Skip to main content

Using TLM via the OpenAI library to score the trustworthiness of: structured outputs, function calling, messages, and more

Run in Google ColabRun in Google Colab

This tutorial demonstrates how to assess the trustworthiness of OpenAI model responses using Cleanlab’s Trustworthy Language Model (TLM), accessible directly through the OpenAI library. Existing OpenAI users: you can obtain real-time trustworthiness scores for every OpenAI response, without changing your code.

Using TLM via the OpenAI library enables you to leverage OpenAI’s advanced features (structured outputs, function calling, …), while reliably scoring the trustworthiness of each response to automatically catch errors/hallucinations made by OpenAI.

In this tutorial, we use OpenAI’s structured outputs feature to perform multi-label classification (i.e. document tagging) with trustworthiness scores from TLM. The same method can be used to score the trustworthiness of any type of output from OpenAI (not just structured outputs).

Setup

TLM requires a Cleanlab account. Sign up for one here and use TLM for free! If you’ve already signed up, check your email for a personal login link.

The Python package dependencies for this tutorial can be installed using pip:

%pip install --upgrade openai tqdm
import pandas as pd
from enum import Enum
from pydantic import BaseModel
from concurrent.futures import ThreadPoolExecutor
from openai import OpenAI
import ast
import time
from tqdm import tqdm

pd.set_option('display.max_colwidth', None)

Fetch example Dataset

This tutorial uses a modified version of the Alexa intent detection dataset.

Each text sample contains several statements that could correspond to multiple intents (for example controlling devices, asking for information etc). The label corresponding to each example specifies what the intent of that statement is, where there could be more than one intent corresponding to each sample. Let’s take a look at the dataset below:

In this tutorial, we will only run the LLM inference on 50 randomly sampled examples of this dataset as a demonstration.

wget -nc https://cleanlab-public.s3.us-east-1.amazonaws.com/Datasets/massive_multilabel_classification.csv
data = pd.read_csv("massive_multilabel_classification.csv")
data["labels"] = data["labels"].apply(ast.literal_eval)
data = data.sample(50, random_state=123).reset_index(drop=True)
data.head()
text labels
0 lets have a chat [general_quirky]
1 what are meeting scheduled for today [calendar_query]
2 erase all the events. resume my audio book from karl pilkington. tell me the profession of celebrity [calendar_remove, play_audiobook, qa_factoid]
3 thirty minute reminder on meeting for tuesday [calendar_set]
4 i have a nine am meeting on wednesday send me a reminder [calendar_set]

Obtain LLM Predictions

Define Structured Output Schema

First, we need to get a list of all possible classes from the given dataset:

multilabel_classes = data["labels"].explode().unique()
multilabel_classes[:5]
    array(['general_quirky', 'calendar_query', 'calendar_remove',
'play_audiobook', 'qa_factoid'], dtype=object)

Then, we can create a object that inherits from pydantic’s BaseModel to represent the multi-label classification schema, ensuring that each predicted label is validated against the predefined list of possible classes:

class MultiLabelClassification(BaseModel):
classes: list[Enum("MultilabelClasses", {name: name for name in multilabel_classes})]

Prompt OpenAI

Then, we can instantiate the OpenAI client, pointing the base_url to TLM, which allows us to also get the trustworthiness score associated with each response.

# Get your API key from https://app.cleanlab.ai/account after creating an account
client = OpenAI(
api_key="<Cleanlab API key>",
base_url="https://api.cleanlab.ai/api/v1/openai_trustworthy_llm/"
)

Here is an example of how we can prompt OpenAI with one sample text:

sample_text = data['text'][0]
sample_text
    'lets have a chat'
completion = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{"role": "user", "content": f"Classify the following text: {sample_text}"}
],
response_format=MultiLabelClassification,
)

The returned object matches what OpenAI would ordinarily return, except it contains a few additional keys from TLM: trustworthiness_score, tlm_metadata. This way you can use TLM as a drop-in replacement for OpenAI in any application. Let’s parse the predictions and trustworthiness score from the returned response:

parsed_predictions = [prediction.value for prediction in completion.choices[0].message.parsed.classes]
trustworthiness_score = completion.tlm_metadata["trustworthiness_score"]

print(f"Predicted Classes: {parsed_predictions}")
print(f"Trustworthiness Score: {trustworthiness_score}")
    Predicted Classes: ['general_quirky']
Trustworthiness Score: 0.8512080365166845

Batch Prompt on a Dataset

Here, we define a quick helper function that allows us to process multiple texts in parallel, which will speed up prompting the LLM on an entire dataset. The helper functions also parses and collects the predictions and trustworthiness score in a DataFrame for easy downstream analysis.

def classify_text(text):
completion = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[{"role": "user", "content": f"Classify the following text: {text}"}],
response_format=MultiLabelClassification,
)

return {
"predictions": [pred.value for pred in completion.choices[0].message.parsed.classes],
"trustworthiness_score": completion.tlm_metadata["trustworthiness_score"],
}

def classify_texts_batch(texts, batch_size=20, max_threads=8, sleep_time=10):
results = []
for i in tqdm(range(0, len(texts), batch_size)):
batch = texts[i:i + batch_size]

with ThreadPoolExecutor(max_threads) as executor:
futures = [executor.submit(classify_text, text) for text in batch]
batch_results = [f.result() for f in futures]

results.extend(batch_results)

# sleep to prevent hitting rate limits
if i + batch_size < len(texts):
time.sleep(sleep_time)

return pd.DataFrame(results)
results = classify_texts_batch(data["text"])
results.head()
predictions trustworthiness_score
0 [general_quirky] 0.851207
1 [calendar_query] 0.988874
2 [calendar_remove, play_audiobook, qa_factoid] 0.989885
3 [alarm_query] 0.338316
4 [calendar_set, calendar_query] 0.687683

Examine Results

We have now obtained the predictions and trustworthiness score for each given text. Let’s examine the results in more detail.

combined_results = pd.concat([data, results], axis=1)
combined_results = combined_results.rename(columns={"labels": "ground_truth_labels"})

High Trustworthiness Scores

The responses with the highest trustworthiness scores represent texts where TLM is the most confident that it has predicted the correct intents.

We can see below that the predictions for the samples below match the ground truth labels and are correctly classified.

combined_results.sort_values("trustworthiness_score", ascending=False).head(3)
text ground_truth_labels predictions trustworthiness_score
7 what alarms did i set [alarm_query] [alarm_query] 0.989979
20 turn the lights off [iot_hue_lightoff] [iot_hue_lightoff] 0.989947
17 send an email to margaret. shut down the sound [email_sendemail, audio_volume_mute] [email_sendemail, audio_volume_mute] 0.989936

Low Trustworthiness Scores

The responses with the lowest trustworthiness scores indicate outputs we are least confident are good.

Results with low trustworthiness scores would benefit most from manual review, especially if we need almost all outputs across the dataset to be correct.

For examples with the lowest trustworthiness scores in our dataset shown below, you can see that the predictions tend to be incorrect or could use further review.

combined_results.sort_values("trustworthiness_score").head(3)
text ground_truth_labels predictions trustworthiness_score
42 i will need warm socks in winter in morning [weather_query] [general_quirky, datetime_query] 0.264497
3 thirty minute reminder on meeting for tuesday [calendar_set] [alarm_query] 0.338316
41 features of google pixel. what is the deepest point on earth [general_quirky, qa_factoid] [qa_factoid, recommendation_events] 0.460527

Using Different Quality Presets

You can use TLM with different quality presets by specifying the preset after the model name.

For example, in this example below we specify model="gpt-4o-low" to use TLM on low quality preset (for lower cost/latency). If unspecified, the default quality preset used is medium.

Currently, only base, low, and medium presets are supported when using TLM via the OpenAI library. Read more about quality presets here.

sample_text = data['text'][0]

completion = client.beta.chat.completions.parse(
model="gpt-4o-low",
messages=[
{"role": "user", "content": f"Classify the following text: {sample_text}"}
],
response_format=MultiLabelClassification,
)

We re-emphasize that you can use TLM via the OpenAI library to score the trustworthiness of any type of OpenAI output (not just structured outputs). Beyond structured outputs, we recommend using TLM via the OpenAI library for LLM applications involving: function calling, system prompts and multiple user/assistant messages, as well as other advanced features offered by OpenAI but not most LLM APIs.

For questions about the OpenAI API, refer to the documentation linked from their library.