Skip to main content

Using TLM via the OpenAI library to score the trustworthiness of: structured outputs, function calling, messages, and more

Run in Google ColabRun in Google Colab

This tutorial demonstrates how to assess the trustworthiness of OpenAI model responses using Cleanlab’s Trustworthy Language Model (TLM), accessible directly through the OpenAI library. Existing OpenAI users: you can obtain real-time trustworthiness scores for every OpenAI response, without changing your code.

Using TLM via the OpenAI library enables you to leverage OpenAI’s advanced features (structured outputs, function calling, …), while reliably scoring the trustworthiness of each response to automatically catch errors/hallucinations made by OpenAI.

In this tutorial, we will showcase how to use OpenAI’s structured outputs feature to perform multi-label classification, alongside using the TLM for trustworthiness scoring.

Install and Import Dependencies

Using TLM requires a Cleanlab account. Sign up for one here if you haven’t yet.

The Python package dependencies for this tutorial can be installed using pip:

%pip install --upgrade openai tqdm
import pandas as pd
from enum import Enum
from pydantic import BaseModel
from concurrent.futures import ThreadPoolExecutor
from openai import OpenAI
import ast
import time
from tqdm import tqdm

pd.set_option('display.max_colwidth', None)

Fetch the Dataset

This tutorial uses a modified version of the Alexa intent detection dataset.

Each text sample contains several statements that could correspond to multiple intents (for example controlling devices, asking for information etc). The label corresponding to each example specifies what the intent of that statement is, where there could be more than one intent corresponding to each sample. Let’s take a look at the dataset below:

In this tutorial, we will only run the LLM inference on 50 randomly sampled examples of this dataset as a demonstration.

wget -nc https://cleanlab-public.s3.us-east-1.amazonaws.com/Datasets/massive_multilabel_classification.csv
data = pd.read_csv("massive_multilabel_classification.csv")
data["labels"] = data["labels"].apply(ast.literal_eval)
data = data.sample(50, random_state=0).reset_index(drop=True)
data.head()
text labels
0 play country radio. please describe that object for me [play_radio, qa_definition]
1 can you let amazon know that this new phone case is junk [social_post]
2 are there any alarms going off today [alarm_query]
3 will it be rainy day tomorrow. definition of velocity. let's cook meatballs together [weather_query, qa_definition, cooking_recipe]
4 set my coffee machine. play the oldies station [iot_coffee, play_radio]

Obtain LLM Predictions

Define Structured Output Schema

First, we need to get a list of all possible classes from the given dataset:

multilabel_classes = data["labels"].explode().unique()
multilabel_classes[:5]
    array(['play_radio', 'qa_definition', 'social_post', 'alarm_query',
'weather_query'], dtype=object)

Then, we can create a object that inherits from pydantic’s BaseModel to represent the multi-label classification schema, ensuring that each predicted label is validated against the predefined list of possible classes:

class MultiLabelClassification(BaseModel):
classes: list[Enum("MultilabelClasses", {name: name for name in multilabel_classes})]

Prompt OpenAI

Then, we can instantiate the OpenAI client, pointing the base_url to the TLM, which allows us to also get the trustworthiness score associated with each response.

# Get your API key from https://app.cleanlab.ai/account after creating an account
client = OpenAI(
api_key="<Cleanlab API key>",
base_url="https://api.cleanlab.ai/api/v1/openai_trustworthy_llm/"
)

Here is an example of how we can prompt OpenAI with one sample text:

sample_text = data['text'][0]
sample_text
    'play country radio. please describe that object for me'
completion = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{"role": "user", "content": f"Classify the following text: {sample_text}"}
],
response_format=MultiLabelClassification,
)

The returned object matches what OpenAI would ordinarily return, except it contains a few additional keys from TLM: trustworthiness_score, tlm_metadata. This way you can use TLM as a drop-in replacement for OpenAI in any application. Let’s parse the predictions and trustworthiness score from the returned response:

parsed_predictions = [prediction.value for prediction in completion.choices[0].message.parsed.classes]
trustworthiness_score = completion.tlm_metadata["trustworthiness_score"]

print(f"Predicted Classes: {parsed_predictions}")
print(f"Trustworthiness Score: {trustworthiness_score}")
    Predicted Classes: ['play_radio', 'qa_definition']
Trustworthiness Score: 0.9091661249086779

Batch Prompt on a Dataset

Here, we define a quick helper function that allows us to process multiple texts in parallel, which will speed up prompting the LLM on an entire dataset. The helper functions also parses and collects the predictions and trustworthiness score in a DataFrame for easy downstream analysis.

def classify_text(text):
completion = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[{"role": "user", "content": f"Classify the following text: {text}"}],
response_format=MultiLabelClassification,
)

return {
"predictions": [pred.value for pred in completion.choices[0].message.parsed.classes],
"trustworthiness_score": completion.tlm_metadata["trustworthiness_score"],
}

def classify_texts_batch(texts, batch_size=20, max_threads=8, sleep_time=10):
results = []
for i in tqdm(range(0, len(texts), batch_size)):
batch = texts[i:i + batch_size]

with ThreadPoolExecutor(max_threads) as executor:
futures = [executor.submit(classify_text, text) for text in batch]
batch_results = [f.result() for f in futures]

results.extend(batch_results)

# sleep to prevent hitting rate limits
if i + batch_size < len(texts):
time.sleep(sleep_time)

return pd.DataFrame(results)
results = classify_texts_batch(data["text"])
results.head()
predictions trustworthiness_score
0 [play_radio, qa_definition] 0.939186
1 [social_post, recommendation_events] 0.615166
2 [alarm_query] 0.940098
3 [weather_query, qa_definition, cooking_recipe] 0.878958
4 [iot_coffee, play_radio] 0.818165

Examine Results

We have now obtained the predictions and trustworthiness score for each given text. Let’s examine the results in more detail.

combined_results = pd.concat([data, results], axis=1)
combined_results = combined_results.rename(columns={"labels": "ground_truth_labels"})

High Trustworthiness Scores

The responses with the highest trustworthiness scores represent texts where TLM is the most confident that it has predicted the correct intents.

We can see below that the predictions for the samples below match the ground truth labels and are correctly classified.

combined_results.sort_values("trustworthiness_score", ascending=False).head(3)
text ground_truth_labels predictions trustworthiness_score
44 remove the list [lists_remove] [lists_remove] 0.957895
12 switch off the master's bedroom [iot_hue_lightoff] [iot_hue_lightoff] 0.957323
45 what are meeting scheduled for today [calendar_query] [calendar_query] 0.956170

Low Trustworthiness Scores

The responses with the lowest trustworthiness scores represent the texts where the TLM is the least confident in.

Low trustworthiness scores indicate instances where you should not place much confidence in the responses. Results with low trustworthiness scores would benefit most from manual review, especially if the results are critical to get right.

Here are some examples with the lowest trustworthiness scores, and you can see that the predictions tend to be incorrect or could use further review.

combined_results.sort_values("trustworthiness_score").head(3)
text ground_truth_labels predictions trustworthiness_score
42 play. where is the nearest walmart [play_radio, recommendation_locations] [play_game, transport_query] 0.450565
25 notify me when joshua emails me. can you make dinner for me. what degree is it outside [email_query, general_quirky, weather_query] [email_query, qa_definition, weather_query] 0.593590
8 create a playlist [lists_createoradd] [lists_createoradd, music_query] 0.609958

Using Different Quality Presets

You can use TLM with different quality presets by specifying the preset after the model name.

For example, in this example below we specify model="gpt-4o-low" to use TLM on low quality preset (for lower cost/latency). If unspecified, the default quality preset used is medium.

Currently, only base, low, and medium presets are supported when using TLM via the OpenAI library. Read more about quality presets here.

sample_text = data['text'][0]

completion = client.beta.chat.completions.parse(
model="gpt-4o-low",
messages=[
{"role": "user", "content": f"Classify the following text: {sample_text}"}
],
response_format=MultiLabelClassification,
)