Using TLM via the OpenAI library to score the trustworthiness of: structured outputs, function calling, messages, and more
This tutorial demonstrates how to assess the trustworthiness of OpenAI model responses using Cleanlab’s Trustworthy Language Model (TLM), accessible directly through the OpenAI library. Existing OpenAI users: you can obtain real-time trustworthiness scores for every OpenAI response, without changing your code.
Using TLM via the OpenAI library enables you to leverage OpenAI’s advanced features (structured outputs, function calling, …), while reliably scoring the trustworthiness of each response to automatically catch errors/hallucinations made by OpenAI.
In this tutorial, we will showcase how to use OpenAI’s structured outputs feature to perform multi-label classification, alongside using the TLM for trustworthiness scoring.
Install and Import Dependencies
Using TLM requires a Cleanlab account. Sign up for one here if you haven’t yet.
The Python package dependencies for this tutorial can be installed using pip:
%pip install --upgrade openai tqdm
import pandas as pd
from enum import Enum
from pydantic import BaseModel
from concurrent.futures import ThreadPoolExecutor
from openai import OpenAI
import ast
import time
from tqdm import tqdm
pd.set_option('display.max_colwidth', None)
Fetch the Dataset
This tutorial uses a modified version of the Alexa intent detection dataset.
Each text sample contains several statements that could correspond to multiple intents (for example controlling devices, asking for information etc). The label corresponding to each example specifies what the intent of that statement is, where there could be more than one intent corresponding to each sample. Let’s take a look at the dataset below:
In this tutorial, we will only run the LLM inference on 50 randomly sampled examples of this dataset as a demonstration.
wget -nc https://cleanlab-public.s3.us-east-1.amazonaws.com/Datasets/massive_multilabel_classification.csv
data = pd.read_csv("massive_multilabel_classification.csv")
data["labels"] = data["labels"].apply(ast.literal_eval)
data = data.sample(50, random_state=0).reset_index(drop=True)
data.head()
text | labels | |
---|---|---|
0 | play country radio. please describe that object for me | [play_radio, qa_definition] |
1 | can you let amazon know that this new phone case is junk | [social_post] |
2 | are there any alarms going off today | [alarm_query] |
3 | will it be rainy day tomorrow. definition of velocity. let's cook meatballs together | [weather_query, qa_definition, cooking_recipe] |
4 | set my coffee machine. play the oldies station | [iot_coffee, play_radio] |
Obtain LLM Predictions
Define Structured Output Schema
First, we need to get a list of all possible classes from the given dataset:
multilabel_classes = data["labels"].explode().unique()
multilabel_classes[:5]
Then, we can create a object that inherits from pydantic’s BaseModel
to represent the multi-label classification schema, ensuring that each predicted label is validated against the predefined list of possible classes:
class MultiLabelClassification(BaseModel):
classes: list[Enum("MultilabelClasses", {name: name for name in multilabel_classes})]
Prompt OpenAI
Then, we can instantiate the OpenAI client, pointing the base_url
to the TLM, which allows us to also get the trustworthiness score associated with each response.
# Get your API key from https://app.cleanlab.ai/account after creating an account
client = OpenAI(
api_key="<Cleanlab API key>",
base_url="https://api.cleanlab.ai/api/v1/openai_trustworthy_llm/"
)
Here is an example of how we can prompt OpenAI with one sample text:
sample_text = data['text'][0]
sample_text
completion = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{"role": "user", "content": f"Classify the following text: {sample_text}"}
],
response_format=MultiLabelClassification,
)
The returned object matches what OpenAI would ordinarily return, except it contains a few additional keys from TLM: trustworthiness_score, tlm_metadata. This way you can use TLM as a drop-in replacement for OpenAI in any application. Let’s parse the predictions and trustworthiness score from the returned response:
parsed_predictions = [prediction.value for prediction in completion.choices[0].message.parsed.classes]
trustworthiness_score = completion.tlm_metadata["trustworthiness_score"]
print(f"Predicted Classes: {parsed_predictions}")
print(f"Trustworthiness Score: {trustworthiness_score}")
Batch Prompt on a Dataset
Here, we define a quick helper function that allows us to process multiple texts in parallel, which will speed up prompting the LLM on an entire dataset. The helper functions also parses and collects the predictions and trustworthiness score in a DataFrame for easy downstream analysis.
def classify_text(text):
completion = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[{"role": "user", "content": f"Classify the following text: {text}"}],
response_format=MultiLabelClassification,
)
return {
"predictions": [pred.value for pred in completion.choices[0].message.parsed.classes],
"trustworthiness_score": completion.tlm_metadata["trustworthiness_score"],
}
def classify_texts_batch(texts, batch_size=20, max_threads=8, sleep_time=10):
results = []
for i in tqdm(range(0, len(texts), batch_size)):
batch = texts[i:i + batch_size]
with ThreadPoolExecutor(max_threads) as executor:
futures = [executor.submit(classify_text, text) for text in batch]
batch_results = [f.result() for f in futures]
results.extend(batch_results)
# sleep to prevent hitting rate limits
if i + batch_size < len(texts):
time.sleep(sleep_time)
return pd.DataFrame(results)
results = classify_texts_batch(data["text"])
results.head()
predictions | trustworthiness_score | |
---|---|---|
0 | [play_radio, qa_definition] | 0.939186 |
1 | [social_post, recommendation_events] | 0.615166 |
2 | [alarm_query] | 0.940098 |
3 | [weather_query, qa_definition, cooking_recipe] | 0.878958 |
4 | [iot_coffee, play_radio] | 0.818165 |
Examine Results
We have now obtained the predictions and trustworthiness score for each given text. Let’s examine the results in more detail.
combined_results = pd.concat([data, results], axis=1)
combined_results = combined_results.rename(columns={"labels": "ground_truth_labels"})
High Trustworthiness Scores
The responses with the highest trustworthiness scores represent texts where TLM is the most confident that it has predicted the correct intents.
We can see below that the predictions for the samples below match the ground truth labels and are correctly classified.
combined_results.sort_values("trustworthiness_score", ascending=False).head(3)
text | ground_truth_labels | predictions | trustworthiness_score | |
---|---|---|---|---|
44 | remove the list | [lists_remove] | [lists_remove] | 0.957895 |
12 | switch off the master's bedroom | [iot_hue_lightoff] | [iot_hue_lightoff] | 0.957323 |
45 | what are meeting scheduled for today | [calendar_query] | [calendar_query] | 0.956170 |
Low Trustworthiness Scores
The responses with the lowest trustworthiness scores represent the texts where the TLM is the least confident in.
Low trustworthiness scores indicate instances where you should not place much confidence in the responses. Results with low trustworthiness scores would benefit most from manual review, especially if the results are critical to get right.
Here are some examples with the lowest trustworthiness scores, and you can see that the predictions tend to be incorrect or could use further review.
combined_results.sort_values("trustworthiness_score").head(3)
text | ground_truth_labels | predictions | trustworthiness_score | |
---|---|---|---|---|
42 | play. where is the nearest walmart | [play_radio, recommendation_locations] | [play_game, transport_query] | 0.450565 |
25 | notify me when joshua emails me. can you make dinner for me. what degree is it outside | [email_query, general_quirky, weather_query] | [email_query, qa_definition, weather_query] | 0.593590 |
8 | create a playlist | [lists_createoradd] | [lists_createoradd, music_query] | 0.609958 |
Using Different Quality Presets
You can use TLM with different quality presets by specifying the preset after the model name.
For example, in this example below we specify model="gpt-4o-low"
to use TLM on low
quality preset (for lower cost/latency). If unspecified, the default quality preset used is medium
.
Currently, only base
, low
, and medium
presets are supported when using TLM via the OpenAI library. Read more about quality presets here.
sample_text = data['text'][0]
completion = client.beta.chat.completions.parse(
model="gpt-4o-low",
messages=[
{"role": "user", "content": f"Classify the following text: {sample_text}"}
],
response_format=MultiLabelClassification,
)