Using TLM with OpenAI’s Chat Completions API
This tutorial demonstrates the easiest ways to score the trustworthiness of responses from the OpenAI Chat Completions API. With minimal changes to your existing Chat Completions API code, you can score the trustworthiness of every LLM response in real-time (works for all OpenAI models and most non-OpenAI LLMs, which also support the Chat Completions API, such as: Gemini, DeepSeek, Llama, etc).
Setup
The Python packages required for this tutorial can be installed using pip:
%pip install --upgrade cleanlab-tlm openai
This tutorial requires a TLM API key. Get one here.
import os
os.environ["CLEANLAB_TLM_API_KEY"] = "<Cleanlab TLM API key>" # Get your free API key from: https://tlm.cleanlab.ai/
os.environ["OPENAI_API_KEY"] = "<OpenAI API key>" # for using OpenAI client library, not strictly necessary for all workflows shown here
import openai
from openai import OpenAI
from cleanlab_tlm.utils.chat_completions import TLMChatCompletion
Overview of this tutorial
We’ll showcase three different workflows to incorporate trust scoring into your existing LLM code, with minimal code changes:
- Workflow 1 & 2: Use your own existing LLM infrastructure to generate responses, then use Cleanlab to score them
- Workflow 3: Use Cleanlab for both generating and scoring responses
Workflow 1: Score Responses from Existing LLM Calls
One way to use TLM if you’re already using OpenAI’s ChatCompletions API is to score any existing LLM call you’ve made. This works for LLMs beyond OpenAI models (many LLM providers like Gemini or DeepSeek also support OpenAI’s Chat Completions API).
You can first obtain generate LLM responses as usual using the OpenAI API (or any other existing LLM infrastructure you use):
openai_kwargs = dict(
model="gpt-4.1-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
]
)
response = openai.chat.completions.create(**openai_kwargs)
response
We can then use TLM to score the generated response.
Here, we first instantiate a TLMChatCompletion
object. For more configurations, view all the valid arguments in our API documentation.
tlm = TLMChatCompletion(quality_preset="medium", options={"model": "gpt-4.1-mini", "log": ["explanation"]})
score_result = tlm.score(
response=response,
**openai_kwargs
)
print(f"Response: {response.choices[0].message.content}")
print(f"TLM Score: {score_result['trustworthiness_score']:.4f}")
print(f"TLM Explanation: {score_result['log']['explanation']}")
Workflow 2: Adding a Decorator to your LLM Call
Alternatively, you decorate your call to openai.chat.completions.create()
with a decorator that then appends the trust score as a key in the returned response. This workflow only requires minimal initial setup; after that zero changes are needed in the rest of your existing code!
import functools
def add_trust_scoring(tlm_instance):
"""Decorator factory that creates a trust scoring decorator."""
def trust_score_decorator(fn):
@functools.wraps(fn)
def wrapper(**kwargs):
response = fn(**kwargs)
score_result = tlm_instance.score(response=response, **kwargs)
response.tlm_metadata = score_result
return response
return wrapper
return trust_score_decorator
tlm = TLMChatCompletion(quality_preset="medium", options={"model": "gpt-4.1-mini", "log": ["explanation"]})
Then decorate your OpenAI Chat Completions function like this:
openai.chat.completions.create = add_trust_scoring(tlm)(openai.chat.completions.create)
After you decorate OpenAI’s Chat Completions function like this, all of your existing Chat Completions API code will automatically compute trust scores as well (zero change needed in other code):
response = openai.chat.completions.create(**openai_kwargs)
response
print(f"Response: {response.choices[0].message.content}")
print(f"TLM Score: {response.tlm_metadata['trustworthiness_score']:.4f}")
print(f"TLM Explanation: {response.tlm_metadata['log']['explanation']}")
Workflow 3: Use Cleanlab to Generate and Score Responses
For convenience, you can alternatively generate responses using Cleanlab’s infrastructure while simultaneously providing trustworthiness scores. Response-generation can be done using any of the OpenAI LLM models supported within TLM.
To do this, simply point the OpenAI client at Cleanlab’s backend instead of OpenAI’s.
Instantiate an OpenAI client, point its base_url
to the Cleanlab backend (see URL below), and specify your Cleanlab API key. After that, you can use the chat.completions.create()
method as you normally would (zero change to any existing code), obtaining responses and trust scores without relying on OpenAI at all.
client = OpenAI(
api_key="<Cleanlab TLM API key>", # get your API key from: https://tlm.cleanlab.ai/
base_url="https://api.cleanlab.ai/api/v1/openai_trustworthy_llm/"
)
response = client.chat.completions.create(
model="gpt-4.1-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
]
)
response
print(f"Response: {response.choices[0].message.content}")
print(f"TLM Score: {response.tlm_metadata['trustworthiness_score']:.4f}")
Getting Faster/Better Results (click to expand)
The default TLM settings are not latency-optimized because they have to remain effective across all possible LLM use-cases. For your specific use-case, you can greatly improve latency without compromising results.
Strategy: first run TLM with default settings to see what results look like over a dataset from your use-case; once results look promising, adjust the TLM preset/options/model to reduce latency for your application.
View more tips to improve latency and accuracy in our FAQ and Advanced Tutorial.
Running over Batches/Datasets (click to expand)
When processing large datasets, here are some tips to handle rate limits and implement proper batching strategies.
Prevent hitting rate limits:
- Process data in small batches (e.g. 10-50 requests at a time)
- Add sleep intervals between batches (e.g.
time.sleep(1)
) to stay under rate limits
Handling errors:
- Save partial results frequently to avoid losing progress
- Consider using a try/except block to catch errors, and implement retry logic when rate limits are hit
You may find the basic TLM API showcased in our Quickstart tutorial simpler for running TLM over datasets, as it manages all of the above for you.
Otherwise, here are helper functions to help with batching LLM calls when relying on the Chat Completions API:
from openai import OpenAI
import time
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor, as_completed
client = OpenAI(
api_key="<Cleanlab TLM API key>", # get your API key from: https://tlm.cleanlab.ai/
base_url="https://api.cleanlab.ai/api/v1/openai_trustworthy_llm/"
)
def invoke_llm_with_retries(openai_kwargs, retries=3, backoff=2):
attempt = 0
while attempt <= retries:
try:
# the code to invoke the LLM goes here, feel free to modify
response = client.chat.completions.create(**openai_kwargs)
return {
"response": response.choices[0].message.content,
"trustworthiness_score": response.tlm_metadata["trustworthiness_score"],
"raw_completion": response
}
except Exception as e:
if attempt == retries:
return {"error": str(e), "input": openai_kwargs}
sleep_time = backoff ** attempt
time.sleep(sleep_time)
attempt += 1
def run_batch(batch_data, batch_size=20, max_threads=8, sleep_time=5):
results = []
for i in tqdm(range(0, len(batch_data), batch_size)):
data = batch_data[i:i + batch_size]
batch_results = [None] * len(data)
with ThreadPoolExecutor(max_workers=max_threads) as executor:
future_to_idx = {executor.submit(invoke_llm_with_retries, d): idx for idx, d in enumerate(data)}
for future in as_completed(future_to_idx):
idx = future_to_idx[future]
batch_results[idx] = future.result()
results.extend(batch_results)
# sleep to prevent hitting rate limits
if i + batch_size < len(batch_data):
time.sleep(sleep_time)
return results
sample_input = dict(
model="gpt-4.1-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
]
)
sample_batch = [sample_input] * 10
run_batch(sample_batch)