Using TLM with OpenAI’s Chat Completions API

Run in Google Colab

This tutorial demonstrates the easiest ways to score the trustworthiness of responses from the OpenAI Chat Completions API. With minimal changes to your existing Chat Completions API code, you can score the trustworthiness of every LLM response in real-time (works for all OpenAI models and most non-OpenAI LLMs, which also support the Chat Completions API, such as: Gemini, DeepSeek, Llama, etc).

Setup

The Python packages required for this tutorial can be installed using pip:

%pip install --upgrade cleanlab-tlm openai azure-ai-inference

This tutorial requires a TLM API key. Get one here.

import os
os.environ["CLEANLAB_TLM_API_KEY"] = "<Cleanlab TLM API key>"  # Get your free API key from: https://tlm.cleanlab.ai/
os.environ["OPENAI_API_KEY"] = "<OpenAI API key>"  # for using OpenAI client library, not strictly necessary for all workflows shown here

from openai import OpenAI
from cleanlab_tlm.utils.chat_completions import TLMChatCompletion

Overview of this tutorial

We’ll showcase three different workflows to incorporate trust scoring into your existing LLM code, with minimal code changes:

Workflow 1 & 2: Use your own existing LLM infrastructure to generate responses, then use Cleanlab to score them
Workflow 3: Use Cleanlab for both generating and scoring responses

Workflow 1: Score Responses from Existing LLM Calls

One way to use TLM if you’re already using OpenAI’s ChatCompletions API is to score any existing LLM call you’ve made. This works for LLMs beyond OpenAI models (many LLM providers like Gemini or DeepSeek also support OpenAI’s Chat Completions API).

You can first obtain generate LLM responses as usual using the OpenAI API (or any other existing LLM infrastructure you use):

openai_kwargs = dict(
    model="gpt-4.1-mini",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ]
)

client = OpenAI()
response = client.chat.completions.create(**openai_kwargs)
response

ChatCompletion(id='chatcmpl-BllXeJIA0HPwmc8YbtZXEhkIwZKir', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='The capital of France is Paris.', refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=None))], created=1750723866, model='gpt-4.1-mini-2025-04-14', object='chat.completion', service_tier='default', system_fingerprint='fp_6f2eabb9a5', usage=CompletionUsage(completion_tokens=7, prompt_tokens=24, total_tokens=31, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0)))

We can then use TLM to score the generated response.

Here, we first instantiate a TLMChatCompletion object. For more configurations, view all the valid arguments in our API documentation.

tlm = TLMChatCompletion(quality_preset="medium", options={"model": "gpt-4.1-mini", "log": ["explanation"]})

score_result = tlm.score(
    response=response,
    **openai_kwargs
)

print(f"Response: {response.choices[0].message.content}")
print(f"TLM Score: {score_result['trustworthiness_score']:.4f}")
print(f"TLM Explanation: {score_result['log']['explanation']}")

Response: The capital of France is Paris.
TLM Score: 0.9873
TLM Explanation: Did not find a reason to doubt trustworthiness.

What’s different if I’m using Azure OpenAI? (click to expand)

The only difference would be that your existing code to generate the response would look like this:

client = AzureOpenAI(
    api_version="<your-api-version>",
    azure_endpoint="<your-azure-endpoint>",
    api_key="<your-azure-api-key>",
)
response = client.chat.completions.create(**openai_kwargs)

instead of:

client = OpenAI()
response = client.chat.completions.create(**openai_kwargs)

The code to score this response using TLM remains identical as shown above.

Workflow 2: Adding a Decorator to your LLM Call

Alternatively, you decorate your call to openai.chat.completions.create() with a decorator that then appends the trust score as a key in the returned response. This workflow only requires minimal initial setup; after that zero changes are needed in the rest of your existing code!

import functools

def add_trust_scoring(tlm_instance):
    """Decorator factory that creates a trust scoring decorator."""
    def trust_score_decorator(fn):
        @functools.wraps(fn)
        def wrapper(**kwargs):
            response = fn(**kwargs)
            score_result = tlm_instance.score(response=response, **kwargs)
            response.tlm_metadata = score_result
            return response
        return wrapper
    return trust_score_decorator

tlm = TLMChatCompletion(quality_preset="medium", options={"model": "gpt-4.1-mini", "log": ["explanation"]})

Then decorate your OpenAI Chat Completions function like this:

client = OpenAI()
client.chat.completions.create = add_trust_scoring(tlm)(client.chat.completions.create)

After you decorate OpenAI’s Chat Completions function like this, all of your existing Chat Completions API code will automatically compute trust scores as well (zero change needed in other code):

response = client.chat.completions.create(**openai_kwargs)
response

ChatCompletion(id='chatcmpl-BllXeduYKX6ltuxvHw0pFhGBu7s8R', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='The capital of France is Paris.', refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=None))], created=1750723866, model='gpt-4.1-mini-2025-04-14', object='chat.completion', service_tier='default', system_fingerprint='fp_6f2eabb9a5', usage=CompletionUsage(completion_tokens=7, prompt_tokens=24, total_tokens=31, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0)), tlm_metadata={'trustworthiness_score': 0.9872832331806422, 'log': {'explanation': 'Did not find a reason to doubt trustworthiness.'}})

print(f"Response: {response.choices[0].message.content}")
print(f"TLM Score: {response.tlm_metadata['trustworthiness_score']:.4f}")
print(f"TLM Explanation: {response.tlm_metadata['log']['explanation']}")

Response: The capital of France is Paris.
TLM Score: 0.9873
TLM Explanation: Did not find a reason to doubt trustworthiness.

What’s different if I’m using Azure OpenAI? (click to expand)

The only difference would be that your existing code to generate the response would look like this:

client = AzureOpenAI(
    api_version="<your-api-version>",
    azure_endpoint="<your-azure-endpoint>",
    api_key="<your-azure-api-key>",
)
client.chat.completions.create = add_trust_scoring(tlm)(client.chat.completions.create)

instead of:

client = OpenAI()
client.chat.completions.create = add_trust_scoring(tlm)(client.chat.completions.create)

The code to score this response using TLM remains identical as shown above.

Workflow 3: Use Cleanlab to Generate and Score Responses

For convenience, you can alternatively generate responses using Cleanlab’s infrastructure while simultaneously providing trustworthiness scores. Response-generation can be done using any of the OpenAI LLM models supported within TLM.

Using the OpenAI Client

To do this, simply point the OpenAI client at Cleanlab’s backend instead of OpenAI’s. Instantiate an OpenAI client, point its base_url to the Cleanlab backend (see URL below), and specify your Cleanlab API key. After that, you can use the chat.completions.create() method as you normally would (zero change to any existing code), obtaining responses and trust scores without needing an OpenAI key/account.

client = OpenAI(
    api_key="<Cleanlab TLM API key>",  # get your API key from: https://tlm.cleanlab.ai/
    base_url="https://api.cleanlab.ai/api/v1/openai_trustworthy_llm/"
)

response = client.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ]
)
response

ChatCompletion(id='chatcmpl-BllXgv6mrp09H0UskaXkBkElqmE7a', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='The capital of France is Paris.', refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=None))], created=1750723868, model='gpt-4.1-mini-2025-04-14', object='chat.completion', service_tier=None, system_fingerprint='fp_6f2eabb9a5', usage=CompletionUsage(completion_tokens=7, prompt_tokens=24, total_tokens=31, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0)), _request_id='req_a35d1e47a2e058a61405402895f18707', tlm_metadata={'trustworthiness_score': 0.9898358185685113})

print(f"Response: {response.choices[0].message.content}")
print(f"TLM Score: {response.tlm_metadata['trustworthiness_score']:.4f}")

Response: The capital of France is Paris.
TLM Score: 0.9898

What’s different if I’m using Azure OpenAI? (click to expand)

If you were using the Azure OpenAI client, simply make the following replacements in your code:

from openai import AzureOpenAI -> from openai import OpenAI
client = AzureOpenAI() -> client = OpenAI(...) with the arguments specified above

The rest of this section should work with your existing code, as the API interface and input/output types are the same between OpenAI and AzureOpenAI.

Using the Azure AI Inference Client

Alternatively, you can also use TLM via the azure-ai-inference client by pointing at Cleanlab’s backend. Here we instantiate the ChatCompletionsClient from Azure and point its endpoint to the Cleanlab backend (see URL below), and specify your Cleanlab API key. After that, you can use the complete() method as you normally would (zero change to any existing code) to obtain responses and trust scores.

from azure.ai.inference import ChatCompletionsClient
from azure.core.credentials import AzureKeyCredential

azure_client = ChatCompletionsClient(
    credential=AzureKeyCredential("<Cleanlab TLM API key>"),  # get your API key from: https://tlm.cleanlab.ai/
    endpoint="https://api.cleanlab.ai/api/v1/openai_trustworthy_llm/",  # replace with your TLM service URL
)

response = azure_client.complete(
    model="gpt-4.1-mini",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ]
)

response

{'_request_id': 'req_aa8c5380d40a742bcaf2e2f4ed24e163', 'choices': [{'finish_reason': 'stop', 'index': 0, 'logprobs': None, 'message': {'annotations': [], 'audio': None, 'content': 'The capital of France is Paris.', 'function_call': None, 'refusal': None, 'role': 'assistant', 'tool_calls': None}}], 'created': 1752104263, 'id': 'chatcmpl-BrYe7Ab5cTikH6309vd0fFXGToYSb', 'model': 'gpt-4.1-mini-2025-04-14', 'object': 'chat.completion', 'service_tier': None, 'system_fingerprint': 'fp_6f2eabb9a5', 'tlm_metadata': {'trustworthiness_score': 0.998297591743598}, 'usage': {'completion_tokens': 7, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens': 24, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}, 'total_tokens': 31}}

print(f"Response: {response.choices[0].message.content}")
print(f"TLM Score: {response['tlm_metadata']['trustworthiness_score']:.4f}")

Response: The capital of France is Paris.
TLM Score: 0.9982

Getting Faster/Better Results (click to expand)

The default TLM settings are not latency-optimized because they have to remain effective across all possible LLM use-cases. For your specific use-case, you can greatly improve latency without compromising results.

Strategy: first run TLM with default settings to see what results look like over a dataset from your use-case; once results look promising, adjust the TLM preset/options/model to reduce latency for your application.

View more tips to improve latency and accuracy in our FAQ and Advanced Tutorial.

Running over Batches/Datasets (click to expand)

When processing large datasets, here are some tips to handle rate limits and implement proper batching strategies.

Prevent hitting rate limits:

Process data in small batches (e.g. 10-50 requests at a time)
Add sleep intervals between batches (e.g. time.sleep(1)) to stay under rate limits

Handling errors:

Save partial results frequently to avoid losing progress
Consider using a try/except block to catch errors, and implement retry logic when rate limits are hit

You may find the basic TLM API showcased in our Quickstart tutorial simpler for running TLM over datasets, as it manages all of the above for you.

Otherwise, here are helper functions to help with batching LLM calls when relying on the Chat Completions API:

from openai import OpenAI
import time
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor, as_completed

client = OpenAI(
    api_key="<Cleanlab TLM API key>",  # get your API key from: https://tlm.cleanlab.ai/
    base_url="https://api.cleanlab.ai/api/v1/openai_trustworthy_llm/"
)

def invoke_llm_with_retries(openai_kwargs, retries=3, backoff=2):
    attempt = 0
    while attempt <= retries:
        try:
            # the code to invoke the LLM goes here, feel free to modify
            response = client.chat.completions.create(**openai_kwargs)
            return {
                "response": response.choices[0].message.content,
                "trustworthiness_score": response.tlm_metadata["trustworthiness_score"],
                "raw_completion": response
            }
        except Exception as e:
            if attempt == retries:
                return {"error": str(e), "input": openai_kwargs}
            sleep_time = backoff ** attempt
            time.sleep(sleep_time)
            attempt += 1

def run_batch(batch_data, batch_size=20, max_threads=8, sleep_time=5):
    results = []
    
    for i in tqdm(range(0, len(batch_data), batch_size)):
        data = batch_data[i:i + batch_size]
        batch_results = [None] * len(data)
        
        with ThreadPoolExecutor(max_workers=max_threads) as executor:
            future_to_idx = {executor.submit(invoke_llm_with_retries, d): idx for idx, d in enumerate(data)}
            for future in as_completed(future_to_idx):
                idx = future_to_idx[future]
                batch_results[idx] = future.result()
                
        results.extend(batch_results)

        # sleep to prevent hitting rate limits
        if i + batch_size < len(batch_data):
            time.sleep(sleep_time)
            
    return results

sample_input = dict(
    model="gpt-4.1-mini",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ]
)
sample_batch = [sample_input] * 10
run_batch(sample_batch)

Setup​

Overview of this tutorial​

Workflow 1: Score Responses from Existing LLM Calls​

Workflow 2: Adding a Decorator to your LLM Call​

Workflow 3: Use Cleanlab to Generate and Score Responses​

Using the OpenAI Client​

Using the Azure AI Inference Client​

Resources to learn more about Chat Completions API​

Setup

Overview of this tutorial

Workflow 1: Score Responses from Existing LLM Calls

Workflow 2: Adding a Decorator to your LLM Call

Workflow 3: Use Cleanlab to Generate and Score Responses

Using the OpenAI Client

Using the Azure AI Inference Client

Resources to learn more about Chat Completions API