Using TLM in your VPC via OpenAI's Chat Completions API
This tutorial demonstrates how to integrate your VPC installation of Cleanlab's Trustworthy Language Model (TLM) into existing GenAI apps. You will learn how to assess the trustworthiness of LLM model responses, directly through the OpenAI client library, Azure's AI inference client, or Cleanlab's cleanlab-tlm client library.
API access to the TLM backend service
This demo assumes that you have access to the deployed TLM backend service at the URL http://example.customer.com:8080/api. You are welcome to expose the TLM service however you prefer, depending on the unique needs of your networking environment. Simply replace the base URL in the corresponding cell blocks below.
Please note that Google Colab does not have built-in support to access services on your local machine. This is because Colab runs in a virtual machine, so localhost refers to that VM, rather than your computer. If you would like to access TLM by port-forwarding to your local machine, you may do so by downloading the .ipynb file and running Jupyter locally, or by using a tunneling service like ngrok.
import os
os.environ["BASE_URL"] = "http://example.customer.com:8080/api"
Setup
The Python packages required for this tutorial can be installed using pip:
%pip install --upgrade openai azure-ai-inference cleanlab-tlm
from openai import OpenAI, AzureOpenAI
from cleanlab_tlm.utils.vpc.chat_completions import TLMChatCompletion
Overview of this tutorial
The workflows showcased below demonstrates how to incorporate trust scoring into your existing LLM code with minimal code changes. We'll explore three workflows:
- Workflow 1 & 2: Use your own existing LLM infrastructure to generate responses, then use Cleanlab to score them
- Workflow 3: Use Cleanlab for both generating and scoring responses (response-generation can be from any LLM model supported in your VPC deployment)
Workflow 1: Score Responses from Existing LLM Calls
One way to use TLM if you're already using OpenAI's ChatCompletions API is to score any existing LLM call you've made. This works for LLMs beyond OpenAI models (many LLM providers like Gemini or DeepSeek also support OpenAI's Chat Completions API).
You can first obtain generate LLM responses as usual using the OpenAI API (or any of your existing infrastructure):
openai_kwargs = {
"model": "gpt-4.1-mini",
"messages":[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
"logprobs": True,
"top_logprobs": 3,
}
client = AzureOpenAI(
api_version="<your-api-version>",
azure_endpoint="<your-azure-endpoint>",
api_key="<your-azure-api-key>",
)
response = client.chat.completions.create(**openai_kwargs)
response
We can then use TLM to score the generated response. For models supporting log probabilities, including these will allow TLM to return higher quality scores.
Here, we first instantiate a TLMChatCompletion object. For more configurations, view the valid arguments below.
tlm = TLMChatCompletion(quality_preset="medium", options={"model": "azure/gpt-4.1-mini", "log": ["explanation"]})
score_result = tlm.score(
response=response,
**openai_kwargs
)
print(f"Response: {response.choices[0].message.content}")
print(f"TLM Score: {score_result['trustworthiness_score']:.4f}")
print(f"TLM Explanation: {score_result['log']['explanation']}")
Using OpenAI client instead of AzureOpenAI (click to expand)
If you're using the OpenAI client instead of AzureOpenAI client, the only difference in your code from above would be that the client is instantiated differently:
client = OpenAI()
response = client.chat.completions.create(**openai_kwargs)
instead of:
client = AzureOpenAI(
api_version="<your-api-version>",
azure_endpoint="<your-azure-endpoint>",
api_key="<your-azure-api-key>",
)
response = client.chat.completions.create(**openai_kwargs)
The code to score this response using TLM remains identical. Full code sample below:
openai_kwargs = {
"model": "gpt-4.1-mini",
"messages":[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
]
}
client = OpenAI()
tlm = TLMChatCompletion(quality_preset="medium", options={"model": "azure/gpt-4.1-mini", "log": ["explanation"]})
response = client.chat.completions.create(**openai_kwargs)
score_result = tlm.score(
response=response,
**openai_kwargs
)
Workflow 2: Adding a Decorator to your LLM Call
For greater convenience, you decorate your call to openai.chat.completions.create() with a decorator that then appends the trust score as a key in the returned response. This workflow only requires an initial setup which then requires zero changes to the rest of your existing code:
import functools
def add_trust_scoring(tlm_instance):
"""Decorator factory that creates a trust scoring decorator."""
def trust_score_decorator(fn):
@functools.wraps(fn)
def wrapper(**kwargs):
response = fn(**kwargs)
score_result = tlm_instance.score(response=response, **kwargs)
response.tlm_metadata = score_result
return response
return wrapper
return trust_score_decorator
Then, we can decorate the OpenAI client, and then your existing code automatically gets trust scores:
tlm = TLMChatCompletion(quality_preset="medium", options={"model": "azure/gpt-4.1-mini", "log": ["explanation"]})
client = AzureOpenAI(
api_version="<your-api-version>",
azure_endpoint="<your-azure-endpoint>",
api_key="<your-azure-api-key>",
)
client.chat.completions.create = add_trust_scoring(tlm)(client.chat.completions.create)
response = client.chat.completions.create(**openai_kwargs)
response
print(f"Response: {response.choices[0].message.content}")
print(f"TLM Score: {response.tlm_metadata['trustworthiness_score']:.4f}")
print(f"TLM Explanation: {response.tlm_metadata['log']['explanation']}")
Using OpenAI client instead of AzureOpenAI (click to expand)
The only difference would again be that your client is instantiated differently:
client = OpenAI()
client.chat.completions.create = add_trust_scoring(tlm)(client.chat.completions.create)
instead of:
client = AzureOpenAI(
api_version="<your-api-version>",
azure_endpoint="<your-azure-endpoint>",
api_key="<your-azure-api-key>",
)
client.chat.completions.create = add_trust_scoring(tlm)(client.chat.completions.create)
The code to score this response using TLM remains identical. Full code sample below:
import functools
def add_trust_scoring(tlm_instance):
"""Decorator factory that creates a trust scoring decorator."""
def trust_score_decorator(fn):
@functools.wraps(fn)
def wrapper(**kwargs):
response = fn(**kwargs)
score_result = tlm_instance.score(response=response, **kwargs)
response.tlm_metadata = score_result
return response
return wrapper
return trust_score_decorator
tlm = TLMChatCompletion(quality_preset="medium", options={"model": "azure/gpt-4.1-mini", "log": ["explanation"]})
client = OpenAI()
client.chat.completions.create = add_trust_scoring(tlm)(client.chat.completions.create)
response = client.chat.completions.create(**openai_kwargs)
The above workflows allow you to continue using your own LLM infrastructure to generate responses, and you simply add Cleanlab as an extra step to score their trustworthiness. Your core AI system remains the same as before, without changes to your existing code. Alternatively, you can avoid managing any LLM infrastructure via the workflow below, where Cleanlab manages the LLM calls to produce responses as well.
Workflow 3: Use Cleanlab to Generate and Score Responses
You can point your LLM client directly to Cleanlab's infrastructure. This approach generates responses using Cleanlab's backend while simultaneously providing trustworthiness scores.
OpenAI Client
First we demonstrate how to use to OpenAI client with TLM. Here, you can replace the base URL with your actual TLM service endpoint, and then use the chat.completions.create() method as you normally would.
If your existing code uses AzureOpenAI client instead of OpenAI client, simply make the following replacements in your code:
from openai import AzureOpenAI->from openai import OpenAIclient = AzureOpenAI()->client = OpenAI(...)using the arguments specified below
The rest of this section should work with your existing code, as the API interface and input/output types are the same between OpenAI and AzureOpenAI.
client = OpenAI(
api_key="<your-api-key>", # replace with your Azure OpenAI key
base_url="http://example.customer.com:8080/api" # replace with your TLM service URL
)
response = client.chat.completions.create(
model="azure/gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
extra_body={
"quality_preset": "low",
"options": {"log": ["explanation"]}
}
)
response
The extra_body argument contains additional TLM configurations. For all supported inputs, view the valid arguments below.
print(f"Response: {response.choices[0].message.content}")
print(f"TLM Score: {response.tlm_metadata['trustworthiness_score']:.4f}")
print(f"TLM Explanation: {response.tlm_metadata['log']['explanation']}")
Adding a decorator to pass in TLM configurations via extra_body
Here, we demonstrate how to decorate your call to openai.chat.completions.create() which will automatically add the extra_body argument to all your subsequent calls to the create() method, which after the initial setup will require zero changes to your existing code.
import functools
def add_extra_body(tlm_kwargs):
def decorator(fn):
@functools.wraps(fn)
def wrapper(*args, **kwargs):
kwargs["extra_body"] = tlm_kwargs
return fn(*args, **kwargs)
return wrapper
return decorator
Similar to above, we can decorate the OpenAI client. After this monkey-patch, the code below is functionally equivalent to the one above where we specified extra_body in each create() call -- this make it such that you can use your existing code with minimal changes.
tlm_kwargs = {"quality_preset": "low", "options": {"log": ["explanation"]}}
client.chat.completions.create = add_extra_body(tlm_kwargs)(client.chat.completions.create)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
)
response
Azure AI Inference Client
You can also use the azure-ai-inference client by pointing it to the TLM service endpoint. It can be called in a similar way:
from azure.ai.inference import ChatCompletionsClient
from azure.core.credentials import AzureKeyCredential
azure_client = ChatCompletionsClient(
endpoint="http://example.customer.com:8080/api", # replace with your TLM service URL
credential=AzureKeyCredential("<your-api-key>"), # replace with your Azure OpenAI key
)
response = azure_client.complete(
model="gpt-4o-mini",
messages=[
{"role": "user", "content": "What is the capital of France?"}
],
model_extras={
"quality_preset": "low",
"options": {"log": ["explanation"]}
}
)
response
Note that the extra TLM options are now passed in using the model_extras argument (instead of extra_body argument used for invoking TLM through the OpenAI client).
Input Arguments to TLM
These are optional TLM configurations you can specify either when initializing TLMChatCompletion object, or in the extra_body argument to the OpenAI API client.
-
quality_preset(medium, default = "medium"): a preset configuration to control the quality of TLM responses and trustworthiness scores vs. latency/costs. The "medium" preset produces more reliable trustworthiness scores than "low". The "base" preset provides the lowest possible latency/cost. Higher presets have increased runtime and cost. Reduce your preset if you see token-limit errors. -
optionsis a dictionary of configuration options for TLM. Inputs include:-
model(default = “gpt-4.1-mini”): Underlying base LLM to use (better models yield better results, faster models yield faster results).Note that if you are using the OpenAI
openai.chat.completions.create()API, you should provide the model name there instead of in the options dictionary here. -
log(default = []): Specify additional logs or metadata that TLM should return. Valid options inlude:explanation: get explanations of why a response is scored with low trustworthiness
-
model_provider(default = ): A dictionary specifying the specific endpoint that LLM requests would be sent to, valid keys include:api_base: the base URL endpoint for the LLM serviceapi_key: the corresponding API key to authenticate with the endpoint specified inapi_baseapi-version: the API version to useprovider: the provider name, should be one of the providers supported by litellm (e.g., "azure", "openai", "anthropic", "cohere", etc.)
-
Using Custom Endpoints
The model_provider parameter allows you to specify custom API endpoints for your LLM services. This is particularly useful when you want to route requests through specific endpoints for each request. Below are examples showing how to configure TLM to work with different endpoints.
When using the TLMChatCompletion object (workflows 1 / 2):
tlm = TLMChatCompletion(
quality_preset="medium",
options={
"model": "azure/gpt-4.1-mini",
"log": ["explanation"],
"model_provider": {
"api_base": "<your-api-base>",
"api_key": "<your-api-key>"
}
}
)
tlm.score(...)
When using the OpenAI / Azure AI Inference client (workflow 3):
response = client.chat.completions.create(
model="azure/gpt-4.1-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
extra_body={
"quality_preset": "low",
"options": {
"log": ["explanation"],
"model_provider": {
"api_base": "<your-api-base>",
"api_key": "<your-api-key>"
}
}
}
)
Getting Cheaper / Faster Results
The default TLM settings are not latency-optimized because they have to remain effective across all possible LLM use-cases. For your specific use-case, you can greatly improve latency without compromising results. Strategy: first run TLM with default settings to see what results look like over a dataset from your use-case; once results look promising, adjust the TLM preset/options/model to reduce latency for your application.
-
You can stream in a response from any (fast) LLM you are using, and then use
TLMChatCompletion.scoreto subsequently stream in the trustworthiness score for the response. If you run TLM with a lowerquality_presetand cheaper model, then the additional cost/runtime of trustworthiness scoring can be only a fraction of your cost/runtime of producing the response with your own LLM. -
Reduce the quality_preset setting (e.g. to "low" or "base:).
-
Specify
optionsto further reduce TLM runtimes by: changing model to a faster base LLM (e.g.gpt-4.1-nano)
Running on Batches and Managing Rate Limits
When processing large datasets, here are some tips to handle rate limits and implement proper batching strategies:
Prevent hitting rate limits
- Process data in small batches (e.g. 10-50 requests at a time)
- Add sleep intervals between batches (e.g.
time.sleep(1)) to stay under rate limits
Handling errors
- Save partial results frequently to avoid losing progress
- Consider using a try/except block to catch errors, and implement retry logic when rate limits are hit
Here are some sample helper functions that could help with batching:
import time
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor, as_completed
client = OpenAI(
api_key="<your-api-key>", # replace with your Azure OpenAI key
base_url="http://example.customer.com:8080/api" # replace with your TLM service URL
)
def invoke_llm_with_retries(openai_kwargs, retries=3, backoff=2):
attempt = 0
while attempt <= retries:
try:
# the code to invoke the LLM goes here, feel free to modify
response = client.chat.completions.create(**openai_kwargs)
return {
"response": response.choices[0].message.content,
"trustworthiness_score": response.tlm_metadata["trustworthiness_score"],
"raw_completion": response
}
except Exception as e:
if attempt == retries:
return {"error": str(e), "input": openai_kwargs}
sleep_time = backoff ** attempt
time.sleep(sleep_time)
attempt += 1
def run_batch(batch_data, batch_size=20, max_threads=8, sleep_time=5):
results = []
for i in tqdm(range(0, len(batch_data), batch_size)):
data = batch_data[i:i + batch_size]
batch_results = [None] * len(data)
with ThreadPoolExecutor(max_workers=max_threads) as executor:
future_to_idx = {executor.submit(invoke_llm_with_retries, d): idx for idx, d in enumerate(data)}
for future in as_completed(future_to_idx):
idx = future_to_idx[future]
batch_results[idx] = future.result()
results.extend(batch_results)
# sleep to prevent hitting rate limits
if i + batch_size < len(batch_data):
time.sleep(sleep_time)
return results
sample_input = {
"model": "gpt-4.1-mini",
"messages":[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
]
}
sample_batch = [sample_input] * 10
run_batch(sample_batch)
More information about handling rate limits can be found in this OpenAI cookbook.