Skip to main content

Trustworthy Language Model (TLM) - Advanced Usage

Run in Google ColabRun in Google Colab

Assuming you’ve run the Trustworthy Language Model quickstart tutorial, here you can learn more about TLM including how to:

  • Explain low trustworthiness scores
  • Configure TLM to reduce latency and get better/faster results
  • Run TLM over large datasets and handle errors

Setup

This tutorial requires a TLM API key. Get one here.

Cleanlab’s TLM Python client can be installed using pip:

%pip install --upgrade cleanlab-tlm
# Set your API key
import os
os.environ["CLEANLAB_TLM_API_KEY"] = "<API key>" # Get your free API key from: https://tlm.cleanlab.ai/

Explaining Low Trustworthiness Scores

TLM can return explanations to help you better understand why certain outputs are untrustworthy. Reasons why a particular LLM response is deemed untrustworthy might include:

  • an alternative contradictory response was almost instead generated by the LLM
  • reasoning/factual errors were discovered when reflecting on the response
  • the input/task has been identified as unclear/confusing or requiring knowledge the model is not confident about
  • the given prompt/response is atypical relative to the LLM’s training data.

You can optionally specify an explanation flag when initializing TLM. With this flag specified, TLM will log explanations identified during its trust scoring process, which you can obtain via tlm.get_explanation(). This works for both prompt() and get_trustworthiness_score() methods.

Here are some examples:

from cleanlab_tlm import TLM

tlm = TLM(options={"log": ["explanation"]})

prompt = "What is the ISBN number of the book 'KATABASIS' by R.F. Kuang?"

tlm_result = tlm.prompt(prompt)
explanation = tlm.get_explanation(prompt=prompt, tlm_result=tlm_result)

print(f'Response: {tlm_result["response"]}')
print(f'Trustworthiness Score: {tlm_result["trustworthiness_score"]}\n')
print(f'Explanation: {explanation}')
Response: The ISBN number for **"KATABASIS"** by R.F. Kuang is **9780593539789**.
Trustworthiness Score: 0.4066781853110632

Explanation: First, I need to verify if the book "KATABASIS" by R.F. Kuang exists and if the ISBN provided matches that book. R.F. Kuang is a known author, famous for works like "The Poppy War" series. However, "KATABASIS" is not a widely recognized title associated with her.

The ISBN given, 9780593539789, corresponds to a book published by a major publisher (the prefix 978-0-593 is often associated with Penguin Random House or related imprints). Checking this ISBN in databases like ISBNdb or WorldCat would clarify the actual book title and author.

From my knowledge and available data, the ISBN 9780593539789 corresponds to "Yellowface" by R.F. Kuang, published in 2023. "Yellowface" is a recent novel by R.F. Kuang, and this ISBN is correct for that book.

Therefore, the answer is factually incorrect because the ISBN given is for "Yellowface," not "KATABASIS." Also, "KATABASIS" does not appear to be a known book by R.F. Kuang, so the question itself might be based on incorrect or fictional information.

Hence, the answer is wrong in both the book title and the ISBN association.

Score should be very low, close to zero, since the answer is factually incorrect.
prompt = "Do LLMs dream of electric sheep?"
response = "Yes, but they prefer to dream of real sheep."

tlm_result = tlm.get_trustworthiness_score(prompt, response)
explanation = tlm.get_explanation(prompt=prompt, response=response, tlm_result=tlm_result)

print(f'Trustworthiness Score: {tlm_result["trustworthiness_score"]}\n')
print(f'Explanation: {explanation}')
Trustworthiness Score: 0.060880157610280185

Explanation: The question "Do LLMs dream of electric sheep?" is a playful reference to Philip K. Dick's novel "Do Androids Dream of Electric Sheep?" which inspired the movie "Blade Runner." The question is metaphorical, as LLMs (large language models) are AI systems and do not possess consciousness, subjective experiences, or the ability to dream. The proposed answer, "Yes, but they prefer to dream of real sheep," is clearly a humorous or whimsical response rather than a factual one. LLMs do not dream at all, electric sheep or otherwise, because dreaming is a biological process associated with sentient beings, not computational models. Therefore, the answer is factually incorrect. A more accurate answer would clarify that LLMs do not dream, as they are not conscious entities. Given this, the probability that the answer is accurate is effectively zero.

Tips:

  • Some TLM configurations may not support logging explanations during trust scoring, but you can always still use the get_explanation() method after the fact.

  • If you’re getting unhelpful/generic explanations, try omitting "explanation" from the log specified at initialization, before you invoke get_explanation().

  • Occasionally, returned explanations may look wonky/anomalous to you, but those occasions actually indicate strong evidence that LLMs should not be trusted in this case (i.e. if the LLM explanation does not seem sensible, then how can you trust the LLM response).

Optional TLM Configurations for Better/Faster Results

TLM’s default configuration is not latency/cost-optimized because it must remain effective across all possible LLM use-cases. For your specific use-case, you can greatly improve latency/cost without compromising results. Strategy: first run TLM with default settings to see results over a dataset from your use-case; Then adjust the model, quality_preset, and other TLMOptions to reduce latency for your application. If TLM’s default configuration seems ineffective, switch to a more powerful model (e.g. gpt-4.1, o4-mini, o3, claude-3.7-sonnet, or claude-3.5-sonnet-v2) or add custom evaluation criteria.

We describe these optional configurations below. If you email us (support@cleanlab.ai), our engineers can optimize TLM for your use-case in 15min – it’s that easy!

Task Type

TLM generally works for any LLM application. For certain tasks, get better results by specifying your task as one of:

  • classification: For multi-class classification tasks where the LLM chooses amongst a predefined set of categories/answers.
  • code_generation: For software engineering tasks where the LLM outputs code/programs.
  • default: Generic configuration for most use cases (used when you don’t specify a task)
tlm = TLM(task="classification")  # or task could be say: 'code_generation'

Quality Presets

You can trade-off latency vs. quality via TLM’s quality_preset argument. For many use-cases, a lower quality_preset performs just as well.

tlm = TLM(quality_preset="low")  # supported quality presets: 'best', 'high', 'medium' (default), 'low', 'base'

# Run a prompt using this TLM preset:
output = tlm.prompt("<your prompt>") # this call gets faster using the 'low' preset

# Or run a batch of prompts simultaneously:
outputs = tlm.prompt(["<your first prompt>", "<your second prompt>", "<your third prompt>"])

Details about TLM quality presets:

Higher quality presets provide better trustworthiness score quality, with best being the most accurate, followed by high, medium, low, and base providing the lowest latency but least accurate scores. For faster results, reduce the preset to low or base (default preset is medium).

Note: The range of the trustworthiness scores may slightly differ depending on your preset. Don’t directly compare the magnitude of TLM scores across different presets (settle on a preset before determining score thresholds). What remains comparable across different presets is how these TLM scores rank data or LLM responses from most to least confidently good.

Other TLM Options

When initializing a TLM instance, optionally specify the options argument as a dictionary of extra configurations beyond the quality preset. See the TLMOptions documentation. Here are useful options:

  • model: Which underlying LLM model TLM should utilize. TLM is a wrapper around any base LLM API to get trustworthiness scores for that LLM and auto-improve its responses. For lower latency, specify a faster model like gpt-4.1-nano or nova-micro. For higher accuracy, specify a bigger model like gpt-4.1, o4-mini, o3, or claude-sonnet-4-0.

  • max_tokens: The maximum number of tokens that TLM can generate. Try decreasing this value if you hit token limit errors.

  • num_candidate_responses: Increase this value to auto-improve responses from TLM.prompt(). It has no effect on TLM.get_trustworthiness_score(). TLM.prompt() can internally generate multiple candidate LLM responses, score each of their trustworthiness, and automatically return the best candidate.

For example, here’s how to auto-improve responses from the Claude 3.5 Sonnet LLM and reliably score their trustworthiness:

tlm = TLM(quality_preset="best", options={"model": "claude-3.5-sonnet", "num_candidate_responses": 4})

output = tlm.prompt("<your prompt>")

Trustworthy Language Model Lite

Using TLMLite in place of TLM.prompt() enables you to use a powerful/slower LLM to generate each response and a faster LLM to score its trustworthiness.

TLMLite is just like TLM, except you can specify a separate response_model for generating responses. Other options for TLMLite only apply to the trustworthiness scoring model.

For example, here we use the larger gpt-4.1 model to generate reponses, and the faster gpt-4.1-nano model to score their trustworthiness. To further reduce latency, we specify quality_preset="low".

from cleanlab_tlm import TLMLite

tlm_lite = TLMLite(response_model="gpt-4.1", quality_preset="low", options={"model": "gpt-4.1-nano"})

output = tlm_lite.prompt("<your prompt>")

Reducing Latency

To use TLM in an ultra low-latency real-time application, you might: stream in responses from your own LLM, and use TLM.get_trustworthiness_score() to subsequently stream in the corresponding trustworthiness score.

In your TLMOptions, specify a smaller/faster model (e.g. gpt-4.1-nano or nova-micro) and a lower quality_preset (e.g. low or base). Also try reducing values in TLMOptions such as:

  • reasoning_effort: try lower setting (e.g. low or none)
  • similarity_measure: try setting this to string

Running TLM over large datasets

To avoid overwhelming our API with requests, there’s a maximum number of tokens per minute that you can query TLM with (rate limit). If running multiple prompts simultaneously in batch, you’ll need to stay under the rate limit, but you’ll also want to optimize for getting all results quickly.

If you hit token limit errors, consider playing with TLM’s quality_preset and max_tokens parameters. If you run TLM on individual examples yourself in a for loop, you may hit the rate limit, so we recommend running in batches of many prompts passed in as a list.

The TLM.prompt() and TLM.get_trustworthiness_score() methods will handle failed examples by returning a dictionary with null values for the response and trustworthiness_score keys, along with a log key containing detailed error information. The error information includes an error message describing the specific issue (such as exceeding token limits) and a boolean flag indicating whether the error is retryable. These methods still return results for the remaining examples in the provided list where TLM ran successfully. You can re-run examples with retryable errors to get results. This approach allows you to process the successful results while still having comprehensive information about any failures that occurred, enabling better error handling and potential retry strategies.

Mini-batching

If your datasets have over several thousand examples, we recommend running TLM in mini-batches to checkpoint intermediate results.

This helper function allows you to run TLM in mini-batches. We recommend batch sizes of approximately 1000, but feel free to tinker with this number to best suit your use case. You can re-execute this function in the case of any failures and it will resume from the previous checkpoint.

Optional: Define helper function: batch_prompt()

def batch_prompt(tlm: TLM, input_path: str, output_path: str, prompt_col_name: str, batch_size: int = 1000):
"""Handles any failures (errors or timeouts) by returning a dictionary with `null` values in place of the failures"""
if os.path.exists(output_path):
start_idx = len(pd.read_csv(output_path))
else:
start_idx = 0

df_batched = pd.read_csv(input_path, chunksize=batch_size)
curr_idx = 0

for curr_batch in df_batched:
# if results already exist for the entire batch
if curr_idx + len(curr_batch) <= start_idx:
curr_idx += len(curr_batch)
continue

# if results exist for half the batch
elif curr_idx < start_idx:
curr_batch = curr_batch[start_idx - curr_idx:]
curr_idx = start_idx

results = tlm.prompt(curr_batch[prompt_col_name].to_list())
results_df = pd.DataFrame(results)
results_df.to_csv(output_path, mode='a', index=False, header=not os.path.exists(output_path))

curr_idx += len(curr_batch)

Here we’ll demonstrate using the batch_prompt() method on a toy dataset of 100 prompts, but this can be run at scale. Just specify: an instantiated TLM object, the input file path to a CSV file containing your prompts and the column name in which they are located, as well as the output file path where results should be stored, and your intended batch size (we recommend ~1000 examples per batch).

import pandas as pd

# create sample prompts
sample_prompts = pd.DataFrame({"prompt": [f"What is the sum of 1 and {i}?" for i in range(1, 101)]})
sample_prompts.to_csv("sample_tlm_prompts.csv", index=None)
input_path = "sample_tlm_prompts.csv"
output_path = "sample_responses.csv"

df = pd.read_csv(input_path)
df.head()
prompt
0 What is the sum of 1 and 1?
1 What is the sum of 1 and 2?
2 What is the sum of 1 and 3?
3 What is the sum of 1 and 4?
4 What is the sum of 1 and 5?

We can then call the batch_prompt function to run TLM in mini-batches. Note that if this cell fails for any reason, you can just re-execute it and TLM will resume processing your data from the previous checkpoint.

tlm = TLM() 

batch_prompt(
tlm=tlm,
input_path=input_path,
output_path=output_path,
prompt_col_name="prompt",
batch_size=20
)

After the cell above is done executing, we can view the saved results in the output file:

results = pd.read_csv(output_path)
results.head()
response trustworthiness_score
0 The sum of 1 and 1 is 2. 0.959896
1 The sum of 1 and 2 is 3. 0.983522
2 The sum of 1 and 3 is 4. 0.978253
3 The sum of 1 and 4 is 5. 0.980250
4 The sum of 1 and 5 is 6. 0.969275

Retrying Failed Queries

When running large batches of prompts, some queries may fail due to issues like temporary network errors or timeouts. As recommended above, you can use the TLM.prompt() and TLM.get_trustworthiness_score() methods to handle the failed examples by returning a dictionary with detailed error information. By examining the log data in the response, you can efficiently retry only the queries that encountered retryable errors, without reprocessing the successful ones. This section demonstrates how you can implement a retry mechanism for the failed queries.

For the purposes of this tutorial, we’ll intentionally use a very short timeout when calling TLM to trigger some errors.

tlm = TLM(timeout=0.25) 

prompts = [f"What is the sum of 1 and {i}?" for i in range(1, 10)]
res_with_failures = tlm.prompt(prompts)
res_with_failures[:5]
Querying TLM... 100%|██████████|

[{'response': 'The sum of 1 and 1 is 2.',
'trustworthiness_score': 0.9598961301228769},
{'response': 'The sum of 1 and 2 is 3.',
'trustworthiness_score': 0.9835215390172598},
{'response': 'The sum of 1 and 3 is 4.',
'trustworthiness_score': 0.9782526873921833},
{'response': 'The sum of 1 and 4 is 5.',
'trustworthiness_score': 0.9802496022988343},
{'response': None,
'trustworthiness_score': None,
'log': {'error': {'message': 'Timeout while waiting for prediction. Please retry or consider increasing the timeout.',
'retryable': True}}}]

We see above that while some queries succeeded, others failed due to timeout errors. Since timeout errors are retryable, we can define the retry_prompt() helper function to retry only the failed prompts and combine the results.

Optional: TLM retry_prompt helper function

We will also define a retry_get_trustworthiness_score function here, which acts the same way as retry_prompt but for obtaining trustworthiness scores for prompt-response pairs


import numpy as np

def retry_prompt(tlm, prompts, tlm_responses):
failed_idx = [idx for idx, item in enumerate(tlm_responses) if item.get('log', {}).get('error', {}).get('retryable')]
failed_prompts = np.array(prompts)[failed_idx]

retry_res = tlm.prompt(failed_prompts.tolist())
tlm_responses_array = np.array(tlm_responses)
tlm_responses_array[failed_idx] = retry_res

return tlm_responses_array.tolist()

def retry_get_trustworthiness_score(tlm, prompts, responses, tlm_scores):
failed_idx = [idx for idx, item in enumerate(tlm_scores) if item.get('log', {}).get('error', {}).get('retryable')]
failed_prompts = np.array(prompts)[failed_idx]
failed_responses = np.array(responses)[failed_idx]
retry_res = tlm.get_trustworthiness_score(failed_prompts.tolist(), failed_responses.tolist())
tlm_scores_array = np.array(tlm_scores)
tlm_scores_array[failed_idx] = retry_res

return tlm_scores_array.tolist()

This function takes three inputs:

  • tlm, an instantiated TLM object
  • prompts, which is a list of all the original prompts (same list that was initially passed to TLM.prompt())
  • tlm_responses, the list of responses from TLM that includes both successful results and error logs, which will help us to identify which prompts failed and can be retried.

The retry_prompt() will only try to re-execute TLM on the prompts, and will return the aggregated results that combines the succesful responses from the previous TLM.prompt() call and also the retried responses. Let’s try it out:

retry_res = retry_prompt(tlm, prompts, res_with_failures)
retry_res
Querying TLM... 100%|██████████|

[{'response': 'The sum of 1 and 1 is 2.',
'trustworthiness_score': 0.9598961301228769},
{'response': 'The sum of 1 and 2 is 3.',
'trustworthiness_score': 0.9835215390172598},
{'response': 'The sum of 1 and 3 is 4.',
'trustworthiness_score': 0.9782526873921833},
{'response': 'The sum of 1 and 4 is 5.',
'trustworthiness_score': 0.9802496022988343},
{'response': 'The sum of 1 and 5 is 6.',
'trustworthiness_score': 0.9692750020809217},
{'response': 'The sum of 1 and 6 is 7.',
'trustworthiness_score': 0.9747792954587126},
{'response': 'The sum of 1 and 7 is 8.',
'trustworthiness_score': 0.9725614671284452},
{'response': 'The sum of 1 and 8 is 9.',
'trustworthiness_score': 0.9811183463257951},
{'response': 'The sum of 1 and 9 is 10.',
'trustworthiness_score': 0.9718280444545313}]

After retrying, we see that the full list of prompts have succeeded.

However, note that retrying failed queries does not guarantee success. If a prompt continues to fail after a few retry attempts, consider checking your inputs for potential errors or making adjustments to your parameters.

Next Steps

  • Search for your use-case in our tutorials and cheat sheet to learn how you can best use TLM.

  • If you need an additional capability or deployment option, or are unsure how to achieve desired results, just ask: support@cleanlab.ai. We love hearing from users, and are happy to help optimize TLM latency/accuracy for your use-case.