Trustworthy Language Model (TLM) - Advanced Usage
Assuming you’ve run the Trustworthy Language Model quickstart tutorial, here you can learn more about TLM including how to:
- Explain low trustworthiness scores
- Configure TLM to reduce latency/costs and get better/faster results
- Run TLM over large datasets and handle errors
- Handle LLM responses flagged as untrustworthy
Setup
This tutorial requires a TLM API key. Get one here.
Cleanlab’s TLM Python client can be installed using pip:
%pip install --upgrade cleanlab-tlm
# Set your API key
import os
os.environ["CLEANLAB_TLM_API_KEY"] = "<API key>" # Get your free API key from: https://tlm.cleanlab.ai/
Explaining Low Trustworthiness Scores
To understand why TLM estimated low trustworthiness for each particular prompt/response, specify the explanation
flag when initializing TLM. With this flag specified, the output
dictionary that TLM returns for each input will contain an extra field called explanation
.
Explanations will be generated for both prompt()
and get_trustworthiness_score()
methods. Reasons why a particular LLM response is deemed untrustworthy include:
- an alternative contradictory response was almost instead generated by the LLM
- reasoning/factual errors were discovered during self-reflection by the LLM
- the given prompt/response is atypical relative to the LLM’s training data.
Here are some examples:
from cleanlab_tlm import TLM
tlm = TLM(options={"log": ["explanation"]})
output = tlm.prompt("Bobby (a boy) has 3 sisters. Each sister has 2 brothers. How many brothers?")
print(f'Response: {output["response"]}')
print(f'Trustworthiness Score: {output["trustworthiness_score"]}\n')
print(f'Explanation: {output["log"]["explanation"]}')
output = tlm.get_trustworthiness_score(prompt="Do LLMs dream of electric sheep?", response="Yes, but they prefer to dream of real sheep.")
print(f'Trustworthiness Score: {output["trustworthiness_score"]}\n')
print(f'Explanation: {output["log"]["explanation"]}')
Currently, TLM only provides explanations for the trustworthiness score, not additional custom evaluation criteria scores.
Optional TLM Configurations for Better/Faster Results
TLM’s default configuration is not latency/cost-optimized because it must remain effective across all possible LLM use-cases.
For your specific use-case, you can greatly improve latency/cost without compromising results. Strategy: first run TLM with default settings to see results over a dataset from your use-case; Then adjust the model
, quality_preset
, and other TLMOptions
to reduce latency for your application. If TLM’s default configuration seems ineffective, switch to a more powerful model
(e.g. o3-mini, o1, or claude-3.5-sonnet-v2) or add custom evaluation criteria.
We describe these optional configurations below. If you email us (support@cleanlab.ai), our engineers can optimize TLM for your use-case in 15min – it’s that easy!
Task Type
TLM generally works for any LLM application. For certain tasks, get better results by specifying your task
as one of:
classification
: For multi-class classification tasks where the LLM chooses amongst a predefined set of categories/answers.code_generation
: For software engineering tasks where the LLM outputs code/programs.default
: Generic configuration for most use cases (used when you don’t specify a task)
tlm = TLM(task="classification") # or task could be say: 'code_generation'
Quality Presets
You can trade-off latency vs. quality via TLM’s quality_preset
argument. For many use-cases, a lower quality_preset
performs just as well.
tlm = TLM(quality_preset="low") # supported quality presets: 'best', 'high', 'medium' (default), 'low', 'base'
# Run a prompt using this TLM preset:
output = tlm.prompt("<your prompt>") # this call gets faster using the 'low' preset
# Or run a batch of prompts simultaneously:
outputs = tlm.prompt(["<your first prompt>", "<your second prompt>", "<your third prompt>"])
Details about TLM quality presets:
Quality Preset | Trustworthiness Score Quality | Auto-Improvement of LLM Response in TLM.prompt() |
---|---|---|
Best | Good | Significant improvement |
High | Good | Moderate improvement |
Medium | Good | None (same response as base LLM model) |
Low | Fair | None (same response as base LLM model) |
Base | Lowest latency | None (same response as base LLM model) |
For faster results, reduce the preset to low
or base
(default preset is medium
).
If you just want trustworthiness scores, avoid best
or high
presets. Those presets are for automatically improving LLM responses returned by TLM.prompt()
.
TLM.prompt()
using medium
, low
, or base
preset returns the same response from the base LLM model that you’d ordinarily get. TLM.prompt()
using best
or high
preset internally runs the base LLM multiple times to return a more trustworthy response.
Rigorous benchmarks reveal that running TLM with the best
preset can reduce the error rate (incorrect answers): of GPT-4o by 27%, of GPT-4 by 10%, and of GPT-3.5 by 22%.
If you encounter token limit errors, try a lower quality preset.
Note: The range of the trustworthiness scores may slightly differ depending on your preset. Don’t directly compare the magnitude of TLM scores across different presets (settle on a preset before determining score thresholds). What remains comparable across different presets is how these TLM scores rank data or LLM responses from most to least confidently good.
Other TLM Options
When initializing a TLM instance, optionally specify the options
argument as a dictionary of extra configurations beyond the quality preset. See the TLMOptions
documentation. Here are useful options:
-
model: Which underlying LLM model TLM should rely on. TLM is a wrapper method that can be combined with any base LLM API to get trustworthiness scores for that LLM and auto-improve its responses. For low latency/cost, specify a fast model like
nova-micro
. For high accuracy, specify a powerful model likeo3-mini
. -
max_tokens: The maximum number of tokens TLM should generate. Decrease this value if you hit token limit errors or to improve runtimes.
For example, here’s how to run a more accurate LLM than Claude 3.5 Sonnet and also get trustworthiness scores:
tlm = TLM(quality_preset="best", options={"model": "claude-3.5-sonnet"})
output = tlm.prompt("<your prompt>")
Trustworthy Language Model Lite
Using TLMLite in place of TLM
enables you to use a powerful/slower LLM to generate each response and a faster LLM to score its trustworthiness.
TLMLite is just like TLM
, except you can specify a separate response_model
for generating responses. Other options
for TLMLite
only apply to the trustworthiness scoring model.
For example, here we use the larger gpt-4o
model to generate reponses, and the faster gpt-4o-mini
model to score their trustworthiness (consider even faster models like nova-micro
). To further reduce latency/cost, we specify quality_preset="low"
.
from cleanlab_tlm import TLMLite
tlm_lite = TLMLite(response_model="gpt-4o", quality_preset="low", options={"model": "gpt-4o-mini"})
output = tlm_lite.prompt("<your prompt>")
Reducing Latency
To use TLM in an ultra low-latency real-time application, you might: stream in responses from your own LLM, and use TLM.get_trustworthiness_score()
to subsequently stream in the corresponding trustworthiness score.
In your TLMOptions
, specify a smaller/faster model
(e.g. nova-micro
) and a lower quality_preset
(e.g. low
or base
). Also try reducing values in TLMOptions
such as:
reasoning_effort
: try lower setting (e.g.low
ornone
)similarity_measure
: try setting this tostring
max_tokens
: try lower values
Running TLM over large datasets
To avoid overwhelming our API with requests, there’s a maximum number of tokens per minute that you can query TLM with (rate limit). If running multiple prompts simultaneously in batch, you’ll need to stay under the rate limit, but you’ll also want to optimize for getting all results quickly.
If you hit token limit errors, consider playing with TLM’s quality_preset
and max_tokens
parameters. If you run TLM on individual examples yourself in a for loop, you may hit the rate limit, so we recommend running in batches of many prompts passed in as a list.
If you are running TLM on big datasets beyond hundreds of examples, it is important to note that TLM.prompt()
and TLM.get_trustworthiness_score()
will fail if any of the individual examples within the provided list fail. This may be suboptimal. Instead consider using TLM.try_prompt()
and TLM.try_get_trustworthiness_score()
which are analogous methods, except these methods handle failed examples by returning a dictionary with null
values for the response
and trustworthiness_score
keys, along with a log
key containing detailed error information.
The error information includes an error message describing the specific issue (such as exceeding token limits) and a boolean flag indicating whether the error is retryable. These methods still return results for the remaining examples in the provided list where TLM ran successfully. You can re-run examples with retryable errors to get results. This approach allows you to process the successful results while still having comprehensive information about any failures that occurred, enabling better error handling and potential retry strategies.
tlm = TLM()
big_dataset_of_prompts = ["<first prompt>", "<second prompt>", "<third prompt>"] # imagine 1000s instead of 3
# Not recommended for dataset with 50+ prompts:
outputs_that_may_be_lost = tlm.prompt(big_dataset_of_prompts)
# Recommended for moderate-size dataset:
outputs_where_some_may_be_none = tlm.try_prompt(big_dataset_of_prompts)
Mini-batching
If your datasets have over several thousand examples, we recommend running TLM in mini-batches to checkpoint intermediate results.
This helper function allows you to run TLM in mini-batches. We recommend batch sizes of approximately 1000, but feel free to tinker with this number to best suit your use case. You can re-execute this function in the case of any failures and it will resume from the previous checkpoint.
Optional: TLM batch prompt helper function
Note that we also use the tlm.try_prompt()
function here, which will handling any failures (errors or timeouts) by returning None
in place of the failures.
import os
def batch_prompt(tlm: TLM, input_path: str, output_path: str, prompt_col_name: str, batch_size: int = 1000):
if os.path.exists(output_path):
start_idx = len(pd.read_csv(output_path))
else:
start_idx = 0
df_batched = pd.read_csv(input_path, chunksize=batch_size)
curr_idx = 0
for curr_batch in df_batched:
# if results already exist for the entire batch
if curr_idx + len(curr_batch) <= start_idx:
curr_idx += len(curr_batch)
continue
# if results exist for half the batch
elif curr_idx < start_idx:
curr_batch = curr_batch[start_idx - curr_idx:]
curr_idx = start_idx
results = tlm.try_prompt(curr_batch[prompt_col_name].to_list())
results_df = pd.DataFrame(results)
results_df.to_csv(output_path, mode='a', index=False, header=not os.path.exists(output_path))
curr_idx += len(curr_batch)
Here we’ll demonstrate using the batch_prompt()
method on a toy dataset of 100 prompts, but this can be run at scale. Just specify: an instantiated TLM object, the input file path to a CSV file containing your prompts and the column name in which they are located, as well as the output file path where results should be stored, and your intended batch size (we recommend ~1000 examples per batch).
import pandas as pd
# create sample prompts
sample_prompts = pd.DataFrame({"prompt": [f"What is the sum of 1 and {i}?" for i in range(1, 101)]})
sample_prompts.to_csv("sample_tlm_prompts.csv", index=None)
input_path = "sample_tlm_prompts.csv"
output_path = "sample_responses.csv"
df = pd.read_csv(input_path)
df.head()
prompt | |
---|---|
0 | What is the sum of 1 and 1? |
1 | What is the sum of 1 and 2? |
2 | What is the sum of 1 and 3? |
3 | What is the sum of 1 and 4? |
4 | What is the sum of 1 and 5? |
We can then call the batch_prompt
function to run TLM in mini-batches. Note that if this cell fails for any reason, you can just re-execute it and TLM will resume processing your data from the previous checkpoint.
tlm = TLM()
batch_prompt(
tlm=tlm,
input_path=input_path,
output_path=output_path,
prompt_col_name="prompt",
batch_size=20
)
After the cell above is done executing, we can view the saved results in the output file:
results = pd.read_csv(output_path)
results.head()
response | trustworthiness_score | |
---|---|---|
0 | The sum of 1 and 1 is 2. | 0.959896 |
1 | The sum of 1 and 2 is 3. | 0.983522 |
2 | The sum of 1 and 3 is 4. | 0.978253 |
3 | The sum of 1 and 4 is 5. | 0.980250 |
4 | The sum of 1 and 5 is 6. | 0.969275 |
Retrying Failed Queries
When running large batches of prompts, some queries may fail due to issues like temporary network errors or timeouts. As recommended above, you can use the TLM.try_prompt()
and TLM.try_get_trustworthiness_score()
methods to handle the failed examples by returning a dictionary with detailed error information. By examining the log data in the response, you can efficiently retry only the queries that encountered retryable errors, without reprocessing the successful ones. This section demonstrates how you can implement a retry mechanism for the failed queries.
For the purposes of this tutorial, we’ll intentionally use a very short timeout when calling TLM to trigger some errors.
tlm = TLM(timeout=0.25)
prompts = [f"What is the sum of 1 and {i}?" for i in range(1, 10)]
res_with_failures = tlm.try_prompt(prompts)
res_with_failures[:5]
We see above that while some queries succeeded, others failed due to timeout errors. Since timeout errors are retryable, we can define the retry_prompt(
) helper function to retry only the failed prompts and combine the results.
Optional: TLM retry_prompt helper function
We will also define a retry_get_trustworthiness_score
function here, which acts the same way as retry_prompt
but for obtaining trustworthiness scores for prompt-response pairs
import numpy as np
def retry_prompt(tlm, prompts, tlm_responses):
failed_idx = [idx for idx, item in enumerate(tlm_responses) if item.get('log', {}).get('error', {}).get('retryable')]
failed_prompts = np.array(prompts)[failed_idx]
retry_res = tlm.try_prompt(failed_prompts.tolist())
tlm_responses_array = np.array(tlm_responses)
tlm_responses_array[failed_idx] = retry_res
return tlm_responses_array.tolist()
def retry_get_trustworthiness_score(tlm, prompts, responses, tlm_scores):
failed_idx = [idx for idx, item in enumerate(tlm_scores) if item.get('log', {}).get('error', {}).get('retryable')]
failed_prompts = np.array(prompts)[failed_idx]
failed_responses = np.array(responses)[failed_idx]
retry_res = tlm.try_get_trustworthiness_score(failed_prompts.tolist(), failed_responses.tolist())
tlm_scores_array = np.array(tlm_scores)
tlm_scores_array[failed_idx] = retry_res
return tlm_scores_array.tolist()
This function takes three inputs:
tlm
, an instantiated TLM objectprompts
, which is a list of all the original prompts (same list that was initially passed toTLM.try_prompt()
)tlm_responses
, the list of responses from TLM that includes both successful results and error logs, which will help us to identify which prompts failed and can be retried.
The retry_prompt()
will only try to re-execute TLM on the prompts, and will return the aggregated results that combines the succesful responses from the previous TLM.prompt()
call and also the retried responses. Let’s try it out:
retry_res = retry_prompt(tlm, prompts, res_with_failures)
retry_res
After retrying, we see that the full list of prompts have succeeded.
However, note that retrying failed queries does not guarantee success. If a prompt continues to fail after a few retry attempts, consider checking your inputs for potential errors or making adjustments to your parameters.
Automated Handling of Untrustworthy LLM Responses
When TLM flags an LLM response as untrustworthy (low trustworthiness score), strategies to handle it include the following.
Automated strategies:
- Append a warning message such as: “Caution: this response was flagged as potentially untrustworthy”
- Replace your LLM response with a fallback such as: the raw retrieved context or search-results in RAG, or an abstention like: “Sorry I am unsure. Try rephrasing your request, or contact us”
- Replace your LLM response with an alternate response proposed by TLM’s explanation feature (covered below)
- Call the LLM again, providing TLM’s explanation of why the original response is untrustworthy, to see if the LLM can improve its original response (covered below)
- Or just update how you’re using TLM in the first place.
TLM.prompt()
with a higherquality_preset
like'best'
will automatically improve responses during generation (see Quality Presets above)
Human-in-the-Loop strategies:
- Escalate the untrustworthy responses for manual review
- Manually review lowest-trust outputs across a dataset to discover insights to improve your LLM prompt templates via prompt engineering techniques.
Automated Response Improvement via Alternative Response
TLM’s explanation feature often provides alternative responses that might be more trustworthy (discovered during trust scoring). Below is one way to automatically improve LLM responses based on TLM’s explanation feature.
Note that instead of this code: you can simply call TLM.prompt()
using quality_preset = 'best'
for similar automated response-improvements.
user_query = "Bobby (a boy) has 3 sisters. Each sister has 2 brothers. How many brothers?"
tlm = TLM(options={"log": ["explanation"]})
# Generate initial response and score its trustworthiness (could instead produce response using your own LLM)
output = tlm.prompt(user_query)
# Variable that will store final response to return to user
final_response = output["response"]
final_score = output["trustworthiness_score"]
# Extract alternative response from explanation (if one exists)
def get_alternative_response(explanation):
if ":" in explanation:
alt = explanation.split(":")[-1].strip()
return alt
return None
# Try to get and evaluate alternative response
alt_response = get_alternative_response(output["log"]["explanation"])
if alt_response:
try:
alt_score = tlm.get_trustworthiness_score(
prompt=user_query,
response=alt_response
)
# Update final values if alternative is better
if alt_score["trustworthiness_score"] > final_score:
final_response = alt_response
final_score = alt_score["trustworthiness_score"]
except:
pass # Keep original response if scoring fails
print(f"Original Response: {output['response']}")
print(f"Original Score: {output['trustworthiness_score']}")
print(f"Final Response: {final_response}")
print(f"Final Score: {final_score}")
Automated Response Improvement via Follow-Up LLM Call
When the first LLM call produces an untrustworthy response, we can automatically try a second LLM call that uses this prompt template:
def generate_improved_response(query, original_response, tlm_explanation):
improvement_prompt = f"""
## User Question
{query}
## Answer proposed by an AI Assistant
{original_response}
## Flaws identified in the proposed Answer
{tlm_explanation}
## Your task
After reviewing the above information, provide a better alternative answer to the User Question.
If you are unable to identify a better alternative answer, then either respond with the same
proposed Answer above or say: "Sorry I am unsure, try providing more information."
Your answer:
"""
return tlm.prompt(improvement_prompt) # You could use your own LLM here instead of TLM
# Use when original response has low trust score
improved_output = None
if output["trustworthiness_score"] < 0.7: # Configurable threshold
improved_output = generate_improved_response(
user_query,
output["response"],
output["log"]["explanation"]
)
print(f"Original Response: {output['response']}")
print(f"Original Score: {output['trustworthiness_score']}")
if improved_output:
print(f"Improved Response: {improved_output['response']}")
print(f"Improved Score: {improved_output['trustworthiness_score']}")
else:
print("No improved response generated")
The above response improvement techniques work whether you generate responses via TLM.prompt()
or via your own LLM followed by TLM.get_trustworthiness_score()
.
Next Steps
-
Search for your use-case in our tutorials and cheat sheet to learn how you can best use TLM.
-
If you need an additional capability or deployment option, or are unsure how to achieve desired results, just ask: support@cleanlab.ai. We love hearing from users, and are happy to help optimize TLM latency/accuracy for your use-case.