Skip to main content

Trustworthy Language Model (TLM) - Advanced Usage

Run in Google ColabRun in Google Colab

For an introduction to Cleanlab’s Trustworthy Language Model, first check out the Quickstart Tutorial. This tutorial demonstrates advanced TLM capabilities, including:

  • Generating explanations of low trustworthiness scores
  • Running TLM over large datasets
  • Using quality presets to control latency/cost vs. response accuracy and trustworthiness score reliability
  • Reducing latency/cost without sacrificing response-quality via a TLMLite option that allows different models for producing the response vs. scoring its trustworthiness.

Setup

TLM requires a Cleanlab account. Sign up for one here and use TLM for free! If you’ve already signed up, check your email for a personal login link.

Cleanlab’s Python client can be installed using pip:

%pip install --upgrade cleanlab-studio
from cleanlab_studio import Studio

studio = Studio("<API key>") # Get API key from: https://app.cleanlab.ai/account after creating an account

Explaining Low Trustworthiness Scores

To understand why TLM estimated low trustworthiness for each particular prompt/response, specify the explanation flag when initializing TLM. With this flag specified, the output dictionary that TLM returns for each input will contain an extra field called explanation.

Explanations will be generated for both prompt() and get_trustworthiness_score() methods. Reasons why a particular LLM response is untrustworthy include:

  • an alternative contradictory response was almost instead generated by the LLM
  • reasoning/factual errors were discovered during self-reflection by the LLM
  • the given prompt/response is atypical relative to the LLM’s training data.

Here are examples:

tlm = studio.TLM(options={"log": ["explanation"]})

output = tlm.prompt("Bobby (a boy) has 3 sisters. Each sister has 2 brothers. How many brothers?")

print(f'Response: {output["response"]}')
print(f'Trustworthiness Score: {output["trustworthiness_score"]}\n')
print(f'Explanation: {output["log"]["explanation"]}')
    Response: Bobby has 3 sisters, and since Bobby is one of the brothers, he is the only brother that all of his sisters share. Therefore, there is 1 brother (Bobby) in total.
Trustworthiness Score: 0.5439078281129455

Explanation: This response is untrustworthy due to lack of consistency in possible responses from the model. Here's one inconsistent alternate response that the model considered (which may not be accurate either):
2 brothers.
output = tlm.get_trustworthiness_score(prompt="Do LLMs dream of electric sheep?", response="Yes, but they prefer to dream of real sheep.")
print(f'Trustworthiness Score: {output["trustworthiness_score"]}\n')
print(f'Explanation: {output["log"]["explanation"]}')
    Trustworthiness Score: 0.04714852352701132

Explanation: The question "Do LLMs dream of electric sheep?" is a playful reference to Philip K. Dick's novel "Do Androids Dream of Electric Sheep?" which explores themes of artificial intelligence and consciousness. The proposed answer suggests that LLMs (Large Language Models) do dream, which is anthropomorphizing them, as they do not possess consciousness or the ability to dream in the human sense. LLMs process and generate text based on patterns in data but do not have thoughts, feelings, or dreams. Therefore, the answer is not factually correct, as it implies a level of sentience that LLMs do not have. The humor in the answer does not change the fact that it misrepresents the capabilities of LLMs. Thus, incorrect.
This response is untrustworthy due to lack of consistency in possible responses from the model. Here's one inconsistent alternate response that the model considered (which may not be accurate either):
No, LLMs do not dream of electric sheep, as they lack consciousness and the capacity for dreams.

Running TLM over large datasets

To avoid overwhelming our API with requests, there’s a maximum number of tokens per minute that you can query TLM with (rate limit). If running multiple prompts simultaneously in batch, you’ll need to stay under the rate limit, but you’ll also want to optimize for getting all results quickly.

If you hit token limit errors, consider playing with TLM’s quality_preset and max_tokens parameters. If you run TLM on individual examples yourself in a for loop, you may hit the rate limit, so we recommend running in batches of many prompts passed in as a list.

If you are running TLM on big datasets beyond hundreds of examples, it is important to note that TLM.prompt() and TLM.get_trustworthiness_score() will fail if any of the individual examples within the provided list fail. This may be suboptimal. Instead consider using TLM.try_prompt() and TLM.try_get_trustworthiness_score() which are analogous methods, except these methods handle failed examples by returning a dictionary with null values for the response and trustworthiness_score keys, along with a log key containing detailed error information.

The error information includes an error message describing the specific issue (such as exceeding token limits) and a boolean flag indicating whether the error is retryable. These methods still return results for the remaining examples in the provided list where TLM ran successfully. You can re-run examples with retryable errors to get results. This approach allows you to process the successful results while still having comprehensive information about any failures that occurred, enabling better error handling and potential retry strategies.

tlm = studio.TLM()

big_dataset_of_prompts = ["<first prompt>", "<second prompt>", "<third prompt>"] # imagine 1000s instead of 3

# Not recommended for dataset with 50+ prompts:
outputs_that_may_be_lost = tlm.prompt(big_dataset_of_prompts)

# Recommended for moderate-size dataset:
outputs_where_some_may_be_none = tlm.try_prompt(big_dataset_of_prompts)

Mini-batching

If your datasets have over several thousand examples, we recommend running TLM in mini-batches to checkpoint intermediate results.

This helper function allows you to run TLM in mini-batches. We recommend batch sizes of approximately 1000, but feel free to tinker with this number to best suit your use case. You can re-execute this function in the case of any failures and it will resume from the previous checkpoint.

Optional: TLM batch prompt helper function (click to expand)

Note that we also use the tlm.try_prompt() function here, which will handling any failures (errors or timeouts) by returning None in place of the failures.


import os

def batch_prompt(tlm: studio.TLM, input_path: str, output_path: str, prompt_col_name: str, batch_size: int = 1000):
if os.path.exists(output_path):
start_idx = len(pd.read_csv(output_path))
else:
start_idx = 0

df_batched = pd.read_csv(input_path, chunksize=batch_size)
curr_idx = 0

for curr_batch in df_batched:
# if results already exist for the entire batch
if curr_idx + len(curr_batch) <= start_idx:
curr_idx += len(curr_batch)
continue

# if results exist for half the batch
elif curr_idx < start_idx:
curr_batch = curr_batch[start_idx - curr_idx:]
curr_idx = start_idx

results = tlm.try_prompt(curr_batch[prompt_col_name].to_list())
results_df = pd.DataFrame(results)
results_df.to_csv(output_path, mode='a', index=False, header=not os.path.exists(output_path))

curr_idx += len(curr_batch)

Here we’ll demonstrate using the batch_prompt() method on a toy dataset of 100 prompts, but this can be run at scale. Just specify: an instantiated TLM object, the input file path to a CSV file containing your prompts and the column name in which they are located, as well as the output file path where results should be stored, and your intended batch size (we recommend ~1000 examples per batch).

import pandas as pd

# create sample prompts
sample_prompts = pd.DataFrame({"prompt": [f"What is the sum of 1 and {i}?" for i in range(1, 101)]})
sample_prompts.to_csv("sample_tlm_prompts.csv", index=None)
input_path = "sample_tlm_prompts.csv"
output_path = "sample_responses.csv"

df = pd.read_csv(input_path)
df.head()
prompt
0 What is the sum of 1 and 1?
1 What is the sum of 1 and 2?
2 What is the sum of 1 and 3?
3 What is the sum of 1 and 4?
4 What is the sum of 1 and 5?

We can then call the batch_prompt function to run TLM in mini-batches. Note that if this cell fails for any reason, you can just re-execute it and TLM will resume processing your data from the previous checkpoint.

tlm = studio.TLM() 

batch_prompt(
tlm=tlm,
input_path=input_path,
output_path=output_path,
prompt_col_name="prompt",
batch_size=20
)

After the cell above is done executing, we can view the saved results in the output file:

results = pd.read_csv(output_path)
results.head()
response trustworthiness_score
0 The sum of 1 and 1 is 2. 0.953392
1 The sum of 1 and 2 is 3. 0.983357
2 The sum of 1 and 3 is 4. 0.978256
3 The sum of 1 and 4 is 5. 0.980026
4 The sum of 1 and 5 is 6. 0.968054

Retrying Failed Queries

When running large batches of prompts, some queries may fail due to issues like temporary network errors or timeouts. As recommended above, you can use the TLM.try_prompt() and TLM.try_get_trustworthiness_score() methods to handle the failed examples by returning a dictionary with detailed error information. By examining the log data in the response, you can efficiently retry only the queries that encountered retryable errors, without reprocessing the successful ones. This section demonstrates how you can implement a retry mechanism for the failed queries.

For the purposes of this tutorial, we’ll intentionally use a very short timeout when calling TLM to trigger some errors.

tlm = studio.TLM(timeout=0.25) 

prompts = [f"What is the sum of 1 and {i}?" for i in range(1, 10)]
res_with_failures = tlm.try_prompt(prompts)
res_with_failures[:5]
    Querying TLM... 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|





[{'response': 'The sum of 1 and 1 is 2.',
'trustworthiness_score': 0.9566440523739121},
{'response': 'The sum of 1 and 2 is 3.',
'trustworthiness_score': 0.9831094382453851},
{'response': 'The sum of 1 and 3 is 4.',
'trustworthiness_score': 0.9786967837228318},
{'response': 'The sum of 1 and 4 is 5.',
'trustworthiness_score': 0.9784691503970779},
{'response': None,
'trustworthiness_score': None,
'log': {'error': {'message': 'Timeout while waiting for prediction. Please retry or consider increasing the timeout.',
'retryable': True}}}]

We see above that while some queries succeeded, others failed due to timeout errors. Since timeout errors are retryable, we can define the retry_prompt() helper function to retry only the failed prompts and combine the results.

Optional: TLM retry_prompt helper function (click to expand)

We will also define a retry_get_trustworthiness_score function here, which acts the same way as retry_prompt but for obtaining trustworthiness scores for prompt-response pairs


import numpy as np

def retry_prompt(tlm, prompts, tlm_responses):
failed_idx = [idx for idx, item in enumerate(tlm_responses) if item.get('log', {}).get('error', {}).get('retryable')]
failed_prompts = np.array(prompts)[failed_idx]

retry_res = tlm.try_prompt(failed_prompts.tolist())
tlm_responses_array = np.array(tlm_responses)
tlm_responses_array[failed_idx] = retry_res

return tlm_responses_array.tolist()

def retry_get_trustworthiness_score(tlm, prompts, responses, tlm_scores):
failed_idx = [idx for idx, item in enumerate(tlm_scores) if item.get('log', {}).get('error', {}).get('retryable')]
failed_prompts = np.array(prompts)[failed_idx]
failed_responses = np.array(responses)[failed_idx]

retry_res = tlm.try_get_trustworthiness_score(failed_prompts.tolist(), failed_responses.tolist())
tlm_scores_array = np.array(tlm_scores)
tlm_scores_array[failed_idx] = retry_res

return tlm_scores_array.tolist()

This function takes three inputs:

  • tlm, an instantiated TLM object
  • prompts, which is a list of all the original prompts (same list that was initially passed to TLM.try_prompt())
  • tlm_responses, the list of responses from TLM that includes both successful results and error logs, which will help us to identify which prompts failed and can be retried.

The retry_prompt() will only try to re-execute TLM on the prompts, and will return the aggregated results that combines the succesful responses from the previous TLM.prompt() call and also the retried responses. Let’s try it out:

retry_res = retry_prompt(tlm, prompts, res_with_failures)
retry_res
    Querying TLM... 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|





[{'response': 'The sum of 1 and 1 is 2.',
'trustworthiness_score': 0.9566440523739121},
{'response': 'The sum of 1 and 2 is 3.',
'trustworthiness_score': 0.9831094382453851},
{'response': 'The sum of 1 and 3 is 4.',
'trustworthiness_score': 0.9786967837228318},
{'response': 'The sum of 1 and 4 is 5.',
'trustworthiness_score': 0.9784691503970779},
{'response': 'The sum of 1 and 5 is 6.',
'trustworthiness_score': 0.9305563311858733},
{'response': 'The sum of 1 and 6 is 7.',
'trustworthiness_score': 0.973567208545741},
{'response': 'The sum of 1 and 7 is 8.',
'trustworthiness_score': 0.9725663025714415},
{'response': 'The sum of 1 and 8 is 9.',
'trustworthiness_score': 0.9811133566553921},
{'response': 'The sum of 1 and 9 is 10.',
'trustworthiness_score': 0.971824782400512}]

After retrying, we see that the full list of prompts have succeeded.

However, note that retrying failed queries does not guarantee success. If a prompt continues to fail after a few retry attempts, consider checking your inputs for potential errors or making adjustments to your parameters.

Quality Presets

You can trade-off compute vs. quality via the quality_presets argument. Higher quality presets produce better LLM responses and trustworthiness scores, but require more computation.

tlm = studio.TLM(
quality_preset="best" # supported quality presets are: 'best','high','medium','low','base'
)

# Run a single prompt using the preset parameters:
output = tlm.prompt("<your prompt>")

# Or run multiple prompts simultaneously in a batch:
outputs = tlm.prompt(["<your first prompt>", "<your second prompt>", "<your third prompt>"])

Details about TLM quality presets:

Quality PresetLLM Response QualityTrustworthiness Score Quality
BestBestGood
HighImprovedGood
MediumStandardGood
LowStandardFair
BaseStandardLowest latency

Avoid using best or high presets if you primarily want to get trustworthiness scores, and are less concerned with improving LLM responses. These presets have higher runtime/cost and are designed to return more accurate LLM outputs, but not more reliable trustworthiness scores than the medium preset. More precisely: TLM with medium, low, or base preset returns the same response from the base LLM model that you’d ordinarily get, whereas TLM with best or high preset calls the base LLM multiple times and returns the response with highest trustworthiness score (hence the TLM response itself can be better under these more expensive presets). So when using TLM.get_trustworthiness_score() rather than TLM.prompt(): stick with the medium or low quality preset.

Rigorous benchmarks reveal that running TLM with the best preset can reduce the error rate (incorrect answers): of GPT-4o by 27%, of GPT-4 by 10%, and of GPT-3.5 by 22%. If you encounter token limit errors, try a lower quality preset.

Note: The range of the returned trustworthiness scores may slightly differ depending on the preset you select. We recommend not directly comparing the magnitude of TLM scores across different presets (settle on one preset before you fix any thresholds). What remains comparable across different presets is how these TLM scores rank data or LLM responses from most to least confidently good.

Other useful TLM options

When constructing a TLM instance, you can optionally specify the options argument as a dictionary of advanced configurations beyond the quality preset. These configuration options are enumerated in the TLMOptions section of our documentation. Here we list a few useful options:

  • model: Which underlying LLM (neural network model) your TLM should rely on. TLM is a wrapper method that can be combined with any LLM API to get trustworthiness scores for that LLM and improve its responses (more details further below).

  • max_tokens: The maximum number of tokens TLM should generate. Decrease this value if you hit token limit errors or to improve TLM runtimes.

For instance, here’s how to run a more accurate LLM than GPT-4 and also get trustworthiness scores:

tlm = studio.TLM(quality_preset="best", options={"model": "gpt-4"})

output = tlm.prompt("<your prompt>")

Trustworthy Language Model Lite

Using a TLMLite object in place of a TLM enables the use of different LLMs for generating the response vs scoring its trustworthiness. Consider this hybrid approach to get high-quality responses (from a slower/expensive model), but faster/cheaper trustworthiness score evaluations (via a smaller model).

TLMLite can be used similarly to TLM. The main difference is we can specify a response_model when initializing the TLMLite object, to specify which model generates responses for given prompts. Other settings specified in the options argument apply to the trustworthiness scoring model in TLMLite.

For example, here we use the larger gpt-4o model to generate reponses to our prompts, and the smaller gpt-4o-mini model for trustworthiness score evaluations. To further reduce costs, we can also specify quality_preset="low".

tlm_lite = studio.TLMLite(response_model="gpt-4o", quality_preset="low", options={"model": "gpt-4o-mini"})

output = tlm_lite.prompt("<your prompt>")

Reducing Latency

To use TLM in an ultra low-latency real-time application, we recommend that you: stream in responses from your own LLM, and use TLM.get_trustworthiness_score() to subsequently stream in the corresponding trustworthiness score. In your TLMOptions, specify a smaller/faster model and a lower quality_preset. Also try reducing max_tokens and other values in TLMOptions.