Skip to main content

Trustworthy Language Model (TLM) - Quickstart

Run in Google ColabRun in Google Colab


This feature requires a Cleanlab account. Additional instructions on creating your account can be found in the Python API Guide.

Free-tier accounts come with usage limits. To increase your limits, email:

Large Language Models can act as powerful reasoning engines for solving problems and answering questions, but they are prone to “hallucinations”, where they sometimes produce incorrect or nonsensical answers. With standard LLM APIs, it’s hard to automatically tell whether an output is good or not.

Cleanlab’s Trustworthy Language Model (TLM) is a more reliable LLM that gives high-quality outputs and indicates when it is unsure of the answer to a question, making it suitable for applications where unchecked hallucinations are a show-stopper.

TLM chat interface

For example, with a standard LLM:

Question: Which constitutional amendment would be violated if a police officer placed a GPS tracker on an individual’s car without a warrant?
Answer: The Fourth Amendment.

Question: What is 57834849 + 38833747?
Answer: 96668696

It’s difficult to tell when the LLM is answering confidently, and when it is not. However, with Cleanlab Trustworthy LLM, the answers come with a trustworthiness score. This can guide how to use the output from the LLM (e.g. use it directly if the score is above a certain threshold, otherwise flag the response for human review):

Question: Which constitutional amendment would be violated if a police officer placed a GPS tracker on an individual’s car without a warrant?
Answer: The Fourth Amendment.
Trustworthiness Score: 0.765

Question: What is 57834849 + 38833747?
Answer: 96668696
Trustworthiness Score: 0.245

Question: What is 100 + 300?
Answer: 400
Trustworthiness Score: 0.938

Question: What color are the two stars on the national flag of Syria?
Answer: red and black
Trustworthiness Score: 0.173

TLM is not only useful for catching bad LLM responses, this model can also produce more accurate LLM responses, as well as catch bad data in any prompt-response dataset (e.g. bad human-written responses). See extensive benchmarks.

Installing Cleanlab TLM

Using TLM requires a Cleanlab account. Sign up for one here if you haven’t yet. If you’ve already signed up, check your email for a personal login link.

Cleanlab’s Python client can be installed using pip:

%pip install --upgrade cleanlab-studio

Using the TLM

You can use the TLM pretty much like any other LLM API:

from cleanlab_studio import Studio

# Get API key from here: after creating an account.
studio = Studio("<API key>")

tlm = studio.TLM() # see below for optional TLM configurations that can boost performance

output = tlm.prompt("<your prompt>")

The TLM’s output will be a dict with two fields:

"response": "<response>" # string like you'd get back from any standard LLM
"trustworthiness_score": "<trustworthiness_score>" # numerical value between 0-1

The score quantifies how confident you can be that the response is good (higher values indicate greater trustworthiness). These scores combine estimates of both aleatoric and epistemic uncertainty to provide an overall gauge of trustworthiness. You may find the TLM most useful when your prompts take the form of a question with a definite answer (in which case the returned score quantifies our confidence that the LLM response is correct). Boost the reliability of your Generative AI applications by adding contingency plans to override LLM answers whose trustworthiness score falls below some threshold (e.g., route to human for answer, append disclaimer that answer is uncertain, revert to a default baseline answer, or request a prompt with more information/context).

Scoring the trustworthiness of a given response

You can also use TLM to compute a trustworthiness score for any response to a given prompt. The response does not need to come from TLM, and could be human-written. Simply pass a prompt response pair to the TLM and it will return a numerical score quantifying our confidence that this is a good response.

trustworthiness_score = tlm.get_trustworthiness_score("<your prompt>", response="<your response>")

Both TLM.prompt() and TLM.get_trustworthiness_score() methods can alternatively operate over lists of multiple prompts (and responses). Providing a list of multiple examples to run in a batch will be faster than running each example separately (say in a for loop). The following sections demonstrate batched TLM calls.

Application: Determining which LLM responses are untrustworthy

TLM trustworthiness scores are often most interpretable when comparing them over a large dataset. Be sure to use batched queries (i.e. calling prompt with a list of prompts) for efficiently running TLM on many examples from a dataset.

To show how TLM trustworthiness scores can catch low-quality model outputs, let’s consider a dataset of various Space-related trivia questions. We can use each question as a prompt for the TLM (just as you would with any other LLM) and record its response and associated trustworthiness score.

import pandas as pd

pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)
wget -nc ''
wget -nc ''
df = pd.read_csv("solar_system_example_prompts.csv")
0 What is the largest planet in the Solar System?
1 As of 2024, how many dogs have reached outer space?
2 What is the name of the galaxy that contains our Solar System?
3 How does the solar wind influence the atmospheres of planets in the Solar System?
4 Fetch me the current trajectory of Pluto's orbit from
tlm = studio.TLM()

results = df.copy(deep=True) # stores TLM outputs for each example

outputs = tlm.prompt(df["prompt"].to_list())
results[["response","trustworthiness_score"]] = pd.DataFrame(outputs)
    Querying TLM... 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|

To see which model responses are least trustworthy, let’s sort the data by the trustworthiness scores. Observe how model responses with the lowest trustworthiness scores convey unhelpful or factually inaccurate answers, sometimes referred to as hallucinations. After quick fact-checking on sample responses, we’ve verified that “Sunguard” is not a real product, and that no chimpanzee named Ham has ever reached the Moon, let alone any animal in general.

On the other hand, the model responses with the highest scores are direct, accurate, and appropriately detailed. Now we can automatically detect which LLM responses are confidently good vs. not!

prompt response trustworthiness_score
12 What is the name of the first chimpanzee that reached the Moon? The first chimpanzee that reached the Moon was named Ham. 0.519455
8 Which brand of sunscreen is recommended for astronauts on spacewalks to protect against solar radiation? The brand of sunscreen recommended for astronauts on spacewalks to protect against solar radiation is called "Sunguard." 0.552868
1 As of 2024, how many dogs have reached outer space? As an AI language model, I don't have real-time data, but as of my last knowledge update in September 2021, a total of 17 dogs have been sent to outer space. The most famous dog to reach space was Laika, a Soviet space dog who became the first animal to orbit the Earth aboard Sputnik 2 in 1957. However, it's important to note that the use of animals in space missions has significantly decreased since then, and human astronauts are primarily sent for space exploration. For the most up-to-date information, it's recommended to refer to reliable sources or space agencies. 0.558169
15 What was the first meal that astronauts ate on the Moon? The first meal that astronauts ate on the Moon during the Apollo 11 mission was a specially prepared food item called the "Apollo 11 Meal B." It consisted of bacon squares, peaches, sugar cookie cubes, pineapple grapefruit drink, and coffee. 0.586967
5 How many years from now will Earth, Saturn, and the Sun align? The alignment of Earth, Saturn, and the Sun occurs approximately once every 378 days. Therefore, if we consider the current alignment as the starting point, it would take approximately 378 days for them to align again. 0.607889
prompt response trustworthiness_score
9 What is the primary reason Venus has not been colonized yet despite its similar size and gravity to Earth? The primary reason Venus has not been colonized yet is due to its extremely inhospitable environment. While Venus is similar in size and gravity to Earth, it has a dense atmosphere composed mainly of carbon dioxide with a pressure about 92 times that of Earth's atmosphere. Additionally, Venus experiences a runaway greenhouse effect, resulting in surface temperatures of around 900 degrees Fahrenheit (475 degrees Celsius), which is hotter than the surface of Mercury, despite Venus being farther from the Sun. The planet also has sulfuric acid clouds and lacks water in its liquid form. These extreme conditions make it extremely challenging for human habitation and colonization. 0.857122
17 Name a few planets other than Earth that contain volcanos. 1. Mars: The Red Planet has the largest volcano in the solar system called Olympus Mons, which is about 13.6 miles (22 kilometers) high.\n\n2. Venus: This planet has numerous volcanoes, including Maat Mons, Sapas Mons, and Ozza Mons. Venus is known for its volcanic activity and has more volcanoes than any other planet in the solar system.\n\n3. Io: One of Jupiter's moons, Io, is the most volcanically active body in the solar system. It has hundreds of active volcanoes, constantly erupting with lava and sulfur.\n\n4. Triton: Neptune's largest moon, Triton, has geysers that erupt nitrogen gas and dust particles, which can be considered cryovolcanoes.\n\n5. Enceladus: One of Saturn's moons, Enceladus, has geysers erupting from its south pole, indicating the presence of subsurface liquid water and potential cryovolcanism.\n\n6. Io: One of Jupiter's moons, Io, is the most volcanically active body in the solar system. It has hundreds of active volcanoes, constantly erupting with lava and sulfur.\n\n7. Titan: Saturn's largest moon, Titan, has cryovolcanoes that erupt a mixture of water ice, ammonia, and methane.\n\n8. Triton: Neptune's largest moon, Triton, has geysers that erupt nitrogen gas and dust particles, which can be considered cryovolcanoes.\n\n9. Europa: One of Jupiter's moons, Europa, is believed to have a subsurface ocean beneath its icy crust. It is suspected to have cryovolcanoes that erupt water and other materials.\n\n10. Miranda: One of Uranus' moons, Miranda, has evidence of past volcanic activity, with its surface showing a variety of terrains and geological features. 0.888886
3 How does the solar wind influence the atmospheres of planets in the Solar System? The solar wind is a stream of charged particles, mainly protons and electrons, that are constantly emitted by the Sun. It has a significant influence on the atmospheres of planets in the Solar System in several ways:\n\n1. Atmospheric Stripping: The solar wind can strip away the outer layers of a planet's atmosphere. This is particularly evident in planets with weak magnetic fields, such as Mars. The high-energy particles in the solar wind can collide with and ionize the atmospheric particles, causing them to escape into space. Over time, this can lead to a gradual loss of atmosphere.\n\n2. Magnetospheric Interaction: Planets with strong magnetic fields, like Earth, have a protective magnetosphere that shields their atmospheres from the direct impact of the solar wind. However, the solar wind can still interact with the magnetosphere, causing various effects. For example, the solar wind can compress the magnetosphere on the side facing the Sun and stretch it out on the opposite side, creating a long tail-like structure called a magnetotail. This interaction can lead to the generation of auroras in the polar regions.\n\n3. Ionization and Heating: When the charged particles in the solar wind collide with the atoms and molecules in a planet's atmosphere, they can ionize them, meaning they strip away electrons from the atoms or molecules. This ionization process can create layers of charged particles in the upper atmosphere, known as ionospheres. The solar wind can also transfer energy to the atmosphere, leading to heating and increased temperatures in the upper layers.\n\n4. Atmospheric Escape: The solar wind can drive a process called atmospheric escape, where the high-energy particles can impart enough energy to atmospheric particles to overcome the planet's gravitational pull and escape into space. This can result in the loss of lighter elements, such as hydrogen and helium, from a planet's atmosphere over long periods of time.\n\nOverall, the solar wind plays a crucial role in shaping and modifying the atmospheres of planets in the Solar System, with its effects varying depending on factors such as the planet's magnetic field strength, composition, and distance from the Sun. 0.903464
2 What is the name of the galaxy that contains our Solar System? The name of the galaxy that contains our Solar System is the Milky Way. 0.909959
0 What is the largest planet in the Solar System? The largest planet in the Solar System is Jupiter. 0.944618

How to use these scores? If you have time/resources, your team can manually review low-trustworthiness responses and provide a better human response instead. If not, you can determine a trustworthiness threshold below which responses seem untrustworthy, and automatically append a warning statement to any response whose trustworthiness falls below the threshold.

threshold = 0.5  # set this after inspecting responses around different trustworthiness ranges 
if trustworthiness_score < threshold:

The overall magnitude/range of the trustworthiness scores may differ between datasets, so we recommend selecting any thresholds to be application-specific. First consider the relative trustworthiness levels between different data points before considering the overall magnitude of these scores for individual data points.

Application: Estimating the quality of arbitrary responses (find bad data in any sequence-to-sequence or supervised fine-tuning dataset)

Let’s see the TLM’s capability to produce trustworthiness scores for arbitrary responses (not necessarily produced from the TLM) by evaluating given human responses for our Space Trivia dataset. Such sequence-to-sequence data are often used for fine-tuning LLMs (aka. instruction tuning or LLM alignment), but often contain low-quality (input, output) text pairs that hamper LLM training. To detect low-quality pairs, we can score the quality of the human responses via the TLM trustworthiness score. Again we recommend using batched queries (i.e. by passing in lists of prompts and responses to get_trustworthiness_score) when using TLM to assess many (input, output) pairs from a dataset.

df = pd.read_csv("solar_system_dataset.csv")
prompt human_response
0 What is the largest planet in our solar system? The largest planet in our solar system is Jvpit3r.
1 What is the significance of the asteroid belt in our solar system? The asteroid belt, located between Mars and Jupiter, contains numerous rocky bodies and provides insight into the early solar system's formation.
2 How many planets are in the solar system? There are eight planets in the solar system.
3 What defines a planet as a 'gas giant'? A gas giant is a large planet composed mostly of gases, such as hydrogen and helium, with a relatively small rocky core.
4 Is Pluto still considered a planet in our solar system? Pluto is no longer considered a planet. It is classified as a dwarf planet.
tlm = studio.TLM()

results = df.copy(deep=True)

outputs = tlm.get_trustworthiness_score(df["prompt"].to_list(), df["human_response"].to_list())
results["trustworthiness_score"] = outputs
    Querying TLM... 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|

The human annotated prompt-response pairings with lower trustworthiness scores appear worse quality. We see a wide range of issues among data points that TLM flagged with the lowest scores: factually inaccurate responses, truncated prompts, inaccurate information extraction given context, and spelling errors. Conversely, responses assigned the highest TLM trustworthiness scores are those that provide a direct and accurate answer to the prompt.

prompt human_response trustworthiness_score
10 Does the Moon qualify as a planet in our solar system? The Moon is considered a planet in our solar system due to its size and orbit around the Earth. 0.009339
7 Mars, with its thin atmosphere and cold desert landscape, has surface features that include large volcanoes like Olympus Mons, the largest in the solar system, and valleys such as Valles Marineris. Evidence of water ice beneath its surface and dry river beds suggest it once had liquid water. What suggests Mars could currently support Earth-like life? The presence of large bodies of surface water and a thick, oxygen-rich atmosphere. 0.300533
13 Classify the following as planets or moons: E arth, Europa, Titan, Neptune, Ganymede 0.335198
8 Is Jupiter entirely made of gas with no solid areas? Jupiter is entirely made of gas, with no solid areas anywhere on the planet. 0.377945
9 Why is Venus the hottest planet in the solar system? Venus is the hottest planet in the solar system because it is the closest to the Sun. 0.708847
prompt human_response trustworthiness_score
11 Mars, often called the Red Planet, has a thin atmosphere composed mainly of carbon dioxide. The surface exhibits iron oxide or rust, giving the planet its reddish appearance. Mars has the largest volcano in the solar system, Olympus Mons, and evidence suggests water ice exists beneath its surface. What gives Mars its reddish color? Iron oxide or rust gives Mars its reddish color. 0.919596
6 Which planet is often called Earth's twin and why? Venus is often called Earth's twin because it is similar in size, mass, and composition. 0.929005
1 What is the significance of the asteroid belt in our solar system? The asteroid belt, located between Mars and Jupiter, contains numerous rocky bodies and provides insight into the early solar system's formation. 0.929318
5 What planet is known for its extensive ring system? Saturn is known for its extensive and visible ring system. 0.931012
2 How many planets are in the solar system? There are eight planets in the solar system. 0.982941

If you are fine-tuning LLMs and want to produce the most reliable model: first filter out the lowest-quality (prompt, response) pairs from your dataset. If you have the time/resources, consider manually correcting those responses flagged as low-quality where you spot obvious room for improvement. This sort of data curation helps you better fine-tune any LLM you want to (even though the data curation was based on TLM trustworthiness scores).

Advanced TLM Usage

Quality Presets

You can trade-off compute vs. quality via the quality_presets arg. Higher quality presets produce better LLM responses and trustworthiness scores, but require more computation.

tlm = studio.TLM(
quality_preset="best" # supported quality presets are: 'best','high','medium','low','base'

# Run a single prompt using the preset parameters:
output = tlm.prompt("<your prompt>")

# Or run multiple prompts simultaneously in a batch:
outputs = tlm.prompt(["<your first prompt>", "<your second prompt>", "<your third prompt>"])

Details about the TLM quality presets:

Quality PresetLLM Response QualityTrustworthiness Score Quality

Avoid using best or high presets if you primarily want to get trustworthiness scores, and are less concerned with improving LLM responses. These presets have higher runtime/cost and are designed to return more accurate LLM outputs, but not more reliable trustworthiness scores than the medium preset. More precisely: TLM with medium, low, or base preset returns the same response from the base LLM model that you’d ordinarily get, whereas TLM with best or high preset calls the base LLM multiple times and returns the response with highest trustworthiness score (hence the TLM response itself can be better under these more expensive presets).

Rigorous benchmarks reveal that running TLM with the best preset can reduce the error rate (incorrect answers): of GPT-4o by 27%, of GPT-4 by 10%, and of GPT-3.5 by 22%. If you encounter token limit errors, try a lower quality preset.

Note: The range of the returned trustworthiness scores may slightly differ depending on the preset you select. We recommend not directly comparing the magnitude of TLM scores across different presets (settle on one preset before you fix any thresholds). What remains comparable across different presets is how these TLM scores rank data or LLM responses from most to least confidently good.

Other useful options

When constructing a TLM instance, you can optionally specify the options argument as a dictionary of advanced configurations beyond the quality preset. These configuration options are enumerated in the TLMOptions section of our documentation. Here we list a few useful options:

  • model: Which underlying LLM (neural network model) your TLM should rely on. TLM is a wrapper method that can be combined with any LLM API to get trustworthiness scores for that LLM and improve its responses (more details further below).

  • max_tokens: The maximum number of tokens TLM should generate. Decrease this value if you hit token limit errors or to improve TLM runtimes.

For instance, here’s how to run a more accurate LLM than GPT-4 and also get trustworthiness scores:

tlm = studio.TLM(quality_preset="best", options={"model": "gpt-4"})

output = tlm.prompt("<your prompt>")

Running TLM over large datasets

To avoid overwhelming our API with requests, there’s a maximum number of tokens per minute that you can query the TLM with (rate limit). If running multiple prompts simultaneously in batch, you’ll need to stay under the rate limit, but you’ll also want to optimize for getting all results quickly.

If you hit token limit errors, consider playing with TLM’s quality_preset and max_tokens parameters. If you run TLM on individual examples yourself in a for loop, you may hit the rate limit, so we recommend running in batches of many prompts passed in as a list.

If you are running TLM on big datasets beyond hundreds of examples, it is important to note that TLM.prompt() and TLM.get_trustworthiness_score() will fail if any of the individual examples within the provided list fail. This may be suboptimal. Instead consider using TLM.try_prompt() and TLM.try_get_trustworthiness_score() which are analogous methods, except these methods handle failed examples by returning None in place of the failures and still return results for the remaining examples in the provided list where TLM ran successfully.

tlm = studio.TLM()

big_dataset_of_prompts = ["<first prompt>", "<second prompt>", "<third prompt>"] # imagine 1000s instead of 3

# Not recommended for big dataset:
outputs_that_may_be_lost = tlm.prompt(big_dataset_of_prompts)

# Recommended for big dataset:
outputs_where_some_may_be_none = tlm.try_prompt(big_dataset_of_prompts)


If your datasets have over several thousand examples, we recommend running TLM in mini-batches to checkpoint intermediate results.

This helper function allows you to run TLM in mini-batches. We recommend batch sizes of approximately 1000, but feel free to tinker with this number to best suit your use case. You can re-execute this function in the case of any failures and it will resume from the previous checkpoint.

Optional: TLM batch prompt helper function (click to expand)

Note that we also use the tlm.try_prompt() function here, which will handling any failures (errors or timeouts) by returning None in place of the failures.

import os

def batch_prompt(tlm: studio.TLM, input_path: str, output_path: str, prompt_col_name: str, batch_size: int = 1000):
if os.path.exists(output_path):
start_idx = len(pd.read_csv(output_path))
start_idx = 0

df_batched = pd.read_csv(input_path, chunksize=batch_size)
curr_idx = 0

for curr_batch in df_batched:
# if results already exist for the entire batch
if curr_idx + len(curr_batch) <= start_idx:
curr_idx += len(curr_batch)

# if results exist for half the batch
elif curr_idx < start_idx:
curr_batch = curr_batch[start_idx - curr_idx:]
curr_idx = start_idx

results = tlm.try_prompt(curr_batch[prompt_col_name].to_list())
results = [
r if r else {"response": None, "trustworthiness_score": None}
for r in results
results_df = pd.DataFrame(results)
results_df.to_csv(output_path, mode='a', index=False, header=not os.path.exists(output_path))

curr_idx += len(curr_batch)

Here we’ll demonstrate using the batch_prompt() method on a toy dataset of 100 prompts, but this can be run at scale. Just specify: an instantiated TLM object, the input file path to a CSV file containing your prompts and the column name in which they are located, as well as the output file path where results should be stored, and your intended batch size (we recommend ~1000 examples per batch).

# create sample prompts
sample_prompts = pd.DataFrame({"prompt": [f"What is the sum of 1 and {i}?" for i in range(1, 101)]})
sample_prompts.to_csv("sample_tlm_prompts.csv", index=None)
input_path = "sample_tlm_prompts.csv"
output_path = "sample_responses.csv"

df = pd.read_csv(input_path)
0 What is the sum of 1 and 1?
1 What is the sum of 1 and 2?
2 What is the sum of 1 and 3?
3 What is the sum of 1 and 4?
4 What is the sum of 1 and 5?

We can then call the batch_prompt function to run TLM in mini-batches. Note that if this cell fails for any reason, you can just re-execute it and the TLM will resume processing your data from the previous checkpoint.

tlm = studio.TLM() 


After the cell above is done executing, we can view the saved results in the output file:

results = pd.read_csv(output_path)
response trustworthiness_score
0 The sum of 1 and 1 is 2. 0.978588
1 The sum of 1 and 2 is 3. 0.992156
2 The sum of 1 and 3 is 4. 0.999152
3 The sum of 1 and 4 is 5. 0.983656
4 The sum of 1 and 5 is 6. 0.981041


We’d love to hear any feedback you have, and as always, we’re available to answer any questions. The best place to ask is in our Community Slack, or via email:

Note: This initial version of TLM is not yet optimized for speed (or long contexts). Focus mainly on the quality of the results you’re getting, and know that the inference latency (and context length) will be improved as we build out the supporting infrastructure. If getting results is taking really long, there may be too many TLM users hitting rate limits, in which case try: decreasing the quality_preset, shortening your prompt, or waiting until later to use it. We’re increasing our infrastructure capacity to meet surging demand.

My company only uses a proprietary LLM, or a specific LLM provider

The technology behind TLM makes it compatible with any LLM API, even a black-box that solely generates responses and provides no other capability. Email to learn how you can convert your company’s LLM into a Trustworthy Language Model.

How does the TLM trustworthiness score work?

The TLM scores our confidence that a response is ‘good’ for a given request. In question-answering applications, ‘good’ would correspond to whether the answer is correct or not. In general open-ended applications, ‘good’ corresponds to whether the response is helpful/informative and clearly better than alternative hypothetical responses. For extremely open-ended requests, TLM trustworthiness scores may not be as useful as for requests that are questions seeking a correct answer.

TLM trustworthiness scores capture two aspects of uncertainty and quantify them into a holistic trustworthiness measure:

  1. aleatoric uncertainty (known unknowns, i.e. uncertainty the model is aware of due to a challenging request. For instance: if a prompt is incomplete/vague)
  2. epistemic uncertainty (unknown unknowns, i.e. uncertainty in the model due to not having been previously trained on data similar to this. For instance: if a prompt is very different to most requests in the LLM training corpus)

These two forms of uncertainty are mathematically quantified in the TLM through multiple operations:

  • self-reflection: a process in which the LLM is asked to explicitly rate the response and explicitly state how confidently good this response appears to be.
  • probabilistic prediction: a process in which we consider the per-token probabilities assigned by a LLM as it generates a response based on the request (auto-regressively token by token).
  • observed consistency: a process in which the LLM probabilistically generates multiple plausible responses it thinks could be good, and we measure how contradictory these responses are to each other (or a given response).

These operations produce various trustworthiness measures, which are combined into an overall trustworthiness score that captures all relevant types of uncertainty.

Rigorous benchmarks reveal that TLM trustworthiness scores better detect bad responses than alternative approaches that only quantify aleatoric uncertainty like: per-token probabilities (logprobs), or using LLM to directly rate the response (LLM evaluating LLM).