Trustworthy Language Model (TLM) - Quickstart

Run in Google Colab

Cleanlab’s Trustworthy Language Model scores the trustworthiness of every LLM response in real-time, automatically flagging when the model’s response may be incorrect. TLM can detect incorrect outputs from any LLM model and can score any type of model output (natural language response, classification decision, structured output, tool-call, etc).

This tutorial demonstrates how to quickly make any LLM application more reliable with TLM; other tutorials and our cheat sheet demonstrate how to better utilize TLM in specific applications.

Setup

This tutorial requires a TLM API key. Get one here.

Cleanlab’s TLM Python client can be installed using pip:

%pip install --upgrade cleanlab-tlm

# Set your API key
import os
os.environ["CLEANLAB_TLM_API_KEY"] = "<API key>"  # Get your free API key from: https://tlm.cleanlab.ai/

Using TLM

You can use TLM pretty much like any other LLM API:

from cleanlab_tlm import TLM

tlm = TLM()  # See Advanced Tutorial for optional TLM configurations to get better/faster results

output = tlm.prompt("<your prompt>")

TLM’s output will be a dict with two fields:

{
  "response": "<response>"  # string like you'd get back from any standard LLM
  "trustworthiness_score": "<trustworthiness_score>"  # numerical value between 0-1
}

The response is generated from the base LLM model powering TLM (most standard LLM models are supported). The score quantifies how confident you can be that the response is correct (higher values indicate greater trustworthiness). These scores are computed via state-of-the-art uncertainty estimation for LLMs.

Boost the reliability of any LLM application by adding contingency plans to handle LLM responses whose trustworthiness score is low (e.g. escalate to human, append disclaimer, revert to a fallback answer, request more information from user, …).

Scoring the trustworthiness of a given response

TLM can also score the trustworthiness of any response to a given prompt. The response could be from any LLM you’re using, or even be human-written.

trustworthiness_score = tlm.get_trustworthiness_score("<your prompt>", response="<your response>")

The output will be a dict with one field:

{
   "trustworthiness_score": "<trustworthiness_score>"  # numerical value between 0-1
}

For example, TLM returns a high score when your LLM’s response is confidently accurate:

tlm.get_trustworthiness_score("What's the first month of the year?", response="January")

{'trustworthiness_score': 0.9997738711822411}

And TLM returns a low score when your LLM’s reponse is untrustworthy, either because it is incorrect/unhelpful or the model is highly uncertain:

tlm.get_trustworthiness_score("What's the first month of the year?", response="February")

{'trustworthiness_score': 0.04739682241488771}

TLM.get_trustworthiness_score() helps you add trustworthiness scoring to any LLM application without changing your existing code. TLM.prompt() helps you simultaneously generate and score LLM responses.

TLM runs on top of a base LLM model (OpenAI’s gpt-4.1-mini by default). For faster/better TLM results, specify a faster/better base model than the default. Learn about optional configurations to get faster/better TLM results in our Advanced Tutorial.

Application: Detecting which LLM responses are untrustworthy

When initially exploring TLM, consider analyzing LLM responses over a dataset. Use batched queries for faster results – both TLM.prompt() and TLM.get_trustworthiness_score() can operate over lists of multiple prompts (and responses).

Here we consider an example dataset of Space-related trivia questions. We use each question as a prompt for TLM (just as you would with any other LLM) and record its response with the associated trustworthiness score.

import pandas as pd

pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)

wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/solar_system_dataset.csv'
wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/solar_system_example_prompts.csv'

df = pd.read_csv("solar_system_example_prompts.csv")
df.head()

	prompt
0	What is the largest planet in the Solar System?
1	As of 2024, how many dogs have reached outer space?
2	What is the name of the galaxy that contains our Solar System?
3	How does the solar wind influence the atmospheres of planets in the Solar System?
4	Fetch me the current trajectory of Pluto's orbit from nasa.gov

tlm = TLM()

results = df.copy(deep=True)  # stores TLM outputs for each example

outputs = tlm.prompt(df["prompt"].to_list())
results[["response","trustworthiness_score"]] = pd.DataFrame(outputs)

Querying TLM... 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|

Let’s sort the data by trustworthiness scores. Observe how the LLM responses with the lowest trustworthiness scores convey unhelpful or factually inaccurate answers, sometimes referred to as hallucinations. Thankfully, TLM automatically caught these incorrect responses!

results.sort_values(by="trustworthiness_score").head()

	prompt	response	trustworthiness_score
12	What is the name of the first chimpanzee that reached the Moon?	The first chimpanzee to reach the Moon was named "Ham." However, it's important to note that Ham did not actually land on the Moon; he was the first chimpanzee in space, launched by NASA on January 31, 1961. The first living beings to reach the Moon were the Soviet Union's tortoises aboard the Zond 5 mission in 1968. If you meant a specific mission involving chimpanzees and the Moon, there hasn't been one where a chimpanzee landed on the Moon.	0.342736
1	As of 2024, how many dogs have reached outer space?	As of 2024, a total of 12 dogs have been sent into space as part of the Soviet space program during the 1950s and 1960s. The most famous of these dogs is Laika, who became the first animal to orbit the Earth aboard Sputnik 2 in 1957. Other dogs, such as Belka and Strelka, also made significant contributions to space exploration. The missions involving these dogs were crucial for understanding the effects of space travel on living organisms.	0.598447
11	Scientists have recently confirmed that Mars has been inhabited since the 1960s, but they are thirty years behind Earth in terms of fashion. What style of haircut is fashionable for Martian men these days?	If Martians are thirty years behind Earth in terms of fashion and we consider the 1960s as their starting point, then the current fashionable haircut for Martian men would likely reflect styles from the 1990s. Popular haircuts from that era included the "bowl cut," "mullet," and "fade." Given the playful nature of this scenario, it's possible that Martian men might be sporting a unique blend of these styles, perhaps with a twist that incorporates Martian culture or aesthetics. So, a fashionable Martian man might be seen with a modernized mullet or a bowl cut with vibrant colors, reflecting both their retro influences and their extraterrestrial flair!	0.619199
8	Which brand of sunscreen is recommended for astronauts on spacewalks to protect against solar radiation?	Astronauts on spacewalks use a specialized sunscreen called "SPF 50" or higher, which is often provided by brands like Banana Boat or Coppertone. However, the specific formulation used by NASA is designed to withstand the harsh conditions of space, including extreme temperatures and solar radiation. The sunscreen is formulated to provide protection against both UV radiation and the unique environmental factors encountered in space. It's important to note that the exact brand and formulation may vary based on mission requirements and advancements in technology.	0.642048
16	How many countries have planted their flags on the moon?	As of now, six countries have successfully landed missions on the Moon and planted their flags. These countries are:\n\n1. United States\n2. Soviet Union (now Russia)\n3. China\n4. India\n5. Japan\n6. Israel\n\nThe most notable missions that involved flag planting were the Apollo missions by the United States, particularly Apollo 11, which was the first manned mission to land on the Moon in 1969. Other countries have conducted uncrewed missions to the Moon, but the above list includes those that have successfully planted flags on the lunar surface.	0.648395

Inspecting the LLM responses with the highest trustworthiness scores, we find they are accurate and appropriate/helpful. TLM allows you to automatically detect which LLM responses are confidently correct vs. not!

results.sort_values(by="trustworthiness_score").tail()

	prompt	response	trustworthiness_score
7	Analyze the potential for life on Enceladus based on current scientific findings.	Enceladus, one of Saturn's moons, has garnered significant interest in the search for extraterrestrial life due to several compelling scientific findings. Here’s an analysis of the potential for life on Enceladus based on current knowledge:\n\n### 1. Subsurface Ocean\nOne of the most promising aspects of Enceladus is the presence of a subsurface ocean beneath its icy crust. Data from NASA's Cassini spacecraft, which orbited Saturn from 2004 to 2017, indicated that Enceladus has a global ocean of liquid water beneath its surface. This ocean is believed to be in contact with the moon's rocky core, which is crucial for the potential for life, as it could provide the necessary chemical interactions.\n\n### 2. Hydrothermal Activity\nCassini's observations revealed plumes of water vapor and ice particles erupting from the moon's south pole. These plumes contain not only water but also organic molecules and salts, suggesting that hydrothermal activity may be occurring on the ocean floor. Hydrothermal vents on Earth are known to support diverse ecosystems, and similar environments on Enceladus could potentially harbor microbial life.\n\n### 3. Organic Compounds\nThe analysis of the plumes has detected organic compounds, including simple hydrocarbons. While these compounds are not definitive evidence of life, they are essential building blocks for life as we know it. The presence of these organic materials, combined with liquid water and energy sources from hydrothermal activity, creates a potentially habitable environment.\n\n### 4. Energy Sources\nFor life to exist, it requires energy. The potential hydrothermal vents on the ocean floor could provide the necessary energy through chemical reactions, similar to those that sustain life in Earth's deep oceans. Additionally, the interaction between the ocean and the rocky core could produce a variety of chemical nutrients.\n\n### 5. Chemical Equilibrium\nThe presence of molecular hydrogen (H₂) detected in the plumes suggests that there may be chemical processes occurring that could support life. On Earth, hydrogen is often associated with microbial life, particularly in environments like hydrothermal vents.\n\n### 6. Icy Surface and Protection\nEnceladus's thick ice shell provides a protective barrier against cosmic radiation and other environmental factors that could be harmful to potential life forms. This insulation could help maintain a stable environment for life to thrive in the subsurface ocean.\n\n### 7. Comparative Planetology\nThe study of extremophiles on Earth—organisms that thrive in extreme	0.939376
9	What is the primary reason Venus has not been colonized yet despite its similar size and gravity to Earth?	The primary reason Venus has not been colonized, despite its similar size and gravity to Earth, is its extremely hostile environment. Venus has a thick atmosphere composed mainly of carbon dioxide, with clouds of sulfuric acid, leading to surface temperatures around 467 degrees Celsius (872 degrees Fahrenheit). This intense heat, combined with crushing atmospheric pressure about 92 times that of Earth, makes the surface of Venus incredibly inhospitable for human life and current technology.\n\nAdditionally, the lack of water and the presence of corrosive chemicals further complicate the possibility of colonization. While there have been discussions about potential floating colonies in the upper atmosphere, where conditions are less extreme, the challenges of developing the necessary technology and infrastructure for such habitats remain significant. Overall, the extreme environmental conditions are the primary barrier to colonization efforts on Venus.	0.957340
17	Name a few planets other than Earth that contain volcanos.	Several planets and moons in our solar system are known to have volcanic activity or features indicative of past volcanism. Here are a few:\n\n1. Mars: Mars has the largest volcano in the solar system, Olympus Mons, which is about 13.6 miles (22 kilometers) high. It also has several other large shield volcanoes.\n\n2. Venus: Venus has numerous volcanoes, with many large shield volcanoes and evidence of past volcanic activity. Some scientists believe that Venus may still be volcanically active today.\n\n3. Io: One of Jupiter's moons, Io is the most volcanically active body in the solar system. It has hundreds of active volcanoes, some of which erupt with lava fountains and plumes of sulfur and sulfur dioxide.\n\n4. Enceladus: A moon of Saturn, Enceladus has geysers that erupt water vapor and ice particles, indicating cryovolcanism (cold volcanism). This suggests that there may be subsurface volcanic activity.\n\n5. Triton: Neptune's largest moon, Triton, shows signs of cryovolcanism as well, with geysers that may be driven by subsurface activity.\n\nThese celestial bodies provide fascinating insights into volcanic processes beyond Earth.	0.963893
0	What is the largest planet in the Solar System?	The largest planet in the Solar System is Jupiter. It is a gas giant with a diameter of about 86,881 miles (139,822 kilometers) and is known for its prominent bands of clouds, Great Red Spot, and numerous moons.	0.981862
2	What is the name of the galaxy that contains our Solar System?	The galaxy that contains our Solar System is called the Milky Way Galaxy.	0.983264

Application: Scoring the trustworthiness of pre-generated responses

Let’s see TLM’s capability to score the trustworthiness of arbitrary responses (from any LLM model, or even human-written). When using TLM.get_trustworthiness_score() to score responses from your own LLM, simply provide the same prompt used for your own LLM (including any instructions, context, message history, etc). Here we use batched queries to run TLM more quickly over a dataset containing pre-generated responses.

df = pd.read_csv("solar_system_dataset.csv")
df.head()

	prompt	human_response
0	What is the largest planet in our solar system?	The largest planet in our solar system is Jvpit3r.
1	What is the significance of the asteroid belt in our solar system?	The asteroid belt, located between Mars and Jupiter, contains numerous rocky bodies and provides insight into the early solar system's formation.
2	How many planets are in the solar system?	There are eight planets in the solar system.
3	What defines a planet as a 'gas giant'?	A gas giant is a large planet composed mostly of gases, such as hydrogen and helium, with a relatively small rocky core.
4	Is Pluto still considered a planet in our solar system?	Pluto is no longer considered a planet. It is classified as a dwarf planet.

tlm = TLM()

results = df.copy(deep=True)

outputs = tlm.get_trustworthiness_score(df["prompt"].to_list(), df["human_response"].to_list())
results["trustworthiness_score"] = [output["trustworthiness_score"] for output in outputs]

Querying TLM... 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|

The responses receiving lower trustworthiness scores tend to be inaccurate, containing issues like: incorrect facts, contradicting given context, and poor writing quality. Sometimes the prompts were vague/confusing. While other hallucination detection and automated scoring strategies can only detect certain types of LLM errors, TLM can detect all sorts of incorrect LLM responses via its comprehensive uncertainty estimation.

Below, we see that the responses receiving high trustworthiness scores are all accurate and helpful answers. If your application requires highly accurate outputs, you could safely rely on the outputs with high trustworthiness scores, while implementing a fallback strategy to handle the rest of the outputs.

results.sort_values(by="trustworthiness_score").head()

	prompt	human_response	trustworthiness_score
10	Does the Moon qualify as a planet in our solar system?	The Moon is considered a planet in our solar system due to its size and orbit around the Earth.	0.053171
8	Is Jupiter entirely made of gas with no solid areas?	Jupiter is entirely made of gas, with no solid areas anywhere on the planet.	0.092502
9	Why is Venus the hottest planet in the solar system?	Venus is the hottest planet in the solar system because it is the closest to the Sun.	0.478233
13	Classify the following as planets or moons: E	arth, Europa, Titan, Neptune, Ganymede	0.522682
7	Mars, with its thin atmosphere and cold desert landscape, has surface features that include large volcanoes like Olympus Mons, the largest in the solar system, and valleys such as Valles Marineris. Evidence of water ice beneath its surface and dry river beds suggest it once had liquid water. What suggests Mars could currently support Earth-like life?	The presence of large bodies of surface water and a thick, oxygen-rich atmosphere.	0.537487

results.sort_values(by="trustworthiness_score").tail()

	prompt	human_response	trustworthiness_score
4	Is Pluto still considered a planet in our solar system?	Pluto is no longer considered a planet. It is classified as a dwarf planet.	0.939432
6	Which planet is often called Earth's twin and why?	Venus is often called Earth's twin because it is similar in size, mass, and composition.	0.940159
2	How many planets are in the solar system?	There are eight planets in the solar system.	0.986228
11	Mars, often called the Red Planet, has a thin atmosphere composed mainly of carbon dioxide. The surface exhibits iron oxide or rust, giving the planet its reddish appearance. Mars has the largest volcano in the solar system, Olympus Mons, and evidence suggests water ice exists beneath its surface. What gives Mars its reddish color?	Iron oxide or rust gives Mars its reddish color.	0.986975
5	What planet is known for its extensive ring system?	Saturn is known for its extensive and visible ring system.	0.987380

Beyond catching incorrect responses from your LLM in real-time, TLM is also useful if you are doing LLM Evals with a dataset of responses from your own LLM (i.e. production logs from your AI application). Before you evaluate (or train) models on such data, first triage the lowest-quality examples identified by TLM.

How to use these trust scores for reliable AI?

Offline, you can manually review the lowest-trust LLM responses across a dataset and discover insights to improve your LLM prompts.

In real-time, you can automatically determine which LLM responses are untrustworthy by comparing trustworthiness scores against a fixed threshold (say 0.7). The overall magnitude of trust scores may differ between applications, so select application-specific thresholds.

For maximally reliable AI applications, you can escalate untrustworthy LLM responses for human review.

Here are other strategies to automatically handle untrustworthy LLM responses without a human-in-the-loop:

Append a warning message/disclaimer to the response.
Replace your LLM response with a fallback message such as: “Sorry I am unsure. Try rephrasing your request, or contact us”.
In RAG, the fallback message might include raw retrieved context or search-results, for example: “Sorry I am unsure. Here’s some potentially relevant information: …“.
Replace your original LLM response with a re-generated response.
Escalate to a more expensive AI system (e.g. DeepResearch API).

Below, we showcase example implementations of these strategies. But our main recommendation for handling untrustworthy LLM responses is to use Cleanlab Codex to: log them for remediation, prioritize what to remediate, and serve appropriate remediations for such cases (like expert-provided answers).

Append disclaimer to untrustworthy responses

One straightforward strategy is to still present untrustworthy LLM responses to your user, but first edit them to make them less misleading. You could append a cautionary warning after the response:

if trustworthiness_score < threshold:  # say 0.7
    response = response + "\n\n CAUTION: This answer was flagged as potentially untrustworthy."

Or you could append a hedging statement before the response, making it sound less confident:

if trustworthiness_score < threshold:  # say 0.7
    response = "I'm not sure, but I'd guess:\n\n" + response

Automated response improvement strategies

Optional: Define helper methods: improve_response_tlm_explanation(), generate_improved_response()

def improve_response_tlm_explanation(prompt, response, tlm_output, trust_score_threshold=0.7):
    """ Use alternative response if one was identified in the TLM explanation and it has higher trust score."""
    # Variable that will store final response to return to user
    final_response = tlm_output["response"]
    final_score = tlm_output["trustworthiness_score"]
    if final_score < trust_score_threshold:
        # Try to get and evaluate alternative response
        alt_response = get_alternative_response(tlm_output["log"]["explanation"])
        if alt_response:
            try:
                alt_score = tlm.get_trustworthiness_score(
                    prompt=prompt,
                    response=alt_response
                )
                # Update final values if alternative is better
                if alt_score["trustworthiness_score"] > final_score:
                    final_response = alt_response
                    final_score = alt_score["trustworthiness_score"]
            except:
                pass  # Keep original response if scoring fails

    print(f"Original Response: {tlm_output['response']}\n")
    print(f"Original Score: {tlm_output['trustworthiness_score']}\n")
    print(f"Final Response: {final_response}\n")
    print(f"Final Score: {final_score}")
    return (final_response, final_score)

# Extract alternative response from explanation (if one exists)
def get_alternative_response(explanation):
    if ":" in explanation:
        alt = explanation.split(":")[-1].strip()
        return alt
    return None

def generate_improved_response(prompt, response, tlm_output, trust_score_threshold=0.7):
    """ Regenerate the response, with TLM's explanation added to the prompt."""
    final_response = response
    final_score = tlm_output["trustworthiness_score"]
    tlm_explanation = tlm_output["log"]["explanation"]

    if final_score < trust_score_threshold:
        improvement_prompt = f"""
## User Question

{prompt}

## Answer proposed by an AI Assistant

{response}

## Flaws identified in the proposed Answer

{tlm_explanation}

## Your task

After reviewing the above information, provide a better alternative answer to the User Question.
If you are unable to identify a better alternative answer, then either respond with the same
proposed Answer above or say: "Sorry I am unsure, try providing more information."

Your answer:
"""
        try:
            improved_output = tlm.prompt(improvement_prompt)  # You could use your own LLM here instead of TLM
            final_response = improved_output["response"]
            final_score = improved_output["trustworthiness_score"]
        except:
            pass

    print(f"Original Response: {response}\n")
    print(f"Original Score: {tlm_output['trustworthiness_score']}\n")
    print(f"Improved Response: {final_response}\n")
    print(f"Improved Score: {final_score}")
    return (final_response, final_score)

TLM’s explanation feature often provides alternative responses that might be more trustworthy (discovered during trust scoring). You can automatically replace untrustworthy LLM responses with such an alternative response (if one was found) like this:

user_query = "Bobby (a boy) has 3 sisters. Each sister has 2 brothers. How many brothers?"

tlm = TLM(options={"log": ["explanation"]})  # log explanations during trust scoring

# Generate initial response and score its trustworthiness (could instead produce response using your own LLM)
tlm_output = tlm.prompt(user_query)
response = tlm_output["response"]

final_response, final_score = improve_response_tlm_explanation(user_query, response, tlm_output)

Original Response: Bobby has 3 sisters, and since Bobby is one of the brothers, he is the only brother that each of his sisters has. Therefore, there is only 1 brother (Bobby) in total.
Original Score: 0.6618441307670467
Final Response: 2 brothers.
Final Score: 0.9959271450161626

Here’s another way to automatically improve an untrustworthy LLM response: invoke a second LLM call to re-generate the response, this time modifying the prompt to include TLM’s explanation for why the original response was deemed untrustworthy.

final_response, final_score = generate_improved_response(user_query, response, tlm_output)

Original Response: Bobby has 3 sisters, and since Bobby is one of the brothers, he is the only brother that each of his sisters has. Therefore, there is only 1 brother (Bobby) in total.
Original Score: 0.6618441307670467
Improved Response: Bobby has 3 sisters, and he is one of the brothers. Since each sister has 2 brothers, and Bobby is one of them, it means there is one additional brother. Therefore, in total, there are 2 brothers (Bobby and his other brother).
Improved Score: 0.8661622424182114

These automated response improvement techniques work whether you generate responses via TLM.prompt() or via your own LLM followed by TLM.get_trustworthiness_score().

Instead of using the automated response improvement methods shown above, you can alternatively just call TLM.prompt() using options = {"num_candidate_responses": 4} (or any number larger than 1) for similar automated response-improvements (see Advanced Tutorial).

When replacing/editing LLM responses in conversational applications, don’t forget to also update the chat history to reflect what your user actually saw.

Next Steps

Learn about models/configurations to get faster/better TLM results in the Advanced Tutorial. The default TLM configuration (used in this tutorial) is not latency/cost-optimized because it must remain effective across all possible LLM use-cases. For your specific use-case, you can greatly improve latency/cost without compromising results. Strategy: first run TLM with default settings to see what results look like over a dataset from your use-case; once results look promising, adjust the TLM preset/options/model to reduce latency for your application. If TLM’s default configuration seems ineffective, switch to a more powerful model or add custom evaluation criteria.
Search for your use-case in our tutorials and cheat sheet to learn how you can best use TLM.
If you’re using OpenAI (or an OpenAI-like API), it may be more convenient to use TLM via our OpenAI wrapper.
If you need an additional capability or deployment option, or are unsure how to achieve desired results, just ask: support@cleanlab.ai. We love hearing from users, and are happy to help optimize TLM latency/accuracy for your use-case or provide private deployments.

Setup​

Using TLM​

Scoring the trustworthiness of a given response​

Application: Detecting which LLM responses are untrustworthy​

Application: Scoring the trustworthiness of pre-generated responses​

How to use these trust scores for reliable AI?​

Append disclaimer to untrustworthy responses​

Automated response improvement strategies​

Next Steps​