Skip to main content

Trustworthy Language Model (TLM) - Quickstart

Run in Google ColabRun in Google Colab

info

This feature requires a Cleanlab account. Additional instructions on creating your account can be found in the Python API Guide.

Free-tier accounts come with usage limits. To increase your limits, email: sales@cleanlab.ai.

Large Language Models can act as powerful reasoning engines for solving problems and answering questions, but they are prone to “hallucinations”, where they sometimes produce incorrect or nonsensical answers. With standard LLM APIs, it’s hard to automatically tell whether an output is good or not.

Cleanlab’s Trustworthy Language Model (TLM) is a more reliable LLM that gives high-quality outputs and indicates when it is unsure of the answer to a question, making it suitable for applications where unchecked hallucinations are a show-stopper.

TLM chat interface

For example, with a standard LLM:

Question: Which constitutional amendment would be violated if a police officer placed a GPS tracker on an individual’s car without a warrant?
Answer: The Fourth Amendment.

Question: What is 57834849 + 38833747?
Answer: 96668696

It’s difficult to tell when the LLM is answering confidently, and when it is not. However, with Cleanlab Trustworthy LLM, the answers come with a trustworthiness score. This can guide how to use the output from the LLM (e.g. use it directly if the score is above a certain threshold, otherwise flag the response for human review):

Question: Which constitutional amendment would be violated if a police officer placed a GPS tracker on an individual’s car without a warrant?
Answer: The Fourth Amendment.
Trustworthiness Score: 0.765

Question: What is 57834849 + 38833747?
Answer: 96668696
Trustworthiness Score: 0.245

Question: What is 100 + 300?
Answer: 400
Trustworthiness Score: 0.938

Question: What color are the two stars on the national flag of Syria?
Answer: red and black
Trustworthiness Score: 0.173

TLM is not only useful for catching bad LLM responses, this model can also produce more accurate LLM responses, as well as catch bad data in any prompt-response dataset (e.g. bad human-written responses). See extensive benchmarks.

Installing Cleanlab TLM

Using TLM requires a Cleanlab account. Sign up for one here if you haven’t yet. If you’ve already signed up, check your email for a personal login link.

Cleanlab’s Python client can be installed using pip:

%pip install --upgrade cleanlab-studio

Using the TLM

You can use the TLM pretty much like any other LLM API:

from cleanlab_studio import Studio

# Get API key from here: https://app.cleanlab.ai/account after creating an account.
studio = Studio("<API key>")

tlm = studio.TLM() # see below for optional TLM configurations that can boost performance

output = tlm.prompt("<your prompt>")

The TLM’s output will be a dict with two fields:

{
"response": "<response>" # string like you'd get back from any standard LLM
"trustworthiness_score": "<trustworthiness_score>" # numerical value between 0-1
}

The score quantifies how confident you can be that the response is good (higher values indicate greater trustworthiness). These scores combine estimates of both aleatoric and epistemic uncertainty to provide an overall gauge of trustworthiness. You may find the TLM most useful when your prompts take the form of a question with a definite answer (in which case the returned score quantifies our confidence that the LLM response is correct). Boost the reliability of your Generative AI applications by adding contingency plans to override LLM answers whose trustworthiness score falls below some threshold (e.g., route to human for answer, append disclaimer that answer is uncertain, revert to a default baseline answer, or request a prompt with more information/context).

Scoring the trustworthiness of a given response

You can also use TLM to compute a trustworthiness score for any response to a given prompt. The response does not need to come from TLM, and could be human-written. Simply pass a prompt response pair to the TLM and it will return a numerical score quantifying our confidence that this is a good response.

trustworthiness_score = tlm.get_trustworthiness_score("<your prompt>", response="<your response>")

Both TLM.prompt() and TLM.get_trustworthiness_score() methods can alternatively operate over lists of multiple prompts (and responses). Providing a list of multiple examples to run in a batch will be faster than running each example separately (say in a for loop). The following sections demonstrate batched TLM calls.

Application: Determining which LLM responses are untrustworthy

TLM trustworthiness scores are often most interpretable when comparing them over a large dataset. Be sure to use batched queries (i.e. calling prompt with a list of prompts) for efficiently running TLM on many examples from a dataset.

To show how TLM trustworthiness scores can catch low-quality model outputs, let’s consider a dataset of various Space-related trivia questions. We can use each question as a prompt for the TLM (just as you would with any other LLM) and record its response and associated trustworthiness score.

import pandas as pd

pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)
wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/solar_system_dataset.csv'
wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/solar_system_example_prompts.csv'
df = pd.read_csv("solar_system_example_prompts.csv")
df.head()
prompt
0 What is the largest planet in the Solar System?
1 As of 2024, how many dogs have reached outer space?
2 What is the name of the galaxy that contains our Solar System?
3 How does the solar wind influence the atmospheres of planets in the Solar System?
4 Fetch me the current trajectory of Pluto's orbit from nasa.gov
tlm = studio.TLM()

results = df.copy(deep=True) # stores TLM outputs for each example

outputs = tlm.prompt(df["prompt"].to_list())
results[["response","trustworthiness_score"]] = pd.DataFrame(outputs)
    Querying TLM... 100%|██████████|

To see which model responses are least trustworthy, let’s sort the data by the trustworthiness scores. Observe how model responses with the lowest trustworthiness scores convey unhelpful or factually inaccurate answers, sometimes referred to as hallucinations. After quick fact-checking on sample responses, we’ve verified that no chimpanzee named Ham has ever reached the Moon, let alone any animal in general.

On the other hand, the model responses with the highest scores are direct, accurate, and appropriately detailed. Now we can automatically detect which LLM responses are confidently good vs. not!

results.sort_values(by="trustworthiness_score").head()
prompt response trustworthiness_score
12 What is the name of the first chimpanzee that reached the Moon? The first chimpanzee to reach the Moon was named "Ham." However, it's important to note that Ham did not actually go to the Moon; he was the first chimpanzee in space, launched by NASA on January 31, 1961. The first living beings to orbit the Moon were the crew of Apollo 8, which included humans, not chimpanzees. If you meant a different context or event, please clarify! 0.471894
11 Scientists have recently confirmed that Mars has been inhabited since the 1960s, but they are thirty years behind Earth in terms of fashion. What style of haircut is fashionable for Martian men these days? If Martians are thirty years behind Earth in terms of fashion and it's currently 2023, that would place their trendy styles around the early 1990s. For Martian men, a fashionable haircut from that era might be the classic "grunge" look, characterized by longer, tousled hair, often worn with a bit of a messy, carefree vibe. Alternatively, the "fade" haircut, which was popular in the early '90s, could also be in style, featuring short sides with longer hair on top that can be styled in various ways. Another possibility could be the "bowl cut," which was also a notable trend during that time. Overall, Martian men might be embracing a mix of these retro styles, reflecting the unique cultural influences of their planet. 0.635856
15 What was the first meal that astronauts ate on the Moon? The first meal that astronauts ate on the Moon was during the Apollo 11 mission. After Neil Armstrong and Buzz Aldrin landed on the lunar surface on July 20, 1969, they ate a meal that included bacon squares, peaches, and sugar cookies, along with coffee. This meal was part of their pre-packaged space food designed for consumption in a microgravity environment. 0.636673
8 Which brand of sunscreen is recommended for astronauts on spacewalks to protect against solar radiation? Astronauts on spacewalks use a specialized sunscreen called **"SPF 50"** that is developed specifically for use in space. One of the brands that has been mentioned in this context is **"Banana Boat,"** which has been used by NASA for its effectiveness against solar radiation. However, it's important to note that the formulation is tailored for the unique conditions of space, including high levels of ultraviolet (UV) radiation. The sunscreen used by astronauts is designed to provide maximum protection while being suitable for the extreme environment of space. 0.714314
16 How many countries have planted their flags on the moon? As of now, six countries have successfully landed missions on the Moon and planted their flags there. These countries are:\n\n1. United States\n2. Soviet Union (now Russia)\n3. China\n4. India\n5. Japan\n6. Israel\n\nThe most notable missions that involved flag planting were the Apollo missions by the United States, particularly Apollo 11, which was the first manned mission to land on the Moon in 1969. The Soviet Union also conducted several successful unmanned missions that landed on the Moon. More recent missions by China and India have also contributed to lunar exploration. 0.752240
results.sort_values(by="trustworthiness_score").tail()
prompt response trustworthiness_score
4 Fetch me the current trajectory of Pluto's orbit from nasa.gov I don't have the ability to access external websites, including NASA's. However, I can provide you with general information about Pluto's orbit.\n\nPluto has a highly elliptical orbit that takes about 248 Earth years to complete one revolution around the Sun. Its orbit is tilted about 17 degrees relative to the plane of the solar system, and it has an average distance from the Sun of about 39.5 astronomical units (AU). Pluto's orbit also crosses that of Neptune, but the two bodies are in a stable orbital resonance that prevents them from colliding.\n\nFor the most current and detailed information about Pluto's trajectory, I recommend visiting NASA's official website or the Jet Propulsion Laboratory's (JPL) Solar System Dynamics page. 0.943510
9 What is the primary reason Venus has not been colonized yet despite its similar size and gravity to Earth? The primary reason Venus has not been colonized, despite its similar size and gravity to Earth, is its extremely hostile environment. Venus has a thick atmosphere composed mainly of carbon dioxide, with clouds of sulfuric acid, leading to surface temperatures that average around 467 degrees Celsius (872 degrees Fahrenheit). This intense heat, combined with crushing atmospheric pressure about 92 times that of Earth, makes the surface of Venus incredibly inhospitable for human life and current technology.\n\nAdditionally, the lack of water and the presence of corrosive chemicals further complicate any potential colonization efforts. While there have been discussions about the possibility of floating colonies in the upper atmosphere, where conditions are less extreme, the challenges of establishing and maintaining such habitats remain significant. Overall, the extreme environmental conditions are the primary barrier to colonization efforts on Venus. 0.971264
17 Name a few planets other than Earth that contain volcanos. Several planets and moons in our solar system are known to have volcanic activity or features indicative of past volcanism. Here are a few:\n\n1. **Mars**: Mars has the largest volcano in the solar system, Olympus Mons, which is about 13.6 miles (22 kilometers) high. It also has several other large shield volcanoes.\n\n2. **Venus**: Venus has numerous volcanoes, with many large shield volcanoes and evidence of volcanic activity in its relatively young surface. Some of the largest volcanoes include Maat Mons and Sif Mons.\n\n3. **Io**: One of Jupiter's moons, Io is the most volcanically active body in the solar system. It has hundreds of active volcanoes, including the massive volcano Loki Patera.\n\n4. **Enceladus**: A moon of Saturn, Enceladus has geysers that erupt water vapor and ice particles, indicating cryovolcanism (cold volcanism) rather than traditional lava flows.\n\n5. **Triton**: Neptune's largest moon, Triton, shows signs of cryovolcanism, with geysers that may be expelling nitrogen gas and other materials.\n\nThese celestial bodies provide fascinating insights into volcanic processes beyond Earth. 0.972526
0 What is the largest planet in the Solar System? The largest planet in the Solar System is Jupiter. It is a gas giant with a diameter of about 86,881 miles (139,822 kilometers) and is known for its prominent bands of clouds, Great Red Spot, and numerous moons. 0.984534
2 What is the name of the galaxy that contains our Solar System? The galaxy that contains our Solar System is called the Milky Way Galaxy. 0.985062

How to use these scores? If you have time/resources, your team can manually review low-trustworthiness responses and provide a better human response instead. If not, you can determine a trustworthiness threshold below which responses seem untrustworthy, and automatically append a warning statement to any response whose trustworthiness falls below the threshold.

threshold = 0.5  # set this after inspecting responses around different trustworthiness ranges 
if trustworthiness_score < threshold:
response = response + "\n CAUTION: THIS ANSWER HAS BEEN FLAGGED AS POTENTIALLY UNTRUSTWORTHY"

The overall magnitude/range of the trustworthiness scores may differ between datasets, so we recommend selecting any thresholds to be application-specific. First consider the relative trustworthiness levels between different data points before considering the overall magnitude of these scores for individual data points.

Application: Estimating the quality of arbitrary responses (find bad data in any sequence-to-sequence or supervised fine-tuning dataset)

Let’s see the TLM’s capability to produce trustworthiness scores for arbitrary responses (not necessarily produced from the TLM) by evaluating given human responses for our Space Trivia dataset. Such sequence-to-sequence data are often used for fine-tuning LLMs (aka. instruction tuning or LLM alignment), but often contain low-quality (input, output) text pairs that hamper LLM training. To detect low-quality pairs, we can score the quality of the human responses via the TLM trustworthiness score. Again we recommend using batched queries (i.e. by passing in lists of prompts and responses to get_trustworthiness_score) when using TLM to assess many (input, output) pairs from a dataset.

df = pd.read_csv("solar_system_dataset.csv")
df.head()
prompt human_response
0 What is the largest planet in our solar system? The largest planet in our solar system is Jvpit3r.
1 What is the significance of the asteroid belt in our solar system? The asteroid belt, located between Mars and Jupiter, contains numerous rocky bodies and provides insight into the early solar system's formation.
2 How many planets are in the solar system? There are eight planets in the solar system.
3 What defines a planet as a 'gas giant'? A gas giant is a large planet composed mostly of gases, such as hydrogen and helium, with a relatively small rocky core.
4 Is Pluto still considered a planet in our solar system? Pluto is no longer considered a planet. It is classified as a dwarf planet.
tlm = studio.TLM()

results = df.copy(deep=True)

outputs = tlm.get_trustworthiness_score(df["prompt"].to_list(), df["human_response"].to_list())
results["trustworthiness_score"] = outputs
    Querying TLM... 100%|██████████|

The human annotated prompt-response pairings with lower trustworthiness scores appear worse quality. We see a wide range of issues among data points that TLM flagged with the lowest scores: factually inaccurate responses, truncated prompts, inaccurate information extraction given context, and spelling errors. Conversely, responses assigned the highest TLM trustworthiness scores are those that provide a direct and accurate answer to the prompt.

results.sort_values(by="trustworthiness_score").head()
prompt human_response trustworthiness_score
10 Does the Moon qualify as a planet in our solar system? The Moon is considered a planet in our solar system due to its size and orbit around the Earth. 0.038329
8 Is Jupiter entirely made of gas with no solid areas? Jupiter is entirely made of gas, with no solid areas anywhere on the planet. 0.058672
13 Classify the following as planets or moons: E arth, Europa, Titan, Neptune, Ganymede 0.483364
9 Why is Venus the hottest planet in the solar system? Venus is the hottest planet in the solar system because it is the closest to the Sun. 0.531315
7 Mars, with its thin atmosphere and cold desert landscape, has surface features that include large volcanoes like Olympus Mons, the largest in the solar system, and valleys such as Valles Marineris. Evidence of water ice beneath its surface and dry river beds suggest it once had liquid water. What suggests Mars could currently support Earth-like life? The presence of large bodies of surface water and a thick, oxygen-rich atmosphere. 0.622840
results.sort_values(by="trustworthiness_score").tail()
prompt human_response trustworthiness_score
6 Which planet is often called Earth's twin and why? Venus is often called Earth's twin because it is similar in size, mass, and composition. 0.959877
4 Is Pluto still considered a planet in our solar system? Pluto is no longer considered a planet. It is classified as a dwarf planet. 0.960110
2 How many planets are in the solar system? There are eight planets in the solar system. 0.987161
11 Mars, often called the Red Planet, has a thin atmosphere composed mainly of carbon dioxide. The surface exhibits iron oxide or rust, giving the planet its reddish appearance. Mars has the largest volcano in the solar system, Olympus Mons, and evidence suggests water ice exists beneath its surface. What gives Mars its reddish color? Iron oxide or rust gives Mars its reddish color. 0.987489
5 What planet is known for its extensive ring system? Saturn is known for its extensive and visible ring system. 0.988234

If you are fine-tuning LLMs and want to produce the most reliable model: first filter out the lowest-quality (prompt, response) pairs from your dataset. If you have the time/resources, consider manually correcting those responses flagged as low-quality where you spot obvious room for improvement. This sort of data curation helps you better fine-tune any LLM you want to (even though the data curation was based on TLM trustworthiness scores).

Questions

We’d love to hear any feedback you have, and as always, we’re available to answer any questions. The best place to ask is in our Community Slack, or via email: support@cleanlab.ai

Note: This initial version of TLM is not yet optimized for speed (or long contexts). Focus mainly on the quality of the results you’re getting, and know that the inference latency (and context length) will be improved as we build out the supporting infrastructure. If getting results is taking really long, there may be too many TLM users hitting rate limits, in which case try: decreasing the quality_preset, shortening your prompt, or waiting until later to use it. We’re increasing our infrastructure capacity to meet surging demand.

My company only uses a proprietary LLM, or a specific LLM provider

The technology behind TLM makes it compatible with any LLM API, even a black-box that solely generates responses and provides no other capability. Email sales@cleanlab.ai to learn how you can convert your company’s LLM into a Trustworthy Language Model.

How does the TLM trustworthiness score work?

The TLM scores our confidence that a response is ‘good’ for a given request. In question-answering applications, ‘good’ would correspond to whether the answer is correct or not. In general open-ended applications, ‘good’ corresponds to whether the response is helpful/informative and clearly better than alternative hypothetical responses. For extremely open-ended requests, TLM trustworthiness scores may not be as useful as for requests that are questions seeking a correct answer.

TLM trustworthiness scores capture two aspects of uncertainty and quantify them into a holistic trustworthiness measure:

  1. aleatoric uncertainty (known unknowns, i.e. uncertainty the model is aware of due to a challenging request. For instance: if a prompt is incomplete/vague)
  2. epistemic uncertainty (unknown unknowns, i.e. uncertainty in the model due to not having been previously trained on data similar to this. For instance: if a prompt is very different to most requests in the LLM training corpus)

These two forms of uncertainty are mathematically quantified in the TLM through multiple operations:

  • self-reflection: a process in which the LLM is asked to explicitly rate the response and explicitly state how confidently good this response appears to be.
  • probabilistic prediction: a process in which we consider the per-token probabilities assigned by a LLM as it generates a response based on the request (auto-regressively token by token).
  • observed consistency: a process in which the LLM probabilistically generates multiple plausible responses it thinks could be good, and we measure how contradictory these responses are to each other (or a given response).

These operations produce various trustworthiness measures, which are combined into an overall trustworthiness score that captures all relevant types of uncertainty.

Rigorous benchmarks reveal that TLM trustworthiness scores better detect bad responses than alternative approaches that only quantify aleatoric uncertainty like: per-token probabilities (logprobs), or using LLM to directly rate the response (LLM evaluating LLM).