Trustworthy Language Model (TLM) - Quickstart
Large Language Models are prone to “hallucinations” where they sometimes produce incorrect or nonsensical answers. With standard LLM APIs, it’s hard to tell when an output is wrong.
Cleanlab’s Trustworthy Language Model (TLM) boosts the reliability of any LLM application by indicating when the model is unsure of its answer.
For example, with a standard LLM:
Question: Which constitutional amendment would be violated if a police officer placed a GPS tracker on an individual’s car without a warrant?
Answer: The Fourth Amendment.Question: What is 57834849 + 38833747?
Answer: 96668696
It’s hard to tell when the LLM is answering confidently or not. Using the Trustworthy Language Model, answers come with a trustworthiness score. This can guide how to use the output from the LLM (e.g. use it directly if the score is above a certain threshold, otherwise flag the response for human review):
Question: Which constitutional amendment would be violated if a police officer placed a GPS tracker on an individual’s car without a warrant?
Answer: The Fourth Amendment.
Trustworthiness Score:0.765
Question: What is 57834849 + 38833747?
Answer: 96668696
Trustworthiness Score:0.245
Question: What is 100 + 300?
Answer: 400
Trustworthiness Score:0.938
Question: What color are the two stars on the national flag of Syria?
Answer: red and black
Trustworthiness Score:0.173
TLM is not only useful for catching bad LLM responses in real-time, it can automatically improve the accuracy/quality of LLM responses too. See extensive benchmarks.
Setup
TLM requires a Cleanlab account. Sign up for one here and use TLM for free! If you’ve already signed up, check your email for a personal login link.
Cleanlab’s Python client can be installed using pip:
%pip install --upgrade cleanlab-studio
Using TLM
You can use TLM pretty much like any other LLM API:
from cleanlab_studio import Studio
studio = Studio("<API key>") # Get API key from: https://app.cleanlab.ai/account after creating an account
tlm = studio.TLM() # See Advanced tutorial for optional TLM configurations to boost performance
output = tlm.prompt("<your prompt>")
TLM’s output
will be a dict with two fields:
{
"response": "<response>" # string like you'd get back from any standard LLM
"trustworthiness_score": "<trustworthiness_score>" # numerical value between 0-1
}
The score quantifies how confident you can be that the response is good (higher values indicate greater trustworthiness). These scores combine estimates of both aleatoric and epistemic uncertainty to provide an overall gauge of trustworthiness. Boost the reliability of your Generative AI applications by adding contingency plans to override LLM answers whose trustworthiness score falls below some threshold (e.g., route to human for review, append disclaimer that answer is uncertain, revert to a default baseline answer, or request a prompt with more information/context).
Scoring the trustworthiness of a given response
You can also use TLM to score the trustworthiness of any response to a given prompt. The response does not need to come from TLM, and could be human-written. Provide a (prompt, response) pair to TLM and obtain a trustworthiness score between 0 and 1, quantifying our confidence that this is a good response.
trustworthiness_score = tlm.get_trustworthiness_score("<your prompt>", response="<your response>")
Assuming the API request is successful, the output will be a dict with one field:
{
"trustworthiness_score": "<trustworthiness_score>" # numerical value between 0-1
}
For example, TLM returns a high score when the LLM/RAG/Agent’s response is accurate:
trustworthiness_score = tlm.get_trustworthiness_score("What's the first month of the year?", response="January")
trustworthiness_score
And TLM returns a low score when the LLM/RAG/Agent’s reponse to the given prompt is untrustworthy:
trustworthiness_score = tlm.get_trustworthiness_score("What's the first month of the year?", response="February")
trustworthiness_score
Both TLM.prompt()
and TLM.get_trustworthiness_score()
methods can alternatively operate over lists of multiple prompts (and responses). Providing a list of multiple examples to run in a batch will be faster than running each example separately. The following sections demonstrate batched TLM calls.
Application: Determining which LLM responses are untrustworthy
TLM trustworthiness scores are often most interpretable when comparing them over a large dataset. Be sure to use batched queries (i.e. calling prompt
with a list of prompts) for efficiently running TLM on many examples from a dataset.
To show how TLM trustworthiness scores can catch low-quality model outputs, let’s consider a dataset of various Space-related trivia questions. We can use each question as a prompt for TLM (just as you would with any other LLM) and record its response and associated trustworthiness score.
import pandas as pd
pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)
wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/solar_system_dataset.csv'
wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/solar_system_example_prompts.csv'
df = pd.read_csv("solar_system_example_prompts.csv")
df.head()
prompt | |
---|---|
0 | What is the largest planet in the Solar System? |
1 | As of 2024, how many dogs have reached outer space? |
2 | What is the name of the galaxy that contains our Solar System? |
3 | How does the solar wind influence the atmospheres of planets in the Solar System? |
4 | Fetch me the current trajectory of Pluto's orbit from nasa.gov |
tlm = studio.TLM()
results = df.copy(deep=True) # stores TLM outputs for each example
outputs = tlm.prompt(df["prompt"].to_list())
results[["response","trustworthiness_score"]] = pd.DataFrame(outputs)
To see which model responses are least trustworthy, let’s sort the data by the trustworthiness scores. Observe how model responses with the lowest trustworthiness scores convey unhelpful or factually inaccurate answers, sometimes referred to as hallucinations.
Let’s take a closer look at the first prompt-response pair. After quick fact-checking, we’ve verified that no chimpanzee named Ham has ever reached the Moon. That response is not only incorrect in the beginning, but also confusing as the first sentence implies that Ham reached the moon, but it later clarifies otherwise. This contradicting information will likely lead to misunderstandings and potential misinformation, so thankfully the low trustworthiness score will prompt us to double check its validity.
results.sort_values(by="trustworthiness_score").head()
prompt | response | trustworthiness_score | |
---|---|---|---|
12 | What is the name of the first chimpanzee that reached the Moon? | The first chimpanzee to reach the Moon was named "Ham." However, it's important to note that Ham did not actually land on the Moon; he was the first chimpanzee in space, launched by NASA on January 31, 1961. The first living beings to reach the Moon were the Soviet Union's tortoises aboard the Zond 5 mission in 1968. If you meant a specific mission involving chimpanzees and the Moon, there hasn't been one where a chimpanzee landed on the Moon. | 0.342736 |
1 | As of 2024, how many dogs have reached outer space? | As of 2024, a total of 12 dogs have been sent into space as part of the Soviet space program during the 1950s and 1960s. The most famous of these dogs is Laika, who became the first animal to orbit the Earth aboard Sputnik 2 in 1957. Other dogs, such as Belka and Strelka, also made significant contributions to space exploration. The missions involving these dogs were crucial for understanding the effects of space travel on living organisms. | 0.598447 |
11 | Scientists have recently confirmed that Mars has been inhabited since the 1960s, but they are thirty years behind Earth in terms of fashion. What style of haircut is fashionable for Martian men these days? | If Martians are thirty years behind Earth in terms of fashion and we consider the 1960s as their starting point, then the current fashionable haircut for Martian men would likely reflect styles from the 1990s. Popular haircuts from that era included the "bowl cut," "mullet," and "fade." Given the playful nature of this scenario, it's possible that Martian men might be sporting a unique blend of these styles, perhaps with a twist that incorporates Martian culture or aesthetics. So, a fashionable Martian man might be seen with a modernized mullet or a bowl cut with vibrant colors, reflecting both their retro influences and their extraterrestrial flair! | 0.619199 |
8 | Which brand of sunscreen is recommended for astronauts on spacewalks to protect against solar radiation? | Astronauts on spacewalks use a specialized sunscreen called **"SPF 50"** or higher, which is often provided by brands like **Banana Boat** or **Coppertone**. However, the specific formulation used by NASA is designed to withstand the harsh conditions of space, including extreme temperatures and solar radiation. The sunscreen is formulated to provide protection against both UV radiation and the unique environmental factors encountered in space. It's important to note that the exact brand and formulation may vary based on mission requirements and advancements in technology. | 0.642048 |
16 | How many countries have planted their flags on the moon? | As of now, six countries have successfully landed missions on the Moon and planted their flags. These countries are:\n\n1. United States\n2. Soviet Union (now Russia)\n3. China\n4. India\n5. Japan\n6. Israel\n\nThe most notable missions that involved flag planting were the Apollo missions by the United States, particularly Apollo 11, which was the first manned mission to land on the Moon in 1969. Other countries have conducted uncrewed missions to the Moon, but the above list includes those that have successfully planted flags on the lunar surface. | 0.648395 |
On the other hand, the model responses with the highest scores are direct, accurate, and appropriately detailed. Now we can automatically detect which LLM responses are confidently good vs. not!
results.sort_values(by="trustworthiness_score").tail()
prompt | response | trustworthiness_score | |
---|---|---|---|
7 | Analyze the potential for life on Enceladus based on current scientific findings. | Enceladus, one of Saturn's moons, has garnered significant interest in the search for extraterrestrial life due to several compelling scientific findings. Here’s an analysis of the potential for life on Enceladus based on current knowledge:\n\n### 1. **Subsurface Ocean**\nOne of the most promising aspects of Enceladus is the presence of a subsurface ocean beneath its icy crust. Data from NASA's Cassini spacecraft, which orbited Saturn from 2004 to 2017, indicated that Enceladus has a global ocean of liquid water beneath its surface. This ocean is believed to be in contact with the moon's rocky core, which is crucial for the potential for life, as it could provide the necessary chemical interactions.\n\n### 2. **Hydrothermal Activity**\nCassini's observations revealed plumes of water vapor and ice particles erupting from the moon's south pole. These plumes contain not only water but also organic molecules and salts, suggesting that hydrothermal activity may be occurring on the ocean floor. Hydrothermal vents on Earth are known to support diverse ecosystems, and similar environments on Enceladus could potentially harbor microbial life.\n\n### 3. **Organic Compounds**\nThe analysis of the plumes has detected organic compounds, including simple hydrocarbons. While these compounds are not definitive evidence of life, they are essential building blocks for life as we know it. The presence of these organic materials, combined with liquid water and energy sources from hydrothermal activity, creates a potentially habitable environment.\n\n### 4. **Energy Sources**\nFor life to exist, it requires energy. The potential hydrothermal vents on the ocean floor could provide the necessary energy through chemical reactions, similar to those that sustain life in Earth's deep oceans. Additionally, the interaction between the ocean and the rocky core could produce a variety of chemical nutrients.\n\n### 5. **Chemical Equilibrium**\nThe presence of molecular hydrogen (H₂) detected in the plumes suggests that there may be chemical processes occurring that could support life. On Earth, hydrogen is often associated with microbial life, particularly in environments like hydrothermal vents.\n\n### 6. **Icy Surface and Protection**\nEnceladus's thick ice shell provides a protective barrier against cosmic radiation and other environmental factors that could be harmful to potential life forms. This insulation could help maintain a stable environment for life to thrive in the subsurface ocean.\n\n### 7. **Comparative Planetology**\nThe study of extremophiles on Earth—organisms that thrive in extreme | 0.939376 |
9 | What is the primary reason Venus has not been colonized yet despite its similar size and gravity to Earth? | The primary reason Venus has not been colonized, despite its similar size and gravity to Earth, is its extremely hostile environment. Venus has a thick atmosphere composed mainly of carbon dioxide, with clouds of sulfuric acid, leading to surface temperatures around 467 degrees Celsius (872 degrees Fahrenheit). This intense heat, combined with crushing atmospheric pressure about 92 times that of Earth, makes the surface of Venus incredibly inhospitable for human life and current technology.\n\nAdditionally, the lack of water and the presence of corrosive chemicals further complicate the possibility of colonization. While there have been discussions about potential floating colonies in the upper atmosphere, where conditions are less extreme, the challenges of developing the necessary technology and infrastructure for such habitats remain significant. Overall, the extreme environmental conditions are the primary barrier to colonization efforts on Venus. | 0.957340 |
17 | Name a few planets other than Earth that contain volcanos. | Several planets and moons in our solar system are known to have volcanic activity or features indicative of past volcanism. Here are a few:\n\n1. **Mars**: Mars has the largest volcano in the solar system, Olympus Mons, which is about 13.6 miles (22 kilometers) high. It also has several other large shield volcanoes.\n\n2. **Venus**: Venus has numerous volcanoes, with many large shield volcanoes and evidence of past volcanic activity. Some scientists believe that Venus may still be volcanically active today.\n\n3. **Io**: One of Jupiter's moons, Io is the most volcanically active body in the solar system. It has hundreds of active volcanoes, some of which erupt with lava fountains and plumes of sulfur and sulfur dioxide.\n\n4. **Enceladus**: A moon of Saturn, Enceladus has geysers that erupt water vapor and ice particles, indicating cryovolcanism (cold volcanism). This suggests that there may be subsurface volcanic activity.\n\n5. **Triton**: Neptune's largest moon, Triton, shows signs of cryovolcanism as well, with geysers that may be driven by subsurface activity.\n\nThese celestial bodies provide fascinating insights into volcanic processes beyond Earth. | 0.963893 |
0 | What is the largest planet in the Solar System? | The largest planet in the Solar System is Jupiter. It is a gas giant with a diameter of about 86,881 miles (139,822 kilometers) and is known for its prominent bands of clouds, Great Red Spot, and numerous moons. | 0.981862 |
2 | What is the name of the galaxy that contains our Solar System? | The galaxy that contains our Solar System is called the Milky Way Galaxy. | 0.983264 |
How to use these scores? If you have time/resources, your team can manually review low-trustworthiness responses and provide a better human response instead. If not, you can determine a trustworthiness threshold below which responses seem untrustworthy, and automatically append a warning statement to any response whose trustworthiness falls below the threshold.
threshold = 0.5 # set this after inspecting responses around different trustworthiness ranges
if trustworthiness_score < threshold:
response = response + "\n CAUTION: THIS ANSWER HAS BEEN FLAGGED AS POTENTIALLY UNTRUSTWORTHY"
The overall magnitude/range of the trustworthiness scores may differ between datasets, so we recommend selecting any thresholds to be application-specific. First consider the relative trustworthiness levels between different data points before considering the overall magnitude of these scores for individual data points.
Application: Estimating the quality of arbitrary responses (find bad responses in LLM Evals or any Human-Written dataset)
Let’s see TLM’s capability to produce trustworthiness scores for arbitrary responses (not necessarily generated using TLM). Here we evaluate human-written responses for our Space Trivia dataset, but the same technique also works for responses generated by your existing LLM. To detect low-quality responses, we can score the trustworthiness of each existing response using TLM. Again we recommend using batched queries (i.e. passing in lists of prompts and responses to get_trustworthiness_score
) when using TLM to assess many examples from a dataset.
df = pd.read_csv("solar_system_dataset.csv")
df.head()
prompt | human_response | |
---|---|---|
0 | What is the largest planet in our solar system? | The largest planet in our solar system is Jvpit3r. |
1 | What is the significance of the asteroid belt in our solar system? | The asteroid belt, located between Mars and Jupiter, contains numerous rocky bodies and provides insight into the early solar system's formation. |
2 | How many planets are in the solar system? | There are eight planets in the solar system. |
3 | What defines a planet as a 'gas giant'? | A gas giant is a large planet composed mostly of gases, such as hydrogen and helium, with a relatively small rocky core. |
4 | Is Pluto still considered a planet in our solar system? | Pluto is no longer considered a planet. It is classified as a dwarf planet. |
tlm = studio.TLM()
results = df.copy(deep=True)
outputs = tlm.get_trustworthiness_score(df["prompt"].to_list(), df["human_response"].to_list())
results["trustworthiness_score"] = [output["trustworthiness_score"] for output in outputs]
The human annotated prompt-response pairings with lower trustworthiness scores appear worse quality. We see a wide range of issues among data points that TLM flagged with the lowest scores: factually inaccurate responses, truncated prompts, inaccurate information extraction given context, and spelling errors. Conversely, responses assigned the highest TLM trustworthiness scores are those that provide a direct and accurate answer to the prompt.
results.sort_values(by="trustworthiness_score").head()
prompt | human_response | trustworthiness_score | |
---|---|---|---|
10 | Does the Moon qualify as a planet in our solar system? | The Moon is considered a planet in our solar system due to its size and orbit around the Earth. | 0.053171 |
8 | Is Jupiter entirely made of gas with no solid areas? | Jupiter is entirely made of gas, with no solid areas anywhere on the planet. | 0.092502 |
9 | Why is Venus the hottest planet in the solar system? | Venus is the hottest planet in the solar system because it is the closest to the Sun. | 0.478233 |
13 | Classify the following as planets or moons: E | arth, Europa, Titan, Neptune, Ganymede | 0.522682 |
7 | Mars, with its thin atmosphere and cold desert landscape, has surface features that include large volcanoes like Olympus Mons, the largest in the solar system, and valleys such as Valles Marineris. Evidence of water ice beneath its surface and dry river beds suggest it once had liquid water. What suggests Mars could currently support Earth-like life? | The presence of large bodies of surface water and a thick, oxygen-rich atmosphere. | 0.537487 |
results.sort_values(by="trustworthiness_score").tail()
prompt | human_response | trustworthiness_score | |
---|---|---|---|
4 | Is Pluto still considered a planet in our solar system? | Pluto is no longer considered a planet. It is classified as a dwarf planet. | 0.939432 |
6 | Which planet is often called Earth's twin and why? | Venus is often called Earth's twin because it is similar in size, mass, and composition. | 0.940159 |
2 | How many planets are in the solar system? | There are eight planets in the solar system. | 0.986228 |
11 | Mars, often called the Red Planet, has a thin atmosphere composed mainly of carbon dioxide. The surface exhibits iron oxide or rust, giving the planet its reddish appearance. Mars has the largest volcano in the solar system, Olympus Mons, and evidence suggests water ice exists beneath its surface. What gives Mars its reddish color? | Iron oxide or rust gives Mars its reddish color. | 0.986975 |
5 | What planet is known for its extensive ring system? | Saturn is known for its extensive and visible ring system. | 0.987380 |
If you will fine-tune LLMs on such data, then first filter or manually correct the lowest-quality (prompt, response) pairs identified by TLM. Using TLM to automatically find bad responses is also useful if you are doing LLM Evals with a dataset of responses from your own LLM (production logs from your AI application).
Next Steps
Learn about configurations to get faster/better TLM results in the Advanced Tutorial.
Search for your use-case in our tutorials and cheat sheet to learn how you can best use TLM.
If you need an additional capability or deployment option, or are unsure how to achieve desired results, just ask in our Community Slack or via email: support@cleanlab.ai. We love hearing from users!