Skip to main content

Trustworthy Language Model (TLM)

Run in Google ColabRun in Google Colab

info

This feature is in beta, and requires a Cleanlab Studio API Token to use. To get the API token, you must first create a Cleanlab Studio account and access the account page here. Additional instructions on creating your account and getting the API token can be found in the Cleanlab Studio Python API Guide.

For higher token limits, email: sales@cleanlab.ai

Large Language Models can act as powerful reasoning engines for solving problems and answering questions, but they are prone to “hallucinations”, where they sometimes produce incorrect or nonsensical answers. With standard LLM APIs, it’s hard to automatically tell whether an output is good or not.

Cleanlab TLM is a more reliable LLM that gives high-quality outputs and indicates when it is unsure of the answer to a question, making it suitable for applications where unchecked hallucinations are a show-stopper.

TLM chat interface

For example, with a standard LLM:

Question: Which constitutional amendment would be violated if a police officer placed a GPS tracker on an individual’s car without a warrant?
Answer: The Fourth Amendment.

Question: What is 57834849 + 38833747?
Answer: 96668696

It’s difficult to tell when the LLM is answering confidently, and when it is not. However, with Cleanlab Trustworthy LLM, the answers come with a confidence score. This can guide how to use the output from the LLM (e.g. use it directly if the score is above a certain threshold, otherwise flag the response for human review):

Question: Which constitutional amendment would be violated if a police officer placed a GPS tracker on an individual’s car without a warrant?
Answer: The Fourth Amendment.
Confidence: 0.765

Question: What is 57834849 + 38833747?
Answer: 96668696
Confidence: 0.245

Question: What is 100 + 300?
Answer: 400
Confidence: 0.938

Question: Which part of the human body produces insulin?
Answer: the pancreas
Confidence: 0.759

Question: What color are the two stars on the national flag of Syria?
Answer: red and black
Confidence: 0.173

Installing Cleanlab TLM

Using TLM requires a Cleanlab Studio account. Sign up for one here if you haven’t yet. If you’ve already signed up, check your email for a personal login link.

The cleanlab-studio client can be installed using pip:

%pip install "cleanlab-studio>=1.2.0"
from cleanlab_studio import Studio

Using the TLM

You can query the TLM as follows:

# Get API key from here: https://app.cleanlab.ai/account after creating a Cleanlab Studio account.
# Instructions to create account can be found under Guide -> Quickstart -> Cleanlab Studio Python API -> Creating an API Key

studio = Studio("<API key>")

tlm = studio.TLM()

output = tlm.prompt("<your prompt>")

The TLM’s output will be a dict with two fields:

{
"response": "<response>" # string like you'd get back from any standard LLM
"confidence_score": "<confidence_score>" # numerical value between 0-1
}

The score quantifies how confident you can be that the response is good (higher values indicate greater confidence). These scores combine estimates of both aleatoric and epistemic uncertainty to provide an overall gauge of confidence. You may find the TLM most useful when your prompts take the form of a question with a definite answer (in which case the returned score quantifies our confidence that the LLM response is correct). Boost the reliability of your Generative AI applications by adding contingency plans to override LLM answers whose confidence falls below some threshold (e.g., route to human for answer, append disclaimer that answer is uncertain, revert to a default baseline answer, or request a prompt with more information/context).

Advanced Usage

For efficient computation on larger datasets, TLM supports processing multiple concurrent requests in batches. The maximum number of concurrent requests in batch queries is set by max_concurrent_requests (default value is 16). To control reliability/compute trade-offs, you can pass in different quality presets to the TLM. The default preset is medium. For some use cases, this will be enough, but using the highest quality presets will produce better model responses and more reliable associated confidence scores (at the cost of extra computation).

# Default batch size is 16, max batch size is 128.
DEFAULT_MAX_CONCURRENT_TLM_REQUESTS = 16

tlm = studio.TLM(
max_concurrent_requests=DEFAULT_MAX_CONCURRENT_TLM_REQUESTS,
quality_preset="best" # supported quality presets are: 'best','high','medium','low','base'
)

# Run a single prompt using the preset parameters:
output = tlm.prompt("<your prompt>")

# Or run multiple prompts simultaneously in a batch:
outputs = tlm.prompt(["<your first prompt>", "<your second prompt>", "<your third prompt>"])

Details about the TLM quality presets:

  • best and high will improve the LLM responses themselves, with best also returning the most reliable confidence scores.
  • medium and low will return standard LLM responses along with associated confidence scores, with medium producing more reliable confidence scores than low.
  • base will not return any confidence score, just an output response. This option is similar to using your favorite LLM. It helps you to compare the enhanced responses from best and high quality presets with a standard LLM, as well as the value of the additional confidence scores returned by TLM.

Benchmark: Accuracy of answers from OpenAI LLM (GPT3.5-Turbo) vs. TLM with quality_preset=best (across 4 different Q&A datasets from different domains)

DatasetOpenAI LLMCleanlab TLM
GSM8K47%69%
CSQA72%73%
SVAMP75%82%
TriviaQA73%76%

These benchmarks also reveal that, in the vast majority of cases, the confidence scores are lower for incorrect TLM answers than correct answers. Thus you can safely rely on these scores to alert you about LLM responses that are untrustworthy.

Scoring the confidence of a given response

You can also use TLM to compute a confidence score for any response to a given prompt. The response does not need to come from TLM, and could be human-written. Simply pass a prompt response pair to the TLM and it will return a numerical score quantifying our confidence that this is a good response.

confidence_score = tlm.get_confidence_score("<your prompt>", response="<your response>")

Application: Determining which LLM responses are untrustworthy

TLM confidence scores are often most interpretable when comparing them over a large dataset. Be sure to use batched queries (eg. batch_prompt) for running TLM on many examples from a dataset.

To show how TLM confidence scores can catch low quality model outputs, let’s consider a dataset of various Space-related trivia questions. We can use each question as a prompt for the TLM (just as you would with any other LLM) and record its response and associated confidence score.

import pandas as pd
from tqdm import tqdm

pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)
wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/solar_system_dataset.csv'
wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/solar_system_example_prompts.csv'
    File ‘solar_system_dataset.csv’ already there; not retrieving.



File ‘solar_system_example_prompts.csv’ already there; not retrieving.

df = pd.read_csv("solar_system_example_prompts.csv")
df.head()
prompt
0 What is the largest planet in the Solar System?
1 As of 2024, how many dogs have reached outer space?
2 What is the name of the galaxy that contains our Solar System?
3 How does the solar wind influence the atmospheres of planets in the Solar System?
4 Fetch me the current trajectory of Pluto's orbit from nasa.gov
# Our list of prompts is small so we can use the default batch size
tlm = studio.TLM(quality_preset="best")

results = df.copy(deep=True)

outputs = tlm.prompt(df["prompt"].to_list())
results[["response","confidence_score"]] = pd.DataFrame(outputs)

To see which model responses are least trustworthy, let’s sort the data by the confidence scores. Observe how model responses with the lowest confidence scores convey unhelpful or factually inaccurate answers, sometimes referred to as hallucinations. After quick fact-checking on sample responses, we’ve verified that “SprectraShield” is not a real product, and that no chimpanzee named Ham has ever reached the Moon, let alone any animal in general.

On the other hand, the model responses with the highest scores are direct, accurate, and appropriately detailed. Now we can automatically detect which LLM responses are confidently good vs. not!

results.sort_values(by="confidence_score").head()
prompt response confidence_score
8 Which brand of sunscreen is recommended for astronauts on spacewalks to protect against solar radiation? One brand of sunscreen that is recommended for astronauts on spacewalks to protect against solar radiation is "SpectraShield" by Southern Research. This sunscreen was developed specifically for NASA and provides protection against both ultraviolet A (UVA) and ultraviolet B (UVB) radiation. It is designed to be highly effective in extreme environments, such as during spacewalks where astronauts are exposed to intense solar radiation. 0.428642
5 How many years from now will Earth, Saturn, and the Sun align? It is important to note that Earth, Saturn, and the Sun align at specific intervals called oppositions. Opposition occurs when Earth and Saturn are on opposite sides of the Sun, and hence, the three objects are relatively aligned in a straight line.\n\nThe time it takes for Earth and Saturn to align at opposition is approximately 378 days (or about 1 year and 13 days). This is because Earth takes approximately 365.25 days to orbit the Sun, while Saturn takes about 29.5 Earth years to complete its orbit.\n\nHowever, the alignment of all three objects (Earth, Saturn, and the Sun) in perfect alignment is quite rare and does not occur at every opposition. The exact alignment of the three objects can happen, but the occurrence is infrequent and depends on various factors such as the tilt of the orbits and the specific positions of the objects at a given time.\n\nTherefore, predicting the exact number of years from now for such a perfect alignment is impossible, as it can vary greatly. Within a few hundred or thousand years, some alignments may occur, but they cannot be accurately predicted without detailed mathematical calculations considering the specific positions and movements of the objects. 0.550901
12 What is the name of the first chimpanzee that reached the Moon? The first chimpanzee that reached the Moon was named Ham. 0.572259
1 As of 2024, how many dogs have reached outer space? As of now, no dogs have reached outer space as part of any known space missions. 0.624848
15 What was the first meal that astronauts ate on the Moon? The first meal that astronauts ate on the Moon was a specially prepared food item called "The Apollo Moon Meal." It consisted of a package containing bacon squares, peaches, sugar cookie cubes, pineapple grapefruit drink, and coffee. This meal was consumed by the Apollo 11 astronauts Neil Armstrong and Buzz Aldrin during their historic moon landing mission in July 1969. 0.666584
results.sort_values(by="confidence_score").tail()
prompt response confidence_score
7 Analyze the potential for life on Enceladus based on current scientific findings. Based on current scientific findings, there is strong evidence to suggest that Enceladus, one of Saturn's moons, has the potential to harbor life.\n\n1. Liquid Water: Enceladus is known to have a subsurface ocean of liquid water beneath its icy crust. This water is in contact with a rocky core, creating a ideal environment for life as we know it to potentially exist.\n\n2. Hydrothermal Activity: Observations from the Cassini spacecraft revealed that Enceladus has active hydrothermal vents on its ocean floor. These vents release heat and minerals, similar to the deep-sea hydrothermal vents on Earth that support diverse ecosystems. This indicates a possible source of energy and nutrients for potential life forms.\n\n3. Organic Molecules: In 2014, Cassini detected complex organic molecules, including carbon-based compounds like methane and other hydrocarbons, in Enceladus' plumes of water vapor and ice particles. These organic molecules are essential building blocks for life as we know it.\n\n4. Energy Source: In addition to hydrothermal vents, Enceladus also has tidal heating. The gravitational interactions between Saturn and its nearby moons cause tidal forces, generating heat within Enceladus. This heating could provide another potential energy source for life to thrive, as seen in Earth's deep-sea hydrothermal vents.\n\n5. Prebiotic Chemistry: Recent laboratory experiments that recreated the conditions of Enceladus' subsurface ocean suggest that prebiotic chemistry, the precursor to life, could occur on this moon. Scientists discovered that complex organic molecules, such as amino acids, can form under Enceladus-like conditions, further supporting the potential for life.\n\nWhile all these findings are promising, it is important to note that direct evidence of life on Enceladus has not been found yet. Future missions, like NASA's planned Europa Clipper mission, will further explore this moon to gather more data on its potential habitability.\n\nIn conclusion, based on the current scientific findings, Enceladus possesses several factors that make it a prime candidate for hosting life. The presence of liquid water, hydrothermal activity, complex organic molecules, and potential energy sources provide a compelling case for the existence of life beyond Earth on this intriguing moon. 0.917308
3 How does the solar wind influence the atmospheres of planets in the Solar System? The solar wind is a stream of charged particles, mainly protons and electrons, that are constantly emitted by the Sun. These particles, along with their associated magnetic field, have several effects on the atmospheres of planets in the Solar System:\n\n1. Atmospheric Escape: The solar wind can cause atmospheric particles, particularly the lighter ones like hydrogen and helium, to gain enough energy and escape the gravitational pull of a planet. This process, known as atmospheric escape, can lead to a gradual loss of a planet's atmosphere over time. For example, Mars is believed to have lost most of its early atmosphere due to the solar wind.\n\n2. Ionization and Aurora: When the solar wind interacts with a planet's magnetic field, it can cause the atoms and molecules in the upper atmosphere to become ionized. This ionization can lead to the formation of auroras, also known as polar lights. The charged particles from the solar wind excite the atoms and molecules in the atmosphere, causing them to emit colorful lights.\n\n3. Magnetospheric Effects: The solar wind exerts pressure on a planet's magnetic field, compressing it on the side facing the Sun and causing it to stretch out in the opposite direction. This interaction forms a region around the planet called the magnetosphere. The solar wind can influence the size and shape of the magnetosphere, which in turn affects the interaction between a planet's atmosphere and the solar wind.\n\n4. Solar-wind-induced Atmospheric Erosion: High-energy particles in the solar wind can directly impact the upper atmosphere of planets, potentially causing erosion. These particles can collide with atmospheric particles, breaking them apart, and altering the composition and structure of a planet's atmosphere. The erosion caused by the solar wind is particularly prominent on bodies that lack a protective magnetosphere, like the Moon.\n\nOverall, the solar wind plays a significant role in shaping the atmospheres of planets in the Solar System. Its effects can range from atmospheric escape and erosion to the formation of auroras and the modification of a planet's magnetosphere. 0.936951
10 In what year did the United Nations grant the Moon its independence, and who is the current lunar ambassador? The United Nations has not granted the Moon its independence, and there is no current lunar ambassador. As of now, the Moon is not considered an independent entity and has no recognized ambassador. 0.938256
0 What is the largest planet in the Solar System? The largest planet in the Solar System is Jupiter. 0.939693
2 What is the name of the galaxy that contains our Solar System? The name of the galaxy that contains our Solar System is the Milky Way. 0.941339

How to use these scores? If you have time/resources, your team can manually review low-confidence responses and provide a better human response instead. If not, you can determine a confidence threshold below which responses seem untrustworthy, and automatically append a warning statement to any response whose confidence falls below the threshold.

threshold = 0.5  # set this after inspecting responses around different confidence ranges 
if confidence_score < threshold:
response = response + "\n CAUTION: THIS ANSWER HAS BEEN FLAGGED AS POTENTIALLY UNTRUSTWORTHY"

Application: Estimating the quality of arbitrary responses (find bad data in sequence-to-sequence dataset)

Let’s see the TLM’s capability to produce confidence scores for arbitrary responses (not necessarily produced from the TLM) by evaluating given human responses for our Space Trivia dataset. Such sequence-to-sequence data are often used for fine-tuning LLMs (aka. instruction tuning or LLM alignment), but often contain low-quality (input, output) text pairs that hamper LLM training. To detect low-quality pairs, we can score the quality of the human responses via the TLM confidence score. Again we recommend using batched queries (i.e. by passing in lists of prompts and responses to get_confidence_score) when using TLM to assess many (input, output) pairs from a dataset.

df = pd.read_csv("solar_system_dataset.csv")
df.head()
prompt human_response
0 What is the largest planet in our solar system? The largest planet in our solar system is Jvpit3r.
1 What is the significance of the asteroid belt in our solar system? The asteroid belt, located between Mars and Jupiter, contains numerous rocky bodies and provides insight into the early solar system's formation.
2 How many planets are in the solar system? There are eight planets in the solar system.
3 What defines a planet as a 'gas giant'? A gas giant is a large planet composed mostly of gases, such as hydrogen and helium, with a relatively small rocky core.
4 Is Pluto still considered a planet in our solar system? Pluto is no longer considered a planet. It is classified as a dwarf planet.
tlm = studio.TLM(quality_preset="best")

results = df.copy(deep=True)

outputs = tlm.get_confidence_score(df["prompt"].to_list(), df["human_response"].to_list())
results["confidence_score"] = outputs

The human annotated prompt-response pairings with lower confidence scores appear worse quality. From examining and verifying the results, we see a wide range of issues among those datapoints: factually inaccurate responses, truncated prompts, inaccurate information extraction given context, and spelling errors. Conversely, responses assigned the highest TLM confidence scores are those that provide a direct and accurate answer to the prompt.

results.sort_values(by="confidence_score").head()
prompt human_response confidence_score
10 Does the Moon qualify as a planet in our solar system? The Moon is considered a planet in our solar system due to its size and orbit around the Earth. 0.035069
13 Classify the following as planets or moons: E arth, Europa, Titan, Neptune, Ganymede 0.324784
9 Why is Venus the hottest planet in the solar system? Venus is the hottest planet in the solar system because it is the closest to the Sun. 0.499935
0 What is the largest planet in our solar system? The largest planet in our solar system is Jvpit3r. 0.537292
7 Mars, with its thin atmosphere and cold desert landscape, has surface features that include large volcanoes like Olympus Mons, the largest in the solar system, and valleys such as Valles Marineris. Evidence of water ice beneath its surface and dry river beds suggest it once had liquid water. What suggests Mars could currently support Earth-like life? The presence of large bodies of surface water and a thick, oxygen-rich atmosphere. 0.565554
results.sort_values(by="confidence_score").tail()
prompt human_response confidence_score
6 Which planet is often called Earth's twin and why? Venus is often called Earth's twin because it is similar in size, mass, and composition. 0.938658
4 Is Pluto still considered a planet in our solar system? Pluto is no longer considered a planet. It is classified as a dwarf planet. 0.938804
5 What planet is known for its extensive ring system? Saturn is known for its extensive and visible ring system. 0.939700
2 How many planets are in the solar system? There are eight planets in the solar system. 0.944188
11 Mars, often called the Red Planet, has a thin atmosphere composed mainly of carbon dioxide. The surface exhibits iron oxide or rust, giving the planet its reddish appearance. Mars has the largest volcano in the solar system, Olympus Mons, and evidence suggests water ice exists beneath its surface. What gives Mars its reddish color? Iron oxide or rust gives Mars its reddish color. 0.968772

To get the most reliable model via LLM fine-tuning, first filter out the lowest-quality (prompt, response) pairs from your dataset. If you have the time/resources, consider manually correcting those responses flagged as low-quality where you spot obvious room for improvement. This sort of data curation helps you better fine-tune any LLM you want to (even though the data curation was based on TLM confidence scores).

Questions

We’d love to hear any feedback you have, and as always, we’re available to answer any questions. The best place to ask is in our Community Slack, or via email: support@cleanlab.ai

Note: This beta version of TLM is not yet optimized for speed (or long contexts). Focus mainly on the quality of the results you’re getting, and know that the inference latency (and context length) will be greatly improved shortly as we build out the supporting infrastructure. If getting results is taking really long, there may be too many TLM users hitting rate limits, in which case try: decreasing the quality_preset, shortening your prompt, or waiting until later to use it. We are increasing our infrastructure capacity to meet the surging beta demand.