Cheat sheet / FAQ for TLM

Make sure you’ve read the quickstart tutorial and advanced tutorial.

Tips on using TLM for specific tasks

Recall the two ways to use TLM:

Generate responses via your own LLM, and then use TLM to score their trustworthiness via TLM’s get_trustworthiness_score() method.
Use TLM in place of your LLM to both generate responses and score their trustworthiness via TLM’s prompt() method.

Choose Option 1 to:

Stream in responses at the lowest latency, and then score their trustworthiness.
Use a specific LLM model not supported within TLM, or keep existing LLM inference code as is.

Choose Option 2 to:

Simply use one API call to produce both responses and trust scores.
Auto-improve LLM responses (achieved by increasing the num_candidate_responses configuration in TLMOptions).

In either case, ensure TLM receives the same information/prompt you’d supply to your own LLM. More tips for particular AI applications:

Retrieval-Augmented Generation (RAG)

Refer to our RAG tutorial. For RAG, we recommend TrustworthyRAG over the standard TLM object.

Common mistakes when using TLM for RAG include:

The prompt provided to TLM is missing: the retrieved Context, or your System Instructions (e.g. when to abstain / say ‘No information available’, specific requirements for a correct/good response). Provide TLM with the same prompt you’d use for your own LLM in RAG.
Running TLM at default settings if you require low-latency. Try the lower-latency TLM configurations listed on this page.

In RAG: what makes TLM different than groundedness/faithfulness scores like RAGAS/DeepEval?

While some developers measure many scores to debug RAG system components, what matters most to your users is: did the RAG system answer correctly or not. TLM trustworthiness scores detect incorrect RAG responses in real-time with 3x greater precision than groundedness/faithfulness scores like RAGAS (see benchmarks). Groundedness/faithfulness measures like RAGAS only attempt to estimate discrepancies between the RAG response and retrieved Context, and thus only detect certain response errors. TLM relies on state-of-the-art model uncertainty estimation, which detects the same discrepancies but also issues such as when: the response is not a good answer for the user’s query (LLMs often make reasoning/factual errors), or the query is complex/vague, or the retrieved Context is confusing or not relevant/sufficient for a proper answer.

Beyond TLM’s built-in score for response trustworthiness, you can also use our TrustworthyRAG Evals to simulateneously score: groundedness/faithfulness, abstention, context sufficiency, response helpfulness, query difficulty, or custom properties of your RAG system – all in a more efficient and reliable manner than tools like RAGAS or DeepEval.

Agents

For adding trust scoring into a prebuilt LangGraph Agent, see this tutorial.

For adding trust scoring into a customized Agent, see this tutorial. It’s also focused on LangGraph, since that is the most flexible framework for customized Agents, but TLM can be used in any Agentic framework (OpenAI Agents SDK, Crew AI, …).

Two ways we recommend to add trust scoring for your Agent:

scoring the trustworthiness of the final LLM output (i.e. Agent’s response to user)
scoring the trustworthiness of every LLM output, including internal LLM calls made by the Agent.

The latter may be useful to prevent: your Agent from ‘going off the rails’, or bad Tool Calls from happening (e.g. can escalate to human approval before any Tool Call with low trustworthiness score is executed).

When LLM outputs receive low trustworthiness scores, you can consider several fallback strategies:

Replace Agent’s responses with a pre-written abstention phrase indicating lack of knowledge.
Escalate this Agent interaction to a human.
Automatically handle it within the Agent to autonomously improve response accuracy – either by: re-running the Agent with a modified prompt, or only re-generating the recent untrustworthy LLM output.

Summarization

Include specific instructions in your prompt, such as the desired length of the summary, format, and what types of information/concepts are most/least important to include.

Conversational Chat (handling system prompts and message history)

Refer to our multi-turn conversations tutorial.

For chatbots: TLM’s trustworthiness scoring can be useful for automated escalation to a human agent, or to flag key responses as potentially untrustworthy to your users.

TLM remains effective when system prompts and past message history are included in its prompt argument in various formats. For example, you could set TLM’s prompt to the following string (which implies the next answer will come from the AI):

System: You are a customer support agent representing company XYZ.

User: hi

Assistant: How can I help you?

User: can I return my earrings?

Assistant:

This is also how packages like LangChain handle conversation history.

You can alternatively use OpenAI’s conversation history and system prompt handling, by running TLM via the OpenAI API.

Data Extraction

Refer to our data extraction tutorial.
The TLM trustworthiness score tells you which data auto-extracted from documents, databases, transcripts is worth double checking to ensure accuracy. Consider running TLM with a higher number of num_candidate_responses (specified in TLMOptions) to auto-improve LLM accuracy in addition to scoring LLM trustworthiness.
If you already know which section in your documents contains the relevant information, save cost and boost accuracy by only including text from the relevant part of the document in your prompt.
If you wish to extract multiple structured data fields from each unstructured document, consider using Structured Outputs.

Classification

Refer to our Zero-Shot Classification tutorial. Or for binary classification, our Yes/No Decisions tutorial.
Pass in the constrain_outputs keyword argument to TLM.prompt() to restrict the output to your set of classes/categories.
Consider running TLM with a higher number of num_candidate_responses (specified in TLMOptions) to auto-improve LLM accuracy in addition to scoring LLM trustworthiness.
A good prompt template should list all of the possible categories a document/text can be classified as, definitions of the categories, and instructions for the LLM to choose a category (including how to handle edge-cases). Append this template to the text of each document in order to form the prompt argument for TLM. After running TLM, you can review the most/least trustworthy LLM predictions and then refine your prompt based on this review.
If you have some already labeled examples from different classes, try few-shot prompting, where these examples and their classes are listed within the prompt template.
You can also try using Structured Outputs, although today’s LLMs display lower accuracy in some classification/tagging tasks when required to structure their outputs.

Data Annotation/Labeling

Refer to our Data Annotation tutorial. Also check out the various tips/tutorials on using TLM for classification, structured outputs, and data extraction – these cover ideas useful for data annotation as well.
LLMs (including TLM) can handle most types of data labeling (including: text categorization, document tagging, entity recognition / PII detection, and more complex annotation tasks). The TLM trustworthiness scores additionally reveal which subset of data the LLM can confidently handle. Let the LLM auto-label 99% of cases where it is trustworthy, and manually label the remaining 1%.
Consider running TLM with a higher number of num_candidate_responses (specified in TLMOptions) to auto-improve LLM accuracy in addition to scoring LLM trustworthiness.
TLM can also detect labeling errors made by human annotators (examples where TLM confidently assigns a different label than the human annotator).
Provide detailed annotation instructions and example annotations in TLM’s prompt argument. At least the same level of detail as the human annotator instructions (preferably more detail since LLMs can quickly process more information than humans).

Non-Standard Response Types: Structured Outputs, Function Calling, …

Currently, you should use TLM via the OpenAI API to handle non-standard output types. Used this way, TLM can score the trustworthiness of every type of output that OpenAI can return, including Structured Outputs, Tool Calls, etc.

Non-English Responses: Other Languages, Code Generation, …

For code generation applications, specify the TLM task for better results:

tlm = TLM(task="code_generation")

TLM can generally be used in any LLM application where the base LLM model is at least moderately capable. This includes applications where your LLM outputs: a foreign language, code/SQL, or other types of symbolic strings (e.g. molecular formula, board game encoding, …).

For such applications: run TLM with the base model that works best for your use-case. TLM’s default settings are optimized for English generation, so you may achieve better results with one of the following TLMOptions configurations:

options = {"reasoning_effort": "none", "similarity_measure": "string"}

options = {"reasoning_effort": "none", "similarity_measure": "discrepancy"}

options = {"reasoning_effort": "none", "similarity_measure": "embedding"}  # or "embedding_large"

options = {"num_consistency_samples": 0}

LLM Evals, or improving LLM fine-tuning

For LLM Evals, use TLM to quickly find bad/untrustworthy LLM responses in your application logs. Inspecting the least trustworthy LLM responses helps you discover how to improve your prompts/model (e.g. how to handle edge-cases).

For improving LLM fine-tuning, use TLM to find bad training data and then filter/correct it.

The following tutorials can help:

Recommended TLM configurations to try

TLM offers optional configurations. The default TLM configuration is not latency/cost-optimized because it must remain effective across all possible LLM use-cases. For your specific use-case, you can greatly improve latency/cost without compromising results. Strategy: first run TLM with default settings to see what results look like over a dataset from your use-case; once results look promising, adjust the TLM preset/options/model to reduce latency for your application. If TLM’s default configuration seems ineffective, switch to a more powerful model or add custom evaluation criteria.

We list some good configurations to try out below. Each can be copy/pasted into the initialization arguments for the TLM object:

tlm = TLM(<configuration>)

For low latency (real-time applications):

quality_preset = "base"

or:

quality_preset = "low", options = {"model": "gpt-4.1-nano"}  # consider "base" instead of "low", "gpt-5-nano" or "nova-micro" instead of "gpt-4.1-nano"

For better trustworthiness scoring:

options = {"model": "gpt-5"}  # or consider: "o4-mini", "gpt-4.1"

For more accurate LLM responses:

options = {"model": "o4-mini", "num_candidate_responses": 4}  # higher number of `num_candidate_responses` could produce more accurate responses
# Or instead of "o4-mini", consider: "o3", "gpt-5", or "claude-sonnet-4-0"

Frequently Asked Questions

How does the TLM trustworthiness score work?

TLM significantly improves the quality of LLM responses by reducing mistakes and outperforming other evaluation methods.

In extensive benchmarks, TLM detects hallucinations and incorrect answers with much higher precision and recall than alternatives like logprobs, LLM-as-judge, RAGAS, DeepEval, and others. These results hold across a wide range of datasets, LLMs, and customer support tasks.

With using the best or high quality_preset, TLM doesn’t just evaluate, it helps improve outputs. For example, across a sample set of models, it reduced the error rate of LLM responses by:

27% for GPT-4o
34% for GPT-4o mini
22% for GPT-3.5
10% for GPT-4
24% for Claude 3 Haiku

You can explore the results in our published benchmarks:

TLM: Trustworthy Language Model
RAG Evaluation Models Benchmark
RAG Hallucination Metrics Benchmark
Get in touch to learn more or see how TLM can help your use case.

Why should I trust the TLM trustworthiness scores?

For transparency and scientific rigor, we published our state-of-the-art research behind TLM in ACL, the top venue for NLP and Generative AI research. TLM combines all major forms of uncertainty quantification and LLM-based evaluation into one unified framework that comprehensively detects different types of LLM mistakes.

Ultimately what matters is whether TLM actually detects LLM errors in real applications. Rigorous benchmarks reveal that TLM trustworthiness scores detect wrong responses with significantly greater precision than alternative approaches like: token probabilities (logprobs), or asking the LLM to directly evaluate the response (LLM-as-judge). Such findings hold across diverse use-cases, domains, and all major LLMs including reasoning models. In extensive RAG benchmarks, TLM detected incorrect RAG responses with significantly greater precision than alternatives including: RAGAS, LLM-as-judge, G-Eval, DeepEval, HHEM, Lynx, Prometheus-2, or LogProbs.

Additional accuracy benchmarks reveal that TLM’s trustworthiness score can be used to automatically improve LLM responses themselves (in the same way across many LLM models). This would not be possible if the trustworthiness score were unable to automatically catch incorrect LLM responses.

Why did TLM return a low trustworthiness score?

Our Advanced Tutorial demonstrates how to activate explanations and understand why a particular response is considered untrustworthy.

We typically consider trustworthiness scores below 0.7 low, and scores above 0.9 high. But it depends on use-case, especially for scores between 0.7 - 0.9.

Remember that TLM estimates the uncertainty in your LLM. Trustworthiness scores may be low even for correct LLM responses, if your LLM would likely get similar questions wrong. Trustworthiness scores may also be low for open-ended requests for which there isn’t a single correct answer (see one way to handle this here).

Simple configurations like a more powerful model can improve the trustworthiness scores for your use-case; here are tips.

Why don’t TLM trust scores align with my team’s human evaluations of LLM outputs?

Our Custom Evaluation Criteria tutorial demonstrates how to better tailor TLM for response quality ratings specific to your use-case.

Simple configurations like a more powerful model can improve the trustworthiness scores for your use-case; here are tips.

Why don’t trustworthiness scores from TLM.prompt() and TLM.get_trustworthiness_score() always match?

The scores are also not deterministic and are computed as a result of multiple (non-deterministic) LLM calls. When re-running TLM on the same prompt, results are cached and thus you may get identical results until the cache is refreshed. The cache automatically refreshes every 30 days or sooner in rare cases such as major updates or internal refreshes.

TLM.prompt() additionally considers statistics produced during LLM response generation (such as token probabilities), whereas TLM.get_trustworthiness_score() does not.

If you want to use one base LLM model to generate responses and score their trustworthiness with a different (e.g. faster) base LLM model, you can still obtain the .prompt() trustworthiness score via TLM Lite.

How can I reduce latency or costs when using TLM?

By default, TLM is configured for broad reliability across all LLM use cases, not for speed or cost-efficiency. But you can significantly reduce latency and runtime costs by customizing settings for your specific application without sacrificing accuracy.

Recommended Approach

Start with default settings to evaluate TLM on a sample from your use case.
Once results look solid, adjust the configuration to optimize for latency or cost.

Ways to Reduce Latency and Cost

Lower the quality preset: Use quality_preset="low" or "base" to get faster and cheaper results while still catching most LLM errors.
Use a faster model: In TLMOptions, choose lightweight models like gpt-5-nano, gpt-4.1-nano, or nova-micro.
Reduce reasoning complexity: Set reasoning_effort="none" and similarity_measure="string" to cut down processing time.
Use TLM Lite: This variant provides quicker trustworthiness scores and pairs well with high-quality but slower LLM responses.
Stream scores separately: Stream in responses from your own LLM, then use TLM.get_trustworthiness_score() to stream in the trust score afterward — enabling near real-time trust evaluation.

For Enterprise Use

If latency is critical, you can deploy TLM privately in your own VPC. This reduces network latency and leverages your existing LLM infrastructure. TLM doesn’t require any additional infrastructure as it runs on top of your current setup.

Reach out to learn more.

How much does TLM cost?

You can try TLM for free! Sign up for a Cleanlab account here to get your API key, and have fun trying TLM in your LLM workflows.

Once your free tokens are used up, you can continue using this same TLM API on a pay-per-token plan. View the pricing in your Cleanlab Account under Usage & Billing. TLM offers many base LLM models and configuration settings like quality presets, giving you flexible pricing options to suit your needs.

The default TLM settings are more expensive because they have to remain effective across all possible LLM use-cases. For your specific use-case, you can greatly improve costs without compromising results. Strategy: first run TLM with default settings to see what results look like over a dataset from your use-case; once results look promising, adjust the TLM preset/options/model to reduce costs for your application. For instance, you can reduce costs significantly via TLM Lite.

Enterprise subscriptions are available with: volume discounts, private deployment options, and many additional features. Reach out to learn more.

How to run TLM over big datasets?

Refer to our Advanced Tutorial.

Hitting your account’s rate limits may cause variable speeds when using TLM to process a dataset.

What prompt should I use for TLM?

When specifying TLM’s prompt argument, you should not include instructions to evaluate responses or utilize LLM-as-a-judge style prompting. For instance, you do not need to say something like: “Evaluate the trustworthiness of this response”.

Instead, run TLM with the same prompt you’d use to generate a response with your own LLM. This makes TLM easy to include in any LLM application, and produces better trustworthiness scores. You just focus on prompts to produce good responses from your AI application, and let TLM handle trust scoring.

Do you offer private deployments in VPC?

Yes, TLM can be deployed in your company’s own cloud such that all data remains within your private infrastructure. All major cloud providers and LLM models are supported. Reach out to learn more.

My company only uses a proprietary LLM, or a specific LLM provider

You can use TLM.get_trustworthiness_score() to score the trustworthiness of responses from any LLM.

If you would like to both produce responses and score their trustworthiness using your own custom/private LLM, get in touch regarding our Enterprise plan. Our TLM technology is compatible with any LLM or Agentic system and can be deployed in your private infrastructure.

I am using LLM model ___, how can I use TLM?

Two primary ways to use TLM are the prompt() and get_trustworthiness_score() methods. The former can be used as a drop-in replacement for any standard LLM API, returning trustworthiness scores in addition to responses from one of TLM’s supported base LLM models. Here the response and trustworthiness score are both produced using the same LLM model.

Alternatively, you can produce responses using any LLM, and just use TLM to score their trustworthiness.

If you’d like to both produce responses and score their trustworthiness using your own custom (private) LLM, get in touch regarding our Enterprise plan.

Learn More

Beyond the tutorials in this documentation and tips on this page, you can learn more about TLM via our blog and additional cookbooks. For instance, the TLM demo cookbook provides a concise demo of TLM used across various applications (particularly customer support use-cases).

If your question is not answered here, feel free to ask in our Community Slack, or via email: support@cleanlab.ai

Tips on using TLM for specific tasks​

Recommended TLM configurations to try​

For low latency (real-time applications):​

For better trustworthiness scoring:​

For more accurate LLM responses:​

Frequently Asked Questions​

Recommended Approach​

Ways to Reduce Latency and Cost​

For Enterprise Use​

Learn More​