Cheat sheet / FAQ for TLM
Make sure you’ve read the quickstart tutorial and advanced tutorial.
Tips on using TLM for specific tasks
Recall the two ways to use TLM:
- Generate responses via your own LLM, and then use TLM to score their trustworthiness via TLM’s
get_trustworthiness_score()
method. - Use TLM in place of your LLM to both generate responses and score their trustworthiness via TLM’s
prompt()
method.
Choose Option 1 to:
- Stream in responses at the lowest latency, and then score their trustworthiness.
- Use a specific LLM model not supported within TLM, or keep existing LLM inference code as is.
Choose Option 2 to:
- Simply use one API call to produce both responses and trust scores.
- Use TLM for improved LLM responses (via say the ‘best’ quality preset).
In either case, ensure TLM receives the same information/prompt you’d supply to your own LLM. More tips for particular AI applications:
Retrieval-Augmented Generation (RAG)
Refer to our RAG tutorial, particularly the final sections.
Common mistakes when using TLM for RAG include:
- Context, system instructions (e.g. when to abstain or say ‘No information available’), or evaluation criteria (specific requirements for a correct/good response) are missing from the
prompt
provided to TLM. - Providing unnecessarily lengthy context in the
prompt
for TLM. - Providing low-quality context in the
prompt
for TLM. This video demonstrates how you can use Cleanlab to curate high quality documents for RAG systems.
In RAG: what makes TLM different than groundedness or faithfulness scores like RAGAS/DeepEval?
While some developers measure many scores to debug RAG system components, what matters most to your users is: did the RAG system answer correctly or not. TLM trustworthiness scores detect incorrect RAG responses in real-time with 3x greater precision than groundedness/faithfulness scores like RAGAS (see benchmarks). Groundedness/faithfulness measures like RAGAS only attempt to estimate discrepancies between the RAG response and retrieved Context, and thus only detect certain response errors. TLM relies on state-of-the-art model uncertainty estimation, which detects the same discrepancies but also issues such as when: the response is not a good answer for the user’s query (LLMs often make reasoning/factual errors), or the query is complex/vague, or the retrieved Context is confusing or not relevant/sufficient for a proper answer.
Beyond TLM’s built-in score for response trustworthiness, you can also use TLM’s custom evaluation criteria to simulateneously score: groundedness, faithfulness, abstention, context-relevance, or custom properties of your RAG system – all in a more efficient and reliable manner than tools like RAGAS or DeepEval.
Classification
-
Refer to our Zero-Shot Classification tutorial, as well as our Data Annotation tutorial.
-
Pass in the
constrain_outputs
keyword argument toTLM.prompt()
to retrict the output to your set of classes/categories. -
Consider running TLM with the ‘best’ quality preset to boost clasification accuracy in addition to scoring trustworthiness.
-
A good prompt template should list all of the possible categories a document/text can be classified as, definitions of the categories, and instructions for the LLM to choose a category (including how to handle edge-cases). Append this template to the text of each document in order to form the
prompt
argument for TLM. After running TLM, you can review the most/least trustworthy LLM predictions and then refine your prompt based on this review. -
If you have some already labeled examples from different classes, try few-shot prompting, where these examples and their classes are listed within the prompt template.
-
You can also try using Structured Outputs, although today’s LLMs display lower accuracy in some classification/tagging tasks when required to structure their outputs.
Data Extraction
-
Refer to our data extraction tutorial.
-
The TLM trustworthiness score tells you which data auto-extracted from documents, databases, transcripts is worth double checking to ensure accuracy. Consider running TLM with the ‘best’ quality preset to boost extraction accuracy as well.
-
If you already know which section in your documents contains the relevant information, save cost and boost accuracy by only including text from the relevant part of the document in your prompt.
-
If you wish to extract multiple structured data fields from each unstructured document, consider using Structured Outputs.
Data Annotation/Labeling
-
Refer to our Data Annotation tutorial. Also check out the various tips/tutorials on using TLM for classification, structured outputs, and data extraction – these cover ideas useful for data annotation as well.
-
LLMs (including TLM) can handle most types of data labeling (including: text categorization, document tagging, entity recognition / PII detection, and more complex annotation tasks). The TLM trustworthiness scores additionally reveal which subset of data the LLM can confidently handle. Let the LLM auto-label 99% of cases where it is trustworthy, and manually label the remaining 1%.
-
Consider running TLM with the ‘best’ quality preset to boost auto-labeling accuracy in addition to scoring trustworthiness.
-
TLM can also detect labeling errors made by human annotators (examples where TLM confidently assigns a different label than the human annotator).
-
Provide detailed annotation instructions and example annotations in TLM’s
prompt
argument. At least the same level of detail as the human annotator instructions (preferably more detail since LLMs can quickly process more information than humans).
Summarization
Include specific instructions in your prompt, such as the desired length of the summary, format, and what types of information/concepts are most/least important to include.
Conversational Chat (handling system prompts and message history)
For chatbots: TLM’s trustworthiness scoring can be useful for automated escalation to a human agent, or to flag key responses as potentially untrustworthy to your users.
TLM remains effective when system prompts and past message history are included in its prompt
argument in various formats. For example, you could set TLM’s prompt
to the following string (which implies the next answer will come from the AI):
AI System Instructions: You are a customer support agent representing company XYZ.
User: hi
AI Assistant: How can I help you?
User: can I return my earrings?
AI Assistant:
This is also how packages like LangChain handle conversation history.
You can alternatively use OpenAI’s conversation history and system prompt handling, by running TLM via the OpenAI API.
In open-ended conversational chat applications: You may not want to rely on TLM’s trustworthiness score for every AI response, but rather only for verifiable statements that convey information. You can run TLM with a custom evaluation criteria like the following:
Determine whether the response is non-propositional, in which case it is great. Otherwise it is a bad response if it conveys any specific information or facts, or otherwise seems like an answer whose accuracy could matter.
Then, you could only consider TLM’s trustworthiness score for those responses whose custom evaluation score is low.
Non-Text Response Types: Structured Outputs, Function Calling, …
Currently, you must use TLM via the OpenAI API to handle non-standard output types. Used this way, TLM can score the trustworthiness of every type of output that OpenAI can return.
LLM Evals, or improving LLM fine-tuning
For LLM Evals, use TLM to quickly find bad LLM responses in your logs. For improving LLM fine-tuning, use TLM to find bad training data and then filter/correct it.
The following tutorials can help:
Recommended TLM configurations to try
TLM offers optional configurations. The default TLM configuration is not latency/cost-optimized because it must remain effective across all possible LLM use-cases. For your specific use-case, you can greatly improve latency/cost without compromising results. Strategy: first run TLM with default settings to see what results look like over a dataset from your use-case; once results look promising, adjust the TLM preset/options/model to reduce latency for your application. If TLM’s default configuration seems ineffective, switch to a more powerful model
(e.g. “o3-mini”, “o1”, or “claude-3.5-sonnet-v2”) or add custom evaluation criteria.
We list some good configurations to try out below. Each can be copy/pasted into the initialization arguments for the TLM object:
tlm = TLM(<configuration>)
For low latency (real-time applications):
quality_preset = "base"
or:
quality_preset = "low", options = {"model": "nova-micro"} # consider "base" instead of "low"
or:
quality_preset = "low", options = {"reasoning_effort": "none", "similarity_measure": "string"}
For better trustworthiness scoring:
options = {"model": "gpt-4o"}
For more accurate LLM responses:
quality_preset = "best", options = {"model": "o3-mini"}
# Or instead of "o3-mini", consider: "claude-3.5-sonnet-v2", "o1", or "gpt-4o"
Frequently Asked Questions
How to reduce latency and get faster results? Or reduce costs?
The default TLM settings are not latency-optimized because they have to remain effective across all possible LLM use-cases. For your specific use-case, you can greatly improve latency without compromising results. Strategy: first run TLM with default settings to see what results look like over a dataset from your use-case; once results look promising, adjust the TLM preset/options/model to reduce latency for your application.
-
You can stream in a response from any (fast) LLM you are using, and then use TLM.get_trustworthiness_score to subsequently stream in the trustworthiness score for the response. This section from the Trustworthy RAG tutorial demonstrates this. If you run TLM with a lower
quality_preset
(e.g. “low” or “base”) and cheapermodel
(e.g.nova-micro
), then the additional cost/runtime of trustworthiness scoring can be only a fraction of your cost/runtime of producing the response with your own LLM. -
Reduce the quality_preset setting (e.g. to “low” or “base”).
-
Specify
TLMOptions
to further reduce TLM runtimes by: changingmodel
to a faster base LLM (e.g.nova-micro
), loweringreasoning_effort
(tolow
ornone
), changingsimilarity_measure
tostring
, and reducingmax_tokens
or other values inTLMOptions
. -
If you’re willing to wait for a high-quality response but want lower-latency trustworthiness score, try TLM Lite.
-
Enterprises can further reduce latency via private TLM deployment in your own VPC, especially if you have provisioned throughput with a major LLM provider and optimized infratstructure. TLM requires no additional infrastructure to maintain. Reach out to learn more.
How much does TLM cost?
You can try TLM for free! Sign up for a Cleanlab account here to get your API key, and have fun trying TLM in your LLM workflows.
Once your free trial tokens are used up, you can continue using this same TLM API on a pay-per-token plan. You can see the pricing in your Cleanlab Account under Usage & Billing
. Note that TLM offers many base LLM models and configuration settings like quality presets, giving you flexible pricing options to suit your needs.
The default TLM settings are more expensive because they have to remain effective across all possible LLM use-cases. For your specific use-case, you can greatly improve costs without compromising results. Strategy: first run TLM with default settings to see what results look like over a dataset from your use-case; once results look promising, adjust the TLM preset/options/model to reduce costs for your application. For instance, you can reduce costs significantly via TLM Lite.
Enterprise subscriptions are available with: volume discounts, private deployment options, and additional features. Reach out to learn more.
Why did TLM return a low trustworthiness score?
Our Advanced Tutorial demonstrates how to activate explanations and understand why a particular response is considered untrustworthy.
Why don’t TLM trust scores align with my team’s human evaluations of LLM outputs?
Our Custom Evaluation Criteria tutorial demonstrates how to better tailor TLM for response quality ratings specific to your use-case.
Also try specifying a more powerful model
(e.g. “o3-mini”, “o1”, or “gpt-4o”) in the initialization of TLM.
Why should I trust the TLM trustworthiness scores?
For transarency and scientific rigor, we published our state-of-the-art research behind TLM in ACL, the top venue for NLP and Generative AI research. TLM combines all major forms of uncertainty quantification and LLM-based evaluation into one unified framework that comprehensively detects different types of LLM mistakes.
Ultimately what matters is whether TLM actually detects LLM errors in real applications. Rigorous benchmarks reveal that TLM trustworthiness scores detect wrong responses with significantly greater precision than alternative approaches like: token probabilities (logprobs), or asking the LLM to directly evaluate the response (LLM-as-judge). Such findings hold across diverse use-cases, domains, and all major LLMs including reasoning models. In extensive RAG benchmarks, TLM detected incorrect RAG responses with significantly greater precision than alternatives including: RAGAS, LLM-as-judge, G-Eval, DeepEval, HHEM, Lynx, Prometheus-2, or LogProbs.
Additional accuracy benchmarks reveal that TLM’s trustworthiness score can be used to automatically improve LLM responses themselves (in the same way across many LLM models). This would not be possible if the trustworthiness score were unable to automatically catch incorrect LLM responses.
How does the TLM trustworthiness score work?
TLM scores our confidence that a response is ‘good’ for a given request. In question-answering applications, ‘good’ would correspond to whether the answer is correct or not. In general open-ended applications, ‘good’ corresponds to whether the response is helpful/informative and clearly better than alternative hypothetical responses. For extremely open-ended requests, TLM trustworthiness scores may not be as useful as for requests that are questions seeking a correct answer.
TLM trustworthiness scores capture two aspects of uncertainty and quantify them into a holistic trustworthiness measure:
- aleatoric uncertainty (known unknowns, i.e. uncertainty the model is aware of due to a challenging request. For instance: if a prompt is incomplete/vague).
- epistemic uncertainty (unknown unknowns, i.e. uncertainty in the model due to not having been previously trained on data similar to this. For instance: if a prompt is very different to most requests in the LLM training corpus).
These two forms of uncertainty are mathematically quantified in TLM through multiple operations:
- self-reflection: a process in which the LLM is asked to explicitly rate the response and explicitly state how confidently good this response appears to be.
- probabilistic prediction: a process in which we consider the per-token probabilities assigned by a LLM as it generates a response based on the request (auto-regressively token by token).
- observed consistency: a process in which the LLM probabilistically generates multiple plausible responses it thinks could be good, and we measure how contradictory these responses are to each other (or a given response).
These operations produce various trustworthiness measures, which are combined into an overall trustworthiness score that captures all relevant types of uncertainty.
Rigorous benchmarks reveal that TLM trustworthiness scores better detect bad responses than alternative LLM confidence estimators that only quantify aleatoric uncertainty like: per-token probabilities (logprobs), or using LLM to directly rate the response (LLM-as-judge).
For more details on certain foundational components of TLM, refer to our research paper published at ACL, the top venue for NLP and Generative AI research.
Note that: TLM does not rely on custom models, and only relies on the base LLM model specified in TLMOptions
. Since TLM is a wrapper system around leading LLM APIs, it will remain applicable for all future frontier models. TLM does not need to be trained on your data, which means you don’t need to do any dataset preparation/labeling, nor worry about data drift or whether your AI task will evolve over time.
Why don’t trustworthiness scores from TLM.prompt() and TLM.get_trustworthiness_score() always match?
The scores are also not deterministic and are computed as a result of multiple (non-deterministic) LLM calls. When re-running TLM on the same prompt, results are cached and thus you may get identical results until the cache is refreshed.
TLM.prompt()
additionally considers statistics produced during LLM response generation (such as token probabilities), whereas tlm.get_trustworthiness_score()
does not.
If you want to use one base LLM model to generate responses and score their trustworthiness with a different (e.g. faster) base LLM model, you can still obtain the .prompt()
trustworthiness score via TLM Lite.
Do you offer private deployments in VPC?
Yes, TLM can be deployed in your company’s own cloud such that all data remains within your private infrastructure. All major cloud providers and LLM models are supported. Reach out to learn more.
My company only uses a proprietary LLM, or a specific LLM provider
You can use TLM.get_trustworthiness_score()
to score the trustworthiness of responses from any LLM. See our tutorial: Compute Trustworthiness Scores for any LLM
If you would like to both produce responses and score their trustworthiness using your own custom (private) LLM, get in touch regarding our Enterprise plan. Our TLM technology is compatible with any LLM or Agentic system.
How to run TLM over a big dataset?
Refer to our Advanced Tutorial.
Learn More
Beyond the tutorials in this documentation and tips on this page, you can learn more about TLM via our blog and additional cookbooks. For instance, the TLM demo cookbook provides a concise demo of TLM used across various applications (particularly customer support use-cases).
If your question is not answered here, feel free to ask in our Community Slack, or via email: support@cleanlab.ai