Cheat sheet / FAQ for TLM
This page provides tips on how to best utilize the Trustworthy Language Model in your AI applications. First read the quickstart tutorial and advanced tutorial to get familiar with TLM.
Tips on using TLM for specific tasks
Recall the two ways to use TLM in any LLM application:
- Generate responses via your own LLM, and then use TLM to score their trustworthiness via TLM’s
get_trustworthiness_score()
method. - Use TLM in place of your LLM to both generate responses and score their trustworthiness via TLM’s
prompt()
method.
Choose Option 1 if you want to:
- Stream in responses at the lowest latency, and then subsequently score their trustworthiness.
- Use a specific LLM model not supported within TLM, or keep your LLM inference code as is.
Choose Option 2 if you want to:
- Simplify your application with only a single API call to produce both responses and trust scores.
- Use TLM for improved LLM responses (via say the ‘best’ quality preset).
In either case, you should ensure TLM receives the same information you’d supply to your own LLM. Here are tips for particular AI applications:
Retrieval-Augmented Generation (RAG)
Refer to our RAG tutorial, particularly the final sections.
Common mistakes when using TLM for RAG include:
- Context, system instructions (e.g. when to abstain or say ‘No information available’), or evaluation criteria (specific requirements for a correct/good response) are missing from the
prompt
provided to TLM. - Providing unnecessarily lengthy context in the
prompt
for TLM. - Providing low-quality context in the
prompt
for TLM. This video demonstrates how you can use Cleanlab to curate high quality documents for RAG systems.
Classification or Document Tagging
Refer to our Zero-Shot Classification tutorial.
Pass in the
constrain_outputs
keyword argument toTLM.prompt()
to retrict the output to your set of classes/categories.Consider running TLM with the ‘best’ quality preset to boost clasification accuracy in addition to scoring trustworthiness.
A good prompt template should list all of the possible categories a document can be classified as, definitions of the categories, and instructions for the LLM to choose a category (including how to handle edge-cases). Append this template to the text of each document in order to form the
prompt
argument for TLM.If you have a few labeled examples from different classes, try few-shot prompting, where these examples and their classes are listed within the prompt template.
You can also try using Structured Outputs, although today’s LLMs display lower accuracy in some classification/tagging tasks when required to structure their outputs.
Data Extraction
Refer to our data extraction tutorial.
The TLM trustworthiness score tells you which data auto-extracted from documents, databases, transcripts is worth double checking to ensure accuracy. Consider running TLM with the ‘best’ quality preset to boost extraction accuracy as well.
If you already know which section in your documents contains the relevant information, save cost and boost accuracy by only including text from the relevant part of the document in your prompt.
If you wish to extract multiple structured data fields from each unstructured document, consider using Structured Outputs.
Data Annotation/Labeling
LLMs (including TLM) can handle most types of data labeling (including: text categorization, document tagging, entity recognition / PII detection, and more complex annotation tasks). The TLM trustworthiness scores additionally reveal which subset of data the LLM can confidently handle. Let the LLM auto-label 99% of cases where it is trustworthy, and manually label the remaining 1%.
Consider running TLM with the ‘best’ quality preset to boost auto-labeling accuracy in addition to scoring trustworthiness.
TLM can also detect labeling errors made by human annotators (examples where TLM confidently assigns a different label than the human annotator).
Provide detailed annotation instructions and example annotations in TLM’s
prompt
argument. At least the same level of detail as the human annotator instructions (preferably more detail since LLMs can quickly process more information than humans).Check out the various tips/tutorials on using TLM for classification, structured outputs, and data extraction – these cover ideas useful for data annotation as well.
Summarization/Labeling
Include specific instructions in your prompt, such as the desired length of the summary, format, and what types of information/concepts are most/least important to include.
Conversational Chat (handling system prompts and message history)
TLM remains effective when system prompts and past message history are included in its prompt
argument in various formats. For example, you could set TLM’s prompt
to the following string (which implies the next answer will come from the AI):
AI System Instructions: You are a customer support agent representing company XYZ.
User: hi
AI Assistant: How can I help you?
User: can I return my earrings?
AI Assistant:
This is also how packages like LangChain handle conversation history.
You can alternatively use OpenAI’s conversation history and system prompt handling, by running TLM via the OpenAI API.
For chatbots: TLM’s trustworthiness scoring can be useful for automated escalation to a human agent, or to flag key responses as potentially untrustworthy to your users.
Non-Text Response Types: Structured Outputs, Function Calling, ...
Currently, you must use TLM via the OpenAI API to handle non-standard output types. Used this way, TLM can score the trustworthiness of every type of output that OpenAI can return.
LLM Evals, or improving LLM fine-tuning
For LLM Evals, use TLM to quickly find bad LLM responses in your logs. For improving LLM fine-tuning, use TLM to find bad training data and then filter/correct it.
The following tutorials can help:
Frequently Asked Questions
How does the TLM trustworthiness score work?
TLM scores our confidence that a response is ‘good’ for a given request. In question-answering applications, ‘good’ would correspond to whether the answer is correct or not. In general open-ended applications, ‘good’ corresponds to whether the response is helpful/informative and clearly better than alternative hypothetical responses. For extremely open-ended requests, TLM trustworthiness scores may not be as useful as for requests that are questions seeking a correct answer.
TLM trustworthiness scores capture two aspects of uncertainty and quantify them into a holistic trustworthiness measure:
- aleatoric uncertainty (known unknowns, i.e. uncertainty the model is aware of due to a challenging request. For instance: if a prompt is incomplete/vague).
- epistemic uncertainty (unknown unknowns, i.e. uncertainty in the model due to not having been previously trained on data similar to this. For instance: if a prompt is very different to most requests in the LLM training corpus).
These two forms of uncertainty are mathematically quantified in TLM through multiple operations:
- self-reflection: a process in which the LLM is asked to explicitly rate the response and explicitly state how confidently good this response appears to be.
- probabilistic prediction: a process in which we consider the per-token probabilities assigned by a LLM as it generates a response based on the request (auto-regressively token by token).
- observed consistency: a process in which the LLM probabilistically generates multiple plausible responses it thinks could be good, and we measure how contradictory these responses are to each other (or a given response).
These operations produce various trustworthiness measures, which are combined into an overall trustworthiness score that captures all relevant types of uncertainty.
Rigorous benchmarks reveal that TLM trustworthiness scores better detect bad responses than alternative LLM confidence estimators that only quantify aleatoric uncertainty like: per-token probabilities (logprobs), or using LLM to directly rate the response (LLM-as-judge).
Why did TLM return a low trustworthiness score?
Our Advanced Tutorial demonstrates how to activate explanations and understand why a particular response is considered untrustworthy.
Why don’t TLM trust scores align with my team’s human evaluations of LLM outputs?
Our Custom Evaluation Criteria tutorial demonstrates how to better tailor TLM for response quality ratings specific to your use-case.
How to reduce latency and get faster results?
You can stream in a response from any fast LLM you are using, and then use TLM.get_trustworthiness_score to subsequently stream in the trustworthiness score for the response. This section from the Trustworthy RAG tutorial demonstrates this.
Reduce the quality_preset setting. Specify
TLMOptions
to further reduce TLM runtimes by: changingmodel
to a faster base LLM, and reducingmax_tokens
or other values inTLMOptions
.If you’re willing to wait for a high-quality response but want lower-latency trustworthiness score, try TLM Lite.
My company only uses a proprietary LLM, or a specific LLM provider
You can use TLM.get_trustworthiness_score()
to score the trustworthiness of responses from any LLM. See our tutorial: Compute Trustworthiness Scores for any LLM
If you would like to both produce responses and score their trustworthiness using your own custom (private) LLM, get in touch regarding our Enterprise plan. Our TLM technology is compatible with any LLM or Agentic system.
How much does TLM cost?
You can try TLM for free! Sign up for a Cleanlab account here to get your API key, and have fun trying TLM in your LLM workflows.
Once your free trial tokens are used up, you can continue using this same TLM API on a pay-per-token plan. You can see the pricing in your Cleanlab Account under Usage & Billing
. Note that TLM offers many base LLM models and configuration settings like quality presets, giving you flexible pricing options to suit your needs.
Enterprise subscriptions are available with: volume discounts, private deployment options, and additional features. Reach out to learn more.
Do you offer private deployments in VPC?
Yes, TLM can be deployed in your company’s own cloud such that all data remains within your private infrastructure. All major cloud providers and LLM models are supported. Reach out to learn more.
How to run TLM over a big dataset?
Refer to our Advanced Tutorial.
Why don't trustworthiness scores from TLM.prompt() and TLM.get_trustworthiness_score() always match?
TLM.prompt()
additionally considers statistics produced during LLM response generation (such as token probabilities), whereas tlm.get_trustworthiness_score()
does not.
The scores are also not deterministic and are computed as a result of multiple (non-deterministic) LLM calls. When re-running TLM on the same prompt, results are cached and thus you may get identical results until the cache is refreshed.
Learn More
Beyond the tutorials in this documentation and tips on this page, you can learn more about TLM via our blog and additional cookbooks. For instance, the TLM demo cookbook provides a concise demo of TLM used across various applications (particularly customer support use-cases).
If your question is not answered here, feel free to ask in our Community Slack, or via email: support@cleanlab.ai