module cleanlab_tlm.tlm
Cleanlab’s Trustworthy Language Model (TLM) is a large language model that gives more reliable answers and quantifies its uncertainty in these answers.
Learn how to use TLM via the quickstart tutorial.
class TLM
Represents a Trustworthy Language Model (TLM) instance, which is bound to a Cleanlab TLM account.
The TLM object can be used as a drop-in replacement for an LLM, or for scoring the trustworthiness of arbitrary text prompt/response pairs.
Advanced users can optionally specify TLM configuration options. The documentation below summarizes these options, more details are explained in the Advanced TLM tutorial.
Args:
quality_preset
({“base”, “low”, “medium”, “high”, “best”}, default = “medium”): an optional preset configuration to control the quality of TLM responses and trustworthiness scores vs. latency/costs.
The “best” and “high” presets auto-improve LLM responses, with “best” also returning more reliable trustworthiness scores than “high”. The “medium”, “low”, and “base” presets return standard LLM responses along with associated trustworthiness scores, with “medium” producing more reliable trustworthiness scores than “low”. The “base” preset provides the lowest possible latency/cost.
Higher presets have increased runtime and cost. Reduce your preset if you see token-limit errors. Details about each present are documented in TLMOptions. Ignore “best” or “high” presets if you just want trustworthiness scores (i.e. are using TLM.get_trustworthiness_score()
rather than TLM.prompt()
). These “best” or “high” presets can additionally improve the LLM response itself, but do not return more reliable trustworthiness scores than “medium” or “low” presets.
-
task
({“default”, “classification”, “code_generation”}, default = “default”): determines details of the algorithm used for scoring LLM response trustworthiness (similar toquality_preset
).- “default”: use for general tasks such as question-answering, summarization, extraction, etc.
- ”classification”: use for classification tasks, where the response is a categorical prediction. When using this task type,
constrain_outputs
must be provided in theprompt()
andget_trustworthiness_score()
methods. - ”code_generation”: use for code generation tasks.
-
options
(TLMOptions, optional): a typed dict of advanced configurations you can optionally specify. Available options (keys in this dict) include “model”, “max_tokens”, “num_candidate_responses”, “num_consistency_samples”, “use_self_reflection”, “similarity_measure”, “reasoning_effort”, “log”, “custom_eval_criteria”. See detailed documentation under TLMOptions. If specified, these override any settings from the choice ofquality_preset
(eachquality_preset
is just a certain TLMOptions configuration). -
timeout
(float, optional): timeout (in seconds) to apply to each TLM prompt. If a batch of data is passed in, the timeout will be applied to each individual item in the batch. If a result is not produced within the timeout, a TimeoutError will be raised. Defaults to None, which does not apply a timeout. -
verbose
(bool, optional): whether to print outputs during execution, i.e. show a progress bar when running TLM over a batch of data. If None, this will be auto-determined based on whether the code is running in an interactive environment such as a Jupyter notebook.
method get_model_name
get_model_name() → str
Returns the underlying LLM used to generate responses and score their trustworthiness.
method get_trustworthiness_score
get_trustworthiness_score(
prompt: 'Union[str, Sequence[str]]',
response: 'Union[str, Sequence[str]]',
**kwargs: 'Any'
) → Union[TLMScore, list[TLMScore]]
Computes trustworthiness score for arbitrary given prompt-response pairs.
Args:
prompt
(str | Sequence[str]): prompt (or list of prompts) for the TLM to evaluateresponse
(str | Sequence[str]): existing response (or list of responses) associated with the input prompts. These can be from any LLM or human-written responses.kwargs
: Optional keyword arguments, it supports the same arguments as theprompt()
method such asconstrain_outputs
.
Returns:
TLMScore | list[TLMScore]
: If a single prompt/response pair was passed in, method returns a TLMScore object containing the trustworthiness score and optional log dictionary keys.
If a list of prompt/responses was passed in, method returns a list of TLMScore objects each containing the trustworthiness score and optional log dictionary keys for each prompt-response pair passed in.
The score quantifies how confident TLM is that the given response is good for the given prompt. The returned list will always have the same length as the input list. In case of TLM error or timeout on any prompt-response pair, the returned list will contain TLMScore objects with error messages and retryability information in place of the trustworthiness score.
method get_trustworthiness_score_async
get_trustworthiness_score_async(
prompt: 'Union[str, Sequence[str]]',
response: 'Union[str, Sequence[str]]',
**kwargs: 'Any'
) → Union[TLMScore, list[TLMScore]]
Asynchronously gets trustworthiness score for prompt-response pairs. This method is similar to the get_trustworthiness_score()
method but operates asynchronously, allowing for non-blocking concurrent operations.
Use this method if prompt-response pairs are streaming in, and you want to return TLM scores for each pair as quickly as possible, without the TLM scoring of any one pair blocking the scoring of the others. Asynchronous methods do not block until completion, so you will need to fetch the results yourself.
Args:
prompt
(str | Sequence[str]): prompt (or list of prompts) for the TLM to evaluateresponse
(str | Sequence[str]): response (or list of responses) corresponding to the input promptskwargs
: Optional keyword arguments, it supports the same arguments as theprompt()
method such asconstrain_outputs
.
Returns:
TLMScore | list[TLMScore]
: If a single prompt/response pair was passed in, method returns either a float (representing the output trustworthiness score) or a TLMScore object containing both the trustworthiness score and log dictionary keys.
If a list of prompt/responses was passed in, method returns a list of floats representing the trustworthiness score or a list of TLMScore objects each containing both the trustworthiness score and log dictionary keys for each prompt-response pair passed in. The score quantifies how confident TLM is that the given response is good for the given prompt. This method will raise an exception if any errors occur or if you hit a timeout (given a timeout is specified).
method prompt
prompt(
prompt: 'Union[str, Sequence[str]]',
**kwargs: 'Any'
) → Union[TLMResponse, list[TLMResponse]]
Gets response and trustworthiness score for any text input.
This method prompts the TLM with the given prompt(s), producing completions (like a standard LLM) but also provides trustworthiness scores quantifying the quality of the output.
Args:
prompt
(str | Sequence[str]): prompt (or list of multiple prompts) for the language model. Providing a batch of many prompts here will be faster than calling this method on each prompt separately.kwargs
: Optional keyword arguments for TLM. When using TLM for multi-class classification, specifyconstrain_outputs
as a keyword argument to ensure returned responses are one of the valid classes/categories.constrain_outputs
is a list of strings (or a list of lists of strings), used to denote the valid classes/categories of interest. We recommend also listing and defining the valid outputs in your prompt as well. Ifconstrain_outputs
is a list of strings, the response returned for every prompt will be constrained to match one of these values. The last entry in this list is additionally treated as the output to fall back to if the raw LLM output does not resemble any of the categories (for instance, this could be an Other category, or it could be the category you’d prefer to return whenever the LLM is unsure). If you run a list of multiple prompts simultaneously and want to differently constrain each of their outputs, then specifyconstrain_outputs
as a list of lists of strings (one list for each prompt).
Returns:
TLMResponse | list[TLMResponse]
: TLMResponse object containing the response and trustworthiness score. If multiple prompts were provided in a list, then a list of such objects is returned, one for each prompt. The returned list will always have the same length as the input list. In case of TLM failure on any prompt (due to timeouts or other errors),the return list will include a TLMResponse with an error message and retryability information instead of the usual TLMResponse for that failed prompt.
method prompt_async
prompt_async(
prompt: 'Union[str, Sequence[str]]',
**kwargs: 'Any'
) → Union[TLMResponse, list[TLMResponse]]
Asynchronously get response and trustworthiness score for any text input from TLM. This method is similar to the prompt()
method but operates asynchronously, allowing for non-blocking concurrent operations.
Use this method if prompts are streaming in one at a time, and you want to return results for each one as quickly as possible, without the TLM execution of one prompt blocking the execution of the others. Asynchronous methods do not block until completion, so you need to fetch the results yourself.
Args:
prompt
(str | Sequence[str]): prompt (or list of multiple prompts) for the TLMkwargs
: Optional keyword arguments, the same as for theprompt()
method.
Returns:
TLMResponse | list[TLMResponse]
: TLMResponse object containing the response and trustworthiness score. If multiple prompts were provided in a list, then a list of such objects is returned, one for each prompt. This method will raise an exception if any errors occur or if you hit a timeout (given a timeout is specified).
class TLMResponse
A typed dict containing the response, trustworthiness score, and additional logs output by the Trustworthy Language Model.
Attributes:
-
response
(str): text response from the Trustworthy Language Model. -
trustworthiness_score
(float, optional): score between 0-1 corresponding to the trustworthiness of the response. A higher score indicates a higher confidence that the response is correct/good. -
log
(dict, optional): additional logs and metadata returned from the LLM call, only if thelog
key was specified in TLMOptions.
class TLMScore
A typed dict containing the trustworthiness score and additional logs output by the Trustworthy Language Model.
Attributes:
-
trustworthiness_score
(float, optional): score between 0-1 corresponding to the trustworthiness of the response. A higher score indicates a higher confidence that the response is correct/good. -
log
(dict, optional): additional logs and metadata returned from the LLM call, only if thelog
key was specified in TLMOptions.
class TLMOptions
Typed dict of advanced configuration options for the Trustworthy Language Model. Many of these configurations are determined by the quality preset selected (learn about quality presets in the TLM initialization method). Specifying TLMOptions values directly overrides any default values set from the quality preset.
For all options described below, higher settings will lead to longer runtimes and may consume more tokens internally. You may not be able to run long prompts (or prompts with long responses) in your account, unless your token/rate limits are increased. If you hit token limit issues, try lower/less expensive TLMOptions to be able to run longer prompts/responses, or contact Cleanlab to increase your limits.
The default values corresponding to each quality preset are:
- best:
num_candidate_responses
= 6,num_consistency_samples
= 8,use_self_reflection
= True. This preset improves LLM responses. - high:
num_candidate_responses
= 4,num_consistency_samples
= 8,use_self_reflection
= True. This preset improves LLM responses. - medium:
num_candidate_responses
= 1,num_consistency_samples
= 8,use_self_reflection
= True. - low:
num_candidate_responses
= 1,num_consistency_samples
= 4,use_self_reflection
= True. - base:
num_candidate_responses
= 1,num_consistency_samples
= 0,use_self_reflection
= False. When usingget_trustworthiness_score()
on “base” preset, a cheaper self-reflection will be used to compute the trustworthiness score.
By default, TLM uses the: “medium” quality_preset
, “gpt-4.1-mini” base model
, and max_tokens
is set to 512. You can set custom values for these arguments regardless of the quality preset specified.
Args:
-
model
({“gpt-4.1”, “gpt-4.1-mini”, “gpt-4.1-nano”, “o4-mini”, “o3”, “gpt-4.5-preview”, “gpt-4o-mini”, “gpt-4o”, “o3-mini”, “o1”, “o1-mini”, “gpt-4”, “gpt-3.5-turbo-16k”, “claude-3.7-sonnet”, “claude-3.5-sonnet-v2”, “claude-3.5-sonnet”, “claude-3.5-haiku”, “claude-3-haiku”, “nova-micro”, “nova-lite”, “nova-pro”}, default = “gpt-4.1-mini”): Underlying base LLM to use (better models yield better results, faster models yield faster/cheaper results).- Models still in beta: “o3”, “o1”, “o4-mini”, “o3-mini”, “o1-mini”, “gpt-4.5-preview”, “claude-3.7-sonnet”, “claude-3.5-haiku”.
- Recommended models for accuracy: “gpt-4.1”, “o4-mini”, “o3”, “claude-3.7-sonnet”, “claude-3.5-sonnet-v2”.
- Recommended models for low latency/costs: “gpt-4.1-nano”, “nova-micro”.
-
max_tokens
(int, default = 512): the maximum number of tokens that can be generated in the TLM response (and in internal trustworthiness scoring). Higher values here may produce better (more reliable) TLM responses and trustworthiness scores, but at higher runtimes/costs. If you experience token/rate limit errors while using TLM, try lowering this number. For OpenAI models, this parameter must be between 64 and 4096. For Claude models, this parameter must be between 64 and 512. -
num_candidate_responses
(int, default = 1): how many alternative candidate responses are internally generated inTLM.prompt()
.TLM.prompt()
scores the trustworthiness of each candidate response, and then returns the most trustworthy one. This parameter must be between 1 and 20. It has no effect onTLM.score()
. Higher values here can produce more accurate responses fromTLM.prompt()
, but at higher runtimes/costs. When it is 1,TLM.prompt()
simply returns a standard LLM response and does not attempt to auto-improve it. -
num_consistency_samples
(int, default = 8): the amount of internal sampling to measure LLM response consistency, a factor affecting trustworthiness scoring. Must be between 0 and 20. Higher values produce more reliable TLM trustworthiness scores, but at higher runtimes/costs. Measuring consistency helps quantify the epistemic uncertainty associated with strange prompts or prompts that are too vague/open-ended to receive a clearly defined ‘good’ response. TLM measures consistency via the degree of contradiction between sampled responses that the model considers plausible. -
use_self_reflection
(bool, default =True
): whether the LLM is asked to reflect on the given response and directly evaluate correctness/confidence. Setting this False disables reflection and will reduce runtimes/costs, but potentially also the reliability of trustworthiness scores. Reflection helps quantify aleatoric uncertainty associated with challenging prompts and catches responses that are noticeably incorrect/bad upon further analysis. -
similarity_measure
({“semantic”, “string”, “embedding”, “embedding_large”, “code”, “discrepancy”}, default = “semantic”): how the trustworthiness scoring’s consistency algorithm measures similarity between alternative responses considered plausible by the model. Supported similarity measures include - “semantic” (based on natural language inference), “embedding” (based on vector embedding similarity), “embedding_large” (based on a larger embedding model), “code” (based on model-based analysis designed to compare code), “discrepancy” (based on model-based analysis of possible discrepancies), and “string” (based on character/word overlap). Set this to “string” for minimal runtimes/costs. -
reasoning_effort
({“none”, “low”, “medium”, “high”}, default = “high”): how much internal LLM calls are allowed to reason (number of thinking tokens) when generating alternative possible responses and reflecting on responses during trustworthiness scoring. Higher reasoning efforts may yield more reliable TLM trustworthiness scores. Reduce this value to reduce runtimes/costs. -
log
(list[str], default = []): optionally specify additional logs or metadata that TLM should return. For instance, include “explanation” here to get explanations of why a response is scored with low trustworthiness. -
custom_eval_criteria
(list[dict[str, Any]], default = []): optionally specify custom evalution criteria beyond the built-in trustworthiness scoring. The expected input format is a list of dictionaries, where each dictionary has the following keys:- name: Name of the evaluation criteria.
- criteria: Instructions specifying the evaluation criteria.