Skip to main content

Trustworthy Language Model

module cleanlab_studio.studio.trustworthy_language_model

Cleanlab’s Trustworthy Language Model (TLM) is a large language model that gives more reliable answers and quantifies its uncertainty in these answers.

This module is not meant to be imported and used directly. Instead, use Studio.TLM() to instantiate a TLM object, and then you can use the methods like prompt() and get_trustworthiness_score() documented on this page.

The Trustworthy Language Model tutorial further explains TLM and its use cases.


class TLM

Represents a Trustworthy Language Model (TLM) instance, which is bound to a Cleanlab Studio account.

The TLM object can be used as a drop-in replacement for an LLM, or, for estimating trustworthiness scores for arbitrary text prompt/response pairs.

For advanced use, TLM offers configuration options. The documentation below summarizes these options, and more details are explained in the TLM tutorial.

The TLM object is not meant to be constructed directly. Instead, use the Studio.TLM() method to configure and instantiate a TLM object. After you’ve instantiated the TLM object using Studio.TLM(), you can use the instance methods documented on this page. Possible arguments for Studio.TLM() are documented below.

Args:

  • quality_preset (TLMQualityPreset, default = “medium”): An optional preset configuration to control the quality of TLM responses and trustworthiness scores vs. runtimes/costs. TLMQualityPreset is a string specifying one of the supported presets: “best”, “high”, “medium”, “low”, “base”.

    The “best” and “high” presets return improved LLM responses, with “best” also returning more reliable trustworthiness scores than “high”. The “medium” and “low” presets return standard LLM responses along with associated trustworthiness scores, with “medium” producing more reliable trustworthiness scores than low. The “base” preset will provide a standard LLM response and a trustworthiness score in the lowest possible latency/cost.

Higher presets have increased runtime and cost (and may internally consume more tokens). Reduce your preset if you see token-limit errors. Details about each present are in the documentation for TLMOptions. Avoid using “best” or “high” presets if you primarily want trustworthiness scores (i.e. are using tlm.get_trustworthiness_score() rather than tlm.prompt()), and are less concerned with improving LLM responses. These “best” and “high” presets have higher runtime/cost, and are optimized to return more accurate LLM outputs, but not more reliable trustworthiness scores than the “medium” and “low” presets.

  • options (TLMOptions, optional): a typed dict of advanced configuration options. Available options (keys in this dict) include “model”, “max_tokens”, “num_candidate_responses”, “num_consistency_samples”, “use_self_reflection”. For more details about the options, see the documentation for TLMOptions. If specified, these override any settings from the choice of quality_preset.
  • timeout (float, optional): timeout (in seconds) to apply to each TLM prompt. If a batch of data is passed in, the timeout will be applied to each individual item in the batch. If a result is not produced within the timeout, a TimeoutError will be raised. Defaults to None, which does not apply a timeout.
  • verbose (bool, optional): whether to print outputs during execution, i.e., whether to show a progress bar when TLM is prompted with batches of data. If None, this will be determined automatically based on whether the code is running in an interactive environment such as a Jupyter notebook.

method get_model_name

get_model_name()str

Returns the underlying LLM used to generate responses and score their trustworthiness.


method get_trustworthiness_score

get_trustworthiness_score(
prompt: 'Union[str, Sequence[str]]',
response: 'Union[str, Sequence[str]]',
**kwargs: 'Any'
) → Union[TLMScore, List[TLMScore]]

Computes trustworthiness score for arbitrary given prompt-response pairs.

Args:

  • prompt (str | Sequence[str]): prompt (or list of prompts) for the TLM to evaluate
  • response (str | Sequence[str]): existing response (or list of responses) associated with the input prompts. These can be from any LLM or human-written responses.

Returns:

  • TLMScore | List[TLMScore]: If a single prompt/response pair was passed in, method returns a TLMScore object containing the trustworthiness score and optional log dictionary keys.

    If a list of prompt/responses was passed in, method returns a list of TLMScore objects each containing the trustworthiness score and optional log dictionary keys for each prompt-response pair passed in.

    The score quantifies how confident TLM is that the given response is good for the given prompt. If running on many prompt-response pairs simultaneously: this method will raise an exception if any TLM errors or timeouts occur. Use it if strict error handling and immediate notification of any exceptions/timeouts is preferred. You will lose any partial results if an exception is raised. If saving partial results is important, you can call this method on smaller batches of prompt-response pairs at a time (and save intermediate results) or use the try_get_trustworthiness_score() method instead.


method get_trustworthiness_score_async

get_trustworthiness_score_async(
prompt: 'Union[str, Sequence[str]]',
response: 'Union[str, Sequence[str]]',
**kwargs: 'Any'
) → Union[TLMScore, List[TLMScore]]

Asynchronously gets trustworthiness score for prompt-response pairs. This method is similar to the get_trustworthiness_score() method but operates asynchronously, allowing for non-blocking concurrent operations.

Use this method if prompt-response pairs are streaming in, and you want to return TLM scores for each pair as quickly as possible, without the TLM scoring of any one pair blocking the scoring of the others. Asynchronous methods do not block until completion, so you will need to fetch the results yourself.

Args:

  • prompt (str | Sequence[str]): prompt (or list of prompts) for the TLM to evaluate
  • response (str | Sequence[str]): response (or list of responses) corresponding to the input prompts

Returns:

  • TLMScore | List[TLMScore]: If a single prompt/response pair was passed in, method returns either a float (representing the output trustworthiness score) or a TLMScore object containing both the trustworthiness score and log dictionary keys.

    If a list of prompt/responses was passed in, method returns a list of floats representing the trustworthiness score or a list of TLMScore objects each containing both the trustworthiness score and log dictionary keys for each prompt-response pair passed in. The score quantifies how confident TLM is that the given response is good for the given prompt. This method will raise an exception if any errors occur or if you hit a timeout (given a timeout is specified).


method prompt

prompt(
prompt: 'Union[str, Sequence[str]]',
**kwargs: 'Any'
) → Union[TLMResponse, List[TLMResponse]]

Gets response and trustworthiness score for any text input.

This method prompts the TLM with the given prompt(s), producing completions (like a standard LLM) but also provides trustworthiness scores quantifying the quality of the output.

Args:

  • prompt (str | Sequence[str]): prompt (or list of multiple prompts) for the language model. Providing a batch of many prompts here will be faster than calling this method on each prompt separately.
  • kwargs: Optional keyword arguments for TLM. When using TLM for multi-class classification, specify constrain_outputs as a keyword argument to ensure returned responses are one of the valid classes/categories. constrain_outputs is a list of strings (or a list of lists of strings), used to denote the valid classes/categories of interest. We recommend also listing and defining the valid outputs in your prompt as well. If constrain_outputs is a list of strings, the response returned for every prompt will be constrained to match one of these values. The last entry in this list is additionally treated as the output to fall back to if the raw LLM output does not resemble any of the categories (for instance, this could be an Other category, or it could be the category you’d prefer to return whenever the LLM is unsure). If you run a list of multiple prompts simultaneously and want to differently constrain each of their outputs, then specify constrain_outputs as a list of lists of strings (one list for each prompt).

Returns:

  • TLMResponse | List[TLMResponse]: TLMResponse object containing the response and trustworthiness score. If multiple prompts were provided in a list, then a list of such objects is returned, one for each prompt. This method will raise an exception if any errors occur or if you hit a timeout (given a timeout is specified). Use it if you want strict error handling and immediate notification of any exceptions/timeouts.
  • If running this method on a big batch of prompts: you might lose partially completed results if TLM fails on any one of them. To avoid losing partial results for the prompts that TLM did not fail on, you can either call this method on smaller batches of prompts at a time (and save intermediate results between batches), or use the try_prompt() method instead.

method prompt_async

prompt_async(
prompt: 'Union[str, Sequence[str]]',
**kwargs: 'Any'
) → Union[TLMResponse, List[TLMResponse]]

Asynchronously get response and trustworthiness score for any text input from TLM. This method is similar to the prompt() method but operates asynchronously, allowing for non-blocking concurrent operations.

Use this method if prompts are streaming in one at a time, and you want to return results for each one as quickly as possible, without the TLM execution of any one prompt blocking the execution of the others. Asynchronous methods do not block until completion, so you will need to fetch the results yourself.

Args:

  • prompt (str | Sequence[str]): prompt (or list of multiple prompts) for the TLM
  • kwargs: Optional keyword arguments, the same as for the prompt() method.

Returns:

  • TLMResponse | List[TLMResponse]: TLMResponse object containing the response and trustworthiness score. If multiple prompts were provided in a list, then a list of such objects is returned, one for each prompt. This method will raise an exception if any errors occur or if you hit a timeout (given a timeout is specified).

method try_get_trustworthiness_score

try_get_trustworthiness_score(
prompt: 'Sequence[str]',
response: 'Sequence[str]',
**kwargs: 'Any'
) → List[TLMScore]

Gets trustworthiness score for batches of many prompt-response pairs.

The list returned will have the same length as the input list, if TLM hits any errors or timeout processing certain inputs, the list will contain TLMScore objects with error messages and retryability information in place of the TLM score for this failed input.

This is the recommended way to get TLM trustworthiness scores for big datasets, where some individual TLM calls within the dataset may fail. It will ensure partial results are not lost.

Args:

  • prompt (Sequence[str]): list of prompts for the TLM to evaluate
  • response (Sequence[str]): list of existing responses corresponding to the input prompts (from any LLM or human-written)

Returns:

  • List[TLMScore]: If a list of prompt/responses was passed in, method returns a list of TLMScore objects each containing the trustworthiness score and the optional log dictionary keys for each prompt-response pair passed in. For all TLM calls that failed, the returned list will contain TLMScore objects with error messages and retryability information instead.

    The score quantifies how confident TLM is that the given response is good for the given prompt. The returned list will always have the same length as the input list. In case of TLM error or timeout on any prompt-response pair, the returned list will contain TLMScore objects with error messages and retryability information in place of the trustworthiness score. Use this method if you prioritize obtaining results for as many inputs as possible, however you might miss out on certain error messages. If you prefer to be notified immediately about any errors or timeouts, use the get_trustworthiness_score() method instead.


method try_prompt

try_prompt(prompt: 'Sequence[str]', **kwargs: 'Any') → List[TLMResponse]

Gets response and trustworthiness score for any batch of prompts handling any failures (errors or timeouts).

The list returned will have the same length as the input list. If there are any failures (errors or timeouts) processing some inputs, the TLMResponse objects in the returned list will contain error messages and retryability information instead of the usual response.

This is the recommended approach for obtaining TLM responses and trustworthiness scores for large datasets with many prompts, where some individual TLM responses within the dataset might fail. It ensures partial results are preserved.

Args:

  • prompt (Sequence[str]): list of multiple prompts for the TLM
  • kwargs: Optional keyword arguments, the same as for the prompt() method.

Returns:

  • List[TLMResponse]: list of TLMResponse objects containing the response and trustworthiness score. The returned list will always have the same length as the input list. In case of TLM failure on any prompt (due to timeouts or other errors), the return list will include a TLMResponse with an error message and retryability information instead of the usual TLMResponse for that failed prompt. Use this method to obtain TLM results for as many prompts as possible, while handling errors/timeouts manually. If you prefer immediate notification about any errors or timeouts when processing multiple prompts, use the prompt() method instead.

class TLMResponse

A typed dict containing the response, trustworthiness score and additional logs from the Trustworthy Language Model.

Attributes:

  • response (str): text response from the Trustworthy Language Model.
  • trustworthiness_score (float, optional): score between 0-1 corresponding to the trustworthiness of the response. A higher score indicates a higher confidence that the response is correct/trustworthy.
  • log (dict, optional): additional logs and metadata returned from the LLM call only if the log key was specified in TLMOptions.

class TLMScore

A typed dict containing the trustworthiness score and additional logs from the Trustworthy Language Model.

Attributes:

  • trustworthiness_score (float, optional): score between 0-1 corresponding to the trustworthiness of the response. A higher score indicates a higher confidence that the response is correct/trustworthy.
  • log (dict, optional): additional logs and metadata returned from the LLM call only if the log key was specified in TLMOptions.

class TLMOptions

Typed dict containing advanced configuration options for the Trustworthy Language Model. Many of these configurations are automatically determined by the quality preset selected (see the arguments in the TLM initialization method to learn more about quality presets). Specifying custom values here will override any default values from the quality preset.

For all options described below, higher/more expensive settings will lead to longer runtimes and may consume more tokens internally. The high token cost might make it such that you are not able to run long prompts (or prompts with long responses) in your account, unless your token limits are increased. If you are hit token limit issues, try using lower/less expensive settings to be able to run longer prompts/responses.

The default values corresponding to each quality preset (specified when instantiating Studio.TLM()) are:

  • best: num_candidate_responses = 6, num_consistency_samples = 8, use_self_reflection = True. This preset will improve LLM responses.
  • high: num_candidate_responses = 4, num_consistency_samples = 8, use_self_reflection = True. This preset will improve LLM responses.
  • medium: num_candidate_responses = 1, num_consistency_samples = 8, use_self_reflection = True.
  • low: num_candidate_responses = 1, num_consistency_samples = 4, use_self_reflection = True.
  • base: num_candidate_responses = 1, num_consistency_samples = 0, use_self_reflection = False. This preset is equivalent to a regular LLM call. When using get_trustworthiness_score() on “base” preset, a cheaper self-reflection will be used to approximate the trustworthiness score. If you explicitly set use_self_reflection = False, get_trustworthiness_score() will return None instead of a score.

By default, the TLM is set to the “medium” quality preset. The default model used is “gpt-4o-mini”, and max_tokens is 512 for all quality presets. You can set custom values for these arguments regardless of the quality preset specified.

Args:

  • model (str, default = “gpt-4o-mini”): underlying base LLM to use (better models yield better results, faster models yield faster/cheaper results).
    • Models currently supported include: “gpt-4o-mini”, “gpt-4o”, “o1-preview”, “gpt-3.5-turbo-16k”, “gpt-4”, “claude-3.5-sonnet”, “claude-3-haiku”.
    • Additional models supported in beta include: “claude-3.5-sonnet-v2”, “claude-3.5-haiku”, “nova-micro”, “nova-lite”, “nova-pro”.
  • max_tokens (int, default = 512): the maximum number of tokens to generate in the TLM response. This number will impact the maximum number of tokens you will see in the output response, and also the number of tokens that can be generated internally within the TLM (to estimate the trustworthiness score). Higher values here can produce better (more reliable) TLM responses and trustworthiness scores, but at higher costs/runtimes. If you are experiencing token limit errors while using the TLM (especially on higher quality presets), consider lowering this number. For OpenAI models, this parameter must be between 64 and 4096. For Claude models, this parameter must be between 64 and 512.
  • num_candidate_responses (int, default = 1): how many alternative candidate responses are internally generated by TLM. TLM scores the trustworthiness of each candidate response, and then returns the most trustworthy one. Higher values here can produce better (more accurate) responses from the TLM, but at higher costs/runtimes (and internally consumes more tokens). This parameter must be between 1 and 20. When it is 1, TLM simply returns a standard LLM response and does not attempt to improve it.
  • num_consistency_samples (int, default = 8): the amount of internal sampling to evaluate LLM-response-consistency. This consistency forms a big part of the returned trustworthiness score, helping quantify the epistemic uncertainty associated with strange prompts or prompts that are too vague/open-ended to receive a clearly defined ‘good’ response. Higher values here produce better (more reliable) TLM trustworthiness scores, but at higher costs/runtimes. This parameter must be between 0 and 20.
  • use_self_reflection (bool, default = True): whether the LLM is asked to self-reflect upon the response it generated and self-evaluate this response. This self-reflection forms a big part of the trustworthiness score, helping quantify aleatoric uncertainty associated with challenging prompts and helping catch answers that are obviously incorrect/bad for a prompt asking for a well-defined answer that LLMs should be able to handle. Setting this to False disables the use of self-reflection and may produce worse TLM trustworthiness scores, but will reduce costs/runtimes.
  • similarity_measure (str, default = “semantic”): Controls how the trustworthiness scoring algorithm measures similarity between possible responses/outputs considered by the model. Set this to “string” to get faster results. Supported measures include “semantic” and “string”.
  • reasoning_effort (str, default = “high”): Controls how much the LLM reasons when considering alternative possible responses and double-checking responses. Higher efforts here produce better TLM trustworthiness scores, but at higher costs/runtimes, reduce this value to get faster results. Supported efforts include “none”, “low”, “medium”, “high”.
  • log (List[str], default = []): optionally specify additional logs or metadata to return. For instance, include “explanation” here to get explanations of why a response is scored with low trustworthiness.
  • custom_eval_criteria (List[Dict[str, Any]], default = []): optionally specify custom evalution criteria. The expected input format is a list of dictionaries, where each dictionary has the following keys:
    • name: name of the evaluation criteria
    • criteria: the instruction for the evaluation criteria Currently, only one custom evaluation criteria at a time is supported.