Skip to main content

module cleanlab_tlm.utils.rag

Real-time Evals for Retrieval-Augmented Generation (RAG) systems, powered by Cleanlab’s Trustworthy Language Model (TLM).

This module combines Cleanlab’s trustworthiness scores for each RAG response with additional Evals for other RAG components (such as the retrieved context).

You can also customize Evals for your use-case. Each Eval provides real-time detection of quality issues in your RAG application based on the: user query, retrieved context (documents), and/or LLM-generated response.

For RAG use-cases, we recommend using this module’s TrustworthyRAG object in place of the basic TLM object.

This feature is in Beta, contact us if you encounter issues.


class TrustworthyRAG

Real-time Evals for Retrieval-Augmented Generation (RAG) systems, powered by Cleanlab’s Trustworthy Language Model (TLM).

For RAG use-cases, we recommend using this object in place of the basic TLM object. You can use TrustworthyRAG to either score an existing RAG response (from any LLM) based on user query and retrieved context, or to both generate the RAG response and score it simultaneously.

This object combines Cleanlab’s trustworthiness scores for each RAG response with additional Evals for other RAG components (such as the retrieved context).

You can also customize Evals for your use-case. Each Eval provides real-time detection of quality issues in your RAG application based on the: user query, retrieved context (documents), and/or LLM-generated response.

Most arguments for this TrustworthyRAG() class are similar to those for TLM, the differences are described below. For details about each argument, refer to the TLM documentation.

Args:

  • quality_preset ({“base”, “low”, “medium”}, default = “medium”): an optional preset configuration to control the quality of generated LLM responses and trustworthiness scores vs. latency/costs.

  • api_key (str, optional): API key for accessing TLM. If not provided, this client will attempt to use the CLEANLAB_TLM_API_KEY environment variable.

  • options (TLMOptions, optional): a typed dict of advanced configurations you can optionally specify. The “custom_eval_criteria” key for TLM is not supported for TrustworthyRAG, you can instead specify evals.

  • timeout (float, optional): timeout (in seconds) to apply to each request.

  • verbose (bool, optional): whether to print outputs during execution, i.e. show a progress bar when processing a batch of data.

  • evals (list[Eval], optional): additional evaluation criteria to check for, in addition to response trustworthiness. If not specified, default evaluations will be used (access these via get_default_evals). To come up with your custom evals, we recommend you first run get_default_evals() and then add/remove/modify the returned list. Each Eval in this list provides real-time detection of specific issues in your RAG application based on the user query, retrieved context (documents), and/or LLM-generated response.


method score

score(
response: 'Union[str, Sequence[str]]',
query: 'Union[str, Sequence[str]]',
context: 'Union[str, Sequence[str]]',
prompt: 'Optional[Union[str, Sequence[str]]]' = None,
form_prompt: 'Optional[Callable[[str, str], str]]' = None
) → Union[TrustworthyRAGScore, list[TrustworthyRAGScore]]

Evaluate an existing RAG system’s response to a given user query and retrieved context.

Batch processing (supplying lists of strings) will be supported shortly in a future release.

Args:

  • response (str | Sequence[str]): A response (or list of multiple responses) from your LLM/RAG system.
  • query (str | Sequence[str]): The user query (or list of multiple queries) that was used to generate the response.
  • context (str | Sequence[str]): The context (or list of multiple contexts) that was retrieved from the RAG Knowledge Base and used to generate the response.
  • prompt (str | Sequence[str], optional): Optional prompt (or list of multiple prompts) representing the actual inputs (combining query, context, and system instructions into one string) to the LLM that generated the response.
  • form_prompt (Callable[[str, str], str], optional): Optional function to format the prompt based on query and context. Cannot be provided together with prompt, provide one or the other. This function should take query and context as parameters and return a formatted prompt string. If not provided, a default prompt formatter will be used. To include a system prompt or any other special instructions for your LLM, incorporate them directly in your custom form_prompt() function definition.

Returns:

  • TrustworthyRAGScore | list[TrustworthyRAGScore]: TrustworthyRAGScore object containing evaluation metrics.

method generate

generate(
query: 'Union[str, Sequence[str]]',
context: 'Union[str, Sequence[str]]',
prompt: 'Optional[Union[str, Sequence[str]]]' = None,
form_prompt: 'Optional[Callable[[str, str], str]]' = None
) → Union[TrustworthyRAGResponse, list[TrustworthyRAGResponse]]

Generate a RAG response and evaluate/score it simultaneously.

You can use this method in place of the generator LLM in your RAG application (no change to your prompts needed). It will both produce the response based on query/context and the corresponding evaluations computed by score().

This method relies on the same arguments as score(), except you should not provide a response.

Returns:

  • TrustworthyRAGResponse | list[TrustworthyRAGResponse]: TrustworthyRAGResponse object containing the generated response text and corresponding evaluation scores.

method get_evals

get_evals()list[Eval]

Get the list of Evals that this TrustworthyRAG instance checks.

This method returns a copy of the internal evaluation criteria list (to prevent accidental modification of the instance’s evaluation criteria). The returned list contains all evaluation criteria currently configured for this TrustworthyRAG instance, whether they are the default evaluations or custom evaluations provided during initialization. To change which Evals are run, instantiate a new TrustworthyRAG instance.

Returns:

  • list[Eval]: A list of Eval objects which this TrustworthyRAG instance checks.

function get_default_evals

get_default_evals()list[Eval]

Get the evaluation criteria that are run in TrustworthyRAG by default.

Returns:

  • list[Eval]: A list of Eval objects based on pre-configured criteria that can be used with TrustworthyRAG.

Example:

   default_evaluations = get_default_evals()

# You can modify the default Evals by:
# 1. Adding new evaluation criteria
# 2. Updating existing criteria with custom text
# 3. Removing specific evaluations you don't need

# Run TrustworthyRAG with your modified Evals
trustworthy_rag = TrustworthyRAG(evals=modified_evaluations)

class Eval

Class representing an evaluation for TrustworthyRAG.

Args:

  • name (str): The name of the evaluation, used to identify this specific evaluation in the results.
  • criteria (str): The evaluation criteria text that describes what aspect is being evaluated and how.
  • query_identifier (str, optional): The exact string used in your evaluation criteria to reference the user’s query. For example, specifying query_identifier as “User Question” means your criteria should refer to the query as “User Question”. Leave this value as None (the default) if this Eval doesn’t consider the query.
  • context_identifier (str, optional): The exact string used in your evaluation criteria to reference the retrieved context. For example, specifying context_identifier as “Retrieved Documents” means your criteria should refer to the context as “Retrieved Documents”. Leave this value as None (the default) if this Eval doesn’t consider the context.
  • response_identifier (str, optional): The exact string used in your evaluation criteria to reference the RAG/LLM response. For example, specifying response_identifier as “AI Answer” means your criteria should refer to the response as “AI Answer”. Leave this value as None (the default) if this Eval doesn’t consider the response.

class EvalMetric

Evaluation metric reporting a quality score and optional logs.

Attributes:

  • score (float, optional): score between 0-1 corresponding to the evaluation metric. A higher score indicates a higher rating for the specific evaluation criteria being measured.

  • log (dict, optional): additional logs and metadata, reported only if the log key was specified in TLMOptions.


class TrustworthyRAGResponse

Object returned by TrustworthyRAG.generate() containing generated text and evaluation scores. This class is a dictionary with specific keys.

Attributes:

  • response (str): The generated response text.
  • trustworthiness (EvalMetric): Overall trustworthiness of the response.
  • Additional keys: Various evaluation metrics (context_sufficiency, response_helpfulness, etc.), each following the EvalMetric structure.

Example:

   {
"response": "<response text>",
"trustworthiness": {
"score": 0.92,
"log": {"explanation": "Did not find a reason to doubt trustworthiness."}
},
"context_informativeness": {
"score": 0.65
},
...
}

class TrustworthyRAGScore

Object returned by TrustworthyRAG.score() containing evaluation scores. This class is a dictionary with specific keys.

Attributes:

  • trustworthiness (EvalMetric): Overall trustworthiness of the response.
  • Additional keys: Various evaluation metrics (context_sufficiency, response_helpfulness, etc.), each following the EvalMetric structure.

Example:

   {
"trustworthiness": {
"score": 0.92,
"log": {"explanation": "Did not find a reason to doubt trustworthiness."}
},
"context_informativeness": {
"score": 0.65
},
...
}