Skip to main content

Trustworthy Retrieval-Augmented Generation with Cleanlab

Run in Google ColabRun in Google Colab

Retrieval-Augmented Generation (RAG) allows LLMs to answer domain-specific queries by retrieving relevant context (documents) from a knowledge base, and then concatenating query + context within the prompt used by the LLM to generate a final response. Cleanlab evaluates the trustworthiness of every RAG response in real-time, combining scores for each response with additional Evals for other RAG components like the retrieved context. Cleanlab works with any RAG architecture, retrieval/indexing methodology, and LLM model.

TLM adds a trustworthiness score to every RAG response

Overview

Here’s all the code needed for trustworthy RAG:

from cleanlab_tlm import TrustworthyRAG
trustworthy_rag = TrustworthyRAG() # optional configurations can improve: latency, accuracy, explanations

# Your existing RAG code:
context = rag_retrieve_context(user_query)
prompt = rag_form_prompt(user_query, retrieved_context)
response = rag_generate_response(prompt)

# Detect issues with Cleanlab:
results = trustworthy_rag.score(query=query, context=context, response=response, form_prompt=rag_form_prompt)

The returned results will be a dict with keys like: ‘trustworthiness’, ‘response_helpfulness’, ‘context_sufficiency’, … Each points to a quality score between 0-1 that evaluates one type of issue in your RAG system.

Setup

This tutorial requires a TLM API key. Get one here.

# Install the required packages
%pip install -U cleanlab-tlm pandas
# Set your API key
import os
os.environ["CLEANLAB_TLM_API_KEY"] = "<API key>" # Get your free API key from: https://tlm.cleanlab.ai/
# Import libraries
import pandas as pd
from cleanlab_tlm import TrustworthyRAG, Eval, get_default_evals

Example RAG Use-Case

Given a user query, let’s assume that your RAG system: retrieves relevant context from a knowledge base, formats a LLM prompt based on the query and context (plus auxiliary system instructions), and generates a response using this prompt. You can run this tutorial no matter what RAG architecture or LLM model you’re using!

For this tutorial, we’ll consider a customer support example RAG use-case, loading an example dataset of: query, context, response values. For simplicity, our retrieved context is hardcoded as a single customer service policy document. Replace our examples with the outputs of your RAG system, and Cleanlab will detect issues in your outputs in real-time.

Optional: Example queries, retrieved contexts, and generated responses from a RAG system (stored in DataFrame)

customer_service_policy = """The following is the customer service policy of ACME Inc.
# ACME Inc. Customer Service Policy

## Table of Contents
1. Free Shipping Policy
2. Free Returns Policy
3. Fraud Detection Guidelines
4. Customer Interaction Tone

## 1. Free Shipping Policy

### 1.1 Eligibility Criteria
- Free shipping is available on all orders over $50 within the continental United States.
- For orders under $50, a flat rate shipping fee of $5.99 will be applied.
- Free shipping is not available for expedited shipping methods (e.g., overnight or 2-day shipping).

### 1.2 Exclusions
- Free shipping does not apply to orders shipped to Alaska, Hawaii, or international destinations.
- Oversized or heavy items may incur additional shipping charges, which will be clearly communicated to the customer before purchase.

### 1.3 Handling Customer Inquiries
- If a customer inquires about free shipping eligibility, verify the order total and shipping destination.
- Inform customers of ways to qualify for free shipping (e.g., adding items to reach the $50 threshold).
- For orders just below the threshold, you may offer a one-time courtesy free shipping if it's the customer's first purchase or if they have a history of large orders.

## 2. Free Returns Policy

### 2.1 Eligibility Criteria
- Free returns are available for all items within 30 days of the delivery date.
- Items must be unused, unworn, and in their original packaging with all tags attached.
- Free returns are limited to standard shipping methods within the continental United States.

### 2.2 Exclusions
- Final sale items, as marked on the product page, are not eligible for free returns.
- Customized or personalized items are not eligible for free returns unless there is a manufacturing defect.
- Undergarments, swimwear, and earrings are not eligible for free returns due to hygiene reasons.

### 2.3 Process for Handling Returns
1. Verify the order date and ensure it falls within the 30-day return window.
2. Ask the customer about the reason for the return and document it in the system.
3. Provide the customer with a prepaid return label if they qualify for free returns.
4. Inform the customer of the expected refund processing time (5-7 business days after receiving the return).

### 2.4 Exceptions
- For items damaged during shipping or with manufacturing defects, offer an immediate replacement or refund without requiring a return.
- For returns outside the 30-day window, use discretion based on the customer's history and the reason for the late return. You may offer store credit as a compromise.

## 3. Fraud Detection Guidelines

### 3.1 Red Flags for Potential Fraud
- Multiple orders from the same IP address with different customer names or shipping addresses.
- Orders with unusually high quantities of the same item.
- Shipping address different from the billing address, especially if in different countries.
- Multiple failed payment attempts followed by a successful one.
- Customers pressuring for immediate shipping or threatening to cancel the order.

### 3.2 Verification Process
1. For orders flagging as potentially fraudulent, place them on hold for review.
2. Verify the customer's identity by calling the phone number on file.
3. Request additional documentation (e.g., photo ID, credit card statement) if necessary.
4. Cross-reference the shipping address with known fraud databases.

### 3.3 Actions for Confirmed Fraud
- Cancel the order immediately and refund any charges.
- Document the incident in the customer's account and flag it for future reference.
- Report confirmed fraud cases to the appropriate authorities and credit card companies.

### 3.4 False Positives
- If a legitimate customer is flagged, apologize for the inconvenience and offer a small discount or free shipping on their next order.
- Document the incident to improve our fraud detection algorithms.

## 4. Customer Interaction Tone

### 4.1 General Guidelines
- Always maintain a professional, friendly, and empathetic tone.
- Use the customer's name when addressing them.
- Listen actively and paraphrase the customer's concerns to ensure understanding.
- Avoid negative language; focus on what can be done rather than what can't.

### 4.2 Specific Scenarios

#### Angry or Frustrated Customers
- Remain calm and do not take comments personally.
- Acknowledge the customer's feelings and apologize for their negative experience.
- Focus on finding a solution and clearly explain the steps you'll take to resolve the issue.
- If necessary, offer to escalate the issue to a supervisor.

#### Confused or Indecisive Customers
- Be patient and offer clear, concise explanations.
- Ask probing questions to better understand their needs.
- Provide options and explain the pros and cons of each.
- Offer to send follow-up information via email if the customer needs time to decide.

#### VIP or Loyal Customers
- Acknowledge their status and thank them for their continued business.
- Be familiar with their purchase history and preferences.
- Offer exclusive deals or early access to new products when appropriate.
- Go above and beyond to exceed their expectations.

### 4.3 Language and Phrasing
- Use positive language: "I'd be happy to help you with that" instead of "I can't do that."
- Avoid technical jargon or abbreviations that customers may not understand.
- Use "we" statements to show unity with the company: "We value your feedback" instead of "The company values your feedback."
- End conversations on a positive note: "Is there anything else I can assist you with today?"

### 4.4 Written Communication
- Use proper grammar, spelling, and punctuation in all written communications.
- Keep emails and chat responses concise and to the point.
- Use bullet points or numbered lists for clarity when providing multiple pieces of information.
- Include a clear call-to-action or next steps at the end of each communication.

The following dialogue features a discussion between a user and a customer service bot. The bot attempts to help the customer but must respect the guidelines in the customer service policy. The bot provides very accurate and concise answers. The bot does not tell the user to contact customer service
Remember, as a representative of ACME Inc., you are often the first point of contact for our customers. Your interactions should always reflect our commitment to exceptional customer service and satisfaction.
"""

# Example queries and retrieved context + generated response
data = [
{
"query": "What's your free shipping policy for orders within the continental US?",
"context": customer_service_policy,
"response": "We offer free shipping on all orders over $50 within the continental United States. Orders under $50 have a flat rate shipping fee of $5.99. Expedited shipping methods, oversized items, and orders to Alaska or Hawaii don't qualify for free shipping."
},
{
"query": "What is your return policy?",
"context": customer_service_policy,
"response": "We offer a 90-day return policy with full refunds on all purchases. Returns must be in original condition with packaging. We provide prepaid return labels and process refunds within 3 business days of receiving your return."
},
{
"query": "What are your product warranty periods?",
"context": customer_service_policy,
"response": "Thank you for your inquiry about our warranty periods. I'd be happy to assist you with that. While I don't have the specific warranty information in our current policy document, we do offer warranties on most products. For detailed warranty information about specific products, I recommend checking the product packaging or documentation that came with your purchase. Is there a particular product you're inquiring about?"
}
]

df = pd.DataFrame(data)

In practice, your RAG system should already have functions to retrieve context and generate responses. For this tutorial, we’ll simulate such functions based on the examples above.

Optional: Toy RAG methods you should replace with existing methods from your RAG system

def rag_retrieve_context(query):
"""Simulate retrieval from a knowledge base"""
# In a real system, this would search the knowledge base
for item in data:
if item["query"] == query:
return item["context"]
return ""

def rag_form_prompt(query, context):
"""Format a prompt used by your RAG system's LLM to generate response based on query and retrieved context. Note that you'll want to include any system instructions to your LLM here as well (eg. to specify desired tone/formatting of responses)."""
return f"""You are a customer service agent for ACME Inc. Your task is to answer the following customer question based on the customer service policy.

Customer Service Policy: {context}
Customer Question: {query}
"""

def rag_generate_response(prompt):
"""Simulate LLM response generation"""
# In a real system, this would call an LLM
query = prompt.split("Customer Question: ")[1].split("\n")[0]
for item in data:
if item["query"] == query:
return item["response"]
return ""

Real-time Evaluation using TrustworthyRAG

Cleanlab’s TrustworthyRAG runs many real-time evaluations to detect issues in your RAG system. It runs Cleanlab’s state-of-the-art LLM uncertainty estimator, the Trustworthy Language Model, to provide a trustworthiness score indicating overall confidence that your RAG system’s response is correct. TrustworthyRAG can simultaneously run additional evaluations to diagnose why responses are likely incorrect or other types of issues. Let’s see what Evals are run by default:

default_evals = get_default_evals()
print(default_evals)
[{
'name': 'context_sufficiency',
'criteria': 'Determine if the Document contains 100% of the information needed to answer the Question. If any external knowledge or assumptions are required, it does not meet the criteria. Each Question component must have explicit support in the Document.',
'query_identifier': 'Question',
'context_identifier': 'Document',
'response_identifier': None
}, {
'name': 'response_groundedness',
'criteria': 'Review the Response to the Query and assess whether every factual claim in the Response is explicitly supported by the provided Context. A Response meets the criteria if all information is directly backed by evidence in the Context, without relying on assumptions, external knowledge, or unstated inferences. The focus is on whether the Response is fully grounded in the Context, rather than whether it fully addresses the Query. If any claim in the Response lacks direct support or introduces information not present in the Context, the Response is bad and does not meet the criteria.',
'query_identifier': 'Query',
'context_identifier': 'Context',
'response_identifier': 'Response'
}, {
'name': 'response_helpfulness',
'criteria': 'Assess whether the AI Assistant Response is a helpful answer to the User Query.
A Response is not helpful if it:
- Is not useful, incomplete, or unclear
- Abstains or refuses to answer the question
- Contains statements which are similar to 'I don't know', 'Sorry', or 'No information available'
- Leaves part of the original User Query unresolved',
'query_identifier': 'User Query',
'context_identifier': None,
'response_identifier': 'AI Assistant Response'
}, {
'name': 'query_ease',
'criteria': 'Determine whether the above User Request appears simple and straightforward. A bad User Request will appear either: disgruntled, complex, purposefully tricky, abnormal, or vague, perhaps missing vital information needed to answer properly. The simpler the User Request appears, the better. If you believe an AI Assistant could correctly answer this User Request, it is considered good. If the User Request is non-propositional language, it is also considered good.',
'query_identifier': 'User Request',
'context_identifier': None,
'response_identifier': None
}]

Each Eval returns a score between 0-1 (higher is better) that assesses a different aspect of your RAG system:

  1. context_sufficiency: Evaluates whether the retrieved context contains sufficient information to completely answer the query. A low score indicates that key information is missing from the context (perhaps due to poor retrieval or missing documents).

  2. response_groundedness: Evaluates whether claims/information stated in the response are explicitly supported by the provided context.

  3. response_helpfulness: Evaluates whether the response effectively addresses the user query and appears helpful.

  4. query_ease: Evaluates whether the user query seems easy for an AI system to properly handle. Complex, vague, tricky, or disgruntled-sounding queries receive lower scores.

We recommend using the trustworthiness score to automatically flag potentially incorrect responses, and these additional Evals to diagnose what part of the RAG system led to a bad response.

Evaluating RAG Responses with TrustworthyRAG.score()

You can evaluate every response from your RAG system using TrustworthyRAG’s score() method. Here we do this using a helper function that evaluates one row (query + context + response example) from our earlier dataframe of examples. You can use the score() method however best suits your RAG system.

def evaluate_df_row(df, row_index, evaluator, verbose=False):
"""
Evaluate a specific row from the dataframe using TrustworthyRAG

Args:
df: DataFrame containing queries
row_index: Index of the row to evaluate
evaluator: TrustworthyRAG instance to use for evaluation
verbose (bool, optional): Whether to print detailed output. Defaults to False.

Returns:
dict: Evaluation results
"""
# Select a query to evaluate
user_query = df.iloc[row_index]["query"]
print(f"Query: {user_query}\n")

# Get the retrieved context
retrieved_context = rag_retrieve_context(user_query)
if verbose:
print(f"Retrieved context:\n{retrieved_context}\n")

# Format the RAG prompt
rag_prompt = rag_form_prompt(user_query, retrieved_context)

# Get the LLM response
llm_response = rag_generate_response(rag_prompt)
print(f"Generated response: {llm_response}\n")

# Evaluate the response (this is the only code you need to add to your RAG system)
result = evaluator.score(
query=user_query,
context=retrieved_context,
response=llm_response,
form_prompt=rag_form_prompt
)

print("Evaluation results:")
for metric, value in result.items():
if 'log' in value and 'explanation' in value['log']:
print(f"Explanation: {value['log']['explanation']}\n")
print(f"{metric}: {value['score']}")

Let’s evaluate the RAG response to our first example query. Reviewing the RAG results manually, we find both the RAG response and the retrieved context seem good. Cleanlab’s score() automatically determined this in real-time!

trustworthy_rag = TrustworthyRAG()  # Use default evals/configurations

evaluate_df_row(df, row_index=0, evaluator=trustworthy_rag)
Query: What's your free shipping policy for orders within the continental US?

Generated response: We offer free shipping on all orders over $50 within the continental United States. Orders under $50 have a flat rate shipping fee of $5.99. Expedited shipping methods, oversized items, and orders to Alaska or Hawaii don't qualify for free shipping.

Evaluation results:
trustworthiness: 0.966935913156101
context_sufficiency: 0.9975116191140101
response_groundedness: 0.9975024700788124
response_helpfulness: 0.9975123110551495
query_ease: 0.9975087687500489

Let’s evaluate the RAG response to our second example query. Reviewing the RAG results manually, we find the RAG response appears hallucinated. Cleanlab’s score() automatically determined this in real-time!

evaluate_df_row(df, row_index=1, evaluator=trustworthy_rag)
Query: What is your return policy?

Generated response: We offer a 90-day return policy with full refunds on all purchases. Returns must be in original condition with packaging. We provide prepaid return labels and process refunds within 3 business days of receiving your return.

Evaluation results:
trustworthiness: 0.06805964003033721
context_sufficiency: 0.9973566287120986
response_groundedness: 0.0024875635688671963
response_helpfulness: 0.9975124307760344
query_ease: 0.9975123107600834

Let’s evaluate the RAG response to our third example query. Reviewing the RAG results manually, we find the RAG system’s retrieved context appears insufficent. Cleanlab’s score() automatically caught this bad retrieval in real-time!

evaluate_df_row(df, row_index=2, evaluator=trustworthy_rag)
Query: What are your product warranty periods?

Generated response: Thank you for your inquiry about our warranty periods. I'd be happy to assist you with that. While I don't have the specific warranty information in our current policy document, we do offer warranties on most products. For detailed warranty information about specific products, I recommend checking the product packaging or documentation that came with your purchase. Is there a particular product you're inquiring about?

Evaluation results:
trustworthiness: 0.10388353414471514
context_sufficiency: 0.0024875622257826786
response_groundedness: 0.0025381785910353174
response_helpfulness: 0.45465018925211553
query_ease: 0.9974776551616285

Running Specific Evals

You might choose to only run specific evaluations rather than the default set. For example, here’s how to run a TrustworthyRAG instance that exclusively evaluates context sufficiency.

context_sufficiency_eval = [eval for eval in default_evals if eval.name == "context_sufficiency"]

trustworthy_rag_context_sufficiency_only = TrustworthyRAG(evals=context_sufficiency_eval)

# Evaluate each example from our dataframe
for i in range(len(df)):
print(f"\nEvaluating example {i}:")
evaluate_df_row(df, row_index=i, evaluator=trustworthy_rag_context_sufficiency_only)

Evaluating example 0:
Query: What's your free shipping policy for orders within the continental US?

Generated response: We offer free shipping on all orders over $50 within the continental United States. Orders under $50 have a flat rate shipping fee of $5.99. Expedited shipping methods, oversized items, and orders to Alaska or Hawaii don't qualify for free shipping.

Evaluation results:
trustworthiness: 0.9668385030700651
context_sufficiency: 0.9975119410516021

Evaluating example 1:
Query: What is your return policy?

Generated response: We offer a 90-day return policy with full refunds on all purchases. Returns must be in original condition with packaging. We provide prepaid return labels and process refunds within 3 business days of receiving your return.

Evaluation results:
trustworthiness: 0.06841609873744349
context_sufficiency: 0.9973749268917449

Evaluating example 2:
Query: What are your product warranty periods?

Generated response: Thank you for your inquiry about our warranty periods. I'd be happy to assist you with that. While I don't have the specific warranty information in our current policy document, we do offer warranties on most products. For detailed warranty information about specific products, I recommend checking the product packaging or documentation that came with your purchase. Is there a particular product you're inquiring about?

Evaluation results:
trustworthiness: 0.44060586570286747
context_sufficiency: 0.0024875622176428434

Custom Evals

You can also specify custom evaluations to assess specific criteria, and combine them with the default evaluations for comprehensive/tailored assessment of your RAG system.

For instance, here’s how to create and run a custom eval that checks the conciseness of each RAG response.

conciseness_eval = Eval(
name="response_conciseness",
criteria="Evaluate whether the Assistant Response is concise and to the point without unnecessary verbosity or repetition. A good response should be brief but comprehensive, covering all necessary information without extra words or redundant explanations.",
response_identifier="Assistant Response"
)

# Combine default evals with a custom eval
combined_evals = get_default_evals() + [conciseness_eval]

# Initialize TrustworthyRAG with combined evals
combined_trustworthy_rag = TrustworthyRAG(evals=combined_evals)

# Loop through all examples in the dataframe
for i in range(len(df)):
print(f"\nEvaluating example {i}:")
evaluate_df_row(df, row_index=i, evaluator=combined_trustworthy_rag)

Evaluating example 0:
Query: What's your free shipping policy for orders within the continental US?

Generated response: We offer free shipping on all orders over $50 within the continental United States. Orders under $50 have a flat rate shipping fee of $5.99. Expedited shipping methods, oversized items, and orders to Alaska or Hawaii don't qualify for free shipping.

Evaluation results:
trustworthiness: 0.9661610896010152
context_sufficiency: 0.9975117154659671
response_groundedness: 0.9975011424240706
response_helpfulness: 0.9975121355291326
query_ease: 0.9975091996978735
response_conciseness: 0.8097224895220108

Evaluating example 1:
Query: What is your return policy?

Generated response: We offer a 90-day return policy with full refunds on all purchases. Returns must be in original condition with packaging. We provide prepaid return labels and process refunds within 3 business days of receiving your return.

Evaluation results:
trustworthiness: 0.06783690857824737
context_sufficiency: 0.9973566287120986
response_groundedness: 0.0024875670052090138
response_helpfulness: 0.9975124307760344
query_ease: 0.9975123110551495
response_conciseness: 0.9473101404730212

Evaluating example 2:
Query: What are your product warranty periods?

Generated response: Thank you for your inquiry about our warranty periods. I'd be happy to assist you with that. While I don't have the specific warranty information in our current policy document, we do offer warranties on most products. For detailed warranty information about specific products, I recommend checking the product packaging or documentation that came with your purchase. Is there a particular product you're inquiring about?

Evaluation results:
trustworthiness: 0.435717961856357
context_sufficiency: 0.0024875622257826786
response_groundedness: 0.002544918268778039
response_helpfulness: 0.45465018964071213
query_ease: 0.9974938182510782
response_conciseness: 0.6398390745765199

Note the response_identifier argument we specified above. When writing custom evals, your evaluation criteria will depend on some subset of the: query, context, response. Specify which of these matter for your Eval via the query_identifier, context_identifier, response_identifier arguments (don’t forget, otherwise your Eval will incorrectly ignore this field). Set the values of these arguments to the exact text (string) you used to refer to this object in your evaluation criteria. For instance, your criteria might choose to refer to the context as ‘Document’ or ‘Evidence’. Use whatever name makes the most sense for your use-case, and simply specify that name in these identifier arguments.

Understanding Differences Between the Provided Scores (click to expand)
  • Trustworthiness vs. Groundedness: Trustworthiness provides a holistic measure of response reliability, considering all possible factors (which might influence how uncertain your LLM is in its response). Groundedness specifically assesses whether the response’s claims are explicitly stated/implied by the context that your RAG system retrieved from your knowledge base. Groundedness is less concerned with the overall correctness/trustworthiness of the response and specifically focused on verifying that each fact in the response is supported by the retrieved context. Both evaluations can help you detect incorrect responses in your RAG application. While groundedness scores will only be low in cases where the response hallucinates information not mentioned in the context, trustworthiness scores will also be low when the user query is vague/complex or the context seems bad.

  • Context Sufficiency: Evaluates whether the retrieved context contains all of the information required to completely answer the query, without considering the generated response.

  • Response Helpfulness: Evaluates whether the response appears to satisfy the user’s request (relevance/helpfulness), without considering its correctness or the retrieved context.

  • Query Ease: Measures how straightforward the query seems to answer, without considering the generated response or retrieved context.

Recommendations:

  • Responses with low helpfulness score may be along the lines of “I don’t know” or “I cannot answer” (e.g. fallback responses from your AI). For these unhelpful responses, ignore the trustworthiness score.
  • If your RAG app encounters tricky/vague user requests you’d like to detect, then supplement trustworthiness scores with query_ease scores.
  • If your RAG app should avoid answering questions unless the answer is clearly present in the retrieved context, then supplement trustworthiness scores with groundedness scores.
  • To distinguish between bad responses caused by LLM hallucination vs. bad retrieval or missing documents, supplement trustworthiness scores with context_sufficiency scores.

Using TrustworthyRAG.generate() in place of your own LLM

Beyond evaluating responses already generated from your LLM, TrustworthyRAG can also generate responses and evaluate them simultaneously (using one of many supported models). This replaces your own LLM within your RAG system and can be more convenient/accurate/faster.

# Initialize TrustworthyRAG with default evals
trustworthy_rag_generator = TrustworthyRAG()

# Run retrieval for a sample query
user_query = "What are your product warranty periods?"
retrieved_context = rag_retrieve_context(user_query)

# Generate a response and evaluate it simultaneously
result = trustworthy_rag_generator.generate(
query=user_query,
context=retrieved_context,
form_prompt=rag_form_prompt
)

print(f"Generated Response:\n{result['response']}\n")
print("Evaluation Scores:")
for metric, value in result.items():
if metric != "response":
print(f"{metric}: {value['score']}")
Generated Response:
I'm sorry, but I don't have information regarding product warranty periods as it is not covered in our customer service policy. I recommend checking the product details on our website or the documentation that came with your purchase for warranty information. If you have any other questions, feel free to ask!

Evaluation Scores:
trustworthiness: 0.9005242277883816
context_sufficiency: 0.0024875622214670305
response_groundedness: 0.0027882564124391625
response_helpfulness: 0.006961761793786819
query_ease: 0.9974853475455417

Reducing Latency

To reduce how long the evaluations take, you can combine these strategies when initializing TrustworthyRAG:

  1. Run only specific evaluations (eg. only context_sufficiency)

  2. Lower the quality_preset to: low or base

  3. Specify faster TLMOptions configurations:

    • model: Specify a smaller/faster model like gpt-4o-mini or nova-lite
    • reasoning_effort: Reduce to low or none
    • max_tokens: Limit the maximum tokens processed

The code below demonstrates how to configure TrustworthyRAG for faster speed via these approaches.

# Get only the context_sufficiency eval
context_sufficiency_eval = [eval for eval in get_default_evals() if eval.name == "context_sufficiency"]

# Customize options for lower latency
low_latency_options = {
"model": "nova-lite",
"reasoning_effort": "none",
"max_tokens": 64
}

# Initialize TrustworthyRAG with faster settings:
fast_trustworthy_rag = TrustworthyRAG(
quality_preset="low",
options=low_latency_options,
evals=context_sufficiency_eval
)

# Evaluate all examples from our dataframe
for i in range(len(df)):
print(f"\nEvaluating example {i}:")
evaluate_df_row(df, row_index=i, evaluator=fast_trustworthy_rag)

Evaluating example 0:
Query: What's your free shipping policy for orders within the continental US?

Generated response: We offer free shipping on all orders over $50 within the continental United States. Orders under $50 have a flat rate shipping fee of $5.99. Expedited shipping methods, oversized items, and orders to Alaska or Hawaii don't qualify for free shipping.

Evaluation results:
trustworthiness: 0.9631636095416907
context_sufficiency: 1.0

Evaluating example 1:
Query: What is your return policy?

Generated response: We offer a 90-day return policy with full refunds on all purchases. Returns must be in original condition with packaging. We provide prepaid return labels and process refunds within 3 business days of receiving your return.

Evaluation results:
trustworthiness: 0.06768528249514133
context_sufficiency: 1.0

Evaluating example 2:
Query: What are your product warranty periods?

Generated response: Thank you for your inquiry about our warranty periods. I'd be happy to assist you with that. While I don't have the specific warranty information in our current policy document, we do offer warranties on most products. For detailed warranty information about specific products, I recommend checking the product packaging or documentation that came with your purchase. Is there a particular product you're inquiring about?

Evaluation results:
trustworthiness: 0.5924908427770693
context_sufficiency: 0.0

To instead improve accuracy of the results, specify: a more powerful model, custom Evals, or a better prompt (or form_prompt() template).

Batch Processing

Both TrustworthyRAG.score() and TrustworthyRAG.generate() support batch processing, allowing you to evaluate or generate many responses at once. This significantly improves throughput when processing many queries.

Let’s process all three of our example queries in a single batch operation:

trustworthy_rag_batch = TrustworthyRAG()

# Prepare lists of queries, contexts and responses for batch processing
batch_queries = df["query"].tolist()
batch_contexts = [rag_retrieve_context(query) for query in batch_queries]
batch_responses = [rag_generate_response(rag_form_prompt(query, context))
for query, context in zip(batch_queries, batch_contexts)]

# 1. Batch scoring (evaluate pre-generated responses)
batch_scores = trustworthy_rag_batch.score(
query=batch_queries,
context=batch_contexts,
response=batch_responses,
form_prompt=rag_form_prompt
)

print("Batch Scoring Results:")
for i, (query, scores) in enumerate(zip(batch_queries, batch_scores)):
print(f"\nExample {i}: {query}")
print(f"Trustworthiness: {scores['trustworthiness']['score']}")

# 2. Batch generation (generate and evaluate responses simultaneously)
batch_generations = trustworthy_rag_batch.generate(
query=batch_queries,
context=batch_contexts,
form_prompt=rag_form_prompt
)

print("\n\nBatch Generation Results:")
for i, (query, result) in enumerate(zip(batch_queries, batch_generations)):
print(f"\nExample {i}: {query}")
print(f"Generated response: {result['response'][:50]}...")
print(f"Trustworthiness: {result['trustworthiness']['score']}")
Querying TrustworthyRAG... 100%|██████████|

Batch Scoring Results:

Example 0: What's your free shipping policy for orders within the continental US?
Trustworthiness: 0.966935913156101

Example 1: What is your return policy?
Trustworthiness: 0.037328382184029694

Example 2: What are your product warranty periods?
Trustworthiness: 0.10388353414471514

Querying TrustworthyRAG... 100%|██████████|



Batch Generation Results:

Example 0: What's your free shipping policy for orders within the continental US?
Generated response: Thank you for your question! Here’s a summary of o...
Trustworthiness: 0.9635255373352174

Example 1: What is your return policy?
Generated response: Thank you for your question! Here’s a summary of o...
Trustworthiness: 0.9622359971608274

Example 2: What are your product warranty periods?
Generated response: I'm sorry, but I don't have information regarding ...
Trustworthiness: 0.9005242277883816


Explaining Low Trustworthiness Scores

To explain why certain responses are deemed untrustworthy, specify the explanation log during initialization. TrustworthyRAG will automatically include an explanation field within each returned trustworthiness dictionary. Below we also encourage internal LLM reasoning during evaluation via the reasoning_effort configuration, which can improve explanations (and the scores as well).

trustworthy_rag_with_explanation = TrustworthyRAG(options={"reasoning_effort": "high", "log": ["explanation"]})
evaluate_df_row(df, row_index=1, evaluator=trustworthy_rag_with_explanation)
Query: What is your return policy?

Generated response: We offer a 90-day return policy with full refunds on all purchases. Returns must be in original condition with packaging. We provide prepaid return labels and process refunds within 3 business days of receiving your return.

Evaluation results:
Explanation: The proposed response is incorrect because it does not align with the customer service policy outlined for ACME Inc. The policy states that free returns are available for all items within 30 days of the delivery date, not 90 days as mentioned in the response. Additionally, the policy specifies that items must be unused, unworn, and in their original packaging with all tags attached, which is not explicitly stated in the response. The processing time for refunds is also stated as 5-7 business days after receiving the return, rather than the 3 business days mentioned in the response. Therefore, the response fails to accurately reflect the company's return policy.
This response is untrustworthy due to lack of consistency in possible responses from the model. Here's one inconsistent alternate response that the model considered (which may not be accurate either):
Our return policy allows free returns for all items within 30 days of the delivery date, provided they are unused, unworn, and in their original packaging with all tags attached. Please note that final sale items, customized items (unless defective), and specific apparel like undergarments or swimwear are not eligible for free returns due to hygiene reasons. If eligible, we will provide you with a prepaid return label. Refunds are processed within 5-7 business days after we receive the returned item. If you experience any issues with your order, like damage during shipping, we can offer an immediate replacement or refund without a return.

trustworthiness: 0.038363571877046365
context_sufficiency: 0.9975124367416152
response_groundedness: 0.0024877692279631144
response_helpfulness: 0.9975124373317179
query_ease: 0.9975124362570796

Conclusion

This tutorial demonstrated how Cleanlab’s TrustworthyRAG can automatically detect critical issues in any RAG system like hallucinations and bad retrievals. TrustworthyRAG evaluations help you avoid losing users’ trust by flagging potentially untrustworthy responses in real-time, as well as finding/fixing problems with your retrieval/search mechanism, prompt design, or LLM model.

Key concepts:

  1. Use TrustworthyRAG.score() to evaluate RAG responses from any LLM.
  2. Or use TrustworthyRAG.generate() to generate and simultaneously evaluate RAG responses (using one of many supported models).
  3. Adjust the evaluations run by removing some of the defaults or adding custom Evals.
  4. We recommend you specify one of form_prompt() or prompt with the same prompt you’re using to generate responses with your LLM (otherwise TrustworthyRAG may be missing key instructions that were supplied to your own LLM).
  5. Improve latency/accuracy via optional configurations like TLMOptions and quality_preset.

Integrate TrustworthyRAG into your RAG pipeline to continuously monitor response quality and achieve more trustworthy AI.