Skip to main content

Detect and remediate bad responses from any RAG system with Cleanlab

Run in Google ColabRun in Google Colab

This tutorial demonstrates how to automatically improve any RAG application by integrating Codex as-a-backup. For each user query, simply provide the retrieved context and generated response from your RAG app. Cleanlab will automatically detect if the response is bad (unhelpful/untrustworthy), and if so: provide an alternative response whenever a similar query has been answered in the connected Codex Project, or otherwise log this query into the Codex Project for SMEs to answer.

'Codex as a backup'

Overview

Here’s all the code needed for using Codex as-a-backup with your RAG system.

from cleanlab_codex import Validator
validator = Validator(codex_access_key=...) # optional configurations can improve accuracy/latency

# Your existing RAG code:
context = rag_retrieve_context(user_query)
prompt = rag_form_prompt(user_query, retrieved_context)
response = rag_generate_response(prompt)

# Detect bad responses and remediate with Cleanlab
results = validator.validate(query=query, context=context, response=response,
form_prompt=rag_form_prompt)

final_response = (
results["expert_answer"] # Codex's answer
if results["is_bad_response"] and results["expert_answer"]
else response # Your RAG system's initial response
)

Setup

This tutorial requires a TLM API key. Get one here.

%pip install --upgrade cleanlab-codex pandas
# Set your TLM API key
import os
os.environ["CLEANLAB_TLM_API_KEY"] = "<API key>" # Get your free API key from: https://tlm.cleanlab.ai/
# Import libraries
import pandas as pd
from cleanlab_codex.validator import Validator

Example RAG App: Product Customer Support

Consider a customer support / e-commerce RAG use-case where the Knowledge Base contains product listings like the following:

Simple water bottle product listing

Here, the inner workings of the RAG app are not important for this tutorial. What is important is that the RAG app generates a response based on a user query and a context, which are all made available for evaluation.

For simplicity, our context is hardcoded as the product listing below. You should replace these with the outputs of your RAG system, noting that Cleanlab can detect issues in these outputs in real-time.

product_listing = """Simple Water Bottle - Amber (limited edition launched Jan 1st 2025)
A water bottle designed with a perfect blend of functionality and aesthetics in mind. Crafted from high-quality, durable plastic with a sleek honey-colored finish.
Price: $24.99
Dimensions: 10 inches height x 4 inches width"""
# Example queries and retrieved context + generated response from RAG system
data = [
{
"query": "How much water can the Simple Water Bottle hold?",
"context": product_listing,
"response": "The Simple Water Bottle can hold 16 oz of Water"
},
{
"query": "How can I order the Simple Water Bottle in bulk?",
"context": product_listing,
"response": "Based on the available information, I cannot provide a complete answer to this question."
},
{
"query": "How much does the Simple Water Bottle cost?",
"context": product_listing,
"response": "The Simple Water Bottle costs $24.99"
},
]

df = pd.DataFrame(data)
df
query context response
0 How much water can the Simple Water Bottle hold? Simple Water Bottle - Amber (limited edition l... The Simple Water Bottle can hold 16 oz of Water
1 How can I order the Simple Water Bottle in bulk? Simple Water Bottle - Amber (limited edition l... Based on the available information, I cannot p...
2 How much does the Simple Water Bottle cost? Simple Water Bottle - Amber (limited edition l... The Simple Water Bottle costs $24.99

In practice, your RAG system should already have functions to retrieve context and generate responses. For this tutorial, we’ll simulate these functions using the above fields.

Optional: Toy RAG methods you should replace with existing methods from your RAG system
def rag_retrieve_context(query):
"""Simulate retrieval from a knowledge base"""
# In a real system, this would search the knowledge base
for item in data:
if item["query"] == query:
return item["context"]
return ""

def rag_form_prompt(query, context):
"""Format a prompt for the RAG system"""
return f"""You are a customer service agent. Your task is to answer the following customer questions based on the product listing.

Product Listing: {context}
Customer Question: {query}
"""

def rag_generate_response(prompt):
"""Simulate LLM response generation"""
# In a real system, this would call an LLM
query = prompt.split("Customer Question: ")[1].split("\n")[0]
for item in data:
if item["query"] == query:
return item["response"]

# Return a fallback response if the LLM is unable to answer the question
return "Based on the available information, I cannot provide a complete answer to this question."

Create Codex Project

To later use Codex, we must first create a Project. Here we assume some (question, answer) pairs have already been added to the Codex Project. To learn how that was done, see our tutorial: Populating Codex.

Our existing Codex Project contains the following entries:

Codex Project Example

User queries where Codex detected a bad response from your RAG app will be logged in this Project for SMEs to later answer.

Running detection and remediation

Now that our Codex Project is configured, we can initialize a Validator object. This Validator uses the validate() method to detect bad responses in our RAG applications by running Evals, scoring responses, and flagging them as bad when they fall below certain thresholds. When a response is flagged as bad, the Validator will query Codex for an expert answer that can remediate the bad response. If no suitable answer is found, the Validator will log the query into the Codex Project for SMEs to answer.

Let’s initialize the Validator using our access key:

access_key = "<YOUR-PROJECT-ACCESS-KEY>"  # Obtain from your Project's settings page: https://codex.cleanlab.ai/
validator = Validator(codex_access_key=access_key)

Applying this Validator to a RAG system is straightfoward. Here we do this using a helper function that applies the Validator to one row from our example dataframe.

def df_row_validation(df, row_index, validator, verbosity=0):
"""
Detect and remediate bad responses in a specific row from the dataframe


Args:
df (DataFrame): The dataframe containing the query, context, and response to validate.
row_index (int): The index of the row in the dataframe to validate.
validator (Validator): The Validator object to use for detection and remediation of bad responses.
verbosity (int): Whether to print verbose output. Defaults to 0.
At verbosity level 0, only the query and final response are printed.
At verbosity level 1, the initial RAG response and the validation results are printed as well.
At verbosity level 2, the retrieved context is also printed.
"""
# 1. Get user query
user_query = df.iloc[row_index]["query"]
print(f"Query: {user_query}\n")

# 2. Standard RAG pipeline
retrieved_context = rag_retrieve_context(user_query)
if verbosity >= 2:
print(f"Retrieved context:\n{retrieved_context}\n")

# Format the RAG prompt
rag_prompt = rag_form_prompt(user_query, retrieved_context)
initial_response = rag_generate_response(rag_prompt)
if verbosity >= 1:
print(f"Initial RAG response: {initial_response}\n")

# 3. Detect and remediate bad responses
results = validator.validate(
query=user_query,
context=retrieved_context,
response=initial_response,
form_prompt=rag_form_prompt,
)

# Use expert answer if available and response was flagged as bad
final_response = (
results["expert_answer"]
if results["is_bad_response"] and results["expert_answer"]
else initial_response
)
print(f"Final Response: {final_response}\n")

# For tutorial purposes, show validation results
if verbosity >= 1:
print("Validation Results:")
for key, value in results.items():
print(f" {key}: {value}")

Let’s validate the RAG response to our first example query. The final dictionary printed by our helper functions are the results of Validator.validate(), which we’ll break down below.

df_row_validation(df, 0, validator, verbosity=1)
Query: How much water can the Simple Water Bottle hold?

Initial RAG response: The Simple Water Bottle can hold 16 oz of Water

Final Response: 32oz

Validation Results:
expert_answer: 32oz
is_bad_response: True
trustworthiness: {'log': {'explanation': "This response is untrustworthy due to lack of consistency in possible responses from the model. Here's one inconsistent alternate response that the model considered (which may not be accurate either): \nThank you for your question! Unfortunately, the product listing does not specify the water capacity of the Simple Water Bottle. If you’re looking for a water bottle suitable for daily hydration, it typically holds around 16-20 ounces, but I recommend checking the product specifications or contacting the manufacturer for detailed information. Let me know if there's anything else I can assist you with!"}, 'score': 0.48383545940159745, 'is_bad': True}
response_helpfulness: {'score': 0.9975122945166285, 'is_bad': False}

The Validator.validate() method returns a comprehensive dictionary containing multiple evaluation metrics and remediation options. Let’s examine the key components of these results:

Core Validation Results

  1. expert_answer (String | None)

    • Contains the remediation response retrieved from the Codex Project.
    • Returns None in two scenarios:
      • When is_bad_response is False (indicating no remediation needed, so Codex is not queried).
      • When no suitable expert answer exists in the Codex Project for similar queries.
    • Returns a string containing the expert-provided answer when:
      • The response is flagged as requiring remediation (is_bad_response=True).
      • A semantically similar query exists in the Codex Project with an expert answer.
  2. is_bad_response (Boolean)

    • This is the primary validation indicator that determines if a response requires remediation (i.e. for expert_answer to contain a string value).
    • Will be True when any evaluation metric falls below its configured threshold.
    • Controls whether the system will attempt to fetch an expert answer from Codex. Only when is_bad_response=True will the system lookup an expert answer from Codex (which logs the corresponding query into the Codex Project).

Evaluation Metrics

The Validator extends TrustworthyRAG’s evaluation scores by adding an is_bad boolean flag to each metric. This flag indicates whether the metric’s score falls below its configured threshold, which determines if a response needs remediation.

By default, the Validator uses the following metrics:

  • trustworthiness: overall confidence that your RAG system’s response is correct
  • response_helpfulness: evaluates whether the response effectively addresses the user query and appears helpful.

You can modify these metrics by providing a custom list of evals in the trustworthy_rag_config dictionary.

Let’s validate another example from our RAG system. For this example, the response is flagged as bad, but no expert answer is available in the Codex Project. The corresponding query will be logged there for SMEs to answer.

df_row_validation(df, 1, validator, verbosity=1)
Query: How can I order the Simple Water Bottle in bulk?

Initial RAG response: Based on the available information, I cannot provide a complete answer to this question.

Final Response: Based on the available information, I cannot provide a complete answer to this question.

Validation Results:
expert_answer: None
is_bad_response: True
trustworthiness: {'log': {'explanation': 'The prompt/response appear atypical or vague.'}, 'score': 0.4924309760346324, 'is_bad': True}
response_helpfulness: {'score': 0.0024875641966603293, 'is_bad': True}

The RAG system is unable to answer this question because there is no relevant information in the retrieved context, nor has a similar question been answered in the Codex Project (see the contents of the Codex Project above).

Codex automatically recognizes this question could not be answered and logs it into the Project where it awaits an answer from a SME. Navigate to your Codex Project in the Web App where you (or a SME at your company) can enter the desired answer for this query.

As soon as an answer is provided in Codex, our RAG system will be able to answer all similar questions going forward (as seen for the previous query).

Advanced Usage

You can configure many aspects of the bad response detection like what score thresholds to use and TrustworthyRAG settings.

Response Quality Thresholds

Thresholds determine when a response needs intervention:

  • Each metric (trustworthiness, helpfulness, etc.) has its own threshold (0-1)
  • If any metric’s score falls below its threshold, the response is marked as “bad”
  • Example: With trustworthiness_threshold = 0.85
    • Score 0.80 -> Marked as bad (below threshold)
    • Score 0.90 -> Passes validation (above threshold)

Setting thresholds affects your validation strategy:

  • Higher thresholds (e.g. 0.9) = Stricter validation
    • More responses marked as “bad”
    • More queries logged for SMEs to answer
    • Better response quality but higher SME workload
  • Lower thresholds (e.g. 0.7) = More lenient validation
    • Fewer responses marked as “bad”
    • Fewer queries logged for SMEs to answer
    • Lower SME workload, but may allow lower quality responses from your RAG app to be returned unremediated.

Learn more about the thresholds in the [BadResponseThresholds documentation](

Additional Configuration

The Validator combines powerful detection capabilities with automatic remediation of bad responses. For detection, you can customize the evaluation process using the trustworthy_rag_config dictionary parameter.

The Validator supports all configuration options available in TrustworthyRAG for the detection of bad responses. You can refer to the TrustworthyRAG documentation for the complete list of supported options that can be passed in this dictionary.

The following example shows how to configure the Validator with custom thresholds and evaluation settings:

Optional: Configure custom Evals
from cleanlab_tlm.utils.rag import Eval, get_default_evals

# Select the "resonse_helpfulness" Eval
evals = [evaluation for evaluation in get_default_evals() if evaluation.name == "response_helpfulness"]

evals.append(
Eval(
name="self_containedness",
criteria="""Assess whether the AI Assistant Response provides a self-contained, standalone information that would be clear to someone who hasn't seen the User Query or Context.

A good response should:
- Include relevant subjects and context from the User Query (and Context if necessary)
- Avoid pronouns (like "it", "he", "she") without clear antecedents
- Be understandable on its own without requiring the original User Query (and Context if necessary) for reference

For example:
- "27" is less self-contained than "I am 27 years old"
- "Yes" is less self-contained than "Yes, the store is open on Sundays"
- "$50" is less self-contained than "The product costs $50"
- "He is" is less self-contained than "He is a good person", which itself is less self-contained than "John is a good person"

A self-contained, complete AI Assistant Response would be considered good when the AI Assistant doesn't require the Context for reference when asked the same User Query again.
""",
query_identifier="User Query",
context_identifier="Context",
response_identifier="AI Assistant Response",
)
)

validator = Validator(
codex_access_key=access_key,
tlm_api_key=os.environ["CLEANLAB_TLM_API_KEY"],
bad_response_thresholds={
"trustworthiness": 0.85,
"response_helpfulness": 0.9,
"self_containedness": 0.75,
},
trustworthy_rag_config={
"quality_preset": "base",
"options": {
"model": "gpt-4o-mini",
"log": ["explanation"],
},
"evals": evals,
},
)

Let’s validate another example from our RAG system.

df_row_validation(df, 2, validator, verbosity=1)
Query: How much does the Simple Water Bottle cost?

Initial RAG response: The Simple Water Bottle costs $24.99

Final Response: The Simple Water Bottle costs $24.99

Validation Results:
expert_answer: None
is_bad_response: False
trustworthiness: {'log': {'explanation': 'Did not find a reason to doubt trustworthiness.'}, 'score': 1.0, 'is_bad': False}
response_helpfulness: {'score': 0.9975124087312727, 'is_bad': False}
self_containedness: {'score': 0.995846566624026, 'is_bad': False}

Detection-Only Mode

While Validator.validate() provides a complete solution to detect, log, and fix bad responses, you might want to detect RAG issues without logging/remediation or other side-effects.

Validator.detect() runs the same detection as the validate() method, but without affecting the Codex Project at all. detect() returns nearly the same results dict as validate(), only the expert_answer key is missing (the Codex Project is ignored).

Use Validator.detect() to test/tune detection configurations like score thresholds and TrustworthyRAG settings. Validator.detect() will not affect your Codex Project, whereas Validator.validate() will log queries whose response was detected as bad into the Codex Project and is thus for production use, not testing. Both methods run the same detection logic, so you can use detect() to first optimize detections and then switch to using validate().

def df_row_detection(df, row_index, validator, verbosity=0):
"""
Detect bad responses in a specific row from the dataframe


Args:
df (DataFrame): The dataframe containing the query, context, and response to validate.
row_index (int): The index of the row in the dataframe to validate.
validator (Validator): The Validator object to use for detection of bad responses.
verbosity (int): Whether to print verbose output. Defaults to 0.
At verbosity level 0, only the query and final response are printed, along with the detection results.
At verbosity level 1, the retrieved context is also printed.
"""
# 1. Get user query
user_query = df.iloc[row_index]["query"]
print(f"Query: {user_query}\n")

# 2. Standard RAG pipeline
retrieved_context = rag_retrieve_context(user_query)
if verbosity >= 1:
print(f"Retrieved context:\n{retrieved_context}\n")

# Format the RAG prompt
rag_prompt = rag_form_prompt(user_query, retrieved_context)
initial_response = rag_generate_response(rag_prompt)
print(f"Initial RAG response: {initial_response}\n")

# 3. Detect bad responses
scores, is_bad_response = validator.detect(
query=user_query,
context=retrieved_context,
response=initial_response,
form_prompt=rag_form_prompt,
)

# Print results
print("Validation Results:")
for key, value in scores.items():
print(f" {key}: {value}")
print(f"\n is_bad_response: {is_bad_response}")

Let’s take another look at the previous example from our RAG system.

df_row_detection(df, 2, validator)
Query: How much does the Simple Water Bottle cost?

Initial RAG response: The Simple Water Bottle costs $24.99

Validation Results:
trustworthiness: {'log': {'explanation': 'Did not find a reason to doubt trustworthiness.'}, 'score': 1.0, 'is_bad': False}
response_helpfulness: {'score': 0.9975124087312727, 'is_bad': False}
self_containedness: {'score': 0.9917940685074799, 'is_bad': False}

is_bad_response: False

Next Steps

Now that Cleanlab is integrated with your RAG app, you and SMEs can open the Codex Project and answer questions logged there to continuously improve your AI.

This tutorial demonstrated how to use Cleanlab to automatically detect and remediate bad responses in any RAG application. Cleanlab provides a robust way to evaluate response quality and automatically fetch expert answers when needed. For responses that don’t meet quality thresholds, Codex automatically logs the queries for SME review.

Adding Cleanlab only improves your RAG app. Once integrated, it automatically identifies problematic responses and either remediates them with expert answers or logs them for review. Using a simple web interface, SMEs at your company can answer the highest priority questions in the Codex Project. As soon as an answer is entered in Codex, your RAG app will be able to properly handle all similar questions encountered in the future.

Codex is the fastest way for nontechnical SMEs to directly improve your RAG system. As the Developer, you simply integrate Cleanlab once, and from then on, SMEs can continuously improve how your system handles common user queries without needing your help.

Need help, more capabilities, or other deployment options?
Check the FAQ or email us at: support@cleanlab.ai