Detect and remediate bad responses from any RAG system with Cleanlab
This tutorial demonstrates how to automatically improve any RAG application by integrating Codex as-a-backup. For each user query, simply provide the retrieved context and generated response from your RAG app. Cleanlab will automatically detect if the response is bad (unhelpful/untrustworthy), and if so: provide an alternative response whenever a similar query has been answered in the connected Codex Project, or otherwise log this query into the Codex Project for SMEs to answer.
Overview
Here’s all the code needed for using Codex as-a-backup with your RAG system.
from cleanlab_codex import Validator
validator = Validator(codex_access_key=...) # optional configurations can improve accuracy/latency
# Your existing RAG code:
context = rag_retrieve_context(user_query)
prompt = rag_form_prompt(user_query, retrieved_context)
response = rag_generate_response(prompt)
# Detect bad responses and remediate with Cleanlab
results = validator.validate(query=query, context=context, response=response,
form_prompt=rag_form_prompt)
final_response = (
results["expert_answer"] # Codex's answer
if results["is_bad_response"] and results["expert_answer"]
else response # Your RAG system's initial response
)
Setup
This tutorial requires a TLM API key. Get one here.
%pip install --upgrade cleanlab-codex pandas
# Set your TLM API key
import os
os.environ["CLEANLAB_TLM_API_KEY"] = "<API key>" # Get your free API key from: https://tlm.cleanlab.ai/
# Import libraries
import pandas as pd
from cleanlab_codex.validator import Validator
Example RAG App: Product Customer Support
Consider a customer support / e-commerce RAG use-case where the Knowledge Base contains product listings like the following:
Here, the inner workings of the RAG app are not important for this tutorial. What is important is that the RAG app generates a response based on a user query and a context, which are all made available for evaluation.
For simplicity, our context is hardcoded as the product listing below. You should replace these with the outputs of your RAG system, noting that Cleanlab can detect issues in these outputs in real-time.
product_listing = """Simple Water Bottle - Amber (limited edition launched Jan 1st 2025)
A water bottle designed with a perfect blend of functionality and aesthetics in mind. Crafted from high-quality, durable plastic with a sleek honey-colored finish.
Price: $24.99
Dimensions: 10 inches height x 4 inches width"""
# Example queries and retrieved context + generated response from RAG system
data = [
{
"query": "How much water can the Simple Water Bottle hold?",
"context": product_listing,
"response": "The Simple Water Bottle can hold 16 oz of Water"
},
{
"query": "How can I order the Simple Water Bottle in bulk?",
"context": product_listing,
"response": "Based on the available information, I cannot provide a complete answer to this question."
},
{
"query": "How much does the Simple Water Bottle cost?",
"context": product_listing,
"response": "The Simple Water Bottle costs $24.99"
},
]
df = pd.DataFrame(data)
df
query | context | response | |
---|---|---|---|
0 | How much water can the Simple Water Bottle hold? | Simple Water Bottle - Amber (limited edition l... | The Simple Water Bottle can hold 16 oz of Water |
1 | How can I order the Simple Water Bottle in bulk? | Simple Water Bottle - Amber (limited edition l... | Based on the available information, I cannot p... |
2 | How much does the Simple Water Bottle cost? | Simple Water Bottle - Amber (limited edition l... | The Simple Water Bottle costs $24.99 |
In practice, your RAG system should already have functions to retrieve context and generate responses. For this tutorial, we’ll simulate these functions using the above fields.
Optional: Toy RAG methods you should replace with existing methods from your RAG system
def rag_retrieve_context(query):
"""Simulate retrieval from a knowledge base"""
# In a real system, this would search the knowledge base
for item in data:
if item["query"] == query:
return item["context"]
return ""
def rag_form_prompt(query, context):
"""Format a prompt for the RAG system"""
return f"""You are a customer service agent. Your task is to answer the following customer questions based on the product listing.
Product Listing: {context}
Customer Question: {query}
"""
def rag_generate_response(prompt):
"""Simulate LLM response generation"""
# In a real system, this would call an LLM
query = prompt.split("Customer Question: ")[1].split("\n")[0]
for item in data:
if item["query"] == query:
return item["response"]
# Return a fallback response if the LLM is unable to answer the question
return "Based on the available information, I cannot provide a complete answer to this question."
Create Codex Project
To later use Codex, we must first create a Project. Here we assume some (question, answer) pairs have already been added to the Codex Project. To learn how that was done, see our tutorial: Populating Codex.
Our existing Codex Project contains the following entries:
User queries where Codex detected a bad response from your RAG app will be logged in this Project for SMEs to later answer.
Running detection and remediation
Now that our Codex Project is configured, we can initialize a Validator object. This Validator uses the validate()
method to detect bad responses in our RAG applications by running Evals, scoring responses, and flagging them as bad when they fall below certain thresholds.
When a response is flagged as bad, the Validator will query Codex for an expert answer that can remediate the bad response. If no suitable answer is found, the Validator will log the query into the Codex Project for SMEs to answer.
Let’s initialize the Validator using our access key:
access_key = "<YOUR-PROJECT-ACCESS-KEY>" # Obtain from your Project's settings page: https://codex.cleanlab.ai/
validator = Validator(codex_access_key=access_key)
Applying this Validator to a RAG system is straightfoward. Here we do this using a helper function that applies the Validator to one row from our example dataframe.
def df_row_validation(df, row_index, validator, verbosity=0):
"""
Detect and remediate bad responses in a specific row from the dataframe
Args:
df (DataFrame): The dataframe containing the query, context, and response to validate.
row_index (int): The index of the row in the dataframe to validate.
validator (Validator): The Validator object to use for detection and remediation of bad responses.
verbosity (int): Whether to print verbose output. Defaults to 0.
At verbosity level 0, only the query and final response are printed.
At verbosity level 1, the initial RAG response and the validation results are printed as well.
At verbosity level 2, the retrieved context is also printed.
"""
# 1. Get user query
user_query = df.iloc[row_index]["query"]
print(f"Query: {user_query}\n")
# 2. Standard RAG pipeline
retrieved_context = rag_retrieve_context(user_query)
if verbosity >= 2:
print(f"Retrieved context:\n{retrieved_context}\n")
# Format the RAG prompt
rag_prompt = rag_form_prompt(user_query, retrieved_context)
initial_response = rag_generate_response(rag_prompt)
if verbosity >= 1:
print(f"Initial RAG response: {initial_response}\n")
# 3. Detect and remediate bad responses
results = validator.validate(
query=user_query,
context=retrieved_context,
response=initial_response,
form_prompt=rag_form_prompt,
)
# Use expert answer if available and response was flagged as bad
final_response = (
results["expert_answer"]
if results["is_bad_response"] and results["expert_answer"]
else initial_response
)
print(f"Final Response: {final_response}\n")
# For tutorial purposes, show validation results
if verbosity >= 1:
print("Validation Results:")
for key, value in results.items():
print(f" {key}: {value}")
Let’s validate the RAG response to our first example query. The final dictionary printed by our helper functions are the results of Validator.validate()
, which we’ll break down below.
df_row_validation(df, 0, validator, verbosity=1)
The Validator.validate()
method returns a comprehensive dictionary containing multiple evaluation metrics and remediation options. Let’s examine the key components of these results:
Core Validation Results
-
expert_answer
(String | None)- Contains the remediation response retrieved from the Codex Project.
- Returns
None
in two scenarios:- When
is_bad_response
isFalse
(indicating no remediation needed, so Codex is not queried). - When no suitable expert answer exists in the Codex Project for similar queries.
- When
- Returns a string containing the expert-provided answer when:
- The response is flagged as requiring remediation (
is_bad_response=True
). - A semantically similar query exists in the Codex Project with an expert answer.
- The response is flagged as requiring remediation (
-
is_bad_response
(Boolean)- This is the primary validation indicator that determines if a response requires remediation (i.e. for
expert_answer
to contain a string value). - Will be
True
when any evaluation metric falls below its configured threshold. - Controls whether the system will attempt to fetch an expert answer from Codex. Only when
is_bad_response=True
will the system lookup an expert answer from Codex (which logs the corresponding query into the Codex Project).
- This is the primary validation indicator that determines if a response requires remediation (i.e. for
Evaluation Metrics
The Validator extends TrustworthyRAG’s evaluation scores by adding an is_bad
boolean flag to each metric. This flag indicates whether the metric’s score falls below its configured threshold, which determines if a response needs remediation.
By default, the Validator uses the following metrics:
trustworthiness
: overall confidence that your RAG system’s response is correctresponse_helpfulness
: evaluates whether the response effectively addresses the user query and appears helpful.
You can modify these metrics by providing a custom list of evals
in the trustworthy_rag_config
dictionary.
Let’s validate another example from our RAG system. For this example, the response is flagged as bad, but no expert answer is available in the Codex Project. The corresponding query will be logged there for SMEs to answer.
df_row_validation(df, 1, validator, verbosity=1)
The RAG system is unable to answer this question because there is no relevant information in the retrieved context, nor has a similar question been answered in the Codex Project (see the contents of the Codex Project above).
Codex automatically recognizes this question could not be answered and logs it into the Project where it awaits an answer from a SME. Navigate to your Codex Project in the Web App where you (or a SME at your company) can enter the desired answer for this query.
As soon as an answer is provided in Codex, our RAG system will be able to answer all similar questions going forward (as seen for the previous query).
Advanced Usage
You can configure many aspects of the bad response detection like what score thresholds to use and TrustworthyRAG settings.
Response Quality Thresholds
Thresholds determine when a response needs intervention:
- Each metric (trustworthiness, helpfulness, etc.) has its own threshold (0-1)
- If any metric’s score falls below its threshold, the response is marked as “bad”
- Example: With trustworthiness_threshold = 0.85
- Score 0.80 -> Marked as bad (below threshold)
- Score 0.90 -> Passes validation (above threshold)
Setting thresholds affects your validation strategy:
- Higher thresholds (e.g. 0.9) = Stricter validation
- More responses marked as “bad”
- More queries logged for SMEs to answer
- Better response quality but higher SME workload
- Lower thresholds (e.g. 0.7) = More lenient validation
- Fewer responses marked as “bad”
- Fewer queries logged for SMEs to answer
- Lower SME workload, but may allow lower quality responses from your RAG app to be returned unremediated.
Learn more about the thresholds in the [BadResponseThresholds documentation](
Additional Configuration
The Validator combines powerful detection capabilities with automatic remediation of bad responses. For detection, you can customize the evaluation process using the trustworthy_rag_config
dictionary parameter.
The Validator supports all configuration options available in TrustworthyRAG for the detection of bad responses. You can refer to the TrustworthyRAG documentation for the complete list of supported options that can be passed in this dictionary.
The following example shows how to configure the Validator with custom thresholds and evaluation settings:
Optional: Configure custom Evals
from cleanlab_tlm.utils.rag import Eval, get_default_evals
# Select the "resonse_helpfulness" Eval
evals = [evaluation for evaluation in get_default_evals() if evaluation.name == "response_helpfulness"]
evals.append(
Eval(
name="self_containedness",
criteria="""Assess whether the AI Assistant Response provides a self-contained, standalone information that would be clear to someone who hasn't seen the User Query or Context.
A good response should:
- Include relevant subjects and context from the User Query (and Context if necessary)
- Avoid pronouns (like "it", "he", "she") without clear antecedents
- Be understandable on its own without requiring the original User Query (and Context if necessary) for reference
For example:
- "27" is less self-contained than "I am 27 years old"
- "Yes" is less self-contained than "Yes, the store is open on Sundays"
- "$50" is less self-contained than "The product costs $50"
- "He is" is less self-contained than "He is a good person", which itself is less self-contained than "John is a good person"
A self-contained, complete AI Assistant Response would be considered good when the AI Assistant doesn't require the Context for reference when asked the same User Query again.
""",
query_identifier="User Query",
context_identifier="Context",
response_identifier="AI Assistant Response",
)
)
validator = Validator(
codex_access_key=access_key,
tlm_api_key=os.environ["CLEANLAB_TLM_API_KEY"],
bad_response_thresholds={
"trustworthiness": 0.85,
"response_helpfulness": 0.9,
"self_containedness": 0.75,
},
trustworthy_rag_config={
"quality_preset": "base",
"options": {
"model": "gpt-4o-mini",
"log": ["explanation"],
},
"evals": evals,
},
)
Let’s validate another example from our RAG system.
df_row_validation(df, 2, validator, verbosity=1)
Detection-Only Mode
While Validator.validate()
provides a complete solution to detect, log, and fix bad responses, you might want to detect RAG issues without logging/remediation or other side-effects.
Validator.detect()
runs the same detection as the validate()
method, but without affecting the Codex Project at all.
detect()
returns nearly the same results dict as validate()
, only the expert_answer
key is missing (the Codex Project is ignored).
Use Validator.detect()
to test/tune detection configurations like score thresholds and TrustworthyRAG settings.
Validator.detect()
will not affect your Codex Project, whereas Validator.validate()
will log queries whose response was detected as bad into the Codex Project and is thus for production use, not testing.
Both methods run the same detection logic, so you can use detect()
to first optimize detections and then switch to using validate()
.
def df_row_detection(df, row_index, validator, verbosity=0):
"""
Detect bad responses in a specific row from the dataframe
Args:
df (DataFrame): The dataframe containing the query, context, and response to validate.
row_index (int): The index of the row in the dataframe to validate.
validator (Validator): The Validator object to use for detection of bad responses.
verbosity (int): Whether to print verbose output. Defaults to 0.
At verbosity level 0, only the query and final response are printed, along with the detection results.
At verbosity level 1, the retrieved context is also printed.
"""
# 1. Get user query
user_query = df.iloc[row_index]["query"]
print(f"Query: {user_query}\n")
# 2. Standard RAG pipeline
retrieved_context = rag_retrieve_context(user_query)
if verbosity >= 1:
print(f"Retrieved context:\n{retrieved_context}\n")
# Format the RAG prompt
rag_prompt = rag_form_prompt(user_query, retrieved_context)
initial_response = rag_generate_response(rag_prompt)
print(f"Initial RAG response: {initial_response}\n")
# 3. Detect bad responses
scores, is_bad_response = validator.detect(
query=user_query,
context=retrieved_context,
response=initial_response,
form_prompt=rag_form_prompt,
)
# Print results
print("Validation Results:")
for key, value in scores.items():
print(f" {key}: {value}")
print(f"\n is_bad_response: {is_bad_response}")
Let’s take another look at the previous example from our RAG system.
df_row_detection(df, 2, validator)
Next Steps
Now that Cleanlab is integrated with your RAG app, you and SMEs can open the Codex Project and answer questions logged there to continuously improve your AI.
This tutorial demonstrated how to use Cleanlab to automatically detect and remediate bad responses in any RAG application. Cleanlab provides a robust way to evaluate response quality and automatically fetch expert answers when needed. For responses that don’t meet quality thresholds, Codex automatically logs the queries for SME review.
Adding Cleanlab only improves your RAG app. Once integrated, it automatically identifies problematic responses and either remediates them with expert answers or logs them for review. Using a simple web interface, SMEs at your company can answer the highest priority questions in the Codex Project. As soon as an answer is entered in Codex, your RAG app will be able to properly handle all similar questions encountered in the future.
Codex is the fastest way for nontechnical SMEs to directly improve your RAG system. As the Developer, you simply integrate Cleanlab once, and from then on, SMEs can continuously improve how your system handles common user queries without needing your help.
Need help, more capabilities, or other deployment options?
Check the FAQ or email us at: support@cleanlab.ai