Skip to main content

Detect and remediate bad responses from any RAG application

Run in Google ColabRun in Google Colab

This tutorial demonstrates how to automatically improve any RAG application by integrating Codex as-a-backup. The Codex API takes in the AI-generated response from your RAG app, and the same inputs provided to the LLM that generated it: user query, retrieved context, and any parts of the LLM prompt. Cleanlab will automatically detect if your AI response is bad (e.g., untrustworthy, unhelpful, or unsafe). The Codex API returns these real-time evaluation scores which you can use to guardrail your AI. If your AI response is flagged as bad, the Codex API will also return an expert response whenever a similar query has been answered in the connected Codex Project, or otherwise log this query into the Codex Project for SMEs to answer.

'Codex as a backup'

Overview

Here’s all the code needed for using Codex as-a-backup with your RAG system.

from cleanlab_codex import Project
project = Project.from_access_key(access_key)

# Your existing RAG code:
context = rag_retrieve_context(user_query)
messages = rag_form_prompt(user_query, context)
response = rag_generate_response(messages)

# Detect bad responses and remediate with Cleanlab
results = project.validate(messages=messages, query=user_query, context=context, response=response)

final_response = (
results.expert_answer if results.expert_answer and results.escalated_to_sme
else fallback_response if results.should_guardrail
else response
)

Note: This tutorial is for Single-turn Q&A Apps. If you have a Multi-turn Chat app, a similar workflow is covered in the Detect and Remediate bad responses in Conversational Apps tutorial.

Setup

This tutorial requires a Codex API key. Get one here.

%pip install --upgrade cleanlab-codex pandas
# Set your Codex API key
import os
os.environ["CODEX_API_KEY"] = "<API key>" # Get your free API key from: https://codex.cleanlab.ai/account
# Import libraries
import pandas as pd
from cleanlab_codex.project import Project

Example RAG App: Product Customer Support

Consider a customer support / e-commerce RAG use-case where the Knowledge Base contains product listings like the following:

Simple water bottle product listing

Here, the inner workings of the RAG app are not important for this tutorial. What is important is that the RAG app generates a response based on a user query and a context, which are all made available for evaluation.

For simplicity, our context is hardcoded as the product listing below. You should replace these with the outputs of your RAG system, noting that Cleanlab can detect issues in these outputs in real-time.

product_listing = """Simple Water Bottle - Amber (limited edition launched Jan 1st 2025)
A water bottle designed with a perfect blend of functionality and aesthetics in mind. Crafted from high-quality, durable plastic with a sleek honey-colored finish.
Price: $24.99
Dimensions: 10 inches height x 4 inches width"""
Optional: Example queries and retrieved context + generated response from RAG system dataframe
data = [
{
"query": "How much water can the Simple Water Bottle hold?",
"context": product_listing,
"response": "The Simple Water Bottle can hold 16 oz of Water"
},
{
"query": "How can I order the Simple Water Bottle in bulk?",
"context": product_listing,
"response": "Based on the available information, I cannot provide a complete answer to this question."
},
{
"query": "How much does the Simple Water Bottle cost?",
"context": product_listing,
"response": "The Simple Water Bottle costs $10"
},
]

df = pd.DataFrame(data)
df

In practice, your RAG system should already have functions to retrieve context, generate responses, and build a messages object to prompt the LLM with. For this tutorial, we’ll simulate these functions using the above fields as well as define a simple fallback_response and prompt_template.

Optional: Toy RAG methods you should replace with existing methods from your RAG system

fallback_response = "I'm sorry, I couldn't find an answer for that — can I help with something else?"

prompt_template = """You are a customer service agent. Your task is to answer the following customer questions based on the product listing.

Product Listing: {context}

Customer Question: {query}
"""

def rag_retrieve_context(query):
"""Simulate retrieval from a knowledge base"""
# In a real system, this would search the knowledge base
for item in data:
if item["query"] == query:
return item["context"]
return ""

def rag_generate_response(messages):
"""Simulate LLM response generation"""
# In a real system, this would call an LLM
user_prompt = messages[0]["content"]
query = user_prompt.split("Customer Question: ")[1].split("\n")[0]
for item in data:
if item["query"] == query:
return item["response"]

# Return a fallback response if the LLM is unable to answer the question
return "Based on the available information, I cannot provide a complete answer to this question."

def rag_form_prompt(query, context):
"""
Form a prompt for your LLM response-generation step (from the user query, retrieved context, system instructions, etc). We represent the `prompt` using OpenAI's `messages` format, which matches the input to Cleanlab's `validate()` method.
"""

user_prompt = prompt_template.format(query=query, context=context)
messages = [{
"role": "user",
"content": user_prompt,
}]
return messages

Create Codex Project

To later use Codex, we must first create a Project. Here we assume some (question, answer) pairs have already been added to the Codex Project.

Our existing Codex Project contains the following entries:

Codex Project Example

User queries where Codex detected a bad response from your RAG app will be logged in this Project for SMEs to later answer.

Running detection and remediation

Now that our Codex Project is configured, we can use the Project.validate() method to detect bad responses from our RAG application. A single call runs many real-time Evals to score each AI response, and when scores fall below certain thresholds, the response is flagged for guardrailing or for SME review.

When your AI response is flagged for SME review, the Project.validate() call will simultaneously query Codex for an expert answer that can remediate your bad AI response. If no suitable expert answer is found, this query will be logged as Unaddressed in the Codex Project for SMEs to answer

When a response is flagged for guardrailing, the should_guardrail return value will be marked as True. You can choose to return a safer fallback response in place of the original AI response, or escalate to a human employee rather than letting your AI handle this case.

Here’s some logic to determine the final_response to return to your user.

final_response = (
results.expert_answer if results.expert_answer and results.escalated_to_sme # you can optionally omit the 2nd part of the AND statement to always use expert answers when available
else fallback_response if results.should_guardrail
else initial_response
)

Let’s initialize the Project using our access key:

access_key = "<YOUR-PROJECT-ACCESS-KEY>"  # Obtain from your Project's settings page: https://codex.cleanlab.ai/
project = Project.from_access_key(access_key)

Applying the validate() method to a RAG system is straightfoward. Here we do this using a helper function that applies the Validator to one row from our example dataframe.

def df_row_validation(df, row_index, project, verbosity=0):
"""
Detect and remediate bad responses in a specific row from the dataframe


Args:
df (DataFrame): The dataframe containing the query, context, and response to validate.
row_index (int): The index of the row in the dataframe to validate.
project (Project): The Codex Project object used to detect bad responses and remediate them. verbosity (int): Whether to print verbose output. Defaults to 0.
At verbosity level 0, only the query and final response are printed.
At verbosity level 1, the initial RAG response and the validation results are printed as well.
At verbosity level 2, the retrieved context is also printed.
"""
# 1. Get user query
user_query = df.iloc[row_index]["query"]
print(f"Query: {user_query}\n")

# 2. Standard RAG pipeline
# a. retrieve the context
retrieved_context = rag_retrieve_context(user_query)
if verbosity >= 2:
print(f"Retrieved context:\n{retrieved_context}\n")

# b. build prompt for RAG system
messages = rag_form_prompt(
query=user_query,
context=retrieved_context,
)

# c. simulate LLM response generation
initial_response = rag_generate_response(messages)
if verbosity >= 1:
print(f"Initial RAG response: {initial_response}\n")

# 3. Detect and remediate bad responses
results = project.validate(
messages=messages,
response=initial_response,
query=user_query,
context=retrieved_context,
)

# 4. Get the final response:
# - Use the fallback_response if the response was flagged as requiring guardrails
# - Use an expert answer if available and response was flagged as escalated to SME
# - Otherwise, use the initial response

final_response = (
results.expert_answer if results.expert_answer and results.escalated_to_sme
else fallback_response if results.should_guardrail
else initial_response
)
print(f"Final Response: {final_response}\n")

# For tutorial purposes, show validation results
if verbosity >= 1:
print("Validation Results:")
for key, value in results.model_dump().items():
print(f" {key}: {value}")

Let’s validate the RAG response to our first example query. The final dictionary printed by our helper functions are the results of Project.validate(), which we’ll break down below.

df_row_validation(df, 0, project, verbosity=1)
Query: How much water can the Simple Water Bottle hold?

Initial RAG response: The Simple Water Bottle can hold 16 oz of Water

Final Response: 32oz

Validation Results:
deterministic_guardrails_results: {}
escalated_to_sme: True
eval_scores: {'trustworthiness': {'score': 0.1828326551090038, 'triggered': True, 'triggered_escalation': True, 'triggered_guardrail': True, 'failed': True, 'log': None}, 'context_sufficiency': {'score': 0.0024944120492266216, 'triggered': True, 'triggered_escalation': False, 'triggered_guardrail': False, 'failed': True, 'log': None}, 'response_helpfulness': {'score': 0.9975122307805526, 'triggered': False, 'triggered_escalation': False, 'triggered_guardrail': False, 'failed': False, 'log': None}, 'query_ease': {'score': 0.815418802947351, 'triggered': False, 'triggered_escalation': False, 'triggered_guardrail': False, 'failed': False, 'log': None}, 'response_groundedness': {'score': 0.0024875794784465807, 'triggered': False, 'triggered_escalation': False, 'triggered_guardrail': False, 'failed': False, 'log': None}}
expert_answer: 32oz
is_bad_response: True
should_guardrail: True

The Project.validate() method returns a comprehensive dictionary containing multiple evaluation metrics and remediation options. Let’s examine the key components of these results:

Core Validation Results

  1. expert_answer (String | None)

    • Contains the remediation response retrieved from the Codex Project.
    • Returns None in two scenarios:
      • When escalated_to_sme is False (indicating no remediation needed, so Codex is not queried).
      • When no suitable expert answer exists in the Codex Project for similar queries.
    • Returns a string containing the expert-provided answer when:
      • The response is flagged as requiring remediation (escalated_to_sme=True).
      • A semantically similar query exists in the Codex Project with an expert answer.
  2. escalated_to_sme (Boolean)

    • This is the primary validation indicator that determines if a response requires remediation (i.e. for expert_answer to contain a string value).
    • Will be True if any eval fails with should_escalate=True, meaning the score for that specific eval falls below a configured threshold.
    • Controls whether the system will attempt to fetch an expert answer from Codex. When escalated_to_sme=True the system will lookup an expert answer from Codex (which logs the corresponding query into the Codex Project).
  3. should_guardrail (Boolean)

    • Will be True when any configured guardrails are triggered with should_guardrail=True, meaning the score for that specific guardrail falls below a configured threshold.
    • Does not trigger checking Codex for an expert answer and flagging the query for review.

Evaluation Metrics

Each evaluation metric has a triggered_guardrail and triggered_escalation boolean flag that indicates whether the metric’s score falls below its configured threshold, which determines if a response needs remediation or guardrailing.

By default, the Project.validate() method uses the following metrics as Evaluations for escalation:

  • trustworthiness: overall confidence that your RAG system’s response is correct.
  • response_helpfulness: evaluates whether the response attempts to helpfully address the user query vs. abstaining or saying ‘I don’t know’.

By default, the Project.validate() method uses the following metrics as Guardrails:

  • trustworthiness: overall confidence that your RAG system’s response is correct (used for guardrailing and escalation).

You can modify these metrics or add your own by defining a custom list of Evaluations and/or Guardrails for a Project in the Codex Web App.

Let’s validate another example from our RAG system. For this example, the response is flagged as bad, but no expert answer is available in the Codex Project. The corresponding query will be logged there for SMEs to answer.

df_row_validation(df, 1, project, verbosity=1)
Query: How can I order the Simple Water Bottle in bulk?

Initial RAG response: Based on the available information, I cannot provide a complete answer to this question.

Final Response: I'm sorry, I couldn't find an answer for that — can I help with something else?

Validation Results:
deterministic_guardrails_results: {}
escalated_to_sme: True
eval_scores: {'trustworthiness': {'score': 0.5637340040550474, 'triggered': True, 'triggered_escalation': True, 'triggered_guardrail': True, 'failed': True, 'log': None}, 'context_sufficiency': {'score': 0.0024907979417248738, 'triggered': True, 'triggered_escalation': False, 'triggered_guardrail': False, 'failed': True, 'log': None}, 'response_helpfulness': {'score': 0.003276786800052492, 'triggered': True, 'triggered_escalation': True, 'triggered_guardrail': False, 'failed': True, 'log': None}, 'query_ease': {'score': 0.9975124059276235, 'triggered': False, 'triggered_escalation': False, 'triggered_guardrail': False, 'failed': False, 'log': None}, 'response_groundedness': {'score': 0.9975124079370519, 'triggered': False, 'triggered_escalation': False, 'triggered_guardrail': False, 'failed': False, 'log': None}}
expert_answer: None
is_bad_response: True
should_guardrail: True

The RAG system is unable to answer this question because there is no relevant information in the retrieved context, nor has a similar question been answered in the Codex Project (see the contents of the Codex Project above).

Codex automatically recognizes this question could not be answered and logs it into the Project where it awaits an answer from a SME (notice that escalated_to_sme is True). Navigate to your Codex Project in the Web App where you (or a SME at your company) can enter the desired answer for this query.

As soon as an answer is provided in Codex, our RAG system will be able to answer all similar questions going forward (as seen for the previous query).

Advanced Usage

You can configure many aspects of the bad response detection like what score thresholds to use inside the Codex Web App.

Response Quality Thresholds

Thresholds determine when a response needs intervention:

  • A metric (trustworthiness, pii_detection, etc.) can be defined as an Evaluation, Guardrail , or both.

  • Each metric has its own threshold (0-1) and a defined threshold direction (below/above).

  • If any Evaluation’s score violates the threshold condition, the response is escalated to an SME and an expert answer is attempted to be returned.

  • If any Guardrail’s score violates the threshold condition, Project.validate() returns with the should_guardrail=True flag and no other action is taken.

  • Example: With Evaluation metric Hallucination (trustworthiness) threshold set to “below 0.85”

    • Score 0.80 -> Escalated to an SME
    • Score 0.90 -> Passes validation

Setting thresholds affects your validation strategy:

  • Higher thresholds (e.g. 0.9) = Stricter validation
    • More responses marked as “should_escalate”
    • More queries logged for SMEs to answer
    • Better response quality but higher SME workload
  • Lower thresholds (e.g. 0.7) = More lenient validation
    • Fewer responses marked as “should_escalate”
    • Fewer queries logged for SMEs to answer
    • Lower SME workload, but may allow lower quality responses from your RAG app to be returned unremediated.

Configure Custom Evaluations and Guardrails

You can configure these directly in the Codex Web UI. For a detailed walkthrough, see the Adding custom guardrails section of our other tutorial.

Logging Additional Information

When project.validate() returns results indicating a response should be escalated to an SME, it logs the query into your Codex Project. By default, this log automatically includes the evaluation scores (like trustworthiness), the context and LLM response.

You can include additional information that would be helpful for Subject Matter Experts (SMEs) when they review the logged queries in the Codex Project later.

To add any extra information, simply pass in any key-value pairs into the metadata parameter in the validate() method. For example, you can add the location the Query came from like so:

metadata = {"location": "USA"}
results = project.validate(
messages=messages,
response=response,
query=query,
context=context,
metadata=metadata,
)

Run the example below to add an entry into Codex that contains this additional metadata.

def df_row_log_metadata(df, row_index, project, verbosity=0):
"""
Detect and remediate bad responses in a specific row from the dataframe


Args:
df (DataFrame): The dataframe containing the query, context, and response to validate.
row_index (int): The index of the row in the dataframe to validate.
project (Project): The Codex Project object used to detect bad responses and remediate them.
verbosity (int): Whether to print verbose output. Defaults to 0.
At verbosity level 0, only the query and final response are printed.
At verbosity level 1, the initial RAG response and the validation results are printed as well.
At verbosity level 2, the retrieved context is also printed.
"""
# 1. Get user query
user_query = df.iloc[row_index]["query"]
print(f"Query: {user_query}\n")

# 2. Standard RAG pipeline
# a. retrieve the context
retrieved_context = rag_retrieve_context(user_query)
if verbosity >= 2:
print(f"Retrieved context:\n{retrieved_context}\n")

# b. build prompt for RAG system
messages = rag_form_prompt(
query=user_query,
context=retrieved_context,
)

# c. simulate LLM response generation
initial_response = rag_generate_response(messages)
if verbosity >= 1:
print(f"Initial RAG response: {initial_response}\n")

# 3. Detect and remediate bad responses
results = project.validate(
messages=messages,
response=initial_response,
query=user_query,
context=retrieved_context,
metadata={"location": "USA"},
)

# 4. Get the final response:
# - Use the fallback_response if the response was flagged as requiring guardrails
# - Use an expert answer if available and response was flagged as escalated to SME
# - Otherwise, use the initial response

final_response = (
results.expert_answer if results.expert_answer and results.escalated_to_sme
else fallback_response if results.should_guardrail
else initial_response
)
print(f"Final Response: {final_response}\n")

# For tutorial purposes, show validation results
if verbosity >= 1:
print("Validation Results:")
for key, value in results.model_dump().items():
print(f" {key}: {value}")
df_row_log_metadata(df, 2, project, verbosity=1)
Query: How much does the Simple Water Bottle cost?

Initial RAG response: The Simple Water Bottle costs $10

Final Response: I'm sorry, I couldn't find an answer for that — can I help with something else?

Validation Results:
deterministic_guardrails_results: {}
escalated_to_sme: True
eval_scores: {'trustworthiness': {'score': 0.08583690987124462, 'triggered': True, 'triggered_escalation': True, 'triggered_guardrail': True, 'failed': True, 'log': None}, 'context_sufficiency': {'score': 0.9975124378114657, 'triggered': False, 'triggered_escalation': False, 'triggered_guardrail': False, 'failed': False, 'log': None}, 'response_helpfulness': {'score': 0.9975124367477299, 'triggered': False, 'triggered_escalation': False, 'triggered_guardrail': False, 'failed': False, 'log': None}, 'query_ease': {'score': 0.9968973517119587, 'triggered': False, 'triggered_escalation': False, 'triggered_guardrail': False, 'failed': False, 'log': None}, 'response_groundedness': {'score': 0.002487575280819864, 'triggered': False, 'triggered_escalation': False, 'triggered_guardrail': False, 'failed': False, 'log': None}}
expert_answer: None
is_bad_response: True
should_guardrail: True

Next Steps

Now that Cleanlab is integrated with your Single-turn Q&A App, you and SMEs can open the Codex Project and answer questions logged there to continuously improve your AI.

This tutorial demonstrated how to use Cleanlab to automatically detect and remediate bad responses in any Single-turn Q&A application. Cleanlab provides a robust way to evaluate response quality and automatically fetch expert answers when needed. For responses that don’t meet quality thresholds, Codex automatically logs the queries for SME review.

Note: Automatic detection and remediation of bad responses for a Multi-turn Conversational Chat app is covered in the Detect and Remediate bad responses in Conversational Apps tutorial.

Adding Cleanlab only improves your RAG app. Once integrated, it automatically identifies problematic responses and either remediates them with expert answers or logs them for review. Using a simple web interface, SMEs at your company can answer the highest priority questions in the Codex Project. As soon as an answer is entered in Codex, your RAG app will be able to properly handle all similar questions encountered in the future.

Codex is the fastest way for nontechnical SMEs to directly improve your RAG system. As the Developer, you simply integrate Cleanlab once, and from then on, SMEs can continuously improve how your system handles common user queries without needing your help.

Need help, more capabilities, or other deployment options?
Check the FAQ or email us at: support@cleanlab.ai