Trustworthy Retrieval-Augmented Generation with the Trustworthy Language Model
This tutorial demonstrates how to replace the Generator LLM in any RAG system with Cleanlab’s Trustworthy Language Model (TLM), to score the trustworthiness of answers and improve overall reliability. We recommend first completing the TLM quickstart tutorial.
A second part of this tutorial demonstrates how to alternatively use TLM in an existing RAG pipeline where low-latency is key.
Retrieval-Augmented Generation (RAG) has become popular for building LLM-based Question-Answer systems in domains where LLMs alone suffer from: hallucination, knowledge gaps, and factual inaccuracies. However, RAG systems often still produce unreliable responses, because they depend on LLMs that are fundamentally unreliable. Cleanlab’s Trustworthy Language Model (TLM) offers a solution by providing trustworthiness scores to assess and improve response quality, independent of your RAG architecture or retrieval and indexing processes. To diagnose when RAG answers cannot be trusted, simply swap your existing LLM that is generating answers based on the retrieved context with TLM. This tutorial showcases this for a standard RAG system, based off a tutorial in the popular LlamaIndex framework. Here we merely replace the LLM used in the LlamaIndex tutorial with TLM, and showcase some of the benefits. TLM can be similarly inserted into any other RAG framework.
Setup
RAG is all about connecting LLMs to data, to better inform their answers. This tutorial uses Nvidia’s Q1 FY2024 earnings report as an example dataset.
Use the following commands to download the data (earnings report) and store it in a directory named data/
.
wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/NVIDIA_Financial_Results_Q1_FY2024.md'
mkdir -p ./data
mv NVIDIA_Financial_Results_Q1_FY2024.md data/
Let’s next install required dependencies.
%pip install -U cleanlab-studio llama-index llama-index-embeddings-huggingface
We then initialize our Cleanlab client. You can get your Cleanlab API key here: https://app.cleanlab.ai/account after creating an account. For detailed instructions, refer to this guide.
from cleanlab_studio import Studio
studio = Studio("<insert your API key>")
Integrate TLM with LlamaIndex
TLM not only provides a response but also includes a trustworthiness score indicating the confidence that this response is good/accurate. Here we initialize a TLM object with default settings. You can achieve better results by playing with the TLM configurations outlined in the Advanced section of the TLM quickstart tutorial.
tlm = studio.TLM()
Our RAG pipeline closely follows the LlamaIndex guide on Using a custom LLM Model. LLamaIndex’s CustomLLM
class exposes two methods, complete()
and stream_complete()
, for returning the LLM response. Additionally, it provides a metadata
property to specify LLM details such as context window, number of output tokens, and name of your LLM.
Here we create a TLMWrapper
subclass of CustomLLM
that uses our TLM object instantiated above.
from typing import Any, Dict
import json
# Import LlamaIndex dependencies
from llama_index.core.base.llms.types import (
CompletionResponse,
CompletionResponseGen,
LLMMetadata,
)
from llama_index.core.llms.callbacks import llm_completion_callback
from llama_index.core.llms.custom import CustomLLM
from llama_index.core import Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
class TLMWrapper(CustomLLM):
context_window: int = 16000
num_output: int = 256
model_name: str = "TLM"
@property
def metadata(self) -> LLMMetadata:
"""Get LLM metadata."""
return LLMMetadata(
context_window=self.context_window,
num_output=self.num_output,
model_name=self.model_name,
)
@llm_completion_callback()
def complete(self, prompt: str, **kwargs: Any) -> CompletionResponse:
# Prompt tlm for a response and trustworthiness score
response: Dict[str, str] = tlm.prompt(prompt)
output = json.dumps(response)
return CompletionResponse(text=output)
@llm_completion_callback()
def stream_complete(self, prompt: str, **kwargs: Any) -> CompletionResponseGen:
# Prompt tlm for a response and trustworthiness score
response = tlm.prompt(prompt)
output = json.dumps(response)
# Stream the output
output_str = ""
for token in output:
output_str += token
yield CompletionResponse(text=output_str, delta=token)
Build a RAG pipeline with TLM
Now let’s integrate our TLM-based CustomLLM
into a RAG pipeline.
Settings.llm = TLMWrapper()
Specify Embedding Model
RAG uses an embedding model to match queries against document chunks to retrieve the most relevant data. Here we opt for a no-cost, local embedding model from Hugging Face. You can use any other embedding model by referring to this LlamaIndex guide.
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
Load Data and Create Index + Query Engine
Let’s create an index from the documents stored in the data directory. The system can index multiple files within the same folder, although for this tutorial, we’ll use just one document. We stick with the default index from LlamaIndex for this tutorial.
documents = SimpleDirectoryReader("data").load_data()
# Optional step since we're loading just one data file
for doc in documents:
doc.excluded_llm_metadata_keys.append("file_path") # file_path wouldn't be a useful metadata to add to LLM's context since our datasource contains just 1 file
index = VectorStoreIndex.from_documents(documents)
The generated index is used to power a query engine over the data.
query_engine = index.as_query_engine()
Note that TLM is agnostic to the index and the query engine used for RAG, and is compatible with any choices you make for these components of your system.
Answering queries with our RAG system
Let’s try out our RAG pipeline based on TLM. Here we pose questions with differing levels of complexity.
Optional: Define `display_response` helper function (click to expand)
# This method presents formatted responses from our TLM-based RAG pipeline. It parses the output to display both the response itself and the corresponding trustworthiness score.
def display_response(response):
response_str = response.response
output_dict = json.loads(response_str)
print(f"Response: {output_dict['response']}")
print(f"Trustworthiness score: {round(output_dict['trustworthiness_score'], 2)}")
Easy Questions
We first pose straightforward questions that can be directly answered by the provided data and can be easily located within a few lines of text.
response = query_engine.query(
"What was NVIDIA's total revenue in the first quarter of fiscal 2024?"
)
display_response(response)
response = query_engine.query(
"What was the GAAP earnings per diluted share for the quarter?"
)
display_response(response)
response = query_engine.query(
"What significant transitions did Jensen Huang, NVIDIA's CEO, comment on?"
)
display_response(response)
TLM returns high trustworthiness scores for these responses, indicating high confidence they are accurate. After doing a quick fact-check (reviewing the original earnings report), we can confirm that TLM indeed accurately answered these questions. In case you’re curious, here are relevant excerpts from the data context for these questions:
NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, …
GAAP earnings per diluted share for the quarter were $0.82, up 28% from a year ago and up 44% from the previous quarter.
Jensen Huang, founder and CEO of NVIDIA, commented on the significant transitions the computer industry is undergoing, particularly accelerated computing and generative AI, …
Questions without Available Context
Now let’s see how TLM responds to queries that cannot be answered using the provided data.
response = query_engine.query(
"What factors as per the report were responsible to the decline in NVIDIA's proviz revenue?"
)
display_response(response)
The lower TLM trustworthiness score indicate a bit more uncertainty about the response, which aligns with the lack of information available. Let’s try some more questions.
response = query_engine.query(
"How does the report explain why NVIDIA's Gaming revenue decreased year over year?"
)
display_response(response)
response = query_engine.query(
"How does NVIDIA's dividend payout for this quarter compare to the industry average?",
)
display_response(response)
We observe that TLM demonstrates the ability to recognize the limitations of the available information. It refrains from generating speculative responses or hallucinations, thereby maintaining the reliability of the question-answering system. This behavior showcases an understanding of the boundaries of the context and prioritizes accuracy over conjecture.
Challenging Questions
Let’s see how our RAG system responds to harder questions, some of which may be misleading.
response = query_engine.query(
"How much did Nvidia's revenue decrease this quarter vs last quarter, in dollars?"
)
display_response(response)
response = query_engine.query(
"This report focuses on Nvidia's Q1FY2024 financial results. There are mentions of other companies in the report like Microsoft, Dell, ServiceNow, etc. Can you name them all here?",
)
display_response(response)
response = query_engine.query(
"How many RTX GPU models, including all custom versions released by third-party manufacturers and all revisions across different series, were officially announced in NVIDIA's Q1 FY2024 financial results?",
)
display_response(response)
response = query_engine.query(
"If NVIDIA's Data Center segment maintains its Q1 FY2024 quarter-over-quarter growth rate for the next four quarters, what would be its projected annual revenue?",
)
display_response(response)
TLM automatically alerts us that these answers are unreliable, by the low trustworthiness score. RAG systems with TLM help you properly exercise caution when you see low trustworthiness scores. Here are the correct answers to the aforementioned questions:
NVIDIA’s revenue increased by $1.14 billion this quarter compared to last quarter.
Google, Amazon Web Services, Microsoft, Oracle, ServiceNow, Medtronic, Dell Technologies.
There is not a specific total count of RTX GPUs mentioned.
Projected annual revenue if this growth rate is maintained for the next four quarters: approximately $26.34 billion.
With TLM, you can easily increase trust in any RAG system!
Alternate low-latency/streaming approach: Use TLM to assess responses from an existing RAG system
This subsequent part of the tutorial demonstrates how to use TLM to assess RAG responses from any other generator LLM in a streaming use case (rather than producing the responses with TLM too). TLM can score the trustworthiness of any response (generated by another LLM model) to a given prompt via TLM.get_trustworthiness_score_async(). This is useful in settings where low latency is critical, since you can first stream in the response from your existing RAG system, and subsequently stream in TLM’s trustworthiness score once it has been computed.
A key consideration here is the prompt
argument provided to TLM. In order for TLM to effectively detect bad responses and hallucinations, its provided prompt
should contain all relevant information, including:
- Optional system instructions to shape the generator LLM’s overall behavior
- Optional criteria that a good response should satisfy
- Relevant context fetched by the RAG system retriever (in the same format as provided to your generator LLM)
- The user query being responded to.
Here we demonstrate this process assuming a RAG system that uses the gpt-4o-mini
LLM from OpenAI, but this can be done with any LLM.
First, let’s set up the OpenAI client for streaming responses:
%pip install openai
from openai import AsyncOpenAI
from typing import AsyncGenerator
client = AsyncOpenAI(api_key="<insert your OpenAI API key>")
Now we define a function for streaming a response from OpenAI and then scoring its trustworthiness via TLM.
In your RAG system, the retriever will fetch some context for your generator LLM. You should combine that context with the user query into the prompt
argument of TLM.get_trustworthiness_score_async()
.
If your RAG system uses system instructions to shape the generator LLM’s overall behavior, ensure these are also part of the prompt
argument passed to TLM. If you have criteria that a good response should satisfy (eg. conciseness, specific statements to avoid, etc), also provide these as part of the prompt
argument passed to TLM (ideally formulated as: A correct answer will meet the following criteria: ...
).
If the prompt
argument provided to TLM lacks any of the above information, then the resulting trustworthiness score may be lacking.
We provide an example RAG prompt you can use below.
async def stream_openai_response(
query: str,
context: str,
system_instructions: str = None,
eval_criteria: str = None
) -> AsyncGenerator[str, None]:
"""
Asynchronously stream a response from OpenAI, and subsequently provide a trustworthiness score using TLM.
Args:
query (str): The user's question.
context (str): Retrieved/formatted context information to be used for answering the query.
system_instructions (str): Optional instructions for the LLM on how to behave overall.
eval_criteria (str): Optional criteria for evaluating the correctness/quality of the answer.
Yields:
str: Chunks of the streamed response from OpenAI, followed by the trustworthiness score.
Note:
The function yields the response in chunks for a streaming effect, and the trustworthiness
score is yielded at the end after the full response is received.
"""
full_prompt = f"""{system_instructions}
{eval_criteria}
Context:
{context}
User Question: {query}
"""
response_stream = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": full_prompt}],
stream=True
)
full_response = ""
async for chunk in response_stream:
if chunk.choices[0].delta.content is not None:
content = chunk.choices[0].delta.content
full_response += content
yield content
# After response streaming is complete, get the trustworthiness score for this prompt/response pair.
# Here we demonstrate the asynchronous method in case there is additional logic you'd like to execute while the trustworthiness score is being computed.
trust_score_result = await tlm.get_trustworthiness_score_async(full_prompt, full_response)
yield f"\n\nTrustworthiness Score: {trust_score_result['trustworthiness_score']:.2f}"
Here we suppose the context for your generator LLM has already been fetched by the retriever of your RAG system. The context used below includes the relevant excerpts about Nvidia from a document encountered earlier in this tutorial.
# Define the context and query for the OpenAI streaming response
context = """ # Provided by your RAG system retriever
NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, up 19% from the previous quarter and down 13% from a year ago.
GAAP earnings per diluted share for the quarter were $0.82, up 28% from a year ago and up 44% from the previous quarter. Non-GAAP earnings per diluted share were $1.09, down 20% from a year ago and up 26% from the previous quarter.
"The computer industry is going through two simultaneous transitions -- accelerated computing and generative AI," said Jensen Huang, founder and CEO of NVIDIA. "A trillion dollars of installed global data center infrastructure will transition from general purpose to accelerated computing as companies race to apply generative AI into every product, service and business process.
"""
query = "What was NVIDIA's revenue in the first quarter of 2023, and how does it compare to the previous quarter? And what are the two simultaneous transitions Jensen Huang mentioned?"
Next we define system instructions and evaluation criteria. These are optional, but if you are implementing them in your RAG system, try to also provide them to TLM.
System instructions for the LLM:
- Note: Edit these instructions to match the system instructions used for the generator LLM in your own RAG system.
- These instructions should guide the LLM on how to interpret and respond to queries based on the provided context.
Evaluation criteria for the response:
- Note: Edit these criteria to reflect the specific requirements for a good response in your use case.
- These criteria will be used by TLM to assess the trustworthiness of the response.
# System instructions for the LLM
system_instructions = """You are a helpful assistant designed to help users navigate a complex set of documents. Answer the user's Question based on the following Context. Follow these rules:
1. Use only information from the provided Context.
2. If the Context doesn't adequately address the Question, say: "Based on the available information, I cannot provide a complete answer to this question."
3. Give a clear, short, and accurate answer. Explain complex terms if needed.
4. If the Context contains conflicting information, point this out without attempting to resolve the conflict.
5. Don't use phrases like "according to the context," "as the context states," etc.
Remember, your purpose is to provide information based on the Context, not to offer original advice."""
# Evaluation criteria for the response
eval_criteria = """A correct Answer should: be concise without unnecessary words, only contain facts that are explicitly stated in the provided Context, and never contain investment advice."""
Now let’s actually stream in a RAG response and trustworthiness score using our stream_openai_response()
function.
print("Streaming response:")
async for content in stream_openai_response(query, context, system_instructions, eval_criteria):
print(content, end="", flush=True)
print("\n")
When you run the above code, you’ll see the OpenAI response streaming in real-time, followed by the trustworthiness score from TLM. This approach gives you the best of both worlds: a responsive interface with low-latency streaming and a reliability assessment of the generated answer to catch hallucinations in your RAG system.
If you have an existing RAG pipeline where latency is a key concern, this is how we recommend incorporating TLM to catch hallucinations.
Calibrate TLM scores to more closely reflect response-quality ratings by your team
If TLM trustworthiness scores do not align with your team’s manual quality ratings of good/bad RAG responses, consider introducing custom evaluation criteria. You can further calibrate TLM scores against your human quality ratings for a dataset of prompt-response pairs.
For example, let’s define custom evaluation criteria based on faithfulness and groundedness, two key measures of RAG systems. Faithfulness measures the factual consistency of the generated answer against the retrieved context, while groundedness measures whether information in the generated answer is grounded in the retrieved context.
faithfulness_groundedness_eval_criteria = {
"custom_eval_criteria": [
{
"name": "Faithfulness & Groundedness",
"criteria": "Determine if the Answer is solely based on information available in the Context (no additional facts are mentioned in the Answer that are not stated in the Context). \
Also determine if the Answer does not contradict any information in the Context. If the Context contains no information available to answer the Question, a good Answer should state 'there is no information available.'"
}
]
}
For this custom evaluation criteria, there are 4 cases we want to consider:
- Question is answerable based on context, response answers the question - (should be a high score)
- Question is answerable based on context, response does not answer the question - (should be a low score)
- Question is not answerable based on context, response answers the question - (should be a low score)
- Question is not answerable based on context, response does not answer the question - (should be a high score)
Let’s now define some example data based on our Nvidia dataset we’ve used throughout this tutorial.
import pandas as pd
custom_eval_data = [
{"question": "What was NVIDIA's revenue in the first quarter of fiscal 2024?",
"context": "NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion.",
"answer": "NVIDIA's total revenue in the first quarter of fiscal 2024 was $7.19 billion.",
"case": 1},
{"question": "What was NVIDIA's revenue in the first quarter of fiscal 2024?",
"context": "NVIDIA's GAAP earnings per share for the quarter were $0.82.",
"answer": "NVIDIA's total revenue was $7.19 billion.",
"case": 2},
{"question": "What was NVIDIA's dividend payout in Q1 FY2024?",
"context": "NVIDIA reported earnings per share, but no mention of dividends.",
"answer": "NVIDIA's dividend payout was $0.10 per share.",
"case": 3},
{"question": "What was NVIDIA's dividend payout in Q1 FY2024?",
"context": "NVIDIA reported earnings per share, but no mention of dividends.",
"answer": "No information is available to answer the question.",
"case": 4}
]
custom_df = pd.DataFrame(custom_eval_data)
We can now create the prompt used for this custom evaluation.
def create_tlm_prompt(row):
return f"Context: {row['context']}\n\nUser Question: {row['question']}"
custom_df['prompt'] = custom_df.apply(create_tlm_prompt, axis=1)
tlm_faithfulness_groundedness = studio.TLM(options=faithfulness_groundedness_eval_criteria)
Then we will use TLM.get_trustworthiness_score() to obtain the score pertaining to our faithfulness & groundedness custom evaluation criteria.
res_faithfulness_groundedness = tlm_faithfulness_groundedness.get_trustworthiness_score(custom_df['prompt'].tolist(), custom_df['answer'].tolist())
res_faithfulness_groundedness_df = pd.DataFrame(res_faithfulness_groundedness)
df_results = pd.concat([custom_df, res_faithfulness_groundedness_df], axis=1)
df_results[['question', 'answer', 'trustworthiness_score', 'log']]
question | answer | trustworthiness_score | log | |
---|---|---|---|---|
0 | What was NVIDIA's revenue in the first quarter... | NVIDIA's total revenue in the first quarter of... | 0.986734 | {'custom_eval_criteria': [{'name': 'Faithfulne... |
1 | What was NVIDIA's revenue in the first quarter... | NVIDIA's total revenue was $7.19 billion. | 0.775447 | {'custom_eval_criteria': [{'name': 'Faithfulne... |
2 | What was NVIDIA's dividend payout in Q1 FY2024? | NVIDIA's dividend payout was $0.10 per share. | 0.565589 | {'custom_eval_criteria': [{'name': 'Faithfulne... |
3 | What was NVIDIA's dividend payout in Q1 FY2024? | No information is available to answer the ques... | 0.418083 | {'custom_eval_criteria': [{'name': 'Faithfulne... |
Let’s see if our custom evaluation scores align with expected behavior from each of the previously-described cases.
for idx, row in df_results.iterrows():
case = row['case']
faithfulness_and_groundedness_score = row['log']['custom_eval_criteria'][0]['score']
print(f"Case {case} - Faithfulness & Groundedness Score: {faithfulness_and_groundedness_score:.2f}")
We expected cases 1 and 4 to have a high score and cases 2 and 3 to have a low score, so the results look great!
If you have human-quality ratings for many generated responses and want to produce automated scores that better align with these ratings, consider TLMCalibrated. This approach combines TLM’s trustworthiness and custom evaluation scores into a single score that is calibrated to match your manually-provided quality ratings.
# Quality ratings for each response, say provided by your team (1 = good, 0 = bad)
df_results['human_rating'] = [1, 0, 0, 1]
tlm_calibrated = studio.TLMCalibrated(options=faithfulness_groundedness_eval_criteria)
We fit the TLMCalibrated
model to a dataset of the previously-obtained TLM scores and human quality ratings, training the model to better align its scores.
tlm_calibrated.fit(res_faithfulness_groundedness_df.to_dict(orient='records'), df_results['human_rating'].tolist())
Here’s how to produce calibrated scores after fitting the model:
calibrated_res = tlm_calibrated.get_trustworthiness_score(custom_df['prompt'].tolist(), custom_df['answer'].tolist())
calibrated_res_df = pd.DataFrame(calibrated_res)
calibrated_combined_df = pd.concat([df_results, calibrated_res_df], axis=1)
calibrated_combined_df[['question', 'answer', 'calibrated_score']]
question | answer | calibrated_score | |
---|---|---|---|
0 | What was NVIDIA's revenue in the first quarter... | NVIDIA's total revenue in the first quarter of... | 0.97 |
1 | What was NVIDIA's revenue in the first quarter... | NVIDIA's total revenue was $7.19 billion. | 0.03 |
2 | What was NVIDIA's dividend payout in Q1 FY2024? | NVIDIA's dividend payout was $0.10 per share. | 0.03 |
3 | What was NVIDIA's dividend payout in Q1 FY2024? | No information is available to answer the ques... | 0.81 |
The scores even better align with our expectations (high scores for question/response pairs 1 and 4, low scores for pairs 2 and 3).
Learn more in our TLM Custom Evaluation Criteria tutorial.