Skip to main content

Trustworthy Retrieval-Augmented Generation with the Trustworthy Language Model

Run in Google ColabRun in Google Colab

This tutorial demonstrates how to replace the Generator LLM in any RAG system with Cleanlab’s Trustworthy Language Model (TLM), to score the trustworthiness of answers and improve overall reliability. We recommend first completing the TLM quickstart tutorial.

A second part of this tutorial demonstrates how to alternatively use TLM in an existing RAG pipeline where low-latency is key.

Retrieval-Augmented Generation (RAG) has become popular for building LLM-based Question-Answer systems in domains where LLMs alone suffer from: hallucination, knowledge gaps, and factual inaccuracies. However, RAG systems often still produce unreliable responses, because they depend on LLMs that are fundamentally unreliable. Cleanlab’s Trustworthy Language Model (TLM) offers a solution by providing trustworthiness scores to assess and improve response quality, independent of your RAG architecture or retrieval and indexing processes. To diagnose when RAG answers cannot be trusted, simply swap your existing LLM that is generating answers based on the retrieved context with TLM. This tutorial showcases this for a standard RAG system, based off a tutorial in the popular LlamaIndex framework. Here we merely replace the LLM used in the LlamaIndex tutorial with TLM, and showcase some of the benefits. TLM can be similarly inserted into any other RAG framework.

TLM RAG system correctly identifying high/low confidence responses

Setup

RAG is all about connecting LLMs to data, to better inform their answers. This tutorial uses Nvidia’s Q1 FY2024 earnings report as an example dataset. Use the following commands to download the data (earnings report) and store it in a directory named data/.

wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/NVIDIA_Financial_Results_Q1_FY2024.md'
mkdir -p ./data
mv NVIDIA_Financial_Results_Q1_FY2024.md data/

Let’s next install required dependencies.

%pip install -U cleanlab-studio llama-index llama-index-embeddings-huggingface

We then initialize our Cleanlab client. You can get your Cleanlab API key here: https://app.cleanlab.ai/account after creating an account. For detailed instructions, refer to this guide.

from cleanlab_studio import Studio

studio = Studio("<insert your API key>")

Integrate TLM with LlamaIndex

TLM not only provides a response but also includes a trustworthiness score indicating the confidence that this response is good/accurate. Here we initialize a TLM object with default settings. You can achieve better results by playing with the TLM configurations outlined in the Advanced section of the TLM quickstart tutorial.

tlm = studio.TLM()

Our RAG pipeline closely follows the LlamaIndex guide on Using a custom LLM Model. LLamaIndex’s CustomLLM class exposes two methods, complete() and stream_complete(), for returning the LLM response. Additionally, it provides a metadata property to specify LLM details such as context window, number of output tokens, and name of your LLM.

Here we create a TLMWrapper subclass of CustomLLM that uses our TLM object instantiated above.

from typing import Any, Dict
import json

# Import LlamaIndex dependencies
from llama_index.core.base.llms.types import (
CompletionResponse,
CompletionResponseGen,
LLMMetadata,
)
from llama_index.core.llms.callbacks import llm_completion_callback
from llama_index.core.llms.custom import CustomLLM
from llama_index.core import Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader


class TLMWrapper(CustomLLM):
context_window: int = 16000
num_output: int = 256
model_name: str = "TLM"

@property
def metadata(self) -> LLMMetadata:
"""Get LLM metadata."""
return LLMMetadata(
context_window=self.context_window,
num_output=self.num_output,
model_name=self.model_name,
)

@llm_completion_callback()
def complete(self, prompt: str, **kwargs: Any) -> CompletionResponse:
# Prompt tlm for a response and trustworthiness score
response: Dict[str, str] = tlm.prompt(prompt)
output = json.dumps(response)
return CompletionResponse(text=output)

@llm_completion_callback()
def stream_complete(self, prompt: str, **kwargs: Any) -> CompletionResponseGen:
# Prompt tlm for a response and trustworthiness score
response = tlm.prompt(prompt)
output = json.dumps(response)

# Stream the output
output_str = ""
for token in output:
output_str += token
yield CompletionResponse(text=output_str, delta=token)

Build a RAG pipeline with TLM

Now let’s integrate our TLM-based CustomLLM into a RAG pipeline.

Settings.llm = TLMWrapper()

Specify Embedding Model

RAG uses an embedding model to match queries against document chunks to retrieve the most relevant data. Here we opt for a no-cost, local embedding model from Hugging Face. You can use any other embedding model by referring to this LlamaIndex guide.

Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

Load Data and Create Index + Query Engine

Let’s create an index from the documents stored in the data directory. The system can index multiple files within the same folder, although for this tutorial, we’ll use just one document. We stick with the default index from LlamaIndex for this tutorial.

documents = SimpleDirectoryReader("data").load_data()
# Optional step since we're loading just one data file
for doc in documents:
doc.excluded_llm_metadata_keys.append("file_path") # file_path wouldn't be a useful metadata to add to LLM's context since our datasource contains just 1 file
index = VectorStoreIndex.from_documents(documents)

The generated index is used to power a query engine over the data.

query_engine = index.as_query_engine()

Note that TLM is agnostic to the index and the query engine used for RAG, and is compatible with any choices you make for these components of your system.

Answering queries with our RAG system

Let’s try out our RAG pipeline based on TLM. Here we pose questions with differing levels of complexity.

Optional: Define `display_response` helper function (click to expand)

# This method presents formatted responses from our TLM-based RAG pipeline. It parses the output to display both the response itself and the corresponding trustworthiness score.
def display_response(response):
response_str = response.response
output_dict = json.loads(response_str)
print(f"Response: {output_dict['response']}")
print(f"Trustworthiness score: {round(output_dict['trustworthiness_score'], 2)}")

Easy Questions

We first pose straightforward questions that can be directly answered by the provided data and can be easily located within a few lines of text.

response = query_engine.query(
"What was NVIDIA's total revenue in the first quarter of fiscal 2024?"
)
display_response(response)
    Response: NVIDIA's total revenue in the first quarter of fiscal 2024 was $7.19 billion.
Trustworthiness score: 1.0
response = query_engine.query(
"What was the GAAP earnings per diluted share for the quarter?"
)
display_response(response)
    Response: The GAAP earnings per diluted share for the quarter (Q1 FY24) was $0.82.
Trustworthiness score: 0.99
response = query_engine.query(
"What significant transitions did Jensen Huang, NVIDIA's CEO, comment on?"
)
display_response(response)
    Response: Jensen Huang, NVIDIA's CEO, commented on the significant transitions the computer industry is undergoing, particularly in the areas of accelerated computing and generative AI.
Trustworthiness score: 0.99

TLM returns high trustworthiness scores for these responses, indicating high confidence they are accurate. After doing a quick fact-check (reviewing the original earnings report), we can confirm that TLM indeed accurately answered these questions. In case you’re curious, here are relevant excerpts from the data context for these questions:

NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, …

GAAP earnings per diluted share for the quarter were $0.82, up 28% from a year ago and up 44% from the previous quarter.

Jensen Huang, founder and CEO of NVIDIA, commented on the significant transitions the computer industry is undergoing, particularly accelerated computing and generative AI, …

Questions without Available Context

Now let’s see how TLM responds to queries that cannot be answered using the provided data.

response = query_engine.query(
"What factors as per the report were responsible to the decline in NVIDIA's proviz revenue?"
)
display_response(response)
    Response: The report indicates that NVIDIA's professional visualization (proviz) revenue declined by 53% from a year ago. While the specific factors contributing to this decline are not detailed in the provided context, it can be inferred that broader market trends, such as shifts in demand for professional visualization products, competition, or changes in customer spending patterns, may have played a role. Additionally, the overall economic environment and the impact of the ongoing transitions in the computer industry, particularly in areas like accelerated computing and generative AI, could also have influenced the decline in proviz revenue.
Trustworthiness score: 0.76

The lower TLM trustworthiness score indicate a bit more uncertainty about the response, which aligns with the lack of information available. Let’s try some more questions.

response = query_engine.query(
"How does the report explain why NVIDIA's Gaming revenue decreased year over year?"
)
display_response(response)
    Response: The report indicates that NVIDIA's Gaming revenue decreased year over year by 38%, but it does not provide specific reasons for this decline. However, it does highlight several developments in the gaming segment, such as the launch of new GPUs and the expansion of game titles, which suggest that while there may be challenges in the market, NVIDIA is actively working to enhance its product offerings and gaming ecosystem. The decrease in revenue could be attributed to broader market conditions, reduced consumer spending, or increased competition, but these factors are not explicitly mentioned in the provided context.
Trustworthiness score: 0.92
response = query_engine.query(
"How does NVIDIA's dividend payout for this quarter compare to the industry average?",
)
display_response(response)
    Response: The context information provided does not include specific details about the industry average for dividend payouts. Therefore, I cannot directly compare NVIDIA's dividend payout for this quarter to the industry average. However, NVIDIA announced a quarterly cash dividend of $0.04 per share for shareholders of record on June 8, 2023. To assess how this compares to the industry average, one would need to look up the average dividend payout for similar companies in the technology or semiconductor industry.
Trustworthiness score: 0.93

We observe that TLM demonstrates the ability to recognize the limitations of the available information. It refrains from generating speculative responses or hallucinations, thereby maintaining the reliability of the question-answering system. This behavior showcases an understanding of the boundaries of the context and prioritizes accuracy over conjecture.

Challenging Questions

Let’s see how our RAG system responds to harder questions, some of which may be misleading.

response = query_engine.query(
"How much did Nvidia's revenue decrease this quarter vs last quarter, in dollars?"
)
display_response(response)
    Response: NVIDIA's revenue for the first quarter ended April 30, 2023, was $7.19 billion. The revenue for the previous quarter (the fourth quarter of fiscal 2023) can be calculated by taking the reported revenue for the first quarter and adjusting for the 19% increase from the previous quarter. 

Let \( x \) be the revenue for the previous quarter. According to the information provided:

\[
x + 0.19x = 7.19 \text{ billion}
\]
\[
1.19x = 7.19 \text{ billion}
\]
\[
x = \frac{7.19 \text{ billion}}{1.19} \approx 6.04 \text{ billion}
\]

Now, to find the decrease in revenue from the previous quarter to this quarter:

\[
\text{Decrease} = 7.19 \text{ billion} - 6.04 \text{ billion} \approx 1.15 \text{ billion}
\]

Thus, NVIDIA's revenue decreased by approximately $1.15 billion this quarter compared to the last quarter.
Trustworthiness score: 0.6
response = query_engine.query(
"This report focuses on Nvidia's Q1FY2024 financial results. There are mentions of other companies in the report like Microsoft, Dell, ServiceNow, etc. Can you name them all here?",
)
display_response(response)
    Response: The report mentions the following companies: Microsoft. There are no mentions of Dell or ServiceNow in the provided context information.
Trustworthiness score: 0.73
response = query_engine.query(
"How many RTX GPU models, including all custom versions released by third-party manufacturers and all revisions across different series, were officially announced in NVIDIA's Q1 FY2024 financial results?",
)
display_response(response)
    Response: In NVIDIA's Q1 FY2024 financial results, the following RTX GPU models were officially announced:

1. **GeForce RTX 4060 family of GPUs**
2. **GeForce RTX 4070 GPU**
3. **Six new NVIDIA RTX GPUs for mobile and desktop workstations**

This totals to **eight RTX GPU models** announced.
Trustworthiness score: 0.74
response = query_engine.query(
"If NVIDIA's Data Center segment maintains its Q1 FY2024 quarter-over-quarter growth rate for the next four quarters, what would be its projected annual revenue?",
)
display_response(response)
    Response: To calculate the projected annual revenue for NVIDIA's Data Center segment if it maintains its Q1 FY2024 quarter-over-quarter growth rate, we first need to determine the growth rate from Q4 FY2023 to Q1 FY2024.

NVIDIA reported a record Data Center revenue of $4.28 billion for Q1 FY2024. The revenue for the previous quarter (Q4 FY2023) can be calculated as follows:

Let \( R \) be the revenue for Q4 FY2023. The growth rate from Q4 FY2023 to Q1 FY2024 is given by:

\[
\text{Growth Rate} = \frac{\text{Q1 Revenue} - \text{Q4 Revenue}}{\text{Q4 Revenue}} = \frac{4.28 - R}{R}
\]

We know that the overall revenue for Q1 FY2024 is $7.19 billion, which is up 19% from the previous quarter. Therefore, we can express the revenue for Q4 FY2023 as:

\[
\text{Q1 FY2024 Revenue} = \text{Q4 FY2023 Revenue} \times 1.19
\]

Substituting the known value:

\[
7.19 = R \times 1.19
\]

Solving for \( R \):

\[
R = \frac{7.19}{1.19} \approx 6.03 \text{ billion}
\]

Now, we can calculate the Data Center revenue for Q4 FY2023. Since we don't have the exact figure for the Data Center revenue in Q4 FY2023, we will assume that the Data Center revenue also grew by the same percentage as the overall revenue.

Now, we can calculate the quarter-over-quarter growth rate for the Data Center segment:

\[
\text{Growth Rate} = \frac{4.28 - R_D}{R_D}
\]

Where \( R_D \) is the Data Center revenue for Q4 FY2023. However, we need to find \( R_D \) first.

Assuming the Data Center revenue was a certain percentage of the total revenue in Q4 FY2023, we can estimate it. For simplicity, let's assume the Data Center revenue was around 50% of the total revenue in Q4 FY2023 (this is a rough estimate, as we don't have the exact figure).

Thus, \( R_D \approx 0.5 \times 6
Trustworthiness score: 0.69

TLM automatically alerts us that these answers are unreliable, by the low trustworthiness score. RAG systems with TLM help you properly exercise caution when you see low trustworthiness scores. Here are the correct answers to the aforementioned questions:

NVIDIA’s revenue increased by $1.14 billion this quarter compared to last quarter.

Google, Amazon Web Services, Microsoft, Oracle, ServiceNow, Medtronic, Dell Technologies.

There is not a specific total count of RTX GPUs mentioned.

Projected annual revenue if this growth rate is maintained for the next four quarters: approximately $26.34 billion.

With TLM, you can easily increase trust in any RAG system!

Alternate low-latency/streaming approach: Use TLM to assess responses from an existing RAG system

This subsequent part of the tutorial demonstrates how to use TLM to assess RAG responses from any other generator LLM in a streaming use case (rather than producing the responses with TLM too). TLM can score the trustworthiness of any response (generated by another LLM model) to a given prompt via TLM.get_trustworthiness_score_async(). This is useful in settings where low latency is critical, since you can first stream in the response from your existing RAG system, and subsequently stream in TLM’s trustworthiness score once it has been computed.

A key consideration here is the prompt argument provided to TLM. In order for TLM to effectively detect bad responses and hallucinations, its provided prompt should contain all relevant information, including:

  • Optional system instructions to shape the generator LLM’s overall behavior
  • Optional criteria that a good response should satisfy
  • Relevant context fetched by the RAG system retriever (in the same format as provided to your generator LLM)
  • The user query being responded to.

Here we demonstrate this process assuming a RAG system that uses the gpt-4o-mini LLM from OpenAI, but this can be done with any LLM. First, let’s set up the OpenAI client for streaming responses:

%pip install openai
from openai import AsyncOpenAI
from typing import AsyncGenerator

client = AsyncOpenAI(api_key="<insert your OpenAI API key>")

Now we define a function for streaming a response from OpenAI and then scoring its trustworthiness via TLM.

In your RAG system, the retriever will fetch some context for your generator LLM. You should combine that context with the user query into the prompt argument of TLM.get_trustworthiness_score_async(). If your RAG system uses system instructions to shape the generator LLM’s overall behavior, ensure these are also part of the prompt argument passed to TLM. If you have criteria that a good response should satisfy (eg. conciseness, specific statements to avoid, etc), also provide these as part of the prompt argument passed to TLM (ideally formulated as: A correct answer will meet the following criteria: ...).

If the prompt argument provided to TLM lacks any of the above information, then the resulting trustworthiness score may be lacking.

We provide an example RAG prompt you can use below.

async def stream_openai_response(
query: str,
context: str,
system_instructions: str = None,
eval_criteria: str = None
) -> AsyncGenerator[str, None]:
"""
Asynchronously stream a response from OpenAI, and subsequently provide a trustworthiness score using TLM.

Args:
query (str): The user's question.
context (str): Retrieved/formatted context information to be used for answering the query.
system_instructions (str): Optional instructions for the LLM on how to behave overall.
eval_criteria (str): Optional criteria for evaluating the correctness/quality of the answer.

Yields:
str: Chunks of the streamed response from OpenAI, followed by the trustworthiness score.

Note:
The function yields the response in chunks for a streaming effect, and the trustworthiness
score is yielded at the end after the full response is received.
"""
full_prompt = f"""{system_instructions}

{eval_criteria}

Context:
{context}

User Question: {query}
"""

response_stream = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": full_prompt}],
stream=True
)

full_response = ""
async for chunk in response_stream:
if chunk.choices[0].delta.content is not None:
content = chunk.choices[0].delta.content
full_response += content
yield content

# After response streaming is complete, get the trustworthiness score for this prompt/response pair.
# Here we demonstrate the asynchronous method in case there is additional logic you'd like to execute while the trustworthiness score is being computed.
trust_score_result = await tlm.get_trustworthiness_score_async(full_prompt, full_response)
yield f"\n\nTrustworthiness Score: {trust_score_result['trustworthiness_score']:.2f}"

Here we suppose the context for your generator LLM has already been fetched by the retriever of your RAG system. The context used below includes the relevant excerpts about Nvidia from a document encountered earlier in this tutorial.

# Define the context and query for the OpenAI streaming response
context = """ # Provided by your RAG system retriever
NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, up 19% from the previous quarter and down 13% from a year ago.

GAAP earnings per diluted share for the quarter were $0.82, up 28% from a year ago and up 44% from the previous quarter. Non-GAAP earnings per diluted share were $1.09, down 20% from a year ago and up 26% from the previous quarter.

"The computer industry is going through two simultaneous transitions -- accelerated computing and generative AI," said Jensen Huang, founder and CEO of NVIDIA. "A trillion dollars of installed global data center infrastructure will transition from general purpose to accelerated computing as companies race to apply generative AI into every product, service and business process.
"""

query = "What was NVIDIA's revenue in the first quarter of 2023, and how does it compare to the previous quarter? And what are the two simultaneous transitions Jensen Huang mentioned?"

Next we define system instructions and evaluation criteria. These are optional, but if you are implementing them in your RAG system, try to also provide them to TLM.

System instructions for the LLM:

  • Note: Edit these instructions to match the system instructions used for the generator LLM in your own RAG system.
  • These instructions should guide the LLM on how to interpret and respond to queries based on the provided context.

Evaluation criteria for the response:

  • Note: Edit these criteria to reflect the specific requirements for a good response in your use case.
  • These criteria will be used by TLM to assess the trustworthiness of the response.
# System instructions for the LLM
system_instructions = """You are a helpful assistant designed to help users navigate a complex set of documents. Answer the user's Question based on the following Context. Follow these rules:
1. Use only information from the provided Context.
2. If the Context doesn't adequately address the Question, say: "Based on the available information, I cannot provide a complete answer to this question."
3. Give a clear, short, and accurate answer. Explain complex terms if needed.
4. If the Context contains conflicting information, point this out without attempting to resolve the conflict.
5. Don't use phrases like "according to the context," "as the context states," etc.
Remember, your purpose is to provide information based on the Context, not to offer original advice."""

# Evaluation criteria for the response
eval_criteria = """A correct Answer should: be concise without unnecessary words, only contain facts that are explicitly stated in the provided Context, and never contain investment advice."""

Now let’s actually stream in a RAG response and trustworthiness score using our stream_openai_response() function.

print("Streaming response:")
async for content in stream_openai_response(query, context, system_instructions, eval_criteria):
print(content, end="", flush=True)
print("\n")
    Streaming response:
NVIDIA's revenue for the first quarter of 2023 was $7.19 billion, which is up 19% from the previous quarter. Jensen Huang mentioned that the computer industry is going through two simultaneous transitions: accelerated computing and generative AI.

Trustworthiness Score: 0.99

When you run the above code, you’ll see the OpenAI response streaming in real-time, followed by the trustworthiness score from TLM. This approach gives you the best of both worlds: a responsive interface with low-latency streaming and a reliability assessment of the generated answer to catch hallucinations in your RAG system.

If you have an existing RAG pipeline where latency is a key concern, this is how we recommend incorporating TLM to catch hallucinations.

Calibrate TLM scores to more closely reflect response-quality ratings by your team

If TLM trustworthiness scores do not align with your team’s manual quality ratings of good/bad RAG responses, consider introducing custom evaluation criteria. You can further calibrate TLM scores against your human quality ratings for a dataset of prompt-response pairs.

For example, let’s define custom evaluation criteria based on faithfulness and groundedness, two key measures of RAG systems. Faithfulness measures the factual consistency of the generated response against the retrieved context, while groundedness measures whether information in the generated response is grounded in the retrieved context.

faithfulness_groundedness_eval_criteria = {
"custom_eval_criteria": [
{
"name": "Faithfulness & Groundedness",
"criteria": "Determine if the Response is solely based on information available in the Context (no additional facts are mentioned in the Response that are not stated in the Context). \
Also determine if the Response does not contradict any information in the Context. If the Context contains no information available to answer the Question, a good Response should state 'there is no information available.'"
}
]
}

For this custom evaluation criteria, there are 4 cases we want to consider:

  1. Question is answerable based on context, response answers the question - (should be a high score)
  2. Question is answerable based on context, response does not answer the question - (should be a low score)
  3. Question is not answerable based on context, response answers the question - (should be a low score)
  4. Question is not answerable based on context, response does not answer the question - (should be a high score)

Let’s now define some example data based on our Nvidia dataset we’ve used throughout this tutorial.

import pandas as pd 

custom_eval_data = [
{"question": "What was NVIDIA's revenue in the first quarter of fiscal 2024?",
"context": "NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion.",
"answer": "NVIDIA's total revenue in the first quarter of fiscal 2024 was $7.19 billion.",
"case": 1},

{"question": "What was NVIDIA's revenue in the first quarter of fiscal 2024?",
"context": "NVIDIA's GAAP earnings per share for the quarter were $0.82.",
"answer": "NVIDIA's total revenue was $7.19 billion.",
"case": 2},

{"question": "What was NVIDIA's dividend payout in Q1 FY2024?",
"context": "NVIDIA reported earnings per share, but no mention of dividends.",
"answer": "NVIDIA's dividend payout was $0.10 per share.",
"case": 3},

{"question": "What was NVIDIA's dividend payout in Q1 FY2024?",
"context": "NVIDIA reported earnings per share, but no mention of dividends.",
"answer": "No information is available to answer the question.",
"case": 4}
]

custom_df = pd.DataFrame(custom_eval_data)

We can now create the prompt used for this custom evaluation.

def create_tlm_prompt(row):
return f"Context: {row['context']}\n\nUser Question: {row['question']}"

custom_df['prompt'] = custom_df.apply(create_tlm_prompt, axis=1)
tlm_faithfulness_groundedness = studio.TLM(options=faithfulness_groundedness_eval_criteria)

Then we will use TLM.get_trustworthiness_score() to obtain the score pertaining to our faithfulness & groundedness custom evaluation criteria.

res_faithfulness_groundedness = tlm_faithfulness_groundedness.get_trustworthiness_score(custom_df['prompt'].tolist(), custom_df['answer'].tolist())
res_faithfulness_groundedness_df = pd.DataFrame(res_faithfulness_groundedness)
    Querying TLM... 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|
df_results = pd.concat([custom_df, res_faithfulness_groundedness_df], axis=1)
df_results[['question', 'answer', 'trustworthiness_score', 'log']]
question answer trustworthiness_score log
0 What was NVIDIA's revenue in the first quarter... NVIDIA's total revenue in the first quarter of... 0.902085 {'custom_eval_criteria': [{'name': 'Faithfulne...
1 What was NVIDIA's revenue in the first quarter... NVIDIA's total revenue was $7.19 billion. 0.753851 {'custom_eval_criteria': [{'name': 'Faithfulne...
2 What was NVIDIA's dividend payout in Q1 FY2024? NVIDIA's dividend payout was $0.10 per share. 0.404263 {'custom_eval_criteria': [{'name': 'Faithfulne...
3 What was NVIDIA's dividend payout in Q1 FY2024? No information is available to answer the ques... 0.414055 {'custom_eval_criteria': [{'name': 'Faithfulne...

Let’s see if our custom evaluation scores align with expected behavior from each of the previously-described cases.

for idx, row in df_results.iterrows():
case = row['case']
faithfulness_and_groundedness_score = row['log']['custom_eval_criteria'][0]['score']
print(f"Case {case} - Faithfulness & Groundedness Score: {faithfulness_and_groundedness_score:.3f}")
    Case 1 - Faithfulness & Groundedness Score: 0.998
Case 2 - Faithfulness & Groundedness Score: 0.003
Case 3 - Faithfulness & Groundedness Score: 0.002
Case 4 - Faithfulness & Groundedness Score: 0.998

We expected cases 1 and 4 to have a high score and cases 2 and 3 to have a low score, so the results look great!

If you have human-quality ratings for many generated responses and want to produce automated scores that better align with these ratings, consider TLMCalibrated. This approach combines TLM’s trustworthiness and custom evaluation scores into a single score that is calibrated to match your manually-provided quality ratings.

# Quality ratings for each response, say provided by your team (1 = good, 0 = bad)
df_results['human_rating'] = [1, 0, 0, 1]
tlm_calibrated = studio.TLMCalibrated(options=faithfulness_groundedness_eval_criteria)

We fit the TLMCalibrated model to a dataset of the previously-obtained TLM scores and human quality ratings, training the model to better align its scores.

tlm_calibrated.fit(res_faithfulness_groundedness_df.to_dict(orient='records'), df_results['human_rating'].tolist())

Here’s how to produce calibrated scores after fitting the model:

calibrated_res = tlm_calibrated.get_trustworthiness_score(custom_df['prompt'].tolist(), custom_df['answer'].tolist())
calibrated_res_df = pd.DataFrame(calibrated_res)
    Querying TLM... 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|
calibrated_combined_df = pd.concat([df_results, calibrated_res_df], axis=1)
calibrated_combined_df[['question', 'answer', 'calibrated_score']]
question answer calibrated_score
0 What was NVIDIA's revenue in the first quarter... NVIDIA's total revenue in the first quarter of... 0.97
1 What was NVIDIA's revenue in the first quarter... NVIDIA's total revenue was $7.19 billion. 0.13
2 What was NVIDIA's dividend payout in Q1 FY2024? NVIDIA's dividend payout was $0.10 per share. 0.03
3 What was NVIDIA's dividend payout in Q1 FY2024? No information is available to answer the ques... 0.84

The scores even better align with our expectations (high scores for question/response pairs 1 and 4, low scores for pairs 2 and 3).

Learn more in our TLM Custom Evaluation Criteria tutorial.