Skip to main content

Incorporating custom evaluation criteria for LLM outputs and calibrating TLM trustworthiness scores against human ratings

Run in Google ColabRun in Google Colab

Are you finding TLM trustworthiness scores do not align with your team’s manual quality ratings of good/bad LLM outputs? If so, you can address this via:

  • Adjusting prompts and TLM options/settings
  • Custom evaluation criteria
  • Calibrating TLM trustworthiness scores against your human quality ratings for example prompt-response pairs.

This tutorial demonstrates how to evaluate and improve the performance of TLM for your specific use-case.

Setup

Using TLM requires a Cleanlab account. Sign up for one here if you haven’t yet. If you’ve already signed up, check your email for a personal login link.

Cleanlab’s Python client can be installed using pip and a Studio object can be instantiated with your API key:

%pip install --upgrade cleanlab-studio scikit-learn
import pandas as pd
from scipy.stats import spearmanr # just used for evaluation of results in this tutorial

pd.set_option('display.max_colwidth', None)
from cleanlab_studio import Studio

# Get API key from here: https://app.cleanlab.ai/account after creating an account.
studio = Studio("<API key>")

Fetch the dataset

Let’s first load an example dataset. Here we consider a dataset composed of trivia questions with responses already computed from some LLM, which we want to automatically evaluate with TLM.

In this tutorial we will be using the tlm.get_trustworthiness_score() method to evaluate these prompt-response pairs, and explore how we can use various techniques to calibrate the trustworthiness score produced by the TLM to align with your specific use case.

wget -nc https://cleanlab-public.s3.amazonaws.com/Datasets/triviaqa-conciseness/triviaqa_with_ratings.csv
wget -nc https://cleanlab-public.s3.amazonaws.com/Datasets/triviaqa-conciseness/triviaqa_without_ratings.csv
data = pd.read_csv("triviaqa_with_ratings.csv")

Suppose we aim to score LLM responses based on both correctness of the response and conciseness (ie. the LLM response should not contain extraneous information). Our dataset additionally contains human ratings regarding the (manually-determined) quality of each response. Even without having such human ratings, you can still use many of the techniques demonstrated in this tutorial.

Here is a sample question where the LLM response is factually correct, however it contains too much irrelevant information, and hence it is marked as incorrect by a human annotator (rating = 0):

print("Question:", data["question"][0], "\n")
print("Response:", data["answer"][0], "\n")
print("Human Rating:", data["human_rating"][0])
    Question: Bonar Law is the only Prime Minister not born in the UK. In which country was he born? 

Response: Bonar Law, who briefly served as British Prime Minister in the early 1920s, holds the distinction of being the only Prime Minister born outside the United Kingdom. He was born in Canada, specifically in the province of New Brunswick. This unique aspect of his background adds an interesting layer to his political legacy, as he later rose to lead the UK government during a particularly tumultuous period in British history.

Human Rating: 0

The LLM response to this other question is both correct and concise, hence it has been manually marked as high-quality (human rating = 1):

print("Question:", data["question"][1], "\n")
print("Response:", data["answer"][1], "\n")
print("Human Rating:", data["human_rating"][1])
    Question: Which singer had a big 60s No 1 with Roses Are Red? 

Response: Bobby Vinton

Human Rating: 1

Scoring responses with TLM

At first, let’s just append a simple prompt template to each question, and then use get_trustworthiness_score() to score each response.

def create_prompt(question):
return f"Answer this question as concisely as possible: {question}"

data["prompt"] = data["question"].apply(create_prompt)
print("Sample prompt:")
print(data["prompt"][0])
    Sample prompt:
Answer this question as concisely as possible: Bonar Law is the only Prime Minister not born in the UK. In which country was he born?
tlm = studio.TLM()
res = tlm.get_trustworthiness_score(data["prompt"].tolist(), data["answer"].tolist())
res_df = pd.DataFrame(res)
    Querying TLM... 100%|██████████|

Spearman’s rank correlation is a statistical method used to measure the strength of the relationship between two ranked variables. We can use it here to measure how well the trustworthiness scores from TLM reflect the given human ratings.

In this tutorial, the human ratings are binary (only 0 and 1 values representing correct and incorrect responses). However, all methods showcased in this tutorial can be applied to any range of numeric human ratings, such as ratings on a scale of 1 to 5. Spearman’s rank correlation is a valid measure of alignment betweeen estimated trustworthiness scores and human ratings, regardless whether they are binary or numeric, as it quantifies how consistently good responses are scored higher than bad ones.

print("Spearman correlation:", spearmanr(res_df["trustworthiness_score"], data["human_rating"]).statistic)
    Spearman correlation: 0.5710574352893738

Prompt engineering for better TLM trustworthiness scoring

One simple way to improve TLM performance is to improve the instructions we give. With clearer instructions, TLM can better score the responses for our specific use-case. Important elements of an effective prompt argument for TLM include: specifying precisely what a correct answer should contain, and perhaps also giving examples of wrong answers.

Here for example, we can update TLM’s prompt to specify that wordy answers are considered incorrect.

def create_improved_prompt(question):
return f"""Answer the following question as concisely as you can.
For the answer to be considered correct, it has to be concise. Wordy answers are incorrect.
Question: {question}"""

data["improved_prompt"] = data["question"].apply(create_improved_prompt)
print("Sample improved prompt:")
print(data["improved_prompt"][0])
    Sample improved prompt:
Answer the following question as concisely as you can.
For the answer to be considered correct, it has to be concise. Wordy answers are incorrect.
Question: Bonar Law is the only Prime Minister not born in the UK. In which country was he born?
res_improved = tlm.get_trustworthiness_score(data["improved_prompt"].tolist(), data["answer"].tolist())
res_improved = pd.DataFrame(res_improved)
    Querying TLM... 100%|██████████|

We see that with the updated prompt, TLM’s trustworthiness score is now better aligned with our human ratings (as measured via their Spearman correlation):

print("Spearman correlation:", spearmanr(res_improved["trustworthiness_score"], data["human_rating"]).statistic)
    Spearman correlation: 0.6219208505783059

Custom evaluation criteria

Our improved prompt template did enhance TLM performance. However, evaluating conciseness of the response represents a particular notion of response-quality that may not align with TLM’s general trustworthiness scoring (which aims to quantify uncertainty in the LLM). Fortunately, TLM supports custom evaluation criteria, allowing you to define specific ways to additionally evaluate response-quality.

We can define our custom evaluation criteria in TLM’s options dictionary:

custom_eval_criteria_option = {"custom_eval_criteria": [
{"name": "Conciseness", "criteria": "Determine if the output is concise."}
]}

tlm_custom_eval = studio.TLM(options=custom_eval_criteria_option)

Note that while the prompt for TLM.get_trustworthiness_score() should only specify what a correct answer should look like, the custom evaluation criteria should specify how to determine if the answer is good (more tips below).

After defining the additional evaluation criteria, running TLM’s get_trustworthiness_score() method will return both TLM’s standard trustworthiness scores along with a new custom evaluation score based on our criteria.

res_custom_eval = tlm_custom_eval.get_trustworthiness_score(data["improved_prompt"].tolist(), data["answer"].tolist())
res_custom_eval_df = pd.DataFrame(res_custom_eval)
res_custom_eval_df.head()
Querying TLM... 100%|██████████|
trustworthiness_score log
0 0.594730 {'custom_eval_criteria': [{'name': 'Conciseness', 'score': 0.21906989418743317}]}
1 0.987435 {'custom_eval_criteria': [{'name': 'Conciseness', 'score': 1.0}]}
2 0.935873 {'custom_eval_criteria': [{'name': 'Conciseness', 'score': 0.4}]}
3 0.176857 {'custom_eval_criteria': [{'name': 'Conciseness', 'score': 0.4058624456722605}]}
4 0.941126 {'custom_eval_criteria': [{'name': 'Conciseness', 'score': 0.40281272503149823}]}

The custom evaluation scores will be returned as entries in the log object:

res_custom_eval_df["log"].iloc[0]
    {'custom_eval_criteria': [{'name': 'Conciseness',
'score': 0.21906989418743317}]}
res_custom_eval_df["custom_score"] = res_custom_eval_df.apply(lambda x: x["log"]["custom_eval_criteria"][0]["score"], axis=1)
res_custom_eval_df["custom_score"].head()
    0    0.219070
1 1.000000
2 0.400000
3 0.405862
4 0.402813
Name: custom_score, dtype: float64

Let’s combine the original data with these results to see how well the our conciseness criteria scored the responses.

custom_eval_combined = pd.concat([data, res_custom_eval_df], axis=1)
custom_eval_sorted = custom_eval_combined.sort_values("custom_score")

Below, we view 2 examples with the lowest conciseness scores across our dataset, and see that the LLM responses are indeed very verbose. We see that in the first example, the TLM’s trustworthiness score did not actually reflect the verboseness of the response.

This showcases how specifying custom evaluation criteria can help in use-cases where TLM’s general trustworthiness scores do not reflect your notion of good-vs-bad quality responses.

custom_eval_sorted.head(2)[["question", "answer", "trustworthiness_score", "custom_score"]]
question answer trustworthiness_score custom_score
29 What was the name of Michael Jackson's autobiography written in 1988? The title of Michael Jackson's autobiography, written and published in 1988, is 'Moonwalk'. In this deeply personal and revealing book, Jackson reflects on his extraordinary life, starting from his early days as a child prodigy with the Jackson 5 to his unprecedented rise as a global pop sensation. The book delves into his experiences with fame, his artistic inspirations, and the personal challenges he faced throughout his career, offering readers an intimate glimpse into the mind of the King of Pop. 0.664943 0.2
27 Name the magician who first introduced us to Basil Brush. The magician who first introduced us to Basil Brush is David Nixon, a prominent British magician and television personality. He was well-known for his charming magic performances and frequent appearances on television variety shows during the 1960s and 1970s. His collaboration with the puppet character Basil Brush helped make the mischievous fox a beloved figure in British entertainment, ultimately turning Basil into a household name across the UK. 0.071862 0.2

Inspecting responses with the highest conciseness scores in our dataset, we see that these are indeed nice and concise.

custom_eval_sorted.tail(2)[["question", "answer", "trustworthiness_score", "custom_score"]]
question answer trustworthiness_score custom_score
26 Richard Daley was mayor of which city for 21 years? Richard Daley was mayor of Chicago for 21 years. 0.982434 1.0
23 Taphephobia is the fear of losing your teeth? Taphephobia is actually the fear of being buried alive. 0.964557 1.0

Tips for writing custom evaluation criteria

You can further iterate on how you write out the custom evaluation criteria, in order to better align the custom evaluation scores with what you consider good-vs-bad quality responses. Tips include: define clear/objective criteria for determining answer quality and avoid subjective language. Consider including: examples of good vs bad responses, and edge-case guidelines – such as whether things like flowery-language or answer-refusal are considered good or bad. Qualitatively describe measurable aspects of the response without describing numerical scoring mechanisms, as TLM already has an internal scoring system based on your qualitative description.

An example custom criteria to evaluate how well a RAG system abstains from responding might be:

Determine whether the answer is accurate based on the provided context. If the context does not provide information needed to answer the question, the answer should state ‘Unable to respond based on available information.

Calibrating TLM scores against human quality ratings

TLM’s native trustworthiness scoring and custom evaluation criteria each have their strengths. By combining them, we can benefit from the TLM’s general trustworthiness score while tailoring specific aspects of the evaluation to our specific criteria. None of the above approaches requires human-rating annotations, we have so far just used those to quantify alignment between the scores and human ratings.

Here we showcase how TLMCalibrated can combine and calibrate these values into a score that is explicitly aligned against numeric/binary human-rating (i.e. ground-truth) quality annotations. While we here integrate the custom evaluation and trustworthiness scores into a single calibrated score, you can also run the same procedure to calibrate just one of these scores without the other.

First, we instantiate the TLMCalibrated object with the same custom evaluation criteria used previously in the options argument.

tlm_calibrated = studio.TLMCalibrated(options=custom_eval_criteria_option)

Next, we fit this TLMCalibrated object to a dataset composed of the previously obtained TLM scores and human quality ratings. This trains the model to produce better-aligned scores over this dataset.

We can directly pass res_custom_eval (the list of dictionaries returned from the previous get_trustworthiness_score() call) directly into fit().

tlm_calibrated.fit(res_custom_eval, data["human_rating"])

After fitting TLMCalibrated to our human-annotated dataset, we can call its get_trustworthiness_score() method. This returns not only the previous TLM trustworthiness score and custom conciseness score, but also an additional calibrated score that optimally combines those two scores to align with human quality ratings.

calibrated_res = tlm_calibrated.get_trustworthiness_score(data["improved_prompt"].tolist(), data["answer"].tolist())
calibrated_res_df = pd.DataFrame(calibrated_res)
    Querying TLM... 100%|██████████|
calibrated_res_df.head()
trustworthiness_score log calibrated_score
0 0.594730 {'custom_eval_criteria': [{'name': 'Conciseness', 'score': 0.21906989418743317}]} 0.065214
1 0.987435 {'custom_eval_criteria': [{'name': 'Conciseness', 'score': 1.0}]} 1.000000
2 0.935873 {'custom_eval_criteria': [{'name': 'Conciseness', 'score': 0.4}]} 0.831320
3 0.176857 {'custom_eval_criteria': [{'name': 'Conciseness', 'score': 0.4058624456722605}]} 0.230365
4 0.941126 {'custom_eval_criteria': [{'name': 'Conciseness', 'score': 0.40281272503149823}]} 0.840114

We again measure alignment between the calibrated scores and our human ratings via their Spearman correlation, which here is the highest among all scoring methods. Thus whenever you have a moderately-sized set of human quality ratings, we recommend considering TLMCalibrated to produce a tailored evaluation score for LLM responses in your use-case.

print("Spearman correlation:", spearmanr(calibrated_res_df["calibrated_score"], data["human_rating"]).statistic)
    Spearman correlation: 0.7166216667291583

Using TLMCalibrated to get improved scores on new data

You can use the already-fitted TLMCalibrated model to score new responses via the same get_trustworthiness_score() method. Here we demonstrate this with some additional data.

new_data = pd.read_csv("triviaqa_without_ratings.csv")
new_data["prompt"] = new_data["question"].apply(create_improved_prompt)

We call get_trustworthiness_score() on the new (prompt, response) pairs, just as we did before.

new_data_res = tlm_calibrated.get_trustworthiness_score(new_data["prompt"].tolist(), new_data["answer"].tolist())
new_data_res_df = pd.DataFrame(new_data_res)
new_data_res_df.head(2)
Querying TLM... 100%|██████████|
trustworthiness_score log calibrated_score
0 0.968277 {'custom_eval_criteria': [{'name': 'Conciseness', 'score': 1.0}]} 0.981732
1 0.708994 {'custom_eval_criteria': [{'name': 'Conciseness', 'score': 0.20586244549585175}]} 0.202411

The returned results contain the TLM trustworthiness score and log (like before), but also the calibrated_score to use for evaluating these additional responses (in real-time).

new_data_combined = pd.concat([new_data, new_data_res_df], axis=1)
new_data_combined.head(2)[["question", "answer", "calibrated_score"]]
question answer calibrated_score
0 In 1976, which gymnast scored 7 maximum scores of 10 as she won three gold medals, one silver and one bronze? Nadia Comaneci 0.981732
1 What is the Milky Way? The Milky Way is a barred spiral galaxy, one of the billions of galaxies in the observable universe, and it is our home galaxy. As part of the Local Group of galaxies, it exists alongside other notable galaxies like Andromeda and Triangulum, as well as about 54 smaller satellite galaxies. The Milky Way contains between 100 and 400 billion stars, including our Sun, and the dense band of light seen from Earth at night is caused by the multitude of stars and celestial objects aligned along the galactic plane. It is named after the appearance of this bright band of light, which stretches across the night sky. 0.202411

Further scoring improvements

At this point, you have specified custom evaluation criteria and combined it with TLM’s native trustworthiness score to produce calibrated scores via TLMCalibrated. To further align these scores against your human-quality ratings, you can tinker with the prompts used in various arguments of TLMCalibrated.

Try jointly improving the text used for both the prompt argument and the custom evaluation criteria. Iterate through many alternative versions of this text, trying to maximize the resulting Spearman correlation between the resulting calibrated scores from TLMCalibrated and human-quality ratings.

Tips on tuning both of these include:

  • The base TLM prompt will perform better in general correctness evaluation, however you can still be specific about certain guidelines that it can follow.
  • For example in RAG use cases: if the correct response to a question is “The given context does not contain any relevant information to answer the question” (because the retrieved context lacks the necessary details), you should explicitly specify this in the prompt. This ensures that the TLM recognizes such responses as correct and does not score them low simply because they do not provide a direct answer.
  • The custom evaluation criteria is better suited for domain-specific needs beyond a general correctness check. You can specify these custom evaluation criteria to complement the general trustworthiness scoring that the base TLM provides.

By systematically adjusting both the prompt and evaluation criteria, you can improve TLMCalibrated scores to suite your specific use-case.