Incorporating custom evaluation criteria for LLM outputs and calibrating TLM trustworthiness scores against human ratings
Are you finding TLM trustworthiness scores do not align with your team’s manual quality ratings of good/bad LLM outputs? If so, you can address this via:
- Adjusting prompts and TLM options/settings
- Custom evaluation criteria
- Calibrating TLM trustworthiness scores against your human quality ratings for example prompt-response pairs.
This tutorial demonstrates how to evaluate and improve the performance of TLM for your specific use-case.
Setup
TLM requires a Cleanlab account. Sign up for one here and use TLM for free! If you’ve already signed up, check your email for a personal login link.
Cleanlab’s Python client can be installed using pip. This tutorial also requires the scikit-learn
package.
%pip install --upgrade cleanlab-studio scikit-learn
import pandas as pd
from scipy.stats import spearmanr # just used for evaluation of results in this tutorial
pd.set_option('display.max_colwidth', None)
from cleanlab_studio import Studio
studio = Studio("<API key>") # Get API key from here: https://app.cleanlab.ai/account after creating an account
Fetch the dataset
Let’s first load an example dataset. Here we consider a dataset composed of trivia questions with responses already computed from some LLM, which we want to automatically evaluate with TLM.
In this tutorial we will be using the tlm.get_trustworthiness_score()
method to evaluate these prompt-response pairs, and explore how we can use various techniques to calibrate TLM’s trustworthiness score to align with your specific use case.
wget -nc https://cleanlab-public.s3.amazonaws.com/Datasets/triviaqa-conciseness/triviaqa_with_ratings.csv
wget -nc https://cleanlab-public.s3.amazonaws.com/Datasets/triviaqa-conciseness/triviaqa_without_ratings.csv
data = pd.read_csv("triviaqa_with_ratings.csv")
Suppose we aim to score LLM responses based on both correctness of the response and conciseness (ie. the LLM response should not contain extraneous information). Our dataset additionally contains human ratings regarding the (manually-determined) quality of each response. Even without having such human ratings, you can still use many of the techniques demonstrated in this tutorial.
Here is a sample question where the LLM response is factually correct, however it contains too much irrelevant information, and hence it is marked as incorrect by a human annotator (rating = 0):
print("Question:", data["question"][0], "\n")
print("Response:", data["answer"][0], "\n")
print("Human Rating:", data["human_rating"][0])
The LLM response to this other question is both correct and concise, hence it has been manually marked as high-quality (human rating = 1):
print("Question:", data["question"][1], "\n")
print("Response:", data["answer"][1], "\n")
print("Human Rating:", data["human_rating"][1])
Scoring responses with TLM
At first, let’s just append a simple prompt template to each question, and then use get_trustworthiness_score()
to score each response.
def create_prompt(question):
return f"Answer this question as concisely as possible: {question}"
data["prompt"] = data["question"].apply(create_prompt)
print("Sample prompt:")
print(data["prompt"][0])
tlm = studio.TLM() # See Advanced tutorial for optional TLM configurations to boost performance
res = tlm.get_trustworthiness_score(data["prompt"].tolist(), data["answer"].tolist())
res_df = pd.DataFrame(res)
Spearman’s rank correlation is a statistical method used to measure the strength of the relationship between two ranked variables. We can use it here to measure how well the trustworthiness scores from TLM reflect the given human ratings.
In this tutorial, the human ratings are binary (only 0 and 1 values representing correct and incorrect responses). However, all methods showcased in this tutorial can be applied to any range of numeric human ratings, such as ratings on a scale of 1 to 5. Spearman’s rank correlation is a valid measure of alignment betweeen estimated trustworthiness scores and human ratings, regardless whether they are binary or numeric, as it quantifies how consistently good responses are scored higher than bad ones.
print("Spearman correlation:", spearmanr(res_df["trustworthiness_score"], data["human_rating"]).statistic)
Prompt engineering for better TLM trustworthiness scoring
One simple way to improve TLM performance is to improve the instructions we give. With clearer instructions, TLM can better score the responses for our specific use-case. Important elements of an effective prompt
argument for TLM include: specifying precisely what a correct answer should contain, and perhaps also giving examples of wrong answers.
Here for example, we can update TLM’s prompt
to specify that wordy answers are considered incorrect.
def create_improved_prompt(question):
return f"""Answer the following question as concisely as you can.
For the answer to be considered correct, it has to be concise. Wordy answers are incorrect.
Question: {question}"""
data["improved_prompt"] = data["question"].apply(create_improved_prompt)
print("Sample improved prompt:")
print(data["improved_prompt"][0])
res_improved = tlm.get_trustworthiness_score(data["improved_prompt"].tolist(), data["answer"].tolist())
res_improved = pd.DataFrame(res_improved)
We see that with the updated prompt, TLM’s trustworthiness score is now better aligned with our human ratings (as measured via their Spearman correlation):
print("Spearman correlation:", spearmanr(res_improved["trustworthiness_score"], data["human_rating"]).statistic)
Custom evaluation criteria
Our improved prompt template did enhance TLM performance. However, evaluating conciseness of the response represents a particular notion of response-quality that may not align with TLM’s general trustworthiness scoring (which aims to quantify uncertainty in the LLM). Fortunately, TLM supports custom evaluation criteria, allowing you to define specific ways to additionally evaluate response-quality.
We can define our custom evaluation criteria in TLM’s options
dictionary (currently, only one custom evaluation criteria at a time is supported):
custom_eval_criteria_option = {"custom_eval_criteria": [
{"name": "Conciseness", "criteria": "Determine if the response is concise."}
]}
tlm_custom_eval = studio.TLM(options=custom_eval_criteria_option)
Note that while the prompt
for TLM.get_trustworthiness_score()
should only specify what a correct response should look like, the custom evaluation criteria should specify how to determine if the response is good (more tips below).
After defining the additional evaluation criteria, running TLM’s get_trustworthiness_score()
method will return both TLM’s standard trustworthiness scores along with a new custom evaluation score based on our criteria.
res_custom_eval = tlm_custom_eval.get_trustworthiness_score(data["improved_prompt"].tolist(), data["answer"].tolist())
res_custom_eval_df = pd.DataFrame(res_custom_eval)
res_custom_eval_df.head()
trustworthiness_score | log | |
---|---|---|
0 | 0.650464 | {'custom_eval_criteria': [{'name': 'Conciseness', 'score': 0.006961743549913892}]} |
1 | 0.987435 | {'custom_eval_criteria': [{'name': 'Conciseness', 'score': 0.9975124352663575}]} |
2 | 0.935303 | {'custom_eval_criteria': [{'name': 'Conciseness', 'score': 0.2512494864758131}]} |
3 | 0.183738 | {'custom_eval_criteria': [{'name': 'Conciseness', 'score': 0.2500971975237057}]} |
4 | 0.941002 | {'custom_eval_criteria': [{'name': 'Conciseness', 'score': 0.25337718256654873}]} |
The custom evaluation scores will be returned as entries in the log
object:
res_custom_eval_df["log"].iloc[0]
res_custom_eval_df["custom_score"] = res_custom_eval_df.apply(lambda x: x["log"]["custom_eval_criteria"][0]["score"], axis=1)
res_custom_eval_df["custom_score"].head()
Let’s combine the original data with these results to see how well the our conciseness criteria scored the responses.
custom_eval_combined = pd.concat([data, res_custom_eval_df], axis=1)
custom_eval_sorted = custom_eval_combined.sort_values("custom_score")
Below, we view 2 examples with the lowest conciseness scores across our dataset, and see that the LLM responses are indeed very verbose. We see that in the first example, TLM’s trustworthiness score did not actually reflect the verboseness of the response.
This showcases how specifying custom evaluation criteria can help in use-cases where TLM’s general trustworthiness scores do not reflect your notion of good-vs-bad quality responses.
custom_eval_sorted.head(2)[["question", "answer", "trustworthiness_score", "custom_score"]]
question | answer | trustworthiness_score | custom_score | |
---|---|---|---|---|
14 | For which movie did Katharine Hepburn win her second Oscar? | Katharine Hepburn won her second Academy Award for her role in the 1967 film 'Guess Who's Coming to Dinner'. In this acclaimed film, she portrayed Christina Drayton, a mother grappling with her daughter's decision to marry a Black man. Hepburn’s poignant and emotionally charged performance earned her critical praise and solidified her place as one of Hollywood's all-time great actresses. The film, which addresses issues of race and social change, was significant in its time, and Hepburn's portrayal is a key element of its impact. | 0.711556 | 0.002532 |
29 | What was the name of Michael Jackson's autobiography written in 1988? | The title of Michael Jackson's autobiography, written and published in 1988, is 'Moonwalk'. In this deeply personal and revealing book, Jackson reflects on his extraordinary life, starting from his early days as a child prodigy with the Jackson 5 to his unprecedented rise as a global pop sensation. The book delves into his experiences with fame, his artistic inspirations, and the personal challenges he faced throughout his career, offering readers an intimate glimpse into the mind of the King of Pop. | 0.664933 | 0.002571 |
Inspecting responses with the highest conciseness scores in our dataset, we see that these are indeed nice and concise.
custom_eval_sorted.tail(2)[["question", "answer", "trustworthiness_score", "custom_score"]]
question | answer | trustworthiness_score | custom_score | |
---|---|---|---|---|
34 | Marc Dutroux hit the headlines over a 'house of horrors' in which country? | Marc Dutroux hit the headlines over a 'house of horrors' in Belgium. | 0.947773 | 0.997512 |
9 | The Yalu river forms a sort of natural border between China and which of its neighbours? | The Yalu river forms a natural border between China and North Korea. | 0.629239 | 0.997512 |
Tips for writing custom evaluation criteria
You can further iterate on how you write out the custom evaluation criteria, in order to better align the custom evaluation scores with what you consider good-vs-bad quality responses. Tips include: define clear/objective criteria for determining answer quality and avoid subjective language. Consider including: examples of good vs bad responses, and edge-case guidelines – such as whether things like flowery-language or answer-refusal are considered good or bad. Qualitatively describe measurable aspects of the response without describing numerical scoring mechanisms, as TLM already has an internal scoring system based on your qualitative description.
An example custom criteria to evaluate how well a RAG system abstains from responding might be:
Determine whether the response is accurate based on the provided context. If the context does not provide information needed to answer the question, the response should state ‘Unable to respond based on available information’.
Calibrating TLM scores against human quality ratings
TLM’s native trustworthiness scoring and custom evaluation criteria each have their strengths. By combining them, we can benefit from TLM’s general trustworthiness score while tailoring specific aspects of the evaluation to our specific criteria. None of the above approaches requires human-rating annotations, we have so far just used those to quantify alignment between the scores and human ratings.
Here we showcase how TLMCalibrated
can combine and calibrate these values into a score that is explicitly aligned against numeric/binary human-rating (i.e. ground-truth) quality annotations. While we here integrate the custom evaluation and trustworthiness scores into a single calibrated score, you can also run the same procedure to calibrate just one of these scores without the other.
First, we instantiate the TLMCalibrated
object with the same custom evaluation criteria used previously in the options
argument.
tlm_calibrated = studio.TLMCalibrated(options=custom_eval_criteria_option)
Next, we fit this TLMCalibrated
object to a dataset composed of the previously obtained TLM scores and human quality ratings. This trains the model to produce better-aligned scores over this dataset.
We can directly pass res_custom_eval
(the list of dictionaries returned from the previous get_trustworthiness_score()
call) directly into fit()
.
tlm_calibrated.fit(res_custom_eval, data["human_rating"])
After fitting TLMCalibrated
to our human-annotated dataset, we can call its get_trustworthiness_score()
method. This returns not only the previous TLM trustworthiness score and custom conciseness score, but also an additional calibrated score that optimally combines those two scores to align with human quality ratings.
calibrated_res = tlm_calibrated.get_trustworthiness_score(data["improved_prompt"].tolist(), data["answer"].tolist())
calibrated_res_df = pd.DataFrame(calibrated_res)
calibrated_res_df.head()
trustworthiness_score | log | calibrated_score | |
---|---|---|---|
0 | 0.650464 | {'custom_eval_criteria': [{'name': 'Conciseness', 'score': 0.006961743549913892}]} | 0.026667 |
1 | 0.987435 | {'custom_eval_criteria': [{'name': 'Conciseness', 'score': 0.9975124352663575}]} | 1.000000 |
2 | 0.935303 | {'custom_eval_criteria': [{'name': 'Conciseness', 'score': 0.2512494864758131}]} | 0.669535 |
3 | 0.183738 | {'custom_eval_criteria': [{'name': 'Conciseness', 'score': 0.2500971975237057}]} | 0.010000 |
4 | 0.941002 | {'custom_eval_criteria': [{'name': 'Conciseness', 'score': 0.25337718256654873}]} | 0.914807 |
We again measure alignment between the calibrated scores and our human ratings via their Spearman correlation, which here is the highest among all scoring methods. Thus whenever you have a moderately-sized set of human quality ratings, we recommend considering TLMCalibrated
to produce a tailored evaluation score for LLM responses in your use-case.
print("Spearman correlation:", spearmanr(calibrated_res_df["calibrated_score"], data["human_rating"]).statistic)
Using TLMCalibrated
to get improved scores on new data
You can use the already-fitted TLMCalibrated
model to score new responses via the same get_trustworthiness_score()
method. Here we demonstrate this with some additional data.
new_data = pd.read_csv("triviaqa_without_ratings.csv")
new_data["prompt"] = new_data["question"].apply(create_improved_prompt)
We call get_trustworthiness_score()
on the new (prompt, response) pairs, just as we did before.
new_data_res = tlm_calibrated.get_trustworthiness_score(new_data["prompt"].tolist(), new_data["answer"].tolist())
new_data_res_df = pd.DataFrame(new_data_res)
new_data_res_df.head(2)
trustworthiness_score | log | calibrated_score | |
---|---|---|---|
0 | 0.965530 | {'custom_eval_criteria': [{'name': 'Conciseness', 'score': 0.997512434971306}]} | 0.924324 |
1 | 0.705995 | {'custom_eval_criteria': [{'name': 'Conciseness', 'score': 0.0174344915132037}]} | 0.030667 |
The returned results contain the TLM trustworthiness score and log
(like before), but also the calibrated_score
to use for evaluating these additional responses (in real-time).
new_data_combined = pd.concat([new_data, new_data_res_df], axis=1)
new_data_combined.head(2)[["question", "answer", "calibrated_score"]]
question | answer | calibrated_score | |
---|---|---|---|
0 | In 1976, which gymnast scored 7 maximum scores of 10 as she won three gold medals, one silver and one bronze? | Nadia Comaneci | 0.924324 |
1 | What is the Milky Way? | The Milky Way is a barred spiral galaxy, one of the billions of galaxies in the observable universe, and it is our home galaxy. As part of the Local Group of galaxies, it exists alongside other notable galaxies like Andromeda and Triangulum, as well as about 54 smaller satellite galaxies. The Milky Way contains between 100 and 400 billion stars, including our Sun, and the dense band of light seen from Earth at night is caused by the multitude of stars and celestial objects aligned along the galactic plane. It is named after the appearance of this bright band of light, which stretches across the night sky. | 0.030667 |
Further scoring improvements
At this point, you have specified custom evaluation criteria and combined it with TLM’s native trustworthiness score to produce calibrated scores via TLMCalibrated
. To further align these scores against your human-quality ratings, you can tinker with the prompts used in various arguments of TLMCalibrated
.
Try jointly improving the text used for both the prompt
argument and the custom evaluation criteria. Iterate through many alternative versions of this text, trying to maximize the resulting Spearman correlation between the resulting calibrated scores from TLMCalibrated
and human-quality ratings.
Tips on tuning both of these include:
- The base TLM prompt will perform better in general correctness evaluation, however you can still be specific about certain guidelines that it can follow.
- For example in RAG use cases: if the correct response to a question is “The given context does not contain any relevant information to answer the question” (because the retrieved context lacks the necessary details), you should explicitly specify this in the prompt. This ensures that TLM recognizes such responses as correct and does not score them low simply because they do not provide a direct answer.
- The custom evaluation criteria is better suited for domain-specific needs beyond a general correctness check. You can specify these custom evaluation criteria to complement the general trustworthiness scoring that the base TLM provides.
By systematically adjusting both the prompt and evaluation criteria, you can improve TLMCalibrated
scores to suite your specific use-case.