Skip to main content

Detect Low-Quality Data in any Instruction Tuning Dataset with Cleanlab Studio and the Trustworthy Language Model

Run in Google ColabRun in Google Colab

Data quality is paramount in instruction tuning (aka. supervised fine-tuning, alignment, sequence-to-sequence modeling), a popular method to improve the performance of pre-trained Language Models (LLMs) for specific tasks. Low-quality examples lurking in the dataset hamper LLM instruction tuning, resulting in poor performance. Such bad data is prevalent in real-world datasets and hard to catch manually.

Using Cleanlab Studio’s Python API together with the Trustworthy Language Model (TLM), this tutorial demonstrates how to automatically catch: low-quality responses, incomplete/vague prompts, and other problematic text (toxic language, PII, informal writing, bad grammar/spelling) lurking in any instruction-response dataset.

For optimal LLM fine-tuning or in-context (few-shot) learning, addressing such low-quality data is essential. After identifying bad data points, you can remove them from your dataset or manually improve flagged responses, in order to improve the quality of your dataset and resulting AI model. Before this tutorial, we recommend completing the TLM quickstart tutorial.

Detected issues in databricks-dolly-15k

Setup

Make sure you have a Cleanlab account. Sign up for one here if you haven’t yet. If you’ve already signed up, check your email for a personal login link.

The Python client can be installed using pip:

%pip install cleanlab-studio

Once installed, let’s load this package and other dependencies.

from cleanlab_studio import Studio

import pandas as pd
import numpy as np

pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)

Also download the instruction-tuning dataset used in this tutorial.

wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/databricks-dolly-mini.jsonl'

We then initalize our Cleanlab client.

# Get API key from here: https://app.cleanlab.ai/account after creating an account.
studio = Studio("<API Key>")
tlm = studio.TLM()

Dataset

This tutorial considers a subset of the famous databricks-dolly-15k dataset, which was used to fine-tune the Dolly 2.0 LLM and many other popular models.

Like other instruction-tuning datasets, this data is composed of instruction-response pairs, where the responses were manually written/curated by a large team of 5000+ human data annotators.

df = pd.read_json("databricks-dolly-mini.jsonl", lines=True)
df.head()
instruction context response category
0 What are the official languages of Canada? The official languages of Canada are English and French. open_qa
1 Classify each of the following plates as either food or a cocktail: sushi, mojito, quiche, tacos, moscow mule, old fashioned, pizza, pasta, sandwich, pho, margarita, martini, ramen From the list that has been provided, the following items to belong to the food category: sushi quiche, tacos, pizza, pasta, sandwich, pho and ramen. The rest of the items can be categorized as cocktails: mojito, moscow mule, old fashioned, margarita and martini. classification
2 How do I see how large a dataset is in Spark? You can find the number of records in a DataFrame or RDD using the `count()` method. If you want to see how large it is in terms of bytes, you can find this information on the Spark UI. general_qa
3 Extract the locations where Nachum lived from the text below and list them in alphabetical order and separated by a semicolon. Nachum Gutman was born in Teleneşti, Bessarabia Governorate, then a part of the Russian Empire (now in the Republic of Moldova). He was the fourth child of Simha Alter and Rivka Gutman. His father was a Hebrew writer and educator who wrote under the pen name S. Ben Zion. In 1903, the family moved to Odessa, and two years later, to Ottoman Palestine. In 1908, Gutman attended the Herzliya Gymnasium in what would later become Tel Aviv. In 1912, he studied at the Bezalel School in Jerusalem. In 1920–26, he studied art in Vienna, Berlin and Paris.\n\nGutman was married to Dora, with whom he had a son. After Gutman's death in 1980, Dora asked two Tel Aviv gallery owners, Meir Stern of Stern Gallery and Miriam Tawin of Shulamit Gallery, to appraise the value all of the works left in his estate. Berlin;Jerusalem;Odessa;Palestine;Paris;Tel Aviv;Vienna information_extraction
4 What is a proof that there are two irrational numbers where raising one to the power of the other produces a rational number? Here is a proof by cases that there exist two irrational numbers a and b such that a^b is rational. First, consider a = b = sqrt(2). Recall that sqrt(2) is irrational, so If a^b = sqrt(2) ^ sqrt(2) is rational, we are done because we have found irrational a and b with a^b being a rational number. Otherwise, sqrt(2) ^ sqrt(2) is irrational. Let a = sqrt(2) ^ sqrt(2) and b = sqrt(2). Then a ^ b = (sqrt(2) ^ sqrt(2)) ^ sqrt(2) = sqrt(2) ^ (sqrt(2) * sqrt(2)) = sqrt(2) ^ 2 = 2, which is rational. So either way, we can find irrational a and b with a^b being rational. creative_writing

Some examples in this dataset come with additional context, which we’ll merge with the instruction to form a single prompt input for each row.

# Define a 'prompt' column by prepending the context to the instruction
def format_row(row):
if pd.notnull(row["context"]) and row["context"] != "":
return f"Context: {row['context']} The question is: {row['instruction']}"
else:
return row["instruction"]


df["prompt"] = df.apply(format_row, axis=1)

While our dataset here is composed of prompt/response pairs, the ideas presented in this tutorial can be used to automatically catch bad data in any text dataset composed of (input, output) pairs.

Using TLM to estimating the quality of prompt-response pairs and catch bad data

To detect low-quality (input, output) pairs in our dataset, we can score the quality of each response via the trustworthiness score estimated by Cleanlab’s Trustworthy Language Model (TLM).

To run this method over a dataset, we recommend using try_get_trustworthiness_score() instead of get_trustworthiness_score() because the former method will save partial results in the case where some examples fail during processing.

tlm = studio.TLM()  # see the TLM quickstart tutorial for configuration options

results = df.drop("prompt", axis=1).copy(deep=True)

trustworthiness_scores = tlm.try_get_trustworthiness_score(df["prompt"].to_list(), df["response"].to_list())
results["trustworthiness_score"] = trustworthiness_scores
    Querying TLM... 100%|██████████|

To see which responses in the dataset are least trustworthy (i.e. low quality), we sort the data by the computed trustworthiness scores.

results.sort_values(by="trustworthiness_score").head(10)
instruction context response category trustworthiness_score
13 When does the world celebrates Hindi diwas every year On January 10 open_qa 0.044324
17 Tell me whether these states are in Eastern Time, Central Time, or neither California, Oregon, Connecticut, Washington, Texas, Hawaii classification 0.093878
12 What's the capital of Brazil? The capital of Brazil is Rio de Janeiro. open_qa 0.115947
11 Who invented the airplane? The airplane was invented by Santos Dumont, a Brazilian inventor. He flew for the first time in France, in the year of 1906. open_qa 0.121684
16 What magazine was Alex Prager the director of Photography for? In 2011 Kathy Ryan, director of photography for The New York Times Magazine commissioned Prager to shoot 12, 1 minute films with some film actors from that year, inspired by "cinematic villainy". Prager won a News and Documentary Emmy Award for New Approaches to News & Documentary Programming: Arts, Lifestyle & Culture for her Touch of Evil short films. Alex Prager was the director of photography for the New York Times Magazine. closed_qa 0.184048
15 Which notable person having surname Liebers in this list was born first? Liebers is a German language surname. Notable people with the name include:\n\nMario Liebers (born 1960), German former competitive figure skater\nMartin Liebers (born 1985), German former competitive figure skater\nMatthias Liebers (born 1958), former German footballer\nPeter Liebers (born 1988), German former figure skater. Mario Liebers information_extraction 0.207657
19 classify the following as mammals vs reptiles:\ngoat\nsnake\ncockroach\npig\ndog\ncaterpillar\nlion mammals: goat, pig, dog, lion\nreptiles: snack, cockroach, caterpillar classification 0.210273
33 Why are dogs better than cats for a person's physical health? Dogs keep their owners more active by needing to go for walks/runs, socialization, getting into garbage so the owner never fully gets to relax, etc. Cats just chill and mind their business, and do not require walks or park time. open_qa 0.275223
18 Where is Babylon remain located? Irag open_qa 0.384999
14 Which of these are dog and which are birds: Phoenix, Siberian Husky, Alaskan Malamute, Hawk, Border Collie, Labrador, Araucana Birds: Phoenix, Siberian Husky, Alaskan Malamute, Border Collie, Labrador\nDogs: Siberian Husky, Alaskan Malamute, Border Collie, Labrador classification 0.425704

The human written prompt-response pairs with low trustworthiness appear worse quality. Reviewing the results in detail, we find a variety of issues among these lowest-scoring datapoints: factually inaccurate responses, truncated/vague prompts, inaccurate information extraction given context, and spelling errors. Conversely, the responses in the dataset that received the highest TLM trustworthiness scores below provide a direct and accurate answer to the instruction.

results.sort_values(by="trustworthiness_score").tail()
instruction context response category trustworthiness_score
7 What is the square root of 25? The square root of 25 is 5. open_qa 0.949566
47 Which is a species of fish? Snook or Sneak Snook classification 0.954673
0 What are the official languages of Canada? The official languages of Canada are English and French. open_qa 0.981853
10 Where did Danny Dayton graduated from? Danny Dayton (born Daniel David Segall, November 20, 1923 – February 6, 1999) was an American actor and television director. Beginning in the 1950s, he played many roles in film and on TV. He had a recurring role as Hank Pivnik on All in the Family and had guest roles in M*A*S*H, The Incredible Hulk, Charlie's Angels and The Love Boat.\n\nDayton graduated from the school of journalism at New York University, before turning to acting. In addition to acting, Dayton also directed episodes of Occasional Wife, Here's Lucy, and the short-lived series Good Morning World.\n\nOn February 6, 1999, Dayton died of emphysema in Los Angeles, California. Danny Dayton graduated from the school of journalism at New York University. summarization 0.990145
6 When did Florida become a state? Florida became a state on March 3, 1845. open_qa 0.999644

Detecting Additional Text Issues with Cleanlab Studio

The TLM trustworthiness scores offer one way to automatically detect bad data, but Cleanlab offfers other automated techniques you can use as well.

In this section, we demonstrate how the Cleanlab Studio platform can programmatically generate smart metadata for any text dataset. This metadata (returned as many different Cleanlab Columns) helps you discover all sorts of additional problems in your dataset and understand their severity.

We first load our dataset into Cleanlab Studio.

dataset_id = studio.upload_dataset(results, dataset_name="dolly-mini")
print(f"Dataset ID: {dataset_id}")

We can use the dataset_id to launch a Cleanlab Studio Project. Whereas the TLM is a pre-trained Foundation model that assesses your data based on existing world knowledge, a Cleanlab Studio Project automatically trains ML models on your dataset to learn its statistical properties and provide more tailored analysis.

project_id = studio.create_project(
dataset_id=dataset_id,
project_name="dolly-mini-text-issues",
modality="text",
task_type="unsupervised", # consider a different task_type if you have additional category labels or tags
model_type="regular",
label_column=None,
text_column="response", # we specifically audit the response column text in this dataset
)
print(f"Project successfully created and training has begun! project_id: {project_id}")

The Project will take some time to run (you’ll receive an email when it is complete). The next code cell simply waits until the Project results are ready.

Warning: This next cell may take a long time to execute for big datasets. If your notebook times out, do not create another Project – just re-execute the following cell which will fetch the necessary information from the already-completed Project.

cleanset_id = studio.get_latest_cleanset_id(project_id)
print(f"cleanset_id: {cleanset_id}")
studio.wait_until_cleanset_ready(cleanset_id)

Once the Project results are ready, we fetch the generated Cleanlab columns that contain smart metadata about each data point in our dataset, like what types of issues it exhibits (PII, toxic, non english, informal writing, etc). Each issue type comes with corresponding severity scores that indicate how badly each data point exhibits this issue. Note that this Cleanlab Studio Project focused exclusively on the text in the response column of our dataset, because it contains the text used that a LLM would be trained to reproduce during supervised fine-tuning.

cleanlab_columns_df = studio.download_cleanlab_columns(cleanset_id)
combined_dataset_df = results.merge(cleanlab_columns_df, left_index=True, right_index=True)
combined_dataset_df.head(2)
instruction context response category trustworthiness_score cleanlab_row_ID is_empty_text text_num_characters is_PII PII_score PII_types PII_items is_informal informal_score spelling_issue_score grammar_issue_score slang_issue_score is_non_english non_english_score predicted_language is_toxic toxic_score sentiment_score bias_score is_biased gender_bias_score racial_bias_score sexual_orientation_bias_score
0 What are the official languages of Canada? The official languages of Canada are English and French. open_qa 0.981853 1 False 56 False 0.0 [] [] False 0.010022 0.0 0.017135 0.007921 False 0.046236 <NA> False 0.281250 0.729675 0.822070 True 0.82207 0.406006 1.647949e-01
1 Classify each of the following plates as either food or a cocktail: sushi, mojito, quiche, tacos, moscow mule, old fashioned, pizza, pasta, sandwich, pho, margarita, martini, ramen From the list that has been provided, the following items to belong to the food category: sushi quiche, tacos, pizza, pasta, sandwich, pho and ramen. The rest of the items can be categorized as cocktails: mojito, moscow mule, old fashioned, margarita and martini. classification 0.707636 2 False 263 False 0.0 [] [] False 0.321658 0.0 0.623752 0.180392 False 0.013322 <NA> False 0.333008 0.694214 0.335205 False 0.00000 0.335205 5.960000e-08

Toxic Language

Text that contains toxic language may have elements of hateful speech and language others may find harmful or aggressive. Identifying toxic language is vital in tasks such as content moderation and LLM training/evaluation, where appropriate action should be taken to ensure safe platforms and applications. Let’s see what toxic language Cleanlab automatically detected in this dataset.

toxic_samples = combined_dataset_df.query("is_toxic").sort_values(
"toxic_score", ascending=False
)

columns_to_display = ["cleanlab_row_ID", "response", "toxic_score", "is_toxic"]
display(toxic_samples.head(5)[columns_to_display])
cleanlab_row_ID response toxic_score is_toxic
20 21 People that pour milk before cereal??? People that pour MILK before CEREAL???\n\nAre you out of your mind?? When you pour milk before cereal you not only do a disservice to yourself but you are fighting tooth and nail to collectively drag the human race ten steps back to the stone age. You are morally bankrupt and your soul is in ANGUISH.\n\nYou absolute mongrels. You blithering neanderthals. The blatant disregard for human life is simply sickening. People that pour milk before cereal keep the company of villains and thieves. They are naught but charlatans and grifters, scum of the earth that would sell their own mothers up the river for a simple bowl of Wheaties (Wheaties?? Really?? Grow up.).\n\nWhen the good lord comes back to take us to the great Frosted Flakes bowl in the sky, you will not be at his side. You will wander this scorched earth in an eternity of agony and suffering for your hubris. You will beg, BEG I say for a drop of the sweet milk you so callously poured out and at that time I will look down at you from above with nothing but contempt. 0.862305 True

Personally Identifiable Information (PII)

Personally Identifiable Information (PII) is information that could be used to identify an individual or is otherwise sensitive. Exposing PII can compromise an individual’s security and hence should be safeguarded, removed from publicly shared data, and not generated by fine-tuned LLMs. Let’s see what PII text Cleanlab automatically detected in this dataset.

PII_samples = combined_dataset_df.query("is_PII").sort_values(
"PII_score", ascending=False
)

columns_to_display = [
"cleanlab_row_ID",
"response",
"PII_score",
"is_PII",
"PII_types",
"PII_items",
]
display(PII_samples.head(5)[columns_to_display])
cleanlab_row_ID response PII_score is_PII PII_types PII_items
22 23 Hi Delta Airline,\n\nI am Ao Ni, I send this email regarding a cancelled flight of mine and to request documents to claim my travel insurance. \n\nI just received an email from you that my flight from Las Vegas to New York today at 10 PM was canceled due to mechanical issues. Even though Delta Airline will refund me the cost of flight ticket, the last minute ticket for tomorrow's flight is extremely expensive, almost double the normal price. I have to get back to New York by tomorrow to attend a very important meeting so I have to pay for the ticket, plus one night stay at a hotel for tonight. \n\nWill you reimburse my hotel and ticket? If not, I will have to claim it through my travel insurance and I need two documents to support my case. First, the confirmation of my order. Second, I need a letter from you to state the reason of the cancellation and that Delta Airlines is not able to reimburse my loss. \n\nI understand that bad things happen sometimes but I am still annoyed by the last minute notice. You should have noticed me earlier so that I could have had more time to deal with it. \n\nPlease let me know if you need anything from me. You can reach me at 123-456-7890 or my email 123@gmail.com. 0.5 True ["email"] ["123@gmail.com"]
23 24 Following the {first_initial}{last_name}@somecompany.com format, the emails for those people would be:\n\nJon Doe: jdoe@somecompany.com\nRichard Smith: rsmith@somecompany.com\nTom Jenkins: tjenkins@somecompany.com\nNick Parsons: nparsons@somecompany.com 0.5 True ["Uncategorized PII", "email"] ["first_initial}{last_name}@someco", "jdoe@somecompany.com", "rsmith@somecompany.com", "tjenkins@somecompany.com", "nparsons@somecompany.com"]

Non-English Text

Non-English text includes text written in a foreign language or containing nonsensical characters (such as HTML/XML tags, identifiers, hashes, random characters). These are important to find and address if we want to ensure the text fields in our data and fine-tuned LLM outputs are understandable. Let’s see what non-English text Cleanlab automatically detected in our dataset.

non_english_samples = combined_dataset_df.query("is_non_english").sort_values(
"non_english_score", ascending=False
)

columns_to_display = [
"cleanlab_row_ID",
"response",
"non_english_score",
"is_non_english",
"predicted_language",
]
display(non_english_samples.head(5)[columns_to_display])
cleanlab_row_ID response non_english_score is_non_english predicted_language
5 6 Yusof bin Ishak 0.881498 True <NA>
18 19 Irag 0.871653 True <NA>
15 16 Mario Liebers 0.849844 True German
25 26 De dag dat Jezus gekruisigd werd 0.849697 True Dutch

Informal Language

Informal text contains casual language, slang, or poor writing such as improper grammar or spelling. Such informal text should be omitted from fine-tuning data if we want our LLM to produce professional sounding responses. Let’s see what informal text Cleanlab automatically detected in our dataset.

informal_samples = combined_dataset_df.query("is_informal").sort_values(
"informal_score", ascending=False
)

columns_to_display = ["cleanlab_row_ID", "response", "informal_score", "spelling_issue_score", "grammar_issue_score", "slang_issue_score", "is_informal"]
display(informal_samples.head(5)[columns_to_display])
cleanlab_row_ID response informal_score spelling_issue_score grammar_issue_score slang_issue_score is_informal
18 19 Irag 0.675657 1.0 0.43651 0.752632 True

Single data quality score

Thus far, we showed how to automatically detect various types of problems in any instruction tuning dataset. Here we show to get a single quality score for each request-response example that combines Cleanlab’s confidence about how good the response for the given request is together with other issue scores based on the response text alone. We achieve this via a weighted geometric average of the scores, where the weights are predetermined based on how important we think each issue type is.

weights = {  # change these based on your needs
"trustworthiness": 0.2,
"toxic": 0.2,
"PII": 0.2,
"non_english": 0.2,
"informal": 0.2,
}
def compute_aggregate_scores(cleanset_df, issue_weight):
EPS = 1e-2
cleanset_issue_types = ["toxic", "PII", "non_english", "informal"]
inverse_confidence_scores = 1 - cleanset_df["trustworthiness_score"]
aggregate_scores = inverse_confidence_scores * issue_weight["trustworthiness"]
for issue_type in cleanset_issue_types:
issue_scores = cleanset_df[issue_type + "_score"]
issue_examples = cleanset_df["is_" + issue_type]
issue_contributions = issue_scores * np.clip(issue_examples, EPS, 1 - EPS)
aggregate_scores += issue_weight[issue_type] * issue_contributions

cleanset_df.insert(
3, "cleanlab_score", 1 - aggregate_scores
) # low values = bad data
return cleanset_df
cleanlab_df = compute_aggregate_scores(
cleanset_df=combined_dataset_df, issue_weight=weights
)
columns_to_display = [
"context",
"instruction",
"response",
"cleanlab_score",
"trustworthiness_score",
"is_toxic",
"is_PII",
"is_non_english",
"is_informal",
]

cleanlab_df.sort_values("cleanlab_score", ascending=False)[
columns_to_display
].head() # high values = good data
context instruction response cleanlab_score trustworthiness_score is_toxic is_PII is_non_english is_informal
6 When did Florida become a state? Florida became a state on March 3, 1845. 0.999384 0.999644 False False False False
10 Danny Dayton (born Daniel David Segall, November 20, 1923 – February 6, 1999) was an American actor and television director. Beginning in the 1950s, he played many roles in film and on TV. He had a recurring role as Hank Pivnik on All in the Family and had guest roles in M*A*S*H, The Incredible Hulk, Charlie's Angels and The Love Boat.\n\nDayton graduated from the school of journalism at New York University, before turning to acting. In addition to acting, Dayton also directed episodes of Occasional Wife, Here's Lucy, and the short-lived series Good Morning World.\n\nOn February 6, 1999, Dayton died of emphysema in Los Angeles, California. Where did Danny Dayton graduated from? Danny Dayton graduated from the school of journalism at New York University. 0.997383 0.990145 False False False False
0 What are the official languages of Canada? The official languages of Canada are English and French. 0.995696 0.981853 False False False False
7 What is the square root of 25? The square root of 25 is 5. 0.98946 0.949566 False False False False
26 What year was Genghis Khan born? Genghis Khan was born in the year 1162. 0.98877 0.947948 False False False False
cleanlab_df.sort_values("cleanlab_score")[columns_to_display].head(
15
) # low values = bad data
context instruction response cleanlab_score trustworthiness_score is_toxic is_PII is_non_english is_informal
18 Where is Babylon remain located? Irag 0.569586 0.384999 False False True True
15 Liebers is a German language surname. Notable people with the name include:\n\nMario Liebers (born 1960), German former competitive figure skater\nMartin Liebers (born 1985), German former competitive figure skater\nMatthias Liebers (born 1958), former German footballer\nPeter Liebers (born 1988), German former figure skater. Which notable person having surname Liebers in this list was born first? Mario Liebers 0.672213 0.207657 False False True False
20 Write a rant about people that pour milk before cereal. People that pour milk before cereal??? People that pour MILK before CEREAL???\n\nAre you out of your mind?? When you pour milk before cereal you not only do a disservice to yourself but you are fighting tooth and nail to collectively drag the human race ten steps back to the stone age. You are morally bankrupt and your soul is in ANGUISH.\n\nYou absolute mongrels. You blithering neanderthals. The blatant disregard for human life is simply sickening. People that pour milk before cereal keep the company of villains and thieves. They are naught but charlatans and grifters, scum of the earth that would sell their own mothers up the river for a simple bowl of Wheaties (Wheaties?? Really?? Grow up.).\n\nWhen the good lord comes back to take us to the great Frosted Flakes bowl in the sky, you will not be at his side. You will wander this scorched earth in an eternity of agony and suffering for your hubris. You will beg, BEG I say for a drop of the sweet milk you so callously poured out and at that time I will look down at you from above with nothing but contempt. 0.768291 0.700411 True False False False
25 Goede Vrijdag is de vrijdag voor Pasen. Op deze dag herdenken christenen de kruisiging en dood van Jezus. Jezus werd volgens de Bijbel veroordeeld tot de kruisdood door de Romeinse stadhouder Pontius Pilatus, op aandrang van het sanhedrin. Deze straf werd voltrokken op de heuvel Golgotha nabij de stad Jeruzalem. Goede Vrijdag volgt op Witte Donderdag en gaat vooraf aan Stille Zaterdag. Daarop volgt Pasen. Wat is goede vrijdag? De dag dat Jezus gekruisigd werd 0.792483 0.808997 False False True False
5 After being expelled from Malaysia, Singapore became independent as the Republic of Singapore on 9 August 1965, with Lee Kuan Yew and Yusof bin Ishak as the first prime minister and president respectively. In 1967, the country co-founded the Association of Southeast Asian Nations (ASEAN). Race riots broke out once more in 1969. Lee Kuan Yew's emphasis on rapid economic growth, support for business entrepreneurship, and limitations on internal democracy shaped Singapore's policies for the next half-century. Economic growth continued throughout the 1980s, with the unemployment rate falling to 3% and real GDP growth averaging at about 8% up until 1999. During the 1980s, Singapore began to shift towards high-tech industries, such as the wafer fabrication sector, in order to remain competitive as neighbouring countries began manufacturing with cheaper labour. Singapore Changi Airport was opened in 1981 and Singapore Airlines was formed. The Port of Singapore became one of the world's busiest ports and the service and tourism industries also grew immensely during this period. Given this paragraph about Singapore, who was Singapore's first president? Yusof bin Ishak 0.792525 0.839890 False False True False
13 When does the world celebrates Hindi diwas every year On January 10 0.807525 0.044324 False False False False
17 Tell me whether these states are in Eastern Time, Central Time, or neither California, Oregon, Connecticut, Washington, Texas, Hawaii 0.817622 0.093878 False False False False
12 What's the capital of Brazil? The capital of Brazil is Rio de Janeiro. 0.821298 0.115947 False False False False
11 Who invented the airplane? The airplane was invented by Santos Dumont, a Brazilian inventor. He flew for the first time in France, in the year of 1906. 0.823922 0.121684 False False False False
16 In 2011 Kathy Ryan, director of photography for The New York Times Magazine commissioned Prager to shoot 12, 1 minute films with some film actors from that year, inspired by "cinematic villainy". Prager won a News and Documentary Emmy Award for New Approaches to News & Documentary Programming: Arts, Lifestyle & Culture for her Touch of Evil short films. What magazine was Alex Prager the director of Photography for? Alex Prager was the director of photography for the New York Times Magazine. 0.836171 0.184048 False False False False
19 classify the following as mammals vs reptiles:\ngoat\nsnake\ncockroach\npig\ndog\ncaterpillar\nlion mammals: goat, pig, dog, lion\nreptiles: snack, cockroach, caterpillar 0.840591 0.210273 False False False False
33 Why are dogs better than cats for a person's physical health? Dogs keep their owners more active by needing to go for walks/runs, socialization, getting into garbage so the owner never fully gets to relax, etc. Cats just chill and mind their business, and do not require walks or park time. 0.854142 0.275223 False False False False
23 Assume the email format is {first_initial}{last_name}@somecompany.com\n\nWhat would the email addresses be for the following people?\nJon Doe\nRichard Smith\nTom Jenkins\nNick Parsons Following the {first_initial}{last_name}@somecompany.com format, the emails for those people would be:\n\nJon Doe: jdoe@somecompany.com\nRichard Smith: rsmith@somecompany.com\nTom Jenkins: tjenkins@somecompany.com\nNick Parsons: nparsons@somecompany.com 0.865563 0.827946 False True False False
22 Your organized flight has been canceled. Write a letter to the airline company. You should 1) complain about the situation, 2) request what you need from them to make a travel insurance claim, 3) suggest what they should do in future situations. Hi Delta Airline,\n\nI am Ao Ni, I send this email regarding a cancelled flight of mine and to request documents to claim my travel insurance. \n\nI just received an email from you that my flight from Las Vegas to New York today at 10 PM was canceled due to mechanical issues. Even though Delta Airline will refund me the cost of flight ticket, the last minute ticket for tomorrow's flight is extremely expensive, almost double the normal price. I have to get back to New York by tomorrow to attend a very important meeting so I have to pay for the ticket, plus one night stay at a hotel for tonight. \n\nWill you reimburse my hotel and ticket? If not, I will have to claim it through my travel insurance and I need two documents to support my case. First, the confirmation of my order. Second, I need a letter from you to state the reason of the cancellation and that Delta Airlines is not able to reimburse my loss. \n\nI understand that bad things happen sometimes but I am still annoyed by the last minute notice. You should have noticed me earlier so that I could have had more time to deal with it. \n\nPlease let me know if you need anything from me. You can reach me at 123-456-7890 or my email 123@gmail.com. 0.866693 0.832689 False True False False
14 Which of these are dog and which are birds: Phoenix, Siberian Husky, Alaskan Malamute, Hawk, Border Collie, Labrador, Araucana Birds: Phoenix, Siberian Husky, Alaskan Malamute, Border Collie, Labrador\nDogs: Siberian Husky, Alaskan Malamute, Border Collie, Labrador 0.88263 0.425704 False False False False

To get the most reliable model via LLM fine-tuning, first filter out the lowest-quality (prompt, response) pairs from your dataset. If you have the time/resources, consider manually correcting those responses flagged as low-quality where you spot obvious room for improvement. This sort of data curation helps you better fine-tune any LLM you want to (even though the data curation was based on TLM trustworthiness scores).

The same concepts demonstrated here also help you improve in-context learning (i.e. few-shot prompting), in which you incorporate dataset examples into the prompt of a pre-trained LLM to adapt its behavior for a specific task. Data quality matters greatly for in-context learning, just as in fine-tuning applications.