Detect Low-Quality Data/Responses in any LLM/Human Written Dataset (for better LLM Evals or Instruction-Tuning)
Data quality is paramount in instruction tuning (aka. supervised fine-tuning, alignment, sequence-to-sequence modeling), a popular method to improve the performance of pre-trained Language Models (LLMs) for specific tasks. Low-quality examples lurking in the dataset hamper LLM instruction tuning, resulting in poor performance. Such bad data is prevalent in real-world datasets and hard to catch manually.
Using Cleanlab Studio’s Python API together with the Trustworthy Language Model (TLM), this tutorial demonstrates how to automatically catch: low-quality responses, incomplete/vague prompts, and other problematic text (toxic language, PII, informal writing, bad grammar/spelling) lurking in any instruction-response dataset.
The same techniques demonstrated here are also helpful for LLM Evals (where the responses in the dataset come from your own LLM rather than being human-written).
Setup
This tutorial requires a TLM API key. Get one here.
Cleanlab’s Python client can be installed using pip:
%pip install cleanlab-studio
Once installed, let’s load this package and other dependencies.
from cleanlab_studio import Studio
import pandas as pd
import numpy as np
pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)
Also download the instruction-tuning dataset used in this tutorial.
wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/databricks-dolly-mini.jsonl'
We then initalize our Cleanlab client.
studio = Studio("<API Key>") # Get your API key from: https://tlm.cleanlab.ai/
tlm = studio.TLM() # See Advanced Tutorial for optional TLM configurations to get better/faster results
Dataset
This tutorial considers a subset of the famous databricks-dolly-15k dataset, which was used to fine-tune the Dolly 2.0 LLM and many other popular models.
Like other instruction-tuning datasets, this data is composed of instruction-response pairs, where the responses were manually written/curated by a large team of 5000+ human data annotators.
df = pd.read_json("databricks-dolly-mini.jsonl", lines=True)
df.head()
Some examples in this dataset come with additional context, which we’ll merge with the instruction to form a single prompt
input for each row.
# Define a 'prompt' column by prepending the context to the instruction
def format_row(row):
if pd.notnull(row["context"]) and row["context"] != "":
return f"Context: `{row['context']}` The question is: `{row['instruction']}`"
else:
return row["instruction"]
df["prompt"] = df.apply(format_row, axis=1)
While our dataset here is composed of prompt/response pairs, the ideas presented in this tutorial can be used to automatically catch bad data in any text dataset composed of (input, output) pairs.
Using TLM to estimating the quality of prompt-response pairs and catch bad data
To detect low-quality (input, output) pairs in our dataset, we can score the quality of each response via the trustworthiness score estimated by Cleanlab’s Trustworthy Language Model (TLM).
To run this method over a dataset, we recommend using try_get_trustworthiness_score()
instead of get_trustworthiness_score()
because the former method will save partial results in the case where some examples fail during processing.
tlm = studio.TLM()
results = df.drop("prompt", axis=1).copy(deep=True)
trustworthiness_scores = tlm.try_get_trustworthiness_score(df["prompt"].to_list(), df["response"].to_list())
results["trustworthiness_score"] = [score["trustworthiness_score"] for score in trustworthiness_scores]
To see which responses in the dataset are least trustworthy (i.e. low quality), we sort the data by the computed trustworthiness scores.
results.sort_values(by="trustworthiness_score").head(10)
The human written prompt-response pairs with low trustworthiness appear worse quality. Reviewing the results in detail, we find a variety of issues among these lowest-scoring datapoints: factually inaccurate responses, truncated/vague prompts, inaccurate information extraction given context, and spelling errors. Conversely, the responses in the dataset that received the highest TLM trustworthiness scores below provide a direct and accurate answer to the instruction.
results.sort_values(by="trustworthiness_score").tail()
Detecting Additional Text Issues with Cleanlab Studio
The TLM trustworthiness scores offer one way to automatically detect bad data, but Cleanlab offfers other automated techniques you can use as well.
In this section, we demonstrate how the Cleanlab Studio platform can programmatically generate smart metadata for any text dataset. This metadata (returned as many different Cleanlab Columns) helps you discover all sorts of additional problems in your dataset and understand their severity.
The remainder of this tutorial requires a Cleanlab Studio account.
Once you have instantiated a studio
client object, we can load our dataset into Cleanlab Studio like this:
dataset_id = studio.upload_dataset(results, dataset_name="dolly-mini")
print(f"Dataset ID: `{dataset_id}`")
We can use the dataset_id
to launch a Cleanlab Studio Project. Whereas TLM uses a pre-trained Foundation model that assesses your data based on existing world knowledge, a Cleanlab Studio Project automatically trains ML models on your dataset to learn its statistical properties and provide more tailored analysis.
project_id = studio.create_project(
dataset_id=dataset_id,
project_name="dolly-mini-text-issues",
modality="text",
task_type="unsupervised", # consider a different task_type if you have additional category labels or tags
model_type="regular",
label_column=None,
text_column="response", # we specifically audit the response column text in this dataset
)
print(f"Project successfully created and training has begun! project_id: `{project_id}`")
The Project will take some time to run (you’ll receive an email when it is complete). The next code cell simply waits until the Project results are ready.
Warning: This next cell may take a long time to execute for big datasets. If your notebook times out, do not create another Project – just re-execute the following cell which will fetch the necessary information from the already-completed Project.
cleanset_id = studio.get_latest_cleanset_id(project_id)
print(f"cleanset_id: `{cleanset_id}`")
studio.wait_until_cleanset_ready(cleanset_id)
Once the Project results are ready, we fetch the generated Cleanlab columns that contain smart metadata about each data point in our dataset, like what types of issues it exhibits (PII, toxic, non english, informal writing, etc). Each issue type comes with corresponding severity scores that indicate how badly each data point exhibits this issue. Note that this Cleanlab Studio Project focused exclusively on the text in the response
column of our dataset, because it contains the text used that a LLM would be trained to reproduce during supervised fine-tuning.
cleanlab_columns_df = studio.download_cleanlab_columns(cleanset_id)
combined_dataset_df = results.merge(cleanlab_columns_df, left_index=True, right_index=True)
combined_dataset_df.head(2)
Toxic Language
Text that contains toxic language may have elements of hateful speech and language others may find harmful or aggressive. Identifying toxic language is vital in tasks such as content moderation and LLM training/evaluation, where appropriate action should be taken to ensure safe platforms and applications. Let’s see what toxic language Cleanlab automatically detected in this dataset.
toxic_samples = combined_dataset_df.query("is_toxic").sort_values(
"toxic_score", ascending=False
)
columns_to_display = ["cleanlab_row_ID", "response", "toxic_score", "is_toxic"]
display(toxic_samples.head(5)[columns_to_display])
Personally Identifiable Information (PII)
Personally Identifiable Information (PII) is information that could be used to identify an individual or is otherwise sensitive. Exposing PII can compromise an individual’s security and hence should be safeguarded, removed from publicly shared data, and not generated by fine-tuned LLMs. Let’s see what PII text Cleanlab automatically detected in this dataset.
PII_samples = combined_dataset_df.query("is_PII").sort_values(
"PII_score", ascending=False
)
columns_to_display = [
"cleanlab_row_ID",
"response",
"PII_score",
"is_PII",
"PII_types",
"PII_items",
]
display(PII_samples.head(5)[columns_to_display])
Non-English Text
Non-English text includes text written in a foreign language or containing nonsensical characters (such as HTML/XML tags, identifiers, hashes, random characters). These are important to find and address if we want to ensure the text fields in our data and fine-tuned LLM outputs are understandable. Let’s see what non-English text Cleanlab automatically detected in our dataset.
non_english_samples = combined_dataset_df.query("is_non_english").sort_values(
"non_english_score", ascending=False
)
columns_to_display = [
"cleanlab_row_ID",
"response",
"non_english_score",
"is_non_english",
"predicted_language",
]
display(non_english_samples.head(5)[columns_to_display])
Informal Language
Informal text contains casual language, slang, or poor writing such as improper grammar or spelling. Such informal text should be omitted from fine-tuning data if we want our LLM to produce professional sounding responses. Let’s see what informal text Cleanlab automatically detected in our dataset.
informal_samples = combined_dataset_df.query("is_informal").sort_values(
"informal_score", ascending=False
)
columns_to_display = ["cleanlab_row_ID", "response", "informal_score", "spelling_issue_score", "grammar_issue_score", "slang_issue_score", "is_informal"]
display(informal_samples.head(5)[columns_to_display])
Single data quality score
Thus far, we showed how to automatically detect various types of problems in any instruction tuning dataset. Here we show to get a single quality score for each request-response example that combines Cleanlab’s confidence about how good the response for the given request is together with other issue scores based on the response text alone. We achieve this via a weighted geometric average of the scores, where the weights are predetermined based on how important we think each issue type is.
weights = `{ # change these based on your needs
"trustworthiness": 0.2,
"toxic": 0.2,
"PII": 0.2,
"non_english": 0.2,
"informal": 0.2,
}`
def compute_aggregate_scores(cleanset_df, issue_weight):
EPS = 1e-2
cleanset_issue_types = ["toxic", "PII", "non_english", "informal"]
inverse_confidence_scores = 1 - cleanset_df["trustworthiness_score"]
aggregate_scores = inverse_confidence_scores * issue_weight["trustworthiness"]
for issue_type in cleanset_issue_types:
issue_scores = cleanset_df[issue_type + "_score"]
issue_examples = cleanset_df["is_" + issue_type]
issue_contributions = issue_scores * np.clip(issue_examples, EPS, 1 - EPS)
aggregate_scores += issue_weight[issue_type] * issue_contributions
cleanset_df.insert(
3, "cleanlab_score", 1 - aggregate_scores
) # low values = bad data
return cleanset_df
cleanlab_df = compute_aggregate_scores(
cleanset_df=combined_dataset_df, issue_weight=weights
)
columns_to_display = [
"context",
"instruction",
"response",
"cleanlab_score",
"trustworthiness_score",
"is_toxic",
"is_PII",
"is_non_english",
"is_informal",
]
cleanlab_df.sort_values("cleanlab_score", ascending=False)[
columns_to_display
].head() # high values = good data
cleanlab_df.sort_values("cleanlab_score")[columns_to_display].head(
15
) # low values = bad data
To get the most reliable model via LLM fine-tuning, first filter out the lowest-quality (prompt, response) pairs from your dataset. If you have the time/resources, consider manually correcting those responses flagged as low-quality where you spot obvious room for improvement. This sort of data curation helps you better fine-tune any LLM you want to (even though the data curation was based on TLM trustworthiness scores).
The same concepts demonstrated here also help you improve in-context learning (i.e. few-shot prompting), in which you incorporate dataset examples into the prompt of a pre-trained LLM to adapt its behavior for a specific task. Data quality matters greatly for in-context learning, just as in fine-tuning applications.