Detecting Issues in Text Datasets
This is the recommended quickstart tutorial for analyzing text datasets via the Cleanlab Studio’s Python API.
In this tutorial, we demonstrate the metadata Cleanlab Studio automatically generates for any text classification dataset. This metadata (returned as Cleanlab Columns) helps you discover various problems in your dataset and understand their severity. This entire notebook is run using the cleanlab_studio
Python package, so you can audit your datasets programmatically.
Install and import dependencies
Make sure you have wget
installed to run this tutorial. You can use pip to install all other packages required for this tutorial as follows:
%pip install cleanlab-studio
import numpy as np
import pandas as pd
import os
from IPython.display import display, Markdown
pd.set_option("display.max_colwidth", None)
Fetch and view dataset
Fetch the dataset for this tutorial.
mkdir -p data/
wget -nc https://cleanlab-public.s3.amazonaws.com/Datasets/banking-text-quickstart-v1.csv -O data/banking-text-quickstart.csv
Here we’ll use a variant of the BANKING77 text dataset. This is a multi-class classification dataset where customer service requests are labeled as belonging to one of K classes (intent categories).
We can view the first few rows of our dataset below:
BASE_PATH = os.getcwd()
dataset_path = os.path.join(BASE_PATH, "data/banking-text-quickstart.csv")
data = pd.read_csv(dataset_path)
data.head()
text | label | |
---|---|---|
0 | i need to cancel my recent transfer as soon as possible. i made an error there. please help before it goes through. | cancel_transfer |
1 | why is there a fee when i thought there would be no fees? | card_payment_fee_charged |
2 | why can't my beneficiary access my account? | beneficiary_not_allowed |
3 | does it cost extra to send out more than one card? | getting_spare_card |
4 | can i change my pin at an atm? | change_pin |
Dataset Structure
The data used in the tutorial is stored in a standard CSV file containing the following columns:
text,label
<a text example>,<a class label>
"<a text example with quotes, to escape commas as column separators>",<another class label>
...
You can similarly format any other text dataset and run the rest of this tutorial. Details on how to format your dataset can be found in this guide, which also outlines other format options.
Load dataset into Cleanlab Studio
Now that we have our dataset, let’s load it into Cleanlab Studio and conduct our analysis. Use your API key to instantiate a studio
object, which analyzes your dataset.
from cleanlab_studio import Studio
# you can find your Cleanlab Studio API key by going to app.cleanlab.ai/upload,
# clicking "Upload via Python API", and copying the API key there
API_KEY = "<insert your API key>"
# initialize studio object
studio = Studio(API_KEY)
Load the data into Cleanlab Studio (more details/options can be found in this guide). This may take a while for big datasets.
dataset_id = studio.upload_dataset(dataset_path, dataset_name="banking-text-quickstart")
print(f"Dataset ID: {dataset_id}")
Launch a Project
A Cleanlab Studio project automatically trains ML models to provide AI-based analysis of your dataset. Let’s launch one.
Note: For our label_column
and text_column
specified below, they happened to be called label
and text
for this example. The values for these arguments should be the name of the columns pertaining to your label and containing the text field in your dataset. If you have multiple text columns, please merge them into a single column as demonstrated in our FAQ.
project_id = studio.create_project(
dataset_id=dataset_id,
project_name="banking-text-quickstart-project",
modality="text",
task_type="multi-class",
model_type="regular",
label_column="label",
text_column="text",
)
print(f"Project successfully created and training has begun! project_id: {project_id}")
Once the project has been launched successfully and you see your project_id
you can feel free to close this notebook. It will take some time for Cleanlab’s AI to train on your data and analyze it. Come back after training is complete (you will receive an email) and continue with the notebook to review your results.
You should only execute the above cell once per dataset. After launching the project, you can poll for its status to programmatically wait until the results are ready for review. Each project creates a cleanset, an improved version of your original dataset that contains additional metadata for helping you clean up the data. The next code cell simply waits until this cleanset has been created.
Warning! For big datasets, this next cell may take a long time to execute while Cleanlab’s AI model is training. If your Jupyter notebook has timed out during this process, you can resume work by re-running the below cell (which should return instantly if the project has completed training). Do not create a new project.
cleanset_id = studio.get_latest_cleanset_id(project_id)
print(f"cleanset_id: {cleanset_id}")
project_status = studio.wait_until_cleanset_ready(cleanset_id)
Once the above cell completes execution, your project results are ready for review! At this point, you can optionally view your project in the Cleanlab Studio web interface and interactively improve your dataset. However this tutorial will stick with a fully programmatic workflow.
Download Cleanlab columns
We can fetch Cleanlab columns that store metadata for this cleanset using its cleanset_id
. These columns have the same length as your original dataset and provide metadata about each individual data point, like what types of issues it exhibits and how severely.
If at any point you want to re-run the remaining parts of this notebook (without creating another project), simply call studio.download_cleanlab_columns(cleanset_id)
with the cleanset_id
printed from the previous cell.
cleanlab_columns_df = studio.download_cleanlab_columns(cleanset_id)
cleanlab_columns_df.head()
cleanlab_row_ID | corrected_label | is_label_issue | label_issue_score | suggested_label | suggested_label_confidence_score | is_ambiguous | ambiguous_score | is_well_labeled | is_near_duplicate | ... | non_english_score | predicted_language | is_toxic | toxic_score | sentiment_score | bias_score | is_biased | gender_bias_score | racial_bias_score | sexual_orientation_bias_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | <NA> | False | 0.250695 | <NA> | 0.581173 | False | 0.850463 | True | False | ... | 0.004226 | <NA> | False | 0.101501 | 0.326050 | 0.122986 | False | 0.000000 | 0.122986 | 0.000018 |
1 | 2 | <NA> | False | 0.207495 | <NA> | 0.648213 | False | 0.842312 | True | False | ... | 0.007218 | <NA> | False | 0.184326 | 0.134735 | 0.395068 | False | 0.395068 | 0.232788 | 0.144897 |
2 | 3 | <NA> | False | 0.155370 | <NA> | 0.731616 | False | 0.780281 | True | False | ... | 0.008784 | <NA> | False | 0.061584 | 0.159088 | 0.259082 | False | 0.259082 | 0.093872 | 0.025330 |
3 | 4 | <NA> | False | 0.399957 | <NA> | 0.391870 | False | 0.895436 | False | False | ... | 0.004739 | <NA> | False | 0.181763 | 0.377625 | 0.241504 | False | 0.241504 | 0.131836 | 0.013489 |
4 | 5 | <NA> | False | 0.258581 | <NA> | 0.571155 | False | 0.837926 | True | False | ... | 0.058934 | <NA> | False | 0.115234 | 0.543762 | 0.381641 | False | 0.381641 | 0.112732 | 0.012970 |
5 rows × 38 columns
Review data issues
Details about all of the Cleanlab columns and their meanings can be found in this guide. Here we briefly showcase some of the Cleanlab columns that correspond to issues detected in our tutorial dataset:
- Label issue indicates the given label of this data point is likely wrong. For such data, consider correcting their label to the
suggested_label
if it seems more appropriate. - Ambiguous indicates this data point does not clearly belong to any of the classes (e.g. a borderline case). Multiple human annotators might disagree on how to label this data point, so you might consider refining your annotation instructions to clarify how to handle data points like this.
- Outlier indicates this data point is very different from the rest of the data (looks atypical). The presence of outliers may indicate problems in your data sources, consider deleting such data from your dataset if appropriate.
- Near duplicate indicates there are other data points that are (exactly or nearly) identical to this data point. Duplicated data points can have an outsized impact on models/analytics, so consider deleting the extra copies from your dataset if appropriate.
The data points exhibiting each type of issue are indicated with boolean values in the respective is_<issue>
column, and the severity of this issue in each data point is quantified in the respective <issue>_score
column (on a scale of 0-1 with 1 indicating the most severe instances of the issue).
Let’s go through some of the Cleanlab columns and types of data issues, starting with label issues (i.e. mislabeled data). We first create a given_label
column in our dataframe to clearly indicate the original class label originally assigned to each data point (customer service request).
import warnings
warnings.filterwarnings('ignore')
# Load the dataset into a DataFrame
df = pd.read_csv(dataset_path)
# Combine the dataset with the cleanlab columns
combined_dataset_df = df.merge(cleanlab_columns_df, left_index=True, right_index=True)
# Set a "given_label" column to the original label
combined_dataset_df.rename(columns={"label": "given_label"}, inplace=True)
To see which text examples are estimated to be mislabeled, we filter by is_label_issue
. We sort by label_issue_score
to see which of these data points are most likely mislabeled.
samples_ranked_by_label_issue_score = combined_dataset_df.query("is_label_issue").sort_values("label_issue_score", ascending=False)
columns_to_display = ["text", "label_issue_score", "is_label_issue", "given_label", "suggested_label"]
display(samples_ranked_by_label_issue_score.head(5)[columns_to_display])
text | label_issue_score | is_label_issue | given_label | suggested_label | |
---|---|---|---|---|---|
874 | why am i being charge a fee when using an atm? | 0.857160 | True | card_about_to_expire | card_payment_fee_charged |
988 | i was charged for getting cash. | 0.837559 | True | card_about_to_expire | card_payment_fee_charged |
490 | which currencies can i used to add funds to my account? | 0.792916 | True | cancel_transfer | supported_cards_and_currencies |
77 | how do i find my new pin? | 0.788946 | True | visa_or_mastercard | change_pin |
8 | can i change my pin on holiday? | 0.773169 | True | beneficiary_not_allowed | change_pin |
Note that in each of these examples, the given_label
really does seem wrong (the annotated intent in the original dataset does not appear appropriate for the customer request). Data labeling is an error-prone process and annotators make mistakes! Luckily we can easily correct these data points by just using Cleanlab’s suggested_label
above, which seems like a much more suitable label in most cases.
While the boolean flags above can help estimate the overall label error rate, the numeric scores help decide what data to prioritize for review. You can alternatively ignore these boolean is_label_issue
flags and filter the data by thresholding the label_issue_score
yourself (if say you find the default thresholds produce false positives/negatives).
Next, let’s look at the ambiguous examples detected in the dataset.
samples_ranked_by_ambiguous_score = combined_dataset_df.query("is_ambiguous").sort_values("ambiguous_score", ascending=False)
columns_to_display = ["text", "ambiguous_score", "is_ambiguous", "given_label", "suggested_label"]
display(samples_ranked_by_ambiguous_score.head(5)[columns_to_display])
text | ambiguous_score | is_ambiguous | given_label | suggested_label | |
---|---|---|---|---|---|
652 | i just made a top-up but it shows as pending! i use your service all the time and have never had a problem before. why does it keep showing up as pending? | 0.989361 | True | cancel_transfer | <NA> |
337 | my money didnt go through after i transferred. | 0.985214 | True | beneficiary_not_allowed | lost_or_stolen_phone |
783 | how do i avoid charges in the future | 0.981678 | True | card_payment_fee_charged | <NA> |
898 | hi, one of payment is still coming as pending for which i have already paid by card. i guess it did not processed, could you please check and update me. | 0.975506 | True | lost_or_stolen_phone | <NA> |
841 | the card payment didn't work | 0.974473 | True | change_pin | <NA> |
Next, let’s look at the outliers detected in the dataset.
samples_ranked_by_outlier_score = combined_dataset_df.query("is_outlier").sort_values("outlier_score", ascending=False)
columns_to_display = ["text", "outlier_score", "is_empty_text", "text_num_characters", "is_outlier", "given_label", "suggested_label"]
display(samples_ranked_by_outlier_score.head(5)[columns_to_display])
text | outlier_score | is_empty_text | text_num_characters | is_outlier | given_label | suggested_label | |
---|---|---|---|---|---|---|---|
180 | p | 0.989873 | False | 1 | True | getting_spare_card | <NA> |
770 | la trasferenza al mio conto non è stata consentita. | 0.969794 | False | 51 | True | beneficiary_not_allowed | <NA> |
676 | 750 credit score | 0.966779 | False | 16 | True | getting_spare_card | <NA> |
528 | my sc | 0.964187 | False | 5 | True | apple_pay_or_google_pay | <NA> |
755 | 404Error<body><p>InvalidUsername</p><p> InvalidPIN</p></body> | 0.963091 | False | 61 | True | change_pin | <NA> |
Next, let’s look at the near duplicates detected in the dataset.
n_near_duplicate_sets = len(set(combined_dataset_df.loc[combined_dataset_df["near_duplicate_cluster_id"].notna(), "near_duplicate_cluster_id"]))
print(f"There are {n_near_duplicate_sets} sets of near duplicate texts in the dataset.")
Note that the near duplicate data points each have an associated near_duplicate_cluster_id
integer. Data points that share the same IDs are near duplicates of each other, so you can use this column to find the near duplicates of any data point. And remember the near duplicates also include exact duplicates as well (which have near_duplicate_score
= 1).
# Sort the combined dataset by near_duplicate_cluster_id
sorted_combined_dataset_df = combined_dataset_df[combined_dataset_df['is_near_duplicate']].sort_values(by="near_duplicate_cluster_id")
columns_to_display = ["text", "near_duplicate_score", 'near_duplicate_cluster_id', "is_near_duplicate", "given_label"]
sorted_combined_dataset_df[columns_to_display]
text | near_duplicate_score | near_duplicate_cluster_id | is_near_duplicate | given_label | |
---|---|---|---|---|---|
54 | what happens after my card expires? | 0.894769 | 0 | True | card_about_to_expire |
250 | what will happen after my card expires? | 0.894769 | 0 | True | card_about_to_expire |
127 | i have one other credit card from the us. do you take it? | 0.914963 | 1 | True | supported_cards_and_currencies |
759 | i have one other credit card from the us. do you take that? | 0.914963 | 1 | True | supported_cards_and_currencies |
710 | which type of card will i receive? | 0.978388 | 2 | True | visa_or_mastercard |
197 | what type of card will i receive? | 0.978388 | 2 | True | visa_or_mastercard |
921 | i can't use the app because i got mugged yesterday and they took everything. please help. | 0.909358 | 3 | True | lost_or_stolen_phone |
350 | i can't use the app because i got mugged yesterday and they took everything. i need some help. | 0.909358 | 3 | True | lost_or_stolen_phone |
374 | what do i do if my phone is stolen? | 0.962170 | 4 | True | lost_or_stolen_phone |
965 | what do i do if my phone was stolen? | 0.962170 | 4 | True | lost_or_stolen_phone |
396 | which currencies do you accept for adding money? | 0.975470 | 5 | True | supported_cards_and_currencies |
932 | what currencies do you accept for adding money? | 0.975470 | 5 | True | supported_cards_and_currencies |
549 | will i get visa or mastercard? | 0.884869 | 6 | True | visa_or_mastercard |
423 | would i get a visa or mastercard? | 0.884869 | 6 | True | visa_or_mastercard |
431 | my card is going to expire, what do i do? | 0.917732 | 7 | True | card_about_to_expire |
445 | my card is about to expire, what do i do? | 0.917732 | 7 | True | card_about_to_expire |
596 | can i top up using my apple watch? | 0.879197 | 8 | True | apple_pay_or_google_pay |
721 | can i use my apple watch to top up? | 0.879197 | 8 | True | apple_pay_or_google_pay |
947 | i'd prefer a mastercard. | 0.919696 | 9 | True | visa_or_mastercard |
718 | i would prefer a mastercard. | 0.919696 | 9 | True | visa_or_mastercard |
864 | what to do if my card is about to expire? | 0.945380 | 10 | True | card_about_to_expire |
753 | what do i do if my card is about to expire? | 0.945380 | 10 | True | card_about_to_expire |
804 | which atms allow me to change my pin? | 0.921094 | 11 | True | change_pin |
808 | what atms will allow me to change my pin? | 0.921094 | 11 | True | beneficiary_not_allowed |
813 | i paid with my card and i was charged a fee | 0.929191 | 12 | True | card_payment_fee_charged |
877 | i paid with card and i was charged a fee | 0.929191 | 12 | True | card_payment_fee_charged |
Text issues
Cleanlab Studio can also detect potential problems in any text in your dataset, such as the occurrence of toxic language, personally identifiable information (PII), or nonsensical language (e.g. HTML/XML tags and other random strings contaminating text descriptions). The following Cleanlab columns are specific to the text fields in your dataset (see here for details).
Similar to above, the is_<issue>
column contains boolean values indicating if a text field has been identified to exhibit a particular issue, and the <issue>_score
column contains numeric scores between 0 and 1 indicating the severity of this particular issue (1 indicates the most severe instance of the issue).
Let’s take a closer look at some text issues that have been flagged in our dataset:
Text issues detection is currently only provided for text modality projects running in regular mode.
Text that contains toxic language may have elements of hateful speech and language others may find harmful or aggressive. Identifying toxic language is vital in tasks such as content moderation and LLM training/evaluation, where appropriate action should be taken to ensure safe platforms, chatbots, or other applications depending on this dataset.
Here are some examples in this dataset detected to contain toxic language:
toxic_samples = combined_dataset_df.query("is_toxic").sort_values("toxic_score", ascending=False)
columns_to_display = ["text", "toxic_score", "is_toxic"]
display(toxic_samples.head(5)[columns_to_display])
text | toxic_score | is_toxic | |
---|---|---|---|
852 | help me change my pin your garbage app is broken, the most pathetic bank and absolute worst customer service ever | 0.851074 | True |
773 | i'm really sick of your stupid requirements, just issue me the damn credit card! | 0.832520 | True |
416 | some f-ing lowlife mugged me, they stole everything including my phone. i can't use your app anymore, what can i do? | 0.825195 | True |
Personally Identifiable Information (PII) is information that could be used to identify an individual or is otherwise sensitive. Exposing PII can compromise an individual’s security and hence should be safeguarded and anonymized/removed if discovered in publicly shared data.
Cleanlab’s PII detection also returns two extra columns, PII_items
and PII_types
, which list the specific PII detected in the text and its type. Possible types of PII that can be detected are detailed in the guide and scored according to how sensitive each type of information is.
Here are some examples of PII detected in the dataset:
PII_samples = combined_dataset_df.query("is_PII").sort_values("PII_score", ascending=False)
columns_to_display = ["text", "PII_score", "is_PII", "PII_types", "PII_items"]
display(PII_samples.head(5)[columns_to_display])
text | PII_score | is_PII | PII_types | PII_items | |
---|---|---|---|---|---|
68 | my card number is 4012888888881881 how do I know if it is mastercard or visa? | 1.0 | True | ["credit card"] | ["4012888888881881"] |
235 | i just replaced my phone, do i have to make a new account? my username is gavdlin@gmail.com new phone number is 212-978-1213 | 0.5 | True | ["email", "phone number"] | ["gavdlin@gmail.com", "212-978-1213"] |
485 | i no longer have my phone number +44 20 8356 1167, what should i do? | 0.5 | True | ["phone number"] | ["+44 20 8356 1167"] |
760 | i wish to cancel a transfer sent to judmunz@yahoo.com | 0.5 | True | ["email"] | ["judmunz@yahoo.com"] |
243 | i want to choose a new pin, name on account is alvin weber and dob 2/10/1967 | 0.4 | True | ["date of birth"] | ["2/10/1967"] |
Non-English text includes text written in a foreign language or containing nonsensical characters (such as HTML/XML tags, identifiers, hashes, random characters). These are important to identify and remove in situations where we want to ensure the text fields in our data are understandable (e.g. if they are text descriptions intended to be read).
If a text datapoint is detected to be non-English, Cleanlab Studio will predicted its language in the predicted_language
column. If an alternative langauge cannot be predicted (this could either represent that the text contains more than one langauge, or that it contains nonsensical characters), the predicted_language
will contain a null value.
Here are some non-English examples detected in the dataset:
non_english_samples = combined_dataset_df.query("is_non_english").sort_values("non_english_score", ascending=False)
columns_to_display = ["text", "non_english_score", "is_non_english", "predicted_language"]
display(non_english_samples.head(5)[columns_to_display])
text | non_english_score | is_non_english | predicted_language | |
---|---|---|---|---|
180 | p | 0.991476 | True | <NA> |
755 | 404Error<body><p>InvalidUsername</p><p> InvalidPIN</p></body> | 0.979175 | True | <NA> |
770 | la trasferenza al mio conto non è stata consentita. | 0.866523 | True | Italian |
220 | qué necesito hacer para cancelar una transacción? | 0.828047 | True | Spanish |
Informal text contains casual language, slang, or poor writing such as improper grammar or spelling. It’s presence may be noteworthy if you are expecting the text in your dataset to be well-written.
Here are some examples of informal text detected in the dataset:
informal_samples = combined_dataset_df.query("is_informal").sort_values("informal_score", ascending=False)
columns_to_display = ["text", "informal_score", "spelling_issue_score", "grammar_issue_score", "slang_issue_score", "is_informal"]
display(informal_samples.head(5)[columns_to_display])
text | informal_score | spelling_issue_score | grammar_issue_score | slang_issue_score | is_informal | |
---|---|---|---|---|---|---|
528 | my sc | 0.701533 | 0.500000 | 0.615330 | 0.888503 | True |
720 | google pay top up not working. | 0.700279 | 0.000000 | 0.881408 | 0.869290 | True |
192 | which atm's am i able to change my pin? | 0.694062 | 0.111111 | 0.811346 | 0.868254 | True |
564 | i do i top up from my apple watch? | 0.674827 | 0.000000 | 0.925408 | 0.761659 | True |
476 | google play top up help? | 0.671472 | 0.000000 | 0.807158 | 0.871522 | True |
Improve the dataset based on the detected issues
Since the results of this analysis appear reasonable, let’s use the Cleanlab columns to improve the quality of our dataset. For your own datasets, which actions you should take to remedy the detected issues will depend on what you are using the data for. No action may be the best choice for certain datasets, we caution against blindly copying the actions we perform below.
For data marked as label_issue
, we create a new corrected_label
column, which will be the given label for data without detected label issues, and the suggested_label
for data with detected label issues.
corrected_label = np.where(combined_dataset_df["is_label_issue"],
combined_dataset_df["suggested_label"],
combined_dataset_df["given_label"])
For data marked as outlier or ambiguous, we will simply exclude them from our dataset. Here we create a boolean vector rows_to_exclude
to track which data points will be excluded.
# create an exclude column to keep track of the excluded data
rows_to_exclude = combined_dataset_df["is_outlier"] | combined_dataset_df["is_ambiguous"]
For each set of near duplicates, we only want to keep one of the data points that share a common near_duplicate_cluster_id
(so that the resulting dataset will no longer contain any near duplicates).
near_duplicates_to_exclude = combined_dataset_df['is_near_duplicate'] & combined_dataset_df['near_duplicate_cluster_id'].duplicated(keep='first')
rows_to_exclude |= near_duplicates_to_exclude
Note we didn’t exclude the data with text issues here but you might want to in your applications. We can check the total amount of excluded data:
print(f"Excluding {rows_to_exclude.sum()} text examples (out of {len(combined_dataset_df)})")
Finally, let’s actually make a new version of our dataset with these changes.
We craft a new dataframe from the original, applying corrections and exclusions, and then use this dataframe to save the new dataset in a separate CSV file. The new dataset is a CSV file that has the same format as our original dataset – you can use it as a plug-in replacement to get more reliable results in your ML and Analytics pipelines, without any change in your existing modeling code.
new_dataset_filename = "improved_dataset.csv"
# Fetch the original dataset
fixed_dataset = combined_dataset_df[["text"]].copy()
# Add the corrected label column
fixed_dataset["label"] = corrected_label
# Automatically exclude selected rows
fixed_dataset = fixed_dataset[~rows_to_exclude]
# Check if the file exists before saving
if os.path.exists(new_dataset_filename):
raise ValueError(f"File {new_dataset_filename} already exists. Cannot overwite so please delete it first, or specify a different new_dataset_filename.")
else:
# Save the adjusted dataset to a CSV file
fixed_dataset.to_csv(new_dataset_filename, index=False)
print(f"Adjusted dataset saved to {new_dataset_filename}")
If you want to curate a text dataset for better LLM fine-tuning, here’s an example.
If you are interested in building AI Assistants connected to your company’s data sources and other Retrieval-Augmented Generation applications, reach out to learn how Cleanlab can help.