Skip to main content

Detecting Issues in Text Datasets

Run in Google ColabRun in Google Colab

This is the recommended quickstart tutorial for analyzing text datasets via the Cleanlab Studio’s Python API.

In this tutorial, we demonstrate the metadata Cleanlab Studio automatically generates for any text classification dataset. This metadata (returned as Cleanlab Columns) helps you discover various problems in your dataset and understand their severity. This entire notebook is run using the cleanlab_studio Python package, so you can audit your datasets programmatically.

Install and import dependencies

Make sure you have wget installed to run this tutorial. You can use pip to install all other packages required for this tutorial as follows:

%pip install cleanlab-studio
import numpy as np
import pandas as pd
import os

from IPython.display import display, Markdown
pd.set_option("display.max_colwidth", None)

Fetch and view dataset

Fetch the dataset for this tutorial.

mkdir -p data/
wget -nc https://cleanlab-public.s3.amazonaws.com/Datasets/banking-text-quickstart-v1.csv -O data/banking-text-quickstart.csv

Here we’ll use a variant of the BANKING77 text dataset. This is a multi-class classification dataset where customer service requests are labeled as belonging to one of K classes (intent categories).

We can view the first few rows of our dataset below:

BASE_PATH = os.getcwd()
dataset_path = os.path.join(BASE_PATH, "data/banking-text-quickstart.csv")
data = pd.read_csv(dataset_path)
data.head()
text label
0 i need to cancel my recent transfer as soon as possible. i made an error there. please help before it goes through. cancel_transfer
1 why is there a fee when i thought there would be no fees? card_payment_fee_charged
2 why can't my beneficiary access my account? beneficiary_not_allowed
3 does it cost extra to send out more than one card? getting_spare_card
4 can i change my pin at an atm? change_pin

Dataset Structure

The data used in the tutorial is stored in a standard CSV file containing the following columns:

text,label
<a text example>,<a class label>
"<a text example with quotes, to escape commas as column separators>",<another class label>
...

You can similarly format any other text dataset and run the rest of this tutorial. Details on how to format your dataset can be found in this guide, which also outlines other format options.

Load dataset into Cleanlab Studio

Now that we have our dataset, let’s load it into Cleanlab Studio and conduct our analysis. Use your API key to instantiate a studio object, which analyzes your dataset.

from cleanlab_studio import Studio

# you can find your Cleanlab Studio API key by going to app.cleanlab.ai/upload,
# clicking "Upload via Python API", and copying the API key there
API_KEY = "<insert your API key>"

# initialize studio object
studio = Studio(API_KEY)

Load the data into Cleanlab Studio (more details/options can be found in this guide). This may take a while for big datasets.

dataset_id = studio.upload_dataset(dataset_path, dataset_name="banking-text-quickstart")
print(f"Dataset ID: {dataset_id}")

Launch a Project

A Cleanlab Studio project automatically trains ML models to provide AI-based analysis of your dataset. Let’s launch one.

Note: For our label_column and text_column specified below, they happened to be called label and text for this example. The values for these arguments should be the name of the columns pertaining to your label and containing the text field in your dataset. If you have multiple text columns, please merge them into a single column as demonstrated in our FAQ.

project_id = studio.create_project(
dataset_id=dataset_id,
project_name="banking-text-quickstart-project",
modality="text",
task_type="multi-class",
model_type="regular",
label_column="label",
text_column="text",
)
print(f"Project successfully created and training has begun! project_id: {project_id}")

Once the project has been launched successfully and you see your project_id you can feel free to close this notebook. It will take some time for Cleanlab’s AI to train on your data and analyze it. Come back after training is complete (you will receive an email) and continue with the notebook to review your results.

You should only execute the above cell once per dataset. After launching the project, you can poll for its status to programmatically wait until the results are ready for review. Each project creates a cleanset, an improved version of your original dataset that contains additional metadata for helping you clean up the data. The next code cell simply waits until this cleanset has been created.

Warning! For big datasets, this next cell may take a long time to execute while Cleanlab’s AI model is training. If your Jupyter notebook has timed out during this process, you can resume work by re-running the below cell (which should return instantly if the project has completed training). Do not create a new project.

cleanset_id = studio.get_latest_cleanset_id(project_id)
print(f"cleanset_id: {cleanset_id}")
project_status = studio.wait_until_cleanset_ready(cleanset_id)

Once the above cell completes execution, your project results are ready for review! At this point, you can optionally view your project in the Cleanlab Studio web interface and interactively improve your dataset. However this tutorial will stick with a fully programmatic workflow.

Download Cleanlab columns

We can fetch Cleanlab columns that store metadata for this cleanset using its cleanset_id. These columns have the same length as your original dataset and provide metadata about each individual data point, like what types of issues it exhibits and how severely.

If at any point you want to re-run the remaining parts of this notebook (without creating another project), simply call studio.download_cleanlab_columns(cleanset_id) with the cleanset_id printed from the previous cell.

cleanlab_columns_df = studio.download_cleanlab_columns(cleanset_id)
cleanlab_columns_df.head()
cleanlab_row_ID corrected_label is_label_issue label_issue_score suggested_label suggested_label_confidence_score is_ambiguous ambiguous_score is_well_labeled is_near_duplicate ... non_english_score predicted_language is_toxic toxic_score sentiment_score bias_score is_biased gender_bias_score racial_bias_score sexual_orientation_bias_score
0 1 <NA> False 0.357680 <NA> 0.404141 False 0.914398 True False ... 0.004226 <NA> False 0.104431 0.326050 0.122986 False 0.000000 0.122986 0.000018
1 2 <NA> False 0.323392 <NA> 0.453953 False 0.918068 True False ... 0.007218 <NA> False 0.181641 0.134735 0.395068 False 0.395068 0.232788 0.144897
2 3 <NA> False 0.253626 <NA> 0.561822 False 0.857873 True False ... 0.008784 <NA> False 0.062195 0.159088 0.259082 False 0.259082 0.093872 0.025330
3 4 <NA> False 0.478667 <NA> 0.238665 False 0.937696 False False ... 0.004739 <NA> False 0.184570 0.377625 0.241504 False 0.241504 0.131836 0.013489
4 5 <NA> False 0.366417 <NA> 0.393526 False 0.907793 True False ... 0.058934 <NA> False 0.120178 0.543762 0.381641 False 0.381641 0.112732 0.012970

5 rows × 38 columns

Review data issues

Details about all of the Cleanlab columns and their meanings can be found in this guide. Here we briefly showcase some of the Cleanlab columns that correspond to issues detected in our tutorial dataset:

  • Label issue indicates the given label of this data point is likely wrong. For such data, consider correcting their label to the suggested_label if it seems more appropriate.
  • Ambiguous indicates this data point does not clearly belong to any of the classes (e.g. a borderline case). Multiple human annotators might disagree on how to label this data point, so you might consider refining your annotation instructions to clarify how to handle data points like this.
  • Outlier indicates this data point is very different from the rest of the data (looks atypical). The presence of outliers may indicate problems in your data sources, consider deleting such data from your dataset if appropriate.
  • Near duplicate indicates there are other data points that are (exactly or nearly) identical to this data point. Duplicated data points can have an outsized impact on models/analytics, so consider deleting the extra copies from your dataset if appropriate.

The data points exhibiting each type of issue are indicated with boolean values in the respective is_<issue> column, and the severity of this issue in each data point is quantified in the respective <issue>_score column (on a scale of 0-1 with 1 indicating the most severe instances of the issue).

Let’s go through some of the Cleanlab columns and types of data issues, starting with label issues (i.e. mislabeled data). We first create a given_label column in our dataframe to clearly indicate the original class label originally assigned to each data point (customer service request).

# Load the dataset into a DataFrame
df = pd.read_csv(dataset_path)

# Combine the dataset with the cleanlab columns
combined_dataset_df = df.merge(cleanlab_columns_df, left_index=True, right_index=True)

# Set a "given_label" column to the original label
combined_dataset_df.rename(columns={"label": "given_label"}, inplace=True)

To see which text examples are estimated to be mislabeled, we filter by is_label_issue. We sort by label_issue_score to see which of these data points are most likely mislabeled.

samples_ranked_by_label_issue_score = combined_dataset_df.query("is_label_issue").sort_values("label_issue_score", ascending=False)

columns_to_display = ["text", "label_issue_score", "is_label_issue", "given_label", "suggested_label"]
display(samples_ranked_by_label_issue_score.head(5)[columns_to_display])
text label_issue_score is_label_issue given_label suggested_label
874 why am i being charge a fee when using an atm? 0.756143 True card_about_to_expire card_payment_fee_charged
988 i was charged for getting cash. 0.700452 True card_about_to_expire card_payment_fee_charged
490 which currencies can i used to add funds to my account? 0.693506 True cancel_transfer supported_cards_and_currencies
8 can i change my pin on holiday? 0.677492 True beneficiary_not_allowed change_pin
769 why do i see extra charges for withdrawing my money? 0.671252 True card_about_to_expire card_payment_fee_charged

Note that in each of these examples, the given_label really does seem wrong (the annotated intent in the original dataset does not appear appropriate for the customer request). Data labeling is an error-prone process and annotators make mistakes! Luckily we can easily correct these data points by just using Cleanlab’s suggested_label above, which seems like a much more suitable label in most cases.

While the boolean flags above can help estimate the overall label error rate, the numeric scores help decide what data to prioritize for review. You can alternatively ignore these boolean is_label_issue flags and filter the data by thresholding the label_issue_score yourself (if say you find the default thresholds produce false positives/negatives).

Next, let’s look at the ambiguous examples detected in the dataset.

samples_ranked_by_ambiguous_score = combined_dataset_df.query("is_ambiguous").sort_values("ambiguous_score", ascending=False)

columns_to_display = ["text", "ambiguous_score", "is_ambiguous", "given_label", "suggested_label"]
display(samples_ranked_by_ambiguous_score.head(5)[columns_to_display])
text ambiguous_score is_ambiguous given_label suggested_label
633 i tried to withdraw 40 pounds but only 20 came out. did you steal my money? 0.990658 True card_payment_fee_charged <NA>
320 payment did not process 0.989376 True beneficiary_not_allowed card_payment_fee_charged
730 i'm still waiting for my transaction. 0.983218 True supported_cards_and_currencies <NA>
652 i just made a top-up but it shows as pending! i use your service all the time and have never had a problem before. why does it keep showing up as pending? 0.980733 True cancel_transfer <NA>
841 the card payment didn't work 0.979759 True change_pin <NA>

Next, let’s look at the outliers detected in the dataset.

samples_ranked_by_outlier_score = combined_dataset_df.query("is_outlier").sort_values("outlier_score", ascending=False)

columns_to_display = ["text", "outlier_score", "is_empty_text", "text_num_characters", "is_outlier", "given_label", "suggested_label"]
display(samples_ranked_by_outlier_score.head(5)[columns_to_display])
text outlier_score is_empty_text text_num_characters is_outlier given_label suggested_label
180 p 0.812792 False 1 True getting_spare_card <NA>
503 cancel transaction 0.276691 False 18 True cancel_transfer <NA>
528 my sc 0.259906 False 5 True apple_pay_or_google_pay <NA>
683 bad bank 0.251266 False 8 True apple_pay_or_google_pay <NA>
520 switch banks 0.242160 False 12 True change_pin <NA>

Next, let’s look at the near duplicates detected in the dataset.

n_near_duplicate_sets = len(set(combined_dataset_df.loc[combined_dataset_df["near_duplicate_cluster_id"].notna(), "near_duplicate_cluster_id"]))
print(f"There are {n_near_duplicate_sets} sets of near duplicate texts in the dataset.")
    There are 3 sets of near duplicate texts in the dataset.

Note that the near duplicate data points each have an associated near_duplicate_cluster_id integer. Data points that share the same IDs are near duplicates of each other, so you can use this column to find the near duplicates of any data point. And remember the near duplicates also include exact duplicates as well (which have near_duplicate_score = 1).

Let’s check out the near duplicates with id = 0:

near_duplicate_cluster_id = 0  # play with this value to see other sets of near duplicates
selected_samples_by_near_duplicate_cluster_id = combined_dataset_df.query("near_duplicate_cluster_id == @near_duplicate_cluster_id")

columns_to_display = ["text", "near_duplicate_score", "is_near_duplicate", "given_label"]
selected_samples_by_near_duplicate_cluster_id[columns_to_display]
text near_duplicate_score is_near_duplicate given_label
344 is there a charge for sending out more cards? 0.891876 True getting_spare_card
453 is there a fee for sending out more cards? 0.891876 True getting_spare_card

Text issues

Cleanlab Studio can also detect potential problems in any text in your dataset, such as the occurrence of toxic language, personally identifiable information (PII), or nonsensical language (e.g. HTML/XML tags and other random strings contaminating text descriptions). The following Cleanlab columns are specific to the text fields in your dataset (see here for details).

Similar to above, the is_<issue> column contains boolean values indicating if a text field has been identified to exhibit a particular issue, and the <issue>_score column contains numeric scores between 0 and 1 indicating the severity of this particular issue (1 indicates the most severe instance of the issue).

Let’s take a closer look at some text issues that have been flagged in our dataset:

note

Text issues detection is currently only provided for text modality projects running in regular mode.

Text that contains toxic language may have elements of hateful speech and language others may find harmful or aggressive. Identifying toxic language is vital in tasks such as content moderation and LLM training/evaluation, where appropriate action should be taken to ensure safe platforms, chatbots, or other applications depending on this dataset.

Here are some examples in this dataset detected to contain toxic language:

toxic_samples = combined_dataset_df.query("is_toxic").sort_values("toxic_score", ascending=False)

columns_to_display = ["text", "toxic_score", "is_toxic"]
display(toxic_samples.head(5)[columns_to_display])
text toxic_score is_toxic
852 help me change my pin your garbage app is broken, the most pathetic bank and absolute worst customer service ever 0.837891 True
773 i'm really sick of your stupid requirements, just issue me the damn credit card! 0.834473 True
416 some f-ing lowlife mugged me, they stole everything including my phone. i can't use your app anymore, what can i do? 0.818848 True

Personally Identifiable Information (PII) is information that could be used to identify an individual or is otherwise sensitive. Exposing PII can compromise an individual’s security and hence should be safeguarded and anonymized/removed if discovered in publicly shared data.

Cleanlab’s PII detection also returns two extra columns, PII_items and PII_types, which list the specific PII detected in the text and its type. Possible types of PII that can be detected are detailed in the guide and scored according to how sensitive each type of information is.

Here are some examples of PII detected in the dataset:

PII_samples = combined_dataset_df.query("is_PII").sort_values("PII_score", ascending=False)

columns_to_display = ["text", "PII_score", "is_PII", "PII_types", "PII_items"]
display(PII_samples.head(5)[columns_to_display])
text PII_score is_PII PII_types PII_items
68 my card number is 4012888888881881 how do I know if it is mastercard or visa? 1.0 True ["credit card"] ["4012888888881881"]
235 i just replaced my phone, do i have to make a new account? my username is gavdlin@gmail.com new phone number is 212-978-1213 0.5 True ["email", "phone number"] ["gavdlin@gmail.com", "212-978-1213"]
485 i no longer have my phone number +44 20 8356 1167, what should i do? 0.5 True ["phone number"] ["+44 20 8356 1167"]
760 i wish to cancel a transfer sent to judmunz@yahoo.com 0.5 True ["email"] ["judmunz@yahoo.com"]
243 i want to choose a new pin, name on account is alvin weber and dob 2/10/1967 0.4 True ["date of birth"] ["2/10/1967"]

Non-English text includes text written in a foreign language or containing nonsensical characters (such as HTML/XML tags, identifiers, hashes, random characters). These are important to identify and remove in situations where we want to ensure the text fields in our data are understandable (e.g. if they are text descriptions intended to be read).

If a text datapoint is detected to be non-English, Cleanlab Studio will predicted its language in the predicted_language column. If an alternative langauge cannot be predicted (this could either represent that the text contains more than one langauge, or that it contains nonsensical characters), the predicted_language will contain a null value.

Here are some non-English examples detected in the dataset:

non_english_samples = combined_dataset_df.query("is_non_english").sort_values("non_english_score", ascending=False)

columns_to_display = ["text", "non_english_score", "is_non_english", "predicted_language"]
display(non_english_samples.head(5)[columns_to_display])
text non_english_score is_non_english predicted_language
180 p 0.991476 True <NA>
755 404Error<body><p>InvalidUsername</p><p> InvalidPIN</p></body> 0.979175 True <NA>
770 la trasferenza al mio conto non è stata consentita. 0.866523 True Italian
220 qué necesito hacer para cancelar una transacción? 0.828047 True Spanish

Informal text contains casual language, slang, or poor writing such as improper grammar or spelling. It’s presence may be noteworthy if you are expecting the text in your dataset to be well-written.

Here are some examples of informal text detected in the dataset:

informal_samples = combined_dataset_df.query("is_informal").sort_values("informal_score", ascending=False)

columns_to_display = ["text", "informal_score", "spelling_issue_score", "grammar_issue_score", "slang_issue_score", "is_informal"]
display(informal_samples.head(5)[columns_to_display])
text informal_score spelling_issue_score grammar_issue_score slang_issue_score is_informal
528 my sc 0.701533 0.500000 0.615330 0.888503 True
720 google pay top up not working. 0.700279 0.000000 0.881408 0.869290 True
192 which atm's am i able to change my pin? 0.694062 0.111111 0.811346 0.868254 True
564 i do i top up from my apple watch? 0.674827 0.000000 0.925408 0.761659 True
476 google play top up help? 0.671472 0.000000 0.807158 0.871522 True

Improve the dataset based on the detected issues

Since the results of this analysis appear reasonable, let’s use the Cleanlab columns to improve the quality of our dataset. For your own datasets, which actions you should take to remedy the detected issues will depend on what you are using the data for. No action may be the best choice for certain datasets, we caution against blindly copying the actions we perform below.

For data marked as label_issue, we create a new corrected_label column, which will be the given label for data without detected label issues, and the suggested_label for data with detected label issues.

corrected_label = np.where(combined_dataset_df["is_label_issue"],
combined_dataset_df["suggested_label"],
combined_dataset_df["given_label"])

For data marked as outlier or ambiguous, we will simply exclude them from our dataset. Here we create a boolean vector rows_to_exclude to track which data points will be excluded.

# create an exclude column to keep track of the excluded data
rows_to_exclude = combined_dataset_df["is_outlier"] | combined_dataset_df["is_ambiguous"]

For each set of near duplicates, we only want to keep one of the data points that share a common near_duplicate_cluster_id (so that the resulting dataset will no longer contain any near duplicates).

near_duplicates_to_exclude = combined_dataset_df['is_near_duplicate'] & combined_dataset_df['near_duplicate_cluster_id'].duplicated(keep='first')

rows_to_exclude |= near_duplicates_to_exclude

Note we didn’t exclude the data with text issues here but you might want to in your applications. We can check the total amount of excluded data:

print(f"Excluding {rows_to_exclude.sum()} text examples (out of {len(combined_dataset_df)})")
    Excluding 29 text examples (out of 1000)

Finally, let’s actually make a new version of our dataset with these changes.

We craft a new dataframe from the original, applying corrections and exclusions, and then use this dataframe to save the new dataset in a separate CSV file. The new dataset is a CSV file that has the same format as our original dataset – you can use it as a plug-in replacement to get more reliable results in your ML and Analytics pipelines, without any change in your existing modeling code.

new_dataset_filename = "improved_dataset.csv"
# Fetch the original dataset
fixed_dataset = combined_dataset_df[["text"]].copy()

# Add the corrected label column
fixed_dataset["label"] = corrected_label

# Automatically exclude selected rows
fixed_dataset = fixed_dataset[~rows_to_exclude]

# Check if the file exists before saving
if os.path.exists(new_dataset_filename):
raise ValueError(f"File {new_dataset_filename} already exists. Cannot overwite so please delete it first, or specify a different new_dataset_filename.")
else:
# Save the adjusted dataset to a CSV file
fixed_dataset.to_csv(new_dataset_filename, index=False)
print(f"Adjusted dataset saved to {new_dataset_filename}")
    Adjusted dataset saved to improved_dataset.csv

If you want to curate a text dataset for better LLM fine-tuning, here’s an example.