Detecting Issues in Text Datasets

This is the recommended quickstart tutorial for analyzing text datasets via the Cleanlab Studio’s Python API.

In this tutorial, we demonstrate the metadata Cleanlab Studio automatically generates for any text classification dataset. This metadata (returned as Cleanlab Columns) helps you discover various problems in your dataset and understand their severity. This entire notebook is run using the cleanlab_studio Python package, so you can audit your datasets programmatically.

Install and import dependencies

Make sure you have wget installed to run this tutorial. You can use pip to install all other packages required for this tutorial as follows:

%pip install cleanlab-studio

import numpy as np
import pandas as pd
import os

from IPython.display import display, Markdown
pd.set_option("display.max_colwidth", None)

Fetch and view dataset

Fetch the dataset for this tutorial.

mkdir -p data/
wget -nc https://cleanlab-public.s3.amazonaws.com/Datasets/banking-text-quickstart-v1.csv -O data/banking-text-quickstart.csv

Here we’ll use a variant of the BANKING77 text dataset. This is a multi-class classification dataset where customer service requests are labeled as belonging to one of K classes (intent categories).

We can view the first few rows of our dataset below:

BASE_PATH = os.getcwd()
dataset_path = os.path.join(BASE_PATH, "data/banking-text-quickstart.csv")

data = pd.read_csv(dataset_path)
data.head()

	text	label
0	i need to cancel my recent transfer as soon as possible. i made an error there. please help before it goes through.	cancel_transfer
1	why is there a fee when i thought there would be no fees?	card_payment_fee_charged
2	why can't my beneficiary access my account?	beneficiary_not_allowed
3	does it cost extra to send out more than one card?	getting_spare_card
4	can i change my pin at an atm?	change_pin

Dataset Structure

The data used in the tutorial is stored in a standard CSV file containing the following columns:

text,label
<a text example>,<a class label>
"<a text example with quotes, to escape commas as column separators>",<another class label>
...

You can similarly format any other text dataset and run the rest of this tutorial. Details on how to format your dataset can be found in this guide, which also outlines other format options.

Load dataset into Cleanlab Studio

Now that we have our dataset, let’s load it into Cleanlab Studio and conduct our analysis. Use your API key to instantiate a studio object, which analyzes your dataset.

from cleanlab_studio import Studio

# you can find your Cleanlab Studio API key by going to app.cleanlab.ai/upload,
# clicking "Upload via Python API", and copying the API key there
API_KEY = "<insert your API key>"

# initialize studio object
studio = Studio(API_KEY)

Load the data into Cleanlab Studio (more details/options can be found in this guide). This may take a while for big datasets.

dataset_id = studio.upload_dataset(dataset_path, dataset_name="banking-text-quickstart")
print(f"Dataset ID: {dataset_id}")

Launch a Project

A Cleanlab Studio project automatically trains ML models to provide AI-based analysis of your dataset. Let’s launch one.

Note: For our label_column and text_column specified below, they happened to be called label and text for this example. The values for these arguments should be the name of the columns pertaining to your label and containing the text field in your dataset. If you have multiple text columns, please merge them into a single column as demonstrated in our FAQ.

project_id = studio.create_project(
    dataset_id=dataset_id,
    project_name="banking-text-quickstart-project",
    modality="text",
    task_type="multi-class",
    model_type="regular",
    label_column="label",
    text_column="text",
)
print(f"Project successfully created and training has begun! project_id: {project_id}")

Once the project has been launched successfully and you see your project_id you can feel free to close this notebook. It will take some time for Cleanlab’s AI to train on your data and analyze it. Come back after training is complete (you will receive an email) and continue with the notebook to review your results.

You should only execute the above cell once per dataset. After launching the project, you can poll for its status to programmatically wait until the results are ready for review. Each project creates a cleanset, an improved version of your original dataset that contains additional metadata for helping you clean up the data. The next code cell simply waits until this cleanset has been created.

Warning! For big datasets, this next cell may take a long time to execute while Cleanlab’s AI model is training. If your Jupyter notebook has timed out during this process, you can resume work by re-running the below cell (which should return instantly if the project has completed training). Do not create a new project.

cleanset_id = studio.get_latest_cleanset_id(project_id)
print(f"cleanset_id: {cleanset_id}")
project_status = studio.wait_until_cleanset_ready(cleanset_id)

Once the above cell completes execution, your project results are ready for review! At this point, you can optionally view your project in the Cleanlab Studio web interface and interactively improve your dataset. However this tutorial will stick with a fully programmatic workflow.

Download Cleanlab columns

We can fetch Cleanlab columns that store metadata for this cleanset using its cleanset_id. These columns have the same length as your original dataset and provide metadata about each individual data point, like what types of issues it exhibits and how severely.

If at any point you want to re-run the remaining parts of this notebook (without creating another project), simply call studio.download_cleanlab_columns(cleanset_id) with the cleanset_id printed from the previous cell.

cleanlab_columns_df = studio.download_cleanlab_columns(cleanset_id)
cleanlab_columns_df.head()

	cleanlab_row_ID	corrected_label	is_label_issue	label_issue_score	suggested_label	suggested_label_confidence_score	is_ambiguous	ambiguous_score	is_well_labeled	is_near_duplicate	...	non_english_score	predicted_language	is_toxic	toxic_score	sentiment_score	bias_score	is_biased	gender_bias_score	racial_bias_score	sexual_orientation_bias_score
0	1	<NA>	False	0.357680	<NA>	0.404141	False	0.914398	True	False	...	0.004226	<NA>	False	0.104431	0.326050	0.122986	False	0.000000	0.122986	0.000018
1	2	<NA>	False	0.323392	<NA>	0.453953	False	0.918068	True	False	...	0.007218	<NA>	False	0.181641	0.134735	0.395068	False	0.395068	0.232788	0.144897
2	3	<NA>	False	0.253626	<NA>	0.561822	False	0.857873	True	False	...	0.008784	<NA>	False	0.062195	0.159088	0.259082	False	0.259082	0.093872	0.025330
3	4	<NA>	False	0.478667	<NA>	0.238665	False	0.937696	False	False	...	0.004739	<NA>	False	0.184570	0.377625	0.241504	False	0.241504	0.131836	0.013489
4	5	<NA>	False	0.366417	<NA>	0.393526	False	0.907793	True	False	...	0.058934	<NA>	False	0.120178	0.543762	0.381641	False	0.381641	0.112732	0.012970

5 rows × 38 columns

Review data issues

Details about all of the Cleanlab columns and their meanings can be found in this guide. Here we briefly showcase some of the Cleanlab columns that correspond to issues detected in our tutorial dataset:

Label issue indicates the given label of this data point is likely wrong. For such data, consider correcting their label to the suggested_label if it seems more appropriate.
Ambiguous indicates this data point does not clearly belong to any of the classes (e.g. a borderline case). Multiple human annotators might disagree on how to label this data point, so you might consider refining your annotation instructions to clarify how to handle data points like this.
Outlier indicates this data point is very different from the rest of the data (looks atypical). The presence of outliers may indicate problems in your data sources, consider deleting such data from your dataset if appropriate.
Near duplicate indicates there are other data points that are (exactly or nearly) identical to this data point. Duplicated data points can have an outsized impact on models/analytics, so consider deleting the extra copies from your dataset if appropriate.

The data points exhibiting each type of issue are indicated with boolean values in the respective is_<issue> column, and the severity of this issue in each data point is quantified in the respective <issue>_score column (on a scale of 0-1 with 1 indicating the most severe instances of the issue).

Let’s go through some of the Cleanlab columns and types of data issues, starting with label issues (i.e. mislabeled data). We first create a given_label column in our dataframe to clearly indicate the original class label originally assigned to each data point (customer service request).

# Load the dataset into a DataFrame
df = pd.read_csv(dataset_path)

# Combine the dataset with the cleanlab columns
combined_dataset_df = df.merge(cleanlab_columns_df, left_index=True, right_index=True)

# Set a "given_label" column to the original label
combined_dataset_df.rename(columns={"label": "given_label"}, inplace=True)

To see which text examples are estimated to be mislabeled, we filter by is_label_issue. We sort by label_issue_score to see which of these data points are most likely mislabeled.

samples_ranked_by_label_issue_score = combined_dataset_df.query("is_label_issue").sort_values("label_issue_score", ascending=False)

columns_to_display = ["text", "label_issue_score", "is_label_issue", "given_label", "suggested_label"]
display(samples_ranked_by_label_issue_score.head(5)[columns_to_display])

	text	label_issue_score	is_label_issue	given_label	suggested_label
874	why am i being charge a fee when using an atm?	0.756143	True	card_about_to_expire	card_payment_fee_charged
988	i was charged for getting cash.	0.700452	True	card_about_to_expire	card_payment_fee_charged
490	which currencies can i used to add funds to my account?	0.693506	True	cancel_transfer	supported_cards_and_currencies
8	can i change my pin on holiday?	0.677492	True	beneficiary_not_allowed	change_pin
769	why do i see extra charges for withdrawing my money?	0.671252	True	card_about_to_expire	card_payment_fee_charged

Note that in each of these examples, the given_label really does seem wrong (the annotated intent in the original dataset does not appear appropriate for the customer request). Data labeling is an error-prone process and annotators make mistakes! Luckily we can easily correct these data points by just using Cleanlab’s suggested_label above, which seems like a much more suitable label in most cases.

While the boolean flags above can help estimate the overall label error rate, the numeric scores help decide what data to prioritize for review. You can alternatively ignore these boolean is_label_issue flags and filter the data by thresholding the label_issue_score yourself (if say you find the default thresholds produce false positives/negatives).

Next, let’s look at the ambiguous examples detected in the dataset.

samples_ranked_by_ambiguous_score = combined_dataset_df.query("is_ambiguous").sort_values("ambiguous_score", ascending=False)

columns_to_display = ["text", "ambiguous_score", "is_ambiguous", "given_label", "suggested_label"]
display(samples_ranked_by_ambiguous_score.head(5)[columns_to_display])

	text	ambiguous_score	is_ambiguous	given_label	suggested_label
633	i tried to withdraw 40 pounds but only 20 came out. did you steal my money?	0.990658	True	card_payment_fee_charged	<NA>
320	payment did not process	0.989376	True	beneficiary_not_allowed	card_payment_fee_charged
730	i'm still waiting for my transaction.	0.983218	True	supported_cards_and_currencies	<NA>
652	i just made a top-up but it shows as pending! i use your service all the time and have never had a problem before. why does it keep showing up as pending?	0.980733	True	cancel_transfer	<NA>
841	the card payment didn't work	0.979759	True	change_pin	<NA>

Next, let’s look at the outliers detected in the dataset.

samples_ranked_by_outlier_score = combined_dataset_df.query("is_outlier").sort_values("outlier_score", ascending=False)

columns_to_display = ["text", "outlier_score", "is_empty_text", "text_num_characters", "is_outlier", "given_label", "suggested_label"]
display(samples_ranked_by_outlier_score.head(5)[columns_to_display])

	text	outlier_score	is_empty_text	text_num_characters	is_outlier	given_label	suggested_label
180	p	0.812792	False	1	True	getting_spare_card	<NA>
503	cancel transaction	0.276691	False	18	True	cancel_transfer	<NA>
528	my sc	0.259906	False	5	True	apple_pay_or_google_pay	<NA>
683	bad bank	0.251266	False	8	True	apple_pay_or_google_pay	<NA>
520	switch banks	0.242160	False	12	True	change_pin	<NA>

Next, let’s look at the near duplicates detected in the dataset.

n_near_duplicate_sets = len(set(combined_dataset_df.loc[combined_dataset_df["near_duplicate_cluster_id"].notna(), "near_duplicate_cluster_id"]))
print(f"There are {n_near_duplicate_sets} sets of near duplicate texts in the dataset.")

    There are 3 sets of near duplicate texts in the dataset.

Note that the near duplicate data points each have an associated near_duplicate_cluster_id integer. Data points that share the same IDs are near duplicates of each other, so you can use this column to find the near duplicates of any data point. And remember the near duplicates also include exact duplicates as well (which have near_duplicate_score = 1).

Let’s check out the near duplicates with id = 0:

near_duplicate_cluster_id = 0  # play with this value to see other sets of near duplicates
selected_samples_by_near_duplicate_cluster_id = combined_dataset_df.query("near_duplicate_cluster_id == @near_duplicate_cluster_id")

columns_to_display = ["text", "near_duplicate_score", "is_near_duplicate", "given_label"]
selected_samples_by_near_duplicate_cluster_id[columns_to_display]

	text	near_duplicate_score	is_near_duplicate	given_label
344	is there a charge for sending out more cards?	0.891876	True	getting_spare_card
453	is there a fee for sending out more cards?	0.891876	True	getting_spare_card

Text issues

Cleanlab Studio can also detect potential problems in any text in your dataset, such as the occurrence of toxic language, personally identifiable information (PII), or nonsensical language (e.g. HTML/XML tags and other random strings contaminating text descriptions). The following Cleanlab columns are specific to the text fields in your dataset (see here for details).

Similar to above, the is_<issue> column contains boolean values indicating if a text field has been identified to exhibit a particular issue, and the <issue>_score column contains numeric scores between 0 and 1 indicating the severity of this particular issue (1 indicates the most severe instance of the issue).

Let’s take a closer look at some text issues that have been flagged in our dataset:

note

Text issues detection is currently only provided for text modality projects running in regular mode.

Text that contains toxic language may have elements of hateful speech and language others may find harmful or aggressive. Identifying toxic language is vital in tasks such as content moderation and LLM training/evaluation, where appropriate action should be taken to ensure safe platforms, chatbots, or other applications depending on this dataset.

Here are some examples in this dataset detected to contain toxic language:

toxic_samples = combined_dataset_df.query("is_toxic").sort_values("toxic_score", ascending=False)

columns_to_display = ["text", "toxic_score", "is_toxic"]
display(toxic_samples.head(5)[columns_to_display])

	text	toxic_score	is_toxic
852	help me change my pin your garbage app is broken, the most pathetic bank and absolute worst customer service ever	0.837891	True
773	i'm really sick of your stupid requirements, just issue me the damn credit card!	0.834473	True
416	some f-ing lowlife mugged me, they stole everything including my phone. i can't use your app anymore, what can i do?	0.818848	True

Personally Identifiable Information (PII) is information that could be used to identify an individual or is otherwise sensitive. Exposing PII can compromise an individual’s security and hence should be safeguarded and anonymized/removed if discovered in publicly shared data.

Cleanlab’s PII detection also returns two extra columns, PII_items and PII_types, which list the specific PII detected in the text and its type. Possible types of PII that can be detected are detailed in the guide and scored according to how sensitive each type of information is.

Here are some examples of PII detected in the dataset:

PII_samples = combined_dataset_df.query("is_PII").sort_values("PII_score", ascending=False)

columns_to_display = ["text", "PII_score", "is_PII", "PII_types", "PII_items"]
display(PII_samples.head(5)[columns_to_display])

	text	PII_score	is_PII	PII_types	PII_items
68	my card number is 4012888888881881 how do I know if it is mastercard or visa?	1.0	True	["credit card"]	["4012888888881881"]
235	i just replaced my phone, do i have to make a new account? my username is gavdlin@gmail.com new phone number is 212-978-1213	0.5	True	["email", "phone number"]	["gavdlin@gmail.com", "212-978-1213"]
485	i no longer have my phone number +44 20 8356 1167, what should i do?	0.5	True	["phone number"]	["+44 20 8356 1167"]
760	i wish to cancel a transfer sent to judmunz@yahoo.com	0.5	True	["email"]	["judmunz@yahoo.com"]
243	i want to choose a new pin, name on account is alvin weber and dob 2/10/1967	0.4	True	["date of birth"]	["2/10/1967"]

Non-English text includes text written in a foreign language or containing nonsensical characters (such as HTML/XML tags, identifiers, hashes, random characters). These are important to identify and remove in situations where we want to ensure the text fields in our data are understandable (e.g. if they are text descriptions intended to be read).

If a text datapoint is detected to be non-English, Cleanlab Studio will predicted its language in the predicted_language column. If an alternative langauge cannot be predicted (this could either represent that the text contains more than one langauge, or that it contains nonsensical characters), the predicted_language will contain a null value.

Here are some non-English examples detected in the dataset:

non_english_samples = combined_dataset_df.query("is_non_english").sort_values("non_english_score", ascending=False)

columns_to_display = ["text", "non_english_score", "is_non_english", "predicted_language"]
display(non_english_samples.head(5)[columns_to_display])

	text	non_english_score	is_non_english	predicted_language
180	p	0.991476	True	<NA>
755	404Error<body><p>InvalidUsername</p><p> InvalidPIN</p></body>	0.979175	True	<NA>
770	la trasferenza al mio conto non è stata consentita.	0.866523	True	Italian
220	qué necesito hacer para cancelar una transacción?	0.828047	True	Spanish

Informal text contains casual language, slang, or poor writing such as improper grammar or spelling. It’s presence may be noteworthy if you are expecting the text in your dataset to be well-written.

Here are some examples of informal text detected in the dataset:

informal_samples = combined_dataset_df.query("is_informal").sort_values("informal_score", ascending=False)

columns_to_display = ["text", "informal_score", "spelling_issue_score", "grammar_issue_score", "slang_issue_score", "is_informal"]
display(informal_samples.head(5)[columns_to_display])

	text	informal_score	spelling_issue_score	grammar_issue_score	slang_issue_score	is_informal
528	my sc	0.701533	0.500000	0.615330	0.888503	True
720	google pay top up not working.	0.700279	0.000000	0.881408	0.869290	True
192	which atm's am i able to change my pin?	0.694062	0.111111	0.811346	0.868254	True
564	i do i top up from my apple watch?	0.674827	0.000000	0.925408	0.761659	True
476	google play top up help?	0.671472	0.000000	0.807158	0.871522	True

Improve the dataset based on the detected issues

Since the results of this analysis appear reasonable, let’s use the Cleanlab columns to improve the quality of our dataset. For your own datasets, which actions you should take to remedy the detected issues will depend on what you are using the data for. No action may be the best choice for certain datasets, we caution against blindly copying the actions we perform below.

For data marked as label_issue, we create a new corrected_label column, which will be the given label for data without detected label issues, and the suggested_label for data with detected label issues.

corrected_label = np.where(combined_dataset_df["is_label_issue"],
                           combined_dataset_df["suggested_label"],
                           combined_dataset_df["given_label"])

For data marked as outlier or ambiguous, we will simply exclude them from our dataset. Here we create a boolean vector rows_to_exclude to track which data points will be excluded.

# create an exclude column to keep track of the excluded data
rows_to_exclude = combined_dataset_df["is_outlier"] | combined_dataset_df["is_ambiguous"]

For each set of near duplicates, we only want to keep one of the data points that share a common near_duplicate_cluster_id (so that the resulting dataset will no longer contain any near duplicates).

near_duplicates_to_exclude = combined_dataset_df['is_near_duplicate'] & combined_dataset_df['near_duplicate_cluster_id'].duplicated(keep='first')

rows_to_exclude |= near_duplicates_to_exclude

Note we didn’t exclude the data with text issues here but you might want to in your applications. We can check the total amount of excluded data:

print(f"Excluding {rows_to_exclude.sum()} text examples (out of {len(combined_dataset_df)})")

    Excluding 29 text examples (out of 1000)

Finally, let’s actually make a new version of our dataset with these changes.

We craft a new dataframe from the original, applying corrections and exclusions, and then use this dataframe to save the new dataset in a separate CSV file. The new dataset is a CSV file that has the same format as our original dataset – you can use it as a plug-in replacement to get more reliable results in your ML and Analytics pipelines, without any change in your existing modeling code.

new_dataset_filename = "improved_dataset.csv"

# Fetch the original dataset
fixed_dataset = combined_dataset_df[["text"]].copy()

# Add the corrected label column 
fixed_dataset["label"] = corrected_label

# Automatically exclude selected rows
fixed_dataset = fixed_dataset[~rows_to_exclude]

# Check if the file exists before saving
if os.path.exists(new_dataset_filename):
    raise ValueError(f"File {new_dataset_filename} already exists. Cannot overwite so please delete it first, or specify a different new_dataset_filename.")
else:
    # Save the adjusted dataset to a CSV file
    fixed_dataset.to_csv(new_dataset_filename, index=False)
    print(f"Adjusted dataset saved to {new_dataset_filename}")

    Adjusted dataset saved to improved_dataset.csv

Detecting Issues in Text Datasets

Install and import dependencies​

Fetch and view dataset​

Dataset Structure​

Load dataset into Cleanlab Studio​

Launch a Project​

Download Cleanlab columns​

Review data issues​

Text issues​

Improve the dataset based on the detected issues​