Skip to main content

Detecting Issues in Text Datasets

Run in Google ColabRun in Google Colab

This is the recommended quickstart tutorial for analyzing text datasets via the Cleanlab Studio’s Python API.

In this tutorial, we demonstrate the metadata Cleanlab Studio automatically generates for any text classification dataset. This metadata (returned as Cleanlab Columns) helps you discover various problems in your dataset and understand their severity. This entire notebook is run using the cleanlab_studio Python package, so you can audit your datasets programmatically.

Install and import dependencies

Make sure you have wget installed to run this tutorial. You can use pip to install all other packages required for this tutorial as follows:

%pip install cleanlab-studio
import numpy as np
import pandas as pd
import os

from IPython.display import display, Markdown
pd.set_option("display.max_colwidth", None)

Fetch and view dataset

Fetch the dataset for this tutorial.

mkdir -p data/
wget -nc https://cleanlab-public.s3.amazonaws.com/Datasets/banking-text-quickstart-v1.csv -O data/banking-text-quickstart.csv

Here we’ll use a variant of the BANKING77 text dataset. This is a multi-class classification dataset where customer service requests are labeled as belonging to one of K classes (intent categories).

We can view the first few rows of our dataset below:

BASE_PATH = os.getcwd()
dataset_path = os.path.join(BASE_PATH, "data/banking-text-quickstart.csv")
data = pd.read_csv(dataset_path)
data.head()
text label
0 i need to cancel my recent transfer as soon as possible. i made an error there. please help before it goes through. cancel_transfer
1 why is there a fee when i thought there would be no fees? card_payment_fee_charged
2 why can't my beneficiary access my account? beneficiary_not_allowed
3 does it cost extra to send out more than one card? getting_spare_card
4 can i change my pin at an atm? change_pin

Dataset Structure

The data used in the tutorial is stored in a standard CSV file containing the following columns:

text,label
<a text example>,<a class label>
"<a text example with quotes, to escape commas as column separators>",<another class label>
...

You can similarly format any other text dataset and run the rest of this tutorial. Details on how to format your dataset can be found in this guide, which also outlines other format options.

Load dataset into Cleanlab Studio

Now that we have our dataset, let’s load it into Cleanlab Studio and conduct our analysis. Use your API key to instantiate a studio object, which analyzes your dataset.

from cleanlab_studio import Studio

# you can find your Cleanlab Studio API key by going to app.cleanlab.ai/upload,
# clicking "Upload via Python API", and copying the API key there
API_KEY = "<insert your API key>"

# initialize studio object
studio = Studio(API_KEY)

Load the data into Cleanlab Studio (more details/options can be found in this guide). This may take a while for big datasets.

dataset_id = studio.upload_dataset(dataset_path, dataset_name="banking-text-quickstart")
print(f"Dataset ID: {dataset_id}")

Launch a Project

A Cleanlab Studio project automatically trains ML models to provide AI-based analysis of your dataset. Let’s launch one.

project_id = studio.create_project(
dataset_id=dataset_id,
project_name="banking-text-quickstart-project",
modality="text",
task_type="multi-class",
model_type="regular",
)
print(f"Project successfully created and training has begun! project_id: {project_id}")

Once the project has been launched successfully and you see your project_id you can feel free to close this notebook. It will take some time for Cleanlab’s AI to train on your data and analyze it. Come back after training is complete (you will receive an email) and continue with the notebook to review your results.

You should only execute the above cell once per dataset. After launching the project, you can poll for its status to programmatically wait until the results are ready for review. Each project creates a cleanset, an improved version of your original dataset that contains additional metadata for helping you clean up the data. The next code cell simply waits until this cleanset has been created.

Warning! For big datasets, this next cell may take a long time to execute while Cleanlab’s AI model is training. If your Jupyter notebook has timed out during this process, you can resume work by re-running the below cell (which should return instantly if the project has completed training). Do not create a new project.

cleanset_id = studio.get_latest_cleanset_id(project_id)
print(f"cleanset_id: {cleanset_id}")
project_status = studio.wait_until_cleanset_ready(cleanset_id)

Once the above cell completes execution, your project results are ready for review! At this point, you can optionally view your project in the Cleanlab Studio web interface and interactively improve your dataset. However this tutorial will stick with a fully programmatic workflow.

Download Cleanlab columns

We can fetch Cleanlab columns that store metadata for this cleanset using its cleanset_id. These columns have the same length as your original dataset and provide metadata about each individual data point, like what types of issues it exhibits and how severely.

If at any point you want to re-run the remaining parts of this notebook (without creating another project), simply call studio.download_cleanlab_columns(cleanset_id) with the cleanset_id printed from the previous cell.

cleanlab_columns_df = studio.download_cleanlab_columns(cleanset_id)
cleanlab_columns_df.head()
cleanlab_row_ID corrected_label is_label_issue label_issue_score suggested_label is_ambiguous ambiguous_score is_well_labeled is_near_duplicate near_duplicate_score ... PII_score PII_types PII_items is_informal informal_score is_non_english non_english_score predicted_language is_toxic toxic_score
0 0 <NA> False 0.187229 <NA> False 0.895058 False False 0.989314 ... 0.0 [] [] False 0.011878 False 0.004226 <NA> False 0.104431
1 1 <NA> False 0.208311 <NA> False 0.901848 False False 0.946454 ... 0.0 [] [] False 0.033124 False 0.007218 <NA> False 0.181641
2 2 <NA> False 0.114819 <NA> False 0.805536 False False 0.987998 ... 0.0 [] [] False 0.160778 False 0.008784 <NA> False 0.062195
3 3 <NA> False 0.452269 <NA> False 0.946026 False False 0.962509 ... 0.0 [] [] False 0.275605 False 0.004739 <NA> False 0.184570
4 4 <NA> False 0.189948 <NA> False 0.902838 False False 0.992378 ... 0.0 [] [] False 0.390461 False 0.058934 <NA> False 0.120178

5 rows × 25 columns

Review data issues

Details about all of the Cleanlab columns and their meanings can be found in this guide. Here we briefly showcase some of the Cleanlab columns that correspond to issues detected in our tutorial dataset:

  • Label issue indicates the given label of this data point is likely wrong. For such data, consider correcting their label to the suggested_label if it seems more appropriate.
  • Ambiguous indicates this data point does not clearly belong to any of the classes (e.g. a borderline case). Multiple human annotators might disagree on how to label this data point, so you might consider refining your annotation instructions to clarify how to handle data points like this.
  • Outlier indicates this data point is very different from the rest of the data (looks atypical). The presence of outliers may indicate problems in your data sources, consider deleting such data from your dataset if appropriate.
  • Near duplicate indicates there are other data points that are (exactly or nearly) identical to this data point. Duplicated data points can have an outsized impact on models/analytics, so consider deleting the extra copies from your dataset if appropriate.

The data points exhibiting each type of issue are indicated with boolean values in the respective is_<issue> column, and the severity of this issue in each data point is quantified in the respective <issue>_score column (on a scale of 0-1 with 1 indicating the most severe instances of the issue).

Let’s go through some of the Cleanlab columns and types of data issues, starting with label issues (i.e. mislabeled data). We first create a given_label column in our dataframe to clearly indicate the original class label originally assigned to each data point (customer service request).

# Load the dataset into a DataFrame
df = pd.read_csv(dataset_path)

# Combine the dataset with the cleanlab columns
combined_dataset_df = df.merge(cleanlab_columns_df, left_index=True, right_on="cleanlab_row_ID")

# Set a "given_label" column to the original label
combined_dataset_df.rename(columns={"label": "given_label"}, inplace=True)

To see which text examples are estimated to be mislabeled, we filter by is_label_issue. We sort by label_issue_score to see which of these data points are most likely mislabeled.

samples_ranked_by_label_issue_score = combined_dataset_df.query("is_label_issue").sort_values("label_issue_score", ascending=False)

columns_to_display = ["cleanlab_row_ID", "text", "label_issue_score", "is_label_issue", "given_label", "suggested_label"]
display(samples_ranked_by_label_issue_score.head(5)[columns_to_display])
cleanlab_row_ID text label_issue_score is_label_issue given_label suggested_label
524 524 will i be sent a new card before mine expires? 0.994404 True apple_pay_or_google_pay card_about_to_expire
8 8 can i change my pin on holiday? 0.984735 True beneficiary_not_allowed change_pin
808 808 what atms will allow me to change my pin? 0.982135 True beneficiary_not_allowed change_pin
133 133 my card is almost expired. how fast will i get a new one and what is the cost? 0.980538 True apple_pay_or_google_pay card_about_to_expire
874 874 why am i being charge a fee when using an atm? 0.960389 True card_about_to_expire card_payment_fee_charged

Note that in each of these examples, the given_label really does seem wrong (the annotated intent in the original dataset does not appear appropriate for the customer request). Data labeling is an error-prone process and annotators make mistakes! Luckily we can easily correct these data points by just using Cleanlab’s suggested_label above, which seems like a much more suitable label in most cases.

While the boolean flags above can help estimate the overall label error rate, the numeric scores help decide what data to prioritize for review. You can alternatively ignore these boolean is_label_issue flags and filter the data by thresholding the label_issue_score yourself (if say you find the default thresholds produce false positives/negatives).

Next, let’s look at the ambiguous examples detected in the dataset.

samples_ranked_by_ambiguous_score = combined_dataset_df.query("is_ambiguous").sort_values("ambiguous_score", ascending=False)

columns_to_display = ["cleanlab_row_ID", "text", "ambiguous_score", "is_ambiguous", "given_label", "suggested_label"]
display(samples_ranked_by_ambiguous_score.head(5)[columns_to_display])
cleanlab_row_ID text ambiguous_score is_ambiguous given_label suggested_label
652 652 i just made a top-up but it shows as pending! i use your service all the time and have never had a problem before. why does it keep showing up as pending? 0.987011 True cancel_transfer supported_cards_and_currencies
898 898 hi, one of payment is still coming as pending for which i have already paid by card. i guess it did not processed, could you please check and update me. 0.983816 True lost_or_stolen_phone card_payment_fee_charged
793 793 what reasons would cause my card payment to be cancelled? 0.982958 True change_pin cancel_transfer
783 783 how do i avoid charges in the future 0.980937 True card_payment_fee_charged getting_spare_card
320 320 payment did not process 0.978533 True beneficiary_not_allowed card_payment_fee_charged

Next, let’s look at the outliers detected in the dataset.

samples_ranked_by_outlier_score = combined_dataset_df.query("is_outlier").sort_values("outlier_score", ascending=False)

columns_to_display = ["cleanlab_row_ID", "text", "outlier_score", "is_outlier", "given_label", "suggested_label"]
display(samples_ranked_by_outlier_score.head(5)[columns_to_display])
cleanlab_row_ID text outlier_score is_outlier given_label suggested_label
180 180 p 0.375975 True getting_spare_card change_pin
503 503 cancel transaction 0.276691 True cancel_transfer cancel_transfer
528 528 my sc 0.259906 True apple_pay_or_google_pay change_pin
683 683 bad bank 0.251266 True apple_pay_or_google_pay change_pin
520 520 switch banks 0.242160 True change_pin change_pin

Next, let’s look at the near duplicates detected in the dataset.

n_near_duplicate_sets = len(set(combined_dataset_df.loc[combined_dataset_df["near_duplicate_cluster_id"].notna(), "near_duplicate_cluster_id"]))
print(f"There are {n_near_duplicate_sets} sets of near duplicate texts in the dataset.")
    There are 3 sets of near duplicate texts in the dataset.

Note that the near duplicate data points each have an associated near_duplicate_cluster_id integer. Data points that share the same IDs are near duplicates of each other, so you can use this column to find the near duplicates of any data point. And remember the near duplicates also include exact duplicates as well (which have near_duplicate_score = 1).

Let’s check out the near duplicates with id = 0:

near_duplicate_cluster_id = 0  # play with this value to see other sets of near duplicates
selected_samples_by_near_duplicate_cluster_id = combined_dataset_df.query("near_duplicate_cluster_id == @near_duplicate_cluster_id")

columns_to_display = ["cleanlab_row_ID", "text", "near_duplicate_score", "is_near_duplicate", "given_label"]
selected_samples_by_near_duplicate_cluster_id[columns_to_display]
cleanlab_row_ID text near_duplicate_score is_near_duplicate given_label
344 344 is there a charge for sending out more cards? 0.997122 True getting_spare_card
453 453 is there a fee for sending out more cards? 0.997122 True getting_spare_card

Text issues

Cleanlab Studio can also detect potential problems in any text in your dataset, such as the occurrence of toxic language, personally identifiable information (PII), or nonsensical language (e.g. HTML/XML tags and other random strings contaminating text descriptions). The following Cleanlab columns are specific to the text fields in your dataset (see here for details).

Similar to above, the is_<issue> column contains boolean values indicating if a text field has been identified to exhibit a particular issue, and the <issue>_score column contains numeric scores between 0 and 1 indicating the severity of this particular issue (1 indicates the most severe instance of the issue).

Let’s take a closer look at some text issues that have been flagged in our dataset:

note

Text issues detection is currently only provided for text modality projects running in regular mode.

Text that contains toxic language may have elements of hateful speech and language others may find harmful or aggressive. Identifying toxic language is vital in tasks such as content moderation and LLM training/evaluation, where appropriate action should be taken to ensure safe platforms, chatbots, or other applications depending on this dataset.

Here are some examples in this dataset detected to contain toxic language:

toxic_samples = combined_dataset_df.query("is_toxic").sort_values("toxic_score", ascending=False)

columns_to_display = ["cleanlab_row_ID", "text", "toxic_score", "is_toxic"]
display(toxic_samples.head(5)[columns_to_display])
cleanlab_row_ID text toxic_score is_toxic
852 852 help me change my pin your garbage app is broken, the most pathetic bank and absolute worst customer service ever 0.837891 True
773 773 i'm really sick of your stupid requirements, just issue me the damn credit card! 0.834473 True
416 416 some f-ing lowlife mugged me, they stole everything including my phone. i can't use your app anymore, what can i do? 0.818848 True

Personally Identifiable Information (PII) is information that could be used to identify an individual or is otherwise sensitive. Exposing PII can compromise an individual’s security and hence should be safeguarded and anonymized/removed if discovered in publicly shared data.

Cleanlab’s PII detection also returns two extra columns, PII_items and PII_types, which list the specific PII detected in the text and its type. Possible types of PII that can be detected are detailed in the guide and scored according to how sensitive each type of information is.

Here are some examples of PII detected in the dataset:

PII_samples = combined_dataset_df.query("is_PII").sort_values("PII_score", ascending=False)

columns_to_display = ["cleanlab_row_ID", "text", "PII_score", "is_PII", "PII_types", "PII_items"]
display(PII_samples.head(5)[columns_to_display])
cleanlab_row_ID text PII_score is_PII PII_types PII_items
68 68 my card number is 4012888888881881 how do I know if it is mastercard or visa? 1.0 True ["credit card"] ["4012888888881881"]
235 235 i just replaced my phone, do i have to make a new account? my username is gavdlin@gmail.com new phone number is 212-978-1213 0.5 True ["email", "phone number"] ["gavdlin@gmail.com", "212-978-1213"]
485 485 i no longer have my phone number +44 20 8356 1167, what should i do? 0.5 True ["phone number"] ["+44 20 8356 1167"]
760 760 i wish to cancel a transfer sent to judmunz@yahoo.com 0.5 True ["email"] ["judmunz@yahoo.com"]
243 243 i want to choose a new pin, name on account is alvin weber and dob 2/10/1967 0.4 True ["date of birth"] ["2/10/1967"]

Non-English text includes text written in a foreign language or containing nonsensical characters (such as HTML/XML tags, identifiers, hashes, random characters). These are important to identify and remove in situations where we want to ensure the text fields in our data are understandable (e.g. if they are text descriptions intended to be read).

If a text datapoint is detected to be non-English, Cleanlab Studio will predicted its language in the predicted_language column. If an alternative langauge cannot be predicted (this could either represent that the text contains more than one langauge, or that it contains nonsensical characters), the predicted_language will contain a null value.

Here are some non-English examples detected in the dataset:

non_english_samples = combined_dataset_df.query("is_non_english").sort_values("non_english_score", ascending=False)

columns_to_display = ["cleanlab_row_ID", "text", "non_english_score", "is_non_english", "predicted_language"]
display(non_english_samples.head(5)[columns_to_display])
cleanlab_row_ID text non_english_score is_non_english predicted_language
180 180 p 0.991476 True <NA>
755 755 404Error<body><p>InvalidUsername</p><p> InvalidPIN</p></body> 0.949186 True <NA>
770 770 la trasferenza al mio conto non è stata consentita. 0.866523 True Italian
220 220 qué necesito hacer para cancelar una transacción? 0.828047 True Spanish

Informal text contains casual language, slang, or poor writing such as improper grammar or spelling. It’s presence may be noteworthy if you are expecting the text in your dataset to be well-written.

Here are some examples of informal text detected in the dataset:

informal_samples = combined_dataset_df.query("is_informal").sort_values("informal_score", ascending=False)

columns_to_display = ["cleanlab_row_ID", "text", "informal_score", "is_informal"]
display(informal_samples.head(5)[columns_to_display])
cleanlab_row_ID text informal_score is_informal
528 528 my sc 0.701533 True
720 720 google pay top up not working. 0.700279 True
192 192 which atm's am i able to change my pin? 0.694062 True
564 564 i do i top up from my apple watch? 0.674827 True
476 476 google play top up help? 0.671472 True

Improve the dataset based on the detected issues

Since the results of this analysis appear reasonable, let’s use the Cleanlab columns to improve the quality of our dataset. For your own datasets, which actions you should take to remedy the detected issues will depend on what you are using the data for. No action may be the best choice for certain datasets, we caution against blindly copying the actions we perform below.

For data marked as label_issue, we create a new corrected_label column, which will be the given label for data without detected label issues, and the suggested_label for data with detected label issues.

corrected_label = np.where(combined_dataset_df["is_label_issue"],
combined_dataset_df["suggested_label"],
combined_dataset_df["given_label"])

For data marked as outlier or ambiguous, we will simply exclude them from our dataset. Here we create a boolean vector rows_to_exclude to track which data points will be excluded.

# create an exclude column to keep track of the excluded data
rows_to_exclude = combined_dataset_df["is_outlier"] | combined_dataset_df["is_ambiguous"]

For each set of near duplicates, we only want to keep one of the data points that share a common near_duplicate_cluster_id (so that the resulting dataset will no longer contain any near duplicates).

near_duplicates_to_exclude = combined_dataset_df['is_near_duplicate'] & combined_dataset_df['near_duplicate_cluster_id'].duplicated(keep='first')

rows_to_exclude |= near_duplicates_to_exclude

Note we didn’t exclude the data with text issues here but you might want to in your applications. We can check the total amount of excluded data:

print(f"Excluding {rows_to_exclude.sum()} text examples (out of {len(combined_dataset_df)})")
    Excluding 28 text examples (out of 1000)

Finally, let’s actually make a new version of our dataset with these changes.

We craft a new dataframe from the original, applying corrections and exclusions, and then use this dataframe to save the new dataset in a separate CSV file. The new dataset is a CSV file that has the same format as our original dataset – you can use it as a plug-in replacement to get more reliable results in your ML and Analytics pipelines, without any change in your existing modeling code.

new_dataset_filename = "improved_dataset.csv"
# Fetch the original dataset
fixed_dataset = combined_dataset_df[["text"]].copy()

# Add the corrected label column
fixed_dataset["label"] = corrected_label

# Automatically exclude selected rows
fixed_dataset = fixed_dataset[~rows_to_exclude]

# Check if the file exists before saving
if os.path.exists(new_dataset_filename):
raise ValueError(f"File {new_dataset_filename} already exists. Cannot overwite so please delete it first, or specify a different new_dataset_filename.")
else:
# Save the adjusted dataset to a CSV file
fixed_dataset.to_csv(new_dataset_filename, index=False)
print(f"Adjusted dataset saved to {new_dataset_filename}")