Catching Issues in a Dynamically Growing Dataset

Run in Google Colab

This is the recommended tutorial for programatically auditing datasets that grow over time with the Cleanlab Studio Python API.

In this tutorial, we consider data that comes in batches accumulated into a master dataset. While one could follow the other tutorials to use Cleanlab to auto-detect issues across the entire master dataset, here we demonstrate how to catch issues in the most recent batch of data. We additionally demonstrate how to fix issues in the latest data batch, in order to create a higher-quality master dataset.

While this tutorial focuses specifically on label issues for brevity, the same ideas can be applied to catch any of the other data issues Cleanlab Studio can auto-detect (outliers, near duplicates, unsafe or low-quality content, …). This tutorial focuses on text data, but the same ideas can be applied to the other data modalities Cleanlab Studio supports such as images or structured/tabular data. We recommend first completing our text data quickstart tutorial to first understand how Cleanlab Studio works with a static dataset.

Install and import dependencies

Make sure you have wget installed to run this tutorial. You can use pip to install all other packages required for this tutorial as follows:

%pip install cleanlab-studio

import numpy as np
import pandas as pd
import os

from IPython.display import display
pd.set_option("display.max_colwidth", None)

Fetch and view dataset

Here we use a variant of the BANKING77 text classification dataset, in which customer service request are labeled as belonging to one of K classes (intent categories). To fetch the dataset for this tutorial, make sure you have wget and zip installed.

mkdir -p data/
wget -nc https://cleanlab-public.s3.amazonaws.com/Datasets/growing_dataset.zip -O data/growing_dataset.zip

unzip -q data/growing_dataset.zip -d data/

This data is stored amongst 3 unequal batches, which will be received incrementally in this tutorial. Batch 3 contains 1 class that was not seen in batches 1 or 2, to help you handle applications where certain dataset classes might appear or disappear over time. Let’s load the first batch of data.

BASE_PATH = os.getcwd()
dataset_path = os.path.join(BASE_PATH, "data/growing_dataset")

batch_1 = pd.read_csv(os.path.join(dataset_path, "data_batch1.csv"))

Ensure unique identifier for data points

For a dynamic dataset, having a unique identifier for each data point allows us to better track results. The current dataset has two columns - text and label - both of which cannot be used to identify a unique row.

We can add a column id that would just contain sequential numbers, starting from 0 to batch size - 1. For the subsequent batches, we”ll start from previous batch’s size to the total size of the merged dataset.

# Create a new column and assign sequential numbers till batch size
batch_1["id"] = range(0, len(batch_1))

batch_1.head()

	text	label	id
0	why is there a fee when i thought there would be no fees?	card_payment_fee_charged	0
1	why can't my beneficiary access my account?	beneficiary_not_allowed	1
2	does it cost extra to send out more than one card?	getting_spare_card	2
3	can i change my pin at an atm?	change_pin	3
4	i have a us credit card, will you accept it?	supported_cards_and_currencies	4

print(f"The total number of rows in the current dataset: {len(batch_1)}")

The total number of rows in the current dataset: 500

Load batch 1 into Cleanlab Studio

Upon receiving our batch of data, let’s load it into Cleanlab Studio for analysis. First use your API key to instantiate a Studio object.

from cleanlab_studio import Studio

# You can find your Cleanlab Studio API key by going to studio.cleanlab.ai/upload,
# clicking "Upload via Python API", and copying the API key there
API_KEY = "<insert your API key>"

studio = Studio(API_KEY)

Load the data from batch 1 into Cleanlab Studio. More details on formatting datasets can be found in this guide.

dataset_id = studio.upload_dataset(batch_1, dataset_name="data-batch-1")
print(f"Dataset ID: {dataset_id}")

Launch a Project

A Cleanlab Studio Project automatically trains ML models to provide AI-based analysis of your dataset. Let’s launch one for the data we have received so far.

Note: For our label_column and text_column specified below, they happened to be called label and text for this example. The values for these arguments should be the name of the columns pertaining to your label and containing the text field in your dataset. If you have multiple text columns, please merge them into a single column as demonstrated in our FAQ.

label_column = "label"  # name of column containing labels
text_column = "text"

project_id = studio.create_project(
    dataset_id=dataset_id,
    project_name="batch-1-analysis",
    modality="text",
    task_type="multi-class",
    model_type="regular",  # set this to "fast" if time-constrained
    label_column=label_column,
    text_column=text_column
)
print(f"Project successfully created and training has begun! project_id: {project_id}")

Once the project has been launched successfully and the project_id is visible, feel free to close this notebook. It will take some time for Cleanlab’s AI to train models on this data and analyze it. Come back after training is complete (you will receive an email) and continue with the notebook to review your results. The training for this example would take approximately 10 minutes.

You should only execute the above cell once per data batch. After launching the project, you can poll for its status to programmatically wait until the results are ready for review:

# Fetch the cleanset id corresponding to the above project_id
cleanset_id = studio.get_latest_cleanset_id(project_id)
print(f"cleanset_id: {cleanset_id}")
project_status = studio.wait_until_cleanset_ready(cleanset_id)

If your notebook timed out, you can resume by re-running the above lines of code. Do not create a new project for the same batch when coming back to this notebook. When the project is complete, the resulting cleanset contains many Cleanlab columns of metadata that can be used to produce an cleaner version of our original dataset.

Review issues detected in batch 1

Fetch the Cleanlab columns of metadata for this cleanset using its cleanset_id. These columns have the same length as our original data batch and provide metadata about each individual data point, like what types of issues it exhibits and how severely.

If at any point you want to re-run the remaining parts of this notebook (without creating another Project), simply call studio.download_cleanlab_columns(cleanset_id) with the cleanset_id printed from the previous cell.

cleanlab_columns_df = studio.download_cleanlab_columns(cleanset_id)
cleanlab_columns_df.head()

	cleanlab_row_ID	corrected_label	is_label_issue	label_issue_score	suggested_label	suggested_label_confidence_score	is_ambiguous	ambiguous_score	is_well_labeled	is_near_duplicate	...	non_english_score	predicted_language	is_toxic	toxic_score	sentiment_score	bias_score	is_biased	gender_bias_score	racial_bias_score	sexual_orientation_bias_score
0	1	<NA>	False	0.311937	<NA>	0.477899	False	0.946097	True	False	...	0.007218	<NA>	False	0.181641	0.134277	0.394824	False	0.394824	0.232544	0.146484
1	2	<NA>	False	0.254897	<NA>	0.565403	False	0.893302	True	False	...	0.008784	<NA>	False	0.062164	0.159088	0.259082	False	0.259082	0.093933	0.024902
2	3	<NA>	False	0.494578	<NA>	0.230432	False	0.968427	False	False	...	0.004739	<NA>	False	0.184692	0.377869	0.244189	False	0.244189	0.131470	0.013443
3	4	<NA>	False	0.353640	<NA>	0.411544	False	0.921234	True	False	...	0.058934	<NA>	False	0.120056	0.543640	0.379687	False	0.379687	0.112732	0.013229
4	5	<NA>	False	0.333641	<NA>	0.441447	False	0.892788	True	False	...	0.008220	<NA>	False	0.107422	0.573120	0.161621	False	0.146045	0.161621	0.002678

5 rows × 38 columns

Details about the Cleanlab columns and their meanings can be found in the Cleanlab columns guide.

In this tutorial, we focus on label issues only. The ideas demonstrated here can be used for other types of data issues that Cleanlab Studio auto-detects.

A data point flagged with a label issue likely has a wrong given label. For such data points, consider correcting their label to the suggested_label if it seems more appropriate.

The data points exhibiting this issue are indicated with boolean values in the is_label_issue column, and the severity of this issue in each data point is quantified in the label_issue_score column (on a scale of 0-1 with 1 indicating the most severe instances of the issue).

Let’s create a given_label column in our dataframe to clearly indicate the class label originally assigned to each data point (customer service request).

# Copy data into a new DataFrame
df1 = batch_1.copy()

# Combine the dataset with cleanlab columns
merge_df_cleanlab = df1.merge(cleanlab_columns_df, left_index=True, right_index=True)

# Rename label column to "given_label"
merge_df_cleanlab.rename(columns={"label": "given_label"}, inplace=True)

To see which data points are estimated to be mislabeled, we filter by is_label_issue. We sort by label_issue_score to see which of these data points are most likely mislabeled.

label_issues = merge_df_cleanlab.query("is_label_issue", engine="python").sort_values("label_issue_score", ascending=False)

columns_to_display = ["id", "text", "label_issue_score", "is_label_issue", "given_label", "suggested_label"]

display(label_issues[columns_to_display])

	id	text	label_issue_score	is_label_issue	given_label	suggested_label
7	7	can i change my pin on holiday?	0.706490	True	beneficiary_not_allowed	change_pin
459	459	will i be sent a new card before mine expires?	0.665202	True	apple_pay_or_google_pay	card_about_to_expire
117	117	my card is almost expired. how fast will i get a new one and what is the cost?	0.656704	True	apple_pay_or_google_pay	card_about_to_expire
54	54	is it possible to change my pin?	0.648337	True	beneficiary_not_allowed	change_pin
160	160	p	0.605484	True	getting_spare_card	supported_cards_and_currencies
115	115	can i get a new card even though i am in china?	0.575734	True	apple_pay_or_google_pay	card_about_to_expire
119	119	what currencies does google pay top up accept?	0.557280	True	apple_pay_or_google_pay	supported_cards_and_currencies
369	369	do i need to verify my top-up card?	0.527742	True	getting_spare_card	apple_pay_or_google_pay

Note that in most of these rows, the given_label really does seem wrong (the annotated intent in the original dataset does not appear appropriate for the customer request), except the rows which have a label issue score less than 0.7. Luckily we can easily correct these data points by just using Cleanlab’s suggested_label above, which seems like a more appropriate label in most cases.

While the boolean flags above help us estimate the overall label error rate, the numeric scores help decide what data to prioritize for review. In this tutorial, we use a threshold on label_issue_score to select which data points to fix, excluding the rest of the data points which doesn’t meet the threshold.

Improve batch 1 data based on the detected issues

Let’s use the Cleanlab columns to improve the quality of our dataset. For your own datasets, which actions you should take to remedy the detected issues will depend on what you are using the data for. No action may be the best choice for certain datasets, we caution against blindly copying the actions we perform here.

For data flagged as label issues, we create a new corrected_label column, which will be the given_label for data points without detected label issues, and the suggested_label for data points with detected label issues. We use a label_issue_score threshold of 0.7 to determine which data points to re-label. The remaining data points flagged as label issues will be excluded from the dataset to avoid potential contamination.

Throughout, we track all of the rows we fixed (re-labeled) or excluded.

# Set issue score threshold
label_threshold = 0.70

# DataFrame to track excluded rows
threshold_filtered_rows = label_issues.query("label_issue_score < @label_threshold")

# Find indices of rows to exclude
ids_to_exclude1 = threshold_filtered_rows["id"]
indices_to_exclude1 = merge_df_cleanlab.query("id in @ids_to_exclude1").index

print(f"Excluding {len(threshold_filtered_rows)} text examples (out of {len(merge_df_cleanlab)})")

# Drop rows from the merge DataFrame
merge_df_cleanlab = merge_df_cleanlab.drop(indices_to_exclude1)

corrected_label = np.where(merge_df_cleanlab["is_label_issue"],
                           merge_df_cleanlab["suggested_label"],
                           merge_df_cleanlab["given_label"])

# DataFrame to track fixed (re-labeled) rows
label_issues_fixed_rows = merge_df_cleanlab.query("is_label_issue", engine="python")

Excluding 7 text examples (out of 500)

Let’s make a cleaned version of the batch 1 data after applying these corrections:

fixed_batch_1 = merge_df_cleanlab[["text", "id"]].copy()
fixed_batch_1["label"] = corrected_label

Let’s also initialize our curated master dataset, a single DataFrame to store the clean data points accumulated across all the data batches.

fixed_dataset = pd.DataFrame(columns=["text", "label", "id"])
fixed_dataset = pd.concat([fixed_dataset, fixed_batch_1], ignore_index=True)  # add clean data from batch 1

Perfect! Now let’s grow our master dataset after recieving the data from batch 2.

Adding a second batch of data

Our fixed_dataset currently contains the cleaned version of our first data batch. Suppose now we’ve collected a second batch of data, which consists of 300 rows, that we wish to add to this master fixed dataset.

batch_2 = pd.read_csv(os.path.join(dataset_path, "data_batch2.csv"))

We will again add a unique identifier id which would start from the last id of batch 1 i.e. the size of batch 1 (500), till the size of batch 1 + batch 2 (800).

total_rows = len(batch_1) + len(batch_2)
batch_2["id"] = range(len(batch_1), total_rows)
batch_2.head()

	text	label	id
0	i received my american express in apple pay, why is top up not working?	apple_pay_or_google_pay	500
1	i want to change my pin - do i need to be in a bank?	change_pin	501
2	i want to use a payment card to top up my account. how can i do this?	supported_cards_and_currencies	502
3	i would like to give another card to my daughter, how can i proceed?	getting_spare_card	503
4	is there a location where i can change my pin?	change_pin	504

Let’s first check what class labels are present in this second batch of data. We define a helper function to compare the unique values of the label column against our previously observed data.

Optional: Initialize helper method to compare classes

def compare_classes(new_data, historical_data, label_column):
    historical_data_classes = set(historical_data[label_column])
    new_batch_classes = set(new_data[label_column])
    if len(historical_data_classes.difference(new_batch_classes)) > 0:
        print(f"New batch has no data points from the following classes: {historical_data_classes.difference(new_batch_classes)}")
    if len(new_batch_classes.difference(historical_data_classes)) > 0:
        print(f"New batch has data points from previously unseen classes: {new_batch_classes.difference(historical_data_classes)}")

compare_classes(batch_2, fixed_dataset, label_column)

New batch has no data points from the following classes: {'lost_or_stolen_phone'}

Here we don’t act on this information, but such information may be concerning depending on your application.

We’ll repeat the Cleanlab Studio steps that we previously performed for our first data batch, this time on a larger dataset composed of our clean historical data plus the newest data batch.

batch_1_2 = pd.concat([fixed_dataset, batch_2], ignore_index=True)

Load dataset, launch Project, and get Cleanlab columns

Again: if your notebook times out during any of the following steps, you likely don’t need to re-run that step (re-running the step may take a long time again). Instead try to run the next step after restarting your notebook.

dataset_id = studio.upload_dataset(batch_1_2, dataset_name="data-batch-1-2")
print(f"Dataset ID: {dataset_id}")

project_id = studio.create_project(
    dataset_id=dataset_id,
    project_name="batch-1-2-analysis",
    modality="text",
    task_type="multi-class",
    model_type="regular",
    label_column=label_column,
    text_column=text_column
)
print(f"Project successfully created and training has begun! project_id: {project_id}")

cleanset_id = studio.get_latest_cleanset_id(project_id)
print(f"cleanset_id: {cleanset_id}")
project_status = studio.wait_until_cleanset_ready(cleanset_id)
cleanlab_columns_df = studio.download_cleanlab_columns(cleanset_id)

Review issues detected in batch 2

Similar to how we reviewed the label issues detected in batch 1 data, here we will focus on the label issues detected in the newest (second) batch of data. Note that our Project analyzed this batch of data together with the clean historical data, as more data allows Cleanlab’s AI to more accurately detect data issues. As before, the first step toward reviewing results is to merge the Cleanlab columns with the dataset that the Project was run on:

df2 = batch_1_2.copy()
merge_df_cleanlab = df2.merge(cleanlab_columns_df, left_index=True, right_index=True)
merge_df_cleanlab.rename(columns={"label": "given_label"}, inplace=True)

The current merge_df_cleanlab dataset consists of both the cleaned historical data (from batch 1) and the raw batch 2 data. Here we demonstrate how to focus on catching label issues in the new (batch 2) data only:

# Use identifier to create an array of batch-2 id's
ids_of_batch_2 = batch_2["id"]
# Isolate batch-2 data from the merged dataset
batch_2_subset = merge_df_cleanlab.query("id in @ids_of_batch_2", engine="python")
# Get batch 2 rows flagged as label issues
label_issues = batch_2_subset.query("is_label_issue", engine="python").sort_values("label_issue_score", ascending=False)

display(label_issues[columns_to_display])

	id	text	label_issue_score	is_label_issue	given_label	suggested_label
742	749	why am i being charge a fee when using an atm?	0.789770	True	card_about_to_expire	card_payment_fee_charged
686	693	what atms will allow me to change my pin?	0.716286	True	beneficiary_not_allowed	change_pin
788	795	what services can i use to top up?	0.676584	True	apple_pay_or_google_pay	supported_cards_and_currencies
652	659	why do i see extra charges for withdrawing my money?	0.672628	True	card_about_to_expire	card_payment_fee_charged
587	594	bad bank	0.601791	True	apple_pay_or_google_pay	supported_cards_and_currencies

Improve batch 2 data based on the detected issues

Assume we are in working with a production data pipeline where fixing issues in the most recent batch of data is highest priority. Just as before, we can apply the same strategy to clean the batch 2 data (re-label the flagged data points with label_issue_score above the same threshold and exclude the rest of the label issues from the dataset).

# Keep track of excluded rows
issues_below_threshold = label_issues.query("label_issue_score < @label_threshold", engine="python")

threshold_filtered_rows = pd.concat([issues_below_threshold, threshold_filtered_rows])

# Find indices of rows to exclude
ids_to_exclude2 = issues_below_threshold["id"]
indices_to_exclude2 = batch_2_subset.query("id in @ids_to_exclude2", engine="python").index

print(f"Excluding {len(ids_to_exclude2)} text example(s) (out of {len(batch_2_subset)} from batch-2)")

# Drop rows from the batch-2 subset
batch_2_subset = batch_2_subset.drop(indices_to_exclude2)

corrected_label = np.where(batch_2_subset["is_label_issue"],
                           batch_2_subset["suggested_label"],
                           batch_2_subset["given_label"])

# Keep track of fixed rows
label_issues_fixed_rows = pd.concat([label_issues_fixed_rows, batch_2_subset.query("is_label_issue", engine="python")])

Excluding 3 text example(s) (out of 300 from batch-2)

Applying these corrections produces a cleaned version of the batch 2 data. We add this cleaned batch 2 data to our master fixed dataset (which up to this point contained the cleaned batch 1 data).

fixed_batch_2 = batch_2_subset[["text", "id"]].copy()
fixed_batch_2["label"] = corrected_label
fixed_dataset = pd.concat([fixed_dataset, fixed_batch_2], ignore_index=True)

Awesome! We have grown our master dataset with the additional data being collected, while still ensuring this dataset is clean and free of label issues.

Adding another batch of data

Finally, let’s clean and then add another batch of newly collected data (200 rows) to our master dataset.

batch_3 = pd.read_csv(os.path.join(dataset_path, "data_batch3.csv"))

# Create an id column for unique identification of rows
total_rows = total_rows + len(batch_3)
batch_3["id"] = range(len(batch_1) + len(batch_2), total_rows)
batch_3.head()

	text	label	id
0	i need to cancel my recent transfer as soon as possible. i made an error there. please help before it goes through.	cancel_transfer	800
1	i need help as fast as possible! i made a mistake on my most recent transfer; can you please stop it before it goes through?	cancel_transfer	801
2	i already made a transfer and want to cancel it, how do i do that?	cancel_transfer	802
3	i just realised i made the wrong payment yesterday. can you please change it to the right account? it's my rent payment and really really needs to be in the right account by tomorrow	cancel_transfer	803
4	hi, i made a transfer yesterday that i need to reverse. i need to put the money in a different account.	cancel_transfer	804

Let’s first check what class labels are present in this third batch of data.

compare_classes(batch_3, fixed_dataset, label_column)

New batch has data points from previously unseen classes: {'cancel_transfer'}

Here we don’t act on this information, but it may be a concern depending on your application. We repeat the Cleanlab Studio related steps that we performed before.

batch_1_2_3 = pd.concat([fixed_dataset, batch_3], ignore_index=True)

Load dataset, launch Project, and get Cleanlab columns

dataset_id = studio.upload_dataset(batch_1_2_3, dataset_name="data-batch-1-3")
print(f"Dataset ID: {dataset_id}")

project_id = studio.create_project(
    dataset_id=dataset_id,
    project_name="batch-1-3-analysis",
    modality="text",
    task_type="multi-class",
    model_type="regular",
    label_column=label_column,
    text_column=text_column
)
print(f"Project successfully created and training has begun! project_id: {project_id}")

cleanset_id = studio.get_latest_cleanset_id(project_id)
print(f"cleanset_id: {cleanset_id}")
project_status = studio.wait_until_cleanset_ready(cleanset_id)
cleanlab_columns_df = studio.download_cleanlab_columns(cleanset_id)

Review issues detected in batch 3

df3 = batch_1_2_3.copy()
merge_df_cleanlab = df3.merge(cleanlab_columns_df, left_index=True, right_index=True)
merge_df_cleanlab.rename(columns={"label": "given_label"}, inplace=True)

The merged dataset batch_1_2_3 consists of clean historical data (from batches 1 + 2) and the raw batch 3 data. As before, let’s focus on the issues detected in the batch 3 data:

# Create an array of batch-3 id's
ids_of_batch_3 = batch_3["id"]
# Isolate batch-3 subset from the current dataset
batch_3_subset = merge_df_cleanlab.query("id in @ids_of_batch_3", engine="python")
# Fetch rows with label issues
label_issues = batch_3_subset.query("is_label_issue", engine="python").sort_values("label_issue_score", ascending=False)

display(label_issues[columns_to_display])

	id	text	label_issue_score	is_label_issue	given_label	suggested_label
850	860	which currencies can i used to add funds to my account?	0.850656	True	cancel_transfer	supported_cards_and_currencies
978	988	i was charged for getting cash.	0.834128	True	card_about_to_expire	card_payment_fee_charged
949	959	so, i was just charged for my recent atm withdrawal and any withdrawal prior to this has been free. what is the issue here?	0.769915	True	card_about_to_expire	card_payment_fee_charged
840	850	how long does it take for a top up to be approved?	0.582045	True	cancel_transfer	supported_cards_and_currencies

Review issues in older batches

While we’ve been focusing on the issues detected in the latest batch of data only, we can also see if any issues have been detected in the older historical data (the cleaned version of batches 1 and 2). Now that there is significantly more data in the Cleanlab Studio Project, the AI is able to detect data issues more accurately and may find issues missed in previous rounds. Let’s see if there are any new label issues detected in the previous cleaned versions of batches 1 and 2:

batch_1_2_subset = merge_df_cleanlab.query("id not in @ids_of_batch_3", engine="python")

display(batch_1_2_subset.query("is_label_issue", engine="python")[columns_to_display])

	id	text	label_issue_score	is_label_issue	given_label	suggested_label
39	39	please tell me how to change my pin.	0.830466	True	beneficiary_not_allowed	change_pin
67	68	how do i find my new pin?	0.811987	True	visa_or_mastercard	change_pin
90	91	explain roth ira	0.585797	True	beneficiary_not_allowed	supported_cards_and_currencies
94	95	what cards do you offer?	0.726222	True	visa_or_mastercard	supported_cards_and_currencies

While we’ve been fixing the label issues detected in each older data batch at the time the data was collected, this doesn’t guarantee the older batches are 100% free of label issues.

To demonstrate another type of issue, we can also review the outliers detected in these data batches in isolation. From the latest Cleanlab Studio Project, here are the outliers detected in batch 3:

columns_to_display_outlier = ["id", "text", "given_label", "outlier_score", "is_empty_text", "text_num_characters", "is_outlier"]
outlier_issues_batch_3 = merge_df_cleanlab.query('(id in @ids_of_batch_3) & (is_outlier)', engine='python')

display(outlier_issues_batch_3[columns_to_display_outlier])

	id	text	given_label	outlier_score	is_empty_text	text_num_characters	is_outlier
852	862	cancel transaction	cancel_transfer	0.178126	False	18	True

Outliers detected in batch 2:

outlier_issues_batch_2 = merge_df_cleanlab.query('(id in @ids_of_batch_2) & (is_outlier)', engine='python')
display(outlier_issues_batch_2[columns_to_display_outlier])

	id	text	given_label	outlier_score	is_empty_text	text_num_characters	is_outlier
502	509	metal card	card_about_to_expire	0.234098	False	10	True
561	568	changing my pin	change_pin	0.172110	False	15	True
582	589	750 credit score	getting_spare_card	0.154556	False	16	True
639	647	404Error<body><p>InvalidUsername</p><p> InvalidPIN</p></body>	change_pin	0.202901	False	61	True

Outliers detected in batch 1:

ids_of_batch_1 = batch_1["id"]
outlier_issues_batch_1 = merge_df_cleanlab.query("(id in @ids_of_batch_1) & (is_outlier)", engine="python")
display(outlier_issues_batch_1[columns_to_display_outlier])

	id	text	given_label	outlier_score	is_empty_text	text_num_characters	is_outlier
90	91	explain roth ira	beneficiary_not_allowed	0.176312	False	16	True
280	285	payment did not process	beneficiary_not_allowed	0.184214	False	23	True
450	456	switch banks	change_pin	0.186031	False	12	True
456	463	my sc	apple_pay_or_google_pay	0.247411	False	5	True

Improve batch 3 data based on the detected issues

Finally, we fix just the label issues detected in batch 3, using the same strategy applied to the previous data batches.

# Keep track of excluded rows
issues_below_threshold = label_issues.query("label_issue_score < @label_threshold", engine="python")

threshold_filtered_rows = pd.concat([issues_below_threshold, threshold_filtered_rows])

# Find indices of rows to exclude
ids_to_exclude3 = issues_below_threshold["id"]
indices_to_exclude3 = batch_3_subset.query("id in @ids_to_exclude3", engine="python").index

print(f"Excluding {len(ids_to_exclude3)} text example(s) (out of {len(batch_3_subset)} from batch-3)")

# Drop rows from the batch-3 subset
batch_3_subset = batch_3_subset.drop(indices_to_exclude3)

corrected_label = np.where(batch_3_subset["is_label_issue"],
                           batch_3_subset["suggested_label"],
                           batch_3_subset["given_label"])

# Keep track of fixed rows
label_issues_fixed_rows = pd.concat([label_issues_fixed_rows, batch_3_subset.query("is_label_issue", engine="python")])

Excluding 1 text example(s) (out of 200 from batch-3)

And then add the cleaned batch 3 data to our master dataset:

fixed_batch_3 = batch_3_subset[["text", "id"]].copy()
fixed_batch_3["label"] = corrected_label
fixed_dataset = pd.concat([fixed_dataset, fixed_batch_3], ignore_index=True)

print(f"Total number of label issues fixed across all 3 batches: {len(label_issues_fixed_rows)}")
print(f"Total number of rows, with label issues, excluded due to score less than threshold ({label_threshold}): {len(threshold_filtered_rows)}")

Total number of label issues fixed across all 3 batches: 6
Total number of rows, with label issues, excluded due to score less than threshold (0.7): 11

Saving the master dataset

After cleaning and accumulating multiple batches of data, we save the resulting master fixed dataset to a CSV file that can be used as a plug-in replacement in your existing modeling workflows. The cleaned dataset will have the same format as your original dataset, so you can use it as a plug-in replacement to get more reliable results in your ML/Analytics pipelines (without changing your existing modeling code).

new_dataset_filename = "fixed_dataset.csv"  # Location to save clean master dataset

if os.path.exists(new_dataset_filename):
    raise ValueError(f"File {new_dataset_filename} already exists. Cannot overwite so please delete it first, or specify a different new_dataset_filename.")
else:
    fixed_dataset.to_csv(new_dataset_filename, index=False, columns=["text", "label"])
    print(f"Master fixed dataset saved to {new_dataset_filename}")

Master fixed dataset saved to fixed_dataset.csv

Faster methods

The approach demonstrated here is how we recommend handling growing datasets in the generally available version of Cleanlab Studio. If your only goal is to label/categorize data coming in at rapid volumes, you can instead deploy a ML model to more quickly process incoming data. For companies with particular data workloads, Cleanlab offers more compute-efficient and integrated solutions that scale to larger volumes of incoming data. Contact us to learn more.

Install and import dependencies​

Fetch and view dataset​

Ensure unique identifier for data points​

Load batch 1 into Cleanlab Studio​

Launch a Project​

Review issues detected in batch 1​

Improve batch 1 data based on the detected issues​

Adding a second batch of data​

Load dataset, launch Project, and get Cleanlab columns​

Review issues detected in batch 2​

Improve batch 2 data based on the detected issues​

Adding another batch of data​

Load dataset, launch Project, and get Cleanlab columns​

Review issues detected in batch 3​

Review issues in older batches​

Improve batch 3 data based on the detected issues​

Saving the master dataset​

Faster methods​

Install and import dependencies

Fetch and view dataset

Ensure unique identifier for data points

Load batch 1 into Cleanlab Studio

Launch a Project

Review issues detected in batch 1

Improve batch 1 data based on the detected issues

Adding a second batch of data

Load dataset, launch Project, and get Cleanlab columns

Review issues detected in batch 2

Improve batch 2 data based on the detected issues

Adding another batch of data

Load dataset, launch Project, and get Cleanlab columns

Review issues detected in batch 3

Review issues in older batches

Improve batch 3 data based on the detected issues

Saving the master dataset

Faster methods