Skip to main content

Catching Issues in a Dynamically Growing Dataset

Run in Google ColabRun in Google Colab

This is the recommended tutorial for programatically auditing datasets that grow over time with the Cleanlab Studio Python API.

In this tutorial, we consider data that comes in batches accumulated into a master dataset. While one could follow the other tutorials to use Cleanlab to auto-detect issues across the entire master dataset, here we demonstrate how to catch issues in the most recent batch of data. We additionally demonstrate how to fix issues in the latest data batch, in order to create a higher-quality master dataset.

While this tutorial focuses specifically on label issues for brevity, the same ideas can be applied to catch any of the other data issues Cleanlab Studio can auto-detect (outliers, near duplicates, unsafe or low-quality content, …). This tutorial focuses on text data, but the same ideas can be applied to the other data modalities Cleanlab Studio supports such as images or structured/tabular data. We recommend first completing our text data quickstart tutorial to first understand how Cleanlab Studio works with a static dataset.

Install and import dependencies

Make sure you have wget installed to run this tutorial. You can use pip to install all other packages required for this tutorial as follows:

%pip install cleanlab-studio
import numpy as np
import pandas as pd
import os
import random

from IPython.display import display, Markdown
pd.set_option("display.max_colwidth", None)

Fetch and view dataset

Here we use a variant of the BANKING77 text classification dataset, in which customer service request are labeled as belonging to one of K classes (intent categories). To fetch the dataset for this tutorial, make sure you have wget and zip installed.

mkdir -p data/
wget -nc https://cleanlab-public.s3.amazonaws.com/Datasets/growing_dataset.zip -O data/growing_dataset.zip
unzip -q data/growing_dataset.zip -d data/

This data is stored amongst 3 unequal batches, which will be received incrementally in this tutorial. Batch 3 contains 1 class that was not seen in batches 1 or 2, to help you handle applications where certain dataset classes might appear or disappear over time. Let’s load the first batch of data.

BASE_PATH = os.getcwd()
dataset_path = os.path.join(BASE_PATH, "data/growing_dataset")

batch_1 = pd.read_csv(os.path.join(dataset_path, "data_batch1.csv"))

Ensure unique identifier for data points

For a dynamic dataset, having a unique identifier for each data point allows us to better track results. The current dataset has two columns - text and label - both of which cannot be used to identify a unique row.

We can add a column id that would just contain sequential numbers, starting from 0 to batch size - 1. For the subsequent batches, we”ll start from previous batch’s size to the total size of the merged dataset.

# Create a new column and assign sequential numbers till batch size
batch_1["id"] = range(0, len(batch_1))

batch_1.head()
text label id
0 why is there a fee when i thought there would be no fees? card_payment_fee_charged 0
1 why can't my beneficiary access my account? beneficiary_not_allowed 1
2 does it cost extra to send out more than one card? getting_spare_card 2
3 can i change my pin at an atm? change_pin 3
4 i have a us credit card, will you accept it? supported_cards_and_currencies 4
print(f"The total number of rows in the current dataset: {len(batch_1)}")
    The total number of rows in the current dataset: 500

Load batch 1 into Cleanlab Studio

Upon receiving our batch of data, let’s load it into Cleanlab Studio for analysis. First use your API key to instantiate a Studio object.

from cleanlab_studio import Studio

# You can find your Cleanlab Studio API key by going to app.cleanlab.ai/upload,
# clicking "Upload via Python API", and copying the API key there
API_KEY = "<insert your API key>"

studio = Studio(API_KEY)

Load the data from batch 1 into Cleanlab Studio. More details on formatting datasets can be found in this guide.

dataset_id = studio.upload_dataset(batch_1, dataset_name="data-batch-1")
print(f"Dataset ID: {dataset_id}")

Launch a Project

A Cleanlab Studio Project automatically trains ML models to provide AI-based analysis of your dataset. Let’s launch one for the data we have received so far.

Note: For our label_column and text_column specified below, they happened to be called label and text for this example. The values for these arguments should be the name of the columns pertaining to your label and containing the text field in your dataset. If you have multiple text columns, please merge them into a single column as demonstrated in our FAQ.

label_column = "label"  # name of column containing labels
text_column = "text"
project_id = studio.create_project(
dataset_id=dataset_id,
project_name="batch-1-analysis",
modality="text",
task_type="multi-class",
model_type="regular", # set this to "fast" if time-constrained
label_column=label_column,
text_column=text_column
)
print(f"Project successfully created and training has begun! project_id: {project_id}")

Once the project has been launched successfully and the project_id is visible, feel free to close this notebook. It will take some time for Cleanlab’s AI to train models on this data and analyze it. Come back after training is complete (you will receive an email) and continue with the notebook to review your results. The training for this example would take approximately 10 minutes.

You should only execute the above cell once per data batch. After launching the project, you can poll for its status to programmatically wait until the results are ready for review:

# Fetch the cleanset id corresponding to the above project_id
cleanset_id = studio.get_latest_cleanset_id(project_id)
print(f"cleanset_id: {cleanset_id}")
project_status = studio.wait_until_cleanset_ready(cleanset_id)

If your notebook timed out, you can resume by re-running the above lines of code. Do not create a new project for the same batch when coming back to this notebook. When the project is complete, the resulting cleanset contains many Cleanlab columns of metadata that can be used to produce an cleaner version of our original dataset.

Review issues detected in batch 1

Fetch the Cleanlab columns of metadata for this cleanset using its cleanset_id. These columns have the same length as our original data batch and provide metadata about each individual data point, like what types of issues it exhibits and how severely.

If at any point you want to re-run the remaining parts of this notebook (without creating another Project), simply call studio.download_cleanlab_columns(cleanset_id) with the cleanset_id printed from the previous cell.

cleanlab_columns_df = studio.download_cleanlab_columns(cleanset_id)
cleanlab_columns_df.head()
cleanlab_row_ID corrected_label is_label_issue label_issue_score suggested_label suggested_label_confidence_score is_ambiguous ambiguous_score is_well_labeled is_near_duplicate ... non_english_score predicted_language is_toxic toxic_score sentiment_score bias_score is_biased gender_bias_score racial_bias_score sexual_orientation_bias_score
0 1 <NA> False 0.311937 <NA> 0.477899 False 0.946097 True False ... 0.007218 <NA> False 0.181641 0.134277 0.394824 False 0.394824 0.232544 0.146484
1 2 <NA> False 0.254897 <NA> 0.565403 False 0.893302 True False ... 0.008784 <NA> False 0.062164 0.159088 0.259082 False 0.259082 0.093933 0.024902
2 3 <NA> False 0.494578 <NA> 0.230432 False 0.968427 False False ... 0.004739 <NA> False 0.184692 0.377869 0.244189 False 0.244189 0.131470 0.013443
3 4 <NA> False 0.353640 <NA> 0.411544 False 0.921234 True False ... 0.058934 <NA> False 0.120056 0.543640 0.379687 False 0.379687 0.112732 0.013229
4 5 <NA> False 0.333641 <NA> 0.441447 False 0.892788 True False ... 0.008220 <NA> False 0.107422 0.573120 0.161621 False 0.146045 0.161621 0.002678

5 rows × 38 columns

Details about the Cleanlab columns and their meanings can be found in the Cleanlab columns guide.

In this tutorial, we focus on label issues only. The ideas demonstrated here can be used for other types of data issues that Cleanlab Studio auto-detects.

A data point flagged with a label issue likely has a wrong given label. For such data points, consider correcting their label to the suggested_label if it seems more appropriate.

The data points exhibiting this issue are indicated with boolean values in the is_label_issue column, and the severity of this issue in each data point is quantified in the label_issue_score column (on a scale of 0-1 with 1 indicating the most severe instances of the issue).

Let’s create a given_label column in our dataframe to clearly indicate the class label originally assigned to each data point (customer service request).

# Copy data into a new DataFrame
df1 = batch_1.copy()

# Combine the dataset with cleanlab columns
merge_df_cleanlab = df1.merge(cleanlab_columns_df, left_index=True, right_index=True)

# Rename label column to "given_label"
merge_df_cleanlab.rename(columns={"label": "given_label"}, inplace=True)

To see which data points are estimated to be mislabeled, we filter by is_label_issue. We sort by label_issue_score to see which of these data points are most likely mislabeled.

label_issues = merge_df_cleanlab.query("is_label_issue", engine="python").sort_values("label_issue_score", ascending=False)

columns_to_display = ["id", "text", "label_issue_score", "is_label_issue", "given_label", "suggested_label"]

display(label_issues[columns_to_display])
id text label_issue_score is_label_issue given_label suggested_label
7 7 can i change my pin on holiday? 0.706490 True beneficiary_not_allowed change_pin
459 459 will i be sent a new card before mine expires? 0.665202 True apple_pay_or_google_pay card_about_to_expire
117 117 my card is almost expired. how fast will i get a new one and what is the cost? 0.656704 True apple_pay_or_google_pay card_about_to_expire
54 54 is it possible to change my pin? 0.648337 True beneficiary_not_allowed change_pin
160 160 p 0.605484 True getting_spare_card supported_cards_and_currencies
115 115 can i get a new card even though i am in china? 0.575734 True apple_pay_or_google_pay card_about_to_expire
119 119 what currencies does google pay top up accept? 0.557280 True apple_pay_or_google_pay supported_cards_and_currencies
369 369 do i need to verify my top-up card? 0.527742 True getting_spare_card apple_pay_or_google_pay

Note that in most of these rows, the given_label really does seem wrong (the annotated intent in the original dataset does not appear appropriate for the customer request), except the rows which have a label issue score less than 0.7. Luckily we can easily correct these data points by just using Cleanlab’s suggested_label above, which seems like a more appropriate label in most cases.

While the boolean flags above help us estimate the overall label error rate, the numeric scores help decide what data to prioritize for review. In this tutorial, we use a threshold on label_issue_score to select which data points to fix, excluding the rest of the data points which doesn’t meet the threshold.

Improve batch 1 data based on the detected issues

Let’s use the Cleanlab columns to improve the quality of our dataset. For your own datasets, which actions you should take to remedy the detected issues will depend on what you are using the data for. No action may be the best choice for certain datasets, we caution against blindly copying the actions we perform here.

For data flagged as label issues, we create a new corrected_label column, which will be the given_label for data points without detected label issues, and the suggested_label for data points with detected label issues. We use a label_issue_score threshold of 0.7 to determine which data points to re-label. The remaining data points flagged as label issues will be excluded from the dataset to avoid potential contamination.

Throughout, we track all of the rows we fixed (re-labeled) or excluded.

# Set issue score threshold
label_threshold = 0.70

# DataFrame to track excluded rows
threshold_filtered_rows = label_issues.query("label_issue_score < @label_threshold")

# Find indices of rows to exclude
ids_to_exclude1 = threshold_filtered_rows["id"]
indices_to_exclude1 = merge_df_cleanlab.query("id in @ids_to_exclude1").index

print(f"Excluding {len(threshold_filtered_rows)} text examples (out of {len(merge_df_cleanlab)})")

# Drop rows from the merge DataFrame
merge_df_cleanlab = merge_df_cleanlab.drop(indices_to_exclude1)

corrected_label = np.where(merge_df_cleanlab["is_label_issue"],
merge_df_cleanlab["suggested_label"],
merge_df_cleanlab["given_label"])

# DataFrame to track fixed (re-labeled) rows
label_issues_fixed_rows = merge_df_cleanlab.query("is_label_issue", engine="python")
    Excluding 7 text examples (out of 500)

Let’s make a cleaned version of the batch 1 data after applying these corrections:

fixed_batch_1 = merge_df_cleanlab[["text", "id"]].copy()
fixed_batch_1["label"] = corrected_label

Let’s also initialize our curated master dataset, a single DataFrame to store the clean data points accumulated across all the data batches.

fixed_dataset = pd.DataFrame(columns=["text", "label", "id"])
fixed_dataset = pd.concat([fixed_dataset, fixed_batch_1], ignore_index=True) # add clean data from batch 1

Perfect! Now let’s grow our master dataset after recieving the data from batch 2.

Adding a second batch of data

Our fixed_dataset currently contains the cleaned version of our first data batch. Suppose now we’ve collected a second batch of data, which consists of 300 rows, that we wish to add to this master fixed dataset.

batch_2 = pd.read_csv(os.path.join(dataset_path, "data_batch2.csv"))

We will again add a unique identifier id which would start from the last id of batch 1 i.e. the size of batch 1 (500), till the size of batch 1 + batch 2 (800).

total_rows = len(batch_1) + len(batch_2)
batch_2["id"] = range(len(batch_1), total_rows)
batch_2.head()
text label id
0 i received my american express in apple pay, why is top up not working? apple_pay_or_google_pay 500
1 i want to change my pin - do i need to be in a bank? change_pin 501
2 i want to use a payment card to top up my account. how can i do this? supported_cards_and_currencies 502
3 i would like to give another card to my daughter, how can i proceed? getting_spare_card 503
4 is there a location where i can change my pin? change_pin 504

Let’s first check what class labels are present in this second batch of data. We define a helper function to compare the unique values of the label column against our previously observed data.

Optional: Initialize helper method to compare classes (click to expand)

def compare_classes(new_data, historical_data, label_column):
historical_data_classes = set(historical_data[label_column])
new_batch_classes = set(new_data[label_column])
if len(historical_data_classes.difference(new_batch_classes)) > 0:
print(f"New batch has no data points from the following classes: {historical_data_classes.difference(new_batch_classes)}")
if len(new_batch_classes.difference(historical_data_classes)) > 0:
print(f"New batch has data points from previously unseen classes: {new_batch_classes.difference(historical_data_classes)}")
compare_classes(batch_2, fixed_dataset, label_column)
    New batch has no data points from the following classes: {'lost_or_stolen_phone'}

Here we don’t act on this information, but such information may be concerning depending on your application.

We’ll repeat the Cleanlab Studio steps that we previously performed for our first data batch, this time on a larger dataset composed of our clean historical data plus the newest data batch.

batch_1_2 = pd.concat([fixed_dataset, batch_2], ignore_index=True)

Load dataset, launch Project, and get Cleanlab columns

Again: if your notebook times out during any of the following steps, you likely don’t need to re-run that step (re-running the step may take a long time again). Instead try to run the next step after restarting your notebook.

dataset_id = studio.upload_dataset(batch_1_2, dataset_name="data-batch-1-2")
print(f"Dataset ID: {dataset_id}")
project_id = studio.create_project(
dataset_id=dataset_id,
project_name="batch-1-2-analysis",
modality="text",
task_type="multi-class",
model_type="regular",
label_column=label_column,
text_column=text_column
)
print(f"Project successfully created and training has begun! project_id: {project_id}")
cleanset_id = studio.get_latest_cleanset_id(project_id)
print(f"cleanset_id: {cleanset_id}")
project_status = studio.wait_until_cleanset_ready(cleanset_id)
cleanlab_columns_df = studio.download_cleanlab_columns(cleanset_id)

Review issues detected in batch 2

Similar to how we reviewed the label issues detected in batch 1 data, here we will focus on the label issues detected in the newest (second) batch of data. Note that our Project analyzed this batch of data together with the clean historical data, as more data allows Cleanlab’s AI to more accurately detect data issues. As before, the first step toward reviewing results is to merge the Cleanlab columns with the dataset that the Project was run on:

df2 = batch_1_2.copy()
merge_df_cleanlab = df2.merge(cleanlab_columns_df, left_index=True, right_index=True)
merge_df_cleanlab.rename(columns={"label": "given_label"}, inplace=True)

The current merge_df_cleanlab dataset consists of both the cleaned historical data (from batch 1) and the raw batch 2 data. Here we demonstrate how to focus on catching label issues in the new (batch 2) data only:

# Use identifier to create an array of batch-2 id's
ids_of_batch_2 = batch_2["id"]
# Isolate batch-2 data from the merged dataset
batch_2_subset = merge_df_cleanlab.query("id in @ids_of_batch_2", engine="python")
# Get batch 2 rows flagged as label issues
label_issues = batch_2_subset.query("is_label_issue", engine="python").sort_values("label_issue_score", ascending=False)

display(label_issues[columns_to_display])
id text label_issue_score is_label_issue given_label suggested_label
742 749 why am i being charge a fee when using an atm? 0.789770 True card_about_to_expire card_payment_fee_charged
686 693 what atms will allow me to change my pin? 0.716286 True beneficiary_not_allowed change_pin
788 795 what services can i use to top up? 0.676584 True apple_pay_or_google_pay supported_cards_and_currencies
652 659 why do i see extra charges for withdrawing my money? 0.672628 True card_about_to_expire card_payment_fee_charged
587 594 bad bank 0.601791 True apple_pay_or_google_pay supported_cards_and_currencies

Improve batch 2 data based on the detected issues

Assume we are in working with a production data pipeline where fixing issues in the most recent batch of data is highest priority. Just as before, we can apply the same strategy to clean the batch 2 data (re-label the flagged data points with label_issue_score above the same threshold and exclude the rest of the label issues from the dataset).

# Keep track of excluded rows
issues_below_threshold = label_issues.query("label_issue_score < @label_threshold", engine="python")

threshold_filtered_rows = pd.concat([issues_below_threshold, threshold_filtered_rows])

# Find indices of rows to exclude
ids_to_exclude2 = issues_below_threshold["id"]
indices_to_exclude2 = batch_2_subset.query("id in @ids_to_exclude2", engine="python").index

print(f"Excluding {len(ids_to_exclude2)} text example(s) (out of {len(batch_2_subset)} from batch-2)")

# Drop rows from the batch-2 subset
batch_2_subset = batch_2_subset.drop(indices_to_exclude2)

corrected_label = np.where(batch_2_subset["is_label_issue"],
batch_2_subset["suggested_label"],
batch_2_subset["given_label"])

# Keep track of fixed rows
label_issues_fixed_rows = pd.concat([label_issues_fixed_rows, batch_2_subset.query("is_label_issue", engine="python")])
    Excluding 3 text example(s) (out of 300 from batch-2)

Applying these corrections produces a cleaned version of the batch 2 data. We add this cleaned batch 2 data to our master fixed dataset (which up to this point contained the cleaned batch 1 data).

fixed_batch_2 = batch_2_subset[["text", "id"]].copy()
fixed_batch_2["label"] = corrected_label
fixed_dataset = pd.concat([fixed_dataset, fixed_batch_2], ignore_index=True)

Awesome! We have grown our master dataset with the additional data being collected, while still ensuring this dataset is clean and free of label issues.

Adding another batch of data

Finally, let’s clean and then add another batch of newly collected data (200 rows) to our master dataset.

batch_3 = pd.read_csv(os.path.join(dataset_path, "data_batch3.csv"))

# Create an id column for unique identification of rows
total_rows = total_rows + len(batch_3)
batch_3["id"] = range(len(batch_1) + len(batch_2), total_rows)
batch_3.head()
text label id
0 i need to cancel my recent transfer as soon as possible. i made an error there. please help before it goes through. cancel_transfer 800
1 i need help as fast as possible! i made a mistake on my most recent transfer; can you please stop it before it goes through? cancel_transfer 801
2 i already made a transfer and want to cancel it, how do i do that? cancel_transfer 802
3 i just realised i made the wrong payment yesterday. can you please change it to the right account? it's my rent payment and really really needs to be in the right account by tomorrow cancel_transfer 803
4 hi, i made a transfer yesterday that i need to reverse. i need to put the money in a different account. cancel_transfer 804

Let’s first check what class labels are present in this third batch of data.

compare_classes(batch_3, fixed_dataset, label_column)
    New batch has data points from previously unseen classes: {'cancel_transfer'}

Here we don’t act on this information, but it may be a concern depending on your application. We repeat the Cleanlab Studio related steps that we performed before.

batch_1_2_3 = pd.concat([fixed_dataset, batch_3], ignore_index=True)

Load dataset, launch Project, and get Cleanlab columns

Again: if your notebook times out during any of the following steps, you likely don’t need to re-run that step (re-running the step may take a long time again). Instead try to run the next step after restarting your notebook The training for this example could take approximately 15 minutes.

dataset_id = studio.upload_dataset(batch_1_2_3, dataset_name="data-batch-1-3")
print(f"Dataset ID: {dataset_id}")
project_id = studio.create_project(
dataset_id=dataset_id,
project_name="batch-1-3-analysis",
modality="text",
task_type="multi-class",
model_type="regular",
label_column=label_column,
text_column=text_column
)
print(f"Project successfully created and training has begun! project_id: {project_id}")
cleanset_id = studio.get_latest_cleanset_id(project_id)
print(f"cleanset_id: {cleanset_id}")
project_status = studio.wait_until_cleanset_ready(cleanset_id)
cleanlab_columns_df = studio.download_cleanlab_columns(cleanset_id)

Review issues detected in batch 3

df3 = batch_1_2_3.copy()
merge_df_cleanlab = df3.merge(cleanlab_columns_df, left_index=True, right_index=True)
merge_df_cleanlab.rename(columns={"label": "given_label"}, inplace=True)

The merged dataset batch_1_2_3 consists of clean historical data (from batches 1 + 2) and the raw batch 3 data. As before, let’s focus on the issues detected in the batch 3 data:

# Create an array of batch-3 id's
ids_of_batch_3 = batch_3["id"]
# Isolate batch-3 subset from the current dataset
batch_3_subset = merge_df_cleanlab.query("id in @ids_of_batch_3", engine="python")
# Fetch rows with label issues
label_issues = batch_3_subset.query("is_label_issue", engine="python").sort_values("label_issue_score", ascending=False)

display(label_issues[columns_to_display])
id text label_issue_score is_label_issue given_label suggested_label
850 860 which currencies can i used to add funds to my account? 0.850656 True cancel_transfer supported_cards_and_currencies
978 988 i was charged for getting cash. 0.834128 True card_about_to_expire card_payment_fee_charged
949 959 so, i was just charged for my recent atm withdrawal and any withdrawal prior to this has been free. what is the issue here? 0.769915 True card_about_to_expire card_payment_fee_charged
840 850 how long does it take for a top up to be approved? 0.582045 True cancel_transfer supported_cards_and_currencies

Review issues in older batches

While we’ve been focusing on the issues detected in the latest batch of data only, we can also see if any issues have been detected in the older historical data (the cleaned version of batches 1 and 2). Now that there is significantly more data in the Cleanlab Studio Project, the AI is able to detect data issues more accurately and may find issues missed in previous rounds. Let’s see if there are any new label issues detected in the previous cleaned versions of batches 1 and 2:

batch_1_2_subset = merge_df_cleanlab.query("id not in @ids_of_batch_3", engine="python")

display(batch_1_2_subset.query("is_label_issue", engine="python")[columns_to_display])
id text label_issue_score is_label_issue given_label suggested_label
39 39 please tell me how to change my pin. 0.830466 True beneficiary_not_allowed change_pin
67 68 how do i find my new pin? 0.811987 True visa_or_mastercard change_pin
90 91 explain roth ira 0.585797 True beneficiary_not_allowed supported_cards_and_currencies
94 95 what cards do you offer? 0.726222 True visa_or_mastercard supported_cards_and_currencies

While we’ve been fixing the label issues detected in each older data batch at the time the data was collected, this doesn’t guarantee the older batches are 100% free of label issues.

To demonstrate another type of issue, we can also review the outliers detected in these data batches in isolation. From the latest Cleanlab Studio Project, here are the outliers detected in batch 3:

columns_to_display_outlier = ["id", "text", "given_label", "outlier_score", "is_empty_text", "text_num_characters", "is_outlier"]
outlier_issues_batch_3 = merge_df_cleanlab.query('(id in @ids_of_batch_3) & (is_outlier)', engine='python')

display(outlier_issues_batch_3[columns_to_display_outlier])
id text given_label outlier_score is_empty_text text_num_characters is_outlier
852 862 cancel transaction cancel_transfer 0.178126 False 18 True

Outliers detected in batch 2:

outlier_issues_batch_2 = merge_df_cleanlab.query('(id in @ids_of_batch_2) & (is_outlier)', engine='python')
display(outlier_issues_batch_2[columns_to_display_outlier])
id text given_label outlier_score is_empty_text text_num_characters is_outlier
502 509 metal card card_about_to_expire 0.234098 False 10 True
561 568 changing my pin change_pin 0.172110 False 15 True
582 589 750 credit score getting_spare_card 0.154556 False 16 True
639 647 404Error<body><p>InvalidUsername</p><p> InvalidPIN</p></body> change_pin 0.202901 False 61 True

Outliers detected in batch 1:

ids_of_batch_1 = batch_1["id"]
outlier_issues_batch_1 = merge_df_cleanlab.query("(id in @ids_of_batch_1) & (is_outlier)", engine="python")
display(outlier_issues_batch_1[columns_to_display_outlier])
id text given_label outlier_score is_empty_text text_num_characters is_outlier
90 91 explain roth ira beneficiary_not_allowed 0.176312 False 16 True
280 285 payment did not process beneficiary_not_allowed 0.184214 False 23 True
450 456 switch banks change_pin 0.186031 False 12 True
456 463 my sc apple_pay_or_google_pay 0.247411 False 5 True

Improve batch 3 data based on the detected issues

Finally, we fix just the label issues detected in batch 3, using the same strategy applied to the previous data batches.

# Keep track of excluded rows
issues_below_threshold = label_issues.query("label_issue_score < @label_threshold", engine="python")

threshold_filtered_rows = pd.concat([issues_below_threshold, threshold_filtered_rows])

# Find indices of rows to exclude
ids_to_exclude3 = issues_below_threshold["id"]
indices_to_exclude3 = batch_3_subset.query("id in @ids_to_exclude3", engine="python").index

print(f"Excluding {len(ids_to_exclude3)} text example(s) (out of {len(batch_3_subset)} from batch-3)")

# Drop rows from the batch-3 subset
batch_3_subset = batch_3_subset.drop(indices_to_exclude3)

corrected_label = np.where(batch_3_subset["is_label_issue"],
batch_3_subset["suggested_label"],
batch_3_subset["given_label"])

# Keep track of fixed rows
label_issues_fixed_rows = pd.concat([label_issues_fixed_rows, batch_3_subset.query("is_label_issue", engine="python")])
    Excluding 1 text example(s) (out of 200 from batch-3)

And then add the cleaned batch 3 data to our master dataset:

fixed_batch_3 = batch_3_subset[["text", "id"]].copy()
fixed_batch_3["label"] = corrected_label
fixed_dataset = pd.concat([fixed_dataset, fixed_batch_3], ignore_index=True)

print(f"Total number of label issues fixed across all 3 batches: {len(label_issues_fixed_rows)}")
print(f"Total number of rows, with label issues, excluded due to score less than threshold ({label_threshold}): {len(threshold_filtered_rows)}")
    Total number of label issues fixed across all 3 batches: 6
Total number of rows, with label issues, excluded due to score less than threshold (0.7): 11

Saving the master dataset

After cleaning and accumulating multiple batches of data, we save the resulting master fixed dataset to a CSV file that can be used as a plug-in replacement in your existing modeling workflows. The cleaned dataset will have the same format as your original dataset, so you can use it as a plug-in replacement to get more reliable results in your ML/Analytics pipelines (without changing your existing modeling code).

new_dataset_filename = "fixed_dataset.csv"  # Location to save clean master dataset

if os.path.exists(new_dataset_filename):
raise ValueError(f"File {new_dataset_filename} already exists. Cannot overwite so please delete it first, or specify a different new_dataset_filename.")
else:
fixed_dataset.to_csv(new_dataset_filename, index=False, columns=["text", "label"])
print(f"Master fixed dataset saved to {new_dataset_filename}")
    Master fixed dataset saved to fixed_dataset.csv

Faster methods

The approach demonstrated here is how we recommend handling growing datasets in the generally available version of Cleanlab Studio. If your only goal is to label/categorize data coming in at rapid volumes, you can instead deploy a ML model to more quickly process incoming data. For companies with particular data workloads, Cleanlab offers more compute-efficient and integrated solutions that scale to larger volumes of incoming data. Contact us to learn more.