Catching Issues in a Dynamically Growing Dataset
This is the recommended tutorial for programatically auditing datasets that grow over time with the Cleanlab Studio Python API.
In this tutorial, we consider data that comes in batches accumulated into a master dataset. While one could follow the other tutorials to use Cleanlab to auto-detect issues across the entire master dataset, here we demonstrate how to catch issues in the most recent batch of data. We additionally demonstrate how to fix issues in the latest data batch, in order to create a higher-quality master dataset.
While this tutorial focuses specifically on label issues for brevity, the same ideas can be applied to catch any of the other data issues Cleanlab Studio can auto-detect (outliers, near duplicates, unsafe or low-quality content, …). This tutorial focuses on text data, but the same ideas can be applied to the other data modalities Cleanlab Studio supports such as images or structured/tabular data. We recommend first completing our text data quickstart tutorial to first understand how Cleanlab Studio works with a static dataset.
Install and import dependencies
Make sure you have wget
installed to run this tutorial. You can use pip to install all other packages required for this tutorial as follows:
%pip install cleanlab-studio
import numpy as np
import pandas as pd
import os
import random
from IPython.display import display, Markdown
pd.set_option("display.max_colwidth", None)
Fetch and view dataset
Here we use a variant of the BANKING77 text classification dataset, in which customer service request are labeled as belonging to one of K classes (intent categories). To fetch the dataset for this tutorial, make sure you have wget
and zip
installed.
mkdir -p data/
wget -nc https://cleanlab-public.s3.amazonaws.com/Datasets/growing_dataset.zip -O data/growing_dataset.zip
unzip -q data/growing_dataset.zip -d data/
This data is stored amongst 3 unequal batches, which will be received incrementally in this tutorial. Batch 3 contains 1 class that was not seen in batches 1 or 2, to help you handle applications where certain dataset classes might appear or disappear over time. Let’s load the first batch of data.
BASE_PATH = os.getcwd()
dataset_path = os.path.join(BASE_PATH, "data/growing_dataset")
batch_1 = pd.read_csv(os.path.join(dataset_path, "data_batch1.csv"))
Ensure unique identifier for data points
For a dynamic dataset, having a unique identifier for each data point allows us to better track results.
The current dataset has two columns - text
and label
- both of which cannot be used to identify a unique row.
We can add a column id
that would just contain sequential numbers, starting from 0 to batch size - 1. For the subsequent batches, we”ll start from previous batch’s size to the total size of the merged dataset.
# Create a new column and assign sequential numbers till batch size
batch_1["id"] = range(0, len(batch_1))
batch_1.head()
text | label | id | |
---|---|---|---|
0 | why is there a fee when i thought there would be no fees? | card_payment_fee_charged | 0 |
1 | why can't my beneficiary access my account? | beneficiary_not_allowed | 1 |
2 | does it cost extra to send out more than one card? | getting_spare_card | 2 |
3 | can i change my pin at an atm? | change_pin | 3 |
4 | i have a us credit card, will you accept it? | supported_cards_and_currencies | 4 |
print(f"The total number of rows in the current dataset: {len(batch_1)}")
Load batch 1 into Cleanlab Studio
Upon receiving our batch of data, let’s load it into Cleanlab Studio for analysis. First use your API key to instantiate a Studio
object.
from cleanlab_studio import Studio
# You can find your Cleanlab Studio API key by going to app.cleanlab.ai/upload,
# clicking "Upload via Python API", and copying the API key there
API_KEY = "<insert your API key>"
studio = Studio(API_KEY)
Load the data from batch 1 into Cleanlab Studio. More details on formatting datasets can be found in this guide.
dataset_id = studio.upload_dataset(batch_1, dataset_name="data-batch-1")
print(f"Dataset ID: {dataset_id}")
Launch a Project
A Cleanlab Studio Project automatically trains ML models to provide AI-based analysis of your dataset. Let’s launch one for the data we have received so far.
Note: For our label_column
and text_column
specified below, they happened to be called label
and text
for this example. The values for these arguments should be the name of the columns pertaining to your label and containing the text field in your dataset. If you have multiple text columns, please merge them into a single column as demonstrated in our FAQ.
label_column = "label" # name of column containing labels
text_column = "text"
project_id = studio.create_project(
dataset_id=dataset_id,
project_name="batch-1-analysis",
modality="text",
task_type="multi-class",
model_type="regular", # set this to "fast" if time-constrained
label_column=label_column,
text_column=text_column
)
print(f"Project successfully created and training has begun! project_id: {project_id}")
Once the project has been launched successfully and the project_id
is visible, feel free to close this notebook. It will take some time for Cleanlab’s AI to train models on this data and analyze it. Come back after training is complete (you will receive an email) and continue with the notebook to review your results. The training for this example would take approximately 10 minutes.
You should only execute the above cell once per data batch. After launching the project, you can poll for its status to programmatically wait until the results are ready for review:
# Fetch the cleanset id corresponding to the above project_id
cleanset_id = studio.get_latest_cleanset_id(project_id)
print(f"cleanset_id: {cleanset_id}")
project_status = studio.wait_until_cleanset_ready(cleanset_id)
If your notebook timed out, you can resume by re-running the above lines of code. Do not create a new project for the same batch when coming back to this notebook. When the project is complete, the resulting cleanset contains many Cleanlab columns of metadata that can be used to produce an cleaner version of our original dataset.
Review issues detected in batch 1
Fetch the Cleanlab columns of metadata for this cleanset using its cleanset_id
. These columns have the same length as our original data batch and provide metadata about each individual data point, like what types of issues it exhibits and how severely.
If at any point you want to re-run the remaining parts of this notebook (without creating another Project), simply call studio.download_cleanlab_columns(cleanset_id)
with the cleanset_id
printed from the previous cell.
cleanlab_columns_df = studio.download_cleanlab_columns(cleanset_id)
cleanlab_columns_df.head()
cleanlab_row_ID | corrected_label | is_label_issue | label_issue_score | suggested_label | suggested_label_confidence_score | is_ambiguous | ambiguous_score | is_well_labeled | is_near_duplicate | ... | non_english_score | predicted_language | is_toxic | toxic_score | sentiment_score | bias_score | is_biased | gender_bias_score | racial_bias_score | sexual_orientation_bias_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | <NA> | False | 0.311937 | <NA> | 0.477899 | False | 0.946097 | True | False | ... | 0.007218 | <NA> | False | 0.181641 | 0.134277 | 0.394824 | False | 0.394824 | 0.232544 | 0.146484 |
1 | 2 | <NA> | False | 0.254897 | <NA> | 0.565403 | False | 0.893302 | True | False | ... | 0.008784 | <NA> | False | 0.062164 | 0.159088 | 0.259082 | False | 0.259082 | 0.093933 | 0.024902 |
2 | 3 | <NA> | False | 0.494578 | <NA> | 0.230432 | False | 0.968427 | False | False | ... | 0.004739 | <NA> | False | 0.184692 | 0.377869 | 0.244189 | False | 0.244189 | 0.131470 | 0.013443 |
3 | 4 | <NA> | False | 0.353640 | <NA> | 0.411544 | False | 0.921234 | True | False | ... | 0.058934 | <NA> | False | 0.120056 | 0.543640 | 0.379687 | False | 0.379687 | 0.112732 | 0.013229 |
4 | 5 | <NA> | False | 0.333641 | <NA> | 0.441447 | False | 0.892788 | True | False | ... | 0.008220 | <NA> | False | 0.107422 | 0.573120 | 0.161621 | False | 0.146045 | 0.161621 | 0.002678 |
5 rows × 38 columns
Details about the Cleanlab columns and their meanings can be found in the Cleanlab columns guide.
In this tutorial, we focus on label issues only. The ideas demonstrated here can be used for other types of data issues that Cleanlab Studio auto-detects.
A data point flagged with a label issue likely has a wrong given label. For such data points, consider correcting their label to the suggested_label
if it seems more appropriate.
The data points exhibiting this issue are indicated with boolean values in the is_label_issue
column, and the severity of this issue in each data point is quantified in the label_issue_score
column (on a scale of 0-1 with 1 indicating the most severe instances of the issue).
Let’s create a given_label
column in our dataframe to clearly indicate the class label originally assigned to each data point (customer service request).
# Copy data into a new DataFrame
df1 = batch_1.copy()
# Combine the dataset with cleanlab columns
merge_df_cleanlab = df1.merge(cleanlab_columns_df, left_index=True, right_index=True)
# Rename label column to "given_label"
merge_df_cleanlab.rename(columns={"label": "given_label"}, inplace=True)
To see which data points are estimated to be mislabeled, we filter by is_label_issue
. We sort by label_issue_score
to see which of these data points are most likely mislabeled.
label_issues = merge_df_cleanlab.query("is_label_issue", engine="python").sort_values("label_issue_score", ascending=False)
columns_to_display = ["id", "text", "label_issue_score", "is_label_issue", "given_label", "suggested_label"]
display(label_issues[columns_to_display])
id | text | label_issue_score | is_label_issue | given_label | suggested_label | |
---|---|---|---|---|---|---|
7 | 7 | can i change my pin on holiday? | 0.706490 | True | beneficiary_not_allowed | change_pin |
459 | 459 | will i be sent a new card before mine expires? | 0.665202 | True | apple_pay_or_google_pay | card_about_to_expire |
117 | 117 | my card is almost expired. how fast will i get a new one and what is the cost? | 0.656704 | True | apple_pay_or_google_pay | card_about_to_expire |
54 | 54 | is it possible to change my pin? | 0.648337 | True | beneficiary_not_allowed | change_pin |
160 | 160 | p | 0.605484 | True | getting_spare_card | supported_cards_and_currencies |
115 | 115 | can i get a new card even though i am in china? | 0.575734 | True | apple_pay_or_google_pay | card_about_to_expire |
119 | 119 | what currencies does google pay top up accept? | 0.557280 | True | apple_pay_or_google_pay | supported_cards_and_currencies |
369 | 369 | do i need to verify my top-up card? | 0.527742 | True | getting_spare_card | apple_pay_or_google_pay |
Note that in most of these rows, the given_label
really does seem wrong (the annotated intent in the original dataset does not appear appropriate for the customer request), except the rows which have a label issue score less than 0.7. Luckily we can easily correct these data points by just using Cleanlab’s suggested_label
above, which seems like a more appropriate label in most cases.
While the boolean flags above help us estimate the overall label error rate, the numeric scores help decide what data to prioritize for review. In this tutorial, we use a threshold on label_issue_score
to select which data points to fix, excluding the rest of the data points which doesn’t meet the threshold.
Improve batch 1 data based on the detected issues
Let’s use the Cleanlab columns to improve the quality of our dataset. For your own datasets, which actions you should take to remedy the detected issues will depend on what you are using the data for. No action may be the best choice for certain datasets, we caution against blindly copying the actions we perform here.
For data flagged as label issues, we create a new corrected_label
column, which will be the given_label
for data points without detected label issues, and the suggested_label
for data points with detected label issues. We use a label_issue_score
threshold of 0.7 to determine which data points to re-label. The remaining data points flagged as label issues will be excluded from the dataset to avoid potential contamination.
Throughout, we track all of the rows we fixed (re-labeled) or excluded.
# Set issue score threshold
label_threshold = 0.70
# DataFrame to track excluded rows
threshold_filtered_rows = label_issues.query("label_issue_score < @label_threshold")
# Find indices of rows to exclude
ids_to_exclude1 = threshold_filtered_rows["id"]
indices_to_exclude1 = merge_df_cleanlab.query("id in @ids_to_exclude1").index
print(f"Excluding {len(threshold_filtered_rows)} text examples (out of {len(merge_df_cleanlab)})")
# Drop rows from the merge DataFrame
merge_df_cleanlab = merge_df_cleanlab.drop(indices_to_exclude1)
corrected_label = np.where(merge_df_cleanlab["is_label_issue"],
merge_df_cleanlab["suggested_label"],
merge_df_cleanlab["given_label"])
# DataFrame to track fixed (re-labeled) rows
label_issues_fixed_rows = merge_df_cleanlab.query("is_label_issue", engine="python")
Let’s make a cleaned version of the batch 1 data after applying these corrections:
fixed_batch_1 = merge_df_cleanlab[["text", "id"]].copy()
fixed_batch_1["label"] = corrected_label
Let’s also initialize our curated master dataset, a single DataFrame to store the clean data points accumulated across all the data batches.
fixed_dataset = pd.DataFrame(columns=["text", "label", "id"])
fixed_dataset = pd.concat([fixed_dataset, fixed_batch_1], ignore_index=True) # add clean data from batch 1
Perfect! Now let’s grow our master dataset after recieving the data from batch 2.
Adding a second batch of data
Our fixed_dataset
currently contains the cleaned version of our first data batch. Suppose now we’ve collected a second batch of data, which consists of 300 rows, that we wish to add to this master fixed dataset.
batch_2 = pd.read_csv(os.path.join(dataset_path, "data_batch2.csv"))
We will again add a unique identifier id
which would start from the last id of batch 1 i.e. the size of batch 1 (500), till the size of batch 1 + batch 2 (800).
total_rows = len(batch_1) + len(batch_2)
batch_2["id"] = range(len(batch_1), total_rows)
batch_2.head()
text | label | id | |
---|---|---|---|
0 | i received my american express in apple pay, why is top up not working? | apple_pay_or_google_pay | 500 |
1 | i want to change my pin - do i need to be in a bank? | change_pin | 501 |
2 | i want to use a payment card to top up my account. how can i do this? | supported_cards_and_currencies | 502 |
3 | i would like to give another card to my daughter, how can i proceed? | getting_spare_card | 503 |
4 | is there a location where i can change my pin? | change_pin | 504 |
Let’s first check what class labels are present in this second batch of data. We define a helper function to compare the unique values of the label
column against our previously observed data.
Optional: Initialize helper method to compare classes (click to expand)
def compare_classes(new_data, historical_data, label_column):
historical_data_classes = set(historical_data[label_column])
new_batch_classes = set(new_data[label_column])
if len(historical_data_classes.difference(new_batch_classes)) > 0:
print(f"New batch has no data points from the following classes: {historical_data_classes.difference(new_batch_classes)}")
if len(new_batch_classes.difference(historical_data_classes)) > 0:
print(f"New batch has data points from previously unseen classes: {new_batch_classes.difference(historical_data_classes)}")
compare_classes(batch_2, fixed_dataset, label_column)
Here we don’t act on this information, but such information may be concerning depending on your application.
We’ll repeat the Cleanlab Studio steps that we previously performed for our first data batch, this time on a larger dataset composed of our clean historical data plus the newest data batch.
batch_1_2 = pd.concat([fixed_dataset, batch_2], ignore_index=True)
Load dataset, launch Project, and get Cleanlab columns
Again: if your notebook times out during any of the following steps, you likely don’t need to re-run that step (re-running the step may take a long time again). Instead try to run the next step after restarting your notebook.
dataset_id = studio.upload_dataset(batch_1_2, dataset_name="data-batch-1-2")
print(f"Dataset ID: {dataset_id}")
project_id = studio.create_project(
dataset_id=dataset_id,
project_name="batch-1-2-analysis",
modality="text",
task_type="multi-class",
model_type="regular",
label_column=label_column,
text_column=text_column
)
print(f"Project successfully created and training has begun! project_id: {project_id}")
cleanset_id = studio.get_latest_cleanset_id(project_id)
print(f"cleanset_id: {cleanset_id}")
project_status = studio.wait_until_cleanset_ready(cleanset_id)
cleanlab_columns_df = studio.download_cleanlab_columns(cleanset_id)
Review issues detected in batch 2
Similar to how we reviewed the label issues detected in batch 1 data, here we will focus on the label issues detected in the newest (second) batch of data. Note that our Project analyzed this batch of data together with the clean historical data, as more data allows Cleanlab’s AI to more accurately detect data issues. As before, the first step toward reviewing results is to merge the Cleanlab columns with the dataset that the Project was run on:
df2 = batch_1_2.copy()
merge_df_cleanlab = df2.merge(cleanlab_columns_df, left_index=True, right_index=True)
merge_df_cleanlab.rename(columns={"label": "given_label"}, inplace=True)
The current merge_df_cleanlab
dataset consists of both the cleaned historical data (from batch 1) and the raw batch 2 data. Here we demonstrate how to focus on catching label issues in the new (batch 2) data only:
# Use identifier to create an array of batch-2 id's
ids_of_batch_2 = batch_2["id"]
# Isolate batch-2 data from the merged dataset
batch_2_subset = merge_df_cleanlab.query("id in @ids_of_batch_2", engine="python")
# Get batch 2 rows flagged as label issues
label_issues = batch_2_subset.query("is_label_issue", engine="python").sort_values("label_issue_score", ascending=False)
display(label_issues[columns_to_display])
id | text | label_issue_score | is_label_issue | given_label | suggested_label | |
---|---|---|---|---|---|---|
742 | 749 | why am i being charge a fee when using an atm? | 0.789770 | True | card_about_to_expire | card_payment_fee_charged |
686 | 693 | what atms will allow me to change my pin? | 0.716286 | True | beneficiary_not_allowed | change_pin |
788 | 795 | what services can i use to top up? | 0.676584 | True | apple_pay_or_google_pay | supported_cards_and_currencies |
652 | 659 | why do i see extra charges for withdrawing my money? | 0.672628 | True | card_about_to_expire | card_payment_fee_charged |
587 | 594 | bad bank | 0.601791 | True | apple_pay_or_google_pay | supported_cards_and_currencies |
Improve batch 2 data based on the detected issues
Assume we are in working with a production data pipeline where fixing issues in the most recent batch of data is highest priority. Just as before, we can apply the same strategy to clean the batch 2 data (re-label the flagged data points with label_issue_score
above the same threshold and exclude the rest of the label issues from the dataset).
# Keep track of excluded rows
issues_below_threshold = label_issues.query("label_issue_score < @label_threshold", engine="python")
threshold_filtered_rows = pd.concat([issues_below_threshold, threshold_filtered_rows])
# Find indices of rows to exclude
ids_to_exclude2 = issues_below_threshold["id"]
indices_to_exclude2 = batch_2_subset.query("id in @ids_to_exclude2", engine="python").index
print(f"Excluding {len(ids_to_exclude2)} text example(s) (out of {len(batch_2_subset)} from batch-2)")
# Drop rows from the batch-2 subset
batch_2_subset = batch_2_subset.drop(indices_to_exclude2)
corrected_label = np.where(batch_2_subset["is_label_issue"],
batch_2_subset["suggested_label"],
batch_2_subset["given_label"])
# Keep track of fixed rows
label_issues_fixed_rows = pd.concat([label_issues_fixed_rows, batch_2_subset.query("is_label_issue", engine="python")])
Applying these corrections produces a cleaned version of the batch 2 data. We add this cleaned batch 2 data to our master fixed dataset (which up to this point contained the cleaned batch 1 data).
fixed_batch_2 = batch_2_subset[["text", "id"]].copy()
fixed_batch_2["label"] = corrected_label
fixed_dataset = pd.concat([fixed_dataset, fixed_batch_2], ignore_index=True)
Awesome! We have grown our master dataset with the additional data being collected, while still ensuring this dataset is clean and free of label issues.
Adding another batch of data
Finally, let’s clean and then add another batch of newly collected data (200 rows) to our master dataset.
batch_3 = pd.read_csv(os.path.join(dataset_path, "data_batch3.csv"))
# Create an id column for unique identification of rows
total_rows = total_rows + len(batch_3)
batch_3["id"] = range(len(batch_1) + len(batch_2), total_rows)
batch_3.head()
text | label | id | |
---|---|---|---|
0 | i need to cancel my recent transfer as soon as possible. i made an error there. please help before it goes through. | cancel_transfer | 800 |
1 | i need help as fast as possible! i made a mistake on my most recent transfer; can you please stop it before it goes through? | cancel_transfer | 801 |
2 | i already made a transfer and want to cancel it, how do i do that? | cancel_transfer | 802 |
3 | i just realised i made the wrong payment yesterday. can you please change it to the right account? it's my rent payment and really really needs to be in the right account by tomorrow | cancel_transfer | 803 |
4 | hi, i made a transfer yesterday that i need to reverse. i need to put the money in a different account. | cancel_transfer | 804 |
Let’s first check what class labels are present in this third batch of data.
compare_classes(batch_3, fixed_dataset, label_column)
Here we don’t act on this information, but it may be a concern depending on your application. We repeat the Cleanlab Studio related steps that we performed before.
batch_1_2_3 = pd.concat([fixed_dataset, batch_3], ignore_index=True)
Load dataset, launch Project, and get Cleanlab columns
Again: if your notebook times out during any of the following steps, you likely don’t need to re-run that step (re-running the step may take a long time again). Instead try to run the next step after restarting your notebook The training for this example could take approximately 15 minutes.
dataset_id = studio.upload_dataset(batch_1_2_3, dataset_name="data-batch-1-3")
print(f"Dataset ID: {dataset_id}")
project_id = studio.create_project(
dataset_id=dataset_id,
project_name="batch-1-3-analysis",
modality="text",
task_type="multi-class",
model_type="regular",
label_column=label_column,
text_column=text_column
)
print(f"Project successfully created and training has begun! project_id: {project_id}")
cleanset_id = studio.get_latest_cleanset_id(project_id)
print(f"cleanset_id: {cleanset_id}")
project_status = studio.wait_until_cleanset_ready(cleanset_id)
cleanlab_columns_df = studio.download_cleanlab_columns(cleanset_id)
Review issues detected in batch 3
df3 = batch_1_2_3.copy()
merge_df_cleanlab = df3.merge(cleanlab_columns_df, left_index=True, right_index=True)
merge_df_cleanlab.rename(columns={"label": "given_label"}, inplace=True)
The merged dataset batch_1_2_3
consists of clean historical data (from batches 1 + 2) and the raw batch 3 data.
As before, let’s focus on the issues detected in the batch 3 data:
# Create an array of batch-3 id's
ids_of_batch_3 = batch_3["id"]
# Isolate batch-3 subset from the current dataset
batch_3_subset = merge_df_cleanlab.query("id in @ids_of_batch_3", engine="python")
# Fetch rows with label issues
label_issues = batch_3_subset.query("is_label_issue", engine="python").sort_values("label_issue_score", ascending=False)
display(label_issues[columns_to_display])
id | text | label_issue_score | is_label_issue | given_label | suggested_label | |
---|---|---|---|---|---|---|
850 | 860 | which currencies can i used to add funds to my account? | 0.850656 | True | cancel_transfer | supported_cards_and_currencies |
978 | 988 | i was charged for getting cash. | 0.834128 | True | card_about_to_expire | card_payment_fee_charged |
949 | 959 | so, i was just charged for my recent atm withdrawal and any withdrawal prior to this has been free. what is the issue here? | 0.769915 | True | card_about_to_expire | card_payment_fee_charged |
840 | 850 | how long does it take for a top up to be approved? | 0.582045 | True | cancel_transfer | supported_cards_and_currencies |
Review issues in older batches
While we’ve been focusing on the issues detected in the latest batch of data only, we can also see if any issues have been detected in the older historical data (the cleaned version of batches 1 and 2). Now that there is significantly more data in the Cleanlab Studio Project, the AI is able to detect data issues more accurately and may find issues missed in previous rounds. Let’s see if there are any new label issues detected in the previous cleaned versions of batches 1 and 2:
batch_1_2_subset = merge_df_cleanlab.query("id not in @ids_of_batch_3", engine="python")
display(batch_1_2_subset.query("is_label_issue", engine="python")[columns_to_display])
id | text | label_issue_score | is_label_issue | given_label | suggested_label | |
---|---|---|---|---|---|---|
39 | 39 | please tell me how to change my pin. | 0.830466 | True | beneficiary_not_allowed | change_pin |
67 | 68 | how do i find my new pin? | 0.811987 | True | visa_or_mastercard | change_pin |
90 | 91 | explain roth ira | 0.585797 | True | beneficiary_not_allowed | supported_cards_and_currencies |
94 | 95 | what cards do you offer? | 0.726222 | True | visa_or_mastercard | supported_cards_and_currencies |
While we’ve been fixing the label issues detected in each older data batch at the time the data was collected, this doesn’t guarantee the older batches are 100% free of label issues.
To demonstrate another type of issue, we can also review the outliers detected in these data batches in isolation. From the latest Cleanlab Studio Project, here are the outliers detected in batch 3:
columns_to_display_outlier = ["id", "text", "given_label", "outlier_score", "is_empty_text", "text_num_characters", "is_outlier"]
outlier_issues_batch_3 = merge_df_cleanlab.query('(id in @ids_of_batch_3) & (is_outlier)', engine='python')
display(outlier_issues_batch_3[columns_to_display_outlier])
id | text | given_label | outlier_score | is_empty_text | text_num_characters | is_outlier | |
---|---|---|---|---|---|---|---|
852 | 862 | cancel transaction | cancel_transfer | 0.178126 | False | 18 | True |
Outliers detected in batch 2:
outlier_issues_batch_2 = merge_df_cleanlab.query('(id in @ids_of_batch_2) & (is_outlier)', engine='python')
display(outlier_issues_batch_2[columns_to_display_outlier])
id | text | given_label | outlier_score | is_empty_text | text_num_characters | is_outlier | |
---|---|---|---|---|---|---|---|
502 | 509 | metal card | card_about_to_expire | 0.234098 | False | 10 | True |
561 | 568 | changing my pin | change_pin | 0.172110 | False | 15 | True |
582 | 589 | 750 credit score | getting_spare_card | 0.154556 | False | 16 | True |
639 | 647 | 404Error<body><p>InvalidUsername</p><p> InvalidPIN</p></body> | change_pin | 0.202901 | False | 61 | True |
Outliers detected in batch 1:
ids_of_batch_1 = batch_1["id"]
outlier_issues_batch_1 = merge_df_cleanlab.query("(id in @ids_of_batch_1) & (is_outlier)", engine="python")
display(outlier_issues_batch_1[columns_to_display_outlier])
id | text | given_label | outlier_score | is_empty_text | text_num_characters | is_outlier | |
---|---|---|---|---|---|---|---|
90 | 91 | explain roth ira | beneficiary_not_allowed | 0.176312 | False | 16 | True |
280 | 285 | payment did not process | beneficiary_not_allowed | 0.184214 | False | 23 | True |
450 | 456 | switch banks | change_pin | 0.186031 | False | 12 | True |
456 | 463 | my sc | apple_pay_or_google_pay | 0.247411 | False | 5 | True |
Improve batch 3 data based on the detected issues
Finally, we fix just the label issues detected in batch 3, using the same strategy applied to the previous data batches.
# Keep track of excluded rows
issues_below_threshold = label_issues.query("label_issue_score < @label_threshold", engine="python")
threshold_filtered_rows = pd.concat([issues_below_threshold, threshold_filtered_rows])
# Find indices of rows to exclude
ids_to_exclude3 = issues_below_threshold["id"]
indices_to_exclude3 = batch_3_subset.query("id in @ids_to_exclude3", engine="python").index
print(f"Excluding {len(ids_to_exclude3)} text example(s) (out of {len(batch_3_subset)} from batch-3)")
# Drop rows from the batch-3 subset
batch_3_subset = batch_3_subset.drop(indices_to_exclude3)
corrected_label = np.where(batch_3_subset["is_label_issue"],
batch_3_subset["suggested_label"],
batch_3_subset["given_label"])
# Keep track of fixed rows
label_issues_fixed_rows = pd.concat([label_issues_fixed_rows, batch_3_subset.query("is_label_issue", engine="python")])
And then add the cleaned batch 3 data to our master dataset:
fixed_batch_3 = batch_3_subset[["text", "id"]].copy()
fixed_batch_3["label"] = corrected_label
fixed_dataset = pd.concat([fixed_dataset, fixed_batch_3], ignore_index=True)
print(f"Total number of label issues fixed across all 3 batches: {len(label_issues_fixed_rows)}")
print(f"Total number of rows, with label issues, excluded due to score less than threshold ({label_threshold}): {len(threshold_filtered_rows)}")
Saving the master dataset
After cleaning and accumulating multiple batches of data, we save the resulting master fixed dataset to a CSV file that can be used as a plug-in replacement in your existing modeling workflows. The cleaned dataset will have the same format as your original dataset, so you can use it as a plug-in replacement to get more reliable results in your ML/Analytics pipelines (without changing your existing modeling code).
new_dataset_filename = "fixed_dataset.csv" # Location to save clean master dataset
if os.path.exists(new_dataset_filename):
raise ValueError(f"File {new_dataset_filename} already exists. Cannot overwite so please delete it first, or specify a different new_dataset_filename.")
else:
fixed_dataset.to_csv(new_dataset_filename, index=False, columns=["text", "label"])
print(f"Master fixed dataset saved to {new_dataset_filename}")
Faster methods
The approach demonstrated here is how we recommend handling growing datasets in the generally available version of Cleanlab Studio. If your only goal is to label/categorize data coming in at rapid volumes, you can instead deploy a ML model to more quickly process incoming data. For companies with particular data workloads, Cleanlab offers more compute-efficient and integrated solutions that scale to larger volumes of incoming data. Contact us to learn more.