Detecting Issues in Structured/Tabular Datasets
This is the recommended quickstart tutorial for using Cleanlab Studio’s Python API to analyze tabular datasets (Excel/database files with columns of numeric/string values).
In this tutorial, we demonstrate the metadata Cleanlab Studio automatically generates for any classification dataset stored as a data table. This metadata (returned as Cleanlab Columns) helps you discover various problems in your dataset and understand their severity. This entire notebook is run using the cleanlab_studio
Python package, so you can audit your datasets programmatically.
Install and import dependencies
Make sure you have wget
installed to run this tutorial.
You can use pip to install all other packages required for this tutorial as follows:
%pip install matplotlib cleanlab-studio
import numpy as np
import pandas as pd
import os
Dataset structure
Fetch the dataset for this tutorial, which is stored as a standard data table in CSV format.
wget -nc https://s.cleanlab.ai/grades-tabular-demo.csv -P data
The CSV file contains the following columns:
stud_ID,exam_1,exam_2,exam_3,notes,letter_grade
f48f73,53,77,93,,C
0bd4e7,81,64,80,great participation +10,B
e1795d,74,88,97,,B
<id of student>,<grade on exam 1>,<grade on exam 2>,<grade on exam 3>,<optional notes>,<overall letter grade for student>
...
You can similarly format any other tabular dataset and run the rest of this tutorial. Details on how to format your dataset can be found in this guide, which also outlines other format options. Cleanlab Studio works out-of-the-box for messy tabular datasets with arbitrary numeric/string columns that may contain missing values.
Our dataset for this tutorial is a collection of student grades and other information about each student. Suppose we are interested detecting incorrect values (data entry errors) in students’ final letter_grade
, which belongs to one of five categories: A, B, C, D, F. In machine learning terminology, this can be considered a multi-class classification dataset, where the letter_grade
column is viewed as class label for each row (student) in the table.
If you were say aiming to detect miscategorized products in an e-commerce dataset, you would select the product-category column as the class label. Cleanlab Studio can auto-detect erroneous values not only in categorical columns but also in any numeric column, as demonstrated in our regression tutorial. Refer to our data entry errors tutorial for a general example of how to auto-detect errors in multiple heterogeneous columns of an arbitrary tabular dataset.
BASE_PATH = os.getcwd()
dataset_path = os.path.join(BASE_PATH, "data/grades-tabular-demo.csv")
Load dataset into Cleanlab Studio
Use your API key to instantiate a Studio
object, which can be used to analyze your dataset.
from cleanlab_studio import Studio
# you can find your Cleanlab Studio API key by going to app.cleanlab.ai/upload,
# clicking "Upload via Python API", and copying the API key there
API_KEY = "<insert your API key>"
# initialize studio object
studio = Studio(API_KEY)
Next load the dataset into Cleanlab Studio (more details/options can be found in this guide). This may take a while for big datasets.
dataset_id = studio.upload_dataset(dataset_path, dataset_name="student-grades")
print(f"Dataset ID: {dataset_id}")
Launch a Project
Let’s now create a project using this dataset. A Cleanlab Studio project will automatically train ML models to provide AI-based analysis of your dataset.
By default Cleanlab Studio uses all columns as predictive features except the label column. But since stud_ID
is not a predictive feature, we explicitly specify the columns that will be used as input variables for the model (exam_1, exam_2, exam_3, notes)
columns in this dataset. We also specify the column containing the class labels for each data point (letter_grade
column in this dataset).
project_id = studio.create_project(
dataset_id=dataset_id,
project_name="student-grades project",
modality="tabular",
task_type="multi-class",
model_type="regular",
label_column="letter_grade",
feature_columns=['exam_1', 'exam_2', 'exam_3', 'notes']
)
print(f"Project successfully created and training has begun! project_id: {project_id}")
Once the project has been launched successfully and you see your project_id
you can feel free to close this notebook. It will take some time for Cleanlab’s AI to train on your data and analyze it. Come back after training is complete (you will receive an email) and continue with the notebook to review your results.
You should only execute the above cell once per dataset. After launching the project, you can poll for its status to programmatically wait until the results are ready for review. Each project creates a cleanset, an improved version of your original dataset that contains additional metadata for helping you clean up the data. The next code cell simply waits until this cleanset has been created.
Warning! For big datasets, this next cell may take a long time to execute while Cleanlab’s AI model is training. If your notebook has timed out during this process, then you can resume work by re-running the below cell (which should return the cleanset_id
instantly if the project has completed training). Do not re-run the above cell and create a new project.
cleanset_id = studio.get_latest_cleanset_id(project_id)
print(f"cleanset_id: {cleanset_id}")
project_status = studio.poll_cleanset_status(cleanset_id)
Once the above cell completes execution, your project results are ready for review! At this point, you can optionally view your project in the Cleanlab Studio web interface and interactively improve your dataset. However this tutorial will stick with a fully programmatic workflow.
Download Cleanlab columns
We can fetch the Cleanlab columns that contain the metadata of this cleanset using its cleanset_id
. These columns have the same length as your original dataset and provide metadata about each individual data point, like what types of issues it exhibits and severe these issues are.
If at any point you want to re-run the remaining parts of this notebook (without creating another project), simply call studio.download_cleanlab_columns(cleanset_id)
with the cleanset_id
printed from the previous cell.
cleanlab_columns_df = studio.download_cleanlab_columns(cleanset_id)
cleanlab_columns_df.head()
cleanlab_row_ID | corrected_label | is_label_issue | label_issue_score | suggested_label | suggested_label_confidence_score | is_ambiguous | ambiguous_score | is_well_labeled | is_near_duplicate | near_duplicate_score | near_duplicate_cluster_id | is_outlier | outlier_score | is_initially_unlabeled | has_rare_class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | <NA> | False | 0.178143 | <NA> | 0.740559 | False | 0.848316 | False | False | 0.431745 | <NA> | False | 0.019050 | False | False |
1 | 2 | <NA> | False | 0.154428 | <NA> | 0.766985 | False | 0.756479 | False | False | 0.001234 | <NA> | False | 0.125321 | False | False |
2 | 3 | <NA> | False | 0.185642 | <NA> | 0.753580 | False | 0.722060 | False | False | 0.550396 | <NA> | False | 0.021067 | False | False |
3 | 4 | <NA> | False | 0.296633 | <NA> | 0.629907 | False | 0.887160 | False | False | 0.458981 | <NA> | False | 0.039498 | False | False |
4 | 5 | <NA> | False | 0.329882 | <NA> | 0.516878 | False | 0.933320 | False | False | 0.107241 | <NA> | False | 0.033416 | False | False |
Optional: Initialize visualization helper functions (click to expand)
We define some rule-based coloring functions below to better visualize the numeric and string columns in this dataset. Coloring helps us clearly understand data entry errors for this particular tutorial dataset, but is optional for your own datasets.
from typing import Callable, Dict, Optional
from IPython.core.display import HTML
import matplotlib.colors
def color_calc(value: int):
"""
Calculate color based on the given value. Intended for the range of 0 to 100. No error checking is done.
Parameters:
value (int or float): Value based on which the color is determined.
Returns:
str: Hexadecimal color code.
"""
if value <= 50:
r = 1.0
g = (value / 50) * 194.0 / 255.0 + 61.0 / 255.0
b = 61.0 / 255.0
else:
r = (100 - value) / 50
g = 1.0
b = 0.0
hex_color = matplotlib.colors.to_hex((r, g, b))
return hex_color
def grade_to_color(value: int):
"""
Format numerical grade value with background color. Intended for the range of 0 to 100. No error checking is done.
Parameters:
value (int or float): Numerical grade value.
Returns:
str: HTML div string with background color based on grade value.
"""
hex_color = color_calc(value)
return f'<div style="background-color: {hex_color}; text-align: center;">{value}</div>'
def letter_grade_to_color(grade: str):
"""
Format letter grade with background color to match the numerical grade. No error checking is done.
Parameters:
grade (str): Letter grade.
Returns:
str: HTML div string with background color based on letter grade.
"""
grade_map = {'A': 100, 'B': 75, 'C': 50, 'D': 25, 'F': 0}
value = grade_map.get(grade, 0) # default to 0 if grade is not found
hex_color = color_calc(value)
return f'<div style="background-color: {hex_color}; text-align: center;">{grade}</div>'
def highlight_notes(note: str):
"""
Format notes with background color based on keywords. Notes are returned as is if no keywords are found.
Parameters:
note (str): Text of notes.
Returns:
str: HTML div string with background color based on keywords found in notes.
"""
if 'missed' in note:
value = 40
elif 'cheated' in note:
value = 0
elif 'great participation' in note:
value = 100
else:
return note # default (no color)
hex_color = color_calc(value)
return f'<div style="background-color: {hex_color};">{note}</div>'
_TUTORIAL_FORMATTERS = {
'exam_1': grade_to_color,
'exam_2': grade_to_color,
'exam_3': grade_to_color,
'given_label': letter_grade_to_color,
'suggested_label': letter_grade_to_color,
'notes': highlight_notes
}
def display(df, formatters: Optional[Dict[str, Callable]] = None):
"""
Display DataFrame with formatted columns.
Parameters:
df (pd.DataFrame): DataFrame to display.
Returns:
IPython.core.display.HTML: HTML representation of formatted DataFrame.
"""
if formatters is None:
formatters = {}
return HTML(df.to_html(escape=False, formatters=formatters, index=False))
disable_pretty_print = False # set to True to disable pretty printing when displaying DataFrames
optional_df_display_formatters = None if disable_pretty_print else _TUTORIAL_FORMATTERS
Review detected data issues
Details about all of the returned Cleanlab columns and their meanings can be found in this guide. Here we briefly showcase some of the Cleanlab columns that correspond to issues detected in our tutorial dataset:
- Label issue indicates the original value in the column you chose as the class label appears incorrect for this row (perhaps due to data entry error, or accidental mislabeling). For such data, consider correcting this value to the
suggested_label
value if it seems more appropriate. - Ambiguous indicates this data point is a borderline case, which might be appropriately described by more than one class label or none of the options at all. Multiple people might disagree on how to label this data point, so you might consider refining your annotation instructions to clarify how to handle data points like this if your data are human-labeled.
- Outlier indicates this row is very different from the rest of the rows in your dataset (looks atypical). The presence of outliers may indicate problems in your data sources, consider deleting such data from your dataset if appropriate.
The rows exhibiting each type of issue are indicated with boolean values in the respective is_<issue>
column, and the severity of this issue in each row is quantified in the respective <issue>_score
column (on a scale of 0-1 with 1 indicating the most severe instances of the issue).
Let’s go through some of the Cleanlab columns and types of data issues, starting with label issues. We first create a given_label
column in our dataframe to clearly indicate the class label originally assigned to each data point in this dataset.
# Load the dataset into a DataFrame
df = pd.read_csv(dataset_path)
# Combine the dataset with the cleanlab columns
combined_dataset_df = df.merge(cleanlab_columns_df, left_index=True, right_index=True)
# Set a "given_label" column to the original label
combined_dataset_df.rename(columns={"letter_grade": "given_label"}, inplace=True)
# Store the column names of the dataset for visualization
DATASET_COLUMNS = df.columns.drop("letter_grade").tolist()
Finding label issues
To see which data points are estimated to be mislabeled (i.e. have potentially erroneous values in the class label column), we filter by is_label_issue
. We sort by label_issue_score
to see which of these data points are most likely mislabeled.
samples_ranked_by_label_issue_score = combined_dataset_df.sort_values("label_issue_score", ascending=False)
columns_to_display = DATASET_COLUMNS + ["label_issue_score", "is_label_issue", "given_label", "suggested_label"]
display(samples_ranked_by_label_issue_score.head(5)[columns_to_display], formatters=optional_df_display_formatters)
stud_ID | exam_1 | exam_2 | exam_3 | notes | label_issue_score | is_label_issue | given_label | suggested_label |
---|---|---|---|---|---|---|---|---|
ee538b | 100 |
100 |
99 |
NaN | 0.967053 | True | F |
A |
db4bcf | 72 |
93 |
98 |
great participation +10 |
0.960134 | True | F |
A |
8a0a87 | 66 |
0 |
78 |
cheated on exam, gets 0pts |
0.957181 | True | A |
F |
0bdad5 | 71 |
0 |
82 |
cheated on exam, gets 0pts |
0.944614 | True | A |
F |
34ccdd | 90 |
100 |
89 |
great participation +10 |
0.944509 | True | F |
A |
Note that in each of these rows, the given_label
(i.e. final letter grade) really does seem wrong. Data entry and labeling is an error-prone process and mistakes are made! Luckily we can easily correct these data points by just using Cleanlab’s suggested_label
above, which seems like a much more suitable final letter_grade
value in most cases.
While the boolean flags above can help estimate the overall label error rate, the numeric scores help decide what data to prioritize for review. You can alternatively ignore these boolean is_label_issue
flags and filter the data by thresholding the label_issue_score
yourself (if say you find the default thresholds produce false positives/negatives).
Ambiguous samples
Next, let’s look at the ambiguous examples in the dataset. Many of these students perform well in at least 2 out of 3 subjects, and their grade is brought down by poor performance in one of the subjects or missing homework frequently.
samples_ranked_by_ambiguous_score = combined_dataset_df.sort_values("ambiguous_score", ascending=False)
columns_to_display = DATASET_COLUMNS + ["ambiguous_score", "is_ambiguous", "given_label", "suggested_label"]
display(samples_ranked_by_ambiguous_score.head(10)[columns_to_display], formatters=optional_df_display_formatters)
stud_ID | exam_1 | exam_2 | exam_3 | notes | ambiguous_score | is_ambiguous | given_label | suggested_label |
---|---|---|---|---|---|---|---|---|
9f7f44 | 61 |
65 |
100 |
missed homework frequently -10 |
0.983134 | True | D |
|
600f0b | 55 |
71 |
98 |
missed homework frequently -10 |
0.975925 | False | A |
|
5d086b | 82 |
94 |
99 |
missed class frequently -10 |
0.972961 | True | B |
|
5b2f76 | 99 |
86 |
95 |
missed class frequently -10 |
0.972619 | False | B |
|
5b2d9a | 94 |
95 |
90 |
missed class frequently -10 |
0.961089 | False | B |
|
a93747 | 100 |
80 |
78 |
missed class frequently -10 |
0.952505 | False | C |
|
42155a | 56 |
80 |
79 |
NaN | 0.952427 | True | F |
C |
e8901f | 69 |
90 |
86 |
missed homework frequently -10 |
0.949782 | False | C |
|
b2a3ca | 59 |
94 |
80 |
missed homework frequently -10 |
0.944547 | True | B |
D |
612ebd | 65 |
81 |
75 |
NaN | 0.944378 | False | C |
Outliers
Next, let’s look at the outlier examples in the dataset. Many of these students fall right on the border between two final letter grades.
samples_ranked_by_outlier_score = combined_dataset_df.sort_values("outlier_score", ascending=False)
columns_to_display = DATASET_COLUMNS + ["outlier_score", "is_outlier", "given_label", "suggested_label"]
display(samples_ranked_by_outlier_score.head(10)[columns_to_display], formatters=optional_df_display_formatters)
stud_ID | exam_1 | exam_2 | exam_3 | notes | outlier_score | is_outlier | given_label | suggested_label |
---|---|---|---|---|---|---|---|---|
4787de | 73 |
84 |
68 |
great participation +10 |
0.180820 | True | D |
B |
228dc0 | 65 |
85 |
80 |
great participation +10 |
0.179532 | True | B |
|
93af9d | 76 |
70 |
73 |
great participation +10 |
0.170726 | True | B |
|
8bfb9a | 75 |
91 |
68 |
great final presentation +10 | 0.170343 | True | B |
|
fd8db2 | 90 |
67 |
77 |
missed class frequently -10 |
0.165090 | True | D |
|
7f6511 | 90 |
67 |
77 |
missed class frequently -10 |
0.165090 | True | D |
|
b4a1fe | 52 |
95 |
72 |
great participation +10 |
0.161698 | True | B |
|
e1ac6f | 93 |
58 |
70 |
great final presentation +10 | 0.161023 | True | D |
B |
8d904d | 73 |
73 |
76 |
missed class frequently -10 |
0.160402 | True | D |
|
10ed39 | 83 |
71 |
70 |
great participation +10 |
0.154319 | False | D |
B |
Near duplicates
Now let’s look at near duplicates. Some of these examples have really almost identical, but the given_label
is different. Note that the near duplicate data points each have an associated near_duplicate_cluster_id
integer. Data points that share the same IDs are near duplicates of each other, so you can use this column to find the near duplicates of any data point.
n_near_duplicate_sets = len(set(combined_dataset_df.loc[combined_dataset_df["near_duplicate_cluster_id"].notna(), "near_duplicate_cluster_id"]))
print(f"There are {n_near_duplicate_sets} sets of near duplicate rows in the dataset.")
Let’s check out the near duplicates with cluster IDs 0 and 1:
near_duplicate_cluster_id = 0 # play with this value to see other sets of near duplicates
selected_samples_by_near_duplicate_cluster_id = combined_dataset_df.query("near_duplicate_cluster_id == @near_duplicate_cluster_id")
columns_to_display = ["stud_ID","exam_1","exam_2","exam_3","notes", "near_duplicate_score", "is_near_duplicate", "given_label"]
display(selected_samples_by_near_duplicate_cluster_id[columns_to_display])
stud_ID | exam_1 | exam_2 | exam_3 | notes | near_duplicate_score | is_near_duplicate | given_label |
---|---|---|---|---|---|---|---|
745c23 | 89 | 95 | 72 | NaN | 0.89761 | True | B |
6be392 | 89 | 95 | 73 | NaN | 0.89761 | True | D |
near_duplicate_cluster_id = 1 # play with this value to see other sets of near duplicates
selected_samples_by_near_duplicate_cluster_id = combined_dataset_df.query("near_duplicate_cluster_id == @near_duplicate_cluster_id")
columns_to_display = ["stud_ID","exam_1","exam_2","exam_3","notes", "near_duplicate_score", "is_near_duplicate", "given_label"]
display(selected_samples_by_near_duplicate_cluster_id[columns_to_display])
stud_ID | exam_1 | exam_2 | exam_3 | notes | near_duplicate_score | is_near_duplicate | given_label |
---|---|---|---|---|---|---|---|
24d6f2 | 0 | 83 | 97 | cheated on exam, gets 0pts | 0.953227 | True | D |
7630c7 | 0 | 81 | 99 | cheated on exam, gets 0pts | 0.949133 | True | B |
d4d286 | 0 | 81 | 97 | cheated on exam, gets 0pts | 0.953227 | True | B |
1c1ee0 | 0 | 79 | 96 | cheated on exam, gets 0pts | 0.938130 | True | D |
Improve the dataset based on the detected issues
Since the results of this analysis appear reasonable, let’s use the Cleanlab columns to improve the quality of our dataset. For your own datasets, which actions you should take to remedy the detected issues will depend on what you are using the data for. No single action is going to be the best choice across all datasets, so we caution against blindly copying the actions we perform below.
For data marked as label_issue
, we create a new corrected_label
column, which will be the given label for data without detected label issues, and the suggested_label
for data with detected label issues.
corrected_label = np.where(combined_dataset_df["is_label_issue"],
combined_dataset_df["suggested_label"],
combined_dataset_df["given_label"])
For data marked as outlier or ambiguous, we will exclude them from our dataset here for demonstration purposes. Here we create a boolean vector rows_to_exclude
to track which data points will be excluded.
rows_to_exclude = combined_dataset_df["is_outlier"] | combined_dataset_df["is_ambiguous"]
For each set of near duplicates, we only want to keep one of the data points that share a common near_duplicate_cluster_id
(so that the resulting dataset will no longer contain any near duplicates).
near_duplicates_to_exclude = combined_dataset_df['is_near_duplicate'] & combined_dataset_df['near_duplicate_cluster_id'].duplicated(keep='first')
rows_to_exclude |= near_duplicates_to_exclude
We can check the total amount of excluded data:
print(f"Excluding {rows_to_exclude.sum()} examples (out of {len(combined_dataset_df)})")
Finally, let’s actually make a new version of our dataset with these changes.
We craft a new dataframe from the original, applying corrections and exclusions, and then use this dataframe to save the new dataset in a separate CSV file. The new dataset is a CSV file that looks just like our original dataset – you can use it as a plug-in replacement to get more reliable results in your ML and Analytics pipelines, without any change in your existing modeling code.
new_dataset_filename = "improved_dataset.csv"
# Fetch the original dataset
fixed_dataset = combined_dataset_df[DATASET_COLUMNS].copy()
# Add the corrected label column
fixed_dataset["letter_grade"] = corrected_label
# Automatically exclude selected rows
fixed_dataset = fixed_dataset[~rows_to_exclude]
# Save improved dataset to new CSV file
fixed_dataset.to_csv(new_dataset_filename, index=False)
print(f"Adjusted dataset saved to {new_dataset_filename}")
Note: Cleanlab Studio is not just for labeled datasets. You can follow this tutorial to auto-detect erroneous values in any categorical column of a table, as well as impute all missing values in this column, by selecting it as the label column in your Cleanlab Studio Project. Refer to our data entry errors tutorial for a general example of auto-detecting errors (and imputing missing values) across multiple heterogeneous columns of an arbitrary tabular dataset.