Skip to main content

Detect Data Entry Errors (and Impute Missing Values) in any Tabular Dataset

Run in Google ColabRun in Google Colab

This is the recommended tutorial for programmatically identifying erroneous values (or imputing missing values) in a structured dataset via the Cleanlab Studio Python API. Cleanlab’s AI automatically detects entries in any data table (e.g. CSV/Excel file or Database) that are likely incorrect, perhaps due to: data entry or measurement error (e.g. sensor noise).

Simply provide any data table (including columns that are: text, numeric, or categorical — even with missing values), and state-of-the-art ML models will be trained to score the quality of each datapoint (row) and flag any entry (cell value) that is likely erroneous. These same ML models can produce predictions to accurately impute missing entries in the table.

Unlike traditional data validation or data quality tools, Cleanlab is not simply based on manual rules (e.g. ‘this column only contains positive values in a certain range’). Instead Cleanlab uses AI that accounts for all of the available information in your dataset to flag suspicious entries which may be erroneous. Thus Cleanlab can automatically detect not only atypical values in a column (like traditional data quality tools), but also atypical combinations of values across columns (e.g. multiple values that are jointly incompatible).

Data entry issues

Install and import dependencies

You can use pip to install all other packages required for this tutorial as follows:

%pip install cleanlab-studio
import numpy as np
import pandas as pd
import os
import random

from IPython.display import display, Markdown
pd.set_option("display.max_colwidth", None)

Fetch and view dataset

This tutorial considers a structured dataset of medical records. To fetch this dataset, make sure you have wget installed.

wget -nc https://cleanlab-public.s3.amazonaws.com/Datasets/data_entry.csv -P data
BASE_PATH = os.getcwd()
dataset_path = os.path.join(BASE_PATH, "data/")

data = pd.read_csv(os.path.join(dataset_path, 'data_entry.csv'))
data.head()
patient_id diagnosis medication dosage note visiting_hours invoice
0 90235 Bronchitis Dextromethorphan 10mg Adjusting dosage of Dextromethorphan for Bronchitis. 2.0 40.0
1 34227 Pneumonia Amoxicillin 500mg Reviewing patient's response to Amoxicillin therapy for Pneumonia. 35.0 300.0
2 21253 Hypertension Hydrochlorothiazide 25mg Recommending additional tests to monitor efficacy of Hydrochlorothiazide for Hypertension. 3.0 300.0
3 44136 Pneumonia Levofloxacin 10 IU Advised patient on importance of adherence to Levofloxacin regimen for Pneumonia. 5.0 300.0
4 76265 Bronchitis Guaifenesin 200mg Patient treated for Bronchitis using Guaifenesin. 4.0 80.0

We can check if the dataset is complete or if there are missing entries.

missing_data = data[data.isnull().any(axis=1)]
missing_data
patient_id diagnosis medication dosage note visiting_hours invoice
102 10322 Hypertension Losartan NaN Patient presenting with Hypertension symptoms despite Losartan therapy. 4.0 400.0
129 53909 Pneumonia Azithromycin 250mg NaN 5.0 300.0
256 68289 Bronchitis Guaifenesin 200mg Patient responding well to Guaifenesin for Bronchitis. 5.0 NaN
397 81520 Diabetes mellitus Insulin 10 IU Considering alternative treatments for Diabetes mellitus due to patient's intolerance to Insulin. 2.0 NaN
473 61117 Diabetes mellitus Insulin 10 IU Instructed patient on proper administration of Insulin for Diabetes mellitus. NaN 80.0
655 64952 Influenza Rimantadine 100mg Patient presenting with Influenza symptoms despite Rimantadine therapy. NaN 80.0
816 15877 Bronchitis NaN 10mg Scheduled follow-up appointment to evaluate Bronchitis. 1.0 20.0
966 49846 Bronchitis NaN 10mg NaN 1.0 20.0

Cleanlab Studio automatically deals with missing values, so we don’t have to specially handle them.

Load data into Cleanlab Studio

Our data analysis starts by loading the data, which may take a while for big datasets. First use your API key to instantiate a Studio object.

from cleanlab_studio import Studio

# You can find your Cleanlab Studio API key by going to app.cleanlab.ai/upload,
# clicking "Upload via Python API", and copying the API key there
API_KEY = "<insert your API key>"

# Initialize studio object
studio = Studio(API_KEY)

A unique ID column allows us to better track and visualize results. If such a column is not present in your dataset, you can uncomment the following code to create one.
Here we’ll use the ‘patient_id’ column which has unique values for all entries in our dataset.

# Alternatively you can create an ID column of sequential integers
# identifier_col = 'unique_id'
# data[identifier_col] = range(0, len(data))
identifier_col = 'patient_id'  # unique ID column
data[identifier_col].is_unique
    True

When loading data into Cleanlab Studio, the application will automatically infer the schema of your dataset. However in order to analyze specific numeric columns later on, we’ll override the default schema here to explicitly declare these as numeric columns. You don’t need to do this for all numeric columns, only certain columns you wish to specifically analyze/predict within a later Cleanlab Studio Project.

dataset_id = studio.upload_dataset(
data,
dataset_name="tabular-data-entry",
schema_overrides=[
{"name": "visiting_hours", "column_type": "float"},
{"name": "invoice", "column_type": "float"},
],
)
print(f"Dataset ID: {dataset_id}")

After it’s loaded, we can use the dataset’s id to create a Project in Cleanlab Studio, which automatically trains ML models to provide AI-based analysis of your dataset.

Launch projects

To find erroneous values in this dataset, we launch Cleanlab Studio project for each numeric/categorical column which may contain errors. Each project uses ML to analyze a particular column of the dataset by treating that column as the label (dependent variable), and the rest as predictive features (independent variables). The predictions for each entry in the label column are thus based on all of the other available information in the same row, and can be used to impute missing entries and detect erroneous entries.

We first define helper functions to determine the type of each column and launch Cleanlab Studio projects for a column (based on its determined type).

Optional: Helper functions to determine the type of a column, in terms of the Cleanlab project to find erroneous values in it. (click to expand)

def is_categorical_column(col, unique_threshold = 0.05, min_na_values = 100):
"""
Check if a column is categorical and can be used as the labels for a Cleanlab Studio's classification project.

Parameters:
col (pd.Series or pd.DataFrame): Values in this column (extracted from the original DataFrame).
unique_threshold (float): Threshold to check uniqueness of categories in the column. If the ratio of (unique categories in the column / total number of data samples) is too high, return False.
min_na_values (int): Minimum number of non-null values in this column in order to train a reliable AI model (must be above this to return True).

Returns:
bool: Whether the column is categorical and can be used as label in Cleanlab Studio project.
"""

num_unique_vals = col.nunique()

# Check if column have all identical values
if num_unique_vals == 1:
return False

# Check if column can be considered to be categorical
unique_ratio = num_unique_vals / len(col)
if unique_ratio > unique_threshold:
return False

# Check the minimum class count for any class to be at least 5
if any(col.value_counts() < 5):
return False

# Check if too many NaN values to be worth training a model to predict
if sum(col.notna()) < min_na_values:
return False
if len(col) < 101 and sum(col.isna()) > 0:
return False

return True

def is_numeric_column(col, min_na_values = 100):
"""
Check if a column is numeric and if it can be used as the labels for a Cleanlab Studio regression project.

Parameters:
col (pd.Series or pd.DataFrame): Values in this column (extracted from the original DataFrame).
min_na_values (int): Minimum number of non-null values in this column in order to train a reliable AI model (must be above this to return True).

Returns:
bool: Whether the column is numeric and can be used as label in Cleanlab Studio project.
"""
if col.dtype not in (float, int):
return False

# Check if too many NaN values to be worth training a model to predict
if sum(col.notna()) < min_na_values:
return False
if len(col) < 101 and sum(col.isna()) > 0:
return False

return True

def filter_column_types(data, identifier_col, columns_to_exclude = []):
"""
Determine each column's type: categorical or numerical, to determine which type of Cleanlab Studio project to run for this column.

Parameters:
data (pd.DataFrame): Original dataset.
identifier_col (str): Unique identifier column (will not be considered).
columns_to_exclude (list[str]): List of names of columns to not consider.

Returns:
categorical_columns (list): List of names of columns identified as categorical
numeric_columns (list): List of names of columns identified as numeric
skip_columns (list): List of names of columns that are neither categorical nor numerical (e.g. contain arbitrary text or strings).
"""
categorical_columns = []
numeric_columns = []
skip_columns = []

all_columns = data.columns.to_list()
# List of columns to analyse
columns_to_analyze = [column for column in all_columns if column not in columns_to_exclude]
# Remove identifier_col as it doesn't require any analysis
columns_to_analyze.remove(identifier_col)

for col_name in columns_to_analyze:
col = data[col_name]
if is_categorical_column(col):
categorical_columns.append(col_name)
elif is_numeric_column(col):
numeric_columns.append(col_name)
else:
skip_columns.append(col_name)

if len(categorical_columns + numeric_columns) == 0:
print("No columns to find errors in.")

return categorical_columns, numeric_columns, skip_columns

# Helper function to launch Cleanlab project to detect errors in a column
def create_column_project(dataset_id, project_name, input_columns, audit_column, task_type, model_type="regular"):
"""
Launch Cleanlab Studio project based on the type of the column to be audited.

Parameters:
dataset_id (str): Cleanlab Studio dataset ID.
project_name (str): Name of the Cleanlab Studio project.
input_columns (list[str]): List of names of columns to use as predictive features for training Cleanlab AI model to analyze the `audit_column`.
audit_column (str): Name of the column to audit i.e. use as label for training Cleanlab AI model.
task_type (str): Type of ML task - 'multi-class' for classification or 'regression' for regression.
model_type (str): Type of ML model - 'regular' or 'fast' mode.

Returns:
project_id (str): Unique ID of this Cleanlab Studio project that can be used to programmatically fetch results.
"""
predictive_columns = input_columns.copy()
# Remove audit_column from the list of input predictive features
predictive_columns.remove(audit_column)

project_id = studio.create_project(
dataset_id=dataset_id,
project_name=f"{project_name}-{audit_column}",
modality="tabular",
task_type=task_type,
model_type=model_type,
label_column=audit_column,
feature_columns=predictive_columns
)
print(f"Project for auditing '{audit_column}' created and training has begun! project_id: {project_id}")
return project_id

Not all columns may be susceptible to data entry errors. To save time, you can specify columns_to_exclude to skip these columns in the error checks (these columns may still be used as predictive features for inferring whether other columns have erroneous values).

In our dataset, we’ll skip the ‘diagnosis’ column. Note that any identifier column you previously declared for the dataset (in our case ‘patient_id’) will be ignored (neither being checked for errors nor used as a predictive feature).

The following code determines which categorical and numeric columns we will check for errors (i.e. run a project for).

# List of columns to exclude
columns_to_exclude = ['diagnosis']

categorical_columns, numeric_columns, skip_columns = filter_column_types(data, identifier_col, columns_to_exclude)
print("Categorical columns to audit:", categorical_columns, "\nNumeric columns to audit:", numeric_columns, "\nNot categorical or numeric (not audited):", skip_columns)
    Categorical columns to audit: ['medication', 'dosage'] 
Numeric columns to audit: ['visiting_hours', 'invoice']
Not categorical or numeric (not audited): ['note']

We only audit numeric or categorical columns for errors (the ‘note’ column in our dataset contains free-form text and is automatically not checked for errors). For each column to audit, an appropriate Cleanlab project is launched (classification project for categorical columns, regression project for numeric columns). Let’s launch these projects:

project_name = "tutorial"  # You can change this prefix used to name the per-column projects
model_type = "regular" # You can set this to "fast" to get quicker but less accurate results

# All the columns (except identifier) can be predictive features
input_columns = data.columns.to_list() # You can reduce this to use less predictive features
input_columns.remove(identifier_col)
project_ids = {} # will store IDs of all the projects for different columns

# Run classification projects for categorical columns
for audit_column in categorical_columns:
project_ids[audit_column] = create_column_project(dataset_id, project_name, input_columns, audit_column, 'multi-class', model_type)

# Run regression projects for numeric columns
for audit_column in numeric_columns:
project_ids[audit_column] = create_column_project(dataset_id, project_name, input_columns, audit_column, 'regression', model_type)

Once all of the projects have been launched successfully and their project_id’s are visible, feel free to close this notebook. It will take time for Cleanlab’s AI to train and analyze your data. Come back after training is complete (you will receive emails) and continue with the notebook to review your results. Each project produces a cleanset (cleaned dataset). For this tutorial, all projects launched here should complete within around 10 minutes.

You should only execute the above cells once! Do not call create_project again.

Before we proceed with the rest of the tutorial, we will wait for all the projects to complete. You can poll for project’s status to programmatically wait until the results are ready for review (this next code cell will take some time):

cleanset_ids = {}

for audit_column in project_ids:
cleanset_ids[audit_column] = studio.get_latest_cleanset_id(project_ids[audit_column])
print(f"Project for auditing '{audit_column}' is running! cleanset_id: {cleanset_ids[audit_column]}")
project_status = studio.wait_until_cleanset_ready(cleanset_ids[audit_column])

Get project results

Each project generates smart metadata about each data point, which is stored as Cleanlab columns we can fetch.

If at any point you want to re-run the remaining parts of this notebook (without creating another project), simply call studio.download_cleanlab_columns(cleanset_id) with the cleanset_id printed from the previous cells.

We define a helper function that fetches Cleanlab generated metadata columns for each project and stores: which entries are estimated to be erroneous, confidence scores for these estimates, and AI model predictions in 3 CSV files (as well as returning corresponding DataFrames). Other helper functions are defined to help inspect/highlight Cleanlab results for each column using these 3 DataFrames/CSVs.

Optional: Helper function to extract and re-arrange data entries detected as possibly erroneous (click to expand)

def extract_and_write_cleanlab_results(data, cleanset_ids, identifier_col):
"""
Fetch Cleanlab columns to store erroneous entries, respective confidence scores, and AI model predictions in 3 different CSVs

Parameters:
data (pd.DataFrame): Original data's DataFrame
cleanset_ids (dict): Dictionary with column name as key, and its respective cleanset_id as value
identifier_col (str): Unique identifier column

Returns:
is_issue_df (pd.DataFrame): DataFrame of booleans, for all audited columns, denoting whether the value is erroneous or not
issue_score_df (pd.DataFrame): DataFrame of confidence scores (0-1), for all audited columns, denoting probability of value being erroneous
prediction_df (pd.DataFrame): DataFrame of AI predictions, for all audited columns, that would be used to impute missing values
"""
# Instantiate DataFrames with identifier column
is_issue_df, issue_score_df, prediction_df = (pd.DataFrame(data[identifier_col]) for _ in range(3))

for audit_column in cleanset_ids:
# Download Cleanlab columns from this column's project
cleanlab_columns_df = studio.download_cleanlab_columns(cleanset_ids[audit_column])
data_cleanlab_df = data.merge(cleanlab_columns_df, left_index=True, right_index=True)

# DataFrame to store True/False flag whether the entry seems erroneous
is_issue_df[audit_column] = data_cleanlab_df['is_label_issue']

# DataFrame to store AI's confidence score on whether the entry is erroneous
issue_score_df[audit_column] = data_cleanlab_df['label_issue_score']

# DataFrame to store predicted values
prediction_df[audit_column] = np.where(data_cleanlab_df["is_label_issue"] | data_cleanlab_df[audit_column].isna(),
data_cleanlab_df["suggested_label"],
data_cleanlab_df[audit_column])

# Save DataFrames to CSV files (comment these lines if you prefer not to)
is_issue_df.to_csv("is_issue.csv", index=False)
issue_score_df.to_csv("issue_score.csv", index=False)
prediction_df.to_csv("imputed_values.csv", index=False)
return is_issue_df, issue_score_df, prediction_df

# Helper function to view the Cleanlab analysis results for a particular column
def inspect_column(audit_column):
"""
Create a DataFrame with original data, column to inspect and its respective Cleanlab columns

Parameters:
audit_column (str): Name of the column to inspect

Returns:
merged_df (pd.DataFrame): DataFrame of original data, inspected column and its respective Cleanlab columns
"""
is_issue_col = is_issue_df[audit_column].rename(f"{audit_column}_is_issue")
issue_score_col = issue_score_df[audit_column].rename(f"{audit_column}_issue_score")
predicted_values_col = predicted_value_df[audit_column].rename(f"{audit_column}_predicted_value")

# Moving the column for ease of comparison
data_columns = data.columns.to_list()
data_columns.remove(audit_column)
data_columns.append(audit_column)

merged_df = pd.concat([data[data_columns], is_issue_col, issue_score_col, predicted_values_col], axis=1)
return merged_df

# Helper function to highlight particular values for ease of visualization
def apply_highlight(row, col, mask):
'''
Highlight the (row,col) entry if mask[row,col] is True

Parameters:
mask (pd.DataFrame): Contains boolean indicating whether the entry is to be highlighted

Returns:
CSS-style (str): Style to apply to (row, col) entry
'''
# Check if the mask is True for this particular entry
if mask.at[row.name, col]:
return 'background-color: red'
else:
return ''

# Helper function to highlight entries in Excel basis a boolean table
def highlight_errors_in_excel(excel_path, issue_csv_path):
"""
Highlights entries in an Excel file based on boolean values in a DataFrame.

Parameters:
excel_path (path in str): Path of Excel file to manipulate
issue_csv_path (path in str): Path of `is_issue.csv` containing boolean table

Returns:
None: Save the modified Excel file
"""
# Export original data to Excel format
data.to_excel(excel_path, index=False)

# Read Excel data and is_issue.csv
excel_workbook = load_workbook(excel_path)
excel_worksheet = excel_workbook.active
is_issue_df = pd.read_csv(issue_csv_path)

# Columns to consider for highlighting
# is_issue_df contains the columns we audited
columns_to_highlight = is_issue_df.columns.to_list() # You can reduce this to a list of specific columns
columns_to_highlight.remove(identifier_col)

# Define pattern to apply to a cell
yellow_fill = PatternFill(start_color='FFFF00', end_color='FFFF00', fill_type='solid')

# Columns with their respective indice in Excel
excel_column_with_indice = {cell.value:cell.column for cell in excel_worksheet[1]}

# Iterate over columns_to_highlight
for column in columns_to_highlight:
excel_column_idx = excel_column_with_indice[column]
# Apply highlight if the value in `is_issue_df` is True
for row_idx, value in enumerate(is_issue_df[column]):
excel_row_idx = row_idx + 2 # Excel rows indice starts from 1 and first row contains the column headers itself, hence add 2
if value == True: excel_worksheet.cell(row=excel_row_idx, column=excel_column_idx).fill = yellow_fill

excel_workbook.save(excel_path)

This helper function outputs 3 CSV files:

  • is_issue.csv contains True/False values specifying whether each entry value in the input dataset is estimated to be likely erroneous (True) or not (False). You can filter based on the True values to determine which subset of data appears suspicious/corrupted in your dataset (inferred based on the other available information in the dataset).

  • issue_scores.csv contains scores between 0 and 1 estimating the likelihood that each entry value is corrupted (higher scores indicate values that appear more suspicious/erroneous). You can sort a column’s values by these scores (in descending order) to prioritize which values to review (i.e. which entries are most suspicious).

  • imputed_values.csv contains a value for each entry predicted by Cleanlab’s AI model. The prediction represents the expected value of this entry based on all of the other information in dataset (in particular this row), and can be used as an imputed value if the original value were missing.

All 3 CSV files returned contain the same number of rows as the original dataset, and the columns that we ran through Cleanlab projects. Each erroneous flag (boolean), suspicion score, and imputed value corresponds to the data entry at that same row/column in your original dataset.

The entries flagged with the highest issue scores are those you should inspect first. After that, consider having your team review the entries that Cleanlab has flagged as seeming erronous (based on boolean is_issue.csv).

# Create the above 3 DataFrames
is_issue_df, issue_score_df, predicted_value_df = extract_and_write_cleanlab_results(data, cleanset_ids, identifier_col)
# Display the 3 CSV files saved to the disk
ls -l i*.csv

We can view the values in each of these DataFrames/CSVs.

is_issue_df.head(3)
patient_id medication dosage visiting_hours invoice
0 90235 False False False False
1 34227 False False True False
2 21253 False False False False
issue_score_df.head(3)
patient_id medication dosage visiting_hours invoice
0 90235 0.106462 0.468849 0.288734 0.032805
1 34227 0.495846 0.471051 1.000000 0.124499
2 21253 0.105022 0.469134 0.406527 0.039586
predicted_value_df.head(3)
patient_id medication dosage visiting_hours invoice
0 90235 Dextromethorphan 10mg 2.000000 40.0
1 34227 Amoxicillin 500mg 5.301408 300.0
2 21253 Hydrochlorothiazide 25mg 3.000000 300.0

Optional: Analyze erroneous entries in a spreadsheet

Tabular data is frequently analyzed in Microsoft Excel for its ease of use. We can make use of the is_issue.csv and the original data to create a new Excel file (.xlsx) with erroneous cells highlighted in yellow.
The file highlighted_data_entry.xlsx is saved to the current working directory and can be opened using a spreadsheet software.

Note: You have the flexibility to flag additional or fewer entries in your Data Table by setting higher or lower thresholds on the entries in issue_scores.csv, allowing you to create your customized boolean table.

# Library for manipulating Excel table
%pip install openpyxl
# Import necessary methods
from openpyxl import load_workbook
from openpyxl.styles import PatternFill

# Path of output Excel file and is_issue.csv
excel_path = 'highlighted_data_entry.xlsx'
issue_csv_path = 'is_issue.csv' # Boolean table that we generated above

# Save the highlighted Excel data
highlight_errors_in_excel(excel_path, issue_csv_path)

Impute missing values

Before detecting data entry errors, let’s first see how to impute the missing entries in the original dataset using Cleanlab’s predicted value in predicted_value_df. These predictions come from state-of-the-art machine learning models trained (on your dataset) to predict this missing entry based on the other information available for this row.

highlight_imputed_values = True  # Change to False if you don't want to highlight imputed values
missing_data_imputed = missing_data.copy()

impute_columns = categorical_columns + numeric_columns # You can reduce this to a list of specific columns
for col in impute_columns:
missing_data_imputed[col] = missing_data_imputed[col].fillna(predicted_value_df[col])

if highlight_imputed_values:
# Boolean mask indicating whether the entry is missing or not
mask_impute = missing_data.isna()
# Apply highlight style
missing_data_imputed = missing_data_imputed.style.apply(lambda x: [apply_highlight(x, col, mask_impute) for col in x.index], axis=1, subset=impute_columns)

missing_data_imputed
  patient_id diagnosis medication dosage note visiting_hours invoice
102 10322 Hypertension Losartan 50mg Patient presenting with Hypertension symptoms despite Losartan therapy. 4.000000 400.000000
129 53909 Pneumonia Azithromycin 250mg nan 5.000000 300.000000
256 68289 Bronchitis Guaifenesin 200mg Patient responding well to Guaifenesin for Bronchitis. 5.000000 202.368744
397 81520 Diabetes mellitus Insulin 10 IU Considering alternative treatments for Diabetes mellitus due to patient's intolerance to Insulin. 2.000000 174.182175
473 61117 Diabetes mellitus Insulin 10 IU Instructed patient on proper administration of Insulin for Diabetes mellitus. 1.003176 80.000000
655 64952 Influenza Rimantadine 100mg Patient presenting with Influenza symptoms despite Rimantadine therapy. 2.114235 80.000000
816 15877 Bronchitis Dextromethorphan 10mg Scheduled follow-up appointment to evaluate Bronchitis. 1.000000 20.000000
966 49846 Bronchitis Dextromethorphan 10mg nan 1.000000 20.000000

Review the data entry error audit for each column

Let’s take a closer look at the ‘medication’ column. Here we obtain a merged DataFrame using the inspect_column helper function defined above, then filter for the data points where is_issue is True and sort the values by issue_score to view the top most suspicious entries detected in this column.

audit_column = 'medication'

medication_df = inspect_column(audit_column)
medication_issues = medication_df[medication_df[f"{audit_column}_is_issue"] == True].sort_values(f"{audit_column}_issue_score", ascending=False)
medication_issues.head()
patient_id diagnosis dosage note visiting_hours invoice medication medication_is_issue medication_issue_score medication_predicted_value
484 73051 Influenza 10mg Patient treated for Influenza using Rimantadine. 1.0 40.0 Rimantadine True 0.920749 Zanamivir
771 11433 Pneumonia 500mg Educated patient on potential side effects of Amoxicillin for Pneumonia. 2.0 120.0 Insulin True 0.709063 Amoxicillin

Per drugs.com, Rimantadine is typically prescribed at a dosage of 100mg, not 10mg which is the dosage for Zanamavir (predicted value). Also, Insulin is the medication for Diabetes mellitus, not Pneumonia. Such entries can be referred to subject matter expert for review.

Next let’s check the visiting_hours column and similarly view the top most suspicious values detected in this column.

audit_column = 'visiting_hours'

visiting_hours_df = inspect_column(audit_column)
visiting_hours_issues = visiting_hours_df[visiting_hours_df[f"{audit_column}_is_issue"] == True].sort_values(f"{audit_column}_issue_score", ascending=False)
visiting_hours_issues.head()
patient_id diagnosis medication dosage note invoice visiting_hours visiting_hours_is_issue visiting_hours_issue_score visiting_hours_predicted_value
1 34227 Pneumonia Amoxicillin 500mg Reviewing patient's response to Amoxicillin therapy for Pneumonia. 300.0 35.0 True 1.0 5.301408
325 52005 Pneumonia Levofloxacin 500mg Patient exhibiting adverse reaction to Levofloxacin prescribed for Pneumonia. 300.0 20.0 True 1.0 5.519128
639 39637 Hypertension Hydrochlorothiazide 10mg Instructed patient on proper administration of Hydrochlorothiazide for Hypertension. 500.0 65.0 True 1.0 7.023181
710 19203 Hypertension Losartan 50mg Noted improvement in patient's condition after administering Losartan for Hypertension. 200000.0 30.0 True 1.0 5.218635
855 25119 Influenza Rimantadine 100mg Recommended lifestyle modifications in addition to Rimantadine therapy for Influenza. 120.0 28.0 True 1.0 3.762323

The visiting hours recorded in these entries appears off as patients don’t typically have these sorts of consultations for such long hours. Such entries can be double checked based on domain knowledge and their presence may indicate you should implement regularly scheduled data validation to ensure values fall within known ranges.

We repeat the same steps as above, but this time for the dosage column to check for entries with suspicious values.

audit_column = 'dosage'

dosage_df = inspect_column(audit_column)
dosage_issues = dosage_df[dosage_df[f"{audit_column}_is_issue"] == True].sort_values(f"{audit_column}_issue_score", ascending=False)
dosage_issues.head()
patient_id diagnosis medication note visiting_hours invoice dosage dosage_is_issue dosage_issue_score dosage_predicted_value
3 44136 Pneumonia Levofloxacin Advised patient on importance of adherence to Levofloxacin regimen for Pneumonia. 5.0 300.0 10 IU True 0.532683 500mg
639 39637 Hypertension Hydrochlorothiazide Instructed patient on proper administration of Hydrochlorothiazide for Hypertension. 65.0 500.0 10mg True 0.531477 25mg
109 38965 Bronchitis Guaifenesin Patient diagnosed with Bronchitis. Prescribed Guaifenesin for treatment. 1.0 20.0 100mg True 0.531120 200mg
484 73051 Influenza Rimantadine Patient treated for Influenza using Rimantadine. 1.0 40.0 10mg True 0.530678 100mg

Levofloxacin is prescribed in milligrams (mg) not insulin units (IU), which signals that the dosage value is definitely wrong. Also, the recommended dosage for Rimantadine, as per drugs.com, is 100mg. Zanamivir, another drug for Influenza, is prescribed with 5-10mg dose. So either the recorded dosage or medication seems off here. The same applies to Hydrochlorothiazide which is usually prescribed with 25-100mg dosage for Hypertension, and to Guaifenesin for treating Bronchitis.

For all of these data entries, Cleanlab’s predicted value seems more appropriate.

Finally, we look for data entry errors in the “invoice” column using the same steps.

audit_column = 'invoice'

invoice_df = inspect_column(audit_column)
invoice_issues = invoice_df[invoice_df[f"{audit_column}_is_issue"] == True].sort_values(f"{audit_column}_issue_score", ascending=False)
invoice_issues.head()
patient_id diagnosis medication dosage note visiting_hours invoice invoice_is_issue invoice_issue_score invoice_predicted_value
710 19203 Hypertension Losartan 50mg Noted improvement in patient's condition after administering Losartan for Hypertension. 30.0 200000.0 True 1.0 606.011169

The invoice value of $200k is too high. Such an entry should be verified from other records to avoid accounting issues later on.

Improve your dataset by correcting erroneous entries

The best way to ensure that you have high quality data is to manually inspect the entries that Cleanlab identified to have issues (i.e. is_issue = True) and correct them. However, that could be time consuming and hence you can also auto-correct your data by replacing the entries that have been flagged as issues with Cleanlab’s predicted values. We recommend reviewing various predicted values from diverse rows to ensure they are reasonable before doing this.

We demonstrate how to automatically obtain an improved dataset below, where autocorrected_dataset will have the exact same rows and columns as your original dataset, but with the entries estimated to be potentially erroneous replaced with values predicted by Cleanlab based on the other information in the dataset.

autocorrected_dataset = data.copy()
fix_columns = categorical_columns + numeric_columns # You can reduce this to a list of specific columns

impute_missing = False # Change this to True if you want to impute the missing values alongside fixing the erroneous entries
mask_condition = {col: (is_issue_df[col] if not impute_missing else is_issue_df[col] | autocorrected_dataset[col].isna()) for col in fix_columns}

for col in fix_columns:
autocorrected_dataset[col] = autocorrected_dataset[col].mask(mask_condition[col], predicted_value_df[col])

autocorrected_dataset.to_csv('autocorrected_dataset.csv', index=False)

Lastly, we view the first 5 samples of the autocorrected dataset highlighting the corrected entries.

samples = 5
highlight_corrected_values = True
autocorrected_dataset_copy = autocorrected_dataset.copy()

if highlight_corrected_values:
autocorrected_dataset_copy = autocorrected_dataset.head(samples).style.apply(lambda x: [apply_highlight(x, col, is_issue_df) for col in x.index], axis=1, subset=impute_columns)

autocorrected_dataset_copy
  patient_id diagnosis medication dosage note visiting_hours invoice
0 90235 Bronchitis Dextromethorphan 10mg Adjusting dosage of Dextromethorphan for Bronchitis. 2.000000 40.000000
1 34227 Pneumonia Amoxicillin 500mg Reviewing patient's response to Amoxicillin therapy for Pneumonia. 5.301408 300.000000
2 21253 Hypertension Hydrochlorothiazide 25mg Recommending additional tests to monitor efficacy of Hydrochlorothiazide for Hypertension. 3.000000 300.000000
3 44136 Pneumonia Levofloxacin 500mg Advised patient on importance of adherence to Levofloxacin regimen for Pneumonia. 5.000000 300.000000
4 76265 Bronchitis Guaifenesin 200mg Patient treated for Bronchitis using Guaifenesin. 4.000000 80.000000