Detect Data Entry Errors (and Impute Missing Values) in any Tabular Dataset

Run in Google Colab

This is the recommended tutorial for programmatically identifying erroneous values (or imputing missing values) in a structured dataset via the Cleanlab Studio Python API. Cleanlab’s AI automatically detects entries in any data table (e.g. CSV/Excel file or Database) that are likely incorrect, perhaps due to: data entry or measurement error (e.g. sensor noise).

Simply provide any data table (including columns that are: text, numeric, or categorical — even with missing values), and state-of-the-art ML models will be trained to score the quality of each datapoint (row) and flag any entry (cell value) that is likely erroneous. These same ML models can produce predictions to accurately impute missing entries in the table.

Unlike traditional data validation or data quality tools, Cleanlab is not simply based on manual rules (e.g. ‘this column only contains positive values in a certain range’). Instead Cleanlab uses AI that accounts for all of the available information in your dataset to flag suspicious entries which may be erroneous. Thus Cleanlab can automatically detect not only atypical values in a column (like traditional data quality tools), but also atypical combinations of values across columns (e.g. multiple values that are jointly incompatible).

Data entry issues

Install and import dependencies

You can use pip to install all other packages required for this tutorial as follows:

%pip install cleanlab-studio

import numpy as np
import pandas as pd
import os
import random

from IPython.display import display, Markdown
pd.set_option("display.max_colwidth", None)

Fetch and view dataset

This tutorial considers a structured dataset of medical records. To fetch this dataset, make sure you have wget installed.

wget -nc https://cleanlab-public.s3.amazonaws.com/Datasets/data_entry.csv -P data

BASE_PATH = os.getcwd()
dataset_path = os.path.join(BASE_PATH, "data/")

data = pd.read_csv(os.path.join(dataset_path, 'data_entry.csv'))
data.head()

	patient_id	diagnosis	medication	dosage	note	visiting_hours	invoice
0	90235	Bronchitis	Dextromethorphan	10mg	Adjusting dosage of Dextromethorphan for Bronchitis.	2.0	40.0
1	34227	Pneumonia	Amoxicillin	500mg	Reviewing patient's response to Amoxicillin therapy for Pneumonia.	35.0	300.0
2	21253	Hypertension	Hydrochlorothiazide	25mg	Recommending additional tests to monitor efficacy of Hydrochlorothiazide for Hypertension.	3.0	300.0
3	44136	Pneumonia	Levofloxacin	10 IU	Advised patient on importance of adherence to Levofloxacin regimen for Pneumonia.	5.0	300.0
4	76265	Bronchitis	Guaifenesin	200mg	Patient treated for Bronchitis using Guaifenesin.	4.0	80.0

We can check if the dataset is complete or if there are missing entries.

missing_data = data[data.isnull().any(axis=1)]
missing_data

	patient_id	diagnosis	medication	dosage	note	visiting_hours	invoice
102	10322	Hypertension	Losartan	NaN	Patient presenting with Hypertension symptoms despite Losartan therapy.	4.0	400.0
129	53909	Pneumonia	Azithromycin	250mg	NaN	5.0	300.0
256	68289	Bronchitis	Guaifenesin	200mg	Patient responding well to Guaifenesin for Bronchitis.	5.0	NaN
397	81520	Diabetes mellitus	Insulin	10 IU	Considering alternative treatments for Diabetes mellitus due to patient's intolerance to Insulin.	2.0	NaN
473	61117	Diabetes mellitus	Insulin	10 IU	Instructed patient on proper administration of Insulin for Diabetes mellitus.	NaN	80.0
655	64952	Influenza	Rimantadine	100mg	Patient presenting with Influenza symptoms despite Rimantadine therapy.	NaN	80.0
816	15877	Bronchitis	NaN	10mg	Scheduled follow-up appointment to evaluate Bronchitis.	1.0	20.0
966	49846	Bronchitis	NaN	10mg	NaN	1.0	20.0

Cleanlab Studio automatically deals with missing values, so we don’t have to specially handle them.

Load data into Cleanlab Studio

Our data analysis starts by loading the data, which may take a while for big datasets. First use your API key to instantiate a Studio object.

from cleanlab_studio import Studio

# You can find your Cleanlab Studio API key by going to studio.cleanlab.ai/upload,
# clicking "Upload via Python API", and copying the API key there
API_KEY = "<insert your API key>"

# Initialize studio object
studio = Studio(API_KEY)

A unique ID column allows us to better track and visualize results. If such a column is not present in your dataset, you can uncomment the following code to create one.
Here we’ll use the ‘patient_id’ column which has unique values for all entries in our dataset.

# Alternatively you can create an ID column of sequential integers
# identifier_col = 'unique_id'
# data[identifier_col] = range(0, len(data))

identifier_col = 'patient_id'  # unique ID column
data[identifier_col].is_unique

True

When loading data into Cleanlab Studio, the application will automatically infer the schema of your dataset. However in order to analyze specific numeric columns later on, we’ll override the default schema here to explicitly declare these as numeric columns. You don’t need to do this for all numeric columns, only certain columns you wish to specifically analyze/predict within a later Cleanlab Studio Project.

dataset_id = studio.upload_dataset(
    data,
    dataset_name="tabular-data-entry",
    schema_overrides=[
        {"name": "visiting_hours", "column_type": "float"},
        {"name": "invoice", "column_type": "float"},
    ],
)
print(f"Dataset ID: {dataset_id}")

After it’s loaded, we can use the dataset’s id to create a Project in Cleanlab Studio, which automatically trains ML models to provide AI-based analysis of your dataset.

Launch projects

To find erroneous values in this dataset, we launch Cleanlab Studio project for each numeric/categorical column which may contain errors. Each project uses ML to analyze a particular column of the dataset by treating that column as the label (dependent variable), and the rest as predictive features (independent variables). The predictions for each entry in the label column are thus based on all of the other available information in the same row, and can be used to impute missing entries and detect erroneous entries.

We first define helper functions to determine the type of each column and launch Cleanlab Studio projects for a column (based on its determined type).

Optional: Helper functions to determine the type of a column, in terms of the Cleanlab project to find erroneous values in it.

def is_categorical_column(col, unique_threshold = 0.05, min_na_values = 100):
    """
    Check if a column is categorical and can be used as the labels for a Cleanlab Studio's classification project.

    Parameters:
    col (pd.Series or pd.DataFrame): Values in this column (extracted from the original DataFrame).
    unique_threshold (float): Threshold to check uniqueness of categories in the column. If the ratio of (unique categories in the column / total number of data samples) is too high, return False. 
    min_na_values (int): Minimum number of non-null values in this column in order to train a reliable AI model (must be above this to return True). 

    Returns:
    bool: Whether the column is categorical and can be used as label in Cleanlab Studio project.
    """

    num_unique_vals = col.nunique()

    # Check if column have all identical values
    if num_unique_vals == 1:
        return False

    # Check if column can be considered to be categorical
    unique_ratio = num_unique_vals / len(col)
    if unique_ratio > unique_threshold:
        return False

    # Check the minimum class count for any class to be at least 5
    if any(col.value_counts() < 5):
        return False

    # Check if too many NaN values to be worth training a model to predict
    if sum(col.notna()) < min_na_values:
        return False
    if len(col) < 101 and sum(col.isna()) > 0:
        return False
    
    return True

def is_numeric_column(col, min_na_values = 100):
    """
    Check if a column is numeric and if it can be used as the labels for a Cleanlab Studio regression project.

    Parameters:
    col (pd.Series or pd.DataFrame): Values in this column (extracted from the original DataFrame).
    min_na_values (int): Minimum number of non-null values in this column in order to train a reliable AI model (must be above this to return True).

    Returns:
    bool: Whether the column is numeric and can be used as label in Cleanlab Studio project.
    """
    if col.dtype not in (float, int):
        return False

    # Check if too many NaN values to be worth training a model to predict
    if sum(col.notna()) < min_na_values:
        return False
    if len(col) < 101 and sum(col.isna()) > 0:
        return False

    return True

def filter_column_types(data, identifier_col, columns_to_exclude = []):
    """
    Determine each column's type: categorical or numerical, to determine which type of Cleanlab Studio project to run for this column.

    Parameters:
    data (pd.DataFrame): Original dataset.
    identifier_col (str): Unique identifier column (will not be considered).
    columns_to_exclude (list[str]): List of names of columns to not consider.

    Returns:
    categorical_columns (list): List of names of columns identified as categorical 
    numeric_columns (list): List of names of columns identified as numeric 
    skip_columns (list): List of names of columns that are neither categorical nor numerical (e.g. contain arbitrary text or strings).
    """
    categorical_columns = []
    numeric_columns = []
    skip_columns = []

    all_columns = data.columns.to_list()
    # List of columns to analyse
    columns_to_analyze = [column for column in all_columns if column not in columns_to_exclude]
    # Remove identifier_col as it doesn't require any analysis
    columns_to_analyze.remove(identifier_col)

    for col_name in columns_to_analyze:
        col = data[col_name]
        if is_categorical_column(col):
            categorical_columns.append(col_name)
        elif is_numeric_column(col):
            numeric_columns.append(col_name)
        else:
            skip_columns.append(col_name)

    if len(categorical_columns + numeric_columns) == 0:
        print("No columns to find errors in.")

    return categorical_columns, numeric_columns, skip_columns

# Helper function to launch Cleanlab project to detect errors in a column
def create_column_project(dataset_id, project_name, input_columns, audit_column, task_type, model_type="regular"):
    """
    Launch Cleanlab Studio project based on the type of the column to be audited.

    Parameters:
    dataset_id (str): Cleanlab Studio dataset ID.
    project_name (str): Name of the Cleanlab Studio project.
    input_columns (list[str]): List of names of columns to use as predictive features for training Cleanlab AI model to analyze the `audit_column`.
    audit_column (str): Name of the column to audit i.e. use as label for training Cleanlab AI model.
    task_type (str): Type of ML task - 'multi-class' for classification or 'regression' for regression.
    model_type (str): Type of ML model - 'regular' or 'fast' mode.

    Returns:
    project_id (str): Unique ID of this Cleanlab Studio project that can be used to programmatically fetch results.
    """
    predictive_columns = input_columns.copy()
    # Remove audit_column from the list of input predictive features
    predictive_columns.remove(audit_column)
    
    project_id = studio.create_project(
        dataset_id=dataset_id,
        project_name=f"{project_name}-{audit_column}",
        modality="tabular",
        task_type=task_type,
        model_type=model_type,
        label_column=audit_column,
        feature_columns=predictive_columns
    )
    print(f"Project for auditing '{audit_column}' created and training has begun! project_id: {project_id}")
    return project_id

Not all columns may be susceptible to data entry errors. To save time, you can specify columns_to_exclude to skip these columns in the error checks (these columns may still be used as predictive features for inferring whether other columns have erroneous values).

In our dataset, we’ll skip the ‘diagnosis’ column. Note that any identifier column you previously declared for the dataset (in our case ‘patient_id’) will be ignored (neither being checked for errors nor used as a predictive feature).

The following code determines which categorical and numeric columns we will check for errors (i.e. run a project for).

# List of columns to exclude
columns_to_exclude = ['diagnosis']

categorical_columns, numeric_columns, skip_columns = filter_column_types(data, identifier_col, columns_to_exclude)
print("Categorical columns to audit:", categorical_columns, "\nNumeric columns to audit:", numeric_columns, "\nNot categorical or numeric (not audited):", skip_columns)

Categorical columns to audit: ['medication', 'dosage'] 
Numeric columns to audit: ['visiting_hours', 'invoice'] 
Not categorical or numeric (not audited): ['note']

We only audit numeric or categorical columns for errors (the ‘note’ column in our dataset contains free-form text and is automatically not checked for errors). For each column to audit, an appropriate Cleanlab project is launched (classification project for categorical columns, regression project for numeric columns). Let’s launch these projects:

project_name = "tutorial"  # You can change this prefix used to name the per-column projects
model_type = "regular"  # You can set this to "fast" to get quicker but less accurate results

# All the columns (except identifier) can be predictive features
input_columns = data.columns.to_list()  # You can reduce this to use less predictive features
input_columns.remove(identifier_col)
project_ids = {}  # will store IDs of all the projects for different columns 

# Run classification projects for categorical columns
for audit_column in categorical_columns:
    project_ids[audit_column] = create_column_project(dataset_id, project_name, input_columns, audit_column, 'multi-class', model_type)

# Run regression projects for numeric columns
for audit_column in numeric_columns:
    project_ids[audit_column] = create_column_project(dataset_id, project_name, input_columns, audit_column, 'regression', model_type)

Once all of the projects have been launched successfully and their project_id’s are visible, feel free to close this notebook. It will take time for Cleanlab’s AI to train and analyze your data. Come back after training is complete (you will receive emails) and continue with the notebook to review your results. Each project produces a cleanset (cleaned dataset). For this tutorial, all projects launched here should complete within around 10 minutes.

You should only execute the above cells once! Do not call create_project again.

Before we proceed with the rest of the tutorial, we will wait for all the projects to complete. You can poll for project’s status to programmatically wait until the results are ready for review (this next code cell will take some time):

cleanset_ids = {}

for audit_column in project_ids:
    cleanset_ids[audit_column] = studio.get_latest_cleanset_id(project_ids[audit_column])
    print(f"Project for auditing '{audit_column}' is running! cleanset_id: {cleanset_ids[audit_column]}")
    project_status = studio.wait_until_cleanset_ready(cleanset_ids[audit_column])

Get project results

Each project generates smart metadata about each data point, which is stored as Cleanlab columns we can fetch.

If at any point you want to re-run the remaining parts of this notebook (without creating another project), simply call studio.download_cleanlab_columns(cleanset_id) with the cleanset_id printed from the previous cells.

We define a helper function that fetches Cleanlab generated metadata columns for each project and stores: which entries are estimated to be erroneous, confidence scores for these estimates, and AI model predictions in 3 CSV files (as well as returning corresponding DataFrames). Other helper functions are defined to help inspect/highlight Cleanlab results for each column using these 3 DataFrames/CSVs.

Optional: Helper function to extract and re-arrange data entries detected as possibly erroneous

def extract_and_write_cleanlab_results(data, cleanset_ids, identifier_col):
    """
    Fetch Cleanlab columns to store erroneous entries, respective confidence scores, and AI model predictions in 3 different CSVs

    Parameters:
    data (pd.DataFrame): Original data's DataFrame
    cleanset_ids (dict): Dictionary with column name as key, and its respective cleanset_id as value
    identifier_col (str): Unique identifier column

    Returns:
    is_issue_df (pd.DataFrame): DataFrame of booleans, for all audited columns, denoting whether the value is erroneous or not
    issue_score_df (pd.DataFrame): DataFrame of confidence scores (0-1), for all audited columns, denoting probability of value being erroneous
    prediction_df (pd.DataFrame): DataFrame of AI predictions, for all audited columns, that would be used to impute missing values
    """
    # Instantiate DataFrames with identifier column
    is_issue_df, issue_score_df, prediction_df = (pd.DataFrame(data[identifier_col]) for _ in range(3))

    for audit_column in cleanset_ids:
        # Download Cleanlab columns from this column's project
        cleanlab_columns_df = studio.download_cleanlab_columns(cleanset_ids[audit_column])
        data_cleanlab_df = data.merge(cleanlab_columns_df, left_index=True, right_index=True)

        # DataFrame to store True/False flag whether the entry seems erroneous
        is_issue_df[audit_column] = data_cleanlab_df['is_label_issue']

        # DataFrame to store AI's confidence score on whether the entry is erroneous
        issue_score_df[audit_column] = data_cleanlab_df['label_issue_score']

        # DataFrame to store predicted values
        prediction_df[audit_column] = np.where(data_cleanlab_df["is_label_issue"] | data_cleanlab_df[audit_column].isna(), 
                                      data_cleanlab_df["suggested_label"], 
                                      data_cleanlab_df[audit_column])

    # Save DataFrames to CSV files (comment these lines if you prefer not to)
    is_issue_df.to_csv("is_issue.csv", index=False)
    issue_score_df.to_csv("issue_score.csv", index=False)
    prediction_df.to_csv("imputed_values.csv", index=False)
    return is_issue_df, issue_score_df, prediction_df

# Helper function to view the Cleanlab analysis results for a particular column
def inspect_column(audit_column):
    """
    Create a DataFrame with original data, column to inspect and its respective Cleanlab columns

    Parameters:
    audit_column (str): Name of the column to inspect

    Returns:
    merged_df (pd.DataFrame): DataFrame of original data, inspected column and its respective Cleanlab columns
    """
    is_issue_col = is_issue_df[audit_column].rename(f"{audit_column}_is_issue")
    issue_score_col = issue_score_df[audit_column].rename(f"{audit_column}_issue_score")
    predicted_values_col = predicted_value_df[audit_column].rename(f"{audit_column}_predicted_value")
    
    # Moving the column for ease of comparison
    data_columns = data.columns.to_list()
    data_columns.remove(audit_column)
    data_columns.append(audit_column)

    merged_df = pd.concat([data[data_columns], is_issue_col, issue_score_col, predicted_values_col], axis=1)
    return merged_df

# Helper function to highlight particular values for ease of visualization
def apply_highlight(row, col, mask):
    '''
    Highlight the (row,col) entry if mask[row,col] is True
    
    Parameters:
    mask (pd.DataFrame): Contains boolean indicating whether the entry is to be highlighted

    Returns:
    CSS-style (str): Style to apply to (row, col) entry
    '''
    # Check if the mask is True for this particular entry
    if mask.at[row.name, col]:
        return 'background-color: red'
    else:
        return ''

# Helper function to highlight entries in Excel basis a boolean table
def highlight_errors_in_excel(excel_path, issue_csv_path):
    """
    Highlights entries in an Excel file based on boolean values in a DataFrame.

    Parameters:
    excel_path (path in str): Path of Excel file to manipulate
    issue_csv_path (path in str): Path of `is_issue.csv` containing boolean table

    Returns:
    None: Save the modified Excel file
    """
    # Export original data to Excel format
    data.to_excel(excel_path, index=False)

    # Read Excel data and is_issue.csv
    excel_workbook = load_workbook(excel_path)
    excel_worksheet = excel_workbook.active
    is_issue_df = pd.read_csv(issue_csv_path)

    # Columns to consider for highlighting
    # is_issue_df contains the columns we audited
    columns_to_highlight = is_issue_df.columns.to_list() # You can reduce this to a list of specific columns
    columns_to_highlight.remove(identifier_col)

    # Define pattern to apply to a cell
    yellow_fill = PatternFill(start_color='FFFF00', end_color='FFFF00', fill_type='solid')

    # Columns with their respective indice in Excel
    excel_column_with_indice = {cell.value:cell.column for cell in excel_worksheet[1]}

    # Iterate over columns_to_highlight
    for column in columns_to_highlight:
        excel_column_idx = excel_column_with_indice[column]
        # Apply highlight if the value in `is_issue_df` is True
        for row_idx, value in enumerate(is_issue_df[column]):
            excel_row_idx = row_idx + 2 # Excel rows indice starts from 1 and first row contains the column headers itself, hence add 2
            if value == True: excel_worksheet.cell(row=excel_row_idx, column=excel_column_idx).fill = yellow_fill           

    excel_workbook.save(excel_path)

This helper function outputs 3 CSV files:

is_issue.csv contains True/False values specifying whether each entry value in the input dataset is estimated to be likely erroneous (True) or not (False). You can filter based on the True values to determine which subset of data appears suspicious/corrupted in your dataset (inferred based on the other available information in the dataset).
issue_scores.csv contains scores between 0 and 1 estimating the likelihood that each entry value is corrupted (higher scores indicate values that appear more suspicious/erroneous). You can sort a column’s values by these scores (in descending order) to prioritize which values to review (i.e. which entries are most suspicious).
imputed_values.csv contains a value for each entry predicted by Cleanlab’s AI model. The prediction represents the expected value of this entry based on all of the other information in dataset (in particular this row), and can be used as an imputed value if the original value were missing.

All 3 CSV files returned contain the same number of rows as the original dataset, and the columns that we ran through Cleanlab projects. Each erroneous flag (boolean), suspicion score, and imputed value corresponds to the data entry at that same row/column in your original dataset.

The entries flagged with the highest issue scores are those you should inspect first. After that, consider having your team review the entries that Cleanlab has flagged as seeming erronous (based on boolean is_issue.csv).

# Create the above 3 DataFrames
is_issue_df, issue_score_df, predicted_value_df = extract_and_write_cleanlab_results(data, cleanset_ids, identifier_col)
# Display the 3 CSV files saved to the disk
ls -l i*.csv

We can view the values in each of these DataFrames/CSVs.

is_issue_df.head(3)

	patient_id	medication	dosage	visiting_hours	invoice
0	90235	False	False	False	False
1	34227	False	False	True	False
2	21253	False	False	False	False

issue_score_df.head(3)

	patient_id	medication	dosage	visiting_hours	invoice
0	90235	0.106462	0.468849	0.288734	0.032805
1	34227	0.495846	0.471051	1.000000	0.124499
2	21253	0.105022	0.469134	0.406527	0.039586

predicted_value_df.head(3)

	patient_id	medication	dosage	visiting_hours	invoice
0	90235	Dextromethorphan	10mg	2.000000	40.0
1	34227	Amoxicillin	500mg	5.301408	300.0
2	21253	Hydrochlorothiazide	25mg	3.000000	300.0

Optional: Analyze erroneous entries in a spreadsheet

Tabular data is frequently analyzed in Microsoft Excel for its ease of use. We can make use of the is_issue.csv and the original data to create a new Excel file (.xlsx) with erroneous cells highlighted in yellow.
The file highlighted_data_entry.xlsx is saved to the current working directory and can be opened using a spreadsheet software.

Note: You have the flexibility to flag additional or fewer entries in your Data Table by setting higher or lower thresholds on the entries in issue_scores.csv, allowing you to create your customized boolean table.

# Library for manipulating Excel table
%pip install openpyxl

# Import necessary methods
from openpyxl import load_workbook
from openpyxl.styles import PatternFill

# Path of output Excel file and is_issue.csv
excel_path = 'highlighted_data_entry.xlsx'
issue_csv_path = 'is_issue.csv' # Boolean table that we generated above

# Save the highlighted Excel data
highlight_errors_in_excel(excel_path, issue_csv_path)

Impute missing values

Before detecting data entry errors, let’s first see how to impute the missing entries in the original dataset using Cleanlab’s predicted value in predicted_value_df. These predictions come from state-of-the-art machine learning models trained (on your dataset) to predict this missing entry based on the other information available for this row.

highlight_imputed_values = True  # Change to False if you don't want to highlight imputed values
missing_data_imputed = missing_data.copy()

impute_columns = categorical_columns + numeric_columns # You can reduce this to a list of specific columns
for col in impute_columns:
    missing_data_imputed[col] = missing_data_imputed[col].fillna(predicted_value_df[col])

if highlight_imputed_values:
    # Boolean mask indicating whether the entry is missing or not
    mask_impute = missing_data.isna()
    # Apply highlight style
    missing_data_imputed = missing_data_imputed.style.apply(lambda x: [apply_highlight(x, col, mask_impute) for col in x.index], axis=1, subset=impute_columns)

missing_data_imputed

	patient_id	diagnosis	medication	dosage	note	visiting_hours	invoice
102	10322	Hypertension	Losartan	50mg	Patient presenting with Hypertension symptoms despite Losartan therapy.	4.000000	400.000000
129	53909	Pneumonia	Azithromycin	250mg	nan	5.000000	300.000000
256	68289	Bronchitis	Guaifenesin	200mg	Patient responding well to Guaifenesin for Bronchitis.	5.000000	202.368744
397	81520	Diabetes mellitus	Insulin	10 IU	Considering alternative treatments for Diabetes mellitus due to patient's intolerance to Insulin.	2.000000	174.182175
473	61117	Diabetes mellitus	Insulin	10 IU	Instructed patient on proper administration of Insulin for Diabetes mellitus.	1.003176	80.000000
655	64952	Influenza	Rimantadine	100mg	Patient presenting with Influenza symptoms despite Rimantadine therapy.	2.114235	80.000000
816	15877	Bronchitis	Dextromethorphan	10mg	Scheduled follow-up appointment to evaluate Bronchitis.	1.000000	20.000000
966	49846	Bronchitis	Dextromethorphan	10mg	nan	1.000000	20.000000

Review the data entry error audit for each column

Let’s take a closer look at the ‘medication’ column. Here we obtain a merged DataFrame using the inspect_column helper function defined above, then filter for the data points where is_issue is True and sort the values by issue_score to view the top most suspicious entries detected in this column.

audit_column = 'medication'

medication_df = inspect_column(audit_column)
medication_issues = medication_df[medication_df[f"{audit_column}_is_issue"] == True].sort_values(f"{audit_column}_issue_score", ascending=False)
medication_issues.head()

	patient_id	diagnosis	dosage	note	visiting_hours	invoice	medication	medication_is_issue	medication_issue_score	medication_predicted_value
484	73051	Influenza	10mg	Patient treated for Influenza using Rimantadine.	1.0	40.0	Rimantadine	True	0.920749	Zanamivir
771	11433	Pneumonia	500mg	Educated patient on potential side effects of Amoxicillin for Pneumonia.	2.0	120.0	Insulin	True	0.709063	Amoxicillin

Per drugs.com, Rimantadine is typically prescribed at a dosage of 100mg, not 10mg which is the dosage for Zanamavir (predicted value). Also, Insulin is the medication for Diabetes mellitus, not Pneumonia. Such entries can be referred to subject matter expert for review.

Next let’s check the visiting_hours column and similarly view the top most suspicious values detected in this column.

audit_column = 'visiting_hours'

visiting_hours_df = inspect_column(audit_column)
visiting_hours_issues = visiting_hours_df[visiting_hours_df[f"{audit_column}_is_issue"] == True].sort_values(f"{audit_column}_issue_score", ascending=False)
visiting_hours_issues.head()

	patient_id	diagnosis	medication	dosage	note	invoice	visiting_hours	visiting_hours_is_issue	visiting_hours_issue_score	visiting_hours_predicted_value
1	34227	Pneumonia	Amoxicillin	500mg	Reviewing patient's response to Amoxicillin therapy for Pneumonia.	300.0	35.0	True	1.0	5.301408
325	52005	Pneumonia	Levofloxacin	500mg	Patient exhibiting adverse reaction to Levofloxacin prescribed for Pneumonia.	300.0	20.0	True	1.0	5.519128
639	39637	Hypertension	Hydrochlorothiazide	10mg	Instructed patient on proper administration of Hydrochlorothiazide for Hypertension.	500.0	65.0	True	1.0	7.023181
710	19203	Hypertension	Losartan	50mg	Noted improvement in patient's condition after administering Losartan for Hypertension.	200000.0	30.0	True	1.0	5.218635
855	25119	Influenza	Rimantadine	100mg	Recommended lifestyle modifications in addition to Rimantadine therapy for Influenza.	120.0	28.0	True	1.0	3.762323

The visiting hours recorded in these entries appears off as patients don’t typically have these sorts of consultations for such long hours. Such entries can be double checked based on domain knowledge and their presence may indicate you should implement regularly scheduled data validation to ensure values fall within known ranges.

We repeat the same steps as above, but this time for the dosage column to check for entries with suspicious values.

audit_column = 'dosage'

dosage_df = inspect_column(audit_column)
dosage_issues = dosage_df[dosage_df[f"{audit_column}_is_issue"] == True].sort_values(f"{audit_column}_issue_score", ascending=False)
dosage_issues.head()

	patient_id	diagnosis	medication	note	visiting_hours	invoice	dosage	dosage_is_issue	dosage_issue_score	dosage_predicted_value
3	44136	Pneumonia	Levofloxacin	Advised patient on importance of adherence to Levofloxacin regimen for Pneumonia.	5.0	300.0	10 IU	True	0.532683	500mg
639	39637	Hypertension	Hydrochlorothiazide	Instructed patient on proper administration of Hydrochlorothiazide for Hypertension.	65.0	500.0	10mg	True	0.531477	25mg
109	38965	Bronchitis	Guaifenesin	Patient diagnosed with Bronchitis. Prescribed Guaifenesin for treatment.	1.0	20.0	100mg	True	0.531120	200mg
484	73051	Influenza	Rimantadine	Patient treated for Influenza using Rimantadine.	1.0	40.0	10mg	True	0.530678	100mg

Levofloxacin is prescribed in milligrams (mg) not insulin units (IU), which signals that the dosage value is definitely wrong. Also, the recommended dosage for Rimantadine, as per drugs.com, is 100mg. Zanamivir, another drug for Influenza, is prescribed with 5-10mg dose. So either the recorded dosage or medication seems off here. The same applies to Hydrochlorothiazide which is usually prescribed with 25-100mg dosage for Hypertension, and to Guaifenesin for treating Bronchitis.

For all of these data entries, Cleanlab’s predicted value seems more appropriate.

Finally, we look for data entry errors in the “invoice” column using the same steps.

audit_column = 'invoice'

invoice_df = inspect_column(audit_column)
invoice_issues = invoice_df[invoice_df[f"{audit_column}_is_issue"] == True].sort_values(f"{audit_column}_issue_score", ascending=False)
invoice_issues.head()

	patient_id	diagnosis	medication	dosage	note	visiting_hours	invoice	invoice_is_issue	invoice_issue_score	invoice_predicted_value
710	19203	Hypertension	Losartan	50mg	Noted improvement in patient's condition after administering Losartan for Hypertension.	30.0	200000.0	True	1.0	606.011169

The invoice value of $200k is too high. Such an entry should be verified from other records to avoid accounting issues later on.

Improve your dataset by correcting erroneous entries

The best way to ensure that you have high quality data is to manually inspect the entries that Cleanlab identified to have issues (i.e. is_issue = True) and correct them. However, that could be time consuming and hence you can also auto-correct your data by replacing the entries that have been flagged as issues with Cleanlab’s predicted values. We recommend reviewing various predicted values from diverse rows to ensure they are reasonable before doing this.

We demonstrate how to automatically obtain an improved dataset below, where autocorrected_dataset will have the exact same rows and columns as your original dataset, but with the entries estimated to be potentially erroneous replaced with values predicted by Cleanlab based on the other information in the dataset.

autocorrected_dataset = data.copy()
fix_columns = categorical_columns + numeric_columns # You can reduce this to a list of specific columns

impute_missing = False # Change this to True if you want to impute the missing values alongside fixing the erroneous entries
mask_condition = {col: (is_issue_df[col] if not impute_missing else is_issue_df[col] | autocorrected_dataset[col].isna()) for col in fix_columns}

for col in fix_columns:
    autocorrected_dataset[col] = autocorrected_dataset[col].mask(mask_condition[col], predicted_value_df[col])

autocorrected_dataset.to_csv('autocorrected_dataset.csv', index=False)

Lastly, we view the first 5 samples of the autocorrected dataset highlighting the corrected entries.

samples = 5
highlight_corrected_values = True
autocorrected_dataset_copy = autocorrected_dataset.copy()

if highlight_corrected_values:
    autocorrected_dataset_copy = autocorrected_dataset.head(samples).style.apply(lambda x: [apply_highlight(x, col, is_issue_df) for col in x.index], axis=1, subset=impute_columns)

autocorrected_dataset_copy

	patient_id	diagnosis	medication	dosage	note	visiting_hours	invoice
0	90235	Bronchitis	Dextromethorphan	10mg	Adjusting dosage of Dextromethorphan for Bronchitis.	2.000000	40.000000
1	34227	Pneumonia	Amoxicillin	500mg	Reviewing patient's response to Amoxicillin therapy for Pneumonia.	5.301408	300.000000
2	21253	Hypertension	Hydrochlorothiazide	25mg	Recommending additional tests to monitor efficacy of Hydrochlorothiazide for Hypertension.	3.000000	300.000000
3	44136	Pneumonia	Levofloxacin	500mg	Advised patient on importance of adherence to Levofloxacin regimen for Pneumonia.	5.000000	300.000000
4	76265	Bronchitis	Guaifenesin	200mg	Patient treated for Bronchitis using Guaifenesin.	4.000000	80.000000

Install and import dependencies​

Fetch and view dataset​

Load data into Cleanlab Studio​

Launch projects​

Get project results​

Impute missing values​

Review the data entry error audit for each column​

Improve your dataset by correcting erroneous entries​