Skip to main content

Produce Better-Quality Data 10x Faster in any Data Annotation Platform You’re Using (e.g. Label Studio)

Run in Google ColabRun in Google Colab

This tutorial shows how you can integrate Cleanlab Studio with any data annotation tool. Reduce how much data annotation work your team has to do (thanks to Cleanlab’s auto-labeling) and simultaneously produce better results (thanks to Cleanlab’s label error detection). The basic workflow looks as follows: first label some data with your tool, then use Cleanlab to catch any incorrectly selected labels and auto-label a large subset of the data that AI (Foundation models) can confidently label. You can then iterate these 2 steps multiple times. Each time you label (or re-label) a bit more data with your annotation tool, Cleanlab’s AI will learn from your new labels and become more effective for auto-labeling and label error detection. Since Cleanlab Studio is so easy to use, you can mostly focus on working with your annotation tool as you normally would – except now you’ll get datasets labeled more quickly and accurately.

Data Annotation with Label Studio and Cleanlab Studio

Here we’ll demonstrate this workflow with the Label Studio data annotation platform (one of the most popular tools that is freely open-source), but you can similarly integrate any data annotation tool of your choosing with Cleanlab Studio. For example, you could do the same workflows with annotation tools like LabelBox, LabelMe, or Coco among others. Some annotation platforms have built-in AI features, but none with the same powerful capabilities and ease-of-use as Cleanlab Studio.

This particular tutorial will focus on labeling image data for multi-class classification, but the same steps can be applied to text or tabular data being labeled for other types of tasks (e.g. tagging instead of multi-class classification). Here are the steps we’ll follow in this tutorial, starting with an entirely unlabeled dataset:

  1. Use your data annotation tool to label some of the data.
  2. Then load the currently labeled and unlabeled sets of this data into Cleanlab Studio.
  3. Use Cleanlab Studio to automatically find label errors (and other issues in the data) and send data flagged as likely incorrectly labeled back to your annotation tool for relabeling. Also use Cleanlab Studio to auto-label as much of the unlabeled data as can be confidently handled by Cleanlab’s AI.
  4. Iterate steps 1-3 until all of our data is labeled. Over time, there will be less unlabeled data remaining to label in later repetitions of step 1.

Install and import required dependencies

Note: this tutorial requires that you’ve created a Cleanlab Studio account. You can use pip to install all packages required for this tutorial as follows:

pip install cleanlab-studio label-studio

When creating this tutorial, we used version 1.11.0 of the label-studio Python package.

from cleanlab_studio import Studio
from datasets import load_dataset
import pandas as pd
import requests

pd.set_option('display.max_colwidth', None)

Prepare the dataset

The dataset we’ll label is a sample of images of potato chips from PepsiCo - the original data can be found on Kaggle here. Our task is to label each image from the dataset into one of 2 categories: Defective or Non-Defective. Below are some images from the dataset. 2 Non-Defective potato chips are depicted in the top row and 2 Defective potato chips in the bottom row:

examples of dataset

We have already stored the tutorial dataset with each image at an accessible URL. You could also produce such URLs for files in your cloud storage. Both Cleanlab Studio and Label Studio can operate directly on such linked external media files, or you run these tools on local files from your machine. Let’s fetch the files with the URLs pointing to the images that comprise our dataset:

wget -nc "https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/potato_chip_image_urls.csv"

Optionally, you can also download the full image dataset from here.

image_df = pd.read_csv("potato_chip_image_urls.csv")
image_urls = list(image_df["image"])

Next let’s define a DataFrame that will store our final labels for each image. This object will be updated throughout the tutorial.

This final labels DataFrame has an image column pointing to the URL of each image in our dataset. We initialize this DataFrame with a label column that is our goal to fill in for each image. We also include a column that will contain historical labels (represented as a list) for each image. Anytime a chosen label changes due to a label correction (when relabeling an image in the data annotation tool), the newly selected label will be appended to the list of historical labels that were selected by annotators of this image. Storing this set of previously selected labels is particularly useful when the data annotators are noisy.

final_labels_df = image_df.copy()
final_labels_df["label"] = [None] * len(image_urls)
final_labels_df["historical_labels"] = [[] for _ in range(len(image_urls))]

final_labels_df.head()
image label historical_labels
0 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014120.jpg None []
1 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_020053.jpg None []
2 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_005833.jpg None []
3 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_013704.jpg None []
4 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_015012.jpg None []

Label some of the images with your annotation tool

Note: You can skip this section if you already know how to use Label Studio or are using another data annotation tool.

Now let’s label a batch of the images using Label Studio, our example 3rd party annotation tool in this tutorial. Here we’ll use the Label Studio Python SDK to connect to the API.

Before executing the Python code below, first run label-studio in a command line interface of your choice. Don’t proceed until you have Label Studio running locally at: http://localhost:8080

from label_studio_sdk import Client

# Define the URL where Label Studio is accessible and the API key for your Label Studio user account, which you can find in your account settings
LABEL_STUDIO_URL = 'http://localhost:8080'
label_studio_api_key = "<insert API key here>"

# Connect to the Label Studio API and check the connection
ls = Client(url=LABEL_STUDIO_URL, api_key=label_studio_api_key)
ls.check_connection()
    {'status': 'UP'}

The status should be UP if everything’s working properly. Let’s create a project via our Label Studio Client Object for labeling our Potato Chip images.

project = ls.start_project(
title='Label Studio PepsiCo Image Tutorial Project',
label_config='''
<View>
<Image name="image" value="$image" valueType="url"/>
<Choices name="choice" toName="image">
<Choice value="Defective"/>
<Choice value="Non-Defective" />
</Choices>
</View>
'''
)

Format the data for Label Studio

Note: You can skip this section if you already know how to use Label Studio or are using another data annotation tool.

To programmatically import our images into Label Studio, we must format them like this:

[{'image': image_url1},
{'image': image_url2},
{'image': image_url3}
]

This is a list of dicts, where each dict has an image key and corresponding value image_url pointing to where the image is accessible. Now let’s use a previously defined helper function to format our image URLs to the list of dicts that Label Studio will accept in its project.import_tasks function.

Optional: Define helper function to format the image URLs for Label Studio (click to expand)

from typing import List, Dict

def format_label_studio_image_urls(image_urls: List[str]) -> List[Dict[str, str]]:
"""
Converts a list of image URLs into the format that Label Studio accepts.

Parameters:
- image_urls (List[str]): A list of full URLs to the images that need to be labeled.

Returns:
- List[Dict[str, str]]: A list of dictionaries, where each dictionary contains a key 'image'
with its value being the full URL to an image file, ready for use in Label Studio.
"""

# Convert the list of URLs into the expected dictionary format for Label Studio
label_studio_urls = [{'image': url} for url in image_urls]

return label_studio_urls
image_urls_for_label_studio = format_label_studio_image_urls(image_urls)
print(len(image_urls_for_label_studio))
    400

Let’s look at 3 example images we are importing into Label Studio:

# Check what some of the potato chip images look like

from IPython.display import HTML

images_html = "".join(
f'''<div style="text-align: center;"><img src="{url['image']}" width="400" alt="" /></div>'''
for url in image_urls_for_label_studio[:3]
)

# Display the images in the notebook
display(HTML(images_html))

Now we can import the images into Label Studio for labeling!

# Task IDs used in Label Studio to represent each image are the output we see printed below
imported_ids = project.import_tasks(image_urls_for_label_studio)
print(imported_ids[:5])
    [161861, 161862, 161863, 161864, 161865]

Learn how to import your data into Label Studio via this guide.

Label 50 images in Label Studio

Note: You can skip this section if you already know how to use Label Studio or are using another data annotation tool.

Go to your project in Label Studio and start labeling your imported images. Let’s just label 50 of those images to start. This guide shows how to label/annotate your data with Label Studio, but you could use any other data annotation platform.

Try to select a diverse set of initial examples to label that well represents all of the classes of interest (since Cleanlab’s AI will later learn from this initially labeled set).

Once you have labeled your images, one way to export the labels from Label Studio is via the Export button, as you can see in the video below or within your Label Studio session running locally. In this tutorial, we choose to export programmatically. This minimizes the amount of data you need to manually move between different tools in this tutorial. This also shows you how to interact with Label Studio using the Python API.

Here’s a video demonstrating how to label a some images in Label Studio after you have launched the app locally:

By default, Label Studio only exports tasks (i.e. data points in Label Studio) with annotations. For more details on exporting tasks, you can go here.

To be able to export all tasks programmatically, we use the following helper function.

Optional: Define helper function to export data from your Label Studio project (click to expand)

def export_data_from_label_studio_project(project_id: str, label_studio_api_key: str, output_file_path: str, export_type: str="CSV", server_url: str="http://localhost:8080"):
"""
Exports data from a specific project in Label Studio.

This function requests an export of data from a specified project in Label Studio, supporting
various export formats. The function handles the request and saves the exported data to a file
at the specified path.

Parameters:
- project_id (str): The ID of the project in Label Studio to export data from.
- label_studio_api_key (str): The API token for authenticating with the Label Studio server.
- output_file_path (str, optional): The path where the exported file should be saved.
- export_type (str, optional): The type of export format (e.g., CSV, JSON). Defaults to "CSV".
- server_url (str, optional): The URL of the Label Studio server. Defaults to "http://localhost:8080".

Returns:
- None: The function saves the exported data to a file and prints the outcome.
"""
url = f"{server_url}/api/projects/{project_id}/export?exportType={export_type}"

# Your authorization token
headers = {
"Authorization": f"Token {label_studio_api_key}"
}

response = requests.get(url, headers=headers)

if response.status_code == 200:
# Save the exported data to a file
with open(output_file_path, "wb") as f:
f.write(response.content)
print("Data exported successfully.")
else:
print(f"Failed to export data. Status code: {response.status_code}, Message: {response.text}")

We also need to pass our Label Studio project ID and our API key into our helper function to be able to export our tasks.

This project ID can be found in the URL of your project after you have selected your project within Label Studio.

We also create a csv file to export our tasks into.

ls_project_id = 55 # REPLACE YOUR PROJECT ID HERE
export_filename = "label_studio_potato_chips_annotations.csv" # REPLACE YOUR FILENAME HERE

export_data_from_label_studio_project(ls_project_id, label_studio_api_key, export_filename)

Let’s see what our labeled data looks like so far.

label_studio_annotations = pd.read_csv(export_filename)
label_studio_annotations.head(10)
annotation_id annotator choice created_at id image lead_time updated_at
0 1599 1 Defective 2024-03-21T21:40:46.231429Z 161861 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014120.jpg 3.178 2024-03-21T21:40:46.231451Z
1 1600 1 Defective 2024-03-21T21:40:48.406866Z 161862 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_020053.jpg 2.073 2024-03-21T21:40:48.406890Z
2 1601 1 Defective 2024-03-21T21:40:50.714271Z 161863 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_005833.jpg 2.199 2024-03-21T21:40:50.714307Z
3 1602 1 Defective 2024-03-21T21:40:52.918417Z 161864 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_013704.jpg 2.073 2024-03-21T21:40:52.918444Z
4 1603 1 Defective 2024-03-21T21:40:55.034898Z 161865 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_015012.jpg 2.017 2024-03-21T21:40:55.034925Z
5 1604 1 Defective 2024-03-21T21:40:57.286808Z 161866 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_011506.jpg 2.131 2024-03-21T21:40:57.286836Z
6 1605 1 Non-Defective 2024-03-21T21:40:59.702141Z 161867 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_002817.jpg 2.298 2024-03-21T21:40:59.702167Z
7 1606 1 Non-Defective 2024-03-21T21:41:02.890256Z 161868 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_002553.jpg 3.059 2024-03-21T21:41:02.890279Z
8 1611 1 Defective 2024-03-21T22:13:55.609870Z 161869 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_020243.jpg 1972.597 2024-03-21T22:13:55.609890Z
9 1612 1 Defective 2024-03-21T22:13:58.086707Z 161870 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014858.jpg 2.342 2024-03-21T22:13:58.086738Z

Update final labels DataFrame with labels obtained in our annotation tool

Note: You can skip this section if you already know how to use Label Studio or are using another data annotation tool.

Now we can fill in the official labels for 50 images that we just annotated in Label Studio so far.

# Here we define a helper function to update the values of our final labels based on annotations from Label Studio

def update_final_labels_via_labelstudio(final_labels_df, labels_df):
"""
Updates the 'label' column and appends labels to 'historical_labels' in final_labels_df
based on the 'image' column matching between final_labels_df and labels_df.

Parameters:
- final_labels_df (pd.DataFrame): The DataFrame to update, containing columns 'label',
'image', and 'historical_labels'.
- labels_df (pd.DataFrame): The DataFrame containing new labels from our annotation tool, with columns
including 'choice' and 'image'.

Returns:
- pd.DataFrame: The updated final_labels_df.
"""
for _, label_row in labels_df.iterrows():
image_url = label_row['image']
label_choice = label_row['choice']

# Find the index in final_labels_df where the image URL matches
index = final_labels_df.index[final_labels_df['image'] == image_url]

if not index.empty:
# Update the label column
final_labels_df.at[index[0], 'label'] = label_choice

# Check if the cell contains NaN or cannot be appended to directly
if len(final_labels_df.at[index[0], 'historical_labels']) == 0:
# Initialize with a list containing the label if NaN
final_labels_df.at[index[0], 'historical_labels'] = [label_choice]
else:
current_value = final_labels_df.at[index[0], 'historical_labels']
# Check if current_value is a list, implying we can append to it
if isinstance(current_value, list):
current_value.append(label_choice)
final_labels_df.at[index[0], 'historical_labels'] = current_value
else:
# Handle non-list, non-NaN values (e.g., initializing or converting to a list)
final_labels_df.at[index[0], 'historical_labels'] = [current_value, label_choice]

return final_labels_df
final_labels_df = update_final_labels_via_labelstudio(final_labels_df, label_studio_annotations)
final_labels_df[~final_labels_df["label"].isnull()]
image label historical_labels
0 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014120.jpg Defective [Defective]
1 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_020053.jpg Defective [Defective]
2 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_005833.jpg Defective [Defective]
3 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_013704.jpg Defective [Defective]
4 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_015012.jpg Defective [Defective]
5 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_011506.jpg Defective [Defective]
6 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_002817.jpg Non-Defective [Non-Defective]
7 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_002553.jpg Non-Defective [Non-Defective]
8 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_020243.jpg Defective [Defective]
9 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014858.jpg Defective [Defective]
10 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210318_235659.jpg Non-Defective [Non-Defective]
11 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014520.jpg Defective [Defective]
12 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_010752.jpg Defective [Defective]
13 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_015937.jpg Defective [Defective]
14 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_002905.jpg Non-Defective [Non-Defective]
15 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_013808.jpg Defective [Defective]
16 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210318_232231.jpg Non-Defective [Non-Defective]
17 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_010521.jpg Defective [Defective]
18 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014223.jpg Defective [Defective]
19 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_010858.jpg Defective [Defective]
20 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014353.jpg Defective [Defective]
21 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_015648.jpg Defective [Defective]
22 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_011017.jpg Defective [Defective]
23 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_013341.jpg Defective [Defective]
24 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_003754.jpg Non-Defective [Non-Defective]
25 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_003026.jpg Non-Defective [Non-Defective]
26 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014639.jpg Defective [Defective]
27 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_015451.jpg Defective [Defective]
28 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_020004.jpg Defective [Defective]
29 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_005727.jpg Defective [Defective]
30 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_003342.jpg Non-Defective [Non-Defective]
31 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_013435.jpg Defective [Defective]
32 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_001756.jpg Non-Defective [Non-Defective]
33 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014824.jpg Defective [Defective]
34 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_010315.jpg Defective [Defective]
35 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_000106.jpg Non-Defective [Non-Defective]
36 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_015545.jpg Defective [Defective]
37 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_004702.jpg Defective [Defective]
38 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_001519.jpg Non-Defective [Non-Defective]
39 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_011522.jpg Defective [Defective]
40 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210318_233144.jpg Defective [Defective]
41 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014117.jpg Non-Defective [Non-Defective]
42 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014842.jpg Non-Defective [Non-Defective]
43 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_013912.jpg Non-Defective [Non-Defective]
44 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_003322.jpg Defective [Defective]
45 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_011524.jpg Non-Defective [Non-Defective]
46 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014306.jpg Non-Defective [Non-Defective]
47 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_003531.jpg Non-Defective [Non-Defective]
48 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014529.jpg Defective [Defective]
49 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210318_235645.jpg Non-Defective [Non-Defective]

We’ve now recorded labels (and updated the historical labels) for 50 of our images, which were manually annotated. Cleanlab Studio can help us avoid having to manually annotate the rest of the dataset.

Use Cleanlab Studio to catch label errors and auto-label the data that AI can confidently handle

To run Cleanlab Studio, we first properly format our dataset (including both labeled and the rest of the unlabeled images). Here we use the Cleanlab External Image format, so there is no need to move your images (Cleanlab will read them directly from their public URLs). You can run Cleanlab Studio on data stored in a variety of alternate cloud/local storage formats.

We drop the historical_labels column from the final labels DataFrame prior to loading it into Cleanlab Studio, since it is not needed for us to use Cleanlab Studio to catch label errors and auto-label the images.

cleanlab_df = final_labels_df.drop("historical_labels", axis=1)

A quickstart tutorial for programmatically analyzing image datasets via Cleanlab Studio’s Python API is available here. If you prefer to use the web interface and interactively browse/correct your data, see our other tutorial: Finding Issues in Large-Scale Image Datasets. Here we run through the simple steps without much explanation.

# You can find your Cleanlab Studio API key by going to app.cleanlab.ai/upload, 
# clicking "Upload via Python API", and copying the API key there
API_KEY = "<insert API key here>"

# initialize studio object
studio = Studio(API_KEY)

Let’s load our set of currently labeled/unlabeled images into Cleanlab Studio. It’s important to provide your full dataset to get the best results from Cleanlab’s AI.

dataset_id = studio.upload_dataset(cleanlab_df, dataset_name="Pepsico_RnD_Potato_Chip_Image_Data_Tutorial", schema_overrides=[{"name": "image", "column_type": "image_external"}])
print(f"Dataset ID: {dataset_id}")
    Uploading dataset...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|
Ingesting Dataset...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|


Dataset ID: 7dfa48ce904e47ab9554a66ddef602b9

Once the data are loaded, we create a Project based on this Dataset in Cleanlab Studio.

project_id = studio.create_project(
dataset_id=dataset_id,
project_name="Pepsico_RnD_Potato_Chip_Image_Data_Tutorial_Project",
modality="image",
task_type="multi-class",
model_type="regular",
label_column="label",
)
print(
f"Project successfully created and ML training has begun! project_id: {project_id}"
)
    Project successfully created and ML training has begun! project_id: dbbac382835b4ef9bc8cc47469bae255

The Project will take a while for Cleanlab’s AI models to train on your dataset and analyze it. You’ll receive an email when the results are ready. Each Project creates a cleanset (cleaned dataset). Run the cell below to fetch the cleanset_id from Cleanlab Studio, this code will block until your project results are ready. For big datasets, if your notebook times out, do not recreate the project. Instead just re-run the cell below to fetch the cleanset_id based on the project_id (which you can also find in the Cleanlab Studio Web App).

cleanset_id = studio.get_latest_cleanset_id(project_id)
print(f"cleanset_id: {cleanset_id}")
studio.wait_until_cleanset_ready(cleanset_id)
    cleanset_id: 333667428d494e54908dc7c8e9e39cec


Cleanset Progress: \ Step 50/50, Ready for review!

Cleanlab Studio automatically generates smart metadata for any image, text, or tabular dataset. This metadata (returned as Cleanlab Columns) helps you find and fix various problems in your dataset, or impute missing values and auto-label unlabeled data.

cleanlab_columns_df = studio.download_cleanlab_columns(cleanset_id)
cleanlab_columns_df.head(10)
cleanlab_row_ID corrected_label is_label_issue label_issue_score suggested_label suggested_label_confidence_score is_ambiguous ambiguous_score is_well_labeled is_near_duplicate ... is_odd_size odd_size_score is_low_information low_information_score is_grayscale is_odd_aspect_ratio odd_aspect_ratio_score aesthetic_score is_NSFW NSFW_score
0 1 <NA> False 0.373451 <NA> 0.626549 False 0.993814 False False ... False 0.0 False 0.305794 False False 0.0 0.532279 False 0.041162
1 2 <NA> False 0.425856 <NA> 0.574144 False 0.999999 False False ... False 0.0 False 0.314794 False False 0.0 0.511932 False 0.051039
2 3 <NA> False 0.407524 <NA> 0.592476 False 0.999208 False False ... False 0.0 False 0.278744 False False 0.0 0.519195 False 0.113037
3 4 <NA> False 0.391541 <NA> 0.608459 False 0.997314 False False ... False 0.0 False 0.300383 False False 0.0 0.479944 False 0.054440
4 5 <NA> False 0.390192 <NA> 0.609808 False 0.997103 False False ... False 0.0 False 0.330797 False False 0.0 0.482801 False 0.052081
5 6 <NA> False 0.392143 <NA> 0.607857 False 0.997406 False False ... False 0.0 False 0.291288 False False 0.0 0.464114 False 0.000000
6 7 <NA> False 0.425907 <NA> 0.574093 False 0.951693 False False ... False 0.0 False 0.348319 False False 0.0 0.552842 False 0.137950
7 8 <NA> False 0.420150 <NA> 0.579850 False 0.947803 False False ... False 0.0 False 0.328956 False False 0.0 0.584307 False 0.092091
8 9 <NA> False 0.437139 <NA> 0.562861 False 0.999753 False False ... False 0.0 False 0.310666 False False 0.0 0.489662 False 0.108625
9 10 <NA> False 0.375067 <NA> 0.624933 False 0.994185 False False ... False 0.0 False 0.327839 False False 0.0 0.525412 False 0.110453

10 rows × 33 columns

We’ll join our Cleanlab Studio metadata columns with some of our original data columns in order to view detected label issues, suggested labels, confidence in these suggestions, and other important columns that Cleanlab automatically generated. Learn more about these metadata columns here.

Label issues in image datasets often involve challenges such as incorrectly chosen class labels due to human mistakes, inconsistencies across different data annotators, ambiguous examples where multiple classes might seem applicable, and instances where relevant features are overlooked by annotators. Cleanlab can automatically detect annotation problems stemming from a variety of sources, including: subjective interpretation of images, the complexity of scenes, variability in object appearances, and unclear portions of annotation guidelines (or changing guidelines). Identifying and resolving these annotation issues is crucial for creating a high-quality dataset.

cleanlab_df = cleanlab_df.reset_index().rename(columns={'index': 'cleanlab_row_ID'})
cleanlab_columns_df = pd.merge(cleanlab_columns_df, cleanlab_df[['cleanlab_row_ID', 'label', 'image']], on='cleanlab_row_ID', how='left')

Review some of the mislabeled images that Cleanlab Studio detected

We have retrieved our Cleanlab Studio results, so how should we actually look at the results and use them effectively?

Let’s look at an example of a mislabeled image that Cleanlab was able to detect to build trust that Cleanlab can actually do this. We can filter our cleanlab_columns_df based on the column is_label_issue to find the images that are found to be label issues by Cleanlab.

We then sort (in descending order) these label issues by their label_issue_score, which quantifies how confident Cleanlab is that a data point is mislabeled.

We identify the images most confidently to be mislabeled by thresholding the label issue score.

label_issues = cleanlab_columns_df.query("is_label_issue").sort_values("label_issue_score", ascending=False)
Optional: Helper functions to render id column of DataFrame as images in a separate column (click to expand)

from PIL import Image
from io import BytesIO
from base64 import b64encode
from IPython.display import HTML

def url_to_img_html(url: str) -> str:
"""
Converts a URL to an HTML img tag.
"""
return f'<img src="{url}" width="175px" alt="Image" />'


def display_images_in_df(df: pd.DataFrame) -> HTML:
"""
Modifies the DataFrame to include a column with HTML img tags for rendering images.
"""
image_column = "image_shown"
df_copy = df.copy()

# Assuming 'url_to_img_html' converts URLs in 'image' column to HTML img tags
df_copy[image_column] = df_copy['image'].apply(url_to_img_html)

# Determine the position of the 'image' column
image_col_index = df_copy.columns.get_loc('image') + 1

# Reorder columns to ensure 'image_shown' appears immediately after 'image'
columns = list(df_copy.columns)
new_column_order = columns[:image_col_index] + [image_column] + columns[image_col_index:-1]
df_copy = df_copy[new_column_order]

# Generate HTML representation of the DataFrame to display images
html = df_copy.to_html(escape=False)

return HTML(html)

To see that some of these images were actually mislabeled, you can view the first 2 images below in the image_shown column:

display_images_in_df(label_issues.head(2)[["cleanlab_row_ID", "image", "is_label_issue", "label_issue_score", "label", "suggested_label"]])
cleanlab_row_ID image image_shown is_label_issue label_issue_score label suggested_label
42 43 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_013912.jpg Image True 0.778596 Non-Defective Defective
41 42 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014842.jpg Image True 0.708746 Non-Defective Defective

Cleanlab has automatically flagged data that was badly annotated! Let’s use Cleanlab’s label issue detection to determine which data points should be re-labeled in our data annotation tool.

Based on some threshold value we set for our label_issue_score column, we should choose to flag and relabel (in our data annotation tool) any images that are greater than or equal to this threshold value.

Here we set our label_issue_score threshold to 0.7 (to only re-consider the images Cleanlab is most confident are mislabeled):

images_to_relabel_in_label_studio = cleanlab_columns_df.query(
"is_label_issue & label_issue_score > 0.7"
).sort_values("label_issue_score", ascending=False)

images_to_relabel_in_label_studio.head()
cleanlab_row_ID corrected_label is_label_issue label_issue_score suggested_label suggested_label_confidence_score is_ambiguous ambiguous_score is_well_labeled is_near_duplicate ... is_low_information low_information_score is_grayscale is_odd_aspect_ratio odd_aspect_ratio_score aesthetic_score is_NSFW NSFW_score label image
42 43 <NA> True 0.778596 Defective 0.778596 False 0.905642 False False ... False 0.302884 False False 0.0 0.5266 False 0.000000 Non-Defective https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_013912.jpg
41 42 <NA> True 0.708746 Defective 0.708746 False 0.959486 False False ... False 0.315274 False False 0.0 0.5030 False 0.074041 Non-Defective https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014842.jpg

2 rows × 35 columns

Note: The value you choose for score thresholds will affect subsequent results. We suggest you play around with this threshold to achieve optimal results for your data. For label issues, you should choose a threshold where you see images that were mislabeled around this label_issue_score threshold.

Review some of the unlabeled images that we want to auto-label

Let’s look at a few examples of unlabeled images that we want to try and auto-label using Cleanlab Studio. Auto-labeling uses predictions from Cleanlab’s AI (which was trained on your existing labeled data provided in the Cleanlab Studio Project) to suggest labels for unlabeled data points. To ensure accurate auto-labels, we must account for the confidence in these predictions.

unlabeled_images = final_labels_df[final_labels_df['label'].isnull()]
unlabeled_images.head()
image label historical_labels
50 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_Unlabeled/IMG_20210319_014652.jpg None []
51 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_Unlabeled/IMG_20210318_234146.jpg None []
52 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_Unlabeled/IMG_20210319_010408.jpg None []
53 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_Unlabeled/IMG_20210319_001713.jpg None []
54 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_Unlabeled/IMG_20210319_020124.jpg None []

Let’s look at some of these unlabeled images that we want to auto-label below. The image_shown column is to see what the images actually look like:

display_images_in_df(unlabeled_images.head(3)[["label", "historical_labels", "image"]])
label historical_labels image image_shown
50 None [] https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_Unlabeled/IMG_20210319_014652.jpg Image
51 None [] https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_Unlabeled/IMG_20210318_234146.jpg Image
52 None [] https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_Unlabeled/IMG_20210319_010408.jpg Image

We can then sort (in descending order) these unlabeled images by their suggested_label_confidence_score in order to see which images are considered data points we can more confidently auto-label with Cleanlab.

The suggested_label_confidence_score is our confidence that Cleanlab’s predicted label for a data point is correct (again, higher values correspond to data points you can more confidently auto-fix / auto-label with Cleanlab). To read more about this score, you can go here.

Based on some threshold value we set for our suggested_label_confidence_score column, we should auto-label (using the suggested_label column) any images that are currently unlabeled and that are greater than or equal to this threshold value.

Here is an example of how to do this by setting our suggested_label_confidence_score threshold to 0.7(a high threshold to find images we are most confident about being able to auto-label):

auto_label_criteria = cleanlab_columns_df['suggested_label'].notnull() & \
(cleanlab_columns_df['suggested_label_confidence_score'] > 0.7) & \
cleanlab_columns_df['image'].isin(unlabeled_images['image'])

images_to_auto_label = cleanlab_columns_df[auto_label_criteria].sort_values("suggested_label_confidence_score", ascending=False)

images_to_auto_label.head()
cleanlab_row_ID corrected_label is_label_issue label_issue_score suggested_label suggested_label_confidence_score is_ambiguous ambiguous_score is_well_labeled is_near_duplicate ... is_low_information low_information_score is_grayscale is_odd_aspect_ratio odd_aspect_ratio_score aesthetic_score is_NSFW NSFW_score label image
320 321 <NA> False 0.0 Defective 0.745203 False NaN False False ... False 0.284669 False False 0.0 0.565557 False 0.000000 None https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_Unlabeled/IMG_20210319_020059.jpg
214 215 <NA> False 0.0 Defective 0.736089 False NaN False False ... False 0.291471 False False 0.0 0.530241 False 0.000000 None https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_Unlabeled/IMG_20210319_013358.jpg
158 159 <NA> False 0.0 Defective 0.732108 False NaN False False ... False 0.316843 False False 0.0 0.568669 False 0.004064 None https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_Unlabeled/IMG_20210319_003158.jpg
247 248 <NA> False 0.0 Defective 0.729612 False NaN False False ... False 0.307088 False False 0.0 0.538092 False 0.000000 None https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_Unlabeled/IMG_20210319_002930.jpg
222 223 <NA> False 0.0 Defective 0.728620 False NaN False False ... False 0.294260 False False 0.0 0.493728 False 0.000000 None https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_Unlabeled/IMG_20210319_010310.jpg

5 rows × 35 columns

Note: The value you choose for this threshold is sensitive in how it affects results and we highly suggest you tune or play around with this threshold yourself to achieve optimal results for your use case. For deciding which auto-labels can be trusted, you should choose a threshold where you see the images with greater suggested_label_confidence_score have the correct suggested_label.

Update final labels DataFrame with our Cleanlab Studio results

Let’s now use our Cleanlab Studio results to achieve the following:

  1. Based on some threshold value we set for Cleanlab’s label_issue_score column, select data points to re-label in our data annotation tool. This efficiently fixes data that was likely previously mislabeled.
  2. Based on some threshold value we set for Cleanlab’s suggested_label_confidence_score column, select currently-unlabeled data points to auto-label with Cleanlab (using the suggested_label column). This allows to only auto-label the subset of data that AI can confidently handle.
  3. Update our final labels to reflect the newly auto-labeled images, determine the indices of data points that need to be relabeled in our data annotation tool, as well as indices of data points that were auto-labeled by Cleanlab.
Optional: Define helper function to update the values of our labels based on what we auto-label in Cleanlab Studio and determine which images to relabel (click to expand)

from typing import Tuple, Optional

def update_final_labels_via_cleanlab(
final_labels_df: pd.DataFrame,
cleanlab_columns_df: pd.DataFrame,
issue_score_threshold: float = 0.6,
suggested_label_confidence_score_threshold: float = 0.55,
run_auto_labeling_only: bool = False
) -> Tuple[pd.DataFrame, Optional[pd.DataFrame], Optional[pd.DataFrame]]:
"""
Updates final labels with suggested labels from cleanlab_columns_df based on specified criteria,
and identifies images to relabel in Label Studio.

Parameters:
- final_labels_df (pd.DataFrame): DataFrame to update, containing 'label', 'image', and 'historical_labels'.
- cleanlab_columns_df (pd.DataFrame): DataFrame containing cleanlab analysis and suggestions.
- issue_score_threshold (float): Threshold for determining if a label issue is significant enough to require relabeling.
- suggested_label_confidence_score_threshold (float): Confidence score threshold for auto-labeling.
- run_auto_labeling_only (bool): If True, only run the auto-labeling logic.

Returns:
- pd.DataFrame: Updated final labels.
- Optional[pd.DataFrame]: images_to_relabel_in_label_studio, containing images identified for relabeling (None if run_auto_labeling_only=True).
- pd.DataFrame: images_to_auto_label, containing images auto-labeled based on the suggested label confidence score.
"""
images_to_relabel_in_label_studio = None
if not run_auto_labeling_only:
# Part 1: Identify images to relabel in Label Studio
images_to_relabel_in_label_studio = cleanlab_columns_df.query(
"is_label_issue & label_issue_score > @issue_score_threshold"
).sort_values("label_issue_score", ascending=False)

# Filter out already labeled images in final_labels_df for auto-labeling
unlabeled_final_labels_df = final_labels_df[final_labels_df['label'].isnull()]

# Part 2: Auto-label images not in the images_to_relabel_in_label_studio DataFrame
# and meeting the suggested_label confidence score criteria
# Additionally, ensure these images are currently unlabeled in final_labels_df
auto_label_criteria = cleanlab_columns_df['suggested_label'].notnull() & \
(cleanlab_columns_df['suggested_label_confidence_score'] > suggested_label_confidence_score_threshold) & \
cleanlab_columns_df['image'].isin(unlabeled_final_labels_df['image'])
if not run_auto_labeling_only:
auto_label_criteria &= ~cleanlab_columns_df['image'].isin(images_to_relabel_in_label_studio['image'])
images_to_auto_label = cleanlab_columns_df[auto_label_criteria].sort_values("suggested_label_confidence_score", ascending=False)

# Update final_labels_df with auto-label suggestions
for _, row in images_to_auto_label.iterrows():
image_url = row['image']
suggested_label = row['suggested_label']

# Find the index in final_labels_df where the image URL matches
index = final_labels_df.index[final_labels_df['image'] == image_url]

if not index.empty:
# Update the label column
final_labels_df.at[index[0], 'label'] = suggested_label

# Append the suggested label to the historical_labels list
if len(final_labels_df.at[index[0], 'historical_labels']) == 0:
final_labels_df.at[index[0], 'historical_labels'] = [suggested_label]
else:
existing_labels = final_labels_df.at[index[0], 'historical_labels']
if isinstance(existing_labels, list):
existing_labels.append(suggested_label)
else:
final_labels_df.at[index[0], 'historical_labels'] = [existing_labels, suggested_label]

# Adjust the return based on `run_auto_labeling_only`
if run_auto_labeling_only:
return final_labels_df, None, images_to_auto_label
else:
return final_labels_df, images_to_relabel_in_label_studio, images_to_auto_label

Let’s update our final labels, obtain which images to relabel in Label Studio, and auto-label some of our unlabeled images using this helper function we just defined.

We choose lower thresholds for label_issue_score and suggested_label_confidence_score as the default values for these scores (as arguments to the helper function) compared to the examples we showed previously in order to make it easier to find images to relabel and auto-label for this tutorial.

# Obtain updated finals labels object, the images to relabel in Label Studio, and the images we auto-labeled
final_labels_df, images_to_relabel, auto_labeled_images = update_final_labels_via_cleanlab(final_labels_df, cleanlab_columns_df)

Let’s check how many images we auto labeled with Cleanlab Studio using our threshold on suggested_label_confidence_score:

# We auto-labeled these images 
print(auto_labeled_images.shape)
    (325, 35)

Let’s also look at some examples of the images we auto-labeled to see if they were correct. We will see what our suggested_label is for 2 of the images we auto-labeled and the image_shown column to see what the images actually look like:

display_images_in_df(auto_labeled_images.head(2)[["cleanlab_row_ID", "image", "is_label_issue", "label_issue_score", "label", "suggested_label"]])
cleanlab_row_ID image image_shown is_label_issue label_issue_score label suggested_label
320 321 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_Unlabeled/IMG_20210319_020059.jpg Image False 0.0 None Defective
214 215 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_Unlabeled/IMG_20210319_013358.jpg Image False 0.0 None Defective

You will find after checking each image that the suggested labels are correct! Let’s see how many unlabeled images remain in our data after the auto-labeling:

# We can see we still have unlabeled images that need to be handled in another round of labeling
print(final_labels_df["label"].isnull().sum())
    25

And now let’s see how many images we identified that need to relabeled in Label Studio, our 3rd party data annotation tool of choice:

# We will relabel these images with Label Studio
print(images_to_relabel.shape)

images_to_relabel[["is_label_issue", "label_issue_score", "label", "suggested_label", "image"]].head(10)
(4, 35)
is_label_issue label_issue_score label suggested_label image
42 True 0.778596 Non-Defective Defective https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_013912.jpg
41 True 0.708746 Non-Defective Defective https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014842.jpg
43 True 0.685009 Defective Defective https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_003322.jpg
45 True 0.680246 Non-Defective Defective https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014306.jpg

You can see for each of those images that label does not match suggested_label that found by Cleanlab’s AI. Cleanlab is very useful for catching label errors like these!

Round 2: label more images (and re-label those with label errors) in your data annotation tool

Note: You can skip this section if you already know how to use Label Studio or are using another data annotation tool.

After getting back the initial Cleanlab Studio results, you can use them to inform your next round of data annotation. Here we’ll use the images_to_relabel DataFrame to figure out the exact task_ids (index of our images in Label Studio) to modify when relabeling these images in Label Studio. Here we’ll continue labeling data in the same Label Studio project, which means we need to update this project with the auto-labels obtained from Cleanlab (so annotators don’t label those images that were already auto-labeled). You could alternatively start a new Label Studio project with just the remaining still-to-be-labeled data.

For simplicity in this tutorial, our 2nd round of data labeling just focuses on addressing mislabeled images. In practice, you should also manually label more of the unlabeled data (that could not be confidently auto-labeled yet). Every time you re-label data or label new data in your data annotation tool, run a Cleanlab Studio project on the updated dataset and Cleanlab’s results will automatically improve.

Optional: Helper function designed to help identify the right task IDs (associated with an image) in Label Studio for relabeling (click to expand)

def identify_tasks_to_relabel(tasks, images_to_relabel_list):
"""
Identifies tasks to relabel based on image URLs.

Parameters:
- tasks (list): A list of task dictionaries.
- images_to_relabel_list (list): A list of image URLs to relabel.

Returns:
- dict: A dictionary of task IDs to relabel, mapped to their filenames.
"""
# Extract the basename (filename) from the task image paths
task_filenames = {task['id']: task['data']['image'].split('/')[-1] for task in tasks}

# Extract filenames from the images_to_relabel_list
image_relabel_filenames = [image.split('/')[-1] for image in images_to_relabel_list]

# Filter the task IDs based on whether their filenames are in the list of images to relabel
tasks_to_relabel = {}
for task_id, filename in task_filenames.items():
if filename in image_relabel_filenames:
tasks_to_relabel[task_id] = filename

return tasks_to_relabel

def identify_tasks_to_autolabel(tasks, image_to_label_mapping):
"""
Identifies tasks to auto-label based on image URLs and assigns the suggested label.

Parameters:
- tasks (list): A list of task dictionaries.
- image_to_label_mapping (dict): A dictionary mapping image filenames to suggested labels.

Returns:
- dict: A dictionary of task IDs mapped to their suggested labels.
"""
tasks_to_autolabel = {}

# Extract the basename (filename) from the task image paths
for task in tasks:
filename = task['data']['image'].split('/')[-1]
if filename in image_to_label_mapping:
# Map task ID to the suggested label
tasks_to_autolabel[task['id']] = image_to_label_mapping[filename]

return tasks_to_autolabel

Let’s get all of the tasks (i.e. data points in Label Studio) that we want to relabel.

# Get all tasks to find the ones you want to relabel
tasks = project.get_tasks()

image_relabel_list = list(images_to_relabel["image"])

tasks_to_relabel = identify_tasks_to_relabel(tasks, image_relabel_list)
print("Task IDs to relabel:", tasks_to_relabel)
    Task IDs to relabel: {161903: 'IMG_20210319_014842.jpg', 161904: 'IMG_20210319_013912.jpg', 161905: 'IMG_20210319_003322.jpg', 161907: 'IMG_20210319_014306.jpg'}

Let’s create a dict that maps our image filenames to the suggested labels that we are going to use to annotate our images in Label Studio and add a flag that they were auto-labeled.

# Convert DataFrame to dictionary mapping image filenames to suggested labels
image_to_label_mapping = dict(zip(auto_labeled_images['image'].apply(lambda x: x.split('/')[-1]), auto_labeled_images['suggested_label']))
# Assuming `tasks` is your list of task dictionaries
tasks_we_autolabeled = identify_tasks_to_autolabel(tasks, image_to_label_mapping)

The task_ids are formatted as a dictionary where each key is a task_id that maps to the filename of an image we want to relabel.

We have now found which task_ids to relabel and which we know we have auto-labeled, so we will remove our label (called an annotation in Label Studio) for these images associated with the relevant task_ids identified above. Then we will also mark (using a metadata column) which images need to be relabeled so that this is clearly visible in the Label Studio UI. We also will mark which images we have already auto-labeled in Label Studio so that one is not confused by which images still need to be labeled or not and which ones we have already auto-labeled with Cleanlab Studio.

You can use the code below to execute these steps and send the updates back to your Label Studio project that is running locally.

Optional: Helper functions that list annotations for a given task and clear/mark annotations for relabeling in Label Studio (click to expand)

from typing import Any

def fetch_annotation_metadata(api_key: str, task_id: int) -> List[Dict[str, Any]]:
"""
Fetches annotation metadata for a specific task from Label Studio.

Parameters:
- api_key (str): The API key for Label Studio.
- task_id (int): The ID of the task for which annotations are fetched.

Returns:
- List[Dict[str, Any]]: A list of annotations for the specified task.

Raises:
- Exception: If the request to the Label Studio API fails.
"""
headers = {'Authorization': f'Token {api_key}'}
response = requests.get(f'http://localhost:8080/api/tasks/{task_id}/annotations/', headers=headers)
if response.status_code == 200:
return response.json()
else:
raise Exception(f"Failed to fetch annotations. Status code: {response.status_code}, Response: {response.text}")


def clear_annotations_mark_for_relabeling_and_autolabel(api_key: str, tasks_to_relabel: Dict[int, str], auto_labeled_tasks: Dict[int, str]) -> None:
"""
Clears annotations for specified tasks, marks them for relabeling, and annotates auto-labeled tasks.

Parameters:
- api_key (str): The API key for Label Studio.
- tasks_to_relabel (Dict[int, str]): Tasks to be cleared and marked for relabeling.
- auto_labeled_tasks (Dict[int, str]): Task IDs mapped to their suggested label for auto-labeling.

Returns:
- None

Raises:
- Exception: If requests to the Label Studio API for deleting annotations, updating tasks, or creating annotations fail.
"""

for task_id in tasks_to_relabel.keys():
# List all annotations for the task using the custom function
annotations = fetch_annotation_metadata(api_key, task_id)

# Delete each annotation
for annotation in annotations:
response = requests.delete(f'http://localhost:8080/api/annotations/{annotation["id"]}/', headers={'Authorization': f'Token {api_key}'})
if response.status_code != 204:
raise Exception(f"Failed to delete annotation. Status code: {response.status_code}, Response: {response.text}")

# Fetch the task data
task_response = requests.get(f'http://localhost:8080/api/tasks/{task_id}/', headers={'Authorization': f'Token {api_key}'})
if task_response.status_code != 200:
raise Exception(f"Failed to fetch task. Status code: {task_response.status_code}, Response: {task_response.text}")
task_data = task_response.json()['data']

# Update the task to indicate it needs relabeling
if task_data.get('metadata') is None:
task_data['metadata'] = {}

task_data['metadata']['needs_relabeling'] = True

# Update the task with the modified data
update_response = requests.patch(f'http://localhost:8080/api/tasks/{task_id}/', json={'data': task_data}, headers={'Authorization': f'Token {api_key}'})
if update_response.status_code != 200:
raise Exception(f"Failed to update task. Status code: {update_response.status_code}, Response: {update_response.text}")

print("Annotations cleared and tasks marked for relabeling in Label Studio.")

# Handling auto-labeled tasks
for task_id, suggested_label in auto_labeled_tasks.items():
# Fetch the task data for updating metadata
task_response = requests.get(f'http://localhost:8080/api/tasks/{task_id}/', headers={'Authorization': f'Token {api_key}'})
if task_response.status_code != 200:
raise Exception(f"Failed to fetch task. Status code: {task_response.status_code}, Response: {task_response.text}")
task_data = task_response.json()['data']

# Update task metadata to indicate auto-labeled
if task_data.get('metadata') is None:
task_data['metadata'] = {}
task_data['metadata']['auto_labeled'] = True

# Update the task with the modified data including auto-labeled flag
update_response = requests.patch(f'http://localhost:8080/api/tasks/{task_id}/', json={'data': task_data}, headers={'Authorization': f'Token {api_key}'})
if update_response.status_code != 200:
raise Exception(f"Failed to update task. Status code: {update_response.status_code}, Response: {update_response.text}")

# Correct API endpoint for creating annotations
annotation_endpoint = f'http://localhost:8080/api/tasks/{task_id}/annotations/'

# Prepare annotation data
annotation_data = {
'result': [{
'from_name': 'label',
'to_name': 'image',
'type': 'choices',
'value': {'choices': [suggested_label]}
}]
}

# Attempt to create the annotation
annotation_response = requests.post(annotation_endpoint, json=annotation_data, headers={'Authorization': f'Token {api_key}'})
if annotation_response.status_code not in [200, 201]:
raise Exception(f"Failed to create annotation. Status code: {annotation_response.status_code}, Response: {annotation_response.text}")

print("Tasks that were auto-labeled were successfully annotated in Label Studio.")

We can look at specific annotation metadata for each task_id using the Label Studio API. This can help understand important details about the annotation, including when it was created, last updated, what the project id number is and more.

Here is an example of how to list the annotation metadata for a specific task_id in Label Studio using a helper function we defined:

example_task_id = list(tasks_to_relabel.keys())[0]

print(fetch_annotation_metadata(label_studio_api_key, example_task_id))
    [{'id': 1645, 'created_username': ' matt.turk@cleanlab.ai, 1', 'created_ago': '15\xa0hours, 23\xa0minutes', 'completed_by': 1, 'result': [{'value': {'choices': ['Non-Defective']}, 'id': 'E47bGehLt4', 'from_name': 'choice', 'to_name': 'image', 'type': 'choices', 'origin': 'manual'}], 'was_cancelled': False, 'ground_truth': False, 'created_at': '2024-03-21T22:15:22.666733Z', 'updated_at': '2024-03-21T22:15:22.666763Z', 'draft_created_at': None, 'lead_time': 2.482, 'import_id': None, 'last_action': None, 'task': 161903, 'project': 55, 'updated_by': 1, 'parent_prediction': None, 'parent_annotation': None, 'last_created_by': None}]

Clear annotations for images and mark them for relabeling in Label Studio

Note: You can skip this section if you already know how to use Label Studio or are using another data annotation tool.

Let’s use a helper function to clear the existing annotation in Label Studio for the images (each corresponding to a task in Label Studio) we know we want to relabel.

We also flag these images with a metadata column value to help you find the images that you manually need to relabel. If you had significantly more images to label than those in this tutorial, this flag would help you find the subset of images you are trying to relabel more easily.

You could also obtain the exact task_ids of the images and manually find each image to relabel without clearing the annotations in an automated way, but that would be a more difficult and time consuming method of relabeling your data in Label Studio.

We also flag the images that we’ve already auto-labeled, as mentioned earlier, with a metadata column value to more easily help you see which images do not need to be labeled within Label Studio due to Cleanlab Studio already helping us label those images.

For more information on deleting annotations in Label Studio, you can go here.

clear_annotations_mark_for_relabeling_and_autolabel(label_studio_api_key, tasks_to_relabel, tasks_we_autolabeled)
    Annotations cleared and tasks marked for relabeling in Label Studio.
Tasks that were auto-labeled were successfully annotated in Label Studio.

Now you can go to your Label Studio project to now relabel the images that we flagged. In your Label Studio project, you’ll see that the images that need to be relabeled have a value in the metadata column of your project that says {"needs_relabeling":true}. The images we have already auto-labeled have a value in the metadata column that says “

Here is a video showing you how to find the images that need to be relabeled and also which images we have already auto-labeled. Then we show how to actually relabel the specified images within Label Studio:

After relabeling, export the labels as seen below.

Update final labels DataFrame with our 2nd round of labels obtained via Label Studio

Note: You can skip this section if you already know how to use Label Studio or are using another data annotation tool.

After relabeling some images in Label Studio, we update the DataFrame serving as our master record of each image’s final label.

We’ll reuse our Label Studio project ID defined earlier and our previously defined helper function to export the new image labels from Label Studio into a csv file:

relabeled_export_filename = "relabeled_label_studio_potato_chips_annotations.csv" # REPLACE YOUR FILENAME HERE

export_data_from_label_studio_project(ls_project_id, label_studio_api_key, relabeled_export_filename)

Let’s read in our image data we just exported out of Label Studio and filter out the images we did not relabel.

# Read in relabeled data exported from Label Studio
relabeled_df = pd.read_csv(relabeled_export_filename)
relabeled_df = relabeled_df[relabeled_df["metadata"] == '{"needs_relabeling":true}']

Here are the images that we relabeled:

relabeled_df.head(10)
annotation_id annotator choice created_at id image lead_time metadata updated_at
42 2307 1 Defective 2024-03-22T13:40:20.268734Z 161903 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014842.jpg 2.569 {"needs_relabeling":true} 2024-03-22T13:40:20.268757Z
43 2308 1 Defective 2024-03-22T13:40:24.147650Z 161904 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_013912.jpg 2.230 {"needs_relabeling":true} 2024-03-22T13:40:24.147698Z
44 2309 1 Non-Defective 2024-03-22T13:40:27.828297Z 161905 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_003322.jpg 2.220 {"needs_relabeling":true} 2024-03-22T13:40:27.828330Z
46 2310 1 Defective 2024-03-22T13:40:31.009574Z 161907 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014306.jpg 1.953 {"needs_relabeling":true} 2024-03-22T13:40:31.009615Z

You can confirm the metadata column values show we did in fact relabel this data. The choice column represents what our now correct label value is after relabeling.

Let’s proceed with updating our final labels with our newly relabeled data. We will look at our final labels before AND after we update the values using our relabeled Label Studio images to illustrate the difference.

# Final labels before we update them with newly relabeled images
final_labels_df.loc[relabeled_df.index]
image label historical_labels
42 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014842.jpg Non-Defective [Non-Defective]
43 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_013912.jpg Non-Defective [Non-Defective]
44 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_003322.jpg Defective [Defective]
46 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014306.jpg Non-Defective [Non-Defective]
# Apply updated labels from Label Studio to final labels
final_labels_df = update_final_labels_via_labelstudio(final_labels_df, relabeled_df)
# Final labels after we update them with newly relabeled images
final_labels_df.loc[relabeled_df.index]
image label historical_labels
42 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014842.jpg Defective [Non-Defective, Defective]
43 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_013912.jpg Defective [Non-Defective, Defective]
44 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_003322.jpg Non-Defective [Defective, Non-Defective]
46 https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014306.jpg Defective [Non-Defective, Defective]

You can see that we have new labels for several images that originally were Non-Defective (which was incorrect) and are now (correctly) Defective. Similarly, we correct images that were supposed to be Non-Defective and were not.

The historical_labels column is also correctly updated for these images. The label that is at the end of this historical_labels list is considered the most up-to-date label we are using for the image in the data point. The labels in this list, for each data point, are sorted by least recent label to most recent label of our image in that row.

We will update our metadata column in Label Studio to remove the flag that images need to be relabeled. You can run the following code to do this:

Optional: Helper function to update the relabeling status of all relabeled images (click to expand)

def update_tasks_relabeling_status(api_key: str, task_ids: List[int], needs_relabeling: bool) -> None:
"""
Updates the relabeling status of multiple tasks in Label Studio.

Parameters:
- api_key (str): The API key for Label Studio.
- task_ids (List[int]): A list of task IDs to update.
- needs_relabeling (bool): True to mark the tasks as needing relabeling, False to clear the mark.

Raises:
- Exception: If any request to update a task fails.
"""
headers = {'Authorization': f'Token {api_key}'}

for task_id in task_ids:
task_response = requests.get(f'http://localhost:8080/api/tasks/{task_id}/', headers=headers)

if task_response.status_code != 200:
raise Exception(f"Failed to fetch task {task_id}. Status code: {task_response.status_code}, Response: {task_response.text}")

task_data = task_response.json()['data']
if 'metadata' not in task_data or task_data['metadata'] is None:
task_data['metadata'] = {}

if needs_relabeling:
task_data['metadata']['needs_relabeling'] = True
else:
# Remove the needs_relabeling key if it exists
task_data['metadata'].pop('needs_relabeling', None)
# If the metadata dictionary is empty, set it to None
if not task_data['metadata']:
task_data['metadata'] = None

update_response = requests.patch(f'http://localhost:8080/api/tasks/{task_id}/', json={'data': task_data}, headers=headers)
if update_response.status_code != 200:
raise Exception(f"Failed to update task {task_id}. Status code: {update_response.status_code}, Response: {update_response.text}")

print(f"All tasks relabeling status updated successfully.")
# Remove metadata flag from Label Studio that images need to be relabeled.
relabeled_task_ids = list(tasks_to_relabel.keys())
update_tasks_relabeling_status(label_studio_api_key, relabeled_task_ids, needs_relabeling=False)
    All tasks relabeling status updated successfully.

Now you can check your project in Label Studio to make sure no images there still have the needs_relabeling flag set to True. Without doing this step, it will be confusing to know which images you already relabeled and which images you did not relabel. This can otherwise be complex to manage for big datasets.

Note: For brevity, we did not annotate any more unlabeled data points in Label Studio during this 2nd round of labeling in this tutorial. In practice, you should also manually label some more of the unlabeled data in this 2nd round using your data annotation platform (to help Cleanlab’s AI learn not only from label corrections but new labels as well).

Round 2: Use Cleanlab Studio for further auto-labeling and label error detection

We will now follow the same process to first properly format our dataset (including both labeled and the rest of the unlabeled images) to ingest the newest version of our data back into Cleanlab Studio.

And we will create both a new Cleanlab Studio dataset and project that will be used to auto-label our remaining unlabeled images in our final labels.

In this tutorial, we choose to only auto-label the remainder of our data points given the size of the sample being used. In practice, you can try to repeat the process to detect data points that you want to re-label with your data annotation tool AND use Cleanlab to auto-label more unlabeled data points as many times as you deem necessary to achieve optimal results for your use case. Each time you label more data / correct existing labels and then launch a Cleanlab Studio project, Cleanlab’s AI will get better at auto-labeling your data and detecting label errors.

Let’s create our DataFrame object to ingest into Cleanlab Studio:

new_cleanlab_df = final_labels_df.drop("historical_labels", axis=1)

Let’s load our new set of currently labeled/unlabeled images into Cleanlab Studio. As mentioned earlier, it’s important to provide your full dataset to get the best results from Cleanlab’s AI.

new_dataset_id = studio.upload_dataset(new_cleanlab_df, dataset_name="Round_2_Pepsico_RnD_Potato_Chip_Image_Data_Tutorial", schema_overrides=[{"name": "image", "column_type": "image_external"}])
print(f"Dataset ID: {new_dataset_id}")
    Uploading dataset...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|
Ingesting Dataset...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|


Dataset ID: 720f4800607c4e2892e177d4e2beaa39

Once the new data are loaded, we create a new Project based on this Dataset in Cleanlab Studio.

new_project_id = studio.create_project(
dataset_id=new_dataset_id,
project_name="Round_2_Pepsico_RnD_Potato_Chip_Image_Data_Tutorial_Project",
modality="image",
task_type="multi-class",
model_type="regular",
label_column="label",
)
print(
f"Project successfully created and ML training has begun! project_id: {new_project_id}"
)
    Project successfully created and ML training has begun! project_id: 84cbb77617e64d9caf1a48ea27722d91

Note: the Project will take a while for Cleanlab’s AI models to train on your dataset and analyze it. You’ll receive an email when the results are ready. Run the cell below to fetch the cleanset_id from Cleanlab Studio, this code will block until your project results are ready. For big datasets, if your notebook times out, do not recreate the project. Instead just re-run the cell below to fetch the cleanset_id based on the project_id (which you can also find in the Cleanlab Studio Web App).

new_cleanset_id = studio.get_latest_cleanset_id(new_project_id)
print(f"cleanset_id: {new_cleanset_id}")
studio.wait_until_cleanset_ready(new_cleanset_id)
    cleanset_id: ecde48d5d79a4195ba9ca6e637858239


Cleanset Progress: / Step 50/50, Ready for review!

Let’s now download the metadata columns from Cleanlab Studio as we did before and see what the columns look like:

new_cleanlab_columns_df = studio.download_cleanlab_columns(new_cleanset_id)
new_cleanlab_columns_df.head(10)
cleanlab_row_ID corrected_label is_label_issue label_issue_score suggested_label suggested_label_confidence_score is_ambiguous ambiguous_score is_well_labeled is_near_duplicate ... is_odd_size odd_size_score is_low_information low_information_score is_grayscale is_odd_aspect_ratio odd_aspect_ratio_score aesthetic_score is_NSFW NSFW_score
0 1 <NA> False 0.147065 <NA> 0.852935 False 0.997672 False False ... False 0.0 False 0.305794 False False 0.0 0.532279 False 0.041162
1 2 <NA> False 0.194291 <NA> 0.805709 False 0.999998 False False ... False 0.0 False 0.314794 False False 0.0 0.511932 False 0.051039
2 3 <NA> False 0.144631 <NA> 0.855369 False 0.997418 False False ... False 0.0 False 0.278744 False False 0.0 0.519195 False 0.113037
3 4 <NA> False 0.166404 <NA> 0.833596 False 0.999222 False False ... False 0.0 False 0.300383 False False 0.0 0.479944 False 0.054440
4 5 <NA> False 0.144390 <NA> 0.855610 False 0.997392 False False ... False 0.0 False 0.330797 False False 0.0 0.482801 False 0.052081
5 6 <NA> False 0.161320 <NA> 0.838680 False 0.998895 False False ... False 0.0 False 0.291288 False False 0.0 0.464114 False 0.000000
6 7 <NA> True 0.844409 Defective 0.844409 False 0.998458 False False ... False 0.0 False 0.348319 False False 0.0 0.552842 False 0.137950
7 8 <NA> False 0.793626 Defective 0.793626 True 0.999799 False False ... False 0.0 False 0.328956 False False 0.0 0.584307 False 0.092091
8 9 <NA> False 0.297036 <NA> 0.702964 False 0.987958 False False ... False 0.0 False 0.310666 False False 0.0 0.489662 False 0.108625
9 10 <NA> False 0.123683 <NA> 0.876317 False 0.994688 False False ... False 0.0 False 0.327839 False False 0.0 0.525412 False 0.110453

10 rows × 33 columns

Let’s repeat the process of joining our Cleanlab Studio metadata columns with some of our original data columns in order to more easily observe the important columns that Cleanlab Studio generated for us.

new_cleanlab_df = new_cleanlab_df.reset_index().rename(columns={'index': 'cleanlab_row_ID'})
new_cleanlab_columns_df = new_cleanlab_columns_df.merge(new_cleanlab_df[['cleanlab_row_ID', 'label', 'image']], left_index=True, right_index=True)

Update final labels DataFrame with our 2nd round of Cleanlab Studio results

With our new Cleanlab Studio results we can now auto-label the remainder of our unlabeled data points and update our final labels object accordingly.

As mentioned before, based on some threshold value we set for our suggested_label_confidence_score column, we should auto-label (using the suggested_label column) any images that are currently unlabeled and that are greater than or equal to this threshold value.

In this round, we we only execute the auto-labeling logic in our helper function and not the label error detection logic (to find data points to re-label). In practice, you should repeat both of these steps as many times as you need to accurately label your dataset. In between each Cleanlab Studio project, you should aim to manually label a bit more unlabeled data in your data annotation tool (this will improve Cleanlab’s AI in the subsequent rounds). To keep this tutorial short, we won’t show many rounds of this process, and here only focus on auto-labeling in this 2nd round.

We’ll also choose a slightly lower (less conservative) threshold value for suggested_label_confidence_score when using our helper function this time around. This illustrates how you can choose different threshold values in different rounds of the labeling process as Cleanlab’s AI improves over multiple rounds.

# Obtain updated final labels and newly auto-labeled images
final_labels_df, _, new_auto_labeled_images = update_final_labels_via_cleanlab(final_labels_df, new_cleanlab_columns_df, suggested_label_confidence_score_threshold=0.5, run_auto_labeling_only=True)

We’ve updated our final labels and auto-labeled the unlabeled data points that could be confidently handled by Cleanlab’s AI. Let’s see how many unlabeled data points we were able to auto-label:

print(new_auto_labeled_images.shape)
    (25, 36)

The number of data points we auto-labeled matches the number of unlabeled data points remaining from the previous round of labeling! Cleanlab’s AI improved so much from our 2nd round of manual annotations that it was able to auto-label all of the remaining unlabeled data.

Let’s inspect some images we auto-labeled in this round to review if they were correct. We will see what our suggested_label is for 2 of the images we auto-labeled and the image_shown column to see what the images actually look like:

display_images_in_df(new_auto_labeled_images.head(2)[["is_initially_unlabeled", "image", "label", "suggested_label"]])
is_initially_unlabeled image image_shown label suggested_label
102 True https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_Unlabeled/IMG_20210319_004902.jpg Image None Defective
394 True https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_Unlabeled/IMG_20210319_013246.jpg Image None Defective

The suggested_label values confidently predicted by Cleanlab’s AI are correct for these examples!

Here’s how many unlabeled data points remain after this round of auto-labeling:

print(final_labels_df["label"].isnull().sum())
    0

We do not have any unlabeled images left, so we have successfully used a 3rd party data annotation tool combined with Cleanlab Studio to label all of our data points! Otherwise you could manually label some more images with your data annotation platform (in a 3rd round), and then run another Cleanlab Studio Project to further improve the auto-labeling of your data.

Summary

This tutorial demonstrated how easily Cleanlab Studio can be used with any 3rd party data annotation tool. By incorporating Cleanlab Studio into our workflow, we only had to manually label 50 data points (plus the 6 relabeled points) in our data annotation tool instead of all 400 data points. The other 350 data points we were able to accurately auto-label using Cleanlab Studio, which is ~88% of our images, saving time and effort. Cleanlab Studio can scale to massive datasets just as easily, unlocking immense savings. Cleanlab also automatically caught incorrect labels, allowing us to produce a higher-quality dataset.

While we focused on image data here, all of this works for text and tabular datasets as well. The same workflow can also be applied in other data annotation applications beyond multi-class classification, such as image/document tagging. We re-emphasize that you can apply this same workflow with any data annotation tool of your choosing (see beginning of tutorial for links to other tools beyond Label Studio).