Produce Better-Quality Data 10x Faster in any Data Annotation Platform You’re Using (e.g. Label Studio)

Run in Google Colab

This tutorial shows how you can integrate Cleanlab Studio with any data annotation tool. Reduce how much data annotation work your team has to do (thanks to Cleanlab’s auto-labeling) and simultaneously produce better results (thanks to Cleanlab’s label error detection). The basic workflow looks as follows: first label some data with your tool, then use Cleanlab to catch any incorrectly selected labels and auto-label a large subset of the data that AI (Foundation models) can confidently label. You can then iterate these 2 steps multiple times. Each time you label (or re-label) a bit more data with your annotation tool, Cleanlab’s AI will learn from your new labels and become more effective for auto-labeling and label error detection. Since Cleanlab Studio is so easy to use, you can mostly focus on working with your annotation tool as you normally would – except now you’ll get datasets labeled more quickly and accurately.

Data Annotation with Label Studio and Cleanlab Studio

Here we’ll demonstrate this workflow with the Label Studio data annotation platform (one of the most popular tools that is freely open-source), but you can similarly integrate any data annotation tool of your choosing with Cleanlab Studio. For example, you could do the same workflows with annotation tools like LabelBox, LabelMe, or Coco among others. Some annotation platforms have built-in AI features, but none with the same powerful capabilities and ease-of-use as Cleanlab Studio.

This particular tutorial will focus on labeling image data for multi-class classification, but the same steps can be applied to text or tabular data being labeled for other types of tasks (e.g. tagging instead of multi-class classification). Here are the steps we’ll follow in this tutorial, starting with an entirely unlabeled dataset:

Use your data annotation tool to label some of the data.
Then load the currently labeled and unlabeled sets of this data into Cleanlab Studio.
Use Cleanlab Studio to automatically find label errors (and other issues in the data) and send data flagged as likely incorrectly labeled back to your annotation tool for relabeling. Also use Cleanlab Studio to auto-label as much of the unlabeled data as can be confidently handled by Cleanlab’s AI.
Iterate steps 1-3 until all of our data is labeled. Over time, there will be less unlabeled data remaining to label in later repetitions of step 1.

Install and import required dependencies

Note: this tutorial requires that you’ve created a Cleanlab Studio account. You can use pip to install all packages required for this tutorial as follows:

pip install cleanlab-studio label-studio

When creating this tutorial, we used version 1.11.0 of the label-studio Python package.

from cleanlab_studio import Studio
from datasets import load_dataset
import pandas as pd
import requests

pd.set_option('display.max_colwidth', None)

Prepare the dataset

The dataset we’ll label is a sample of images of potato chips from PepsiCo - the original data can be found on Kaggle here. Our task is to label each image from the dataset into one of 2 categories: Defective or Non-Defective. Below are some images from the dataset. 2 Non-Defective potato chips are depicted in the top row and 2 Defective potato chips in the bottom row:

examples of dataset

We have already stored the tutorial dataset with each image at an accessible URL. You could also produce such URLs for files in your cloud storage. Both Cleanlab Studio and Label Studio can operate directly on such linked external media files, or you run these tools on local files from your machine. Let’s fetch the files with the URLs pointing to the images that comprise our dataset:

wget -nc "https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/potato_chip_image_urls.csv"

Optionally, you can also download the full image dataset from here.

image_df = pd.read_csv("potato_chip_image_urls.csv")
image_urls = list(image_df["image"])

Next let’s define a DataFrame that will store our final labels for each image. This object will be updated throughout the tutorial.

This final labels DataFrame has an image column pointing to the URL of each image in our dataset. We initialize this DataFrame with a label column that is our goal to fill in for each image. We also include a column that will contain historical labels (represented as a list) for each image. Anytime a chosen label changes due to a label correction (when relabeling an image in the data annotation tool), the newly selected label will be appended to the list of historical labels that were selected by annotators of this image. Storing this set of previously selected labels is particularly useful when the data annotators are noisy.

final_labels_df = image_df.copy()
final_labels_df["label"] = [None] * len(image_urls)
final_labels_df["historical_labels"] = [[] for _ in range(len(image_urls))]

final_labels_df.head()

	image	label	historical_labels
0	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014120.jpg	None	[]
1	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_020053.jpg	None	[]
2	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_005833.jpg	None	[]
3	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_013704.jpg	None	[]
4	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_015012.jpg	None	[]

Label some of the images with your annotation tool

Note: You can skip this section if you already know how to use Label Studio or are using another data annotation tool.

Now let’s label a batch of the images using Label Studio, our example 3rd party annotation tool in this tutorial. Here we’ll use the Label Studio Python SDK to connect to the API.

Before executing the Python code below, first run label-studio in a command line interface of your choice. Don’t proceed until you have Label Studio running locally at: http://localhost:8080

from label_studio_sdk import Client

# Define the URL where Label Studio is accessible and the API key for your Label Studio user account, which you can find in your account settings
LABEL_STUDIO_URL = 'http://localhost:8080'
label_studio_api_key = "<insert API key here>"

# Connect to the Label Studio API and check the connection
ls = Client(url=LABEL_STUDIO_URL, api_key=label_studio_api_key)
ls.check_connection()

{'status': 'UP'}

The status should be UP if everything’s working properly. Let’s create a project via our Label Studio Client Object for labeling our Potato Chip images.

project = ls.start_project(
  title='Label Studio PepsiCo Image Tutorial Project',
  label_config='''
<View>
<Image name="image" value="$image" valueType="url"/>
<Choices name="choice" toName="image">
<Choice value="Defective"/>
<Choice value="Non-Defective" />
</Choices>
</View>
'''
)

Format the data for Label Studio

Note: You can skip this section if you already know how to use Label Studio or are using another data annotation tool.

To programmatically import our images into Label Studio, we must format them like this:

[{'image': image_url1},
 {'image': image_url2},
 {'image': image_url3}
]

This is a list of dicts, where each dict has an image key and corresponding value image_url pointing to where the image is accessible. Now let’s use a previously defined helper function to format our image URLs to the list of dicts that Label Studio will accept in its project.import_tasks function.

Optional: Define helper function to format the image URLs for Label Studio

from typing import List, Dict

def format_label_studio_image_urls(image_urls: List[str]) -> List[Dict[str, str]]:
    """
    Converts a list of image URLs into the format that Label Studio accepts.

    Parameters:
    - image_urls (List[str]): A list of full URLs to the images that need to be labeled.

    Returns:
    - List[Dict[str, str]]: A list of dictionaries, where each dictionary contains a key 'image'
      with its value being the full URL to an image file, ready for use in Label Studio.
    """
    
    # Convert the list of URLs into the expected dictionary format for Label Studio
    label_studio_urls = [{'image': url} for url in image_urls]

    return label_studio_urls

image_urls_for_label_studio = format_label_studio_image_urls(image_urls)
print(len(image_urls_for_label_studio))

Let’s look at 3 example images we are importing into Label Studio:

# Check what some of the potato chip images look like

from IPython.display import HTML

images_html = "".join(
    f'''<div style="text-align: center;"><img src="{url['image']}" width="400" alt="" /></div>'''
    for url in image_urls_for_label_studio[:3]
)

# Display the images in the notebook
display(HTML(images_html))

Now we can import the images into Label Studio for labeling!

# Task IDs used in Label Studio to represent each image are the output we see printed below
imported_ids = project.import_tasks(image_urls_for_label_studio)
print(imported_ids[:5])

[161861, 161862, 161863, 161864, 161865]

Learn how to import your data into Label Studio via this guide.

Label 50 images in Label Studio

Note: You can skip this section if you already know how to use Label Studio or are using another data annotation tool.

Go to your project in Label Studio and start labeling your imported images. Let’s just label 50 of those images to start. This guide shows how to label/annotate your data with Label Studio, but you could use any other data annotation platform.

Try to select a diverse set of initial examples to label that well represents all of the classes of interest (since Cleanlab’s AI will later learn from this initially labeled set).

Once you have labeled your images, one way to export the labels from Label Studio is via the Export button, as you can see in the video below or within your Label Studio session running locally. In this tutorial, we choose to export programmatically. This minimizes the amount of data you need to manually move between different tools in this tutorial. This also shows you how to interact with Label Studio using the Python API.

Here’s a video demonstrating how to label a some images in Label Studio after you have launched the app locally:

By default, Label Studio only exports tasks (i.e. data points in Label Studio) with annotations. For more details on exporting tasks, you can go here.

To be able to export all tasks programmatically, we use the following helper function.

Optional: Define helper function to export data from your Label Studio project

def export_data_from_label_studio_project(project_id: str, label_studio_api_key: str, output_file_path: str, export_type: str="CSV", server_url: str="http://localhost:8080"):
    """
    Exports data from a specific project in Label Studio.

    This function requests an export of data from a specified project in Label Studio, supporting
    various export formats. The function handles the request and saves the exported data to a file
    at the specified path.

    Parameters:
    - project_id (str): The ID of the project in Label Studio to export data from.
    - label_studio_api_key (str): The API token for authenticating with the Label Studio server.
    - output_file_path (str, optional): The path where the exported file should be saved.
    - export_type (str, optional): The type of export format (e.g., CSV, JSON). Defaults to "CSV".
    - server_url (str, optional): The URL of the Label Studio server. Defaults to "http://localhost:8080".

    Returns:
    - None: The function saves the exported data to a file and prints the outcome.
    """
    url = f"{server_url}/api/projects/{project_id}/export?exportType={export_type}"

    # Your authorization token
    headers = {
        "Authorization": f"Token {label_studio_api_key}"
    }

    response = requests.get(url, headers=headers)

    if response.status_code == 200:
        # Save the exported data to a file
        with open(output_file_path, "wb") as f:
            f.write(response.content)
        print("Data exported successfully.")
    else:
        print(f"Failed to export data. Status code: {response.status_code}, Message: {response.text}")

We also need to pass our Label Studio project ID and our API key into our helper function to be able to export our tasks.

This project ID can be found in the URL of your project after you have selected your project within Label Studio.

We also create a csv file to export our tasks into.

ls_project_id = 55 # REPLACE YOUR PROJECT ID HERE
export_filename = "label_studio_potato_chips_annotations.csv" # REPLACE YOUR FILENAME HERE

export_data_from_label_studio_project(ls_project_id, label_studio_api_key, export_filename)

Let’s see what our labeled data looks like so far.

label_studio_annotations = pd.read_csv(export_filename)
label_studio_annotations.head(10)

	annotation_id	annotator	choice	created_at	id	image	lead_time	updated_at
0	1599	1	Defective	2024-03-21T21:40:46.231429Z	161861	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014120.jpg	3.178	2024-03-21T21:40:46.231451Z
1	1600	1	Defective	2024-03-21T21:40:48.406866Z	161862	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_020053.jpg	2.073	2024-03-21T21:40:48.406890Z
2	1601	1	Defective	2024-03-21T21:40:50.714271Z	161863	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_005833.jpg	2.199	2024-03-21T21:40:50.714307Z
3	1602	1	Defective	2024-03-21T21:40:52.918417Z	161864	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_013704.jpg	2.073	2024-03-21T21:40:52.918444Z
4	1603	1	Defective	2024-03-21T21:40:55.034898Z	161865	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_015012.jpg	2.017	2024-03-21T21:40:55.034925Z
5	1604	1	Defective	2024-03-21T21:40:57.286808Z	161866	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_011506.jpg	2.131	2024-03-21T21:40:57.286836Z
6	1605	1	Non-Defective	2024-03-21T21:40:59.702141Z	161867	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_002817.jpg	2.298	2024-03-21T21:40:59.702167Z
7	1606	1	Non-Defective	2024-03-21T21:41:02.890256Z	161868	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_002553.jpg	3.059	2024-03-21T21:41:02.890279Z
8	1611	1	Defective	2024-03-21T22:13:55.609870Z	161869	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_020243.jpg	1972.597	2024-03-21T22:13:55.609890Z
9	1612	1	Defective	2024-03-21T22:13:58.086707Z	161870	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014858.jpg	2.342	2024-03-21T22:13:58.086738Z

Update final labels DataFrame with labels obtained in our annotation tool

Note: You can skip this section if you already know how to use Label Studio or are using another data annotation tool.

Now we can fill in the official labels for 50 images that we just annotated in Label Studio so far.

# Here we define a helper function to update the values of our final labels based on annotations from Label Studio

def update_final_labels_via_labelstudio(final_labels_df, labels_df):
    """
    Updates the 'label' column and appends labels to 'historical_labels' in final_labels_df
    based on the 'image' column matching between final_labels_df and labels_df.

    Parameters:
    - final_labels_df (pd.DataFrame): The DataFrame to update, containing columns 'label',
      'image', and 'historical_labels'.
    - labels_df (pd.DataFrame): The DataFrame containing new labels from our annotation tool, with columns
      including 'choice' and 'image'.

    Returns:
    - pd.DataFrame: The updated final_labels_df.
    """
    for _, label_row in labels_df.iterrows():
        image_url = label_row['image']
        label_choice = label_row['choice']
        
        # Find the index in final_labels_df where the image URL matches
        index = final_labels_df.index[final_labels_df['image'] == image_url]
        
        if not index.empty:
            # Update the label column
            final_labels_df.at[index[0], 'label'] = label_choice
            
            # Check if the cell contains NaN or cannot be appended to directly
            if len(final_labels_df.at[index[0], 'historical_labels']) == 0:
                # Initialize with a list containing the label if NaN
                final_labels_df.at[index[0], 'historical_labels'] = [label_choice]
            else:
                current_value = final_labels_df.at[index[0], 'historical_labels']
                # Check if current_value is a list, implying we can append to it
                if isinstance(current_value, list):
                    current_value.append(label_choice)
                    final_labels_df.at[index[0], 'historical_labels'] = current_value
                else:
                    # Handle non-list, non-NaN values (e.g., initializing or converting to a list)
                    final_labels_df.at[index[0], 'historical_labels'] = [current_value, label_choice]

    return final_labels_df

final_labels_df = update_final_labels_via_labelstudio(final_labels_df, label_studio_annotations)

final_labels_df[~final_labels_df["label"].isnull()]

	image	label	historical_labels
0	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014120.jpg	Defective	[Defective]
1	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_020053.jpg	Defective	[Defective]
2	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_005833.jpg	Defective	[Defective]
3	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_013704.jpg	Defective	[Defective]
4	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_015012.jpg	Defective	[Defective]
5	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_011506.jpg	Defective	[Defective]
6	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_002817.jpg	Non-Defective	[Non-Defective]
7	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_002553.jpg	Non-Defective	[Non-Defective]
8	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_020243.jpg	Defective	[Defective]
9	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014858.jpg	Defective	[Defective]
10	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210318_235659.jpg	Non-Defective	[Non-Defective]
11	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014520.jpg	Defective	[Defective]
12	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_010752.jpg	Defective	[Defective]
13	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_015937.jpg	Defective	[Defective]
14	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_002905.jpg	Non-Defective	[Non-Defective]
15	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_013808.jpg	Defective	[Defective]
16	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210318_232231.jpg	Non-Defective	[Non-Defective]
17	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_010521.jpg	Defective	[Defective]
18	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014223.jpg	Defective	[Defective]
19	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_010858.jpg	Defective	[Defective]
20	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014353.jpg	Defective	[Defective]
21	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_015648.jpg	Defective	[Defective]
22	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_011017.jpg	Defective	[Defective]
23	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_013341.jpg	Defective	[Defective]
24	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_003754.jpg	Non-Defective	[Non-Defective]
25	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_003026.jpg	Non-Defective	[Non-Defective]
26	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014639.jpg	Defective	[Defective]
27	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_015451.jpg	Defective	[Defective]
28	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_020004.jpg	Defective	[Defective]
29	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_005727.jpg	Defective	[Defective]
30	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_003342.jpg	Non-Defective	[Non-Defective]
31	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_013435.jpg	Defective	[Defective]
32	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_001756.jpg	Non-Defective	[Non-Defective]
33	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014824.jpg	Defective	[Defective]
34	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_010315.jpg	Defective	[Defective]
35	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_000106.jpg	Non-Defective	[Non-Defective]
36	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_015545.jpg	Defective	[Defective]
37	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_004702.jpg	Defective	[Defective]
38	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_001519.jpg	Non-Defective	[Non-Defective]
39	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_011522.jpg	Defective	[Defective]
40	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210318_233144.jpg	Defective	[Defective]
41	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014117.jpg	Non-Defective	[Non-Defective]
42	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014842.jpg	Non-Defective	[Non-Defective]
43	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_013912.jpg	Non-Defective	[Non-Defective]
44	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_003322.jpg	Defective	[Defective]
45	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_011524.jpg	Non-Defective	[Non-Defective]
46	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014306.jpg	Non-Defective	[Non-Defective]
47	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_003531.jpg	Non-Defective	[Non-Defective]
48	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014529.jpg	Defective	[Defective]
49	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210318_235645.jpg	Non-Defective	[Non-Defective]

We’ve now recorded labels (and updated the historical labels) for 50 of our images, which were manually annotated. Cleanlab Studio can help us avoid having to manually annotate the rest of the dataset.

Use Cleanlab Studio to catch label errors and auto-label the data that AI can confidently handle

To run Cleanlab Studio, we first properly format our dataset (including both labeled and the rest of the unlabeled images). Here we use the Cleanlab External Image format, so there is no need to move your images (Cleanlab will read them directly from their public URLs). You can run Cleanlab Studio on data stored in a variety of alternate cloud/local storage formats.

We drop the historical_labels column from the final labels DataFrame prior to loading it into Cleanlab Studio, since it is not needed for us to use Cleanlab Studio to catch label errors and auto-label the images.

cleanlab_df = final_labels_df.drop("historical_labels", axis=1)

A quickstart tutorial for programmatically analyzing image datasets via Cleanlab Studio’s Python API is available here. If you prefer to use the web interface and interactively browse/correct your data, see our other tutorial: Finding Issues in Large-Scale Image Datasets. Here we run through the simple steps without much explanation.

# You can find your Cleanlab Studio API key by going to studio.cleanlab.ai/upload, 
# clicking "Upload via Python API", and copying the API key there
API_KEY = "<insert API key here>"

# initialize studio object
studio = Studio(API_KEY)

Let’s load our set of currently labeled/unlabeled images into Cleanlab Studio. It’s important to provide your full dataset to get the best results from Cleanlab’s AI.

dataset_id = studio.upload_dataset(cleanlab_df, dataset_name="Pepsico_RnD_Potato_Chip_Image_Data_Tutorial", schema_overrides=[{"name": "image", "column_type": "image_external"}])
print(f"Dataset ID: {dataset_id}")

Uploading dataset...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|
Ingesting Dataset...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|

Dataset ID: 7dfa48ce904e47ab9554a66ddef602b9

Once the data are loaded, we create a Project based on this Dataset in Cleanlab Studio.

project_id = studio.create_project(
    dataset_id=dataset_id,
    project_name="Pepsico_RnD_Potato_Chip_Image_Data_Tutorial_Project",
    modality="image",
    task_type="multi-class",
    model_type="regular",
    label_column="label",
)
print(
    f"Project successfully created and ML training has begun! project_id: {project_id}"
)

Project successfully created and ML training has begun! project_id: dbbac382835b4ef9bc8cc47469bae255

The Project will take a while for Cleanlab’s AI models to train on your dataset and analyze it. You’ll receive an email when the results are ready. Each Project creates a cleanset (cleaned dataset). Run the cell below to fetch the cleanset_id from Cleanlab Studio, this code will block until your project results are ready. For big datasets, if your notebook times out, do not recreate the project. Instead just re-run the cell below to fetch the cleanset_id based on the project_id (which you can also find in the Cleanlab Studio Web App).

cleanset_id = studio.get_latest_cleanset_id(project_id)
print(f"cleanset_id: {cleanset_id}")
studio.wait_until_cleanset_ready(cleanset_id)

cleanset_id: 333667428d494e54908dc7c8e9e39cec

Cleanset Progress: \ Step 50/50, Ready for review!

Cleanlab Studio automatically generates smart metadata for any image, text, or tabular dataset. This metadata (returned as Cleanlab Columns) helps you find and fix various problems in your dataset, or impute missing values and auto-label unlabeled data.

cleanlab_columns_df = studio.download_cleanlab_columns(cleanset_id)
cleanlab_columns_df.head(10)

	cleanlab_row_ID	corrected_label	is_label_issue	label_issue_score	suggested_label	suggested_label_confidence_score	is_ambiguous	ambiguous_score	is_well_labeled	is_near_duplicate	...	is_odd_size	is_low_information	low_information_score	is_grayscale	is_odd_aspect_ratio	aesthetic_score	is_NSFW	NSFW_score
0	1	<NA>	False	0.373451	<NA>	0.626549	False	0.993814	False	False	...	False	False	0.305794	False	False	0.532279	False	0.041162
1	2	<NA>	False	0.425856	<NA>	0.574144	False	0.999999	False	False	...	False	False	0.314794	False	False	0.511932	False	0.051039
2	3	<NA>	False	0.407524	<NA>	0.592476	False	0.999208	False	False	...	False	False	0.278744	False	False	0.519195	False	0.113037
3	4	<NA>	False	0.391541	<NA>	0.608459	False	0.997314	False	False	...	False	False	0.300383	False	False	0.479944	False	0.054440
4	5	<NA>	False	0.390192	<NA>	0.609808	False	0.997103	False	False	...	False	False	0.330797	False	False	0.482801	False	0.052081
5	6	<NA>	False	0.392143	<NA>	0.607857	False	0.997406	False	False	...	False	False	0.291288	False	False	0.464114	False	0.000000
6	7	<NA>	False	0.425907	<NA>	0.574093	False	0.951693	False	False	...	False	False	0.348319	False	False	0.552842	False	0.137950
7	8	<NA>	False	0.420150	<NA>	0.579850	False	0.947803	False	False	...	False	False	0.328956	False	False	0.584307	False	0.092091
8	9	<NA>	False	0.437139	<NA>	0.562861	False	0.999753	False	False	...	False	False	0.310666	False	False	0.489662	False	0.108625
9	10	<NA>	False	0.375067	<NA>	0.624933	False	0.994185	False	False	...	False	False	0.327839	False	False	0.525412	False	0.110453

10 rows × 33 columns

We’ll join our Cleanlab Studio metadata columns with some of our original data columns in order to view detected label issues, suggested labels, confidence in these suggestions, and other important columns that Cleanlab automatically generated. Learn more about these metadata columns here.

Label issues in image datasets often involve challenges such as incorrectly chosen class labels due to human mistakes, inconsistencies across different data annotators, ambiguous examples where multiple classes might seem applicable, and instances where relevant features are overlooked by annotators. Cleanlab can automatically detect annotation problems stemming from a variety of sources, including: subjective interpretation of images, the complexity of scenes, variability in object appearances, and unclear portions of annotation guidelines (or changing guidelines). Identifying and resolving these annotation issues is crucial for creating a high-quality dataset.

cleanlab_df = cleanlab_df.reset_index().rename(columns={'index': 'cleanlab_row_ID'})
cleanlab_columns_df = pd.merge(cleanlab_columns_df, cleanlab_df[['cleanlab_row_ID', 'label', 'image']], on='cleanlab_row_ID', how='left')

Review some of the mislabeled images that Cleanlab Studio detected

We have retrieved our Cleanlab Studio results, so how should we actually look at the results and use them effectively?

Let’s look at an example of a mislabeled image that Cleanlab was able to detect to build trust that Cleanlab can actually do this. We can filter our cleanlab_columns_df based on the column is_label_issue to find the images that are found to be label issues by Cleanlab.

We then sort (in descending order) these label issues by their label_issue_score, which quantifies how confident Cleanlab is that a data point is mislabeled.

We identify the images most confidently to be mislabeled by thresholding the label issue score.

label_issues = cleanlab_columns_df.query("is_label_issue").sort_values("label_issue_score", ascending=False)

Optional: Helper functions to render id column of DataFrame as images in a separate column

from PIL import Image
from io import BytesIO
from base64 import b64encode
from IPython.display import HTML

def url_to_img_html(url: str) -> str:
    """
    Converts a URL to an HTML img tag.
    """
    return f'<img src="{url}" width="175" alt="Image" />'

def display_images_in_df(df: pd.DataFrame) -> HTML:
    """
    Modifies the DataFrame to include a column with HTML img tags for rendering images.
    """
    image_column = "image_shown"
    df_copy = df.copy()
    
    # Assuming 'url_to_img_html' converts URLs in 'image' column to HTML img tags
    df_copy[image_column] = df_copy['image'].apply(url_to_img_html)
    
    # Determine the position of the 'image' column
    image_col_index = df_copy.columns.get_loc('image') + 1
    
    # Reorder columns to ensure 'image_shown' appears immediately after 'image'
    columns = list(df_copy.columns)
    new_column_order = columns[:image_col_index] + [image_column] + columns[image_col_index:-1]
    df_copy = df_copy[new_column_order]
    
    # Generate HTML representation of the DataFrame to display images
    html = df_copy.to_html(escape=False)
    
    return HTML(html)

To see that some of these images were actually mislabeled, you can view the first 2 images below in the image_shown column:

display_images_in_df(label_issues.head(2)[["cleanlab_row_ID", "image", "is_label_issue", "label_issue_score", "label", "suggested_label"]])

	cleanlab_row_ID	image	image_shown	is_label_issue	label_issue_score	label	suggested_label
42	43	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_013912.jpg		True	0.778596	Non-Defective	Defective
41	42	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014842.jpg		True	0.708746	Non-Defective	Defective

Cleanlab has automatically flagged data that was badly annotated! Let’s use Cleanlab’s label issue detection to determine which data points should be re-labeled in our data annotation tool.

Based on some threshold value we set for our label_issue_score column, we should choose to flag and relabel (in our data annotation tool) any images that are greater than or equal to this threshold value.

Here we set our label_issue_score threshold to 0.7 (to only re-consider the images Cleanlab is most confident are mislabeled):

images_to_relabel_in_label_studio = cleanlab_columns_df.query(
            "is_label_issue & label_issue_score > 0.7"
).sort_values("label_issue_score", ascending=False)

images_to_relabel_in_label_studio.head()

	cleanlab_row_ID	corrected_label	is_label_issue	label_issue_score	suggested_label	suggested_label_confidence_score	is_ambiguous	ambiguous_score	is_well_labeled	is_near_duplicate	...	is_low_information	low_information_score	is_grayscale	is_odd_aspect_ratio	odd_aspect_ratio_score	aesthetic_score	is_NSFW	NSFW_score	label	image
42	43	<NA>	True	0.778596	Defective	0.778596	False	0.905642	False	False	...	False	0.302884	False	False	0.0	0.5266	False	0.000000	Non-Defective	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_013912.jpg
41	42	<NA>	True	0.708746	Defective	0.708746	False	0.959486	False	False	...	False	0.315274	False	False	0.0	0.5030	False	0.074041	Non-Defective	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014842.jpg

2 rows × 35 columns

Note: The value you choose for score thresholds will affect subsequent results. We suggest you play around with this threshold to achieve optimal results for your data. For label issues, you should choose a threshold where you see images that were mislabeled around this label_issue_score threshold.

Review some of the unlabeled images that we want to auto-label

Let’s look at a few examples of unlabeled images that we want to try and auto-label using Cleanlab Studio. Auto-labeling uses predictions from Cleanlab’s AI (which was trained on your existing labeled data provided in the Cleanlab Studio Project) to suggest labels for unlabeled data points. To ensure accurate auto-labels, we must account for the confidence in these predictions.

unlabeled_images = final_labels_df[final_labels_df['label'].isnull()]
unlabeled_images.head()

	image	label	historical_labels
50	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_Unlabeled/IMG_20210319_014652.jpg	None	[]
51	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_Unlabeled/IMG_20210318_234146.jpg	None	[]
52	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_Unlabeled/IMG_20210319_010408.jpg	None	[]
53	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_Unlabeled/IMG_20210319_001713.jpg	None	[]
54	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_Unlabeled/IMG_20210319_020124.jpg	None	[]

Let’s look at some of these unlabeled images that we want to auto-label below. The image_shown column is to see what the images actually look like:

display_images_in_df(unlabeled_images.head(3)[["label", "historical_labels", "image"]])

	label	historical_labels	image
50	None	[]	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_Unlabeled/IMG_20210319_014652.jpg
51	None	[]	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_Unlabeled/IMG_20210318_234146.jpg
52	None	[]	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_Unlabeled/IMG_20210319_010408.jpg

We can then sort (in descending order) these unlabeled images by their suggested_label_confidence_score in order to see which images are considered data points we can more confidently auto-label with Cleanlab.

The suggested_label_confidence_score is our confidence that Cleanlab’s predicted label for a data point is correct (again, higher values correspond to data points you can more confidently auto-fix / auto-label with Cleanlab). To read more about this score, you can go here.

Based on some threshold value we set for our suggested_label_confidence_score column, we should auto-label (using the suggested_label column) any images that are currently unlabeled and that are greater than or equal to this threshold value.

Here is an example of how to do this by setting our suggested_label_confidence_score threshold to 0.7(a high threshold to find images we are most confident about being able to auto-label):

auto_label_criteria = cleanlab_columns_df['suggested_label'].notnull() & \
                      (cleanlab_columns_df['suggested_label_confidence_score'] > 0.7) & \
                      cleanlab_columns_df['image'].isin(unlabeled_images['image'])

images_to_auto_label = cleanlab_columns_df[auto_label_criteria].sort_values("suggested_label_confidence_score", ascending=False)

images_to_auto_label.head()

	cleanlab_row_ID	corrected_label	is_label_issue	suggested_label	suggested_label_confidence_score	is_ambiguous	ambiguous_score	is_well_labeled	is_near_duplicate	...	is_low_information	low_information_score	is_grayscale	is_odd_aspect_ratio	aesthetic_score	is_NSFW	NSFW_score	label	image
320	321	<NA>	False	Defective	0.745203	False	NaN	False	False	...	False	0.284669	False	False	0.565557	False	0.000000	None	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_Unlabeled/IMG_20210319_020059.jpg
214	215	<NA>	False	Defective	0.736089	False	NaN	False	False	...	False	0.291471	False	False	0.530241	False	0.000000	None	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_Unlabeled/IMG_20210319_013358.jpg
158	159	<NA>	False	Defective	0.732108	False	NaN	False	False	...	False	0.316843	False	False	0.568669	False	0.004064	None	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_Unlabeled/IMG_20210319_003158.jpg
247	248	<NA>	False	Defective	0.729612	False	NaN	False	False	...	False	0.307088	False	False	0.538092	False	0.000000	None	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_Unlabeled/IMG_20210319_002930.jpg
222	223	<NA>	False	Defective	0.728620	False	NaN	False	False	...	False	0.294260	False	False	0.493728	False	0.000000	None	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_Unlabeled/IMG_20210319_010310.jpg

5 rows × 35 columns

Note: The value you choose for this threshold is sensitive in how it affects results and we highly suggest you tune or play around with this threshold yourself to achieve optimal results for your use case. For deciding which auto-labels can be trusted, you should choose a threshold where you see the images with greater suggested_label_confidence_score have the correct suggested_label.

Update final labels DataFrame with our Cleanlab Studio results

Let’s now use our Cleanlab Studio results to achieve the following:

Based on some threshold value we set for Cleanlab’s label_issue_score column, select data points to re-label in our data annotation tool. This efficiently fixes data that was likely previously mislabeled.
Based on some threshold value we set for Cleanlab’s suggested_label_confidence_score column, select currently-unlabeled data points to auto-label with Cleanlab (using the suggested_label column). This allows to only auto-label the subset of data that AI can confidently handle.
Update our final labels to reflect the newly auto-labeled images, determine the indices of data points that need to be relabeled in our data annotation tool, as well as indices of data points that were auto-labeled by Cleanlab.

Optional: Define helper function to update the values of our labels based on what we auto-label in Cleanlab Studio and determine which images to relabel

from typing import Tuple, Optional

def update_final_labels_via_cleanlab(
    final_labels_df: pd.DataFrame, 
    cleanlab_columns_df: pd.DataFrame, 
    issue_score_threshold: float = 0.6, 
    suggested_label_confidence_score_threshold: float = 0.55,
    run_auto_labeling_only: bool = False
) -> Tuple[pd.DataFrame, Optional[pd.DataFrame], Optional[pd.DataFrame]]:
    """
    Updates final labels with suggested labels from cleanlab_columns_df based on specified criteria,
    and identifies images to relabel in Label Studio.

    Parameters:
    - final_labels_df (pd.DataFrame): DataFrame to update, containing 'label', 'image', and 'historical_labels'.
    - cleanlab_columns_df (pd.DataFrame): DataFrame containing cleanlab analysis and suggestions.
    - issue_score_threshold (float): Threshold for determining if a label issue is significant enough to require relabeling.
    - suggested_label_confidence_score_threshold (float): Confidence score threshold for auto-labeling.
    - run_auto_labeling_only (bool): If True, only run the auto-labeling logic.

    Returns:
    - pd.DataFrame: Updated final labels.
    - Optional[pd.DataFrame]: images_to_relabel_in_label_studio, containing images identified for relabeling (None if run_auto_labeling_only=True).
    - pd.DataFrame: images_to_auto_label, containing images auto-labeled based on the suggested label confidence score.
    """
    images_to_relabel_in_label_studio = None
    if not run_auto_labeling_only:
        # Part 1: Identify images to relabel in Label Studio
        images_to_relabel_in_label_studio = cleanlab_columns_df.query(
            "is_label_issue & label_issue_score > @issue_score_threshold"
        ).sort_values("label_issue_score", ascending=False)

    # Filter out already labeled images in final_labels_df for auto-labeling
    unlabeled_final_labels_df = final_labels_df[final_labels_df['label'].isnull()]

    # Part 2: Auto-label images not in the images_to_relabel_in_label_studio DataFrame
    # and meeting the suggested_label confidence score criteria
    # Additionally, ensure these images are currently unlabeled in final_labels_df
    auto_label_criteria = cleanlab_columns_df['suggested_label'].notnull() & \
                          (cleanlab_columns_df['suggested_label_confidence_score'] > suggested_label_confidence_score_threshold) & \
                          cleanlab_columns_df['image'].isin(unlabeled_final_labels_df['image'])
    if not run_auto_labeling_only:
        auto_label_criteria &= ~cleanlab_columns_df['image'].isin(images_to_relabel_in_label_studio['image'])
    images_to_auto_label = cleanlab_columns_df[auto_label_criteria].sort_values("suggested_label_confidence_score", ascending=False)

    # Update final_labels_df with auto-label suggestions
    for _, row in images_to_auto_label.iterrows():
        image_url = row['image']
        suggested_label = row['suggested_label']

        # Find the index in final_labels_df where the image URL matches
        index = final_labels_df.index[final_labels_df['image'] == image_url]

        if not index.empty:
            # Update the label column
            final_labels_df.at[index[0], 'label'] = suggested_label
            
            # Append the suggested label to the historical_labels list
            if len(final_labels_df.at[index[0], 'historical_labels']) == 0:
                final_labels_df.at[index[0], 'historical_labels'] = [suggested_label]
            else:
                existing_labels = final_labels_df.at[index[0], 'historical_labels']
                if isinstance(existing_labels, list):
                    existing_labels.append(suggested_label)
                else:
                    final_labels_df.at[index[0], 'historical_labels'] = [existing_labels, suggested_label]

    # Adjust the return based on `run_auto_labeling_only`
    if run_auto_labeling_only:
        return final_labels_df, None, images_to_auto_label
    else:
        return final_labels_df, images_to_relabel_in_label_studio, images_to_auto_label

Let’s update our final labels, obtain which images to relabel in Label Studio, and auto-label some of our unlabeled images using this helper function we just defined.

We choose lower thresholds for label_issue_score and suggested_label_confidence_score as the default values for these scores (as arguments to the helper function) compared to the examples we showed previously in order to make it easier to find images to relabel and auto-label for this tutorial.

# Obtain updated finals labels object, the images to relabel in Label Studio, and the images we auto-labeled
final_labels_df, images_to_relabel, auto_labeled_images = update_final_labels_via_cleanlab(final_labels_df, cleanlab_columns_df)

Let’s check how many images we auto labeled with Cleanlab Studio using our threshold on suggested_label_confidence_score:

# We auto-labeled these images 
print(auto_labeled_images.shape)

(325, 35)

Let’s also look at some examples of the images we auto-labeled to see if they were correct. We will see what our suggested_label is for 2 of the images we auto-labeled and the image_shown column to see what the images actually look like:

display_images_in_df(auto_labeled_images.head(2)[["cleanlab_row_ID", "image", "is_label_issue", "label_issue_score", "label", "suggested_label"]])

	cleanlab_row_ID	image	image_shown	is_label_issue	label_issue_score	label	suggested_label
320	321	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_Unlabeled/IMG_20210319_020059.jpg		False	0.0	None	Defective
214	215	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_Unlabeled/IMG_20210319_013358.jpg		False	0.0	None	Defective

You will find after checking each image that the suggested labels are correct! Let’s see how many unlabeled images remain in our data after the auto-labeling:

# We can see we still have unlabeled images that need to be handled in another round of labeling
print(final_labels_df["label"].isnull().sum())

And now let’s see how many images we identified that need to relabeled in Label Studio, our 3rd party data annotation tool of choice:

# We will relabel these images with Label Studio
print(images_to_relabel.shape)

images_to_relabel[["is_label_issue", "label_issue_score", "label", "suggested_label", "image"]].head(10)

	is_label_issue	label_issue_score	label	suggested_label	image
42	True	0.778596	Non-Defective	Defective	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_013912.jpg
41	True	0.708746	Non-Defective	Defective	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014842.jpg
43	True	0.685009	Defective	Defective	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_003322.jpg
45	True	0.680246	Non-Defective	Defective	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014306.jpg

You can see for each of those images that label does not match suggested_label that found by Cleanlab’s AI. Cleanlab is very useful for catching label errors like these!

Round 2: label more images (and re-label those with label errors) in your data annotation tool

Note: You can skip this section if you already know how to use Label Studio or are using another data annotation tool.

After getting back the initial Cleanlab Studio results, you can use them to inform your next round of data annotation. Here we’ll use the images_to_relabel DataFrame to figure out the exact task_ids (index of our images in Label Studio) to modify when relabeling these images in Label Studio. Here we’ll continue labeling data in the same Label Studio project, which means we need to update this project with the auto-labels obtained from Cleanlab (so annotators don’t label those images that were already auto-labeled). You could alternatively start a new Label Studio project with just the remaining still-to-be-labeled data.

For simplicity in this tutorial, our 2nd round of data labeling just focuses on addressing mislabeled images. In practice, you should also manually label more of the unlabeled data (that could not be confidently auto-labeled yet). Every time you re-label data or label new data in your data annotation tool, run a Cleanlab Studio project on the updated dataset and Cleanlab’s results will automatically improve.

Optional: Helper function designed to help identify the right task IDs (associated with an image) in Label Studio for relabeling

def identify_tasks_to_relabel(tasks, images_to_relabel_list):
    """
    Identifies tasks to relabel based on image URLs.

    Parameters:
    - tasks (list): A list of task dictionaries.
    - images_to_relabel_list (list): A list of image URLs to relabel.

    Returns:
    - dict: A dictionary of task IDs to relabel, mapped to their filenames.
    """
    # Extract the basename (filename) from the task image paths
    task_filenames = {task['id']: task['data']['image'].split('/')[-1] for task in tasks}

    # Extract filenames from the images_to_relabel_list
    image_relabel_filenames = [image.split('/')[-1] for image in images_to_relabel_list]

    # Filter the task IDs based on whether their filenames are in the list of images to relabel
    tasks_to_relabel = {}
    for task_id, filename in task_filenames.items():
        if filename in image_relabel_filenames:
            tasks_to_relabel[task_id] = filename

    return tasks_to_relabel

def identify_tasks_to_autolabel(tasks, image_to_label_mapping):
    """
    Identifies tasks to auto-label based on image URLs and assigns the suggested label.

    Parameters:
    - tasks (list): A list of task dictionaries.
    - image_to_label_mapping (dict): A dictionary mapping image filenames to suggested labels.

    Returns:
    - dict: A dictionary of task IDs mapped to their suggested labels.
    """
    tasks_to_autolabel = {}
    
    # Extract the basename (filename) from the task image paths
    for task in tasks:
        filename = task['data']['image'].split('/')[-1]
        if filename in image_to_label_mapping:
            # Map task ID to the suggested label
            tasks_to_autolabel[task['id']] = image_to_label_mapping[filename]

    return tasks_to_autolabel

Let’s get all of the tasks (i.e. data points in Label Studio) that we want to relabel.

# Get all tasks to find the ones you want to relabel
tasks = project.get_tasks()

image_relabel_list = list(images_to_relabel["image"])

tasks_to_relabel = identify_tasks_to_relabel(tasks, image_relabel_list)
print("Task IDs to relabel:", tasks_to_relabel)

Task IDs to relabel: {161903: 'IMG_20210319_014842.jpg', 161904: 'IMG_20210319_013912.jpg', 161905: 'IMG_20210319_003322.jpg', 161907: 'IMG_20210319_014306.jpg'}

Let’s create a dict that maps our image filenames to the suggested labels that we are going to use to annotate our images in Label Studio and add a flag that they were auto-labeled.

# Convert DataFrame to dictionary mapping image filenames to suggested labels
image_to_label_mapping = dict(zip(auto_labeled_images['image'].apply(lambda x: x.split('/')[-1]), auto_labeled_images['suggested_label']))

# Assuming `tasks` is your list of task dictionaries
tasks_we_autolabeled = identify_tasks_to_autolabel(tasks, image_to_label_mapping)

The task_ids are formatted as a dictionary where each key is a task_id that maps to the filename of an image we want to relabel.

We have now found which task_ids to relabel and which we know we have auto-labeled, so we will remove our label (called an annotation in Label Studio) for these images associated with the relevant task_ids identified above. Then we will also mark (using a metadata column) which images need to be relabeled so that this is clearly visible in the Label Studio UI. We also will mark which images we have already auto-labeled in Label Studio so that one is not confused by which images still need to be labeled or not and which ones we have already auto-labeled with Cleanlab Studio.

You can use the code below to execute these steps and send the updates back to your Label Studio project that is running locally.

Optional: Helper functions that list annotations for a given task and clear/mark annotations for relabeling in Label Studio

from typing import Any

def fetch_annotation_metadata(api_key: str, task_id: int) -> List[Dict[str, Any]]:
    """
    Fetches annotation metadata for a specific task from Label Studio.

    Parameters:
    - api_key (str): The API key for Label Studio.
    - task_id (int): The ID of the task for which annotations are fetched.

    Returns:
    - List[Dict[str, Any]]: A list of annotations for the specified task.
    
    Raises:
    - Exception: If the request to the Label Studio API fails.
    """
    headers = {'Authorization': f'Token {api_key}'}
    response = requests.get(f'http://localhost:8080/api/tasks/{task_id}/annotations/', headers=headers)
    if response.status_code == 200:
        return response.json()
    else:
        raise Exception(f"Failed to fetch annotations. Status code: {response.status_code}, Response: {response.text}")

def clear_annotations_mark_for_relabeling_and_autolabel(api_key: str, tasks_to_relabel: Dict[int, str], auto_labeled_tasks: Dict[int, str]) -> None:
    """
    Clears annotations for specified tasks, marks them for relabeling, and annotates auto-labeled tasks.

    Parameters:
    - api_key (str): The API key for Label Studio.
    - tasks_to_relabel (Dict[int, str]): Tasks to be cleared and marked for relabeling.
    - auto_labeled_tasks (Dict[int, str]): Task IDs mapped to their suggested label for auto-labeling.

    Returns:
    - None
    
    Raises:
    - Exception: If requests to the Label Studio API for deleting annotations, updating tasks, or creating annotations fail.
    """

    for task_id in tasks_to_relabel.keys():
        # List all annotations for the task using the custom function
        annotations = fetch_annotation_metadata(api_key, task_id)
        
        # Delete each annotation
        for annotation in annotations:
            response = requests.delete(f'http://localhost:8080/api/annotations/{annotation["id"]}/', headers={'Authorization': f'Token {api_key}'})
            if response.status_code != 204:
                raise Exception(f"Failed to delete annotation. Status code: {response.status_code}, Response: {response.text}")
        
        # Fetch the task data
        task_response = requests.get(f'http://localhost:8080/api/tasks/{task_id}/', headers={'Authorization': f'Token {api_key}'})
        if task_response.status_code != 200:
            raise Exception(f"Failed to fetch task. Status code: {task_response.status_code}, Response: {task_response.text}")
        task_data = task_response.json()['data']

        # Update the task to indicate it needs relabeling
        if task_data.get('metadata') is None:
            task_data['metadata'] = {}

        task_data['metadata']['needs_relabeling'] = True
        
        # Update the task with the modified data
        update_response = requests.patch(f'http://localhost:8080/api/tasks/{task_id}/', json={'data': task_data}, headers={'Authorization': f'Token {api_key}'})
        if update_response.status_code != 200:
            raise Exception(f"Failed to update task. Status code: {update_response.status_code}, Response: {update_response.text}")

    print("Annotations cleared and tasks marked for relabeling in Label Studio.")

        # Handling auto-labeled tasks
    for task_id, suggested_label in auto_labeled_tasks.items():
        # Fetch the task data for updating metadata
        task_response = requests.get(f'http://localhost:8080/api/tasks/{task_id}/', headers={'Authorization': f'Token {api_key}'})
        if task_response.status_code != 200:
            raise Exception(f"Failed to fetch task. Status code: {task_response.status_code}, Response: {task_response.text}")
        task_data = task_response.json()['data']
        
        # Update task metadata to indicate auto-labeled
        if task_data.get('metadata') is None:
            task_data['metadata'] = {}
        task_data['metadata']['auto_labeled'] = True
        
        # Update the task with the modified data including auto-labeled flag
        update_response = requests.patch(f'http://localhost:8080/api/tasks/{task_id}/', json={'data': task_data}, headers={'Authorization': f'Token {api_key}'})
        if update_response.status_code != 200:
            raise Exception(f"Failed to update task. Status code: {update_response.status_code}, Response: {update_response.text}")
        
        # Correct API endpoint for creating annotations
        annotation_endpoint = f'http://localhost:8080/api/tasks/{task_id}/annotations/'

        # Prepare annotation data
        annotation_data = {
            'result': [{
                'from_name': 'label',
                'to_name': 'image',
                'type': 'choices',
                'value': {'choices': [suggested_label]}
            }]
        }

        # Attempt to create the annotation
        annotation_response = requests.post(annotation_endpoint, json=annotation_data, headers={'Authorization': f'Token {api_key}'})
        if annotation_response.status_code not in [200, 201]:
            raise Exception(f"Failed to create annotation. Status code: {annotation_response.status_code}, Response: {annotation_response.text}")

    print("Tasks that were auto-labeled were successfully annotated in Label Studio.")

We can look at specific annotation metadata for each task_id using the Label Studio API. This can help understand important details about the annotation, including when it was created, last updated, what the project id number is and more.

Here is an example of how to list the annotation metadata for a specific task_id in Label Studio using a helper function we defined:

example_task_id = list(tasks_to_relabel.keys())[0]

print(fetch_annotation_metadata(label_studio_api_key, example_task_id))

[{'id': 1645, 'created_username': ' matt.turk@cleanlab.ai, 1', 'created_ago': '15\xa0hours, 23\xa0minutes', 'completed_by': 1, 'result': [{'value': {'choices': ['Non-Defective']}, 'id': 'E47bGehLt4', 'from_name': 'choice', 'to_name': 'image', 'type': 'choices', 'origin': 'manual'}], 'was_cancelled': False, 'ground_truth': False, 'created_at': '2024-03-21T22:15:22.666733Z', 'updated_at': '2024-03-21T22:15:22.666763Z', 'draft_created_at': None, 'lead_time': 2.482, 'import_id': None, 'last_action': None, 'task': 161903, 'project': 55, 'updated_by': 1, 'parent_prediction': None, 'parent_annotation': None, 'last_created_by': None}]

Clear annotations for images and mark them for relabeling in Label Studio

Note: You can skip this section if you already know how to use Label Studio or are using another data annotation tool.

Let’s use a helper function to clear the existing annotation in Label Studio for the images (each corresponding to a task in Label Studio) we know we want to relabel.

We also flag these images with a metadata column value to help you find the images that you manually need to relabel. If you had significantly more images to label than those in this tutorial, this flag would help you find the subset of images you are trying to relabel more easily.

You could also obtain the exact task_ids of the images and manually find each image to relabel without clearing the annotations in an automated way, but that would be a more difficult and time consuming method of relabeling your data in Label Studio.

We also flag the images that we’ve already auto-labeled, as mentioned earlier, with a metadata column value to more easily help you see which images do not need to be labeled within Label Studio due to Cleanlab Studio already helping us label those images.

For more information on deleting annotations in Label Studio, you can go here.

clear_annotations_mark_for_relabeling_and_autolabel(label_studio_api_key, tasks_to_relabel, tasks_we_autolabeled)

Annotations cleared and tasks marked for relabeling in Label Studio.
Tasks that were auto-labeled were successfully annotated in Label Studio.

Now you can go to your Label Studio project to now relabel the images that we flagged. In your Label Studio project, you’ll see that the images that need to be relabeled have a value in the metadata column of your project that says {"needs_relabeling":true}. The images we have already auto-labeled have a value in the metadata column that says “

Here is a video showing you how to find the images that need to be relabeled and also which images we have already auto-labeled. Then we show how to actually relabel the specified images within Label Studio:

After relabeling, export the labels as seen below.

Update final labels DataFrame with our 2nd round of labels obtained via Label Studio

Note: You can skip this section if you already know how to use Label Studio or are using another data annotation tool.

After relabeling some images in Label Studio, we update the DataFrame serving as our master record of each image’s final label.

We’ll reuse our Label Studio project ID defined earlier and our previously defined helper function to export the new image labels from Label Studio into a csv file:

relabeled_export_filename = "relabeled_label_studio_potato_chips_annotations.csv" # REPLACE YOUR FILENAME HERE

export_data_from_label_studio_project(ls_project_id, label_studio_api_key, relabeled_export_filename)

Let’s read in our image data we just exported out of Label Studio and filter out the images we did not relabel.

# Read in relabeled data exported from Label Studio
relabeled_df = pd.read_csv(relabeled_export_filename)
relabeled_df = relabeled_df[relabeled_df["metadata"] == '{"needs_relabeling":true}']

Here are the images that we relabeled:

relabeled_df.head(10)

	annotation_id	annotator	choice	created_at	id	image	lead_time	metadata	updated_at
42	2307	1	Defective	2024-03-22T13:40:20.268734Z	161903	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014842.jpg	2.569	{"needs_relabeling":true}	2024-03-22T13:40:20.268757Z
43	2308	1	Defective	2024-03-22T13:40:24.147650Z	161904	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_013912.jpg	2.230	{"needs_relabeling":true}	2024-03-22T13:40:24.147698Z
44	2309	1	Non-Defective	2024-03-22T13:40:27.828297Z	161905	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_003322.jpg	2.220	{"needs_relabeling":true}	2024-03-22T13:40:27.828330Z
46	2310	1	Defective	2024-03-22T13:40:31.009574Z	161907	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014306.jpg	1.953	{"needs_relabeling":true}	2024-03-22T13:40:31.009615Z

You can confirm the metadata column values show we did in fact relabel this data. The choice column represents what our now correct label value is after relabeling.

Let’s proceed with updating our final labels with our newly relabeled data. We will look at our final labels before AND after we update the values using our relabeled Label Studio images to illustrate the difference.

# Final labels before we update them with newly relabeled images
final_labels_df.loc[relabeled_df.index]

	image	label	historical_labels
42	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014842.jpg	Non-Defective	[Non-Defective]
43	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_013912.jpg	Non-Defective	[Non-Defective]
44	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_003322.jpg	Defective	[Defective]
46	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014306.jpg	Non-Defective	[Non-Defective]

# Apply updated labels from Label Studio to final labels
final_labels_df = update_final_labels_via_labelstudio(final_labels_df, relabeled_df)

# Final labels after we update them with newly relabeled images
final_labels_df.loc[relabeled_df.index]

	image	label	historical_labels
42	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014842.jpg	Defective	[Non-Defective, Defective]
43	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_013912.jpg	Defective	[Non-Defective, Defective]
44	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_003322.jpg	Non-Defective	[Defective, Non-Defective]
46	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_To_Label/IMG_20210319_014306.jpg	Defective	[Non-Defective, Defective]

You can see that we have new labels for several images that originally were Non-Defective (which was incorrect) and are now (correctly) Defective. Similarly, we correct images that were supposed to be Non-Defective and were not.

The historical_labels column is also correctly updated for these images. The label that is at the end of this historical_labels list is considered the most up-to-date label we are using for the image in the data point. The labels in this list, for each data point, are sorted by least recent label to most recent label of our image in that row.

We will update our metadata column in Label Studio to remove the flag that images need to be relabeled. You can run the following code to do this:

Optional: Helper function to update the relabeling status of all relabeled images

def update_tasks_relabeling_status(api_key: str, task_ids: List[int], needs_relabeling: bool) -> None:
    """
    Updates the relabeling status of multiple tasks in Label Studio.

    Parameters:
    - api_key (str): The API key for Label Studio.
    - task_ids (List[int]): A list of task IDs to update.
    - needs_relabeling (bool): True to mark the tasks as needing relabeling, False to clear the mark.

    Raises:
    - Exception: If any request to update a task fails.
    """
    headers = {'Authorization': f'Token {api_key}'}

    for task_id in task_ids:
        task_response = requests.get(f'http://localhost:8080/api/tasks/{task_id}/', headers=headers)

        if task_response.status_code != 200:
            raise Exception(f"Failed to fetch task {task_id}. Status code: {task_response.status_code}, Response: {task_response.text}")

        task_data = task_response.json()['data']
        if 'metadata' not in task_data or task_data['metadata'] is None:
            task_data['metadata'] = {}

        if needs_relabeling:
            task_data['metadata']['needs_relabeling'] = True
        else:
            # Remove the needs_relabeling key if it exists
            task_data['metadata'].pop('needs_relabeling', None)
            # If the metadata dictionary is empty, set it to None
            if not task_data['metadata']:
                task_data['metadata'] = None

        update_response = requests.patch(f'http://localhost:8080/api/tasks/{task_id}/', json={'data': task_data}, headers=headers)
        if update_response.status_code != 200:
            raise Exception(f"Failed to update task {task_id}. Status code: {update_response.status_code}, Response: {update_response.text}")

    print(f"All tasks relabeling status updated successfully.")

# Remove metadata flag from Label Studio that images need to be relabeled.
relabeled_task_ids = list(tasks_to_relabel.keys())
update_tasks_relabeling_status(label_studio_api_key, relabeled_task_ids, needs_relabeling=False)

All tasks relabeling status updated successfully.

Now you can check your project in Label Studio to make sure no images there still have the needs_relabeling flag set to True. Without doing this step, it will be confusing to know which images you already relabeled and which images you did not relabel. This can otherwise be complex to manage for big datasets.

Note: For brevity, we did not annotate any more unlabeled data points in Label Studio during this 2nd round of labeling in this tutorial. In practice, you should also manually label some more of the unlabeled data in this 2nd round using your data annotation platform (to help Cleanlab’s AI learn not only from label corrections but new labels as well).

Round 2: Use Cleanlab Studio for further auto-labeling and label error detection

We will now follow the same process to first properly format our dataset (including both labeled and the rest of the unlabeled images) to ingest the newest version of our data back into Cleanlab Studio.

And we will create both a new Cleanlab Studio dataset and project that will be used to auto-label our remaining unlabeled images in our final labels.

In this tutorial, we choose to only auto-label the remainder of our data points given the size of the sample being used. In practice, you can try to repeat the process to detect data points that you want to re-label with your data annotation tool AND use Cleanlab to auto-label more unlabeled data points as many times as you deem necessary to achieve optimal results for your use case. Each time you label more data / correct existing labels and then launch a Cleanlab Studio project, Cleanlab’s AI will get better at auto-labeling your data and detecting label errors.

Let’s create our DataFrame object to ingest into Cleanlab Studio:

new_cleanlab_df = final_labels_df.drop("historical_labels", axis=1)

Let’s load our new set of currently labeled/unlabeled images into Cleanlab Studio. As mentioned earlier, it’s important to provide your full dataset to get the best results from Cleanlab’s AI.

new_dataset_id = studio.upload_dataset(new_cleanlab_df, dataset_name="Round_2_Pepsico_RnD_Potato_Chip_Image_Data_Tutorial", schema_overrides=[{"name": "image", "column_type": "image_external"}])
print(f"Dataset ID: {new_dataset_id}")

Uploading dataset...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|
Ingesting Dataset...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|

Dataset ID: 720f4800607c4e2892e177d4e2beaa39

Once the new data are loaded, we create a new Project based on this Dataset in Cleanlab Studio.

new_project_id = studio.create_project(
    dataset_id=new_dataset_id,
    project_name="Round_2_Pepsico_RnD_Potato_Chip_Image_Data_Tutorial_Project",
    modality="image",
    task_type="multi-class",
    model_type="regular",
    label_column="label",
)
print(
    f"Project successfully created and ML training has begun! project_id: {new_project_id}"
)

Project successfully created and ML training has begun! project_id: 84cbb77617e64d9caf1a48ea27722d91

Note: the Project will take a while for Cleanlab’s AI models to train on your dataset and analyze it. You’ll receive an email when the results are ready. Run the cell below to fetch the cleanset_id from Cleanlab Studio, this code will block until your project results are ready. For big datasets, if your notebook times out, do not recreate the project. Instead just re-run the cell below to fetch the cleanset_id based on the project_id (which you can also find in the Cleanlab Studio Web App).

new_cleanset_id = studio.get_latest_cleanset_id(new_project_id)
print(f"cleanset_id: {new_cleanset_id}")
studio.wait_until_cleanset_ready(new_cleanset_id)

cleanset_id: ecde48d5d79a4195ba9ca6e637858239

Cleanset Progress: / Step 50/50, Ready for review!

Let’s now download the metadata columns from Cleanlab Studio as we did before and see what the columns look like:

new_cleanlab_columns_df = studio.download_cleanlab_columns(new_cleanset_id)
new_cleanlab_columns_df.head(10)

	cleanlab_row_ID	corrected_label	is_label_issue	label_issue_score	suggested_label	suggested_label_confidence_score	is_ambiguous	ambiguous_score	is_well_labeled	is_near_duplicate	...	is_odd_size	is_low_information	low_information_score	is_grayscale	is_odd_aspect_ratio	aesthetic_score	is_NSFW	NSFW_score
0	1	<NA>	False	0.147065	<NA>	0.852935	False	0.997672	False	False	...	False	False	0.305794	False	False	0.532279	False	0.041162
1	2	<NA>	False	0.194291	<NA>	0.805709	False	0.999998	False	False	...	False	False	0.314794	False	False	0.511932	False	0.051039
2	3	<NA>	False	0.144631	<NA>	0.855369	False	0.997418	False	False	...	False	False	0.278744	False	False	0.519195	False	0.113037
3	4	<NA>	False	0.166404	<NA>	0.833596	False	0.999222	False	False	...	False	False	0.300383	False	False	0.479944	False	0.054440
4	5	<NA>	False	0.144390	<NA>	0.855610	False	0.997392	False	False	...	False	False	0.330797	False	False	0.482801	False	0.052081
5	6	<NA>	False	0.161320	<NA>	0.838680	False	0.998895	False	False	...	False	False	0.291288	False	False	0.464114	False	0.000000
6	7	<NA>	True	0.844409	Defective	0.844409	False	0.998458	False	False	...	False	False	0.348319	False	False	0.552842	False	0.137950
7	8	<NA>	False	0.793626	Defective	0.793626	True	0.999799	False	False	...	False	False	0.328956	False	False	0.584307	False	0.092091
8	9	<NA>	False	0.297036	<NA>	0.702964	False	0.987958	False	False	...	False	False	0.310666	False	False	0.489662	False	0.108625
9	10	<NA>	False	0.123683	<NA>	0.876317	False	0.994688	False	False	...	False	False	0.327839	False	False	0.525412	False	0.110453

10 rows × 33 columns

Let’s repeat the process of joining our Cleanlab Studio metadata columns with some of our original data columns in order to more easily observe the important columns that Cleanlab Studio generated for us.

new_cleanlab_df = new_cleanlab_df.reset_index().rename(columns={'index': 'cleanlab_row_ID'})
new_cleanlab_columns_df = new_cleanlab_columns_df.merge(new_cleanlab_df[['cleanlab_row_ID', 'label', 'image']], left_index=True, right_index=True)

Update final labels DataFrame with our 2nd round of Cleanlab Studio results

With our new Cleanlab Studio results we can now auto-label the remainder of our unlabeled data points and update our final labels object accordingly.

As mentioned before, based on some threshold value we set for our suggested_label_confidence_score column, we should auto-label (using the suggested_label column) any images that are currently unlabeled and that are greater than or equal to this threshold value.

In this round, we we only execute the auto-labeling logic in our helper function and not the label error detection logic (to find data points to re-label). In practice, you should repeat both of these steps as many times as you need to accurately label your dataset. In between each Cleanlab Studio project, you should aim to manually label a bit more unlabeled data in your data annotation tool (this will improve Cleanlab’s AI in the subsequent rounds). To keep this tutorial short, we won’t show many rounds of this process, and here only focus on auto-labeling in this 2nd round.

We’ll also choose a slightly lower (less conservative) threshold value for suggested_label_confidence_score when using our helper function this time around. This illustrates how you can choose different threshold values in different rounds of the labeling process as Cleanlab’s AI improves over multiple rounds.

# Obtain updated final labels and newly auto-labeled images
final_labels_df, _, new_auto_labeled_images = update_final_labels_via_cleanlab(final_labels_df, new_cleanlab_columns_df, suggested_label_confidence_score_threshold=0.5, run_auto_labeling_only=True)

We’ve updated our final labels and auto-labeled the unlabeled data points that could be confidently handled by Cleanlab’s AI. Let’s see how many unlabeled data points we were able to auto-label:

print(new_auto_labeled_images.shape)

(25, 36)

The number of data points we auto-labeled matches the number of unlabeled data points remaining from the previous round of labeling! Cleanlab’s AI improved so much from our 2nd round of manual annotations that it was able to auto-label all of the remaining unlabeled data.

Let’s inspect some images we auto-labeled in this round to review if they were correct. We will see what our suggested_label is for 2 of the images we auto-labeled and the image_shown column to see what the images actually look like:

display_images_in_df(new_auto_labeled_images.head(2)[["is_initially_unlabeled", "image", "label", "suggested_label"]])

	is_initially_unlabeled	image	image_shown	label	suggested_label
102	True	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_Unlabeled/IMG_20210319_004902.jpg		None	Defective
394	True	https://s.cleanlab.ai/PepsiCo-Potato-Chip-Images/Label_Studio_Tutorial_Potato_Chip_Images_Unlabeled/IMG_20210319_013246.jpg		None	Defective

The suggested_label values confidently predicted by Cleanlab’s AI are correct for these examples!

Here’s how many unlabeled data points remain after this round of auto-labeling:

print(final_labels_df["label"].isnull().sum())

We do not have any unlabeled images left, so we have successfully used a 3rd party data annotation tool combined with Cleanlab Studio to label all of our data points! Otherwise you could manually label some more images with your data annotation platform (in a 3rd round), and then run another Cleanlab Studio Project to further improve the auto-labeling of your data.

Summary

This tutorial demonstrated how easily Cleanlab Studio can be used with any 3rd party data annotation tool. By incorporating Cleanlab Studio into our workflow, we only had to manually label 50 data points (plus the 6 relabeled points) in our data annotation tool instead of all 400 data points. The other 350 data points we were able to accurately auto-label using Cleanlab Studio, which is ~88% of our images, saving time and effort. Cleanlab Studio can scale to massive datasets just as easily, unlocking immense savings. Cleanlab also automatically caught incorrect labels, allowing us to produce a higher-quality dataset.

While we focused on image data here, all of this works for text and tabular datasets as well. The same workflow can also be applied in other data annotation applications beyond multi-class classification, such as image/document tagging. We re-emphasize that you can apply this same workflow with any data annotation tool of your choosing (see beginning of tutorial for links to other tools beyond Label Studio).

Install and import required dependencies​

Prepare the dataset​

Label some of the images with your annotation tool​

Format the data for Label Studio​

Label 50 images in Label Studio​

Update final labels DataFrame with labels obtained in our annotation tool​

Use Cleanlab Studio to catch label errors and auto-label the data that AI can confidently handle​

Review some of the mislabeled images that Cleanlab Studio detected​

Review some of the unlabeled images that we want to auto-label​

Update final labels DataFrame with our Cleanlab Studio results​

Round 2: label more images (and re-label those with label errors) in your data annotation tool​

Clear annotations for images and mark them for relabeling in Label Studio​

Update final labels DataFrame with our 2nd round of labels obtained via Label Studio​

Round 2: Use Cleanlab Studio for further auto-labeling and label error detection​

Update final labels DataFrame with our 2nd round of Cleanlab Studio results​

Summary​

Install and import required dependencies

Prepare the dataset

Label some of the images with your annotation tool

Format the data for Label Studio

Label 50 images in Label Studio

Update final labels DataFrame with labels obtained in our annotation tool

Use Cleanlab Studio to catch label errors and auto-label the data that AI can confidently handle

Review some of the mislabeled images that Cleanlab Studio detected

Review some of the unlabeled images that we want to auto-label

Update final labels DataFrame with our Cleanlab Studio results

Round 2: label more images (and re-label those with label errors) in your data annotation tool

Clear annotations for images and mark them for relabeling in Label Studio

Update final labels DataFrame with our 2nd round of labels obtained via Label Studio

Round 2: Use Cleanlab Studio for further auto-labeling and label error detection

Update final labels DataFrame with our 2nd round of Cleanlab Studio results

Summary