Skip to main content

Data Labeling

This tutorial describes how to use Cleanlab Studio for data labeling, a task where you have a classification dataset with some subset of the data labeled, and you wish to efficiently label the rest of the data points. Cleanlab Studio can expedite this process by suggesting a label for your unlabeled data and supporting a human-in-the-loop workflow where you can review and accept proposed labels.

This tutorial uses a multi-class classification dataset. The process is similar for a multi-label dataset – learn how to format such datasets here.

During this process, Cleanlab Studio can also help identify data issues in the already-labeled subset of the data.

Download the Dataset

This tutorial uses a version of the Banking77 dataset which you can download here: banking_tutorial.csv.

This intent recognition dataset is composed of customer service requests, classified by their intents into one of 77 categories. 90% of the examples in our dataset are not yet labeled. Here are some of the labeled examples from this dataset:

textlabel
I am still waiting on my card?card_arrival
What can I do if my card still hasn’t arrived after 2 weeks?card_arrival
I noticed an extra pound charge?extra_charge_on_statement
when i travel can it top up automaticallyautomatic_top_up
Help. My card is broken.card_not_working

We use a text dataset as an example here, but the steps from this tutorial can also be applied to image or structured/tabular datasets.

Uploading a Dataset to Cleanlab Studio

First, upload the dataset to Cleanlab Studio using the upload wizard. You can find this from your Dashboard by clicking Upload Dataset and using the wizard to walk through the steps for Upload from your computer:

  1. Click Upload Dataset
  2. Select Upload from your computer
  3. Select or drag-and-drop the provided banking_tutorial.csv
  4. Click Next
  5. Cleanlab Studio automatically infers the dataset’s schema; leave this as-is, and click Confirm

For many Cleanlab users, Upload from your computer will be the most common way of uploading datasets, but datasets can also be uploaded via URL, command line, or from the Python library. URL uploads are useful for publicly hosted datasets, common in benchmarking datasets if you want to experiment with Cleanlab Studio. Command line and Python library options are often used when building Cleanlab Studio into your data processing and QA pipeline.

Python API instructions (click to expand)

To upload your data to Cleanlab Studio using the Python library, you can use the upload_dataset method:

from cleanlab_studio import Studio
import pandas as pd

studio = Studio('YOUR_API_KEY')
df = pd.read_csv('banking_tutorial.csv')
studio.upload_dataset(dataset=df, dataset_name = "Data Labeling Tutorial")

You should see this view once the upload is confirmed.

uploaded_dataset.png

Creating a Project

Next, click Create Project from the dataset page or the dashboard. On the project creation page, there are several settings to choose.

  • Machine Learning Task: select Text Classification (the default for a text dataset) — we’re working with a text classification task in this tutorial (classifying a single string of text into a class like “card_arrival” or “automatic_top_up”).
  • Type of Classification: select Multi-Class (the default) — the Banking77 dataset is for a single-label classification task, where each utterance belongs to exactly one class.
  • Text column: select “text” (which is auto-detected) — this is the name of the column from your dataset that contains the text, on which predictions are based.
  • Label column: select “label” (which is auto-detected) — this is the name of the column from your dataset that contains the labels.

Keep the default “Use a provided model” setting, and choose “Fast” mode for the purpose of the tutorial. Fast models train more quickly (15-90 minutes), and Regular models take longer (up to 24 hours) but produce better results.

Finally, click Clean my data! to start the data analysis process!

create_project_settings.png

When you launch a project, Cleanlab Studio trains cutting-edge ML models to analyze your data, which can take some time. For the tutorial dataset, it should take about 15 minutes. You will get an email with a link to the results once the analysis is complete. By clicking on that link or Ready for review on the dashboard, you will see the project view:

project_page.png

Labeling Data and Resolving Issues

The interface supports two workflows for labeling your data: batch labeling, and individual row/hand labeling. This tutorial demonstrates both. Batch labeling can be very powerful when combined with Cleanlab Studio’s filters interface, or when the data are fairly \“easy\” to label. Batch labeling can also speed up the hand-labeling process by providing a good starting point that can be rapidly iterated on. Hand labeling is good option for those last few examples that Cleanlab Studio indicates it is less confident on.

While Cleanlab Studio can be used to quickly re-label data auto-identified as being originally mislabeled, let’s focus for a moment on labeling originally unlabeled data only. Learn how to format unlabeled data here. You can filter to view solely the unlabeled data in a dataset by selecting Unlabeled in the Filter bar. Unlabeled data has a ”-” in the Given column (to indicate no label has been given yet). For each unlabeled data point, Cleanlab Studio automatically suggests an appropriate label in the Suggested column, along with a confidence score for this suggestion.

Unlabeled data.

Labeling individual examples

If you’d prefer to review data points one-at-a-time, aided by Cleanlab Studio’s suggestions in labeling them, start by clicking on a row of data and using the resolver panel to select labels. Even though you’re going through the datapoints one-by-one, you’ll be leveraging Cleanlab Studio’s insights about likely labels by seeing percentage likelihoods for each label, and Cleanlab will surface the most likely labels to the top while you do the tagging.

resolver.png

Basic row-by-row labeling workflow

The clip below demonstrates using the issue resolver panel to label some examples that Cleanlab Studio is less confident about. To label, select an example by clicking on it and use shortcuts to take actions like label, exclude, tag, etc. Or if you prefer, use the mouse and click on the buttons!

  • W to auto-fix (apply Cleanlab Studio’s recommended action, such as changing the label)
  • N to add the “Needs Review” tag (so you can come back to the data point later)
  • Q to keep the given label
  • E to exclude the data point from the dataset
  • 19 to label, for the top 9 most likely labels

Batch Labeling

Batch labeling allows you to resolve label errors and purge bad data rapidly. As shown below, it can function either as a standalone action or as a baseline for human review. When used on a dataset with many unlabeled data points, batch actions/the Clean Top K feature can improve the speed of labeling by allowing you to clear the low-hanging fruit and focus on the most nuanced examples.

To batch label, first use the filter the data using quick-filter button and filter bar to pull up the examples that you’d like to label.

Once you have retrieved the desired data, open the Clean Top K dialog to take an action on the data. The actions are Auto-fix (apply our suggested label), Exclude (remove from dataset), Label (apply one label to the selected points), and Needs Review (add a tag to come back to some data)

In this clip, the top 100 examples with a suggested label of Refund_not_showing_up that were initially unlabeled are batch labeled.

Improving Labeling Efficiency

Searching for unlikely labels

If you want to discover a label that is not in the top 10 most likely labels, you can also search for labels in the search bar and select the label from there, rather than scrolling.

Filtering and multi-select to avoid context switching

In addition to easily auto-fixing one issue category, you can quickly filter for specific suggested labels. This speeds up the process significantly because of reduced context switching, for example you can review all of the data points that Cleanlab Studio thinks should be labeled cash_withdrawal_not_recognized at once.

You can also filter for specific corrections (e.g., given label is top_up_failed and suggested label is top_up_pending) by selecting a filter value for both the given and suggested labels.

filter_bar.png

Use click-and-drag/multi-select to label lots of examples at once without using batch actions. This is helpful to quickly review and correct filtered examples where you want to apply the same action to every data point.

multiselect.png

Tip: Check out the Analytics page to see which label errors are most common in your dataset. You can find these in the Most Frequently Suggested Label Corrections section. If you click on one of the bars, that’ll apply filters to show all of the corresponding data points, and then you can use multi-select to fix them all at once.

analytics_page.png

Using the Needs Review tag

If there are examples which need further review, you can use the Needs Review tag to mark them, and go back and review them later by filtering for the tag.

needs_review_callout.png

Fixing other data issues

As you can see in our other tutorials, Cleanlab Studio is a great tool for resolving all kinds of data issues, not just annotating unlabeled data. For example, you can exclude all outliers from the dataset with two clicks:

As always, after resolving a data issue, you can quickly check that Cleanlab’s suggestions are correct by opening the resolver panel and looking over the actions taken, and updating them if you disagree.

Just like that - our outliers have been excluded, with us only having to verify the changes rather than going through the whole dataset ourselves.

Exporting your Labeled Dataset

Now that you’ve labeled the dataset and resolved some other issues found, it’s time to export your cleanset! You can do that using the Export Cleanset button at the bottom of the page. If you prefer, you can also use the Python API to export the results.

export

Python library export instructions

Retrieve the Cleanset ID from the Export with API option on the Export Cleanset dialog, and then use the download_cleanlab_columns method:

from cleanlab_studio import Studio
import pandas as pd

studio = Studio('YOUR_API_KEY')
df = studio.download_cleanlab_columns('YOUR_CLEANSET_ID')

Bonus: Making the most of your data with the Cleanlab Studio Inference API

You’ve exported the dataset — now what? In addition to using your cleaned and labeled data for all the tasks you were previously tackling with the smaller/noisier original dataset, Cleanlab Studio also has the capability to use your data immediately for inference! This is great if you have even more unlabeled data to handle (say, some that came in after you generated your initial dataset). Click Deploy Model on the bottom of the page, and Cleanlab Studio will automatically train a model on your cleaned data and deploy it for use in production. You can use this model to generate predictions for any new data, as described in our inference API tutorial.