This tutorial describes how to use Cleanlab Studio for data labeling, a task where you have a classification dataset with some subset of the data labeled, and you wish to efficiently label the rest of the data points. Cleanlab Studio can expedite this process by suggesting a label for your unlabeled data and supporting a human-in-the-loop workflow where you can review and accept proposed labels.
During this process, Cleanlab Studio can also help identify data issues in the already-labeled subset of the data.
Download the Dataset
This tutorial uses a version of the Banking77 dataset which you can download here: banking_tutorial.csv.
This intent recognition dataset is composed of customer service requests, classified by their intents into one of 77 categories. 90% of the examples in our dataset are not yet labeled. Here are some of the labeled examples from this dataset:
|I am still waiting on my card?
|What can I do if my card still hasn’t arrived after 2 weeks?
|I noticed an extra pound charge?
|when i travel can it top up automatically
|Help. My card is broken.
We use a text dataset as an example here, but the steps from this tutorial can also be applied to image or structured/tabular datasets.
Uploading a Dataset to Cleanlab Studio
First, upload the dataset to Cleanlab Studio using the upload wizard. You can find this from your Dashboard by clicking
Upload Dataset and using the wizard to walk through the steps for
Upload from your computer:
Upload from your computer
- Select or drag-and-drop the provided
- Cleanlab Studio automatically infers the dataset’s schema; leave this as-is, and click
For many Cleanlab users,
Upload from your computer will be the most common way of uploading datasets, but datasets can also be uploaded via URL, command line, or from the Python library. URL uploads are useful for publicly hosted datasets, common in benchmarking datasets if you want to experiment with Cleanlab Studio. Command line and Python library options are often used when building Cleanlab Studio into your data processing and QA pipeline.
Python API instructions (click to expand)
To upload your data to Cleanlab Studio using the Python library, you can use the
from cleanlab_studio import Studio
import pandas as pd
studio = Studio('YOUR_API_KEY')
df = pd.read_csv('banking_tutorial.csv')
studio.upload_dataset(dataset=df, dataset_name = "Data Labeling Tutorial")
You should see this view once the upload is confirmed.
Creating a Project
Create Project from the dataset page or the dashboard. On the project creation page, there are several settings to choose.
- Machine Learning Task: select Text Classification (the default for a text dataset) — we’re working with a text classification task in this tutorial (classifying a single string of text into a class like “card_arrival” or “automatic_top_up”).
- Type of Classification: select Multi-Class (the default) — the Banking77 dataset is for a single-label classification task, where each utterance belongs to exactly one class.
- Text column: select “text” (which is auto-detected) — this is the name of the column from your dataset that contains the text, on which predictions are based.
- Label column: select “label” (which is auto-detected) — this is the name of the column from your dataset that contains the labels.
Keep the default “Use a provided model” setting, and choose “Fast” mode for the purpose of the tutorial. Fast models train more quickly (15-90 minutes), and Regular models take longer (up to 24 hours) but produce better results.
Clean my data! to start the data analysis process!
When you launch a project, Cleanlab Studio trains cutting-edge ML models to analyze your data, which can take some time. For the tutorial dataset, it should take about 15 minutes. You will get an email with a link to the results once the analysis is complete. By clicking on that link or
Ready for review on the dashboard, you will see the project view:
Labeling Data and Resolving Issues
The interface supports two workflows for labeling your data: batch labeling, and individual row/hand labeling. This tutorial demonstrates both. Batch labeling can be very powerful when combined with Cleanlab Studio’s filters interface, or when the data are fairly \“easy\” to label. Batch labeling can also speed up the hand-labeling process by providing a good starting point that can be rapidly iterated on. Hand labeling is good option for those last few examples that Cleanlab Studio indicates it is less confident on.
While Cleanlab Studio can be used to quickly re-label data auto-identified as being originally mislabeled, let’s focus for a moment on labeling originally unlabeled data only. Learn how to format unlabeled data here. You can filter to view solely the unlabeled data in a dataset by selecting
Unlabeled in the Filter bar. Unlabeled data has a ”-” in the
Given column (to indicate no label has been given yet). For each unlabeled data point, Cleanlab Studio automatically suggests an appropriate label in the
Suggested column, along with a confidence score for this suggestion.
Labeling individual examples
If you’d prefer to review data points one-at-a-time, aided by Cleanlab Studio’s suggestions in labeling them, start by clicking on a row of data and using the resolver panel to select labels. Even though you’re going through the datapoints one-by-one, you’ll be leveraging Cleanlab Studio’s insights about likely labels by seeing percentage likelihoods for each label, and Cleanlab will surface the most likely labels to the top while you do the tagging.
Basic row-by-row labeling workflow
The clip below demonstrates using the issue resolver panel to label some examples that Cleanlab Studio is less confident about. To label, select an example by clicking on it and use shortcuts to take actions like label, exclude, tag, etc. Or if you prefer, use the mouse and click on the buttons!
Wto auto-fix (apply Cleanlab Studio’s recommended action, such as changing the label)
Nto add the “Needs Review” tag (so you can come back to the data point later)
Qto keep the given label
Eto exclude the data point from the dataset
9to label, for the top 9 most likely labels
Batch labeling allows you to resolve label errors and purge bad data rapidly. As shown below, it can function either as a standalone action or as a baseline for human review. When used on a dataset with many unlabeled data points, batch actions/the Clean Top K feature can improve the speed of labeling by allowing you to clear the low-hanging fruit and focus on the most nuanced examples.
To batch label, first use the filter the data using quick-filter button and filter bar to pull up the examples that you’d like to label.
Once you have retrieved the desired data, open the Clean Top K dialog to take an action on the data. The actions are Auto-fix (apply our suggested label), Exclude (remove from dataset), Label (apply one label to the selected points), and Needs Review (add a tag to come back to some data)
In this clip, the top 100 examples with a suggested label of
Refund_not_showing_up that were initially unlabeled are batch labeled.
Improving Labeling Efficiency
Searching for unlikely labels
If you want to discover a label that is not in the top 10 most likely labels, you can also search for labels in the search bar and select the label from there, rather than scrolling.
Filtering and multi-select to avoid context switching
In addition to easily auto-fixing one issue category, you can quickly filter for specific suggested labels. This speeds up the process significantly because of reduced context switching, for example you can review all of the data points that Cleanlab Studio thinks should be labeled
cash_withdrawal_not_recognized at once.
You can also filter for specific corrections (e.g., given label is
top_up_failed and suggested label is
top_up_pending) by selecting a filter value for both the given and suggested labels.
Use click-and-drag/multi-select to label lots of examples at once without using batch actions. This is helpful to quickly review and correct filtered examples where you want to apply the same action to every data point.
Tip: Check out the Analytics page to see which label errors are most common in your dataset. You can find these in the Most Frequently Suggested Label Corrections section. If you click on one of the bars, that’ll apply filters to show all of the corresponding data points, and then you can use multi-select to fix them all at once.
Using the Needs Review tag
If there are examples which need further review, you can use the
Needs Review tag to mark them, and go back and review them later by filtering for the tag.
Fixing other data issues
As you can see in our other tutorials, Cleanlab Studio is a great tool for resolving all kinds of data issues, not just annotating unlabeled data. For example, you can exclude all outliers from the dataset with two clicks:
As always, after resolving a data issue, you can quickly check that Cleanlab’s suggestions are correct by opening the resolver panel and looking over the actions taken, and updating them if you disagree.
Just like that - our outliers have been excluded, with us only having to verify the changes rather than going through the whole dataset ourselves.
Exporting your Labeled Dataset
Now that you’ve labeled the dataset and resolved some other issues found, it’s time to export your cleanset! You can do that using the
Export Cleanset button at the bottom of the page. If you prefer, you can also use the Python API to export the results.
Python library export instructions
Retrieve the Cleanset ID from the
Export with API option on the
Export Cleanset dialog, and then use the
from cleanlab_studio import Studio
import pandas as pd
studio = Studio('YOUR_API_KEY')
df = studio.download_cleanlab_columns('YOUR_CLEANSET_ID')
Bonus: Making the most of your data with the Cleanlab Studio Inference API
You’ve exported the dataset — now what? In addition to using your cleaned and labeled data for all the tasks you were previously tackling with the smaller/noisier original dataset, Cleanlab Studio also has the capability to use your data immediately for inference! This is great if you have even more unlabeled data to handle (say, some that came in after you generated your initial dataset). Click
Deploy Model on the bottom of the page, and Cleanlab Studio will automatically train a model on your cleaned data and deploy it for use in production. You can use this model to generate predictions for any new data, as described in our inference API tutorial.