Getting Started with Cleanlab Studio
Welcome to Cleanlab Studio! This guide introduces how an end-to-end use of Cleanlab Studio can look:
- Upload a dataset
- Create a project (our AI analyzes your data)
- Review detected data issues
- Export a cleanset (the cleaned dataset)
Cleanlab Studio offers the fastest path for turning your unreliable/messy data into reliable models/analytics via automated detection/correction of data issues! Easily improve the quality of your dataset in the web browser – no code required as seen in the rest of this guide.
While the rest of this guide steps through how to use Cleanlab Studio via the no-code web interface, you can alternatively use this tool programmatically via the Python API or Command Line Interface (CLI). Both unlock even greater Cleanlab Studio capabilities than are available in the web interface (but we recommend exploring the simple web interface first).
The generally available public version of Cleanlab Studio supports many other tasks/workflows; refer to our tutorials for specific examples. Many additional capabilities are available to users on an Enterprise subscription plan. If you are interested in building AI Assistants connected to your company’s data sources and other Retrieval-Augmented Generation applications, reach out to learn how Cleanlab can help.
Demo Datasets
Cleanlab Studio comes pre-loaded with several demo datasets and projects, so you can check those out in your account after signing in. Alternatively, if you’d like to go through the entire workflow using a demo dataset, here are some you can use to get started:
- Text: amazon-text-demo.csv
- Tabular/CSV: grades-tabular-demo.csv
- Image: mnist.tar.gz (please unzip this folder before uploading it to Cleanlab Studio)
- This tutorial’s dataset: tweets.csv
Upload a Dataset
Cleanlab Studio offers a variety of ways to upload your data to best suit your needs. You can upload data from your computer, via URL, via command line, or via our Python API. We also offer Data Warehouse and Cloud Storage options for Enterprise Users!
How to format your dataset
Cleanlab Studio supports text, image, and tabular datasets (we’re constantly adding new modalities) in multiple formats. The free tier supports CSV, JSON, Excel, Dataframes, and Zip Files. If you’re uploading your own dataset, you can use the How to format your dataset wizard on the upload page to get a walkthrough of the best way to upload your specific type of data.
Once you upload a dataset…
Cleanlab Studio will automatically infer its schema (the data types and feature types of all the fields). You can review this schema and make any corrections before clicking “Confirm schema.”
Cleanlab Studio will then analyze and display any processing issues or missing data cells in your dataset immediately.
Create a Project to detect data and label issues
To analyze a dataset for data issues, you first need to create a project. There are a few options to configure when creating a project:
- Machine Learning Task: What type of task are you training a model to accomplish? Currently, we support Classification for text, tabular and image datasets.
- Type of Classification: Classification tasks are either multi-class (a datapoint is assigned to 1 of K classes) or multi-label (each datapoint can be part of 0 to K of K classes).
- Label Column: The column in your dataset that you want us to find label errors in and suggest label corrections on. Some examples in the dataset may be unlabeled.
- Predictive Columns / Text Column: Depending on your machine learning task, you’ll be asked to specify which columns we should use to train our classification models on.
- Model Type: Fast mode runs quicker analyses of your data but may produce lower quality results, while regular mode will produce the best results but could take up to 24 hours for larger datasets.
We’ll send you an email when analysis is complete. It takes some time for our cutting-edge AI system to train on the dataset.
Review issues detected in your dataset and correct them
Cleanlab Studio automatically flags examples that have data issues. For each identified label issue, Cleanlab Studio suggests a better label that may be more appropriate for this example. You can accept all of the suggestions with the “Auto-fix top issues” button, but for best results, we recommend reviewing the flagged issues. Cleanlab Studio’s data/label error correction interface makes this easy — data is ranked by quality so that your time is spent on the data that needs review (no need to review already-clean data). Additionally, you can choose to exclude data points — which is our recommended action for outliers: data points that appear to not be part of any of the classes in your dataset.
After reviewing flagged issues and taking actions to fix the data, you have produced a cleaned version of your dataset, which we call a cleanset.
Project Analytics
Along with the review interface, Cleanlab Studio offers analytics information about the data and label issues in your dataset. From the Analytics tab, you can view information about the classes with the most label issues, the data corrections that we most commonly suggest, and more! This tab can give you a high level summary of your data, or offer a direct view into the specific issues that we’ve detected — click on any bar or square in the Analytics chart and we’ll show you the exact examples exhibiting the specific issue summarized in the chart.
Export a cleanset
Cleanlab Studio supports exporting a cleanset from the web app or the cleanlab-studio CLI / Python client library. The cleaned labels are available in the cleanlab_corrected_label
column. The export also includes other metadata from Cleanlab Studio; look through the column headers to see what’s there. Data columns generated by Cleanlab Studio are prefixed with cleanlab_
.
You can also re-run Cleanlab on the cleanset. This will re-analyze the new version of your dataset, now with ML models trained on the cleaner data which will often give even better results.
Next steps
Once you are happy with the quality of your cleanset (e.g. you’ve skimmed all the detected data issues), you can use it in place of your original dataset in your downstream workflows. Immediately get better predictions/conclusions without changing any of your modeling code or analytics workflows. Better data -> better results!
If your goal is to deploy a machine learning model to serve predictions on new data, Cleanlab Studio can do this for you in just a few clicks. The model will be trained on the final cleanset you produced, and will be the same cutting-edge combination of AutoML and Foundation models that detected the data issues originally.
This guide covered a no-code workflow to improve your data via your web browser. The other pages on this documentation website cover our Python API to use Cleanlab Studio programatically and unlock more capabilities.
Remember: Cleanlab Studio works for text, image, and tabular datasets! Some Cleanlab functionalities may be demonstrated here for say an image dataset, but you can apply the same functionality to data from the other two modalities.
Need help?
Our team of experts is here to support your Data/AI projects! See various ways to get help on our Community page.