Project corresponds to an automated Cleanlab analysis of your Dataset as well as corrections to improve the Dataset.
Once your dataset is added successfully, it will populate in the “Datasets” section. The next step required to improve your dataset is to create a Project by selecting
Create Project under the Action column.
As soon as you create a Project, Cleanlab Studio automatically trains models to analyze your data. This may take some time (you will get an email when the results are ready). Once the Status shows “Ready for review”, you can click the Project to review the issues in your dataset and start correcting them.
Creating a project to improve your data requires just a few selections that are explained in detail below.
Machine Learning Task / Dataset Type
This selection corresponds to the modality of your data. Cleanlab Studio supports the following data modalities:
- Text — create a project to analyze a single column of text data (e.g. customer service requests)
- Tabular — create a project to analyze multiple columns of data (e.g. financial reports)
- Image — create a project to analyze image data (e.g. e-commerce products)
The generally available version of Cleanlab Studio focuses on classification datasets where each data point is categorized amongst a discrete set of classes. Other types of datasets and machine learning tasks are supported in the Enterprise plan.
Type of Classification
Cleanlab Studio supports the following ML tasks:
- Multi-class classification (
multi-class) : A single example belongs to exactly one of the classes – the classes are mutually exclusive.
- Multi-label classification (
multi-label) : A single example can belong to one or more classes simultaneously or none of the classes at all – the classes are not mutually exclusive (each class either applies to the example or not).
Cleanlab trains many different ML models on your data and automatically identifies the best in order to detect data issues in your dataset. This process can take some time to deliver the best results. You have a choice between two settings:
- Fast — trains faster but less accurate ML models
- Regular — trains more accurate ML models and runs more data analyses
We recommend using Regular mode, except when you have a tight deadline to meet.
After making your selections, click
Clean my data! to kick-off project training.
You will get an email when training completes.
At that point, the auto-detected data issues are presented in an intuitive
data correction interface for you to quickly clean up your dataset.