Cleanset is an improved version of your original dataset. You create one by fixing detected data issues in your dataset, such as correcting erroneous labels or excluding examples that are outliers/duplicates. Use the Resolver to correct individual data points or
Clean Top K button (at the bottom of a Project) to auto-fix many data points simultaneously. The resulting improved dataset has the same format as your original dataset and can be used as a plug-in replacement to get more reliable downstream modeling and analytics. After making your data improvements, obtain the cleanset by exporting it from the web interface or Python API.
Cleanlab Studio learns to iteratively improve results as you make dataset changes. Once you’ve made significant dataset changes, click
Improve Results to obtain further improved data quality analysis and suggested actions. When you click
Improve Results, Cleanlab Studio re-runs the
Project based on the current corrected dataset as the starting point. Because the ML model in this re-run is now auto-trained with cleaner data, the resulting
Project yields more accurate detection of data issues and suggested labels.
A recommended workflow is:
- Run Cleanlab Studio on your original dataset
- Correct up to half of the detected issues
- Repeat steps 2 and 3 until you achieve desired data quality or ML model deployment accuracy.
Cleanlab Studio creates a dataset snapshot each time you run analyses on your dataset (or cleanset) via
Improve Results. These versions are accessible via the
Version history button. The snapshot saves your dataset corrections that were made at that point in time. This allows you to revisit the state of the dataset at each re-run of a
Here you can view each stage of dataset correction, with recent data corrections that produced the newest cleanset version building on top of older data corrections from previous cleansets. Use this to export an older dataset version and track which stage of dataset corrections produce the best downstream results.