Cleanset
A Cleanset
is an improved version of your original dataset. You create one by fixing detected data issues in your dataset, such as correcting erroneous labels or excluding examples that are outliers/duplicates. Use the Resolver to correct individual data points or Clean Top K
button (at the bottom of a Project) to auto-fix many data points simultaneously. The resulting improved dataset has the same format as your original dataset and can be used as a plug-in replacement to get more reliable downstream modeling and analytics. After making your data improvements, obtain the cleanset by exporting it from the web interface or Python API.
Improve Results
Cleanlab Studio learns to iteratively improve results as you make dataset changes. Once you’ve made significant dataset changes, click Improve Results
to obtain further improved data quality analysis and suggested actions. When you click Improve Results
, Cleanlab Studio re-runs the Project
based on the current corrected dataset as the starting point. Because the ML model in this re-run is now auto-trained with cleaner data, the resulting Project
yields more accurate detection of data issues and suggested labels.
A recommended workflow is:
- Run Cleanlab Studio on your original dataset
- Correct up to half of the detected issues
- Run
Improve Results
- Repeat steps 2 and 3 until you achieve desired data quality or ML model deployment accuracy.
Version History
Cleanlab Studio creates a dataset snapshot each time you run analyses on your dataset (or cleanset) via Improve Results
. These versions are accessible via the Version history
button. The snapshot saves your dataset corrections that were made at that point in time. This allows you to revisit the state of the dataset at each re-run of a Project
.
Here you can view each stage of dataset correction, with recent data corrections that produced the newest cleanset version building on top of older data corrections from previous cleansets. Use this to export an older dataset version and track which stage of dataset corrections produce the best downstream results.