Skip to main content

Cleanset

A Cleanset is an improved version of your original dataset. You create one by fixing detected data issues in your dataset, such as correcting erroneous labels or excluding examples that are outliers/duplicates. Use the Resolver to correct individual data points or Clean Top K button (at the bottom of a Project) to auto-fix many data points simultaneously. The resulting improved dataset has the same format as your original dataset and can be used as a plug-in replacement to get more reliable downstream modeling and analytics. After making your data improvements, obtain the cleanset by exporting it from the web interface or Python API.

'Cleanlab Studio turns datasets into cleansets.'

Improve Results

Cleanlab Studio learns to iteratively improve results as you make dataset changes. Once you’ve made significant dataset changes, click Improve Results to obtain further improved data quality analysis and suggested actions. When you click Improve Results, Cleanlab Studio re-runs the Project based on the current corrected dataset as the starting point. Because the ML model in this re-run is now auto-trained with cleaner data, the resulting Project yields more accurate detection of data issues and suggested labels.

A recommended workflow is:

  1. Run Cleanlab Studio on your original dataset
  2. Correct up to half of the detected issues
  3. Run Improve Results
  4. Repeat steps 2 and 3 until you achieve desired data quality or ML model deployment accuracy.

'Cleanlab Studio with an arrow pointing to improve results button.'

Version History

Cleanlab Studio creates a dataset snapshot each time you run analyses on your dataset (or cleanset) via Improve Results. These versions are accessible via the Version history button. The snapshot saves your dataset corrections that were made at that point in time. This allows you to revisit the state of the dataset at each re-run of a Project.

'Cleanlab Studio with an arrow pointing to version history button.'

Here you can view each stage of dataset correction, with recent data corrections that produced the newest cleanset version building on top of older data corrections from previous cleansets. Use this to export an older dataset version and track which stage of dataset corrections produce the best downstream results.

'Cleanlab Studio version history page.'