Dynamically evolve your classification dataset by adding new classes
In classification datasets, a common problem arises when there are hidden classes — categories that exist within the data but were not identified or labeled during the initial dataset creation.
Cleanlab Studio offers a feature that streamlines the process of identifying and labeling these hidden classes. This feature enables efficient categorization with minimal manual effort, allowing you to dynamically improve your datasets for better machine learning performance.
For this demonstration, we will use a Shoes Dataset from a retailer, which contains around 1300 examples across five categories: boots
, flip_flops
, sandals
, sneakers
, and soccer_shoes
. These steps can be applied to any multi-class classification dataset.
First, create a project in Cleanlab Studio using this dataset. You can download the dataset and upload it, or use the Import via URL
option for direct upload. Next, set up a multi-class image classification project in fast mode.
Creating a label for a new category in your dataset
Once the project is Ready for Review
, review the detected issues. You may find data points that don’t fit into the existing categories. For example, some shoes may belong to the casual category of loafers instead of sneakers
or sandals
Creating a new label for such cases is straightforward. This feature is also useful for splitting overarching labels into more fine-grained categories, like dividing sandals
into flats
and heels
. Watch the video below to see how to create a new label. Here, we sort the rows in descending order of label issue score and go over the data points.
Creating a label for irrelevant data points
For ambiguous or outlier images, such as those focusing on the person rather than the shoe, you can create a label like people. This ensures these images are categorized appropriately without being excluded or tagged as issues. This scenario is quite common in classification datasets, where a data point may belong to a class that is not relevant at the moment. These classes could be labeled as miscellaneous, other, unknown, or simply clutter.
Find more data points belonging to the new class
Once you have labeled a few images for the new class, hit Improve Issues Found
to re-run Cleanlab Studio’s analysis. This process will identify more data points that belong to the new class, allowing for quick and efficient labeling. After the analyis is complete, and the Project is Ready for Review, we can now see more data points with suggested label loafers using the filter, and use auto-fix batch action to correct labels for all of them.
Similarly we label images with suggested label people.