Skip to main content

Dynamically evolve your classification dataset by adding new classes

In classification datasets, a common problem arises when there are hidden classes — categories that exist within the data but were not identified or labeled during the initial dataset creation.

Cleanlab Studio offers a feature that streamlines the process of identifying and labeling these hidden classes. This feature enables efficient categorization with minimal manual effort, allowing you to dynamically improve your datasets for better machine learning performance.

create_label_main.png

For this demonstration, we will use a Shoes Dataset from a retailer, which contains around 1300 examples across five categories: boots, flip_flops, sandals, sneakers, and soccer_shoes. These steps can be applied to any multi-class classification dataset.

First, create a project in Cleanlab Studio using this dataset. You can download the dataset and upload it, or use the Import via URL option for direct upload. Next, set up a multi-class image classification project in fast mode.

select_task_type.png

Creating a label for a new category in your dataset

Once the project is Ready for Review, review the detected issues. You may find data points that don’t fit into the existing categories. For example, some shoes may belong to the casual category of loafers instead of sneakers or sandals

resolver_view

Creating a new label for such cases is straightforward. This feature is also useful for splitting overarching labels into more fine-grained categories, like dividing sandals into flats and heels. Watch the video below to see how to create a new label. Here, we sort the rows in descending order of label issue score and go over the data points.

After creating the new label, you can start labeling data points that belong to this new category. This helps Cleanlab Studio learn from these examples and find more similar data points when you re-run the analysis. You can label examples individually or in batches using filters.

Creating a label for irrelevant data points

For ambiguous or outlier images, such as those focusing on the person rather than the shoe, you can create a label like people. This ensures these images are categorized appropriately without being excluded or tagged as issues. This scenario is quite common in classification datasets, where a data point may belong to a class that is not relevant at the moment. These classes could be labeled as miscellaneous, other, unknown, or simply clutter.

Find more data points belonging to the new class

Once you have labeled a few images for the new class, hit Improve Issues Found to re-run Cleanlab Studio’s analysis. This process will identify more data points that belong to the new class, allowing for quick and efficient labeling. After the analyis is complete, and the Project is Ready for Review, we can now see more data points with suggested label loafers using the filter, and use auto-fix batch action to correct labels for all of them.

Similarly we label images with suggested label people.