Skip to main content

Frequently Asked Questions

Welcome to the Cleanlab Studio FAQ page! Here you can find answers to common questions.

Can’t find the answer? Join our Slack community for live support or contact us at support@cleanlab.ai.

General

What types of data does Cleanlab Studio handle?

Cleanlab Studio currently supports:

  • Text data
  • Image data
  • Structured tabular data (Excel, CSV, JSON, SQL, etc)

Video, Audio, and other formats are supported in Cleanlab Studio for Enterprise.

A tutorial on this site may demonstrate some Cleanlab functionality for say an image dataset, but you can apply the same functionality to text or tabular data. Just try to find the tutorial that covers the functionality you are interested in (regardless of the data modality used as an example in the tutorial), it should be straightforward to apply the same tutorial to your own data (even if it’s a different data modality).

What types of machine learning tasks does Cleanlab Studio support?

Cleanlab Studio currently supports:

Regression, Image Segmentation, (2D/3D) Object Detection, and other ML tasks are supported in Cleanlab Studio for Enterprise.

How does Cleanlab Studio detect label and data issues in my datasets?

Cleanlab Studio automatically trains many state-of-the-art ML models based on your dataset’s features and label column (including Foundation models with extensive world-knowledge), and combines the outputs from these models with novel algorithms to estimate data and label quality. This is the culmination of years of research from our scientists.

After you’ve cleaned up your dataset, you can re-train the same AutoML system that was used to detect data issues on the higher-quality data with one click. With another click, you can deploy this ML model to serve accurate predictions in your application. Beyond deploying it for prediction and using it to detect data isuses, the same AutoML system can also be used confidently label large subsets of data automatically.

Thus Cleanlab Studio is far more than a data quality and data cleaning tool. This data-centric AI platform automates all of the steps of a real-world ML project, from data labeling, characterizing data quality, data cleaning, model training/tuning/selection, and model deployment to serve predictions. This is the quickest way to go from raw data to reliable ML deployment, all without having to write code!

What makes Cleanlab Studio unique?
  1. Cleanlab Studio works for both structured (tabular) and unstructured (image, text) datasets.
  2. Cleanlab Studio auto-detects label and data issues via AI (rather than user-specified rules) and auto-suggests how to fix these issues to produce a higher quality dataset. Cleanlab Studio can simultaneously detect many types of common issues, most of which can only be auto-detected by an AI system that understands the information content in each data point.
  3. Cleanlab Studio provides a quick interface to improve the quality of your existing data - no code required!
  4. Cleanlab Studio supports end-to-end ML Model Deployment, so you can go from data correction to solution all in one interface. With a few clicks, automatically train the same ML models (used to detect issues in your original dataset) on the improved version of your dataset, and deploy them to serve predictions in your applications. This is the fastest way to go from messy raw data to highly accurate deployed ML.
  5. Cleanlab Studio also allows you to use these same ML models to automatically label a dataset from scratch. This is the fastest way to create a high-quality dataset for supervised learning applications, or to get a bunch of documents/images tagged. With auto-labeling + automated label error detection, one person can now label a huge dataset!
Why do the buttons in the Cleanlab Studio application look different than in the tutorial?

We are continuously adding/deleting/renaming buttons to improve Cleanlab Studio. Thus what you see in the Tutorials may sometimes be an older version of the application. If anything is unclear, please ask for help in our Slack community and somebody will resolve your issue promptly.

I can't find a particular Cleanlab column computed for my dataset?

More Cleanlab columns may be computed in Projects created in Regular vs. Fast mode. If you are missing a column of interest in a Fast mode Project, first try creating another Project in Regular mode. If the column is still missing, consider reformatting your dataset. Certain columns may not be returned in the Project results if their computation failed (likely due to dataset formatting). Reach out for additional help: support@cleanlab.ai

Cleanlab Studio Web

My image dataset won't upload properly.

Image data must be formatted in a specific way. To learn how:

  1. Go to Upload page
  2. Select “How to format your dataset” or press H
  3. Select your desired Machine Learning task
  4. Select Image

Here you will find how to format your image data depending on where it is stored and how it is structured. You can also read more in the Image Data Quickstart tutorial or our Datasets guide.

Where is my cleanset_id?

You can find your cleanset_id inside of the project you want to export:

  1. Select Export Cleanset
  2. Select Export using API
  3. Copy cleanset_id by clicking the copy icon on the right

Cleanlab Studio Python API

Why am I getting an error when running a tutorial notebook?

The tutorial notebooks require the latest version of the cleanlab-studio library, which is continuously being improved.

Please upgrade to the latest version via: pip install cleanlab-studio --upgrade

If you are experiencing this issue in Google Colab or a Jupyter notebook, make sure you refresh your kernel after upgrading.

If you are still encountering an error after that, ask for help in our Slack community and somebody will resolve your issue promptly.

Where is my API key?

You can find your API key in two ways:

  • On the Cleanlab Studio Account page
  • From the Upload page, select Upload Dataset and then Upload via command line
Where is my cleanset_id?

You can retrieve your latest cleanset_id by running the code below. If you need a different one, instructions for getting the cleanset_id from the Web app can be found above.

from cleanlab_studio import Studio
studio = Studio(API_KEY)
cleanset_id = studio.get_latest_cleanset_id(project_id)
My notebook timed out after I created a project.
When you create a project in Cleanlab Studio, it takes some time for the analytics to finish running. You can check the status of the project using the project_id:
from cleanlab_studio import Studio
API_KEY = "<insert your API key>"
studio = Studio(API_KEY)

cleanset_id = studio.get_latest_cleanset_id(project_id)
print(f"cleanset_id: {cleanset_id}")
project_status = studio.wait_until_cleanset_ready(cleanset_id)
Note: DO NOT create a new project.
My notebook timed out while my dataset was uploading.

You can try a few things to check if the dataset was uploaded successfully:

  1. Check if the dataset_id was printed in your notebook before time out (this is returned from upload_dataset method)
  2. Check the Cleanlab Studio Web Application to see if your dataset is shown, and copy down its dataset_id.

If you cannot find the dataset or its ID, then re-execute the upload_dataset command and ensure your notebook stays connected for sufficiently long.

Otherwise, once you restart your notebook, skip past the upload_dataset command and move onto Project creation based on the dataset_id.

How do I format the labels for a multi-label dataset?

To run Cleanlab Studio on a multi-label dataset (where each data point can belong to more than one class or none of the classes), you will want to make sure the values in your label column are formatted as a string that contains no whitespaces and in which each label value is separated by a comma. If we are say using Python with a multi-label dataset in Pandas DataFrame where the possible classes are 1, 2, 3, then we could format the label column like this:

labels = ['2,3', '2,3', '1,2,3', '2,3', '2,3', '1,2,3', '1,2,3', '2,3', '3', '']
df["label"] = labels

This example shows specifically how to do multi-label formatting with a Python Pandas DataFrame, but many other data formats are supported. Here is a detailed tutorial on how to prepare and format multi-label data in Cleanlab Studio, and you should also refer to the Datasets guide.

Pay particular attention to the subtle formatting difference between unlabeled data points (that have not yet been labeled) and data points that have been labeled but do not belong to any of the classes.

I got a `ValueError: CLI is out of date and must be updated.`

Please upgrade to the latest version via: pip install cleanlab-studio --upgrade

If you are experiencing this issue in Google Colab or a Jupyter notebook, make sure you refresh your kernel after upgrading.

If you are still encountering an error after that, ask for help in our Slack community and somebody will resolve your issue promptly.

Cleanlab Studio CLI

I got a No file exists at: {filename} error

Ensure you are calling cleanlab dataset upload from the same directory where your file is located.

I got Error: Resource Not Found

Ensure you are calling cleanlab cleanset download with the cleanset_ID. This can be copied by:

  1. Inside the project you want to export from, select Export Cleanset
  2. Select Export using API
  3. Click the copy icon next to your cleanset_ID