Frequently Asked Questions
Welcome to the Cleanlab Studio FAQ page! Here you can find answers to common questions.
Can’t find the answer? Join our Slack community for live support or contact us at support@cleanlab.ai.
General
What types of data does Cleanlab Studio handle?
Cleanlab Studio currently supports:
- Text data
- Image data
- Structured tabular data (Excel, CSV, JSON, SQL, etc)
Video, Audio, and other formats are supported in Cleanlab Studio for Enterprise.
A tutorial on this site may demonstrate some Cleanlab functionality for say an image dataset, but you can apply the same functionality to text or tabular data. Just try to find the tutorial that covers the functionality you are interested in (regardless of the data modality used as an example in the tutorial), it should be straightforward to apply the same tutorial to your own data (even if it’s a different data modality).
What types of machine learning tasks does Cleanlab Studio support?
Cleanlab Studio currently supports:
- Multi-Class Classification
- Multi-Label Classification
- Entity Recognition
- Sequence-to-Sequence (Text Generation)
Regression, Image Segmentation, (2D/3D) Object Detection, and other ML tasks are supported in Cleanlab Studio for Enterprise.
How does Cleanlab Studio detect label and data issues in my datasets?
Cleanlab Studio automatically trains many state-of-the-art ML models based on your dataset’s features and label column (including Foundation models with extensive world-knowledge), and combines the outputs from these models with novel algorithms to estimate data and label quality. This is the culmination of years of research from our scientists.
After you’ve cleaned up your dataset, you can re-train the same AutoML system that was used to detect data issues on the higher-quality data with one click. With another click, you can deploy this ML model to serve accurate predictions in your application. Beyond deploying it for prediction and using it to detect data isuses, the same AutoML system can also be used confidently label large subsets of data automatically.
Thus Cleanlab Studio is far more than a data quality and data cleaning tool. This data-centric AI platform automates all of the steps of a real-world ML project, from data labeling, characterizing data quality, data cleaning, model training/tuning/selection, and model deployment to serve predictions. This is the quickest way to go from raw data to reliable ML deployment, all without having to write code!
What makes Cleanlab Studio unique?
- Cleanlab Studio works for both structured (tabular) and unstructured (image, text) datasets.
- Cleanlab Studio auto-detects label and data issues via AI (rather than user-specified rules) and auto-suggests how to fix these issues to produce a higher quality dataset. Cleanlab Studio can simultaneously detect many types of common issues, most of which can only be auto-detected by an AI system that understands the information content in each data point.
- Cleanlab Studio provides a quick interface to improve the quality of your existing data - no code required!
- Cleanlab Studio supports end-to-end ML Model Deployment, so you can go from data correction to solution all in one interface. With a few clicks, automatically train the same ML models (used to detect issues in your original dataset) on the improved version of your dataset, and deploy them to serve predictions in your applications. This is the fastest way to go from messy raw data to highly accurate deployed ML.
- Cleanlab Studio also allows you to use these same ML models to automatically label a dataset from scratch. This is the fastest way to create a high-quality dataset for supervised learning applications, or to get a bunch of documents/images tagged. With auto-labeling + automated label error detection, one person can now label a huge dataset!
Why do the buttons in the Cleanlab Studio application look different than in the tutorial?
We are continuously adding/deleting/renaming buttons to improve Cleanlab Studio. Thus what you see in the Tutorials may sometimes be an older version of the application. If anything is unclear, please ask for help in our Slack community and somebody will resolve your issue promptly.
I can't find a particular Cleanlab column computed for my dataset?
More Cleanlab columns may be computed in Projects created in Regular
vs. Fast
mode. If you are missing a column of interest in a Fast
mode Project, first try creating another Project in Regular
mode.
If the column is still missing, consider reformatting your dataset. Certain columns may not be returned in the Project results if their computation failed (likely due to dataset formatting).
Otherwise you can also try decreasing the size of your dataset. Certain columns may not be returned in the Project results if their computation took too long. If your dataset is too large, you will need to get on an Enterprise plan in order to get the best results with Cleanlab Studio.
Reach out for additional help: support@cleanlab.ai
When should I choose Text vs Tabular dataset? How to handle multiple text fields?
While a text dataset can have multiple columns, the subsequent Cleanlab Studio project can only focus on one specific text column. Cleanlab’s AI will only consider this column when determining issues in your dataset and predicting labels. If you have multiple text columns that are important, please combine them all into a single text column (one long string, perhaps including the column names and some delimeter to separate the concatenated columns) and then re-upload your dataset.
For instance, you might format one row in a dataset that had 3 text columns like this:
Column 1 name: Text from this row of column 1. \n Column 2 name: Text from this row of column 2. \n Column 3 name: Text from this row of column 3.
Here is a Python function you can use to combine columns of your choosing into a single text column (as shown above):
def combine_columns_into_text(filepath, column_names, output_filepath = "output.csv", name_of_text_column = "text", delimiter = ' \n '):
"""
Takes in location of your CSV file (`filepath`) and list of strings (`column_names`) specifying which columns to merge.
Writes dataset as new file: `output_filepath` with the specified columns merged into one string column.
Name of the merged column will be: `name_of_text_column` (optional choice), select this as the Text column in Cleanlab Studio.
Merged columns will be separated by (optionally) provided `delimiter`.
"""
df = pd.read_csv(filepath)
df[name_of_text_column] = df.apply(lambda row: delimiter.join([f"{col}: {str(row[col])}" for col in column_names]), axis=1)
df.to_csv(output_filepath, index=False)
print(f"Dataset with merged columns saved as: {output_filepath}. Load this file in Cleanlab Studio and select '{name_of_text_column}' as the Text column in your Project.")
return df
Alternatively you can run Cleanlab Studio on such a dataset by treating it as a tabular dataset. For tabular data, Cleanlab’s AI will consider all columns when determining issues in your dataset and predicting labels (even if there are multiple columns containing text).
So what’s the difference between running a Text Project or a Tabular Project?
In a Text Project, your data will be run through pretrained Foundation models (that are able to contextualize your data with broader world knowledge), and various other Transformer models will be fit to your data. So Text Projects will be better for applications where predicting labels or detecting data issues requires deep semantic understanding of the information in the text.
In a Tabular Project, each column will be appropriately featurized for supervised ML models to be applied. For instance, numeric columns are rescaled, categorical columns are encoded, and text columns are encoded via N-gram and TF-IDF word/phrase count vectors. Thus Tabular Projects will be better for applications where: the numeric/categorical information in your dataset is important, being able to learn relationships across multiple columns is important, or the relevant information in the text fields can be encoded into relatively simple patterns (keywords/phrases) rather than requiring semantic understanding of the full meaning of the text.
For example, suppose you have a product dataset with feature columns: Name, Description, Price, Weight, … If your label is say whether or not each product is age-restricted (e.g. if it contains alcohol), then a Text Project may be better. If your label is say whether or not each product is expensive relative to similar products, then a Tabular Project may be better.
Since Cleanlab Studio is so easy to use, you can just try both approaches and see which works better for your data!
Why is the suggested label sometimes missing?
Note: If you’d like to obtain the predictions (and predicted class probabilities) output by Cleanlab’s AutoML system for every data point, use the download_pred_probs()
method in the Python API. You can always use these predictions to get suggested labels for every data point, even if some suggestions are missing in the Cleanlab columns.
These predictions are out-of-sample (obtained via cross-validation) and can be used for unbiased model evaluation estimates for your dataset. Cleanlab’s suggested labels are simply based on these predictions. You can apply your own thresholds to the predicted class probabilities to form your own predictions (and suggested labels) with different levels of precision/recall.
For Multi-Class Classification Projects
Cleanlab’s suggested label will be null for data points not flagged with a label issue (is_label_issue
column is False
). Suggested labels may also be null for data points simultaneously flagged with other types of issues as well.
The idea is you can rely on the suggested label column for automated data correction, so Cleanlab leaves it empty in cases where there is not an alternative label that is confidently better. In such cases, you can either leave the label as it was originally given, or manually review a data point and determine how to best re-label it.
For Multi-Label Classification Projects
Cleanlab’s suggested label will always exist for each data point in a multi-label classification project. This is to facilitate interpretable comparison between the given label and the predicted label (especially because there can be empty labels in muti-label datasets, for instance an image/document to which none of the tags apply). Thus in multi-label classification, you should not always rely on the suggested label column for automated data correction. Only rely on this suggested label for data points flagged as a label issue.
Cleanlab Studio Web App
My image dataset won't upload properly.
Image data must be formatted in a specific way. To learn how:
- Go to Upload page
- Select “How to format your dataset” or press H
- Select your desired Machine Learning task
- Select
Image
Here you will find how to format your image data depending on where it is stored and how it is structured. You can also read more in the Image Data Quickstart tutorial or our Datasets guide.
Where is my cleanset_id
?
You can find your cleanset_id
inside of the project you want to export:
- Select
Export Cleanset
- Select
Export using API
- Copy
cleanset_id
by clicking the copy icon on the right
Cleanlab Studio Python API
Why am I getting an error when running a tutorial notebook?
The tutorial notebooks require the latest version of the cleanlab-studio
library, which is continuously being improved.
Please upgrade to the latest version via: pip install cleanlab-studio --upgrade
If you are experiencing this issue in Google Colab
or a Jupyter notebook
, make sure you refresh your kernel after upgrading.
If you are still encountering an error after that, ask for help in our Slack community and somebody will resolve your issue promptly.
Where is my API key?
Where is my cleanset_id
?
You can retrieve your latest cleanset_id
by running the code below. If you need a different one, instructions for getting the cleanset_id
from the Web app can be found above.
from cleanlab_studio import Studio
studio = Studio(API_KEY)
cleanset_id = studio.get_latest_cleanset_id(project_id)
My notebook timed out after I created a project.
project_id
:from cleanlab_studio import Studio
API_KEY = "<insert your API key>"
studio = Studio(API_KEY)
cleanset_id = studio.get_latest_cleanset_id(project_id)
print(f"cleanset_id: {cleanset_id}")
project_status = studio.wait_until_cleanset_ready(cleanset_id)
My notebook timed out while my dataset was uploading.
You can try a few things to check if the dataset was uploaded successfully:
- Check if the
dataset_id
was printed in your notebook before time out (this is returned fromupload_dataset
method) - Check the Cleanlab Studio Web Application to see if your dataset is shown, and copy down its
dataset_id
.
If you cannot find the dataset or its ID, then re-execute the upload_dataset
command and ensure your notebook stays connected for sufficiently long.
Otherwise, once you restart your notebook, skip past the upload_dataset
command and move onto Project creation based on the dataset_id
.
How do I handle (near) duplicates?
From each set of nearly duplicated data points that share a common near_duplicate_cluster_id
, you might only want to keep one of the data points, so that the resulting dataset will no longer contain any near duplicates.
You can use this code to select rows of duplicate data and then remove them from your original dataframe:
near_duplicates_to_exclude = df['is_near_duplicate'] & df['cleanlab_near_duplicate_cluster_id'].duplicated(keep='first')
df_fixed = df[~near_duplicates_to_exclude]
You might find Cleanlab’s near duplicate determination in is_near_duplicate
is too loose/stringent. The near_duplicate_score
column quantifies how similar each data point is to its nearest neighbor in the dataset. Exactly duplicated data points will have a score of 1. You can these scores do apply a different threshold to determine which data points are considered near duplicates. For instance here is the code to only consider exact duplicates, and remove extra copies of exact duplicates from the dataset:
df['is_exact_duplicate'] = df['cleanlab_near_duplicate_score'] == 1.0
exact_duplicates_to_exclude = df['is_exact_duplicate'] & df['cleanlab_near_duplicate_cluster_id'].duplicated(keep='first')
df_fixed = df[~exact_duplicates_to_exclude]
How do I format the labels for a multi-label dataset?
To run Cleanlab Studio on a multi-label dataset (where each data point can belong to more than one class or none of the classes), you will want to make sure the values in your label column are formatted as a string that contains no whitespaces and in which each label value is separated by a comma. If we are say using Python with a multi-label dataset in Pandas DataFrame where the possible classes are 1
, 2
, 3
, then we could format the label column like this:
labels = ['2,3', '2,3', '1,2,3', '2,3', '2,3', '1,2,3', '1,2,3', '2,3', '3', '']
df["label"] = labels
This example shows specifically how to do multi-label formatting with a Python Pandas DataFrame, but many other data formats are supported. Here is a detailed tutorial on how to prepare and format multi-label data in Cleanlab Studio, and you should also refer to the Datasets guide.
Pay particular attention to the subtle formatting difference between unlabeled data points (that have not yet been labeled) and data points that have been labeled but do not belong to any of the classes.
I got a `ValueError: CLI is out of date and must be updated.`
Please upgrade to the latest version via: pip install cleanlab-studio --upgrade
If you are experiencing this issue in Google Colab
or a Jupyter notebook
, make sure you refresh your kernel after upgrading.
If you are still encountering an error after that, ask for help in our Slack community and somebody will resolve your issue promptly.
Cleanlab Studio CLI
I got a No file exists at: {filename}
error
Ensure you are calling cleanlab dataset upload
from the same directory where your file is located.
I got Error: Resource Not Found
Ensure you are calling cleanlab cleanset download
with the cleanset_ID
. This can be copied by:
- Inside the project you want to export from, select
Export Cleanset
- Select
Export using API
- Click the copy icon next to your
cleanset_ID