Skip to main content

Datasets

A Dataset corresponds to your original text/image/tabular data.

Cleanlab Studio can analyze and train/deploy models on diverse types of datasets. This page outlines how to format your data and the available options.

Upload Dataset interface.

Modalities

Cleanlab Studio supports datasets of the following modalities:

  • Tabular (structured data stored in tables with numeric/categorical/string columns, e.g. financial reports, sensor measurements)
  • Text (e.g. customer service requests, reports, LLM outputs)
  • Image (e.g. product images, photographs, satellite imagery)

Formats

Text/Tabular

Cleanlab Studio natively supports the below formats for text and tabular datasets. Our tutorial shows how to convert other common data formats into one of these native formats.

CSV

CSV is a standard format for text/tabular data, and most tools that process tabular data can export it as a CSV. You should make sure your CSV file is formatted as a standard CSV (use , as the delimiter and " as the quote character, which is usually the default).

The first row of your CSV should contain headers naming all of the columns. Cleanlab Studio requires that tabular data is non-jagged: each row should contain the same number of columns. If some values are missing, the entries for those columns can be left blank (but the columns shouldn’t be missing entirely).

Unlabeled data. For a multi-class dataset (where each data point belongs to exactly one class), you can indicate which data points are unlabeled by leaving their values in the label column empty/blank. For a multi-label classification dataset (where each data point can belong to multiple classes, e.g. image/document tagging), you must use the following JSON format to indicate unlabeled data points (Note: an unlabeled data point in multi-label classification is one whose label remains unknown, NOT the same as a data point that has been labeled as not belonging to any of the classes/tags).

Example CSV text dataset
review_id,review,label
f3ac,The sales rep was fantastic!,positive
d7c4,He was a bit wishy-washy.,negative
439a,They kept using the word "obviously," which was off-putting.,positive
a53f,,negative

Excel

Cleanlab Studio supports Microsoft Excel files. The first sheet will be imported as your dataset. The first row of your sheet should contain names for all of the columns.

Unlabeled data. For a multi-class dataset (where each data point belongs to exactly one class), you can indicate which data points are unlabeled by leaving their values in the label column empty/blank. For a multi-label classification dataset (where each data point can belong to multiple classes, e.g. image/document tagging), you must use the following JSON format to indicate unlabeled data points (Note: an unlabeled data point in multi-label classification is one whose label remains unknown, NOT the same as a data point that has been labeled as not belonging to any of the classes/tags).

Example Excel text dataset
_review_idreviewlabel
0f3acThe sales rep was fantastic!positive
1d7c4He was a bit wishy-washy.negative
2439aThey kept using the word “obviously,” which wa...positive
3a53fnegative

JSON

JSON is a standard data interchange format. Cleanlab Studio expects text/tabular data encoded as a JSON array of JSON objects.

All objects in your dataset must contain the same set of keys; if a value is missing from one of your rows of data, map it to null.

Unlabeled data. For a multi-class dataset (where each data point belongs to exactly one class), you can indicate which data points are unlabeled by leaving their values for the label key as one of: null, pd.NA, None, or "" (empty string). For a multi-label classification dataset (where each data point can belong to multiple classes, e.g. image/document tagging), use the following format to indicate unlabeled data points (Note: an unlabeled data point in multi-label classification is one whose label remains unknown, NOT the same as a data point that has been labeled as not belonging to any of the classes/tags).

Note: Cleanlab Studio doesn’t support nested JSON structures, you must flatten them.

Example JSON text dataset
[
{
"review_id": "f3ac",
"review": "The sales rep was fantastic!",
"label": "positive"
},
{
"review_id": "d7c4",
"review": "He was a bit wishy-washy.",
"label": "negative"
},
{
"review_id": "439a",
"review": "They kept using the word 'obviously,' which was off-putting.",
"label": "positive"
},
{
"review_id": "a53f",
"review": null,
"label": "negative"
}
]

Pandas/PySpark DataFrame

Cleanlab Studio’s Python API supports a number of DataFrame formats, including Pandas DataFrames and PySpark DataFrames. You can upload directly from a DataFrame in a Python script or Jupyter notebook.

Example DataFrame text dataset
_review_idreviewlabel
0f3acThe sales rep was fantastic!positive
1d7c4He was a bit wishy-washy.negative
2439aThey kept using the word “obviously,” which wa...positive
3a53fnegative

Image

Image datasets are natively supported in the below formats. Our image data formatting tutorial show how to convert other common data formats into one of these native formats.

Simple ZIP

Images can be uploaded in ZIP file format, with a folder for each possible class label and image files in each folder. The folder names are interpreted as class labels for the images within the folder.

Simple Zip Folder
Metadata ZIP

Images can also be uploaded in ZIP file format, with an associated CSV file that must be named metadata.csv and placed at the top-level of the zipped directory. This file contains mappings between relative filepaths and class labels. The images in the ZIP can be in an arbitrary layout.

Key details about the metadata.csv file (see image below of what an example metadata file might look like):

  • Must be formatted as a standard CSV (e.g. use , as the delimiter and " as the quote character, as is typically the default).
  • Must have one column that contains the file-path to each image. Filepaths must include the parent folder (cifar10_cl/ in the example below)
  • Must have one column that contains the label for each image
  • May contain other columns with additional metadata that’d you like to see or filter-by
Metadata Zip Folder
External Media

Images can be supplied using public URLs to the raw image files (e.g. in cloud storage) with metadata in any of our supported tabular formats (CSV, JSON, XLS/XLSX, DataFrame). See above for the description of the metadata file. If using a CSV, ensure that it is formatted as a standard CSV (e.g. use , as the delimiter and " as the quote character). If using JSON, ensure that it is encoded as a JSON array of JSON objects, where each object has the same set of keys. Values must be primitives: Cleanlab Studio doesn’t support nested JSON structures, you must flatten them.

One of the metadata columns should contain a sequence of URLs, each pointing to a single hosted image. These URLs must be either public or pre-signed; no additional authentication can be required to access the images. Your dataset can contain arbitrary other columns, in addition to the image and label columns.

Example external media image dataset
_idimglabel
00https://s.cleanlab.ai/DCA_Competition_2023_Dat...c
11https://s.cleanlab.ai/DCA_Competition_2023_Dat...h
22https://s.cleanlab.ai/DCA_Competition_2023_Dat...y
33https://s.cleanlab.ai/DCA_Competition_2023_Dat...p
44https://s.cleanlab.ai/DCA_Competition_2023_Dat...j

Multi-label Data

Cleanlab Studio requires a particular format for multi-label classification datasets (where each data point can belong to multiple classes, e.g. image/document tagging). Specifically the label column must be of type string and formatted as a comma-separated string of classes, i.e. “wearing_hat,has_glasses” (note: no whitespace).

To upload a multi-label dataset that contains some unlabeled data points and some data points that belong to none of the classes, use JSON file format. Unlabeled data points are those that simply have not been annotated yet (the classes to which they belong remain unknown and should be labeled), unlike data points purposefully assigned an empty label because they belong to none of the classes. Important: For a multi-label dataset like image/document tagging, you cannot specify a certain image/document has none of the tags in CSV format (must use JSON file format instead).

To represent these two types of data points in the JSON file format, use the following values for the label key:

  • empty labels: "" (empty string)
  • unlabeled: null, pd.NA, or None

To upload a multi-label text or tabular dataset, simply upload this JSON file with the example formatting shown below. To upload a multi-label image dataset, upload the zip containing the image files and a metadata.json file containing the labels in the format shown above. The same format can be used for datasets uploaded via the Python API.

Example JSON multi-label text dataset

In this example dataset each data point is a text review. The first data point f3ac is a data point with an empty label (has been annotated as belonging to none of the classes), while the last review a53f is an unlabeled data point (that has not yet been annotated). Note the difference between their label values to distinguish these cases.

[
{
"review_id": "f3ac",
"review": "The message was sent yesterday.",
"label": ""
},
{
"review_id": "d7c4",
"review": "He was a bit rude to the staff.",
"label": "negative,rude,mean"
},
{
"review_id": "439a",
"review": "They provided a wonderful experience that made us very happy.",
"label": "positive,happy,joy"
},
{
"review_id": "a53f",
"review": "Please let her know I appreciated the hospitality.",
"label": null
}
]

Other Things to Consider

  1. If you only have empty labels, but no unlabeled data, you still need to provide the file in JSON format and set the labels to "". Empty string labels in CSV format will be interpreted as unlabeled (there is no way to specify data points that do not belong to any of the classes in CSV format).
  2. This distinction does not affect multi-class datasets. You can upload a JSON file with null and "" labels and both will be interpreted as unlabeled data.
  3. Cleanlab Studio doesn’t support nested JSON structures, you must flatten them.

Schemas

Schemas define the data type of each field in your dataset. Cleanlab Studio automatically infers these data types, but you may sometimes you may want to override the automated inferences and manually specify a particular column contains particular types of values.

Cleanlab Studio supports the following data types:

Data typeFeature type
stringtext, categorical, datetime, identifier, filepath
integercategorical, datetime, identifier, numeric
floatdatetime, numeric
booleanboolean

Schema Overrides

To manually override the schema inferred for your dataset, pass in a schema override. The format of schema overrides are as follows:

{
"<column_name>": {
"data_type": "<override_data_type>",
"feature_type": "<override_feature_type>"
}
...
}

In the Python API, you can provide a partial schema override — specifying manually-specified data types for a subset of columns. In the CLI, you must provide a full schema containing every column you wish to include from your dataset.

Supported Formats for Datetime Feature Type

Integer/Float Data Type

Unix timestamps of type integer or float are supported as datetime features.

String Data Type

The following formats are supported for string data types. See the chrono crate documentation for more information on how to interpret format strings. Note: only Day_Month_Year, Year_Month_Day, and Year_Month_Day_Timezone order are currently supported (not Month_Day_Year).

CategoryFormatExample
D_M_Y%d-%m-%Y %H:%M:%S31-07-2021 02:58:01
%d-%m-%Y31-07-2021
Y_M_D%Y/%m/%dT%H:%M:%S2021/07/31T02:58:01
%Y-%m-%dT%H:%M:%S2021-07-31T02:58:01
%Y/%m/%dT%H%M%S2021/07/31T025801
%Y-%m-%dT%H%M%S2021-07-31T025801
%Y/%m/%dT%H:%M2021/07/31T02:58
%Y-%m-%dT%H:%M2021-07-31T02:58
%Y/%m/%dT%H%M2021/07/31T0258
%Y-%m-%dT%H%M2021-07-31T0258
%Y/%m/%dT%H:%M:%S.%9f2021/07/31T02:58:01.555000000
%Y-%m-%dT%H:%M:%S.%9f2021-07-31T02:58:01.555000000
%Y/%m/%dT%H:%M:%S.%6f2021/07/31T02:58:01.555000
%Y-%m-%dT%H:%M:%S.%6f2021-07-31T02:58:01.555000
%Y/%m/%dT%H:%M:%S.%3f2021/07/31T02:58:01.555
%Y-%m-%dT%H:%M:%S.%3f2021-07-31T02:58:01.555
%Y/%m/%dT%H%M%S.%9f2021/07/31T025801.555000000
%Y-%m-%dT%H%M%S.%9f2021-07-31T025801.555000000
%Y/%m/%dT%H%M%S.%6f2021/07/31T025801.555000
%Y-%m-%dT%H%M%S.%6f2021-07-31T025801.555000
%Y/%m/%dT%H%M%S.%3f2021/07/31T025801.555
%Y-%m-%dT%H%M%S.%3f2021-07-31T025801.555
%Y/%m/%d2021/07/31
%Y-%m-%d2021-07-31
%Y/%m/%d %H:%M:%S2021/07/31 02:58:01
%Y-%m-%d %H:%M:%S2021-07-31 02:58:01
%Y/%m/%d %H%M%S2021/07/31 025801
%Y-%m-%d %H%M%S2021-07-31 025801
%Y/%m/%d %H:%M2021/07/31 02:58
%Y-%m-%d %H:%M2021-07-31 02:58
%Y/%m/%d %H%M2021/07/31 0258
%Y-%m-%d %H%M2021-07-31 0258
%Y/%m/%d %H:%M:%S.%9f2021/07/31 02:58:01.555000000
%Y-%m-%d %H:%M:%S.%9f2021-07-31 02:58:01.555000000
%Y/%m/%d %H:%M:%S.%6f2021/07/31 02:58:01.555000
%Y-%m-%d %H:%M:%S.%6f2021/07/31 02:58:01.555000
%Y/%m/%d %H:%M:%S.%3f2021/07/31 02:58:01.555
%Y-%m-%d %H:%M:%S.%3f2021-07-31 02:58:01.555
%Y/%m/%d %H%M%S.%9f2021/07/31 025801.555000000
%Y-%m-%d %H%M%S.%9f2021-07-31 025801.555000000
%Y/%m/%d %H%M%S.%6f2021/07/31 025801.555000
%Y-%m-%d %H%M%S.%6f2021-07-31 025801.555000
%Y/%m/%d %H%M%S.%3f2021/07/31 025801.555
%Y-%m-%d %H%M%S.%3f2021-07-31 025801.555
Y_M_D_ZAll supported Y_M_D formats that include times + %#z2021/07/31T02:58:01+09