Skip to main content

Cleanlab Columns

Cleanlab Studio runs many analyses of your dataset to find data issues and provide other useful metadata, including detection of: label issues, outliers, ambiguous examples, well-labeled examples, and more. New metadata columns are being continuously added to Cleanlab Studio, both in the web app as well as the API, where they are available through download_cleanlab_columns(). Read this page to learn about all of the different types of data issues that Cleanlab Studio is able to automatically detect in your datasets.

Most Cleanlab analyses (label issue, outlier, etc.) produce two columns:

  • A numeric score between 0 and 1 indicating the magnitude of the attribute. For example, an outlier score of 0.98 is an extreme outlier, and a label issue score of 0.2 is not a severe label issue. Example: label_issue_score.
  • A boolean flag indicating whether the attribute is estimated to be true or not. When both the score and boolean are present, the boolean is computed based on thresholding the score, with the threshold chosen intelligently by Cleanlab Studio (we’ve extensively benchmarked these scores on a large number of datasets). For example, a data point with the outlier value True is likely an outlier that warrants your attention (as it may have an outsized impact on downstream models or indicate problems with the data sources). Example: is_label_issue.

For most applications, you should use the boolean attribute rather than manually thresholding the scores yourself, and only use the score for sorting. All analyses run independently (so e.g. a data point can be flagged as both a label issue and an outlier). More Cleanlab columns may be computed in Projects created in Regular vs. Fast mode.

Analysis TypeColumn NameType
Label Issuesis_label_issue
label_issue_score
suggested_label
suggested_label_confidence_score
Boolean
Float
String or Integer
Float
Ambiguous Examplesis_ambiguous
ambiguous_score
Boolean
Float
Well-labeled Examplesis_well_labeledBoolean
Near Duplicatesis_near_duplicate
near_duplicate_score
near_duplicate_cluster_id
Boolean
Float
Integer
Outliersis_outlier
outlier_score
Boolean
Float

The following analyses are only run on text data:

Analysis TypeColumn NameTypeExample
Personally Identifiable Informationis_PII
PII_score
PII_types
PII_items
Boolean
Float
List
List
hello@cleanlab.com, 242-123-4567
Non-English Textis_non_english
non_english_score
predicted_language
Boolean
Float
String
404Error<body><p>InvalidUsername</p><p>InvalidPIN</p></body>
Informal Textis_informal
informal_score
Boolean
Float
‘sup my bro! how was the weekend at d beach?
Toxic Languageis_toxic
toxic_score
Boolean
Float
F**k you!
Text Sentimentsentiment_scoreFloat
Text Lengthis_empty_text
text_num_characters
Boolean
Integer

The following analyses are only run on image data:

Analysis TypeColumn NameTypeExample
Dark images is_dark
dark_score
Boolean
Float
dark
Light images is_light
light_score
Boolean
Float
light
Blurry images is_blurry
blurry_score
Boolean
Float
blurry
Low information images is_low_information
low_information_score
Boolean
Float
lowinfo
Odd aspect ratio images is_odd_aspect_ratio
odd_aspect_ratio_score
Boolean
Float
oddaspect
Grayscale images is_grayscale Booleangrayscale
Oddly sized images is_odd_size
odd_size_score
Boolean
Float
oddsize
Aesthetic images aesthetic_score Floataesthetic
NSFW images is_NSFW
NSFW_score
Boolean
Float
nsfw
Not Analyzedis_not_analyzedBoolean

Label Issues

Label issues are data points that are likely mislabeled.

is_label_issue

Contains a boolean value, with True indicating that the data point is likely has been labeled incorrectly in the original dataset.

label_issue_score

Contains a score between 0 and 1. The higher the score of a data point, the more likely it is mislabeled in the original dataset.

suggested_label

Contains an alternative suggested label that appears more appropriate for the data point than the original given label. The suggested label will be null for data points not flagged with a label issue (is_label_issue is False). The type of this column matches the type of the given label column (string or integer).

suggested_label_confidence_score

Contains a score between 0 and 1 for data points that have a suggested_label. The higher the score, the more confident we are in the suggested label. For multi-label datasets, the confidence score is aggregated over all the suggested labels.

Ambiguous

Ambiguous data points are not well-described by any class in a dataset. Even many different human annotators may disagree on the best class label for such data points. Such data might be appropriate to exclude from a dataset or have your experts review. The presence of ambiguous data points may also indicate the class definitions were not very clear in the original data annotation instructions – consider more clearly defining the distinction between certain classes with many data points flagged as ambiguous.

is_ambiguous

Contains a boolean value, with True indicating that the data point appears ambiguous.

ambiguous_score

Contains a score between 0 and 1. The higher the score of a data point, the more ambiguous the data point (i.e. the more we anticipate multiple human annotators would disagree on how to properly label this data point).

Well-Labeled

Well-labeled data points are confidently estimated to already have the correct given label in the original dataset. These are good representative examples of their given class. Well-labeled examples can safely be used in downstream tasks where high quality is required. If manually reviewing a dataset to ensure the highest quality, you can safely skip reviewing these Well-labeled data points, which have already been validated by Cleanlab’s AI. Learn more.

is_well_labeled

Contains a boolean value, with True indicating that the given label of the data point is highly likely to be correct.

Outliers

Outliers are atypical data points that significantly differ from other data. Such anomalous data might be appropriate to exclude from a dataset. Outliers may otherwise have outsized impact on your modeling/analyses or indicate problems with the data sources.

Note: Outliers are not currently provided for multi-label tabular datasets.

is_outlier

Contains a boolean value, with True indicating that the data point is identifed as an outlier.

outlier_score

Contains a score between 0 and 1. The higher the score of a data point, the more atypical it appears to be compared to the rest of the dataset (more extreme outlier).

Near Duplicates

Near duplicate data points are those that are identical or near-identical to other examples in the dataset. Extra duplicate copies of data might be appropriate to exclude from a dataset. They may otherwise have outsized impact on your modeling/analyses or indicate problems with the data sources.

Note: Near-duplicates are not currently provided for tabular datasets.

is_near_duplicate

Contains a boolean value, with True indicating that this data point is a (near or exact) duplicate of another data point in the dataset.

near_duplicate_score

Contains a score between 0 and 1. The higher the score of a data point, the more similar it is to some other data point in the dataset.

near_duplicate_cluster_id

Contains a cluster ID for each data point, where data points with the same ID belong to the same set of near duplicates. Use this to determine which data points are near duplicates of one another. The cluster IDs are sequential integers: 0, 1, 2, … Data points that do not have near duplicates (where is_near_duplicate is False) are assigned an ID of <NA>.

Columns specific to Text Data

Each of the values listed in this section is independently computed for each text field (does not depend on other text fields in the dataset, unlike say duplicates or label issues which depend on statistics of the full dataset).

Personally Identifiable Information (PII)

Personally Identifiable Information (PII) is information that could be used to identify an individual or is otherwise sensitive. Names, addresses, phone numbers are examples of common PII.

Note: PII detection is currently only provided for text datasets from Projects run in Regular mode.

is_PII

Contains a boolean value, with True indicating that the text contains PII.

PII_score

Contains a score between 0 and 1. The higher the score of a data point, the more sensitive the PII detected.

  • Low sensitivity: URL, social media username
  • Moderate sensitivity: phone number, email address
  • High sensitivity: social security number, credit card number

PII_types

Contains a comma separated list of the types of PII detected in this text. Possible types include: social security number, credit card number, phone number, email, social media username, URL.

PII_items

Contains a comma separated list of the PII items detected in the text. For example: abc@gmail.com, 242-139-4346, etc.

Non-English Text

Non-English text may include foreign languages or nonsensical characters.

Note: Non-English text detection is currently only provided for text datasets from Projects run in Regular mode.

is_non_english

Contains a boolean value, with True indicating that the text is not written in English.

non_english_score

Contains a score between 0 and 1. The higher the score of a data point, the less the text resembles proper English language. High scores may correspond to text from other languages, or text contaminated by non-language strings (eg. HTML/XML tags, identifiers, hashes, random characters).

predicted_language

Contains the predicted language of the text if it identified to be non-English. The language will only be predicted if we are confident that the text is written in another language (ie. it is not nonsensical characters), otherwise it will be assigned <NA>.

Informal Text

Informal text contains casual language or language errors such as improper grammar or spelling.

Note: Informal text detection is currently only provided for text datasets from Projects run in Regular mode.

is_informal

Contains a boolean value, with True indicating that the text contains informal language

informal_score

Contains a score between 0 and 1. Data points with lower scores correspond to text written in a more formal style with fewer grammar/spelling errors.

Toxic Language

Text that contain hateful speech, harmful language, aggression, or related toxic elements.

Note: Toxicity detection is currently only provided for text datasets from Projects run in Regular mode.

is_toxic

Contains a boolean value, with True indicating that the given text likely contains toxic language.

toxic_score

Contains a score between 0 and 1. The higher the score of a data point, the more hateful the text appears.

Text Sentiment

Sentiment refers to how positive or negative the tone conveyed in some text sounds (i.e. how the author seems to feel about the topic).

Note: Sentiment is only estimated for text datasets from Projects run in Regular mode.

sentiment_score

Contains a score between 0 and 1. Higher scores correspond to text with stronger positive sentiments (positive opinions conveyed about the topic), while lower scores correspond to text with stronger negative sentiments. Text conveying a neutral opinion will receive sentiment scores around 0.5.

Note: Unlike many of Cleanlab’s other issue scores, higher sentiment scores may correspond to better text examples in your dataset.

Text Length

Texts that are unusually short or long could be anomalies in a dataset.

is_empty_text

Contains a boolean value, with True indicating that the given text field is empty (there is no text associated with this data point).

text_num_characters

The length of each text field (how many characters), useful for detecting data points with anomalous text length.

Columns specific to Image Data

Cleanlab Studio performs various types of image-specific analyses that are independent of the machine learning task, annotations, and other factors. These analyses focus solely on the images within the dataset. This metadata plays a crucial role in identifying and removing low-quality or corrupted images from your dataset, which can adversely affect model performance.

Dark images

These refer to images with insufficient brightness, appearing dim or underexposed, resulting in a lack of detail and clarity.

is_dark

Boolean value. True indicates that the image is classified as a dark image.

dark_score

A score between 0 and 1, representing the darkness of the image. Higher scores indicate darker images.

Light images

These are images that are excessively overexposed or washed out due to an abundance of light or high brightness levels.

is_light

Boolean value. True indicates that the image is classified as a light image.

light_score

A score between 0 and 1, indicating the brightness level of the image. Higher scores suggest brighter images.

Blurry images

These are images that lack sharpness and clarity, causing the subjects or details to appear hazy, out of focus, or indistinct.

is_blurry

Boolean value. True indicates that the image is classified as a blurry image.

blurry_score

A score between 0 and 1, representing the degree of blurriness in the image. Higher scores indicate a more blurry image.

Low information images

These are images lacking content (low entropy in values of their pixels).

is_low_information

Boolean value. True indicates that the image is classified as a low information image.

low_information_score

A score between 0 and 1, representing the severity of the image’s lack of information. Higher scores correspond to images containing less information.

Grayscale images

These are images lacking color.

is_grayscale

Boolean value. True indicates that the image is classified as a grayscale image.

Odd Aspect Ratio images

These are images with unusually long width or height (asymmetric dimensions).

is_odd_aspect_ratio

Boolean value. True indicates that the image is classified as having an odd aspect ratio.

odd_aspect_ratio_score

A score between 0 and 1, representing the degree of irregularity in the image’s aspect ratio. Higher scores correspond to images whose aspect ratio is more extreme.

Odd sized images

These are images that are unusually small relative to the rest of the images in the dataset.

is_odd_size

Boolean value. True indicates that the image is classified as an image with an odd size.

odd_size_score

A score between 0 and 1, indicating the degree of irregularity in the image’s size. Higher scores indicate images whose size is more unusual compared to the other images in the dataset.

Aesthetic images

These are images which are visually appealing (as rated by most people, although this is highly subjective). They might be artistic, beautiful photographs, or contain otherwise interesting content.

aesthetic_score

A score between 0 and 1, representing how aesthetic the image is estimated to be. If training Generative AI models on your dataset, consider first filtering out the images with low aesthetic_score.

Note: Higher aesthetic scores correspond to higher-quality images in the dataset (unlike many of Cleanlab’s other issue scores).

NSFW images

NSFW (Not Safe For Work) images refer to any visual content that is not suitable for viewing in a professional or public environment, typically due to its explicit, pornographic, or graphic nature. These images may contain nudity, sexual acts, or other content that is considered inappropriate for certain settings, such as workplaces or public spaces.

is_NSFW

Boolean value. True indicates that the image depicts NSFW content.

NSFW_score

A score between 0 and 1, representing the severity of the issue. Higher scores correspond to images more likely to be considered inappropriate.

Not Analyzed

For image projects only, a few of the images in the dataset might fail to be processed due to reasons such as poorly formatted data or invalid image file paths. None of our other analyses will be run on such images, and all the other Cleanlab columns will be filled with default values (e.g. False for all the boolean issue columns). Thus we provide a boolean column so such rows can be easily filtered out.

is_not_analyzed

Contains a boolean value, with True indicating that this data point was not analyzed because its image failed to be processed. Images may fail to process because of their file format or being corrupted. Consider using a more standard file format for each image that failed to be processed.