Skip to main content

Cleanlab Columns

Cleanlab Studio runs many analyses of your dataset to find data issues and provide other useful metadata, including detection of: label issues, outliers, ambiguous examples, well-labeled examples, and more. New metadata columns are being continuously added to Cleanlab Studio, both in the web app as well as the API, where they are available through download_cleanlab_columns(). Read this page to learn about all of the different types of data issues that Cleanlab Studio is able to automatically detect in your datasets.

Most Cleanlab analyses (label issue, outlier, etc.) produce two columns:

  • A numeric score between 0 and 1 indicating the magnitude of the attribute. For example, an outlier score of 0.98 is an extreme outlier, and a label issue score of 0.2 is not a severe label issue. Example: label_issue_score.
  • A boolean flag indicating whether the attribute is estimated to be true or not. When both the score and boolean are present, the boolean is computed based on thresholding the score, with the threshold chosen intelligently by Cleanlab Studio (we’ve extensively benchmarked these scores on a large number of datasets). For example, a data point with the outlier value True is likely an outlier that warrants your attention (as it may have an outsized impact on downstream models or indicate problems with the data sources). Example: is_label_issue.

For most applications, you should use the boolean attribute rather than manually thresholding the scores yourself, and only use the score for sorting. All analyses run independently (so e.g. a data point can be flagged as both a label issue and an outlier). More Cleanlab columns may be computed in Projects created in Regular vs. Fast mode.

Analysis TypeColumn NameType
Label Issuesis_label_issue
label_issue_score
suggested_label
suggested_label_confidence_score
Boolean
Float
String or Integer
Float
Ambiguous Examplesis_ambiguous
ambiguous_score
Boolean
Float
Well-labeled Examplesis_well_labeledBoolean
Near Duplicatesis_near_duplicate
near_duplicate_score
near_duplicate_cluster_id
Boolean
Float
Integer
Outliersis_outlier
outlier_score
Boolean
Float

The following analyses are only run on text data:

Analysis TypeColumn NameTypeExample
Personally Identifiable Informationis_PII
PII_score
PII_types
PII_items
Boolean
Float
List
List
hello@cleanlab.com, 242-123-4567
Non-English Textis_non_english
non_english_score
predicted_language
Boolean
Float
String
404Error<body><p>InvalidUsername</p><p>InvalidPIN</p></body>
Informal Textis_informal
informal_score
spelling_issue_score
grammar_issue_score
slang_issue_score
Boolean
Float
Float
Float
Float
‘sup my bro! how was the weekend at d beach?
Toxic Languageis_toxic
toxic_score
Boolean
Float
F**k you!
Text Sentimentsentiment_scoreFloat
Biased Languageis_biased
bias_score
gender_bias_score
racial_bias_score
sexual_orientation_bias_score
Boolean
Float
Float
Float
Float
He can’t be a nurse, nursing is a profession for women.
Text Lengthis_empty_text
text_num_characters
Boolean
Integer

The following analyses are only run on image data:

Analysis TypeColumn NameTypeExample
Dark images is_dark
dark_score
Boolean
Float
dark
Light images is_light
light_score
Boolean
Float
light
Blurry images is_blurry
blurry_score
Boolean
Float
blurry
Low information images is_low_information
low_information_score
Boolean
Float
lowinfo
Odd aspect ratio images is_odd_aspect_ratio
odd_aspect_ratio_score
Boolean
Float
oddaspect
Grayscale images is_grayscale Booleangrayscale
Oddly sized images is_odd_size
odd_size_score
Boolean
Float
oddsize
Aesthetic images aesthetic_score Floataesthetic
NSFW images is_NSFW
NSFW_score
Boolean
Float
nsfw
Not Analyzedis_not_analyzedBoolean

Label Issues

Label issues are data points that are likely mislabeled.

is_label_issue

Contains a boolean value, with True indicating that the data point is likely has been labeled incorrectly in the original dataset.

label_issue_score

Contains a score between 0 and 1. The higher the score of a data point, the more likely it is mislabeled in the original dataset.

Mathematically, is_label_issue is estimated via Confident Learning algorithms, with label_issue_score defined based on the normalized_margin method defined in the Confident Learning paper. The determination of is_label_issue is not a simple cutoff applied to the label_issue_score because different classes have different estimated label error rates and model prediction error rates, which Confident Learning accounts for.

suggested_label

Contains an alternative suggested label that appears more appropriate for the data point than the original given label. For multi-class classification and regression projects, the suggested label will be null for data points not flagged with any issues (i.e. is_label_issue, is_outlier, is_ambiguous, is_near_duplicate are all False). The type of this column matches the type of the given label column (string or integer).

suggested_label_confidence_score

Contains a score between 0 and 1 for data points that have a suggested_label. The higher the score, the more confident we are in the suggested label.

While the label_issue_score quantifies the likelihood that a given label in the original dataset is incorrect (higher values correspond to data points that are more likely incorrectly annotated), the suggested_label_confidence_score quantifies the likelihood that Cleanlab’s predicted label for a data point is correct (higher values correspond to data points you can more confidently auto-fix / auto-label with Cleanlab). Unlabeled data points thus do not receive a label_issue_score but they do receive a suggested_label_confidence_score.

For multi-class classification datasets, this score is the predicted class probability of the class deemed most likely by Cleanlab’s ML model. For multi-label datasets, the confidence score for each data point is aggregated over all the possible classes for this data point.

Ambiguous

Ambiguous data points are not well-described by any class in a dataset. Even many different human annotators may disagree on the best class label for such data points. Such data might be appropriate to exclude from a dataset or have your experts review. The presence of ambiguous data points may also indicate the class definitions were not very clear in the original data annotation instructions – consider more clearly defining the distinction between certain classes with many data points flagged as ambiguous.

is_ambiguous

Contains a boolean value, with True indicating that the data point appears ambiguous.

ambiguous_score

Contains a score between 0 and 1. The higher the score of a data point, the more ambiguous the data point (i.e. the more we anticipate multiple human annotators would disagree on how to properly label this data point).

Mathematically, this score is proportional to the normalized entropy of predicted class probabilities for each data point output by Cleanlab’s AutoML system.

Well-Labeled

Well-labeled data points are confidently estimated to already have the correct given label in the original dataset. These are good representative examples of their given class. Well-labeled examples can safely be used in downstream tasks where high quality is required. If manually reviewing a dataset to ensure the highest quality, you can safely skip reviewing these Well-labeled data points, which have already been validated by Cleanlab’s AI. Learn more.

is_well_labeled

Contains a boolean value, with True indicating that the given label of the data point is highly likely to be correct.

Outliers

Outliers are atypical data points that significantly differ from other data. Such anomalous data might be appropriate to exclude from a dataset. Outliers may otherwise have outsized impact on your modeling/analyses or indicate problems with the data sources.

For images/text, outliers are detected based on how semantically similar the content is to any other content in the dataset. For text data, the length of the text is also taken into account.

Note: Outliers are not currently provided for multi-label tabular datasets.

is_outlier

Contains a boolean value, with True indicating that the data point is identifed as an outlier.

outlier_score

Contains a score between 0 and 1. The higher the score of a data point, the more atypical it appears to be compared to the rest of the dataset (more extreme outlier).

Near Duplicates

Near duplicate data points are those that are identical or near-identical to other examples in the dataset. Extra duplicate copies of data might be appropriate to exclude from a dataset. They may otherwise have outsized impact on your modeling/analyses or indicate problems with the data sources.

Note: Near-duplicates are not currently provided for tabular datasets.

is_near_duplicate

Contains a boolean value, with True indicating that this data point is a (near or exact) duplicate of another data point in the dataset.

near_duplicate_score

Contains a score between 0 and 1. The higher the score of a data point, the more similar it is to some other data point in the dataset.

Exact duplicates will have score equal to 1. For images/text, near duplicates are scored based on how semantically similar the content is to any other content in the dataset.

near_duplicate_cluster_id

Contains a cluster ID for each data point, where data points with the same ID belong to the same set of near duplicates. Use this to determine which data points are near duplicates of one another. The cluster IDs are sequential integers: 0, 1, 2, … Data points that do not have near duplicates (where is_near_duplicate is False) are assigned an ID of <NA>.

Columns specific to Text Data

Each of the values listed in this section is independently computed for each text field (does not depend on other text fields in the dataset, unlike say duplicates or label issues which depend on statistics of the full dataset).

Personally Identifiable Information (PII)

Personally Identifiable Information (PII) is information that could be used to identify an individual or is otherwise sensitive. Names, addresses, phone numbers are examples of common PII.

Note: PII detection is currently only provided for text datasets from Projects run in Regular mode.

is_PII

Contains a boolean value, with True indicating that the text contains PII.

PII_score

Contains a score between 0 and 1. The higher the score of a data point, the more sensitive the PII detected.

  • Low sensitivity: URL, social media username
  • Moderate sensitivity: phone number, email address
  • High sensitivity: social security number, credit card number

PII_types

Contains a comma separated list of the types of PII detected in this text. Possible types include: social security number, credit card number, phone number, email, social media username, URL.

PII_items

Contains a comma separated list of the PII items detected in the text. For example: abc@gmail.com, 242-139-4346, etc.

Non-English Text

Non-English text may include foreign languages or be unreadable due to nonsensical characters and other indecipherable gibberish.

Note: Non-English text detection is currently only provided for text datasets from Projects run in Regular mode.

is_non_english

Contains a boolean value, with True indicating that the text is not written in English.

non_english_score

Contains a score between 0 and 1 (defined as 1 minus the probability that this text is English estimated by a language detection model). The higher the score of a data point, the less the text resembles proper English language. High scores may correspond to text from other languages, or text contaminated by non-language strings (eg. HTML/XML tags, identifiers, hashes, random character combinations that are not meaningfully readable).

predicted_language

Contains the predicted language of the text if it is identified to be non-English. The language will only be predicted if Cleanlab is confident that the text is written in another language (ie. it is not nonsensical characters), otherwise it will be assigned <NA>.

Informal Text

Informal text is poorly written either due to casual language or writing errors such as improper grammar or spelling.

Note: Informal text detection is currently only provided for text datasets from Projects run in Regular mode.

is_informal

Contains a boolean value, with True indicating that the text contains informal language and is not well written.

informal_score

Contains a score between 0 and 1. Data points with lower scores correspond to text written in a more formal style with fewer grammar/spelling errors. This score aggregates the spelling_issue_score, grammar_issue_score, slang_issue_score defined below. The highest values of the informal_score (near 1.0) correspond to text that is poorly written with many obvious writing mistakes issues (many grammatical errors, spelling errors, slang terms throughout).

spelling_issue_score

Contains a score between 0 and 1. Higher scores correspond to text with more misspelled words.

grammar_issue_score

Contains a score between 0 and 1. Higher scores correspond to text with more severe grammatical errors.

slang_issue_score

Contains a score between 0 and 1. Higher scores correspond to less formally written text with more slang and colloquial language.

Toxic Language

Text that contains hateful speech, harmful language, aggression, or related toxic elements.

Note: Toxicity detection is currently only provided for text datasets from Projects run in Regular mode.

is_toxic

Contains a boolean value, with True indicating that the given text likely contains toxic language.

toxic_score

Contains a score between 0 and 1. The higher the score of a data point, the more hateful the text appears.

Text Sentiment

Sentiment refers to how positive or negative the tone conveyed in some text sounds (i.e. how the author seems to feel about the topic).

Note: Sentiment is only estimated for text datasets from Projects run in Regular mode.

sentiment_score

Contains a score between 0 and 1. Higher scores correspond to text with stronger positive sentiments (positive opinions conveyed about the topic), while lower scores correspond to text with stronger negative sentiments. Text conveying a neutral opinion will receive sentiment scores around 0.5.

Note: Unlike many of Cleanlab’s other issue scores, higher sentiment scores may correspond to better text examples in your dataset.

Biased Language

Text that contains biased language, prejudiced expressions, stereotypes, or promotes discrimination/unfair treatment. Specific types of bias detected include discrimination against: gender, race, or sexual orientation.

Note: Bias detection is currently only provided for text datasets from Projects run in Regular mode.

is_biased

Contains a boolean value, with True indicating that the given text contains language likely to be perceived as biased. Since bias is inherently subjective, the bias_score thresholds used to determine True values here are stringently selected. Just use a lower threshold yourself if you find biased text that is not flagged by this boolean.

bias_score

Contains a score between 0 and 1. The higher the score of a data point, the more likely the text might be perceived as biased. This score is an aggregation of the following sub-scores into an overall bias measure.

gender_bias_score

Contains a score between 0 and 1, with higher scores indicating that the given text contains language more likely to be perceived as discriminating against one’s gender.

racial_bias_score

Contains a score between 0 and 1, with higher scores indicating that the given text contains language more likely to be perceived as discriminating against one’s race.

sexual_orientation_bias_score

Contains a score between 0 and 1, with higher scores indicating that the given text contains language more likely to be perceived as discriminating against one’s sexual orientation.

Text Length

Texts that are unusually short or long could be anomalies in a dataset. For text datasets, Cleanlab Studio’s outlier score already accounts for the length of each text field, but we explicitly provide this information as well.

is_empty_text

Contains a boolean value, with True indicating that the given text field is empty (there is no text associated with this data point).

text_num_characters

The length of each text field (how many characters in each row). Any trailing white spaces are not considered in the character count.

Columns specific to Image Data

Cleanlab Studio performs various types of image-specific analyses that are independent of the machine learning task, annotations, and other factors. These analyses focus solely on the images within the dataset. This metadata plays a crucial role in identifying and removing low-quality or corrupted images from your dataset, which can adversely affect model performance.

Dark images

These refer to images with insufficient brightness, appearing dim or underexposed, resulting in a lack of detail and clarity.

is_dark

Boolean value. True indicates that the image is classified as a dark image.

dark_score

A score between 0 and 1, representing the darkness of the image. Higher scores indicate darker images.

Light images

These are images that are excessively overexposed or washed out due to an abundance of light or high brightness levels.

is_light

Boolean value. True indicates that the image is classified as a light image.

light_score

A score between 0 and 1, indicating the brightness level of the image. Higher scores suggest brighter images.

Blurry images

These are images that lack sharpness and clarity, causing the subjects or details to appear hazy, out of focus, or indistinct.

is_blurry

Boolean value. True indicates that the image is classified as a blurry image.

blurry_score

A score between 0 and 1, representing the degree of blurriness in the image. Higher scores indicate a more blurry image.

Low information images

These are images lacking content (low entropy in values of their pixels).

is_low_information

Boolean value. True indicates that the image is classified as a low information image.

low_information_score

A score between 0 and 1, representing the severity of the image’s lack of information. Higher scores correspond to images containing less information.

Grayscale images

These are images lacking color.

is_grayscale

Boolean value. True indicates that the image is classified as a grayscale image.

Odd Aspect Ratio images

These are images with unusually long width or height (asymmetric dimensions).

is_odd_aspect_ratio

Boolean value. True indicates that the image is classified as having an odd aspect ratio.

odd_aspect_ratio_score

A score between 0 and 1, representing the degree of irregularity in the image’s aspect ratio. Higher scores correspond to images whose aspect ratio is more extreme.

Odd sized images

These are images that are unusually small relative to the rest of the images in the dataset.

is_odd_size

Boolean value. True indicates that the image is classified as an image with an odd size.

odd_size_score

A score between 0 and 1, indicating the degree of irregularity in the image’s size. Higher scores indicate images whose size is more unusual compared to the other images in the dataset.

Aesthetic images

These are images which are visually appealing (as rated by most people, although this is highly subjective). They might be artistic, beautiful photographs, or contain otherwise interesting content.

Note: Aesthetic scores are currently only provided for image datasets from Projects run in Regular mode.

aesthetic_score

A score between 0 and 1, representing how aesthetic the image is estimated to be. If training Generative AI models on your dataset, consider first filtering out the images with low aesthetic_score.

Note: Higher aesthetic scores correspond to higher-quality images in the dataset (unlike many of Cleanlab’s other issue scores).

NSFW images

NSFW (Not Safe For Work) images refer to any visual content that is not suitable for viewing in a professional or public environment, typically due to its explicit, pornographic, or graphic nature. These images may contain nudity, sexual acts, or other content that is considered inappropriate for certain settings, such as workplaces or public spaces.

Note: NSFW detection is currently only provided for image datasets from Projects run in Regular mode.

is_NSFW

Boolean value. True indicates that the image depicts NSFW content.

NSFW_score

A score between 0 and 1, representing the severity of the issue. Higher scores correspond to images more likely to be considered inappropriate.

Not Analyzed

For image projects only, a few of the images in the dataset might fail to be processed due to reasons such as poorly formatted data or invalid image file paths. None of our other analyses will be run on such images, and all the other Cleanlab columns will be filled with default values (e.g. False for all the boolean issue columns). Thus we provide a boolean column so such rows can be easily filtered out.

is_not_analyzed

Contains a boolean value, with True indicating that this data point was not analyzed because its image failed to be processed. Images may fail to process because of their file format or being corrupted. Consider using a more standard file format for each image that failed to be processed.