Cleanlab Columns

Cleanlab Studio runs many analyses of your dataset to find data issues and provide other useful metadata, including detection of: label issues, outliers, ambiguous examples, well-labeled examples, and more. New metadata columns are being continuously added to Cleanlab Studio, both in the web app as well as the API, where they are available through download_cleanlab_columns(). Read this page to learn about all of the different types of data issues that Cleanlab Studio is able to automatically detect in your datasets.

Most Cleanlab analyses (label issue, outlier, etc.) produce two columns:

A numeric score between 0 and 1 indicating the magnitude of the attribute. For example, an outlier score of 0.99 is an extreme outlier (such data points warrant your attention), and a label issue score of 0.01 is not a severe label issue (such data points do not warrant your attention). Sort your dataset by these scores to identify the most important data points to focus on. The raw magnitude of these scores is less important than their relative ordering across your dataset. Example: label_issue_score.
A boolean flag indicating whether the attribute is estimated to be true or not. When both the score and boolean are present, the boolean is computed based on thresholding the score, with the threshold chosen intelligently by Cleanlab Studio (we’ve extensively benchmarked these scores on a large number of datasets). For example, a data point with the outlier value True is likely an outlier that warrants your attention (as it may have an outsized impact on downstream models or indicate problems with the data sources). Example: is_label_issue.

For most applications, start by sorting your data points by the score you care about, then consider the corresponding boolean flag attribute, and finally consider a different score threshold if appropriate for your application. All analyses run independently (so e.g. a data point can be flagged as both a label issue and an outlier). More Cleanlab columns may be computed in Projects created in Regular vs. Fast mode.

Analysis Type	Column Name	Type
Label Issues	is_label_issue label_issue_score suggested_label suggested_label_confidence_score	Boolean Float String or Integer Float
Ambiguous Examples	is_ambiguous ambiguous_score	Boolean Float
Well-labeled Examples	is_well_labeled	Boolean
Near Duplicates	is_near_duplicate near_duplicate_score near_duplicate_cluster_id	Boolean Float Integer
Outliers	is_outlier outlier_score	Boolean Float

The following analyses are only run on text data:

Analysis Type	Column Name	Type	Example
Personally Identifiable Information	is_PII PII_score PII_types PII_items	Boolean Float List List	hello@cleanlab.com, 242-123-4567
Non-English Text	is_non_english non_english_score predicted_language	Boolean Float String	404Error InvalidUsername InvalidPIN
Informal Text	is_informal informal_score spelling_issue_score grammar_issue_score slang_issue_score	Boolean Float Float Float Float	’sup my bro! how was the weekend at d beach?
Toxic Language	is_toxic toxic_score	Boolean Float	F**k you!
Text Sentiment	sentiment_score	Float
Biased Language	is_biased bias_score gender_bias_score racial_bias_score sexual_orientation_bias_score	Boolean Float Float Float Float	He can’t be a nurse, nursing is a profession for women.
Text Length	is_empty_text text_num_characters	Boolean Integer

The following analyses are only run on image data:

Analysis Type	Column Name	Type
Dark images	is_dark dark_score	Boolean Float
Light images	is_light light_score	Boolean Float
Blurry images	is_blurry blurry_score	Boolean Float
Low information images	is_low_information low_information_score	Boolean Float
Odd aspect ratio images	is_odd_aspect_ratio odd_aspect_ratio_score	Boolean Float
Grayscale images	is_grayscale	Boolean
Oddly sized images	is_odd_size odd_size_score	Boolean Float
Aesthetic images	aesthetic_score	Float
NSFW images	is_NSFW NSFW_score	Boolean Float
Not Analyzed	is_not_analyzed	Boolean

Label Issues

Label issues are data points that are likely mislabeled. Learn more.

`is_label_issue`

Contains a boolean value, with True indicating that the data point is likely has been labeled incorrectly in the original dataset.

`label_issue_score`

Contains a score between 0 and 1. The higher the score of a data point, the more likely it is mislabeled in the original dataset.

Mathematically, is_label_issue is estimated via Confident Learning algorithms, with label_issue_score defined based on the normalized_margin method defined in the Confident Learning paper. The determination of is_label_issue is not a simple cutoff applied to the label_issue_score because different classes have different estimated label error rates and model prediction error rates, which Confident Learning accounts for.

`suggested_label`

Contains an alternative suggested label that appears more appropriate for the data point than the original given label. For multi-class classification and regression projects, the suggested label will be null for data points not flagged with any issues (i.e. is_label_issue, is_outlier, is_ambiguous, is_near_duplicate are all False). The type of this column matches the type of the given label column (string or integer).

`suggested_label_confidence_score`

Contains a score between 0 and 1 for data points that have a suggested_label. The higher the score, the more confident we are in the suggested label.

While the label_issue_score quantifies the likelihood that a given label in the original dataset is incorrect (higher values correspond to data points that are more likely incorrectly annotated), the suggested_label_confidence_score quantifies the likelihood that Cleanlab’s predicted label for a data point is correct (higher values correspond to data points you can more confidently auto-fix / auto-label with Cleanlab). Unlabeled data points thus do not receive a label_issue_score but they do receive a suggested_label_confidence_score.

For multi-class classification datasets, this score is the predicted class probability of the class deemed most likely by Cleanlab’s ML model. For multi-label datasets, the confidence score for each data point is aggregated over all the possible classes for this data point.

Ambiguous

Ambiguous data points are not well-described by any class in a dataset. Even many different human annotators may disagree on the best class label for such data points. Such data might be appropriate to exclude from a dataset or have your experts review. The presence of ambiguous data points may also indicate the class definitions were not very clear in the original data annotation instructions – consider more clearly defining the distinction between certain classes with many data points flagged as ambiguous. Read about how ambiguity is scored.

`is_ambiguous`

Contains a boolean value, with True indicating that the data point appears ambiguous.

`ambiguous_score`

Contains a score between 0 and 1. The higher the score of a data point, the more ambiguous the data point (i.e. the more we anticipate multiple human annotators would disagree on how to properly label this data point).

Mathematically, this score is proportional to the normalized entropy of predicted class probabilities for each data point output by Cleanlab’s AutoML system.

Well-Labeled

Well-labeled data points are confidently estimated to already have the correct given label in the original dataset. These are good representative examples of their given class. Well-labeled examples can safely be used in downstream tasks where high quality is required. If manually reviewing a dataset to ensure the highest quality, you can safely skip reviewing these Well-labeled data points, which have already been validated by Cleanlab’s AI. Learn more.

`is_well_labeled`

Contains a boolean value, with True indicating that the given label of the data point is highly likely to be correct.

Outliers

Outliers are atypical data points that significantly differ from other data. Such anomalous data might be appropriate to exclude from a dataset. Outliers may otherwise have outsized impact on your modeling/analyses or indicate problems with the data sources. Learn more.

For images/text, outliers are detected based on how semantically similar the content is to any other content in the dataset. For text data, the length of the text is also taken into account.

`is_outlier`

Contains a boolean value, with True indicating that the data point is identifed as an outlier.

`outlier_score`

Contains a score between 0 and 1. The higher the score of a data point, the more atypical it appears to be compared to the rest of the dataset (more extreme outlier).

Near Duplicates

Near duplicate data points are those that are identical or near-identical to other examples in the dataset. Extra duplicate copies of data might be appropriate to exclude from a dataset. They may otherwise have outsized impact on your modeling/analyses or indicate problems with the data sources.

`is_near_duplicate`

Contains a boolean value, with True indicating that this data point is a (near or exact) duplicate of another data point in the dataset.

`near_duplicate_score`

Contains a score between 0 and 1. The higher the score of a data point, the more similar it is to some other data point in the dataset.

Exact duplicates will have score equal to 1. For images/text, near duplicates are scored based on how semantically similar the content is to any other content in the dataset.

`near_duplicate_cluster_id`

Contains a cluster ID for each data point, where data points with the same ID belong to the same set of near duplicates. Use this to determine which data points are near duplicates of one another. The cluster IDs are sequential integers: 0, 1, 2, … Data points that do not have near duplicates (where is_near_duplicate is False) are assigned an ID of _<NA\>_.

Columns specific to Text Data

Each of the values listed in this section is independently computed for each text field (does not depend on other text fields in the dataset, unlike say duplicates or label issues which depend on statistics of the full dataset). Learn more.

Personally Identifiable Information (PII)

Personally Identifiable Information (PII) is information that could be used to identify an individual or is otherwise sensitive. Names, addresses, phone numbers are examples of common PII.

Note: PII detection is currently only provided for text datasets from Projects run in Regular mode.

`is_PII`

Contains a boolean value, with True indicating that the text contains PII.

`PII_score`

Contains a score between 0 and 1. The higher the score of a data point, the more sensitive the PII detected.

Low sensitivity: URL, social media username
Moderate sensitivity: phone number, email address
High sensitivity: social security number, credit card number

`PII_types`

Contains a comma separated list of the types of PII detected in this text. Possible types include: social security number, credit card number, phone number, email, social media username, URL.

`PII_items`

Contains a comma separated list of the PII items detected in the text. For example: abc@gmail.com, 242-139-4346, etc.

Non-English Text

Non-English text may include foreign languages or be unreadable due to nonsensical characters and other indecipherable gibberish.

Examples of text detected as non-English:

{{Answer}}\n \n {{nextStep}}\n\n {{followUp}}\n\n {{addInfo}}\n \n {{cusFeedback}}\n \n {{faq}}\n \n{{EOM}}
404Error\<div>\<p\>InvalidUsername\</p\>\<p\>InvalidPIN\</p\>\</div\>
la trasferenza al mio conto non è stata consentita

Note: Non-English text detection is currently only provided for text datasets from Projects run in Regular mode.

`is_non_english`

Contains a boolean value, with True indicating that the text is not written in English.

`non_english_score`

Contains a score between 0 and 1 (defined as 1 minus the probability that this text is English estimated by a language detection model). The higher the score of a data point, the less the text resembles proper English language. High scores may correspond to text from other languages, or text contaminated by non-language strings (eg. HTML/XML tags, identifiers, hashes, random character combinations that are not meaningfully readable).

`predicted_language`

Contains the predicted language of the text if it is identified to be non-English. The language will only be predicted if Cleanlab is confident that the text is written in another language (ie. it is not nonsensical characters), otherwise it will be assigned _<NA\>_.

Informal Text

Informal text is poorly written either due to casual language or writing errors such as improper grammar or spelling.

Note: Informal text detection is currently only provided for text datasets from Projects run in Regular mode.

`is_informal`

Contains a boolean value, with True indicating that the text contains informal language and is not well written.

`informal_score`

Contains a score between 0 and 1. Data points with lower scores correspond to text written in a more formal style with fewer grammar/spelling errors. This score aggregates the spelling_issue_score, grammar_issue_score, slang_issue_score defined below. The highest values of the informal_score (near 1.0) correspond to text that is poorly written with many obvious writing mistakes issues (many grammatical errors, spelling errors, slang terms throughout).

`spelling_issue_score`

Contains a score between 0 and 1. Higher scores correspond to text with more misspelled words.

`grammar_issue_score`

Contains a score between 0 and 1. Higher scores correspond to text with more severe grammatical errors.

`slang_issue_score`

Contains a score between 0 and 1. Higher scores correspond to less formally written text with more slang and colloquial language.

Toxic Language

Text that contains hateful speech, harmful language, aggression, or related toxic elements.

Note: Toxicity detection is currently only provided for text datasets from Projects run in Regular mode.

`is_toxic`

Contains a boolean value, with True indicating that the given text likely contains toxic language.

`toxic_score`

Contains a score between 0 and 1. The higher the score of a data point, the more hateful the text appears.

Text Sentiment

Sentiment refers to how positive or negative the tone conveyed in some text sounds (i.e. how the author seems to feel about the topic).

Note: Sentiment is only estimated for text datasets from Projects run in Regular mode.

`sentiment_score`

Contains a score between 0 and 1. Higher scores correspond to text with stronger positive sentiments (positive opinions conveyed about the topic), while lower scores correspond to text with stronger negative sentiments. Text conveying a neutral opinion will receive sentiment scores around 0.5.

Note: Unlike many of Cleanlab’s other issue scores, higher sentiment scores may correspond to better text examples in your dataset.

Biased Language

Text that contains biased language, prejudiced expressions, stereotypes, or promotes discrimination/unfair treatment. Specific types of bias detected include discrimination against: gender, race, or sexual orientation.

Note: Bias detection is currently only provided for text datasets from Projects run in Regular mode.

`is_biased`

Contains a boolean value, with True indicating that the given text contains language likely to be perceived as biased. Since bias is inherently subjective, the bias_score thresholds used to determine True values here are stringently selected. Just use a lower threshold yourself if you find biased text that is not flagged by this boolean.

`bias_score`

Contains a score between 0 and 1. The higher the score of a data point, the more likely the text might be perceived as biased. This score is an aggregation of the following sub-scores into an overall bias measure.

`gender_bias_score`

Contains a score between 0 and 1, with higher scores indicating that the given text contains language more likely to be perceived as discriminating against one’s gender.

`racial_bias_score`

Contains a score between 0 and 1, with higher scores indicating that the given text contains language more likely to be perceived as discriminating against one’s race.

`sexual_orientation_bias_score`

Contains a score between 0 and 1, with higher scores indicating that the given text contains language more likely to be perceived as discriminating against one’s sexual orientation.

Text Length

Texts that are unusually short or long could be anomalies in a dataset. For text datasets, Cleanlab Studio’s outlier score already accounts for the length of each text field, but we explicitly provide this information as well.

`is_empty_text`

Contains a boolean value, with True indicating that the given text field is empty (there is no text associated with this data point).

`text_num_characters`

The length of each text field (how many characters in each row). Any trailing white spaces are not considered in the character count.

Columns specific to Image Data

Cleanlab Studio performs various types of image-specific analyses that are independent of the machine learning task, annotations, and other factors. These analyses focus solely on the images within the dataset. This metadata plays a crucial role in identifying and removing low-quality or corrupted images from your dataset, which can adversely affect model performance. Learn more.

Dark images

These refer to images with insufficient brightness, appearing dim or underexposed, resulting in a lack of detail and clarity.

`is_dark`

Boolean value. True indicates that the image is classified as a dark image.

`dark_score`

A score between 0 and 1, representing the darkness of the image. Higher scores indicate darker images.

Light images

These are images that are excessively overexposed or washed out due to an abundance of light or high brightness levels.

`is_light`

Boolean value. True indicates that the image is classified as a light image.

`light_score`

A score between 0 and 1, indicating the brightness level of the image. Higher scores suggest brighter images.

Blurry images

These are images that lack sharpness and clarity, causing the subjects or details to appear hazy, out of focus, or indistinct.

`is_blurry`

Boolean value. True indicates that the image is classified as a blurry image.

`blurry_score`

A score between 0 and 1, representing the degree of blurriness in the image. Higher scores indicate a more blurry image.

Low information images

These are images lacking content (low entropy in values of their pixels).

`is_low_information`

Boolean value. True indicates that the image is classified as a low information image.

`low_information_score`

A score between 0 and 1, representing the severity of the image’s lack of information. Higher scores correspond to images containing less information.

Grayscale images

These are images lacking color.

`is_grayscale`

Boolean value. True indicates that the image is classified as a grayscale image.

Odd Aspect Ratio images

These are images with unusually long width or height (asymmetric dimensions).

`is_odd_aspect_ratio`

Boolean value. True indicates that the image is classified as having an odd aspect ratio.

`odd_aspect_ratio_score`

A score between 0 and 1, representing the degree of irregularity in the image’s aspect ratio. Higher scores correspond to images whose aspect ratio is more extreme.

Odd sized images

These are images that are unusually small relative to the rest of the images in the dataset.

`is_odd_size`

Boolean value. True indicates that the image is classified as an image with an odd size.

`odd_size_score`

A score between 0 and 1, indicating the degree of irregularity in the image’s size. Higher scores indicate images whose size is more unusual compared to the other images in the dataset.

Aesthetic images

These are images which are visually appealing (as rated by most people, although this is highly subjective). They might be artistic, beautiful photographs, or contain otherwise interesting content.

Note: Aesthetic scores are currently only provided for image datasets from Projects run in Regular mode.

`aesthetic_score`

A score between 0 and 1, representing how aesthetic the image is estimated to be. If training Generative AI models on your dataset, consider first filtering out the images with low aesthetic_score.

Note: Higher aesthetic scores correspond to higher-quality images in the dataset (unlike many of Cleanlab’s other issue scores).

NSFW images

NSFW (Not Safe For Work) images refer to any visual content that is not suitable for viewing in a professional or public environment, typically due to its explicit, pornographic, or graphic nature. These images may contain nudity, sexual acts, or other content that is considered inappropriate for certain settings, such as workplaces or public spaces.

Note: NSFW detection is currently only provided for image datasets from Projects run in Regular mode.

`is_NSFW`

Boolean value. True indicates that the image depicts NSFW content.

`NSFW_score`

A score between 0 and 1, representing the severity of the issue. Higher scores correspond to images more likely to be considered inappropriate.

Not Analyzed

For image projects only, a few of the images in the dataset might fail to be processed due to reasons such as poorly formatted data or invalid image file paths. None of our other analyses will be run on such images, and all the other Cleanlab columns will be filled with default values (e.g. False for all the boolean issue columns). Thus we provide a boolean column so such rows can be easily filtered out.

`is_not_analyzed`

Contains a boolean value, with True indicating that this data point was not analyzed because its image failed to be processed. Images may fail to process because of their file format or being corrupted. Consider using a more standard file format for each image that failed to be processed.

Label Issues​

is_label_issue​

label_issue_score​

suggested_label​

suggested_label_confidence_score​

Ambiguous​

is_ambiguous​

ambiguous_score​

Well-Labeled​

is_well_labeled​

Outliers​

is_outlier​

outlier_score​

Near Duplicates​

is_near_duplicate​

near_duplicate_score​

near_duplicate_cluster_id​

Columns specific to Text Data​

Personally Identifiable Information (PII)​

is_PII​

PII_score​

PII_types​

PII_items​

Non-English Text​

is_non_english​

non_english_score​

predicted_language​

Informal Text​

is_informal​

informal_score​

spelling_issue_score​

grammar_issue_score​

slang_issue_score​

Toxic Language​

is_toxic​

toxic_score​

Text Sentiment​

sentiment_score​

Biased Language​

is_biased​

bias_score​

gender_bias_score​

racial_bias_score​

sexual_orientation_bias_score​

Text Length​

is_empty_text​

text_num_characters​

Columns specific to Image Data​

Dark images​

is_dark​

dark_score​

Light images​

is_light​

light_score​

Blurry images​

is_blurry​

blurry_score​

Low information images​

is_low_information​

low_information_score​

Grayscale images​

is_grayscale​

Odd Aspect Ratio images​

is_odd_aspect_ratio​

odd_aspect_ratio_score​

Odd sized images​

is_odd_size​

odd_size_score​

Aesthetic images​

aesthetic_score​

NSFW images​

is_NSFW​

NSFW_score​

Not Analyzed​

is_not_analyzed​