Cleanlab Columns
Cleanlab Studio runs many analyses of your dataset to find data issues and provide other useful metadata, including detection of: label issues, outliers, ambiguous examples, well-labeled examples, and more. New metadata columns are being continuously added to Cleanlab Studio, both in the web app as well as the API, where they are available through download_cleanlab_columns()
. Read this page to learn about all of the different types of data issues that Cleanlab Studio is able to automatically detect in your datasets.
Most Cleanlab analyses (label issue, outlier, etc.) produce two columns:
- A numeric score between 0 and 1 indicating the magnitude of the attribute. For example, an outlier score of
0.99
is an extreme outlier (such data points warrant your attention), and a label issue score of0.01
is not a severe label issue (such data points do not warrant your attention). Sort your dataset by these scores to identify the most important data points to focus on. The raw magnitude of these scores is less important than their relative ordering across your dataset. Example:label_issue_score
. - A boolean flag indicating whether the attribute is estimated to be true or not. When both the score and boolean are present, the boolean is computed based on thresholding the score, with the threshold chosen intelligently by Cleanlab Studio (we’ve extensively benchmarked these scores on a large number of datasets). For example, a data point with the outlier value
True
is likely an outlier that warrants your attention (as it may have an outsized impact on downstream models or indicate problems with the data sources). Example:is_label_issue
.
For most applications, start by sorting your data points by the score you care about, then consider the corresponding boolean flag attribute, and finally consider a different score threshold if appropriate for your application.
All analyses run independently (so e.g. a data point can be flagged as both a label issue and an outlier). More Cleanlab columns may be computed in Projects created in Regular
vs. Fast
mode.
Analysis Type | Column Name | Type |
---|---|---|
Label Issues | is_label_issue label_issue_score suggested_label suggested_label_confidence_score | Boolean Float String or Integer Float |
Ambiguous Examples | is_ambiguous ambiguous_score | Boolean Float |
Well-labeled Examples | is_well_labeled | Boolean |
Near Duplicates | is_near_duplicate near_duplicate_score near_duplicate_cluster_id | Boolean Float Integer |
Outliers | is_outlier outlier_score | Boolean Float |
The following analyses are only run on text data:
Analysis Type | Column Name | Type | Example |
---|---|---|---|
Personally Identifiable Information | is_PII PII_score PII_types PII_items | Boolean Float List List | hello@cleanlab.com, 242-123-4567 |
Non-English Text | is_non_english non_english_score predicted_language | Boolean Float String | 404Error<body><p>InvalidUsername</p><p>InvalidPIN</p></body> |
Informal Text | is_informal informal_score spelling_issue_score grammar_issue_score slang_issue_score | Boolean Float Float Float Float | ‘sup my bro! how was the weekend at d beach? |
Toxic Language | is_toxic toxic_score | Boolean Float | F**k you! |
Text Sentiment | sentiment_score | Float | |
Biased Language | is_biased bias_score gender_bias_score racial_bias_score sexual_orientation_bias_score | Boolean Float Float Float Float | He can’t be a nurse, nursing is a profession for women. |
Text Length | is_empty_text text_num_characters | Boolean Integer |
The following analyses are only run on image data:
Analysis Type | Column Name | Type | Example |
---|---|---|---|
Dark images | is_dark dark_score | Boolean Float | |
Light images | is_light light_score | Boolean Float | |
Blurry images | is_blurry blurry_score | Boolean Float | |
Low information images | is_low_information low_information_score | Boolean Float | |
Odd aspect ratio images | is_odd_aspect_ratio odd_aspect_ratio_score | Boolean Float | |
Grayscale images | is_grayscale | Boolean | |
Oddly sized images | is_odd_size odd_size_score | Boolean Float | |
Aesthetic images | aesthetic_score | Float | |
NSFW images | is_NSFW NSFW_score | Boolean Float | |
Not Analyzed | is_not_analyzed | Boolean |
Label Issues
Label issues are data points that are likely mislabeled. Learn more.
is_label_issue
Contains a boolean value, with True
indicating that the data point is likely has been labeled incorrectly in the original dataset.
label_issue_score
Contains a score between 0 and 1. The higher the score of a data point, the more likely it is mislabeled in the original dataset.
Mathematically, is_label_issue
is estimated via Confident Learning algorithms, with label_issue_score
defined based on the normalized_margin method defined in the Confident Learning paper.
The determination of is_label_issue
is not a simple cutoff applied to the label_issue_score
because different classes have different estimated label error rates and model prediction error rates, which Confident Learning accounts for.
suggested_label
Contains an alternative suggested label that appears more appropriate for the data point than the original given label. For multi-class classification and regression projects, the suggested label will be null for data points not flagged with any issues (i.e. is_label_issue
, is_outlier
, is_ambiguous
, is_near_duplicate
are all False
). The type of this column matches the type of the given label column (string or integer).
suggested_label_confidence_score
Contains a score between 0 and 1 for data points that have a suggested_label
. The higher the score, the more confident we are in the suggested label.
While the label_issue_score
quantifies the likelihood that a given label in the original dataset is incorrect (higher values correspond to data points that are more likely incorrectly annotated), the suggested_label_confidence_score
quantifies the likelihood that Cleanlab’s predicted label for a data point is correct (higher values correspond to data points you can more confidently auto-fix / auto-label with Cleanlab). Unlabeled data points thus do not receive a label_issue_score
but they do receive a suggested_label_confidence_score
.
For multi-class classification datasets, this score is the predicted class probability of the class deemed most likely by Cleanlab’s ML model. For multi-label datasets, the confidence score for each data point is aggregated over all the possible classes for this data point.
Ambiguous
Ambiguous data points are not well-described by any class in a dataset. Even many different human annotators may disagree on the best class label for such data points. Such data might be appropriate to exclude from a dataset or have your experts review. The presence of ambiguous data points may also indicate the class definitions were not very clear in the original data annotation instructions – consider more clearly defining the distinction between certain classes with many data points flagged as ambiguous. Read about how ambiguity is scored.
is_ambiguous
Contains a boolean value, with True
indicating that the data point appears ambiguous.
ambiguous_score
Contains a score between 0 and 1. The higher the score of a data point, the more ambiguous the data point (i.e. the more we anticipate multiple human annotators would disagree on how to properly label this data point).
Mathematically, this score is proportional to the normalized entropy of predicted class probabilities for each data point output by Cleanlab’s AutoML system.
Well-Labeled
Well-labeled data points are confidently estimated to already have the correct given label in the original dataset. These are good representative examples of their given class. Well-labeled examples can safely be used in downstream tasks where high quality is required. If manually reviewing a dataset to ensure the highest quality, you can safely skip reviewing these Well-labeled data points, which have already been validated by Cleanlab’s AI. Learn more.
is_well_labeled
Contains a boolean value, with True
indicating that the given label of the data point is highly likely to be correct.
Outliers
Outliers are atypical data points that significantly differ from other data. Such anomalous data might be appropriate to exclude from a dataset. Outliers may otherwise have outsized impact on your modeling/analyses or indicate problems with the data sources. Learn more.
For images/text, outliers are detected based on how semantically similar the content is to any other content in the dataset. For text data, the length of the text is also taken into account.
is_outlier
Contains a boolean value, with True
indicating that the data point is identifed as an outlier.
outlier_score
Contains a score between 0 and 1. The higher the score of a data point, the more atypical it appears to be compared to the rest of the dataset (more extreme outlier).
Near Duplicates
Near duplicate data points are those that are identical or near-identical to other examples in the dataset. Extra duplicate copies of data might be appropriate to exclude from a dataset. They may otherwise have outsized impact on your modeling/analyses or indicate problems with the data sources.
is_near_duplicate
Contains a boolean value, with True
indicating that this data point is a (near or exact) duplicate of another data point in the dataset.
near_duplicate_score
Contains a score between 0 and 1. The higher the score of a data point, the more similar it is to some other data point in the dataset.
Exact duplicates will have score equal to 1. For images/text, near duplicates are scored based on how semantically similar the content is to any other content in the dataset.
near_duplicate_cluster_id
Contains a cluster ID for each data point, where data points with the same ID belong to the same set of near duplicates. Use this to determine which data points are near duplicates of one another. The cluster IDs are sequential integers: 0, 1, 2, … Data points that do not have near duplicates (where is_near_duplicate
is False
) are assigned an ID of <NA>.
Columns specific to Text Data
Each of the values listed in this section is independently computed for each text field (does not depend on other text fields in the dataset, unlike say duplicates or label issues which depend on statistics of the full dataset). Learn more.
Personally Identifiable Information (PII)
Personally Identifiable Information (PII) is information that could be used to identify an individual or is otherwise sensitive. Names, addresses, phone numbers are examples of common PII.
Note: PII detection is currently only provided for text datasets from Projects run in Regular mode.
is_PII
Contains a boolean value, with True
indicating that the text contains PII.
PII_score
Contains a score between 0 and 1. The higher the score of a data point, the more sensitive the PII detected.
- Low sensitivity: URL, social media username
- Moderate sensitivity: phone number, email address
- High sensitivity: social security number, credit card number
PII_types
Contains a comma separated list of the types of PII detected in this text. Possible types include: social security number, credit card number, phone number, email, social media username, URL.
PII_items
Contains a comma separated list of the PII items detected in the text. For example: abc@gmail.com, 242-139-4346, etc.
Non-English Text
Non-English text may include foreign languages or be unreadable due to nonsensical characters and other indecipherable gibberish.
Examples of text detected as non-English:
{{Answer}}\n \n {{nextStep}}\n\n {{followUp}}\n\n {{addInfo}}\n \n {{cusFeedback}}\n \n {{faq}}\n \n{{EOM}}
404Error\<body>\<p\>InvalidUsername\</p\>\<p\>InvalidPIN\</p\>\</body\>
la trasferenza al mio conto non è stata consentita
Note: Non-English text detection is currently only provided for text datasets from Projects run in Regular mode.
is_non_english
Contains a boolean value, with True
indicating that the text is not written in English.
non_english_score
Contains a score between 0 and 1 (defined as 1 minus the probability that this text is English estimated by a language detection model). The higher the score of a data point, the less the text resembles proper English language. High scores may correspond to text from other languages, or text contaminated by non-language strings (eg. HTML/XML tags, identifiers, hashes, random character combinations that are not meaningfully readable).
predicted_language
Contains the predicted language of the text if it is identified to be non-English. The language will only be predicted if Cleanlab is confident that the text is written in another language (ie. it is not nonsensical characters), otherwise it will be assigned <NA>.
Informal Text
Informal text is poorly written either due to casual language or writing errors such as improper grammar or spelling.
Note: Informal text detection is currently only provided for text datasets from Projects run in Regular mode.
is_informal
Contains a boolean value, with True
indicating that the text contains informal language and is not well written.
informal_score
Contains a score between 0 and 1. Data points with lower scores correspond to text written in a more formal style with fewer grammar/spelling errors. This score aggregates the spelling_issue_score
, grammar_issue_score
, slang_issue_score
defined below. The highest values of the informal_score
(near 1.0) correspond to text that is poorly written with many obvious writing mistakes issues (many grammatical errors, spelling errors, slang terms throughout).
spelling_issue_score
Contains a score between 0 and 1. Higher scores correspond to text with more misspelled words.
grammar_issue_score
Contains a score between 0 and 1. Higher scores correspond to text with more severe grammatical errors.
slang_issue_score
Contains a score between 0 and 1. Higher scores correspond to less formally written text with more slang and colloquial language.
Toxic Language
Text that contains hateful speech, harmful language, aggression, or related toxic elements.
Note: Toxicity detection is currently only provided for text datasets from Projects run in Regular mode.
is_toxic
Contains a boolean value, with True
indicating that the given text likely contains toxic language.
toxic_score
Contains a score between 0 and 1. The higher the score of a data point, the more hateful the text appears.
Text Sentiment
Sentiment refers to how positive or negative the tone conveyed in some text sounds (i.e. how the author seems to feel about the topic).
Note: Sentiment is only estimated for text datasets from Projects run in Regular mode.
sentiment_score
Contains a score between 0 and 1. Higher scores correspond to text with stronger positive sentiments (positive opinions conveyed about the topic), while lower scores correspond to text with stronger negative sentiments. Text conveying a neutral opinion will receive sentiment scores around 0.5.
Note: Unlike many of Cleanlab’s other issue scores, higher sentiment scores may correspond to better text examples in your dataset.
Biased Language
Text that contains biased language, prejudiced expressions, stereotypes, or promotes discrimination/unfair treatment. Specific types of bias detected include discrimination against: gender, race, or sexual orientation.
Note: Bias detection is currently only provided for text datasets from Projects run in Regular mode.
is_biased
Contains a boolean value, with True
indicating that the given text contains language likely to be perceived as biased. Since bias is inherently subjective, the bias_score
thresholds used to determine True
values here are stringently selected. Just use a lower threshold yourself if you find biased text that is not flagged by this boolean.
bias_score
Contains a score between 0 and 1. The higher the score of a data point, the more likely the text might be perceived as biased. This score is an aggregation of the following sub-scores into an overall bias measure.
gender_bias_score
Contains a score between 0 and 1, with higher scores indicating that the given text contains language more likely to be perceived as discriminating against one’s gender.
racial_bias_score
Contains a score between 0 and 1, with higher scores indicating that the given text contains language more likely to be perceived as discriminating against one’s race.
sexual_orientation_bias_score
Contains a score between 0 and 1, with higher scores indicating that the given text contains language more likely to be perceived as discriminating against one’s sexual orientation.
Text Length
Texts that are unusually short or long could be anomalies in a dataset. For text datasets, Cleanlab Studio’s outlier score already accounts for the length of each text field, but we explicitly provide this information as well.
is_empty_text
Contains a boolean value, with True
indicating that the given text field is empty (there is no text associated with this data point).
text_num_characters
The length of each text field (how many characters in each row). Any trailing white spaces are not considered in the character count.
Columns specific to Image Data
Cleanlab Studio performs various types of image-specific analyses that are independent of the machine learning task, annotations, and other factors. These analyses focus solely on the images within the dataset. This metadata plays a crucial role in identifying and removing low-quality or corrupted images from your dataset, which can adversely affect model performance. Learn more.
Dark images
These refer to images with insufficient brightness, appearing dim or underexposed, resulting in a lack of detail and clarity.
is_dark
Boolean value. True indicates that the image is classified as a dark image.
dark_score
A score between 0 and 1, representing the darkness of the image. Higher scores indicate darker images.
Light images
These are images that are excessively overexposed or washed out due to an abundance of light or high brightness levels.
is_light
Boolean value. True indicates that the image is classified as a light image.
light_score
A score between 0 and 1, indicating the brightness level of the image. Higher scores suggest brighter images.
Blurry images
These are images that lack sharpness and clarity, causing the subjects or details to appear hazy, out of focus, or indistinct.
is_blurry
Boolean value. True indicates that the image is classified as a blurry image.
blurry_score
A score between 0 and 1, representing the degree of blurriness in the image. Higher scores indicate a more blurry image.
Low information images
These are images lacking content (low entropy in values of their pixels).
is_low_information
Boolean value. True indicates that the image is classified as a low information image.
low_information_score
A score between 0 and 1, representing the severity of the image’s lack of information. Higher scores correspond to images containing less information.
Grayscale images
These are images lacking color.
is_grayscale
Boolean value. True indicates that the image is classified as a grayscale image.
Odd Aspect Ratio images
These are images with unusually long width or height (asymmetric dimensions).
is_odd_aspect_ratio
Boolean value. True indicates that the image is classified as having an odd aspect ratio.
odd_aspect_ratio_score
A score between 0 and 1, representing the degree of irregularity in the image’s aspect ratio. Higher scores correspond to images whose aspect ratio is more extreme.
Odd sized images
These are images that are unusually small relative to the rest of the images in the dataset.
is_odd_size
Boolean value. True indicates that the image is classified as an image with an odd size.
odd_size_score
A score between 0 and 1, indicating the degree of irregularity in the image’s size. Higher scores indicate images whose size is more unusual compared to the other images in the dataset.
Aesthetic images
These are images which are visually appealing (as rated by most people, although this is highly subjective). They might be artistic, beautiful photographs, or contain otherwise interesting content.
Note: Aesthetic scores are currently only provided for image datasets from Projects run in Regular mode.
aesthetic_score
A score between 0 and 1, representing how aesthetic the image is estimated to be. If training Generative AI models on your dataset, consider first filtering out the images with low aesthetic_score
.
Note: Higher aesthetic scores correspond to higher-quality images in the dataset (unlike many of Cleanlab’s other issue scores).
NSFW images
NSFW (Not Safe For Work) images refer to any visual content that is not suitable for viewing in a professional or public environment, typically due to its explicit, pornographic, or graphic nature. These images may contain nudity, sexual acts, or other content that is considered inappropriate for certain settings, such as workplaces or public spaces.
Note: NSFW detection is currently only provided for image datasets from Projects run in Regular mode.
is_NSFW
Boolean value. True indicates that the image depicts NSFW content.
NSFW_score
A score between 0 and 1, representing the severity of the issue. Higher scores correspond to images more likely to be considered inappropriate.
Not Analyzed
For image projects only, a few of the images in the dataset might fail to be processed due to reasons such as poorly formatted data or invalid image file paths. None of our other analyses will be run on such images, and all the other Cleanlab columns will be filled with default values (e.g. False
for all the boolean issue columns). Thus we provide a boolean column so such rows can be easily filtered out.
is_not_analyzed
Contains a boolean value, with True
indicating that this data point was not analyzed because its image failed to be processed. Images may fail to process because of their file format or being corrupted. Consider using a more standard file format for each image that failed to be processed.