Catching Issues in a Product Catalog (Multimodal Dataset)

Run in Google Colab

This tutorial demonstrates how e-commerce businesses can use Cleanlab Studio to ensure high-quality product catalogs. Follow this tutorial to see how to programatically audit a multimodal dataset (with images, text, and structured/tabular numeric/categorical features) using the Cleanlab Studio Python API.

Cleanlab Studio handles such multimodal data (eg. product images, titles/descriptions, and price, size, brand, …) to auto-detect many common issues in product catalogs, including:

Products (SKUs) that are miscategorized or have incorrect tags (tax-classifications, age-restrictions, …)
Near-duplicate products (SKUs)
Products with images that are low-quality (over/under-exposed, blurry, low information, …) or NSFW
Products with text titles/descriptions/reviews that are low-quality (non-English/unreadable) or unsafe (toxic language, PII)
Products whose image does not match description
Products whose price, size, or other numeric attribute seems off.

Issues detected in a product catalog

Install and import dependencies

Make sure you have wget and zip installed to run this tutorial. You can use pip to install all other packages required for this tutorial as follows:

%pip install -U Pillow cleanlab-studio

import numpy as np
import pandas as pd
import os
import random

from IPython.display import display, Markdown

pd.set_option("display.max_colwidth", None)

Get the dataset

This tutorial uses a variant of this retailer’s product catalog dataset with the following information about each product/SKU:

product category (11 unique classes)
sale price, average rating (numeric features in structured/tabular format)
brand (categorical feature in structured/tabular format)
title, description (text fields)
image of the product

The data are in a CSV file, with images stored in a seperate directory (their filepaths are listed in a column of the CSV). Let’s download the dataset:

wget -nc https://cleanlab-public.s3.amazonaws.com/Datasets/product_catalog.zip -O product_catalog.zip
unzip -q product_catalog.zip

The unzipped data directory product_catalog/ has the following structure:

product_catalog/                 # Main directory after unzipping the archive.
|
|-- metadata.csv                 # CSV of product information (tabular and text fields, filepaths to images)
|         
|-- images/                      # Directory containing product images
|   |-- <image_filename_1>.jpg   # Example of an image file 
|   |-- <image_filename_2>.jpg   # Another image file
|   |-- ...

Each row (data point) in the CSV file corresponds to a product within the catalog. Let’s load the data:

BASE_PATH = os.getcwd()
dataset_path = os.path.join(BASE_PATH, "product_catalog")

data = pd.read_csv(os.path.join(dataset_path, "metadata.csv"))
data.head(1)

	sku	title	description	brand	sale_price	average_product_rating	image	category
0	53d474313dd0	Park B. Smith® Chevron Foil-Printed Decorative Pillow	A chic foil-printed chevron pattern and luxurious comfort make this elegant decorative pillow a must-have accent piece for your formal living space. foil-printed cover plush and soft cotton cover with feather fill spot clean 12x18" imported	Park B Smith	43.5	3.0	images/0.jpg	bed & bath

Optional: Initialize helper methods to view images in the DataFrame (click to expand)

import pathlib
from PIL import Image
from io import BytesIO
from base64 import b64encode
from IPython.display import HTML

BASE_PATH = os.getcwd()


def path_to_img_html(path: str) -> str:
    buf = BytesIO()
    Image.open(path).convert("RGB").save(buf, format="JPEG")
    b64 = b64encode(buf.getvalue()).decode("utf8")
    return f'<img src="data:image/jpeg;base64,{b64}" width="175px" alt="" />'


def display_image(df):
    image_column = "rendered_image"
    df_copy = df.copy()
    df_copy[image_column] = df_copy["image"].apply(
        lambda x: pathlib.Path(BASE_PATH).joinpath("product_catalog").joinpath(x)
    )

    # Rearrange columns to move image_column after "image"
    columns = list(df_copy.columns)
    columns.remove(image_column)
    columns.insert(1, image_column)
    df_copy = df_copy[columns]
    return HTML(
        df_copy.to_html(escape=False, formatters={image_column: path_to_img_html})
    )

display_image(data.head(3))

	sku	title	description	brand	sale_price	average_product_rating	image	category
0	53d474313dd0	Park B. Smith® Chevron Foil-Printed Decorative Pillow	A chic foil-printed chevron pattern and luxurious comfort make this elegant decorative pillow a must-have accent piece for your formal living space. foil-printed cover plush and soft cotton cover with feather fill spot clean 12x18" imported	Park B Smith	43.50	3.0	images/0.jpg	bed & bath
1	4a1e2abd23c0	a.n.a Jeggings - Plus size	Our a.n.a® jeggings gives you the look of super skinny jeans that you love to pair with your favorite tops. flat front back pockets 31" inseam cotton/Lyocell/polyester/spandex washable imported	A.N.A	36.25	4.8	images/1.jpg	pants
2	debf3b021522	Carter's® Bunny Slippers - Baby Girls	Little feet will look cuter than ever—and stay warm and cozy—in these precious bunny slippers from Carter's. 3D ears embroidered face 3D tail faux fur lining synthetic wipe clean with damp cloth imported	Carter's	10.14	4.5	images/2.jpg	shoes

Load data into Cleanlab Studio

The multimodal product catalog dataset is spread in 2 formats: tabular and image. We’ll load both into Cleanlab Studio.

First use your API key to instantiate a Studio object.

from cleanlab_studio import Studio

# You can find your Cleanlab Studio API key by going to app.cleanlab.ai/upload,
# clicking "Upload via Python API", and copying the API key there
API_KEY = "<insert API key here>"

# Initialize studio object
studio = Studio(API_KEY)

For loading the image data, we use the Metadata ZIP format and for loading the text/tabular data, we use the CSV format. Both formats are detailed in the dataset guide.

Load image data

The Metadata ZIP format requires a CSV file named metadata.csv with at least these 2 columns:

image’s path - contains the relative file-path for each image. Filepaths should not include the parent folder (product_catalog/ in this example)
label - contains the label for each image.

The dataset used in this tutorial adheres to this format. Our “metadata.csv” file contains both “image” and “category” columns, with other columns being optional and not used for training AI models in the image modality

The ZIP file we downloaded can be directly loaded into Cleanlab Studio.

Note: To work with big image datasets, we recommend hosting them in cloud storage (rather than a local ZIP file), as demonstrated in this tutorial: Finding Issues in Large-Scale Image Datasets.

image_dataset_id = studio.upload_dataset(
    "product_catalog.zip", dataset_name="multimodal-image"
)
print(f"Dataset ID: {image_dataset_id}")

Load tabular data

The tabular data contains categorical, numeric, and text features. Before loading the DataFrame, let’s remove the “image” column since its redundant for tabular and text modality. When loading the dataset into Cleanlab Studio, we explicitly set the type for the sale_price column to float since we’ll be running a regression project with sale_price as the label column later in this tutorial (the label column must have float type for a regression project, otherwise this step is optional). See this guide for more information on schema overrides.

# Remove "image" column
data_copy = data.copy()
data_copy.drop(columns="image", inplace=True)

tabular_dataset_id = studio.upload_dataset(
    data_copy,
    dataset_name="multimodal-tabular",
    schema_overrides=[{"name": "sale_price", "column_type": "float"}],
)
print(f"Dataset ID: {tabular_dataset_id}")

Launch projects

A Cleanlab Studio project automatically trains ML models to provide AI-based analysis of your dataset.

Here we launch multiple projects, each analyzing a particular aspect of the product information. In total, we run the following 3 projects:

Image project: with the product images
Text project: with the product descriptions
Tabular project: with all the numeric/string product features

In each project, we treat the product category as a label. Since each product belongs to exactly one category, this is a multi-class classifcation task. If you are instead analyzing product tags (where one or more tags can apply to the same product), specify that your task is multi-label classification instead.

The image project is based on the previously uploaded image dataset (image_dataset_id).

# Name of the column containing labels
label_column = "category"

image_project_id = studio.create_project(
    dataset_id=image_dataset_id,
    project_name="multimodal-image",
    modality="image",
    task_type="multi-class",
    model_type="regular",
    label_column=label_column,
)
print(
    f"Project successfully created and training has begun! project_id: {image_project_id}"
)

For the rest of the projects, we use previously uploaded tabular dataset (tabular_dataset_id).

For each project, we specify certain input text or tabular column(s) to use as predictive features. In each project, Cleanlab will train a separate AI model to predict the label based on these features.

# Name of the column containing product description
product_description_column = "description"

description_project_id = studio.create_project(
    dataset_id=tabular_dataset_id,
    project_name="multimodal-text-description",
    modality="text",
    task_type="multi-class",
    model_type="regular",
    label_column=label_column,
    text_column=product_description_column,
)
print(
    f"Project successfully created and training has begun! project_id: {description_project_id}"
)

# Name of the columns for tabular data modelling
tabular_columns = [
    "title",
    "description",
    "sale_price",
    "average_product_rating",
    "brand",
]

tabular_project_id = studio.create_project(
    dataset_id=tabular_dataset_id,
    project_name="multimodal-tabular",
    modality="tabular",
    task_type="multi-class",
    model_type="regular",
    label_column=label_column,
    feature_columns=tabular_columns,
)
print(
    f"Project successfully created and training has begun! project_id: {tabular_project_id}"
)

Once the projects have been launched successfully and the project_id’s are visible, feel free to close this notebook. It will take time for Cleanlab’s AI to train these models and analyze your data. Come back after training is complete (you will receive an email) and continue with the notebook to review your results. For this tutorial, every project will produce a cleanset (cleaned dataset) within around 20 minutes.

You should only execute the above cells once! Do not call create_project again.

Before we proceed with the rest of the tutorial, we will wait for all the projects to complete.

You can poll for project’s status to programmatically wait until the results (and cleanset) are ready for review:

# Fetch cleanset_id
image_cleanset_id = studio.get_latest_cleanset_id(image_project_id)
# Poll for project status
project_status = studio.wait_until_cleanset_ready(image_cleanset_id)

description_cleanset_id = studio.get_latest_cleanset_id(description_project_id)
project_status = studio.wait_until_cleanset_ready(description_cleanset_id)

tabular_cleanset_id = studio.get_latest_cleanset_id(tabular_project_id)
project_status = studio.wait_until_cleanset_ready(tabular_cleanset_id)

Get project results

We can fetch the Cleanlab columns that contain the metadata for the cleanset using its cleanset_id. These columns have the same length as our original data and provide metadata about each individual data point, like what types of issues it exhibits and how severely.

If at any point you want to re-run the remaining parts of this notebook (without creating another Project), simply call studio.download_cleanlab_columns(cleanset_id) with the cleanset_id printed from the previous cells.

# Download cleanlab columns from image project
image_cleanlab_columns = studio.download_cleanlab_columns(image_cleanset_id)
# Download cleanlab columns from description text project
description_cleanlab_columns = studio.download_cleanlab_columns(description_cleanset_id)
# Download cleanlab columns from tabular project
tabular_cleanlab_columns = studio.download_cleanlab_columns(tabular_cleanset_id)

Label issues detected in tabular project

Each Cleanlab project auto-detects many different types of issues in the corresponding dataset. We demonstrate how to programmatically review these issues, and which data points exhibit them. Let’s start with label issues. First, we merge the Cleanlab columns with the original DataFrame to view them jointly.

# Combine original DataFrame with Cleanlab columns
tabular_cleanlab_df = data.merge(
    tabular_cleanlab_columns, left_index=True, right_index=True
)

# Set "given_label" column to the original label
tabular_cleanlab_df.rename(columns={"category": "given_label"}, inplace=True)

Recall that we specified the product category as the label in the project. To see which data points Cleanlab has estimated to be mislabeled, we filter by is_label_issue. For our tutorial dataset, these label issues correspond to products that are likely miscategorized. We sort the label issues by label_issue_score to see which of these data points are most likely mislabeled.

tabular_label_issue = tabular_cleanlab_df.query(
    "is_label_issue", engine="python"
).sort_values("label_issue_score", ascending=False)

# Select columns to display results
label_issue_columns = [
    "given_label",
    "suggested_label",
    "label_issue_score",
    "is_label_issue",
]
columns_to_display = tabular_columns + label_issue_columns

display(tabular_label_issue.head()[columns_to_display])

	title	description	sale_price	average_product_rating	brand	given_label	suggested_label	label_issue_score	is_label_issue
718	Nike® 3-pk. Dri-FIT Low Cut Socks	The Nike Dri-FIT® low-cut socks give you the moisture-wicking performance and cushioned support you need. dri-FIT® moisture technology wicking keeps feet cool and dry maximum breathability cushioned footbed for comfort arch compression for perfected fit polyester/nylon/cotton/spandex washable imported sock size 10-13 fits shoe sizes 8-12	21.14	3.1	Nike	accessories	underwear & socks	0.994825	True
357	Liz Claiborne® Emma Cropped Ankle Pants - Plus	Our cropped ankle pants have a straight leg and sleek look that you'll love wearing to work or on nights out. ● hook-and-bar with zip closure● straight leg● back pockets● 23" inseam● cotton/rayon/spandex● washable● imported	31.42	4.9	LIZ CLAIBORNE	shorts	pants	0.985283	True
736	Speedo® Tropical Striped Swim Trunks	The VaporPLUS quick-drying fabric of our swim trunks will keep you comfortable no matter how many times you go in the water. UPF 50+ protection relaxed fit mesh liner Veclro® fly with string closure inside key pocket side seam pockets 10" inseam polyester machine wash, hang dry imported	39.87	5.0	Speedo	shorts	swimwear	0.977736	True
509	Speedo® Straight Away Swim Trunks	Featuring the latest quick-drying technology, our Speedo swim trunks will be your go-to for your days by the beach or pool. Block the Burn® UPF 50+ protection VaporPLUS™ quick-drying fabric relaxed fit Velcro® fly with drawstring side seam pockets interior key pocket mesh liner 10" inseam polyester machine wash, hang dry imported	39.87	5.0	Speedo	shorts	swimwear	0.967584	True
341	Stylus™ Long-Sleeve Textured Cable Sweater	Our classic crewneck cable-knit sweater gets a casual update with a flatteringly comfortable fit. cotton/acrylic side slits high-low hem machine wash, dry flat imported misses: approx. 24½" - 29½" length petite: approx. 23" - 25" length	15.70	4.4	STYLUS	tops	coats & jackets	0.956729	True

The above analysis implies that the category assigned to these products is wrong. Such errors can hamper discoverability, worsen customer experience, and degrade ML/Analytics efforts.

Label issues and outliers detected in “description” text project

We’ll repeat the steps of merging Cleanlab columns with the original DataFrame and filter the merged DataFrame for data points flagged with label issues.

# Combine description DataFrame with Cleanlab columns
description_cleanlab_df = data.merge(
    description_cleanlab_columns, left_index=True, right_index=True
)
description_cleanlab_df.rename(columns={"category": "given_label"}, inplace=True)

# Filter label issues
description_label_issue = description_cleanlab_df.query(
    "is_label_issue", engine="python"
).sort_values("label_issue_score", ascending=False)
columns_to_display = [product_description_column] + label_issue_columns

display(description_label_issue.head()[columns_to_display])

	description	given_label	suggested_label	label_issue_score	is_label_issue
738	Crochet edges give our sweater tank top a look of class that's also ready for the weekend. v-neck approx. 24" - 27½" length cotton/polyester/Lurex® metallic washable imported	coats & jackets	tops	0.755175	True
983	Gotham City will be cleaned up in no time when you've got our Batman flip flops. nubuck/synthetic/nylon upper spot clean synthetic sole	swimwear	shoes	0.733110	True
357	Our cropped ankle pants have a straight leg and sleek look that you'll love wearing to work or on nights out. ● hook-and-bar with zip closure● straight leg● back pockets● 23" inseam● cotton/rayon/spandex● washable● imported	shorts	pants	0.720278	True
728	With just the right touch of trendy ombre coloring, our Alfred Dunner layered sweater brings some modern excitement to your wardrobe and then it tops it off with a pretty necklace for a pulled together look. crewneck with necklace 3/4 sleeves approx. 22-24" length polyester/acrylic cami: polyester hand wash, dry flat imported	coats & jackets	tops	0.696369	True
718	The Nike Dri-FIT® low-cut socks give you the moisture-wicking performance and cushioned support you need. dri-FIT® moisture technology wicking keeps feet cool and dry maximum breathability cushioned footbed for comfort arch compression for perfected fit polyester/nylon/cotton/spandex washable imported sock size 10-13 fits shoe sizes 8-12	accessories	underwear & socks	0.667388	True

Our ‘description’ text projects suggests that for these flagged products: either the description is misleading or the category assigned is incorrect.

Note that Cleanlab trains different ML models for tabular projects vs. text projects. Beyond identifying mismatched text descriptions, this text project may help you more accurately identify miscategorized products, depending on the relative accuracy of the tabular vs. text ML models for your dataset.

If your dataset has multiple text fields per data point, you can either analyze each in a separate project, or concatenate them all together in a single text project.

Outliers: Beyond mis-labeled (mis-categorized) data, Cleanlab Studio also auto-detects outliers (anomalies) lurking in your dataset. These data points are very different from the rest of the data. Outliers in a product catalog may correspond to products which are strange, or whose information got corrupted.

Let’s review the outliers detected in these product descriptions. Cleanlab Studio’s AI understands the semantics of text and can identify text which meaningfully stands out:

description_outlier_issue = description_cleanlab_df.query(
    "is_outlier", engine="python"
).sort_values("outlier_score", ascending=False)

outlier_columns = ["given_label", "outlier_score", "is_empty_text", "text_num_characters", "is_outlier"]
columns_to_display = [product_description_column] + outlier_columns

display(description_outlier_issue.head(5)[columns_to_display])

	description	given_label	outlier_score	is_empty_text	text_num_characters	is_outlier
541	m	accessories	0.800763	False	1	True
959	<!--<b><d><p></p></d></b><>-->	underwear & socks	0.706694	False	30	True
5	404Error<body><p>InvalidUsername</p><p>InvalidPIN</p></body>	shorts	0.679714	False	60	True
138	/tmp/description/3116287515.txt: missing	kitchen & dining	0.117357	False	40	True
16	useless garbage absolute worst pathetic brand	pajamas	0.097774	False	45	True

Other data issues detected in “description” text project

Cleanlab Studio can also detect other problems in text data, such as the occurrence of: toxic language, personally identifiable information (PII), or nonsensical language (e.g. HTML/XML tags and other random strings contaminating text descriptions), in the text columns of the dataset. The following Cleanlab columns are specific to the text fields in the dataset (see here for details), and are useful for detecting low-quality or unsafe text (content moderation).

Similar to the above, the is_<issue> column contains boolean values indicating if a text field has been identified to exhibit a particular issue, and the <issue>_score column contains numeric scores between 0 and 1 indicating the severity of this particular issue (1 indicates the most severe instance of the issue).

Personally Identifiable Information (PII) is information that could be used to identify an individual or is otherwise sensitive, which should not be present in product descriptions. Cleanlab’s PII issue check also returns two extra columns, PII_items and PII_types, which list the specific PII detected in the text and its type.

Here’s the PII detected in these product descriptions:

description_pii_issue = description_cleanlab_df.query(
    "is_PII", engine="python"
).sort_values("PII_score", ascending=False)

pii_columns = ["PII_items", "PII_types", "PII_score", "is_PII"]
columns_to_display = [product_description_column] + pii_columns

display(description_pii_issue.head(5)[columns_to_display])

	description	PII_items	PII_types	PII_score	is_PII
142	Pillow with hearts that can be placed anywhere. Call us on 202 856 1167 for free trial." spot clean imported	["202 856 1167"]	["phone number"]	0.5	True
366	Pay directly for a 10% discount. Email for details at judmunz@yahoo.com. Work out in our shorts, featuring an elastic waist and Champion Vapor moisture-wicking fabric to keep you cool and dry. elastic waist with drawstring 2 pockets flat front Champion Vapor moisture-wicking fabric 11" inseam polyester washable imported	["judmunz@yahoo.com"]	["email"]	0.5	True
600	Black to brown with this useful reversible belt, perfect for formal and casual wear. Made with 100% leather materials. We can manufacture for your preferred colours on large orders. Call us on 212-978-1213 for details.	["212-978-1213"]	["phone number"]	0.5	True
920	Star war fans, here we come with this new sketchers for your toddlers. They would be on sky nine. We accept Paypal on shoeseller@myshop.com	["shoeseller@myshop.com"]	["email"]	0.5	True
1070	These shorts are very comfortable, lightweight and beach ready. Bulk orders accepted. Pay directly for a 10% discount. Contact @ hulphasi@yahoo.com.	["hulphasi@yahoo.com"]	["email"]	0.5	True

Non-English text includes text written in a foreign language or containing nonsensical characters (such as HTML/XML tags, identifiers, hashes, random characters).

It is often more effective to provide language options for the ecommerce platform and allow users to select their preferred language instead of having multi-language product descriptions in the catalog, which negatively affects customer experience. Nonsensical product descriptions are particular damaging and should be remedied immediately.

For text is detected to be non-English, Cleanlab Studio will predict its language in the predicted_language column. If an alternative langauge cannot be predicted (this could either represent that the text contains more than one langauge, or that it is just a nonsensical/unreadable string such as HTML/XML/malformatted), the predicted_language will contain a null value.

Here’s the Non-English text detected in these product descriptions:

description_non_en_issue = description_cleanlab_df.query(
    "is_non_english", engine="python"
).sort_values("non_english_score", ascending=False)

non_en_columns = ["predicted_language", "non_english_score", "is_non_english"]
columns_to_display = [product_description_column] + non_en_columns

display(description_non_en_issue.head(5)[columns_to_display])

	description	predicted_language	non_english_score	is_non_english
959	<!--<b><d><p></p></d></b><>-->	<NA>	0.999843	True
541	m	<NA>	0.997578	True
5	404Error<body><p>InvalidUsername</p><p>InvalidPIN</p></body>	<NA>	0.987424	True
924	La teiera in ghisa si ispira alle antiche teiere cinesi in ghisa molto pregiate ancora in uso oggi. Cestino per la preparazione del tè in ottone con rivestimento in porcellana blu destinato alla preparazione e al servizio del tè, non per piano cottura	Italian	0.849882	True
1032	"Calcetines de la tripulación. En el trabajo o en el juego estos calcetines ricos en algodón cuentan con un toque de elasticidad para brindar comodidad durante todo el día. Paquete de 3 pares de mezcla de algodón suave para mayor transpirabilidad mejor ajuste con puntera sin costuras de spandex."	Spanish	0.841750	True

Informal text contains casual language, slang, or poor writing such as improper grammar or spelling. Maintaining a professional tone in product descriptions is generally recommended for product catalog, and is important for customer trust.

Here’s the informal text detected in these product descriptions:

description_informal_issue = description_cleanlab_df.query(
    "is_informal", engine="python"
).sort_values("informal_score", ascending=False)

informal_columns = ["informal_score", "spelling_issue_score", "grammar_issue_score", "slang_issue_score", "is_informal"]
columns_to_display = [product_description_column] + informal_columns

display(description_informal_issue.head(5)[columns_to_display])

	description	informal_score	spelling_issue_score	grammar_issue_score	slang_issue_score	is_informal
351	You'll love the year-round cottony comfy (feels like its hugging you) of our short-sleeve crewneck tee. cotton washable imported	0.667319	0.095238	0.891363	0.729315	True
541	m	0.647167	1.000000	0.277822	0.840096	True
365	Go beyond the basics with our flattering ribbed tank top. For ya'll gals out there looking to spice ur wardrobe. roundneck approx. 27¾ - 29½" length fun coral: cotton/polyester/spandex washable imported	0.638390	0.125000	0.707476	0.825999	True
651	Buy the shine dress 4 ur bedding. Colorful, kid friendly, wash nicely, proofing stains for food. Made in Africa, shipped from 128758	0.628461	0.000000	0.897662	0.673491	True
1003	this will make you feel buy more things to fill your pocket since it has loads of pockets. keep all your friends phone in your pocket hah.	0.622662	0.000000	0.725848	0.830807	True

Label issues, outliers, and ambiguous examples detected in image project

Cleanlab also trains different ML models for image projects. Depending on their accuracy, they could be more/less effective for detecting label issues than the text/tabular projects (or may simply detect a different subset of the label issues lurking in a dataset). The flagged label issues can also be used to identify a product whose image is misleading (does not reflect its category/description).

# Column containing image's path
image_path_column = ["image"]

image_cleanlab_df = data.merge(
    image_cleanlab_columns, left_on="image", right_on="image"
)
image_cleanlab_df.rename(columns={"category": "given_label"}, inplace=True)

image_label_issue = image_cleanlab_df.query(
    "is_label_issue", engine="python"
).sort_values("label_issue_score", ascending=False)
columns_to_display = image_path_column + label_issue_columns

display_image(image_label_issue.head(5)[columns_to_display])

	image	given_label	suggested_label	label_issue_score	is_label_issue
718	images/718.jpg	accessories	underwear & socks	0.847718	True
1162	images/1162.jpg	pajamas	underwear & socks	0.811411	True
1103	images/1103.jpg	kitchen & dining	bed & bath	0.804226	True
907	images/907.jpg	tops	kitchen & dining	0.802845	True
359	images/359.jpg	accessories	kitchen & dining	0.785909	True

Outlier images are auto-detected as those that look atypical from the rest of the dataset. Cleanlab Studio’s AI understands the visual semantics of images and flags those that meaningfully stand out in this product catalog:

image_outlier_issue = image_cleanlab_df.query(
    "is_outlier", engine="python"
).sort_values("outlier_score", ascending=False)

columns_to_display = image_path_column + outlier_columns

display_image(image_outlier_issue.head(5)[columns_to_display])

	image	given_label	outlier_score	is_outlier
844	images/844.jpg	pants	0.382356	True
777	images/777.jpg	kitchen & dining	0.349961	True
304	images/304.jpg	kitchen & dining	0.331406	True
328	images/328.jpg	bed & bath	0.316104	True
596	images/596.jpg	accessories	0.308186	True

Ambiguous data points are not well-described by any class label in the dataset, or may be borderline cases between multiple class labels. Many ambiguous data points may indicate the class definitions were not very clear in the original category taxonomy defined for the product catalog (or data annotation/categorization instructions more generally). Consider closely reviewing the class labels for such ambiguous data. Unlike outliers which are very different from other data, ambiguous data may occur in clusters.

image_ambiguous_issue = image_cleanlab_df.query(
    "is_ambiguous", engine="python"
).sort_values("ambiguous_score", ascending=False)

ambiguous_columns = ["given_label", "ambiguous_score", "is_ambiguous"]
columns_to_display = image_path_column + ambiguous_columns

display_image(image_ambiguous_issue.head(5)[columns_to_display])

	image	given_label	ambiguous_score	is_ambiguous
414	images/414.jpg	kitchen & dining	0.960864	True
861	images/861.jpg	underwear & socks	0.960580	True
47	images/47.jpg	shoes	0.959784	True
304	images/304.jpg	kitchen & dining	0.958595	True
407	images/407.jpg	kitchen & dining	0.953552	True

Other data issues detected in image project

As with the text fields, Cleanlab can auto-detect various types of problem in image data. The following Cleanlab columns are specific to images (see here for details), and are useful for detecting low-quality or unsafe images (content moderation).

Low-information images lack content and exhibit low entropy in the values of their pixels. They do not help your customers reach a purchase decision and negatively affect the browsing experience.

Here are the low-information images Cleanlab detected in this product catalog:

image_low_quality_issue = image_cleanlab_df.query(
    "is_low_information", engine="python"
).sort_values("low_information_score", ascending=False)

low_quality_columns = ["given_label", "low_information_score", "is_low_information"]
columns_to_display = image_path_column + low_quality_columns

display_image(image_low_quality_issue.head(5)[columns_to_display])

	image	given_label	low_information_score	is_low_information
414	images/414.jpg	kitchen & dining	0.785571	True
323	images/323.jpg	accessories	0.707833	True
705	images/705.jpg	bed & bath	0.707282	True

Overly Dark images appear dim/underexposed lacking clarity, and are obviously undesirable in a product catalog.

Here are the dark images Cleanlab detected in this product catalog:

image_dark_issue = image_cleanlab_df.query("is_dark", engine="python").sort_values(
    "dark_score", ascending=False
)

dark_columns = ["given_label", "dark_score", "is_dark"]
columns_to_display = image_path_column + dark_columns

display_image(image_dark_issue.head(5)[columns_to_display])

	image	rendered_image	given_label	dark_score	is_dark
1017	images/1017.jpg		shoes	0.906149	True
1042	images/1042.jpg		bed & bath	0.835294	True

Overly Light images appear excessively bright or overexposed, lacking adequate contrast and detail for a product catalog. Here are light images detected by Cleanlab:

image_light_issue = image_cleanlab_df.query("is_light", engine="python").sort_values(
    "light_score", ascending=False
)

light_columns = ["given_label", "light_score", "is_light"]
columns_to_display = image_path_column + light_columns

display_image(image_light_issue.head(5)[columns_to_display])

	image	rendered_image	given_label	light_score	is_light
414	images/414.jpg		kitchen & dining	0.996078	True

Aesthetic score: Cleanlab Studio can compute an aesthetic score to quantify how visually appealing each image is (as rated by most people, although this is subjective). Use this score to automatically identify images which are non-realistic, artificial, or depict content that is not helpful for the customer to make a decision. For a product with multiple images, consider ordering them by aesthetic score in the display to boost customer engagement.

Note: Higher aesthetic scores correspond to higher-quality images in the dataset (unlike many of Cleanlab’s other issue scores).

Here are the unaesthetic images with low aesthetic scores detected in this product catalog:

image_unaesthetic = image_cleanlab_df.sort_values("aesthetic_score", ascending=True)

aesthetic_columns = ["given_label", "aesthetic_score"]
columns_to_display = image_path_column + aesthetic_columns

display_image(image_unaesthetic.head(3)[columns_to_display])

	image	given_label	aesthetic_score
352	images/352.jpg	kitchen & dining	0.131365
95	images/95.jpg	accessories	0.140328
102	images/102.jpg	kitchen & dining	0.151972

Odd Size and Odd Aspect Ratio images have unusual area or width/height dimensions. On an ecommerce platform, such images may result in inconsistent & uneven display. If your website or application uses responsive design, odd-sized images might not adapt well to different screen sizes.

Here are the odd size or aspect ratio images Cleanlab detected in this product catalog:

image_odd_size = image_cleanlab_df.query(
    "is_odd_aspect_ratio or is_odd_size"
).sort_values("odd_aspect_ratio_score", ascending=False)

odd_size_columns = [
    "given_label",
    "odd_aspect_ratio_score",
    "is_odd_aspect_ratio",
    "odd_size_score",
    "is_odd_size",
]
columns_to_display = image_path_column + odd_size_columns

display_image(image_odd_size.head(3)[columns_to_display])

	image	rendered_image	given_label	odd_aspect_ratio_score	is_odd_aspect_ratio	odd_size_score	is_odd_size
499	images/499.jpg		kitchen & dining	0.821429	True	0.090175	False
486	images/486.jpg		coats & jackets	0.750000	True	0.091837	False

In this dataset, there are odd aspect ratio images but no odd size images flagged above Cleanlab’s odd_size_score threshold.

Not Safe For Work (NSFW) images are not suitable for viewing in a professional or public environment because they depict explicit/pornographic content or graphic violence/gore. These images must be removed from an ecommerce product catalog.

Here are the images that Cleanlab flagged as NSFW in this product catalog:

image_nsfw = image_cleanlab_df.query("is_NSFW").sort_values(
    "NSFW_score", ascending=False
)

nsfw_columns = ["given_label", "NSFW_score", "is_NSFW"]
columns_to_display = image_path_column + nsfw_columns

display_image(image_nsfw.head(1)[columns_to_display])

	image	rendered_image	given_label	NSFW_score	is_NSFW
844	images/844.jpg		pants	0.611196	True

Near duplicates

Cleanlab also automatically detects data points that are exact or near duplicates. Duplicate products confuse your customers and degrade your analytics and logistical efficiency.

Let’s first review the near duplicate product descriptions detected from our text project:

n_near_duplicate_sets = len(
    set(
        description_cleanlab_df.loc[
            description_cleanlab_df["near_duplicate_cluster_id"].notna(),
            "near_duplicate_cluster_id",
        ]
    )
)
print(
    f"There are {n_near_duplicate_sets} sets of near duplicate samples from the 'description' text field."
)

    There are 72 sets of near duplicate samples from the 'description' text field.

The nearly duplicated data points each have an associated near_duplicate_cluster_id. Data points that share the same cluster IDs are near duplicates of each other. We can use this column to find the near duplicates of any data point. The near_duplicate_score quantifies how semantically similar another data point is. This score = 1.0 for exactly duplicated data points, which have identical copies in the dataset.

near_duplicate_cluster_id = 10
description_near_duplicate_cluster = description_cleanlab_df.query(
    "near_duplicate_cluster_id == @near_duplicate_cluster_id", engine="python"
)

near_duplicate_columns = ["near_duplicate_score", "is_near_duplicate", "given_label"]
columns_to_display = tabular_columns + near_duplicate_columns

display(description_near_duplicate_cluster.head()[columns_to_display])

	title	description	sale_price	average_product_rating	brand	near_duplicate_score	is_near_duplicate	given_label
189	Clarks® Leisa Grove Leather Sandals	Stay comfortable and looking great all day with these sandals that are crafted in luxurious leather. technology Cushion Soft technology offers softness you can feel from your first step, long-lasting comfort and fit with minimal cushion compression optimal breathability via open-cell technology OrthoLite® cushioned footbed absorbs impact and offers breathability, moisture management and is antimicrobial construction leather upper EVA sole details strappy open-toe design adjustable hook-and-loop closure fabric lining	60.42	4.8	Clarks	0.999996	True	shoes
951	Clarks® Leisa Grove Leather Sandals - Wide Width	Stay comfortable and looking great all day with these sandals that are crafted in luxurious leather. technology Cushion Soft technology offers softness you can feel from your first step, long-lasting comfort and fit with minimal cushion compression optimal breathability via open-cell technology OrthoLite® cushioned footbed absorbs impact and offers breathability, moisture management and is antimicrobial construction leather upper EVA sole details strappy open-toe design adjustable hook-and-loop closure fabric lining	60.42	4.4	Clarks	0.999996	True	shoes

Let’s also review the near duplicate product images detected in this dataset (from the image project):

n_near_duplicate_sets = len(
    set(
        image_cleanlab_df.loc[
            image_cleanlab_df["near_duplicate_cluster_id"].notna(),
            "near_duplicate_cluster_id",
        ]
    )
)
print(
    f"There are {n_near_duplicate_sets} sets of near duplicate samples identified from the images."
)

    There are 50 sets of near duplicate samples identified from the images.

near_duplicate_cluster_id = (
    20  # play with this value to see other sets of near duplicates
)
image_near_duplicate_cluster = image_cleanlab_df.query(
    "near_duplicate_cluster_id == @near_duplicate_cluster_id", engine="python"
)

columns_to_display = image_path_column + tabular_columns + near_duplicate_columns
display_image(image_near_duplicate_cluster.head()[columns_to_display])

	image	rendered_image	title	description	sale_price	average_product_rating	brand	near_duplicate_score	is_near_duplicate	given_label
291	images/291.jpg		adidas® Brand Cut Tee	Our adidas brand cut t-shirt offers comfort, style and easy care. regular fit crewneck short sleeves cotton washable imported	21.53	3.1	ADIDAS	1.0	True	tops
872	images/872.jpg		adidas® Brand Cut Tee	Our adidas brand cut t-shirt offers comfort, style and easy care. regular fit crewneck short sleeves cotton washable imported	21.53	1.8	ADIDAS	1.0	True	tops

Cleanlab Studio automatically provides holistic quality assurance for your product catalog. The AI system closely evaluates diverse multi-modal information about each product, using its understanding of the content semantics to diagnose problems you never even considered. Cleanlab Studio automatically detects many types of data issues (not all demonstrated here).

Next Steps: After auto-detecting all sorts of informational issues in a product catalog, you can alert the appropriate parties to fix these issues (e.g. the seller of a particular product in an e-commerce marketplace). It’s this easy to use Data-Centric AI to improve your product catalog and delight your customers! Better quality data achieved with Cleanlab Studio will improve: customer engagement, product discoverability, and your sales.

Bonus: Detect erroneous numeric values (regression task)

Previously we analyzed issues with respect to the specified product category, designating the ML task as multi-class classification within Cleanlab Studio. We can also use Cleanlab Studio for analyzing issues in numeric columns like sale price, product ratings, size, etc. by designating the ML task as regression.

Suppose we are interested in flagging suspicious (or potentially incorrect) values in the listed sale_price of certain products. We accomplish this in Cleanlab Studio by declaring this column as the label for a regression task. Based on the same previous tabular dataset loaded into Cleanlab Studio, we can specify which columns to use as predictive features for inferring this label, and launch a “regression” project. Cleanlab will train its AI model to predict sales price based on these features, and then flag the values deemed least likely.

# Name of the numeric column to detect issues in
regression_column = "sale_price"
# Name of the tabular columns to use as predictive features (including 'category' here)
tabular_columns_price = [
    "title",
    "description",
    "average_product_rating",
    "brand",
    "category",
]

tabular_price_project_id = studio.create_project(
    dataset_id=tabular_dataset_id,
    project_name="multimodal-tabular-price",
    modality="tabular",
    task_type="regression",
    model_type="regular",
    label_column=regression_column,
    feature_columns=tabular_columns_price,
)
print(
    f"Project successfully created and training has begun! project_id: {tabular_price_project_id}"
)

You should only execute the above cell once! Do not call create_project again. Come back after training is complete (you will receive an email).

tabular_price_cleanset_id = studio.get_latest_cleanset_id(tabular_price_project_id)
project_status = studio.wait_until_cleanset_ready(tabular_price_cleanset_id)

tabular_price_cleanlab_columns = studio.download_cleanlab_columns(
    tabular_price_cleanset_id
)

# Combine original DataFrame with Cleanlab columns
tabular_price_cleanlab_df = data.merge(
    tabular_price_cleanlab_columns, left_index=True, right_index=True
)

# Set "given_label" column to the original label
tabular_price_cleanlab_df.rename(columns={"sale_price": "given_label"}, inplace=True)

tabular_price_label_issue = tabular_price_cleanlab_df.query(
    "is_label_issue", engine="python"
).sort_values("label_issue_score", ascending=False)
columns_to_display = tabular_columns_price + label_issue_columns

display(tabular_price_label_issue.head()[columns_to_display])

	title	description	average_product_rating	brand	category	given_label	suggested_label	label_issue_score	is_label_issue
254	Mixit™ Oval Bling Wedge Sandals	With its shiny stone detail and extra height in the heel, our wedge sandal is the perfect blend of comfort, style and bling. manmade materials comfortable slip-on	5.0	MIXIT	shoes	1000.10	59.930344	1.0	True
142	Pillow covers free trial too	Pillow with hearts that can be placed anywhere. Call us on 202 856 1167 for free trial." spot clean imported	2.5	ß	bed & bath	456.78	55.065586	1.0	True
1086	Master Massage Olympic 32” Massage Table Set	The Olympic LX massage table is all about revolutionary size and strength. At 32 inches wide with oversized ash wood legs and double thickness bed, this table supports up to 3,200 pounds	3.8	MASTER MASSAGE	bed & bath	640.56	62.977962	1.0	True
360	Nike® Team Training Gym Sack	The Nike Team Training gym sack helps keep your gear organized with an interior divider and bonded zip pocket. The water-resistant fabric and polyurethane-coated bottom also help ensure your essentials stay dry. main compartment with drawcord closure water-resistant fabric zip pocket for secure, small-item storage perforated film provides breathability Swoosh graphic polyester spot clean imported	4.9	Nike	accessories	2011.76	51.723045	1.0	True
869	Onasis Backless Barstool	Add sturdy and stylish seating to your counter or table with this backless barstool featuring splayed legs accented with a decorative wood trim. wood and rubberwood construction flush-joint joinery cushion filled with upholstery foam brushed bronze-tone nailhead accents 275-pound weight capacity Upholstery options: Faux leather: polyurethane; spot clean with a dry cloth; avoid direct heat and sunlight. Tweed: 75% polyester/25% acrylic; spot clean with a clean, damp cloth or have professionally cleaned. Assembly required; screwdriver needed. Imported. Counter-height barstool: 20Wx14Dx26"H Bar-height barstool: 21Wx15Dx32"H	5.0	Linon	kitchen & dining	464.60	81.529083	1.0	True

The original sales price of these flagged products seems suspiciously high. These are definitely worth a second look, good thing Cleanlab caught them! The label_issue_score reflects how likely Cleanlab’s AI finds each product’s listed sales price to be incorrect.

You could repeat this bonus analysis with the size, rating, or other important numeric information about each product that is vital to get right.

Catching Issues in a Product Catalog (Multimodal Dataset)

Install and import dependencies​

Get the dataset​

Load data into Cleanlab Studio​

Load image data​

Load tabular data​

Launch projects​

Get project results​

Label issues detected in tabular project​

Label issues and outliers detected in “description” text project​

Other data issues detected in “description” text project​

Label issues, outliers, and ambiguous examples detected in image project​

Other data issues detected in image project​

Near duplicates​

Bonus: Detect erroneous numeric values (regression task)​