Detecting Issues in Image Datasets
This is the recommended quickstart tutorial for programmatically analyzing image datasets via Cleanlab Studio’s Python API. If you prefer to use the web interface and interactively browse/correct your data, see our other tutorial: Finding Issues in Large-Scale Image Datasets
Here we demonstrate the metadata Cleanlab Studio automatically generates for any image classification dataset. This metadata (returned as Cleanlab Columns) helps you discover various problems in your dataset and understand their severity.
Install and import dependencies
Make sure you have wget
and zip
installed to run this tutorial. You can use pip to install all other packages required for this tutorial as follows:
%pip install Pillow cleanlab-studio
import os
import numpy as np
from cleanlab_studio import Studio
Load dataset into Cleanlab Studio
To fetch the data for this tutorial, make sure you have wget
and zip
installed.
wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/caltech256-image-quickstart.zip'
mkdir -p data
unzip -q caltech256-image-quickstart.zip
The dataset we will use is a subset of the well-known Caltech 256 image dataset. This is a multi-class classification dataset where each image is labeled as belonging to one of K classes. We have downloaded the dataset and created a local ZIP file which is one of the supported formats.
Dataset Folder Structure
Upon unzipping the caltech256-image-quickstart.zip
file, you will discover the directory named caltech256-image-quickstart/
. Inside this directory, the dataset is organized with images sorted by their respective classes in separate subdirectories. The structure is as follows:
caltech256-image-quickstart/ # Main directory after unzipping the archive.
|
|-- frog/ # This is the first class directory.
| |-- <image_filename_1>.jpg # Example of an image filename inside the "frog" class directory.
| |-- <image_filename_2>.jpg # Another example of an image filename.
| |-- ...
|
|-- ibis/ # This is the second class directory.
| |-- <image_filename_1>.jpg # Example of an image filename inside the "ibis" class directory.
| |-- <image_filename_2>.jpg
| |-- ...
|
|-- penguin/ # Yet another class directory.
| |-- <image_filename_1>.jpg # Example of an image filename inside the "penguin" class directory.
| |-- <image_filename_2>.jpg
| |-- ...
|
|-- ... # Additional class directories follow the same structure.
Understanding this structure is important, especially if you intend to format your own dataset in a similar manner.
You can similarly format any other image dataset and run the rest of this tutorial. Details on how to format your dataset can be found in this guide, which also outlines other format options (such as providing the labels in a metadata CSV file instead of relying on subdirectory structure to represent the labels).
Note: To work with big datasets more quickly, we recommend you host them as external media (e.g. in cloud storage as demonstrated in this tutorial) rather than a local ZIP file.
BASE_PATH = os.getcwd()
dataset_path = os.path.join(BASE_PATH, "caltech256-image-quickstart")
dataset_zip_path = dataset_path + ".zip"
Use your API key to instantiate a studio
object, which can be used to analyze your dataset.
# you can find your Cleanlab Studio API key by going to app.cleanlab.ai/upload,
# clicking "Upload via Python API", and copying the API key there
API_KEY = "<insert your API key>"
# initialize studio object
studio = Studio(API_KEY)
Next load the dataset into Cleanlab Studio (more details/options can be found in this guide). This may take a while for big datasets.
dataset_id = studio.upload_dataset(dataset_zip_path, dataset_name="caltech256")
print(f"Dataset ID: {dataset_id}")
Launch a Project
We will then create a project using this dataset. A Cleanlab Studio project automatically trains ML models to provide AI-based analysis of your dataset.
project_id = studio.create_project(
dataset_id=dataset_id,
project_name="caltech256 project",
modality="image",
task_type="multi-class",
model_type="regular",
)
print(
f"Project successfully created and ML training has begun! project_id: {project_id}"
)
Once the project has been launched successfully and you see your project_id
you can feel free to close this notebook. It will take some time for Cleanlab’s AI to train on your data and analyze it. Come back after training is complete (you will receive an email) and continue with the notebook to review your results.
You should only execute the above cell once per dataset. After launching the project, you can poll for its status to programmatically wait until the results are ready for review. Each project creates a cleanset, an improved version of your original dataset that contains additional metadata for helping you clean up the data. The next code cell simply waits until this cleanset has been created.
Warning! For big datasets, this next cell may take a long time to execute while Cleanlab’s AI model is training. If your Jupyter notebook has timed out during this process then you can resume work by re-running the below cell (which should return instantly if the project has completed training; do not create a new Project).
cleanset_id = studio.get_latest_cleanset_id(project_id)
print(f"cleanset_id: {cleanset_id}")
studio.wait_until_cleanset_ready(cleanset_id)
Once the above cell completes execution, your project results are ready for review! At this point, you can optionally view your project in the Cleanlab Studio web interface and interactively improve your dataset. However this tutorial will stick with a fully programmatic workflow.
Download Cleanlab columns
We can fetch the Cleanlab columns that contain the metadata of this cleanset using its cleanset_id
. These columns have the same length as your original dataset and provide metadata about each indiviudal data point, like what types of issues it exhibits and how severely.
If at any point you want to re-run the remaining parts of this notebook (without creating another project), simply call studio.download_cleanlab_columns(cleanset_id)
with the cleanset_id
printed from the previous cell.
cleanlab_columns_df = studio.download_cleanlab_columns(cleanset_id)
Let’s view the downloaded dataframe with all the issue columns
cleanlab_columns_df.head()
image | cleanlab_row_ID | corrected_label | is_label_issue | label_issue_score | suggested_label | suggested_label_confidence_score | is_ambiguous | ambiguous_score | is_well_labeled | ... | is_odd_size | odd_size_score | is_low_information | low_information_score | is_grayscale | is_odd_aspect_ratio | odd_aspect_ratio_score | aesthetic_score | is_NSFW | NSFW_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | penguin/158_0058.jpg | 1 | <NA> | False | 0.053458 | <NA> | 0.921922 | False | 0.638933 | True | ... | False | 0.113644 | False | 0.107532 | False | False | 0.183594 | 0.510599 | False | 0.000000 |
1 | penguin/158_0064.jpg | 2 | <NA> | False | 0.051673 | <NA> | 0.918950 | False | 0.640233 | True | ... | False | 0.071429 | False | 0.060965 | False | False | 0.335938 | 0.481435 | False | 0.000000 |
2 | penguin/158_0070.jpg | 3 | <NA> | False | 0.149695 | <NA> | 0.802456 | False | 0.748173 | True | ... | False | 0.035302 | False | 0.079976 | False | False | 0.250000 | 0.564252 | False | 0.000000 |
3 | penguin/158_0138.jpg | 4 | <NA> | False | 0.070986 | <NA> | 0.891869 | False | 0.666421 | True | ... | False | 0.096633 | False | 0.180649 | False | False | 0.355469 | 0.404128 | False | 0.000000 |
4 | penguin/158_0110.jpg | 5 | <NA> | False | 0.048130 | <NA> | 0.926317 | False | 0.633335 | True | ... | False | 0.002039 | False | 0.183095 | False | False | 0.277344 | 0.594717 | False | 0.010624 |
5 rows × 34 columns
Review detected data issues
Details about all of the Cleanlab columns and their meanings can be found in this guide. Here we briefly showcase some of the Cleanlab columns that correspond to issues detected in our tutorial dataset:
- Label issue indicates the given label of this data point is likely wrong. For such data, consider correcting their label to the
suggested_label
if it seems more appropriate. - Ambiguous indicates this data point does not clearly belong to any of the classes (e.g. a borderline case). Multiple human annotators might disagree on how to label this data point, so you might consider refining your annotation instructions to clarify how to handle data points like this.
- Outlier indicates this data point is very different from the rest of the data (looks atypical). The presence of outliers may indicate problems in your data sources, consider deleting such data from your dataset if appropriate.
- Near duplicate indicates there are other data points that are (exactly or nearly) identical to this data point. Duplicated data points can have an outsized impact on models/analytics, so consider deleting the extra copies from your dataset if appropriate.
The data points exhibiting each type of issue are indicated with boolean values in the respective is_<issue>
column, and the severity of this issue in each data point is quantified in the respective <issue>_score
column (on a scale of 0-1 with 1 indicating the most severe instances of the issue).
Let’s go through some of the Cleanlab columns and types of data issues, starting with label issues (i.e. mislabeled data). We first create a given_label
column in our dataframe to clearly indicate the original class label originally assigned to each image in this dataset.
def get_label_for_example(row):
"""A helper function to extract the label from image path."""
return row["image"].split("/")[0]
# create a given label column
cleanlab_columns_df["given_label"] = cleanlab_columns_df.apply(
get_label_for_example, axis=1
)
We’ll also add an image column that allows us to view the images in this tutorial notebook as part of the dataframe.
# code to render id column of DataFrame as images in a separate column
from PIL import Image
from io import BytesIO
from base64 import b64encode
from IPython.display import HTML
def path_to_img_html(path: str) -> str:
buf = BytesIO()
Image.open(path).save(buf, format="JPEG")
b64 = b64encode(buf.getvalue()).decode("utf8")
return f'<img src="data:image/jpeg;base64,{b64}" width="175px" alt="" />'
def display(df):
image_column = "image"
df_copy = df.copy()
df_copy[image_column] = df_copy["image"].apply(lambda x: dataset_path + "/" + x)
# Rearrange columns to move image_column right next to index
columns = list(df_copy.columns)
columns.remove(image_column)
columns.insert(0, image_column)
df_copy = df_copy[columns]
return HTML(df_copy.to_html(escape=False, formatters=dict(image=path_to_img_html)))
Label Issues
To see which images are estimated to be mislabeled, we filter by is_label_issue
. We sort by label_issue_score
to see which of these images are most likely mislabeled.
samples_ranked_by_label_issue_score = cleanlab_columns_df.query(
"is_label_issue"
).sort_values("label_issue_score", ascending=False)
columns_to_display = [
"image",
"label_issue_score",
"is_label_issue",
"given_label",
"suggested_label",
]
display(samples_ranked_by_label_issue_score.head()[columns_to_display])
image | label_issue_score | is_label_issue | given_label | suggested_label | |
---|---|---|---|---|---|
326 | 0.959155 | True | toad | penguin | |
166 | 0.853347 | True | ibis | frog | |
336 | 0.804267 | True | toad | frog | |
622 | 0.737624 | True | frog | toad | |
333 | 0.620983 | True | toad | swan |
Note that in each of these images, the given_label
really does seem wrong. Data labeling is an error-prone process and annotators make mistakes! Luckily we can easily correct these data points by just using Cleanlab’s suggested_label
above, which seems like a much more suitable label in most cases.
While the boolean flags above can help estimate the overall label error rate, the numeric scores help decide what data to prioritize for review. You can alternatively ignore these boolean is_label_issue
flags and filter the data by thresholding the label_issue_score
yourself (if say you find the default thresholds produce false positives/negatives).
Ambiguous
Next, let’s look at the ambiguous examples in the dataset.
samples_ranked_by_ambiguous_score = cleanlab_columns_df.query(
"is_ambiguous"
).sort_values("ambiguous_score", ascending=False)
columns_to_display = [
"image",
"ambiguous_score",
"is_ambiguous",
"given_label",
"suggested_label",
]
display(samples_ranked_by_ambiguous_score.head(5)[columns_to_display])
image | ambiguous_score | is_ambiguous | given_label | suggested_label | |
---|---|---|---|---|---|
522 | 0.982241 | True | frog | swan | |
282 | 0.973153 | True | toad | swan | |
278 | 0.972304 | True | ibis | penguin | |
352 | 0.972086 | True | toad | frog | |
464 | 0.970093 | True | swan | ibis |
Outliers
Next, let’s look at the outliers in the dataset.
samples_ranked_by_outlier_score = cleanlab_columns_df.query("is_outlier").sort_values(
"outlier_score", ascending=False
)
columns_to_display = [
"image",
"outlier_score",
"is_outlier",
"given_label",
"suggested_label",
]
display(samples_ranked_by_outlier_score.head(5)[columns_to_display])
image | outlier_score | is_outlier | given_label | suggested_label | |
---|---|---|---|---|---|
80 | 0.346339 | True | penguin | frog | |
486 | 0.345775 | True | swan | frog | |
475 | 0.345775 | True | swan | frog | |
599 | 0.341197 | True | frog | frog | |
517 | 0.341197 | True | frog | frog |
Near Duplicates
Next, let’s look at the near duplicates in the dataset.
n_near_duplicate_sets = len(
set(
cleanlab_columns_df.loc[
cleanlab_columns_df["near_duplicate_cluster_id"].notna(),
"near_duplicate_cluster_id",
]
)
)
print(
f"There are {n_near_duplicate_sets} sets of near duplicate images in the dataset."
)
Note that the near duplicate data points each have an associated near_duplicate_cluster_id
integer. Data points that share the same IDs are near duplicates of each other, so you can use this column to find the near duplicates of any data point. And remember the near duplicates also include exact duplicates as well (which have near_duplicate_score=1).
Let’s check out the near duplicates with id = 2:
near_duplicate_cluster_id = (
2 # play with this value to see other sets of near duplicates
)
selected_samples_by_near_duplicate_cluster_id = cleanlab_columns_df.query(
"near_duplicate_cluster_id == @near_duplicate_cluster_id"
)
columns_to_display = [
"image",
"near_duplicate_score",
"is_near_duplicate",
"given_label",
"near_duplicate_cluster_id",
]
display(selected_samples_by_near_duplicate_cluster_id[columns_to_display])
image | near_duplicate_score | is_near_duplicate | given_label | near_duplicate_cluster_id | |
---|---|---|---|---|---|
317 | 0.896038 | True | toad | 2 | |
321 | 0.896038 | True | toad | 2 |
Image issues
Cleanlab Studio performs various analyses of each image in the dataset that are independent of the machine learning task and data annotations. The resulting image-specific metadata helps identify low-quality images from your dataset, such as images which are: dark, light, blurry, low-information, grayscale, oddly-sized, or formatted in an odd aspect ratio.
As above, the is_<issue>
column contains boolean values indicating if an image has been identified to exhibit a particular issue, and the <issue>_score
column contains numeric scores between 0 and 1 indicating the severity of this particular issue (1 indicates the most severe instance of the issue).
Let’s inspect some of the image-specific issues detected in our dataset:
Dark images
Cleanlab Studio automatically identifies overly dark images in a dataset, which appear dim/underexposed and lack detail/clarity. Since data annotators and machine learning models can struggle with dark images, carefully consider how to handle them and whether they are introducing noise in the dataset or undesirable spurious correlations.
Here are the most severe examples of dark images detected in this dataset:
samples_marked_as_dark = cleanlab_columns_df.query("is_dark").sort_values(
"dark_score", ascending=False
)
columns_to_display = [
"image",
"dark_score",
"is_dark",
"given_label",
]
display(samples_marked_as_dark.head(5)[columns_to_display])
image | dark_score | is_dark | given_label | |
---|---|---|---|---|
127 | 0.809566 | True | penguin | |
171 | 0.713295 | True | ibis |
You can do something similar to discover the light (overexposed) images in your dataset, which Cleanlab Studio also automatically detects.
Blurry images
Cleanlab Studio can also detect blurry images in a dataset. These images lack sharpness and clarity, resulting in hazy, out-of-focus, or indistinct subjects or details. Blurry images can pose challenges, particularly in tasks such as fine-grained classification that demand attention to detail. Consider if they should be excluded from your dataset (or if there is a problem with your data source).
Here are the most severe examples of blurry images detected in this dataset:
samples_marked_as_blurry = cleanlab_columns_df.query("is_blurry").sort_values(
"blurry_score", ascending=False
)
columns_to_display = [
"image",
"blurry_score",
"is_blurry",
"given_label",
]
display(samples_marked_as_blurry.head(5)[columns_to_display])
image | blurry_score | is_blurry | given_label | |
---|---|---|---|---|
433 | 0.761646 | True | swan |
Grayscale images
Cleanlab Studio also detects grayscale images in a dataset, which lack color. Their presence can potentially lead to spurious correlations between the class label and the image (if more images from specific classes are grayscale than for other classes).
Here are examples of grayscale images detected in this dataset:
samples_marked_as_grayscale = cleanlab_columns_df.query("is_grayscale")
columns_to_display = [
"image",
"is_grayscale",
"given_label",
]
display(samples_marked_as_grayscale.head(5)[columns_to_display])
image | is_grayscale | given_label | |
---|---|---|---|
209 | True | ibis | |
346 | True | toad |
Low information images
Cleanlab Studio can also detect low-information images in a dataset, which lack content and exhibit low entropy in the values of their pixels. Low-information images can be memorized by models, leading to regurgitation in generative tasks. They may lack necessary detail to derive useful information from the image in supervised learning.
Here are the most severely low information images detected in this dataset:
samples_marked_as_low_information = cleanlab_columns_df.query(
"is_low_information"
).sort_values("low_information_score", ascending=False)
columns_to_display = [
"image",
"is_low_information",
"low_information_score",
"given_label",
]
display(samples_marked_as_low_information.head(10)[columns_to_display])
image | is_low_information | low_information_score | given_label | |
---|---|---|---|---|
80 | True | 0.804015 | penguin | |
517 | True | 0.742925 | frog | |
599 | True | 0.742925 | frog | |
547 | True | 0.737675 | frog | |
81 | True | 0.724251 | penguin |
Odd Aspect Ratio images
Cleanlab Studio also detects images with odd aspect ratios or sizes. These are images with unusual aspect ratios.
Here are the images with odd aspect ratio detected in this dataset:
samples_marked_as_odd_aspect_ratio = cleanlab_columns_df.query(
"is_odd_aspect_ratio"
).sort_values(["odd_aspect_ratio_score"], ascending=False)
columns_to_display = [
"image",
"is_odd_aspect_ratio",
"odd_aspect_ratio_score",
"given_label",
]
display(samples_marked_as_odd_aspect_ratio.head(5)[columns_to_display])
image | is_odd_aspect_ratio | odd_aspect_ratio_score | given_label | |
---|---|---|---|---|
207 | True | 0.65625 | ibis |
Odd Sized images
Cleanlab Studio also detects images with unusual sizes, compared to the rest of the dataset. This issue type differs from odd aspect ratio, as it focuses on the image’s total area, assessing discrepancies in size rather than shape, and highlights outliers in comparison to the dataset’s overall image size distribution.
Here are the images with odd sizes detected in this dataset:
samples_marked_as_odd_size = cleanlab_columns_df.query("is_odd_size").sort_values(
["odd_size_score"], ascending=False
)
samples_marked_as_odd_size["image_size"] = [
Image.open(dataset_path + "/" + path).size
for path in samples_marked_as_odd_size["image"]
]
columns_to_display = [
"image",
"is_odd_size",
"odd_size_score",
"given_label",
"image_size",
]
display(samples_marked_as_odd_size.head(5)[columns_to_display])
image | is_odd_size | odd_size_score | given_label | image_size | |
---|---|---|---|---|---|
41 | True | 1.0 | penguin | (6000, 9000) | |
80 | True | 1.0 | penguin | (1006, 814) | |
81 | True | 1.0 | penguin | (1048, 680) | |
127 | True | 1.0 | penguin | (600, 450) | |
346 | True | 1.0 | toad | (500, 331) |
Most of the images in this dataset are of size (256,256), looking at the above images we can clearly tell that they are unusually large like (1024,1024) or unusually small like (25,18) compared to the rest of the dataset.
NSFW images
Cleanlab Studio also detects NSFW (Not Safe For Work) content in your dataset. NSFW images are not suitable for viewing in a professional or public environment because they depict explicit/pornographic content or graphic violence/gore.
Here are the images flagged as NSFW in this dataset:
samples_marked_as_nsfw = cleanlab_columns_df.query("is_NSFW").sort_values(
"NSFW_score", ascending=False
)
columns_to_display = [
"image",
"is_NSFW",
"NSFW_score",
"given_label",
]
display(samples_marked_as_nsfw.head(5)[columns_to_display])
image | is_NSFW | NSFW_score | given_label | |
---|---|---|---|---|
71 | True | 0.612313 | penguin |
Aesthetic score
Cleanlab Studio can also compute an aesthetic score to quantify how visually appealing each image is (as rated by most people, although this is subjective). Use this score to automatically identify images which are artistic, beautiful photographs, or depict otherwise interesting content.
Note: Higher aesthetic scores correspond to higher-quality images in the dataset (unlike many of Cleanlab’s other issue scores).
Here are the images with highest aesthetic scores in this dataset:
samples_with_highest_aesthetic_score = cleanlab_columns_df.sort_values(
"aesthetic_score", ascending=False
)
columns_to_display = [
"image",
"aesthetic_score",
"given_label",
]
display(samples_with_highest_aesthetic_score.head(5)[columns_to_display])
image | aesthetic_score | given_label | |
---|---|---|---|
583 | 0.741301 | frog | |
596 | 0.689487 | frog | |
561 | 0.657363 | frog | |
114 | 0.655281 | penguin | |
614 | 0.645130 | frog |
Here are the images with the lowest aesthetic score in this dataset:
samples_with_lowest_aesthetic_score = cleanlab_columns_df.sort_values("aesthetic_score")
columns_to_display = [
"image",
"aesthetic_score",
"given_label",
]
display(samples_with_lowest_aesthetic_score.head(5)[columns_to_display])
image | aesthetic_score | given_label | |
---|---|---|---|
580 | 0.153385 | frog | |
424 | 0.183691 | swan | |
624 | 0.197218 | frog | |
222 | 0.219795 | ibis | |
80 | 0.224014 | penguin |
Improve the dataset based on the detected issues
Since the results of this analysis appear reasonable, let’s use the Cleanlab columns to improve the quality of our dataset. For your own datasets, which actions you should take to remedy the detected issues will depend on what you are using the data for. No action may be the best choice for certain datasets, we caution against blindly copying the actions we perform below.
For data marked as label_issue
, we create a new corrected_label
column, which will be the given label for data without detected label issues, and the suggested_label
for data with detected label issues.
corrected_label = np.where(
cleanlab_columns_df["is_label_issue"],
cleanlab_columns_df["suggested_label"],
cleanlab_columns_df["given_label"],
)
For data marked as outlier or ambiguous, we will simply exclude them from our dataset. Here we create a boolean vector rows_to_exclude
to track which images will be excluded.
rows_to_exclude = (
cleanlab_columns_df["is_outlier"] | cleanlab_columns_df["is_ambiguous"]
)
For each set of near duplicates, we only want to keep one of the data points that share a common near_duplicate_cluster_id
(so that the resulting dataset will no longer contain any near duplicates).
near_duplicates_to_exclude = cleanlab_columns_df[
"is_near_duplicate"
] & cleanlab_columns_df["near_duplicate_cluster_id"].duplicated(keep="first")
rows_to_exclude |= near_duplicates_to_exclude
To keep things simple, we’ll ignore the image-specific issues here.
We can check the total amount of excluded data:
print(f"Excluding {rows_to_exclude.sum()} images (out of {len(cleanlab_columns_df)})")
Finally, let’s actually make a new version of our dataset with these changes.
We craft a new dataframe from the original, applying corrections and exclusions, and then use this dataframe to save the new dataset in a separate directory. The new dataset is a directory that looks just like our original dataset – you can use it as a plug-in replacement to get more reliable results in your ML and Analytics pipelines without any change in your existing pipelines.
Caution: With local image datasets, make sure to verify the source and output directory paths to avoid overwriting or mixing data, confirm there’s sufficient disk space to store the new dataset, and give your settings a once-over before saving the new dataset.
new_dataset_directory = "improved_dataset" # where to save the new dataset (image files will be copied over)
Optional: Initialize helper method to export a cleaned image dataset (click to expand)
import csv
import shutil
def export_adjusted_dataset(
filepaths, labels, source_root_dir, output_dir, zip_output=True
):
"""Copies fixed image dataset to a new output directory. Optionally zips the output directory."""
if len(filepaths) != len(labels):
raise ValueError(
"The number of source filepaths and labels should be the same."
)
if os.path.abspath(source_root_dir) == os.path.abspath(output_dir):
raise ValueError("The input and output directories should not be the same.")
if os.path.exists(output_dir):
raise ValueError(
f"Directory {output_dir} already exists. Cannot overwite so please delete it first, or specify a different output_dir."
)
# Initialize a dictionary to keep track of the mappings
mappings = {}
# Iterate over the filepaths and labels
for filepath, label in zip(filepaths, labels):
source_path = os.path.join(source_root_dir, filepath)
target_dir = os.path.join(output_dir, label)
# Create the target directory if it doesn't exist
os.makedirs(target_dir, exist_ok=True)
# Define the initial target path
target_filename = os.path.basename(filepath)
target_path = os.path.join(target_dir, target_filename)
# Handle filename collisions
counter = 1
while os.path.exists(target_path):
# Append a unique identifier to the filename
target_filename = f"{os.path.splitext(os.path.basename(filepath))[0]}_{counter}{os.path.splitext(filepath)[1]}"
target_path = os.path.join(target_dir, target_filename)
counter += 1
# Copy the image to the new directory
shutil.copy2(source_path, target_path)
# Save the mapping
mappings[filepath] = os.path.relpath(target_path, output_dir)
# Save the filename mappings to a CSV file
csv_columns = ["source_path", "target_path"]
with open(os.path.join(output_dir, "mappings.csv"), "w") as f:
writer = csv.DictWriter(f, fieldnames=csv_columns)
writer.writeheader()
for key, value in mappings.items():
writer.writerow({"source_path": key, "target_path": value})
# Zip the new directory if zip_output is True
if zip_output:
shutil.make_archive(output_dir, "zip", output_dir)
return f"{output_dir}.zip"
else:
return output_dir
# Fetch the original dataset
fixed_dataset = cleanlab_columns_df[["image"]].copy()
# Add the corrected label column
fixed_dataset["label"] = corrected_label
# Automatically exclude selected rows
fixed_dataset = fixed_dataset[~rows_to_exclude]
# Save the adjusted dataset to disk
output_path = export_adjusted_dataset(
filepaths=fixed_dataset["image"],
labels=fixed_dataset["label"],
source_root_dir=dataset_path,
output_dir=new_dataset_directory,
zip_output=True,
)
print(f"Improved dataset saved to {output_path}")