Skip to main content

Synthetic

Warning: The utility methods in utils are not guaranteed to be stable between different versions of the cleanlab-studio API.

module cleanlab_studio.utils.synthetic

Collection of utility functions for Cleanlab Studio Python API


function score_synthetic_dataset

score_synthetic_dataset(
cleanset_df: DataFrame,
real_or_synth_column: str = 'real_or_synthetic',
synthetic_class_names: Optional[Tuple[str, str]] = None
) → Dict[str, float]

Computes the issue scores for a dataset consisting of real and synthetic data, to evaluate any overarching issues within the synthetic dataset.

Args:

  • cleanset_df: The dataframe containing the dataset to score. It should contain a column named “real_or_synthetic” that indicates whether each example is real or synthetic. It should also have the cleanset columns provided by Cleanlab Studio.
  • real_or_synth_column: The name of the column that indicates whether each example is real or synthetic.
  • synthetic_class_names: The class names of the “real_or_synthetic” column (ie. which class corresponds to real examples, which to synthetic examples). If None, the default values are (“synthetic”, “real”). The first class name should correspond to the synthetic examples, and the second class name should correspond to the real examples.