Skip to main content

Studio

module cleanlab_studio

Python API for Cleanlab Studio.

Global Variables

  • name
  • method

class Studio

method __init__

__init__(api_key: Optional[str])

method TLM

TLM(
quality_preset: Literal['best', 'high', 'medium', 'low', 'base'] = 'medium',
**kwargs: Any
) → TLM

Gets Trustworthy Language Model (TLM) object to prompt.

Args:

  • quality_preset: quality preset to use for prompts
  • kwargs (Any): additional kwargs to pass to TLM class

Returns:


method apply_corrections

apply_corrections(
cleanset_id: str,
dataset: Any,
keep_excluded: bool = False
) → Any

Applies corrections from a Cleanlab Studio cleanset to your dataset. This function takes in your local copy of the original dataset, as well as the cleanset_id for the cleanset generated from this dataset in the Project web interface. The function returns a copy of your original dataset, where the label column has been substituted with corrected labels that you selected (either manually or via auto-fix) in the Cleanlab Studio web interface Project, and the rows you marked as excluded will be excluded from the returned copy of your original dataset. Corrections should have been made by viewing your Project in the Cleanlab Studio web interface (see Cleanlab Studio web quickstart).

The intended workflow is: create a Project, correct your Dataset automatically/manually in the web interface to generate a Cleanset (cleaned dataset), then call this function to make your original dataset locally look like the current Cleanset.

Args:

  • cleanset_id: ID of cleanset to apply corrections from.
  • dataset: Dataset to apply corrections to. Supported formats include pandas, snowpark, and pyspark DataFrame. Dataset should have the same number of rows as the dataset used to create the project. It should also contain a label column with the same name as the label column for the project.
  • keep_excluded: Whether to retain rows with an “exclude” action. By default these rows will be removed from the dataset.

Returns: A copy of the dataset with corrections applied.


method create_project

create_project(
dataset_id: str,
project_name: str,
modality: Literal['text', 'tabular', 'image'],
task_type: Optional[Literal['multi-class', 'multi-label', 'regression', 'unsupervised']] = 'multi-class',
model_type: Literal['fast', 'regular'] = 'regular',
label_column: Optional[str] = None,
feature_columns: Optional[List[str]] = None,
text_column: Optional[str] = None
)str

Creates a Cleanlab Studio project.

Args:

  • dataset_id: ID of dataset to create project for.
  • project_name: Name for resulting project.
  • modality: Modality of project (i.e. text, tabular, image).
  • task_type: Type of ML task to perform (i.e. multi-class, multi-label, regression).
  • model_type: Type of model to train (i.e. fast, regular).
  • label_column: Name of column in dataset containing labels (if not supplied, we’ll make our best guess).
  • feature_columns: List of columns to use as features when training tabular modality project (if not supplied and modality is “tabular” we’ll use all valid feature columns).
  • text_column: Name of column containing the text to train text modality project on (if not supplied and modality is “text” we’ll make our best guess).

Returns: ID of created project.


method delete_project

delete_project(project_id: str)None

Deletes a project from Cleanlab Studio.

Args:

  • project_id: ID of project to delete.

method download_cleanlab_columns

download_cleanlab_columns(
cleanset_id: str,
include_cleanlab_columns: bool = True,
include_project_details: bool = False,
to_spark: bool = False
) → Any

Downloads Cleanlab columns for a cleanset.

Args:

  • cleanset_id: ID of cleanset to download columns from. To obtain cleanset ID from project ID use, get_latest_cleanset_id.
  • include_cleanlab_columns: whether to download all Cleanlab columns or just the clean_label column
  • include_project_details: whether to download columns related to project status such as resolved rows, actions taken, etc.

Returns: A pandas or pyspark DataFrame. Type is Any to avoid requiring pyspark installation.


method download_embeddings

download_embeddings(cleanset_id: str) → ndarray[Any, dtype[float64]]

Downloads feature embeddings for a cleanset (available only for text and image projects). These are numeric vectors produced via neural network representations of each data point in your dataset.

Args:

  • cleanset_id (str): the ID of the cleanset from which you want to download feature embeddings.

Returns:

  • np.NDArray[float64]: a 2D numpy array of feature embeddings of shape N by N_EMBED, where N is the number of rows in the original dataset, and N_EMBED is the dimension of the feature embeddings. The embedding-dimension depends on which neural network is used to represent your data (Cleanlab automatically identifies the best type of neural network for your data).

For image projects, a few images in the original dataset might fail to be processed due to pooly formatted data or invalid image file paths. Feature embeddings are not computed for those rows. The rows in the original dataset that failed to be processed are marked as True in the is_not_analyzed Cleanlab column of the cleanset. If you want to work with feature embeddings for an image project, the recommended workflow is as follows:

  1. When the image project completes, download the cleaset via studio.download_cleanlab_columns, and check whether the is_not_analyzed boolean column has any True values.

  2. If no rows are flaged as is_not_analyzed, it means that all the rows were processed successfully. In this case, the rows of the feature embeddings will correspond to the rows of the original dataset, and downstream analysis can be carried out with no further preparation.

  3. If there are rows flagged as is_not_analyzed, the rows of the feature embeddings will correspond to the rows of the original dataset after filtering out the rows that are not analyzed.


method download_pred_probs

download_pred_probs(cleanset_id: str, keep_id: bool = False) → DataFrame

Downloads predicted probabilities for a cleanset (only for classification datasets).

Args:

  • cleanset_id (str): the ID of the cleanset for which to download the corresponding predicted class probabilities.
  • keep_id (bool): whether to include the ID column in the returned DataFrame to enable easy join/merge operations with original dataset.

Returns:

  • pd.DataFrame: a DataFrame of probabilities of shape N by M, where N is the number of rows in the original dataset, and M is the total number of classes in the original dataset. Every row of the returned DataFrame corresponds to the predicted probability of each class for the corresponding row in the original dataset. If keep_id is True, the DataFrame will include an extra ID column that can be used for database joins/merges with the original dataset or downloaded Cleanlab columns.

For image projects, a few images in the original dataset might fail to be processed due to poorly formatted data or invalid image file paths. Predicted probabilities will not be calculated for those rows. The rows in the original dataset that failed to be processed are marked as True in the is_not_analyzed Cleanlab column of the cleanset.

If you want to work with predicted probabilities for an image project, the recommended workflow is to download probabilities with the option keep_id=True, and then do a join with the original dataset on the ID column. Alternatively, you can follow the steps here, and filter out the rows that were not analyzed. The filtered dataset will then have rows that align with the predicted probabilities DataFrame.


method get_latest_cleanset_id

get_latest_cleanset_id(project_id: str)str

Gets latest cleanset ID for a project.

Args:

  • project_id: ID of project.

Returns: ID of latest associated cleanset.


method get_model

get_model(model_id: str) → Model

Gets a model deployed by Cleanlab Studio.

Args:

  • model_id: ID of model to get. This ID should be fetched in the deployments page of the app UI.

Returns: Model object with methods to run predictions on new input data.


method poll_dataset_id_for_name

poll_dataset_id_for_name(dataset_name: str, timeout: Optional[int] = None)str

Polls for dataset ID for a dataset name.

Args:

  • dataset_name: Name of dataset to get ID for.
  • timeout: Optional timeout after which to stop polling for progress. If not provided, will block until dataset is ready.

Returns ID of dataset.

Raises

  • TimeoutError: if dataset is not ready by end of timeout

method upload_dataset

upload_dataset(
dataset: Any,
dataset_name: Optional[str] = None,
schema_overrides: Optional[Dict[str, Dict[str, Any]]] = None,
modality: Optional[str] = None,
id_column: Optional[str] = None
)str

Uploads a dataset to Cleanlab Studio.

Args:

  • dataset: Object representing the dataset to upload. Currently supported formats include a str path to your dataset, a pandas, snowflake, or pyspark DataFrame.
  • dataset_name: Name for your dataset in Cleanlab Studio (optional if uploading from filepath).
  • schema_overrides: Optional dictionary of overrides you would like to make to the schema of your dataset. If not provided, schema will be inferred. Format defined here.
  • modality: Optional parameter to override the modality of your dataset. If not provided, modality will be inferred.
  • id_column: Optional parameter to override the ID column of your dataset. If not provided, a monotonically increasing ID column will be generated.

Returns: ID of uploaded dataset.


method wait_until_cleanset_ready

wait_until_cleanset_ready(
cleanset_id: str,
timeout: Optional[float] = None,
show_cleanset_link: bool = False
)None

Blocks until a cleanset is ready or the timeout is reached.

Args:

  • cleanset_id (str): ID of cleanset to check status for.
  • timeout (Optional[float], optional): timeout for polling, in seconds. Defaults to None.
  • show_cleanset_link (bool, optional): whether to print a link to view the cleanset in the Cleanlab Studio web UI when the cleanset is ready. Defaults to False.

Raises:

  • TimeoutError: if cleanset is not ready by end of timeout
  • CleansetError: if cleanset errored while running