Skip to main content

Studio

module cleanlab_studio

Python API for Cleanlab Studio.


class Studio

method __init__

__init__(api_key: Optional[str])

Creates a Cleanlab Studio client.

Args:

  • api_key: You can find your API key on your account page in Cleanlab Studio. Instead of specifying the API key here, you can also log in with cleanlab login on the command-line.

method TLM

TLM(
quality_preset: Literal['best', 'high', 'medium', 'low', 'base'] = 'medium',
options: Optional[TLMOptions] = None,
timeout: Optional[float] = None,
verbose: Optional[bool] = None
) → TLM

Instantiates a configured Trustworthy Language Model (TLM) instance.

The TLM object can be used as a drop-in replacement for an LLM, or, for estimating trustworthiness scores for arbitrary text prompt/response pairs, and more (see the TLM documentation).

For advanced use, TLM offers configuration options. The documentation below summarizes these options, and more details are explained in the TLM tutorial.

Args:

  • quality_preset (TLMQualityPreset): An optional preset to control the quality of TLM responses and trustworthiness scores vs. runtimes/costs. TLMQualityPreset is a string specifying one of the supported presets, including “best”, “high”, “medium”, “low”, “base”.

    The “best” and “high” presets return improved LLM responses, with “best” also returning more reliable trustworthiness scores than “high”. The “medium” and “low” presets return standard LLM responses along with associated trustworthiness scores, with “medium” producing more reliable trustworthiness scores than low. The “base” preset will not return any trustworthiness score, just a standard LLM response, and is similar to directly using your favorite LLM API.

    Higher presets have increased runtime and cost (and may internally consume more tokens). Reduce your preset if you see token-limit errors. Details about each present are in the documentation for TLMOptions. Avoid using “best” or “high” presets if you primarily want to get trustworthiness scores, and are less concerned with improving LLM responses. These presets have higher runtime/cost and are optimized to return more accurate LLM outputs, but not necessarily more reliable trustworthiness scores.

  • options (TLMOptions, optional): a typed dict of advanced configuration options. Avaialable options (keys in this dict) include “model”, “max_tokens”, “num_candidate_responses”, “num_consistency_samples”, “use_self_reflection”. For more details about the options, see the documentation for TLMOptions. If specified, these override any settings from the choice of quality_preset.
  • timeout (float, optional): timeout (in seconds) to apply to each TLM prompt. If a batch of data is passed in, the timeout will be applied to each individual item in the batch. If a result is not produced within the timeout, a TimeoutError will be raised. Defaults to None, which does not apply a timeout.
  • verbose (bool, optional): whether to print outputs during execution, i.e., whether to show a progress bar when TLM is prompted with batches of data. If None, this will be determined automatically based on whether the code is running in an interactive environment such as a Jupyter notebook.

Returns:


method apply_corrections

apply_corrections(
cleanset_id: str,
dataset: Any,
keep_excluded: bool = False
) → Any

Applies corrections from a Cleanlab Studio cleanset to your dataset. This function takes in your local copy of the original dataset, as well as the cleanset_id for the cleanset generated from this dataset in the Project web interface. The function returns a copy of your original dataset, where the label column has been substituted with corrected labels that you selected (either manually or via auto-fix) in the Cleanlab Studio web interface Project, and the rows you marked as excluded will be excluded from the returned copy of your original dataset. Corrections should have been made by viewing your Project in the Cleanlab Studio web interface (see Cleanlab Studio web quickstart).

The intended workflow is: create a Project, correct your Dataset automatically/manually in the web interface to generate a Cleanset (cleaned dataset), then call this function to make your original dataset locally look like the current Cleanset.

Args:

  • cleanset_id: ID of cleanset to apply corrections from.
  • dataset: Dataset to apply corrections to. Supported formats include pandas, snowpark, and pyspark DataFrame. Dataset should have the same number of rows as the dataset used to create the project. It should also contain a label column with the same name as the label column for the project.
  • keep_excluded: Whether to retain rows with an “exclude” action. By default these rows will be removed from the dataset.

Returns: A copy of the dataset with corrections applied.


method create_project

create_project(
dataset_id: str,
project_name: str,
modality: Literal['text', 'tabular', 'image'],
task_type: Optional[Literal['multi-class', 'multi-label', 'regression', 'unsupervised']] = 'multi-class',
model_type: Literal['fast', 'regular'] = 'regular',
label_column: Optional[str] = None,
feature_columns: Optional[List[str]] = None,
text_column: Optional[str] = None
)str

Creates a Cleanlab Studio project.

Args:

  • dataset_id: ID of dataset to create project for.
  • project_name: Name for resulting project.
  • modality: Modality of project (i.e. text, tabular, image).
  • task_type: Type of ML task to perform. Select a supervised task type (i.e. “multi-class”, “multi-label”, “regression”) if your dataset has a label column you would like to predict values for or detect erroneous values in. Select “unsupervised” if your dataset has no specific label column. See the Projects Guide for more information on task types.
  • model_type: Type of model to train (i.e. “fast”, “regular”). See the Projects Guide for more information on model types.
  • label_column: Name of column in dataset containing labels (if not supplied, we’ll make our best guess). For “unsupervised” tasks, this should be None.
  • feature_columns: List of columns to use as features for a tabular project. By default all columns are used as feature columns. This parameter is particularly useful if your dataset has a column containing unique IDs and you want to exclude that column from the feature columns.
  • text_column: Name of column containing the text to train text modality project on (if not supplied and modality is “text” we’ll make our best guess).

Returns: ID of created project.


method delete_dataset

delete_dataset(dataset_id: str)None

Deletes a dataset from Cleanlab Studio.

If the dataset is used in projects, the projects will be deleted as well.


method delete_project

delete_project(project_id: str)None

Deletes a project from Cleanlab Studio.

Args:

  • project_id: ID of project to delete.

method download_cleanlab_columns

download_cleanlab_columns(
cleanset_id: str,
include_cleanlab_columns: bool = True,
include_project_details: bool = False,
to_spark: bool = False
) → Any

Downloads Cleanlab columns for a cleanset.

Args:

  • cleanset_id: ID of cleanset to download columns from. To obtain cleanset ID from project ID use, get_latest_cleanset_id.
  • include_cleanlab_columns: whether to download all Cleanlab columns or just the clean_label column
  • include_project_details: whether to download columns related to project status such as resolved rows, actions taken, etc.

Returns: A pandas or pyspark DataFrame. Type is Any to avoid requiring pyspark installation.


method download_embeddings

download_embeddings(cleanset_id: str) → ndarray[Any, dtype[float64]]

Downloads feature embeddings for a cleanset (available only for text and image projects). These are numeric vectors produced via neural network representations of each data point in your dataset.

Args:

  • cleanset_id (str): the ID of the cleanset from which you want to download feature embeddings.

Returns:

  • np.NDArray[float64]: a 2D numpy array of feature embeddings of shape N by N_EMBED, where N is the number of rows in the original dataset, and N_EMBED is the dimension of the feature embeddings. The embedding-dimension depends on which neural network is used to represent your data (Cleanlab automatically identifies the best type of neural network for your data).

For image projects, a few images in the original dataset might fail to be processed due to poorly formatted data or invalid image file paths. Feature embeddings are not computed for those rows. The rows in the original dataset that failed to be processed are marked as True in the is_not_analyzed Cleanlab column of the cleanset. If you want to work with feature embeddings for an image project, the recommended workflow is as follows:

  1. When the image project completes, download the cleaset via studio.download_cleanlab_columns, and check whether the is_not_analyzed boolean column has any True values.

  2. If no rows are flaged as is_not_analyzed, it means that all the rows were processed successfully. In this case, the rows of the feature embeddings will correspond to the rows of the original dataset, and downstream analysis can be carried out with no further preparation.

  3. If there are rows flagged as is_not_analyzed, the rows of the feature embeddings will correspond to the rows of the original dataset after filtering out the rows that are not analyzed.


method download_pred_probs

download_pred_probs(cleanset_id: str, keep_id: bool = False) → DataFrame

Downloads predicted probabilities for a cleanset (only for classification datasets).

Args:

  • cleanset_id (str): the ID of the cleanset for which to download the corresponding predicted class probabilities.
  • keep_id (bool): whether to include the ID column in the returned DataFrame to enable easy join/merge operations with original dataset.

Returns:

  • pd.DataFrame: a DataFrame of probabilities of shape N by M, where N is the number of rows in the original dataset, and M is the total number of classes in the original dataset. Every row of the returned DataFrame corresponds to the predicted probability of each class for the corresponding row in the original dataset. If keep_id is True, the DataFrame will include an extra ID column that can be used for database joins/merges with the original dataset or downloaded Cleanlab columns.

For image projects, a few images in the original dataset might fail to be processed due to poorly formatted data or invalid image file paths. Predicted probabilities will not be calculated for those rows. The rows in the original dataset that failed to be processed are marked as True in the is_not_analyzed Cleanlab column of the cleanset.

If you want to work with predicted probabilities for an image project, the recommended workflow is to download probabilities with the option keep_id=True, and then do a join with the original dataset on the ID column. Alternatively, you can follow the steps here, and filter out the rows that were not analyzed. The filtered dataset will then have rows that align with the predicted probabilities DataFrame.


method get_latest_cleanset_id

get_latest_cleanset_id(project_id: str)str

Gets latest cleanset ID for a project.

Args:

  • project_id: ID of project.

Returns: ID of latest associated cleanset.


method get_model

get_model(model_id: str) → Model

Gets a model that is deployed in a Cleanlab Studio account.

The returned model can then be used to predict labels for new data. See the documentation for the Model class for more on what you can do with a Model object.

Args:

  • model_id: ID of model to get. The model ID can be found in the “Model Details” tab of a model page.

Returns: Model instance, which exposes methods to predict labels for new data.


method poll_dataset_id_for_name

poll_dataset_id_for_name(dataset_name: str, timeout: Optional[int] = None)str

Polls for dataset ID for a dataset name.

Args:

  • dataset_name: Name of dataset to get ID for.
  • timeout: Optional timeout after which to stop polling for progress. If not provided, will block until dataset is ready.

Returns ID of dataset.

Raises

  • TimeoutError: if dataset is not ready by end of timeout

method upload_dataset

upload_dataset(
dataset: Any,
dataset_name: Optional[str] = None,
schema_overrides: Optional[List[SchemaOverride]] = None,
**kwargs: Any
)str

Uploads a dataset to Cleanlab Studio.

Args:

  • dataset: Object representing the dataset to upload. Currently supported formats include a str path to your dataset, a pandas, snowflake, or pyspark DataFrame.
  • dataset_name: Name for your dataset in Cleanlab Studio (optional if uploading from filepath).
  • schema_overrides: Optional list of overrides you would like to make to the schema of your dataset. If not provided, all columns will be untyped. Format defined here.
  • modality: [DEPRECATED] Optional parameter to override the modality of your dataset. If not provided, modality will be inferred.
  • id_column: [DEPRECATED] Optional parameter to override the ID column of your dataset. If not provided, a monotonically increasing ID column will be generated.

Returns: ID of uploaded dataset.


method wait_until_cleanset_ready

wait_until_cleanset_ready(
cleanset_id: str,
timeout: Optional[float] = None,
show_cleanset_link: bool = False
)None

Blocks until a cleanset is ready or the timeout is reached.

Args:

  • cleanset_id (str): ID of cleanset to check status for.
  • timeout (Optional[float], optional): timeout for polling, in seconds. Defaults to None.
  • show_cleanset_link (bool, optional): whether to print a link to view the cleanset in the Cleanlab Studio web UI when the cleanset is ready. Defaults to False.

Raises:

  • TimeoutError: if cleanset is not ready by end of timeout
  • CleansetError: if cleanset errored while running