Skip to main content

Studio

module cleanlab_studio

Python API for Cleanlab Studio.

Some methods here apply to other Cleanlab products (e.g. TLM) that also rely on the Cleanlab Studio client.


class Studio

method __init__

__init__(api_key: 'Optional[str]')

Creates a Cleanlab Studio client.

Args:

  • api_key: You can find your API key on your account page in Cleanlab Studio. Instead of specifying the API key here, you can also log in with cleanlab login on the command-line.

method apply_corrections

apply_corrections(
cleanset_id: 'str',
dataset: 'Any',
keep_excluded: 'bool' = False
) → Any

Applies corrections from a Cleanlab Studio cleanset to your dataset. This function takes in your local copy of the original dataset, as well as the cleanset_id for the cleanset generated from this dataset in the Project web interface. The function returns a copy of your original dataset, where the label column has been substituted with corrected labels that you selected (either manually or via auto-fix) in the Cleanlab Studio web interface Project, and the rows you marked as excluded will be excluded from the returned copy of your original dataset. Corrections should have been made by viewing your Project in the Cleanlab Studio web interface (see Cleanlab Studio web quickstart).

The intended workflow is: create a Project, correct your Dataset automatically/manually in the web interface to generate a Cleanset (cleaned dataset), then call this function to make your original dataset locally look like the current Cleanset.

Args:

  • cleanset_id: ID of cleanset to apply corrections from.
  • dataset: Dataset to apply corrections to. Supported formats include pandas, snowpark, and pyspark DataFrame. Dataset should have the same number of rows as the dataset used to create the project. It should also contain a label column with the same name as the label column for the project.
  • keep_excluded: Whether to retain rows with an “exclude” action. By default these rows will be removed from the dataset.

Returns: A copy of the dataset with corrections applied.


method create_enrichment_project

create_enrichment_project(name: 'str', dataset_id: 'str') → EnrichmentProject

Creates a Cleanlab Studio Enrichment Project.

Args:

  • name (str): Name of the enrichment project to create.
  • dataset_id (str): ID of dataset to be enriched.

Returns:


method create_project

create_project(
dataset_id: 'str',
project_name: 'str',
modality: "Literal['text', 'tabular', 'image']",
task_type: "Optional[Literal['multi-class', 'multi-label', 'regression', 'unsupervised']]" = 'multi-class',
model_type: "Literal['fast', 'regular']" = 'regular',
label_column: 'Optional[str]' = None,
feature_columns: 'Optional[List[str]]' = None,
text_column: 'Optional[str]' = None
)str

Creates a Cleanlab Studio project.

Args:

  • dataset_id: ID of dataset to create project for.
  • project_name: Name for resulting project.
  • modality: Modality of project (i.e. text, tabular, image).
  • task_type: Type of ML task to perform. Select a supervised task type (i.e. “multi-class”, “multi-label”, “regression”) if your dataset has a label column you would like to predict values for or detect erroneous values in. Select “unsupervised” if your dataset has no specific label column. See the Projects Guide for more information on task types.
  • model_type: Type of model to train (i.e. “fast”, “regular”). See the Projects Guide for more information on model types.
  • label_column: Name of column in dataset containing labels (if not supplied, we’ll make our best guess). For “unsupervised” tasks, this should be None.
  • feature_columns: List of columns to use as features for a tabular project. By default all columns are used as feature columns. This parameter is particularly useful if your dataset has a column containing unique IDs and you want to exclude that column from the feature columns.
  • text_column: Name of column containing the text to train text modality project on (if not supplied and modality is “text” we’ll make our best guess).

Returns: ID of created project.


method delete_dataset

delete_dataset(dataset_id: 'str')None

Deletes a dataset from Cleanlab Studio.

If the dataset is used in projects, the projects will be deleted as well.


method delete_enrichment_project

delete_enrichment_project(project_id: 'str')None

Deletes an Enrichment Project from Cleanlab Studio.

Args:

  • project_id: ID of enrichment project to delete.

method delete_project

delete_project(project_id: 'str')None

Deletes a project from Cleanlab Studio.

Args:

  • project_id: ID of project to delete.

method deploy_model

deploy_model(cleanset_id: 'str', model_name: 'str')str

Trains and deploys a model with an improved dataset created by applying any corrections you’ve made to your cleanset in Cleanlab Studio.

Args:

  • cleanset_id: ID of cleanset to deploy model for.
  • model_name: Name for resulting model.

method download_cleanlab_columns

download_cleanlab_columns(
cleanset_id: 'str',
include_cleanlab_columns: 'bool' = True,
include_project_details: 'bool' = False,
to_spark: 'bool' = False
) → Any

Downloads Cleanlab columns for a cleanset.

Args:

  • cleanset_id: ID of cleanset to download columns from. To obtain cleanset ID from project ID use, get_latest_cleanset_id.
  • include_cleanlab_columns: whether to download all Cleanlab columns or just the clean_label column
  • include_project_details: whether to download columns related to project status such as resolved rows, actions taken, etc.

Returns: A pandas or pyspark DataFrame. Type is Any to avoid requiring pyspark installation.


method download_embeddings

download_embeddings(cleanset_id: 'str') → NDArray[float64]

Downloads feature embeddings for a cleanset (available only for text and image projects). These are numeric vectors produced via neural network representations of each data point in your dataset.

Args:

  • cleanset_id (str): the ID of the cleanset from which you want to download feature embeddings.

Returns:

  • np.NDArray[float64]: a 2D numpy array of feature embeddings of shape N by N_EMBED, where N is the number of rows in the original dataset, and N_EMBED is the dimension of the feature embeddings. The embedding-dimension depends on which neural network is used to represent your data (Cleanlab automatically identifies the best type of neural network for your data).

For image projects, a few images in the original dataset might fail to be processed due to poorly formatted data or invalid image file paths. Feature embeddings are not computed for those rows. The rows in the original dataset that failed to be processed are marked as True in the is_not_analyzed Cleanlab column of the cleanset. If you want to work with feature embeddings for an image project, the recommended workflow is as follows:

  1. When the image project completes, download the cleaset via studio.download_cleanlab_columns, and check whether the is_not_analyzed boolean column has any True values.

  2. If no rows are flaged as is_not_analyzed, it means that all the rows were processed successfully. In this case, the rows of the feature embeddings will correspond to the rows of the original dataset, and downstream analysis can be carried out with no further preparation.

  3. If there are rows flagged as is_not_analyzed, the rows of the feature embeddings will correspond to the rows of the original dataset after filtering out the rows that are not analyzed.


method download_pred_probs

download_pred_probs(cleanset_id: 'str', keep_id: 'bool' = False) → DataFrame

Downloads predicted probabilities for a cleanset (only for classification datasets).

Args:

  • cleanset_id (str): the ID of the cleanset for which to download the corresponding predicted class probabilities.
  • keep_id (bool): whether to include the ID column in the returned DataFrame to enable easy join/merge operations with original dataset.

Returns:

  • pd.DataFrame: a DataFrame of probabilities of shape N by M, where N is the number of rows in the original dataset, and M is the total number of classes in the original dataset. Every row of the returned DataFrame corresponds to the predicted probability of each class for the corresponding row in the original dataset. If keep_id is True, the DataFrame will include an extra ID column that can be used for database joins/merges with the original dataset or downloaded Cleanlab columns.

For image projects, a few images in the original dataset might fail to be processed due to poorly formatted data or invalid image file paths. Predicted probabilities will not be calculated for those rows. The rows in the original dataset that failed to be processed are marked as True in the is_not_analyzed Cleanlab column of the cleanset.

If you want to work with predicted probabilities for an image project, the recommended workflow is to download probabilities with the option keep_id=True, and then do a join with the original dataset on the ID column. Alternatively, you can follow the steps here, and filter out the rows that were not analyzed. The filtered dataset will then have rows that align with the predicted probabilities DataFrame.


method get_enrichment_job_status

get_enrichment_job_status(job_id: 'str')dict[str, Any]

Get the status of an enrichment job.

Args:

  • job_id (str): ID of the enrichment job.

Returns:

  • dict[str, Any]: A dictionary containing the status of the enrichment job.

method get_enrichment_project

get_enrichment_project(project_id: 'str') → EnrichmentProject

Get an EnrichmentProject object for a given Cleanlab Studio Enrichment Project’s ID.

Args:

  • project_id (str): ID of the enrichment project.

Returns:


method get_enrichment_projects

get_enrichment_projects() → List[EnrichmentProject]

Get a list of all EnrichmentProjects.

Returns:


method get_latest_cleanset_id

get_latest_cleanset_id(project_id: 'str')str

Gets latest cleanset ID for a project.

Args:

  • project_id: ID of project.

Returns: ID of latest associated cleanset.


method get_model

get_model(model_id: 'str') → Model

Gets a model that is deployed in a Cleanlab Studio account.

The returned model can then be used to predict labels for new data. See the documentation for the Model class for more on what you can do with a Model object.

Args:

  • model_id: ID of model to get. The model ID can be found in the “Model Details” tab of a model page.

Returns: Model instance, which exposes methods to predict labels for new data.


method poll_dataset_id_for_name

poll_dataset_id_for_name(
dataset_name: 'str',
timeout: 'Optional[int]' = None
)str

Polls for dataset ID for a dataset name.

Args:

  • dataset_name: Name of dataset to get ID for.
  • timeout: Optional timeout after which to stop polling for progress. If not provided, will block until dataset is ready.

Returns ID of dataset.

Raises

  • TimeoutError: if dataset is not ready by end of timeout

method upload_dataset

upload_dataset(
dataset: 'Any',
dataset_name: 'Optional[str]' = None,
schema_overrides: 'Optional[List[SchemaOverride]]' = None,
**kwargs: 'Any'
)str

Uploads a dataset to Cleanlab Studio.

Args:

  • dataset: Object representing the dataset to upload. Currently supported formats include a str path to your dataset, a pandas, snowflake, or pyspark DataFrame.
  • dataset_name: Name for your dataset in Cleanlab Studio (optional if uploading from filepath).
  • schema_overrides: Optional list of overrides you would like to make to the schema of your dataset. If not provided, all columns will be untyped. Format defined here.
  • modality: [DEPRECATED] Optional parameter to override the modality of your dataset. If not provided, modality will be inferred.
  • id_column: [DEPRECATED] Optional parameter to override the ID column of your dataset. If not provided, a monotonically increasing ID column will be generated.

Returns: ID of uploaded dataset.


method upload_from_bigframe

upload_from_bigframe(
bigframe: 'Any',
schema_overrides: 'Optional[List[SchemaOverride]]' = None,
**kwargs: 'Any'
)str

Uploads a dataset, from a BigFrame, to Cleanlab Studio.

Args:

  • bigframe: BigFrame object representing the dataset to upload.
  • schema_overrides: Optional list of overrides you would like to make to the schema of your dataset. If not provided, all columns will be untyped. Format defined here.

method upload_from_bigquery

upload_from_bigquery(
bigquery_project: 'str',
bigquery_dataset_id: 'str',
bigquery_table_id: 'str',
schema_overrides: 'Optional[List[SchemaOverride]]' = None,
**kwargs: 'Any'
)str

Uploads a dataset, from BigQuery, to Cleanlab Studio.

Args:

  • bigquery_project: BigQuery project ID.
  • bigquery_dataset_id: BigQuery dataset ID.
  • bigquery_table_id: BigQuery table ID.
  • schema_overrides: Optional list of overrides you would like to make to the schema of your dataset. If not provided, all columns will be untyped. Format defined here.

method upload_from_url

upload_from_url(
url: 'str',
schema_overrides: 'Optional[List[SchemaOverride]]' = None,
**kwargs: 'Any'
)str

Uploads a dataset, from URL, to Cleanlab Studio.

Args:

  • url: URL to the dataset to upload.
  • schema_overrides: Optional list of overrides you would like to make to the schema of your dataset. If not provided, all columns will be untyped. Format defined here.

method wait_until_cleanset_ready

wait_until_cleanset_ready(
cleanset_id: 'str',
timeout: 'Optional[float]' = None,
show_cleanset_link: 'bool' = False
)None

Blocks until a cleanset is ready or the timeout is reached.

Args:

  • cleanset_id (str): ID of cleanset to check status for.
  • timeout (Optional[float], optional): timeout for polling, in seconds. Defaults to None.
  • show_cleanset_link (bool, optional): whether to print a link to view the cleanset in the Cleanlab Studio web UI when the cleanset is ready. Defaults to False.

Raises:

  • TimeoutError: if cleanset is not ready by end of timeout
  • CleansetError: if cleanset errored while running

method wait_until_model_ready

wait_until_model_ready(
model_id: 'str',
timeout: 'Optional[float]' = None
)None

Blocks until a model is ready or the timeout is reached.

Args:

  • model_id (str): ID of model to check status for.
  • timeout (Optional[float], optional): timeout for polling, in seconds. Defaults to None.

Raises:

  • TimeoutError: if model is not ready by end of timeout
  • DeploymentError: if model errored while training

method TLM

TLM(
quality_preset: 'TLMQualityPreset' = 'medium',
options: 'Optional[TLMOptions]' = None,
timeout: 'Optional[float]' = None,
verbose: 'Optional[bool]' = None
) → TLM

Instantiate a Trustworthy Language Model (TLM). For more details, see the documentation of: cleanlab_studio.studio.trustworthy_language_model.TLM


method TLMCalibrated

TLMCalibrated(
quality_preset: 'TLMQualityPreset' = 'medium',
options: 'Optional[TLMOptions]' = None,
timeout: 'Optional[float]' = None,
verbose: 'Optional[bool]' = None
) → TLMCalibrated

Instantiate a version of the Trustworthy Language Model that you can calibrate using existing ratings for example prompt-response pairs. For more details, see the documentation of: cleanlab_studio.utils.tlm_calibrated.TLMCalibrated


method TLMLite

TLMLite(
response_model: 'str' = 'gpt-4o',
quality_preset: 'TLMQualityPreset' = 'medium',
options: 'Optional[TLMOptions]' = None,
timeout: 'Optional[float]' = None,
verbose: 'Optional[bool]' = None
) → TLMLite

Instantiate a version of the Trustworthy Language Model that uses one model for response and another for trustworthiness scoring (reduce costs/latency without reducing response quality). For more details, see the documentation of: cleanlab_studio.utils.tlm_lite.TLMLite