Studio

`module` `cleanlab_studio`

Python API for Cleanlab Studio.

Some methods here apply to other Cleanlab products (e.g. TLM) that also rely on the Cleanlab Studio client.

`class` `Studio`

`method` `init`

__init__(api_key: 'Optional[str]')

Creates a Cleanlab Studio client.

Args:

api_key: You can find your API key on your account page in Cleanlab Studio. Instead of specifying the API key here, you can also log in with cleanlab login on the command-line.

`method` `apply_corrections`

apply_corrections(
    cleanset_id: 'str',
    dataset: 'Any',
    keep_excluded: 'bool' = False
) → Any

Applies corrections from a Cleanlab Studio cleanset to your dataset. This function takes in your local copy of the original dataset, as well as the cleanset_id for the cleanset generated from this dataset in the Project web interface. The function returns a copy of your original dataset, where the label column has been substituted with corrected labels that you selected (either manually or via auto-fix) in the Cleanlab Studio web interface Project, and the rows you marked as excluded will be excluded from the returned copy of your original dataset. Corrections should have been made by viewing your Project in the Cleanlab Studio web interface (see Cleanlab Studio web quickstart).

The intended workflow is: create a Project, correct your Dataset automatically/manually in the web interface to generate a Cleanset (cleaned dataset), then call this function to make your original dataset locally look like the current Cleanset.

Args:

cleanset_id: ID of cleanset to apply corrections from.
dataset: Dataset to apply corrections to. Supported formats include pandas, snowpark, and pyspark DataFrame. Dataset should have the same number of rows as the dataset used to create the project. It should also contain a label column with the same name as the label column for the project.
keep_excluded: Whether to retain rows with an “exclude” action. By default these rows will be removed from the dataset.

Returns: A copy of the dataset with corrections applied.

`method` `create_enrichment_project`

create_enrichment_project(name: 'str', dataset_id: 'str') → EnrichmentProject

Creates a Cleanlab Studio Enrichment Project.

Args:

name (str): Name of the enrichment project to create.
dataset_id (str): ID of dataset to be enriched.

Returns:

EnrichmentProject: The EnrichmentProject object for the new enrichment project.

`method` `create_project`

create_project(
    dataset_id: 'str',
    project_name: 'str',
    modality: "Literal['text', 'tabular', 'image']",
    task_type: "Optional[Literal['multi-class', 'multi-label', 'regression', 'unsupervised']]" = 'multi-class',
    model_type: "Literal['fast', 'regular']" = 'regular',
    label_column: 'Optional[str]' = None,
    feature_columns: 'Optional[List[str]]' = None,
    text_column: 'Optional[str]' = None
) → str

Creates a Cleanlab Studio project.

Args:

dataset_id: ID of dataset to create project for.
project_name: Name for resulting project.
modality: Modality of project (i.e. text, tabular, image).
task_type: Type of ML task to perform. Select a supervised task type (i.e. “multi-class”, “multi-label”, “regression”) if your dataset has a label column you would like to predict values for or detect erroneous values in. Select “unsupervised” if your dataset has no specific label column. See the Projects Guide for more information on task types.
model_type: Type of model to train (i.e. “fast”, “regular”). See the Projects Guide for more information on model types.
label_column: Name of column in dataset containing labels (if not supplied, we’ll make our best guess). For “unsupervised” tasks, this should be None.
feature_columns: List of columns to use as features for a tabular project. By default all columns are used as feature columns. This parameter is particularly useful if your dataset has a column containing unique IDs and you want to exclude that column from the feature columns.
text_column: Name of column containing the text to train text modality project on (if not supplied and modality is “text” we’ll make our best guess).

Returns: ID of created project.

`method` `delete_dataset`

delete_dataset(dataset_id: 'str') → None

Deletes a dataset from Cleanlab Studio.

If the dataset is used in projects, the projects will be deleted as well.

`method` `delete_enrichment_project`

delete_enrichment_project(project_id: 'str') → None

Deletes an Enrichment Project from Cleanlab Studio.

Args:

project_id: ID of enrichment project to delete.

`method` `delete_project`

delete_project(project_id: 'str') → None

Deletes a project from Cleanlab Studio.

Args:

project_id: ID of project to delete.

`method` `deploy_model`

deploy_model(cleanset_id: 'str', model_name: 'str') → str

Trains and deploys a model with an improved dataset created by applying any corrections you’ve made to your cleanset in Cleanlab Studio.

Args:

cleanset_id: ID of cleanset to deploy model for.
model_name: Name for resulting model.

`method` `download_cleanlab_columns`

download_cleanlab_columns(
    cleanset_id: 'str',
    include_cleanlab_columns: 'bool' = True,
    include_project_details: 'bool' = False,
    to_spark: 'bool' = False
) → Any

Downloads Cleanlab columns for a cleanset.

Args:

cleanset_id: ID of cleanset to download columns from. To obtain cleanset ID from project ID use, get_latest_cleanset_id.
include_cleanlab_columns: whether to download all Cleanlab columns or just the clean_label column
include_project_details: whether to download columns related to project status such as resolved rows, actions taken, etc.

Returns: A pandas or pyspark DataFrame. Type is Any to avoid requiring pyspark installation.

`method` `download_embeddings`

download_embeddings(cleanset_id: 'str') → NDArray[float64]

Downloads feature embeddings for a cleanset (available only for text and image projects). These are numeric vectors produced via neural network representations of each data point in your dataset.

Args:

cleanset_id (str): the ID of the cleanset from which you want to download feature embeddings.

Returns:

np.NDArray[float64]: a 2D numpy array of feature embeddings of shape N by N_EMBED, where N is the number of rows in the original dataset, and N_EMBED is the dimension of the feature embeddings. The embedding-dimension depends on which neural network is used to represent your data (Cleanlab automatically identifies the best type of neural network for your data).

For image projects, a few images in the original dataset might fail to be processed due to poorly formatted data or invalid image file paths. Feature embeddings are not computed for those rows. The rows in the original dataset that failed to be processed are marked as True in the is_not_analyzed Cleanlab column of the cleanset. If you want to work with feature embeddings for an image project, the recommended workflow is as follows:

When the image project completes, download the cleaset via studio.download_cleanlab_columns, and check whether the is_not_analyzed boolean column has any True values.
If no rows are flaged as is_not_analyzed, it means that all the rows were processed successfully. In this case, the rows of the feature embeddings will correspond to the rows of the original dataset, and downstream analysis can be carried out with no further preparation.
If there are rows flagged as is_not_analyzed, the rows of the feature embeddings will correspond to the rows of the original dataset after filtering out the rows that are not analyzed.

`method` `download_pred_probs`

download_pred_probs(cleanset_id: 'str', keep_id: 'bool' = False) → DataFrame

Downloads predicted probabilities for a cleanset (only for classification datasets).

Args:

cleanset_id (str): the ID of the cleanset for which to download the corresponding predicted class probabilities.
keep_id (bool): whether to include the ID column in the returned DataFrame to enable easy join/merge operations with original dataset.

Returns:

pd.DataFrame: a DataFrame of probabilities of shape N by M, where N is the number of rows in the original dataset, and M is the total number of classes in the original dataset. Every row of the returned DataFrame corresponds to the predicted probability of each class for the corresponding row in the original dataset. If keep_id is True, the DataFrame will include an extra ID column that can be used for database joins/merges with the original dataset or downloaded Cleanlab columns.

For image projects, a few images in the original dataset might fail to be processed due to poorly formatted data or invalid image file paths. Predicted probabilities will not be calculated for those rows. The rows in the original dataset that failed to be processed are marked as True in the is_not_analyzed Cleanlab column of the cleanset.

If you want to work with predicted probabilities for an image project, the recommended workflow is to download probabilities with the option keep_id=True, and then do a join with the original dataset on the ID column. Alternatively, you can follow the steps here, and filter out the rows that were not analyzed. The filtered dataset will then have rows that align with the predicted probabilities DataFrame.

`method` `get_enrichment_job_status`

get_enrichment_job_status(job_id: 'str') → dict[str, Any]

Get the status of an enrichment job.

Args:

job_id (str): ID of the enrichment job.

Returns:

dict[str, Any]: A dictionary containing the status of the enrichment job.

`method` `get_enrichment_project`

get_enrichment_project(project_id: 'str') → EnrichmentProject

Get an EnrichmentProject object for a given Cleanlab Studio Enrichment Project’s ID.

Args:

project_id (str): ID of the enrichment project.

Returns:

EnrichmentProject: The EnrichmentProject object for the enrichment project.

`method` `get_enrichment_projects`

get_enrichment_projects() → List[EnrichmentProject]

Get a list of all EnrichmentProjects.

Returns:

List[EnrichmentProject]: A list of EnrichmentProject objects.

`method` `get_latest_cleanset_id`

get_latest_cleanset_id(project_id: 'str') → str

Gets latest cleanset ID for a project.

Args:

project_id: ID of project.

Returns: ID of latest associated cleanset.

`method` `get_model`

get_model(model_id: 'str') → Model

Gets a model that is deployed in a Cleanlab Studio account.

The returned model can then be used to predict labels for new data. See the documentation for the Model class for more on what you can do with a Model object.

Args:

model_id: ID of model to get. The model ID can be found in the “Model Details” tab of a model page.

Returns: Model instance, which exposes methods to predict labels for new data.

`method` `poll_dataset_id_for_name`

poll_dataset_id_for_name(
    dataset_name: 'str',
    timeout: 'Optional[int]' = None
) → str

Polls for dataset ID for a dataset name.

Args:

dataset_name: Name of dataset to get ID for.
timeout: Optional timeout after which to stop polling for progress. If not provided, will block until dataset is ready.

Returns ID of dataset.

Raises

TimeoutError: if dataset is not ready by end of timeout

`method` `upload_dataset`

upload_dataset(
    dataset: 'Any',
    dataset_name: 'Optional[str]' = None,
    schema_overrides: 'Optional[List[SchemaOverride]]' = None,
    **kwargs: 'Any'
) → str

Uploads a dataset to Cleanlab Studio.

Args:

dataset: Object representing the dataset to upload. Currently supported formats include a str path to your dataset, a pandas, snowflake, or pyspark DataFrame.
dataset_name: Name for your dataset in Cleanlab Studio (optional if uploading from filepath).
schema_overrides: Optional list of overrides you would like to make to the schema of your dataset. If not provided, all columns will be untyped. Format defined here.
modality: [DEPRECATED] Optional parameter to override the modality of your dataset. If not provided, modality will be inferred.
id_column: [DEPRECATED] Optional parameter to override the ID column of your dataset. If not provided, a monotonically increasing ID column will be generated.

Returns: ID of uploaded dataset.

`method` `upload_from_bigframe`

upload_from_bigframe(
    bigframe: 'Any',
    schema_overrides: 'Optional[List[SchemaOverride]]' = None,
    **kwargs: 'Any'
) → str

Uploads a dataset, from a BigFrame, to Cleanlab Studio.

Args:

bigframe: BigFrame object representing the dataset to upload.
schema_overrides: Optional list of overrides you would like to make to the schema of your dataset. If not provided, all columns will be untyped. Format defined here.

`method` `upload_from_bigquery`

upload_from_bigquery(
    bigquery_project: 'str',
    bigquery_dataset_id: 'str',
    bigquery_table_id: 'str',
    schema_overrides: 'Optional[List[SchemaOverride]]' = None,
    **kwargs: 'Any'
) → str

Uploads a dataset, from BigQuery, to Cleanlab Studio.

Args:

bigquery_project: BigQuery project ID.
bigquery_dataset_id: BigQuery dataset ID.
bigquery_table_id: BigQuery table ID.
schema_overrides: Optional list of overrides you would like to make to the schema of your dataset. If not provided, all columns will be untyped. Format defined here.

`method` `upload_from_url`

upload_from_url(
    url: 'str',
    schema_overrides: 'Optional[List[SchemaOverride]]' = None,
    **kwargs: 'Any'
) → str

Uploads a dataset, from URL, to Cleanlab Studio.

Args:

url: URL to the dataset to upload.
schema_overrides: Optional list of overrides you would like to make to the schema of your dataset. If not provided, all columns will be untyped. Format defined here.

`method` `wait_until_cleanset_ready`

wait_until_cleanset_ready(
    cleanset_id: 'str',
    timeout: 'Optional[float]' = None,
    show_cleanset_link: 'bool' = False
) → None

Blocks until a cleanset is ready or the timeout is reached.

Args:

cleanset_id (str): ID of cleanset to check status for.
timeout (Optional[float], optional): timeout for polling, in seconds. Defaults to None.
show_cleanset_link (bool, optional): whether to print a link to view the cleanset in the Cleanlab Studio web UI when the cleanset is ready. Defaults to False.

Raises:

TimeoutError: if cleanset is not ready by end of timeout
CleansetError: if cleanset errored while running

`method` `wait_until_model_ready`

wait_until_model_ready(
    model_id: 'str',
    timeout: 'Optional[float]' = None
) → None

Blocks until a model is ready or the timeout is reached.

Args:

model_id (str): ID of model to check status for.
timeout (Optional[float], optional): timeout for polling, in seconds. Defaults to None.

Raises:

TimeoutError: if model is not ready by end of timeout
DeploymentError: if model errored while training

`method` `TLM`

TLM(
    quality_preset: 'TLMQualityPreset' = 'medium',
    options: 'Optional[TLMOptions]' = None,
    timeout: 'Optional[float]' = None,
    verbose: 'Optional[bool]' = None
) → TLM

Instantiate a Trustworthy Language Model (TLM). For more details, see the documentation of: cleanlab_studio.studio.trustworthy_language_model.TLM

`method` `TLMCalibrated`

TLMCalibrated(
    quality_preset: 'TLMQualityPreset' = 'medium',
    options: 'Optional[TLMOptions]' = None,
    timeout: 'Optional[float]' = None,
    verbose: 'Optional[bool]' = None
) → TLMCalibrated

Instantiate a version of the Trustworthy Language Model that you can calibrate using existing ratings for example prompt-response pairs. For more details, see the documentation of: cleanlab_studio.utils.tlm_calibrated.TLMCalibrated

`method` `TLMLite`

TLMLite(
    response_model: 'str' = 'gpt-4o',
    quality_preset: 'TLMQualityPreset' = 'medium',
    options: 'Optional[TLMOptions]' = None,
    timeout: 'Optional[float]' = None,
    verbose: 'Optional[bool]' = None
) → TLMLite

Instantiate a version of the Trustworthy Language Model that uses one model for response and another for trustworthiness scoring (reduce costs/latency without reducing response quality). For more details, see the documentation of: cleanlab_studio.utils.tlm_lite.TLMLite

module cleanlab_studio

class Studio​

method __init__​

method apply_corrections​

method create_enrichment_project​

method create_project​

method delete_dataset​

method delete_enrichment_project​

method delete_project​

method deploy_model​

method download_cleanlab_columns​

method download_embeddings​

method download_pred_probs​

method get_enrichment_job_status​

method get_enrichment_project​

method get_enrichment_projects​

method get_latest_cleanset_id​

method get_model​

method poll_dataset_id_for_name​

method upload_dataset​

method upload_from_bigframe​

method upload_from_bigquery​

method upload_from_url​

method wait_until_cleanset_ready​

method wait_until_model_ready​

method TLM​

method TLMCalibrated​

method TLMLite​

`module` `cleanlab_studio`

`class` `Studio`

`method` `init`

`method` `apply_corrections`

`method` `create_enrichment_project`

`method` `create_project`

`method` `delete_dataset`

`method` `delete_enrichment_project`

`method` `delete_project`

`method` `deploy_model`

`method` `download_cleanlab_columns`

`method` `download_embeddings`

`method` `download_pred_probs`

`method` `get_enrichment_job_status`

`method` `get_enrichment_project`

`method` `get_enrichment_projects`

`method` `get_latest_cleanset_id`

`method` `get_model`

`method` `poll_dataset_id_for_name`

`method` `upload_dataset`

`method` `upload_from_bigframe`

`method` `upload_from_bigquery`

`method` `upload_from_url`

`method` `wait_until_cleanset_ready`

`method` `wait_until_model_ready`

`method` `TLM`

`method` `TLMCalibrated`

`method` `TLMLite`