Studio
module cleanlab_studio
Python API for Cleanlab Studio.
Some methods here apply to other Cleanlab products (e.g. TLM) that also rely on the Cleanlab Studio client.
class Studio
method __init__
__init__(api_key: 'Optional[str]')
Creates a Cleanlab Studio client.
Args:
api_key
: You can find your API key on your account page in Cleanlab Studio. Instead of specifying the API key here, you can also log in withcleanlab login
on the command-line.
method apply_corrections
apply_corrections(
cleanset_id: 'str',
dataset: 'Any',
keep_excluded: 'bool' = False
) → Any
Applies corrections from a Cleanlab Studio cleanset to your dataset. This function takes in your local copy of the original dataset, as well as the cleanset_id
for the cleanset generated from this dataset in the Project web interface. The function returns a copy of your original dataset, where the label column has been substituted with corrected labels that you selected (either manually or via auto-fix) in the Cleanlab Studio web interface Project, and the rows you marked as excluded will be excluded from the returned copy of your original dataset. Corrections should have been made by viewing your Project in the Cleanlab Studio web interface (see Cleanlab Studio web quickstart).
The intended workflow is: create a Project, correct your Dataset automatically/manually in the web interface to generate a Cleanset (cleaned dataset), then call this function to make your original dataset locally look like the current Cleanset.
Args:
cleanset_id
: ID of cleanset to apply corrections from.dataset
: Dataset to apply corrections to. Supported formats include pandas, snowpark, and pyspark DataFrame. Dataset should have the same number of rows as the dataset used to create the project. It should also contain a label column with the same name as the label column for the project.keep_excluded
: Whether to retain rows with an “exclude” action. By default these rows will be removed from the dataset.
Returns: A copy of the dataset with corrections applied.
method create_enrichment_project
create_enrichment_project(name: 'str', dataset_id: 'str') → EnrichmentProject
Creates a Cleanlab Studio Enrichment Project.
Args:
name
(str): Name of the enrichment project to create.dataset_id
(str): ID of dataset to be enriched.
Returns:
EnrichmentProject
: The EnrichmentProject object for the new enrichment project.
method create_project
create_project(
dataset_id: 'str',
project_name: 'str',
modality: "Literal['text', 'tabular', 'image']",
task_type: "Optional[Literal['multi-class', 'multi-label', 'regression', 'unsupervised']]" = 'multi-class',
model_type: "Literal['fast', 'regular']" = 'regular',
label_column: 'Optional[str]' = None,
feature_columns: 'Optional[List[str]]' = None,
text_column: 'Optional[str]' = None
) → str
Creates a Cleanlab Studio project.
Args:
dataset_id
: ID of dataset to create project for.project_name
: Name for resulting project.modality
: Modality of project (i.e. text, tabular, image).task_type
: Type of ML task to perform. Select a supervised task type (i.e. “multi-class”, “multi-label”, “regression”) if your dataset has a label column you would like to predict values for or detect erroneous values in. Select “unsupervised” if your dataset has no specific label column. See the Projects Guide for more information on task types.model_type
: Type of model to train (i.e. “fast”, “regular”). See the Projects Guide for more information on model types.label_column
: Name of column in dataset containing labels (if not supplied, we’ll make our best guess). For “unsupervised” tasks, this should beNone
.feature_columns
: List of columns to use as features for a tabular project. By default all columns are used as feature columns. This parameter is particularly useful if your dataset has a column containing unique IDs and you want to exclude that column from the feature columns.text_column
: Name of column containing the text to train text modality project on (if not supplied and modality is “text” we’ll make our best guess).
Returns: ID of created project.
method delete_dataset
delete_dataset(dataset_id: 'str') → None
Deletes a dataset from Cleanlab Studio.
If the dataset is used in projects, the projects will be deleted as well.
method delete_enrichment_project
delete_enrichment_project(project_id: 'str') → None
Deletes an Enrichment Project from Cleanlab Studio.
Args:
project_id
: ID of enrichment project to delete.
method delete_project
delete_project(project_id: 'str') → None
Deletes a project from Cleanlab Studio.
Args:
project_id
: ID of project to delete.
method deploy_model
deploy_model(cleanset_id: 'str', model_name: 'str') → str
Trains and deploys a model with an improved dataset created by applying any corrections you’ve made to your cleanset in Cleanlab Studio.
Args:
cleanset_id
: ID of cleanset to deploy model for.model_name
: Name for resulting model.
method download_cleanlab_columns
download_cleanlab_columns(
cleanset_id: 'str',
include_cleanlab_columns: 'bool' = True,
include_project_details: 'bool' = False,
to_spark: 'bool' = False
) → Any
Downloads Cleanlab columns for a cleanset.
Args:
cleanset_id
: ID of cleanset to download columns from. To obtain cleanset ID from project ID use, get_latest_cleanset_id.include_cleanlab_columns
: whether to download all Cleanlab columns or just the clean_label columninclude_project_details
: whether to download columns related to project status such as resolved rows, actions taken, etc.
Returns:
A pandas or pyspark DataFrame. Type is Any
to avoid requiring pyspark installation.
method download_embeddings
download_embeddings(cleanset_id: 'str') → NDArray[float64]
Downloads feature embeddings for a cleanset (available only for text and image projects). These are numeric vectors produced via neural network representations of each data point in your dataset.
Args:
cleanset_id
(str): the ID of the cleanset from which you want to download feature embeddings.
Returns:
np.NDArray[float64]
: a 2D numpy array of feature embeddings of shapeN
byN_EMBED
, whereN
is the number of rows in the original dataset, andN_EMBED
is the dimension of the feature embeddings. The embedding-dimension depends on which neural network is used to represent your data (Cleanlab automatically identifies the best type of neural network for your data).
For image projects, a few images in the original dataset might fail to be processed due to poorly formatted data or invalid image file paths. Feature embeddings are not computed for those rows. The rows in the original dataset that failed to be processed are marked as True
in the is_not_analyzed
Cleanlab column of the cleanset. If you want to work with feature embeddings for an image project, the recommended workflow is as follows:
When the image project completes, download the cleaset via
studio.download_cleanlab_columns
, and check whether theis_not_analyzed
boolean column has anyTrue
values.If no rows are flaged as
is_not_analyzed
, it means that all the rows were processed successfully. In this case, the rows of the feature embeddings will correspond to the rows of the original dataset, and downstream analysis can be carried out with no further preparation.If there are rows flagged as
is_not_analyzed
, the rows of the feature embeddings will correspond to the rows of the original dataset after filtering out the rows that are not analyzed.
method download_pred_probs
download_pred_probs(cleanset_id: 'str', keep_id: 'bool' = False) → DataFrame
Downloads predicted probabilities for a cleanset (only for classification datasets).
Args:
cleanset_id
(str): the ID of the cleanset for which to download the corresponding predicted class probabilities.keep_id
(bool): whether to include the ID column in the returned DataFrame to enable easy join/merge operations with original dataset.
Returns:
pd.DataFrame
: a DataFrame of probabilities of shapeN
byM
, whereN
is the number of rows in the original dataset, andM
is the total number of classes in the original dataset. Every row of the returned DataFrame corresponds to the predicted probability of each class for the corresponding row in the original dataset. Ifkeep_id
isTrue
, the DataFrame will include an extra ID column that can be used for database joins/merges with the original dataset or downloaded Cleanlab columns.
For image projects, a few images in the original dataset might fail to be processed due to poorly formatted data or invalid image file paths. Predicted probabilities will not be calculated for those rows. The rows in the original dataset that failed to be processed are marked as True
in the is_not_analyzed
Cleanlab column of the cleanset.
If you want to work with predicted probabilities for an image project, the recommended workflow is to download probabilities with the option keep_id=True
, and then do a join with the original dataset on the ID column. Alternatively, you can follow the steps here, and filter out the rows that were not analyzed. The filtered dataset will then have rows that align with the predicted probabilities DataFrame.
method get_enrichment_job_status
get_enrichment_job_status(job_id: 'str') → dict[str, Any]
Get the status of an enrichment job.
Args:
job_id
(str): ID of the enrichment job.
Returns:
dict[str, Any]
: A dictionary containing the status of the enrichment job.
method get_enrichment_project
get_enrichment_project(project_id: 'str') → EnrichmentProject
Get an EnrichmentProject object for a given Cleanlab Studio Enrichment Project’s ID.
Args:
project_id
(str): ID of the enrichment project.
Returns:
EnrichmentProject
: The EnrichmentProject object for the enrichment project.
method get_enrichment_projects
get_enrichment_projects() → List[EnrichmentProject]
Get a list of all EnrichmentProjects.
Returns:
List[EnrichmentProject]
: A list of EnrichmentProject objects.
method get_latest_cleanset_id
get_latest_cleanset_id(project_id: 'str') → str
Gets latest cleanset ID for a project.
Args:
project_id
: ID of project.
Returns: ID of latest associated cleanset.
method get_model
get_model(model_id: 'str') → Model
Gets a model that is deployed in a Cleanlab Studio account.
The returned model can then be used to predict labels for new data. See the documentation for the Model class for more on what you can do with a Model object.
Args:
model_id
: ID of model to get. The model ID can be found in the “Model Details” tab of a model page.
Returns: Model instance, which exposes methods to predict labels for new data.
method poll_dataset_id_for_name
poll_dataset_id_for_name(
dataset_name: 'str',
timeout: 'Optional[int]' = None
) → str
Polls for dataset ID for a dataset name.
Args:
dataset_name
: Name of dataset to get ID for.timeout
: Optional timeout after which to stop polling for progress. If not provided, will block until dataset is ready.
Returns ID of dataset.
Raises
TimeoutError
: if dataset is not ready by end of timeout
method upload_dataset
upload_dataset(
dataset: 'Any',
dataset_name: 'Optional[str]' = None,
schema_overrides: 'Optional[List[SchemaOverride]]' = None,
**kwargs: 'Any'
) → str
Uploads a dataset to Cleanlab Studio.
Args:
dataset
: Object representing the dataset to upload. Currently supported formats include astr
path to your dataset, a pandas, snowflake, or pyspark DataFrame.dataset_name
: Name for your dataset in Cleanlab Studio (optional if uploading from filepath).schema_overrides
: Optional list of overrides you would like to make to the schema of your dataset. If not provided, all columns will be untyped. Format defined here.modality
: [DEPRECATED] Optional parameter to override the modality of your dataset. If not provided, modality will be inferred.id_column
: [DEPRECATED] Optional parameter to override the ID column of your dataset. If not provided, a monotonically increasing ID column will be generated.
Returns: ID of uploaded dataset.
method upload_from_bigframe
upload_from_bigframe(
bigframe: 'Any',
schema_overrides: 'Optional[List[SchemaOverride]]' = None,
**kwargs: 'Any'
) → str
Uploads a dataset, from a BigFrame, to Cleanlab Studio.
Args:
bigframe
: BigFrame object representing the dataset to upload.schema_overrides
: Optional list of overrides you would like to make to the schema of your dataset. If not provided, all columns will be untyped. Format defined here.
method upload_from_bigquery
upload_from_bigquery(
bigquery_project: 'str',
bigquery_dataset_id: 'str',
bigquery_table_id: 'str',
schema_overrides: 'Optional[List[SchemaOverride]]' = None,
**kwargs: 'Any'
) → str
Uploads a dataset, from BigQuery, to Cleanlab Studio.
Args:
bigquery_project
: BigQuery project ID.bigquery_dataset_id
: BigQuery dataset ID.bigquery_table_id
: BigQuery table ID.schema_overrides
: Optional list of overrides you would like to make to the schema of your dataset. If not provided, all columns will be untyped. Format defined here.
method upload_from_url
upload_from_url(
url: 'str',
schema_overrides: 'Optional[List[SchemaOverride]]' = None,
**kwargs: 'Any'
) → str
Uploads a dataset, from URL, to Cleanlab Studio.
Args:
url
: URL to the dataset to upload.schema_overrides
: Optional list of overrides you would like to make to the schema of your dataset. If not provided, all columns will be untyped. Format defined here.
method wait_until_cleanset_ready
wait_until_cleanset_ready(
cleanset_id: 'str',
timeout: 'Optional[float]' = None,
show_cleanset_link: 'bool' = False
) → None
Blocks until a cleanset is ready or the timeout is reached.
Args:
cleanset_id
(str): ID of cleanset to check status for.timeout
(Optional[float], optional): timeout for polling, in seconds. Defaults to None.show_cleanset_link
(bool, optional): whether to print a link to view the cleanset in the Cleanlab Studio web UI when the cleanset is ready. Defaults to False.
Raises:
TimeoutError
: if cleanset is not ready by end of timeoutCleansetError
: if cleanset errored while running
method wait_until_model_ready
wait_until_model_ready(
model_id: 'str',
timeout: 'Optional[float]' = None
) → None
Blocks until a model is ready or the timeout is reached.
Args:
model_id
(str): ID of model to check status for.timeout
(Optional[float], optional): timeout for polling, in seconds. Defaults to None.
Raises:
TimeoutError
: if model is not ready by end of timeoutDeploymentError
: if model errored while training
method TLM
TLM(
quality_preset: 'TLMQualityPreset' = 'medium',
options: 'Optional[TLMOptions]' = None,
timeout: 'Optional[float]' = None,
verbose: 'Optional[bool]' = None
) → TLM
Instantiate a Trustworthy Language Model (TLM). For more details, see the documentation of: cleanlab_studio.studio.trustworthy_language_model.TLM
method TLMCalibrated
TLMCalibrated(
quality_preset: 'TLMQualityPreset' = 'medium',
options: 'Optional[TLMOptions]' = None,
timeout: 'Optional[float]' = None,
verbose: 'Optional[bool]' = None
) → TLMCalibrated
Instantiate a version of the Trustworthy Language Model that you can calibrate using existing ratings for example prompt-response pairs. For more details, see the documentation of: cleanlab_studio.utils.tlm_calibrated.TLMCalibrated
method TLMLite
TLMLite(
response_model: 'str' = 'gpt-4o',
quality_preset: 'TLMQualityPreset' = 'medium',
options: 'Optional[TLMOptions]' = None,
timeout: 'Optional[float]' = None,
verbose: 'Optional[bool]' = None
) → TLMLite
Instantiate a version of the Trustworthy Language Model that uses one model for response and another for trustworthiness scoring (reduce costs/latency without reducing response quality). For more details, see the documentation of: cleanlab_studio.utils.tlm_lite.TLMLite