Skip to main content

Enrichment

module cleanlab_studio.studio.enrichment

Methods for interfacing with Enrichment Projects.

This module is not meant to be imported and used directly. Instead, use Studio.get_enrichment_project() to instantiate an EnrichmentProject object.

Global Variables

  • ROW_ID_COLUMN_NAME
  • REGEX_PARAMETER_ERROR_MESSAGE
  • CLEANLAB_ROW_ID_COLUMN_NAME
  • CHECK_READY_INTERVAL

class EnrichmentJobStatusEnum

An enumeration.


class EnrichmentProject

Represents an Enrichment Project instance, which is bound to a Cleanlab Studio account.

EnrichmentProjects should be instantiated using the Studio.get_enrichment_project() method.

method __init__

__init__(
api_key: 'str',
id: 'str',
name: 'str',
created_at: 'Optional[Union[str, datetime]]' = None
)None

Initialize an EnrichmentProject.

Objects of this class are not meant to be constructed directly. Instead, use Studio.get_enrichment_project().


property created_at

(datetime.datetime) When the Enrichment Project was created.


property id

(str) ID of the Enrichment Project.


property name

(str) Name of the Enrichment Project.


property ready

Check if the latest enrichment job is ready or not.

If one ran a preview after the last run, this method will raise an error since the latest job is a preview.


property updated_at

(datetime.datetime) When the Enrichment Project was last updated.


method download_results

download_results(
job_id: 'Optional[str]' = None,
include_original_dataset: 'Optional[bool]' = False
) → EnrichmentResults

Retrieve the results of an enrichment job.

This method fetches the results of a specified enrichment job. If no job_id is provided, it will default to retrieving the results of the latest job.

Args:

  • job_id (str, optional): The ID of the job to retrieve results from. If not provided, the latest job will be used.
  • include_original_dataset (bool, optional): If True, the original dataset will be included in the returned results. Defaults to False.

method export_results_as_csv

export_results_as_csv(job_id: 'Optional[str]' = None)None

Download the results of a job.


method list_all_jobs

list_all_jobs() → List[EnrichmentJob]

List all jobs in the project.


method pause

pause()None

Pause the latest batch job.


method preview

preview(
options: 'EnrichmentOptions',
new_column_name: 'str',
indices: 'Optional[List[int]]' = None
) → EnrichmentPreviewResults

Enrich a subset of data for a preview.

Args:

  • options (EnrichmentOptions): Options for enriching the dataset.
  • new_column_name (str): The name of the new column to store the prompt results.
  • indices (List[int], optional): The indices of the rows to enrich, up to 10. If None, three rows in the dataset will be randomly picked.

method resume

resume() → JSONDict

Resume the latest batch job.


method run

run(options: 'EnrichmentOptions', new_column_name: 'str')dict[str, Any]

Enrich the entire dataset using the provided prompt.

This method triggers a remote job that applies TLM to each row of the dataset based on the given prompt. The process will run on a remote server and will block execution until the job is fully completed.

Args:

  • options (EnrichmentOptions): Options for enriching the dataset.
  • new_column_name (str): The name of the new column to store the prompt results.

method show_trustworthiness_score_history

show_trustworthiness_score_history()None

Show the trustworthiness score history of all jobs in the project.


method to_dict

to_dict() → Dict[str, Any]

Returns a dictionary of EnrichmentProject metadata.


method wait_until_ready

wait_until_ready()None

Wait until the latest enrichment job is ready.


class EnrichmentJob

Represents an Enrichment Job instance.

This class is not meant to be constructed directly. Instead, use the EnrichmentProject methods to create and manage Enrichment Jobs.


class EnrichmentOptions

Options for enriching a dataset with a Trustworthy Language Model (TLM).

Args:

  • prompt (str): Using string.Template, that contains both the prompt, and names of columns to embed.

  • `Example`: “Is this a numeric value, answer Yes or No only. Value: ${column_name}”

  • constrain_outputs (List[str], optional): List of all possible output values for the metadata column. If specified, every entry in the metadata column will exactly match one of these values (for less open-ended data enrichment tasks). If None, the metadata column can contain arbitrary values (for more open-ended data enrichment tasks). There may be additional transformations applied to ensure the returned value is one of these. If regex is also specified, then these transformations occur after your regex is applied. If optimize_prompt is True, the prompt will be automatically adjusted to include a statement that the response must match one of the constrain_outputs.

  • optimize_prompt (bool, default = True): When False, your provided prompt will not be modified in any way. When True, your provided prompt may be automatically adjusted in an effort to produce better results.

  • For instance, if the constrain_outputs are constrained, we may automatically append the following statement to your prompt: “Your answer must exactly match one of the following values: constrain_outputs.”

  • quality_preset (TLMQualityPreset, default = “medium”): The quality preset to use for the Trustworthy Language Model (TLM) to use for data enrichment.

  • regex (str | Replacement | List[Replacement], optional): A string, tuple, or list of tuples specifying regular expressions to apply for post-processing the raw LLM outputs. If a string value is passed in, a regex match will be performed and the matched pattern will be returned (if the pattern cannot be matched, None will be returned). Specifically the provided string will be passed into Python’s re.match() method. Pass in a tuple (R1, R2) instead if you wish to perform find and replace operations rather than matching/extraction. R1 should be a string containing the regex pattern to match, and R2 should be a string to replace matches with. Pass in a list of tuples instead if you wish to apply multiple replacements. Replacements will be applied in the order they appear in the list. Note that you cannot pass in a list of strings (chaining of multiple regex processing steps is only allowed for replacement operations).

    These tuples specify the desired patterns to match and replace from the raw LLM response, This regex processing is useful in settings where you are unable to prompt the LLM to generate valid outputs 100% of the time, but can easily transform the raw LLM outputs to be valid through regular expressions that extract and replace parts of the raw output string. When this regex is applied, the processed results can be seen ithe {new_column_name} column, and the raw outpus (before any regex processing) will be saved in the {new_column_name}_log column of the results dataframe.

  • `Example 1`: regex = '.*The answer is: (Bird|[Rr]abbit).*' will extract strings that are the words ‘Bird’, ‘Rabbit’ or ‘rabbit’ after the characters “The answer is: ” from the raw response.
  • `Example 2`: regex = [('True', 'T'), ('False', 'F')] will replace the words True and False with T and F.
  • `Example 3`: “regex = (’ Explanation:.*’, ”) will remove everything after and including the words “Explanation:“.
  • For instance, the response "True. Explanation: 3+4=7, and 7 is an odd number.” would return “True.” after the regex replacement.
  • tlm_options (TLMOptions, default = {}): Options for the Trustworthy Language Model (TLM) to use for data enrichment.

class EnrichmentResults

Enrichment result.

method __init__

__init__(results: 'DataFrame')

method details

details() → DataFrame

classmethod from_dataframe

from_dataframe(df: 'DataFrame') → EnrichmentResults

classmethod from_dict

from_dict(
json_dict: 'List[JSONDict]',
include_original_dataset: 'Optional[bool]' = False
) → EnrichmentResults

method join

join(original_data: 'DataFrame', with_details: 'bool' = False) → DataFrame

class EnrichmentPreviewResults

Enrichment preview results.

method __init__

__init__(results: 'DataFrame')

method details

details() → DataFrame

classmethod from_dataframe

from_dataframe(df: 'DataFrame') → EnrichmentResults

classmethod from_dict

from_dict(
json_dict: 'List[JSONDict]',
include_original_dataset: 'Optional[bool]' = False
) → EnrichmentPreviewResults

method join

join(original_data: 'DataFrame', with_details: 'bool' = False) → DataFrame

Join the original data with the enrichment results. The result only contains those rows that were enriched by preview.

Args:

  • original_data (pd.DataFrame): The original data to join with the enrichment results.
  • with_details (bool): If with_details is True, the details of the enrichment results will be included in the output DataFrame.