Enrichment
module cleanlab_studio.studio.enrichment
Methods for interfacing with Enrichment Projects.
This module is not meant to be imported and used directly. Instead, use Studio.get_enrichment_project()
to instantiate an EnrichmentProject object.
Global Variables
- ROW_ID_COLUMN_NAME
- REGEX_PARAMETER_ERROR_MESSAGE
- CLEANLAB_ROW_ID_COLUMN_NAME
- CHECK_READY_INTERVAL
class EnrichmentJobStatusEnum
An enumeration.
class EnrichmentProject
Represents an Enrichment Project instance, which is bound to a Cleanlab Studio account.
EnrichmentProjects should be instantiated using the Studio.get_enrichment_project()
method.
method __init__
__init__(
api_key: 'str',
id: 'str',
name: 'str',
created_at: 'Optional[Union[str, datetime]]' = None
) → None
Initialize an EnrichmentProject.
Objects of this class are not meant to be constructed directly. Instead, use Studio.get_enrichment_project()
.
property created_at
(datetime.datetime) When the Enrichment Project was created.
property id
(str) ID of the Enrichment Project.
property name
(str) Name of the Enrichment Project.
property ready
Check if the latest enrichment job is ready or not.
If one ran a preview after the last run, this method will raise an error since the latest job is a preview.
property updated_at
(datetime.datetime) When the Enrichment Project was last updated.
method download_results
download_results(
job_id: 'Optional[str]' = None,
include_original_dataset: 'Optional[bool]' = False
) → EnrichmentResults
Retrieve the results of an enrichment job.
This method fetches the results of a specified enrichment job. If no job_id
is provided, it will default to retrieving the results of the latest job.
Args:
job_id
(str, optional): The ID of the job to retrieve results from. If not provided, the latest job will be used.include_original_dataset
(bool, optional): If True, the original dataset will be included in the returned results. Defaults to False.
method export_results_as_csv
export_results_as_csv(job_id: 'Optional[str]' = None) → None
Download the results of a job.
method list_all_jobs
list_all_jobs() → List[EnrichmentJob]
List all jobs in the project.
method pause
pause() → None
Pause the latest batch job.
method preview
preview(
options: 'EnrichmentOptions',
new_column_name: 'str',
indices: 'Optional[List[int]]' = None
) → EnrichmentPreviewResults
Enrich a subset of data for a preview.
Args:
options
(EnrichmentOptions): Options for enriching the dataset.new_column_name
(str): The name of the new column to store the prompt results.indices
(List[int], optional): The indices of the rows to enrich, up to 10. If None, three rows in the dataset will be randomly picked.
method resume
resume() → JSONDict
Resume the latest batch job.
method run
run(options: 'EnrichmentOptions', new_column_name: 'str') → dict[str, Any]
Enrich the entire dataset using the provided prompt.
This method triggers a remote job that applies TLM to each row of the dataset based on the given prompt. The process will run on a remote server and will block execution until the job is fully completed.
Args:
options
(EnrichmentOptions): Options for enriching the dataset.new_column_name
(str): The name of the new column to store the prompt results.
method show_trustworthiness_score_history
show_trustworthiness_score_history() → None
Show the trustworthiness score history of all jobs in the project.
method to_dict
to_dict() → Dict[str, Any]
Returns a dictionary of EnrichmentProject metadata.
method wait_until_ready
wait_until_ready() → None
Wait until the latest enrichment job is ready.
class EnrichmentJob
Represents an Enrichment Job instance.
This class is not meant to be constructed directly. Instead, use the EnrichmentProject
methods to create and manage Enrichment Jobs.
class EnrichmentOptions
Options for enriching a dataset with a Trustworthy Language Model (TLM).
Args:
prompt
(str): Using string.Template, that contains both the prompt, and names of columns to embed.`Example`: “Is this a numeric value, answer Yes or No only. Value: ${column_name}”
constrain_outputs
(List[str], optional): List of all possible output values for themetadata
column. If specified, every entry in themetadata
column will exactly match one of these values (for less open-ended data enrichment tasks). If None, themetadata
column can contain arbitrary values (for more open-ended data enrichment tasks). There may be additional transformations applied to ensure the returned value is one of these. If regex is also specified, then these transformations occur after your regex is applied. Ifoptimize_prompt
is True, the prompt will be automatically adjusted to include a statement that the response must match one of theconstrain_outputs
.optimize_prompt
(bool, default = True): When False, your provided prompt will not be modified in any way. When True, your provided prompt may be automatically adjusted in an effort to produce better results.For instance, if the constrain_outputs are constrained, we may automatically append the following statement to your prompt
: “Your answer must exactly match one of the following values:constrain_outputs
.”quality_preset
(TLMQualityPreset, default = “medium”): The quality preset to use for the Trustworthy Language Model (TLM) to use for data enrichment.regex
(str | Replacement | List[Replacement], optional): A string, tuple, or list of tuples specifying regular expressions to apply for post-processing the raw LLM outputs. If a string value is passed in, a regex match will be performed and the matched pattern will be returned (if the pattern cannot be matched, None will be returned). Specifically the provided string will be passed into Python’sre.match()
method. Pass in a tuple(R1, R2)
instead if you wish to perform find and replace operations rather than matching/extraction.R1
should be a string containing the regex pattern to match, andR2
should be a string to replace matches with. Pass in a list of tuples instead if you wish to apply multiple replacements. Replacements will be applied in the order they appear in the list. Note that you cannot pass in a list of strings (chaining of multiple regex processing steps is only allowed for replacement operations).These tuples specify the desired patterns to match and replace from the raw LLM response, This regex processing is useful in settings where you are unable to prompt the LLM to generate valid outputs 100% of the time, but can easily transform the raw LLM outputs to be valid through regular expressions that extract and replace parts of the raw output string. When this regex is applied, the processed results can be seen ithe
{new_column_name}
column, and the raw outpus (before any regex processing) will be saved in the{new_column_name}_log
column of the results dataframe.
- `Example 1`:
regex = '.*The answer is: (Bird|[Rr]abbit).*'
will extract strings that are the words ‘Bird’, ‘Rabbit’ or ‘rabbit’ after the characters “The answer is: ” from the raw response. - `Example 2`:
regex = [('True', 'T'), ('False', 'F')]
will replace the words True and False with T and F. - `Example 3`: “regex = (’ Explanation:.*’, ”) will remove everything after and including the words “Explanation:“.
For instance, the response "True. Explanation
: 3+4=7, and 7 is an odd number.” would return “True.” after the regex replacement.tlm_options
(TLMOptions, default = {}): Options for the Trustworthy Language Model (TLM) to use for data enrichment.
class EnrichmentResults
Enrichment result.
method __init__
__init__(results: 'DataFrame')
method details
details() → DataFrame
classmethod from_dataframe
from_dataframe(df: 'DataFrame') → EnrichmentResults
classmethod from_dict
from_dict(
json_dict: 'List[JSONDict]',
include_original_dataset: 'Optional[bool]' = False
) → EnrichmentResults
method join
join(original_data: 'DataFrame', with_details: 'bool' = False) → DataFrame
class EnrichmentPreviewResults
Enrichment preview results.
method __init__
__init__(results: 'DataFrame')
method details
details() → DataFrame
classmethod from_dataframe
from_dataframe(df: 'DataFrame') → EnrichmentResults
classmethod from_dict
from_dict(
json_dict: 'List[JSONDict]',
include_original_dataset: 'Optional[bool]' = False
) → EnrichmentPreviewResults
method join
join(original_data: 'DataFrame', with_details: 'bool' = False) → DataFrame
Join the original data with the enrichment results. The result only contains those rows that were enriched by preview.
Args:
original_data
(pd.DataFrame): The original data to join with the enrichment results.with_details
(bool): Ifwith_details
is True, the details of the enrichment results will be included in the output DataFrame.