Skip to main content

Ingesting Data from Google BigQuery (BigFrames)

Run in Google ColabRun in Google Colab

In this tutorial, you’ll learn how to ingest data from Google BigQuery DataFrames (bigframes) to Cleanlab Studio. You’ll start by creating a bigframes DataFrame and then use the Python client to ingest your table. This guide will help you integrate your bigframes/BigQuery data into Cleanlab Studio efficiently.

This notebook uses the bigframes package, along with the cleanlab-studio Python Package.

1. Install and import dependencies

You’ll need to install the cleanlab-studio package, along with the bigframes package.

1a. Install the required packages

Required packages are installed using pip:

%pip install -U cleanlab-studio
%pip install bigframes "numpy>=1.26.0"
import bigframes.pandas as bpd
from google.cloud import bigquery
from cleanlab_studio import Studio

1b. Setup BigQuery options and Cleanlab Studio

To make API calls to BigQuery and Cleanlab Studio, you need to setup BigQuery DataFrame options and create a Cleanlab Studio client.

This tutorial assumes you have already authenticated your Google Cloud account. If you haven’t, you can follow the instructions in the Google Cloud documentation.

Ensure that you set the GCP_PROJECT variable along with the Cleanlab Studio API key in the following block.

# Set BigQuery DataFrames options
GCP_PROJECT = "<your-gcp-project-id>"
GCP_REGION = "US"
bpd.options.bigquery.project = GCP_PROJECT
bpd.options.bigquery.location = GCP_REGION

# Create a Studio client
# You can find your Cleanlab Studio API key by going to app.cleanlab.ai/account
API_KEY = "<YOUR_API_KEY>"
studio = Studio(API_KEY)

2. Create a DataFrame (from a BigQuery table)

The following code block creates a DataFrame from a (public) BigQuery table. You can use this DataFrame to ingest data to Cleanlab Studio.

# Get the dataset and read into a DataFrame
query_or_table = "bigquery-public-data.ml_datasets.penguins"
df = bpd.read_gbq(query_or_table)
df.head()

3. Ingest bigframes DataFrame to Cleanlab Studio

You can use the cleanlab-studio Python package to ingest the bigframes DataFrame to Cleanlab Studio.

After ingesting the data, you can access it in Cleanlab Studio by opening the application and finding the dataset on the Dashboard (or clicking the link below).

# Ingest the dataset to Cleanlab Studio
dataset_id = studio.upload_from_bigframe(df)

# View the dataset in Cleanlab Studio
print(f"https://app.cleanlab.ai/datasets/{dataset_id}")

4. Run Cleanlab Studio Project on bigframes ingested data

Let’s now create a project using this dataset. A Cleanlab Studio project will automatically train ML models to provide AI-based analysis of your dataset.

# Create and run a Cleanlab Studio multi-class classification project using the `sex` column as the label and the rest of the columns as the features
project_id = studio.create_project(
dataset_id=dataset_id,
project_name="bigquery-public-data.ml_datasets.penguins project",
modality="tabular",
task_type="multi-class",
model_type="regular",
label_column="sex"
)

print(f"Project successfully created and training has begun! project_id: {project_id}")
# Check on status of the project and if cleanset has been created (after project finishes running)
cleanset_id = studio.get_latest_cleanset_id(project_id)
print(f"cleanset_id: {cleanset_id}")
project_status = studio.poll_cleanset_status(cleanset_id)

5. Store results in BigFrames.

We can fetch the Cleanlab columns that contain the metadata of this cleanset using its cleanset_id. These columns have the same length as your original dataset and provide metadata about each individual data point, like what types of issues it exhibits and severe these issues are.

Then we’ll convert the metadata results directly into a BigQuery DataFrame.

# Convert the pandas DataFrame to a BigQuery DataFrame
cleanlab_bq_df = bpd.DataFrame(studio.download_cleanlab_columns(cleanset_id))
cleanlab_bq_df.head()

6. Save the results in a BigQuery table.

Let’s now initialize a BigQuery client and create a dataset in BigQuery (if it doesn’t already exist) to define a table to write our Cleanlab Project results to. This shows how easy it is to integrate Cleanlab Studio with BigQuery (both before and after running a Cleanlab Studio project).

# Initialize a BigQuery client
client = bigquery.Client(project=GCP_PROJECT)

# Create a new dataset (if it doesn't exist)
dataset_id = f"{GCP_PROJECT}.penguins_cleanlab_results"
dataset = bigquery.Dataset(dataset_id)
dataset.location = GCP_REGION

try:
dataset = client.create_dataset(dataset, exists_ok=True)
print(f"Dataset {dataset_id} is ready")
except Exception as e:
print(f"Error with dataset {dataset_id}: {e}")
# Define the table
table_id = f"{dataset_id}.penguins_with_cleanlab"

# Save the dataset to the new BigQuery table
cleanlab_bq_df.to_gbq(table_id, if_exists='replace')

print(f"Data successfully saved to {table_id}")

# Verify the data was saved by reading it back
verified_df = bpd.read_gbq(table_id)

print(f"Shape of data written to BigQuery: {cleanlab_bq_df.shape}")
print(f"Shape of data read back from BigQuery: {verified_df.shape}")

7. Conclusion

In this tutorial, you learned how to ingest data from BigQuery to Cleanlab Studio, run a Cleanlab Studio project on this data, and then download the Cleanset results and load them back into a BigQuery bigframes DataFrame before saving the results in a BigQuery table. You created a table in BigQuery, configured access, and ingested the data using the cleanlab-studio Python package. For more details on configuring and running Cleanlab Studio projects, check out our Projects guide.