Skip to main content

How to Use Cleanlab Studio With Your Data Stored in Databricks

Run in Google ColabRun in Google Colab

This tutorial demonstrates how to run Cleanlab Studio on data stored in the Databricks platform. As an example, we consider an application of text classification with LLMs, improving the models by improving the data they are fine-tuned on. We’ll see how Cleanlab Studio systematically improves the training data to boost LLM performance by 37%, without requiring you spending any time or resources to change the model architecture, hyperparameters, or the training process.

Here’s a video walkthrough of the tutorial, if you prefer to watch instead of read the rest of the tutorial below:

LLMs acquire powerful generative and discriminative capabilities after being pre-trained on a large corpus of text (usually scraped from the internet), but producing reliable outputs for a particular business use case often requires additional training (fine-tuning) on a labeled data set from the application domain.

Labeled data powers AI/ML in the enterprise, but real-world datasets have been found to contain between 7-50% annotation errors. Imperfectly-labeled text data hampers the training (and evaluation of) ML models across tasks like intent recognition, entity recognition, and sequence generation. Although pretrained LLMs are equipped with a lot of world knowledge, their performance is adversely affected by noisy training data (as noted by OpenAI). This tutorial illustrates how using Cleanlab Studio to improve the training data can mitigate the negative effects of bad data (such as erroneous labels) without writing code or spending time to change the model architecture, hyperparameters, or training process. Because Cleanlab Studio improves the data itself (regardless of which model is used) it remains applicable for LLMs that are yet to be invented, like GPT-10.

This tutorial applies LLMs to a politeness classification task, beginning by fine-tuning OpenAI’s Davinci model on the original dataset stored in Databricks. The baseline model achieves moderate performance, but by automatically finding and fixing errors in the data using the Databricks connector for Cleanlab Studio, we can achieve significantly better performance using the same LLM model and fine-tuning process, just by improving the data (and spending minimal human time manually reviewing data). We see a 37% reduction in LLM prediction error after using Cleanlab Studio to improve the dataset:

Improvement using Cleanlab Studio

You can achieve analogous results whether you are using popular APIs for fine-tuning LLMs or training open-source LLMs like Dolly directly on Databricks. See the accompanying blog post for additional context on LLMs and fine-tuning, why data quality matters for LLMs and ML tasks in general, and how Cleanlab Studio can help you easily improve ML model robustness and performance by systematically improving data quality.

Install and configure dependencies

As an example of how data curation can improve ML models, this tutorial fine-tunes a model from OpenAI. Note: OpenAI is not a dependency requirement of Cleanlab Studio. Our Data Curation platform can be used to improve any ML model (or even to improve Data outside of ML applications), OpenAI is used here just as an example model.

%pip install openai cleanlab-studio

Configure OpenAI API key

Note that invoking the OpenAI API will use credits or bill you. The estimated cost to run this tutorial is 15 dollars with the Davinci model, which is the most powerful and most expensive. You can also scale down to the Curie or Ada model to reduce costs, by setting openai_model in the cell below, replacing “davinci” with “curie” or “ada”. Fine-tuning the Ada model costs 1 dollar per run with the given dataset.

Put your OpenAI API key in the cell below. You can find your API key at https://platform.openai.com/api-keys. Here we have saved the key in a secret scope.

import openai
import os
import dbutils

# we set the environment variable because it is used by the OpenAI command-line tool
os.environ['OPENAI_API_KEY'] = dbutils.secrets.get("solution-accelerator-cicd","openai_api")
# we also set the .api_key property below for the Python API
openai.api_key = os.environ['OPENAI_API_KEY']
# set openai model name
openai_model = 'davinci'

Set up Cleanlab Studio

  1. If you don’t have an account already, sign up for an account. It may take up to one day to get access.
  2. Get your API key and enter it below
import cleanlab_studio

CLEANLAB_STUDIO_API_KEY = dbutils.secrets.get("solution-accelerator-cicd","cleanlab_api")
studio = cleanlab_studio.Studio(CLEANLAB_STUDIO_API_KEY)

Download and prepare data

Here we consider a 3-class variant of the Stanford Politeness Dataset, which has text phrases labeled as: impolite, neutral, or polite. Annotated by human raters, some of these labels are naturally low-quality.

The training dataset has 1916 examples each labeled by a single human annotator, and thus some may be unreliable.

The test dataset has 480 examples each labeled by 5 annotators, and we use their consensus label as a high-quality approximation of the true politeness (measuring test accuracy against these consensus labels). To ensure a fair comparison, this test dataset remains fixed throughout the experiments in this tutorial (all data quality improvement is done only for the training set).

To prepare the data, we download raw data into DBFS, load it into PySpark DataFrames, and do some processing to prepare the dataset for the downstream task.

rm -rf /tmp/stanford-politeness
mkdir -p /tmp/stanford-politeness
cd /tmp/stanford-politeness
curl --silent -L https://raw.githubusercontent.com/databricks-industry-solutions/cleanlab-improving-llms/main/data/train.csv -o train.csv
curl --silent -L https://raw.githubusercontent.com/databricks-industry-solutions/cleanlab-improving-llms/main/data/test.csv -o test.csv

# move the dataset to our main bucket
rm -rf /dbfs/solacc/product/llm/stanford-politeness/raw
mkdir -p /dbfs/solacc/product/llm/stanford-politeness/raw
cp train.csv test.csv /dbfs/solacc/product/llm/stanford-politeness/raw

We can use the %fs command to see that our raw data is indeed saved in DBFS.

%fs ls /solacc/product/llm/stanford-politeness/raw

Next, we load the raw data into a PySpark DataFrame to enable further processing.

data_path = '/solacc/product/llm/stanford-politeness'
raw_path = f'{data_path}/raw'
politeness_train = spark.read.options(header='true', inferSchema='true', escape='"', multiLine=True).csv(f'{raw_path}/train.csv')
politeness_test = spark.read.options(header='true', inferSchema='true', escape='"', multiLine=True).csv(f'{raw_path}/test.csv')

We can inspect this prepared data, looking at some specific rows to highlight data errors that are present. For example, the data point on row 1426 is erroneously labeled “impolite” when “polite” is a more appropriate label.

display(politeness_train.to_pandas().iloc[[1426, 299, 134]])

Formatting data for fine-tuning

We are using the OpenAI APIs for fine-tuning, which require data in a specific format (JSONL, newline-delimited JSON objects). We also need to do some pre-processing of the label column, adding whitespace before the completion, as the API recommends.

We save the prepared results into DBFS, so that the result files can be used by the OpenAI API.

from pyspark.sql import functions as F

def prepare_data(df, path):
'''
Write a dataframe into a single JSONL file located at path, in a format appropriate for fine tuning.

This makes a small tweak to the data, namely, adding whitespace before the completion.

We don't need the full power of Spark's parallel and distributed processing for this small demo dataset, but you would want to leverage it for any large real-world datasets. Our small dataset lives in a single partition, but larger datasets would have multiple partitions. By default, Spark writes the dataset into multiple files for efficiency (each partition corresponds to a separate file), but we need a single file to pass to the OpenAI command-line tool. This function ensures that a single JSONL file containing all of the data is produced as the final result.
'''
# add whitespace to the completion, as OpenAI requires
df = df.withColumn('completion', F.format_string(' %s', 'completion'))

temp_dir = f'{path}_tmp'
# write using a single partition, so we have a single JSONL file
df.coalesce(1).write.mode('overwrite').json(temp_dir)
# Spark saves the JSON file in a directory, along with some other files we don't need anymore
all_files = dbutils.fs.ls(temp_dir)
# move the .json file to the output destination
for f in all_files:
if f.path.endswith('.json'):
dbutils.fs.mv(f.path, path)
# remove all the other files, which we don't need
dbutils.fs.rm(temp_dir, recurse=True)

prepare_data(politeness_train, f'{data_path}/processed/train.jsonl')
prepare_data(politeness_test, f'{data_path}/processed/test.jsonl')

Fine-Tune and evaluate OpenAI model without Cleanlab Studio (accuracy ~65%)

We use the OpenAI fine-tuning API to first establish a baseline by:

  • Fine-tuning the OpenAI model on our (original, with some errors) training set
  • Evaluating the model on our test set

First, we upload our training set and test set to OpenAI:

train_file = openai.File.create(file=open(f'/dbfs/{data_path}/processed/train.jsonl', 'rb'), purpose='fine-tune')
test_file = openai.File.create(file=open(f'/dbfs/{data_path}/processed/test.jsonl', 'rb'), purpose='fine-tune')

Then, we invoke the OpenAI API to fine tune the openai_model (“davinci” by default, unless you changed it above). Note that this incurs some cost (roughly $7.50 for the Davinci model or $0.50 for the Ada model, at the time this was written).

We also invoke the fine-tuning API with the validation_file keyword argument, so that the API will automatically compute statistics including model accuracy on the test set after the fine-tuning process.

response = openai.FineTune.create(
training_file=train_file.id,
validation_file=test_file.id,
compute_classification_metrics=True,
classification_n_classes=3,
model=openai_model,
suffix='baseline'
)

You can follow the progress of fine-tuning with the following command. Once it’s done, it’ll print “Job complete!“. You might need to re-run the cell if it times out. Training time varies based on queue length and other factors; it can take up to 1 hour to fine-tune the LLM. The block below would check the status of the finetune and block execution until the finetune is complete. The block is based on this openai-cookbook example.

import time

def wait_for_finetune(job_id):
status = openai.FineTune.retrieve(id=job_id)["status"]
if status not in ["succeeded", "failed"]:
print(f'Job not in terminal status: {status}. Waiting.')
while status not in ["succeeded", "failed"]:
time.sleep(60)
status = openai.FineTune.retrieve(id=job_id)["status"]
print(f'Status: {status}')
else:
print(f'Finetune job {job_id} finished with status: {status}')

wait_for_finetune(response.id)

Once the job completes, we see the test accuracy achieved when fine-tuning this LLM on the original training dataset.

import pandas as pd
openai api fine_tunes.results -i {response.id} > baseline.csv

base_df = pd.read_csv('baseline.csv')
baseline_acc = base_df.iloc[-1]['classification/accuracy']
print(f"Fine-tuning Accuracy: {baseline_acc:.1%}")

Baseline results: ~65% accuracy

Our baseline Davinci LLM achieves a test accuracy of 65% when fine-tuned on the original training data (Curie achieved 64% accuracy, Ada achieved 60% accuracy). Model training is nondeterministic, so your results might vary slightly, even with the exact same dataset and initial model checkpoint. OpenAI’s models might also be changed/updated over time.

Even a state-of-the-art LLM like the Davinci model produces lackluster results for this classification task; is it because of low data quality?

Improve the data using Cleanlab Studio and re-train the LLM (accuracy ~78%)

Next, we use the Databricks connector for Cleanlab Studio to automatically improve data quality, and subsequently re-train our LLM on the curated dataset. We can directly upload a Spark DataFrame to the Cleanlab Studio platform via studio.upload_dataset().

dataset_id = studio.upload_dataset(politeness_train, dataset_name='Stanford Politeness')

Create a project

To analyze the data, use the Cleanlab Studio web UI to create a project, configuring it according to the ML task. For this demo, you should select:

  • ML task: text classification
  • Type of classification: multi-class
  • Text column: prompt (will be auto-detected)
  • Label column: completion (will be auto-detected)

Select fast mode or regular mode depending on the speed/quality tradeoff you desire.

Creating a project.

Make corrections to your dataset with Cleanlab Studio

Cleanlab Studio not only finds data points with potential issues, but it also makes suggestions for how to address the issues (e.g., changing the label of a data point). Deciding how to make use of the analysis results is up to you. For example, you could discard all potentially erroneous data points, or you could review the data points most likely to have issues and make corrections. This human-in-the-loop data correction usually gives the best results.

If you want to save time, you could briefly review some flagged issues, and then auto-fix the top issues.

Creating a project.

The selected row in the screenshot above is an example of a poorly-labeled datapoint.

The phrase:

I went through your comments. How does it look now?

is labeled “impolite”. Cleanlab Studio flags this as a label error, and it suggests that the label be switched to “neutral”. In the screenshot above, we pressed “W” to accept Cleanlab Studio’s suggestion to automatically fix the label.

Label issues like this cause the accuracy of the fine-tuned LLM to be degraded. Correcting these issues allows us to train an improved LLM, as we’ll see below.

Export your improved dataset back into Databricks

Once you’re done correcting issues found in your dataset with Cleanlab Studio, export the improved dataset back into Databricks:

  1. Click on the “Export Cleanset” button in your Cleanlab Studio project
  2. Select the “Export using API” tab
  3. Copy the “cleanset ID” and paste it into the cell below
cleanset_id = '7b27d51ba79b4087b32b3f064f87a47b'  # paste your own Cleanset ID here
politeness_train_fixed = studio.apply_corrections(cleanset_id, politeness_train)
display(politeness_train_fixed)

Fine-tune the LLM on your improved dataset and evaluate the results

Let’s see how Cleanlab Studio improves the performance of the LLM. We follow the same process as earlier, except we use the politeness_train_fixed DataFrame as our training data.

When we ran the experiment below, we used Cleanlab Studio’s web interface to review the data issues that it flagged. Machine-augmented human-in-the-loop data improvement often gives the best results. If you want to use the dataset that we exported from Cleanlab Studio, uncomment the line below.

# By default for reproducibility, we use the dataset that we exported from Cleanlab Studio as csv
# But if you want to use your dataset that you improved, downloaded as politeness_train_fixed above
# set the flag below to 'False'
use_provided_training_set_improved_using_cleanlab_studio = True
if use_provided_training_set_improved_using_cleanlab_studio:
politeness_train_fixed = pd.read_csv('https://raw.githubusercontent.com/databricks-industry-solutions/cleanlab-improving-llms/main/data/train_fixed.csv')

prepare_data(politeness_train_fixed, f'{data_path}/processed/train_fixed.jsonl')

train_file_fixed = openai.File.create(file=open(f'/dbfs/{data_path}/processed/train_fixed.jsonl', 'rb'), purpose='fine-tune')

response_fixed = openai.FineTune.create(
training_file=train_file_fixed.id,
validation_file=test_file.id,
compute_classification_metrics=True,
classification_n_classes=3,
model=openai_model,
suffix='fixed'
)

You can follow the progress of fine-tuning with the following command. Once it’s done, it’ll print “Job complete!“. You might need to re-run the cell if it times out. Training time varies based on queue length and other factors; it can take up to 1 hour to fine-tune the LLM. We use the wait_for_finetune function defined before to block this step until the finetuning is done.

wait_for_finetune(response_fixed.id)

Once the job completes, we see the test accuracy achieved when fine-tuning this LLM on the improved dataset. If you simply auto-fixed some of the labels (spending zero human time on data improvement), you’ll still see improvement; if you reviewed some of Cleanlab Studio’s suggestions following a human-in-the-loop data cleaning process, you’ll see larger improvements here.

openai api fine_tunes.results -i {response_fixed.id} > fixed.csv

Impact of data improvement using Cleanlab Studio: ~78% accuracy (compared to ~65% baseline accuracy)

Training on the improved dataset, we see a test accuracy of 78% for the Davinci model (Curie achieved ~76% accuracy, Ada achieved ~75% accuracy). These results are from our train_fixed.csv (provided above); results on your dataset will vary depending on how you improved the dataset using Cleanlab Studio (e.g., whether you used auto-fix or manually reviewed the top issues, how you corrected labels, how you removed outliers, etc.). Even the results of fine-tuning on the provided dataset might vary slightly, because model training is nondeterministic, and OpenAI’s initial model checkpoints may be updated over time.

In this evaluation, we see that data quality has a huge impact on LLM performance. By simply improving the data quality (and leaving the original LLM checkpoint, training parameters, fine-tuning process, etc. as-is), we have reduced prediction error by ~37%. Because Cleanlab Studio improves models by improving the underlying data, the strategy outlined here works for any model or LLM that exists today or may exist in the future!

With Cleanlab, it’s that easy to improve any datasets/models you have in Databricks

Using Cleanlab with Databricks