Trustworthy Yes/No Decision Automation with TLM (Binary Classification)

Run in Google Colab

This tutorial demonstrates how to use the Trustworthy Language Model (TLM) to automate Yes/No decisions, or more generally, for any binary classification task where you want an LLM to pick between two options (e.g. True or False, A or B, etc).

To use TLM for multi-class classification tasks where your LLM picks from more than two options, refer to our Zero-Shot Classification Tutorial.

Setup

This tutorial requires a TLM API key. Get one here.

Cleanlab’s Python client can be installed using pip.

%pip install --upgrade cleanlab-tlm

# Set your API key
import os
os.environ["CLEANLAB_TLM_API_KEY"] = "<API key>" # Get your free API key from: https://tlm.cleanlab.ai/

import pandas as pd
import numpy as np

pd.set_option('display.max_colwidth', None)

from cleanlab_tlm import TLM

tlm = TLM(quality_preset="low")

Binary classification dataset

Let’s consider a dataset composed of customer service messages received by a bank.

wget -nc https://cleanlab-public.s3.us-east-1.amazonaws.com/Datasets/tlm-annotation-tutorial/customer-service-text.csv

df = pd.read_csv("customer-service-text.csv")
df.head()

	text
0	i need to cancel my recent transfer as soon as possible. i made an error there. please help before it goes through.
1	why is there a fee when i thought there would be no fees?
2	how do i replace my card before it expires next month?
3	what should i do if someone stole my phone?
4	please help me get a visa card.

In our example, let’s suppose the goal is to determine: whether each customer is asking for help changing their card’s PIN, or about something else.

This is a decision making task, which we we want to automate using an LLM that takes in a customer message and outputs:

“Yes” if the request is about changing their card PIN
”No” otherwise

Apply TLM for Yes/No decision making

In binary decision-making, LLM errors (false positives/negatives) may have asymmetric impact. For example, incorrectly predicting Yes may be 3x worse than incorrectly predicting No.

If we just have an LLM output either Yes or No, it will be difficult to control the false positive/negative error rates. Instead you can use TLM to produce a score reflecting the LLM’s confidence that Yes is the right decision. You can subsequently translate these scores into Yes/No predictions by choosing the score-threshold which achieves the best false positive/negative error rates for your use-case.

customer_message = "I need to change my card's PIN"

prompt_template = '''Given the following customer message, determine if it is about the customer needing help changing their card's PIN. Please respond with only "Yes" or "No" with no leading or trailing text.
Here is the customer message: {}'''
prompt = prompt_template.format(customer_message)

# Note how we specify the only outputs to consider are: Yes or No
response = tlm.get_trustworthiness_score(prompt, "Yes", constrain_outputs=["Yes", "No"])

print("Trustworthiness Score:", response["trustworthiness_score"])

Trustworthiness Score: 0.999830669558766

For the above example, the trustworthiness score indicates the LLM’s confidence that Yes is the right decision. You can confidently decide Yes for examples where this score is high, and No for examples where this score is low.

Run TLM for automated decision-making over the dataset

Let’s now apply TLM to predict decisions for every customer message.

# Construct prompt for each message.
all_prompts = [prompt_template.format(text) for text in df.text]
print(all_prompts[0])

Given the following customer message, determine if it is about the customer needing help changing their card's PIN. Please respond with only "Yes" or "No" with no leading or trailing text.
Here is the customer message: i need to cancel my recent transfer as soon as possible. i made an error there. please help before it goes through.

# Score the entire dataset in one batch.
responses = tlm.get_trustworthiness_score(all_prompts, ["Yes"] * len(all_prompts), constrain_outputs=["Yes", "No"])

# Extract the scores from each response
scores = [response["trustworthiness_score"] for response in responses]
df["trustworthiness"] = scores
df.head(3)

	text	trustworthiness
0	i need to cancel my recent transfer as soon as possible. i made an error there. please help before it goes through.	0.000469
1	why is there a fee when i thought there would be no fees?	0.000469
2	how do i replace my card before it expires next month?	0.000469

Assess Prediction Performance (Optional)

In this section, we introduce ground-truth labels solely in order to evaluate the performance of TLM. These ground-truth labels are never provided to TLM, and this section can be skipped if you don’t have ground-truth labels.

wget -nc https://cleanlab-public.s3.us-east-1.amazonaws.com/Datasets/tlm-annotation-tutorial/customer-service-categories.csv

ground_truth_labels = pd.read_csv("customer-service-categories.csv")
df['ground_truth_label'] = np.where(ground_truth_labels['label'] == "change pin", "Yes", "No")
df.head(3)

	text	trustworthiness	ground_truth_label
0	i need to cancel my recent transfer as soon as possible. i made an error there. please help before it goes through.	0.000469	No
1	why is there a fee when i thought there would be no fees?	0.000469	No
2	how do i replace my card before it expires next month?	0.000469	No

Lowest confidence examples

Let’s sort the messages by TLM’s confidence score for Yes, and look at a few examples.

df.sort_values(by="trustworthiness", ascending=True).head(3)

	text	trustworthiness	ground_truth_label
0	i need to cancel my recent transfer as soon as possible. i made an error there. please help before it goes through.	0.000469	No
630	my google pay top up isn't working. help.	0.000469	No
633	how do i top up using my apple watch?	0.000469	No

We see that the lowest scores were given to messages that are not related to changing card PIN (these are obvious cases where No is the right decision).

Highest confidence examples

We can also view messages where TLM estimated highest confidence that Yes is the right decision.

df.sort_values("trustworthiness", ascending=False).head(3)

	text	trustworthiness	ground_truth_label
104	which atm's am i able to change my pin?	0.999831	Yes
28	how do i change my pin while traveling?	0.999831	Yes
664	how can i reset my pin?	0.999831	Yes

We can see that the highest trustworthiness scores are given to examples that are related to changing card PIN (cases where Yes is clearly the right decision).

Aggregate results

Let’s plot the distribution of these confidence scores, grouping the messages by their ground-truth label.

Optional: Plotting code

import matplotlib.pyplot as plt
import seaborn as sns

def analyze_results(train_df, score_col):
    """
    Analyze results and find best threshold from training data
    Returns best threshold and mismatches dataframe
    """
    # Plot distribution of trustworthiness scores
    plt.figure(figsize=(10, 6))
    sns.histplot(data=train_df, x=score_col, hue='ground_truth_label', bins=20)
    plt.title('Distribution of Trustworthiness Scores by Label')
    plt.xlabel('Trustworthiness Score')
    plt.ylabel('Count')
    plt.show()

analyze_results(df, "trustworthiness")

TLM performance on Yes/No decisions

We see that the TLM score is consistently lower for “No” ground-truth examples (messages that are unrelated to changing PIN), while “Yes” ground truth examples (messages that are actually requesting a PIN change) received consistently higher scores from TLM.

Yes/No Decision with Unsure option

Sometimes, it may be useful to include an “Unsure” option for your LLM to consider. This can be done by setting TLM’s constrain_outputs parameter to: ["Yes", "No", "Unsure"].

prompt_template_with_unsure = '''Given the following customer message, determine if it is about the customer needing help changing their card's PIN. Please respond with only "Yes", "No", or "Unsure" with no leading or trailing text.
Here is the customer message: {}'''

# Construct prompt for each message to label.
all_prompts_with_unsure = [prompt_template_with_unsure.format(text) for text in df.text]
print(all_prompts_with_unsure[0])

Given the following customer message, determine if it is about the customer needing help changing their card's PIN. Please respond with only "Yes", "No", or "Unsure" with no leading or trailing text.
Here is the customer message: i need to cancel my recent transfer as soon as possible. i made an error there. please help before it goes through.

# Score the entire dataset in one batch.
responses_with_unsure = tlm.get_trustworthiness_score(all_prompts_with_unsure, ["Yes"] * len(all_prompts_with_unsure), constrain_outputs=["Yes", "No", "Unsure"])

# Extract the scores from each response
scores = [response["trustworthiness_score"] for response in responses_with_unsure]
df["trustworthiness_with_unsure"] = scores

analyze_results(df, "trustworthiness_with_unsure")

TLM performance on Yes/No decisions with Unsure option

We see that TLM’s confidence score is again consistently higher for “Yes” ground truth examples (messages that are actually requesting a PIN change).

Setup​

Binary classification dataset​

Apply TLM for Yes/No decision making​

Run TLM for automated decision-making over the dataset​

Assess Prediction Performance (Optional)​

Lowest confidence examples​

Highest confidence examples​

Aggregate results​

Yes/No Decision with Unsure option​