Reliable Zero-Shot Classification with the Trustworthy Language Model

Run in Google Colab

In zero-shot (or few-shot) classification, we use a Foundation model to classify input data into predefined categories (aka. classes), without having to train this model on a manually annotated dataset. This requires much less work than training/deploying classical machine learning models (no data preparation/labeling required either) and can generalize better across evolving environments. The problem with classification with pretrained LLMs is we don’t know which LLM classifications we can trust. LLMs are prone to hallucination and will often predict a category even when their world knowledge does not suffice to justify this prediction.

This tutorial demonstrates how you can easily replace any LLM with Cleanlab’s Trustworthy Language Model (TLM) to:

Score the trustworthiness of each classification
Automatically boost classification accuracy

Use TLM to ensure reliable classification where you know which model predictions cannot be trusted.

Setup

This tutorial requires a TLM API key. Get one here.

The Python client package can be installed using pip:

%pip install cleanlab-tlm

# Set your API key
import os
os.environ["CLEANLAB_TLM_API_KEY"] = "<API key>" # Get your free API key from: https://tlm.cleanlab.ai/

import pandas as pd
from cleanlab_tlm import TLM

Let’s load an example classification dataset. Here we consider legal documents from the “US” Jurisdiction of the Multi_Legal_Pile. We aim to classify each document into one of three categories: [caselaw, contracts, legislation]. We’ll prompt our TLM to categorize each document and record its response and associated trustworthiness score. You can use the ideas from this tutorial to improve LLMs for any other classification task!

First download our dataset and load it into a DataFrame.

wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/zero_shot_classification.csv'

df = pd.read_csv('zero_shot_classification.csv')
df.head(2)

	index	text
0	0	Probl2B\n0/NV Form\nRev. June 2014\n\n\n\n ...
1	1	UNITED STATES DI...

Perform Zero Shot Classification with TLM

Let’s initalize a TLM object using gpt-4o as the underlying base model. Advanced configuration options exist that can produce improved classification accuracy or trustworthiness scoring.

MODEL = "gpt-4o"  # which base LLM should TLM utilize
tlm = TLM(options={"model": MODEL})  # to boost accuracy, consider adding: `num_candidate_responses=4` (any number larger than 1, higher number = more accurate response)

Next, let’s define a prompt template to instruct TLM on how to classify each document. Write your prompt just as you would with any other LLM when adapting it for zero-shot classification. A good prompt template might contain all the possible categories a document can be classified as, as well as formatting instructions for the LLM response. Of course the text of the document is crucial.

'You are an expert Legal Document Auditor. Classify the following document into a single category that best represents it. The categories are: {categories}. In your response, first provide a brief explanation as to why the document belongs to a specific category and then on a new line write "Category: <category document belongs to>". \nDocument: {document}'

If you have a couple labeled examples from different classes, you may be able to get better LLM predictions via few-shot prompting (where these examples + their classes are embedded within the prompt). Here we’ll stick with zero-shot classification for simplicity, but note that TLM can also be used for few-shot classification just like any other LLM.

Let’s apply the above prompt template to all documents in our dataset and form the list of prompts we want to run. For one arbitrary document, we print the actual corresponding prompt fed into TLM below.

zero_shot_prompt_template = 'You are an expert Legal Document Auditor. Classify the following document into a single category that best represents it. The categories are: {categories}. In your response, first provide a brief explanation as to why the document belongs to a specific category and then on a new line write "Cateogry: <category document belongs to>". \nDocument: {document}'
categories = ['caselaw', 'contracts', 'legislation']
string_categories = str(categories).replace('\'', '')

# Create a DataFrame to store results and apply the prompt template to all examples
results_df = df.copy()
results_df['prompt'] = results_df['text'].apply(lambda x: zero_shot_prompt_template.format(categories=string_categories, document=x))

print(f"{results_df.at[7, 'prompt']}")

You are an expert Legal Document Auditor. Classify the following document into a single category that best represents it. The categories are: [caselaw, contracts, legislation]. In your response, first provide a brief explanation as to why the document belongs to a specific category and then on a new line write "Cateogry: <category document belongs to>". 
Document: UNITED STATES DISTRICT COURT
SOUTHERN DISTRICT OF NEW YORK

UNITED STATES OF AMERICA,

                 v.                                                ORDER

JOSE DELEON,                                                    14 Cr. 28 (PGG)

                         Defendant.

PAUL G. GARDEPHE, U.S.D.J.:

              It is hereby ORDERED that the violation of supervised release hearing currently

scheduled for January 8, 2020 is adjourned to January 15, 2020 at 3:30 p.m. in Courtroom 705

of the Thurgood Marshall United States Courthouse, 40 Foley Square, New York, New York.

Dated: New York, New York
       January 8, 2020

Now we prompt TLM and save the output responses and their associated trustworthiness scores for all examples. We use the constrain_outputs parameter to ensure that TLM always outputs one of the valid categories. The last entry in the constrain_outputs list is treated as the category to fall back to whenever the LLM fails to choose one of the categories (so optionally order your categories such that the last one is what you would choose in cases of high uncertainty).

outputs = tlm.prompt(results_df['prompt'].to_list(), constrain_outputs=categories + ["other"])

results_df[["predicted_category","trustworthiness_score"]] = pd.DataFrame(outputs)

Querying TLM... 100%|██████████|

Optional: Define helper methods to better display results.

def display_result(results_df: pd.DataFrame, index: int):
    """Displays TLM result for the example from the dataset whose `index` is provided."""
    
    print(f"TLM predicted category: {results_df.iloc[index].predicted_category}")
    print(f"TLM trustworthiness score: {results_df.iloc[index].trustworthiness_score}\n")
    print(results_df.iloc[index].text)

Analyze Classification Results

Let’s first inspect the most trustworthy predictions from our model. We sort the TLM outputs over our documents to see which predictions received the highest trustworthiness scores.

results_df = results_df.sort_values(by='trustworthiness_score', ascending=False)
display_result(results_df, index=0)

TLM predicted category: contracts
TLM trustworthiness score: 0.966594896528417

EXHIBIT 10

WIRELESS RONIN TECHNOLOGIES, INC.

SENIOR MANAGEMENT BONUS PLAN

Effective January 1, 2012

The Senior Management Bonus Plan (the “Plan”) provides bonuses to certain
members of the Company’s senior management team. Such bonuses are based 50
percent upon the Company’s annual gross revenue dollars and 50 percent upon the
Company’s adjusted EBITDA, which is calculated based upon the Company’s
accounting practices, consistently applied and upon GAAP standards applicable to
the Company.

The Company’s Compensation Committee has identified eligible members of senior
management and established annual gross revenue dollar and adjusted EBITDA goals
for the upcoming plan year. The Company’s Board of Directors and the
Compensation Committee of the Board reserve the right to modify, terminate or
suspend this Plan at any time in the Board or Committee’s sole discretion.

September 30,

Percentage of Goal Annual

Gross Revenue Dollar

     Percentage of Goal Annual Gross Revenue
Dollar portion of Target Bonus  

Less than 85%

       0 % 

85% to < 100%

       12.5 % 

100% to < 110%

       50 % 

110% to < 120%

       55 % 

120% to < 130%

       60 % 

130% to < 140%

       65 % 

140% to < 150%

       70 % 

150% to < 160%

       75 % 

160% to < 170%

       80 % 

170% to < 180%

       85 % 

180% to < 190%

       90 % 

190% to < 200%

       95 % 

Over 200%

       100 % 

September 30,

Percentage of Goal Annual

Adjusted EBITDA

     Percentage of Goal Annual Adjusted
EBITDA portion of Target Bonus  

Less than 85%

       0 % 

85% to < 100%

       12.5 % 

100% to < 110%

       50 % 

110% to < 120%

       55 % 

120% to < 130%

       60 % 

130% to < 140%

       65 % 

140% to < 150%

       70 % 

150% to < 160%

       75 % 

160% to < 170%

       80 % 

170% to < 180%

       85 % 

180% to < 190%

       90 % 

190% to < 200%

       95 % 

Over 200%

       100 % 

--------------------------------------------------------------------------------

The following process documents how bonus payouts are to be calculated:
(1) actual gross revenue dollar performance for 2012 is compared to targeted
gross revenue dollar performance for 2012, (2) the percentage payout associated
with that gross revenue dollar performance is multiplied by the named
executive’s target bonus amount, then multiplied by the portion of the payout
associated with gross revenue dollars (50%), (3) actual adjusted EBITDA
performance for 2012 is compared to targeted adjusted EBITDA performance for
2012, (4) the percentage payout associated with that adjusted EBITDA performance
is multiplied by the named executive’s target bonus amount, then multiplied by
the portion of the payout associated with adjusted EBITDA (50%), then (5) the
numbers generated in steps (2) and (4) are added together to generate the total
bonus payout.

“Adjusted EBITDA” as used by the Compensation Committee equals the Company’s
earnings before all interest, tax, depreciation, amortization and stock-based
expenses but after payment of non-equity based employee bonuses.

A document about “SENIOR MANAGEMENT BONUS PLAN, Effective January 1, 2012” is clearly a contract, so it makes sense that TLM classifies it into the “contracts” category with high trustworthiness.

display_result(results_df, index=1)

TLM predicted category: contracts
TLM trustworthiness score: 0.9608771429046686

AMENDMENT No. 4 TO EMPLOYMENT AGREEMENT

THIS AMENDMENT No. 4 TO EMPLOYMENT AGREEMENT (“Amendment No. 3”), is entered
into as of August 3, 2017 (the “Effective Date”), by and between Third Point
Reinsurance Ltd., a Bermuda company (the “Company”), and Manoj K. Gupta (the
“Executive”).

WHEREAS, the Company and the Executive entered into a certain Employment
Agreement dated as of March 27, 2012, an Amendment No. 1 to Employment Agreement
dated as of February 26, 2015, an Amendment No. 2 to Employment Agreement dated
as of April 1, 2016, and an Amendment No. 3 to Employment Agreement dated as of
March 1, 2017 (collectively, the “Employment Agreement”); and

WHEREAS, in consideration of the mutual agreements set forth below and for other
good and valuable consideration given, by each party to this Amendment No. 4 to
the other, the receipt and sufficiency of which are hereby acknowledged, the
Company and Executive agree to amend the Employment Agreement on the terms set
forth below.

NOW, THEREFORE, the parties hereto, intending to be legally bound, hereby agree
as follows:
1.
Section 2 (a) Duties of the Employment Agreement shall be amended to read as
follows:

“2. Extent of Employment.
(a) Duties. During the Employment Term and from and after the Effective Date,
the Executive shall serve as President of TPRe USA. In addition, the Executive
shall continue to serve as the Company’s Head of Investor Relations and Business
Development, in such capacity reporting to the Chief Executive Officer of the
Company. In each function, the Executive shall perform such duties, services,
and responsibilities on behalf of TPRe USA and the Company, respectively,
consistent with such positions as may be reasonably assigned to the Executive
from time to time.
2.
The parties hereto agree that except as specifically set forth in this Amendment
No. 4, each and every provision of the Employment Agreement shall remain in full
force and effect as set forth therein.

[Signature Page Follows]

--------------------------------------------------------------------------------

IN WITNESS WHEREOF, the Company has caused this Amendment No. 4 to be executed,
and the Executive has hereunto set his hand, in each case to be effective as of
the day and year first above written.
THIRD POINT REINSURANCE LTD.

By: /s/ J. Robert Bredahl ______________________________
Name: Title:     J. Robert Bredahl
President & Chief Executive Officer

By: /s/ Janice R. Weidenborner
_____________________________
Name: Title:     Janice R. Weidenborner
EVP, Group General Counsel

EXECUTIVE
By:     /s/ Manoj K. Gupta
______________________________
Manoj K. Gupta

Another document titled as “AMENDMENT No. 4 TO EMPLOYMENT AGREEMENT” is clearly a contract, so it makes sense that TLM classifies it into the “contracts” category with high trustworthiness.

display_result(results_df, index=3)

TLM predicted category: legislation
TLM trustworthiness score: 0.9484836135698226

        DEPARTMENT OF DEFENSE
        GENERAL SERVICES ADMINISTRATION
        NATIONAL AERONAUTICS AND SPACE ADMINISTRATION 
        48 CFR Parts 2, 4, 12, 14, 15, 16, 19, 27, 30, 31, 32, 42, 44, 49, and 52
        [FAR Case 2005-036; Docket 2007-001, Sequence 7]
        RIN 9000-AK74
        Federal Acquisition Regulation; FAR Case 2005-036, Definition of Cost or Pricing Data
        
          AGENCIES:
          Department of Defense (DoD), General Services Administration (GSA), and National Aeronautics and Space Administration (NASA).
        
          ACTION:
          Notice of public meeting; extension of comment period.
        
          SUMMARY:
          The Civilian Agency Acquisition Council and the Defense Acquisition Regulations Council (Councils) are cosponsoring a public meeting to discuss the proposed Federal Acquisition Regulation (FAR) rule 2005-036 on cost or pricing data.  The rule would revise the definition of “cost or pricing data”; change the term “information other than cost or pricing data” to “data other than certified cost or pricing data”; add a definition of “certified cost or pricing data” to make the terms and definitions consistent with 10 U.S.C. 2306(a) and 41 U.S.C. 254(b) and more understandable to the general reader; change terminology throughout the FAR; and clarify the need to obtain data other than certified cost or pricing data when there is no other means to determine fair and reasonable pricing during price analysis.
          The proposed rule was published in the Federal Register at 72 FR 20092 on April 23, 2007.  The Councils are now seeking additional views before finalizing the proposed rule.

          A public meeting will be held on November 15, 2007, from 9:00 a.m. to 1:00 p.m. EST, in the General Services Administration Building Auditorium, 1800 F Street, NW, Washington, DC, 20405.  Interested parties may register electronically at: http://www.corpcomm-inc.com/dpap/dars/far_case_2005-036_meeting_registration.php.  Attendees are encouraged but not required to register for the public meeting, to ensure adequate room accommodations.
        
          DATES:

          The comment period has been extended.  Interested parties should submit written comments to the FAR Secretariat on or before November 22, 2007, to be considered in the formulation of the final rule.  Please do not resubmit comments already made.  If you wish your views at the public meeting to be considered as a public comment you must submit a public comment.  Copies of the public comments already submitted are available at http://www.regulations.gov.
        
          ADDRESSES:
          Submit comments identified by FAR Case 2005-036 by any of the following methods:
        
        Federal eRulemaking Portal: http://www.regulations.gov.
        • To search for any document, first select under “Step 1,” “Documents with an Open Comment Period” and select under “Optional Step 2,” “Federal Acquisition Regulation” as the agency of choice. Under “Optional Step 3,” select “Proposed Rules”. Under “Optional Step 4,” from the drop down list, select “Document Title” and type the FAR case number “2005-036”. Click the “Submit” button. Please include your name and company name (if any) inside the document.
        You may also search for any document by clicking on the “Search for Documents” tab at the top of the screen. Select from the agency field “Federal Acquisition Regulation”, and type “2005-036” in the “Document Title” field.  Select the “Submit” button.
        • Fax:  202-501-4067.
        • Mail:  General Services Administration, Regulatory Secretariat (VIR), 1800 F Street, NW, Room 4035, ATTN:  Laurieann Duarte, Washington, DC  20405.
        
          Instructions: Please submit comments only and cite FAR case 2005-036 in all correspondence related to this case.  All comments received will be posted without change to http://www.regulations.gov, including any personal and/or business confidential information provided.
        
          FOR FURTHER INFORMATION CONTACT
          Edward N. Chambers, Deputy Chair, FAR Finance Team, by telephone at (202) 501-3221, or by e-mail at edward.chambers@gsa.gov, for clarification of content.  Please cite FAR Case 2005-036. For information pertaining to status or publication schedules, contact the FAR Secretariat, Room 4035, GS Building, Washington, DC, 20405, at (202) 501-4755.
        
          Dated:  October 24, 2007.
          Al Matera,
          Director, Office of Acquisition Policy.
        
      [FR Doc. 07-5404 Filed 10-31-07; 8:45 am]
      BILLING CODE 6820-EP-S

This document about “DEPARTMENT OF DEFENSE, GENERAL SERVICES ADMINISTRATION, NATIONAL AERONAUTICS AND SPACE ADMINISTRATION” clearly belongs to some legislation measure, so it makes sense that TLM classifies it into the “legislation” category with high trustworthiness.

Least Trustworthy Predictions

Now let’s see which classifications predicted by the model are least trustworthy. We sort the data by trustworthiness scores in the opposite order to see which predictions received the lowest scores. Observe how model classifications with the lowest trustworthiness scores are often incorrect, corresponding to examples with vague/irrelevant text or documents possibly belonging to more than one category.

results_df = results_df.sort_values(by='trustworthiness_score')
display_result(results_df, index=0)

TLM predicted category: legislation
TLM trustworthiness score: 0.227975262181385

[exaa_001.jpg] 

 [exaa_002.jpg]

 [exaa_003.jpg]

 [exaa_004.jpg]

 [exaa_005.jpg]

 [exaa_006.jpg]

 [exaa_007.jpg]

 [exaa_008.jpg]

 [exaa_009.jpg]

 [exaa_010.jpg]

 [exaa_011.jpg]

 [exaa_012.jpg]

 [exaa_013.jpg]

 [exaa_014.jpg]

 [exaa_015.jpg]

 [exaa_016.jpg]

 [exaa_017.jpg]

 [exaa_018.jpg]

This example is clearly not legislation nor any other category since the document is just a list of JPG file names. TLM’s low trust score alerts us that this example cannot be confidently classified.

How to use Trustworthiness Scores?

If you have time, your team can manually review/correct the least trustworthy LLM classifications. Inspecting the least trustworthy examples also helps you discover how to improve your prompt (e.g. how to handle edge-cases, which few-shot examples to provide, etc).

Alternatively, you can determine a trustworthiness threshold below which LLM predictions seem too unreliable, and abstain from classifying such cases. The overall magnitude/range of the trustworthiness scores may differ between datasets, so we recommend selecting any thresholds to be application-specific. First consider the relative trustworthiness levels between different data points before considering the overall magnitude of these scores for individual data points.

Measuring Classification Accuracy with Ground Truth Labels

Our example dataset happens to have labels for each document, so we can load them in to assess the accuracy of our model predictions. We’ll study the impact on accuracy as we abstain from making predictions for examples receiving lower trustworthiness scores.

wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/zero_shot_classification_labels.csv'

df_ground_truth = pd.read_csv('zero_shot_classification_labels.csv')
df = pd.merge(results_df, df_ground_truth, on=['index'], how='outer')
df['is_correct'] = df['type'] == df['predicted_category']

df.head()

	index	text	prompt	predicted_category	trustworthiness_score	type	is_correct
0	0	Probl2B\n0/NV Form\nRev. June 2014\n\n\n\n ...	You are an expert Legal Document Auditor. Clas...	caselaw	0.778880	caselaw	True
1	1	UNITED STATES DI...	You are an expert Legal Document Auditor. Clas...	caselaw	0.809627	caselaw	True
2	2	\n \n FEDERAL COMMUNICATIONS COMMI...	You are an expert Legal Document Auditor. Clas...	legislation	0.923114	legislation	True
3	3	\n \n DEPARTMENT OF COMMERCE\n ...	You are an expert Legal Document Auditor. Clas...	legislation	0.767369	legislation	True
4	4	EXHIBIT 10.14\n\nAMENDMENT NO. 1 TO\n\nCHANGE ...	You are an expert Legal Document Auditor. Clas...	contracts	0.843782	contracts	True

print('TLM zero-shot classification accuracy over all documents: ', df['is_correct'].sum() / df.shape[0])

TLM zero-shot classification accuracy over all documents:  0.9605263157894737

Next we plot the accuracy of the TLM-predicted categories (computed with respect to ground-truth labels). Here we assume predictions from TLM are only considered for the subset of data where the trustworthiness score is sufficiently high, so accuracy is only computed over this data subset (the remaining data could be manually reviewed by humans). Our plot depicts the resulting accuracy across different choices of the trustworthiness score threshold, which determine how much of the data gets auto-labeled by the LLM (see X-axis below).

Optional: Plotting code

import numpy as np
import matplotlib.pyplot as plt

# Calculate the number of examples, percentage of data, and accuracy of TLM's predictions for each threshold value
threshold_analysis = pd.DataFrame([{
    "threshold": t,
    "num_examples": len(filtered := df[df["trustworthiness_score"] > t]),
    "percent_data": len(filtered) / len(df) * 100,
    "accuracy": np.mean(filtered["predicted_category"] == filtered["type"]) * 100
} for t in np.arange(0, 1.0, 0.01)]).round(2)

# Plot the accuracy of TLM's predictions and percentage of data for each trustworthiness score threshold value
def create_enhanced_line_plot(threshold_analysis):
    plt.figure(figsize=(8.25, 6.6))
    points = plt.scatter(threshold_analysis['percent_data'], threshold_analysis['accuracy'],
                        c=threshold_analysis['threshold'], cmap='viridis', s=40)  # Increased marker size
    plt.plot(threshold_analysis['percent_data'], threshold_analysis['accuracy'], 
            alpha=0.3, color='gray', zorder=1, linewidth=2)  # Increased line width
    
    plt.colorbar(points).set_label('trustworthiness Threshold', fontsize=14)  # Increased font size
    plt.grid(True, alpha=0.3)
    plt.xlabel('Percentage of Data Included', fontsize=14)  # Increased font size
    plt.ylabel('Classification Accuracy', fontsize=14)  # Increased font size
    plt.title('Accuracy vs Auto Classification Threshold', fontsize=16)  # Increased font size
    plt.xticks(fontsize=14)  # Increased tick label size
    plt.yticks(fontsize=14)  # Increased tick label size
    plt.xlim(85, 100)
    plt.tight_layout()
    return plt.gcf()

# Apply the function to your data
fig = create_enhanced_line_plot(threshold_analysis)
plt.show()

TLM performance on zero-shot classification

The above plot shows the accuracy of TLM predicted labels, if we only have the LLM handle the subset of the data where TLM’s trustworthiness score exceeds a certain threshold. This shows how TLM can ensure a target labeling accuracy for examples above a certain trustworthiness score. You can escalate to humans who manually categorize the remaining data whose trustworthiness falls below a score threshold.

For this task, we can achieve 100% accuracy in automated classification with TLM by setting the trustworthiness score threshold near 0.7, which allows us to automatically categorize 91% of the data. This means you only need to manually handle 9% of the data to achieve perfect accuracy. Use TLM trust scores to guarantee reliable LLM classifications.

Automatically Boost Accuracy

Beyond scoring trustworthiness, TLM can auto-improve the accuracy of LLM predictions if you specify a higher number of num_candidate_responses in TLMOptions. Additionally consider setting TLM’s base model option to a more powerful LLM that works well in your domain. TLM can automatically improve the accuracy of any LLM model, no change to your prompts/code required!

base_accuracy = np.mean(df["predicted_category"] == df["type"])
print(f"Base accuracy: {base_accuracy:.1%}")

# Here we set a higher number of candidate responses to auto-improve accuracy
tlm_best = TLM(options={"model": MODEL, "num_candidate_responses": 6})
best_responses = tlm_best.prompt(df['prompt'].to_list(), constrain_outputs=categories + ["other"])
df[["best_predicted_category","best_trustworthiness_score"]] = pd.DataFrame(best_responses)
boosted_accuracy = np.mean(df['type'] == df['best_predicted_category'])
print(f"Boosted accuracy: {boosted_accuracy:.1%}")

Base accuracy: 96.1%

Querying TLM... 100%|██████████|

Boosted accuracy: 97.4%

Next Steps

If you are enforcing Structured Outputs on your LLM, learn how you can still apply TLM via our OpenAI API. For classification tasks: structured outputs may degrade accuracy, so using TLM’s constrain_outputs argument is generally recommended over using structured outputs.

For binary classification tasks (i.e. Yes/No or True/False decisions), learn how you can control false positive/negative error rates with TLM via our tutorial: Yes/No Decisions.

Learn how to auto-label data using TLM and save human data annotation costs via our tutorial on: Data Annotation/Labeling.

Setup​

Perform Zero Shot Classification with TLM​

Analyze Classification Results​

Least Trustworthy Predictions​

How to use Trustworthiness Scores?​

Measuring Classification Accuracy with Ground Truth Labels​

Automatically Boost Accuracy​

Next Steps​