Skip to main content

Reliable Zero-Shot Classification with the Trustworthy Language Model

Run in Google ColabRun in Google Colab

In zero-shot classification, we use a Foundation model to classify input data into predefined categories (aka. classes), without having to train this model on a dataset manually annotated with these categories. This utilizes the pre-trained model’s world knowledge to accomplish tasks that would require much more work training classical machine learning models from scratch. The problem with zero-shot classification of text with LLMs is we don’t know which LLM classifications we can trust. Most LLMs are prone to hallucination and will often predict a category even when their world knowledge does not suffice to justify this prediction.

This tutorial demonstrates how you can easily replace any LLM with Cleanlab’s Trustworthy Language Model (TLM) to gauge the trustworthiness of each zero-shot classification. Use the TLM to ensure reliable classification where you which model predictions cannot be trusted. Before this tutorial, we recommend completing the TLM quickstart tutorial.

Setup

Using TLM requires a Cleanlab account. Sign up for one here if you haven’t yet. If you’ve already signed up, check your email for a personal login link.

The Python client package can be installed using pip:

%pip install cleanlab-studio
import re
import pandas as pd
from tqdm import tqdm
from difflib import SequenceMatcher

from cleanlab_studio import Studio

In Python, launch your Cleanlab Studio client using your API key.

# Get your API key from https://app.cleanlab.ai/account after creating an account.
studio = Studio("<insert your API key>")

Let’s load an example classification dataset. Here we consider legal documents from the “US” Jurisdiction of the Multi_Legal_Pile, a large-scale multilingual legal dataset that spans over 24 languages. We aim to classify each document into one of three categories: [caselaw, contracts, legislation]. We’ll prompt our TLM to categorize each document and record its response and associated trustworthiness score. You can use the ideas from this tutorial to improve LLMs for any other text classification task!

First download our example dataset and then load it into a DataFrame.

wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/zero_shot.csv'
df = pd.read_csv('zero_shot.csv')
df.head(2)
index text
0 0 Probl2B\n0/NV Form\nRev. June 2014\n\n\n\n ...
1 1 UNITED STATES DI...

Perform Zero Shot Classification with TLM

Let’s initalize a TLM object. Here we use default TLM settings, but check out the TLM quickstart tutorial for configuration options that can produce better results.

tlm = studio.TLM()

Next, let’s define a prompt template to instruct the TLM on how to classify each document’s text. Write your prompt just as you would with any other LLM when adapting it for zero-shot classification. A good prompt template might contain all the possible categories a document can be classified as, as well as formatting instructions for the LLM response. Of course the text of the document is crucial.

'You are an expert Legal Document Auditor. Classify the following document into a single category that best represents it. The categories are: {categories}. In your response, first provide a brief explanation as to why the document belongs to a specific category and then on a new line write "Category: <category document belongs to>". \nDocument: {document}'

If you have a couple labeled examples from different classes, you may be able to get better LLM predictions via few-shot prompting (where these examples + their classes are embedded within the prompt). Here we’ll stick with zero-shot classification for simplicity, but note that TLM can also be used for few-shot classification just like any other LLM.

Lets apply the above prompt template to all documents in our dataset and form the list of prompts we want to run. For one arbitrary document, we print the actual corresponding prompt fed into the TLM below.

zero_shot_prompt_template = 'You are an expert Legal Document Auditor. Classify the following document into a single category that best represents it. The categories are: {categories}. In your response, first provide a brief explanation as to why the document belongs to a specific category and then on a new line write "Cateogry: <category document belongs to>". \nDocument: {document}'
categories = ['caselaw', 'contracts', 'legislation']
string_categories = str(categories).replace('\'', '')

# Create a DataFrame to store results and apply the prompt template to all examples
results_df = df.copy()
results_df['prompt'] = results_df['text'].apply(lambda x: zero_shot_prompt_template.format(categories=string_categories, document=x))

print(f"{results_df.at[7, 'prompt']}")
    You are an expert Legal Document Auditor. Classify the following document into a single category that best represents it. The categories are: [caselaw, contracts, legislation]. In your response, first provide a brief explanation as to why the document belongs to a specific category and then on a new line write "Cateogry: <category document belongs to>". 
Document: UNITED STATES DISTRICT COURT
SOUTHERN DISTRICT OF NEW YORK

UNITED STATES OF AMERICA,

v. ORDER

JOSE DELEON, 14 Cr. 28 (PGG)

Defendant.


PAUL G. GARDEPHE, U.S.D.J.:

It is hereby ORDERED that the violation of supervised release hearing currently

scheduled for January 8, 2020 is adjourned to January 15, 2020 at 3:30 p.m. in Courtroom 705

of the Thurgood Marshall United States Courthouse, 40 Foley Square, New York, New York.

Dated: New York, New York
January 8, 2020

Now we prompt the TLM and save the output responses and their associated trustworthiness scores for all examples. We recommend the try_prompt() method to run TLM over datasets with many examples.

outputs = tlm.try_prompt(results_df['prompt'].to_list())

results_df[["response","trustworthiness_score"]] = pd.DataFrame(outputs)
    Querying TLM... 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|

Parse raw LLM Responses into Category Predictions

Our prompt template asks the LLM to provide explain it’s predictions, which can boost their accuracy. We now parse out the classification prediction, which should be exactly one of the categories for each document. Because LLMs don’t necessarily follow output formatting instructions perfectly, the parser we define here can fuzzily map raw LLM outputs to predicted categories.

Optional: Define helper methods to parse categories and better display results. (click to expand)

import warnings

def parse_category(response: str, categories: list):
"""Takes in a response and parses out the category. If no category out of the possible `categories` is directly mentioned in the response, the category with greatest string similarity to the response is returned (along with a warning).
This parser assumes the LLM was instructed to return "Category: <category example belongs to>" on a new line.

Params
------
response: Response from the LLM
categories: List of possible categories examples in the dataset could be classified as
"""

# Parse category if LLM response is properly formatted
pattern = r'Category: (' + '|'.join(categories) + ')'
exact_matches = re.findall(pattern, response, re.IGNORECASE)
if len(exact_matches) > 0:
return exact_matches[-1].lower() # Return the last match since our zero shot prompt template asks the TLM to list the category last

# If there are no exact matches to a specific category, return the closest category based on string similarity.
pattern = r'Category: (.+)'
matches = re.findall(pattern, response)
if len(matches) > 0:
category = matches[-1].lower()
else:
category = response.lower() # If the LLM did not follow the response format we requested "Category: ..." in the prompt template, consider the whole response.

closest_match = max(categories, key=lambda x: SequenceMatcher(None, category, x).ratio())
similarity_score = SequenceMatcher(None, category, closest_match).ratio()
str_warning = "matches"
if similarity_score < 0.5:
str_warning = "remotely matches"

warnings.warn(f"None of the categories {str_warning} raw LLM output: {category}")
return closest_match


def display_result(results_df: pd.DataFrame, index: int):
"""Displays the TLM result for the example from the dataset whose `index` is provided."""

print(f"TLM predicted category: {results_df.iloc[index].predicted_category}")
print(f"TLM trustworthiness score: {results_df.iloc[index].trustworthiness_score}\n")
print(results_df.iloc[index].text)
results_df['predicted_category'] = results_df['response'].apply(lambda x: parse_category(x, categories))

Analyze Classification Results

Let’s first inspect the most trustworthy predictions from our model. We sort the TLM outputs over our documents to see which predictions received the highest trustworthiness scores.

results_df = results_df.sort_values(by='trustworthiness_score', ascending=False)
display_result(results_df, index=0)
    TLM predicted category: legislation
TLM trustworthiness score: 0.9214038143792375



ENVIRONMENTAL PROTECTION AGENCY
40 CFR Parts 86 and 600
DEPARTMENT OF TRANSPORTATION
National Highway Traffic Safety Administration
49 CFR Parts 531, 533, 537, and 538
[EPA-HQ-OAR-2009-0472; FRL-8966-9; NHTSA-2009-0059]
RIN 2060-AP58; 2127-AK90
Public Hearing Locations for the Proposed Rulemaking To Establish Light-Duty Vehicle Greenhouse Gas Emission Standards and Corporate Average Fuel Economy Standards

AGENCY:
Environmental Protection Agency (EPA) and National Highway Traffic Safety Administration (NHTSA).


ACTION:
Notice of public hearings.


SUMMARY:

EPA and NHTSA are announcing the location addresses for the joint public hearings to be held for the “Proposed Rulemaking to Establish Light-Duty Vehicle Greenhouse Gas Emission Standards and Corporate Average Fuel Economy Standards,” published in the Federal Register on September 28, 2009. This joint proposed rulemaking is consistent with the National Fuel Efficiency Policy announced by President Obama on May 19, 2009, responding to the country's critical need to address global climate change and to reduce oil consumption. As described in the joint proposed rule, EPA is proposing greenhouse gas emissions standards under the Clean Air Act, and NHTSA is proposing Corporate Average Fuel Economy standards under the Energy Policy and Conservation Act, as amended. These standards apply to passenger cars, light-duty trucks, and medium-duty passenger vehicles, covering model years 2012 through 2016, and represent a harmonized and consistent National Program. The joint proposed rule provides the dates, times, cities, instructions and other information for the public hearings and these details have not changed.


DATES:

NHTSA and EPA will jointly hold three public hearings on the following dates: October 21, 2009, in Detroit, Michigan, October 23, 2009 in New York, New York, and October 27, 2009 in Los Angeles, California. The hearings will start at 9 a.m. local time and continue until everyone has had a chance to speak. If you would like to present testimony at the public hearings, we ask that you notify the EPA and NHTSA contact persons listed under FOR FURTHER INFORMATION CONTACT at least ten days before the hearing.


ADDRESSES:
NHTSA and EPA will jointly hold three public hearings at the following locations: Detroit Metro Airport Marriott, 30559 Flynn Drive, Romulus, Michigan 48174 on October 21, 2009; New York LaGuardia Airport Marriott, 102-05 Ditmars Boulevard, East Elmhurst, New York 11369 on October 23, 2009; and Renaissance Los Angeles Airport Hotel, 9620 Airport Boulevard, Los Angeles, California 90045 on October 27, 2009. Please see the proposed rule for addresses and detailed instructions for submitting comments.


FOR FURTHER INFORMATION CONTACT:

EPA: Tad Wysor, Office of Transportation and Air Quality, Assessment and Standards Division, Environmental Protection Agency, 2000 Traverwood Drive, Ann Arbor, MI 48105; telephone number: 734-214-4332; fax number: 734-214-4816; e-mail address: wysor.tad@epa.gov, or Assessment and Standards Division Hotline; telephone number (734) 214-4636; e-mail address asdinfo@epa.gov. NHTSA: Rebecca Yoon, Office of Chief Counsel, National Highway Traffic Safety Administration, 1200 New Jersey Avenue, SE., Washington, DC 20590. Telephone: (202) 366-2992.



SUPPLEMENTARY INFORMATION:

The proposal for which NHTSA and EPA are jointly holding the public hearings was published in the Federal Register on September 28, 2009.1

The proposed rule provides the dates, times, cities, instructions for how to participate and other information on the public hearings and these details have not changed. If you would like to present testimony at the public hearings, we ask that you notify the EPA and NHTSA contact persons listed under FOR FURTHER INFORMATION CONTACT at least ten days before the hearing. See the SUPPLEMENTARY INFORMATION section on “Public Participation” in the proposed rule for more information about the public hearings.2
Also, please refer to the proposed rule for addresses and detailed instructions for submitting comments.


1 74 FR 49454, September 28, 2009.



2 74 FR 49455, September 28, 2009.

This notice of public hearings further provides the location addresses for the hearings, shown below:


October 21, 2009: Detroit Metro Airport Marriott, 30559 Flynn Drive, Romulus, Michigan 48174, 734-214-7555.

October 23, 2009: New York LaGuardia Airport Marriott, 102-05 Ditmars Boulevard, East Elmhurst, New York 11369, 718-565-8900.

October 27, 2009: Renaissance Los Angeles Airport Hotel, 9620 Airport Boulevard, Los Angeles, California 90045, 310-337-2800.

Dated: October 1, 2009.
Paul N. Argyropoulos,
Acting Director, Office of Transportation and Air Quality, Environmental Protection Agency.
Dated: October 1, 2009.
Stephen R. Kratzke,
Associate Administrator for Rulemaking, National Highway Traffic Safety Administration.


[FR Doc. E9-24159 Filed 10-5-09; 8:45 am]
BILLING CODE 6560-50-P


A document titled “Public Hearing Locations for the Proposed Rulemaking To Establish Light-Duty Vehicle Greenhouse Gas Emission Standards and Corporate Average Fuel Economy Standards” is very clearly belonging to some legislative measure so it makes sense the TLM classifies it into the “legislation” category with a high trustworthiness score.

The two documents below discuss Stock Option Grant rules for a company and Control Employment Agreements. They are quite clearly contracts, which the TLM correctly classifies with high confidence.

display_result(results_df, index=1)
    TLM predicted category: contracts
TLM trustworthiness score: 0.9208593621290568

Exhibit 10.1

 

Re: Stock Option Grant

In recognition of your significant responsibilities at Airgas, I am pleased to
inform you that pursuant to the Airgas, Inc. Amended and Restated 2006 Equity
Incentive Plan (the “Plan”), effective May ##, 20## you have been granted a
non-qualified stock option (the “Option”) to purchase #,### shares of common
stock, at a price of $##.## per share.

This Option is subject to the applicable terms and conditions of the Plan which
are incorporated herein by reference, and in the event of any contradiction,
distinction or differences between this letter and the terms of the Plan, the
terms of the Plan will control. Please go to
http://airnet/page.asp?7000000000610 to review the prospectus for the Plan and
the Plan.

Subject to your continued employment with the Company, the Option may be
exercised in cumulative equal installments of 25% of the shares on each of the
first four anniversaries of the date of grant. It shall terminate in full at
5:00 P.M. local Philadelphia, Pennsylvania time on August ##, 201#, unless
sooner terminated as specified in Section 6.7 of the Plan.

In order to excise your vested options, you should contact Merrill Lynch to
review your account. Arrangements will be made for withholding any taxes that
may be due with respect to such shares. For information about your account and
to exercise options, call Merrill Lynch at         -        -         to talk to
a Participant Service Representative or visit the web site at
www.benefits.ml.com.

If you have not received copies of the latest Airgas Annual Report and Proxy
Statement because you either do not own Airgas shares or you do not hold other
Airgas stock options, copies are attached hereto. You will be sent a new Annual
Report and Proxy Statement each year as they become available to Airgas
shareholders.

Your stock option is one of the ways you can participate in the long-term
success of Airgas. I wish you much success and personal fulfillment. Thanks for
your hard work and dedication.

Yours truly,
display_result(results_df, index=2)
    TLM predicted category: contracts
TLM trustworthiness score: 0.9201449921200638

Exhibit 10(k)A

SCHEDULE OF CHANGE IN CONTROL EMPLOYMENT AGREEMENTS

In accordance with the Instructions to Item 601 of Regulation S-K, the
Registrant has omitted filing Change in Control Employment Agreements by and
between P. H. Glatfelter Company and the following employees as exhibits to this
Form 10-K because they are substantially identical to the Form of Change in
Control Employment Agreement by and between P. H. Glatfelter Company and certain
employees, which is filed as Exhibit 10 (j) to our Form 10-K for the year ended
December 31, 2008.

David C. Elder

John P. Jacunski

Michael L. Korniczky

Debabrata Mukherjee

Dante C. Parrini

Martin Rapp

Mark A. Sullivan

William T. Yanavitch II

Least Trustworthy Predictions

Now let’s see which classifications predicted by the model are least trustworthy. We sort the data by trustworthiness scores in the opposite order to see which predictions received the lowest scores. Observe how model classifications with the lowest trustworthiness scores are often incorrect, corresponding to examples with vague/irrelevant text or documents possibly belonging to more than one category.

results_df = results_df.sort_values(by='trustworthiness_score')
display_result(results_df, index=0)
    TLM predicted category: contracts
TLM trustworthiness score: 0.6625658934123568

1 RENE L. VALLADARES
Federal Public Defender
2 State Bar No. 11479
KATHERINE A. TANAKA
3 Assistant Federal Public Defender
Nevada State Bar No. 14655C
4 411 E. Bonneville, Ste. 250
Las Vegas, Nevada 89101
5 (702) 388-6577/Phone
(702) 388-6261/Fax
6 Katherine_Tanaka@fd.org

7 Attorney for Joseph A. Strickland

8
UNITED STATES DISTRICT COURT
9
DISTRICT OF NEVADA
10
11 UNITED STATES OF AMERICA, Case No. 2:16-cr-155-JCM-CWH

12 Plaintiff, STIPULATION TO CONTINUE
REVOCATION HEARING
13 v.
(First Request)
14 JOSEPH A. STRICKLAND,

15 Defendant.

16
17 IT IS HEREBY STIPULATED AND AGREED, by and between Nicholas A.
18 Trutanich, United States Attorney, and Lisa Cartier-Giroux, Assistant United States Attorney,
19 counsel for the United States of America, and Rene L. Valladares, Federal Public Defender,
20 and Katherine A. Tanaka, Assistant Federal Public Defender, counsel for Joseph A. Strickland,
21 that the Revocation Hearing currently scheduled on January 9, 2020, be vacated and continued
22 to a date and time convenient to the Court, but no sooner than ninety (90) days.
23 This Stipulation is entered into for the following reasons:
24 1. Mr. Strickland’s state court proceedings are still pending, and those proceedings
25 are related to the instant matter. Defense counsel needs time to confer with Mr. Strickland
26 regarding the violations once his state case is complete.
1 2. Mr. Strickland is out of custody and does not object to the continuance.
2 3. The parties agree to the continuance.
3 This is the first request for a continuance of the revocation hearing.
4 DATED this 08 day of January, 2020.
5
6 RENE L. VALLADARES NICHOLAS A. TRUTANICH
Federal Public Defender United States Attorney
7
8
By /s/ Katherine A. Tanaka By /s/ Lisa Cartier-Giroux
9 KATHERINE A. TANAKA LISA CARTIER-GIROUX
Assistant Federal Public Defender Assistant United States Attorney
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
2
1 UNITED STATES DISTRICT COURT

2 DISTRICT OF NEVADA

3
UNITED STATES OF AMERICA, Case No. 2:16-cr-155-JCM-CWH
4
Plaintiff, ORDER
5
v.
6
JOSPEH R. STRICKLAND,
7
Defendant.
8
9
10 IT IS THEREFORE ORDERED that the revocation hearing currently scheduled for

11 April 9, 2020
Thursday, January 9, 2020, at 10:00 a.m., be vacated and continued to ________________ at

12 10 30 __.m.
the hour of ___:___ a
13 8th day of January, 2020.
DATED this ___

14
15
UNITED STATES DISTRICT JUDGE
16
17
18
19
20
21
22
23
24
25
26
3

This is a case between Joseph A Strickland and the United States of America. Here the LLM mis-categorized this document as a contract where it should be caselaw. The contents of the document discuss an agreement between the State and Joseph, so perhaps that confused the model.

display_result(results_df, index=1)
    TLM predicted category: legislation
TLM trustworthiness score: 0.6705137038662226

Case 2:20-cv-00012 Document 1 Filed 01/08/20 Page 1 of 5 PageID 1




UNITED STATES DISTRICT COURT
MIDDLE DISTRICT OF FLORIDA
FORT MYERS DIVISION

MERCEDES MUNOZ,

Plaintiff,

vs. Case No.:
SOPHIA OF GATEWAY, LLC d/b/a
SUBWAY, and MOHAMMAD
SULEMAN, Individually.

Defendants.
COMPLAINT AND DEMAND FOR JURY TRIAL

Plaintiff, MERCEDES MUNOZ (“Munoz” or “Plaintiff”), by and through her

undersigned attorneys, sues Defendants, SOPHIA OF GATEWAY, LLC d/b/a Subway,

a Florida Limited Liability Company (“Subway”), and MOHAMMAD SULEMAN

(“SULEMAN”), individually, (collectively referred to as “Defendants”) and states as

follows:
NATURE OF ACTION

1. Plaintiff brings this action against Defendants, her former employer, for

failure to pay overtime compensation in violation of the Fair Labor Standards Act of 1938,
29 U.S.C. §§ 201-219 (“FLSA”).
JURISDICTION

2. This Court has jurisdiction over this controversy pursuant to 29 U.S.C.

§ 216(b), 28 U.S.C. § 1331, and 28 U.S.C. § 1343.

3. Defendants are subject to the personal jurisdiction of the United States

District Court because they engage in substantial and not isolated activity within this

judicial district.



[1]
Case 2:20-cv-00012 Document 1 Filed 01/08/20 Page 2 of 5 PageID 2




4. Defendants are also subject to the personal jurisdiction of the United States

District Court because they operate, conduct, engage in, and/or carry on business in the

Middle District of Florida. Specifically, Defendant Subway’s principal place of business is

located at 12030 Fairway Isle Drive, Fort Myers, Florida 33913, and SULEMAN is a

resident of Lee County.
FLSA COVERAGE

5. At all times material, Defendants employed at least two or more employees

who handled, sold, or otherwise worked with goods or materials that had once moved

through interstate commerce.

6. At all times material, Defendants had gross sales volume of at least

$500,000 annually.

7. At all times material, Defendants have and continue to be an “enterprise

engaged in commerce” within the meaning of the FLSA.

8. By virtue of having held and/or exercised the authority to: (a) hire and fire

employees of Subway; (b) determine the work schedules for the employees of Subway; and

(c) control the finances and operations of Subway, Defendant SULEMAN, is an employer

as defined by 29 U.S.C. §201 et. seq.

9. At all times material to this action, Plaintiff was an “employee” of

Defendants within the meaning of the FLSA.

10. At all times material to this action, Defendants were Plaintiff’s

“employers” within the meaning of the FLSA.

11. At all times material hereto, the work performed by the Plaintiff was

essential to the business performed by Defendants.




[2]
Case 2:20-cv-00012 Document 1 Filed 01/08/20 Page 3 of 5 PageID 3




VENUE

12. Venue is proper in the United States District Court for the Middle District

of Florida based upon the following:

a. The unlawful pay practices alleged herein occurred in Naples, Florida,

in the Middle District of Florida;

b. At all times material hereto, Defendant Subway was and continues to be

a Florida Limited Liability Company registered with the Florida

Department of Corporations, with a Florida Registered Agent and a

license to do business within this judicial district; and

c. Defendants employed Plaintiff in the Middle District of Florida.
PARTIES

13. At all times material hereto, Plaintiff was a resident of Collier County,

Florida, in the Middle District of Florida.

14. Defendant Subway was, and continues to be, a Florida Limited Liability

Company engaged in the transaction of business in Lee and Collier Counties, Florida, with

its principal place of business located in Fort Myers, Florida.

15. Upon information and belief, at all times material to this action, Defendant
SULEMAN was, and continues to be, a resident of Lee County, Florida.
STATEMENT OF CLAIM
COUNT I
VIOLATION OF 29 U.S.C. § 207 (UNPAID OVERTIME)

16. Plaintiff realleges Paragraphs 1 through 15 as if fully stated herein.

17. At all times material hereto, Defendants owned and operated three

Subway franchise locations in Lee and Collier Counties.

18. Defendants hired Plaintiff on or around October 9, 2017. Plaintiff was hired




[3]
Case 2:20-cv-00012 Document 1 Filed 01/08/20 Page 4 of 5 PageID 4




by Defendants to work in the Subway store located at 50 Wilson Boulevard S, Naples, FL

34117.

19. Plaintiff’s employment ceased on or about December 24, 2019.

20. While employed, Plaintiff was paid on an hourly basis.

21. Beginning with the date of Plaintiff’s hire in October 2017 and continuing

through December 24, 2019, Plaintiff worked hours and workweeks in excess of forty (40)

per week for which she was not compensated at the statutory rate of time and one-half the

regular rate for all hours actually worked.

22. Plaintiff is entitled to be paid at the rate of time and one-half her regular

hourly rate for all hours worked in excess of the maximum hours provided for in the FLSA.

23. Defendants failed to pay Plaintiff overtime compensation in the lawful

amount for all hours worked in excess of the maximum hours provided for in the FLSA.

24. Defendants knew of and/or showed a willful disregard for the provisions of

the FLSA as evidenced by Defendants’ failure to compensate Plaintiff at the statutory rate

of time and one-half for the hours worked in excess of forty (40) hours per week when

Defendants knew or should have known such was due.

25. As a direct and proximate result of Defendants’ willful disregard of the

FLSA, Plaintiff is also entitled to liquidated damages pursuant to the FLSA.

26. Due to the unlawful acts of Defendants, Plaintiff has suffered damages in

the form of unpaid overtime wages, plus liquidated damages in an equal amount.

27. Plaintiff is entitled to an award of his reasonable attorney’s fees and costs

pursuant to 29 U.S.C. § 216(b).




[4]
Case 2:20-cv-00012 Document 1 Filed 01/08/20 Page 5 of 5 PageID 5




WHEREFORE, Plaintiff respectfully requests that final judgment be entered in

her favor against Defendants as follows:

a. Declaring that Defendants have violated the maximum hour provisions

of 29 U.S.C. § 207;

b. Awarding Plaintiff overtime compensation in amounts according to

proof;

c. Awarding Plaintiff liquidated damages in an equal amount to unpaid

overtime;

d. Awarding Plaintiff reasonable attorney’s fees and costs and expenses of

this litigation pursuant to 29 U.S.C. § 216(b);

e. Awarding Plaintiff post-judgment interest; and

f. Ordering any other and further relief this Court deems to be just and

proper.
JURY DEMAND
Plaintiff demands trial by jury on all issues so triable as of right.

Dated: January 8, 2020
Respectfully submitted,
By: /s/ Jason L. Gunter
Jason L. Gunter
Fla. Bar No. 0134694
Conor P. Foley
Fla. Bar No. 111977
GUNTERFIRM
1514 Broadway, Suite 101
Fort Myers, FL 33901
Phone: (239) 334–7017
Fax: (239) 236–8008
Email: Jason@Gunterfirm.com
Email: Conor@Gunterfirm.com



[5]

This document also clearly a caselaw, but the model predicted it to be legislation, perhaps confused by the contents of the case which discuss rules.

display_result(results_df, index=3)
    TLM predicted category: contracts
TLM trustworthiness score: 0.7303548621607547

 

[exaa_001.jpg] 

 



 

 

 

 [exaa_002.jpg]



 

 

 



 

 [exaa_003.jpg]



 

 

 



 

 [exaa_004.jpg]

 

 

 



 

 [exaa_005.jpg]

 

 

 



 

 [exaa_006.jpg]

 

 

 



 

 [exaa_007.jpg]

 

 

 



 

 [exaa_008.jpg]

 

 

 



 

 [exaa_009.jpg]

 

 

 



 

 [exaa_010.jpg]

 

 

 



 

 [exaa_011.jpg]

 

 

 



 

 [exaa_012.jpg]

 

 

 



 

 [exaa_013.jpg]

 

 

 



 

 [exaa_014.jpg]

 

 

 



 

 [exaa_015.jpg]

 

 

 



 

 [exaa_016.jpg]

 

 

 



 

 [exaa_017.jpg]

 

 

 



 

 [exaa_018.jpg]

 

 

 



 

This document clearly does not belong in any of the three categories as it is just a series of image titles. It makes sense why the TLM is unsure what category to classify it under.

Low trustworthiness scores like this can also help us identify confusing examples for the LLM and to catch its mistakes. Without reliable trustworthiness scores, we don’t know when we can rely on AI and when not.

How to use Trustworthiness Scores?

If you have time/resources, your team can manually review the LLM classifications of low-trustworthiness responses and provide a better human classification instead. If not, you can determine a trustworthiness threshold below which responses seem too unreliable to use, and have the model abstain from predicting in such cases (i.e. outputting “I don’t know” instead).

The overall magnitude/range of the trustworthiness scores may differ between datasets, so we recommend selecting any thresholds to be application-specific. First consider the relative trustworthiness levels between different data points before considering the overall magnitude of these scores for individual data points.

Measuring Classification Accuracy with Ground Truth Labels

Our example dataset happens to have labels for each document, so we can load them in to assess the accuracy of our model predictions. We’ll study the impact on accuracy as we abstain from making predictions for examples receiving lower trustworthiness scores.

wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/zero_shot_labels.csv'
df_ground_truth = pd.read_csv('zero_shot_labels.csv')
df = pd.merge(results_df, df_ground_truth, on=['index'], how='outer')
df['is_correct'] = df['type'] == df['predicted_category']

df.head()
index text prompt response trustworthiness_score predicted_category type is_correct
0 0 Probl2B\n0/NV Form\nRev. June 2014\n\n\n\n ... You are an expert Legal Document Auditor. Clas... Category: legislation\n\nExplanation: This doc... 0.840864 legislation caselaw False
1 1 UNITED STATES DI... You are an expert Legal Document Auditor. Clas... Category: caselaw 0.906702 caselaw caselaw True
2 2 \n \n FEDERAL COMMUNICATIONS COMMI... You are an expert Legal Document Auditor. Clas... Category: Legislation\n\nExplanation: This doc... 0.854045 legislation legislation True
3 3 \n \n DEPARTMENT OF COMMERCE\n ... You are an expert Legal Document Auditor. Clas... Category: Legislation\n\nExplanation: This doc... 0.913332 legislation legislation True
4 4 EXHIBIT 10.14\n\nAMENDMENT NO. 1 TO\n\nCHANGE ... You are an expert Legal Document Auditor. Clas... Category: Contracts 0.887005 contracts contracts True
print('TLM zero-shot classification accuracy over all documents: ', df['is_correct'].sum() / df.shape[0])
    TLM zero-shot classification accuracy over all documents:  0.7784431137724551

Next suppose we instead abstain from making predictions on 20% of the documents flagged with the lowest trustworthiness scores (e.g. having experts manually categorize these documents instead).

quantile = 0.2  # Play with value to observe the accuracy vs. number of abstained examples tradeoff

filtered_df = df[df['trustworthiness_score'] > df['trustworthiness_score'].quantile(quantile)]
acc = filtered_df['is_correct'].sum() / filtered_df.shape[0]
print(f'TLM zero-shot classification accuracy over the documents within the top-{(1-quantile) * 100}% of trustworthiness scores: {acc}')
    TLM zero-shot classification accuracy over the documents within the top-80.0% of trustworthiness scores: 0.8195488721804511

This shows the benefit of considering the TLM’s trustworthiness score for zero-shot classification over having to rely on results from a standard LLM.