Scoring the Trustworthiness of Structured Outputs
This tutorial demonstrates how to score the trustworthiness of structured outputs from LLMs using the OpenAI Chat Completions API. With minimal code changes, you can evaluate both the overall trustworthiness scores for each LLM response, and granular per-field scores for each component of structured responses (e.g., individual fields in JSON/dictionary outputs).
Before starting this tutorial, we recommed you first complete our basic tutorial on Using TLM with the Chat Completions API.
Setup
This tutorial requires a TLM API key. Get one here. While this tutorial uses your own OpenAI account to generate structured outputs, you can alternatively use TLM without an OpenAI account to both generate structured outputs and score their trustworthiness.
The Python packages required for this tutorial can be installed using pip:
%pip install --upgrade cleanlab-tlm openai
import os
os.environ["CLEANLAB_TLM_API_KEY"] = "<Cleanlab TLM API key>" # Get your free API key from: https://tlm.cleanlab.ai/
os.environ["OPENAI_API_KEY"] = "<OpenAI API key>" # For using OpenAI client library to generate structured outputs, you can use TLM for this instead too
from concurrent.futures import ThreadPoolExecutor
from openai import OpenAI
from pydantic import create_model
from typing import Optional
import pandas as pd
import time
from tqdm import tqdm
from cleanlab_tlm.utils.chat_completions import TLMChatCompletion
Fetch Dataset: PII Extraction
This tutorial uses a PII (Personally Identifiable Information) extraction dataset.
Each text sample contains various types of personal information embedded within natural language text. The task is to extract different categories of PII from the text. Each example contains multiple types of PII that need to be identified and classified into specific categories including names (FIRSTNAME, LASTNAME), dates (DATE), and account numbers (ACCOUNTNUMBER).
Let’s take a look at the dataset below:
wget -nc https://cleanlab-public.s3.us-east-1.amazonaws.com/Datasets/pii_extraction.csv
data = pd.read_csv("pii_extraction.csv")
data.head()
source_text | |
---|---|
0 | We would like to do a follow-up meeting with S... |
1 | Melvin, the password of your study support acc... |
2 | Americo, we need a report on how seasonality a... |
3 | Dear parents, our annual school trip is schedu... |
4 | In relation to the filed litigation, we hereby... |
Obtain LLM Predictions
Define structured output schema
We know that the 4 PII fields that we want to extract are: ['FIRSTNAME', 'LASTNAME', 'DATE', 'ACCOUNTNUMBER']
Using that, we can create a Pydantic model to represent our PII extraction schema. Each field is optional and can be None if that entity type is not found in the text:
pii_entities = ['FIRSTNAME', 'LASTNAME', 'DATE', 'ACCOUNTNUMBER']
fields = {name: (Optional[str], None) for name in pii_entities}
PII = create_model("PII", **fields)
Prompt OpenAI for responses + TLM for trust scores
Here, we utilize OpenAI’s Chat Completions API to extract PII from the text. We will also add a TLM decorator to that call to then automatically use TLM to evaluate the trustworthiness of those extractions, and appends the trust score to the response returned by OpenAI.
The decorator allows us to first use OpenAI to identify and parse PII, then use TLM to assess the reliability of each extracted field with minimal setup. For more information view our TLM for Chat Completions tutorial.
If you don’t have an OpenAI account, you can use your TLM account to both generate the structured outputs and score their trustworthiness, as shown here.
import functools
def add_trust_scoring(tlm_instance):
"""Decorator factory that creates a trust scoring decorator."""
def trust_score_decorator(fn):
@functools.wraps(fn)
def wrapper(**kwargs):
response = fn(**kwargs)
score_result = tlm_instance.score(response=response, **kwargs)
response.tlm_metadata = score_result
return response
return wrapper
return trust_score_decorator
First initialize a TLMChatCompletion
object (note that we specify that we want per_field_score
in the log
to obtain granular trust scores for each field in the structured output), then we can decorate our OpenAI Chat Completions function:
tlm = TLMChatCompletion(options={"log": ["per_field_score"]})
client = OpenAI()
client.chat.completions.parse = add_trust_scoring(tlm)(client.chat.completions.parse)
After you decorate OpenAI’s Chat Completions function, all of your existing Chat Completions API code will automatically compute trust scores as well (zero change needed in other code). Let’s run OpenAI on one text sample to generate structured outputs and score their trustworthiness with TLM:
sample_text = data["source_text"][0]
sample_text
completion = client.chat.completions.parse(
model="gpt-4.1-mini",
messages=[
{"role": "user", "content": f"Extract PII information from the following text, return null if the entity is not found: {sample_text}"}
],
response_format=PII,
)
The returned object matches what OpenAI would ordinarily return, except it has an additional tlm_metadata
field from TLM with extra information like the trustworthiness score and per-field scores. Our decorated function serves as a drop-in replacement for OpenAI in any application (and will still return the same responses you’d get directly from OpenAI alone).
print(f"Extracted PII Information: {completion.choices[0].message.parsed}")
print(f"Trustworthiness Score: {completion.tlm_metadata['trustworthiness_score']:.4f}")
print(f"Per-field Trustworthiness Scores: {completion.tlm_metadata['log']['per_field_score']}")
Run a dataset of many examples
Here, we define a quick helper function that allows us to process multiple text samples in parallel, which will speed up prompting the LLM over a dataset. The helper function also collects the LLM outputs and trustworthiness score in a formatted DataFrame for easy downstream analysis.
def extract_pii(text):
tlm = TLMChatCompletion(quality_preset="medium", options={"log": ["per_field_score"]})
client = OpenAI()
client.chat.completions.parse = add_trust_scoring(tlm)(client.chat.completions.parse)
completion = client.chat.completions.parse(
model="gpt-4.1-mini",
messages=[
{"role": "user", "content": f"Extract PII information from the following text, return null if the entity is not found: {text}"}
],
response_format=PII,
)
return {
"raw_completion": completion,
# the columns below extract the PII information and scores from the raw OpenAI response
"extracted_pii": completion.choices[0].message.parsed,
"trustworthiness_score": completion.tlm_metadata["trustworthiness_score"],
"per_field_score": completion.tlm_metadata["log"]["per_field_score"],
}
def extract_pii_batch(texts, batch_size=15, max_threads=8, sleep_time=2):
results = []
for i in tqdm(range(0, len(texts), batch_size)):
batch = texts[i:i + batch_size]
with ThreadPoolExecutor(max_threads) as executor:
futures = [executor.submit(extract_pii, text) for text in batch]
batch_results = [f.result() for f in futures]
results.extend(batch_results)
# sleep to prevent hitting rate limits
if i + batch_size < len(texts):
time.sleep(sleep_time)
return pd.DataFrame(results)
results = extract_pii_batch(data["source_text"])
results.head(2)
raw_completion | extracted_pii | trustworthiness_score | per_field_score | |
---|---|---|---|---|
0 | ParsedChatCompletion[PII](id='chatcmpl-CJSmu5J... | FIRSTNAME='Sierra' LASTNAME='Green' DATE='Augu... | 0.98902 | {'ACCOUNTNUMBER': {'explanation': 'There is no... |
1 | ParsedChatCompletion[PII](id='chatcmpl-CJSmulq... | FIRSTNAME='Melvin' LASTNAME=None DATE=None ACC... | 0.96906 | {'ACCOUNTNUMBER': {'explanation': 'The text do... |
Examine Results
We’ve now generated structured ouputs (i.e. extracted data) for each text sample in the dataset and scored the trustworthiness of each output.
pd.set_option('display.max_colwidth', None)
combined_results = pd.concat([data, results], axis=1)
High Trustworthiness Scores
The responses with the highest trustworthiness scores represent texts where TLM is most confident in the accuracy of your LLM’s structured outputs.
Looking at the examples below with high trustworthiness scores, we can see that your OpenAI model successfully extracted the correct PII elements in these text samples:
combined_results.sort_values("trustworthiness_score", ascending=False).head(3)[["source_text", "extracted_pii", "trustworthiness_score"]]
source_text | extracted_pii | trustworthiness_score | |
---|---|---|---|
0 | We would like to do a follow-up meeting with Sierra Green regarding her recent surgery. The proposed date is August 13, 2013 at our clinic in West Nash. | FIRSTNAME='Sierra' LASTNAME='Green' DATE='August 13, 2013' ACCOUNTNUMBER=None | 0.98902 |
9 | Pinkie, our Customer Brand Engineer noticed unusual traffic from 197.30.116.133 on our site https://tattered-past.org. Please investigate. | FIRSTNAME='Pinkie' LASTNAME=None DATE=None ACCOUNTNUMBER=None | 0.98902 |
27 | Patient Fredrick with insurance account 22661006 and SSN 756.9719.4002 is scheduled for general cleaning on 12/05/1973. Please send a confirmation text message to his mobile number +736 435 268.6135 today. | FIRSTNAME='Fredrick' LASTNAME=None DATE='12/05/1973' ACCOUNTNUMBER='22661006' | 0.98902 |
Low Trustworthiness Scores
The lowest trustworthiness scores reveal the LLM outputs that TLM is least confident are accurate. Documents/results with low trustworthiness scores would benefit most from manual review, especially if we need almost all outputs across the dataset to be correct and want to save human review costs.
The LLM outputs with the lowest trustworthiness scores in this dataset are shown below, and these extractions are often incorrect or ambiguous warranting further review.
combined_results.sort_values("trustworthiness_score").head(3)[["source_text", "extracted_pii", "trustworthiness_score"]]
source_text | extracted_pii | trustworthiness_score | |
---|---|---|---|
16 | To: Maximillian Noah Moore, we forgot to update your record with phone IMEI: 30-265288-033265-8. Could you please provide it in your earliest convenience to keep your records updated. | FIRSTNAME='Maximillian' LASTNAME='Moah' DATE=None ACCOUNTNUMBER=None | 0.242112 |
24 | Loma, your son's eye color, Eye color: Brown, is quite unique. It's beautiful! | FIRSTNAME=None LASTNAME='Loma' DATE=None ACCOUNTNUMBER=None | 0.320759 |
12 | Is your business tax-ready? Our team in Novato is here to help you navigate through Martinique's complex tax rules. Contact us at 56544500. | FIRSTNAME=None LASTNAME=None DATE=None ACCOUNTNUMBER='56544500' | 0.595425 |
Obtaining Trust Scores for Individual Fields
Beyond TLM’s overall trustworthiness score, you can obtain granular confidence scores for each individual field in the structured output from your LLM. These field-level scores help you pinpoint which specific values may be incorrect or warrant focused review.
Let’s look at the text sample receiving the lowest trustworthiness score in this dataset:
lowest_scoring_text = combined_results.loc[combined_results['trustworthiness_score'].idxmin()]
print(f"Text: {lowest_scoring_text['source_text']}")
print(f"Extracted PII Information: {lowest_scoring_text['extracted_pii']}")
print(f"Trustworthiness Score: {lowest_scoring_text['trustworthiness_score']}")
print(f"Per-field Trustworthiness Scores: {lowest_scoring_text['per_field_score']}")
The per_field_score
dictionary contains a granular confidence score and explanation for each extracted field.
Since this dictionary can be overwhelming, we provide a get_untrustworthy_fields()
method that:
- Prints detailed information about low-confidence fields
- Returns a list of fields that may need manual review due to low trust scores
untrustworthy_fields = tlm.get_untrustworthy_fields(tlm_result=lowest_scoring_text['raw_completion'])
This method returns a list of fields whose confidence score is low, allowing you to focus manual review on the specific fields whose extracted value is untrustworthy.
untrustworthy_fields