Information Extraction from Documents using the Trustworthy Language Model
This tutorial demonstrates how to use Cleanlab’s Trustworthy Language Model (TLM) to reliably extract data from unstructured text documents.
You can use TLM like any other LLM, just prompt it with the document text provided as context along with an instruction detailing what information should be extracted and in what format. While today’s GenAI and LLMs demonstrate promise for such information extraction, the technology remains fundamentally unreliable. LLMs may extract completely wrong (hallucinated) values in certain edge-cases, but you won’t know with existing LLMs. Cleanlab’s TLM also provides a trustworthiness score alongside the extracted information to indicate how confident we can be regarding its accuracy. TLM trustworthiness scores offer state-of-the-art automation to catch badly extracted information before it harms downstream processes.
Setup
This tutorial requires a TLM API key. Get one here, and first complete the quickstart tutorial.
The Python packages required for this tutorial can be installed using pip:
%pip install --upgrade cleanlab-tlm "unstructured[pdf]==0.13.2" "pillow-heif==0.16.0" "pdfminer.six==20240706"
# Set your API key
import os
os.environ["CLEANLAB_TLM_API_KEY"] = "<API key>" # Get your free API key from: https://tlm.cleanlab.ai/
import pandas as pd
from unstructured.partition.pdf import partition_pdf
from IPython.display import display, IFrame
from cleanlab_tlm import TLM
pd.set_option('display.max_colwidth', None)
Document Dataset
The ideas demonstrated in this tutorial apply to arbitrary information extracation tasks involving any types of documents. The particular documents demonstrated in this tutorial are a collection of datasheets for electronic parts, stored in PDF format. These datasheets contain technical specifications and application guidelines regarding the electronic products, serving as an important guide for users of these products. Let’s download the data and look at an example datasheet.
wget -nc https://cleanlab-public.s3.amazonaws.com/Datasets/electronics-datasheets/electronics-datasheets.zip
mkdir datasheets/
unzip -q electronics-datasheets.zip -d datasheets/
Here’s the first page of a sample datasheet (13.pdf):
This document contains many details, so it might be tough to extract one particular piece of information. Datasheets may use complex and highly technical language, while also varying in their structure and organization of information.
Hence, we’ll use the Trustworthy Language Model (TLM) to automatically extract key information about each product from these datasheets. TLM also provides a trustworthiness score to quantify how confident we can be that the right information was extracted, which allows us to catch potential errors in this process at scale.
Convert PDF Documents to Text
Our documents are PDFs, but TLM requires text inputs (like any other LLM). We can use the open-source unstructured
library to extract the text from each PDF.
import nltk
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
def extract_text_from_pdf(filename):
elements = partition_pdf(filename)
text = "\n".join([str(el) for el in elements])
return text
directory = "datasheets/"
all_files = os.listdir(directory)
pdf_files = sorted([file for file in all_files if file.endswith('.pdf')])
pdf_texts = [extract_text_from_pdf(os.path.join(directory, file)) for file in pdf_files]
Here is a sample of the extracted text:
print(pdf_texts[0])
Extract Information using TLM
Let’s initalize TLM, here using default configuration settings.
tlm = TLM() # See Advanced Tutorial for optional TLM configurations to get better/faster results
We’ll use the following prompt template that instructs our model to extract the operating voltage of each electronics product from its datasheet:
Please reference the provided datasheet to determine the operating voltage of the item.
Respond in the following format: "Operating Voltage: [insert appropriate voltage range]V [AC/DC]" with the appropriate voltage range and indicating "AC" or "DC" if applicable, or omitting if not.
If the operating voltage is a range, write it as "A - B" with "-" between the values.
If the operating voltage information is not available, specify "Operating Voltage: N/A".
Datasheet: <insert-datasheet>
Some of these datasheets are very long (over 50 pages), and TLM might not have the ability to ingest large inputs (contact us if you require larger context windows: sales@cleanlab.ai). For this tutorial, we’ll limit the input datasheet text to only the first 10,000 characters. Most datasheets summarize product technical details in the first few pages, so this should not be an issue for our information extraction task. If you already know roughly where in your documents the relevant information lies, you can save cost and runtime by only including text from the relevant part of the document in your prompts rather than the whole thing.
prompt_template = """Please reference the provided datasheet to determine the operating voltage of the item.
Respond in the following format: "Operating Voltage: [insert appropriate voltage range]V [AC/DC]" with the appropriate voltage range and indicating "AC" or "DC" if applicable, or omitting if not.
If the operating voltage is a range, write it as "A - B" with "-" between the values.
If the operating voltage information is not available, specify "Operating Voltage: N/A".
Datasheet:
"""
texts_with_prompt = [prompt_template + text[:10000] for text in pdf_texts]
After forming our prompts for each datasheet, let’s run the prompts through TLM. We can run all prompts simultaneously in a batch.
TLM will return a list of dictionaries, with each dictionary containing the corresponding response and trustworthiness score, which quantifies how confident you can be that the response is correct.
tlm_response = tlm.prompt(texts_with_prompt)
Let’s organize the extracted information (TLM responses) and trustworthiness scores for each input PDF, along with its filename.
results_df = pd.DataFrame({
"filename": pdf_files,
"response": [d["response"] for d in tlm_response],
"trustworthiness_score": [d["trustworthiness_score"] for d in tlm_response]
})
results_df.head()
filename | response | trustworthiness_score | |
---|---|---|---|
0 | 1.pdf | Operating Voltage: 5.5V DC | 0.997596 |
1 | 10.pdf | Operating Voltage: 2.7 - 5.5V DC | 0.992541 |
2 | 11.pdf | Operating Voltage: 1.8 - 5.5V DC | 0.985845 |
3 | 12.pdf | Operating Voltage: 0.9 - 1.6V DC | 0.987066 |
4 | 13.pdf | Operating Voltage: 1.8 - 5.5V | 0.990618 |
Examine Results
TLM has extracted the product’s operating voltage from each datasheet, alongside with a given trustworthiness_score
indicating its confidence. Let’s now examine the results in more detail.
High Trustworthiness Scores
The responses with the highest trustworthiness scores represent datasheets where TLM is the most confident that it has accurately extracted the operating voltage of the product.
results_df.sort_values("trustworthiness_score", ascending=False).head()
filename | response | trustworthiness_score | |
---|---|---|---|
29 | 36.pdf | Operating Voltage: N/A | 1.000000 |
12 | 20.pdf | Operating Voltage: 3.3V DC | 1.000000 |
37 | 43.pdf | Operating Voltage: 1.8 - 3.6V DC | 0.999889 |
33 | 4.pdf | Operating Voltage: 19.2 - 28.8V DC | 0.999249 |
6 | 15.pdf | Operating Voltage: 12V DC | 0.998750 |
Let’s look at an example response with one of our highest trustworthiness scores.
results_df.loc[results_df["filename"] == "31.pdf"]
filename | response | trustworthiness_score | |
---|---|---|---|
24 | 31.pdf | Operating Voltage: 1.71 - 3.6V DC | 0.994991 |
Below we show part of a datasheet for which the LLM extraction received a high trustworthiness score, document 31.pdf
.
From this image (you can also find this on the page 1 in the original file: 31.pdf, we see it specifies the operating voltage range to be 1.71 VDC to 3.6 VDC. This matches the extracted information by TLM. When the trustworthiness scores are high, you can trust the results from TLM with great confidence!
Low Trustworthiness Scores
The responses with the lowest trustworthiness scores represent datasheets where you should be least confident in the LLM extractions.
results_df.sort_values("trustworthiness_score").head()
filename | response | trustworthiness_score | |
---|---|---|---|
20 | 28.pdf | Operating Voltage: 2.81 - 3.15V DC | 0.273827 |
64 | 68.pdf | Operating Voltage: 1.5 - 3.8V DC | 0.321355 |
73 | 76.pdf | Operating Voltage: 18 - 36 VDC | 0.336443 |
7 | 16.pdf | Operating Voltage: 2.3 - 2.7V DC, 4.6 - 5.5V DC, 6.9 - 8.1V DC | 0.367804 |
14 | 22.pdf | Operating Voltage: 6.3 - 450 V DC | 0.438818 |
Let’s zoom in on one example where the LLM extraction received a low trustworthiness score.
results_df.loc[results_df["filename"] == "37.pdf"]
filename | response | trustworthiness_score | |
---|---|---|---|
30 | 37.pdf | Operating Voltage: 3.0 - 3.3V DC | 0.488113 |
The LLM extracted a operating voltage value of 3V DC for datasheet 37.pdf
, which received a low trustworthiness score.
We depict part of this datasheet below (you can find this information on Page 7 of 37.pdf). Reviewing the document, we see that this table only lists the product’s supply voltage and backlight supply voltage. There is no mention of the operating voltage. The LLM failed to extract the correct information (or realize that this information is not available), and hence returns the wrong response. At least you automatically get a low trustworthiness score for this response, allow you to catch this incorrect LLM output.
For datasheet 68.pdf
, the LLM output also received a low trustworthiness score. The LLM output was:
Operating Voltage: 1.5 - 2.6V DC
Below we review part of the corresponding datasheet (the table on page 2 of 68.pdf).
We see that there are many different products here with a wide range of operating voltages (some of which are not in the 1.5 - 2.6V range). Again, the low trustworthiness scores automatically help you discover that this example requires further review.
Enforcing Structured Outputs from TLM via the OpenAI API
Above, we extracted the operating voltage from datasheets using basic LLM calls. Let’s explore more advanced extraction that enforces structured outputs from the LLM. TLM can score the trustworthiness of LLM structured outputs when you use it via our OpenAI API integration.
Let’s extract multiple fields simultaneously and ensure they conform to a specific format by defining a structured output schema. For instance, you can guarantee the LLM outputs: specific fields and specific types of values for each field (numbers, specific categories, …). Here we’ll extract:
- Operating voltage
- Whether the maximum dimension exceeds 100mm
First install and import the necessary libraries.
%pip install openai pydantic
from pydantic import BaseModel, Field
from typing import Literal
from openai import OpenAI
We define a Pydantic schema to specify the extracted data format:
class DatasheetInfo(BaseModel):
operating_voltage: str = Field(
description="The operating voltage of the product in the format 'X - Y V [AC/DC]' or single value 'X V [AC/DC]'. Use 'N/A' if not available."
)
max_dimension_exceeds_100mm: Literal["Yes", "No", "N/A"] = Field(
description="Whether any dimension (length, width, or height) exceeds 100mm. Choose 'Yes', 'No', or 'N/A' if dimensions are not specified."
)
We initialize a TLM client via the OpenAI client library. This allows you to run any OpenAI-compatible code and get trustworthiness scores, including for nonstandard response types like structured outputs, tool calls, etc.
client = OpenAI(
api_key="<Cleanlab API key>", # Get your key from https://tlm.cleanlab.ai/
base_url="https://api.cleanlab.ai/api/v1/openai_trustworthy_llm/"
)
Extract Multiple Fields from Documents
Let’s create a prompt for structured extraction of multiple fields at once. Here we just show this for one example datasheet.
# Modify our existing prompt template from the tutorial
prompt_template = """Please reference the provided datasheet to determine the operating voltage of the item and whether any dimension exceeds 100mm.
For operating voltage:
Respond in the format: "Operating Voltage: [insert appropriate voltage range]V [AC/DC]" with the appropriate voltage range and indicating "AC" or "DC" if applicable, or omitting if not.
If the operating voltage is a range, write it as "A - B" with "-" between the values.
If the operating voltage information is not available, specify "Operating Voltage: N/A".
For maximum dimension:
Also determine if any dimension (length, width, or height) exceeds 100mm.
Choose 'Yes', 'No', or 'N/A' if dimensions are not specified.
Datasheet:
"""
# Sample datasheet text (first 1000 characters from the datasheet)
datasheet_text = """
0.5w Solar Panel 55*70
This is a custom solar panel, which mates directly with many of our development boards and has a high efficiency at 17%. Unit has a clear epoxy coating with hard-board backing. Robust sealing for out door applications!
Specification
PET
Package
Typical peak power
0.55W
Voltage at peak power
5.5v
Current at peak power
100mA
Length
70 mm
Width
55 mm
Depth
1.5 mm
Weight
17g
Efficiency
17%
Wire diameter
1.5mm
Connector
2.0mm JST
"""
# Build the full prompt
full_prompt = prompt_template + datasheet_text
Now we use the OpenAI API to extract multiple fields at once. Since we pointed the OpenAI client base_url
at TLM, we will also get back trustworthiness scores for each request.
# Send request to TLM via OpenAI API
completion = client.beta.chat.completions.parse(
model="gpt-4o", # You can also use "gpt-4o-low" or "gpt-4o-medium" for different quality presets
messages=[
{"role": "user", "content": full_prompt}
],
response_format=DatasheetInfo,
)
# Extract the structured information and trustworthiness score
extracted_info = completion.choices[0].message.parsed
trustworthiness_score = completion.tlm_metadata["trustworthiness_score"]
# Display the results
print(f"Extracted Information:")
print(f"Operating Voltage: {extracted_info.operating_voltage}")
print(f"Max Dimension Exceeds 100mm: {extracted_info.max_dimension_exceeds_100mm}")
print(f"Trustworthiness Score: {trustworthiness_score}")
Next Steps
Don’t let unreliable LLM outputs block AI automation to extract information from documents at scale! With TLM, you can let LLMs automatically process the documents where they are trustworthy and automatically detect which remaining LLM outputs to manually review. This saves your team time and improves the accuracy of extracted information.
- Also check out our cheatsheet and Data Annotation tutorial.
- To improve extraction accuracy, run TLM with a more powerful
model
andquality_preset
configuration.