Information Extraction from Documents using the Trustworthy Language Model

Run in Google Colab

This tutorial demonstrates how to use Cleanlab’s Trustworthy Language Model (TLM) to reliably extract data from unstructured text documents.

You can use TLM like any other LLM, just prompt it with the document text provided as context along with an instruction detailing what information should be extracted and in what format. While today’s GenAI and LLMs demonstrate promise for such information extraction, the technology remains fundamentally unreliable. LLMs may extract completely wrong (hallucinated) values in certain edge-cases, but you won’t know with existing LLMs. Cleanlab’s TLM also provides a trustworthiness score alongside the extracted information to indicate how confident we can be regarding its accuracy. TLM trustworthiness scores offer state-of-the-art automation to catch badly extracted information before it harms downstream processes.

LLM extracting data and providing trustworthiness scores

Setup

This tutorial requires a TLM API key. Get one here, and first complete the quickstart tutorial.

The Python packages required for this tutorial can be installed using pip:

%pip install --upgrade cleanlab-tlm "unstructured[pdf]==0.13.2" "pillow-heif==0.16.0" "pdfminer.six==20240706"

# Set your API key
import os
os.environ["CLEANLAB_TLM_API_KEY"] = "<API key>" # Get your free API key from: https://tlm.cleanlab.ai/

import pandas as pd
from unstructured.partition.pdf import partition_pdf
from IPython.display import display, IFrame

from cleanlab_tlm import TLM

pd.set_option('display.max_colwidth', None)

Document Dataset

The ideas demonstrated in this tutorial apply to arbitrary information extracation tasks involving any types of documents. The particular documents demonstrated in this tutorial are a collection of datasheets for electronic parts, stored in PDF format. These datasheets contain technical specifications and application guidelines regarding the electronic products, serving as an important guide for users of these products. Let’s download the data and look at an example datasheet.

wget -nc https://cleanlab-public.s3.amazonaws.com/Datasets/electronics-datasheets/electronics-datasheets.zip
mkdir datasheets/
unzip -q electronics-datasheets.zip -d datasheets/

Here’s the first page of a sample datasheet (13.pdf):

Sample Datasheet 13.pdf

This document contains many details, so it might be tough to extract one particular piece of information. Datasheets may use complex and highly technical language, while also varying in their structure and organization of information.

Hence, we’ll use the Trustworthy Language Model (TLM) to automatically extract key information about each product from these datasheets. TLM also provides a trustworthiness score to quantify how confident we can be that the right information was extracted, which allows us to catch potential errors in this process at scale.

Convert PDF Documents to Text

Our documents are PDFs, but TLM requires text inputs (like any other LLM). We can use the open-source unstructured library to extract the text from each PDF.

import nltk
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

def extract_text_from_pdf(filename):
    elements = partition_pdf(filename)
    text = "\n".join([str(el) for el in elements])
    return text

directory = "datasheets/"
all_files = os.listdir(directory)
pdf_files = sorted([file for file in all_files if file.endswith('.pdf')])
pdf_texts = [extract_text_from_pdf(os.path.join(directory, file)) for file in pdf_files]

Here is a sample of the extracted text:

print(pdf_texts[0])

0.5w Solar Panel 55*70
This is a custom solar panel, which mates directly with many of our development boards and has a high efficiency at 17%. Unit has a clear epoxy coating with hard-board backing. Robust sealing for out door applications!
Specification
PET
Package
Typical peak power
0.55W
Voltage at peak power
5.5v
Current at peak power
100mA
Length
70 mm
Width
55 mm
Depth
1.5 mm
Weight
17g
Efficiency
17%
Wire diameter
1.5mm
Connector
2.0mm JST
Hardware Installation
http://wiki.seeedstudio.com/0.5w_Solar_Panel_55x70/9‐24‐18

Extract Information using TLM

Let’s initalize TLM, here using default configuration settings.

tlm = TLM()  # See Advanced Tutorial for optional TLM configurations to get better/faster results

We’ll use the following prompt template that instructs our model to extract the operating voltage of each electronics product from its datasheet:

Please reference the provided datasheet to determine the operating voltage of the item. 
Respond in the following format: "Operating Voltage: [insert appropriate voltage range]V [AC/DC]" with the appropriate voltage range and indicating "AC" or "DC" if applicable, or omitting if not.
If the operating voltage is a range, write it as "A - B" with "-" between the values.
If the operating voltage information is not available, specify "Operating Voltage: N/A".
Datasheet: <insert-datasheet>

Some of these datasheets are very long (over 50 pages), and TLM might not have the ability to ingest large inputs (contact us if you require larger context windows: sales@cleanlab.ai). For this tutorial, we’ll limit the input datasheet text to only the first 10,000 characters. Most datasheets summarize product technical details in the first few pages, so this should not be an issue for our information extraction task. If you already know roughly where in your documents the relevant information lies, you can save cost and runtime by only including text from the relevant part of the document in your prompts rather than the whole thing.

prompt_template = """Please reference the provided datasheet to determine the operating voltage of the item. 
Respond in the following format: "Operating Voltage: [insert appropriate voltage range]V [AC/DC]" with the appropriate voltage range and indicating "AC" or "DC" if applicable, or omitting if not.
If the operating voltage is a range, write it as "A - B" with "-" between the values.
If the operating voltage information is not available, specify "Operating Voltage: N/A".
Datasheet:
"""

texts_with_prompt = [prompt_template + text[:10000] for text in pdf_texts]

After forming our prompts for each datasheet, let’s run the prompts through TLM. We can run all prompts simultaneously in a batch.

TLM will return a list of dictionaries, with each dictionary containing the corresponding response and trustworthiness score, which quantifies how confident you can be that the response is correct.

tlm_response = tlm.prompt(texts_with_prompt)

Querying TLM... 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|

Let’s organize the extracted information (TLM responses) and trustworthiness scores for each input PDF, along with its filename.

results_df = pd.DataFrame({
    "filename": pdf_files,
    "response": [d["response"] for d in tlm_response],
    "trustworthiness_score": [d["trustworthiness_score"] for d in tlm_response]
})

results_df.head()

	filename	response	trustworthiness_score
0	1.pdf	Operating Voltage: 5.5V DC	0.997596
1	10.pdf	Operating Voltage: 2.7 - 5.5V DC	0.992541
2	11.pdf	Operating Voltage: 1.8 - 5.5V DC	0.985845
3	12.pdf	Operating Voltage: 0.9 - 1.6V DC	0.987066
4	13.pdf	Operating Voltage: 1.8 - 5.5V	0.990618

Examine Results

TLM has extracted the product’s operating voltage from each datasheet, alongside with a given trustworthiness_score indicating its confidence. Let’s now examine the results in more detail.

High Trustworthiness Scores

The responses with the highest trustworthiness scores represent datasheets where TLM is the most confident that it has accurately extracted the operating voltage of the product.

results_df.sort_values("trustworthiness_score", ascending=False).head()

	filename	response	trustworthiness_score
29	36.pdf	Operating Voltage: N/A	1.000000
12	20.pdf	Operating Voltage: 3.3V DC	1.000000
37	43.pdf	Operating Voltage: 1.8 - 3.6V DC	0.999889
33	4.pdf	Operating Voltage: 19.2 - 28.8V DC	0.999249
6	15.pdf	Operating Voltage: 12V DC	0.998750

Let’s look at an example response with one of our highest trustworthiness scores.

results_df.loc[results_df["filename"] == "31.pdf"]

	filename	response	trustworthiness_score
24	31.pdf	Operating Voltage: 1.71 - 3.6V DC	0.994991

Below we show part of a datasheet for which the LLM extraction received a high trustworthiness score, document 31.pdf.

Table from Datasheet 31.pdf

From this image (you can also find this on the page 1 in the original file: 31.pdf, we see it specifies the operating voltage range to be 1.71 VDC to 3.6 VDC. This matches the extracted information by TLM. When the trustworthiness scores are high, you can trust the results from TLM with great confidence!

Low Trustworthiness Scores

The responses with the lowest trustworthiness scores represent datasheets where you should be least confident in the LLM extractions.

results_df.sort_values("trustworthiness_score").head()

	filename	response	trustworthiness_score
20	28.pdf	Operating Voltage: 2.81 - 3.15V DC	0.273827
64	68.pdf	Operating Voltage: 1.5 - 3.8V DC	0.321355
73	76.pdf	Operating Voltage: 18 - 36 VDC	0.336443
7	16.pdf	Operating Voltage: 2.3 - 2.7V DC, 4.6 - 5.5V DC, 6.9 - 8.1V DC	0.367804
14	22.pdf	Operating Voltage: 6.3 - 450 V DC	0.438818

Let’s zoom in on one example where the LLM extraction received a low trustworthiness score.

results_df.loc[results_df["filename"] == "37.pdf"]

	filename	response	trustworthiness_score
30	37.pdf	Operating Voltage: 3.0 - 3.3V DC	0.488113

The LLM extracted a operating voltage value of 3V DC for datasheet 37.pdf, which received a low trustworthiness score. We depict part of this datasheet below (you can find this information on Page 7 of 37.pdf). Reviewing the document, we see that this table only lists the product’s supply voltage and backlight supply voltage. There is no mention of the operating voltage. The LLM failed to extract the correct information (or realize that this information is not available), and hence returns the wrong response. At least you automatically get a low trustworthiness score for this response, allow you to catch this incorrect LLM output.

Table from Datasheet 37.pdf

For datasheet 68.pdf, the LLM output also received a low trustworthiness score. The LLM output was:

Operating Voltage: 1.5 - 2.6V DC

Below we review part of the corresponding datasheet (the table on page 2 of 68.pdf).

Table from Datasheet 68.pdf

We see that there are many different products here with a wide range of operating voltages (some of which are not in the 1.5 - 2.6V range). Again, the low trustworthiness scores automatically help you discover that this example requires further review.

Enforcing Structured Outputs from TLM via the OpenAI API

Above, we extracted the operating voltage from datasheets using basic LLM calls. Let’s explore more advanced extraction that enforces structured outputs from the LLM. TLM can score the trustworthiness of LLM structured outputs when you use it via our OpenAI API integration.

Let’s extract multiple fields simultaneously and ensure they conform to a specific format by defining a structured output schema. For instance, you can guarantee the LLM outputs: specific fields and specific types of values for each field (numbers, specific categories, …). Here we’ll extract:

Operating voltage
Whether the maximum dimension exceeds 100mm

First install and import the necessary libraries.

%pip install openai pydantic

from pydantic import BaseModel, Field
from typing import Literal
from openai import OpenAI

We define a Pydantic schema to specify the extracted data format:

class DatasheetInfo(BaseModel):
    operating_voltage: str = Field(
        description="The operating voltage of the product in the format 'X - Y V [AC/DC]' or single value 'X V [AC/DC]'. Use 'N/A' if not available."
    )
    max_dimension_exceeds_100mm: Literal["Yes", "No", "N/A"] = Field(
        description="Whether any dimension (length, width, or height) exceeds 100mm. Choose 'Yes', 'No', or 'N/A' if dimensions are not specified."
    )

We initialize a TLM client via the OpenAI client library. This allows you to run any OpenAI-compatible code and get trustworthiness scores, including for nonstandard response types like structured outputs, tool calls, etc.

client = OpenAI(
    api_key="<Cleanlab API key>",  # Get your key from https://tlm.cleanlab.ai/
    base_url="https://api.cleanlab.ai/api/v1/openai_trustworthy_llm/"
)

Extract Multiple Fields from Documents

Let’s create a prompt for structured extraction of multiple fields at once. Here we just show this for one example datasheet.

# Modify our existing prompt template from the tutorial
prompt_template = """Please reference the provided datasheet to determine the operating voltage of the item and whether any dimension exceeds 100mm. 

For operating voltage:
Respond in the format: "Operating Voltage: [insert appropriate voltage range]V [AC/DC]" with the appropriate voltage range and indicating "AC" or "DC" if applicable, or omitting if not.
If the operating voltage is a range, write it as "A - B" with "-" between the values.
If the operating voltage information is not available, specify "Operating Voltage: N/A".

For maximum dimension:
Also determine if any dimension (length, width, or height) exceeds 100mm.
Choose 'Yes', 'No', or 'N/A' if dimensions are not specified.

Datasheet:
"""

# Sample datasheet text (first 1000 characters from the datasheet)
datasheet_text = """
0.5w Solar Panel 55*70
This is a custom solar panel, which mates directly with many of our development boards and has a high efficiency at 17%. Unit has a clear epoxy coating with hard-board backing. Robust sealing for out door applications!
Specification
PET
Package
Typical peak power
0.55W
Voltage at peak power
5.5v
Current at peak power
100mA
Length
70 mm
Width
55 mm
Depth
1.5 mm
Weight
17g
Efficiency
17%
Wire diameter
1.5mm
Connector
2.0mm JST
"""

# Build the full prompt
full_prompt = prompt_template + datasheet_text

Now we use the OpenAI API to extract multiple fields at once. Since we pointed the OpenAI client base_url at TLM, we will also get back trustworthiness scores for each request.

# Send request to TLM via OpenAI API
completion = client.beta.chat.completions.parse(
    model="gpt-4o",  # You can also use "gpt-4o-low" or "gpt-4o-medium" for different quality presets
    messages=[
        {"role": "user", "content": full_prompt}
    ],
    response_format=DatasheetInfo,
)

# Extract the structured information and trustworthiness score
extracted_info = completion.choices[0].message.parsed
trustworthiness_score = completion.tlm_metadata["trustworthiness_score"]

# Display the results
print(f"Extracted Information:")
print(f"Operating Voltage: {extracted_info.operating_voltage}")
print(f"Max Dimension Exceeds 100mm: {extracted_info.max_dimension_exceeds_100mm}")
print(f"Trustworthiness Score: {trustworthiness_score}")

Extracted Information:
Operating Voltage: Operating Voltage: 5.5 V
Max Dimension Exceeds 100mm: No
Trustworthiness Score: 0.7003051903015907

Next Steps

Don’t let unreliable LLM outputs block AI automation to extract information from documents at scale! With TLM, you can let LLMs automatically process the documents where they are trustworthy and automatically detect which remaining LLM outputs to manually review. This saves your team time and improves the accuracy of extracted information.

Also check out our cheatsheet and Data Annotation tutorial.
To improve extraction accuracy, run TLM with a more powerful model and quality_preset configuration.

Setup​

Document Dataset​

Convert PDF Documents to Text​

Extract Information using TLM​

Examine Results​

High Trustworthiness Scores​

Low Trustworthiness Scores​

Enforcing Structured Outputs from TLM via the OpenAI API​

Extract Multiple Fields from Documents​

Next Steps​