Skip to main content

Information Extraction from Documents using the Trustworthy Language Model

Run in Google ColabRun in Google Colab

This tutorial demonstrates how to use Cleanlab’s Trustworthy Language Model (TLM) to reliably extract data from unstructured text documents.

You can use TLM like any other LLM, just prompt it with the document text provided as context along with an instruction detailing what information should be extracted and in what format. While today’s GenAI and LLMs demonstrate promise for such information extraction, the technology remains fundamentally unreliable. LLMs may extract completely wrong (hallucinated) values in certain edge-cases, but you won’t know with existing LLMs. Cleanlab’s TLM also provides a trustworthiness score alongside the extracted information to indicate how confident we can be regarding its accuracy. These TLM trustworthiness scores are significantly more reliable than existing measures of LLM confidence, and help you catch badly extracted information before it harms downstream processes.

Before this tutorial, we recommend first completing the TLM quickstart tutorial.

LLM extracting data and providing trustworthiness scores

Setup

TLM requires a Cleanlab account. Sign up for one here and use TLM for free! If you’ve already signed up, check your email for a personal login link. The Python package dependencies for this tutorial can be installed using pip:

%pip install --upgrade cleanlab-studio "unstructured[pdf]==0.13.2" "pillow-heif==0.16.0"
import os
import pandas as pd
from unstructured.partition.pdf import partition_pdf
from IPython.display import display, IFrame

from cleanlab_studio import Studio

pd.set_option('display.max_colwidth', None)

Fetch the Document Dataset

The ideas demonstrated in this tutorial apply to arbitrary information extracation tasks involving any types of documents. The particular documents demonstrated in this tutorial are a collection of datasheets for electronic parts, stored in PDF format. These datasheets contain technical specifications and application guidelines regarding the electronic products, serving as an important guide for users of these products. Let’s download the data and look at an example datasheet.

wget -nc https://cleanlab-public.s3.amazonaws.com/Datasets/electronics-datasheets/electronics-datasheets.zip
mkdir datasheets/
unzip -q electronics-datasheets.zip -d datasheets/

Here’s the first page of a sample datasheet (13.pdf):

Sample Datasheet 13.pdf

This document contains many details, so it might be tough to extract one particular piece of information. Datasheets may use complex and highly technical language, while also varying in their structure and organization of information.

Hence, we’ll use the Trustworthy Language Model (TLM) to automatically extract key information about each product from these datasheets. TLM also provides a trustworthiness score to quantify how confident we can be that the right information was extracted, which allows us to catch potential errors in this process at scale.

Convert PDF Documents to Text

Our documents are PDFs, but TLM requires text inputs (like any other LLM). We can use the open-source unstructured library to extract the text from each PDF.

import nltk
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

def extract_text_from_pdf(filename):
elements = partition_pdf(filename)
text = "\n".join([str(el) for el in elements])
return text
directory = "datasheets/"
all_files = os.listdir(directory)
pdf_files = sorted([file for file in all_files if file.endswith('.pdf')])
pdf_texts = [extract_text_from_pdf(os.path.join(directory, file)) for file in pdf_files]

Here is a sample of the extracted text:

print(pdf_texts[0])
    0.5w Solar Panel 55*70
This is a custom solar panel, which mates directly with many of our development boards and has a high efficiency at 17%. Unit has a clear epoxy coating with hard-board backing. Robust sealing for out door applications!
Specification
PET
Package
Typical peak power
0.55W
Voltage at peak power
5.5v
Current at peak power
100mA
Length
70 mm
Width
55 mm
Depth
1.5 mm
Weight
17g
Efficiency
17%
Wire diameter
1.5mm
Connector
2.0mm JST
Hardware Installation
http://wiki.seeedstudio.com/0.5w_Solar_Panel_55x70/9‐24‐18

Extract Information using TLM

Let’s initalize TLM, here using default configuration settings.

studio = Studio("<API key>")  # Get your API key from: https://app.cleanlab.ai/account after creating an account

tlm = studio.TLM() # See Advanced tutorial for optional TLM configurations to boost performance

We’ll use the following prompt template that instructs our model to extract the operating voltage of each electronics product from its datasheet:

Please reference the provided datasheet to determine the operating voltage of the item. 
Respond in the following format: "Operating Voltage: [insert appropriate voltage range]V [AC/DC]" with the appropriate voltage range and indicating "AC" or "DC" if applicable, or omitting if not.
If the operating voltage is a range, write it as "A - B" with "-" between the values.
If the operating voltage information is not available, specify "Operating Voltage: N/A".
Datasheet: <insert-datasheet>

Some of these datasheets are very long (over 50 pages), and TLM might not have the ability to ingest large inputs (contact us if you require larger context windows: sales@cleanlab.ai). For this tutorial, we’ll limit the input datasheet text to only the first 10,000 characters. Most datasheets summarize product technical details in the first few pages, so this should not be an issue for our information extraction task. If you already know roughly where in your documents the relevant information lies, you can save cost and runtime by only including text from the relevant part of the document in your prompts rather than the whole thing.

prompt_template = """Please reference the provided datasheet to determine the operating voltage of the item. 
Respond in the following format: "Operating Voltage: [insert appropriate voltage range]V [AC/DC]" with the appropriate voltage range and indicating "AC" or "DC" if applicable, or omitting if not.
If the operating voltage is a range, write it as "A - B" with "-" between the values.
If the operating voltage information is not available, specify "Operating Voltage: N/A".
Datasheet:
"""

texts_with_prompt = [prompt_template + text[:10000] for text in pdf_texts]

After forming our prompts for each datasheet, let’s run the prompts through TLM. We can run all prompts simultaneously in a batch.

TLM will return a list of dictionaries, with each dictionary containing the corresponding response and trustworthiness score, which quantifies how confident you can be that the response is correct.

tlm_response = tlm.prompt(texts_with_prompt)
    Querying TLM... 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|

Let’s organize the extracted information (TLM responses) and trustworthiness scores for each input PDF, along with its filename.

results_df = pd.DataFrame({
"filename": pdf_files,
"response": [d["response"] for d in tlm_response],
"trustworthiness_score": [d["trustworthiness_score"] for d in tlm_response]
})

results_df.head()
filename response trustworthiness_score
0 1.pdf Operating Voltage: 5.5V DC 0.997596
1 10.pdf Operating Voltage: 2.7 - 5.5V DC 0.992541
2 11.pdf Operating Voltage: 1.8 - 5.5V DC 0.985845
3 12.pdf Operating Voltage: 0.9 - 1.6V DC 0.987066
4 13.pdf Operating Voltage: 1.8 - 5.5V 0.990618

Examine Results

TLM has extracted the product’s operating voltage from each datasheet, alongside with a given trustworthiness_score indicating its confidence. Let’s now examine the results in more detail.

High Trustworthiness Scores

The responses with the highest trustworthiness scores represent datasheets where TLM is the most confident that it has accurately extracted the operating voltage of the product.

results_df.sort_values("trustworthiness_score", ascending=False).head()
filename response trustworthiness_score
24 31.pdf Operating Voltage: 1.71 - 3.6V DC 0.999100
44 5.pdf Operating Voltage: 4.5 - 9V DC 0.998593
48 53.pdf Operating Voltage: 2.2 - 3.6V DC 0.997787
33 4.pdf Operating Voltage: 19.2 - 28.8V DC 0.997736
0 1.pdf Operating Voltage: 5.5V DC 0.997596

Below we show part of the datasheet which corresponds to the highest TLM trustworthiness score, document 31.pdf.

Table from Datasheet 31.pdf

From this image (you can also find this on the page 1 in the original file: 31.pdf), we see it specifies the operating voltage range to be 1.71 VDC to 3.6 VDC. This matches the extracted information by TLM. When the trustworthiness scores are high, you can trust the results from TLM with great confidence!

Low Trustworthiness Scores

The responses with the lowest trustworthiness scores represent datasheets where TLM is the least confident that it correctly extracted the operating voltage of the product. Low trustworthiness scores indicate instances where you should not place much confidence in the LLM response. Results with low trustworthiness scores would benefit most from manual review, especially if the results are critical to get right.

results_df.sort_values("trustworthiness_score").head()
filename response trustworthiness_score
30 37.pdf Operating Voltage: 3.0V DC 0.492340
64 68.pdf Operating Voltage: 1.5 - 2.6V DC 0.581276
9 18.pdf Operating Voltage: 5V ±5% DC 0.582412
14 22.pdf Operating Voltage: 6.3 - 450V DC 0.623049
70 73.pdf Operating Voltage: 24 V DC 0.647475

Above we see the results that received the lowest trustworthiness scores amongst our documents. TLM extracted a operating voltage value of 3V DC for datasheet 37.pdf, which received the lowest trustworthiness score. We depict part of this datasheet below (you can find this information on Page 7 of 37.pdf). We see that this table only lists this product’s supply voltage and backlight supply voltage, and there is no mention of the operating voltage. Like any standard LLM, TLM finds it hard to extract the correct information (or realize that this information is not available) due to some confusion in specific terminology, and hence returns the wrong response. Furthermore, 3V DC is also the incorrect supply voltage range. However, at least the trustworthiness score associated with this response is low, which allows us to catch this bad response with minimal manual review of the overall results.

Table from Datasheet 37.pdf

For datasheet 68.pdf, TLM also produced a low trustworthiness score for its response, which was the following:

Operating Voltage: 1.5 - 2.6V DC

Below we show part of the datasheet (the table on page 2 of 68.pdf).

Table from Datasheet 68.pdf

We see that there are many different products here with a wide range of operating voltages (some of which are not in the 1.5 - 2.6V range). Again, the low trustworthiness scores help you realize this example requires further review.

Don’t let unreliable results prevent you from employing LLM automation to extract information from documents at scale! You can rely on LLMs for a large subset of documents and with TLM’s trustworthiness scores, know which other documents you should still manually inspect. This can save your teams immense time and improve the quality of extracted information signficantly.

You may also find TLM with Structured Outputs useful in certain data extraction applications like Named Entity Recognition or PII Detection.