Skip to main content

Information Extraction from Documents using the Trustworthy Language Model

Run in Google ColabRun in Google Colab

This tutorial demonstrates how to use Cleanlab’s Trustworthy Language Model (TLM) to reliably extract data from unstructured text documents.

You can use TLM like any other LLM, just prompt it with the document text provided as context along with an instruction detailing what information should be extracted and in what format. While today’s GenAI and LLMs demonstrate promise for such information extraction, the technology remains fundamentally unreliable. LLMs may extract completely wrong (hallucinated) values in certain edge-cases, but you won’t know with existing LLMs. Cleanlab’s TLM also provides a trustworthiness score alongside the extracted information to indicate how confident we can be regarding its accuracy. These TLM trustworthiness scores are significantly more reliable than existing measures of LLM confidence, and help you catch badly extracted information before it harms downstream processes.

Before this tutorial, we recommend first completing the TLM quickstart tutorial.

LLM extracting data and providing trustworthiness scores

Install and Import Dependencies

Using TLM requires a Cleanlab account. Sign up for one here if you haven’t yet. If you’ve already signed up, check your email for a personal login link. The Python package dependencies for this tutorial can be installed using pip:

%pip install --upgrade cleanlab-studio "unstructured[pdf]==0.13.2"
import os
import pandas as pd
from unstructured.partition.pdf import partition_pdf
from IPython.display import display, IFrame

from cleanlab_studio import Studio

pd.set_option('display.max_colwidth', None)

Fetch the Document Dataset

The ideas demonstrated in this tutorial apply to arbitrary information extracation tasks involving any types of documents. The particular documents demonstrated in this tutorial are a collection of datasheets for electronic parts, stored in PDF format. These datasheets contain technical specifications and application guidelines regarding the electronic products, serving as an important guide for users of these products. Let’s download the data and look at an example datasheet.

wget -nc https://cleanlab-public.s3.amazonaws.com/Datasets/electronics-datasheets/electronics-datasheets.zip
mkdir datasheets/
unzip -q electronics-datasheets.zip -d datasheets/

Here’s the first page of a sample datasheet (13.pdf):

Sample Datasheet 13.pdf

This document contains many details, so it might be tough to extract one particular piece of information. Datasheets may use complex and highly technical language, while also varying in their structure and organization of information.

Hence, we’ll use the Trustworthy Language Model (TLM) to automatically extract key information about each product from these datasheets. TLM also provides a trustworthiness score to quantify how confident we can be that the right information was extracted, which allows us to catch potential errors in this process at scale.

Convert PDF Documents to Text

Our documents are PDFs, but TLM requires text inputs (like any other LLM). We can use the open-source unstructured library to extract the text from each PDF.

def extract_text_from_pdf(filename):
elements = partition_pdf(filename)
text = "\n".join([str(el) for el in elements])
return text
directory = "datasheets/"
all_files = os.listdir(directory)
pdf_files = sorted([file for file in all_files if file.endswith('.pdf')])
pdf_texts = [extract_text_from_pdf(os.path.join(directory, file)) for file in pdf_files]

Here is a sample of the extracted text:

print(pdf_texts[0])
    0.5w Solar Panel 55*70
This is a custom solar panel, which mates directly with many of our development boards and has a high efficiency at 17%. Unit has a clear epoxy coating with hard-board backing. Robust sealing for out door applications!
Specification
PET
Package
Typical peak power
0.55W
Voltage at peak power
5.5v
Current at peak power
100mA
Length
70 mm
Width
55 mm
Depth
1.5 mm
Weight
17g
Efficiency
17%
Wire diameter
1.5mm
Connector
2.0mm JST
Hardware Installation
http://wiki.seeedstudio.com/0.5w_Solar_Panel_55x70/9‐24‐18

Extract Information using TLM

Let’s initalize the TLM client. Here we use default TLM settings, but check out the TLM quickstart tutorial for configuration options to get better results.

# Get your API key from https://app.cleanlab.ai/account after creating an account
studio = Studio("<API key>")

tlm = studio.TLM()

We’ll use the following prompt template that instructs our model to extract the operating voltage of each electronics product from its datasheet:

Please reference the provided datasheet to determine the operating voltage of the item. 
Respond in the following format: "Operating Voltage: [insert appropriate voltage range]V [AC/DC]" with the appropriate voltage range and indicating "AC" or "DC" if applicable, or omitting if not.
If the operating voltage is a range, write it as "A - B" with "-" between the values.
If the operating voltage information is not available, specify "Operating Voltage: N/A".
Datasheet: <insert-datasheet>

Some of these datasheets are very long (over 50 pages), and TLM might not have the ability to ingest large inputs (contact us if you require larger context windows: sales@cleanlab.ai). For this tutorial, we’ll limit the input datasheet text to only the first 10,000 characters. Most datasheets summarize product technical details in the first few pages, so this should not be an issue for our information extraction task. If you already know roughly where in your documents the relevant information lies, you can save cost and runtime by only including text from the relevant part of the document in your prompts rather than the whole thing.

prompt_template = """Please reference the provided datasheet to determine the operating voltage of the item. 
Respond in the following format: "Operating Voltage: [insert appropriate voltage range]V [AC/DC]" with the appropriate voltage range and indicating "AC" or "DC" if applicable, or omitting if not.
If the operating voltage is a range, write it as "A - B" with "-" between the values.
If the operating voltage information is not available, specify "Operating Voltage: N/A".
Datasheet:
"""

texts_with_prompt = [prompt_template + text[:10000] for text in pdf_texts]

After forming our prompts for each datasheet, let’s pass in the prompts to the TLM. Note we can run all prompts simultaneously in a batch.

TLM will return a list of dictionaries, with each dictionary containing the corresponding response and corresponding trustworthiness score, which quantifies how confident you can be that the response is correct.

tlm_response = tlm.prompt(texts_with_prompt)
    Querying TLM... 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|

Let’s organize the extracted information (TLM responses) and trustworthiness scores for each input PDF, along with its filename.

results_df = pd.DataFrame({
"filename": pdf_files,
"response": [d["response"] for d in tlm_response],
"trustworthiness_score": [d["trustworthiness_score"] for d in tlm_response]
})

results_df.head()
filename response trustworthiness_score
0 1.pdf Operating Voltage: 5.5V DC 0.765674
1 10.pdf Operating Voltage: 2.7V - 5.5V DC 0.939829
2 11.pdf Operating Voltage: 1.8 - 5.5V DC 0.915594
3 12.pdf Operating Voltage: 0.9 - 1.6VDC 0.930093
4 13.pdf Operating Voltage: 1.8 - 5.5V 0.936506

Examine Results

TLM has extracted the product’s operating voltage from each datasheet, alongside with a given trustworthiness_score indicating its confidence. Let’s now examine the results in more detail.

High Trustworthiness Scores

The responses with the highest trustworthiness scores represent datasheets where TLM is the most confident that it has accurately extracted the operating voltage of the product.

results_df.sort_values("trustworthiness_score", ascending=False).head()
filename response trustworthiness_score
17 25.pdf Operating Voltage: 3.0 - 6.0V DC 0.964483
6 15.pdf Operating Voltage: 12V DC 0.955249
24 31.pdf Operating Voltage: 1.71V - 3.6V DC 0.948459
1 10.pdf Operating Voltage: 2.7V - 5.5V DC 0.939829
4 13.pdf Operating Voltage: 1.8 - 5.5V 0.936506

Below we show part of the datasheet which corresponds to the highest TLM trustworthiness score, document 25.pdf.

Table from Datasheet 25.pdf

From this image (you can also find this on the bottom of page 2 in the original file: 25.pdf), we see a table which specifies the minimum and maximum operating voltage of this time to be 3 VDC and 6 VDC respectively. This matches the extracted information by the TLM. When the trustworthiness scores are high, you can trust the results from TLM with great confidence!

Low Trustworthiness Scores

The responses with the lowest trustworthiness scores represent datasheets where TLM is the least confident that it correctly extracted the operating voltage of the product. Low trustworthiness scores indicate instances where you should not place much confidence in the TLM response. Results with low trustworthiness scores would benefit most from manual review, especially if the results are critical to get right.

results_df.sort_values("trustworthiness_score").head()
filename response trustworthiness_score
62 66.pdf Operating Voltage: 24 V DC 0.514770
52 57.pdf Operating Voltage: 5V DC (26M024B1U, 26M024B1B)\nOperating Voltage: 12V DC (26M048B1U, 26M048B1B, 26M048B2U, 26M048B2B) 0.524739
36 42.pdf Operating Voltage: 1V - 15V DC 0.566969
73 76.pdf Operating Voltage: 9 - 75 VDC 0.588441
31 38.pdf Operating Voltage: 5V DC 0.603914

Above we see the results that received the lowest trustworthiness scores amongst our documents. TLM extracted a operating voltage value of 24 V DC for datasheet 66.pdf, which received the lowest trustworthiness score. We depict part of this datasheet below (you can find this information on Page 2 of 66.pdf). We see there are many mentions of voltages in this datasheet, such as input voltage range, voltage output signal, nominal supply voltage, etc. There is no mention of the operating voltage. Like any standard LLM, the TLM finds it hard to extract the correct information (or realize that this information is not available) due to this abundance of closely related information, returning the wrong response. However, at least the trustworthiness score associated with this response is low, which allows us to catch this bad response with minimal manual review of the overall results.

Table from Datasheet 66.pdf

For datasheet 57.pdf, TLM also produced a low trustworthiness score for its response, which was the following:

Operating Voltage: 5V DC (26M024B1U, 26M024B1B) Operating Voltage: 12V DC (26M048B1U, 26M048B1B, 26M048B2U, 26M048B2B)

Below we show parts of the datasheet (the tables on pages 2 and 3 of 57.pdf).

Table 1 from Datasheet 57.pdf Table 2 from Datasheet 57.pdf

We see that these two tables contain 8 different products, each having its own technical specification. TLM got some of the extracted information right, part numbers 26M024B1U and 26M024B1B do indeed have an operating voltage of 5V. However, some information was also extracted incorrectly, for example the items with part numbers 26M048B1U and 26M048B1B have an operating voltage of 5V, instead of 12V listed in the TLM response.

Again, the low trustworthiness scores help you realize this example requires further review.

Don’t let unreliable results prevent you from employing LLM automation to extract information from documents at scale! You can rely on LLMs for a large subset of documents and with TLM’s trustworthiness scores, know which other documents you should still manually inspect. This can save your teams immense time and improve the quality of extracted information signficantly.