Information Extraction from Documents using the Trustworthy Language Model
This tutorial demonstrates how to use Cleanlab’s Trustworthy Language Model (TLM) to reliably extract data from unstructured text documents.
You can use TLM like any other LLM, just prompt it with the document text provided as context along with an instruction detailing what information should be extracted and in what format. While today’s GenAI and LLMs demonstrate promise for such information extraction, the technology remains fundamentally unreliable. LLMs may extract completely wrong (hallucinated) values in certain edge-cases, but you won’t know with existing LLMs. Cleanlab’s TLM also provides a trustworthiness score alongside the extracted information to indicate how confident we can be regarding its accuracy. These TLM trustworthiness scores are significantly more reliable than existing measures of LLM confidence, and help you catch badly extracted information before it harms downstream processes.
Before this tutorial, we recommend first completing the TLM quickstart tutorial.
Install and Import Dependencies
Using TLM requires a Cleanlab account. Sign up for one here if you haven’t yet. If you’ve already signed up, check your email for a personal login link. The Python package dependencies for this tutorial can be installed using pip:
%pip install --upgrade cleanlab-studio "unstructured[pdf]==0.13.2" "pillow-heif==0.16.0"
import os
import pandas as pd
from unstructured.partition.pdf import partition_pdf
from IPython.display import display, IFrame
from cleanlab_studio import Studio
pd.set_option('display.max_colwidth', None)
Fetch the Document Dataset
The ideas demonstrated in this tutorial apply to arbitrary information extracation tasks involving any types of documents. The particular documents demonstrated in this tutorial are a collection of datasheets for electronic parts, stored in PDF format. These datasheets contain technical specifications and application guidelines regarding the electronic products, serving as an important guide for users of these products. Let’s download the data and look at an example datasheet.
wget -nc https://cleanlab-public.s3.amazonaws.com/Datasets/electronics-datasheets/electronics-datasheets.zip
mkdir datasheets/
unzip -q electronics-datasheets.zip -d datasheets/
Here’s the first page of a sample datasheet (13.pdf):
This document contains many details, so it might be tough to extract one particular piece of information. Datasheets may use complex and highly technical language, while also varying in their structure and organization of information.
Hence, we’ll use the Trustworthy Language Model (TLM) to automatically extract key information about each product from these datasheets. TLM also provides a trustworthiness score to quantify how confident we can be that the right information was extracted, which allows us to catch potential errors in this process at scale.
Convert PDF Documents to Text
Our documents are PDFs, but TLM requires text inputs (like any other LLM). We can use the open-source unstructured
library to extract the text from each PDF.
import nltk
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
def extract_text_from_pdf(filename):
elements = partition_pdf(filename)
text = "\n".join([str(el) for el in elements])
return text
directory = "datasheets/"
all_files = os.listdir(directory)
pdf_files = sorted([file for file in all_files if file.endswith('.pdf')])
pdf_texts = [extract_text_from_pdf(os.path.join(directory, file)) for file in pdf_files]
Here is a sample of the extracted text:
print(pdf_texts[0])
Extract Information using TLM
Let’s initalize the TLM client. Here we use default TLM settings, but check out the TLM quickstart tutorial for configuration options to get better results.
# Get your API key from https://app.cleanlab.ai/account after creating an account
studio = Studio("<API key>")
tlm = studio.TLM()
We’ll use the following prompt template that instructs our model to extract the operating voltage of each electronics product from its datasheet:
Please reference the provided datasheet to determine the operating voltage of the item.
Respond in the following format: "Operating Voltage: [insert appropriate voltage range]V [AC/DC]" with the appropriate voltage range and indicating "AC" or "DC" if applicable, or omitting if not.
If the operating voltage is a range, write it as "A - B" with "-" between the values.
If the operating voltage information is not available, specify "Operating Voltage: N/A".
Datasheet: <insert-datasheet>
Some of these datasheets are very long (over 50 pages), and TLM might not have the ability to ingest large inputs (contact us if you require larger context windows: sales@cleanlab.ai). For this tutorial, we’ll limit the input datasheet text to only the first 10,000 characters. Most datasheets summarize product technical details in the first few pages, so this should not be an issue for our information extraction task. If you already know roughly where in your documents the relevant information lies, you can save cost and runtime by only including text from the relevant part of the document in your prompts rather than the whole thing.
prompt_template = """Please reference the provided datasheet to determine the operating voltage of the item.
Respond in the following format: "Operating Voltage: [insert appropriate voltage range]V [AC/DC]" with the appropriate voltage range and indicating "AC" or "DC" if applicable, or omitting if not.
If the operating voltage is a range, write it as "A - B" with "-" between the values.
If the operating voltage information is not available, specify "Operating Voltage: N/A".
Datasheet:
"""
texts_with_prompt = [prompt_template + text[:10000] for text in pdf_texts]
After forming our prompts for each datasheet, let’s pass in the prompts to the TLM. Note we can run all prompts simultaneously in a batch.
TLM will return a list of dictionaries, with each dictionary containing the corresponding response and corresponding trustworthiness score, which quantifies how confident you can be that the response is correct.
tlm_response = tlm.prompt(texts_with_prompt)
Let’s organize the extracted information (TLM responses) and trustworthiness scores for each input PDF, along with its filename.
results_df = pd.DataFrame({
"filename": pdf_files,
"response": [d["response"] for d in tlm_response],
"trustworthiness_score": [d["trustworthiness_score"] for d in tlm_response]
})
results_df.head()
filename | response | trustworthiness_score | |
---|---|---|---|
0 | 1.pdf | Operating Voltage: 5.5V DC | 0.997596 |
1 | 10.pdf | Operating Voltage: 2.7 - 5.5V DC | 0.992541 |
2 | 11.pdf | Operating Voltage: 1.8 - 5.5V DC | 0.985845 |
3 | 12.pdf | Operating Voltage: 0.9 - 1.6V DC | 0.987066 |
4 | 13.pdf | Operating Voltage: 1.8 - 5.5V | 0.990618 |
Examine Results
TLM has extracted the product’s operating voltage from each datasheet, alongside with a given trustworthiness_score
indicating its confidence. Let’s now examine the results in more detail.
High Trustworthiness Scores
The responses with the highest trustworthiness scores represent datasheets where TLM is the most confident that it has accurately extracted the operating voltage of the product.
results_df.sort_values("trustworthiness_score", ascending=False).head()
filename | response | trustworthiness_score | |
---|---|---|---|
24 | 31.pdf | Operating Voltage: 1.71 - 3.6V DC | 0.999100 |
44 | 5.pdf | Operating Voltage: 4.5 - 9V DC | 0.998593 |
48 | 53.pdf | Operating Voltage: 2.2 - 3.6V DC | 0.997787 |
33 | 4.pdf | Operating Voltage: 19.2 - 28.8V DC | 0.997736 |
0 | 1.pdf | Operating Voltage: 5.5V DC | 0.997596 |
Below we show part of the datasheet which corresponds to the highest TLM trustworthiness score, document 31.pdf
.
From this image (you can also find this on the page 1 in the original file: 31.pdf), we see it specifies the operating voltage range to be 1.71 VDC to 3.6 VDC. This matches the extracted information by the TLM. When the trustworthiness scores are high, you can trust the results from TLM with great confidence!
Low Trustworthiness Scores
The responses with the lowest trustworthiness scores represent datasheets where TLM is the least confident that it correctly extracted the operating voltage of the product. Low trustworthiness scores indicate instances where you should not place much confidence in the TLM response. Results with low trustworthiness scores would benefit most from manual review, especially if the results are critical to get right.
results_df.sort_values("trustworthiness_score").head()
filename | response | trustworthiness_score | |
---|---|---|---|
30 | 37.pdf | Operating Voltage: 3.0V DC | 0.492340 |
64 | 68.pdf | Operating Voltage: 1.5 - 2.6V DC | 0.581276 |
9 | 18.pdf | Operating Voltage: 5V ±5% DC | 0.582412 |
14 | 22.pdf | Operating Voltage: 6.3 - 450V DC | 0.623049 |
70 | 73.pdf | Operating Voltage: 24 V DC | 0.647475 |
Above we see the results that received the lowest trustworthiness scores amongst our documents. TLM extracted a operating voltage value of 3V DC for datasheet 37.pdf
, which received the lowest trustworthiness score.
We depict part of this datasheet below (you can find this information on Page 7 of 37.pdf). We see that this table only lists this product’s supply voltage and backlight supply voltage, and there is no mention of the operating voltage. Like any standard LLM, the TLM finds it hard to extract the correct information (or realize that this information is not available) due to some confusion in specific terminology, and hence returns the wrong response. Furthermore, 3V DC is also the incorrect supply voltage range. However, at least the trustworthiness score associated with this response is low, which allows us to catch this bad response with minimal manual review of the overall results.
For datasheet 68.pdf
, TLM also produced a low trustworthiness score for its response, which was the following:
Operating Voltage: 1.5 - 2.6V DC
Below we show part of the datasheet (the table on page 2 of 68.pdf).
We see that there are many different products here with a wide range of operating voltages (some of which are not in the 1.5 - 2.6V range).
Again, the low trustworthiness scores help you realize this example requires further review.
Don’t let unreliable results prevent you from employing LLM automation to extract information from documents at scale! You can rely on LLMs for a large subset of documents and with TLM’s trustworthiness scores, know which other documents you should still manually inspect. This can save your teams immense time and improve the quality of extracted information signficantly.