How To Automate PDF Data Extraction - 3 Different Methods To Parse PDFs For Analytics

How To Automate PDF Data Extraction – 3 Different Methods To Parse PDFs For Analytics

research@theseattledataguy.com October 2, 2024 big data 0

If you work in data, then at some point in your career, you’ll likely need to parse data from a PDF.

You might need to parse thousands of PDFs in order to pull out invoice information.

Or maybe you need to parse financial filing documents such as 10-Ks.

This can seem challenging at first. Afterall, PDFs are not CSVs or some form of other more structured file.

But it doesn’t have to be. There are several tools you can use that range from Python libraries to out of the box solutions. All of which can make parsing and analyzing data from PDFs far easier.

In this article I wanted to cover how you can use Python to scrape data from a PDF but also how you can analyze data from a PDF without ever using Python.

So, let’s dive in!

How To Parse Data From PDFs With Python

Now, the important thing to realize is that the first issue you’ll likely run into is the fact that some PDFs can be parsed easily because they were created via Adobe or some tool that actually provides an underlying structure, whereas some PDFs are just scanned. In turn, those scans are generally just images.

They have no structure, and in order to pull out the data from said images, we’ll need a different technology. And that’s just to parse the data. From there, analyzing data will require even more steps.

But let’s start with just parsing the PDFs with Python.

There are actually two different libraries you can use in Python to parse PDFs.

The two libraries I have used to parse PDFs in Python are PyPDF2 and Pytesseract. I actually recall using PyPDF2 in a ChatGPT project and that was when I first ran into the issue of ChatGPT being behind on data(of course this was two years ago) because the methods it was trying to use were deprecated.

Here is what you can use the two libraries for.

What Is Pytesseract And When To Use It:

Pytesseract is an OCR (Optical Character Recognition) tool, which means it’s used to extract text from images or scanned documents. If you have a document or image that is not in a text-based format like a PNG, JPEG, or scanned PDF, Pytesseract will likely be the right tool.

What Is PyPDF2 And When To Use It:

PyPDF2 is designed to handle text extraction, manipulation, and merging/splitting of text-based PDF files, not scanned images. If the PDF is not a scanned document but was generated electronically, PyPDF2 can directly extract this text.

OCR – What Is Optical Character Recognition?

As you can see, they both have their use cases.

Quick pause, let’s discuss what OCR is.

Optical Character Recognition (OCR) is a technology used to convert different types of documents, such as scanned paper documents, PDFs, or images captured by a digital camera, into editable and searchable data.

How To Use Python To Parse Data From A PDF

Ok that’s a great high level explanation. But instead of telling you, let me show you in an example how you can use PyPDF2 to pull the data from a PDF.

For our example, let’s use PyPDF2 to parse bill documents from Washington state. These are text based documents that anyone can access online and contain information about upcoming and past bills.

Below is a script you can use for any bill based on ID that will then store the data in pdf_text.

As you can see the script pulls in the PDF from online and then using the PyPDF2.PdfReader() to parse the information. This will work on any PDF that isn’t an image.

For example you could do a similar thing with financial documents, such as SEC based documents just by changing the url. Like in the example below.

Now as you can tell this is just the start of the project. After this you’d need to store the data somewhere to analyze it. Such as a data warehouse.

You could also just have the rest of a script, and it’d analyze this single PDF.

But what happens when you want to analyze hundreds or thousands PDFS and want to then store that data for some form of analysis. Then you might want to use a solution that doesn’t just parse the data once. Instead, you’ll likely want to use a solution that makes it easy to parse the data over any over again.

How To Analyze PDFs Without Python With ROE

If you’re not comfortable with Python or you just want to be able to run large queries over your PDFs, you can use tools like Roe AI. Roe has built-in data connectors to unstructured data sources. This includes data sources such as S3 which allow you to query data directly from PDFs via SQL and their agents.

So instead of needing to manually write Python to parse PDFs, then load that data into a database, then write a query. You can just write a query!

An example of this can be seen below, where Roe’s team analyzed 40 SEC 8K from $LYFT and $UBER which totaled of 2,400 pages.

Then instead of writing a script to manually pull out all of the data fields they were able to implement an agent, as shown below to extract the key data points they were interested in.

Once the data is extracted, it can then be accessed and extracted using SQL.

From there, you can build further queries and actually ask questions about the data directly.

One of the concepts that I did like about Roe is that unlike the traditional approach of Vector search RAG where you’re often stuck “chatting” with the end data set.

Roe used SQL and LLM Vision models to keep the original data unaltered. In turn this allows them to easily tie back and show you where the data came from. I believe being able to keep this data lineage is key.

Also, as someone that has relied heavily on SQL, I foresee this approach as meeting a lot of business and reporting use cases.

What PDFs Will You Parse

Analyzing structured and unstructured PDF can provide far more context and value to businesses. Instead of being limited to data from traditional tables, your team can analyze PDFs in your S3 bucket or other storage system all with SQL.

This means that analysts can answer complex business questions faster without requiring an entire Python script.

That’s why Seattle Data Guy recently partnered with Roe. Their ability to make analyzing PDFs and unstructured data like images and video so easy really resonates. Instead of needing to write Python, you can use SQL and Agents to analyze your data.

If you’re interested in learning more, then you can reach out to me or Richard!

Disclosure: Seattle Data Guy does have a stake in Roe.AI

Also! Don’t forget to check the articles below.

Common Pitfalls of Data Analytics Projects

9 Habits Of Effective Data Managers – Running A Data Team

The Data Engineer’s Guide to ETL Alternatives

Build A Data Stack That Lasts – How To Ensure Your Data Infrastructure Is Maintainable

Explaining Data Lakes, Data Lake Houses, Table Formats and Catalogs

How to cut exact scoring moments from Euro 2024 videos with SQL

How To Modernize Your Data Strategy And Infrastructure For 2025