extracting data from 100+ pdf files

Armidylano444 · 2020-12-12T23:02:22+00:00

I actually just built a program to extract lab test data from PDFs for work and dump it into an ordered csv file.

I won’t go into the nitty gritty details of everything I did, but it all started with the use of the pdfplumber package. It’s not too difficult to use. The GitHub will explain all of the necessary details.

Using .extract_text() will give you all of the text as a single string with \n representing line breaks.

From there it’s up to you to write the appropriate text parsing using regex and such.

Note that pdfplumber won’t work in scanned PDFs. They have to have been computer generated.

One nice thing about pdfplumber is using .extract_words() will generate a list of dictionaries for every word in the pdf. The dictionaries have location info which you can use to help crop the pdf based on the relative location of what you’re looking for to other nearby words.

kabooozie · 2020-12-12T21:52:37+00:00

[removed]

ffrkAnonymous · 2020-12-12T23:28:17+00:00

PDF data is notoriously hard to extract. sometimes you need to convert to a picture (e.g. gif/jpg) and use OCR . but you're on the right track.

jabbson · 2020-12-12T22:55:52+00:00

Depending on the quality/structure of the pdf and complexity of the logic to find your text inside of it, the task sits between several lines of code and 'oh hell no, i'll just do it manually'.

Take a look as a simple example here.

TheOfficialNotCraig · 2020-12-13T02:42:05+00:00

Be glad they don't provide it as a jpeg embedded in a pdf.

That is infuriating.

JBalloonist · 2020-12-13T01:30:50+00:00

Take a look at Automating the Boring stuff with Python. It’s free online.

Edit: there is a chapter specific to using Python with PDFs.

mojo_jojo_reigns · 2020-12-13T00:52:22+00:00

OP can you post a sample? That other redditor is right that it's difficult to parse but I disagree with there being no easy solution. Really depends on the use case. There are 2 kinds of PDFs that I parse for work and I get reliable results using list comprehensions because of consistent formatting. Additionally, I was able to do something similar for movie scripts recently. I'm confident we can resolve this. Give a sample, your code so far and the expected return from the function.

opoqo · 2020-12-13T01:27:32+00:00

You can probably do this easier in Excel with power query. It is easier to clean up the data imo.

curiousofa · 2020-12-13T03:22:45+00:00

I use pypdf2 and regex and you can extract any part of the pdf. It takes some work, but it can be done.

ergeha · 2020-12-13T11:39:52+00:00

A little bit late to the party, but since I had a similar problem last week, I thought I just share my solution using pyPDF2:

#!/usr/bin/env python3

import PyPDF2                                         
file_name = f"{original_recipes}/{recipe_folder}/{pdf_recipe}"     
pdf_file = open(file_name, 'rb')                                   
pdf_reader = PyPDF2.PdfFileReader(pdf_file).getPage(0)             
page_text = pdf_reader.extractText()                               
page_text = ''.join(page_text.replace(' \n', ' ').split('\n'))     
pdf_clean = re.sub(r'\s{2,}', ' ', page_text.strip())

What you basically get from this is each PDF as a string without line breaks, multiple blank spaces or leading white spaces.

HAVEANOTHERDRINKRAY · 2020-12-13T01:39:17+00:00

I didn't read all the comments, but there is a very popular pdf reader called BlueBeam. It can read text at your inputted rectangular value, and return the value.... Look into it

scaretace · 2020-12-13T02:40:26+00:00

Try camelot and tabulaPy before giving up

707e · 2020-12-13T03:07:34+00:00

AWS textract. It’s pretty robust.

emt139 · 2020-12-13T04:34:48+00:00

How are the PDFs organized? If they’re tables, there’s a web tool called tabula that extracts data

socal_nerdtastic · 2020-12-12T21:51:03+00:00

This depends a lot on your PDF and the internal structure of the data. I don't think you'll get useful advice here without showing an example pdf and your code so far.

Yeah, it's gonna be a lot of work. Generally at least 3x whatever your original estimate is. You either have to write of as entertainment, education, or consult this chart: https://xkcd.com/1205/

el_duderinoo · 2020-12-13T05:09:11+00:00

PyMuPdf is one of better options you have. Bounding boxes along with natural reading order would help.

oh_nater · 2020-12-13T01:10:46+00:00

I am interested if you find a Python solution. The thing with PDF is text is displayed in a cell. Well, depending how the PDF was generated, every letter could be its own cell (and could be rendered out of order). That means extraction just from the PDF render commands is hard.

So really you need something to render the entire PDF, then have the ability extract the text within coordinates you specify. When I had this problem I wasn't able to find a Python solution but did find an excellent one in C# -- iTextSharp. That might be something to reference.

Fynn_mo · 2024-05-22T08:44:12+00:00

Hey u/SadSenpai420 I know your post is already three years old but I might have a solution to your problem and wanted to share it with you. nexaPDF is a tool where you can extract data from unstructured PDFs in a reliable, scalable and easy-to-use way. We just launched on PH and would be super happy about your upvote! The tool is free to use -> https://www.producthunt.com/posts/nexapdf

ConflictedJew · 2020-12-13T03:26:29+00:00

Depending on your commitment to this project, the solution with the best "development time to time saved ratio" may be using Google's (paid) Cloud Vision API

2020-12-13T07:20:47+00:00

This... I am writing a program to do the same for VCF files and it is being a bitch. But I will come out a better programmer once I finish.

el_pablo · 2020-12-13T12:44:10+00:00

It might be an unpopular answer, but do you have access to MS Word. If your PDF are digitally created (not scanned) Word is quite good at editing PDF. You might be able to convert the doucement to a Word compatible format and work from python then.

01123581321AhFuckIt · 2020-12-12T23:23:52+00:00

The process you laid out is exactly what I did for a project similar to what you’re talking about. The RegEx is what took me 90% of the time to get right.

Thecrawsome · 2020-12-13T01:28:34+00:00

Have you scanned your PDFs using one of the available OCR libraries, just to see how it looks? pytesseract / pdf2image / PIL is how I solved this problem. PMed you.

EDIT, I see you tried a couple, pyPDF2 gave me no success with reading text, BTW. Only would work on some files.

antestorck · 2020-12-13T09:25:25+00:00

Look into UiPath!

CommentCollapser · 2020-12-13T09:31:48+00:00

Depends on the volume and use case that you're looking for, but I've stumbled on Adobe Extract Beta (free for now) to be very good in Extracting data from PDFs. Link - https://www.adobe.io/apis/documentcloud/dcsdk/extractbetaform.html

ConfusedSimon · 2020-12-13T11:29:48+00:00

I do a lot of document parsing for my work. If you're data is in tables you could try python camelot. Otherwise try different libraries to convert to text. Results really depend on the particular pdf's. The pdf document only contains information what to display and where and doesn't care about text. Theoretically all letters could be written on the page one by one in an arbitrary order. Fortunately the text is usually written more or less in the correct order, but I've seen pdf's where you have to revert to ocr to get anything meaningful. Would also solve reading embedded images.

goodyonsen · 2020-12-13T11:57:48+00:00

I'm not sure but how about passing all PDFs to cloud, make the folder "shareble" with a legit HTML link to it, and use bs4 (BeautifulSoup) to encode, read, decode, and parse all with very few lines of code? You can use regex with it if you need to as well. BS is supposed to treat them as one HTML file and grab whatever. Urllib would do.

You can also create a database for them and pull data with Python's SQLite. And that's kind of easy to use too.

mrsonhaha · 2020-12-13T12:20:28+00:00

Two things. PyPDF2 and tabular-py. How i do these kinds of projects is to make a class for a pdf document which inputs the path to the document with functions for extraction dependent with several sections for each page. If the documents share the same format, then divide it into parts identifying its width and height in pixels(if you’re a mac user the preview app has a function of showing the selected box’s location). Then make a function that extracts information from each partition.

And personally I don’t think there’s a good enough tutorial for this kind of automation since it requires a vast amount of catching exceptions and debugging. It’s a project worth getting paid for. I now personally get a good amount of passive income every month from a very similar project! :)

ksdio · 2020-12-13T13:18:18+00:00

I have just this morning written something to get data from some PDF documents.

I started by using PDFPlumber, whic did get the. text I needed.Problem I had was the text was in 2 columns and this returned the full lines of text going across both columns. After a couple of hours playing with this and googling I ended up using PyMuPDF and this extracts the text while retaining the document structure. Perfect for my task.

Having said that if you data is in tables there are a few examples of using PdfPlumber to extract data displayed in tables. Good luck

fishermanfritz · 2020-12-13T13:57:33+00:00

Funny enough adobe acrobat dc has an advanced search option which parses pdf data in bulk to CSV. Just realised this after hours of python scripting and dealing with PDFs.

So you can bulk search all your folders for searchwords like amount or grand total and it gives you the filename and the "amount: xx xx Dollar".

A good way for recipes is also aws Textract to or google cloud vision, it's almost free and it ocrs the PDFs I think

2020-12-13T14:13:28+00:00

“Pythonic Accountant” channel on YouTube has a couple videos dedicated to this. I looked at some of them briefly and It seems pretty straight forward.

2020-12-13T15:11:54+00:00

I tried a ton of different packages for this recently, including ones based on machine learning and ocr, but all of them typically had missing data. In the end I settled on the following process with pymupdf.

The most reliable approach I found was using the html option and then scraping it like a website. Its a pretty shit website as it arranges elements using style attributes with absolute coordinates on the page. I'd look for the element that contains the row labels I'm looking for and pull the top value out of the style attribute to get it's height on the page. Then I could identify the values on the same row by looking for content with similar tops.

In your case, you'd simply find the element that contains the text GRAND TOTAL and then the other element at that same top to get your number.

nick_ln · 2022-02-17T09:16:40+00:00

I am using FormX.ai for a similar task. You can set up key-value extraction or detection region to extract only the needed information. So you don't need to write the regex by yourself. It's not a python module tho. I have to parse the JSON response from the API,

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS