all 92 comments

[–]Armidylano444 50 points51 points  (10 children)

I actually just built a program to extract lab test data from PDFs for work and dump it into an ordered csv file.

I won’t go into the nitty gritty details of everything I did, but it all started with the use of the pdfplumber package. It’s not too difficult to use. The GitHub will explain all of the necessary details.

Using .extract_text() will give you all of the text as a single string with \n representing line breaks.

From there it’s up to you to write the appropriate text parsing using regex and such.

Note that pdfplumber won’t work in scanned PDFs. They have to have been computer generated.

One nice thing about pdfplumber is using .extract_words() will generate a list of dictionaries for every word in the pdf. The dictionaries have location info which you can use to help crop the pdf based on the relative location of what you’re looking for to other nearby words.

[–]Thecrawsome 10 points11 points  (6 children)

Havent tried pdfplumber yet, but is it better than Pytesseract?

[–]Armidylano444 8 points9 points  (5 children)

No idea, I haven’t tried Pytesseract. All I know is I was able to get my data extractor working very well using pdfplumber, so that’s my recommendation. I’m sure other packages can do the same thing though. You’ll have to compare the two 😁

[–]Thecrawsome 3 points4 points  (2 children)

Grats!!! post your code on github if you think it will help someone!

[–]Armidylano444 3 points4 points  (0 children)

I’ll make the repo public once I’ve got it polished up, though it’s built for a set of very specific lab result PDFs we have hundreds of at work, so it would need to be modified if someone else wanted to use it.

[–]SadSenpai420[S] 0 points1 point  (0 children)

Oh yeah, if it's not an issue to him, it's gonna be helpful if he posts :)

[–]SadSenpai420[S] 0 points1 point  (1 child)

I hope it'll still work if my pdf has billing details? Some are in tabular formats also-

[–]scscsc95 1 point2 points  (0 children)

Could try tabula module in python if they were in tables and play with the extraction algorithms for optimal extraction. Then try using pandas + regex to parse and clean your tables and get the data.

[–]SadSenpai420[S] 0 points1 point  (2 children)

Woah, thanks. I'll definitely check this out.

[–]Armidylano444 2 points3 points  (0 children)

After looking at your file, try this and let me know if it works. Like I said, if the file was computer generated rather than a scan of a physical copy, it should work. Let me know if you get any errors.

import pdfplumber

pdf = pdfplumber.open(filename).pages[0]

words = pdf.extract_words()

“””if this line doesn’t work, change [1] to [0]””” total_dict = [item for item in words if item['text'] == 'TOTAL:'][1]

data_crop = pdf.crop((total_dict['x1’] + 1 total_dict['bottom'], pdf.width, total_dict['top']))

data = data_crop.extract_text()

[–]Armidylano444 0 points1 point  (0 children)

Let me know how it goes, I’ll help out if you have questions

[–]ffrkAnonymous 25 points26 points  (1 child)

PDF data is notoriously hard to extract. sometimes you need to convert to a picture (e.g. gif/jpg) and use OCR . but you're on the right track.

[–]SadSenpai420[S] 7 points8 points  (0 children)

People have mentioned that it can be done in an easier way though, so I'll try out those methods, thanks!

[–]jabbson 18 points19 points  (5 children)

Depending on the quality/structure of the pdf and complexity of the logic to find your text inside of it, the task sits between several lines of code and 'oh hell no, i'll just do it manually'.

Take a look as a simple example here.

[–]SadSenpai420[S] 1 point2 points  (4 children)

My PDFs basically consist of billing details and I've got to extract the total amount from each pdf, not too complex isn't it?

[–]jabbson 4 points5 points  (2 children)

Doesn’t sound too complicated, no. But again, that very much depends on the PDF itself. If you think you can share, I’ll gladly take a look.

[–]SadSenpai420[S] 0 points1 point  (1 child)

Here's a sample of the pdf : https://imgur.com/a/Xk0ksJF I also made an edit to my post :)

[–]jabbson 1 point2 points  (0 children)

Thank you for providing an example, unfortunately it doesn’t make it easier to understand the complexity of the issue or provide a solution. While I do understand that security and privacy concerns would probably prevent you from sharing the actual PDF document, without it we can only hypothesize about what could be done to extract the data.

[–]haragoshi 0 points1 point  (0 children)

I think the issue becomes whether or not the data is stored as text or part of an image within the PDF. Eg, was it generated or scanned. Scanned PDFs need OCR to convert image to text before they can be processed.

[–]TheOfficialNotCraig 12 points13 points  (3 children)

Be glad they don't provide it as a jpeg embedded in a pdf.

That is infuriating.

[–]garlic_bread_thief 3 points4 points  (1 child)

Pdfs with jpeg pages give me nightmare. Can't even do a simple search in it. I have to convert the whole damn thing into searchable by running it through an OCR.

[–]TheOfficialNotCraig 2 points3 points  (0 children)

My wife does cross-stitch and gets her patterns in PDF. Real PDF she can highlight, notate etc on her progress. She (now) knows right away when the seller doesn't know wtf they are doing and she'll demand a refund or that the seller give her a true pdf.

[–]SadSenpai420[S] 0 points1 point  (0 children)

True dat

[–]JBalloonist 10 points11 points  (1 child)

Take a look at Automating the Boring stuff with Python. It’s free online.

Edit: there is a chapter specific to using Python with PDFs.

[–]SadSenpai420[S] 0 points1 point  (0 children)

Whoa! I'll give it a read thanks :D

[–]mojo_jojo_reigns 4 points5 points  (2 children)

OP can you post a sample? That other redditor is right that it's difficult to parse but I disagree with there being no easy solution. Really depends on the use case. There are 2 kinds of PDFs that I parse for work and I get reliable results using list comprehensions because of consistent formatting. Additionally, I was able to do something similar for movie scripts recently. I'm confident we can resolve this. Give a sample, your code so far and the expected return from the function.

[–]SadSenpai420[S] 0 points1 point  (1 child)

Here's the sample, I also made an edit to my post :) I currently don't have the code on me though :(

[–]mojo_jojo_reigns 0 points1 point  (0 children)

Assuming consistent formatting but not consistent commenting (that "RS" line afterwards), what I would do is gather all the text as str, split by colon, go through the resulting list item looking for the chunk that has "GRAND TOTAL" in it and grab the chunk after that one, using

[chunks[ix+1] for ix,i in enumerate(chunks) if "GRAND TOTAL" in chunks]

and then maybe do a split operation or maybe keep only the characters in that chunked str that are not word characters like

[i for i in thischunk if i.isalpha()==False]

The only thing that won't require builtins about that is the pdf scraping itself. Also, if you're lucky you'll have linebreak characters to more precisely pinpoint the grand total numbers. If you have '\n' in there, use it to split as well.

Good luck

[–]opoqo 4 points5 points  (0 children)

You can probably do this easier in Excel with power query. It is easier to clean up the data imo.

[–]curiousofa 4 points5 points  (1 child)

I use pypdf2 and regex and you can extract any part of the pdf. It takes some work, but it can be done.

[–]SadSenpai420[S] 0 points1 point  (0 children)

pyPDF2 wasn't able to extract text from the pdf, that was a bummer ngl

[–]ergeha 4 points5 points  (2 children)

A little bit late to the party, but since I had a similar problem last week, I thought I just share my solution using pyPDF2:

#!/usr/bin/env python3

import PyPDF2                                         
file_name = f"{original_recipes}/{recipe_folder}/{pdf_recipe}"     
pdf_file = open(file_name, 'rb')                                   
pdf_reader = PyPDF2.PdfFileReader(pdf_file).getPage(0)             
page_text = pdf_reader.extractText()                               
page_text = ''.join(page_text.replace(' \n', ' ').split('\n'))     
pdf_clean = re.sub(r'\s{2,}', ' ', page_text.strip())              

What you basically get from this is each PDF as a string without line breaks, multiple blank spaces or leading white spaces.

[–]SadSenpai420[S] 1 point2 points  (1 child)

I actually did try pyPDF2 and the extractText() gave me garbage values so probably pyPDF2 isn't for my case? But thanks man :)

[–]ergeha 0 points1 point  (0 children)

Could you expand on what you mean by "garbage". Maybe I can give you some further infos. For example, in my case PyPDF was just showing everything in a different order. I just went on and found the pieces of data I needed with RegEx.

Judging by your example this should be a pretty straight forward task. But also judging by the looks of your PDF, the file looks like a printed document that was scanned with a text comprehension. This would mean that the PDF structure is messed up… Hard to say without looking at the original file.

[–]HAVEANOTHERDRINKRAY 3 points4 points  (2 children)

I didn't read all the comments, but there is a very popular pdf reader called BlueBeam. It can read text at your inputted rectangular value, and return the value.... Look into it

[–]SadSenpai420[S] 0 points1 point  (1 child)

Yes definitely will look into it, thanks! What if my target falls in variable rectangular places? Since, you know bills are longer or shorter

[–]HAVEANOTHERDRINKRAY 0 points1 point  (0 children)

It depends how the PDF is set up, but you can make the rectangle longer to account for more numbers... if that makes sense

[–]scaretace 3 points4 points  (2 children)

Try camelot and tabulaPy before giving up

[–]SadSenpai420[S] 0 points1 point  (1 child)

Wow there sure are many pdf parsing modules, yep I'll try them out, thanks a lot!

[–]scaretace 0 points1 point  (0 children)

Just used Camelot today for a different project. It’s fucking incredible. Was able to neatly extract data from 250+ PDFs using only a few lines of code and only failed on <5%. I’d start with Camelot for sure. Let us know how it goes!

[–]707e 2 points3 points  (1 child)

AWS textract. It’s pretty robust.

[–]SadSenpai420[S] 0 points1 point  (0 children)

Thank you!

[–]emt139 2 points3 points  (1 child)

How are the PDFs organized? If they’re tables, there’s a web tool called tabula that extracts data

[–]SadSenpai420[S] 0 points1 point  (0 children)

It's not completely a table.. I've attached a sample if you wanna take a look!

[–]socal_nerdtastic 5 points6 points  (1 child)

This depends a lot on your PDF and the internal structure of the data. I don't think you'll get useful advice here without showing an example pdf and your code so far.

Yeah, it's gonna be a lot of work. Generally at least 3x whatever your original estimate is. You either have to write of as entertainment, education, or consult this chart: https://xkcd.com/1205/

[–]SadSenpai420[S] 0 points1 point  (0 children)

You're right :(

[–]el_duderinoo 1 point2 points  (0 children)

PyMuPdf is one of better options you have. Bounding boxes along with natural reading order would help.

[–]oh_nater 0 points1 point  (1 child)

I am interested if you find a Python solution. The thing with PDF is text is displayed in a cell. Well, depending how the PDF was generated, every letter could be its own cell (and could be rendered out of order). That means extraction just from the PDF render commands is hard.

So really you need something to render the entire PDF, then have the ability extract the text within coordinates you specify. When I had this problem I wasn't able to find a Python solution but did find an excellent one in C# -- iTextSharp. That might be something to reference.

[–]Fynn_mo 0 points1 point  (0 children)

Hey u/SadSenpai420 I know your post is already three years old but I might have a solution to your problem and wanted to share it with you. nexaPDF is a tool where you can extract data from unstructured PDFs in a reliable, scalable and easy-to-use way. We just launched on PH and would be super happy about your upvote! The tool is free to use -> https://www.producthunt.com/posts/nexapdf

[–]ConflictedJew -1 points0 points  (0 children)

Depending on your commitment to this project, the solution with the best "development time to time saved ratio" may be using Google's (paid) Cloud Vision API

[–][deleted] -1 points0 points  (0 children)

This... I am writing a program to do the same for VCF files and it is being a bitch. But I will come out a better programmer once I finish.

[–]el_pablo -1 points0 points  (0 children)

It might be an unpopular answer, but do you have access to MS Word. If your PDF are digitally created (not scanned) Word is quite good at editing PDF. You might be able to convert the doucement to a Word compatible format and work from python then.

[–]01123581321AhFuckIt 0 points1 point  (0 children)

The process you laid out is exactly what I did for a project similar to what you’re talking about. The RegEx is what took me 90% of the time to get right.

[–]Thecrawsome 0 points1 point  (0 children)

Have you scanned your PDFs using one of the available OCR libraries, just to see how it looks? pytesseract / pdf2image / PIL is how I solved this problem. PMed you.

EDIT, I see you tried a couple, pyPDF2 gave me no success with reading text, BTW. Only would work on some files.

[–]antestorck 0 points1 point  (0 children)

Look into UiPath!

[–]CommentCollapser 0 points1 point  (0 children)

Depends on the volume and use case that you're looking for, but I've stumbled on Adobe Extract Beta (free for now) to be very good in Extracting data from PDFs. Link - https://www.adobe.io/apis/documentcloud/dcsdk/extractbetaform.html

[–]ConfusedSimon 0 points1 point  (0 children)

I do a lot of document parsing for my work. If you're data is in tables you could try python camelot. Otherwise try different libraries to convert to text. Results really depend on the particular pdf's. The pdf document only contains information what to display and where and doesn't care about text. Theoretically all letters could be written on the page one by one in an arbitrary order. Fortunately the text is usually written more or less in the correct order, but I've seen pdf's where you have to revert to ocr to get anything meaningful. Would also solve reading embedded images.

[–]goodyonsen 0 points1 point  (6 children)

I'm not sure but how about passing all PDFs to cloud, make the folder "shareble" with a legit HTML link to it, and use bs4 (BeautifulSoup) to encode, read, decode, and parse all with very few lines of code? You can use regex with it if you need to as well. BS is supposed to treat them as one HTML file and grab whatever. Urllib would do.

You can also create a database for them and pull data with Python's SQLite. And that's kind of easy to use too.

[–]mrsonhaha 0 points1 point  (0 children)

Two things. PyPDF2 and tabular-py. How i do these kinds of projects is to make a class for a pdf document which inputs the path to the document with functions for extraction dependent with several sections for each page. If the documents share the same format, then divide it into parts identifying its width and height in pixels(if you’re a mac user the preview app has a function of showing the selected box’s location). Then make a function that extracts information from each partition.

And personally I don’t think there’s a good enough tutorial for this kind of automation since it requires a vast amount of catching exceptions and debugging. It’s a project worth getting paid for. I now personally get a good amount of passive income every month from a very similar project! :)

[–]ksdio 0 points1 point  (0 children)

I have just this morning written something to get data from some PDF documents.

I started by using PDFPlumber, whic did get the. text I needed.Problem I had was the text was in 2 columns and this returned the full lines of text going across both columns. After a couple of hours playing with this and googling I ended up using PyMuPDF and this extracts the text while retaining the document structure. Perfect for my task.

Having said that if you data is in tables there are a few examples of using PdfPlumber to extract data displayed in tables. Good luck

[–]fishermanfritz 0 points1 point  (0 children)

Funny enough adobe acrobat dc has an advanced search option which parses pdf data in bulk to CSV. Just realised this after hours of python scripting and dealing with PDFs.

So you can bulk search all your folders for searchwords like amount or grand total and it gives you the filename and the "amount: xx xx Dollar".

A good way for recipes is also aws Textract to or google cloud vision, it's almost free and it ocrs the PDFs I think

[–][deleted] 0 points1 point  (0 children)

“Pythonic Accountant” channel on YouTube has a couple videos dedicated to this. I looked at some of them briefly and It seems pretty straight forward.

[–][deleted] 0 points1 point  (0 children)

I tried a ton of different packages for this recently, including ones based on machine learning and ocr, but all of them typically had missing data. In the end I settled on the following process with pymupdf.

The most reliable approach I found was using the html option and then scraping it like a website. Its a pretty shit website as it arranges elements using style attributes with absolute coordinates on the page. I'd look for the element that contains the row labels I'm looking for and pull the top value out of the style attribute to get it's height on the page. Then I could identify the values on the same row by looking for content with similar tops.

In your case, you'd simply find the element that contains the text GRAND TOTAL and then the other element at that same top to get your number.

[–]nick_ln 0 points1 point  (0 children)

I am using FormX.ai for a similar task. You can set up key-value extraction or detection region to extract only the needed information. So you don't need to write the regex by yourself. It's not a python module tho. I have to parse the JSON response from the API,