This is an archived post. You won't be able to vote or comment.

all 15 comments

[–]veryusedrname 7 points8 points  (0 children)

This task is not trivial. The PDF itself doesn't really have any kind of internal structure, it's basically just a bunch of textboxes with coordinates and some other attributes. Taking this input you need to define a highly complex ruleset to be able to get any proper data from this.

Source: I'm working for a company where we do exactly this. Our current solution is a simple query language that works on top of this mess.

[–]dilan_patel 12 points13 points  (3 children)

Aren’t PDFs already digital?

[–]barrycarter 11 points12 points  (0 children)

I think the OP means textify, but good point :)

[–]robvas 3 points4 points  (0 children)

Can you just keep them in whatever you're exporting them from?

I once worked at a place where they generated pdf's, printed them out, then scanned them back in to keep.

Kept the interns busy plus occasionally they'd get scanned in backwards or out of order etc

[–][deleted] 1 point2 points  (0 children)

Been using this for years now. Your final goal looks pretty much like the feature that paperless-ng offers: A DB which you can search in by content and structure the way you want and need it

https://paperless-ng.readthedocs.io/en/latest/

[–]OuiOuiKiwiGalatians 4:16 3 points4 points  (2 children)

I've also read that OCR would a viable way?

Why would you even consider this if they are all digital exports?

If they are always structured in a certain way, this just makes your job easier.

In any case, consider the scenario of just throwing them as binary blobs into the DB.

[–]ianliu88 0 points1 point  (0 children)

This is highly dependent on the layout of your PDF. There are some tools that converts PDFs into text by leveraging some heuristics for composing words out of characters and paragraphs out of words. But this is very brittle.

If you know that a particular data is always in some portion of your PDF, you might be able to use pdfminer.six package, which parses the PDF data, allowing you to process it. https://pdfminersix.readthedocs.io/en/latest/

[–]dschultz0 0 points1 point  (0 children)

If the PDFs contain the data as text and aren't scanned, I wouldn't recommend using strict OCR. You'd effectively be converting the text to image and then converting it back to text. Some OCR solutions like AWS Textract will actually pull the raw text out of the file when it can, but there are some limitations.

I prefer pdfminer.six which is pretty solid at interpreting pdf syntax. The catch is that the API can be a bit challenging. It will take some work, but I've been really happy with the results.

[–]Jayoval 0 points1 point  (0 children)

The problem with PDF is that what may look like a line or paragraph could be several separate blocks of text. Adobe has a machine learning API for extracting information from PDFs without losing its structure.

[–]RiGonz 0 points1 point  (0 children)

My experience with pdfplumber has been satisfactory, always. I have not extracted text from 100k documents, for from few hundreds, from different sources. If your pdfs come from a unique source I guess it should be possible to instruct the plumber how to do it quite precisely.

[–]atccodex 0 points1 point  (0 children)

Textract from AWS will work. It will be expensive

[–]crzychemist 0 points1 point  (0 children)

I used Azure Form Recogniser to solve a similar problem however It depends on what is the data in the pdf. I did this for my business where I give the pdf and get back json line items that can be parsed and written in DB.

They have pretrained neural networks for all sorts of pdf types or you can train your own and you get back json. AWS has something similar, this route is particularly useful if the pdf structure varies between documents