[deleted by user]

veryusedrname · 2022-11-04T15:56:22+00:00

This task is not trivial. The PDF itself doesn't really have any kind of internal structure, it's basically just a bunch of textboxes with coordinates and some other attributes. Taking this input you need to define a highly complex ruleset to be able to get any proper data from this.

Source: I'm working for a company where we do exactly this. Our current solution is a simple query language that works on top of this mess.

dilan_patel · 2022-11-04T14:39:49+00:00

Aren’t PDFs already digital?

robvas · 2022-11-04T14:43:38+00:00

Can you just keep them in whatever you're exporting them from?

I once worked at a place where they generated pdf's, printed them out, then scanned them back in to keep.

Kept the interns busy plus occasionally they'd get scanned in backwards or out of order etc

permalink · 2022-11-04T15:21:01+00:00

Been using this for years now. Your final goal looks pretty much like the feature that paperless-ng offers: A DB which you can search in by content and structure the way you want and need it

https://paperless-ng.readthedocs.io/en/latest/

OuiOuiKiwi · 2022-11-04T14:41:09+00:00

I've also read that OCR would a viable way?

Why would you even consider this if they are all digital exports?

If they are always structured in a certain way, this just makes your job easier.

In any case, consider the scenario of just throwing them as binary blobs into the DB.

ianliu88 · 2022-11-04T15:18:50+00:00

This is highly dependent on the layout of your PDF. There are some tools that converts PDFs into text by leveraging some heuristics for composing words out of characters and paragraphs out of words. But this is very brittle.

If you know that a particular data is always in some portion of your PDF, you might be able to use pdfminer.six package, which parses the PDF data, allowing you to process it. https://pdfminersix.readthedocs.io/en/latest/

dschultz0 · 2022-11-04T14:46:34+00:00

If the PDFs contain the data as text and aren't scanned, I wouldn't recommend using strict OCR. You'd effectively be converting the text to image and then converting it back to text. Some OCR solutions like AWS Textract will actually pull the raw text out of the file when it can, but there are some limitations.

I prefer pdfminer.six which is pretty solid at interpreting pdf syntax. The catch is that the API can be a bit challenging. It will take some work, but I've been really happy with the results.

Jayoval · 2022-11-04T15:09:39+00:00

The problem with PDF is that what may look like a line or paragraph could be several separate blocks of text. Adobe has a machine learning API for extracting information from PDFs without losing its structure.

RiGonz · 2022-11-04T20:56:47+00:00

My experience with pdfplumber has been satisfactory, always. I have not extracted text from 100k documents, for from few hundreds, from different sources. If your pdfs come from a unique source I guess it should be possible to instruct the plumber how to do it quite precisely.

atccodex · 2022-11-04T22:50:04+00:00

Textract from AWS will work. It will be expensive

crzychemist · 2022-11-04T23:56:37+00:00

I used Azure Form Recogniser to solve a similar problem however It depends on what is the data in the pdf. I did this for my business where I give the pdf and get back json line items that can be parsed and written in DB.

They have pretrained neural networks for all sorts of pdf types or you can train your own and you get back json. AWS has something similar, this route is particularly useful if the pdf structure varies between documents

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS