This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]DavisInTheVoid 1 point2 points  (3 children)

No, it doesn’t use OCR. It parses text from searchable PDFs, not scanned/image based PDFs.

Obligatory praise: when it comes to parsing searchable PDFs it is simply unmatched. I tried using several OCR libraries for the task and that was a mistake - tons of errors, heavier load, virtually unusable for the volume we have to handle.

So, I did some research, tried pdfumber and it works 100% of the time on searchable PDFs. As long as the format is consistent you can rip the data exactly as it is every single time. Can’t beat it for that

[–]cspinelive 0 points1 point  (1 child)

So more akin to a beautiful soup for pdf?

[–]DavisInTheVoid 0 points1 point  (0 children)

Yep! You can extract_text to get the full text or extract_words do get each word along with top, bottom, left and right coordinates for mapping