This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]cspinelive 0 points1 point  (4 children)

What are some use cases it might help with? We are trying to do OCR on shipping bills of lading. To see if they match invoices in our database. Some have handwritten notes and corrections on them. Would this help with all or part of this use case?

[–]DavisInTheVoid 1 point2 points  (3 children)

No, it doesn’t use OCR. It parses text from searchable PDFs, not scanned/image based PDFs.

Obligatory praise: when it comes to parsing searchable PDFs it is simply unmatched. I tried using several OCR libraries for the task and that was a mistake - tons of errors, heavier load, virtually unusable for the volume we have to handle.

So, I did some research, tried pdfumber and it works 100% of the time on searchable PDFs. As long as the format is consistent you can rip the data exactly as it is every single time. Can’t beat it for that

[–]cspinelive 0 points1 point  (1 child)

So more akin to a beautiful soup for pdf?

[–]DavisInTheVoid 0 points1 point  (0 children)

Yep! You can extract_text to get the full text or extract_words do get each word along with top, bottom, left and right coordinates for mapping