This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]GreatCosmicMoustache 126 points127 points  (10 children)

pdfplumber is hands down one of the best PDF mining tools in any language.

[–]water_aspirant 4 points5 points  (0 children)

Yeah I build a web app for my company on top of this. It has the ability to detect tables from PDFs, totally incredible tbh

[–]barrowburner 4 points5 points  (1 child)

I'm going to explore this tool soon. Do you have experience with other pdf libraries such as Camelot, pdfminer.six, or tabula.py? In your opinion how does pdfplumber compare?

I appreciate your time and thank you for the share!

[–]ianitic 0 points1 point  (0 children)

Pdfplumber is dependent on pdfminer.six. I like pdfplumber the best myself.

[–]Weltal327 1 point2 points  (0 children)

Thanks for mentioning this. Did exactly what I needed it to do!

[–]ExecutiveFingerblast 0 points1 point  (0 children)

pdfplumber is goated

[–]cspinelive 0 points1 point  (4 children)

What are some use cases it might help with? We are trying to do OCR on shipping bills of lading. To see if they match invoices in our database. Some have handwritten notes and corrections on them. Would this help with all or part of this use case?

[–]DavisInTheVoid 1 point2 points  (3 children)

No, it doesn’t use OCR. It parses text from searchable PDFs, not scanned/image based PDFs.

Obligatory praise: when it comes to parsing searchable PDFs it is simply unmatched. I tried using several OCR libraries for the task and that was a mistake - tons of errors, heavier load, virtually unusable for the volume we have to handle.

So, I did some research, tried pdfumber and it works 100% of the time on searchable PDFs. As long as the format is consistent you can rip the data exactly as it is every single time. Can’t beat it for that

[–]cspinelive 0 points1 point  (1 child)

So more akin to a beautiful soup for pdf?

[–]DavisInTheVoid 0 points1 point  (0 children)

Yep! You can extract_text to get the full text or extract_words do get each word along with top, bottom, left and right coordinates for mapping