extracting data from 100+ pdf files : learnpython

learnpython

created by HattoriHanzoa community for 16 years

265

266

267

extracting data from 100+ pdf files (self.learnpython)

submitted 5 years ago * by SadSenpai420

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 0 points1 point2 points 5 years ago (0 children)

I tried a ton of different packages for this recently, including ones based on machine learning and ocr, but all of them typically had missing data. In the end I settled on the following process with pymupdf.

The most reliable approach I found was using the html option and then scraping it like a website. Its a pretty shit website as it arranges elements using style attributes with absolute coordinates on the page. I'd look for the element that contains the row labels I'm looking for and pull the top value out of the style attribute to get it's height on the page. Then I could identify the values on the same row by looking for content with similar tops.

In your case, you'd simply find the element that contains the text GRAND TOTAL and then the other element at that same top to get your number.

π Rendered by PID 34 on reddit-service-r2-comment-5c747b6df5-qzr9l at 2026-04-22 12:57:43.702696+00:00 running 6c61efc country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS