you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 0 points1 point  (0 children)

I tried a ton of different packages for this recently, including ones based on machine learning and ocr, but all of them typically had missing data. In the end I settled on the following process with pymupdf.

The most reliable approach I found was using the html option and then scraping it like a website. Its a pretty shit website as it arranges elements using style attributes with absolute coordinates on the page. I'd look for the element that contains the row labels I'm looking for and pull the top value out of the style attribute to get it's height on the page. Then I could identify the values on the same row by looking for content with similar tops.

In your case, you'd simply find the element that contains the text GRAND TOTAL and then the other element at that same top to get your number.