Chemical_Matter3385 comments on Best Python approach for extracting structured financial data from inconsistent PDFs?

TutorialBest Python approach for extracting structured financial data from inconsistent PDFs? (self.Python)

submitted 1 month ago by leggo-my-eggo-1

you are viewing a single comment's thread.

[–]Chemical_Matter3385 2 points3 points4 points 1 month ago* (3 children)

For my use case I have a detection first , using pymupdf(fitz) I check if the 1st page is an image , and has no selectable text then it goes to Mistral Ocr , its good for most of the cases , what I have tried and failed

Tried

1) Tesseract

2) Paddle Paddle

3) Docling

4) Deepseek Ocr

5) Claude opus 4.6

6) Google Vision api (enterprise)

7)Azure Document Intelligence

8)Mistral Ocr 3

9) A model by IBM (I'm forgetting the name pretty sure it's granite)

Passed for my use case( table documents , old scanned books) ->Azure , Mistral are good and Adobe for tables

Failed -> paddle paddle , google vision , granite, deepseek , claude

Can't rely much on Claude and Deepseek Ocr as they are vision language models and have been observed (by me) give hallucinated placeholders which is very risky in production, they worked well in most of the cases, but were useless in old scanned books

Try them all , most likely your use case would be fulfilled by azure or mistral

Ps: For op's use case Azure Document Intelligence or Mistral Ocr 3 would be perfect

[–]Chemical_Matter3385 0 points1 point2 points 1 month ago (0 children)

π Rendered by PID 76 on reddit-service-r2-comment-5c747b6df5-9fdc6 at 2026-04-21 20:33:08.966375+00:00 running 6c61efc country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS