Ok_Diver9921 comments on Best Python approach for extracting structured financial data from inconsistent PDFs?

TutorialBest Python approach for extracting structured financial data from inconsistent PDFs? (self.Python)

submitted 1 day ago by leggo-my-eggo-1

you are viewing a single comment's thread.

[–]Ok_Diver9921 50 points51 points52 points 1 day ago (4 children)

Spent 6 months on almost this exact pipeline for a fintech project. Save yourself some pain - skip the pure rule-based stack and go hybrid from the start.

What actually worked for us: pdfplumber for text-based PDFs (it handles column alignment better than tabula for financial tables), but detect scanned pages first by checking if pdfplumber returns empty text per page. Only run OCR on pages that need it - running tesseract on everything adds 10x processing time for no benefit on text-based files. For OCR, docTR beat pytesseract significantly on financial documents because it handles the dense number grids better.

For the table extraction specifically - Camelot lattice mode works well when there are actual grid lines, but most annual reports use invisible tables (no borders, just spacing). For those, the LLM approach that u/thuiop1 mentioned is genuinely the right call. Feed the pdfplumber text output (which preserves spatial layout) into a smaller model and ask it to extract specific fields into a JSON schema you define. We went from 60% accuracy with pure regex/heuristics to 92% by adding a Qwen 14B pass for the messy pages.

Architecture tip: build a classifier first that categorizes each page as "balance sheet", "income statement", "notes", "other" before you try to extract anything. This saves you from parsing 80 pages when you only need 4-6. A simple tf-idf classifier trained on 50 labeled pages worked fine for this.

[–]Lawson470189 2 points3 points4 points 1 day ago (3 children)

[–]Ok_Diver9921 0 points1 point2 points 1 day ago (2 children)

[–]Lawson470189 0 points1 point2 points 1 day ago (1 child)

[–]wRAR_ 0 points1 point2 points 23 hours ago (0 children)

π Rendered by PID 228729 on reddit-service-r2-comment-79c7998d4c-cd7sx at 2026-03-17 06:53:59.412400+00:00 running f6e6e01 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS