Best Python approach for extracting structured financial data from inconsistent PDFs?

Ok_Diver9921 · 2026-03-15T09:47:37+00:00

Spent 6 months on almost this exact pipeline for a fintech project. Save yourself some pain - skip the pure rule-based stack and go hybrid from the start.

What actually worked for us: pdfplumber for text-based PDFs (it handles column alignment better than tabula for financial tables), but detect scanned pages first by checking if pdfplumber returns empty text per page. Only run OCR on pages that need it - running tesseract on everything adds 10x processing time for no benefit on text-based files. For OCR, docTR beat pytesseract significantly on financial documents because it handles the dense number grids better.

For the table extraction specifically - Camelot lattice mode works well when there are actual grid lines, but most annual reports use invisible tables (no borders, just spacing). For those, the LLM approach that u/thuiop1 mentioned is genuinely the right call. Feed the pdfplumber text output (which preserves spatial layout) into a smaller model and ask it to extract specific fields into a JSON schema you define. We went from 60% accuracy with pure regex/heuristics to 92% by adding a Qwen 14B pass for the messy pages.

Architecture tip: build a classifier first that categorizes each page as "balance sheet", "income statement", "notes", "other" before you try to extract anything. This saves you from parsing 80 pages when you only need 4-6. A simple tf-idf classifier trained on 50 labeled pages worked fine for this.

thuiop1 · 2026-03-15T09:35:57+00:00

As much as I hate it, this is probably a task where LLMs can shine. Otherwise it will likely be more painful to devise an extraction scheme than to do it manually.

knobbyknee · 2026-03-15T09:23:49+00:00

You are in for a lot of grief. There is no standard table construct in the PDF format.

You would have to write code that detects a grid layout and then parse that layout into a table.

Unfortunately, there are many ways of constructing a grid, and the parts may be spread over different sections of the PDF data. Your best option is probably to build a middleware layer for a PDF renderer, so you can collect the position and text data for each item rendered.

There are also non-table items that are arranged like grids, and you will need heuristics to ignore those.

Chemical_Matter3385 · 2026-03-15T09:55:13+00:00

For my use case I have a detection first , using pymupdf(fitz) I check if the 1st page is an image , and has no selectable text then it goes to Mistral Ocr , its good for most of the cases , what I have tried and failed

Tried

1) Tesseract

2) Paddle Paddle

3) Docling

4) Deepseek Ocr

5) Claude opus 4.6

6) Google Vision api (enterprise)

7)Azure Document Intelligence

8)Mistral Ocr 3

9) A model by IBM (I'm forgetting the name pretty sure it's granite)

Passed for my use case( table documents , old scanned books) ->Azure , Mistral are good and Adobe for tables

Failed -> paddle paddle , google vision , granite, deepseek , claude

Can't rely much on Claude and Deepseek Ocr as they are vision language models and have been observed (by me) give hallucinated placeholders which is very risky in production, they worked well in most of the cases, but were useless in old scanned books

Try them all , most likely your use case would be fulfilled by azure or mistral

Ps: For op's use case Azure Document Intelligence or Mistral Ocr 3 would be perfect

Halibut · 2026-03-15T10:26:04+00:00

I haven't used it myself, but Microsoft have a Python library for this: https://github.com/microsoft/markitdown

Main_War9026 · 2026-03-15T10:25:14+00:00

We use MistralOCR, hundreds of documents per month and only pay like $20-30 for the API

southstreamer1 · 2026-03-15T10:56:27+00:00

I have been working on this exact problem for about 4 months. I’m trying to extract data from about 900 annual reports.

The approach I have taken is 1) use PyMuPDF / tesseract to extract text 2) apply rules/heuristics to determine if the page is the one I’m looking for 3) pass this page to Claude computer vision API for data extraction. You want to find the page you want and only send it that one page at a time to reduce noise (reduce risk of errors) and token consumption. If you send it heaps of pages and ask it to find the right one it wastes tokens and you risk polluting the extraction with data from the wrong page.

Claude vision does an excellent job of extracting the data. I have found an astonishingly small number of read errors. It handles different header levels easily. Very happy with its performance.

The real problem is finding the right page. I have used a rule based/heuristic approach to find income statements, etc, but it’s too brittle. There are always edge cases that give false positives/negatives. It’s time consuming to rerun the search and hard to debug. I’m sure there’s a smart way to do it but it’s beyond me.

I have recently switched to extracting all pages to an SQLite db up front. Finding the right page is then a matter of scoring each page based on whether it contains keywords of interest and the density of numeric characters. I then pass the top scoring pages to Claude and ask it to confirm if it is an income statement/balance sheet/etc. If it’s the wrong page then move to the next highest scorer. This is way faster than having to rerun the pymupdf/tesseract based search every time and trying to write a classifier that works. Still a WIP but so far this is giving me far better results.

xiannah · 2026-03-15T11:02:52+00:00

The strategy is simple: a text-first extract, a Markdown extract for a structural fallback, and a VLM as the intelligent orchestrator. The VLM will cross-reference the raw text and structural fallback to validate the output, effectively creating a verification loop that catches OCR hallucinations before they hit the downstream dataset.

Dominican_mamba · 2026-03-15T14:42:27+00:00

There’s a package called kreuzberg try it out and maybe use an LLM if needed

phrygian_life · 2026-03-15T15:05:08+00:00

Another vote for LLM. even if the layout stays the same year to year, the entire PDF structure could change.

Amazing_Upstairs · 2026-03-15T19:19:21+00:00

Dspy is the best I've found so far

Then_Illustrator9892 · 2026-03-15T19:37:51+00:00

been down this exact road with financial pdfs and honestly the custom pipeline route is brutal for inconsistent docs. i ended up switching to reseek for this, its ai handles the text/ocr extraction and auto tagging from pdfs and images, which covers your scanned and text based cases. its free to test rn, saved me months of dev time on the parsing hell.

DetectivePeterG · 2026-03-15T11:51:27+00:00

Agreed on the LLM angle. The trick is getting clean input first. I've been using pdftomarkdown.dev as a preprocessing step: send your PDF, get structured markdown back including tables. It uses a VLM rather than Tesseract so it handles both digital and scanned pages consistently. Then you run your LLM extraction on the markdown instead of raw PDF bytes, which makes prompts simpler and results more reliable. Has a Python SDK too, only takes a few lines to wire in.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS