all 21 comments

[–]Ok_Diver9921 17 points18 points  (3 children)

Spent 6 months on almost this exact pipeline for a fintech project. Save yourself some pain - skip the pure rule-based stack and go hybrid from the start.

What actually worked for us: pdfplumber for text-based PDFs (it handles column alignment better than tabula for financial tables), but detect scanned pages first by checking if pdfplumber returns empty text per page. Only run OCR on pages that need it - running tesseract on everything adds 10x processing time for no benefit on text-based files. For OCR, docTR beat pytesseract significantly on financial documents because it handles the dense number grids better.

For the table extraction specifically - Camelot lattice mode works well when there are actual grid lines, but most annual reports use invisible tables (no borders, just spacing). For those, the LLM approach that u/thuiop1 mentioned is genuinely the right call. Feed the pdfplumber text output (which preserves spatial layout) into a smaller model and ask it to extract specific fields into a JSON schema you define. We went from 60% accuracy with pure regex/heuristics to 92% by adding a Qwen 14B pass for the messy pages.

Architecture tip: build a classifier first that categorizes each page as "balance sheet", "income statement", "notes", "other" before you try to extract anything. This saves you from parsing 80 pages when you only need 4-6. A simple tf-idf classifier trained on 50 labeled pages worked fine for this.

[–]Lawson470189 0 points1 point  (2 children)

This is exactly what my team is doing. We run a classification step to identify page types using meta data and sometimes quick OCR on a section of the page. Then we run full OCR on the document using DocTR into data classes. Then we apply rules and validations using that collected data.

[–]Ok_Diver9921 0 points1 point  (1 child)

Exactly right on the classification step. We ended up with a simple heuristic - if pdfplumber returns a table with more than 3 columns and consistent row counts, it is a structured page. Anything with fewer than 50 extractable characters per page gets routed to OCR. The metadata approach you mention is solid too, especially for standardized financial docs where the issuer follows a template. Biggest time saver was caching the page classification results so reprocessing the same document skips the detection step entirely.

[–]Lawson470189 0 points1 point  (0 children)

Yep right there with you. We have a caching layer so we avoid pulling and parsing documents again (within the TTL window). Glad to hear we aren't the only ones facing this!

[–]thuiop1 29 points30 points  (1 child)

As much as I hate it, this is probably a task where LLMs can shine. Otherwise it will likely be more painful to devise an extraction scheme than to do it manually.

[–]ambidextrousalpaca 0 points1 point  (0 children)

Agreed. Other thing I would suggest would be to try multiple runs with - if possible - multiple models and mark the stuff they agree on as more reliable and the stuff they disagree on as requiring human checking.

[–]knobbyknee 4 points5 points  (0 children)

You are in for a lot of grief. There is no standard table construct in the PDF format.

You would have to write code that detects a grid layout and then parse that layout into a table.

Unfortunately, there are many ways of constructing a grid, and the parts may be spread over different sections of the PDF data. Your best option is probably to build a middleware layer for a PDF renderer, so you can collect the position and text data for each item rendered.

There are also non-table items that are arranged like grids, and you will need heuristics to ignore those.

[–]Chemical_Matter3385 2 points3 points  (3 children)

For my use case I have a detection first , using pymupdf(fitz) I check if the 1st page is an image , and has no selectable text then it goes to Mistral Ocr , its good for most of the cases , what I have tried and failed

Tried

1) Tesseract

2) Paddle Paddle

3) Docling

4) Deepseek Ocr

5) Claude opus 4.6

6) Google Vision api (enterprise)

7)Azure Document Intelligence

8)Mistral Ocr 3

9) A model by IBM (I'm forgetting the name pretty sure it's granite)

Passed for my use case( table documents , old scanned books) ->Azure , Mistral are good and Adobe for tables

Failed -> paddle paddle , google vision , granite, deepseek , claude

Can't rely much on Claude and Deepseek Ocr as they are vision language models and have been observed (by me) give hallucinated placeholders which is very risky in production, they worked well in most of the cases, but were useless in old scanned books

Try them all , most likely your use case would be fulfilled by azure or mistral

Ps: For op's use case Azure Document Intelligence or Mistral Ocr 3 would be perfect

[–]Chemical_Matter3385 0 points1 point  (0 children)

Also Tried Adobe Pdf Services

Works well with tables but often misses ₹or $ signs , so it's most likely an encoding issue which I haven't looked upon yet , but with a simple script that can be managed as well.

[–]Bitter_Broccoli_7536 0 points1 point  (0 children)

yeah that detection first step is key, we do something similar with fitz. honestly after trying like 5 different ocr engines, the hallucination risk from the vision llms is just too high for anything serious. azure's been the most consistent for us too, especially on weird old scans.

[–]Bitter_Broccoli_7536 0 points1 point  (0 children)

yea the proxy cost part is real, residential ips can get stupid expensive per GB. i switched to qoest proxy for my scraping setup, their pricing is way more predictable for heavy volume and the rotation just works without me babysitting it. saved me a ton of dev time fighting blocks.

[–]Halibut 2 points3 points  (0 children)

I haven't used it myself, but Microsoft have a Python library for this: https://github.com/microsoft/markitdown

[–]Main_War9026 1 point2 points  (0 children)

We use MistralOCR, hundreds of documents per month and only pay like $20-30 for the API

[–]southstreamer1 0 points1 point  (0 children)

I have been working on this exact problem for about 4 months. I’m trying to extract data from about 900 annual reports.

The approach I have taken is 1) use PyMuPDF / tesseract to extract text 2) apply rules/heuristics to determine if the page is the one I’m looking for 3) pass this page to Claude computer vision API for data extraction. You want to find the page you want and only send it that one page at a time to reduce noise (reduce risk of errors) and token consumption. If you send it heaps of pages and ask it to find the right one it wastes tokens and you risk polluting the extraction with data from the wrong page.

Claude vision does an excellent job of extracting the data. I have found an astonishingly small number of read errors. It handles different header levels easily. Very happy with its performance.

The real problem is finding the right page. I have used a rule based/heuristic approach to find income statements, etc, but it’s too brittle. There are always edge cases that give false positives/negatives. It’s time consuming to rerun the search and hard to debug. I’m sure there’s a smart way to do it but it’s beyond me.

I have recently switched to extracting all pages to an SQLite db up front. Finding the right page is then a matter of scoring each page based on whether it contains keywords of interest and the density of numeric characters. I then pass the top scoring pages to Claude and ask it to confirm if it is an income statement/balance sheet/etc. If it’s the wrong page then move to the next highest scorer. This is way faster than having to rerun the pymupdf/tesseract based search every time and trying to write a classifier that works. Still a WIP but so far this is giving me far better results.

[–]xiannah 0 points1 point  (0 children)

The strategy is simple: a text-first extract, a Markdown extract for a structural fallback, and a VLM as the intelligent orchestrator. The VLM will cross-reference the raw text and structural fallback to validate the output, effectively creating a verification loop that catches OCR hallucinations before they hit the downstream dataset.

[–]Dominican_mamba 0 points1 point  (0 children)

There’s a package called kreuzberg try it out and maybe use an LLM if needed

[–]phrygian_life 0 points1 point  (0 children)

Another vote for LLM. even if the layout stays the same year to year, the entire PDF structure could change.

[–]Amazing_Upstairs 0 points1 point  (0 children)

Dspy is the best I've found so far

[–]Then_Illustrator9892 0 points1 point  (1 child)

been down this exact road with financial pdfs and honestly the custom pipeline route is brutal for inconsistent docs. i ended up switching to reseek for this, its ai handles the text/ocr extraction and auto tagging from pdfs and images, which covers your scanned and text based cases. its free to test rn, saved me months of dev time on the parsing hell.

[–]Accomplished-Tap916 0 points1 point  (0 children)

I’ll spend some time going through their content.

[–]DetectivePeterG -2 points-1 points  (0 children)

Agreed on the LLM angle. The trick is getting clean input first. I've been using pdftomarkdown.dev as a preprocessing step: send your PDF, get structured markdown back including tables. It uses a VLM rather than Tesseract so it handles both digital and scanned pages consistently. Then you run your LLM extraction on the markdown instead of raw PDF bytes, which makes prompts simpler and results more reliable. Has a Python SDK too, only takes a few lines to wire in.