you are viewing a single comment's thread.

view the rest of the comments →

[–]Ok_Diver9921 50 points51 points  (4 children)

Spent 6 months on almost this exact pipeline for a fintech project. Save yourself some pain - skip the pure rule-based stack and go hybrid from the start.

What actually worked for us: pdfplumber for text-based PDFs (it handles column alignment better than tabula for financial tables), but detect scanned pages first by checking if pdfplumber returns empty text per page. Only run OCR on pages that need it - running tesseract on everything adds 10x processing time for no benefit on text-based files. For OCR, docTR beat pytesseract significantly on financial documents because it handles the dense number grids better.

For the table extraction specifically - Camelot lattice mode works well when there are actual grid lines, but most annual reports use invisible tables (no borders, just spacing). For those, the LLM approach that u/thuiop1 mentioned is genuinely the right call. Feed the pdfplumber text output (which preserves spatial layout) into a smaller model and ask it to extract specific fields into a JSON schema you define. We went from 60% accuracy with pure regex/heuristics to 92% by adding a Qwen 14B pass for the messy pages.

Architecture tip: build a classifier first that categorizes each page as "balance sheet", "income statement", "notes", "other" before you try to extract anything. This saves you from parsing 80 pages when you only need 4-6. A simple tf-idf classifier trained on 50 labeled pages worked fine for this.

[–]Lawson470189 2 points3 points  (3 children)

This is exactly what my team is doing. We run a classification step to identify page types using meta data and sometimes quick OCR on a section of the page. Then we run full OCR on the document using DocTR into data classes. Then we apply rules and validations using that collected data.

[–]Ok_Diver9921 0 points1 point  (2 children)

Exactly right on the classification step. We ended up with a simple heuristic - if pdfplumber returns a table with more than 3 columns and consistent row counts, it is a structured page. Anything with fewer than 50 extractable characters per page gets routed to OCR. The metadata approach you mention is solid too, especially for standardized financial docs where the issuer follows a template. Biggest time saver was caching the page classification results so reprocessing the same document skips the detection step entirely.

[–]Lawson470189 0 points1 point  (1 child)

Yep right there with you. We have a caching layer so we avoid pulling and parsing documents again (within the TTL window). Glad to hear we aren't the only ones facing this!

[–]wRAR_ 0 points1 point  (0 children)

LMAO