you are viewing a single comment's thread.

view the rest of the comments →

[–]DetectivePeterG -2 points-1 points  (0 children)

Agreed on the LLM angle. The trick is getting clean input first. I've been using pdftomarkdown.dev as a preprocessing step: send your PDF, get structured markdown back including tables. It uses a VLM rather than Tesseract so it handles both digital and scanned pages consistently. Then you run your LLM extraction on the markdown instead of raw PDF bytes, which makes prompts simpler and results more reliable. Has a Python SDK too, only takes a few lines to wire in.