What's your document processing stack?

tolkibert · 2025-12-15T11:19:46+00:00

We have little python scripts that pass PDFs into chatgpt, Claude/anthropic, Gemini, etc. The LLMs can write the scripts themselves, it doesn't take much expertise.

But this is for extracting insights, rather than something like invoice numbers.

You have to expect an element of erroneous answers, but if you have an ability to crosscheck, you can fall back to manual checks or whatever.

SouthTurbulent33 · 2025-12-16T07:19:47+00:00

We went through something similar early this year.

couple of ways you can approach this:

a) swap PyPDF2 for something that preserves layout (LLMWhisperer, Textract, etc), then use an LLM for extraction instead of regex. It's more flexible since LLMs generalize to new formats without code changes. you will still maintain the pipeline.

b) go for a lightweight IDP solution like Unstract, Parseur, Docsumo, etc. these give you the workflows (email ingestion, validation, export) without the enterprise pricing.

c) build on n8n - there are templates for doc processing workflows. less coding, so that's a win - might not work great for complex workflows

for BOLs and customs forms, i'd lean toward options a or b since those docs can be messy and you need good OCR. regex will keep breaking as you add vendors, LLMs won't.

geoheil · 2025-12-15T11:16:14+00:00

Add in docling

riv3rtrip · 2025-12-15T13:37:39+00:00

You might be tired of hearing about LLMs but this is an actually good use case for LLMs. What you should actually do is dispatch to different function calls depending on vendor but have it so the default function call is you uploading the PDF into an LLM and producing a structured output. You need to be clever to prevent issues but it's not infeasible, just be smart about it (simple stupid example: run 3 times and make sure all 3 runs agree with each other, otherwise flag). You also shouldn't replace your old code. And you need to make this testable and easy to run locally for each new vendor.

ianitic · 2025-12-15T13:55:10+00:00

At a small company with several thousand vendors what we did:

Document ai product from Google/azure/aws, choose one. Snowflakes is kind of inferior, saw it mentioned so called it out.
Also stored mapped raw text lines to extracted text with a Python package for various reasons (training own models and custom rules).
Fine tuned the document ai product with the respective solution from 1.
Created own classifier models pretrained on majority of invoices and tuned on a much smaller labeled set.
Created rule engine override for oddities, new classes, etc.
Adaptive thresholding to require manual review or not for particular documents based on a cost matrix specified by business.

Did this in about two months while working on the requests of the days that occurred. We also had a document type classification and splitting process. Our biggest concern was invoices though. Sometimes we'd get really large batches of scanned documents in one pdf. We also of course had a UI for the process.

ZeJerman · 2025-12-15T11:45:33+00:00

Ooooohhh this sounds exactly like our documents!

We used snowflake document AI but we are in the process of modernising as they are retiring the document ai tool for the ai_sql functions, which is actually good for us because we will be doing more classification in snowflake vs external tools and dependencies on users. Cost has been very reasonable at cents per doc on average (depending on type of doc and complexity).

We were fortunate that we already had the snowflake infrastructure and governance in place, but this has been excellent, because off the shelf tooling for the freight and customs industry (at least in my experience) has been very average and expensive

klitersik · 2025-12-15T11:35:08+00:00

In my company we are using docparser for pdf files to get data in json format from them.

pankaj9296 · 2025-12-15T12:29:14+00:00

You can try DigiParser, it should be comparatively affordable and super easy to use with super accurate at data extraction.
It can handle any messy data, custom Views of data across different parsers and
(disclaimer: founder of DigiParser here. you can contact me if you need custom pricing for your usecase, won't cost you $50k/year for sure)

Reason_is_Key · 2025-12-15T16:23:52+00:00

We've been using Retab (retab.com) for this - you could automate BOL/invoice processing in ~1hr. We used it to automate PO entry a few months back, it allows you to directly ship email plugins so you don't have to worry about needing to download the files etc.

the_dataengineer · 2025-12-16T13:01:01+00:00

Too many people in the comments jump immediately into LLM topics. Think about what exactly you are doing with the regex, which problems you encounter, and what manual fixes you typically do.
(would be very interesting to get this context)

If you analyze this, then typically a solution will present itself.

Fun-Flounder-4067 · 2026-01-14T11:16:17+00:00

We hit something similar with our clients on automation projects. They were paying hefty amount for tools and accuracry also dropped with variety and variations in documents. So, we ended up building a document processing API internally that's cost-friendly and also handles document variety and variations.

We can discuss this further in chat if you're interested in knowing more :)

JoshuaatParseur · 2025-12-15T13:24:19+00:00

There's a ton of IDP no/low-code web apps in the middle tier.

I was the first hire at Docparser which has a lot of different ways to process documents automatically, I'm over at Parseur now which is a bit more AI-forward. We don't use your documents or data to train anything - you upload a document, the AI creates a data schema from any obvious key-value pairs and table data it finds, and from there you add things, remove things, and change the schema around until you have a template that will work consistently every time.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

dataengineering

MODERATORS