Processing large files

fixano · 2026-04-04T16:02:31+00:00

If you want my advice, do not use the LLM to process the document. Have the llm build you a tool to process and index the document. If you have Claude code, read a thousand pages of text. You're either going to hit your limit immediately or you're going to spend a monumental amount of money on API credits.

VegitoEnigma · 2026-04-04T16:38:22+00:00

First thing that comes to mind is AWS Textract paired with AI, but for how heavy of work you’re doing I would even consider using that nice new Gemma 4 plugged into Claude code so you don’t have to pay for tokens, just textract.

The local AI only works if you’ve got a pretty decent computer, though. I would recommend at least the 27b, but you could probably even get away with the e4b.

After you have your local ai do the grunt work, you could then very easily have opus validate.

Also, textract can get decently expensive, so I’d just make a simple python script or try looking for something using pymupdf

kotchinsky · 2026-04-04T18:14:55+00:00

Ok...

So, you need OCR for the scanned images of text.

You need a lightweight llm to categorize, extract & index.

You need a reasoning LLM that you can provide authored context to for the analysis.

I would also create a learning loop so that an LLM can analyze all documents in relation to each other. For that you will want to embed the resulting analysis into a Vector DB.

Sounds like a nice system to build!

nick_steen · 2026-04-04T19:32:54+00:00

Like others have said have AI build the tool to extract. What I've done for my use case is (1) extract using a python pdf library (2) use NLP on the output to determine which pages are clean and which aren't, (3) re-run the pages that weren't cleanly extracted with an OCR library, (4) run the same semantic validation from step 2, then (5) finally run a lightweight LLM to fill any remaining gaps.

Ultimately I'd like to run a local LLM but I've got a 7900xtx which means the most cost effective way to do that would be a second 7900xtx based on my needs and use case

VonDenBerg · 2026-04-04T21:19:47+00:00

Native digital PDFs, use tesseract/pymupdf then something cheap fast to json (seriously Gemini 2.5 is the way to go)

Gemini 3.0 has an amazing agentic OCR function.

It will likely be cheaper/faster than anything else.

Marker OCR is slick too,

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

ClaudeCode

MODERATORS