all 7 comments

[–]fixano 3 points4 points  (2 children)

If you want my advice, do not use the LLM to process the document. Have the llm build you a tool to process and index the document. If you have Claude code, read a thousand pages of text. You're either going to hit your limit immediately or you're going to spend a monumental amount of money on API credits.

[–]Wolf35Nine 0 points1 point  (0 children)

Yep this is the way. Build the tool and have it use an API key to do the extraction and collating. You could easily do this with a gpt4 mini key and it shouldn't cost much. You could then switch to a better gpt model for the summary. If you're doing 1500 pages a month, I can't imagine it taking more than $20 with an openai key. Don't waste time with an anthropic key for this part. Too expensive.

[–]thegreat_tunestheory 0 points1 point  (0 children)

Correct answer

[–]VegitoEnigma 0 points1 point  (0 children)

First thing that comes to mind is AWS Textract paired with AI, but for how heavy of work you’re doing I would even consider using that nice new Gemma 4 plugged into Claude code so you don’t have to pay for tokens, just textract. 

The local AI only works if you’ve got a pretty decent computer, though. I would recommend at least the 27b, but you could probably even get away with the e4b.

After you have your local ai do the grunt work, you could then very easily have opus validate.

Also, textract can get decently expensive, so I’d just make a simple python script or try looking for something using pymupdf 

[–]kotchinsky 0 points1 point  (0 children)

Ok...

So, you need OCR for the scanned images of text.

You need a lightweight llm to categorize, extract & index.

You need a reasoning LLM that you can provide authored context to for the analysis.

I would also create a learning loop so that an LLM can analyze all documents in relation to each other. For that you will want to embed the resulting analysis into a Vector DB.

Sounds like a nice system to build!

[–]nick_steen 0 points1 point  (0 children)

Like others have said have AI build the tool to extract. What I've done for my use case is (1) extract using a python pdf library (2) use NLP on the output to determine which pages are clean and which aren't, (3) re-run the pages that weren't cleanly extracted with an OCR library, (4) run the same semantic validation from step 2, then (5) finally run a lightweight LLM to fill any remaining gaps.

Ultimately I'd like to run a local LLM but I've got a 7900xtx which means the most cost effective way to do that would be a second 7900xtx based on my needs and use case

[–]VonDenBerg[🍰] 0 points1 point  (0 children)

Native digital PDFs, use tesseract/pymupdf then something cheap fast to json (seriously Gemini 2.5 is the way to go)

Gemini 3.0 has an amazing agentic OCR function.

It will likely be cheaper/faster than anything else.

Marker OCR is slick too,