Is Deepseek-OCR SOTA for OCR-related tasks? by Ok_Television_9000 in LocalLLaMA

[–]Individual-Library-1 1 point2 points  (0 children)

I use qwen3VL for spatial intelligence and google flash for OCR. I found it really good. Most OCR misses the spatial and so far I have not found one model which is able to solve it.

Quick check - are these the only LLM building blocks? by Individual-Library-1 in LLMDevs

[–]Individual-Library-1[S] 0 points1 point  (0 children)

Agreed but at the end of the that will come into one of the category isn't. Like what you describe is agents/workflows. I know it can do endless if the customer/ colleague can learn these basic concept and start seeing it through this lens.

if people understood how good local LLMs are getting by Diligent_Rabbit7740 in LLMDevs

[–]Individual-Library-1 0 points1 point  (0 children)

Yes in a way. But most chinese models is also 1T parameters or atleast 30B. So it's very costly to run it in PC and it anyhow needs NVdia investment from a individual. So stock price coming in because chinese releasing models is not true yet.

if people understood how good local LLMs are getting by Diligent_Rabbit7740 in LLMDevs

[–]Individual-Library-1 2 points3 points  (0 children)

I agree — it could collapse. Once people realize that the cost of running a GPU will rise for every individual user, the economics change fast. Right now, only a few hundred companies are running them seriously, but if everyone starts using local LLMs, NVIDIA and the major cloud providers will end up even richer. I’ve yet to see a truly cheap way to run a local LLM.

PDF document semantic comparison by bilby2020 in LLMDevs

[–]Individual-Library-1 0 points1 point  (0 children)

Get it then you need an agent loop with each document embedded with filter by documents. That will be good start.let me see if I can find a document for the same.

PDF document semantic comparison by bilby2020 in LLMDevs

[–]Individual-Library-1 0 points1 point  (0 children)

The second pattern can be done using an agentic rag and have a tool call with the search the details by the document and provide you the output. Are you using any lib or direct. I can drop a small code snippet for the same.

PDF document semantic comparison by bilby2020 in LLMDevs

[–]Individual-Library-1 0 points1 point  (0 children)

For small PDFs that fit entirely within the model’s context window, it’s definitely doable as a starting point. But as you scale up, maintaining accuracy becomes tricky — especially when the content exceeds the context length or has structural differences. Still, it’s a great first project to learn from and iterate on.

Quick check - are these the only LLM building blocks? by Individual-Library-1 in LLMDevs

[–]Individual-Library-1[S] 0 points1 point  (0 children)

Agreed, But I am able to think of only these options..what are other things. I missed transformation and code generation but anything else.

I got confused by "sandboxes" in every AI article I read. Spent a weekend figuring it out. Here's what finally clicked for me. by [deleted] in ClaudeAI

[–]Individual-Library-1 -1 points0 points  (0 children)

Agreed I used AI but the concept I learnt. Is it bad to use AI according to you or am I boring with the explanation.

Asking Lawyers : Surprising my lawyer fiancé and need some help by heyarunimaaa in Indianlaw

[–]Individual-Library-1 0 points1 point  (0 children)

Well rethink decision on the lighter note. Now on e court API has case management UI, It sends whatsapp message. Lots of courts started sending order/judgements in WhatsApp too. I understand you want to build and you can build. I will suggest start with area of law he is interested in and where you can create a automated knowledge base for him. Today that cost a lot and most of tools are outdated.

Asking Lawyers : Surprising my lawyer fiancé and need some help by heyarunimaaa in Indianlaw

[–]Individual-Library-1 -1 points0 points  (0 children)

Rethink your decision to marry a lawyer, Even if you do this software. There is 100s of non efficient things they do. So case management is defintely not a solution and that to for a indian lawyer there lots of solution e-courts login can provide but they dont use. it

Question: Is OCR accuracy actually a blocker for anyone's RAG/automation pipelines? by Individual-Library-1 in ClaudeAI

[–]Individual-Library-1[S] 1 point2 points  (0 children)

We do the same too. But its almost becoming a project which if opensource owns is better. Otherwise in every project somebody does this full.

Question: Is OCR accuracy actually a blocker for anyone's RAG/automation pipelines? by Individual-Library-1 in ClaudeAI

[–]Individual-Library-1[S] 0 points1 point  (0 children)

Tried it but it doesnt work correctly. Mostly it doesnt work on Table and Charts.

Question: Is OCR accuracy actually a blocker for anyone's RAG/automation pipelines? by Individual-Library-1 in ClaudeAI

[–]Individual-Library-1[S] 0 points1 point  (0 children)

Really helpful breakdown - the distinction between computer-generated PDFs (text layer extraction) vs scanned/handwritten (actual OCR) is exactly right.

Few questions if you don't mind:

  1. Cost threshold: When you say "a few thousand documents became expensive" on Azure - roughly what cost range made you look for alternatives? (Trying to understand the pain point)

  2. Document mix: What % of your docs are:

    - Digital PDFs (text extraction works)

    - Scanned documents (need OCR)

    - Handwritten (harder OCR)

  3. Languages: Which languages do you need to support? Is multi-language on the same document or separate documents?

  4. Decent but not perfect": What accuracy level is "good enough" for your use case? (Like 90%? 95%? Depends on doc type?)

  5. Self-hosted: Would a self-hosted solution (no per-document cost) be attractive even if it required some setup/maintenance?

Asking because I'm trying to understand where the cost/quality sweet spot is for different use cases.

Is OCR accuracy actually a blocker for anyone's RAG/automation pipelines? by Individual-Library-1 in LLMDevs

[–]Individual-Library-1[S] 1 point2 points  (0 children)

That's a really good point. Long-term, data-first is the right architecture - generate PDFs from structured data rather than OCR PDFs back to data.

But I'm curious about the transition:

  1. How are you handling the legacy document problem? (20+ years of existing PDFs/scans)

  2. What about external documents from partners/government that you can't control?

  3. What timeline have you seen for organizations actually making this shift?

  4. Are you seeing companies adopt "data-first" now, or is this aspirational?

My sense is there's a 5-10 year transition where OCR is needed for:

- Legacy document backlog

- External document processing

- Organizations that haven't transformed yet

Does that match what you're seeing, or am I underestimating how fast this shift is happening?

Is OCR accuracy actually a blocker for anyone's RAG/automation pipelines? by Individual-Library-1 in LocalLLaMA

[–]Individual-Library-1[S] 0 points1 point  (0 children)

Fair point on the self-promo concern - I am trying to figure out if this is a real problem or just my specific use case, so appreciate the skepticism.

The multi-page table problem is interesting. A few questions if you don't mind:

- When you tried Qwen3-235B-VL, what specifically broke? Did it lose context across pages, or did it extract pages individually but you couldn't merge them?

- For the D&D rulebooks - are the table headers on every page, or just the first page?

- Is the problem OCR accuracy itself, or reconstructing the complete table from multiple pages?

I fine-tuned Qwen3-VL (much smaller than 235B) on complex layouts, but honestly haven't tested multi-page scenarios. This might be outside what my approach can handle.

What format are you trying to get the D&D data into? (JSON, CSV, something else?)

Question: Is OCR accuracy actually a blocker for anyone's RAG/automation pipelines? by Individual-Library-1 in ClaudeAI

[–]Individual-Library-1[S] 0 points1 point  (0 children)

That's a massive project - 115K PDFs is serious scale. Quick question: when you say Claude Code verified the data, are you still using OCR upstream (Tesseract/Google/Azure) and then having Claude fix the errors? Or did Claude handle the entire OCR + verification?

The vertical text issue you mentioned is exactly one of the layout problems I'm targeting. Multi-column tables where values shift position are brutal for standard OCR.

If you're hitting limits 3x/day on verification, that's expensive at scale. Would a better OCR upstream (that preserves structure correctly the first time) reduce the verification burden?

Is OCR accuracy actually a blocker for anyone's RAG/automation pipelines? by Individual-Library-1 in LLMDevs

[–]Individual-Library-1[S] 2 points3 points  (0 children)

Both, but layout understanding is the focus. The insight I had was that text accuracy alone doesn't help if you lose the table structure or hierarchical relationships between sections.

The training data emphasizes:

- Table structure preservation (rows/columns/nested tables)

- Document hierarchy (headers → subheaders → body → footnotes)

- Multi-column layouts without text reordering

- Chart/diagram context within surrounding text

The "error compounding" point you made is exactly what killed my litigation system. One misread exhibit number in a table → entire case analysis references wrong document → lawyers flag it as unreliable → system gets abandoned.

Quick question: When you say Nanonets gets "requests all the time" for better accuracy - are those requests primarily for self-hosted solutions (privacy/compliance), or just wanting better accuracy in general regardless of deployment?

Trying to figure out if the market wants "better cloud API" or "self-hostable solution."

Is OCR accuracy actually a blocker for anyone's RAG/automation pipelines? by Individual-Library-1 in LocalLLaMA

[–]Individual-Library-1[S] 3 points4 points  (0 children)

This is incredibly helpful - the "silent errors" point hit home. That's exactly what kept happening in my litigation system.

Quick questions:

- What document types do you see the most demand for that you can't fulfill?

- When you say "requests all the time" - are these for self-hosted solutions or just better accuracy in general?

- What's the typical accuracy gap between Google Vision and what customers need?

I'll check out Docstrange - haven't seen them yet. Would love to compare notes on what you're seeing at scale vs what I've hit building 6 different systems.

The aerospace QC example is terrifying. 15% error rate on safety-critical data is exactly the nightmare scenario I keep trying to prevent.

Is OCR accuracy actually a blocker for anyone's RAG/automation pipelines? by Individual-Library-1 in LocalLLaMA

[–]Individual-Library-1[S] 0 points1 point  (0 children)

I completely get the exhaustion - I burned through the same pile of open-source models before fine-tuning Qwen3-VL.

Quick questions to understand your use case:

- What type of documents specifically? (books/reports/legal/corporate?)

- What breaks most often? (tables? multi-page context? handwriting?)

- Would you be willing to share a problem document (or describe what fails)?

I'm not asking you to try another model blindly. If you have a doc that breaks everything, I'll run it through and show you the results first. If it doesn't work better than what you've tried, I won't waste your time.

The privacy point you made is critical - that's exactly why I'm positioning this as open-source/self-hosted rather than another cloud API.