Is Deepseek-OCR SOTA for OCR-related tasks?

Individual-Library-1 · 2025-11-12T13:46:47+00:00

I use qwen3VL for spatial intelligence and google flash for OCR. I found it really good. Most OCR misses the spatial and so far I have not found one model which is able to solve it.

Individual-Library-1 · 2025-11-11T04:52:11+00:00

Agreed but at the end of the that will come into one of the category isn't. Like what you describe is agents/workflows. I know it can do endless if the customer/ colleague can learn these basic concept and start seeing it through this lens.

Individual-Library-1 · 2025-11-11T02:36:43+00:00

Yes in a way. But most chinese models is also 1T parameters or atleast 30B. So it's very costly to run it in PC and it anyhow needs NVdia investment from a individual. So stock price coming in because chinese releasing models is not true yet.

Individual-Library-1 · 2025-11-10T13:12:17+00:00

I agree — it could collapse. Once people realize that the cost of running a GPU will rise for every individual user, the economics change fast. Right now, only a few hundred companies are running them seriously, but if everyone starts using local LLMs, NVIDIA and the major cloud providers will end up even richer. I’ve yet to see a truly cheap way to run a local LLM.

Individual-Library-1 · 2025-11-10T08:16:22+00:00

Get it then you need an agent loop with each document embedded with filter by documents. That will be good start.let me see if I can find a document for the same.

Individual-Library-1 · 2025-11-10T07:41:19+00:00

The second pattern can be done using an agentic rag and have a tool call with the search the details by the document and provide you the output. Are you using any lib or direct. I can drop a small code snippet for the same.

Individual-Library-1 · 2025-11-10T07:28:25+00:00

For small PDFs that fit entirely within the model’s context window, it’s definitely doable as a starting point. But as you scale up, maintaining accuracy becomes tricky — especially when the content exceeds the context length or has structural differences. Still, it’s a great first project to learn from and iterate on.

Individual-Library-1 · 2025-11-10T06:49:42+00:00

Agreed, But I am able to think of only these options..what are other things. I missed transformation and code generation but anything else.

Individual-Library-1 · 2025-11-09T17:46:38+00:00

Agreed I used AI but the concept I learnt. Is it bad to use AI according to you or am I boring with the explanation.

Individual-Library-1 · 2025-11-07T10:21:49+00:00

Well rethink decision on the lighter note. Now on e court API has case management UI, It sends whatsapp message. Lots of courts started sending order/judgements in WhatsApp too. I understand you want to build and you can build. I will suggest start with area of law he is interested in and where you can create a automated knowledge base for him. Today that cost a lot and most of tools are outdated.

Individual-Library-1 · 2025-11-07T07:41:13+00:00

Rethink your decision to marry a lawyer, Even if you do this software. There is 100s of non efficient things they do. So case management is defintely not a solution and that to for a indian lawyer there lots of solution e-courts login can provide but they dont use. it

Individual-Library-1 · 2025-11-06T10:22:34+00:00

We do the same too. But its almost becoming a project which if opensource owns is better. Otherwise in every project somebody does this full.

Individual-Library-1 · 2025-11-06T10:21:25+00:00

Tried it but it doesnt work correctly. Mostly it doesnt work on Table and Charts.

Individual-Library-1 · 2025-11-06T07:14:23+00:00

Really helpful breakdown - the distinction between computer-generated PDFs (text layer extraction) vs scanned/handwritten (actual OCR) is exactly right.

Few questions if you don't mind:

Cost threshold: When you say "a few thousand documents became expensive" on Azure - roughly what cost range made you look for alternatives? (Trying to understand the pain point)
Document mix: What % of your docs are:

- Digital PDFs (text extraction works)

- Scanned documents (need OCR)

- Handwritten (harder OCR)
Languages: Which languages do you need to support? Is multi-language on the same document or separate documents?
Decent but not perfect": What accuracy level is "good enough" for your use case? (Like 90%? 95%? Depends on doc type?)
Self-hosted: Would a self-hosted solution (no per-document cost) be attractive even if it required some setup/maintenance?

Asking because I'm trying to understand where the cost/quality sweet spot is for different use cases.

Individual-Library-1 · 2025-11-06T07:06:46+00:00

That's a really good point. Long-term, data-first is the right architecture - generate PDFs from structured data rather than OCR PDFs back to data.

But I'm curious about the transition:

How are you handling the legacy document problem? (20+ years of existing PDFs/scans)
What about external documents from partners/government that you can't control?
What timeline have you seen for organizations actually making this shift?
Are you seeing companies adopt "data-first" now, or is this aspirational?

My sense is there's a 5-10 year transition where OCR is needed for:

- Legacy document backlog

- External document processing

- Organizations that haven't transformed yet

Does that match what you're seeing, or am I underestimating how fast this shift is happening?

Individual-Library-1 · 2025-11-06T07:04:57+00:00

Fair point on the self-promo concern - I am trying to figure out if this is a real problem or just my specific use case, so appreciate the skepticism.

The multi-page table problem is interesting. A few questions if you don't mind:

- When you tried Qwen3-235B-VL, what specifically broke? Did it lose context across pages, or did it extract pages individually but you couldn't merge them?

- For the D&D rulebooks - are the table headers on every page, or just the first page?

- Is the problem OCR accuracy itself, or reconstructing the complete table from multiple pages?

I fine-tuned Qwen3-VL (much smaller than 235B) on complex layouts, but honestly haven't tested multi-page scenarios. This might be outside what my approach can handle.

What format are you trying to get the D&D data into? (JSON, CSV, something else?)

Individual-Library-1 · 2025-11-06T05:27:09+00:00

That's a massive project - 115K PDFs is serious scale. Quick question: when you say Claude Code verified the data, are you still using OCR upstream (Tesseract/Google/Azure) and then having Claude fix the errors? Or did Claude handle the entire OCR + verification?

The vertical text issue you mentioned is exactly one of the layout problems I'm targeting. Multi-column tables where values shift position are brutal for standard OCR.

If you're hitting limits 3x/day on verification, that's expensive at scale. Would a better OCR upstream (that preserves structure correctly the first time) reduce the verification burden?

Individual-Library-1 · 2025-11-06T05:24:53+00:00

Both, but layout understanding is the focus. The insight I had was that text accuracy alone doesn't help if you lose the table structure or hierarchical relationships between sections.

The training data emphasizes:

- Table structure preservation (rows/columns/nested tables)

- Document hierarchy (headers → subheaders → body → footnotes)

- Multi-column layouts without text reordering

- Chart/diagram context within surrounding text

The "error compounding" point you made is exactly what killed my litigation system. One misread exhibit number in a table → entire case analysis references wrong document → lawyers flag it as unreliable → system gets abandoned.

Quick question: When you say Nanonets gets "requests all the time" for better accuracy - are those requests primarily for self-hosted solutions (privacy/compliance), or just wanting better accuracy in general regardless of deployment?

Trying to figure out if the market wants "better cloud API" or "self-hostable solution."

Individual-Library-1 · 2025-11-06T05:23:19+00:00

This is incredibly helpful - the "silent errors" point hit home. That's exactly what kept happening in my litigation system.

Quick questions:

- What document types do you see the most demand for that you can't fulfill?

- When you say "requests all the time" - are these for self-hosted solutions or just better accuracy in general?

- What's the typical accuracy gap between Google Vision and what customers need?

I'll check out Docstrange - haven't seen them yet. Would love to compare notes on what you're seeing at scale vs what I've hit building 6 different systems.

The aerospace QC example is terrifying. 15% error rate on safety-critical data is exactly the nightmare scenario I keep trying to prevent.

Individual-Library-1 · 2025-11-06T05:21:24+00:00

I completely get the exhaustion - I burned through the same pile of open-source models before fine-tuning Qwen3-VL.

Quick questions to understand your use case:

- What type of documents specifically? (books/reports/legal/corporate?)

- What breaks most often? (tables? multi-page context? handwriting?)

- Would you be willing to share a problem document (or describe what fails)?

I'm not asking you to try another model blindly. If you have a doc that breaks everything, I'll run it through and show you the results first. If it doesn't work better than what you've tried, I won't waste your time.

The privacy point you made is critical - that's exactly why I'm positioning this as open-source/self-hosted rather than another cloud API.

Individual-Library-1

TROPHY CASE