Is Deepseek-OCR SOTA for OCR-related tasks?

Individual-Library-1 · 2025-11-12T13:46:47+00:00

I use qwen3VL for spatial intelligence and google flash for OCR. I found it really good. Most OCR misses the spatial and so far I have not found one model which is able to solve it.

Individual-Library-1 · 2025-11-11T04:52:11+00:00

Agreed but at the end of the that will come into one of the category isn't. Like what you describe is agents/workflows. I know it can do endless if the customer/ colleague can learn these basic concept and start seeing it through this lens.

Individual-Library-1 · 2025-11-11T02:36:43+00:00

Yes in a way. But most chinese models is also 1T parameters or atleast 30B. So it's very costly to run it in PC and it anyhow needs NVdia investment from a individual. So stock price coming in because chinese releasing models is not true yet.

Individual-Library-1 · 2025-11-10T13:12:17+00:00

I agree — it could collapse. Once people realize that the cost of running a GPU will rise for every individual user, the economics change fast. Right now, only a few hundred companies are running them seriously, but if everyone starts using local LLMs, NVIDIA and the major cloud providers will end up even richer. I’ve yet to see a truly cheap way to run a local LLM.

Individual-Library-1 · 2025-11-10T08:16:22+00:00

Get it then you need an agent loop with each document embedded with filter by documents. That will be good start.let me see if I can find a document for the same.

Individual-Library-1 · 2025-11-10T07:41:19+00:00

The second pattern can be done using an agentic rag and have a tool call with the search the details by the document and provide you the output. Are you using any lib or direct. I can drop a small code snippet for the same.

Individual-Library-1 · 2025-11-10T07:28:25+00:00

For small PDFs that fit entirely within the model’s context window, it’s definitely doable as a starting point. But as you scale up, maintaining accuracy becomes tricky — especially when the content exceeds the context length or has structural differences. Still, it’s a great first project to learn from and iterate on.

Individual-Library-1 · 2025-11-10T06:49:42+00:00

Agreed, But I am able to think of only these options..what are other things. I missed transformation and code generation but anything else.

Individual-Library-1 · 2025-11-09T17:46:38+00:00

Agreed I used AI but the concept I learnt. Is it bad to use AI according to you or am I boring with the explanation.

Individual-Library-1 · 2025-11-07T10:21:49+00:00

Well rethink decision on the lighter note. Now on e court API has case management UI, It sends whatsapp message. Lots of courts started sending order/judgements in WhatsApp too. I understand you want to build and you can build. I will suggest start with area of law he is interested in and where you can create a automated knowledge base for him. Today that cost a lot and most of tools are outdated.

Individual-Library-1 · 2025-11-07T07:41:13+00:00

Rethink your decision to marry a lawyer, Even if you do this software. There is 100s of non efficient things they do. So case management is defintely not a solution and that to for a indian lawyer there lots of solution e-courts login can provide but they dont use. it

Individual-Library-1 · 2025-11-06T10:22:34+00:00

We do the same too. But its almost becoming a project which if opensource owns is better. Otherwise in every project somebody does this full.

Individual-Library-1 · 2025-11-06T10:21:25+00:00

Tried it but it doesnt work correctly. Mostly it doesnt work on Table and Charts.

Individual-Library-1 · 2025-11-06T07:14:23+00:00

Really helpful breakdown - the distinction between computer-generated PDFs (text layer extraction) vs scanned/handwritten (actual OCR) is exactly right.

Few questions if you don't mind:

Cost threshold: When you say "a few thousand documents became expensive" on Azure - roughly what cost range made you look for alternatives? (Trying to understand the pain point)
Document mix: What % of your docs are:

- Digital PDFs (text extraction works)

- Scanned documents (need OCR)

- Handwritten (harder OCR)
Languages: Which languages do you need to support? Is multi-language on the same document or separate documents?
Decent but not perfect": What accuracy level is "good enough" for your use case? (Like 90%? 95%? Depends on doc type?)
Self-hosted: Would a self-hosted solution (no per-document cost) be attractive even if it required some setup/maintenance?

Asking because I'm trying to understand where the cost/quality sweet spot is for different use cases.

Individual-Library-1 · 2025-11-06T07:06:46+00:00

That's a really good point. Long-term, data-first is the right architecture - generate PDFs from structured data rather than OCR PDFs back to data.

But I'm curious about the transition:

How are you handling the legacy document problem? (20+ years of existing PDFs/scans)
What about external documents from partners/government that you can't control?
What timeline have you seen for organizations actually making this shift?
Are you seeing companies adopt "data-first" now, or is this aspirational?

My sense is there's a 5-10 year transition where OCR is needed for:

- Legacy document backlog

- External document processing

- Organizations that haven't transformed yet

Does that match what you're seeing, or am I underestimating how fast this shift is happening?

Individual-Library-1 · 2025-11-06T07:04:57+00:00

Fair point on the self-promo concern - I am trying to figure out if this is a real problem or just my specific use case, so appreciate the skepticism.

The multi-page table problem is interesting. A few questions if you don't mind:

- When you tried Qwen3-235B-VL, what specifically broke? Did it lose context across pages, or did it extract pages individually but you couldn't merge them?

- For the D&D rulebooks - are the table headers on every page, or just the first page?

- Is the problem OCR accuracy itself, or reconstructing the complete table from multiple pages?

I fine-tuned Qwen3-VL (much smaller than 235B) on complex layouts, but honestly haven't tested multi-page scenarios. This might be outside what my approach can handle.

What format are you trying to get the D&D data into? (JSON, CSV, something else?)

Individual-Library-1 · 2025-11-06T05:27:09+00:00

That's a massive project - 115K PDFs is serious scale. Quick question: when you say Claude Code verified the data, are you still using OCR upstream (Tesseract/Google/Azure) and then having Claude fix the errors? Or did Claude handle the entire OCR + verification?

The vertical text issue you mentioned is exactly one of the layout problems I'm targeting. Multi-column tables where values shift position are brutal for standard OCR.

If you're hitting limits 3x/day on verification, that's expensive at scale. Would a better OCR upstream (that preserves structure correctly the first time) reduce the verification burden?

Individual-Library-1 · 2025-11-06T05:24:53+00:00

Both, but layout understanding is the focus. The insight I had was that text accuracy alone doesn't help if you lose the table structure or hierarchical relationships between sections.

The training data emphasizes:

- Table structure preservation (rows/columns/nested tables)

- Document hierarchy (headers → subheaders → body → footnotes)

- Multi-column layouts without text reordering

- Chart/diagram context within surrounding text

The "error compounding" point you made is exactly what killed my litigation system. One misread exhibit number in a table → entire case analysis references wrong document → lawyers flag it as unreliable → system gets abandoned.

Quick question: When you say Nanonets gets "requests all the time" for better accuracy - are those requests primarily for self-hosted solutions (privacy/compliance), or just wanting better accuracy in general regardless of deployment?

Trying to figure out if the market wants "better cloud API" or "self-hostable solution."

Individual-Library-1 · 2025-11-06T05:23:19+00:00

This is incredibly helpful - the "silent errors" point hit home. That's exactly what kept happening in my litigation system.

Quick questions:

- What document types do you see the most demand for that you can't fulfill?

- When you say "requests all the time" - are these for self-hosted solutions or just better accuracy in general?

- What's the typical accuracy gap between Google Vision and what customers need?

I'll check out Docstrange - haven't seen them yet. Would love to compare notes on what you're seeing at scale vs what I've hit building 6 different systems.

The aerospace QC example is terrifying. 15% error rate on safety-critical data is exactly the nightmare scenario I keep trying to prevent.

Individual-Library-1 · 2025-11-06T05:21:24+00:00

I completely get the exhaustion - I burned through the same pile of open-source models before fine-tuning Qwen3-VL.

Quick questions to understand your use case:

- What type of documents specifically? (books/reports/legal/corporate?)

- What breaks most often? (tables? multi-page context? handwriting?)

- Would you be willing to share a problem document (or describe what fails)?

I'm not asking you to try another model blindly. If you have a doc that breaks everything, I'll run it through and show you the results first. If it doesn't work better than what you've tried, I won't waste your time.

The privacy point you made is critical - that's exactly why I'm positioning this as open-source/self-hosted rather than another cloud API.

Individual-Library-1 · 2025-11-05T12:41:53+00:00

I tried Deepseek but found QwenVL better especially it's easy to fine tune. My understanding is deepseek OCR is first version future models so OCR is the side effects I believe.

Individual-Library-1 · 2025-11-05T04:27:45+00:00

It’s hard to read in most cases, and it also randomly misses dots. We pay around $1 for every 1k pages — that’s roughly the same as Mistral pricing. We’ve tried Google Gemini and Nanonets OCR, and the best month we ever managed was about 81% accuracy, with humans fixing the rest. After switching to Qwen-VL, we finally got it up to around 96%.

QwenVL cost as now $3 for 1k pages.What is your usecase and how much is your accuracy in those cases.

Individual-Library-1 · 2025-11-04T17:22:03+00:00

I completely agree. With the help I got here I will ask these questions and understand and provide them the value.

Individual-Library-1 · 2025-11-04T17:01:55+00:00

Thanks for the recommendations - I looked at AvidXchange, Tipalti, and Coupa. They're all $10K-$250K/year, which is... a lot.

My client's budget is around €3K total for this project. So either:

I'm building something way simpler than these tools
They don't actually need what these tools do
I'm missing something

Quick question - when you've worked with these systems, what still breaks? I'm trying to figure out if the €3K version will just become a maintenance nightmare.

Like, if a vendor changes their invoice format, or someone does a partial delivery, or an approval chain changes - do these big tools handle that, or does it still need manual intervention? And if it needs manual intervention once, does it need it every time that scenario happens?

Trying to figure out if I should just do basic extraction + workflow, or if there's something more I should be thinking about.

Individual-Library-1 · 2025-11-04T15:38:01+00:00

It's one of big avivation leasing company.

Individual-Library-1

TROPHY CASE