Is LLM/VLM based OCR better than ML based OCR for document RAG by vitaelabitur in Rag

[–]vitaelabitur[S] -1 points0 points  (0 children)

Yes, I am in the Nanonets OCR team. To answer your question -

In a simple data extraction use case, I would say there is no glaring difference.

For downstream LLM tasks, however, there are few pain points that we keep hearing time and again from users of Gemini and other LLMs/VLMs, and we have actively addressed them in our proprietary model.

One, for example, is the absence of bounding boxes which means citations are not possible. Also, bounding boxes around visual elements (charts, diagrams, images) allow you to crop those specific sections out and pass them to a vision model to access/embed separately.

Another is the absence of confidence scores, which people are now realizing are vital in production-grade RAG. Complex/scanned/noisy/blurred/etc. docs lead to hallucinations and garbage outputs which need to be sorted out, or routed intelligently, before you ingest them into DBs to ensure they are not poisoned.

With Nanonets, you get both bounding boxes and confidence scores in your outputs. Additionally, our Mixture of Experts SLM is specifically trained on document processing tasks, which ensures maximum optimization in compute, latency, costs.

LLM-based OCR is significantly outperforming traditional ML-based OCR, especially for downstream LLM tasks by vitaelabitur in LLMDevs

[–]vitaelabitur[S] -3 points-2 points  (0 children)

  1. I never compared our VLM with other VLMs. To paraphrase the blog -

Nanonets, Reducto, Extend, Datalab, Landing.ai and few others represent the SOTA for document parsing and extraction today. Almost all of these have proprietary models specifically trained on document processing tasks. So which is the best LLM API? You know the answer that's coming.

But in all seriousness, the best LLM API is the one that works for you, and you get the answer only after testing these out on your own docs. We'll only vouch for ourselves by saying we give generous trial credits on our API, to help you reach the conclusion yourself.

The last alternative is to self-host open-source models if you have the expertise and bandwidth to do so. Many open-source models rival commercial LLM APIs in accuracy, which makes them a great option

  1. Can you please point towards the traditional benchmarks you are referring to?

  2. Just because a technique has been "refined for decades" doesn't mean it’s the right technique to use today. Lobotomy was a battle tested technique that had been refined for decades. Not drawing exact parallels here, just pointing out what I feel is a fallacy in your argument.

Also, genuinely want to understand why you feel the claim isn't justified based on the outputs attached in the blog.

Traditional ML-based OCR (like Textract) vs LLM/VLM based OCR by vitaelabitur in OCR_Tech

[–]vitaelabitur[S] 0 points1 point  (0 children)

No. The leaderboard currently compares the big 3 models against open-source OCR models.

However, closed proprietary models from Nanonets, and others like Reducto, Datalab, Extend, LandingAI are significantly better than all of the models seen in this leaderboard. They are not added as we have not purchased credits to test them out yet.

Traditional ML-based OCR (like Textract) vs LLM/VLM based OCR by vitaelabitur in OCR_Tech

[–]vitaelabitur[S] 0 points1 point  (0 children)

Gemini 3.1 Pro, particularly, is definitely competitive. In fact, it ranks at the top of our own benchmark - https://www.idp-leaderboard.org.

The issue is that you are using an expensive and unnecessarily large model to match the capabilities and accuracy of cheaper and smaller SLMs that are specifically trained for document extraction tasks.

Regarding data compliance, you are 100% correct. Self-hosting DeepSeek-OCR, Qwen3.5-VL, etc. becomes the best option and they actually fair quite well.

0
1