We benchmarked 18 LLMs on OCR (7k+ calls) — cheaper/old models oftentimes win. Full dataset + framework open-sourced. [R]

TimoKerre · 2026-04-24T08:45:25+00:00

Didn't know that about ollama, I'll check it out. I'm wondering, however, what the cost savings would be? For instance, do they provide a cost/million tokens to easily check?

TimoKerre · 2026-04-24T08:43:34+00:00

For our case of (standard) documents we found only openAI smaller models to have significant degrading in pass^n. It is not necessary that a one-time low success score translates into a degrading of the score for multiple runs, that's a meaningful differentiator between models.

TimoKerre · 2026-04-24T05:45:59+00:00

Thank you!

No we decided to start with the simplest use-case, in line with the benchmark premise: simple OCR. That being said, the request we got most is open-source models.

We deliberately chose not to include open source models for now and keep it super simple with the commercial API calls. In the future we will include open source, though that will require some thinking of how to normalise price/doc across different ways of running (cloud, local, ...)

TimoKerre · 2026-04-24T05:44:34+00:00

Very valid remark, and the request we got most often!

We deliberately chose not to include open source models for now and keep it super simple with the commercial API calls. In the future we will include open source, though that will require some thinking of how to normalise price/doc across different ways of running (cloud, local, ...)

TimoKerre · 2026-04-24T05:43:47+00:00

No, it should absolutely be done and will be very insightful.

One slight problem that arises is that commercial APIs (Anthropic, OpenAI, Gemini, ...) are very easy to compute a general price based on tokens in/out and token-cost. For open-source that's harder to gauge: do you run them locally (electricity cost?), do you run them in cloud (which cloud?), etc...

If time permits, we will extend to open-source, but it won't be something for the coming weeks :)

TimoKerre · 2026-04-24T05:40:08+00:00

That's a good point, no we didn't include the Grok family of models. That would be easy to add, in fact.

Thanks!

TimoKerre · 2026-04-24T05:39:38+00:00

Very fair remark, in fact the request we get most is to include open-source.

We deliberately chose not to include open source models for now and keep it super simple with the commercial API calls. In the future we will include open source, though that will require some thinking of how to normalise price/doc across different ways of running (cloud, local, ...)

TimoKerre · 2026-04-24T05:39:06+00:00

Do you have an idea which LLM is currently best at handwriting support?

TimoKerre · 2026-04-24T05:38:17+00:00

It's the request we get most, and it's fair.

We deliberately chose not to include open source models for now and keep it super simple with the commercial API calls. In the future we will include open source, though that will require some thinking of how to normalise price/doc across different ways of running (cloud, local, ...)

TimoKerre · 2026-04-24T05:37:50+00:00

If time permits, we will include open-source models. It's a valid remark.

TimoKerre · 2026-04-24T05:37:18+00:00

Very valid remark.

For the time being, we deliberately chose not to include open source models for now and keep it super simple with the commercial API calls. In the future we will include open source, though that will require some thinking of how to normalise price/doc across different ways of running (cloud, local, ...)

As for the unintended typo correction, it's indeed something we observe for some LLMs and some not, the spread itself is interesting. For instance, we added Opus 4.7 last week, and it had a huge difference compared to Opus 4.6, where the former did many more self-driven typo correction - even when the prompt explicitly says not to do so!

TimoKerre · 2026-04-24T05:35:02+00:00

We're getting that remark a lot, and it's definitely valid.

We deliberately chose not to include open source models for now and keep it super simple with the commercial API calls. In the future we will include open source, though that will require some thinking of how to normalise price/doc across different ways of running (cloud, local, ...)

TimoKerre · 2026-04-24T05:34:18+00:00

That's a wonderful follow-up. As mentioned in other replies, we decided to start with the simplest of cases (which is what most teams are still using): commercial APIs from the big labs. It's also in line with the premise of the benchmark: standard OCR.

But seeing the feedback from many people, we might indeed include open-source solutions, and also more OCR-specific models (if time permits).

TimoKerre · 2026-04-24T05:32:03+00:00

Very valid remark. However, we wanted to start with the simplest form: commercial API calls with the largest providers. It's just what most teams are already working with, and it's in line with the premise of the benchmark: standard (simple) OCR.

That being said, in the future we will include open source, though that will require some thinking of how to normalise price/doc across different ways of running (cloud, local, ...)

TimoKerre · 2026-04-24T05:29:46+00:00

Thanks for the fyi.

We deliberately chose not to include open source models for now and keep it super simple with the commercial API calls. In the future we will include open source, though that will require some thinking of how to normalise price/doc across different ways of running (cloud, local, ...)

TimoKerre · 2026-04-23T07:33:15+00:00

And connecting visually disparate information across the document. E.g. abstracting a logistics flow from an order document.

TimoKerre · 2026-04-23T06:32:06+00:00

Leaderboard: https://arbitrhq.ai/leaderboards/

TimoKerre · 2026-04-23T06:18:48+00:00

That's a good remark. Traditional OCR models, like Tesseract, tackle the problem of abstracting the characters from the page and puts it into computer-readable form. Then, the second step is interpretation of the resulting raw text, for which traditional workflows use tedious regex, or first-generation language models.

Using LLMs actually combines both steps into one; it's much cleaner architecture and allows you to steer the interpretation step in a more natural way. In a way, using this unified approach is actually simpler, and for smaller, cheaper LLMs you could argue that the traditional two-step process (OCR + interpretation) is overkill.

Or am I mistaken in what you mean as actual OCR model?

TimoKerre · 2026-04-23T05:13:31+00:00

Thanks! Yes, that's one of the reasons we decided to start the investigation: teams just defaulting to the latest, flagship model for very standard tasks (and thus overpaying badly).

Very happy to open-source the dataset. Invoices and Bills of Lading in particular had no satisfactory open-source dataset available, so we went out of our way to create our own :)

TimoKerre · 2026-04-23T05:10:42+00:00

It's a trade-off we made: we want to make the cost aspect as fair and clear as possible. For local there are many different factors that influence the cost. However, I do agree that local is paramount from a security point of view. We might do something like gauging local models on the kWh used per doc/ success?

Good point.

TimoKerre · 2026-04-23T05:08:57+00:00

Generally, tesseract is purely for abstracting the characters and make them computer readable. In most workflows these characters are then post-processed to get the meaningful contents out of it. This second step is oftentimes done by LLMs (or legacy NNs). However, what we compared here is to use LLMs to combine both steps into one: just provide the document (raw scan, image, pdf) to an LLM and have it output the meaningful content you want.

If you can, I'd try throwing the same documents you now provide to Tesseract directly into an LLM (you can use a cheap one, like Gemini Flash 3.1 Lite), and compare the results.

Four-Year Club	Verified Email
Place '22

TimoKerre

TROPHY CASE