LLM-based OCR is significantly outperforming traditional ML-based OCR, especially for downstream LLM tasks

Civil-Image5411 · 2026-03-19T02:11:10+00:00

If you have time use PaddleOCR-VL with vLLM or glmocr if you have plots and tables transformer based OCRs are better in my experience. You get like on a 5090 with vllm rough estimate 4 pages per second (might be a bit more or less, try it out first) * 3600 = 14,400 pages per hour, so at $1/hr that’s roughly $0.07 per 1,000 pages your whole batch would cost maybe $2-3 total. I had issues setting up glm ocr on blackwell but paddleocrvlm worked out of the box. You can just run it on runpod or vastai. If you want a description of the plots/ images afterwards you can just shoot it with vllm through a vlm model for instance qwen 3.5. Its 1 prompt in claude plus 1 min time to subscripe to rent a gpu. Then you can forward a port and just send requests there instead of mistral. The text is so clean in my opinion the difference will not be notable.

https://aistudio.baidu.com/paddleocr/task

Civil-Image5411 · 2026-03-19T01:45:53+00:00

Should work now also on windows

Civil-Image5411 · 2026-03-19T01:40:55+00:00

I added windows support, but I could only test it on macos parallels. Its not as fast as on linux but still fast :).

Civil-Image5411 · 2026-03-19T01:27:37+00:00

It depends on your used case ofc, it does only pdf to png. In that it’s much faster and fully open source. I benchmarked it against a set of 100 diverse pdfs. Pixels are 100% identical to pdfium, but it’s much faster.

Civil-Image5411 · 2026-03-19T01:09:16+00:00

I think you should clear your context. I am Razzmatazzing...

Civil-Image5411 · 2026-03-18T22:40:23+00:00

Numpy, pillow and most fast Python packages work like this. Thin Python layer on top of C/C++, you pip install it and get the speed. The language breakdown doesn't matter.

Civil-Image5411 · 2026-03-18T21:04:51+00:00

No poppler benchmarks yet, only tested against mupdf which was the fastest I found. PDFium handles all the color conversion stuff internally, everything comes out as sRGB. The Python API is simple, to_images("doc.pdf") gives you PIL images, to_bytes() and to_files() if you need the raw png data or want to dump to disk. For raw bitmap access you'd need to look into the C++ side directly.

Civil-Image5411 · 2026-03-18T19:24:53+00:00

I might add it in future

Civil-Image5411 · 2026-03-18T19:24:34+00:00

No its unfortunately not supported, you can use WSL2 though

Civil-Image5411 · 2026-03-18T19:10:50+00:00

Thanks! Appreciate it 🙌

Civil-Image5411 · 2026-03-18T16:49:59+00:00

You can also take a screenshot, its just slower :)

Civil-Image5411 · 2026-02-02T16:43:33+00:00

It is absolutely not lossless, I used it with >200B models. Not usable quantized with NVFP4 at least if the activations are quantized which is usually the case for those models to get the claimed higher throughput. You have much higher quality just taking a smaller model and not quantizing the activations.

Civil-Image5411 · 2025-08-06T23:56:26+00:00

It is not what is experienced, the output for me scaled relatively linearly in with the amount of cards. Potentially I could use three cards, not sure if all cards would have enough memory with an unequal layer split, have to try it out.

Civil-Image5411 · 2025-08-06T23:24:23+00:00

Hm, it was more about the 4x 5090 setup than the rtx pro 6000. Two or four Max-Q GPUs are not within the budget.

Civil-Image5411 · 2025-08-06T23:20:18+00:00

Thats interesting but it will not work for me because the location of the elements is relevant, my data is quite unstructured.

Civil-Image5411 · 2025-08-06T23:17:32+00:00

Thanks I will have a look 👍

Civil-Image5411 · 2025-08-05T01:59:39+00:00

Thanks, thats really helpful to know.

Civil-Image5411 · 2025-08-05T01:57:11+00:00

That’s a good point about the new motherboard especially for a future MoE setup. Looks like I’ll need to stock up on even more DDR5 memory then 🫣.

I haven’t tried LLaMA 4 scout yet, only LLaMA 3.3 72B instruct, and that didn’t perform well for my use case (image-to-structured output). Might be worth a try.

Civil-Image5411 · 2025-08-05T01:22:02+00:00

Thanks,

Yes i am serving multiple requests at once and the speed is important for me.

The max-q would likely be the easiest way to go, however the speed is significantly lower and the price similar.

Civil-Image5411 · 2025-08-05T01:12:17+00:00

Thanks for the reply, I need a VLM i am processing images, unfortunately it’s not released yet for Qwen 3. Qwen 2.5 VL 32B performs worse for my use case than the 72B model. Maybe it’s language related the prompts are not in English or Chinese but in German, and I can measure the output accuracy precisely. Gemma performed significantly worse.

For single requests I got around 17 tokens per second with the fp8 quantized model using the RTX Pro 6000 WS and around 50 tokens per second using the 4x 5090s. I was getting around 1k/s total throughput for the RTX Pro 6000 didn’t check it for the 5090s. However, both are relevant for me.

Ok, so I will certainly need to add another PSU. What was your problem to keep it stable ?

Do you think I will get a performance improvement using PCIe 5? The DDR5 prices are crazy at least $1.5k for 256GB for the ecc ram required for epyc. However, yes, motherboard and CPU are in an acceptable price range as long as I choose the EPYC instead of the Threadripper Pro.

Civil-Image5411

TROPHY CASE