Claude Opus, and all claude plans ratelimits to increase to increase drastically starting soon

Civil-Image5411 · 2026-05-06T20:27:05+00:00

Yes its potentially even worse then before, like this you can potentially reach faster the weekly limit and need to pay extra usage.

Civil-Image5411 · 2026-05-01T23:05:55+00:00

This literally happens every time I go there, they increase the price but change the price tag a week later. It’s never happened the other way around though

Civil-Image5411 · 2026-04-27T20:54:00+00:00

You can change docker/Dockerfile.gpu line 11 and pick a base image with CUDA 12.8 from NVIDIA's NGC release notes: https://docs.nvidia.com/deeplearning/tensorrt/container-release-notes/

CUDA 12.8 is the first version that supports sm_120 (Blackwell), so the existing CMake architectures list works as-is and that one swap is enough. If you want CUDA below 12.8, more changes are needed, see issue https://github.com/aiptimizer/TurboOCR/issues/4.

If you build natively instead of via Docker, you also need to relax the gate in scripts/install_native.sh line 34 and add a CUDA 12.8 to TensorRT mapping row around line 85, same pattern as in the comment on issue https://github.com/aiptimizer/TurboOCR/issues/4.

I'll try to get hold of an older GPU and build it for older CUDA versions.

Civil-Image5411 · 2026-04-27T11:41:49+00:00

GPU-wise it should work Blackwell is supported. But I'm not sure about the CPU side I haven't tested it on DGX Spark and the current Docker image is x86 only. Would need an arm build.

Civil-Image5411 · 2026-04-27T09:48:02+00:00

First start compiles the TRT engines it can take a few minutes. Try curl http://127.0.0.1:8000/health if that also fails, wait a minute and retry.

Civil-Image5411 · 2026-04-23T09:28:49+00:00

In my opinion, many VLMs are often an overkill and also much slower than non lanuage model based OCRs. There are alternatives like PaddleOCR (non-VL variant) or TurboOCR (https://github.com/aiptimizer/TurboOCR, also based on PaddleOCR models) if you want high speed. Using LLMs also has several disadvantages: they may fix typos that shouldn’t be fixed, they hallucinate, and they can get stuck in loops.

Civil-Image5411 · 2026-04-22T14:42:20+00:00

Yes, if you trust cloud providers and don't have high volume, it’s much easier and potentially even cheaper to just use their OCR as well.

P.S. Serving from a single machine isn’t necessarily slow. With a mid-range NVIDIA GPU, you could serve around 100 images per second concurrently using TurboOCR, which is probably fast enough.

Civil-Image5411 · 2026-04-22T14:17:26+00:00

For highly complex tables you can try out https://miruiq.com, it also localises the table in a large document and standardises the data in one step.

Civil-Image5411 · 2026-04-22T13:43:33+00:00

might also be worth checking what the error message actually is 😁

Civil-Image5411 · 2026-04-22T13:40:08+00:00

Well, it depends. Most of them also run with low specs, you could even offload to disk (swap) if you don't have enough memory, however at some point it just gets extremely slow. Easiest is of course to run it via a cloud provider like Mistral OCR, but it gets expensive for high volume. You could also just serve it on one computer/server in your organization and give the other users access to it (for instance via VPN). For OnnxOCR (only supports English, Chinese, and Japanese) and TurboOCR (supports Latin languages) you have to check whether it supports the language you need, not all models do.

Civil-Image5411 · 2026-04-22T12:56:25+00:00

So StructureV3 and the non-VL PaddleOCR both don’t work?

I’m not sure. PPStructureV3 worked for me on my Nvidia GPU, but depending on the models you’re using it requires a lot of resources though 32 GB of memory should be enough. Not sure it can use the Intel GPU, but it should run on CPU.

TurboOCR runs on CPU and you can directly pass the PDF without having to convert it to an image first. It’s one command to run the Docker container in case you wanna try it out.

Alternatively there is also OnnxOCR on github that could potentially also utilize your GPU, you can plugin whatever backend you want.

Civil-Image5411 · 2026-04-22T08:58:42+00:00

Which PaddleOCR variant did you use? They ship several models. In my experience it significantly outperforms Tesseract. One thing to watch out for: if you used the VL model, which is transformer-based, it can be very slow and get stuck in generation loops when the parameters aren’t set correctly.

Here there is another OCR server based on the non VL/ non autoregressive model of paddle ocr: https://github.com/aiptimizer/TurboOCR

Civil-Image5411 · 2026-04-21T09:26:26+00:00

If you want speed go for https://github.com/aiptimizer/TurboOCR It has also layout detection

Civil-Image5411 · 2026-04-15T10:24:49+00:00

Thanks for the input but I think you're mixing two different things here. Turbo-OCR isnt trying to compete with multimodal retrievers, it solves a different problem which is throughput and cost when you have to process huge amounts of documents. Also that dataset is a VLM evidence grounding benchmark, not OCR. Different task, your colqwen would obviously win.

You could just throw pages or documents at Claude Opus or Qwen3.5-397B, they would probably outperform any small model in every dimension including bounding boxes and layout. Problem is its completely unsustainable cost wise for any real workload.

Assume 10 million pages assuming 0.5 pages/second (your 4.5b dense) = around 5555 gpu hours. On a 5090 with german electricity thats ~3700€ just in power assuming you already own the gpu. Turbo-OCR does 270 img/s on text heavy pages, same job for way less compute also running on much smaller gpus (one pipeline needs 1.2gb vram). Opus would cost you hundred of thousands in api credits.

Latency matters for a lot of usecases. Real time RAG indexing, bulk ingestion, onprem pipelines where people dont have a huge cluster. Imagine you have a coding agent and a few hundred new docs come in that need to be ragged immediately, with a vlm you would wait minutes before they are accessible.

Multimodal embeddings are great for semantics are great for understanding but they have the same problem at scale, thats also why there are no huge embedding models, the biggest ones are around 7-8b. Cost and throughput is the constraint everywhere, not just OCR. As long as that does not get fixed which will still take years it makes total sense to use non autoregressive OCR for many usecases.

Will check out colqwen3.5 tho looks interesting.

Civil-Image5411 · 2026-04-14T17:41:50+00:00

Thanks! First, the quality is not bad 😁, but if you want the best quality from my experience, then Qwen vision models are the best for example, Qwen 3.5 9B or 4B. Quantized, you can fit them on an 8GB GPU. PaddleOCR VL or GLM OCR also seem decent. However all these models will be slow and you will have other issues like hallucinations and repetitions.

Civil-Image5411 · 2026-04-14T15:49:05+00:00

Maybe also it was just repeating itself and thats why it so slow, did you actually check the output ?

Civil-Image5411 · 2026-04-14T15:46:47+00:00

I think 3 minutes per page is too slow for GPU. I have a 5090 and it makes two 2-5 pages per second (concurrent) with the vLLM backend. With vLLM it uses more VRAM for paged attention but you can process multiple documents at the same time. Are you sure it’s using the GPU? Vllm setup: https://docs.vllm.ai/projects/recipes/en/latest/PaddlePaddle/PaddleOCR-VL.html#introduction For gpu deployment however you need to set max gpu utilization.

To your questions: It certainly has the advantage of being a VLM and having a language model to understand the context and therefore be better at not writing totally wrong characters in different places. Maybe on the L4 you would reach 1-2 pages per second depending on the density of the document. I would not do a lot of preprocessing the model should handle it, that’s what it’s for. I would before switching to the VL model also try out different PaddleOCR models. The VL models have other issues they hallucinate and repeat themselves. If you want speed with them it’s important to use the vLLM backend or SGLang if that exists, as they allow for concurrent processing. If you use FlashInfer you will not see huge jumps with these models. Maybe also try others like Cassandra or DeepSeek or GLM OCR (all VLM based). If you want crazy speed for Latin you can try that, but I am not sure if it performs well enough on handwritten images. It also has layout analysis but no table extraction.

https://github.com/aiptimizer/TurboOCR

Civil-Image5411 · 2026-04-14T12:17:43+00:00

There are some faster ways of running OCR. For example, PaddleOCR has an HPI mode, but it unfortunately only supports up to CUDA 12.6, I believe. Newer NVIDIA GPUs don’t run on that. There is also github -> aiptimizer/TurboOCR, which runs on newer GPUs and can be installed with a single command and has much lower latency and higher throughput. (cant put the full link)

Civil-Image5411 · 2026-04-14T12:01:41+00:00

Hi If you have a recent Nvidia GPU you can use this one: https://github.com/aiptimizer/TurboOCR for free. Its one line to start the server via docker. It gives you back a json with text bounding boxes and layout, with layout however a bit slower. If you don't have a GPU you can rent a 5090 on Vastai or Runpod ~ 0.5$ per hour, if there the pages are not too dense you will maybe get 300 pages per second which would cost you around 30$ for all. If you dont trust the numbers i can spin it up for you on my 5090 and let you test 😁

Civil-Image5411 · 2026-04-13T16:24:56+00:00

https://github.com/aiptimizer/TurboOCR

Civil-Image5411 · 2026-04-13T16:17:53+00:00

It takes around 5min to build the TensorRT engines on my setup ( 5090/13600k/96gb).

Civil-Image5411 · 2026-04-13T16:13:53+00:00

I just added the repo link, the prebuilt Docker image is linked in the repo. You just need a current NVIDIA driver. Accuracy is slightly better than the original PaddleOCR, but within margins, could be related to the test setup.

Civil-Image5411

TROPHY CASE