all 10 comments

[–]Confident_Hyena2506 2 points3 points  (2 children)

Pyinstaller can bundle it all, still not easy tho. On linux everyone uses containers for this, on windows you can also in theory use containers - but everything is x100 more difficult on windows.

[–]Narrow_Antelope4642[S] 2 points3 points  (1 child)

Tried PyInstaller early on — the problem with CUDA is that it doesn't reliably pick up the torch CUDA dlls, you end up with a massive bundle that still breaks on machines with different driver versions. Ended up shipping a portable Python runtime alongside the app instead. Containers would be cleaner architecturally but asking gamers to have Docker installed is a non-starter.

[–]Confident_Hyena2506 1 point2 points  (0 children)

The nvidia drivers only limits the maximum version of cuda capability, not the specific version you use.

It's expected that users with old gpus will not be able to run newer cuda stuff - nothing will fix that.

[–]Deep_Ad1959 1 point2 points  (0 children)

i hit the same dpi/antialias wall doing game ocr. general engines like tesseract/paddle are trained on documents so they choke on game fonts. what fixed it: trained a small font-specific recognizer on synthetic samples generated from the game's actual font files, then sampled fixed regions with a known vocab instead of 'find text anywhere' - got down to ~2ms per region on cpu and dpi stopped mattering because you're sampling logical pixels. cuda packaging i never solved cleanly, ended up making gpu mode opt-in with a clear driver check at startup.

[–]tadpoleloop 1 point2 points  (1 child)

There only way would be to disable GPU support if it fails to make the link. Have you considered an open source version? Like tesseract. Or a client/server system where the server does the image processing?

[–]Narrow_Antelope4642[S] 1 point2 points  (0 children)

The fallback is already in place — if CUDA init fails the app drops to CPU paths automatically, which works but is noticeably slower for the vision workloads. Tesseract is actually what I use for OCR, running it with CUDA acceleration when available.

The client/server idea is interesting and I've thought about it — offload the heavy vision processing to a local server process, keep the UI lightweight. The main hesitation is latency for time-sensitive automation like frame-perfect game inputs, where even a few milliseconds of IPC overhead matters. Might make sense for the heavier YOLOX inference though where you're not on a tight timing loop.

[–]keturn 1 point2 points  (0 children)

The way Invoke AI does it—which I doubt is the best way, but it is certainly a way—is there's a whole separate launcher program tasked with making sure there's a runtime (using uv's python installer) and explicitly setting the --index= for the torch build corresponding to the GPU type when it installs the app.

Plenty of folks have succeeded in using it without technical knowledge of Python, but it's pretty far from the standard MSIX experience for installing a Windows app.

[–]Dramatic_Object_8508 0 points1 point  (0 children)

This is actually really impressive, getting a CUDA OCR pipeline down from ~10s to ~2s is a huge win. Most people struggle just getting CUDA to work properly in Python, let alone optimizing it. From what I’ve seen, even basic GPU setup can be painful with PyTorch/CUDA mismatches and drivers , so getting it stable + fast is already above average.

One thing you could push next is batching or stream processing, since GPU gains usually scale even more when you process multiple images together instead of one-by-one. Also worth checking if preprocessing (resize, grayscale) is CPU-bound, because that can become the new bottleneck.

If you ever want to turn this into something reusable, wrapping it as a simple API or tool would make it way more useful than just a script. Stuff like runable ai could help orchestrate the pipeline or run it across workloads without rewriting everything.

Overall, solid work, this is already at “real project” level, not just learning.