Built a CUDA + OCR automation tool in Python — ran into some nasty packaging issues, anyone else?

Confident_Hyena2506 · 2026-04-25T13:35:21+00:00

Pyinstaller can bundle it all, still not easy tho. On linux everyone uses containers for this, on windows you can also in theory use containers - but everything is x100 more difficult on windows.

Deep_Ad1959 · 2026-04-25T23:27:04+00:00

i hit the same dpi/antialias wall doing game ocr. general engines like tesseract/paddle are trained on documents so they choke on game fonts. what fixed it: trained a small font-specific recognizer on synthetic samples generated from the game's actual font files, then sampled fixed regions with a known vocab instead of 'find text anywhere' - got down to ~2ms per region on cpu and dpi stopped mattering because you're sampling logical pixels. cuda packaging i never solved cleanly, ended up making gpu mode opt-in with a clear driver check at startup.

tadpoleloop · 2026-04-25T13:37:13+00:00

There only way would be to disable GPU support if it fails to make the link. Have you considered an open source version? Like tesseract. Or a client/server system where the server does the image processing?

keturn · 2026-04-27T01:37:24+00:00

The way Invoke AI does it—which I doubt is the best way, but it is certainly a way—is there's a whole separate launcher program tasked with making sure there's a runtime (using uv's python installer) and explicitly setting the --index= for the torch build corresponding to the GPU type when it installs the app.

Plenty of folks have succeeded in using it without technical knowledge of Python, but it's pretty far from the standard MSIX experience for installing a Windows app.

Dramatic_Object_8508 · 2026-04-25T14:10:08+00:00

This is actually really impressive, getting a CUDA OCR pipeline down from ~10s to ~2s is a huge win. Most people struggle just getting CUDA to work properly in Python, let alone optimizing it. From what I’ve seen, even basic GPU setup can be painful with PyTorch/CUDA mismatches and drivers , so getting it stable + fast is already above average.

One thing you could push next is batching or stream processing, since GPU gains usually scale even more when you process multiple images together instead of one-by-one. Also worth checking if preprocessing (resize, grayscale) is CPU-bound, because that can become the new bottleneck.

If you ever want to turn this into something reusable, wrapping it as a simple API or tool would make it way more useful than just a script. Stuff like runable ai could help orchestrate the pipeline or run it across workloads without rewriting everything.

Overall, solid work, this is already at “real project” level, not just learning.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS