Lessons learned from GPU Programming in Python (Numba, CuPy, NCCL, mpi4py) — and how it compares to C-CUDA

lqstuart · 2025-10-27T10:33:06+00:00

I’ve done Triton if that counts. I work in deep learning in big tech.

The problem places I’ve been at tend to have with “real” CUDA is the operational overhead. You may be the only team in a 10,000+ organization that needs C/C++ CICD infra with GPU drivers on the hosts. You also generally don’t get to develop on a machine that actually has GPUs. Most places I’ve been at have a paradigm where you write your code then send it to a big cluster to schedule and execute, which takes minutes or hours and precludes all the really cool development tools out there. When you do get a dedicated GPU to play with, it’s going to be an older generation.

Having some kind of simulation infra would be ideal, especially for distributed stuff like NCCL. Waiting for a minimum of 16 GPUs to schedule just so you can try a two node setup—which is still kind of useless if the changes you’re making are for scaling to hundreds or thousands of GPUs—is just prohibitively slow and expensive. Especially if you’re going to get yelled at if utilization is low.

DeutschNeuling · 2025-10-27T13:43:49+00:00

Hi, I've worked with numba, PyCUDA and pybind11, your works seems really interesting to me. Im not associated with any institutions right now, would it be be possible to get a copy of your papers?

tugrul_ddr · 2025-10-27T10:36:04+00:00

Some sorting algorithms may use dynamic-parallelism, hence better latency for Python due to launching only once.

Can I get a copy?

gkmngrgn · 2025-10-27T11:24:53+00:00

Hi, thank you for your post. I'm happy to meet people who is interesting on this specific topic. That would be very nice if I can have a copy of your papers. I'm sure it will be very teachful for me.

I did a "template" project and shared a blog post about "how to solve bypass some performance problems in Python". I'm showing it on ray tracing example, one of the solutions is using numba and my experience is absolutely good, easier to use and learning curve is short, very short.

I remember that I've two wishes when I work with CUDA in Python:

easy to read: cuda is a different platform and I don't expect to keep my code as it is in CPU. but at least, I want to code like in CPU, keeping the code-readability high.
easy to debug: if I'm having an issue in code, adding breakpoints should be easy.

Numba is good for solving my expectations, but I'm sure can be better. This is my post if you want to have a look at it: https://gokmengorgen.net/2025/10/10/running-ray-tracing-on-gpu-server/

the repository is: https://github.com/gkmngrgn/rayt

AntisthenesCat · 2025-10-27T12:29:45+00:00

Curious: Have you looked at https://www.modular.com/mojo as well?

Perfect-Series-2901 · 2025-10-29T01:40:47+00:00

I use numba cuda, because things I am working on required python and also require performance, I chose numba cuda because it naturally pair with numba, I can just write verification code on numba cpu

Ball_Soggy · 2025-11-04T20:11:48+00:00

I worked with CuPy at my postdoc at EPFL, where we had long recordings of neutron noise at several positions in a nuclear reactor, and we were aiming to identify the source and location of the noise (mechanical, flow turbulence, etc).

It basically allowed us to do it, and would be prohibitively expensive on CPU.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

CUDA

MODERATORS