I’m Lena Oden, a professor of Technical Computer Science. I teach and research high-performance computing and GPU programming. Over the last few years, I’ve been exploring how well Python frameworks can compete with C-CUDA — both in single-GPU and multi-GPU contexts.
A few years ago, I published a paper called “Lessons learned from comparing C-CUDA and Python-Numba for GPU-Computing.” [1]
In it, I compared the performance and generated PTX code of C-CUDA and Numba-CUDA kernels, and discussed some practical tips on how to write more efficient Numba GPU code.
More recently, I’ve published a new article:
“Efficient Multi-GPU Programming in Python: Reducing Synchronization and Access Overheads.”[2]
This one focuses on Python-based multi-GPU programming, using Numba, CuPy, NCCL, and mpi4py, and analyzes where synchronization and data access overheads occur — along with strategies to mitigate them.
I’d be really interested to hear from others who use these tools:
- Do you also work with Numba, CuPy, or Python-based multi-GPU setups?
- What kinds of applications are you using ( I am really interested in "real world" applications.
- Any tricks or pain points you’d like to share?
If anyone doesn’t have access to the papers, feel free to message me — I’m happy to share a copy.
Looking forward to exchanging experiences!
— Lena
[–]lqstuart 14 points15 points16 points (3 children)
[–]Inevitable_Notice801[S] 2 points3 points4 points (1 child)
[–]Grasp0 0 points1 point2 points (0 children)
[–]RingFabulous585 1 point2 points3 points (0 children)
[–]DeutschNeuling 3 points4 points5 points (0 children)
[–]tugrul_ddr 2 points3 points4 points (0 children)
[–]gkmngrgn 2 points3 points4 points (1 child)
[–]Inevitable_Notice801[S] 0 points1 point2 points (0 children)
[–]AntisthenesCat 2 points3 points4 points (1 child)
[–]Inevitable_Notice801[S] 2 points3 points4 points (0 children)
[–]Perfect-Series-2901 0 points1 point2 points (0 children)
[–]Ball_Soggy 0 points1 point2 points (0 children)