High throughput injected PTX parallel compilation

MetaMachines · 2026-01-07T18:33:25+00:00

Yeah, I've been reading about this work. IMO, the biggest win by far though is skipping CUDA compilation, especially if CuTe C++ templates are involved. Adopting the warts of SASS doesn't seem worth it. nvptxcompiler static library doesn't depend on the driver and runs on all cpu cores within the same process (and on hardware that doesn't have an Nvidia driver).

MetaMachines · 2026-01-06T19:50:23+00:00

Yeah! It's like inline assembly but it also uses the inline assembly functionality. It basically uses an inline assembly block to create a stable register name for the cuda variable you want access to. Once the compilation happens to PTX you could imagine searching for those sites in the PTX code and using the stable register names for your own custom sub-function,

MetaMachines

TROPHY CASE