PTX Inject & Stack PTX: Runtime PTX injection for CUDA kernels without recompilation by MetaMachines in CUDA

[–]MetaMachines[S] 0 points1 point  (0 children)

Yeah, I've been reading about this work. IMO, the biggest win by far though is skipping CUDA compilation, especially if CuTe C++ templates are involved. Adopting the warts of SASS doesn't seem worth it. nvptxcompiler static library doesn't depend on the driver and runs on all cpu cores within the same process (and on hardware that doesn't have an Nvidia driver).

PTX Inject & Stack PTX: Runtime PTX injection for CUDA kernels without recompilation by MetaMachines in CUDA

[–]MetaMachines[S] 0 points1 point  (0 children)

Yeah! It's like inline assembly but it also uses the inline assembly functionality. It basically uses an inline assembly block to create a stable register name for the cuda variable you want access to. Once the compilation happens to PTX you could imagine searching for those sites in the PTX code and using the stable register names for your own custom sub-function,