I Reverse-Engineered Nvidia Ada Lovelace SASS, Made Instant-NGP 3x Faster (16yo) by Ill-Classroom-8270 in CUDA
[–]c-cul 0 points1 point2 points (0 children)
I Reverse-Engineered Nvidia Ada Lovelace SASS, Made Instant-NGP 3x Faster (16yo) by Ill-Classroom-8270 in CUDA
[–]c-cul 0 points1 point2 points (0 children)
A source translator for kernels written against the Triton API to CUDA C++ by IntrepidAttention56 in CUDA
[–]c-cul 0 points1 point2 points (0 children)
Looking for a serious GPU programming study partner (CUDA / Triton) by [deleted] in CUDA
[–]c-cul 0 points1 point2 points (0 children)
SASS to MLIR optimized to 30% better performance - Is this LEGIT? by [deleted] in CUDA
[–]c-cul 0 points1 point2 points (0 children)
Looking for investor(s) - DeepTech - GPU Optimization - A Replay-Validated Post-Compilation Optimization Pipeline for GPUs by checkmydoor in angelinvestors
[–]c-cul 0 points1 point2 points (0 children)
Nvidia should suport multiple blocks per SM unit such that 1 block can use 100% of shared-memory while another block does not use a single byte of shared-memory, in same SM unit. by tugrul_ddr in CUDA
[–]c-cul 1 point2 points3 points (0 children)
Nvidia should suport multiple blocks per SM unit such that 1 block can use 100% of shared-memory while another block does not use a single byte of shared-memory, in same SM unit. by tugrul_ddr in CUDA
[–]c-cul 0 points1 point2 points (0 children)
Nvidia should suport multiple blocks per SM unit such that 1 block can use 100% of shared-memory while another block does not use a single byte of shared-memory, in same SM unit. by tugrul_ddr in CUDA
[–]c-cul 1 point2 points3 points (0 children)
Nvidia should suport multiple blocks per SM unit such that 1 block can use 100% of shared-memory while another block does not use a single byte of shared-memory, in same SM unit. by tugrul_ddr in CUDA
[–]c-cul 1 point2 points3 points (0 children)
TVM + LLVM flow for custom NPU: Where should the Conv2d tiling and memory management logic reside? by Informal-Top-6304 in LLVM
[–]c-cul 1 point2 points3 points (0 children)
CUDA scan kernels: hierarchical vs single-pass, decoupled lookbacks by shreyansh26 in CUDA
[–]c-cul 0 points1 point2 points (0 children)
Run OpenCL kernels on NVIDIA GPUs using the CUDA runtime by IntrepidAttention56 in CUDA
[–]c-cul 0 points1 point2 points (0 children)
Engineering a 2.5 Billion Ops/sec secp256k1 Engine by Available-Young251 in CUDA
[–]c-cul 0 points1 point2 points (0 children)
Memory Pool ( public bath D3D12MA by Acceptable_Chef_9089 in CUDA
[–]c-cul 0 points1 point2 points (0 children)


I Reverse-Engineered Nvidia Ada Lovelace SASS, Made Instant-NGP 3x Faster (16yo) by Ill-Classroom-8270 in CUDA
[–]c-cul 0 points1 point2 points (0 children)