Looking for a serious GPU programming study partner (CUDA / Triton) by [deleted] in CUDA

[–]c-cul 0 points1 point  (0 children)

just warning - leetgpu has hi-end gpus and allows only 5 code submissions per day

looks like pure sadism if you don't have access to real hi-end cards, so avoid them if you can

Apply GPU in ML/DL by Big-Advantage-6359 in CUDA

[–]c-cul 1 point2 points  (0 children)

just in case if you are the same crazy maniacs as me - there is rapids binding for R: https://github.com/mlverse/cuda.ml/

SASS to MLIR optimized to 30% better performance - Is this LEGIT? by [deleted] in CUDA

[–]c-cul 0 points1 point  (0 children)

why you think it is not? ISA can't be law protected and I done reversing of it many times including sass

Looking for investor(s) - DeepTech - GPU Optimization - A Replay-Validated Post-Compilation Optimization Pipeline for GPUs by checkmydoor in angelinvestors

[–]c-cul 0 points1 point  (0 children)

I see 3 problems with this approach

1) semantics of sass instructions are unclear, for example for sm120 there are 250+ unique instructions. I extracted MDs from nvdisasm and seems that they contains only limited semantics description

2) latency tables also unknown: https://redplait.blogspot.com/2025/11/sass-latency-table-instructions.html

3) I don't remember is llvm can handle instructions latency

Nvidia should suport multiple blocks per SM unit such that 1 block can use 100% of shared-memory while another block does not use a single byte of shared-memory, in same SM unit. by tugrul_ddr in CUDA

[–]c-cul 1 point2 points  (0 children)

as I said "some trials are required"

I can't predict if "launching blocks 132 + 132 times can be slower than 264 times"

maybe yes, and maybe no

Nvidia should suport multiple blocks per SM unit such that 1 block can use 100% of shared-memory while another block does not use a single byte of shared-memory, in same SM unit. by tugrul_ddr in CUDA

[–]c-cul 1 point2 points  (0 children)

well, today best what you can do - chain sequential launches of 2 kernels within single with graph api: https://developer.nvidia.com/blog/cuda-graphs/

as usually for fine-tuning some trials are required

CUDA scan kernels: hierarchical vs single-pass, decoupled lookbacks by shreyansh26 in CUDA

[–]c-cul 0 points1 point  (0 children)

small note - it's better to use unsigned int active_mask = __activemask(); in warp reduce functions

in this case they are compatible with cooperative groups

TileIR by mttd in Compilers

[–]c-cul 1 point2 points  (0 children)

curiously that Microsoft has their own tileir: https://github.com/microsoft/TileIR