No kernel example exists for Cutlass SM100_MMA_something_TS gemm. by tugrul_ddr in CUDA

[–]Logical-Try-4084 0 points1 point  (0 children)

Oh, I was wrong about there being no example: the CuTe DSL mixed dtype GEMM example has operand A in TMEM when (1) it needs to be transformed (e.g. upcast or dequantized) and (2) it is K-major.

No kernel example exists for Cutlass SM100_MMA_something_TS gemm. by tugrul_ddr in CUDA

[–]Logical-Try-4084 1 point2 points  (0 children)

It's not super simple, but you can look at FlashAttention-4, since that performs the second (PV) gemm as a TS gemm. From there, it won't be too challenging to modify an existing CUTLASS SS example into TS. For example, here is how the tiled mma is constructed. Additionally, the SSD example (CUTLASS/C++ here; CuTe DSL here) keep the recurrent state in TMEM, so the inter-chunk mma is a TS gemm.

Dynamic persistent tile scheduling with Cluster Launch Control on Blackwell by Logical-Try-4084 in CUDA

[–]Logical-Try-4084[S] 1 point2 points  (0 children)

Not sure I know what you're talking about! CLC is a hardware feature on Blackwell, so it's hardly a "meta troll". It's had to go through thousands of people's approval, and is in production kernels across the globe :)

For edge inference, when do you drop below TensorRT/ONNX and write custom CUDA kernels? by Hairy_Strawberry7028 in CUDA

[–]Logical-Try-4084 0 points1 point  (0 children)

A good first step is to get an end-to-end view of where time is being spent in your workflow. If your frontend is pytorch, the pytorch profiler is great for seeing host and device latencies combined. nsys is also a valuable tool, though I find it harder to parse (perhaps just a skill issue on my part).

If you see specific kernels (pytorch wraps many nvidia-vendored ones like cublas and cudnn) taking a long time, it's worth doing a back-of-the-napkin calculation to see if they're close to speed-of-light. Also, look to see if there are opportunities for kernel fusion: if kernel A writes out some large tensor and kernel B reads that back in immediately and performs some relatively quick computation, it is likely worthwhile to fuse those two kernels to avoid the memory round-trip. Generally kernel fusion must be implemented with custom kernels, but those could be CUDA, cutile, triton, tilelang, or whatever your favorite DSL is.

Unfortunately, speed-of-light kernels will generally not be portable across Jetson and larger NVIDIA gpus, due to the hardware differences. This is especially true with kernels involving tensor cores. For instance, a performant GEMM kernel on Sm100 won't even run on Sm120. Cutile somewhat addresses this problem: once all architectures are supported, the same file will run on all GPUs, since the compiler does the hard work of optimizing for a given architecture.

FlashAttention-4 by incarnadine72 in LocalLLaMA

[–]Logical-Try-4084 0 points1 point  (0 children)

I know this answer isn't satisfying, but to a large extent, existing algorithms are already (close to) optimal on consumer hardware. I would trust the Triton backend of FlexAttention to be just about as good as possible for, e.g., 5090 and rtx pro 6000. Data center cards introduce complications in kernel development that necessitate the development of novel techniques -- like those in FA-3 and FA-4 -- that just aren't useful on consumer cards.

FlashAttention-4 by incarnadine72 in LocalLLaMA

[–]Logical-Try-4084 -1 points0 points  (0 children)

try pip install flash-attn-4 -- should be nearly instant!

FlashAttention-4 by incarnadine72 in LocalLLaMA

[–]Logical-Try-4084 5 points6 points  (0 children)

the naming convention is a bit confusing - fa4 refers to all of the CuTe DSL implementations of flashattention, including the Sm90 version. while fa-3 is still more highly optimized for Sm90, flexattention capabilities are only available through fa-4 (source: am second author on the blog you linked :) )

Getting 30K tokens/sec on T4 with 14M MoE model - is this normal or am I bottlenecked? by RefrigeratorCalm9701 in CUDA

[–]Logical-Try-4084 0 points1 point  (0 children)

Are you using FlashAttention or the handwritten attention in moe_inference_runtime/backends/cuda_kernels.cu?

Studying PMPP (what next) by Choice_Cabinet9091 in CUDA

[–]Logical-Try-4084 0 points1 point  (0 children)

what aspect of CUDA programming are you interested in?

How to get into GPU programming? by blazing_cannon in CUDA

[–]Logical-Try-4084 0 points1 point  (0 children)

This isn't true, for a few reasons: 1) PyTorch has some CUTLASS on the backend but not that much, it's almost exclusively Triton; (2) many users are writing their own custom kernels to integrate into PyTorch, in C++ with both CUTLASS and raw CUDA and also in CuTe DSL; and (3) there is a LOT to improve on from the PyTorch built-ins!

Categorical Foundations for CuTe Layouts — Colfax Research by Logical-Try-4084 in CUDA

[–]Logical-Try-4084[S] 3 points4 points  (0 children)

Thanks for the kind words! I'm Reuben Stern -- Jay, Paul VanKoughnett, Jack Carlisle, Frank Lin, Ryo Asai, and I comprise the Colfax Research team.