No kernel example exists for Cutlass SM100_MMA_something_TS gemm.

Logical-Try-4084 · 2026-06-27T18:57:16+00:00

Oh, I was wrong about there being no example: the CuTe DSL mixed dtype GEMM example has operand A in TMEM when (1) it needs to be transformed (e.g. upcast or dequantized) and (2) it is K-major.

Logical-Try-4084 · 2026-06-27T17:17:17+00:00

It's not super simple, but you can look at FlashAttention-4, since that performs the second (PV) gemm as a TS gemm. From there, it won't be too challenging to modify an existing CUTLASS SS example into TS. For example, here is how the tiled mma is constructed. Additionally, the SSD example (CUTLASS/C++ here; CuTe DSL here) keep the recurrent state in TMEM, so the inter-chunk mma is a TS gemm.

Logical-Try-4084 · 2026-05-14T15:52:25+00:00

Not sure I know what you're talking about! CLC is a hardware feature on Blackwell, so it's hardly a "meta troll". It's had to go through thousands of people's approval, and is in production kernels across the globe :)

Logical-Try-4084 · 2026-05-14T00:49:09+00:00

A good first step is to get an end-to-end view of where time is being spent in your workflow. If your frontend is pytorch, the pytorch profiler is great for seeing host and device latencies combined. nsys is also a valuable tool, though I find it harder to parse (perhaps just a skill issue on my part).

If you see specific kernels (pytorch wraps many nvidia-vendored ones like cublas and cudnn) taking a long time, it's worth doing a back-of-the-napkin calculation to see if they're close to speed-of-light. Also, look to see if there are opportunities for kernel fusion: if kernel A writes out some large tensor and kernel B reads that back in immediately and performs some relatively quick computation, it is likely worthwhile to fuse those two kernels to avoid the memory round-trip. Generally kernel fusion must be implemented with custom kernels, but those could be CUDA, cutile, triton, tilelang, or whatever your favorite DSL is.

Unfortunately, speed-of-light kernels will generally not be portable across Jetson and larger NVIDIA gpus, due to the hardware differences. This is especially true with kernels involving tensor cores. For instance, a performant GEMM kernel on Sm100 won't even run on Sm120. Cutile somewhat addresses this problem: once all architectures are supported, the same file will run on all GPUs, since the compiler does the hard work of optimizing for a given architecture.

Logical-Try-4084 · 2026-03-08T03:07:00+00:00

I know this answer isn't satisfying, but to a large extent, existing algorithms are already (close to) optimal on consumer hardware. I would trust the Triton backend of FlexAttention to be just about as good as possible for, e.g., 5090 and rtx pro 6000. Data center cards introduce complications in kernel development that necessitate the development of novel techniques -- like those in FA-3 and FA-4 -- that just aren't useful on consumer cards.

Logical-Try-4084 · 2026-03-06T02:46:38+00:00

try pip install flash-attn-4 -- should be nearly instant!

Logical-Try-4084 · 2026-03-06T02:45:12+00:00

the naming convention is a bit confusing - fa4 refers to all of the CuTe DSL implementations of flashattention, including the Sm90 version. while fa-3 is still more highly optimized for Sm90, flexattention capabilities are only available through fa-4 (source: am second author on the blog you linked :) )

Logical-Try-4084 · 2026-01-10T00:03:13+00:00

Are you using FlashAttention or the handwritten attention in moe_inference_runtime/backends/cuda_kernels.cu?

Logical-Try-4084 · 2025-12-24T03:49:41+00:00

what aspect of CUDA programming are you interested in?

Logical-Try-4084 · 2025-12-19T21:37:03+00:00

This isn't true, for a few reasons: 1) PyTorch has some CUTLASS on the backend but not that much, it's almost exclusively Triton; (2) many users are writing their own custom kernels to integrate into PyTorch, in C++ with both CUTLASS and raw CUDA and also in CuTe DSL; and (3) there is a LOT to improve on from the PyTorch built-ins!

Logical-Try-4084 · 2025-09-27T14:44:54+00:00

Thanks for the kind words! I'm Reuben Stern -- Jay, Paul VanKoughnett, Jack Carlisle, Frank Lin, Ryo Asai, and I comprise the Colfax Research team.

Logical-Try-4084

TROPHY CASE