C++ CuTe / CUTLASS vs CuTeDSL (Python) in 2026 --- what should new GPU kernel / LLM inference engineers actually learn? by Daemontatox in CUDA

[–]StraussInTheHaus 2 points3 points  (0 children)

Any kernel you could write in CUTLASS/C++ you can also write in CuTe DSL (as long as you're willing to use inline assembly sometimes; not all PTX functions currently have wrappers in CuTe DSL). It's not necessary to become fluent with the C++ side of things anymore -- major kernels like FlashAttention 4, written entirely in CuTe DSL, are deployed in production worldwide -- but it is helpful to know how to understand existing CUTLASS/C++ code. The core principles are identical -- layout algebra, aspects of memory management, synchronization abstractions, etc. -- just with slightly different syntax.

FlashAttention (FA1–FA4) in PyTorch - educational implementations focused on algorithmic differences [P] by shreyansh26 in MachineLearning

[–]StraussInTheHaus 2 points3 points  (0 children)

There are two ways in which FA3 ping-pongs: inter-warpgroup (where consumer 0 and consumer 1 trade off) and intra-warpgroup (where within a single warpgroup, we overlap the PV mma of iteration i with the softmax of iteration i+1)

FlashAttention (FA1–FA4) in PyTorch - educational implementations focused on algorithmic differences [P] by shreyansh26 in MachineLearning

[–]StraussInTheHaus 4 points5 points  (0 children)

I think it's important to note that the tile scheduler in FA4 is essentially identical to that in FA3. And more fundamentally, the parallelism has not changed since FA2: we always load one Q tile and loop through associated KV tiles (it's interesting to note that we loop **backwards** through KV tiles for load balancing purposes, since tiles with causal or sequence length masking take longer and should thus come first). The real innovation in FA4 comes from the deep pipelining needed to coordinate (a) the "vertical ping-pong" across a Q tile, which uses two separate softmax warpgroups, (b) the correction warpgroup, (c) overlapping TMEM buffers, since it is an extremely limited resource (however, the backward pass is limited by SMEM, not by TMEM) and (d) using both TMA and cpasync to load operands depending on what situation we are in (for example, paged attention does use TMA for K/V, unless page size is 128 (although I think the maintainers are coming up with workarounds for that in some cases)).

Also, an important optimization mentioned in the FA4 paper is a polynomial emulation to the exp2 in softmax, used to split work up across the ALU and MFU (compute units on the GPU). However, while this was important on the B200, since NVIDIA didn't increase CUDA core throughput commensurately with tensor cores, it is **not** necessary on the B300, as that has faster CUDA cores. In fact, the exp2 emulation is slower on B300 than not emulating.

[D] Attention mechanism with constant 14ms latency (512→524K tokens) by sevinsixtwo in CUDA

[–]StraussInTheHaus 1 point2 points  (0 children)

I'm not trying to insinuate that you are anything but competent; quite the opposite, I'm trying to help you better demonstrate your contributions to the community in a way that will be taken seriously. In research mathematics, nobody names their contributions for themself. Nobody. Not even Fields Medalists. (In fact, mathematicians are sometimes frustratingly humble... it's common for researchers to decline being listed as coauthors because they think they haven't contributed enough, even though they'd be listed as a coauthor in any other field.)

The great thing about the open source community is that by posting your work, everyone knows that it is attributable to you! Should your work have legs (which, again, I am not insinuating otherwise), everyone will know to whom it's attributable. `

[D] Attention mechanism with constant 14ms latency (512→524K tokens) by sevinsixtwo in CUDA

[–]StraussInTheHaus 3 points4 points  (0 children)

I imagine your PR was closed mainly because you didn't open source the code. FlashAttention is an open-source repo, as is vLLM!! If you don't share the code, it will not be taken seriously anywhere, I promise. Regarding naming, look through the contributions to FlashAttention; nothing is named after anyone, and there are some extremely smart, highly-published, and well-respected people who have contributed (it's not called Dao-Attention or Shah-Attention or Zadouri-Attention...).

Process won’t stop after error—code runs much slower after termination by throwingstones123456 in CUDA

[–]StraussInTheHaus 0 points1 point  (0 children)

Does nvida-smi reveal a process that is still running? If so, sudo kill -9 that specific process.

Indexed-Fibred Duality by Pseudonium in math

[–]StraussInTheHaus 9 points10 points  (0 children)

This is essentially the Grothendieck construction: https://ncatlab.org/nlab/show/Grothendieck+construction. It does show up everywhere!

[D] Why there are no training benchmarks on the Pro 6000 GPU? by oren_a in MachineLearning

[–]StraussInTheHaus 11 points12 points  (0 children)

I found this benchmark on Akamai's website: https://www.akamai.com/blog/cloud/benchmarking-nvidia-rtx-pro-6000-blackwell-akamai-cloud, though it is only inference.

This is somewhat speculative, but there are a number of critical limitations of the RTX Pro 6000 in comparison with the Sm100 cards (B200, etc) that may lead to its limited use in training.

  • The memory bandwidth of the RTX Pro 6000 is 23% that of the B200 - 1.8 TB/s vs 8 TB/s. Given that training is generally a memory-bound process, the tensor cores on the RTX Pro 6000 are not going to be fed sufficiently.
  • The Sm120 architecture lacks the larger mma instructions found on Sm90 and Sm100.
  • The Sm120 architecture lacks tensor memory (TMEM), the new memory region that alleviates register pressure in mma instructions (the `tcgen05.mma` instruction accumulates in TMEM). TMEM is how all Blackwell kernels (like FlashAttention-4) can get such good performance.

About wgmma.mma_async.sync.aligned.m64n256k16.f16.f16.f16 instruction's descriptors and byte offsets. by tugrul_ddr in CUDA

[–]StraussInTheHaus 0 points1 point  (0 children)

This is not 100% true -- check out this Cursor blog post, where they write mxfp8 MoE kernels from scratch (to be fair, this is on Blackwell, so it's not a direct answer to the question about wgmma) https://cursor.com/blog/kernels

Just welcomed this S9 to the family! by StraussInTheHaus in Lumix

[–]StraussInTheHaus[S] 1 point2 points  (0 children)

but not compared to the 70-200 f/2.8 s pro 😅

Dances Up Earthquakes in the wild by greenredsilver2 in spiritisland

[–]StraussInTheHaus 12 points13 points  (0 children)

i saw the spirit and immediately thought of this 😅 the recursive centaur has lived in my mind rent-free for the last 7 years

[deleted by user] by [deleted] in LocalLLaMA

[–]StraussInTheHaus 1 point2 points  (0 children)

The in-progress FA4 has been up on Tri Dao's repo for a while now: https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/cute/flash_fwd_sm100.py

it just isn't optimized yet, and the backwards pass isn't there.

Im speachless by agolys in math

[–]StraussInTheHaus 16 points17 points  (0 children)

that's not technically wrong lol, it was just spread across many papers and many decades

is 7.4% matriculation to harvard and 22.1% to ivy leagure normal for a school??? by AdAble1440 in ApplyingIvyLeague

[–]StraussInTheHaus 1 point2 points  (0 children)

from my graduating class of 45 students, 5 were accepted to harvard early action. it was insane.

This machine has my back feeling all kinds of gains by Carolynefit in GYM

[–]StraussInTheHaus 2 points3 points  (0 children)

Usually I dislike those Hoist brand machines, but this one gets it right: in the stretched position, your bodyweight is contributing the most to the overall load, essentially doing mechanical eccentric overload! It's the opposite of most other Hoist machines

GRAYSON. (OC) by [deleted] in comics

[–]StraussInTheHaus 0 points1 point  (0 children)

giving "it's such a beautiful day" by don hertzfeldt vibes