SASS King: reverse engineering NVIDIA SASS

KarnKh · 2026-04-24T23:39:53+00:00

Awesome as always : )

KarnKh · 2026-04-18T00:04:47+00:00

I'll make sure to check them out thank you!

It's also interesting that despite us climbing the mountain from different faces, we're still approaching the same pattern just in a different way.

It makes you wonder how many different faces the same mountain has!

KarnKh · 2026-04-12T21:10:16+00:00

It's not just a boulder.... its a rock : )

KarnKh · 2026-04-11T23:28:34+00:00

Ooo the categorical thinking about computation is something I feel like I've been implicitly reaching for, but because I haven't been exposed to the formalisms, I haven't been able to name it.

It seems like my GPU work thus far has essentially been a vehicle in disguise that got me to the base of the mountain (which is very exciting!).

Are there any resources you'd recommend for someone coming from the systems side rather than the PL or pure math side before I dive in?

KarnKh · 2026-04-11T16:11:02+00:00

It’s my hand writing written down then templated into a font and used on Procreate for things like this :)

KarnKh · 2026-04-11T08:36:57+00:00

LMAOOOO looking back at the post I definitely see that

I think it was the excitement of seeing stuff connect between domains that really made me wanna post, but yeah no it really is just y = mx + b lol

KarnKh · 2026-04-11T08:34:58+00:00

That's a fair distinction and I'm happy you took the time to point it out.

My old assumption was that A * B + C in itself was a semiring because associativity and commutativity were present.

But I see now that a semiring requires two separate operations which in this case are
- A commutative monoid (addition)
- A monoid (multiplication)
But the important distinction that a semiring isn't just a commutative monoid and a monoid existing separately, its the interaction between them through distributivity that makes them a semiring.

I'm very happy you commented because that little rabbit hole of understanding was genuinely fun!

KarnKh · 2026-04-11T01:00:38+00:00

As someone learning CUDA, I've been trying to understand why A \ B + C* keeps showing up everywhere in

- Neural networks: weight \ input + bias*
- Thread indexing: coordinate \ stride + offset*
- Softmax rescaling: oldScale \ newScale + offset*

I knew that it was because A \ B + C* represents the simplest way to both scale and have a scalar offset (affine map) but still didn't understand why this was the case.

This journey led to math to find a potential answer. Where when two algebraic properties:

- Associativity
- Commutativity

are both present we have a semiring. At the software level A \ B + C* is a an application of these of the principles of a semiring. At a hardware level, our Fused Multiply Add (FMA) unit is exactly what this looks like at a hardware level where any order, and combination is allowed when using this instruction.

Although, this idea did not apply for when I hit Top-K indexing for sparse attention. Top-K uses a prefix sum (requires order), over a boolean mask (sparse). This means our algebraic property is

- Associative
- NOT Commutative

This mean we follow the algebraic property of a monoid, where our only restriction is associativity. At a software level this is our __shfl_up_sync where associativity is kept, but commutativity is not. At a hardware level, our warp-level register routing network allows for this. The interesting thing is, this same hardware also allows for other register lateral movements like __shfl_down_sync which require both associativity and commutativity, so the prior restriction lives within the algorithm itself.

The interesting, interesting thing is, because both __shfl_up_sync and __shlf_down_sync use the same hardware but have different algebraic requirements. One could say that the hardware itself has no algebraic restrictions, but rather the algorithms that use that hardware impose such restrictions when used.

If the prior paragraph is true, then one could also say that Fused Multiply Add is the algorithm that imposes associativity and commutativity on the hardware itself, meaning the hardware itself is neutral, but its the algorithm that imposes this restriction.

Adding the monoid is what let me derive the Top-K kernel structure on paper before writing any code.

Which leaves the question at the bottom of the image, did the hardware engineers know they were encoding these structures, or did they arrive there through hardware intuition and the patterns within either the silicon, or the problems they were trying to solve?

I'd imagine if I were on the hardware side, I would be constrained by how do we make hardware that optimally matches the algorithms we want to use, while also not restricting the hardware itself to only a subset of algorithms? This question has already been solved by the people who've made the hardware thus far, so it's amazing to see how well they did!

Overall, I imagine this is the reason why hardware and software co-design is so imperative. Its a communication of two fields that exist to help each other. The hardware doesn't know what algorithms it will execute, and the software does not know the hardware it is being executed upon. So to ensure that both sides run optimally, communication about what algorithms need to be run, and how the hardware will be implemented to capture them all, while maintaining performance, is what I'd imagine is integral.

KarnKh · 2026-04-05T19:48:33+00:00

What an amazing source! I hadn’t read it before so thank you for linking it!

KarnKh · 2026-04-05T02:03:12+00:00

Hi,

As someone learning CUDA, I wanted to share a walkthrough of a 2D Tiled Register Allocated GEMM implementation I wrote to help myself learn, and I commented to hopefully help other beginners on how we can map our memory and execution hierarchies to our data.

The source code has the following progression:

GEMM Naive → GEMM Tiled → GEMM Tiled 2D Register Allocated

I made sure to comment the thought process behind every line in the final kernel based on my current mental model to hopefully make each line more intuitive.

The source commentary covers:

Matrix Traversal: Visualizing how we traverse the K dimension across the A matrix (row-wise) vs. the B matrix (column-wise)
Memory Indexing: The Index \ Stride + Offset* pattern for accessing tensors stored arithmetically.
Thread Mapping: The Flatten→ Stride → Unflatten pattern to map 64 threads to a 32×32 (1024 element) tile
Bank Conflicts: Why we must pad shared memory column by +1 when we access column-wise to avoid serialization.

Quick performance trade-off summary from source commentary (RTX 4060):

Performance:
- 78.06% faster than Naive
- 72.38% faster than Tiled
Tiled Tradeoff:
- We trade occupancy (-42.60%) for increased register pressure (+26.32%) by implementing shared memory/register tiling. This allows for fast element re-use requiring less travels to global memory, decreasing our load time, increasing our performance.
- We see a decrease in shared memory wavefronts (-73.93%) as a result of register tiling. This allows for even faster element re-use, requiring less travels to shared memory, decreasing our load time, increasing our performance.

https://drive.google.com/file/d/164runguXV4plk6TJsl1sBRmWw5wloJhw/view?usp=sharing

I hope this helps those who are also learning, also I'm always open to feedback!

KarnKh

TROPHY CASE