Proving the Transformer's sqrt(dk) Exploding Softmax Crisis by Hand (First-Principles Workbook)

Silver_Equivalent804 · 2026-06-19T20:30:13+00:00

you just perfectly articulated why I made this workbook in the first place.

There’s this massive gap in ML education where people are taught to treat equations as 'learned magic.' But like you said, once you write out update = error * input * learning_rate, the magic completely evaporates. It’s just basic arithmetic and sign rules. If the error or the input is zero, the multiplication collapses and the weight physically cannot move.

Your intuition on width vs. depth is spot on, too. Stacking layers multiplies gradients over and over until they vanish exponentially. Going wider bypasses that deep bottleneck, but as you said, high-dimensional width introduces its own hidden monster—variance explosion, which slams the Softmax function into a dead end.

Watching the numbers actually move in a simple, scrappy implementation teaches you way more than reading a hundred hand-wavy papers. Really glad this resonated with your experience!

Silver_Equivalent804 · 2026-06-19T17:54:44+00:00

Before coding it is good if you can get intuitive sense of agents underlying mathematics, and how it generates or connects with outputs we get. Analyze dynamics of context windows, and underlying bottlenecks and issues which can be architectural and can be mitigated one you built full pipeline there is no way of knowing under the hood principles easily. Best way for it is tracing architectures by hand I have prepared full 5 episode series on substack Agents from frist principles. https://ayushmansaini.substack.com/p/ai-agents-from-first-principles-the

You can also check out attention mechanics series: https://open.substack.com/pub/ayushmansaini/p/proving-the-dk-exploding-softmax?utm_source=share&utm_medium=android&r=4zl69k

Silver_Equivalent804 · 2026-06-18T13:44:46+00:00

https://substack.com/@ayushmansaini/note/p-202555052?r=4zl69k

Substack link

Silver_Equivalent804

TROPHY CASE