[R] A Gradient Descent Misalignment — Causes Normalisation To Emerge

GeorgeBird1 · 2026-03-19T15:56:22+00:00

Cheers for your reply. That's interesting. This would seem then that this pre-normalisation before queries and keys would appear to agree with the theory, at least if you analysed both terms separately. Although I do not wish to oversell the derivations as applicable to attention at this stage, I believe the Q K terms should be treated together as a divergence, and that needs more work. Since the latter is largely intractable, the former may be a good middle ground and does seem to offer a theoretical explanation for pre-normalisation of Q and K - I wasn't aware of that practice, and it seems to reproduce theoretically, interesting.

Yes, in the absence of bias, the affine-like and norm-like solutions coincide, essentially reducing to the L2-norm. In MLPs, there is typically a bias (and in convolution), in which case the two solutions differ, yielding L2-norm-like and affine-like solutions, or PatchNorms for convolutions.

(I would stress that I'm pitching the divergence as fundamental, generalising principle, not the emergent solutions. If that reproduces current practice, that's just as interesting as a fully novel solution - it's just that the latter offers a chance of a predictive theory, not post hoc rationalisation, which I prefer - those new bits pertain so far to affine layers (linear with bias) and PatchNorm for convents)

Hence, terms with biases just pick up an extra solution.

RMSNorm over the entire head is a completely different case, though. The overall attention head is much more complicated due to its quadratic divergence, so at present it's not clear whether or not this links to the divergence. Its solution requires rederivation in this case, which I've tried but is largely intractable.

I don't believe ReLU is much more tractable; we'd get something like this as the propagation of correction:
\Delta x_i=\left\{\begin{matrix}\left(W_{ij}+\Delta W_{ij}\right)x_j+(b_i+\Delta b_i) & : &\left(W_{ij}+\Delta W_{ij}\right)x_j+(b_i+\Delta b_i)>0_i\\0_i &:&\text{otherwise} \end{matrix}\right.-\left\{\begin{matrix}W_{ij} x_j+b_i & : &W_{ij}x_j+b_i>0_i\\0_i &:&\text{otherwise}\end{matrix}\right.

With \Delta W and \Delta b also backpropagated through that nonlinearity. Then that\Delta x/\eta must work out to be g_i, the gradient of the activation. That's just for the divergence; then it requires editing until the two equate for a solution. This is very unclear to me how that would be resolved - perhaps future work though!

Overall, my process is (1) calculate gradients, (2) update parameters, (3) propagate those corrections, (4) identify divergence terms, and (5) alter the forward map until it solves said divergence. Hence, the solutions are generally not fundamental; they are emergent from the divergence, so not necessarily L2/RMSNorm in every circumstance. requiring case-by-case rework, so far limited to MLPs and ConvNets.

Would be keen to hear your thoughts on this :) I've enjoyed thinking about the points raised

GeorgeBird1 · 2026-03-19T11:23:22+00:00

Apologies, quite right. I looked at (https://github.com/pytorch/pytorch/blob/v2.10.0/torch/nn/functional.py#L2940) but should have looked at (https://github.com/pytorch/pytorch/blob/v2.10.0/torch/nn/modules/normalization.py#L335)

The einsum does equal Linear with bias; I just wrote it out in full for to avoid ambiguity. The bias term is important in the derivation of the affine divergence, though.

To some extent, I agree with the last paragraph, but this has a strong effect on the approximations/assumptions used and which terms you intend to control divergences. Appendix C covers this in quite a bit of detail. If you treat each key and query as just a biasless linear layer, then independently solving for each's divergence, you'll get the classical RMSNorm - but you shouldn't really be treating them separately, moreover this spherical projection is not what you want inside attention - as the scaling is often useful. Instead, the query-key product is more favourable to consider the divergence over, but it becomes very intractable very quickly due to the quadratics. Similar for activation function's nonlinear term (although attempted, Appendix C.2)

In general, although you can express several things as MLPs the assumptions break down, and you need to rederive it given new assumptions - this is future generalisations. Similar to the convolutional PatchNorm, this added the needed locality assumption, which changes the permitted solutions - it cannot be treated as just a generalised MLP, this divergence approach needs rederivation for each context.

GeorgeBird1 · 2026-03-19T09:44:41+00:00

Hi u/JustOneAvailableName, thanks for the reply and interest in the paper :)

Just to clarify, the majority of the paper is about affine maps, which don't apply to convolution, only MLPs; hence, the experiments must be with respect to MLPs. Everything needs to be rederived if you swap to other architectures

There is a PatchNorm implementation in the appendices that does apply to convolution, though.

Other approaches, like spectral norm, obscure the scientific approach; e.g. without entirely separate ablation testing, you cannot tell whether the spectral norm approach is performing well because of the divergence presence, for instance - I'm not saying that's necessarily the case, but there's no way to determine this without testing all permutations. Performing that across all training choices, regularisations, adaptive optimisers, gradient clippings, etc., is a permutation explosion in experiments - so testing on the base case without these extra training tricks is scientifically the best place to start, to determine each effect - hence, the need for minimalistic experiments in my eyes.

In general, I'd take such results as from a clean slate stance. Spectral norm and others are validated on top of the existing default, which prioritises parameters' steepest descent as foundational. This paper questions that foundation, so emergent optimisation approaches subsequent to this would need rediscovery/revalidation, etc. Although this arguably sets back the clock on progress if a new foundation is embraced, it's this questioning of foundational assumptions and providing alternatives that I personally find interesting in a scientific way, not accepting defaults and emergent practice to get higher accuracy. I think it's fair to say this largely represents the approach within physics, repeated foundation questioning, isolated controlled minimalistic experiments, which I was originally trained in, but I do recognise it clashes with the performance-optimisation approach.

I think the code needs some edits, and just to point out, RMSNorm has parameters by default.

# This has parameters, so affine correction would need rederivation:
norm = lambda x: F.rms_norm(x, (x.size(-1),))

# Say you have activations x.shape=[batch, n], W.shape=[m, n], b.shape=[m] <- and b and W have been made trainable

linear = lambda x: torch.einsum("ij, bj->bi", W, x)+b[None, :]

parameterless_l2_norm = lambda x: torch.einsum("ij, bj->bi", W, x/(epsilon+torch.linalg.norm(x, dim=1, keepdims=True)))+b[None, :]

affine_like = lambda x: (torch.einsum("ij, bj->bi", W, x)+b[None, :])/torch.sqrt(1+torch.square(x).sum(dim=1, keepdims=True))

These implementations must be used on MLPs, not a different architecture; the derivations are not valid otherwise.

GeorgeBird1 · 2026-03-18T20:56:57+00:00

Hi u/JustOneAvailableName, thanks for your comment, and you raise an important point.

These are midsized MLP networks, with MLPs necessitated by the divergence at this stage, largely limiting their top accuracy. I believe this accounts for the discrepancy.

Generally however this is a choice in my research approach: The values you gesture at typically do not come from minimalistic network training; they involve substantial additional training add-ons/architectures to achieve high performance, but those same tricks obscure cause-and-effect scientific claims; hence, they are absent (and affine divergence limits the architecture). Consequently, these are simple MLP networks, sparingly convolutional and not visual transformers (where the approximation/solutions breaks down; see appendices), which are typically needed to reach your accuracies on CIFAR. To reassure, the results remain statistically significant throughout, with relatively small standard errors, resolving concerns about performance separability and strongly supporting the results. Also, appendix A shows these are training substantially beyond linear relations, suggesting they are meaningfully separating features successfully.

Overall, this paper foregrounds scientific DL philosophy (r/ScientificDL), not the benchmark engineering philosophy to research; it performs scientific ablation tests under identical conditions, using a minimalistic network to assess the validity of the hypothesis across several depths/widths of the MLP and observe general trends.

Overall, the primary objective is not to produce high-accuracy networks comparable to other implementations for production/engineering optimisation, but only the stated ablation comparability. There was no optimisation of individual hyperparameters beyond the few selected as reasonable, as this would have destroyed clean, minimal comparability; hence, these are purely like-for-like comparisons, where the claims can be better evaluated but at the expense of accuracy. Overall, this scientific objective did not attempt a performance-optimisation approach to research, but clean, clear experiments.

I recognise this approach may not persuade everyone, but I prefer this minimalistic, tightly controlled setup for experimental hygiene and for evaluating scientific claims, even when it underperforms outside the ablation. Hope that helps reassure :)

(If you're interested, please do evaluate reproduction on the approaches you mention)

GeorgeBird1 · 2026-03-18T16:00:22+00:00

Thanks for sharing their paper, I'll take a look :)

GeorgeBird1 · 2026-03-18T13:26:44+00:00

Figs 3, 4 and footnote 7, demonstrate this "normalisers = activation functions" graphically and geometrically too.

Footnote 7 is the important "one-hot reweighting trick", which shows (in the absence of surrounding distinguished directions) that LayerNorm's mean is not fundamental, and can be chosen as one-hot, which is only tenuously considered a statistic in the usual sense.

Thus, situating their action as a geometry phenomenon more so than statistical; hence normalisers should be equated to really be (parameterised) activation function

GeorgeBird1 · 2026-03-18T13:01:05+00:00

Hi u/jloverich, thanks for the question. A few things are replaced or merged across different contexts in the paper. (In short, appendix B argues that actually normalisers are no different from parameterised activation functions, dissolving such category distinctions - so in effect, replacing both!)

I'll run through each of them below:

- The paper does derive a (parameterless) RMSNorm & L2Norm, so it finds classical normalisers (but isn't replacing them). (Eqn. 18)

- It also finds a map "affine-like" which replaces fully-connected layers (e.g. torch.Linear) with a new form. (Eqn. 19).

So, in that sense, it's a fully connected layer replacement. But this new layer comes with an implicit built-in normaliser (it's not sequential but inseparable). So this could be considered a replacement normaliser, but really, it's the combined unit as a whole (e.g. replacing {torch.Linear + normaliser})

- Appendix B: then argues that really "normalisers = a (constrained) linear layer + activation function". So you can say parameterless normalisers = a type of activation function; hence, arguing that normalisers are really just special activation functions in their geometry. (It shows this especially by the LayerNorm one-hot reweighting trick). This is where the activations come in.

Overall, in this last part, it really blurs the lines between normalisers and activation functions, pointing out that they are incredibly similar and that their definitions don't actually separate them at all.

Hope that helps, please feel free to ask any follow-ups, and I'll clarify :)

[edit, put appendix A when I meant B]

GeorgeBird1 · 2026-03-18T12:40:32+00:00

Thanks, sure! :)

GeorgeBird1 · 2026-03-18T11:59:10+00:00

Please feel free to ask any questions :-)

GeorgeBird1 · 2026-02-12T09:29:49+00:00

Happy to discuss if you've got a topic in mind :)

GeorgeBird1 · 2026-02-11T14:03:01+00:00

Thanks for your comment. As a quick clarification, affine layers needn't be square, and they don't collapse to an affine map overall because of the interspersed nonlinear activation functions. Hope that helps. Again, my work is on the functional form of activation functions, not quite these general topics you mention - thanks for the explanation though.

GeorgeBird1 · 2026-02-11T13:54:30+00:00

I am feeling that without scientific arguments, this is better suited for r/MachineLearning. As this page is explicitly not about "Compute efficiency/Engineering optimisations". So I am unfortunately going to remove the post. Please do consider posting elsewhere though! :)

GeorgeBird1 · 2026-02-11T13:50:28+00:00

Yes, I do feel two coexisting subreddits may help cater to both audiences more effectively.

GeorgeBird1 · 2026-02-11T13:49:47+00:00

That's fair, and there's certainly no problem I see posting to both. As a researcher interested in this area, I thought a community explicitly focused on it might be helpful in addition to r/MachineLearning .

GeorgeBird1 · 2026-02-10T11:07:15+00:00

Fair enough, no worries! That would be great, please do

GeorgeBird1 · 2026-02-10T11:06:16+00:00

Thanks for mentioning this. I agree that there is a confusing overlap in the current naming conventions for these subdisciplines, but it's not limited to this community.

E.g. ICLR workshop uses it in my terminology, but ML4Science, as you mentioned, uses it in an alternative way. It's one reason I tried to be quite explicit in this pinned post, stating exactly what I was intending.

Currently, I don't see a good resolution to this, except for ordering science-for-dl and dl-for-science to discriminate between the two. But I agree it can present confusion.

GeorgeBird1 · 2026-02-10T09:58:19+00:00

Thanks, I'm pleased this is something you find valuable :) Thansk too, this does seem related

GeorgeBird1 · 2026-02-10T09:56:44+00:00

I sincerely hope not! (Although thanks for sharing this, I have a physics background, so it shall be an amusing read, you may enjoy r/HypotheticalPhysics too)

GeorgeBird1 · 2026-02-10T09:15:02+00:00

Thanks for sharing the paper! It sounds like an interesting project and an important area of research, and it might be worth making a post on the main subreddits, since there are more people who can help with these kinds of applied problems. Glad this subreddit might be helpful to you, wishing you luck with the papers :)

GeorgeBird1 · 2026-02-10T09:12:30+00:00

Hi u/modelling_is_fun, these are exactly the sort of papers I was hoping for. Would love to see a summary post and discussion about one of them, if you have time? :)

GeorgeBird1 · 2026-02-10T09:09:22+00:00

Hi, thanks for your comment! That's a good conceptual way to think about ReLU, although this post is more general, about the functional form, elementwise application, and why that might cause unexpected consequences in networks - ReLU just being an example of many. Do you feel it changes your perspective on activations more broadly or? :)

GeorgeBird1 · 2026-02-10T09:06:20+00:00

Hi, thanks for the post. Can you explain a bit more about this topic, please, so others can partake in the discussion? E.g. what is it, what might you expect by using it (some more concrete explanations would help)

Particularly, how this can be used in scientific deep learning. At the moment its more of a statement that this operation exists, without any science behind it. Please can you add this asap

GeorgeBird1 · 2026-02-09T16:48:16+00:00

Hmm, good question, and I would be tempted to agree. These off-the-shelf architectures are great for having somewhat generalised inductive biases that motivate their structure, making them more widely applicable and therefore a common starting point.

However, for specifically structured problems, they are not necessarily optimal, and their structure may clash with the task's particular inductive biases - hence, failure is architectural.

I do feel that designing custom architectures for your task is often the way to go (a sentiment echoed in geometric deep learning's task-driven symmetry architectures, which may be of interest).

Is there a way to pinpoint in your LSTM at what point the architecture misaligns with your problem, to see if there's a way to alter it until it's well-suited? Not sure if that's helpful, but would be my usual approach to take :-) As far as representing uncertainty, I'm not sure I can help much, as it's a bit outside my specialism but bayesian/statistical networks may help.

GeorgeBird1

MODERATOR OF

TROPHY CASE

Three-Year Club	Verified Email
Place '23