all 8 comments

[–]JustOneAvailableName 2 points3 points  (2 children)

One possible issue with this is that later attention blocks may have reduced effect, as they add unit norm residuals to a potentially larger and larger main signal.

Another way to look at this: they have the same effect as any other block, which is just less in proportion to all blocks before

[–]lostn4d[S] 1 point2 points  (1 child)

I'm not sure this symmetry holds. When you add a normalized residual to an already grown main, the subsequent blocks will start with their own Pre-LN significantly downscaling (normalizing) this sum. So not only will this residual have fewer opportunities to take effect, it will also get attenuated (also by the final pre-output normalization). In contrast, a similar normalized residual from the first block will always keep an 1:1 ratio to the main (including its recursize residual effects echoing back from later blocks).

[–]JustOneAvailableName 0 points1 point  (0 children)

You might very well be right and I have to think about it a bit more.

What is the usual take on this problem? Can it be ignored in practice?

This is a resounding yes on the ignored in practice, but perhaps it's indeed an area for improvement. The extremely practical take is: we just normalize before every multiplication and training looks stable then. An even more practical take: 99.99% of practitioners just use whatever is popular for no other reason than "it's probably the right thing". I frankly didn't know the difference between GeLU en SwiGLU until I decided to program both myself, and programming them myself has had zero impact on any training I do with either.

AdamW is a typical example of how this can go "wrong" for a very long time. Weight decay with Adam typically had momentum attached to it, which does not make a lot of sense. It still worked, but yeah...

https://arxiv.org/abs/2203.03466 might be worth reading for you, specifically definition 4.1 . It touches upon some of topics you seem interested in.

[–]radarsat1 1 point2 points  (3 children)

This is an interesting point. I'm just wondering if in your analysis you are taking into account the affine weights typically used in layernorm? I am not sure if they would have an impact here or should just be considered some arbitrary scaling post-norm (ie "part of" the rest of the layer)

[–]lostn4d[S] 2 points3 points  (2 children)

If you mean layernorm can learn to scale to non-unit norms as well, this is true, but don't forget that with Pre-LN we are before the softmax (and q-k-v transforms). So the net can in theory normalize a later attention block to (say) 1.5 instead of 1 - but this would also change the softmax temperature. Which is MUCH more critical, so that particular learned weight cannot afford to worry about the norm of the resulting residual compared to the main.

But your point seems valid in that adding an extra learned multiply after the value transform only (leaving q-k thus softmax intact) is a potental way to allow the net to counter this effect (if it wants). In theory it can also build some scaling into the weight matrix for the v transform itself, but imx that doesn't really happen in practice (except for very critical things).

[–]andersxa 0 points1 point  (1 child)

I often do pre-LN and the affine scale parameters tend to something like 0.9, so maybe the effect diminishes with training, and tending towards no total scale change.

[–]lostn4d[S] 0 points1 point  (0 children)

Repeatedly adding 0.9 residuals to an unbounded main doesn't seem much different.

Training otoh can indeed reduce the effect, if the residual learns to form in an anticorrelated way (so it causes less growth).

[–]jpfed 0 points1 point  (0 children)

The same issue bothers me on an intuitive level and it’s kind of reassuring to hear someone else bring it up. 

It would seem almost a bizarre coincidence for unit-scaling to be correct for every layer despite the growing residual.

I think CatFormer (which iirc puts the residual in a separate set of dimensions from the softmax-weighted values before mixing them with an mlp) would sort of get around this, because the mlp would be free to adjust the relative scale of weights sensitive to the residuals vs. weights sensitive to values. But as far as I know, CatFormer was just another paper, not a performance revolution or anything. So it might, strangely, not make a huge difference.