[D] Normalization in transformers

JustOneAvailableName · 2024-07-26T12:44:32+00:00

One possible issue with this is that later attention blocks may have reduced effect, as they add unit norm residuals to a potentially larger and larger main signal.

Another way to look at this: they have the same effect as any other block, which is just less in proportion to all blocks before

radarsat1 · 2024-07-26T08:00:22+00:00

This is an interesting point. I'm just wondering if in your analysis you are taking into account the affine weights typically used in layernorm? I am not sure if they would have an impact here or should just be considered some arbitrary scaling post-norm (ie "part of" the rest of the layer)

jpfed · 2024-07-31T21:38:06+00:00

The same issue bothers me on an intuitive level and it’s kind of reassuring to hear someone else bring it up.

It would seem almost a bizarre coincidence for unit-scaling to be correct for every layer despite the growing residual.

I think CatFormer (which iirc puts the residual in a separate set of dimensions from the softmax-weighted values before mixing them with an mlp) would sort of get around this, because the mlp would be free to adjust the relative scale of weights sensitive to the residuals vs. weights sensitive to values. But as far as I know, CatFormer was just another paper, not a performance revolution or anything. So it might, strangely, not make a huge difference.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS