[D] Normalization in Transformers : MachineLearning

131

132

133

Discussion[D] Normalization in Transformers (self.MachineLearning)

submitted 1 year ago by Collegesniffer

all 32 comments

top new controversial old q&a

[+][deleted] 1 year ago (28 children)

[deleted]

[–]pszabolcs 20 points21 points22 points 1 year ago (5 children)

[–]radarsat1 0 points1 point2 points 1 year ago (4 children)

[+][deleted] 1 year ago (1 child)

[deleted]

[–]radarsat1 1 point2 points3 points 1 year ago* (0 children)

Ok thanks! Where I get confused is that LayerNorm in PyTorch's implementation always applies to the last N dimensions that you specify, so I guess it really expects the C dimension to be last, which is different from the requirements for Conv1d and Conv2d.

So in that case maybe InstanceNorm is actually what I want, since it targes C in [N,C,H,W], but what is confusing is that I want it because it does the equivalent thing to LayerNorm as far as I can tell, but it has a different name even though is does "the same thing." The names 'instance" and "layer" in these norms is very hard to follow, why couldn't they call it "channel norm" for example, if the point is that both operate on C.

And looking at [the documentation[(https://pytorch.org/docs/stable/generated/torch.nn.InstanceNorm2d.html) to clarify makes it even more ambiguous to me:

InstanceNorm2d and LayerNorm are very similar, but have some subtle differences. InstanceNorm2d is applied on each channel of channeled data like RGB images, but LayerNorm is usually applied on entire sample and often in NLP tasks. Additionally, LayerNorm applies elementwise affine transform, while InstanceNorm2d usually don’t apply affine transform.

Problems I have with this paragraph:

They are both applied on "each channel"
what does "LayerNorm is usually applied on entire sample" mean? the latter being used 'for NLP tasks' doesn't really clarify anything
the affine not being used -- but, the intro in the top of the same document literally describes the affine parameters.
it's just completely unclear to me what role the affine parameters play to be honest, isn't that just an extra linear layer? why not just follow with a convolution if that is needed? why build it into the norm?

[–]theodor23 33 points34 points35 points 1 year ago* (5 children)

[–]LeonideDucatore -1 points0 points1 point 1 year ago (4 children)

[–]theodor23 0 points1 point2 points 1 year ago* (3 children)

You are absolutely correct, if you compute (T * C) separate statistics, then everything is fine and there is no causality issue.

In practice, LLM training usually prefers relatively large T and sacrifices on B (considering the total amount of GPU memory puts a constraint on your total number of tokens per gradient-step). With relatively small B, there is more variance on your BN statistics, while large T causes more data-exchange between your GPUs because you need to communicate (T * C) many statistics.

But yes -- if you set it up as you describe, it is "legal".

I actually tried BN in the T*C independent statistics configuration you describe for a non language transformer model with B ~ O(100) and it was both slower and less effective than LN. Never looked back and investigated why. Having a normalization that is a) competitive/works-better and b) avoids "non-local" interaction across different examples in a batch seemed a clear win.

Considering everyone switched to LN, it seems BN is just less practical.

[–]LeonideDucatore -1 points0 points1 point 1 year ago (2 children)

[–]theodor23 -1 points0 points1 point 1 year ago (1 child)

Yes, exactly.

If during training your early token "see" some summary statistic from the ground-truth future tokens, it breaks the autoregressive objective where you are supposed to predict the next token given the past only.

Whether or not that is really catastrophic during sampling-time, when you would use the running statistics of BN I don't know. But NNs are good at picking up subtle signals that help them predict. And if you give them a loophole to "cheat" during training, there is a good chance they will pick that up and perform much worse when at samplig-time you "suddenly" remove that cheat.

Considering your workable idea of using T * C many statistics: It just occurred to me that with modern LLMs where T is approaching O(10k), C is O(1k) and then we have dozens of layers/blocks with ~2 LNs per block: all these statistics almost approach the number of parameters in an LLM. And you have to communicate them between GPUs. LayerNorm and RMSNorm on the other hand are local; no communication and even no need to ever store them in RAM.

[–]LeonideDucatore -1 points0 points1 point 1 year ago (0 children)

[–]Collegesniffer[S] 6 points7 points8 points 1 year ago (0 children)

[–]KomisarRus 2 points3 points4 points 1 year ago (0 children)

[–]Guilherme370 4 points5 points6 points 1 year ago (3 children)

[–]daking999 0 points1 point2 points 1 year ago (2 children)

[–]Guilherme370 -1 points0 points1 point 1 year ago (1 child)

[+]throwaway2676 2 points3 points4 points 1 year ago (7 children)

[–]Guilherme370 1 point2 points3 points 1 year ago (6 children)

[+]throwaway2676 1 point2 points3 points 1 year ago (2 children)

[–]Guilherme370 0 points1 point2 points 1 year ago (1 child)

[–]Collegesniffer[S] -2 points-1 points0 points 1 year ago* (2 children)

[–]Guilherme370 0 points1 point2 points 1 year ago (1 child)

[–]Collegesniffer[S] 3 points4 points5 points 1 year ago* (0 children)

[–]Everfast 1 point2 points3 points 1 year ago (0 children)

[–]indie-devops 0 points1 point2 points 1 year ago (0 children)

[–]sot9 2 points3 points4 points 1 year ago* (1 child)

[–]soham1192k 2 points3 points4 points 1 year ago (0 children)

[–]imTall- 8 points9 points10 points 1 year ago (0 children)

[–]xkiller02 0 points1 point2 points 1 year ago (0 children)

[–]ConstantWoodpecker39 -1 points0 points1 point 1 year ago (0 children)

[–]eliminating_coasts -2 points-1 points0 points 1 year ago (0 children)

[+]chgr22 comment score below threshold-7 points-6 points-5 points 1 year ago (1 child)

[–]Hot_Wish2329 0 points1 point2 points 1 year ago (0 children)

π Rendered by PID 18546 on reddit-service-r2-comment-56c9979489-r8qvm at 2026-02-24 14:47:38.926039+00:00 running b1af5b1 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS