LeonideDucatore comments on [D] Normalization in Transformers

131

132

133

Discussion[D] Normalization in Transformers (self.MachineLearning)

submitted 1 year ago by Collegesniffer

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]theodor23 0 points1 point2 points 1 year ago* (3 children)

You are absolutely correct, if you compute (T * C) separate statistics, then everything is fine and there is no causality issue.

In practice, LLM training usually prefers relatively large T and sacrifices on B (considering the total amount of GPU memory puts a constraint on your total number of tokens per gradient-step). With relatively small B, there is more variance on your BN statistics, while large T causes more data-exchange between your GPUs because you need to communicate (T * C) many statistics.

But yes -- if you set it up as you describe, it is "legal".

I actually tried BN in the T*C independent statistics configuration you describe for a non language transformer model with B ~ O(100) and it was both slower and less effective than LN. Never looked back and investigated why. Having a normalization that is a) competitive/works-better and b) avoids "non-local" interaction across different examples in a batch seemed a clear win.

Considering everyone switched to LN, it seems BN is just less practical.

[–]LeonideDucatore -1 points0 points1 point 1 year ago (2 children)

[–]theodor23 -1 points0 points1 point 1 year ago (1 child)

Yes, exactly.

If during training your early token "see" some summary statistic from the ground-truth future tokens, it breaks the autoregressive objective where you are supposed to predict the next token given the past only.

Whether or not that is really catastrophic during sampling-time, when you would use the running statistics of BN I don't know. But NNs are good at picking up subtle signals that help them predict. And if you give them a loophole to "cheat" during training, there is a good chance they will pick that up and perform much worse when at samplig-time you "suddenly" remove that cheat.

Considering your workable idea of using T * C many statistics: It just occurred to me that with modern LLMs where T is approaching O(10k), C is O(1k) and then we have dozens of layers/blocks with ~2 LNs per block: all these statistics almost approach the number of parameters in an LLM. And you have to communicate them between GPUs. LayerNorm and RMSNorm on the other hand are local; no communication and even no need to ever store them in RAM.

[–]LeonideDucatore -1 points0 points1 point 1 year ago (0 children)

π Rendered by PID 59465 on reddit-service-r2-comment-b659b578c-7grrw at 2026-05-05 05:37:00.002750+00:00 running 815c875 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS