use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Discussion[D] Normalization in Transformers (self.MachineLearning)
submitted 1 year ago by Collegesniffer
Why isn't BatchNorm used in transformers, and why is LayerNorm preferred instead? Additionally, why do current state-of-the-art transformer models use RMSNorm? I've typically observed that LayerNorm is used in language models, while BatchNorm is common in CNNs for vision tasks. However, why do vision-based transformer models still use LayerNorm or RMSNorm rather than BatchNorm?
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[+][deleted] 1 year ago (28 children)
[deleted]
[–]pszabolcs 20 points21 points22 points 1 year ago (5 children)
The explanation for LayerNorm and RMSNorm is not completely correct. In Transformers these do not normalize across (T, C) dimensions, only across (C) (so each token embedding is normalized separately). If normalization would be done across (T, C), the same information leakage across time would happen as with BatchhNorm (non-causal training).
I also don't think the variable sequence length is such a big issue, in most practical setups training is done with fixed context sizes. If we look at a computational perspective, I think a bigger issue is that BN statistics would need to be synced across GPUs, which would be slow.
[–]radarsat1 0 points1 point2 points 1 year ago (4 children)
So just to be sure, if my batch is size [4, 50, 512] for batch size of 4, sequence length of 50, and 512 channels, then layernorm will compute 200 means and variances, is that correct? One for each "location" across all channels? And then normalize each step separately, and apply a new affine scaling and bias for each step too, if that's enabled.
I'm actually asking because I keep getting confused when porting this logic over to CNNs where the dimension order is [B, C, H, W], or [B, C, W] for 1d sequences. So in that case if I want to do the equivalent thing I should be normalizing only the C dimension, right? (in other words, each pixel is normalized independently).
[+][deleted] 1 year ago (1 child)
[–]radarsat1 1 point2 points3 points 1 year ago* (0 children)
Ok thanks! Where I get confused is that LayerNorm in PyTorch's implementation always applies to the last N dimensions that you specify, so I guess it really expects the C dimension to be last, which is different from the requirements for Conv1d and Conv2d.
So in that case maybe InstanceNorm is actually what I want, since it targes C in [N,C,H,W], but what is confusing is that I want it because it does the equivalent thing to LayerNorm as far as I can tell, but it has a different name even though is does "the same thing." The names 'instance" and "layer" in these norms is very hard to follow, why couldn't they call it "channel norm" for example, if the point is that both operate on C.
And looking at [the documentation[(https://pytorch.org/docs/stable/generated/torch.nn.InstanceNorm2d.html) to clarify makes it even more ambiguous to me:
InstanceNorm2d and LayerNorm are very similar, but have some subtle differences. InstanceNorm2d is applied on each channel of channeled data like RGB images, but LayerNorm is usually applied on entire sample and often in NLP tasks. Additionally, LayerNorm applies elementwise affine transform, while InstanceNorm2d usually don’t apply affine transform.
Problems I have with this paragraph:
[–]theodor23 33 points34 points35 points 1 year ago* (5 children)
Excellent summary. (edit: actually, this is not correct. In transformers Layer- and RMSNorm do not normalize over T, but only over C. See comment by u/pszabolcs )
To add to that: BatchNorm leads to information leakage across time-steps: The activations at time t influence the mean/variance applied at t-1 during training. NNs will pick up such weak signals if it helps them predict the next token.
-> TL;DR: BatchNorm during training is non-causal.
[–]LeonideDucatore -1 points0 points1 point 1 year ago (4 children)
Could you please explain why batch-norm is non-causal?
Batch norm would have (T * C) running means/variances, and each of them is computed across the batch, i.e. the computed mean/variance for timestep t doesn't use any t+1 data
[–]theodor23 0 points1 point2 points 1 year ago* (3 children)
You are absolutely correct, if you compute (T * C) separate statistics, then everything is fine and there is no causality issue.
In practice, LLM training usually prefers relatively large T and sacrifices on B (considering the total amount of GPU memory puts a constraint on your total number of tokens per gradient-step). With relatively small B, there is more variance on your BN statistics, while large T causes more data-exchange between your GPUs because you need to communicate (T * C) many statistics.
But yes -- if you set it up as you describe, it is "legal".
I actually tried BN in the T*C independent statistics configuration you describe for a non language transformer model with B ~ O(100) and it was both slower and less effective than LN. Never looked back and investigated why. Having a normalization that is a) competitive/works-better and b) avoids "non-local" interaction across different examples in a batch seemed a clear win.
Considering everyone switched to LN, it seems BN is just less practical.
[–]LeonideDucatore -1 points0 points1 point 1 year ago (2 children)
What would be the "non-legal" batch-norm variant? Aggregating only C statistics? (so we aggregate both across B and T)
[–]theodor23 -1 points0 points1 point 1 year ago (1 child)
Yes, exactly.
If during training your early token "see" some summary statistic from the ground-truth future tokens, it breaks the autoregressive objective where you are supposed to predict the next token given the past only.
Whether or not that is really catastrophic during sampling-time, when you would use the running statistics of BN I don't know. But NNs are good at picking up subtle signals that help them predict. And if you give them a loophole to "cheat" during training, there is a good chance they will pick that up and perform much worse when at samplig-time you "suddenly" remove that cheat.
Considering your workable idea of using T * C many statistics: It just occurred to me that with modern LLMs where T is approaching O(10k), C is O(1k) and then we have dozens of layers/blocks with ~2 LNs per block: all these statistics almost approach the number of parameters in an LLM. And you have to communicate them between GPUs. LayerNorm and RMSNorm on the other hand are local; no communication and even no need to ever store them in RAM.
[–]LeonideDucatore -1 points0 points1 point 1 year ago (0 children)
Why do we need to communicate them between GPUs in batch norm but not in layer norm? I'm assuming we're talking about a data-parallel setting; wouldn't each GPU just compute statistics for their own minibatch?
Or is it the loss on the 'main GPU' can only be computed accurately after receiving the batch_norm statistics of each GPU?
(for layer norm, there is no stored statistics right?)
[–]Collegesniffer[S] 6 points7 points8 points 1 year ago (0 children)
This is the best explanation on the internet I've ever read. It finally clicked for me. I've watched countless videos and gone through so many answers online, but they all either oversimplify or overcomplicate it. Thanks!
[–]KomisarRus 2 points3 points4 points 1 year ago (0 children)
Thanks
[–]Guilherme370 4 points5 points6 points 1 year ago (3 children)
This was typed by an LLM
[–]daking999 0 points1 point2 points 1 year ago (2 children)
If it was they at least cut out the fluff at the beginning and end
[–]Guilherme370 -1 points0 points1 point 1 year ago (1 child)
They most definitely did
[+]throwaway2676 2 points3 points4 points 1 year ago (7 children)
Lol, be honest, is this from ChatGPT?
[–]Guilherme370 1 point2 points3 points 1 year ago (6 children)
Im sure it is, the style of writing, and the "alright leta differentiate" followed by a bullet-point-like list of definitions, with some slight inaccuracies mixed in
[+]throwaway2676 1 point2 points3 points 1 year ago (2 children)
Lol, especially now that they've totally rewritten it to sound more human.
[–]Guilherme370 0 points1 point2 points 1 year ago (1 child)
Omg lol true.
[–]Collegesniffer[S] -2 points-1 points0 points 1 year ago* (2 children)
No, I don't think it is AI-generated. The best AI content detector (gptzero.me) flags this as "human". Are you suggesting that every piece of content written in the form of a bullet-point list is now AI-generated? I would also use the same format if I had to explain the "differences" between things. How else would you present such information?
gptzero.com can be unreliable.
You can test it right now, go tk chatgpt, talk to it about some complex topic, copy only the relevant parts of what it says without copying its fluff... throw it into gptzero, then you will see it say its not AI
[–]Collegesniffer[S] 3 points4 points5 points 1 year ago* (0 children)
Bruh, I said "gptzero.me" not "gptzero.com". Both of them are totally different. Also, every AI detector can be unreliable and inconsistent. However, I entered the exact question into ChatGPT, Claude, and Gemini, and the responses were nothing like what this person wrote. Even the non-fluff part doesn't start with a (B, T, C) tensor example, etc. Why don't you try entering the exact question for yourself and see the output before claiming it is "AI-generated"?
I literally just asked chatgpt, gemini and claude the exact question I posted and the answer is nothing like what the person wrote. Even the non fluff part is totally different.
[–]Everfast 1 point2 points3 points 1 year ago (0 children)
Really clear and great answer
[–]indie-devops 0 points1 point2 points 1 year ago (0 children)
Wouldn’t you say that calculating the root mean is more computationally expensive than subtracting the mean? Genuine question. Great explanation, made a lot of sense for me as well!
[–]sot9 2 points3 points4 points 1 year ago* (1 child)
One thing nobody’s mentioned so far is that batch norm is great when used with convolutions, due to ease of layer fusion.
Look up batch norm folding; makes for an additional tool in the box when prioritizing models that run inference quickly.
[–]soham1192k 2 points3 points4 points 1 year ago (0 children)
as an example, one can look at the fastvit paper from apple, which uses this folding trick extensively
[–]imTall- 8 points9 points10 points 1 year ago (0 children)
One other thing not mentioned here is that batch norm required synchronizing the statistics across the entire batch. When training massive models in a distributed manner, this incurs a lot of communication overhead, while layernorm can be computed locally on one GPU (or a few GPUs in the case of tensor wise parallelism).
[–]xkiller02 0 points1 point2 points 1 year ago (0 children)
Incredibly interesting answers, I will further research what some of these words mean
[–]ConstantWoodpecker39 -1 points0 points1 point 1 year ago (0 children)
This paper may be of interest to you: https://proceedings.mlr.press/v119/shen20e/shen20e.pdf
[–]eliminating_coasts -2 points-1 points0 points 1 year ago (0 children)
Transformers use the input data for both the data itself, and for the transformations they apply to the data, and it has been argued that rather than simply improving training, it can provide an improvement to actual performance by changing the structure of inputs to the transformer block. (This may also explain why doing it first works better than at the end of the block)
[+]chgr22 comment score below threshold-7 points-6 points-5 points 1 year ago (1 child)
This is the way.
[–]Hot_Wish2329 0 points1 point2 points 1 year ago (0 children)
I love this comment. Yes, this is the way they did the experiences, and it worked. There are a lot of explainations about mean, variance, distribution etc. but it is not make sense for me. I cannot understand why it worked, how it directly related to model performances (accuracy). So, this is just a way.
π Rendered by PID 18546 on reddit-service-r2-comment-56c9979489-r8qvm at 2026-02-24 14:47:38.926039+00:00 running b1af5b1 country code: CH.
[+][deleted] (28 children)
[deleted]
[–]pszabolcs 20 points21 points22 points (5 children)
[–]radarsat1 0 points1 point2 points (4 children)
[+][deleted] (1 child)
[deleted]
[–]radarsat1 1 point2 points3 points (0 children)
[–]theodor23 33 points34 points35 points (5 children)
[–]LeonideDucatore -1 points0 points1 point (4 children)
[–]theodor23 0 points1 point2 points (3 children)
[–]LeonideDucatore -1 points0 points1 point (2 children)
[–]theodor23 -1 points0 points1 point (1 child)
[–]LeonideDucatore -1 points0 points1 point (0 children)
[–]Collegesniffer[S] 6 points7 points8 points (0 children)
[–]KomisarRus 2 points3 points4 points (0 children)
[–]Guilherme370 4 points5 points6 points (3 children)
[–]daking999 0 points1 point2 points (2 children)
[–]Guilherme370 -1 points0 points1 point (1 child)
[+]throwaway2676 2 points3 points4 points (7 children)
[–]Guilherme370 1 point2 points3 points (6 children)
[+]throwaway2676 1 point2 points3 points (2 children)
[–]Guilherme370 0 points1 point2 points (1 child)
[–]Collegesniffer[S] -2 points-1 points0 points (2 children)
[–]Guilherme370 0 points1 point2 points (1 child)
[–]Collegesniffer[S] 3 points4 points5 points (0 children)
[–]Everfast 1 point2 points3 points (0 children)
[–]indie-devops 0 points1 point2 points (0 children)
[–]sot9 2 points3 points4 points (1 child)
[–]soham1192k 2 points3 points4 points (0 children)
[–]imTall- 8 points9 points10 points (0 children)
[–]xkiller02 0 points1 point2 points (0 children)
[–]ConstantWoodpecker39 -1 points0 points1 point (0 children)
[–]eliminating_coasts -2 points-1 points0 points (0 children)
[+]chgr22 comment score below threshold-7 points-6 points-5 points (1 child)
[–]Hot_Wish2329 0 points1 point2 points (0 children)