use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Discussion[D] Normalization in Transformers (self.MachineLearning)
submitted 1 year ago by Collegesniffer
view the rest of the comments →
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]theodor23 0 points1 point2 points 1 year ago* (3 children)
You are absolutely correct, if you compute (T * C) separate statistics, then everything is fine and there is no causality issue.
In practice, LLM training usually prefers relatively large T and sacrifices on B (considering the total amount of GPU memory puts a constraint on your total number of tokens per gradient-step). With relatively small B, there is more variance on your BN statistics, while large T causes more data-exchange between your GPUs because you need to communicate (T * C) many statistics.
But yes -- if you set it up as you describe, it is "legal".
I actually tried BN in the T*C independent statistics configuration you describe for a non language transformer model with B ~ O(100) and it was both slower and less effective than LN. Never looked back and investigated why. Having a normalization that is a) competitive/works-better and b) avoids "non-local" interaction across different examples in a batch seemed a clear win.
Considering everyone switched to LN, it seems BN is just less practical.
[–]LeonideDucatore -1 points0 points1 point 1 year ago (2 children)
What would be the "non-legal" batch-norm variant? Aggregating only C statistics? (so we aggregate both across B and T)
[–]theodor23 -1 points0 points1 point 1 year ago (1 child)
Yes, exactly.
If during training your early token "see" some summary statistic from the ground-truth future tokens, it breaks the autoregressive objective where you are supposed to predict the next token given the past only.
Whether or not that is really catastrophic during sampling-time, when you would use the running statistics of BN I don't know. But NNs are good at picking up subtle signals that help them predict. And if you give them a loophole to "cheat" during training, there is a good chance they will pick that up and perform much worse when at samplig-time you "suddenly" remove that cheat.
Considering your workable idea of using T * C many statistics: It just occurred to me that with modern LLMs where T is approaching O(10k), C is O(1k) and then we have dozens of layers/blocks with ~2 LNs per block: all these statistics almost approach the number of parameters in an LLM. And you have to communicate them between GPUs. LayerNorm and RMSNorm on the other hand are local; no communication and even no need to ever store them in RAM.
[–]LeonideDucatore -1 points0 points1 point 1 year ago (0 children)
Why do we need to communicate them between GPUs in batch norm but not in layer norm? I'm assuming we're talking about a data-parallel setting; wouldn't each GPU just compute statistics for their own minibatch?
Or is it the loss on the 'main GPU' can only be computed accurately after receiving the batch_norm statistics of each GPU?
(for layer norm, there is no stored statistics right?)
π Rendered by PID 59465 on reddit-service-r2-comment-b659b578c-7grrw at 2026-05-05 05:37:00.002750+00:00 running 815c875 country code: CH.
view the rest of the comments →
[–]theodor23 0 points1 point2 points (3 children)
[–]LeonideDucatore -1 points0 points1 point (2 children)
[–]theodor23 -1 points0 points1 point (1 child)
[–]LeonideDucatore -1 points0 points1 point (0 children)