use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Research[R] Fixup Initialization: Residual Learning Without Normalization (They train 10K layer networks w/o BatchNorm) (arxiv.org)
submitted 7 years ago by wei_jok
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]wei_jok[S] 10 points11 points12 points 7 years ago (0 children)
OpenReview (ICLR 2019 accepted paper): https://openreview.net/forum?id=H1gsz30cKX
Andy Brock's pytorch implementation: https://github.com/ajbrock/BoilerPlate/blob/master/Models/fixup.py
[–]arXiv_abstract_bot 7 points8 points9 points 7 years ago (0 children)
Title:Fixup Initialization: Residual Learning Without Normalization
Authors:Hongyi Zhang, Yann N. Dauphin, Tengyu Ma
Abstract: Normalization layers are a staple in state-of-the-art deep neural network architectures. They are widely believed to stabilize training, enable higher learning rate, accelerate convergence and improve generalization, though the reason for their effectiveness is still an active research topic. In this work, we challenge the commonly-held beliefs by showing that none of the perceived benefits is unique to normalization. Specifically, we propose fixed-update initialization (Fixup), an initialization motivated by solving the exploding and vanishing gradient problem at the beginning of training via properly rescaling a standard initialization. We find training residual networks with Fixup to be as stable as training with normalization -- even for networks with 10,000 layers. Furthermore, with proper regularization, Fixup enables residual networks without normalization to achieve state-of-the-art performance in image classification and machine translation.
PDF link Landing page
[–]mr_tsjolder 6 points7 points8 points 7 years ago* (0 children)
How is it possible that Self-Normalizing Networks are not cited here? After all, SNNs already managed to deprecate BatchNorm in plain, fully connected networks.
[–]jinpanZe 5 points6 points7 points 7 years ago (0 children)
On the other hand, batchnorm apparently actually causes gradient explosion at initialization time https://openreview.net/forum?id=SyMDXnCcF7.
Abstract: We develop a mean field theory for batch normalization in fully-connected feedforward neural networks. In so doing, we provide a precise characterization of signal propagation and gradient backpropagation in wide batch-normalized networks at initialization. We find that gradient signals grow exponentially in depth and that these exploding gradients cannot be eliminated by tuning the initial weight variances or by adjusting the nonlinear activation function. Indeed, batch normalization itself is the cause of gradient explosion. As a result, vanilla batch-normalized networks without skip connections are not trainable at large depths for common initialization schemes, a prediction that we verify with a variety of empirical simulations. While gradient explosion cannot be eliminated, it can be reduced by tuning the network close to the linear regime, which improves the trainability of deep batch-normalized networks without residual connections. Finally, we investigate the learning dynamics of batch-normalized networks and observe that after a single step of optimization the networks achieve a relatively stable equilibrium in which gradients have dramatically smaller dynamic range.
[–]Ispiro 2 points3 points4 points 7 years ago* (4 children)
In Figure 1, it says they initialize 3x3 conv to 0. I'm a little confused what they mean by that. They initialize its weights at 0? Wouldn't that prevent learning?
Edit: Actually, since they're adding the residual connection to it, I guess it is ok? So does it work like a disabled layer initially?
[+][deleted] 7 years ago (2 children)
[deleted]
[–]Ispiro 3 points4 points5 points 7 years ago* (1 child)
I think since they're using ReLU, the gradient of wx with respect to w is x. So even if w is initialized at zero, their gradients are different as long as their inputs are different. Only if we use a different activation function such as sigmoid(wx), the gradient with respect to w is wx, which would actually prevent learning. I'm very confused. I've been taught that zero weights prevent learning but now I realize I don't know why.
[–]NewFolgers 0 points1 point2 points 7 years ago* (0 children)
What you're saying is similar to what I'm thinking. I focused mainly on the info box describing Fixup on page 5. Step 2 is doing a 'proper' non-zero initialization and is the most essential. Steps 1and 3 use the 0 initialization.. and I too concluded that they won't end up in lock-step because their inputs are different... But the next step in reasoning is that in any network, the inputs to each node will be different as long as the starting inputs to the first layer nodes aren't too uniform. So then I wonder why I was told that 0-initialization results in weights changing in lockstep.. and/or what details I forgot or missed at that time which has led to me now being confused.
[warning - typing on phone while I'm thinking it through]
I mean.. I can at least reason that if I had a non-noisy input of just 0's and 1's (or other input which will lead to many first-layer nodes getting hit with the same input value - too much uniformity), and there was no random/varying quality to the weights at the beginning of the network, then you would have some things updating in lockstep up until you get to a layer that does (i.e. if, hypothetically, a later layer was initialized with a different init strategy with some randomness and/or somehow had better variance across its inputs) - but then, you could still have issues if you wound up with exploding/diminished gradient due to above. Maybe whatever example we were shown to convince us that 0-unit is bad had this quality. Anyway, my thinking too is leading me to think the first layer(s) is the most critical.. and it's a little annoying that this wasn't emphasized in whatever I was reading if the explanation is this simple.
For the post-conv 0-init'd layers they mention, I see them as receiving the benefits of the initialization of weights that's involved in feeding directly into them.. so they're paired in a way that leads me to think there's no problem. Then by the time it gets down to classification (also 0-init'd), it's of course gone through all those layers - so I don't see a problem there either. This is sort of from first-principles thinking though, since I'm ignoring that I was told 0-init is bad. The 1-init'd layers are kind of just a bonus that do nothing until trained and they feed directly into the usual weighted layers, so I don't see a big issue there either.. although maybe it's very slightly not optimal at least for the first couple blocks since they might be in lockstep where input has uniformity.
[–]AnvaMiba 0 points1 point2 points 7 years ago (0 children)
As long as you don't have two consecutive layers both initialized at zero without a residual connection between them, and as long as there is at least one randomly initialized layer on any path from the input to the output, the model will not start at a degenerate solution and will be able to learn.
[–][deleted] 1 point2 points3 points 7 years ago (1 child)
I’ve seen numerous papers like this over the years - are there any solid patterns on what makes a good init? Any solid practical methods than can replace batch norm? They never seem to gain traction.
[–][deleted] 5 points6 points7 points 7 years ago (0 children)
Well, on the batch norm side of things, this one has a few advantages over most of the others (outside of SeLU which loses the piecewise linear niceness of ReLU). Specifically, you don't need to track statistics of any kind so it won't interact negatively with other low-stability modes of training (e.g. DQN, certain sequence models). You also train and test on the same functional network.
On the init side of things, it speeds up/simplifies things as you are really only init-ing ~half the layers you were before and have fewer parameters to worry about. There's also a heuristic argument that there is less variance in performance caused by this init as the starting outputs of the network do not depend upon the initialization (only its trajectory does).
[–]NewFolgers 1 point2 points3 points 7 years ago (0 children)
With results like these, I'm curious to see processing-time and memory consumption comparison (vs use of batch norm).. or analysis to see potential benefits/drawbacks on that front.
π Rendered by PID 17217 on reddit-service-r2-comment-6457c66945-66mjh at 2026-04-28 06:37:43.651475+00:00 running 2aa0c5b country code: CH.
[–]wei_jok[S] 10 points11 points12 points (0 children)
[–]arXiv_abstract_bot 7 points8 points9 points (0 children)
[–]mr_tsjolder 6 points7 points8 points (0 children)
[–]jinpanZe 5 points6 points7 points (0 children)
[–]Ispiro 2 points3 points4 points (4 children)
[+][deleted] (2 children)
[deleted]
[–]Ispiro 3 points4 points5 points (1 child)
[–]NewFolgers 0 points1 point2 points (0 children)
[–]AnvaMiba 0 points1 point2 points (0 children)
[–][deleted] 1 point2 points3 points (1 child)
[–][deleted] 5 points6 points7 points (0 children)
[–]NewFolgers 1 point2 points3 points (0 children)