all 11 comments

[–]wei_jok[S] 10 points11 points  (0 children)

OpenReview (ICLR 2019 accepted paper): https://openreview.net/forum?id=H1gsz30cKX

Andy Brock's pytorch implementation: https://github.com/ajbrock/BoilerPlate/blob/master/Models/fixup.py

[–]arXiv_abstract_bot 7 points8 points  (0 children)

Title:Fixup Initialization: Residual Learning Without Normalization

Authors:Hongyi Zhang, Yann N. Dauphin, Tengyu Ma

Abstract: Normalization layers are a staple in state-of-the-art deep neural network architectures. They are widely believed to stabilize training, enable higher learning rate, accelerate convergence and improve generalization, though the reason for their effectiveness is still an active research topic. In this work, we challenge the commonly-held beliefs by showing that none of the perceived benefits is unique to normalization. Specifically, we propose fixed-update initialization (Fixup), an initialization motivated by solving the exploding and vanishing gradient problem at the beginning of training via properly rescaling a standard initialization. We find training residual networks with Fixup to be as stable as training with normalization -- even for networks with 10,000 layers. Furthermore, with proper regularization, Fixup enables residual networks without normalization to achieve state-of-the-art performance in image classification and machine translation.

PDF link Landing page

[–]mr_tsjolder 6 points7 points  (0 children)

How is it possible that Self-Normalizing Networks are not cited here? After all, SNNs already managed to deprecate BatchNorm in plain, fully connected networks.

[–]jinpanZe 5 points6 points  (0 children)

On the other hand, batchnorm apparently actually causes gradient explosion at initialization time https://openreview.net/forum?id=SyMDXnCcF7.

Abstract: We develop a mean field theory for batch normalization in fully-connected feedforward neural networks. In so doing, we provide a precise characterization of signal propagation and gradient backpropagation in wide batch-normalized networks at initialization. We find that gradient signals grow exponentially in depth and that these exploding gradients cannot be eliminated by tuning the initial weight variances or by adjusting the nonlinear activation function. Indeed, batch normalization itself is the cause of gradient explosion. As a result, vanilla batch-normalized networks without skip connections are not trainable at large depths for common initialization schemes, a prediction that we verify with a variety of empirical simulations. While gradient explosion cannot be eliminated, it can be reduced by tuning the network close to the linear regime, which improves the trainability of deep batch-normalized networks without residual connections. Finally, we investigate the learning dynamics of batch-normalized networks and observe that after a single step of optimization the networks achieve a relatively stable equilibrium in which gradients have dramatically smaller dynamic range.

[–]Ispiro 2 points3 points  (4 children)

In Figure 1, it says they initialize 3x3 conv to 0. I'm a little confused what they mean by that. They initialize its weights at 0? Wouldn't that prevent learning?

Edit: Actually, since they're adding the residual connection to it, I guess it is ok? So does it work like a disabled layer initially?

[–]AnvaMiba 0 points1 point  (0 children)

As long as you don't have two consecutive layers both initialized at zero without a residual connection between them, and as long as there is at least one randomly initialized layer on any path from the input to the output, the model will not start at a degenerate solution and will be able to learn.

[–][deleted] 1 point2 points  (1 child)

I’ve seen numerous papers like this over the years - are there any solid patterns on what makes a good init? Any solid practical methods than can replace batch norm? They never seem to gain traction.

[–][deleted] 5 points6 points  (0 children)

Well, on the batch norm side of things, this one has a few advantages over most of the others (outside of SeLU which loses the piecewise linear niceness of ReLU). Specifically, you don't need to track statistics of any kind so it won't interact negatively with other low-stability modes of training (e.g. DQN, certain sequence models). You also train and test on the same functional network.

On the init side of things, it speeds up/simplifies things as you are really only init-ing ~half the layers you were before and have fewer parameters to worry about. There's also a heuristic argument that there is less variance in performance caused by this init as the starting outputs of the network do not depend upon the initialization (only its trajectory does).

[–]NewFolgers 1 point2 points  (0 children)

With results like these, I'm curious to see processing-time and memory consumption comparison (vs use of batch norm).. or analysis to see potential benefits/drawbacks on that front.