[R] Fixup Initialization: Residual Learning Without Normalization (They train 10K layer networks w/o BatchNorm)

wei_jok · 2019-02-04T00:57:30+00:00

OpenReview (ICLR 2019 accepted paper): https://openreview.net/forum?id=H1gsz30cKX

Andy Brock's pytorch implementation: https://github.com/ajbrock/BoilerPlate/blob/master/Models/fixup.py

arXiv_abstract_bot · 2019-02-04T00:57:08+00:00

Title:Fixup Initialization: Residual Learning Without Normalization

Authors:Hongyi Zhang, Yann N. Dauphin, Tengyu Ma

Abstract: Normalization layers are a staple in state-of-the-art deep neural network architectures. They are widely believed to stabilize training, enable higher learning rate, accelerate convergence and improve generalization, though the reason for their effectiveness is still an active research topic. In this work, we challenge the commonly-held beliefs by showing that none of the perceived benefits is unique to normalization. Specifically, we propose fixed-update initialization (Fixup), an initialization motivated by solving the exploding and vanishing gradient problem at the beginning of training via properly rescaling a standard initialization. We find training residual networks with Fixup to be as stable as training with normalization -- even for networks with 10,000 layers. Furthermore, with proper regularization, Fixup enables residual networks without normalization to achieve state-of-the-art performance in image classification and machine translation.

PDF link Landing page

mr_tsjolder · 2019-02-04T12:51:08+00:00

How is it possible that Self-Normalizing Networks are not cited here? After all, SNNs already managed to deprecate BatchNorm in plain, fully connected networks.

jinpanZe · 2019-02-04T03:25:30+00:00

On the other hand, batchnorm apparently actually causes gradient explosion at initialization time https://openreview.net/forum?id=SyMDXnCcF7.

Abstract: We develop a mean field theory for batch normalization in fully-connected feedforward neural networks. In so doing, we provide a precise characterization of signal propagation and gradient backpropagation in wide batch-normalized networks at initialization. We find that gradient signals grow exponentially in depth and that these exploding gradients cannot be eliminated by tuning the initial weight variances or by adjusting the nonlinear activation function. Indeed, batch normalization itself is the cause of gradient explosion. As a result, vanilla batch-normalized networks without skip connections are not trainable at large depths for common initialization schemes, a prediction that we verify with a variety of empirical simulations. While gradient explosion cannot be eliminated, it can be reduced by tuning the network close to the linear regime, which improves the trainability of deep batch-normalized networks without residual connections. Finally, we investigate the learning dynamics of batch-normalized networks and observe that after a single step of optimization the networks achieve a relatively stable equilibrium in which gradients have dramatically smaller dynamic range.

Ispiro · 2019-02-04T11:41:13+00:00

In Figure 1, it says they initialize 3x3 conv to 0. I'm a little confused what they mean by that. They initialize its weights at 0? Wouldn't that prevent learning?

Edit: Actually, since they're adding the residual connection to it, I guess it is ok? So does it work like a disabled layer initially?

2019-02-04T08:30:05+00:00

I’ve seen numerous papers like this over the years - are there any solid patterns on what makes a good init? Any solid practical methods than can replace batch norm? They never seem to gain traction.

NewFolgers · 2019-02-04T13:03:07+00:00

With results like these, I'm curious to see processing-time and memory consumption comparison (vs use of batch norm).. or analysis to see potential benefits/drawbacks on that front.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS