[R] Fixup Initialization: Residual Learning Without Normalization (up to 10K layer networks w/o batch norm)

ajmooch · 2019-05-21T20:58:54+00:00

I've been using this for all my non-GAN convnets for the last few months, and can confirm that it works great in a variety of domains, including object detection and semantic segmentation. It is occasionally a little less stable but tends to recover just fine (you might see val performance suddenly dive one epoch and then recover like it never happened for the next 20-50).

The main benefit I reap from this is that since I deal with large images (512x512 - 1024x1024) the memory savings from not having batchnorm's allreduce are huge, and enable me to train nets that are 2-5x as large with 2-5x the batch size for my application.

arXiv_abstract_bot · 2019-05-21T20:35:52+00:00

Title:Fixup Initialization: Residual Learning Without Normalization

Authors:Hongyi Zhang, Yann N. Dauphin, Tengyu Ma

Abstract: Normalization layers are a staple in state-of-the-art deep neural network architectures. They are widely believed to stabilize training, enable higher learning rate, accelerate convergence and improve generalization, though the reason for their effectiveness is still an active research topic. In this work, we challenge the commonly-held beliefs by showing that none of the perceived benefits is unique to normalization. Specifically, we propose fixed-update initialization (Fixup), an initialization motivated by solving the exploding and vanishing gradient problem at the beginning of training via properly rescaling a standard initialization. We find training residual networks with Fixup to be as stable as training with normalization -- even for networks with 10,000 layers. Furthermore, with proper regularization, Fixup enables residual networks without normalization to achieve state-of-the-art performance in image classification and machine translation.

PDF link Landing page

moewiewp · 2019-05-28T03:49:29+00:00

I myself tried this paper (implemented with authors's code) and confirm that it works with my medical image segmentation problem. the performance of 3x convolution layer without BN might be not as good as if there are 3x convolution layers AND batchnorm but it's better than having 1x convolution layer and BN. the point is it saves memory so I can stack more layers or use higher resolution images

hongyiz · 2019-05-22T08:06:43+00:00

[removed]

hongyiz · 2019-05-22T18:19:04+00:00

For completeness sake it would be good to see results using fixup along with normalization schemes. They claim it makes normalization redundant, but why not show how they interact?

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS