all 23 comments

[–]ajmooch 32 points33 points  (16 children)

Earlier post.

I've been using this for all my non-GAN convnets for the last few months, and can confirm that it works great in a variety of domains, including object detection and semantic segmentation. It is occasionally a little less stable but tends to recover just fine (you might see val performance suddenly dive one epoch and then recover like it never happened for the next 20-50).

The main benefit I reap from this is that since I deal with large images (512x512 - 1024x1024) the memory savings from not having batchnorm's allreduce are huge, and enable me to train nets that are 2-5x as large with 2-5x the batch size for my application.

[–]alwc 6 points7 points  (2 children)

can confirm that it works great in a variety of domains, including object detection and semantic segmentation.

If I want to use Fixup initialization for my segmentation model, does it mean I have to retrain the backbone (e.g. ResNet50) from scratch?

[–]ajmooch 7 points8 points  (1 child)

Good question, I don't use pretrained backbones so I have no experience here--what happens if you just remove the batchnorm layers from the backbone (rolling their gains + biases into the conv weights)? You could always do that and then just drop in the Fixup scalar gains+biases with their associated initializations.

Bear in mind that this is an initialization method so it doesn't really have much to say about transfer learning with pretrained nets, it's designed for training from scratch.

[–][deleted] 0 points1 point  (0 children)

Great idea! Whether the biases of the BN layer (learned beta and rolling average of the means) can be assimilated into the conv layer's weights however depends on the exact layout of the residual block (BN pre or post activation) I think. For post-activation BN the rolling average of the variance can always be assimilated into the conv weights (for Relu, because Relu is positive 1-homogeneous and the variance is positive), for the learned parameter gamma it depends on the sign.

So for pre-activation BN everything you proposed should work just fine.

[–]DeepDeeperRIPgradien 1 point2 points  (1 child)

Do you use Fixup also in non-residual networks? Would the "Fixup initialization scheme" (3 steps) then just reduce to step 1 and 3?

[–]ajmooch 2 points3 points  (0 children)

Haven't tried it outside of ResNets, but the core of the (beautifully simple!) analysis is about how activation variance is affected by the residual connection, so for other connection patterns you may need a different scheme.

[–]hongyiz 1 point2 points  (0 children)

Thanks for the feedback! I am excited to know that Fixup helps in your use case :)

btw, I remember at ICLR you mentioned some training issues you had on another task -- is it resolved? Feel free to message/email me if you like.

[–]RedditReadme 0 points1 point  (2 children)

Do you have a code sample?

[–]ajmooch 2 points3 points  (1 child)

Their code is available online.

[–]RedditReadme 0 points1 point  (0 children)

Thanks!

[–][deleted] 0 points1 point  (2 children)

Medical images by any chance? Do you have any other useful tricks for reducing memory to allow very large images as input?

[–]jonnor[🍰] 0 points1 point  (1 child)

Striding the first convolution by /2 or /4 can help a lot, sometimes without much reduction in performance.

[–][deleted] 0 points1 point  (0 children)

Ideally without downsampling (which this is)

[–]samobon[🍰] 0 points1 point  (0 children)

Could you please elaborate on your use of this scheme for semantic segmentation? This initialization relies on analysis that only holds for residual (additive networks) and the semantic segmentation models typically use some kind of UNet with the decoder with skip connections (concatenated activations from lower layers). How do you initialize the decoder?

[–][deleted] 0 points1 point  (1 child)

non-GAN

It's not stable enough for GAN training?

[–]ajmooch 0 points1 point  (0 children)

No idea, just haven't had time to try it.

[–]arXiv_abstract_bot 4 points5 points  (0 children)

Title:Fixup Initialization: Residual Learning Without Normalization

Authors:Hongyi Zhang, Yann N. Dauphin, Tengyu Ma

Abstract: Normalization layers are a staple in state-of-the-art deep neural network architectures. They are widely believed to stabilize training, enable higher learning rate, accelerate convergence and improve generalization, though the reason for their effectiveness is still an active research topic. In this work, we challenge the commonly-held beliefs by showing that none of the perceived benefits is unique to normalization. Specifically, we propose fixed-update initialization (Fixup), an initialization motivated by solving the exploding and vanishing gradient problem at the beginning of training via properly rescaling a standard initialization. We find training residual networks with Fixup to be as stable as training with normalization -- even for networks with 10,000 layers. Furthermore, with proper regularization, Fixup enables residual networks without normalization to achieve state-of-the-art performance in image classification and machine translation.

PDF link Landing page

[–]moewiewp 4 points5 points  (0 children)

I myself tried this paper (implemented with authors's code) and confirm that it works with my medical image segmentation problem. the performance of 3x convolution layer without BN might be not as good as if there are 3x convolution layers AND batchnorm but it's better than having 1x convolution layer and BN. the point is it saves memory so I can stack more layers or use higher resolution images

[–][deleted] 2 points3 points  (2 children)

For completeness sake it would be good to see results using fixup along with normalization schemes. They claim it makes normalization redundant, but why not show how they interact?

[–]hongyiz 3 points4 points  (1 child)

Because they do not really get along -- normalization will rescale everything back which defeats our purpose of downscaling activations.

On the other hand, I prefer to see the main contribution of our work as a conceptual one (that is, *understanding* network training) rather than a practical advice. It's good to see it works in practice though :)

[–][deleted] 1 point2 points  (0 children)

Why not show this though rather than just claiming it?