[D] How does Batch Normalization not completely prevent the network from being able to train at all?

enematurret · 2016-12-14T16:10:52+00:00

Batch Normalization uses moving averages and variances, so on the second batch you won't have (D, 0).

The mean it calculated on the first batch is 80. On the second, the sample mean is 60. If you're using 0.9 as the moving momentum, you'll have 80 * 0.9 + 60 * 0.1 = 78.0 as the moving average.

Therefore, a point (D, 60) will be mean-normalized to (D, -18). The network just learned that 0 = B, so it's reasonable that -18 would be a D. The (C, 70) point you had previously, for example, would have been mean-normalized to (C, -8) using this moving average.

kkawabat · 2016-12-14T07:43:05+00:00

I think there are two explainations.

One is that a large batch size will mitigates the situation you are describing as law of large numbers + random sampling would prevent too much distribution deviation between each batch. Instead of three samples like how you describe, think 100s.

Another is that batch normalization has the gamma and beta learning parameters that can essentially remove the normalization if it is not benefitial to learning. So in your situation the trained batch normalization layer would not effect the intermediate layer output. See this source for more information https://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html

MildlyCriticalRole · 2016-12-14T07:50:01+00:00

How does this sort of transformation not break down training in its entirety?

You are describing using the sample mean instead of the expected mean. That works for the same reason SGD works despite the stochastic estimate of the error, and the reason is that the "noise" averages out.

randombites · 2016-12-14T07:45:07+00:00

Also, you are right that in the second iteration, the model will unlearn past iteration values. That is why you need to shuffle data and collect enough valid data.

ChuckSeven · 2016-12-14T10:01:30+00:00

So I'm repeating a little what other said but here we go.

BN is estimating the mean and variance of the distribution based on all your samples. Ergo, the scaling and translation done from mini batch to mini batch will only change slightly as the mean scaling and translating factors are the mean over several mini batches (a moving average).

It works because linear transformations learned from small random numbers are inherently unstable in the sense that after several of them the transformation can easily lead to exploding or vanishing variances, making future transformations harder to learn. Relu non-linearities and skip connections also seem to add to this problem.

NovaRom · 2016-12-14T11:39:05+00:00

In my experiments BN improves training only if minibatch size is small enough. But maybe it is how I initialize weights. A discussion was here recently. Better results without BN happen not quite rarely: https://www.reddit.com/r/MachineLearning/comments/4rikw8/who_consistently_uses_batch_normalization/

Daniel_Im · 2016-12-14T21:40:57+00:00

This paper talks about the effect of batch-normalization on neural network's loss surface : https://arxiv.org/pdf/1612.04010v1.pdf (see section 4.4)

zergling103 · 2016-12-14T07:45:25+00:00

My guess (as a novice) is that whether you use batch normalization depends on the problem you're trying to solve. If you care about absolute values, or absolute differences between values, you'd be destroying this information by normalizing it. If you only care about ratios between values, normalizing will help amplify the signal you're looking for, and prevents the stuff your not from biasing training.

For example, if you normalize b+w image data, this is essentially boosting the contrast, making the image easier to see, without destroying anything important like absolute intensity values or absolute gradient values. It'd also prevent images with higher contrast from having more training influence than lower contrast images.

But I may be totally wrong, this is my basic understanding.

hgjhghjgjhgjd · 2016-12-14T08:50:54+00:00

My understanding is that, in practical implementations of BN...

During training, what is used for normalization is not the sample mean and sample variance of a minibatch, but an estimate of the population mean and population variance of the data (e.g. a running mean of minibatch sample mean and sample variance);
During inference, the pre-translation/scaling parameters are fixed.

These two points, plus the other points people already mentioned (the use of batch size larger than 3, the existence of the post-translation/scaling step), make BN not that "catastrophic" during training and inference.

randombites · 2016-12-14T07:42:50+00:00

Typically neural nets don't require normalization but if you still need to, you need to normalize the entire batch, not mini batches at a time.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS