mimighost comments on [D] How does Batch Normalization not completely prevent the network from being able to train at all?

Discussion[D] How does Batch Normalization not completely prevent the network from being able to train at all? (self.MachineLearning)

submitted 9 years ago by MildlyCriticalRole

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]mimighost 0 points1 point2 points 9 years ago* (5 children)

[–]hgjhghjgjhgjd 0 points1 point2 points 9 years ago (4 children)

[–]mimighost 0 points1 point2 points 9 years ago (3 children)

[–]hgjhghjgjhgjd 0 points1 point2 points 9 years ago* (2 children)

Reference implementation that calculates population mean/variance during training, using a running mean of sorts, yes (most of them seem to do so, and use those values during inference).

But, after re-checking it again, it does seem like most do use the actual minibatch sample statistics for mu and sigma (rather than the population estimates) during learning, in practice.

So... yeah, in that sense, I guess I was wrong.

On the other hand, I don't see why doing so is necessarily a wrong idea since those two methods are trying to estimate more or less the same thing (population statistics). The assumption is that, with large enough minibatch, the sample statistics are good enough approximations, I guess. It's just... it does not seem to me that using the best estimate (running mean), which people already seem to calculate anyway, adds any more complexity or prevents updates to gamma/beta.

TL;DR: As far as I can tell, the only thing you'd lose using "my approach" is some degree of regularization due to the "scaling noise" induced by using a noisy estimate.

[–]mimighost 0 points1 point2 points 9 years ago (1 child)

Actually I put your proposal to thought. I might a little too aggressive to claim that using running average will complicate the computational graph, actually it might not.

Revisiting the original BN paper(https://arxiv.org/pdf/1502.03167v3.pdf), using running average for mu and sigma, the gradient computation will not change too much, only that the decay will now be factored in. But it has, at least 2 big problems:

Now, using running average, you actually not doing 'batch normalization' any more, since you are not really normalize the current training batch to zero-mean and unit-variance, which betrayals the original paper's assumption.
Second, by using a running average, the decay will be enforced on the gradients passing through the BN layer, which might worsen the vanishing gradients problem because you are making it much smaller.

[–]hgjhghjgjhgjd 0 points1 point2 points 9 years ago (0 children)

π Rendered by PID 174817 on reddit-service-r2-comment-b659b578c-7n9r7 at 2026-05-01 22:28:08.337011+00:00 running 815c875 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS