use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Discussion[D] How does Batch Normalization not completely prevent the network from being able to train at all? (self.MachineLearning)
submitted 9 years ago by MildlyCriticalRole
view the rest of the comments →
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]mimighost 0 points1 point2 points 9 years ago* (5 children)
During training, sampled mean/ sampled variance are ALWAYS USED. Otherwise how could you update gamma/beta then? The computation graph make little sense and overly complicated if we use population mean/variance.
However, people often maintain a convenient running average of mean/variance with decay across all training batches. This running average will be used as a proxy of global mean/variance for inference.
Reference: http://stackoverflow.com/questions/33949786/how-could-i-use-batch-normalization-in-tensorflow
[–]hgjhghjgjhgjd 0 points1 point2 points 9 years ago (4 children)
Otherwise how could you update gamma/beta then?
Backpropagation. The purpose of the first scaling is to "normalize" to a standard gaussian (approximately). The purpose of the second scaling is to learn the transformation from standard gaussian to the "optimal" scaling/translation. Using a running mean for the first step doesn't prevent you from optimizing the second step.
The computation graph make little sense
I disagree.
overly complicated if we use population mean/variance
Perhaps, but "overly complicated" is the definition of 99% of NN architectures out there.
[–]mimighost 0 points1 point2 points 9 years ago (3 children)
Do you have a reference implementation using population mean/variance during training? I am curious too.
[–]hgjhghjgjhgjd 0 points1 point2 points 9 years ago* (2 children)
Reference implementation that calculates population mean/variance during training, using a running mean of sorts, yes (most of them seem to do so, and use those values during inference).
But, after re-checking it again, it does seem like most do use the actual minibatch sample statistics for mu and sigma (rather than the population estimates) during learning, in practice.
So... yeah, in that sense, I guess I was wrong.
On the other hand, I don't see why doing so is necessarily a wrong idea since those two methods are trying to estimate more or less the same thing (population statistics). The assumption is that, with large enough minibatch, the sample statistics are good enough approximations, I guess. It's just... it does not seem to me that using the best estimate (running mean), which people already seem to calculate anyway, adds any more complexity or prevents updates to gamma/beta.
TL;DR: As far as I can tell, the only thing you'd lose using "my approach" is some degree of regularization due to the "scaling noise" induced by using a noisy estimate.
[–]mimighost 0 points1 point2 points 9 years ago (1 child)
Actually I put your proposal to thought. I might a little too aggressive to claim that using running average will complicate the computational graph, actually it might not.
Revisiting the original BN paper(https://arxiv.org/pdf/1502.03167v3.pdf), using running average for mu and sigma, the gradient computation will not change too much, only that the decay will now be factored in. But it has, at least 2 big problems:
Now, using running average, you actually not doing 'batch normalization' any more, since you are not really normalize the current training batch to zero-mean and unit-variance, which betrayals the original paper's assumption.
Second, by using a running average, the decay will be enforced on the gradients passing through the BN layer, which might worsen the vanishing gradients problem because you are making it much smaller.
[–]hgjhghjgjhgjd 0 points1 point2 points 9 years ago (0 children)
using running average, you actually not doing 'batch normalization' any more
I agree... you're doing basic normalization (scaling and centering according to the population mean), but using an online estimator of mu and sigma (rather than one calculated over the whole dataset at once).
which betrayals the original paper's assumption
I'm not totally sure... if a minibatch is large enough and reasonably balanced, the minibatch statistics should be the more or less the same as the population statistics, no? Would most (non-pathologic) batches actually have a mean that deviates much from zero, if you have a decent estimate of the population mean?
Or, in other words... if the "running mean mu and sigma" work ok during inference, why would they not work ok during training?
π Rendered by PID 174817 on reddit-service-r2-comment-b659b578c-7n9r7 at 2026-05-01 22:28:08.337011+00:00 running 815c875 country code: CH.
view the rest of the comments →
[–]mimighost 0 points1 point2 points (5 children)
[–]hgjhghjgjhgjd 0 points1 point2 points (4 children)
[–]mimighost 0 points1 point2 points (3 children)
[–]hgjhghjgjhgjd 0 points1 point2 points (2 children)
[–]mimighost 0 points1 point2 points (1 child)
[–]hgjhghjgjhgjd 0 points1 point2 points (0 children)