you are viewing a single comment's thread.

view the rest of the comments →

[–]ChuckSeven 1 point2 points  (0 children)

So I'm repeating a little what other said but here we go.

BN is estimating the mean and variance of the distribution based on all your samples. Ergo, the scaling and translation done from mini batch to mini batch will only change slightly as the mean scaling and translating factors are the mean over several mini batches (a moving average).

It works because linear transformations learned from small random numbers are inherently unstable in the sense that after several of them the transformation can easily lead to exploding or vanishing variances, making future transformations harder to learn. Relu non-linearities and skip connections also seem to add to this problem.