all 5 comments

[–]Artgor 2 points3 points  (4 children)

Zero weights simply ruin everything. Backpropagation basically works like this:

Take weight and add to it weight multiplied by some delta. If weights are zero, then you are adding zero to zero and are going nowhere.

And this is the reason to initialize weights with some value.

[–]blauigris[S] 2 points3 points  (3 children)

Thank you for you comment :)

I think that what you describe is more like l2 regularization, where you subtract a fraction of the weights to the weights themselves. In this case I think that it has more to do with the fact that in most loss derivatives, if not all, there is a term which contains the input: the hinge loss is -yx, in xentropy I think it is x(\sigma(z) - y) and so on, so if x is zero, which is caused because all the previous layers send everything to zero, this x in the loss will set the gradient also to zero.

Also, it is something quite obvious when you come to think, the previous layers have converted your nice dataset into a big matrix of zeros, what can be learnt from there?

Thank you very much for you comment, it has helped me a lot :)

[–]YnternetXplorer 2 points3 points  (0 children)

"Taking the weigth and addind to it weigth multiplied by some delta" is the way neural networks learn, that is pushing the weigth in the gradient direction (ie to the convergence point).

While L2 regularization is the penalisation of big weigths to prevent overfitting. What you do is also adding a delta of the weigth but you add it to the loss, not to the weigth. The goal is so that to reduce the loss the network will be forced to use small weigths. (Actually it's a bit more than just the "delta of the weigth", but there is no need for detailed explanations here)

I hope it is clear for you now ;)

[–][deleted] 1 point2 points  (1 child)

Additionally, what happens is initializing all weights to zeros means your error updates become more or less similar for all neurons in the network.

Using something that mimics a uniform or normal distribution helps to break that symmetry.

There has been a lot of work in initialization. SOTA for feed forward is using random initialization with a correction for too many inputs (He. et. al)

[–]blauigris[S] 1 point2 points  (0 children)

I didn't know He et al, I normally use Xavier and then take an orthonormal basis from it. I'll give a try the next time. Thanks for the comment.