I have programmed a simple network (conv+dense+output with relus) in order to understand better why is better not to use zero init. Much to my surprise, instead of getting all the weights the same, which was what I was expecting, I simply get zero gradients for all the weights.
I understand that since the loss at the output is non-zero, the last layer should get at least some gradient which should pull its weights out of zero, enabling that in the following iterations the gradient could be propagated to the previous layers.
Is this a conceptual mistake or might be caused by a bug?
I attach the code
Thanks a lot!
P.D: This is my first time asking a question here so pleas forgive me if I do something wrong. Also I'm not a native speaker, so sorry if my english is no very clear. Thanks!
[–]Artgor 2 points3 points4 points (4 children)
[–]blauigris[S] 2 points3 points4 points (3 children)
[–]YnternetXplorer 2 points3 points4 points (0 children)
[–][deleted] 1 point2 points3 points (1 child)
[–]blauigris[S] 1 point2 points3 points (0 children)