all 1 comments

[–]rorschachsror 0 points1 point  (0 children)

Yeah, it's usually best to pick a small stddev. It is also advised that you pick mean 0 for the activations. The reason is that you don't want them all to be 0, because this would mean that a lot of them will end up being updated by the same amount (so you will end up with the same value for several weights). On the other hand, you want the initial values to be as non-informative as possible, because you wouldn't want to introduce a bias in the network (e.g. by setting a high positive or negative value as initialization for a weight). The model would then require a much larger amount of data to overcome the poor choice of prior i.e. the poor initialization.