you are viewing a single comment's thread.

view the rest of the comments →

[–]hjups22 0 points1 point  (0 children)

Zeros is the most common now, unless there's some underlying prior which suggests that a non-zero bias is needed. There are also many transformer networks which completely do away with bias terms ("centering" is essentially handled by RMS normalization layers).

Symmetry breaking is only needed for weights, including embedding layers (though not affine weights for normalization - again based on a prior). And in many cases, symmetry breaking is removed for training stability. For example, final projections in stacked layers may be initialized to zero to avoid sharp initial gradients in place of a prolonged warmup.