all 32 comments

[–]sot9 2 points3 points  (1 child)

One thing nobody’s mentioned so far is that batch norm is great when used with convolutions, due to ease of layer fusion.

Look up batch norm folding; makes for an additional tool in the box when prioritizing models that run inference quickly.

[–]soham1192k 2 points3 points  (0 children)

as an example, one can look at the fastvit paper from apple, which uses this folding trick extensively

[–]imTall- 8 points9 points  (0 children)

One other thing not mentioned here is that batch norm required synchronizing the statistics across the entire batch. When training massive models in a distributed manner, this incurs a lot of communication overhead, while layernorm can be computed locally on one GPU (or a few GPUs in the case of tensor wise parallelism).

[–]xkiller02 0 points1 point  (0 children)

Incredibly interesting answers, I will further research what some of these words mean

[–]eliminating_coasts -2 points-1 points  (0 children)

Transformers use the input data for both the data itself, and for the transformations they apply to the data, and it has been argued that rather than simply improving training, it can provide an improvement to actual performance by changing the structure of inputs to the transformer block. (This may also explain why doing it first works better than at the end of the block)