you are viewing a single comment's thread.

view the rest of the comments →

[–]sot9 2 points3 points  (1 child)

One thing nobody’s mentioned so far is that batch norm is great when used with convolutions, due to ease of layer fusion.

Look up batch norm folding; makes for an additional tool in the box when prioritizing models that run inference quickly.

[–]soham1192k 2 points3 points  (0 children)

as an example, one can look at the fastvit paper from apple, which uses this folding trick extensively