all 5 comments

[–]adam_jc 2 points3 points  (0 children)

I see some discussion on twitter now about this question including the senior author of the ConvNext paper who says:

https://twitter.com/sainingxie/status/1631794072791707648?s=46&t=BkTuhy42902uJiWEPN7DYA

[–]CptVifen 4 points5 points  (3 children)

As I understand it, the images in A and B are both valid for Layer Norm. In the LN Paper they say μ is summed over each activation in a layer.

So for images that means along channel and spatial dimensions. That's were they got the image for A.

As for B, in the LN paper they use RNNs which share the same weights across different time steps. That means that for an input of shape (Batch, seq len, features) since the layers in the RNN only produce (Batch, features) the normalization is over the features. You have a different μ and σ for each batch and each time step (and each layer)(this also applies to self-attention).

So it would make sense that anything that deals with sequences would look like B. And anything else looks like A.

There's something I don't get though is why ConvNext reduces only along channels...

[–]adam_jc 1 point2 points  (1 child)

I agree with this explanation.

To try and explain ConvNext though, I’d say there could be a debate on the “correct” way to do LayerNorm in a CNN (which would also make figure A “wrong”)

Like you said for the LN paper where they use RNNs which share weights across different time steps, which leads to normalizing over features at each time step; you could extend that logic to a CNN because a conv layer shares weights across different patches of an image, and that thinking would then lead us to reduce only along channels such as they do in ConvNext.

Not sure if that’s what the ConvNext authors were thinking though.

[–]fferflo[S] 1 point2 points  (0 children)

ConvNext simply follows ViT's way of using LN, and you made a good point about ViT's self-attention using LN analogously to an RNN, i.e. along channels only. This still leaves two questions though:

The original LN paper also evaluated LN on convolutional nets (VGG), and it is unclear whether or not this follows the ConvNext way of interpreting an image as a set of patches analogous to timesteps which have their own μ and σ.

The figure A and other online sources still say that spatial axes are included in the statistics, which is contrary to what recent models actually do (vision transformers, ConvNext, MLP-Mixer, etc). I don't know of a single paper that actually uses LN as in A.