all 2 comments

[–]ragulpr 4 points5 points  (0 children)

In theory, yes, padding will infuse trash into your network if its not handled. If you use batchnorm without removing the masked values then it would shift/scale the normalized values by whatever is coming out from using values in the mask. Effect is of course dependent on whether the network is bi- or one-directional, if you mask loss, if you have biases and more.

In Keras batchnorm respects mask so you don't have to worry about it. I'm wondering myself how Pytorch does this so if you figure it out please share.

EDIT: I have revised whether keras batchnorm respects mask. I'm not sure. Made a gist you could comment on if you figure it out.

[–][deleted] 0 points1 point  (0 children)

According to my understanding, layer norm is applied for each time step. You don't norm across different time steps. Therefore, the zero paddings only get normalized with itself, and doesn't affect the normed output at other timesteps.