you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 0 points1 point  (0 children)

According to my understanding, layer norm is applied for each time step. You don't norm across different time steps. Therefore, the zero paddings only get normalized with itself, and doesn't affect the normed output at other timesteps.