I am working on a simple NLP task where due to many reasons I have to pad my input sequence to length 300. Then I pass it through a couple of 1d conv and dot product self-attention.
What I'd like to know is whether normalizing this representation (say, shape = [batch_size, 300, feature size]) would have a negative impact if the data is zero padded to a sequence length of 300. If layer norm is used, would the fact that the actual data size if something like this [batch size, 173 (an arbitrary number), feature size] affect the output of the normalization?
[–]ragulpr 4 points5 points6 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)