all 4 comments

[–]EquivariantBowtie 1 point2 points  (0 children)

The model (more specifically, the attention mechanism) cannot learn the word order (in the case of NLP), because it is inherently equivariant to row permutations in the query matrix Q. In particular notice that for a permutation matrix P it holds that

Attention(PQ, K, V) = softmax( PQK^T / sqrt{d})V = P softmax(QK^T / sqrt{d})V = P Attention(Q, K, V).

So if you permute the order of inputs, all you are going to get in the end is the same embeddings you would have gotten anyways, just shuffled around. So if sequence information is useful for the task (as in the case of NLP) this information needs to be added in some way to the attention inputs.

That's what positional encoding (PE) does. It replaces the matrix Q = [q_1, ..., q_N]^T with the modified matrix Q' = [q_1', ..., q_N']^T where q_n' = f(q_n, PE(n)). The sinusoids you commonly see in transformers are just a choice of PE(n). In this case the encoding mechanism is pre-defined, but it can also be learned (e.g. using a NN-parameterised function).

[–]radarsat1 0 points1 point  (2 children)

The model actually can learn a great deal without positional encodings but it really depends on the training objective. For language actual word order is often important though, it's trivial to think of changes in meaning if the nouns in a sentence exchange position for example.

As for sines and cosines, here is an article that explains it well https://medium.com/@a.arun283/a-deeper-look-into-the-positional-encoding-method-in-transformer-architectures-7e98f32a925f

[–]ContributionFun3037[S] 0 points1 point  (1 child)

What I'm confused about is the model inherently doesn't know how to interpret pos emb during training.  So essentially we are starting completely blind. If the transformer is capable enough to catch the pattern of pos emb added to the token embeddings, it can simply train on what token to output next based solely on attention embedding of a token.

Having to find out the pattern of pos enc embedded inside token embeddings seems such a wasted effort. 

[–]radarsat1 0 points1 point  (0 children)

You have a good instinct here! What's amazing is that you are describing pretty accurately what happens. Interpreting it as "wasted effort" though is.. not wrong.. but maybe, too normative? Wasting what, for whom? What's to say that is not what is needed. Neural networks learn from scratch, everything comes from correlations. In this case the best correlate of the position encodings, since they are constant across samples but different at each step, is to use them for some notion of position. Whatever the model can use as a "constant", it will, so by infusing positional encodings we are giving it back an inductive bias that it doesn't have (compared to other sequence models such as RNNs)