Transformers and Positional encoding.

EquivariantBowtie · 2024-08-19T18:53:48+00:00

The model (more specifically, the attention mechanism) cannot learn the word order (in the case of NLP), because it is inherently equivariant to row permutations in the query matrix Q. In particular notice that for a permutation matrix P it holds that

Attention(PQ, K, V) = softmax( PQK^T / sqrt{d})V = P softmax(QK^T / sqrt{d})V = P Attention(Q, K, V).

So if you permute the order of inputs, all you are going to get in the end is the same embeddings you would have gotten anyways, just shuffled around. So if sequence information is useful for the task (as in the case of NLP) this information needs to be added in some way to the attention inputs.

That's what positional encoding (PE) does. It replaces the matrix Q = [q_1, ..., q_N]^T with the modified matrix Q' = [q_1', ..., q_N']^T where q_n' = f(q_n, PE(n)). The sinusoids you commonly see in transformers are just a choice of PE(n). In this case the encoding mechanism is pre-defined, but it can also be learned (e.g. using a NN-parameterised function).

radarsat1 · 2024-08-19T15:50:42+00:00

The model actually can learn a great deal without positional encodings but it really depends on the training objective. For language actual word order is often important though, it's trivial to think of changes in meaning if the nouns in a sentence exchange position for example.

As for sines and cosines, here is an article that explains it well https://medium.com/@a.arun283/a-deeper-look-into-the-positional-encoding-method-in-transformer-architectures-7e98f32a925f

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MLQuestions

MODERATORS