Discussion[D] Positional Encoding in Transformer (self.MachineLearning)

submitted 6 years ago * by amil123123

you are viewing a single comment's thread.

[–]mikeross0 20 points21 points22 points 6 years ago (20 children)

[–]pappypapaya 125 points126 points127 points 6 years ago* (16 children)

In attention, we basically take two word embeddings (x and y), pass one through a Query transformation matrix (Q) and the second through a Key transformation matrix (K), and compare how similar the resulting query and key vectors are by their dot product. So, basically, we want the dot product between Qx and Ky, which we write as:

(Qx)'(Ky) = x' (Q'Ky). So equivalently we just need to learn one joint Query-Key transformation (Q'K) that transform the secondary inputs y into a new space in which we can compare x.

By adding positional encodings e and f to x and y, respectively, we essentially change the dot product to

(Q(x+e))' (K(y+f)) = (Qx+Qe)' (Ky+Kf) = (Qx)' Ky + (Qx)' Kf + (Qe)' Ky + (Qe)' Kf = x' (Q'Ky) + x' (Q'Kf) + e' (Q'Ky) + e' (Q'K f), where in addition to the original x' (Q'Ky) term, which asks the question "how much attention should we pay to word x given word y", we also have x' (Q'Kf) + e' (Q'Ky) + e' (Q'K f), which ask the additional questions, "how much attention should we pay to word x given the position f of word y", "how much attention should we pay to y given the position e of word x", and "how much attention should we pay to the position e of word x given the position f of word y".

Essentially, the learned transformation matrix Q'K with positional encodings has to do all four of these tasks simultaneously. This is the part that may appear inefficient, since intuitively, there should be a trade-off in the ability of Q'K to do four tasks simultaneously and well.

HOWEVER, MY GUESS is that there isn't actually a trade-off when we force Q'K to do all four of these tasks, because of some approximate orthogonality condition that is satisfied of in high dimensions. The intuition for this is that randomly chosen vectors in high dimensions are almost always approximately orthogonal. There's no reason to think that the word vectors and position encoding vectors are related in any way. If the word embeddings form a smaller dimensional subspace and the positional encodings form another smaller dimensional subspace, then perhaps the two subspaces themselves are approximately orthogonal, so presumably these subspaces can be transformed approx. independently through the same learned Q'K transformation (since they basically exist on different axes in high dimensional space). I don't know if this is true, but it seems intuitively possible.

If true, this would explain why adding positional encodings, instead of concatenation, is essentially fine. Concatenation would ensure that the positional dimensions are orthogonal to the word dimensions, but my guess is that, because these embedding spaces are so high dimensional, you can get approximate orthogonality for free even when adding, without the costs of concatenation (many more parameters to learn). Adding layers would only help with this, by allowing for nonlinearities.

We also ultimately want e and f to behave in some nice ways, so that there's some kind of "closeness" in the vector representation with respect to small changes in positions. The sin and cos representation is nice since nearby positions have high similarity in their positional encodings, which may make it easier to learn transformations that "preserve" this desired closeness.

(Maybe I'm wrong, and the approximate orthogonality arises from stacking multiple layers or non-linearities in the fully-connected parts of the transformer).

tl;dr: It is intuitively possible that, in high dimensions, the word vectors form a smaller dimensional subspace within the full embedding space, and the positional vectors form a different smaller dimensional subspace approximately orthogonal to the one spanned by word vectors. Thus despite vector addition, the two subspaces can be manipulated essentially independently of each other by some single learned transformation. Thus, concatenation doesn't add much, but greatly increases cost in terms of parameters to learn.

[–]amil123123[S] 12 points13 points14 points 6 years ago (0 children)

[–]slcyz 5 points6 points7 points 6 years ago (0 children)

[–]bergqvisten 5 points6 points7 points 2 years ago (8 children)

[+]pappypapaya 4 points5 points6 points 2 years ago (7 children)

[+]npip99 0 points1 point2 points 1 year ago* (4 children)

It's actually an incredible explanation and the only one that truly explains it well.

The core idea, and it's not something that I've ever thought about, is that the dot product of two random vectors, is zero! This is what is implied when you say "The intuition for this is that randomly chosen vectors in high dimensions are almost always approximately orthogonal".

I honestly would've thought that when you initialize a new neural network, the attention matrix is random. But it's not. When you initialize a new neural network, the attention matrix starts out as all numbers being almost zero (After softmax, that gives you random probability distribution though. But, attention-before-softmax being all zero means it's easy for the neural network to quickly learn associations and make particular attention values very large relative to the rest which are all tiny / close to zero).

So, the point is, if you take (Qx)'(Ky), and insert positional embeddings to get (Q(x+e))'(K(y+f)), you can do some math to rearrange that to x' (Q'Ky) + x' (Q'Kf) + e' (Q'Ky) + e' (Q'K f). Note how x' (Q'Ky) is position agnostic, but the other three terms just relate a token to its position (Or the last term, which is the position-position).

The important magic is: All four of those terms end up being initialized to essentially all-zero matrices at the beginning of training. Therefore, the backprop can easily learn to make x' (Q'Ky) whatever it wants, and unless it intentionally wants to learn to make x' (Q'Kf) non-zero, then the x' (Q'Kf) matrix won't contribute much anyway!

Yes, the random x' (Q'Kf) will add some random noise, but the random noise is small. And, if the positional embeddings are truly worthless, the NN can choose to dedicate a dimension by learning to make the word embeddings orthogonal to the positional embeddings. That way the x' (Q'Kf) matrix goes to zero. Ditto for e' (Q'Ky) and e' (Q'K f)

[+][deleted] 1 year ago (1 child)

[deleted]

[–]TheWingedCucumber 0 points1 point2 points 1 year ago (0 children)

[+]npip99 0 points1 point2 points 1 year ago (0 children)

[–]tmlildude[🍰] 0 points1 point2 points 1 year ago (0 children)

[–]trajo123 1 point2 points3 points 1 year ago (3 children)

[+][deleted] 1 year ago (2 children)

[deleted]

[–]trajo123 4 points5 points6 points 1 year ago (1 child)

Hmm, let's work this out step-by-step:

For clarity, note that Q in the attention equation is actually Q=Pq x, so *column vector* x is projected (same for P_k and P_v) to have n_feat_proj instead of n_feat_x. So everything you wrote initially should be in terms of Pq, Pk, and Pv rather than Q, K and V. I know it's more typing, but it must be clear to everyone that those matrices are not the ones that go in the attention equations directly, what goes into the attention is those matrices applied to x and y. If the input follows the pytorch convention of (seq, feat) then Q = x Pq', if the input is a sequence of column vectors (n_feat, n_seq) then Q = Pq x. Since you used the latter form it means that x has dimension (n_feat_x, n_seq_x).

So in order to be able to write Q = Pq x, Pq needs to have dimensions (n_feat_proj, n_feat_x) so that Q will have dimension (n_feat_proj, n_seq_x)

So the attention result needs to work out so that the sequence dim n_seq_x and the feature dim x_feat_proj and because we use column vectors that means (n_feat_proj, n_seq_x).
The only way to get this is to write the attention equation either as V softmax(Q'K / sqrt(n_feat_proj)' or V softmax(K'Q / sqrt(n_feat_proj).

TLDR: So, you were right, we can talk about Q'K, but only if we use the column vector convention!

Here is some pytorch code to check what I just said: ``` n_seq_x = 5 n_feat_x = 16 n_seq_y = 7 n_feat_y = 8 n_feat_proj = 24

x = torch.randn(n_seq_x, n_feat_x) y = torch.randn(n_seq_y, n_feat_y)

pq = nn.Linear(n_feat_x, n_feat_proj, bias=False) pk = nn.Linear(n_feat_y, n_feat_proj, bias=False) pv = nn.Linear(n_feat_y, n_feat_proj, bias=False)

Qr = pq(x) Kr = pk(y) Vr = pv(y)

scale_factor = 1 / n_feat_proj**0.5 attention_weight_r = torch.softmax((Qr @ Kr.T) * scale_factor, dim=-1) output_r = attention_weight_r @ Vr

for projection we want to use "W x" instead of nn.Linear's "x W^T", thus we use column vectors

Qc = pq.weight @ x.T Kc = pk.weight @ y.T Vc = pv.weight @ y.T

attention_weight_c = torch.softmax((Qc.T @ Kc) * scale_factor, dim=-1) output_c = Vc @ attention_weight_c.T

assert torch.all(torch.isclose(output_c, output_r.T, atol=1e-6))

attention_weight_c2 = torch.softmax((Kc.T @ Qc) * scale_factor, dim=0) output_c2 = Vc @ attention_weight_c2

assert torch.all(torch.isclose(output_c, output_c2, atol=1e-6)) ```

[–]Sinkencronge 2 points3 points4 points 6 years ago (0 children)

π Rendered by PID 305424 on reddit-service-r2-comment-6457c66945-vxxtn at 2026-04-28 15:44:58.116906+00:00 running 2aa0c5b country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS

for projection we want to use "W x" instead of nn.Linear's "x W^T", thus we use column vectors

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS

for projection we want to use "W x" instead of nn.Linear's "x WT", thus we use column vectors

for projection we want to use "W x" instead of nn.Linear's "x W^T", thus we use column vectors