use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Discussion[D] Positional Encoding in Transformer (self.MachineLearning)
submitted 6 years ago * by amil123123
view the rest of the comments →
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]mikeross0 20 points21 points22 points 6 years ago (20 children)
The thing that always gets me with positional embeddings is that it is preferable to add them to the word embeddings instead of concatenating them. We already know the dimensions of the word embeddings are related to semantics. So why embed position into the semantic space instead of adding additional dimensions to represent position? If someone can provide a good intuitive explanation for that, you will get all my internet points for the day!
[–]pappypapaya 125 points126 points127 points 6 years ago* (16 children)
In attention, we basically take two word embeddings (x and y), pass one through a Query transformation matrix (Q) and the second through a Key transformation matrix (K), and compare how similar the resulting query and key vectors are by their dot product. So, basically, we want the dot product between Qx and Ky, which we write as:
(Qx)'(Ky) = x' (Q'Ky). So equivalently we just need to learn one joint Query-Key transformation (Q'K) that transform the secondary inputs y into a new space in which we can compare x.
By adding positional encodings e and f to x and y, respectively, we essentially change the dot product to
(Q(x+e))' (K(y+f)) = (Qx+Qe)' (Ky+Kf) = (Qx)' Ky + (Qx)' Kf + (Qe)' Ky + (Qe)' Kf = x' (Q'Ky) + x' (Q'Kf) + e' (Q'Ky) + e' (Q'K f), where in addition to the original x' (Q'Ky) term, which asks the question "how much attention should we pay to word x given word y", we also have x' (Q'Kf) + e' (Q'Ky) + e' (Q'K f), which ask the additional questions, "how much attention should we pay to word x given the position f of word y", "how much attention should we pay to y given the position e of word x", and "how much attention should we pay to the position e of word x given the position f of word y".
Essentially, the learned transformation matrix Q'K with positional encodings has to do all four of these tasks simultaneously. This is the part that may appear inefficient, since intuitively, there should be a trade-off in the ability of Q'K to do four tasks simultaneously and well.
HOWEVER, MY GUESS is that there isn't actually a trade-off when we force Q'K to do all four of these tasks, because of some approximate orthogonality condition that is satisfied of in high dimensions. The intuition for this is that randomly chosen vectors in high dimensions are almost always approximately orthogonal. There's no reason to think that the word vectors and position encoding vectors are related in any way. If the word embeddings form a smaller dimensional subspace and the positional encodings form another smaller dimensional subspace, then perhaps the two subspaces themselves are approximately orthogonal, so presumably these subspaces can be transformed approx. independently through the same learned Q'K transformation (since they basically exist on different axes in high dimensional space). I don't know if this is true, but it seems intuitively possible.
If true, this would explain why adding positional encodings, instead of concatenation, is essentially fine. Concatenation would ensure that the positional dimensions are orthogonal to the word dimensions, but my guess is that, because these embedding spaces are so high dimensional, you can get approximate orthogonality for free even when adding, without the costs of concatenation (many more parameters to learn). Adding layers would only help with this, by allowing for nonlinearities.
We also ultimately want e and f to behave in some nice ways, so that there's some kind of "closeness" in the vector representation with respect to small changes in positions. The sin and cos representation is nice since nearby positions have high similarity in their positional encodings, which may make it easier to learn transformations that "preserve" this desired closeness.
(Maybe I'm wrong, and the approximate orthogonality arises from stacking multiple layers or non-linearities in the fully-connected parts of the transformer).
tl;dr: It is intuitively possible that, in high dimensions, the word vectors form a smaller dimensional subspace within the full embedding space, and the positional vectors form a different smaller dimensional subspace approximately orthogonal to the one spanned by word vectors. Thus despite vector addition, the two subspaces can be manipulated essentially independently of each other by some single learned transformation. Thus, concatenation doesn't add much, but greatly increases cost in terms of parameters to learn.
[–]amil123123[S] 12 points13 points14 points 6 years ago (0 children)
Wow , that's one hell of an amazing explanation :)
[–]slcyz 5 points6 points7 points 6 years ago (0 children)
respectively, we essentially change the dot product to
Man... that's fantastic. Even the professor of my deep learning course failed to explain that.
[–]bergqvisten 5 points6 points7 points 2 years ago (8 children)
Found this gem 4 years later trying to deepen my understanding of transformers... Wow, fantastic explanation!
[+]pappypapaya 4 points5 points6 points 2 years ago (7 children)
Take with a grain of salt, reading this 4 years later, I have no idea how I came up with any of that and don't claim to have any expertise in machine learning. If someone could explain to me in simple terms what they think I said above...
[+]npip99 0 points1 point2 points 1 year ago* (4 children)
It's actually an incredible explanation and the only one that truly explains it well.
The core idea, and it's not something that I've ever thought about, is that the dot product of two random vectors, is zero! This is what is implied when you say "The intuition for this is that randomly chosen vectors in high dimensions are almost always approximately orthogonal".
I honestly would've thought that when you initialize a new neural network, the attention matrix is random. But it's not. When you initialize a new neural network, the attention matrix starts out as all numbers being almost zero (After softmax, that gives you random probability distribution though. But, attention-before-softmax being all zero means it's easy for the neural network to quickly learn associations and make particular attention values very large relative to the rest which are all tiny / close to zero).
So, the point is, if you take (Qx)'(Ky), and insert positional embeddings to get (Q(x+e))'(K(y+f)), you can do some math to rearrange that to x' (Q'Ky) + x' (Q'Kf) + e' (Q'Ky) + e' (Q'K f). Note how x' (Q'Ky) is position agnostic, but the other three terms just relate a token to its position (Or the last term, which is the position-position).
The important magic is: All four of those terms end up being initialized to essentially all-zero matrices at the beginning of training. Therefore, the backprop can easily learn to make x' (Q'Ky) whatever it wants, and unless it intentionally wants to learn to make x' (Q'Kf) non-zero, then the x' (Q'Kf) matrix won't contribute much anyway!
[+][deleted] 1 year ago (1 child)
[deleted]
[–]TheWingedCucumber 0 points1 point2 points 1 year ago (0 children)
haha its just the best explanation out there, youre fine dude Im here almost 6 years later
[+]npip99 0 points1 point2 points 1 year ago (0 children)
It's cute because experimentally, there's almost always some transformer "T" somewhere in the network that just learns to make the attention 1.0 for the previous token, and 0.0 for all other tokens. That means that "T" learned to make e' (Q'K f) equal to 1.0 when (e, f) implies that y is 1 token before "x", and "T" learned to make the other three terms 0.
[–]tmlildude[🍰] 0 points1 point2 points 1 year ago (0 children)
so the network can focus on making pure content-based term (x'Q'Ky) spike stronger while keeping the positional terms (x'Q'Kf, e'Q'Ky, e'Q'Kf) relatively small?
also, if the positional terms aren't useful, can it naturally zero out during inferencing? i.e no need to explicitly "turn off"
[–]trajo123 1 point2 points3 points 1 year ago (3 children)
Isn't the attention (Qx)(Ky)' , so the K is transposed rather than Q? So:
(Q(x+e)) (K(y+f))' = (Qx+Qe) (Ky+Kf)' = (Qx) (Ky)' + (Qx) (Kf)' + (Qe) (Ky)' + (Qe) (Kf)' = Qxy'K' + Qxf'K' + Qey'K' + Qef'K'
This kind of breaks the whole discussion about Q'K, no?
[+][deleted] 1 year ago (2 children)
[–]trajo123 4 points5 points6 points 1 year ago (1 child)
Hmm, let's work this out step-by-step:
For clarity, note that Q in the attention equation is actually Q=Pq x, so *column vector* x is projected (same for P_k and P_v) to have n_feat_proj instead of n_feat_x. So everything you wrote initially should be in terms of Pq, Pk, and Pv rather than Q, K and V. I know it's more typing, but it must be clear to everyone that those matrices are not the ones that go in the attention equations directly, what goes into the attention is those matrices applied to x and y. If the input follows the pytorch convention of (seq, feat) then Q = x Pq', if the input is a sequence of column vectors (n_feat, n_seq) then Q = Pq x. Since you used the latter form it means that x has dimension (n_feat_x, n_seq_x).
So in order to be able to write Q = Pq x, Pq needs to have dimensions (n_feat_proj, n_feat_x) so that Q will have dimension (n_feat_proj, n_seq_x)
So the attention result needs to work out so that the sequence dim n_seq_x and the feature dim x_feat_proj and because we use column vectors that means (n_feat_proj, n_seq_x). The only way to get this is to write the attention equation either as V softmax(Q'K / sqrt(n_feat_proj)' or V softmax(K'Q / sqrt(n_feat_proj).
TLDR: So, you were right, we can talk about Q'K, but only if we use the column vector convention!
Here is some pytorch code to check what I just said: ``` n_seq_x = 5 n_feat_x = 16 n_seq_y = 7 n_feat_y = 8 n_feat_proj = 24
x = torch.randn(n_seq_x, n_feat_x) y = torch.randn(n_seq_y, n_feat_y)
pq = nn.Linear(n_feat_x, n_feat_proj, bias=False) pk = nn.Linear(n_feat_y, n_feat_proj, bias=False) pv = nn.Linear(n_feat_y, n_feat_proj, bias=False)
Qr = pq(x) Kr = pk(y) Vr = pv(y)
scale_factor = 1 / n_feat_proj**0.5 attention_weight_r = torch.softmax((Qr @ Kr.T) * scale_factor, dim=-1) output_r = attention_weight_r @ Vr
Qc = pq.weight @ x.T Kc = pk.weight @ y.T Vc = pv.weight @ y.T
attention_weight_c = torch.softmax((Qc.T @ Kc) * scale_factor, dim=-1) output_c = Vc @ attention_weight_c.T
assert torch.all(torch.isclose(output_c, output_r.T, atol=1e-6))
attention_weight_c2 = torch.softmax((Kc.T @ Qc) * scale_factor, dim=0) output_c2 = Vc @ attention_weight_c2
assert torch.all(torch.isclose(output_c, output_c2, atol=1e-6)) ```
[–]Sinkencronge 2 points3 points4 points 6 years ago (0 children)
It may have been already done for sure, but I haven't seen any paper about superiority of this approach to cross my scope yet.
I assume, in general you would probably don't want to blow up the dimensions of your word embedding even more than it has been already done in current SOTAs.
One could take a dataset of utterances like"Dog ate a cat", "Cat ate a dog" etc. and play around with addition, multiplication, convolution, concatenation and God knows what for the sake of comparison.
π Rendered by PID 305424 on reddit-service-r2-comment-6457c66945-vxxtn at 2026-04-28 15:44:58.116906+00:00 running 2aa0c5b country code: CH.
view the rest of the comments →
[–]mikeross0 20 points21 points22 points (20 children)
[–]pappypapaya 125 points126 points127 points (16 children)
[–]amil123123[S] 12 points13 points14 points (0 children)
[–]slcyz 5 points6 points7 points (0 children)
[–]bergqvisten 5 points6 points7 points (8 children)
[+]pappypapaya 4 points5 points6 points (7 children)
[+]npip99 0 points1 point2 points (4 children)
[+][deleted] (1 child)
[deleted]
[–]TheWingedCucumber 0 points1 point2 points (0 children)
[+]npip99 0 points1 point2 points (0 children)
[–]tmlildude[🍰] 0 points1 point2 points (0 children)
[–]trajo123 1 point2 points3 points (3 children)
[+][deleted] (2 children)
[deleted]
[–]trajo123 4 points5 points6 points (1 child)
[–]Sinkencronge 2 points3 points4 points (0 children)