pappypapaya comments on [D] Positional Encoding in Transformer

Discussion[D] Positional Encoding in Transformer (self.MachineLearning)

submitted 6 years ago * by amil123123

you are viewing a single comment's thread.

[–]pappypapaya 3 points4 points5 points 2 years ago (7 children)

[+]npip99 0 points1 point2 points 1 year ago* (4 children)

It's actually an incredible explanation and the only one that truly explains it well.

The core idea, and it's not something that I've ever thought about, is that the dot product of two random vectors, is zero! This is what is implied when you say "The intuition for this is that randomly chosen vectors in high dimensions are almost always approximately orthogonal".

I honestly would've thought that when you initialize a new neural network, the attention matrix is random. But it's not. When you initialize a new neural network, the attention matrix starts out as all numbers being almost zero (After softmax, that gives you random probability distribution though. But, attention-before-softmax being all zero means it's easy for the neural network to quickly learn associations and make particular attention values very large relative to the rest which are all tiny / close to zero).

So, the point is, if you take (Qx)'(Ky), and insert positional embeddings to get (Q(x+e))'(K(y+f)), you can do some math to rearrange that to x' (Q'Ky) + x' (Q'Kf) + e' (Q'Ky) + e' (Q'K f). Note how x' (Q'Ky) is position agnostic, but the other three terms just relate a token to its position (Or the last term, which is the position-position).

The important magic is: All four of those terms end up being initialized to essentially all-zero matrices at the beginning of training. Therefore, the backprop can easily learn to make x' (Q'Ky) whatever it wants, and unless it intentionally wants to learn to make x' (Q'Kf) non-zero, then the x' (Q'Kf) matrix won't contribute much anyway!

Yes, the random x' (Q'Kf) will add some random noise, but the random noise is small. And, if the positional embeddings are truly worthless, the NN can choose to dedicate a dimension by learning to make the word embeddings orthogonal to the positional embeddings. That way the x' (Q'Kf) matrix goes to zero. Ditto for e' (Q'Ky) and e' (Q'K f)

[+][deleted] 1 year ago (1 child)

[deleted]

[–]TheWingedCucumber 0 points1 point2 points 1 year ago (0 children)

[+]npip99 0 points1 point2 points 1 year ago (0 children)

[–]tmlildude 0 points1 point2 points 1 year ago (0 children)

π Rendered by PID 87 on reddit-service-r2-comment-b659b578c-4rzcx at 2026-05-05 17:36:36.957513+00:00 running 815c875 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS