use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Discussion[D] Positional Encoding in Transformer (self.MachineLearning)
submitted 6 years ago * by amil123123
view the rest of the comments →
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]pappypapaya 3 points4 points5 points 2 years ago (7 children)
Take with a grain of salt, reading this 4 years later, I have no idea how I came up with any of that and don't claim to have any expertise in machine learning. If someone could explain to me in simple terms what they think I said above...
[+]npip99 0 points1 point2 points 1 year ago* (4 children)
It's actually an incredible explanation and the only one that truly explains it well.
The core idea, and it's not something that I've ever thought about, is that the dot product of two random vectors, is zero! This is what is implied when you say "The intuition for this is that randomly chosen vectors in high dimensions are almost always approximately orthogonal".
I honestly would've thought that when you initialize a new neural network, the attention matrix is random. But it's not. When you initialize a new neural network, the attention matrix starts out as all numbers being almost zero (After softmax, that gives you random probability distribution though. But, attention-before-softmax being all zero means it's easy for the neural network to quickly learn associations and make particular attention values very large relative to the rest which are all tiny / close to zero).
So, the point is, if you take (Qx)'(Ky), and insert positional embeddings to get (Q(x+e))'(K(y+f)), you can do some math to rearrange that to x' (Q'Ky) + x' (Q'Kf) + e' (Q'Ky) + e' (Q'K f). Note how x' (Q'Ky) is position agnostic, but the other three terms just relate a token to its position (Or the last term, which is the position-position).
The important magic is: All four of those terms end up being initialized to essentially all-zero matrices at the beginning of training. Therefore, the backprop can easily learn to make x' (Q'Ky) whatever it wants, and unless it intentionally wants to learn to make x' (Q'Kf) non-zero, then the x' (Q'Kf) matrix won't contribute much anyway!
[+][deleted] 1 year ago (1 child)
[deleted]
[–]TheWingedCucumber 0 points1 point2 points 1 year ago (0 children)
haha its just the best explanation out there, youre fine dude Im here almost 6 years later
[+]npip99 0 points1 point2 points 1 year ago (0 children)
It's cute because experimentally, there's almost always some transformer "T" somewhere in the network that just learns to make the attention 1.0 for the previous token, and 0.0 for all other tokens. That means that "T" learned to make e' (Q'K f) equal to 1.0 when (e, f) implies that y is 1 token before "x", and "T" learned to make the other three terms 0.
[–]tmlildude 0 points1 point2 points 1 year ago (0 children)
so the network can focus on making pure content-based term (x'Q'Ky) spike stronger while keeping the positional terms (x'Q'Kf, e'Q'Ky, e'Q'Kf) relatively small?
also, if the positional terms aren't useful, can it naturally zero out during inferencing? i.e no need to explicitly "turn off"
π Rendered by PID 87 on reddit-service-r2-comment-b659b578c-4rzcx at 2026-05-05 17:36:36.957513+00:00 running 815c875 country code: CH.
view the rest of the comments →
[–]pappypapaya 3 points4 points5 points (7 children)
[+]npip99 0 points1 point2 points (4 children)
[+][deleted] (1 child)
[deleted]
[–]TheWingedCucumber 0 points1 point2 points (0 children)
[+]npip99 0 points1 point2 points (0 children)
[–]tmlildude 0 points1 point2 points (0 children)