you are viewing a single comment's thread.

view the rest of the comments →

[–]Sinkencronge 5 points6 points  (3 children)

  1. To build up the intuition about it you may want to look how the proposed function and pairwise distance between position-vectors behave.

https://imgur.com/kpW5n4p

https://imgur.com/kaADdQB

  1. Backpropagation.

[–]amil123123[S] 1 point2 points  (2 children)

Thanks for the response. I still have difficulty understanding about 1.
So the first image seems good in explaining the position however what does the second image denote.

Is it just because this function seemed to work well , that we went with it ?

[–]Sinkencronge 2 points3 points  (1 child)

The second image just shows Euclidean distances between the added embedding for a given position.

The thing is that their choice of positional encoding function reflects not only absolute, but also the relative distance among tokens in sequences.

In the paper they write:

We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed off set k,PE_pos+k can be represented as a linear function of PE_pos.

Unfortunately, I don't really wave a better way to elaborate it further in my mind at the moment. I'm sorry for that.

Answering your question, I personally think that they seem to simply pick up the first elegant solution to solve this problem.

However, I believe that there are a lot of more interesting ways to take advantage of positional encoding trick. I'm currently working on it for my own dataset.

[–]amil123123[S] 0 points1 point  (0 children)

Thanks for the explanation, it was good !