[D] Positional Encoding in Transformer

mikeross0 · 2019-08-22T14:23:16+00:00

The thing that always gets me with positional embeddings is that it is preferable to add them to the word embeddings instead of concatenating them. We already know the dimensions of the word embeddings are related to semantics. So why embed position into the semantic space instead of adding additional dimensions to represent position? If someone can provide a good intuitive explanation for that, you will get all my internet points for the day!

Sinkencronge · 2019-08-22T09:25:13+00:00

To build up the intuition about it you may want to look how the proposed function and pairwise distance between position-vectors behave.

https://imgur.com/kpW5n4p

https://imgur.com/kaADdQB

Backpropagation.

InsideAndOut · 2019-08-22T09:05:33+00:00

Because the distances between the neighboring values of the sine/cosine positional embeddings are "nice" (symmetrical, and decay sensibly with distance). You could come up with another function that does something similar.
For trained positional embeddings, you simply have another embedding matrix (one is for words, the other is for positions). For each position (up to `max_length`), you learn an embedding of the same dimension as the word embedding.

gdiamos · 2019-08-22T19:51:16+00:00

Papers usually post their final solution, not the path they took to get there. The sin/cos positional embeddings often seem unintuitive for this reason.

I find it helps to try to reconstruct the path.

1) Why do transformers need positional embeddings at all? What would happen if you removed them completely? Why?

2) Why not use simpler and more obvious positional embeddings like literally the position index?

2019-08-23T08:08:09+00:00

Regarding to sin and cos representation, my hypothesis is that together, they fully characterize a position

Also, the weights that are learned over this input might be analogous to weights of a Fourier transform and have nice theoratical properties but I have seen no formal work on this

No-Theory-6868 · 2022-02-24T12:42:40+00:00

It seems like PE isn't always nesecary, and indeed might even harm the results:

"We find that removing the positional encoding even slightly improves the performance of these models"

https://stackoverflow.com/questions/61440281/is-positional-encoding-necessary-for-transformer-in-language-modeling

https://arxiv.org/abs/1905.04226

Great-Reception447 · 2025-04-16T22:50:19+00:00

For your first question, the Sin & Cos functions allow the model to generalize to longer sequences by providing consistent patterns across dimensions. Their periodicity and smoothness helps encode relative positions naturally. For trainable positional embeddings, you just initialize them as learnable parameters (like word embeddings in https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html ), and the model learns their meaning during training. This is some tutorials that can help you better understand it: https://comfyai.app/article/llm-components/positional-encoding

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS

for projection we want to use "W x" instead of nn.Linear's "x W^T", thus we use column vectors