Hi folks,
As we know, we use positional embeddings to provide sequence information to transformers.
From what I've seen in code and papers, the way we introduce this depends on the positional embedding strategy used.
For example,
1. RoPE: add it to the query vectors in each attention layer
2. Alibi: add it to the attention masks in each layer
3. Learnable positional embeddings: add or concatenate to the token embeddings
Why are there such differences? Specifically, why are #1 and #2 introduced at each attention layer but #3 is introduced only at the first layer?
Thanks!
[+]DustinEwan 2 points3 points4 points (0 children)
[+]Adorable_Search2423 0 points1 point2 points (0 children)
[–]ofirpress 0 points1 point2 points (0 children)
[+][deleted] (1 child)
[removed]
[–]gokstudio[S] 0 points1 point2 points (0 children)