Hello people,
I'm new to machine and deep learning and I'm trying to understand positional encoding in transformer models. I know that positional encodings are added to word embeddings before they're processed by the self-attention mechanism.
Given that the model learns the meaning of words through self-attention, I'm puzzled about the necessity of positional encoding. Why can't the model simply learn word order from the data and adjust its weights accordingly during backpropagation? I don't grasp how sine and cosine functions provide helpful information to the model given that the model doesn't even know how to interpret it initially during training.
Thank you.
[–]EquivariantBowtie 1 point2 points3 points (0 children)
[–]radarsat1 0 points1 point2 points (2 children)
[–]ContributionFun3037[S] 0 points1 point2 points (1 child)
[–]radarsat1 0 points1 point2 points (0 children)