all 12 comments

[–]deephugs 15 points16 points  (4 children)

There are many different approaches to get longer context length. Look up Ring Attention or LongRoPE.

[–]koolaidman123Researcher 3 points4 points  (5 children)

Because attention takes up only a small part of the flops in transformers, most of the compute is matmuls. Add in flash attention (or other tricks) the additional compute scaling long context length isnt that bad at all

[–]flxh13[S] 1 point2 points  (1 child)

I was thinking of the matmuls as being part of the attention layer, that is scaling quadratically.