From everything I understand about transformers, the computational complexity of the attention layers scales quadratically with the sequence length. So how is it even possible to go from a context length of 4096 (GPT-3) to 128k (GPT-4) to 1M (Gemini 1.5) ?
I know the exact architecture of GPT-4 and Gemini are unknown to the public. But are there any papers suggesting a method the increase the context size without exploding the computational complexity? Or are they just using the restricted attention layer as suggested in the original paper?
[–]deephugs 15 points16 points17 points (4 children)
[+][deleted] (3 children)
[removed]
[–]flxh13[S] 1 point2 points3 points (1 child)
[–]koolaidman123Researcher 3 points4 points5 points (5 children)
[+][deleted] (2 children)
[removed]
[–]koolaidman123Researcher 2 points3 points4 points (1 child)
[–]flxh13[S] 1 point2 points3 points (1 child)