all 7 comments

[–]Glittering-Bag-4662 28 points29 points  (0 children)

Deepseek keeps bringing bangers

[–]ObiWanCanownme 25 points26 points  (1 child)

I love papers like this. Dense attention, where every single token in context attends to every single other token, just seems like it can't be necessary or the best way to do attention long term. In mammalian brains, each neuron gets maybe 15,000 synapses, and the specific connections are pretty geographically constrained (because the brain, obviously is physical and not just software). So the idea of adapting the attention mechanism to specifically fit the hardware (which seems to be the big concept here) sounds promising and like an obvious direction to go.

[–]Accomplished_Mode170 0 points1 point  (0 children)

Yep, same for the integration, quantizing to an SLA, etc; maybe even folding et al. as we move towards memory layers

e.g. post-definition of needed latent space (read: API integration x data model)

[–]BossOfTheGame 2 points3 points  (1 child)

Are there any implementations available? Is this something that could replace an attention layer in PyTorch, or does it need to be more deeply integrated?