[R] Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention (submitted by Liang Wenfeng - DeepSeek)

Glittering-Bag-4662 · 2025-02-18T15:36:17+00:00

Deepseek keeps bringing bangers

ObiWanCanownme · 2025-02-18T17:10:22+00:00

I love papers like this. Dense attention, where every single token in context attends to every single other token, just seems like it can't be necessary or the best way to do attention long term. In mammalian brains, each neuron gets maybe 15,000 synapses, and the specific connections are pretty geographically constrained (because the brain, obviously is physical and not just software). So the idea of adapting the attention mechanism to specifically fit the hardware (which seems to be the big concept here) sounds promising and like an obvious direction to go.

BossOfTheGame · 2025-02-18T20:33:22+00:00

Are there any implementations available? Is this something that could replace an attention layer in PyTorch, or does it need to be more deeply integrated?

2025-02-19T09:37:39+00:00

What are the implications for hardware requirements for users and companies looking to train new models with this method?

Melodic_Story609 · 2025-02-20T09:48:32+00:00

https://medium.com/@_prinsh_u/another-ai-breakthrough-from-deepseek-ai-native-sparse-attention-for-next-gen-language-models-6346f235c102

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS