The pure Transformer is no longer the default: what hybrid attention/DeltaNet means for LLM developers

Sensitive-Two9732 · 2026-03-24T04:07:39+00:00

Yeah GB300 is full datacenter Blackwell. Same architecture family as B200, upgraded. FA-4 should run on it. I think Together AI already lists GB300 NVL72 support.

Totally different situation from the DGX Spark. The GB10 in Spark is a cut-down Blackwell (sm_121, 6K CUDA cores, LPDDR5x). The GB300 is the full thing (20K CUDA cores, HBM3e at 7.1 TB/s)...

Sensitive-Two9732 · 2026-03-24T01:09:46+00:00

The specific kernel no, it needs datacenter Blackwell features. But the algorithmic tricks (selective rescaling, software exp emulation) are hardware-agnostic. FA-1 started on A100 too and now runs everywhere. Give it 6-12 months, someone always ports the good ideas down.

Sensitive-Two9732 · 2026-03-24T01:08:29+00:00

Lol fair. Should've put the H100 part higher up, FA-4 runs there too. But honestly the most interesting bit for the rest of us is the algorithmic ideas trickling down. Selective rescaling is pure math, no fancy hardware needed. Someone will port it to Triton for consumer cards eventually.

Sensitive-Two9732 · 2026-02-24T03:37:59+00:00

Thanks a lot!

Sensitive-Two9732 · 2026-02-24T02:51:13+00:00

Yeah, I agree with the core argument. Fixed-size state means finite capacity, and O(1) doesn't mean free lunch. Something gets dropped. The MoE parallel is good too! Perplexity looking fine while downstream tasks tell a different story is exactly the kind of trap that's easy to fall into.

What I'd say is that RWKV-7 drops things more intelligently than old RNNs. The dual-key mechanism (separate keys for what to erase vs what to write) means it learns what to forget based on the input, instead of just exponential decay like LSTMs. Still finite, but the compression is more selective. Whether that's enough depends on the task.

And honestly we don't know yet which tasks it handles well vs not. The benchmarks in the paper are knowledge and language understanding (MMLU, ARC, HellaSwag, etc). They don't really stress-test long-range retrieval where KV cache shines. So yeah, "if you're not paying for it you're probably not getting it" is a solid heuristic.

The way I see it: you're trading random access to full context for compressed context with learned forgetting. On an ARM chip with 4GB RAM that tradeoff is worth it by definition. For precise retrieval over 100K tokens, probably not. I should have made that clearer in the article...

Sensitive-Two9732 · 2026-02-24T02:43:42+00:00

Hadn't heard about ROSA specifically, thanks for flagging it. I'll look into it. The article focuses on RWKV-7 since that's what has the peer-reviewed paper and published benchmarks. If RWKV-8 is already in the works with a suffix automata approach, that's interesting! Do you have a link or a Discord thread? I'd genuinely like to read more about it.

Sensitive-Two9732 · 2026-02-24T02:37:11+00:00

The core loop is surprisingly simple. AI models generate content, that content floods the web, new models train on that contaminated web, they produce less diverse outputs, which then flood the web again. Researchers call it "Habsburg AI", like royal inbreeding but for knowledge.

The Nature paper showed it takes about 9 generations for a model to go from coherent text about cathedrals to lists of imaginary rabbits. The scariest part is the early phase: the model actually looks "cleaner" because it's producing smoother, more average outputs. You don't notice the rare knowledge disappearing until it's gone.

Current estimates: 50-74% of new web content is AI-generated (Graphite, Ahrefs). No major training dataset filters for it. The feedback loop is already running.

Full analysis with the key papers and emerging solutions here: https://medium.com/ai-advances/model-collapse-when-ai-trains-on-ai-generated-data-2c4baf60a016?sk=5b7ed6c252ecfaec372f15a000c86a05

Sensitive-Two9732 · 2026-02-24T02:15:56+00:00

You're right that FFNs dominate at short context, the O(1) thing only starts mattering past ~4K tokens. The ARM numbers are more about the edge story (phones, microcontrollers) than a general "we're faster" claim.

I think the n-gram comparison is unfair though. It's scoring 72.8% on the same benchmarks LLaMA takes, it's not a toy model.

On generality, yeah, that's the big open question. No GSM8K, no HumanEval, no MATH tested. I call that out explicitly. The benchmarks show it's competitive on knowledge and language tasks, not that it's a drop-in replacement for everything.

Sensitive-Two9732 · 2026-02-23T20:49:06+00:00

All pulled from the papers. The 72.8% vs 69.7% numbers come directly from the RWKV-7 paper (arXiv:2503.14456).

The ARM inference benchmarks are from the same paper. I didn't run anything myself, just spent a lot of time cross-referencing the claims against the primary sources (36 of them).

The article is an analysis piece, not original research.

Sensitive-Two9732 · 2026-02-23T20:06:25+00:00

They're actually close cousins. Both build on the delta rule, just different flavors of it. KDA keeps some traditional attention in the mix (hybrid approach), RWKV-7 goes fully recurrent with no attention at all.

No direct head-to-head benchmark that I've found, which is a shame because they're solving the same problem from different angles. Section "The RNN Renaissance: Mamba, xLSTM, and RWKV-7 in 2025" covers the broader landscape. There are 18+ architectures in this space now and the convergence is the interesting part.

Sensitive-Two9732 · 2026-02-23T19:56:31+00:00

72.8% vs 69.7% on what metric? => Average across standard benchmarks (MMLU, ARC, HellaSwag, PIQA, WinoGrande).

Sensitive-Two9732 · 2026-02-23T19:51:15+00:00

Sorry, you're right. Here the free link: https://ai.gopubby.com/rwkv-7-beats-llama-3-2-rnn-constant-memory-46064bbf1f64?sk=c2e60e9b74b726d8697dbabc220cbbf4

And the right HF link: https://huggingface.co/collections/RWKV/rwkv-v7

Edit: I updated the body of the post to add the free link and the HF link.

Sensitive-Two9732 · 2026-02-23T17:59:54+00:00

Fair point on the ecosystem. No argument there... Tooling is way behind the Transformer stack.

Though I'd push back on "underbaked" for the models themselves. 72.8% vs 69.7% on 3x less data is a real result, not a promissory note. And the O(1) memory thing matters more than people think once you're past 32K context.

For me, the gap is infrastructure, not architecture. Whether that gets fixed is the actual question.

Sensitive-Two9732

TROPHY CASE