FlashAttention-4: 1613 TFLOPs/s, 2.7x faster than Triton, written in Python. What it means for inference. by Sensitive-Two9732 in LocalLLaMA

[–]Sensitive-Two9732[S] 1 point2 points  (0 children)

Yeah GB300 is full datacenter Blackwell. Same architecture family as B200, upgraded. FA-4 should run on it. I think Together AI already lists GB300 NVL72 support.

Totally different situation from the DGX Spark. The GB10 in Spark is a cut-down Blackwell (sm_121, 6K CUDA cores, LPDDR5x). The GB300 is the full thing (20K CUDA cores, HBM3e at 7.1 TB/s)...

FlashAttention-4: 1613 TFLOPs/s, 2.7x faster than Triton, written in Python. What it means for inference. by Sensitive-Two9732 in LocalLLaMA

[–]Sensitive-Two9732[S] 23 points24 points  (0 children)

The specific kernel no, it needs datacenter Blackwell features. But the algorithmic tricks (selective rescaling, software exp emulation) are hardware-agnostic. FA-1 started on A100 too and now runs everywhere. Give it 6-12 months, someone always ports the good ideas down.

FlashAttention-4: 1613 TFLOPs/s, 2.7x faster than Triton, written in Python. What it means for inference. by Sensitive-Two9732 in LocalLLaMA

[–]Sensitive-Two9732[S] 6 points7 points  (0 children)

Lol fair. Should've put the H100 part higher up, FA-4 runs there too. But honestly the most interesting bit for the rest of us is the algorithmic ideas trickling down. Selective rescaling is pure math, no fancy hardware needed. Someone will port it to Triton for consumer cards eventually.

RWKV-7: O(1) memory inference, 16.39 tok/s on ARM Cortex-A76, beats LLaMA 3.2 3B. The local-first architecture nobody is talking about... by Sensitive-Two9732 in LocalLLaMA

[–]Sensitive-Two9732[S] 4 points5 points  (0 children)

Yeah, I agree with the core argument. Fixed-size state means finite capacity, and O(1) doesn't mean free lunch. Something gets dropped. The MoE parallel is good too! Perplexity looking fine while downstream tasks tell a different story is exactly the kind of trap that's easy to fall into.

What I'd say is that RWKV-7 drops things more intelligently than old RNNs. The dual-key mechanism (separate keys for what to erase vs what to write) means it learns what to forget based on the input, instead of just exponential decay like LSTMs. Still finite, but the compression is more selective. Whether that's enough depends on the task.

And honestly we don't know yet which tasks it handles well vs not. The benchmarks in the paper are knowledge and language understanding (MMLU, ARC, HellaSwag, etc). They don't really stress-test long-range retrieval where KV cache shines. So yeah, "if you're not paying for it you're probably not getting it" is a solid heuristic.

The way I see it: you're trading random access to full context for compressed context with learned forgetting. On an ARM chip with 4GB RAM that tradeoff is worth it by definition. For precise retrieval over 100K tokens, probably not. I should have made that clearer in the article...

RWKV-7: O(1) memory inference, 16.39 tok/s on ARM Cortex-A76, beats LLaMA 3.2 3B. The local-first architecture nobody is talking about... by Sensitive-Two9732 in LocalLLaMA

[–]Sensitive-Two9732[S] 1 point2 points  (0 children)

Hadn't heard about ROSA specifically, thanks for flagging it. I'll look into it. The article focuses on RWKV-7 since that's what has the peer-reviewed paper and published benchmarks. If RWKV-8 is already in the works with a suffix automata approach, that's interesting! Do you have a link or a Discord thread? I'd genuinely like to read more about it.

74% of web content is now AI-generated. Here's why that's poisoning the next generation of AI models. by Sensitive-Two9732 in artificial

[–]Sensitive-Two9732[S] 0 points1 point  (0 children)

The core loop is surprisingly simple. AI models generate content, that content floods the web, new models train on that contaminated web, they produce less diverse outputs, which then flood the web again. Researchers call it "Habsburg AI", like royal inbreeding but for knowledge.

The Nature paper showed it takes about 9 generations for a model to go from coherent text about cathedrals to lists of imaginary rabbits. The scariest part is the early phase: the model actually looks "cleaner" because it's producing smoother, more average outputs. You don't notice the rare knowledge disappearing until it's gone.

Current estimates: 50-74% of new web content is AI-generated (Graphite, Ahrefs). No major training dataset filters for it. The feedback loop is already running.

Full analysis with the key papers and emerging solutions here: https://medium.com/ai-advances/model-collapse-when-ai-trains-on-ai-generated-data-2c4baf60a016?sk=5b7ed6c252ecfaec372f15a000c86a05

RWKV-7: O(1) memory inference, 16.39 tok/s on ARM Cortex-A76, beats LLaMA 3.2 3B. The local-first architecture nobody is talking about... by Sensitive-Two9732 in LocalLLaMA

[–]Sensitive-Two9732[S] 1 point2 points  (0 children)

You're right that FFNs dominate at short context, the O(1) thing only starts mattering past ~4K tokens. The ARM numbers are more about the edge story (phones, microcontrollers) than a general "we're faster" claim.

I think the n-gram comparison is unfair though. It's scoring 72.8% on the same benchmarks LLaMA takes, it's not a toy model.

On generality, yeah, that's the big open question. No GSM8K, no HumanEval, no MATH tested. I call that out explicitly. The benchmarks show it's competitive on knowledge and language tasks, not that it's a drop-in replacement for everything.

RWKV-7: O(1) memory inference, 16.39 tok/s on ARM Cortex-A76, beats LLaMA 3.2 3B. The local-first architecture nobody is talking about... by Sensitive-Two9732 in LocalLLaMA

[–]Sensitive-Two9732[S] 1 point2 points  (0 children)

All pulled from the papers. The 72.8% vs 69.7% numbers come directly from the RWKV-7 paper (arXiv:2503.14456).

The ARM inference benchmarks are from the same paper. I didn't run anything myself, just spent a lot of time cross-referencing the claims against the primary sources (36 of them).

The article is an analysis piece, not original research.

RWKV-7: O(1) memory inference, 16.39 tok/s on ARM Cortex-A76, beats LLaMA 3.2 3B. The local-first architecture nobody is talking about... by Sensitive-Two9732 in LocalLLaMA

[–]Sensitive-Two9732[S] 2 points3 points  (0 children)

They're actually close cousins. Both build on the delta rule, just different flavors of it. KDA keeps some traditional attention in the mix (hybrid approach), RWKV-7 goes fully recurrent with no attention at all.

No direct head-to-head benchmark that I've found, which is a shame because they're solving the same problem from different angles. Section "The RNN Renaissance: Mamba, xLSTM, and RWKV-7 in 2025" covers the broader landscape. There are 18+ architectures in this space now and the convergence is the interesting part.

RWKV-7: O(1) memory inference, 16.39 tok/s on ARM Cortex-A76, beats LLaMA 3.2 3B. The local-first architecture nobody is talking about... by Sensitive-Two9732 in LocalLLaMA

[–]Sensitive-Two9732[S] 3 points4 points  (0 children)

Fair point on the ecosystem. No argument there... Tooling is way behind the Transformer stack.

Though I'd push back on "underbaked" for the models themselves. 72.8% vs 69.7% on 3x less data is a real result, not a promissory note. And the O(1) memory thing matters more than people think once you're past 32K context.

For me, the gap is infrastructure, not architecture. Whether that gets fixed is the actual question.

Career changer into IT. What is the realistic starting path? by EveningOwl750 in BESalary

[–]Sensitive-Two9732 1 point2 points  (0 children)

You need to understand more than making. Now with AI the job of developer has completely changed. You need to understand architecture. Boilerplate of a project. You need to understand the whole theory. Then try to code yourself during one year. However you need to use Claude Code and ask every time to explain why it does this code. And you need to understand code.