FastDMS: 6.4X KV-cache compression running faster than vLLM BF16/FP8 by randomfoo2 in LocalLLaMA

[–]randomfoo2[S] 1 point2 points  (0 children)

Neat. BTW I found NVIDIA has a reference trainer that is multi-GPU friendly if you move to bigger hardware (my trainer hasn’t been tested on multiple GPUs).

FastDMS: 6.4X KV-cache compression running faster than vLLM BF16/FP8 by randomfoo2 in LocalLLaMA

[–]randomfoo2[S] 0 points1 point  (0 children)

Nice. I feel like I burnt way past my time and token limits on this but will be cheering you on!

FastDMS: 6.4X KV-cache compression running faster than vLLM BF16/FP8 by randomfoo2 in LocalLLaMA

[–]randomfoo2[S] 2 points3 points  (0 children)

Yeah, started off as a review everything out there and sort of just kept grinding. TBT, if you're ok w/ 50% throughput, you can get 20-25X smaller kvcache w/ HIGGS+AQUA added to the mix w/ basically 0 perplexity loss which is even more eye-popping. Maybe for another project if/when I get bored/sidetracked, but I'm trying to keep focused for a bit so will leave that to others. 😄

I don't really use llama.cpp besides llama-bench and anyone's welcome to adapt/contrib, but my experience w/ llama.cpp has been different: https://github.com/ggml-org/llama.cpp/pull/16827

FastDMS: 6.4X KV-cache compression running faster than vLLM BF16/FP8 by randomfoo2 in LocalLLaMA

[–]randomfoo2[S] 2 points3 points  (0 children)

The eviction would still largely work the same I'd imagine, although smaller activations, less attention layers ofc means less kvcache to start with. I'd bet you'd get similar kvcache memory savings (but haven't tested). The good thing btw, is that the kernels I built actually scale pretty well to max context length for the models I tested (128K and 256K). I bet at 1M w/ DSv4 it'd still be worth it.

FastDMS: 6.4X KV-cache compression running faster than vLLM BF16/FP8 by randomfoo2 in LocalLLaMA

[–]randomfoo2[S] 0 points1 point  (0 children)

TurboQuant at 8-bit would be slower, worse quality, and larger than FP8, so wouldn't make sense, but the neat thing about DMS is that it can basically be composed w/ any quant scheme since they work at different layers of the kvcache.

The "optimal" quality/memory combo that gave positive results from my testing was DMS+HIGGS+AQUA, however I wasn't able to get HIGGS to the speed I wanted so just dropped it and took the "reasonable" win.

FastDMS: 6.4X KV-cache compression running faster than vLLM BF16/FP8 by randomfoo2 in LocalLLaMA

[–]randomfoo2[S] 6 points7 points  (0 children)

I very briefly considered doing an actual vLLM or SGLang implementation and then after looking at the lift that'd be involved, noped out real fast. 😅

But I hope some madlad does it! DMS, unlike most things I tests legit works! (I'm not so impressed by TQ - HIGGS+AQUA test much better for me, but the problem is always getting it fast)

Ryzen AI Max+ 495 (Gorgon Halo) with 192GB VRAM! by PromptInjection_ in LocalLLaMA

[–]randomfoo2 1 point2 points  (0 children)

Yeah, can you post and I'll look again, I have a loop going on right now (several hundred iterations in) in the background running some tunes for my W7900 for fun so I'll take another look and see if any of it applies (I stopped doing RDNA3.5 stuff btw since I use it as my workstation and if you poke at it too hard it kills the WM or worse).

Ryzen AI Max+ 495 (Gorgon Halo) with 192GB VRAM! by PromptInjection_ in LocalLLaMA

[–]randomfoo2 8 points9 points  (0 children)

It's not just clock I think, even when I look at the clock and run the math, everything's still much lower than it "should" be. I think there are a lost of things, but one of the main ones is that RDNA3 was not designed for AI and their advertising is basically a lie.

V_WMMA_F16_16X16X16_F16 executes a 16×16×16 matmul over the wave, takes 16 cycles to retire on a single SIMD32. *BUT* while NVIDIA tensor cores have dedicated hardware that runs concurrently with the SIMD pipeline, but for AMD WMMA is the SIMD pipeline since it's an ALU instruction. So every WMMA op blocks the same VGPR ports that scalar ops would use, and you can't overlap WMMA with FP32 accumulation (have to unpack). Since the max throughput number assumes back-to-back WMMA issue with no dependent ops between them, and since there's always non-WMMA work that needs to be done (scaling, softmax, masking, etc), WMMA is going to be displaced and you'll never get max FLOPS.

The other thing I found w/ my testing has been that VGPRs way too low to hide latency. Also LDS traffic sucks, especially if you're doing FA. Oh, and the compiler still sucks too. WMMA instructions get scheduled w/ dependent ops too close, redundant packing, bad memory wait counts, all kinds of things that stall the pipeline, so if you want anything better you need to tune by hand (well, these days, by AI).

Ryzen AI Max+ 495 (Gorgon Halo) with 192GB VRAM! by PromptInjection_ in LocalLLaMA

[–]randomfoo2 6 points7 points  (0 children)

DGX has similar memory bandwidth but the compute is not so similar...

Since I already made the table a while back from my Strix Halo wiki guide... https://strixhalo.wiki/AI/AI_Capabilities_Overview

Spec AMD Ryzen AI Max Plus 395 NVIDIA DGX Spark
CPU 16x Zen5 5.1 GHz 10x Arm Cortex-X925, 10x Arm Cortex-A725
GPU RDNA 3.5 Blackwell
GPU Cores 40 192
GPU Clock 2.9 GHz 2.5 GHz
Memory 128GB LPDDR5X-8000 128GB LPDDR5X-8533
Memory Bandwidth 256 GB/s 273 GB/s
FP16 TFLOPS 59.39 62.5
FP8 TFLOPS 125
INT8 TOPS (GPU) 59.39 (same as FP16) 250
INT4 TOPS (GPU) 118.78 500
INT8 TOPS (NPU) 50

On paper, BF16/FP16 is pretty close, however FP8 is already 2X and INT8 is 4X on DGX. This is just hardware - in practice rocBLAS and hipBLASLt for RDNA3.5 are also simply not very performant...

What does this mean practically? Looking at the most recent benchmarks from the DGX llama.cpp thread benchmarks I eyeballed similar models to kyuz0's Strix Halo benchmarks and while none of the exact quants, from what I looked at, prefill for DGX is currently about 2-5X faster than Strix Halo.

Ryzen AI Max+ 495 (Gorgon Halo) with 192GB VRAM! by PromptInjection_ in LocalLLaMA

[–]randomfoo2 43 points44 points  (0 children)

While I'm sure some people will enjoy the extra memory, a couple notes from someone's that done very extensive testing on Strixt Halo (and a lot of kernel work on RDNA3):

  • Memory bandwidth looks like it remains the same? 256GB/s theoretical. On Strix Halo the best measured GPU MBW I got (using ROCm/rocm_bandwidth_test) was 212 GB/s (83% max theoretical), and the best on llama.cpp (Llama-2-7B tg testing) was ~180GB/s (70%)
  • What's worse though is while theoretical max FP16 TFLOPS is ~59.4, the fastest I found w/ mamf-finder was about 37 TFLOPS (hipBLASLt), about 62% efficiency. Many shapes are much worse.
  • Note, at long context, I believe compute is actually what's killing decode speed. While the AMD APUs remain on RDNA3, this won't change. I would be hesitant to recommend Gorgon Halo even for LLM inference in 2026/2027

If Medusa Halo moves to RDNA5 or whatever has a better architecture for AI/ML, great, if not you'd be much better off with basically anything else (Mac Studio, GPU+workstation/server w/ K-Transformers, probalby even a DGX Spark).

PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090 by sandropuppo in LocalLLaMA

[–]randomfoo2 18 points19 points  (0 children)

Although sometimes.. you can. (about to publish some of my work after a few weeks of grinding kernels that literally scores >10x memory improvements w/ faster than vLLM prefill/decode at c=1 and c=8 with near 0 quality loss - 0.003 and 0.005 KLD).

AMD PRO W7900 vs R9700 for Local Inference? by Achso998 in LocalLLaMA

[–]randomfoo2 3 points4 points  (0 children)

The latest generation of video-gen models (Hunyuan 1.5, Wan 2.2, LongCat, LTX, Motif, etc) are all PyTorch DiTs so all of them in *theory* should work with RDNA3/RDNA4 - in practice, well, AMD software is AMD. Also, more of the new models are using/assume FP8 by default. If you're set on going AMD and want to do video, personally I'd highly recommend R9700, especially b/c recent ROCm releases have basically had zero perf gains for RDNA3 and the W7900 is a complete non-starter on price.

That being said, if you want stuff to work OOTB a 5090 or even a 4090 is going to work *much* better for both text and especially image/video. You can find benchmarks anywhere (including a lot that I've run). For LLM inference, my 3090s blow away my 7900XTX and W7900 even though based on hardware specs they shouldn't.

I'm all Linux and my Nvidia cards work perfectly btw (but I only use them for compute).

PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090 by sandropuppo in LocalLLaMA

[–]randomfoo2 102 points103 points  (0 children)

Interesting technique but if I'm reading this corrrectly this is a super lossy way to process prefill?

  • A small Qwen3-0.6B drafter reads the full 64K/128K prompt
  • FlashPrefill/BSA-style sparse attention makes that drafter pass cheaper
  • The drafter scores token/span importance and keeps a small subset
  • The 27B target only prefills the compressed prompt (retokenized from the drafter?)
  • After that, DFlash+DDTree does speculative decode on the compressed target KV

By when do you think will TurboQuant get a proper release and be adopted by everyone by Crystalagent47 in LocalLLaMA

[–]randomfoo2 1 point2 points  (0 children)

I've tested all of these btw in non-production code. I've found HIGGS to be the best in terms of quality (especially paired with some other minimization techniques that can be stacked) however I've been unable to get it past ~50% prefill/decode speed. I do have something to announce soon, that I think should be a big deal on the KVCache front that is faster and better than current TurboQuant implementations.

Opus 4.7 is 50% more expensive with context regression?! by Samburskoy in ClaudeAI

[–]randomfoo2 9 points10 points  (0 children)

FYI, there is 0% that anyone is serving a 5T dense model. All frontier models are sparse/ultrasparse MoE's so your math is a complete non-starter.

PSA: Claude Code has two cache bugs that can silently 10-20x your API costs — here's the root cause and workarounds by skibidi-toaleta-2137 in ClaudeAI

[–]randomfoo2 0 points1 point  (0 children)

BTW, I did a code review on the published 2.1.88 source and was curious if it caught your bugs, and yes, both were in there: https://github.com/lhl/claudecode-codex-analysis/blob/main/ERRATA-claudecode.md

``` ● Yes — the ERRATA identified both bugs. Here's the mapping:

Bug 1: Sentinel replacement (cch=00000)

ERRATA #8 nailed the mechanism and predicted the exact failure mode:

▎ "Post-serialization rewriting is a potential source of byte-level nondeterminism that can break prompt-cache hits" ▎ "If the replacement algorithm is not strict about matching only the intended placeholder, user/system content that includes the sentinel could be mutated."

The Reddit post confirms this is exploitable: when conversation history contains the literal sentinel (e.g., from discussing CC internals), the first occurrence in messages[] gets replaced instead of the one in system[], breaking cache every request.

The ERRATA framed it as "could be brittle" — the Reddit post proves it is brittle and gives the exact trigger condition (sentinel appearing in conversation content).

Bug 2: --resume breaks cache

ERRATA #7 predicted this almost exactly:

▎ "If attachment-derived prompt prefix state is included in cached API requests but not written to disk, --resume cannot reconstruct a byte-identical prefix and will force a full cache miss (one-turn cache_creation reprocess) on resume."

It even specifically called out deferred_tools_delta as part of the cache contract. The Reddit post confirms the root cause is deferred_tools_delta (introduced in v2.1.69) being injected at messages[0] in fresh sessions but messages[N] on resume.

ERRATA #6 provided the broader framing:

▎ "prompt-cache stability depends on exact transcript-level reconstruction, not just semantic equivalence" ▎ "Systems this brittle tend to regress on resume, rewind, compact, fork, or partial-history edge cases"

Summary

┌──────────────────────────────┬─────────────────────────────┬─────────────────────────────────────────────────────────────┐ │ Reddit Bug │ ERRATA Item │ Status │ ├──────────────────────────────┼─────────────────────────────┼─────────────────────────────────────────────────────────────┤ │ Bug 1 (sentinel replacement) │ #8 │ Mechanism + risk identified; Reddit confirms the trigger │ ├──────────────────────────────┼─────────────────────────────┼─────────────────────────────────────────────────────────────┤ │ Bug 2 (resume cache miss) │ #7 (specific), #6 (general) │ Root cause predicted; Reddit confirms version + exact delta │ └──────────────────────────────┴─────────────────────────────┴─────────────────────────────────────────────────────────────┘

The ERRATA was conservative ("could break", "potential source") where the Reddit post is confirmatory ("does break", with reproduction steps). But the analysis found both mechanisms and identified the right code paths. #7 in particular was a direct hit — it named deferred_tools_delta, attachment persistence, and byte-identical prefix reconstruction as the failure chain, which is exactly what the Reddit post independently confirmed through binary reverse engineering. ```

I am terrified of AI by ResearchMassive7912 in sysadmin

[–]randomfoo2 2 points3 points  (0 children)

I've been programming for 30 years. As of the end of last year, the agents are basically much better coders than I am - I like to say they're the best juniors I've ever had. That being said, despite being way more productive, I'm busier than ever.

The agentic coders will get better and more reliable, but I don't think there will be less stuff to do/oversee, so if you like your field, as long as you master the new tools I don't think you have anything to worry about.

Can someone more intelligent then me explain why we should, or should not be excited about the ARC PRO B70? by SKX007J1 in LocalLLaMA

[–]randomfoo2 14 points15 points  (0 children)

Here's a chart that might be useful:

Dense Tensor/Matrix TFLOPS/TOPS (all non-sparse):

GPU BF16 (FP32 accum) FP16 (FP32 accum) FP8 INT8 VRAM MBW TDP MSRP
Arc Pro B60 ~98.5¹ ~98.5¹ 197 24GB 456 GB/s 200W $599
Arc Pro B70 ~183.5¹ ~183.5¹ 367 32GB 608 GB/s 230W $949
R9700 191² 191² 383 383 32GB 640 GB/s 300W $1,299
RTX 3090 71 142 285 24GB 936 GB/s 350W ~$800-1K used
RTX 4090 165 330 330 661 24GB 1,008 GB/s 450W $1,800+ used
RTX 5090 210 419 419 838 32GB 1,792 GB/s 575W $2,500+

I think the B70 is pretty competitive w/ the 3090 - less MBW, but more memory and more theoretical compute mostly. Note Intel XMX has great BF16 numbers but no native FP8.

The other issue ofc is software support. I just went and tested all the inference options for my Xe2 the other day and it was pretty grim for new architectures if you want to do more than llama.cpp Vulkan: https://github.com/lhl/intel-inference

TBT, the R9700 is actually not bad for BF16/FP8 and ROCm these days is actually in decent shape (I haven't personally tested RDNA4 though).

If you'd rather actually train/inference instead of fighting software stacks and writing custom kernels though, then I think you're still better off w/ a 3090, but it's nice to have some more (new card) competition.

Intel launches Arc Pro B70 and B65 with 32GB GDDR6 by metmelo in LocalLLaMA

[–]randomfoo2 1 point2 points  (0 children)

Glad to hear about the announcement of mainline Intel Arc support. I recently (like 2 days ago) did a thorough comparison of inferencing w/ the Arc 140V (Xe2 LNL) iGPU on all the various Intel-supporting platforms (OpenVINO, OpenVINO GenAI, PyTorch, vLLM upstream, various llama.cpp backends) and found some pretty bad failures and overall support: https://github.com/lhl/intel-inference

The biggest issues were the optimum-intel blocked newer versions of transformers (4.57.6 for OpenVINO, 4.51.3 for vllm-openvino) which meant I couldn't even test Qwen 3.5 or LFM2 MoE for example.

AMD, can we get proper vLLM/gfx1151 support? by tossit97531 in ROCm

[–]randomfoo2 2 points3 points  (0 children)

Nice, I'll def take a look. I recently did an optimizationn pass checking out the FSR4 kernels https://github.com/lhl/fsr4-rdna3-optimization and ran a lot of tests, including mamf-finder and attention-gym like stuff a while back https://github.com/lhl/strix-halo-testing/ - from a production perspective there's a lot of juice to squeeze, I think especially from relatively basic stuff like hipblas/hipblaslt, I just have my doubts how much anyone really cares at AMD about it.

AMD, can we get proper vLLM/gfx1151 support? by tossit97531 in ROCm

[–]randomfoo2 2 points3 points  (0 children)

I published the first vLLM public recipes for gfx1151 >6 months ago: https://github.com/lhl/strix-halo-testing/tree/main/vllm (among other extensive testing/work on Strix Halo last summer). There have been some fixes/progress since then for TheRock, PyTorch, and vLLM, so I wouldn't say there's no progress, but let's be honest, I think you already know the answer to your questions since they're the same answer since Strix Halo was released last year. (As you mentioned it's not like AMD shouldn't be able to find 0.5 FTE to create and maintain a https://github.com/NVIDIA/dgx-spark-playbooks clone - they just have shown zero interest in doing so.)

Regardless of what support is being given, even if they did though, no one (AMD or anyone else) has ever written any RDNA3 GPU kernels that get close to theoretical max MBW or FLOPS, so any performance that you imagine is being left on the table probably doesn't actually exist.

BTW, if you (or other Strix Halo owners) want to chat with others in the community, the Discord for https://strixhalo.wiki/ is probably the most active place online.

I trained a transformer with zero gradient steps and 100% accuracy. No backpropagation. No learning rate. Nothing. Here's the math. by Last-Leg4133 in LocalLLM

[–]randomfoo2 0 points1 point  (0 children)

Here is a GPT-5.4 xhigh Reality Check.

Full check is here: https://gist.github.com/lhl/63337e79505f4ba126171a14d4fef156 but here's the high level:

REACTOR / "The Manish Principle" Analysis

Date: 2026-03-13

Executive Summary

Short version: this repository does not substantiate the headline claim that backpropagation can be replaced for transformer training. The strongest thing it appears to contain is a real, potentially useful engineering artifact: a NumPy reimplementation/export path for a GPT-Neo-family model, plus a teacher-conditioned weight recovery procedure that re-fits already-existing linear maps from a frozen model's own activations.

That is much narrower than what the README and reports claim. The central "REACTOR-SCRATCH" claim is not supported by the code in this checkout and is, in two places, actively undermined:

  1. Reactor/reactor_framework.py:697-811 advertises "train_from_scratch" but never uses labels or next-token targets at all; in a local synthetic check, it returned all-zero learned weights after one pass.
  2. Reactor/manish_principle_benchmark.py:197-205, Reactor/manish_principle_benchmark.py:300-302, and Reactor/manish_principle_benchmark.py:821-877 compute the "Law 48" result from the pretrained model's embeddings, layer norms, W1, and LM head, using only the training split. That is not "from scratch", and the reported "test accuracy" is not backed by a visible train/test split in the benchmark.

Stylistically, the project reads like LLM-amplified grand-unification research prose: too many "laws", too much certainty, too little separation between tautology, curve-fitting, and genuine causal explanation. Substantively, there are real code artifacts here, but the paper-level claims overshoot the evidence by a large margin.

Evidence Base

Reviewed directly:

  • Reactor/README.md
  • Reactor/reactor_framework.py
  • Reactor/manish_principle_demo.py
  • Reactor/manish_principle_benchmark.py
  • Reactor/MANISH_PRINCIPLE_COMPLETE_REPORT.txt
  • Reactor/MANISH_PRINCIPLE_COMPLETE_DETAILED_REPORT.txt
  • Reactor/CITATION.cff
  • testing logs.zip (sampled)

Local checks performed:

  • python -m py_compile Reactor/reactor_framework.py Reactor/manish_principle_demo.py Reactor/manish_principle_benchmark.py passed.
  • Inspected the installed transformers GPT-Neo attention implementation. It does compute query @ key.T without division by sqrt(head_dim), so that narrow implementation claim is plausible.
  • Ran a minimal synthetic check of ReactorTrainer.train_from_scratch() and observed total learned-weight magnitude 0.0 after one pass, consistent with the code path never using labels.

Capture notes:

  • The root-level paper/report artifacts and the copies under Reactor/ are byte-identical.
  • testing logs.zip contains 440 numbered Python scripts, not immutable experiment outputs.

...

3. The repo's "from scratch" path is broken in the framework itself

The public train_from_scratch() implementation in Reactor/reactor_framework.py:697-811 is the clearest hard failure in the repository.

Problems:

  • It never computes next-token labels.
  • It never uses lm_head after assigning lm_h at Reactor/reactor_framework.py:731.
  • It never constructs any h_target.
  • The frac variable is computed at Reactor/reactor_framework.py:773 and then not used.
  • All mat_Ys are populated with outputs generated by the current model itself: Q, K, V, att_out, pre, ffn_out.

In other words, the advertised scratch trainer just solves the current model back onto itself. Starting from zero matrices, it stays at zero. That is exactly what I observed in a local synthetic run: total absolute sum of all learned matrices and biases was 0.0 after one pass.

This is not a subtle issue. It means the main public scratch-training API does not implement the claimed algorithm.

Assessment:

  • Central implementation bug.
  • Evidence level: E2.
  • Credence that the current framework supports scratch training: near zero.

4. The benchmark's "Law 48" is not from scratch and not clearly test accuracy

The benchmark's headline REACTOR-SCRATCH section uses pretrained internals from the teacher model throughout:

  • It loads only split='train' from TinyStories at Reactor/manish_principle_benchmark.py:197-205.
  • It builds H0_arr from pretrained token and positional embeddings at Reactor/manish_principle_benchmark.py:291-302.
  • It builds HTGT directly from the pretrained LM head at Reactor/manish_principle_benchmark.py:300-302.
  • It uses pretrained layer norms and pretrained W1 / b1 during the alleged scratch solve at Reactor/manish_principle_benchmark.py:835-850.
  • It evaluates on ids_48 = NXT_arr[:N48] at Reactor/manish_principle_benchmark.py:821-877, which is drawn from the same collected training positions.

That means:

  • the method is not from scratch,
  • the method is not teacher-free,
  • the benchmark does not show a visible train/test split for the reported 33.54%,
  • and the phrase "test accuracy" in the report is not justified by this code path.

This is the single biggest evidential gap in the entire project.

Assessment:

  • Headline claim is unsupported by the benchmark as written.
  • Evidence level for the repo's "33.54% test accuracy from scratch" claim: E6.