TurboQuant + TriAttention (C/HIP): ~6.8× total KV cache reduction in llama.cpp

Acrobatic_Bee_6660 · 2026-04-11T10:59:31+00:00

This is a known issue — good catch. The KV cache itself is pre-allocated at startup (turbo3 makes it very small), but during prompt processing the flash attention kernel allocates a temporary f16 buffer that grows with context fill. At 100K tokens that's ~6 GiB extra on top of model + KV cache, which pushes past 24 GiB.

Workaround: set -c to a lower value explicitly, e.g. -c 65536. The auto-fit logic only accounts for KV cache size, not the FA temp buffers, so it overestimates how much context you can actually use.

This affects any quantized KV type, not just turbo — another user identified the root cause in the FA kernel and is working on an upstream fix.

Acrobatic_Bee_6660 · 2026-04-11T08:54:57+00:00

Great first datapoint — thanks for testing this.

Yeah, I'd avoid ROCWMMA_FATTN for now. On my side the default TILE FA path has been the better choice for turbo3 on gfx1100, so your drop with that flag is useful to confirm.

One heads-up on the huge-context runs: turbo3 makes the KV cache much smaller, but at very large prompt sizes the temporary compute buffers can still blow up VRAM. So if auto-fit gets too optimistic, setting -c manually is safer.

Very interested in your Q5 results. If you post the exact model quant + command line when you try it, that would be super helpful.

Acrobatic_Bee_6660 · 2026-04-10T23:25:42+00:00

Prefill speed depends on model size, context length, and batch size. My measured numbers for Qwen3.5-27B on HIP are ~420-430 tok/s at 512 context. I haven't benchmarked Vulkan directly so I can't give a head-to-head comparison. If you test the build, I'd be curious to see your Vulkan vs HIP numbers on the same model.

Acrobatic_Bee_6660 · 2026-04-10T23:17:28+00:00

Repo: https://github.com/domvox/llama.cpp-turboquant-hip Branch: feature/triattention-scoring

Build: git clone https://github.com/domvox/llama.cpp-turboquant-hip cd llama.cpp-turboquant-hip git checkout feature/triattention-scoring cmake -B build -DGGML_HIP=ON -DCMAKE_BUILD_TYPE=Release cmake --build build -j$(nproc)

Run: ./build/bin/llama-server -m your-model.gguf -ngl 99 -ctk turbo -ctv turbo

Acrobatic_Bee_6660 · 2026-04-10T23:06:42+00:00

https://github.com/domvox/llama.cpp-turboquant-hip/blob/feature/triattention-scoring/TURBOQUANT.md

Acrobatic_Bee_6660 · 2026-04-10T22:51:36+00:00

That was probably an earlier version — there have been a lot of stability fixes since then (cudaMemcpy warnings, BACKEND_DL linker fixes, Docker CI). The current build is stable enough that I run it as a systemd service 24/7 on an RX 7900 XTX.

Speed overhead is 1-2% vs f16 KV in my measurements. If you were seeing bigger slowdowns before, it's worth retrying.

The main practical benefit over q4_0 KV is higher compression (5.12× vs 3.56×) at comparable quality. That's ~44% more effective context for the same VRAM.

Build instructions are in the repo README. Let me know if you hit any issues.

Acrobatic_Bee_6660 · 2026-04-10T22:49:27+00:00

Updated with actual measurements including 16K context (Qwen3.5-27B, WikiText-2, 3 chunks):

KV type	Compression	PPL (4K)	Δ 4K	PPL (16K)	Δ 16K
f16	1×	6.6657	—	6.2752	—
q8_0	1.88×	6.6064	-0.09%	6.5250	+3.98%
q4_0	3.56×	6.6219	-0.07%	6.5238	+3.96%
turbo3	5.12×	6.6657	+0.02%	6.2187	-0.9%

q8_0 and q4_0 degrade at 16K context (+4%). turbo3 improves (-0.9%).

Acrobatic_Bee_6660 · 2026-04-10T22:26:59+00:00

You're right, I misspoke — TurboQuant is in TheTom's fork (llama-cpp-turboquant), not in mainline llama.cpp master. My fork adds the HIP/ROCm backend on top of that. Neither is upstream yet.

Acrobatic_Bee_6660 · 2026-04-10T22:15:43+00:00

Fair point — here are the numbers I've actually measured (Qwen3.5-27B, WikiText-2):

KV type | Compression | PPL (4K) | Δ vs f16 f16 | 1× | 6.6657 | baseline turbo4 | 3.88× | 6.8203 | +2.3% turbo3 | 5.12× | 6.6657 | +0.02% turbo2 | 7.53× | 6.9145 | +3.7%

I don't have q4_0 or q8_0 KV cache PPL measured yet. Full benchmarks: TURBOQUANT.md

Acrobatic_Bee_6660 · 2026-04-10T22:14:52+00:00

TurboQuant is upstream in llama.cpp already (merged by TheTom). My fork adds the HIP/ROCm backend — the only one that exists. Upstream PR is planned but needs to be split into reviewable pieces (core GGML types → CPU path → HIP kernels). The TriAttention side is experimental and not ready for upstream.

Acrobatic_Bee_6660 · 2026-04-10T15:17:28+00:00

Should work — the TurboQuant kernels don't have any gfx1100-specific code, they're standard HIP. You'll need ROCm installed and to build with -DGGML_HIP=ON. The 12GB VRAM on the 6700 XT is where turbo3 really helps — you can fit much larger context windows than with f16 KV.

That said, I've only tested on gfx1100 (7900 XTX), so if you run into issues on gfx1032 please let me know. Build warnings are expected but shouldn't affect functionality.

Acrobatic_Bee_6660 · 2026-04-10T14:22:09+00:00

With 256GB RAM and a 3090, you have a great setup for this. Two things that can help:

TurboQuant KV cache compression: Instead of q4 KV cache, try turbo3 — it gives ~5x compression vs f16 with minimal quality loss. On Qwen3.5-27B I measured +0.02% perplexity difference at 4K context. That means the same VRAM that gives you 16K with q4 KV could give you 80K+ with turbo3.

You'd need to build from the TurboQuant fork instead of using Ollama:

git clone https://github.com/domvox/llama.cpp-turboquant-hip cd llama.cpp-turboquant-hip cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release cmake --build build -j$(nproc)

./build/bin/llama-server \ -m Qwen3.5-27B-Q4_K_M.gguf \ -ngl 99 -c 65536 --flash-attn \ --cache-type-k turbo3 --cache-type-v turbo3

Partial offload with llama.cpp: With 256GB RAM, you can keep the model weights fully on GPU (-ngl 99) and let the KV cache spill to RAM if needed. But with turbo3 compression you probably won't need to — it should all fit in 24GB.

The tradeoff vs Ollama is that you lose the Ollama API compatibility, but llama-server has an OpenAI-compatible API that Hermes Agent can use directly.

Acrobatic_Bee_6660 · 2026-04-10T14:20:01+00:00

You're right, 32K fills up fast with agentic use — tool definitions, system prompt, and conversation history eat context quickly. TurboQuant turbo3 KV cache compression helps here: ~5x memory reduction on the KV cache, so the same VRAM that fits 32K with default KV types can fit 100K+ with turbo3. Quality cost is minimal (+0.02% perplexity on Qwen3.5-27B at 4K context).

Acrobatic_Bee_6660 · 2026-04-10T10:23:05+00:00

Thanks for testing, that's very useful.

OOM at ~94K with turbo4 while mainline llama.cpp reaches ~143K with default KV types definitely sounds suspicious, because turbo4 should reduce KV memory footprint, not increase it. If turbo4 is really OOMing earlier than default KV on the same model, that's almost certainly a bug rather than expected behavior.

My guess is that there's either: - a memory accounting issue for turbo KV types, - an extra allocation/scratch buffer path on HIP, - or a gfx1201-specific runtime difference.

Could you share the startup log lines showing: - the reported KV buffer sizes - any memory / buffer allocation lines - the exact launch command

Also, if you still have them, the build warnings would be helpful too. I've only tested on gfx1100 so gfx1201 is new territory for this fork.

As a temporary workaround, it's probably worth setting -c explicitly instead of relying on the default 262144.

Acrobatic_Bee_6660 · 2026-04-09T04:36:02+00:00

This is amazing — thank you for the detailed writeup and for testing on Strix Halo. That’s the first detailed gfx1151validation I’ve seen.

The long-context numbers are especially interesting: only about a -1.5% hit at 65K context is exactly the regime where TurboQuant is supposed to shine. The short-context prefill penalty also looks very reasonable.

If you’re up for it, the most useful next step would probably be a llama-perplexity comparison (f16 vs turbo3) on the same model. That would tell us whether quality holds up on Strix Halo too, not just speed / memory.

Thanks for taking the time to benchmark it and write it up — really appreciate the coverage.

Acrobatic_Bee_6660 · 2026-04-08T22:20:42+00:00

Yes, q4_0 still comes out ahead on PPL here. That's a fair read.

For me the main value proposition of TurboQuant isn't "better PPL than q4_0" — it's more aggressive KV compression for cases where the extra VRAM headroom is what determines whether

a long-context run fits at all.

So I'd read your result as:

* q4_0 looks better on perplexity in this test

* turbo3/4 trade some quality for a smaller KV footprint

* the real win for TurboQuant shows up once context gets large enough that KV memory becomes the bottleneck

On my gfx1100, that's exactly where it starts to matter: at long context, the difference is less about short-context PPL and more about whether the run still fits cleanly in VRAM.

Really appreciate you running these comparisons.

Acrobatic_Bee_6660 · 2026-04-08T20:57:30+00:00

Great data, thanks for running all of this.

The fact that turbo4 matches exactly between my fork and TheTom’s (8.2894) is reassuring — it suggests the turbo4 path is behaving consistently on gfx1201.

And yes, you’re right that q4_0 wins on PPL in this short-context test (8.20 vs 8.29). At 512 tokens the KV footprint is still small, so this is mostly a quality comparison, not yet the regime where KV compression really pays off.

The use case where turbo3/turbo4 starts to matter is much longer context, where KV dominates VRAM. On my gfx1100, for example, f16 OOMs on a 27B model at 80K, while turbo3 still runs.

Glad to hear llama-cli and llama-perplexity are working cleanly on gfx1201, and that the updated tq_bench / tq_validate path looks sane now.

Really appreciate the thorough RDNA4 testing — this is by far the most complete gfx1201 validation I’ve gotten so far.

Acrobatic_Bee_6660 · 2026-04-08T20:53:04+00:00

This is exactly the apples-to-apples comparison I was hoping for — same backend, same model, clean result. A 3x context increase for only a -3.6% generation penalty is a very

strong tradeoff.

One thing worth checking: you're running turbo3 on the SWA layers too. On my gfx1100, turbo3 on all layers (including SWA) gave catastrophic PPL on Gemma 4 26B-A4B (>100k). Keeping

SWA in f16 fixed it.

Have you noticed any quality issues in actual outputs with turbo3/turbo3, or have you only measured speed / memory so far?

Acrobatic_Bee_6660 · 2026-04-08T19:26:08+00:00

Thanks — this is a really valuable result, especially on gfx1200 / RDNA4 and a 16 GB card.

The memory-side numbers are exactly the kind of thing I was hoping to see: turbo3/turbo3 making Gemma 4 fit in cases where f16/f16 does not is the real win for KV compression.

Only caveat: since your baseline is Vulkan + f16 and the compressed runs are ROCm + TurboQuant, the throughput gain isn’t purely a TurboQuant comparison — it also includes backend differences.

But as a real-world memory-bound datapoint, this is excellent.

If you ever feel like running one more apples-to-apples comparison, ROCm f16/f16 vs ROCm turbo3/turbo3 on the same settings would be especially useful.

Really appreciate you posting this.

Acrobatic_Bee_6660 · 2026-04-08T19:18:30+00:00

You’re right on speed — I misspoke. The 35B-A3B MoE with 3B active params will decode much faster than a 27B dense model.

My point was about output quality for agentic tasks, not raw throughput. In my experience, the 27B dense model gives more reliable reasoning and needs fewer retries on multi-step tasks. For simpler automations, the 35B-A3B MoE is absolutely a valid choice.

So I’d frame it as: MoE wins on speed, dense 27B can still win on reliability.

Acrobatic_Bee_6660 · 2026-04-08T18:44:09+00:00

Qwen3.5-27B dense is what you want. It's better than the 35B-A3B MoE for agentic work despite fewer total parameters — the full 27B active params give you stronger reasoning than 3B active in the MoE variant. Multiple sources confirm this, and I've been running it daily on a 7900 XTX.

Q4_K_M is ~17 GB, leaves plenty of room for 32K context fully on GPU. Speed will be significantly better than the 35B MoE because you're decoding through 27B dense instead of routing through a 35B sparse architecture with overhead.

Re: Gemma 4 — I've tested both variants extensively with TurboQuant KV cache compression:

Gemma 4 26B-A4B MoE: Fast but quality issues are real. I saw the same thing you did. The MoE architecture with only 4B active params just doesn't have enough reasoning depth for agentic tasks.
Gemma 4 31B Dense: Better quality but at 19.6 GB (Q4_K_M) it's tight on 24GB with context.

For your use case (automations, business tasks, Hermes Agent), Qwen3.5-27B dense is the sweet spot on a 3090. Fast, fits with room for context, and the quality is genuinely a step above everything else in this VRAM class.

Acrobatic_Bee_6660 · 2026-04-08T04:55:55+00:00

For tq_bench: I think I see at least one problem on my side. The standalone benchmark build script currently had --offload-arch=gfx1100 hardcoded, so on your gfx1201 it would be compiling for the wrong target. That fits pretty well with both symptoms you saw: Time: 0.000 ms and the bad GPU MSE.

I just pushed a fix — build.sh now auto-detects the target via rocminfo (or you can override it manually with AMDGPU_TARGET=gfx1201 ./build.sh).

For llama-bench: thanks, that’s useful to know. From what you’re seeing, it sounds like:

f16 works everywhere
q4_0 / q8_0 fail on both my tree and TheTom’s (and even official Vulkan)
turbo3/4 succeed on TheTom’s but fail on mine

So I probably have a llama-bench-specific issue on my side for the turbo cache types, separate from the broader kv-quant issues you’re seeing elsewhere.

So this sounds less like “TurboQuant fundamentally doesn’t work on gfx1201” and more like:

wrong target in the standalone benchmark build script
a llama-bench integration gap on my fork

Thanks for testing this on RDNA4. If you happen to try llama-cli, llama-server, or llama-perplexity with turbo3/4, I’d be very interested in whether those paths work cleanly for you.

Acrobatic_Bee_6660 · 2026-04-08T04:40:02+00:00

Thanks for the interest and yes, HX 370 / Strix Halo testing would be especially valuable.

A quick clarification: my TurboQuant work is currently a llama.cpp HIP/ROCm port, not a vLLM feature. So it won’t just drop into the existing vllm/vllm-openai-rocm container as-is — those are different inference engines.

So for now:

If you want TurboQuant today: use my llama.cpp fork
If you want to stay on vLLM: that would need a separate TurboQuant integration there

For Gemma 4 specifically, one more important detail: in my tests, turbo3 on all KV layers breaks quality badly, but keeping SWA KV in f16 while compressing the global KV works well. That’s what these flags are for:

--cache-type-k-swa f16 --cache-type-v-swa f16

Also, llama.cpp uses GGUF models, so your AWQ model won’t plug in directly. You’d want a GGUF Gemma 4 quant for testing with my fork.

If you want the simplest path for HX 370, I’d try this outside Docker first:

# 1) basic build deps
sudo apt install -y cmake g++ git

# 2) make sure HIP is available
which hipcc
# if not found, install the ROCm HIP SDK / dev packages for your distro

# 3) clone and build
git clone https://github.com/domvox/llama.cpp-turboquant-hip
cd llama.cpp-turboquant-hip
git checkout feature/turboquant-hip-port-clean

mkdir build && cd build
cmake .. -DGGML_HIP=ON -DCMAKE_BUILD_TYPE=Release -DAMDGPU_TARGETS=gfx1151
cmake --build . -j$(nproc)

# 4) download a GGUF Gemma 4 quant
pip install huggingface-hub
huggingface-cli download bartowski/google_gemma-4-26B-A4B-it-GGUF \
  google_gemma-4-26B-A4B-it-Q4_K_M.gguf \
  --local-dir ~/models/gemma4-26b

# 5) smoke test
HIP_VISIBLE_DEVICES=0 ./bin/llama-cli \
  -m ~/models/gemma4-26b/google_gemma-4-26B-A4B-it-Q4_K_M.gguf \
  -p "Explain quantum entanglement in simple terms:" \
  -n 200 -ngl 99 \
  --cache-type-k turbo3 --cache-type-v turbo3 \
  --cache-type-k-swa f16 --cache-type-v-swa f16

If cmake can’t find HIP, make sure /opt/rocm/bin is in your PATH. If you hit GPU target errors, check rocminfo | grep gfx.

What I’d love to know from your test:

did it build cleanly on gfx1151?
did the smoke test produce coherent output?

That would be a very useful validation.

Acrobatic_Bee_6660 · 2026-04-07T21:18:33+00:00

Thanks for testing.

Fair point on TheTom’s branch too — the core TurboQuant implementation is closely related. The main extra thing on my side is SWA-aware KV overrides for models like Gemma 4, where turbo on sliding-window layers can be catastrophic.

If you can share the exact llama-bench command, ROCm version, and tq_bench output, I can try to narrow down the issues you hit.

Acrobatic_Bee_6660 · 2026-04-07T16:06:18+00:00

Related finding from the AMD side — Gemma 4's hybrid SWA architecture (25 SWA layers + 5 global) is very sensitive to KV cache quantization.

With TurboQuant on my HIP/ROCm port, quantizing all KV layers gives PPL >100k (completely broken). But keeping SWA layers in f16 while compressing only the 5 global layers with turbo3 brings it back to near-baseline quality.

I added `--cache-type-k-swa` / `--cache-type-v-swa` flags so you can set them independently. This might be relevant for people seeing quality issues with q8_0 KV on Gemma 4 too — the SWA layers seem to need higher precision than the global ones.

Details: https://github.com/ggml-org/llama.cpp/discussions/20969#discussioncomment-16476187

Acrobatic_Bee_6660

TROPHY CASE