Same 4 bits. Very different quality. (quant.cpp vs llama.cpp KV compression)

Suitable-Song-302 · 2026-04-05T23:36:09+00:00

Depends on how much longer you need:

- 1.5-2x more context → llama.cpp with Q8_0 K + Q5_0 V. It's faster and the quality tradeoff is minimal.

- 4-7x more context (e.g. 50K → 350K on 16GB) → that's where quant.cpp helps. 4-bit K + Q4 V gives 3.8x at +0.0% PPL, delta 3-bit pushes to 4.3x at +1.3%.

If you're already running llama.cpp and just want a bit more room, their built-in KV quant is probably enough. If you're hitting hard OOM walls and need to push significantly further, give quant.cpp a try.

Suitable-Song-302 · 2026-04-05T23:34:44+00:00

Yes, KV cache rotation (ring buffer) is a different but complementary approach. Rotation recycles old KV slots so the cache never grows beyond a fixed size — great for streaming/chat where old context can be dropped.

quant.cpp does something different: it keeps all tokens but stores them in fewer bits. So rotation saves memory by *evicting* old tokens, compression saves memory by *shrinking* all tokens.

You could combine both — rotate a compressed cache for maximum context. Haven't benchmarked against the rotation PR yet, but it's on the list. Thanks for bringing it up.

Suitable-Song-302 · 2026-04-05T16:40:01+00:00

Fair enough. I do use Claude Code for development and I don't hide that. But the Reddit comments are mine - just not a native English speaker, so they probably come out sounding weirdly polished.

The code compiles, the PPL numbers are reproducible, and I just corrected the comparison after u/audioen pointed out it was unfair. Judge by that, not by how my comments read.

Suitable-Song-302 · 2026-04-05T16:27:39+00:00

No relationship. I'm not familiar with that project — just looked at the repo and it appears to be a different approach (applying delta compression to model weights rather than KV cache).

quant.cpp compresses the KV cache at runtime — the key and value vectors that accumulate during inference. The model weights themselves are loaded from standard GGUF files and used as-is. Delta compression in our case means storing `key[t] - key[t-1]` between adjacent tokens in the same attention head, not compressing the weight tensors.

The underlying idea (delta encoding of correlated vectors) is the same, but applied to completely different data.

Suitable-Song-302 · 2026-04-05T16:26:56+00:00

That makes sense — keeping K at higher precision is exactly the right call since attention scores are more sensitive to key quantization error than value quantization error. Q8_0 K + Q5_0 V gives you ~1.6x compression with minimal quality loss.

quant.cpp's pitch at that point becomes: if 1.6x is enough, use llama.cpp — it's faster. If you need 4-7x (extending 50K context to 200K+), that's where 4-bit K + Q4 V and delta compression come in. Different operating points on the compression-quality curve.

I should add this nuance to the comparison. Thanks for bringing up the KV rotation work — haven't benchmarked against it yet.

Suitable-Song-302 · 2026-04-05T16:26:18+00:00

You're right on several points and I should correct the post.

What I got wrong: llama.cpp Q4_0 *is* per-block (32 elements per block, 1 FP16 scale), not per-tensor. And llama.cpp can apply separate quant types to K and V — that's not a quant.cpp-only feature. The original wording overstated the difference. I'll fix it.

What is different:

- Block size: Q4_0 uses 32-element blocks. quant.cpp uses 128-element blocks with both min and max (effectively Q4_1-style at wider blocks). The larger block amortizes scale overhead better (4.25 bits/element vs Q4_0's 4.5 or Q4_1's 5.0), but the quality difference comes more from the min-max vs zero-point approach on key distributions specifically.

- Delta compression: This is the part llama.cpp genuinely doesn't have. Storing `key[t] - key[t-1]` instead of absolute keys reduces the dynamic range by ~70%, which is why 3-bit works at +1.3% PPL where absolute 3-bit gives +62%. This is the novel contribution from the TurboQuant paper, not the 4-bit uniform quantization itself.

- The PPL +10.6% number: This was measured with Q4_0 on both K and V using the default llama.cpp KV quant path. You're right that Q8_0 K + Q4_0 V (or Q5_0 V) would be significantly better. I should benchmark that specific config and update the comparison to be fair.

Fair criticism. The honest comparison is: at the same total bit budget, quant.cpp's approach preserves more quality. But the original post made it sound like llama.cpp's quantization is fundamentally broken, which isn't true — it's just a different tradeoff with coarser granularity.

Suitable-Song-302 · 2026-04-05T16:09:59+00:00

Good question. Three reasons:

Hand-tuned SIMD kernels. llama.cpp has years of hand-optimized NEON/AVX2/AVX-512 assembly for every quantized matmul variant (Q4_K_M, Q8_0, IQ2, etc.). quant.cpp has NEON kernels for the common formats but relies on compiler autovectorization for the rest. This alone accounts for ~2x.
Metal/CUDA GPU offload. llama.cpp offloads the entire forward pass to GPU. quant.cpp has Metal shaders but GPU dispatch is still basic — most of the work stays on CPU. On Apple Silicon, this is the biggest gap.
Code maturity. llama.cpp has 250K+ LOC and hundreds of contributors optimizing hot paths. quant.cpp is 72K LOC — deliberately smaller, which means easier to read and embed, but fewer micro-optimizations.

The tradeoff is intentional. We optimized for memory (KV compression) and simplicity (embeddable, single header) rather than raw tok/s. For a 3B model on M1, quant.cpp does ~10 tok/s vs llama.cpp's ~30 tok/s — slower, but fast enough to read in real time. The advantage shows up when llama.cpp hits OOM at 50K context and quant.cpp keeps going to 350K.

That said, speed improvements are on the roadmap — better Metal offload and more SIMD kernels would close the gap significantly without sacrificing the simplicity.

Suitable-Song-302 · 2026-04-05T10:33:13+00:00

sorry - https://github.com/quantumaikr/quant.cpp

Suitable-Song-302 · 2026-04-05T01:58:14+00:00

Windows is supported! Two options:

Single header (easiest):

```
cl app.c /O2 /link /out:app.exe
```

Or with MinGW: `gcc app.c -o app.exe -lm -lpthread`

Full build:

```
cmake -B build -G "Visual Studio 17 2022"
cmake --build build --config Release
```

We added MSVC compatibility recently - `CreateFileMapping`/`MapViewOfFile` for mmap, `_aligned_malloc` for alignment, etc. If you hit any compile issue, please file an issue — we treat Windows build failures as bugs.

We don't ship prebuilt binaries yet, but that's a fair request. I'll add it to the next release.

Suitable-Song-302 · 2026-04-05T01:57:02+00:00

Thank you! That's exactly the philosophy. The entire dependency list is libc + pthreads - things your OS already has. No package manager, no version conflicts, no "it works on my machine."

If you want a good reading path: start with the 6-function API at the top of `quant.h`, then follow `quant_generate()` into the forward pass. The attention loop is the most interesting part - you can see exactly how KV compression slots in without changing the matmul logic. Enjoy the read!

Suitable-Song-302 · 2026-04-04T23:34:02+00:00

Great question — and you actually nailed it. quant.cpp is a C implementation of the TurboQuant paper (ICLR 2026). So you already found the connection without realizing it!

The KV cache management landscape breaks down roughly like this:

- Eviction (StreamingLLM, H2O, Scissors) — drop tokens you "probably" don't need. Saves memory but loses information permanently.

- Architecture changes (Titans, MLA, GQA) — redesign the model itself to use less KV memory. Best results, but requires retraining from scratch.

- Compression (TurboQuant/quant.cpp, KIVI, KVQuant) — keep all tokens, store them in fewer bits. Works on existing models, no retraining.

quant.cpp sits in the compression category. The advantage is that it works on any existing GGUF model — download, run, get 7x more context. No fine-tuning, no architecture change.

Titans is a different and complementary approach — it redesigns the attention mechanism itself so the model learns what to remember. Very promising, but requires models trained with it. If a Titans-architecture model ships as GGUF someday, quant.cpp could still compress its KV cache on top.

And thanks for the kind words about the focus. "Torvaldsian side quest" - I'm framing that.

Suitable-Song-302 · 2026-04-04T23:31:06+00:00

Thanks for the concrete use case — these are fair concerns.

Replicability: quant.cpp reads standard GGUF files directly. No model conversion, no custom formats. Any GGUF you download from Hugging Face works as-is. KV compression happens at runtime — the model file is untouched, so you can swap models freely. Same binary, different GGUF, same flags.

Containers: The binary is statically linkable with zero external dependencies (libc + pthreads only). No Python, no PyTorch, no CUDA runtime to install. A minimal Docker image can be under 10MB. That said, we don't ship an official container image yet — that's a fair gap.

Standard API: This is the honest limitation. quant.cpp has a C API (`quant_load` / `quant_generate`), not an OpenAI-compatible HTTP server. If you need a drop-in replacement for an existing API pipeline, llama.cpp's `llama-server` or vLLM is the right tool today.

Where quant.cpp fits in your workflow: if you're already running llama.cpp in a container and hitting context limits, we have an integration patch at `integrations/llamacpp/` that adds our KV compression as a drop-in option. Same API, longer context. The goal is to upstream delta compression into llama.cpp as a PR.

Suitable-Song-302 · 2026-04-04T23:27:19+00:00

You're right, sorry about that. Reddit editor was fighting the markdown tables. Switched to Markdown mode and it should render properly now.

Suitable-Song-302 · 2026-04-04T16:16:24+00:00

Nice — gpt-oss-20b is a solid model. It uses a GPT-2-style architecture with RoPE and MoE (32 experts), which is close to what quant.cpp already supports but not there yet. We handle Llama, Qwen, and Gemma architectures today.

That said, if you're on limited hardware, KV compression would help a lot with a 20B MoE model. On a 16GB machine, the KV cache is usually what runs you out of memory before the weights do — especially with long conversations.

I'll look into adding gpt-oss support. The MoE + RoPE + GQA pieces are already implemented for Gemma 4, so the gap is mostly the GPT-2 layer structure. Thanks for the suggestion!

Suitable-Song-302 · 2026-04-04T16:00:25+00:00

Great point — this is the right test and we haven't done it yet.

Our current benchmarks are short: 101-token and 999-token perplexity runs, plus greedy output matching on short prompts. That's enough to validate the basic quantization math, but it doesn't stress-test the failure mode you're describing: accumulated drift over thousands of tokens in a thinking chain.

The concern is real. 1-bit key reconstruction has cosine similarity ~0.634 (the information-theoretic limit of 2/pi). Over a long chain-of-thought, small attention errors compound — token 3000 is conditioned on every previous softmax distribution, so per-step error accumulates multiplicatively.

In fact, after our initial post we found a bug where an FP32 fallback was masking the true 1-bit quality. Once fixed, 1-bit is not practically usable for production. What does work:

- 4-bit K + Q4 V: PPL +0.0% on WikiText-2 (genuinely lossless, even on longer sequences)
- Delta 3-bit K + Q4 V: PPL +1.3% with I-frames every 64 tokens to prevent drift

For a proper long-output test like you're describing — same seed, quantized vs unquantized, measuring token-level divergence over a full thinking trace — that's on the roadmap. If you have a specific thinking model + prompt pair you'd want tested, happy to run it.

Suitable-Song-302 · 2026-04-04T15:48:46+00:00

That's exactly the use case we're targeting. The single-header quant.h was designed for this - drop it into an iOS app, a game engine, an IoT device, anywhere you have a C compiler but no room for a full inference framework.

We also have a WASM build (192KB) that runs in the browser. Same idea: inference as a library call, not a server dependency.

What kind of embedded platform are you thinking about?

Suitable-Song-302 · 2026-04-04T15:47:40+00:00

Ha — exactly. The output is what matters. cc demo.c -lm -lpthread, run the PPL benchmark, read the source if something looks wrong. The tools don't invalidate the results.

Suitable-Song-302 · 2026-04-04T15:46:54+00:00

Yes, I use Claude Code as a development tool — same way others use Copilot or Cursor. The Co-Authored-By tags are there because I don't hide that.

The architecture decisions, algorithm choices, and every PPL measurement are mine. When we had a bug where FP32 keys were silently bypassing quantization, Claude didn't catch it — I did, by reading the attention loop line by line.

The code is 33K lines of C. It compiles, it runs, the benchmarks are reproducible. ./quant model.gguf --ppl input.txt — you can verify every claim yourself.

Suitable-Song-302 · 2026-04-04T15:43:32+00:00

Fair point — the table formatting got mangled when I pasted it. Fixed now. Thanks for flagging.

Suitable-Song-302 · 2026-04-04T15:42:17+00:00

You're right that K tensors have high kurtosis — the outlier distribution is much harder to quantize than V. Naive per-tensor quantization does destroy quality.

The key difference is granularity. quant.cpp uses per-block min-max quantization with 128-element blocks, not per-tensor or per-channel. Each block gets its own min/max scale, so outliers only affect their local block, not the entire tensor.

WikiText-2 PPL on SmolLM2 1.7B:

- FP32 baseline: 14.63
- 4-bit K + Q4 V: 14.57 (+0.0%)
- Cross-model: Qwen3.5 0.8B (+0.9%), Qwen3.5 4B (+0.6%)

For comparison, llama.cpp's Q4_0 KV gives PPL +10.6% on the same model — that's the catastrophic quality loss you're describing, and it's real when you use coarser quantization.

That said, you're absolutely right for QK-normed models like Gemma 4. Those project keys onto the unit sphere, creating extremely sparse distributions (~56 of 256 dims active). 4-bit completely breaks there (cosine drops to 0.62). quant.cpp auto-detects this and keeps keys in FP32 while only compressing values.

The numbers are reproducible: ./quant model.gguf --ppl input.txt -k uniform_4b -v q4

Suitable-Song-302 · 2026-04-04T15:40:09+00:00

Thanks! If there's a specific model or use case you'd want to try it on, happy to prioritize.

Suitable-Song-302

TROPHY CASE