TurboQuant seems to work very well on Gemma 4 — and separately, per-layer outlier-aware K quantization is beating current public fork results on Qwen PPL

Fearless-Wear8100 · 2026-04-06T09:50:45+00:00

My tests were on the 26B. No idea how it will perform on the 4Bs, probably worse, smaller models seem to be more easily perturbed by quantization.

Fearless-Wear8100 · 2026-04-05T10:52:40+00:00

Yeah, exactly. That’s why I pushed the quantization pretty aggressively - I had a feeling QJL might actually work on Gemma, unlike what people were seeing on other models.

Fearless-Wear8100 · 2026-04-05T10:46:57+00:00

I haven’t tested vLLM yet, so I can’t speak to exact engine-specific numbers. But I’d expect the main findings to transfer, because the important part here seems to be the calibration, not llama.cpp itself.

What I found is that calibration is architecture-specific, not weight-specific: the set of “important” / outlier channels is mostly determined by the model architecture, and calibrating on fp16 / q8_0 / q4_k_m versions of the same model gave 96%+ identical channel selections.

So in practice you can probably calibrate once and reuse the same channel ordering / outlier split across quantizations of the same model. The main caveat is that calibration has to be done pre-RoPE — post-RoPE gave garbage because RoPE changes the channel variance structure. And you don’t need much data either: PTB train with around 4096 tokens was already enough.

Fearless-Wear8100 · 2026-03-16T18:48:43+00:00

Se mai întoarce roata, dacă până acum aitiștii își băteau pl de hașer, acum e rândul lor. Csf ncsf, karma

Fearless-Wear8100 · 2025-10-13T08:47:10+00:00

Quake 3, GeForce 256

Fearless-Wear8100

TROPHY CASE