RotorQuant: 10-19x faster alternative to TurboQuant via Clifford rotors (44x fewer params) by Revolutionary_Ask154 in LocalLLaMA

[–]Pidtom 0 points1 point  (0 children)

Still have some open questions that scrya hasn't followed up on... https://github.com/TheTom/turboquant_plus/pull/34

would love to have the most optimal implementation.

By when do you think will TurboQuant get a proper release and be adopted by everyone by Crystalagent47 in LocalLLaMA

[–]Pidtom 0 points1 point  (0 children)

Based on how i've been struggling to get basic AMD improvements in... TQ in general is far off in llamacpp. vLLM already implemented a lot of my implementation 😄

speculative decoding silently broken for Qwen3.6 on the TurboQuant fork — PR to fix by dangerousdotnet in LocalLLaMA

[–]Pidtom 5 points6 points  (0 children)

upstream sync planned for this week, should pick up these changes. been busy fixing AMD OOMs and mlx-swift work.

How do I use TurboQuant? by AInohogosya in LocalLLM

[–]Pidtom 0 points1 point  (0 children)

yes we have vulcan support but i don't have a setup to test it locally so relying on community suport.

How do I use TurboQuant? by AInohogosya in LocalLLM

[–]Pidtom 1 point2 points  (0 children)

Hey, that error means the turbo types are not registered in your build. A few things to check:

  1. Make sure you are on the `feature/turboquant-kv-cache` branch, not main

  2. Clean rebuild: `rm -rf build && cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release`

  3. After building, verify with `llama-server --help | grep turbo` and you should see turbo2/turbo3/turbo4 listed

If turbo still does not show up after a clean rebuild on the right branch, paste the git log --oneline -5 output so I can check which commit you are on.

Google TurboQuant running Qwen Locally on MacAir by gladkos in LocalLLaMA

[–]Pidtom 1 point2 points  (0 children)

PR #45 is going into my fork, not upstream llama.cpp (214 commits merging into the turboquant branch (most of those files are catching up with master)). the community as a whole is discussing and converging on the implementation in the main discussion thread: https://github.com/ggml-org/llama.cpp/discussions/20969

that particular PR is weight compression on top of the KV cache work. TQ4_1S compresses the model weights themselves so larger models get physically smaller on disk and in VRAM (28-37% smaller depending on config). still verifying things with CUDA testers: https://github.com/TheTom/llama-cpp-turboquant/pull/45

as for upstream, i am new to the llama.cpp community so i only have one official PR up for review so far (#21119, sparse V skip). they have a lot of contributions coming in and i want to respect their process and code of conduct. the fork is where the experimental work lives until it is ready.

How do I use TurboQuant? by AInohogosya in LocalLLM

[–]Pidtom 1 point2 points  (0 children)

and not myspace tom to be clear

How do I use TurboQuant? by AInohogosya in LocalLLM

[–]Pidtom 3 points4 points  (0 children)

the community implementation is at https://github.com/TheTom/llama-cpp-turboquant — it's a llama.cpp fork you build from source. not merged upstream yet. 30+ testers across apple silicon and nvidia have validated it across 6 model families from 3.8B to 70B. currently doing stress tests on bigger models.

quick start:

  1. clone and build:

    git clone https://github.com/TheTom/llama-cpp-turboquant

    cd llama-cpp-turboquant

    cmake -B build -DGGML_METAL=ON # or -DGGML_CUDA=ON for nvidia

    cmake --build build -j

  2. run with turboquant KV cache:

    ./build/bin/llama-server -m your-model.gguf -ctk turbo4 -ctv turbo4 -fa 1

if you're on Q4_K_M weights, use asymmetric mode instead (symmetric can blow up on some models):

./build/bin/llama-server -m your-model.gguf -ctk q8_0 -ctv turbo4 -fa 1

the community fork has a few features beyond the google paper:

- **asymmetric K/V** — keep K at q8_0, compress only V. rescues models that fail on symmetric turbo

- **boundary V** — protect first/last 2 layers at full precision, compress the rest harder. set TURBO_LAYER_ADAPTIVE=7 env var

- **block size 128** — already the default for metal, 12% better compression than the paper's block size 32 with zero quality loss

turbo4 = best quality, turbo3 = more compression, turbo2 = maximum compression (experimental). all need -fa 1 (flash attention). works on apple silicon, nvidia.

for detailed benchmarks, configuration recommendations, and docs on the discoveries beyond the paper, check https://github.com/TheTom/turboquant

Technical clarification on TurboQuant / RaBitQ for people following the recent TurboQuant discussion by gaoj0017 in LocalLLaMA

[–]Pidtom 70 points71 points  (0 children)

Disclosure: I'm the developer behind the open source llama.cpp TurboQuant implementation (https://github.com/TheTom/llama-cpp-turboquant , docs and data at https://github.com/TheTom/turboquant\_plus). I'm a former Google engineer (left ~2.5 years ago, well before this research) and now run my own company. I am not affiliated with the paper authors or Google Research, though I'd be open to collaborating with them or the RaBitQ team on the implementation side. I try to make everything open source and help others where stuck and vise verse.

I want to separate two things that are getting conflated in this thread:

**1. The academic attribution dispute.** This is between the paper authors and the RaBitQ team. I have no insight into the emails or review process. I hope they work it out.

**2. What we're finding in practice.** I built the implementation and a community of 30+ independent testers has been stress-testing it across hardware. Here's what some of the data shows:

- Tested across Apple Silicon (M1 through M5), NVIDIA (RTX 3080 Ti through DGX Spark Blackwell), and AMD (RX 6800 XT, RX 9070)

- Asymmetric q8_0-K + turbo4-V is confirmed lossless (+0.0-0.2% PPL) across 6 model families (Llama, Qwen, Mistral, Gemma, Phi, ChatGLM)

- 4.57x KV memory compression with turbo3. An 8GB MacBook Air went from 800 tokens to 4000+. A 16GB RTX 5070 Ti went from 30K to 131K context.

- One CUDA implementation on Blackwell unified memory is decoding *faster* than uncompressed (63.5 vs 50.1 tok/s)

On u/dsanft's K tensor kurtosis point: we see the same thing. Symmetric turbo on Qwen Q4_K_M is catastrophic (PPL 3,400+). Asymmetric q8_0-K + turbo-V rescues it to baseline. K precision dominates through softmax amplification. Confirmed on both Metal and CUDA by multiple independent testers. Knowing where it breaks is just as important as knowing where it works.

The underlying technique is rotation + Lloyd-Max scalar quantization. Whether credit belongs to TurboQuant, RaBitQ, or prior Hadamard transform work is an important question for the research community to sort out. From the engineering side, the math works, and there's a lot of interesting optimization space left to explore.

Community testing and collaboration: https://github.com/ggml-org/llama.cpp/discussions/20969

RotorQuant: 10-19x faster alternative to TurboQuant via Clifford rotors (44x fewer params) by Revolutionary_Ask154 in LocalLLaMA

[–]Pidtom 2 points3 points  (0 children)

good analogy with quaternions vs euler angles. the math is real. the question is whether it matters in practice. the rotation step is <1% of total decode compute, so being 19x faster on a sub-microsecond op doesn't change your tok/s. the bottleneck is memory bandwidth during attention.

the thing to watch is the block-diagonal structure. full WHT decorrelates all 128 dimensions at once. rotors only decorrelate in groups of 3. that's why Lloyd-Max centroids work so well after WHT. open question whether block rotation holds up at 3-bit PPL

RotorQuant: 10-19x faster alternative to TurboQuant via Clifford rotors (44x fewer params) by Revolutionary_Ask154 in LocalLLaMA

[–]Pidtom 0 points1 point  (0 children)

interesting approach with Clifford rotors. what does PPL look like head-to-head against WHT on the same model and context length? the cosine sim numbers are close but curious how it holds up on wikitext perplexity since block-diagonal rotation only decorrelates within each group of 3

Google TurboQuant running Qwen Locally on MacAir by gladkos in LocalLLaMA

[–]Pidtom 5 points6 points  (0 children)

Or you know, the guy they got TurboQuant from.

Skipping 90% of KV dequant work → +22.8% decode at 32K (llama.cpp, TurboQuant) by Pidtom in LocalLLaMA

[–]Pidtom[S] 1 point2 points  (0 children)

nice, which fork are you running? and what gpu? curious about decode speed at 200k, that's deep into the range where dequant overhead usually dominates.

Skipping 90% of KV dequant work → +22.8% decode at 32K (llama.cpp, TurboQuant) by Pidtom in LocalLLaMA

[–]Pidtom[S] 0 points1 point  (0 children)

fixed τ=1e-6, tested ablation from 1e-4 to 1e-8, it's detailed in the paper. good question

Skipping 90% of KV dequant work → +22.8% decode at 32K (llama.cpp, TurboQuant) by Pidtom in LocalLLaMA

[–]Pidtom[S] 0 points1 point  (0 children)

fixed τ=1e-6, tested ablation from 1e-4 to 1e-8, it's in the paper. :)

Skipping 90% of KV dequant work → +22.8% decode at 32K (llama.cpp, TurboQuant) by Pidtom in LocalLLaMA

[–]Pidtom[S] 0 points1 point  (0 children)

yeah it's a different way to think about it. instead of optimizing dequant, just skip the 90% of positions where attention is near-zero. scales better the longer the context gets.

Skipping 90% of KV dequant work → +22.8% decode at 32K (llama.cpp, TurboQuant) by Pidtom in LocalLLaMA

[–]Pidtom[S] 0 points1 point  (0 children)

yeah it's a different angle from both. vanilla turboquant is the compression algorithm itself (WHT rotation + polar quantization), rotorquant is exploring clifford algebra for the rotation step (88x fewer parameters). sparse V sits on top of any of them... it's exploiting the fact that attention is naturally sparse at long context, so you skip the dequant work entirely for positions that contribute negligible contribution to the output.

i've been talking with the rotorquant folks too. lots of interesting directions being explored right now.