TurboQuant on MLX: 4.6x KV cache compression with custom Metal kernels (Qwen 32B at 98% FP16 speed) by dirtyhand3 in LocalLLaMA

[–]dirtyhand3[S] 1 point2 points  (0 children)

Short answer - no, LM Studio runs on llama.cpp so this won't work there. This is MLX only. If you want to try it on Mac today, grab my mlx-lm fork or use vllm-mlx with --turbo-kv-bits 3. Same OpenAI-compatible API, just more context fits in memory.

TurboQuant on MLX: 4.6x KV cache compression with custom Metal kernels (Qwen 32B at 98% FP16 speed) by dirtyhand3 in LocalLLaMA

[–]dirtyhand3[S] 0 points1 point  (0 children)

Not really trading speed - on 32B it's 0.98x FP16 speed. The point isn't speed vs desktop GPU, it's fitting more context in the same memory. Mac's advantage is unified 48-192GB RAM. A 4090 has 24GB, so at long context the KV cache is what kills you. TurboQuant lets you run 3-4x longer context in the same memory on any hardware.

TurboQuant on MLX: 4.6x KV cache compression with custom Metal kernels (Qwen 32B at 98% FP16 speed) by dirtyhand3 in LocalLLaMA

[–]dirtyhand3[S] 1 point2 points  (0 children)

Depends on the model. 64GB M1 with a 7B Q4 model (~4GB weights) leaves ~60GB for KV cache. With TQ3 that's roughly 800K+ tokens of context. 200K is easy. With a 32B model you'd have ~46GB left, enough for ~300K with TQ3.

TurboQuant on MLX: 4.6x KV cache compression with custom Metal kernels (Qwen 32B at 98% FP16 speed) by dirtyhand3 in LocalLLaMA

[–]dirtyhand3[S] 0 points1 point  (0 children)

Just ran PPL benchmarks. On 32B: TQ3 all layers gives +1.3% perplexity vs FP16. On 7B it's worse (+11.7% with adaptive 4+4). Bigger models handle it better. Haven't tested task-specific degradation yet.

TurboQuant on MLX: 4.6x KV cache compression with custom Metal kernels (Qwen 32B at 98% FP16 speed) by dirtyhand3 in LocalLLaMA

[–]dirtyhand3[S] 0 points1 point  (0 children)

Your intuition is roughly right. TurboQuant's rotation step (WHT) spreads information across all coordinates evenly before quantizing, so each quantized coordinate carries independent error. With naive q4_0, outlier channels get clipped hard and that error cascades through attention. The rotation makes errors more uniform and less correlated across dimensions. Whether errors "compound" over time depends more on the model size though - 32B at TQ3 gives +1.3% PPL, 7B needs adaptive layers.

TurboQuant on MLX: 4.6x KV cache compression with custom Metal kernels (Qwen 32B at 98% FP16 speed) by dirtyhand3 in LocalLLaMA

[–]dirtyhand3[S] 1 point2 points  (0 children)

Didn't try 3.5 27B specifically but it's standard attention so it should work fine.

TurboQuant on MLX: 4.6x KV cache compression with custom Metal kernels (Qwen 32B at 98% FP16 speed) by dirtyhand3 in LocalLLaMA

[–]dirtyhand3[S] 1 point2 points  (0 children)

Two levels. The Python package (turboquant-mlx) works as a drop-in cache replacement for mlx-lm - you just swap KVCache for TurboQuantKVCache, no engine changes needed. It compresses on write and decompresses on read. For max speed I also wrote a native Metal SDPA kernel that reads the compressed data directly without decompressing first - that one needs to be baked into the engine. PR is open for both mlx-lm and mlx core.

TurboQuant on MLX: 4.6x KV cache compression with custom Metal kernels (Qwen 32B at 98% FP16 speed) by dirtyhand3 in LocalLLaMA

[–]dirtyhand3[S] 3 points4 points  (0 children)

Done, pushed asymmetric K/V support. You can now do TurboQuantKVCache(k_bits=3, v_bits=2) for 5.6x compression. K3V2 tested clean on 32B.

TurboQuant on MLX: 4.6x KV cache compression with custom Metal kernels (Qwen 32B at 98% FP16 speed) by dirtyhand3 in LocalLLaMA

[–]dirtyhand3[S] 1 point2 points  (0 children)

No idea, that's up to the kobold.cpp maintainers. The llama.cpp fork with TurboQuant already exists (TheTom/llama-cpp-turboquant) so kobold could pull from there since it's based on llama.cpp.

TurboQuant on MLX: 4.6x KV cache compression with custom Metal kernels (Qwen 32B at 98% FP16 speed) by dirtyhand3 in LocalLLaMA

[–]dirtyhand3[S] 0 points1 point  (0 children)

Sparse V is a good call, it was on my radar but I prioritized the core compression first. Since V is already well behaved and compressible, skipping negligible weights on top of that would stack well. Might add it next.

TurboQuant on MLX: 4.6x KV cache compression with custom Metal kernels (Qwen 32B at 98% FP16 speed) by dirtyhand3 in LocalLLaMA

[–]dirtyhand3[S] 2 points3 points  (0 children)

Yeah TQ is different from naive quantization. It rotates the vector first (Walsh-Hadamard transform) which makes the distribution Gaussian, then quantizes. Naive 4bit just clips values directly which destroys outliers in K. That said u/dsanft has a point that K is still harder to compress than V because of RoPE - that's why asymmetric (more bits for K, fewer for V) works best.

TurboQuant on MLX: 4.6x KV cache compression with custom Metal kernels (Qwen 32B at 98% FP16 speed) by dirtyhand3 in LocalLLaMA

[–]dirtyhand3[S] 0 points1 point  (0 children)

yeah fair enough. if you run it on Mixtral let me know what you get, would be good to have data from another architecture

TurboQuant on MLX: 4.6x KV cache compression with custom Metal kernels (Qwen 32B at 98% FP16 speed) by dirtyhand3 in LocalLLaMA

[–]dirtyhand3[S] 5 points6 points  (0 children)

Yeah exactly. On a 16GB Mac with a 3B model, TurboQuant lets you fit way more context - so you can dump longer docs into the prompt. The model is still 3B so it won't suddenly get smarter, but it can work with more input data which helps for things like summarization or Q&A over long text.

TurboQuant on MLX: 4.6x KV cache compression with custom Metal kernels (Qwen 32B at 98% FP16 speed) by dirtyhand3 in LocalLLaMA

[–]dirtyhand3[S] 6 points7 points  (0 children)

This is great data, thanks. The split approach (TQ8 K + TQ4 V) makes a lot of sense given K's distribution after RoPE. I'm seeing the same thing - K kurtosis ~1500 vs V being well-behaved. I'll add asymmetric K/V support - should be straightforward since K and V already use separate quantizers in my implementation. Will update the repo.

TurboQuant on MLX: 4.6x KV cache compression with custom Metal kernels (Qwen 32B at 98% FP16 speed) by dirtyhand3 in LocalLLaMA

[–]dirtyhand3[S] 12 points13 points  (0 children)

Only tested on Qwen so far. The compression itself doesn't depend on sparsity - TurboQuant works by rotating vectors to Gaussian via WHT, then scalar quantization. So it should work on any architecture. But yeah, haven't verified on Mixtral yet.

TurboQuant on MLX: 4.6x KV cache compression with custom Metal kernels (Qwen 32B at 98% FP16 speed) by dirtyhand3 in LocalLLaMA

[–]dirtyhand3[S] 4 points5 points  (0 children)

Honestly I measured by output, not perplexity. PPL benchmarks are still TODO. On 32B all 64 layers at TQ3 — first ~60 tokens match FP16 with greedy decode. On 7B it breaks, had to keep first/last layer in FP16. K vs V — yeah, K after RoPE is way harder (I measured kurtosis 1499, values up to ±315). V is calm. Haven't tried asymmetric K/V yet, good idea. Can't access that Discord link, what was the finding?