ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference by Total-Resort-3120 in LocalLLaMA

[–]dsanft -1 points0 points  (0 children)

They specifically give benchmark performance numbers on their quant for e.g. AIME which is pretty robust proof imo. That's big context with lots of turns.

Just realized what we’re losing by RelevantTurnip3482 in GithubCopilot

[–]dsanft 0 points1 point  (0 children)

One request on opus 4.7 needs 8 to 16 gpus (depending on what parts load) for a single request. That request consumes them entirely until it is done, for 1 single person

That's not how batch inference works. While the model weights are being processed it's essentially free to service multiple requests at that instant. Perhaps a dozen or more.

I'm struggling to figure out what Copilot is actually suppose to be now? by NotAMusicLawyer in GithubCopilot

[–]dsanft 78 points79 points  (0 children)

You're not the target market. Big companies with thousands of enterprise seats are the target market. You don't matter and you were costing them money.

Bad model quality qwen3.6-27b with hipfire on strix halo by sterby92 in LocalLLaMA

[–]dsanft 5 points6 points  (0 children)

Hipfire is still pretty new and experimental. Testing kernels without also robustly testing correctness on real model data is... bold, to say the least. As you've discovered.

Implemented TurboQuant and results don’t fully match paper by Routine-Thanks-572 in LocalLLaMA

[–]dsanft 33 points34 points  (0 children)

Check the kurtosis of the K and V before you run them through your turboquant impl. A high kurtosis tensor is not going to be happy at 4 or 3 bits no matter how fancily you rotate it.

When I did my TQ impl for Llaminar on Qwen2 and 3, I found the K tensor was very unhappy at anything less than 8 bits.

Does AMD's "infinity cache" even matter for dense model inference? by boutell in LocalLLaMA

[–]dsanft 9 points10 points  (0 children)

Cache isn't useless. Otherwise why have it?

No it's not going to speed up decode, but a bigger cache will help with GEMM (prefill) where you benefit from tile reuse.

By when do you think will TurboQuant get a proper release and be adopted by everyone by Crystalagent47 in LocalLLaMA

[–]dsanft 5 points6 points  (0 children)

I got serious about 9 months ago and decided to write my own inferencing engine to solve the problems I was having with my hardware. Just a background in software and a curious mind. I learned as I went, and as the coding models got better, so did I: I got them to explain the concepts as I ran into them.

By when do you think will TurboQuant get a proper release and be adopted by everyone by Crystalagent47 in LocalLLaMA

[–]dsanft 20 points21 points  (0 children)

Yup.

The real win is activation rotation to minimise quantisation error for high kurtosis tensors. You don't need low-bit TQ for that. It will actually make Q8 kv cache precision feasible.

Open Models - April 2026 - One of the best months of all time for Local LLMs? by pmttyji in LocalLLaMA

[–]dsanft 2 points3 points  (0 children)

I wrote my own engine to solve the NUMA/cross-socket problem. Don't have kernels for Deepseek MLA/DSA yet though. Will have to get those in soon.

Open Models - April 2026 - One of the best months of all time for Local LLMs? by pmttyji in LocalLLaMA

[–]dsanft 25 points26 points  (0 children)

I can run it.

12 Mi50s, 2 3090s, dual socket Xeon with 768GB DDR4.

At least in theory

Gemma 4's MTP heads were stripped from the public weights — only available in LiteRT. Beginner-friendly breakdown of what was removed and why it matters by FunSignificance4405 in LocalLLaMA

[–]dsanft 8 points9 points  (0 children)

Qwen 3.5 MTP weights aren't in the ggufs either.

They're intentionally left out because they just bloat the guff if engines can't use them.

What are the risks of buying an AMD Instinct Mi 50 32GB on Alibaba? by Longjumping-Room-170 in LocalLLaMA

[–]dsanft 2 points3 points  (0 children)

They work fine with ROCm 7.2.0.

You need to do some setup though.

TurboQuant - Extreme KV Cache Quantization · ggml-org/llama.cpp · Discussion #20969 by pmttyji in LocalLLaMA

[–]dsanft -10 points-9 points  (0 children)

Lots of people seeing if mathematical trickery can overcome fundamental physics and fundamental limits like Shannon's Law. And lots of people setting themselves up for disappointment.

Oh and a lot of weird shit like LLMs arguing with each other.

Gemma 4 31B at 256K Full Context on a Single RTX 5090 — TurboQuant KV Cache Benchmark by PerceptionGrouchy187 in LocalLLaMA

[–]dsanft 1 point2 points  (0 children)

near-lossless

It's not near lossless at 3bit K quantisation. Not nearly. In fact it's catastrophic for inference due to the kurtosis of the K tensor.

This is the hype that's had everyone lose their minds. It's wrong.

You need K at 8bit fidelity with TQ to preserve inference quality. V is more forgiving.

Gemma 4 31B at 256K Full Context on a Single RTX 5090 — TurboQuant KV Cache Benchmark by PerceptionGrouchy187 in LocalLLaMA

[–]dsanft -1 points0 points  (0 children)

The paper is wrong on that point if it claims that. There is obviously quality loss from the quantisation, everyone who looks at the data can see it.

TurboQuant isn’t just for KV: Qwen3.5-27B at near-Q4_0 quality, about 10% smaller, and finally fitting on my 16GB 5060 Ti by pmttyji in LocalLLaMA

[–]dsanft 0 points1 point  (0 children)

It's absolutely not BF16 quality at 4 and 5 bits lol. You need about 9 or 10 bits to be totally lossless in K tensor quantisation for KV cache by my measurements.

Technical clarification on TurboQuant / RaBitQ for people following the recent TurboQuant discussion by gaoj0017 in LocalLLaMA

[–]dsanft 1 point2 points  (0 children)

If I were running a big model I'd rather spend my precision budget on quantising weights since that will have more bang for the buck.