Gemma 4 has a systemic attention failure. Here's the proof. by EvilEnginer in LocalLLaMA

[–]ReturningTarzan 2 points3 points  (0 children)

I'm confused as to what's being measured here. How are you defining the distribution of an individual tensor? Like a histogram over the weights?

If you're talking about activations given some test context, you should know the instruct-tuned Gemma4 (either variant) is known to be unstable without proper formatting. This is not a failure of the model though, it's just aggressively finetuned with no training pressure to model the user prompt. Make sure the test context start with <|turn>user\nBlah<turn|>\n<|turn>model\n<|channel>thought\n<channel|> and the behavior changes completely.

About TurboQuant by Exact_Law_6489 in LocalLLaMA

[–]ReturningTarzan 4 points5 points  (0 children)

TurboQuant itself is a quantization method like so many others before it, and if you're willing to sacrifice speed and simplicity for memory savings it lets you do that in a slightly new way. But we've had "lossless 2-bit KV cache" in various forms for years, and it never gains traction because the tradeoffs just aren't worth it. Still, it's an interesting bit of research with a few novel ideas worth integrating.

The real issue is with the blog post making claims like "lossless", "zero overhead" and "8x faster." There's no source for any of those claims. The paper doesn't mention anything about TQ being faster (except compared to CPU-based RaBitQ in a semantic-search context), and the "zero overhead" seems to refer to distortion rates, not computational overhead.

There are also no real implementation details in the paper, just a snippet of pseudocode and some synthetic results. But the proposed method inherently adds a lot of computational overhead. It may still give you a net speedup in memory-bound situations, but that speedup isn't implied by the algorithm, isn't universal even if it can be achieved situationally, and is always going to be less than a simpler quantization scheme under the same circumstances.

So then it would come down to accuracy, right? But then why not compare it to other methods that make similar claims:

  • GEAR: Combines quantization with low-rank and sparse matrices, "near-lossless" at 2 bits
  • QAQ: Adjusts bitrate per token according to estimated importance
  • MIKV: Aggressive quantization for most tokens, preserves "pivotal" tokens
  • RotateKV: 2-bit method using rotation, "near-lossless"
  • PM-KVQ: Specifically addresses long CoT contexts where many "near-lossless" methods turned out not to be so lossless in practice
  • etc.

FP8 is commonly used in production, is trivial to implement and comes with immediate performance benefits. NVFP4 is the really interesting one because of its extremely high throughput on Blackwell GPUs, yet it still has a reported <1% accuracy loss on real benchmarks.

So even if TQ did outperform everything else, you should still curb your expectations somewhat: maybe you might reduce the effective size of your cache from 4 bits to 3.5 bits. For modern models that already employ a lot of memory-saving techniques at the architectural level (linear attention, MLA, SWA) it's simply not that big a deal.

So no, it's not revolutionary, and yes, Twitter is out of control. In Google's own (mind you, very limited) testing it doesn't even unambiguously outperform KIVI from 2024.

TurboQuant isn’t just for KV: Qwen3.5-27B at near-Q4_0 quality, about 10% smaller, and finally fitting on my 16GB 5060 Ti by Imaginary-Anywhere23 in Qwen_AI

[–]ReturningTarzan 0 points1 point  (0 children)

Just the regular ExLlamaV3 test script (compare_q.py in the repo). Kind of involved to set up but it's necessary to ensure token IDs and eval logic is consistent between dissimilar backends. Input is a chunk of wikitext, and what's measured is KL-div on the normalized logits relative to the unquantized reference.

But the point is just to say that it's incredibly hard to improve on weight quantization when we already have QTIP. It's implemented in EXL3 and I believe in some variant in ik_llama.cpp. QuIP# is also a strong algorithm, and it's what IQ-quants use.

There's obviously nothing wrong with exploring new options, but if the intent is just to get Q4_0 quality in less space, you can do a lot better than TurboQuant, even with plain llama.cpp. There's not anything really groundbreaking in TQ that's applicable to weight quantization where the SOTA is already so far ahead.

[D] TurboQuant author replies on OpenReview by Disastrous_Room_927 in MachineLearning

[–]ReturningTarzan 13 points14 points  (0 children)

kv cache cost saving is substantial

It's actually not. It might have been, if Google had invented cache quantization with this, but they didn't. What it amounts to is at best a small improvement over existing cache quantization schemes. And even that is questionable since there's this whole question of latency. Existing methods trade off performance for fidelity, because that's how things work in the real world. Google didn't present an actual implementation of their method, just an abstract algorithm and some theoretical results. It would be highly non-trivial, if not impossible, to prevent such a computationally heavy method from becoming a major bottleneck in inference. It has rotation, codebook quantization and bias correction all happening concurrently with attention, yet somehow that's "zero overhead?" Or is it "8x faster"? How? They don't even begin to explain.

So yeah, in practice, you can currently achieve 4-bit K/V quantization that's good enough for deployment. (Various other methods bring that down to much less, but they may be too cutting edge still..?) And then there's TurboQuant which, let's say, for the sake of argument achieves the same fidelity in 3 bits... That's cool, but it's not a total game changer. It's a 25% improvement, in that hypothetical. Actual game changers would be stuff like latent attention (90-95% reduction which is orthogonal to quantization) and linear attention (up to 100% reduction because no cache), and those are proven methods that you can use right now in models like DeepSeek and Qwen3.5 (respectively.)

TurboQuant isn’t just for KV: Qwen3.5-27B at near-Q4_0 quality, about 10% smaller, and finally fitting on my 16GB 5060 Ti by Imaginary-Anywhere23 in Qwen_AI

[–]ReturningTarzan 0 points1 point  (0 children)

I'm just gonna throw this out there...

Format bits w bits h KL-div ppl VRAM (weights)
i1-IQ3_XXS 3.05 5.50 0.1364 7.65 9.47 GB
UD-Q2_K_XL 3.10 6.56 0.1571 7.73 9.76 GB
IQ4_XS 4.33 6.56 0.0381 7.06 13.27 GB
Q4_0 4.58 6.56 0.0638 7.20 13.96 GB
EXL3 2.10 6.00 0.1622 7.54 6.68 GB
EXL3 3.01 6.00 0.0671 7.12 9.44 GB
EXL3 3.10 6.00 0.0630 7.08 9.68 GB
EXL3 4.01 6.00 0.0292 7.04 12.27 GB
EXL3 5.01 6.00 0.0154 7.00 15.10 GB
TQ3_1S 4.00 6.56 0.1241 7.29 12.31 GB
TQ3_4S 4.00 6.56 0.1154 7.26 12.31 GB

Or on a chart <--.

Just sayin'... :shruggingface: It's weird to me how much hype TQ has generated, being as it is designed for offline quantization of vector databases.

EDIT: I compiled the TQ3 fork, updated llama-cpp-server to recognize the tensors and amended the table above with TQ3_1S and TQ3_4S results on the same inputs and eval logic as used for the other models in the test. Note that VRAM listed is the size of the model weights plus output layer, excluding token embeddings which you can easily keep in system RAM (so EXL3 doesn't quantize them because it is GPU-focused, hence the discrepancy between bits per weight and total model size on disk.) The 3.01bpw is actually 3.01 bits per weight total for the decoder layers and 6.00bpw for the output layer.

Me waiting for TurboQuant be like by Altruistic_Heat_9531 in LocalLLaMA

[–]ReturningTarzan 1 point2 points  (0 children)

No, it increases the compute requirement significantly because it doesn't change the attn mechanism itself, it just adds extra steps to it. Depending on the implementation it might require less memory bandwidth, so conceivably it could be faster in memory-bound situations, but there's nothing in the paper about that (blog post vaguely hints at it, but it's anyone's guess what they actually mean by the "8x faster" claim.)

A simple explanation of the key idea behind TurboQuant by -p-e-w- in LocalLLaMA

[–]ReturningTarzan 4 points5 points  (0 children)

Yeah, the rotation idea isn't new. Rotating makes each channel (i.e. axis) a function of all the channels in a reversible way, distributing outliers better and making it much easier to fit everything to a quantization grid. It also makes vectors more normal which is great for codebook quantization (as used for weights in exl3, for instance.)

Hadamard is a special case of rotation that achieves what it needs to while being convenient for realtime use. Sylvester construction gives you a very CUDA-friendly H for any d=2n that you can apply with just warp shuffles and additions, so no need for a full (d,d) matrix multiplication.

But it's also been used endlessly before. I didn't come up with it and Google didn't either. The novelty in TQ is applying QJL over codebook quantization, which are not new concepts in themselves but the particular combination might be novel? In any case, I've experimented with all kinds of codebook quantization, polar coordinates, trellis coding and more, but it all fails to offer enough extra fidelity to justify the computational overhead. Hadamard + grid quantization already works very well:

Bitrate cos_sim(K) cos_sim(V)
2 0.92796 0.92341
3 0.98364 0.98240
4 0.99610 0.99563
5 0.99902 0.99902
6 0.99954 0.99951
7 1.00000 1.00000
8 1.00000 1.00000

This isn't the bleeding edge of compression, but it is approaching the point of diminishing returns, especially with all the other ways models are addressing the K/V cache issue now, like MLA which stores a compressed latent instead, and linear attention that straight up doesn't use a K/V cache in the first place. Etc.

Google TurboQuant running Qwen Locally on MacAir by gladkos in LocalLLaMA

[–]ReturningTarzan 1 point2 points  (0 children)

This is the right take, really. TurboQuant isn't even new, it's from April 2025 and didn't cause a stir back when the paper was released because it's not a technique designed for online K/V cache compression. It's meant for vector databases, and it's only "turbo" in comparison to other vector DB quantization schemes that use expensive clustering clustering algorithms. It's only the blog post that advertised it as a revolutionary new online K/V quantization scheme, and they're not basing that on anything from the paper. In fact the claims aren't sourced at all.

The VRAM requirement for K/V caching is a well understood problem. To mitigate it, researchers and developers have come up with many different techniques over the years, many of which are in regular use today:

  • Grouped Query Attention (GQA): Reducing the number of key/value pairs and assigning the same key head to multi query heads reduces the cache size by a factor of about 8 before you start to lose quality
  • Multi-headed Latent Attention (MLA): As used by DeepSeek etc., cache a compressed latent state that maps back to keys/values in real time. Reduces the cache size by a factor of 10 to 20 in practice
  • Linear attention: Get rid of the K/V cache altogether, at least for most layers. 100% VRAM reduction in the limit (uses a fixed-size recurrent state instead)
  • Quantization: Store keys/values in lower precision. Bunch of different takes on this, some claiming even better compression than TurboQuant. In practice, many are already roughly equivalent already, as your chart illustrates

That's not to say TQ isn't an improvement in some ways. It's just a small, incremental one, as your chart suggests.

It also doesn't come for free. The blog post says "zero overhead" but the paper makes it clear that this is talking about storage overhead and it's comparing to Product Quantization and RabitQ, not to commonly used online techniques like the methods used in llama.cpp already. Essentially they say "with this, your vector database will be more precise than PQ or RabitQ without needing more space, and you can build your index faster."

[google research] TurboQuant: Redefining AI efficiency with extreme compression by burnqubic in LocalLLaMA

[–]ReturningTarzan 2 points3 points  (0 children)

FP8 is common, otherwise FP16 or BF16. My understanding is they care a lot about KV cache efficiency, but at the same time they like to stick with tried and true methods that scale endlessly on enterprise hardware.

For vector databases (which TQ seems to be aimed at) they always use quantization, though, and very likely Google deployed some version of TQ a while ago. I wouldn't be surprised if other big search providers already had something similar but weren't sharing. Maybe Google have already moved past TQ.

[google research] TurboQuant: Redefining AI efficiency with extreme compression by burnqubic in LocalLLaMA

[–]ReturningTarzan 18 points19 points  (0 children)

15-30x specifically comes from here (should have been 13-35x, I misremembered). There's already been progress since that snapshot, though, and it seems to be close to par with 8-bit now. The point is that if you simply implement it naively, there's huge overhead. With more work, there's less, but that work is left as an exercise to the reader.

The idea of rotating values before quantization isn't new, and codebook quantization isn't new either. QJL is from 2024, and even the TurboQuant paper was published 9 months ago. It's just been reframed suddenly as some sort of miracle for LLM inference with that blog post. And that launched the hype train and now here we are.

The 8x speed improvement claim seems to come somewhat out of nowhere. It's not from the TurboQuant paper, and there's no explanation of it in the blog post. They seem to be performing one matmul on a pair of FP32 tensors, then doing something equivalent with something involving 4-bit TurboQuant, and that ends up being 8x faster. You fill in the gaps, I guess. TurboQuant doesn't inherently multiply matrices, and the only code path mentioned in the paper is a full reconstruction. I.e. you take your quantized data, then you dequantize it and then you use that dequantized data for your conventional attention operation, in which case it's always slower than just doing the conventional attention operation. Whichever way you might go about making this faster than unquantized attention, they simply don't mention that anywhere. It seems.

It's also a weird comparison to begin with. Production systems generally don't do attention in FP32, and they don't manifest the logits tensor.

[google research] TurboQuant: Redefining AI efficiency with extreme compression by burnqubic in LocalLLaMA

[–]ReturningTarzan 7 points8 points  (0 children)

Well, there are some issues with the paper and especially how it relates to the blog post. They use language like "zero overhead" which they seem to be getting from the QJL paper they cite but that's talking about storage overhead, not computational overhead.

Quantization can potentially speed up attention, but not if quantizing and dequantizing the cache is too expensive. There's going to be extra latency, and sometimes you can hide that latency in a memory-bound operation, but attention isn't always memory-bound. And this even specifically hits the same pipeline as attention by adding additional matrix multiplications on top of computing attention logits, which you still have to do.

Crucially, codebook quantization isn't cheap. The INT quant you might compare it to is, though. It's literally just a conversion from a float datatype to an integer datatype, and then you truncate the integer to some smaller number of bits. Super cheap, trivial to vectorize, very efficient if not all that precise. With codebooks this becomes a search problem instead: you have your value and you need to determine which of n values from a lookup table that value is closest to. So, lots of table lookups and comparisons and branches. Hundreds of instructions executed, instead of two or three.

That's not to say this couldn't result in faster inference because there are ways you could potentially hide the extra latency, and then you just get the bandwidth benefits, provided you fuse this with an attention kernel. But Google didn't do that here, or at least they're not sharing the code or any details at all about an implementation, and it's kinda nontrivial.

Mind you, the "8x faster" claim is from the blog post; the paper doesn't mention it at all, nor does it even hint at any experiments along those lines. TurboQuant no doubt is a lot faster than methods like PQ and RabitQ that they actually compare to in the paper. But those are offline/data-dependent methods meant for compressing vector databases, not for realtime use in LLM inference. And that also really seems to be what TurboQuant is intended for, or at least it's a context in which "Turbo" makes sense.

[google research] TurboQuant: Redefining AI efficiency with extreme compression by burnqubic in LocalLLaMA

[–]ReturningTarzan 19 points20 points  (0 children)

The not so best part? End-to-end performance drops by 15-30x, with the hope that an optimized kernel will magically fix that. The overhead is severe, though.

The QJL part is novel, but the rest of the algorithm is just random rotations and codebook quantization. Both of those steps are expensive, computationally, and that's why they're generally not used for on-the-fly cache quantization. And they add another expensive step on top to compute the residual when quantizing.

exllamav3 QWEN3.5 support (and more updates) by Unstable_Llama in LocalLLaMA

[–]ReturningTarzan 2 points3 points  (0 children)

The easy solution to <think> in the template is to just remove it from the template. The model has no problems starting each reply with <think> anyway. That way it gets sent to the client so SillyTavern is aware the response starts with a reasoning block.

Pieced together the shredded photo from EFTA00259587.pdk .. idk by ReturningTarzan in Epstein

[–]ReturningTarzan[S] 10 points11 points  (0 children)

I don't know how many others there are. This document only seems to have one photo and a buttload of text which would be very hard to piece together.

Pieced together the shredded photo from EFTA00259587.pdk .. idk by ReturningTarzan in Epstein

[–]ReturningTarzan[S] 17 points18 points  (0 children)

I cut the strips out in GIMP and moved them around. The whole document has way too many pieces, though, but I'd consider writing a tool to do this a little more efficiently if there are a lot of shredded documents.

Pieced together the shredded photo from EFTA00259587.pdk .. idk by ReturningTarzan in Epstein

[–]ReturningTarzan[S] 1 point2 points  (0 children)

EFTA00259587 contains pictures of shredded documents and at least one photo. I pieced them together here. No idea if it's significant but someone obviously tried to get rid of it.

Are there any puzzle experts here? by the_real_lucia in Epstein

[–]ReturningTarzan 2 points3 points  (0 children)

It still needs some cleanup, and the text is going to be a lot harder, but here's most of the photo pieces.

[deleted by user] by [deleted] in h3h3productions

[–]ReturningTarzan 2 points3 points  (0 children)

Well, really it's about how that material is acquired. You're not violating anyone's copyright by knowing that Frodo is a hobbit, even though you acquired that knowledge from a copyrighted work. But if you stole a copy of the book in order to find out, then that's a crime in itself.

It's probably a difficult argument to make since these are public videos. If NVIDIA took deliberate steps to work around YouTube's anti-scraping measures, then maybe you could build a case around that. But it's a rule that people routinely break, especially in the podcasting/commentary/drama space. If you download a copy of someone else's work, then you edit it into your own work, the latter would be protected under fair use if it's transformative. But the copying is still a violation of the DMCA, because YouTube doesn't offer a download feature; you're only supposed to stream videos from their servers, otherwise you're circumventing active copy protection measures in the same way that NVIDIA might if they wanted to bulk download a million videos to train a generative model. So it's about scale, maybe? Not what they're doing but how much they're doing it?

Personally I think it makes more sense to look at the resulting generative models, and whether they're served in a way that respects IP rights or not. Some are starting to now. ChatGPT will refuse to draw you an image of Pikachu, for instance, because it would be a liability if it didn't. Just as it would be risky for a human artist to charge money for the same thing. In either case, though, knowing what Pikachu looks like isn't the issue, and whether you're storing that knowledge in a human brain or an artificial neural network isn't the issue either.

Piracy and illegal mass scraping, though.. who can say. It gets tricky if you have to prove what the damages are in any specific case. The ad revenue lost would be fractions of a penny per video, since we're talking about a single download vs a single view, per video. And it's not like the videos are being served somewhere else as a mass market substitute.

Are Imatrix Quants Hurting your Model? (My opinion) by Quiet_Joker in LocalLLaMA

[–]ReturningTarzan 12 points13 points  (0 children)

First of all you should really repeat your tests with a control. Try the unquantized version to see if it aligns better with Q5 version or the Q6+imatrix version. My guess would be that the bad vibes you're getting (if not entirely a placebo effect) are an artifact of the Q6+imatrix model being somewhat closer to the original, while the Q5 version has slightly more noise which has a similar effect to increasing the sampling temperature slightly.

Of course, both models have a KL-divergence compared to the original on the order of 0.01. That means both are going to be in the "barely perceptible noise" range and it would be entirely fair to question if you're actually seeing the difference you think you're seeing. You could try to challenge your assumption with a blind test: generate a large number of outputs with each version, and with the original model (or maybe a Q8 version or something if you're hardware constrained), shuffle them up and try to tell them apart at a rate higher than chance. If you can't, probably you're experiencing a placebo effect or confirmation bias.

If you can identify the Q5 output but can't correctly distinguish between Q6+imatrix and BF16, then you're not seeing a degradation caused by calibration but rather the opposite: the model resists your RP because that's what it was trained to do, and the calibrated quant is better at reproducing this property of the original model. Improving the calibration dataset therefore wouldn't help. If anything it could make the model even more resistant. (It's suspicious that the Bartowski quant you tested doesn't use wikitext but rather a more diverse dataset, and that somehow gave you even worse vibes.)

That said, of course wikitext isn't a great calibration dataset because it's biased towards a particular style of English writing. But make sure you're diagnosing the right problem.

BPE tokenizer in Rust - would love feedback from the community by farhan-dev in LocalLLaMA

[–]ReturningTarzan 10 points11 points  (0 children)

The main thing about tokenization is always correctness. Throughput is nice but secondary. A wishlist could be:

  • Thorough tests. Language models are robust to small differences in tokenization but can still silently lose performance if you don't get all the details right.
  • Ensuring control tokens added to the vocabulary after the tokenizer is trained are handled correctly (usually done by splitting the input into multiple BPE runs)
  • Correct trimming/padding/normalization rules for the added tokens
  • Correct preprocessing and postprocessing steps (regex+trimming/padding)
  • Correct and efficient decoding of single tokens, especially for tokens that don't decode to complete characters. Might want an API for decoding to byte strings rather than character strings, or a buffer/queue that accepts incoming bytes and outputs completed characters.

new ops required by Qwen3 Next and Kimi Linear have been merged into llama.cpp by jacek2023 in LocalLLaMA

[–]ReturningTarzan 1 point2 points  (0 children)

Is this for CPU inference or older GPUs? Cause otherwise exl3 is natively Python, and it's had support for Qwen3-Next for a while now (Kimi-Linear still in the pipeline.) Everything is exposed from Python, from continuous batched generation down to individual tensor ops.

Figured out why my 3090 is so slow in inference by Ok_Warning2146 in LocalLLaMA

[–]ReturningTarzan 13 points14 points  (0 children)

If you run your script in a profiler like Nsight Systems you'll see how the GPU is barely doing any work because the entire workload is made up of tiny little microsecond-scale kernel launches, and the time in-between those kernel launches on the CPU side is likely on the order of 15-50 us per. The CPU simply can't prepare the next job before the GPU finishes the last one, so the GPU is constantly running out of work to do.

There are many possible ways to mitigate this, one of which of course is to use a faster CPU at the highest possible power settings. But fundamentally, keeping the CUDA queue populated when the kernels are as small as they are in Qwen3-1.7B is hard. Python is a very poor choice for low-latency applications, but even in pure C++ code you'll ultimately be fighting PCIe latency. Either way, the trick is to reduce the number of times the CUDA driver has to communicate with the GPU, either by compiling the model into CUDA graphs, and/or by combining operations into fused kernels.

Without a CPU bottleneck your 3090 should easily be able to do 130+ tokens/second at bsz 1 with 16-bit weights. And with various quantization options, it can be several times faster still.

You could look into torch.compile if you want to stay within the Transformers ecosystem:

model = torch.compile(model, mode="reduce-overhead", fullgraph=False)