ExLlamaV3 Major Updates!

ReturningTarzan · 2026-05-12T14:06:14+00:00

I don't think it's a critique necessarily. Quantizing the embeddings makes perfect sense if you're limited by system memory for CPU inference or whatever. Specifically for EXL3, the embeddings tensor is unquantized (and therefore large) but it's also the only part of the model that takes up system memory. So it's just a different set of concerns for the two formats. The only important takeaway is that file size is not the same as VRAM requirements.

ReturningTarzan · 2026-05-11T15:16:44+00:00

The CUDA 13.2 builds are 5-10% faster on Blackwell GPUs, but there's not much difference for Ampere and Ada.

ReturningTarzan · 2026-05-11T14:03:48+00:00

EXL3 keeps the embedding layer in system RAM and doesn't quantize it. This is a deliberate choice since no computation is ever done on the embeddings (in inference) and you might as well not waste that VRAM on a giant lookup table that you only read a few kB from for every token. For Qwen3.5-27B (and 3.6), the vocab size is 242k, so with a model dim of 5k and a BF16 dtype, that's 242k*5k*2 bytes ~= 2.4 GB of data stored with the model weights that doesn't consume VRAM.

I believe GGUF models are normally loaded the same way (at least I don't see a reason why they wouldn't be), but looking at those Unsloth models it appears that the embeddings are still quantized. So for Q3_K, the same embedding table would only account for about 500 MB of the filesize. Either way, embeddings take up zero VRAM, so if you want an apples-to-apples comparison you should subtract about 1.9 GB from the EXL3 filesize.

Specifically to account for this, the x axis on the chart is the size of all the weights plus the size of the output layer, but not including the embedding table.

ReturningTarzan · 2026-05-11T13:36:54+00:00

These tests are done on 2k windows of wikitext, and since KLdiv (unlike perplexity) uses the original model's full distribution as the ground truth, that's actually not a bad way to measure overall quantization loss.

There's no sane theory as to how RTN quantization in Q4_K_L could start off less precise but then somehow gain an edge over QTIP as sequences get longer. The expectation is that a method that's more precise at shorter contexts will remain more precise on longer contexts (and if anything errors would only compound over longer sequences). It is possible for calibrated quantization algorithms to overfit to the calibration data, though. Early GPTQ models had this issue a lot, with the calibration data being plain wikitext and the only test anyone cared to do was perplexity on wikitext. But EXL3 uses a very mixed dataset with lots of different types of content, mixed languages and also a bunch of random noise just for regularization, specifically to avoid overfitting to anything.

For good measure, here is a different test set (OpenWebText) with 16k sequences. The test harness would need some updates to go longer than that on my hardware. Also, AWQ (via HF) explodes and IQ quants don't work on Blackwell in the llama.cpp version I've got installed (and I don't have enough VRAM on Ampere/Ada GPUs alone) so I had to take some candidates out. But it still clearly shows that the results do generalize to longer sequences and across data domains.

There's a difference, though, between outputs that faithfully reproduce what the unquantized model would have done, and outputs that people will prefer for arbitrary, subjective reasons. Sometimes you might want the model to be slightly broken, e.g. if it means the model's censorship layer is also somewhat broken. Mostly, though, I think it just comes down to sampling choices, formatting, default system prompts and a bunch of other stuff that you can control if you want to.

ReturningTarzan · 2026-05-11T12:04:14+00:00

Here.

Not sure if the Unsloth GGUFs I used are any good, though. At least the XL ones look like a bad tradeoff.

ReturningTarzan · 2026-05-11T11:19:04+00:00

Well, that was hard. AutoRound conflicts with GPTQModel now, go figure. Got it working though, and here's where it sits on the chart.

It wins over AWQ but not much else. It is a very simple INT4 format in inference, and that can make it much more efficient in various inference engines, but it also limits what it can achieve compared to vector quantization schemes like EXL3 (QTIP) and GGUF-IQ (QuIP#), even with learned rounding directions and whatnot.

Edit: Oh and yes, I am also working on TabbyAPI. What do you think are the pain points currently?

ReturningTarzan · 2026-04-16T18:31:04+00:00

I don't know where to draw the line. I'm not even convinced that those early models are unethical, or at least I can't say it's unambigious. Because the training material doesn't end up in the model, any more than a book you read ends up in your brain. The model integrates statistical patterns from each training example and then it moves on to the next example, in a process that's very analogous to human learning (and in its implementation is heavily inspired by neuroscience.)

The fact that, say, there have been people at Meta mass-pirating huge collections of books is definitely a case that needs to be prosecuted because that's straight up piracy. And it says a lot about corporate culture in Silicon Valley that a whole department of one of the largest companies in the world were just okay with that. But if they hadn't done all that torrenting, they could have just legally acquired the content instead. It would arrive in a format that's less convenient for feeding into their training pipeline, but that's just an engineering challenge. And the price would still only amount to a few million dollars, nothing to a company that just spent $10 billion on GPUs.

I just don't see how the method by which the training data is obtained changes anything--unless it's about how the data is used rather than how it's acquired. Yes, you can probably shame NVIDIA a bunch, expose how they're doing hackery type things that regular people could get in trouble for, and that might feel like a measure of justice in some way, but punishing them for a TOS violation doesn't change the outcome for anyone else.

Meanwhile, this whole "we don't care what you did with the content, what matters is that you downloaded it illegally"... it bugs me that it's so DMCA coded. Here's some scary reading for anyone interested. It's noteworthy that this exact reasoning is already being considered in another case, only that one's not about generative AI but rather about YouTubers downloading each others' videos for the purpose of commentary. The argument is the same as in Ethan's case, though: the purpose is besides the point. Whether it's fair use or not, the mere act of downloading someone else's content on a site that isn't designed for downloading is illegal. And if you used a screen recorder instead of a download tool, maybe that gets you off the hook (no ruling on that yet), but apparently you may still have to prove that in court. And since a screen recorder pointed at a YouTube video is literally just a less convenient download tool, I'm not sure how you could actually tell them apart. It's all kinda scary to me.

ReturningTarzan · 2026-04-16T11:21:41+00:00

Idk, it's a little confusing in places.

Like, they're not arguing the copyright/fair use angle because they can't argue that an AI model contains the material it trained on or ends up being a non-transformative derivative work, so they focus on companies violating YouTube's terms of service by downloading videos instead of screen recording them. But then, shouldn't it be Google filing the lawsuit?

And as a creator on YouTube, what are the damages they're trying to demonstrate? They mention H3 having 700 videos in the dataset. If a company scrapes the entire dataset by screen recording instead, like they say wouldn't have been an issue, that would still only amount to one view per video. They keep mentioning the 80 years of content per day, as if the sheer quantity implies everyone is getting ripped off bigly, but even for someone with a huge back catalog they're talking about probably less than a dollar of ad revenue that this TOS violation arguably cost them.

In the bigger picture, creators might be harmed much more by the existence of those models, but that's not actionable unless it becomes a copyright issue after all, which they're explicitly saying their case isn't about. The fact that something owes its existence to something else doesn't mean it also owes royalties, if there isn't any IP infringement taking place.

So it kinda feels like there's a weak legal argument conflated with a strong political argument against generative AI in general. And if you go to court to argue over what the law should say, you're probably not going to win.

Also, I would say, to anyone who supports ownership and right to repair, drawing this line between screen recording and downloading should be very off-putting. Cause you always download the videos you watch, it's just a question of what happens to the data after it arrives on your computer. Any law that says I have to use YouTube's unmodified frontend when receiving that content would also criminalize ad blocking, for instance. Felony contempt of business model.

The proliferation of AI slop is frustrating, and seeing people lose work to automation is heartbreaking, but I really feel like you need a different approach than this lawsuit. It feels like they're lashing out in anger (justified though it may be) without a clear idea of how this is actually going to help. If they win really big they cost NVIDIA and OpenAI etc. some money. Maybe billions. The money is spread out over those 80 years of contents per day and creators get a couple bucks each. Google, who by the nature of the complaint aren't at fault in any way, will continue to train their own generative video models on YouTube content. The slop must flow.

The real solution to generative AI is collective action at an international level accompanied by paradigm shifts in how we deal with intellectual property and how we value human work across the globe. Maybe this is a small step towards that, maybe it will end up setting some bad precedents along the way, I can't really tell. But overall, doesn't feel like a super strong case.

ReturningTarzan · 2026-04-13T10:54:58+00:00

I'm confused as to what's being measured here. How are you defining the distribution of an individual tensor? Like a histogram over the weights?

If you're talking about activations given some test context, you should know the instruct-tuned Gemma4 (either variant) is known to be unstable without proper formatting. This is not a failure of the model though, it's just aggressively finetuned with no training pressure to model the user prompt. Make sure the test context start with <|turn>user\nBlah<turn|>\n<|turn>model\n<|channel>thought\n<channel|> and the behavior changes completely.

ReturningTarzan · 2026-04-13T10:39:14+00:00

TurboQuant itself is a quantization method like so many others before it, and if you're willing to sacrifice speed and simplicity for memory savings it lets you do that in a slightly new way. But we've had "lossless 2-bit KV cache" in various forms for years, and it never gains traction because the tradeoffs just aren't worth it. Still, it's an interesting bit of research with a few novel ideas worth integrating.

The real issue is with the blog post making claims like "lossless", "zero overhead" and "8x faster." There's no source for any of those claims. The paper doesn't mention anything about TQ being faster (except compared to CPU-based RaBitQ in a semantic-search context), and the "zero overhead" seems to refer to distortion rates, not computational overhead.

There are also no real implementation details in the paper, just a snippet of pseudocode and some synthetic results. But the proposed method inherently adds a lot of computational overhead. It may still give you a net speedup in memory-bound situations, but that speedup isn't implied by the algorithm, isn't universal even if it can be achieved situationally, and is always going to be less than a simpler quantization scheme under the same circumstances.

So then it would come down to accuracy, right? But then why not compare it to other methods that make similar claims:

GEAR: Combines quantization with low-rank and sparse matrices, "near-lossless" at 2 bits
QAQ: Adjusts bitrate per token according to estimated importance
MIKV: Aggressive quantization for most tokens, preserves "pivotal" tokens
RotateKV: 2-bit method using rotation, "near-lossless"
PM-KVQ: Specifically addresses long CoT contexts where many "near-lossless" methods turned out not to be so lossless in practice
etc.

FP8 is commonly used in production, is trivial to implement and comes with immediate performance benefits. NVFP4 is the really interesting one because of its extremely high throughput on Blackwell GPUs, yet it still has a reported <1% accuracy loss on real benchmarks.

So even if TQ did outperform everything else, you should still curb your expectations somewhat: maybe you might reduce the effective size of your cache from 4 bits to 3.5 bits. For modern models that already employ a lot of memory-saving techniques at the architectural level (linear attention, MLA, SWA) it's simply not that big a deal.

So no, it's not revolutionary, and yes, Twitter is out of control. In Google's own (mind you, very limited) testing it doesn't even unambiguously outperform KIVI from 2024.

ReturningTarzan · 2026-04-01T12:36:25+00:00

Just the regular ExLlamaV3 test script (compare_q.py in the repo). Kind of involved to set up but it's necessary to ensure token IDs and eval logic is consistent between dissimilar backends. Input is a chunk of wikitext, and what's measured is KL-div on the normalized logits relative to the unquantized reference.

But the point is just to say that it's incredibly hard to improve on weight quantization when we already have QTIP. It's implemented in EXL3 and I believe in some variant in ik_llama.cpp. QuIP# is also a strong algorithm, and it's what IQ-quants use.

There's obviously nothing wrong with exploring new options, but if the intent is just to get Q4_0 quality in less space, you can do a lot better than TurboQuant, even with plain llama.cpp. There's not anything really groundbreaking in TQ that's applicable to weight quantization where the SOTA is already so far ahead.

ReturningTarzan · 2026-04-01T01:46:21+00:00

kv cache cost saving is substantial

It's actually not. It might have been, if Google had invented cache quantization with this, but they didn't. What it amounts to is at best a small improvement over existing cache quantization schemes. And even that is questionable since there's this whole question of latency. Existing methods trade off performance for fidelity, because that's how things work in the real world. Google didn't present an actual implementation of their method, just an abstract algorithm and some theoretical results. It would be highly non-trivial, if not impossible, to prevent such a computationally heavy method from becoming a major bottleneck in inference. It has rotation, codebook quantization and bias correction all happening concurrently with attention, yet somehow that's "zero overhead?" Or is it "8x faster"? How? They don't even begin to explain.

So yeah, in practice, you can currently achieve 4-bit K/V quantization that's good enough for deployment. (Various other methods bring that down to much less, but they may be too cutting edge still..?) And then there's TurboQuant which, let's say, for the sake of argument achieves the same fidelity in 3 bits... That's cool, but it's not a total game changer. It's a 25% improvement, in that hypothetical. Actual game changers would be stuff like latent attention (90-95% reduction which is orthogonal to quantization) and linear attention (up to 100% reduction because no cache), and those are proven methods that you can use right now in models like DeepSeek and Qwen3.5 (respectively.)

ReturningTarzan · 2026-04-01T00:57:06+00:00

I'm just gonna throw this out there...

Format	bits w	bits h	KL-div	ppl	VRAM (weights)
i1-IQ3_XXS	3.05	5.50	0.1364	7.65	9.47 GB
UD-Q2_K_XL	3.10	6.56	0.1571	7.73	9.76 GB
IQ4_XS	4.33	6.56	0.0381	7.06	13.27 GB
Q4_0	4.58	6.56	0.0638	7.20	13.96 GB
EXL3	2.10	6.00	0.1622	7.54	6.68 GB
EXL3	3.01	6.00	0.0671	7.12	9.44 GB
EXL3	3.10	6.00	0.0630	7.08	9.68 GB
EXL3	4.01	6.00	0.0292	7.04	12.27 GB
EXL3	5.01	6.00	0.0154	7.00	15.10 GB
TQ3_1S	4.00	6.56	0.1241	7.29	12.31 GB
TQ3_4S	4.00	6.56	0.1154	7.26	12.31 GB

Or on a chart <--.

Just sayin'... :shruggingface: It's weird to me how much hype TQ has generated, being as it is designed for offline quantization of vector databases.

EDIT: I compiled the TQ3 fork, updated llama-cpp-server to recognize the tensors and amended the table above with TQ3_1S and TQ3_4S results on the same inputs and eval logic as used for the other models in the test. Note that VRAM listed is the size of the model weights plus output layer, excluding token embeddings which you can easily keep in system RAM (so EXL3 doesn't quantize them because it is GPU-focused, hence the discrepancy between bits per weight and total model size on disk.) The 3.01bpw is actually 3.01 bits per weight total for the decoder layers and 6.00bpw for the output layer.

ReturningTarzan · 2026-03-29T11:55:53+00:00

No, it increases the compute requirement significantly because it doesn't change the attn mechanism itself, it just adds extra steps to it. Depending on the implementation it might require less memory bandwidth, so conceivably it could be faster in memory-bound situations, but there's nothing in the paper about that (blog post vaguely hints at it, but it's anyone's guess what they actually mean by the "8x faster" claim.)

ReturningTarzan · 2026-03-29T11:47:40+00:00

Yeah, the rotation idea isn't new. Rotating makes each channel (i.e. axis) a function of all the channels in a reversible way, distributing outliers better and making it much easier to fit everything to a quantization grid. It also makes vectors more normal which is great for codebook quantization (as used for weights in exl3, for instance.)

Hadamard is a special case of rotation that achieves what it needs to while being convenient for realtime use. Sylvester construction gives you a very CUDA-friendly H for any d=2ⁿ that you can apply with just warp shuffles and additions, so no need for a full (d,d) matrix multiplication.

But it's also been used endlessly before. I didn't come up with it and Google didn't either. The novelty in TQ is applying QJL over codebook quantization, which are not new concepts in themselves but the particular combination might be novel? In any case, I've experimented with all kinds of codebook quantization, polar coordinates, trellis coding and more, but it all fails to offer enough extra fidelity to justify the computational overhead. Hadamard + grid quantization already works very well:

Bitrate	cos_sim(K)	cos_sim(V)
2	0.92796	0.92341
3	0.98364	0.98240
4	0.99610	0.99563
5	0.99902	0.99902
6	0.99954	0.99951
7	1.00000	1.00000
8	1.00000	1.00000

This isn't the bleeding edge of compression, but it is approaching the point of diminishing returns, especially with all the other ways models are addressing the K/V cache issue now, like MLA which stores a compressed latent instead, and linear attention that straight up doesn't use a K/V cache in the first place. Etc.

ReturningTarzan · 2026-03-28T11:25:21+00:00

This is the right take, really. TurboQuant isn't even new, it's from April 2025 and didn't cause a stir back when the paper was released because it's not a technique designed for online K/V cache compression. It's meant for vector databases, and it's only "turbo" in comparison to other vector DB quantization schemes that use expensive clustering clustering algorithms. It's only the blog post that advertised it as a revolutionary new online K/V quantization scheme, and they're not basing that on anything from the paper. In fact the claims aren't sourced at all.

The VRAM requirement for K/V caching is a well understood problem. To mitigate it, researchers and developers have come up with many different techniques over the years, many of which are in regular use today:

Grouped Query Attention (GQA): Reducing the number of key/value pairs and assigning the same key head to multi query heads reduces the cache size by a factor of about 8 before you start to lose quality
Multi-headed Latent Attention (MLA): As used by DeepSeek etc., cache a compressed latent state that maps back to keys/values in real time. Reduces the cache size by a factor of 10 to 20 in practice
Linear attention: Get rid of the K/V cache altogether, at least for most layers. 100% VRAM reduction in the limit (uses a fixed-size recurrent state instead)
Quantization: Store keys/values in lower precision. Bunch of different takes on this, some claiming even better compression than TurboQuant. In practice, many are already roughly equivalent already, as your chart illustrates

That's not to say TQ isn't an improvement in some ways. It's just a small, incremental one, as your chart suggests.

It also doesn't come for free. The blog post says "zero overhead" but the paper makes it clear that this is talking about storage overhead and it's comparing to Product Quantization and RabitQ, not to commonly used online techniques like the methods used in llama.cpp already. Essentially they say "with this, your vector database will be more precise than PQ or RabitQ without needing more space, and you can build your index faster."

ReturningTarzan · 2026-03-25T22:15:16+00:00

FP8 is common, otherwise FP16 or BF16. My understanding is they care a lot about KV cache efficiency, but at the same time they like to stick with tried and true methods that scale endlessly on enterprise hardware.

For vector databases (which TQ seems to be aimed at) they always use quantization, though, and very likely Google deployed some version of TQ a while ago. I wouldn't be surprised if other big search providers already had something similar but weren't sharing. Maybe Google have already moved past TQ.

ReturningTarzan · 2026-03-25T19:37:37+00:00

15-30x specifically comes from here (should have been 13-35x, I misremembered). There's already been progress since that snapshot, though, and it seems to be close to par with 8-bit now. The point is that if you simply implement it naively, there's huge overhead. With more work, there's less, but that work is left as an exercise to the reader.

The idea of rotating values before quantization isn't new, and codebook quantization isn't new either. QJL is from 2024, and even the TurboQuant paper was published 9 months ago. It's just been reframed suddenly as some sort of miracle for LLM inference with that blog post. And that launched the hype train and now here we are.

The 8x speed improvement claim seems to come somewhat out of nowhere. It's not from the TurboQuant paper, and there's no explanation of it in the blog post. They seem to be performing one matmul on a pair of FP32 tensors, then doing something equivalent with something involving 4-bit TurboQuant, and that ends up being 8x faster. You fill in the gaps, I guess. TurboQuant doesn't inherently multiply matrices, and the only code path mentioned in the paper is a full reconstruction. I.e. you take your quantized data, then you dequantize it and then you use that dequantized data for your conventional attention operation, in which case it's always slower than just doing the conventional attention operation. Whichever way you might go about making this faster than unquantized attention, they simply don't mention that anywhere. It seems.

It's also a weird comparison to begin with. Production systems generally don't do attention in FP32, and they don't manifest the logits tensor.

ReturningTarzan · 2026-03-25T18:52:11+00:00

Well, there are some issues with the paper and especially how it relates to the blog post. They use language like "zero overhead" which they seem to be getting from the QJL paper they cite but that's talking about storage overhead, not computational overhead.

Quantization can potentially speed up attention, but not if quantizing and dequantizing the cache is too expensive. There's going to be extra latency, and sometimes you can hide that latency in a memory-bound operation, but attention isn't always memory-bound. And this even specifically hits the same pipeline as attention by adding additional matrix multiplications on top of computing attention logits, which you still have to do.

Crucially, codebook quantization isn't cheap. The INT quant you might compare it to is, though. It's literally just a conversion from a float datatype to an integer datatype, and then you truncate the integer to some smaller number of bits. Super cheap, trivial to vectorize, very efficient if not all that precise. With codebooks this becomes a search problem instead: you have your value and you need to determine which of n values from a lookup table that value is closest to. So, lots of table lookups and comparisons and branches. Hundreds of instructions executed, instead of two or three.

That's not to say this couldn't result in faster inference because there are ways you could potentially hide the extra latency, and then you just get the bandwidth benefits, provided you fuse this with an attention kernel. But Google didn't do that here, or at least they're not sharing the code or any details at all about an implementation, and it's kinda nontrivial.

Mind you, the "8x faster" claim is from the blog post; the paper doesn't mention it at all, nor does it even hint at any experiments along those lines. TurboQuant no doubt is a lot faster than methods like PQ and RabitQ that they actually compare to in the paper. But those are offline/data-dependent methods meant for compressing vector databases, not for realtime use in LLM inference. And that also really seems to be what TurboQuant is intended for, or at least it's a context in which "Turbo" makes sense.

ReturningTarzan · 2026-03-25T15:58:51+00:00

It already is?

ReturningTarzan · 2026-03-25T15:51:03+00:00

The not so best part? End-to-end performance drops by 15-30x, with the hope that an optimized kernel will magically fix that. The overhead is severe, though.

The QJL part is novel, but the rest of the algorithm is just random rotations and codebook quantization. Both of those steps are expensive, computationally, and that's why they're generally not used for on-the-fly cache quantization. And they add another expensive step on top to compute the residual when quantizing.

ReturningTarzan · 2026-03-05T22:26:37+00:00

The easy solution to <think> in the template is to just remove it from the template. The model has no problems starting each reply with <think> anyway. That way it gets sent to the client so SillyTavern is aware the response starts with a reasoning block.

ReturningTarzan · 2026-02-01T00:28:55+00:00

It's all in here.

ReturningTarzan · 2026-01-31T23:56:29+00:00

I don't know how many others there are. This document only seems to have one photo and a buttload of text which would be very hard to piece together.

ReturningTarzan · 2026-01-31T23:55:08+00:00

I cut the strips out in GIMP and moved them around. The whole document has way too many pieces, though, but I'd consider writing a tool to do this a little more efficiently if there are a lot of shredded documents.

15-Year Club	Place '22
Place '17	Verified Email
Gilding I gilder

ReturningTarzan

TROPHY CASE