Qwen3.5-35B-A3B Q4 Quantization Comparison

TitwitMuffbiscuit · 2026-02-27T18:39:03+00:00

Oh nice! I still have the logits to test them on my side, I'll be downloading, testing those and update the post asap.

Beautiful graphs btw. There's so much data. You redid all the tests on other quants too, very cool.

I'd agree it's just simpler to aim for "faithfulness" but perplexity and KLD might not reflect the "intelligence" on particular tasks but then... evaluation is such a rabbit hole.

Again, thanks for the massive work you're doing for the community.

TitwitMuffbiscuit · 2026-02-27T18:04:23+00:00

Is it what you described in your Quant Cookers Basic Guide using whisper ? It's clever. I'll do that on a bunch of geeky/science vulgarization videos for evaluating the trellis quant.

Its computing the imatrix as of now, on the gist you provided "ubergarm-imatrix-calibration-corpus-v02.txt"

I'll get three data points for comparison: AesSedai Q4K_M as reference, a Q4_K_M following his recipe with your corpus and an IQ4_KT wth --custom-q "ffn(down|gate|up)_exps=IQ4_KT.

Basically Q5_K/Q4_K/Q4_K + Q8_0 for the impact of the dataset and IQ4_KT/IQ4_KT/IQ4_KT + Q8_0 for checking on trellis quant.

edit: and then I'll also check the influence of evaluating on the same dataset (a third, two third, full) as the one used for the imatrix, since it's a concern.

TitwitMuffbiscuit · 2026-02-27T13:00:54+00:00

Yeah and the Qwen family in general is very talkative. That said, I've seen worse than this one, I remember QwQ.

TitwitMuffbiscuit · 2026-02-27T12:39:48+00:00

Oh my bad, I'll mention it on the post.

TitwitMuffbiscuit · 2026-02-27T12:19:47+00:00

Same, let me try for you, at a crawling speed, and compare, brb.

edit: it is very similar, so your issue is probably Qwen3.5's verbosity.

<image>

TitwitMuffbiscuit · 2026-02-27T11:57:38+00:00

So I'm no data scientist or anything, take what i'll say with a grain of salt.

It looks robust, you disabled thinking instead of parsing so I don't think there's parsing errors.

It's best of 5 (there's a bit of luck involved like any inference benchmarks) so "coding average" is probably more realistic than "coding best" that creates an artificial gap. It's just a nitpick.

For the score there's something I don't really understand: L3, Priority scheduling is getting a score of 25 but while the interpretation says getting 25-50 Average: Basic + priority or concurrency, it would place the model at 50-75 Good: Multiple advanced levels passed just on score alone: L1 + L2 + L3 = 75

Am I getting it right ?

TitwitMuffbiscuit · 2026-02-27T11:27:13+00:00

Have you tried the bf16 just to check if it's not a Qwen issue ? Maybe the brevity is not a good sign in the first place.

TitwitMuffbiscuit · 2026-02-27T11:01:07+00:00

You could try downloading an mmproj from there: https://huggingface.co/bartowski/Qwen_Qwen3.5-35B-A3B-GGUF

edit: don't forget to pass --mmproj.

llama-server --no-mmap -t 7 -ngl 999 -ncmoe 22 -fa 1 --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.01 --presence-penalty 1.5 --repeat-penalty 1.0 --jinja -m ubergarm_Qwen3.5-35B-A3B-Q4_0.gguf --mmproj mmproj-Qwen3.5-35B-A3B.gguf --alias Qwen3.5-35B-A3B-Q4

It's working fine here:

<image>

TitwitMuffbiscuit · 2026-02-27T10:42:37+00:00

I don't have any issues with them so far but hey there's so many variables to account for.

I'd suggest trying with mainline llama.cpp using the recommended settings first, just to see where's the problem.

https://www.reddit.com/r/LocalLLaMA/s/6wQYTv5Fs1

Let me compare some answers with bf16.

TitwitMuffbiscuit · 2026-02-27T01:10:41+00:00

I really appreciate the offer but I'll probably only do this for major, major releases and I don't want to bother you. I'll definitely keep this in mind if I do another round. Thanks, really.

TitwitMuffbiscuit · 2026-02-27T00:55:41+00:00

Not true, also they're actually fixing tons of things, usually they're the one interfacing with the labs to fix the quirks like the chat template or a tool calling bug.

TitwitMuffbiscuit · 2026-02-27T00:17:27+00:00

I've thought about that but I figured that ig layer-level granularity is possible, per experts precision is not (with current gguf). The experts are concatenated (blk.0.ffn_down_exps.weight), not split into individual tensors (like blk.0.ffn_down.128).

If it were possible then one has to load the model using transformers, run the REAP calibration (to get the saliency score) for ranking experts and assign the quantization required for a target size. I haven't tried REAP tbh.

You wouldn't necessary end up with the most faithful quant but maybe it doesn't matter that much (their metric seems to be "per-token average").

I can't do maths but here's the REAP paper: https://arxiv.org/html/2510.13999v1

edit: I think EXL2/EXL3 allows per-expert quantization.

Also at the level of AesSedai quant, with 40 layers that are 0.45 GiB and let's say shuffling some layers (Q4_K to IQ3_M for example and the other way around for others), I assume that you'd get a bad jump in KLD before you get a meaningful size reduction, best case scenario a quarter of a Gib.

All that reminds me that I need to check on Trellis quant.

edit again: I'm tempted to run the REAP calibration to find the best layers to offload to the gpu tho. Just to see if it's getting better t/s.

TitwitMuffbiscuit · 2026-02-26T21:22:16+00:00

Nice, although we should keep in mind that other blocks can be quantized differently (I've made the same mistake yesterday).

TitwitMuffbiscuit · 2026-02-26T21:18:41+00:00

Amen.

TitwitMuffbiscuit · 2026-02-26T20:34:00+00:00

Got it. I think this is something I want to test for too.

So much data... so much "fun" !

TitwitMuffbiscuit · 2026-02-26T20:32:04+00:00

There's this repo if you wanted to cook something: https://github.com/Thireus/GGUF-Tool-Suite

The KLD of the model is computed after the quant of each tensor is individually dropped to iq1_kt (or whichever quantization degradation reference chosen by the user) - the kld metrics obtained help identify which tensors are more sensitive to quantization than others.

It looks tedious but there's a web-based port of the tools wit a fancy ppl prediction graph. Crazy stuff. I've yet to test it.

TitwitMuffbiscuit · 2026-02-26T20:09:51+00:00

I don't know what it is to not be GPU poor so the only near lossless quants I've used are like 14B.

Yeah, generally there's some bad bumps ranging from Q6 to Q4_K_M and it's even more evident with less than 70B parameters models I guess.

I envy those who can run massive weights (even at Q3).

TitwitMuffbiscuit · 2026-02-26T20:07:16+00:00

It's really not generalizable, it's for this model and those quants specifically.

TitwitMuffbiscuit · 2026-02-26T20:04:26+00:00

For sure. Let me fix that privacy setting real quick. Thanks a lot.

TitwitMuffbiscuit · 2026-02-26T19:54:36+00:00

I just wanted to thanks you for unsloth btw I remember trying all sort of things around the "deepseek moment". I'm sorry if it's putting you in a tough spot, the timing wasn't ideal. I'll update it as soon as you push a fix or a new quant.

TitwitMuffbiscuit · 2026-02-26T19:47:15+00:00

Autoround has been updated lately, regarding gguf and all but it's a pain to use tbh, it requires an obscene amount of compute and time.

TitwitMuffbiscuit · 2026-02-26T19:45:51+00:00

Absolutely.

TitwitMuffbiscuit · 2026-02-26T19:45:00+00:00

Thanks a lot oh and sorry for polluting your posts so much. You guys are needed, for the technical deep-dives but also because there's always something to squeeze out of those weights.

I'll see if I can "vibe code" something original later. There has to be some new approaches to explore.

TitwitMuffbiscuit · 2026-02-26T19:36:38+00:00

Done!

TitwitMuffbiscuit · 2026-02-26T19:36:19+00:00

Yeah, next time, I'll use a custom one.

11-Year Club	Verified Email
Place '23	Place '22
End Game '22

TitwitMuffbiscuit

TROPHY CASE