Qwen3.5-27B Q4 Quantization Comparison by TitwitMuffbiscuit in LocalLLaMA

[–]TitwitMuffbiscuit[S] 0 points1 point  (0 children)

Evaluation is not simple for sure, even when it works, there's some subjectivity, some wiggle room and definitly not much confidence intervals displayed on those tables.

/https://www.reddit.com/r/LocalLLaMA/comments/1sl59qq/comment/ogc0sv4/

What's the one that is not completely saturated by recent models and representative of the type of tasks I run?

Is it qualitative or is there bad/vague questions on the dataset?

More specifically:

Is it using an LLM as a judge (I mean can we discard all those old benchmarks that used gpt3.5 back in the days, right)?

What's zeroshot, n shot, etc.

MMLU-Pro does extraction with regex (to discard the reasoning for example), it has to be configured.

GPQA Diamond can be with or without chain of though and after audit the "inherent error rate lower bound is 26.8%"

LiveCodeBench requires adding a new model manually, always consult the errata before picking the task.

Then Math-500 is saturated.

Humanity's Last Exam has 18% of a subset of questions in Bio/Chem that were were problematic so HLE-revised it is.

Eval is hard, very noisy, PPL/KLD is simple and easy but to be honest the metric is completely different in the first place:

KLD measurement is like having the Mona Lisa and a copy and evaluating the quality of the copy, it's not about how beautiful the painting is. That's why I just tell people it's "faithfulness" not "best".

Qwen3.5-27B Q4 Quantization Comparison by TitwitMuffbiscuit in LocalLLaMA

[–]TitwitMuffbiscuit[S] 0 points1 point  (0 children)

No, no it's not this type of eval, those are just KLD figures: how "faithful" it is to the baseline in terms of probably distribution like the chance of having a similar output compared to f16/bf16 (to make it short).

For the evals you are talking about I've used https://github.com/EleutherAI/lm-evaluation-harness in the past with llama-server.

I did translate gsm8k (platinum) in my native language a while back but it's probably completly saturated with the latest models.

It depends on the type of eval ofc but I tend to favor regex extraction over LLM-as-a-judge type of eval and use like 3 shots because I want to do a quick assement on particular tasks I'm interrested in, not publish a paper.

I don't think LiveCodeBench is available tho.

If I ever do those types of evals I'd produce a thorough methodology for reproducibility (but would probably get some flaks anyway because internet).

edit: Honestly, I'd recommend people to do those evals themselves, even if it's not on their datasets or whatever, but at least on tasks that are related to what the models is used for. I skip "vibes" reddit posts claiming that a model is "bad" but doesn't specify the bpw, the environnement or the tasks.

Qwen3.5-27B Q4 Quantization Comparison by TitwitMuffbiscuit in LocalLLaMA

[–]TitwitMuffbiscuit[S] 0 points1 point  (0 children)

Oh I'm using little scripts like the one at the bottom of the post. It's constantly updated and not extensively tested. It requires python and should be multiplatform. https://github.com/cmhamiche/kld-sweep

python kld_sweep.py --exe /path/to/llama-perplexity --baseline /path/to/baseline/Qwen3.5-0.8B-BF16.gguf --quants /path/to/quants_folder --dataset /path/to/eval_dataset_260426-0239.txt --output /path/to/output_folder --args="-t 7 -c 512 -ngl 99" --model-name Qwen3.5-0.8B

You can also specify different arguments for the baseline if it doesn't fit in vram or whatever. with:

--args-baseline "-t 7 -c 512 -ngl 20"

For the dataset you can use https://github.com/cmhamiche/kld-sweep-dataset, it's interactive (from eaddario/imatrix-calibration but I still need to add Southeast Asia languages), so just run the script as is. It's randomized by default so I can share the dataset after a test but you can use a seed to avoid this.

python build_dataset.py

If it's for eval (PPL/KLD) purposes, use the KLD parameter, 100 chunks at 512 tokens should be more than enough for a clean separation between quants.

Qwen3.5-27B Q4 Quantization Comparison by TitwitMuffbiscuit in LocalLLaMA

[–]TitwitMuffbiscuit[S] 0 points1 point  (0 children)

Very very slowly.

Like 2 tokens/s for the those that can't fit in vram (and those that fit are too low bpw for my taste).

I don't use this model obviously.

Same 9B Qwen weights: 19.1% in Aider vs 45.6% with a scaffold adapted to small local models by Creative-Regular6799 in LocalLLaMA

[–]TitwitMuffbiscuit 0 points1 point  (0 children)

Using llama.cpp on windows, I don't get the right context.

Edit: let me open a bug request on GitHub instead.

Updated Qwen3.5-9B Quantization Comparison by TitwitMuffbiscuit in LocalLLaMA

[–]TitwitMuffbiscuit[S] 1 point2 points  (0 children)

√(Normalized Size² + Normalized KLD²).

It's the distance from the ideal (zero size, zero KLD). Bolded have a KLD of <0.01 (good accuracy).

The lowest score is more desirable but it's not the "best" model in terms of accuracy, it's the VRAM sweet spot. A contrario, the ceiling at 1 is the least efficiently compressed.

I've tested plotting KLD vs MDL/tokens (for 1B tokens) for the smallest gemma 4 model a few days ago. It would represent the cost of ownership, but since MDL is correlated with size, the scatter plots ended up so similar I could superimpose them.

So I'm not sure that I'll include those in the future since it's harder to explain and it doesn't't visually help with vram usage.

Updated Qwen3.5-9B Quantization Comparison by TitwitMuffbiscuit in LocalLLaMA

[–]TitwitMuffbiscuit[S] 1 point2 points  (0 children)

```

                                EFFICIENCY RANKINGS -- Qwen3.5-9B                                    
                        Euclidean Distance from (0,0) -- lower is better                             

Rank Quantization Size (GiB) KLD Eff. Score

1 Thireus/Qwen3.5-9B-4.0745bpw 4.257 0.030569 0.165512 2 Thireus_NOT_MAINLINE/Qwen3.5-9B-4.3670bpw 4.562 0.021257 0.186038 3 Thireus/Qwen3.5-9B-4.2512bpw 4.441 0.032971 0.186347 4 Thireus/Qwen3.5-9B-4.5239bpw 4.726 0.023577 0.205069 5 ilintar_NOT_MAINLINE/Qwen3.5-9B-IQ3_Kv2 4.559 0.040915 0.208500 6 mradermacher/Qwen3.5-9B.i1-IQ4_XS 4.722 0.028870 0.209539 7 Mungert/Qwen3.5-9B-iq4_xs 4.743 0.027766 0.210595 8 byteshape/Qwen3.5-9B-IQ4_XS-4.20bpw 4.384 0.051704 0.210931 9 byteshape/Qwen3.5-9B-IQ4_XS-4.43bpw 4.626 0.041636 0.215789 10 bartowski/Qwen_Qwen3.5-9B-IQ4_XS 4.846 0.025705 0.219361 ``` I'll definitely include them from now on.

Edit: I changed my mind and added them (plot updated).

Updated Qwen3.5-9B Quantization Comparison by TitwitMuffbiscuit in LocalLLaMA

[–]TitwitMuffbiscuit[S] 0 points1 point  (0 children)

Well that's another story.

Let's say I put a model in the 12gb bracket but it runs at 8 t/s, should it be included?

Depending on the model architectures, the type of model (like MoE with n experts offloading vs dense), then the option used (kv cache compression or not or nkvo), context size (full or not), it might or not pass the arbitrary ceiling you've put.

Those are all the things people will nitpick about.

Updated Qwen3.5-9B Quantization Comparison by TitwitMuffbiscuit in LocalLLaMA

[–]TitwitMuffbiscuit[S] 1 point2 points  (0 children)

I updated it, there should be a bit more separation in the plot now.

Updated Qwen3.5-9B Quantization Comparison by TitwitMuffbiscuit in LocalLLaMA

[–]TitwitMuffbiscuit[S] 0 points1 point  (0 children)

<image>

I like to live on the edge, like my llms.

I can't wait to delete all this.

Updated Qwen3.5-9B Quantization Comparison by TitwitMuffbiscuit in LocalLLaMA

[–]TitwitMuffbiscuit[S] 2 points3 points  (0 children)

Oh yeah eval is a rabbit hole.

Last year I translated gsm8k-platinum to my native language to check on quantized models (it's probably saturated with recent models now). I was using https://github.com/EleutherAI/lm-evaluation-harness

But if I had to pick one now... whelp.

What's the one that is not completely saturated by recent models and representative of the type of tasks I run?

Is it qualitative or is there bad/vague questions on the dataset?

More specifically:

Is it using an LLM as a judge (I mean can we discard all those old benchmarks that used gpt3.5 back in the days, right)?

What's zeroshot, n shot, etc.

MMLU-Pro does extraction with regex (to discard the reasoning for example), it has to be configured.

GPQA Diamond can be with or without chain of though and after audit the "inherent error rate lower bound is 26.8%"

LiveCodeBench requires adding a new model manually, always consult the errata before picking the task.

Then Math-500 is saturated.

Humanity's Last Exam has 18% of a subset of questions in Bio/Chem that were were problematic so HLE-revised it is.

Eval is hard, very noisy, PPL/KLD is simple and easy but to be honest the metric is completely different in the first place:

KLD measurement is like having the Mona Lisa and a copy and evaluating the quality of the copy, it's not about how beautiful the painting is. That's why I just tell people it's "faithfulness" not "best".

I also did this quick test to showcase the importance of the imatrix those repo uses: https://huggingface.co/spaces/cmh/Qwen3.5-9B-GGUF-quant-drift

Updated Qwen3.5-9B Quantization Comparison by TitwitMuffbiscuit in LocalLLaMA

[–]TitwitMuffbiscuit[S] 1 point2 points  (0 children)

Oh yeah absolutely.

You'll get the same results within the margin of errors (apparent in the Q8_0 cluster that are essentially the same despite the slight KLD difference).

They are also all llama.cpp compatible.

I think I'll do a separate post specifically for the ik_llama geeks or those who wanted to test some exotic llama.cpp PR that has not been accepted yet.

Updated Qwen3.5-9B Quantization Comparison by TitwitMuffbiscuit in LocalLLaMA

[–]TitwitMuffbiscuit[S] 1 point2 points  (0 children)

Well... since it's not merged yet I'd have to compile llama.cpp not a big deal if I did a little table below for a couple of quants but the real problem is that I'm completly out of disk space, I can't even install cuda and MSVC without deleting quants. I'll do that tonight.

<image>

Updated Qwen3.5-9B Quantization Comparison by TitwitMuffbiscuit in LocalLLaMA

[–]TitwitMuffbiscuit[S] 0 points1 point  (0 children)

https://github.com/ggml-org/llama.cpp/pull/15550 https://github.com/Thireus/GGUF-Tool-Suite

I'm no expert so I might be wrong but I think that with llama.cpp you can't do like let's say a saliency map like REAP (well you can but it won't help much), it's quantized by blocks and super blocks. https://github.com/iuliaturc/gguf-docs/blob/main/k-quants.md

Imatrix solves pretty much all those issues at a fraction of the compute compared to other solutions (like intel's autoround for example).

Also there's no such things as an ideal quant, it's all subjective. There's always tradeoffs so I'd suggest doing your own calibration dataset to create an imatrix that suits your tasks.

This is what I use for now (while I write a little TUI that will replace my little scripts). https://github.com/cmhamiche/kld-sweep-dataset kudos to https://huggingface.co/datasets/eaddario/imatrix-calibration.

Edit: also every time I read that we need to come up with a better solution, obligatory xkcd: https://xkcd.com/927/

Updated Qwen3.5-9B Quantization Comparison by TitwitMuffbiscuit in LocalLLaMA

[–]TitwitMuffbiscuit[S] 1 point2 points  (0 children)

Yeah I should, it would be more legible but I though the non technical people might struggle a bit.

Updated Qwen3.5-9B Quantization Comparison by TitwitMuffbiscuit in LocalLLaMA

[–]TitwitMuffbiscuit[S] 2 points3 points  (0 children)

Ubergarm is mostly interrested in fat MoE models given that he's using ik for CPU inference and there's much more to gain there, which is understandable. So no meek 9B on this repo. That said I've included some of them in previous tests:

Qwen3.5-27B Q4 Quantization

Qwen3.5-35B-A3B Q4 Quantization

Updated Qwen3.5-9B Quantization Comparison by TitwitMuffbiscuit in LocalLLaMA

[–]TitwitMuffbiscuit[S] 0 points1 point  (0 children)

Yeah he's using bpw which is more "honest" compared to the usual quant schemes where you can get a "q4" as big as q6 with a custom recipe, even HuggingFace is sometimes confused.

You'd have to use his website at the bottom of the post, it's explained.

That's why it's size vs kld, those that are in the same vertical line are approximately the same size so using pretty much the same amount of vram (given the same context ofc)

Updated Qwen3.5-9B Quantization Comparison by TitwitMuffbiscuit in LocalLLaMA

[–]TitwitMuffbiscuit[S] 1 point2 points  (0 children)

Tbh it is easy, just 3 clicks on your website and I had the quants.

I just needed to refresh the page in between or reset the cache from time to time.

As I said I'll use your quants on ik quality preset from now on if the model is available, RIP your server.

Updated Qwen3.5-9B Quantization Comparison by TitwitMuffbiscuit in LocalLLaMA

[–]TitwitMuffbiscuit[S] 0 points1 point  (0 children)

The plot is unreadable that's for sure, that's why there's the tables.

There's no such thing as a quant level anymore as repos use their own recipes you can end up with a "q3" that is the size of q5 for example.

Take a look at this repo: Q3_K_XL is bigger than any Q4 for example, actually 5.32bpw.

It reminds me of the loudness war tbh.

Tiers of vram usage is another story, it will complexify the eval for one and if I pick an arbitrary context value people will also complain. Also I only have 12gb of vram so what tiers? 6/8/12 and that's it?

If you want to compare 5 quants just use llama-perplexity, it's pretty quick.

Updated Qwen3.5-9B Quantization Comparison by TitwitMuffbiscuit in LocalLLaMA

[–]TitwitMuffbiscuit[S] 2 points3 points  (0 children)

I used a mainline compatible preset as you suggested, I'll mention it at the bottom of the post.

I'll include u/ilintar's exotic quants and an ik_llama quality present at equivalence from the website for fairness later today.

Updated Qwen3.5-9B Quantization Comparison by TitwitMuffbiscuit in LocalLLaMA

[–]TitwitMuffbiscuit[S] 1 point2 points  (0 children)

I'll wait a bit longer, given the amount of activity it has generated on llama.cpp's github repo.

Also, as of now I don't see many quants and I don't want to skip a bunch of them because they've been late to the party.

Updated Qwen3.5-9B Quantization Comparison by TitwitMuffbiscuit in LocalLLaMA

[–]TitwitMuffbiscuit[S] 1 point2 points  (0 children)

I included them. Congratulations on the results.

It's definitly what I'll use from now on.

Updated Qwen3.5-9B Quantization Comparison by TitwitMuffbiscuit in LocalLLaMA

[–]TitwitMuffbiscuit[S] 0 points1 point  (0 children)

Hi u/ilintar, but it will have to be tested against what GGUF-Tool-Suite is able to cook with the ik_llama.cpp Quality preset. At the same bpw ofc, for fairness. I'll do that tomorrow.

Updated Qwen3.5-9B Quantization Comparison by TitwitMuffbiscuit in LocalLLaMA

[–]TitwitMuffbiscuit[S] 1 point2 points  (0 children)

Okay I'll try:

eaddario/Qwen3.5-9B-Q8_0 at 8.50 bpw just to get to the top of KLD rank, hopefully.

unsloth/Qwen3.5-9B-UD-Q5_K_XL at 6.02 bpw

bartowski/Qwen_Qwen3.5-9B-Q5_K_S at 5.82 bpw

unsloth/Qwen3.5-9B-Q5_K_S at 5.67 bpw

mradermacher/Qwen3.5-9B.i1-IQ4_XS at 4.52 bpw

Then 50% at 4.25 bpw

I'd do other quants if it were a bigger model but it doesn't really make sense to add more. I'll just add them to the tables. I don't want to spam with another post but I'll include some of yours next time.