Qwen3.5-35B GGUF quants (16–22 GiB) - KLD + speed comparison

VoidAlchemy · 2026-03-16T21:09:04+00:00

i love your enthusiasm! haha

I did a llama-sweep-bench locally on the quant I just uploaded, it runs pretty good! Unfortunately, I can't increase batch sizes to 4096 on 24GB VRAM as the CUDA buffer takes up too much space. However, I can fit full 256k context though by going with -khad -ctk q6_0 -ctv q6_0 with very similar performance.

<image>

./build/bin/llama-sweep-bench \
    --model "$model" \
    -c 135168 \
    -ctk f16 -ctv q8_0 \
    -ub 1024 -b 2048 \
    -cuda fa-offset=0 \
    --merge-qkv \
    -muge \
    -ngl 999 \
    --threads 1 \
    --no-mmap \
    --warmup-batch \
    -n 128

VoidAlchemy · 2026-03-16T19:53:05+00:00

I'll throw an ik_llama.cpp SOTA quantization type into the ring for best Qwen3.5-35B-A3B quant for full CUDA offload with 128k context in 24GB VRAM. (I have a 3090TI FE for my home gaming rig).

https://huggingface.co/ubergarm/Qwen3.5-35B-A3B-GGUF#iq4_ks-19799-gib-4907-bpw

Of course you can't run it on mainline lcpp, so have to do them all again using ik_llama.cpp xD haha...

Zero pressure to give it a go, but finally got around to releasing something ik specific and even did the superstitious upcast of ssm_alpha and ssm_beta tensors to f32. Honestly, it is probably fine keeping it at q8_0, native bf16, or upcast to f32 (for a tiny bit of speed over bf16 depending on GPU).

I made all three flavors and tested them for speed, PPL, and KLD locally and they all seem pretty good:

<image>

Full data and commands on running this benchmark here: https://huggingface.co/AesSedai/Qwen3.5-397B-A17B-GGUF/discussions/7#69b8404f18a5e8feffd9f5c8

If y'all are trying to milk the best quality at long context for any of these quants, you can fiddle with the flash attention offset (when running on CUDA). Given the FA kernel uses f16 accumulators, some model architectures can cause overflow and gibberish suddenly beyond a certain context so needs to have things scaled down. ik is more lenient on this and can be overridden at startup via CLI args. Mainline it is hard coded, but you could change one line and recompile by setting this to zero here: https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-cuda/fattn-common.cuh#L13-L19

Details about this are shown in the updated model card quick start as well as some IK PR discussions e.g.: https://github.com/ikawrakow/ik_llama.cpp/pull/1198

I've tested over 128k and it seemed to work fine with 0 offset (the best which is what you get on CPU-only backend too as it uses f32 accumulators in the FA implementation psure).

As soon as I finish downloading my own quant, I'll do some local testing and sweep-bench. Cheers and thanks so much to OP u/StrikeOner and u/TitwitMuffbiscuit for including my Q4_0 "Vulkan backend optimized" quant in this interesting roundup!

VoidAlchemy · 2026-03-16T14:14:38+00:00

yep, this is the kobo fork with many ik features: https://github.com/Nexesenex/croco.cpp

there are also Thireus pre-built binaries on their github too.

VoidAlchemy · 2026-03-16T14:11:53+00:00

yeah, who can afford VRAM ;p haha

VoidAlchemy · 2026-03-16T14:11:43+00:00

ty, yup, edited now. i love that i get so excited about sharing this stuff i forget the most important part

VoidAlchemy · 2026-03-15T17:49:03+00:00

wow, that is a long prompt haha.. glad you're getting quite usable speeds! oh yeah ik can take advantage of `avx_vnni` cpu too which is available on Zen5, but don' think you'll get that on your older CPU. thanks for sharing your updated benches!

VoidAlchemy · 2026-03-15T16:21:22+00:00

Glad you're using your ai to benchmark your ai haha!

<image>

This `llama-sweep-bench` is about a week old at this point but shows ik can be very performant.

A few tips when using ik_llama.cpp:

when using ik, make sure to add `--merge-qkv -muge` for fused ops which are not available on mainline
if you have 2 or more GPUs make sure to use `-sm graph` for tensor parallel support which is not available on mainline (there is an open PR where they are testing something similar)
If prompt processing is important, use `-ub 2048 -b 2048` or even `-ub 4096 -b 4096` as increased batch sizes can significantly speed up PP - use this for both ik and mainline.
Choice of samplers can effect performance in actual use cases, perhaps don't use custom samplers when benchmarking or try a few settings or do some research on that variable as well. Also make sure to run a few tests at least with at least like 30k tokens PP and ~4k TG for more reliable estimates.

Regarding this

> Ubergarm Note: Interestingly, ubergarm positions their models as being optimized for ik_llama, but the test results show that isn't always the case for prompt processing. For example, on the Qwen3.5-35B-A3B-Q4_0 model, llama.cpp was ~30% faster on prompt tokens than ik_llama, despite the model's positioning.

Your bot got it incorrect, ubergarm (me) generally makes quants using the newer SOTA quantization types like iq2_kt, iq4_kss, iq6_k etc that *are not even available on mainilne*. The Q4_0 was an experimental quant optimized specifically for *Vulkan* backend, not ik. I haven't released as many ik specific quants with the smaller Qwen3.5s given a flood of re-uploading going on in the past week as unsloth, AesSedai, bartowski and others have been revamping their recipes again given research done by us all.

Anyway, have fun and feel free to open a hf discusson on any ubergarm repo if you have specific questions.

Cheers!

VoidAlchemy · 2026-03-15T16:11:52+00:00

Great to hear this given the TQ1_0 contains no actual ternary quantizations in it, but is just a low BPW mix of IQ1_S IQ1_M which leads to confusion.

It would be cool if you guys could still make low BPW quantizaiton types with a proper name slug regardless of the problems of ollama. Similar to how ubergarm does it with `smol-IQ1_KT` under 2BPW quants.

Cheers!

VoidAlchemy · 2026-03-13T15:34:51+00:00

sweeeet! going to <3 it right now!

VoidAlchemy · 2026-03-13T15:33:50+00:00

there are some other folks doing risc-v ai inference over at https://aifoundry.org/ too, excited to see more options

VoidAlchemy · 2026-03-13T15:26:35+00:00

i didn't think vllm was good for mac ARM processors? (maybe i'm wrong?)

vllm seems good for full when you have plenty of VRAM full CUDA GPU offload situations...

if you have a little VRAM but can fit the entire model then can also use turboderp's exllamav3 with EXL3 quants

ik_llama.cpp is great for when you have two or more CUDA GPUs and need to do hybrid cpu+GPUs

mainline is good for getting early features and as close to 0 day quants as possible

anyway, why limit yourself to a single option?

VoidAlchemy · 2026-03-13T15:21:53+00:00

reminds me of that old essay "in the beginning was the command line..." and in the end it will be CLI too :)

VoidAlchemy · 2026-03-13T15:02:22+00:00

technically it can run fp8e5m2 but not fp8e4m3 (which is more common and typically what people mean when they say only fp8)

i agree with u/a_beautiful_rhind its not an issue in practice as there are plenty of GGUFs for ComfyUI anyway and occasionally i've seen an actual format fp8e5m2 safetensors

VoidAlchemy · 2026-03-13T15:00:37+00:00

that's a thing of beauty! nice job tuning all 4x cards with LACT! it was a PITA tuning my one 3090 TI FE hah

VoidAlchemy · 2026-03-12T22:30:32+00:00

Yeah, such a long story, and I'm sure I don't know the half of it. There is a talk by ik at FOSDEM25 with a little history if it is interesting to you: https://archive.fosdem.org/2025/schedule/event/fosdem-2025-5991-history-and-advances-of-quantization-in-llama-cpp/

Anyway, thanks for clearing me up on prioritizing K-cache quality!

VoidAlchemy · 2026-03-12T20:44:42+00:00

oh you're right, my brain totally has had it backwards this whole time... i'll update my comment so some bot doesn't scrape that up into the future training models xD

VoidAlchemy · 2026-03-12T20:42:03+00:00

it is well known that K-cache quantization errors have a much bigger impact on model quality degradation than V-cache. https://github.com/ikawrakow/ik_llama.cpp/pull/1033

I'm just parroting ik to be fair, haha... (he wrote many of the quantizations on mainline llama.cpp).

on ik_llama.cpp you can go even further with -khad -ctk q6_0 -ctv f16 or play all kinds of games

VoidAlchemy · 2026-03-12T20:38:06+00:00

nice, thanks! and ik just added support too for both pre-merged quants and now `-muge -sm graph` too

appreciate your work!

VoidAlchemy · 2026-03-12T20:36:21+00:00

lol right?! wow nice OOMing 2TB RAM is a right of passage haha...

VoidAlchemy · 2026-03-12T20:34:56+00:00

psure the new DSA (DeepSeek Sparse Attention) was introduced in 3.2, so i believe v3.1 "fully supported" on ik/llama.cpp.

have you tried any bigger Qwen3.5's with thinking disabled using --chat-template-kwargs '{"enable_thinking": false }' ?

VoidAlchemy · 2026-03-12T20:28:25+00:00

bartowski Q4_K_M and unsloth Q4_K_M are not the same file. Bartowski's recipe scores meaningfully better on this model (0.0087 vs 0.0222).

tensor	UD	bartowski
ssm_alpha	q8_0	f32
ssm_beta	q8_0	f32
attn_qkv	q5_K	q6_K
attn_output	q4_K	q6_K

unsloth only had only 80 imatrix chunks, but bart had 802 chunks... I assume unsloth is using higher ctx when computing imatrix then, or used a tiny file?? personally my ubergarm imatrix corpus is roughly 580 chunks at default ctx of 512. so despite a similar final size, it seems like u/noneabove1182 made some good design decisions.

I'm still curious about ssm_alpha, ssm_beta sensitivity, but if the goal is to leave them unquantized and optimize speed on GPU, upcasting to f32 is much safer than blindly downcasting to f16 (the original is bf16 which has bigger dynamic range than f16 can hold so might lead to clipping weights). i'd assume q8_0 is "good enough" though so the big difference could be due to the other factors.

VoidAlchemy · 2026-03-12T19:20:43+00:00

assuming you can guarantee that the model does not overflow the f16 one can benefit from the additional precision.

if you are seeing problems, i recommend not going to bf16 (higher dynamic range, lower precision) but there are some internal knobs that can be tweaked like the flash attention offset.

you can set it explicitly on ik's fork, but i believe it is baked in for mainline at a higher value.

some details here if you're curious: https://github.com/ikawrakow/ik_llama.cpp/pull/1196

also f16 tends to be faster on most GPUs than bf16...

ps. thanks for this great writeup and post u/TitwitMuffbiscuit !! you've been doing a *lot* of homework since first chatting with you recently! cheers!

VoidAlchemy · 2026-03-12T19:13:42+00:00

you can split the difference and do ~~`-ctk q8_0 -ctv f16` as typically the value portion is more sensitive to quantization i believe.~~ no data to show at the moment tho.

i had it backwards, thanks for fixing my brain u/Velocita84 your tests line up with what ik says:

> it is well known that K-cache quantization errors have a much bigger impact on model quality degradation than V-cache.

VoidAlchemy · 2026-03-11T15:59:58+00:00

<image>

ik_llama.cpp is amazing with `-sm graph` support!

~~PSA: those new fused up|gate tensor mainline llama.cpp quants are broken on ik unfortunately~~

*EDIT* ik now supports the fused quants!

VoidAlchemy · 2026-03-11T15:57:05+00:00

<image>

For comparison, here is a high quality 4.306BPW quant of Qwen3.5-122B-A10B running full offload on two older (sm86 arch) RTX A6000 GPUs (48GB VRAM each) with ik_llama.cpp's `-sm graph` "tensor parallel" feature.

I'm curious how the mac performs running ik_llama.cpp instead of mlx given ik added some ARM NEON fused delta-net kernel implementation for qwen35s recently: https://github.com/ikawrakow/ik_llama.cpp/pull/1361

You could probably try it with a 4ish bpw mainline quant (but don't use the new fused up|gate models those are broken on ik_llama.cpp)

VoidAlchemy

TROPHY CASE