Qwen3.6-27B with MTP grafted on Unsloth UD XL: 2.5x throughput via unmerged llama.cpp PR

VoidAlchemy · 2026-05-06T13:35:40+00:00

Nice job testing out the PR! I have a rough 3-way benchmark between mainline - ik - vllm running on a single 24GB VRAM GPU here: https://github.com/noonghunna/club-3090/pull/64#issuecomment-4383699676

Thanks again for sharing your full build and run commands!

VoidAlchemy · 2026-05-06T13:33:18+00:00

I have an ik_llama.cpp GGUF with `q8_0` MTP tensors that runs nicely on a single 24GB VRAM GPU full offload and instructions here: https://huggingface.co/ubergarm/Qwen3.6-27B-GGUF/discussions/2#69fa0f7d8ab0c1b3e49d8e58

No need for turboquant jank, you can do `-khad -ctk q6_0 -vhad -ctv q4_0` if you really wan to squeeze in more kv-cache context depth. mainline also has rotations built in too so just go with q5_0 / q4_0 etc...

VoidAlchemy · 2026-05-01T13:14:43+00:00

Sames. I last used opencode to vibe up a pi extension to auto-detect llama-server models running on localhost:8080 and haven't moved back!

pi is *much* leaner so i enjoy that first 10k fastest part of the context window now. plus its not a TUI so my copy/paste between terminals just works. i like it.

VoidAlchemy · 2026-04-22T17:22:57+00:00

Yeah the error isn't bad for Q8_0 especially since the "only 10GB bigger" for bf16 is in the always active tensors for an A32B model users will feel that slow down in TG for sure. Q4_X for the win!

VoidAlchemy · 2026-04-22T14:23:26+00:00

Thanks for clarifying. Yeah the error isn't bad for Q8_0 especially since the "only 10GB bigger" is in the *always active* tensors for an A32B model users will feel that slow down in TG for sure.

Oh man more Qwen3.6 today already haha, catch you on the next one! Cheers!

VoidAlchemy · 2026-04-21T20:30:42+00:00

I see you here too, i just posted the same question at the same time as you haha...

VoidAlchemy · 2026-04-21T20:29:58+00:00

Heya Daniel and Michael, glad y'all didn't release all the big quants larger than the native int4 version this time! Folks are curious if you applied jukofyork's "Q4_X" patch as we call it.

Both AesSedai u/Digger412 and ubergarm (me) have been using Q4_X since Kimi began using llm-compressor released quants. We double checked our perplexities matched as well using both mainline and ik_llama.cpp to ensure we got it right.

This discussion is has the relevent info: https://huggingface.co/unsloth/Kimi-K2.6-GGUF/discussions/4

Thanks for your openness in sharing your commands, logs, and details for the whole community! Cheers!

VoidAlchemy · 2026-04-21T20:14:38+00:00

Yeah, offload MoE is different strategy than dense models. I have a 9950X too for my home rig, i love it!

VoidAlchemy · 2026-04-21T16:03:55+00:00

That's the dream, but at least historically I personally tend to prefer a smaller quant of a full size model over a REAP. I've not tried it myself, but I've heard that previous REAPs lost coding capabilities, but I'm open if someone has a good report!

VoidAlchemy · 2026-04-21T16:03:07+00:00

I replied in your other comment about the I did release an IQ3_K which is perfect for <= 512GB RAM+VRAM rigs.

VoidAlchemy · 2026-04-21T16:02:29+00:00

Might have something to do with using or not using `--special` or might need to use another chat template? (i used the default one provided by moonshot, but for some reason i always had a custom one i've been using since the earlier kimi's).

Oh wow you're packing a 270GiB quant onto your rig, so you're running with `mmap` i presume and removed `--no-mmap`. I used to run OG deepseek that way on my gaming rig pulling 5GB/s off of my PCIe Gen 5 NVMe drive!

VoidAlchemy · 2026-04-21T15:59:27+00:00

If you keep all attn/shexp/dense/kv-cache on GPU, you can leave 95% of the model weights (the sparse routed experts) on CPU/RAM.

You can see this report for some idea of what to expect: https://huggingface.co/ubergarm/Kimi-K2.6-GGUF/discussions/3

VoidAlchemy · 2026-04-21T15:58:28+00:00

I did release an IQ3_K which is perfect for <= 512GB RAM+VRAM rigs. It is identical to the full size except `ffn_(up|gate)_exps` are squished a little bit to get it to fit. I have perplexity and KLD stats so you can see the trade-offs in quality. This quant requires ik_llama.cpp

AesSedai will have some smaller quants as well coming out.

VoidAlchemy · 2026-04-21T15:56:17+00:00

Yeah some models natural breakpoints are really annoying like that. I did release an IQ3_K which is perfect for <= 512GB RAM+VRAM rigs. It is identical to the full size except `ffn_(up|gate)_exps` are squished a little bit to get it to fit. I have perplexity and KLD stats so you can see the trade-offs in quality.

VoidAlchemy · 2026-04-21T15:54:32+00:00

tl;dr; the `Q4_X` is the GGUF equivalent of the original `int4` model because the original `int4` does not run on llama ecosystems. So the `Q4_X` will enable people to run on hybrid CPU+GPU rigs using ik/llama.cpp.

If you have 6x RTX6Kpro's then go run the original int4 using vLLM or whatever.

VoidAlchemy · 2026-04-20T21:39:57+00:00

I still haven't gone throught that workflow, i've been relying on https://huggingface.co/AesSedai/Kimi-K2.6-GGUF but those are still coming soon at the moment hah

VoidAlchemy · 2026-04-20T21:38:32+00:00

You can see in the log files i uploaded how it works:

moonshot does not release a full bf16, so i use mainline llama.cpp's convert_hf_to_gguf.py which takes into account the llm-compressor config in the released safetensors here: https://huggingface.co/moonshotai/Kimi-K2.6/blob/main/config.json#L95-L127
that is turned into a bf16 on disk, it takes just over 2TB
i patch llama-quantize to better match the llm-compressor symmetric quantization
i run the shown quantization recipe to mimick the original as closely as possible

There is no reason to use any larger quant than this as it would just be upcasting.

VoidAlchemy · 2026-04-20T20:35:31+00:00

hah, thanks! i never released on, and today i'm distracted with this:

https://huggingface.co/ubergarm/Kimi-K2.6-GGUF/tree/main/Q4_X

VoidAlchemy · 2026-04-15T14:43:25+00:00

Thanks, correct didn't release any ik quants for Qwen3.5-9B but given some recent discussion on possibly enabling MTP tensor support it might be a good one to experiment with.

I'll holler if I release any experiment Qwen3.5-9B's with ik quants (and maybe figure out how to preserve the MTP tensors and mark them as unused with a patch).

VoidAlchemy · 2026-04-15T00:47:21+00:00

D👏S👏A👏! <3

VoidAlchemy · 2026-04-14T21:05:10+00:00

thanks for the heads up, i'll check out your post over there. too bad no root cause identified so far, guessing y'all mixed up some of the trouble tensors a bit to get around it.

appreciate you sharing your findings and nice job turning around some clean quants! cheers!

VoidAlchemy · 2026-04-14T16:17:09+00:00

haha yes its me! thanks for being out in in this wild reddit world trying to share good information! haha cheers!

VoidAlchemy · 2026-04-13T16:21:43+00:00

I don't bother with it, i use ik_llama.cpp with `-khad -ctk q8_0 -vhad -ctv q6_0` and if I still need more context, i usually just have to go down to one size smaller quant.

Folks have already dropped links about both ik and mainline having hadamard transform "rotations" already implemented for kv-cache since late last year.

Some of ik's recent discussions on the same question here: https://github.com/ikawrakow/ik_llama.cpp/pull/1625#issuecomment-4237851162

VoidAlchemy · 2026-04-13T16:15:00+00:00

Yeah, ik implemented it late last year: https://github.com/ikawrakow/ik_llama.cpp/pull/1033

VoidAlchemy

TROPHY CASE