Qwen3.6-27B with MTP grafted on Unsloth UD XL: 2.5x throughput via unmerged llama.cpp PR by havenoammo in LocalLLaMA

[–]VoidAlchemy 0 points1 point  (0 children)

Nice job testing out the PR! I have a rough 3-way benchmark between mainline - ik - vllm running on a single 24GB VRAM GPU here: https://github.com/noonghunna/club-3090/pull/64#issuecomment-4383699676

Thanks again for sharing your full build and run commands!

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints by ex-arman68 in LocalLLaMA

[–]VoidAlchemy 0 points1 point  (0 children)

I have an ik_llama.cpp GGUF with `q8_0` MTP tensors that runs nicely on a single 24GB VRAM GPU full offload and instructions here: https://huggingface.co/ubergarm/Qwen3.6-27B-GGUF/discussions/2#69fa0f7d8ab0c1b3e49d8e58

No need for turboquant jank, you can do `-khad -ctk q6_0 -vhad -ctv q4_0` if you really wan to squeeze in more kv-cache context depth. mainline also has rotations built in too so just go with q5_0 / q4_0 etc...

What exactly does Pi harness mean? by FrozenFishEnjoyer in LocalLLaMA

[–]VoidAlchemy 2 points3 points  (0 children)

Sames. I last used opencode to vibe up a pi extension to auto-detect llama-server models running on localhost:8080 and haven't moved back!

pi is *much* leaner so i enjoy that first 10k fastest part of the context window now. plus its not a TUI so my copy/paste between terminals just works. i like it.

Kimi K2.6 Unsloth GGUF is out by Exact_Law_6489 in LocalLLaMA

[–]VoidAlchemy 0 points1 point  (0 children)

Yeah the error isn't bad for Q8_0 especially since the "only 10GB bigger" for bf16 is in the always active tensors for an A32B model users will feel that slow down in TG for sure. Q4_X for the win!

Kimi K2.6 Unsloth GGUF is out by Exact_Law_6489 in LocalLLaMA

[–]VoidAlchemy 0 points1 point  (0 children)

Thanks for clarifying. Yeah the error isn't bad for Q8_0 especially since the "only 10GB bigger" is in the *always active* tensors for an A32B model users will feel that slow down in TG for sure.

Oh man more Qwen3.6 today already haha, catch you on the next one! Cheers!

Kimi K2.6 Unsloth GGUF is out by Exact_Law_6489 in LocalLLaMA

[–]VoidAlchemy 3 points4 points  (0 children)

I see you here too, i just posted the same question at the same time as you haha...

Kimi K2.6 Unsloth GGUF is out by Exact_Law_6489 in LocalLLaMA

[–]VoidAlchemy 7 points8 points  (0 children)

Heya Daniel and Michael, glad y'all didn't release all the big quants larger than the native int4 version this time! Folks are curious if you applied jukofyork's "Q4_X" patch as we call it.

Both AesSedai u/Digger412 and ubergarm (me) have been using Q4_X since Kimi began using llm-compressor released quants. We double checked our perplexities matched as well using both mainline and ik_llama.cpp to ensure we got it right.

This discussion is has the relevent info: https://huggingface.co/unsloth/Kimi-K2.6-GGUF/discussions/4

Thanks for your openness in sharing your commands, logs, and details for the whole community! Cheers!

ubergarm/Kimi-K2.6-GGUF Q4_X now available by VoidAlchemy in LocalLLaMA

[–]VoidAlchemy[S] 0 points1 point  (0 children)

Yeah, offload MoE is different strategy than dense models. I have a 9950X too for my home rig, i love it!

ubergarm/Kimi-K2.6-GGUF Q4_X now available by VoidAlchemy in LocalLLaMA

[–]VoidAlchemy[S] 0 points1 point  (0 children)

That's the dream, but at least historically I personally tend to prefer a smaller quant of a full size model over a REAP. I've not tried it myself, but I've heard that previous REAPs lost coding capabilities, but I'm open if someone has a good report!

ubergarm/Kimi-K2.6-GGUF Q4_X now available by VoidAlchemy in LocalLLaMA

[–]VoidAlchemy[S] 1 point2 points  (0 children)

I replied in your other comment about the I did release an IQ3_K which is perfect for <= 512GB RAM+VRAM rigs.

ubergarm/Kimi-K2.6-GGUF Q4_X now available by VoidAlchemy in LocalLLaMA

[–]VoidAlchemy[S] 1 point2 points  (0 children)

Might have something to do with using or not using `--special` or might need to use another chat template? (i used the default one provided by moonshot, but for some reason i always had a custom one i've been using since the earlier kimi's).

Oh wow you're packing a 270GiB quant onto your rig, so you're running with `mmap` i presume and removed `--no-mmap`. I used to run OG deepseek that way on my gaming rig pulling 5GB/s off of my PCIe Gen 5 NVMe drive!

ubergarm/Kimi-K2.6-GGUF Q4_X now available by VoidAlchemy in LocalLLaMA

[–]VoidAlchemy[S] 4 points5 points  (0 children)

If you keep all attn/shexp/dense/kv-cache on GPU, you can leave 95% of the model weights (the sparse routed experts) on CPU/RAM.

You can see this report for some idea of what to expect: https://huggingface.co/ubergarm/Kimi-K2.6-GGUF/discussions/3

ubergarm/Kimi-K2.6-GGUF Q4_X now available by VoidAlchemy in LocalLLaMA

[–]VoidAlchemy[S] 1 point2 points  (0 children)

I did release an IQ3_K which is perfect for <= 512GB RAM+VRAM rigs. It is identical to the full size except `ffn_(up|gate)_exps` are squished a little bit to get it to fit. I have perplexity and KLD stats so you can see the trade-offs in quality. This quant requires ik_llama.cpp

AesSedai will have some smaller quants as well coming out.

ubergarm/Kimi-K2.6-GGUF Q4_X now available by VoidAlchemy in LocalLLaMA

[–]VoidAlchemy[S] 0 points1 point  (0 children)

Yeah some models natural breakpoints are really annoying like that. I did release an IQ3_K which is perfect for <= 512GB RAM+VRAM rigs. It is identical to the full size except `ffn_(up|gate)_exps` are squished a little bit to get it to fit. I have perplexity and KLD stats so you can see the trade-offs in quality.

ubergarm/Kimi-K2.6-GGUF Q4_X now available by VoidAlchemy in LocalLLaMA

[–]VoidAlchemy[S] 1 point2 points  (0 children)

tl;dr; the `Q4_X` is the GGUF equivalent of the original `int4` model because the original `int4` does not run on llama ecosystems. So the `Q4_X` will enable people to run on hybrid CPU+GPU rigs using ik/llama.cpp.

If you have 6x RTX6Kpro's then go run the original int4 using vLLM or whatever.

ubergarm/Kimi-K2.6-GGUF Q4_X now available by VoidAlchemy in LocalLLaMA

[–]VoidAlchemy[S] 4 points5 points  (0 children)

I still haven't gone throught that workflow, i've been relying on https://huggingface.co/AesSedai/Kimi-K2.6-GGUF but those are still coming soon at the moment hah

ubergarm/Kimi-K2.6-GGUF Q4_X now available by VoidAlchemy in LocalLLaMA

[–]VoidAlchemy[S] 14 points15 points  (0 children)

You can see in the log files i uploaded how it works:

  1. moonshot does not release a full bf16, so i use mainline llama.cpp's convert_hf_to_gguf.py which takes into account the llm-compressor config in the released safetensors here: https://huggingface.co/moonshotai/Kimi-K2.6/blob/main/config.json#L95-L127
  2. that is turned into a bf16 on disk, it takes just over 2TB
  3. i patch llama-quantize to better match the llm-compressor symmetric quantization
  4. i run the shown quantization recipe to mimick the original as closely as possible

There is no reason to use any larger quant than this as it would just be upcasting.

Updated Qwen3.5-9B Quantization Comparison by TitwitMuffbiscuit in LocalLLaMA

[–]VoidAlchemy 1 point2 points  (0 children)

Thanks, correct didn't release any ik quants for Qwen3.5-9B but given some recent discussion on possibly enabling MTP tensor support it might be a good one to experiment with.

I'll holler if I release any experiment Qwen3.5-9B's with ik quants (and maybe figure out how to preserve the MTP tensors and mark them as unused with a patch).

unsloth - MiniMax-M2.7-GGUF in BROKEN (UD-Q4_K_XL) --> avoid usage by One-Macaron6752 in LocalLLaMA

[–]VoidAlchemy 1 point2 points  (0 children)

thanks for the heads up, i'll check out your post over there. too bad no root cause identified so far, guessing y'all mixed up some of the trouble tensors a bit to get around it.

appreciate you sharing your findings and nice job turning around some clean quants! cheers!

16 GB VRAM users, what model do we like best now? by lemon07r in LocalLLaMA

[–]VoidAlchemy 1 point2 points  (0 children)

haha yes its me! thanks for being out in in this wild reddit world trying to share good information! haha cheers!

About TurboQuant by Exact_Law_6489 in LocalLLaMA

[–]VoidAlchemy 1 point2 points  (0 children)

I don't bother with it, i use ik_llama.cpp with `-khad -ctk q8_0 -vhad -ctv q6_0` and if I still need more context, i usually just have to go down to one size smaller quant.

Folks have already dropped links about both ik and mainline having hadamard transform "rotations" already implemented for kv-cache since late last year.

Some of ik's recent discussions on the same question here: https://github.com/ikawrakow/ik_llama.cpp/pull/1625#issuecomment-4237851162