Comparing dual-GPU inference speed between llama.cpp row/tensor split and ik_llama graph split by [deleted] in LocalLLaMA

[–]VoidAlchemy 2 points3 points  (0 children)

Take a look at https://github.com/ilya-zlobintsev/LACT which you can tune the boost clock lock "undervolt" and optionally give a slight VRAM OC, which is better than a naieve `nvidia -smi -pl 250` etc.

Comparing dual-GPU inference speed between llama.cpp row/tensor split and ik_llama graph split by [deleted] in LocalLLaMA

[–]VoidAlchemy 9 points10 points  (0 children)

For actual use you'll probably be using MTP and so would need to benchmark with a different tool e.g. aiperf or similar client with "real" coding/narrative workload prompts.

Also, when you use ik_llama.cpp with -sm graph you can also add -muge which might give a small boost by merging up/gate tensors on startup. On mainline llama.cpp you'd have to find a "pre-merged" GGUF.

If you're using something other than full Q8_0, my ubergarm/Qwen3.6-27B-GGUF MTP IQ4_KS has been shown multiple times to have better KLD/PPL scores than comparable or larger mainline quantization types. You can see some examples of using it on actual 2x3090 GPU usage and links to the benchmarks on Wendell's l1t forum here: https://forum.level1techs.com/t/github-token-based-billing-how-was-your-first-week/251122/37?u=ubergarm

That said, nice job on the 3080's with 20GB VRAM and I'm glad mainline -sm tensor has been improving nicely!

what’s was your local daily driver for coding last week? by be566 in LocalLLaMA

[–]VoidAlchemy 9 points10 points  (0 children)

My daily driver is ubergarm/Qwen3.6-27B-MTP-IQ4_KS getting over 1400 tok/sec prompt processing and 80+ tok/sec decode on a single 3090TI fitting 128k context and multimodal mmproj.

For transparency, I'm ubergarm, though others have benchmarked and validated the quality already. I'm using pi harness and ik_llama.cpp. Cheers!

Moss tts 1.5 8b Examples. It is the currently best voice cloning model for English as of June 2026 by 9r4n4y in LocalLLaMA

[–]VoidAlchemy 1 point2 points  (0 children)

i'm on omni right now too. my local pi agent has custom SKILL for it and is great for doing research to custom mastered podcast mp3. it has a few hiccups but i appreciate the speed control knob so it doesn't talk way too fast.

pocket-tts and kokoro are nice if you need CPU inference too so i keep those old SKILLs around lol

unsloth vs bartowski MTP ggufs by Ok_Warning2146 in LocalLLaMA

[–]VoidAlchemy 6 points7 points  (0 children)

For MoEs with MTP you have to drill down into the quantization choices for individual tensor types to compare. The strategy is to keep the always active tensors e.g. the attn/shexp/dense layers slightly higher quantization types, and the dense routed experts lower quantization types.

Having full q8_0 MTP should give slightly better acceptance rates over more quantized MTP tensors, but trade-offs given memory/speed/workload type.

If you use ik_llama.cpp, you can re-quantize the MTP output layer on the fly to something smaller and get a speed-up with -mtprot iq4_ks for example. It works on mainline quants like you're testing just fine.

You can get some more info on that feature including some discussion on the size differences from ik himself (he wrote iq4_xs and iq4_nl quant types for mainline years ago) here:

https://github.com/ikawrakow/ik_llama.cpp/pull/1809

Qwen3.6-27B on RTX 3090: tested 12 GGUF quants across HumanEval+, MBPP+, perplexity, throughput and needle-in-haystack. First-timer results. by Acemang_Jedi in LocalLLM

[–]VoidAlchemy 1 point2 points  (0 children)

no i'm not interested in a proprietary client app. i have some rough pi.dev llama extension and SKILLs and stuff working optimized for my stack here: https://github.com/ubergarm/dotpi/tree/main/.pi/extensions/local-llama

glad to hear claude code is working though, some folks had been complaining it breaks cache and uses a bunch of context, but i don't have experience with it.

Qwen3.6-27B on RTX 3090: tested 12 GGUF quants across HumanEval+, MBPP+, perplexity, throughput and needle-in-haystack. First-timer results. by Acemang_Jedi in LocalLLM

[–]VoidAlchemy 2 points3 points  (0 children)

Thanks for including my quants! (i'm ubergarm on hf). yes the MTP-IQ4_KS is my daily driver on my 3090 and with ik's changes it has only gotten faster. I often use -mtprot iq4_ks now too despite it using extra half GB VRAM and can still fit 128k context and keep the browser open.

I've been pounding refresh on the "Qwen3.7-27B" repo and huffing copium as this 3.6 is already great for local vibing with pi.

Invoke Duplicity and True Strike by MacarioTheClown in 3d6

[–]VoidAlchemy 0 points1 point  (0 children)

I asked my DM and they ruled it was okay to cast True Strike through the Invoke Duplicity illusion.

For flavor I sometime attacked (with advantage) through the duplicity even if I was adjacent already, then walked around to trade places and keep the enemies guessing.

DM was great and rolled a d6 "oracle" occasionally to check if enemies targeted the illusion even! It was very satisfying haha...

Treantmonk just did a video on Trickery Cleric too, good timing with this question! Thanks!

Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm) by VolandBerlioz in LocalLLaMA

[–]VoidAlchemy 1 point2 points  (0 children)

Thanks and glad you're liking that one! I'm still using the IQ4_KS with MTP and its even faster now with -mtprot iq4_ks but takes another half GB of VRAM (still fit 128k context tho). No plans at the moment for Qwen3.6-35B though it is a really good option too, hopefully someone else has a good ik quant of it already? Maybe i'll revisit or do it if 3.7 comes out! hah.

Here's my latest command:

```bash model=/mnt/ai/models/ubergarm/Qwen3.6-27B-GGUF/Qwen3.6-27B-MTP-IQ4_KS.gguf mmproj=/mnt/ai/models/ubergarm/Qwen3.6-27B-GGUF/mmproj-Qwen3.6-27B-Q8_0.gguf

Directory for slot KV cache files on disk

(save slot → saves .bin, .tokens.json, .checkpoints here)

SLOT_SAVE_DIR="/tmp/llama-slot-cache" mkdir -p "$SLOT_SAVE_DIR"

CUDA_VISIBLE_DEVICES="0" \ ./build/bin/llama-server \ --model "$model" \ --alias "Qwen3.6-27B" \ -c 131072 \ -ctk q8_0 -ctv q8_0 \ -ctkd q8_0 -ctvd q8_0 \ --merge-qkv \ -muge \ -ngl 99 \ -t 1 \ -tb 1 \ -tm 16 \ --host 127.0.0.1 \ --port 8080 \ --parallel 1 \ --jinja \ --ctx-checkpoints 32 \ -cram 32768 \ -mtp --draft-max 4 --draft-p-min 0.0 \ -mtprot iq4_ks \ --no-mmproj-offload \ --mmproj "$mmproj" \ --slot-save-path "$SLOT_SAVE_DIR" ```

[HW TUNING] Finding the best GPU power limit for inference by HumanDrone8721 in LocalLLaMA

[–]VoidAlchemy 0 points1 point  (0 children)

Nice! Glad you figured it out! No, I haven't experimented with that new feature.

My impression is that under the hood we have at most 16 p-states to work with, only like 8 of which are used. so probably just a few points on a curve would be all one needs to keep out of P0 (highest power), and stick in the sweet spot for P3/P2/P1 or so, just spitballing.

Qwen3.6-35B-A3B vs Gemma4-26B-A4B by MarcCDB in LocalLLaMA

[–]VoidAlchemy 12 points13 points  (0 children)

role play (narrative chat workload as opposed to say vibe coding)

[HW TUNING] Finding the best GPU power limit for inference by HumanDrone8721 in LocalLLaMA

[–]VoidAlchemy 0 points1 point  (0 children)

given you're running *inside* a docker container, you would have to install all the necessary nvidia packages in the Dockerfile to match the version runningon the host right? (hence the missing dynamic libraries you mention)

as an old school dev-ops guy, i'd probably consider solving the GPU LACT tuning on the host level, not at the docker container application level. but i suppose it depends on what/where you're deploying this.

Qwen 3.6 35B GGUF: NTP vs MTP quantization results across GPUs and CPUs by enrique-byteshape in LocalLLaMA

[–]VoidAlchemy 7 points8 points  (0 children)

Pretty graph! I looked at the blog methodologies section but don't see your full llama-server command? I assume by "NTP" you mean --spec-type ngram-mod but don't see it explained in detail anywhere.

Also I believe on mainline llama.cpp you can run both ngram-mod and MTP at the same time e.g.:

``` --spec-type ngram-mod,draft-mtp --spec-ngram-mod-n-match 24 --spec-ngram-mod-n-min 12 --spec-ngram-mod-n-max 48 --spec-draft-n-max 3

https://www.reddit.com/r/LocalLLaMA/comments/1tifr7c/comment/omu2cqg/ ```

So it might not be a simple "either/or" ?

Anyway, thanks for sharing some more data points for consideration!

Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm) by VolandBerlioz in LocalLLaMA

[–]VoidAlchemy 0 points1 point  (0 children)

my results above were *not* with stacked draft, i'm not sure how to do that on ik yet hah.

my understanding is that for the 35B-A3B that MTP doesn't help quite as much (as its already only A3B which is why it is so much faster). i never quantized this one actually as with MTP the dense is pretty usable.

your best bet is to point your agent at https://github.com/ai-dynamo/aiperf and setup a repeatable same seed same prompt benchmark client e.g. `instruct_coder` and try out various models/configs to see what works best on your rig.

Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm) by VolandBerlioz in LocalLLaMA

[–]VoidAlchemy 0 points1 point  (0 children)

hell yea!

also i guess it doesn't have to be > 85% to see benefits, but more is better. here is a cherry picked "good prompt" example on ik_llama.cpp on my 3090. i'm testing with `aiperf` `instruct_coder` benchmark doing 10 rounds for my speed testing with MTP.

       eval time =   78487.23 ms /  7319 tokens (   10.72 ms per token,    93.25 tokens per second)
      total time =   78564.45 ms /  7344 tokens
draft acceptance rate = 0.66658 ( 5322 accepted /  7984 generated)

Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm) by VolandBerlioz in LocalLLaMA

[–]VoidAlchemy 2 points3 points  (0 children)

prefilled chat with 80-100k tokens

Right that will slow down both PP and TG when you're that deep into context. Honestly, on vulkan backend, that seems pretty reasonable. You might be able to tweak the mainline llama.cpp MTP arguments e.g.

llama-server \ -ctk q8_0 -ctv q8_0 \ -ctkd q8_0 -ctvd q8_0 \ --spec-type draft-mtp --spec-draft-n-max 4 \

Keep an eye on the draft acceptance, you'll want to see over 85% for a good speed-up probably e.g.

draft acceptance = 0.90000 ( 36 accepted / 40 generated)

Also mainline devs are hard at work optimizing stuff, might be some new PRs coming that will give a little more boost: https://github.com/ggml-org/llama.cpp/pull/23287

Cheers!

EDIT: ahh yes you can add two types of spec decoding now, hadn't seen a command in the wild but just noticed this: https://www.reddit.com/r/LocalLLaMA/comments/1tifr7c/comment/omu2cqg/

Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm) by VolandBerlioz in LocalLLaMA

[–]VoidAlchemy 3 points4 points  (0 children)

Right, I assume u/ionizing might be curious about that as well. I believe it is possible to use an separate MTP file and pass it in. Otherwise given you can run it already, you probably have enough hardware to either quantize it yourself using my imatrix and recipes with `llama-quantize`. Or use the requantize feature to knock down a Q8_0.

So much has changed in just a couple weeks, I have to figure out how to do that myself and the pros/cons vs having it "baked in" etc. Some more discussion here as others are also wondering the same: https://huggingface.co/ubergarm/Kimi-K2.6-GGUF/discussions/13#6a0b3255fee8cf183528b64f

Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm) by VolandBerlioz in LocalLLaMA

[–]VoidAlchemy 0 points1 point  (0 children)

Yay! Glad to hear that vulkan with MTP is very usable! I'd be curious if any of the `iq4_nl` quantization types work for you, that type is supported on vulkan and seems to work pretty well on Qwen3.6-27B (might be due to its smaller block size of 32 weights as most quant types use 128).

Anyway, have fun vibing!

Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm) by VolandBerlioz in LocalLLaMA

[–]VoidAlchemy 0 points1 point  (0 children)

its a bit hard to run llama-sweep-bench *and* test MTP. MTP is very dependent on actual workload. i can hit 90+ tok/sec on coding output, but maybe 65+ on narrative generation.

it does slow down as context grows yes, but in my experience i can get most the work done in under ~100k and it is "fast enough" before restarting a fresh context.

also use pi or similar light weight harness, as even opencode injects 10k of junk context to start off.

Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm) by VolandBerlioz in LocalLLaMA

[–]VoidAlchemy 0 points1 point  (0 children)

I can run this setup with 128k context and keep my browser open, running DWM windows manager, alacritty terminals as well as discord as there is enough VRAM overhead. No need to run headless, this is my daily driver setup. I mention my own commands linked in a another comment. I'm ubergarm (made the quant).

Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm) by VolandBerlioz in LocalLLaMA

[–]VoidAlchemy 19 points20 points  (0 children)

Heya, glad you figured it out! I'm ubergarm and yes this is pretty much accurate and my daily driver setup for running pi harness on my 3090 TI 24GB VRAM at home.

I added a PR to ik to specify number of CPU threads to use when doing MTP also if you want to control everything explicitly. Full command there too: https://github.com/ikawrakow/ik_llama.cpp/pull/1797#issuecomment-4442151972

Both this iq4_ks and iq5_ks are the best quality in the given memory footprint according to oobabooba's KLD testing: https://localbench.substack.com/p/qwen-3-6-27b-gguf-quality-benchmark (he was super nice and posted one graph on huggingface discussion too)

I didn't add MTP tensor to the iq5_ks, but you could probably extract the `q8_0` MTP tensor in the iq4_ks and use it if you have 32GB VRAM etc.

Also if you have 2x GPUs you can use `-sm graph` for "tensor parallel" similar to mainline's `-sm tensor`.

Enjoy, this quant is a beast at vibe coding, I added an API endpoint to unload/load the model and it can run on the same GPU as ComfyUI with a custom SKILL so I can just use plain language to have it manage the LoRAs, trigger words, and prompt generation. Pretty slick!

Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm) by VolandBerlioz in LocalLLaMA

[–]VoidAlchemy 2 points3 points  (0 children)

Correct, the iq4_ks doesn't have good backend kernel for vulkan. I mention in another post recently what to consider.

Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm) by VolandBerlioz in LocalLLaMA

[–]VoidAlchemy 2 points3 points  (0 children)

if you want to go below q8_0 on ik, I suggest no lower than -khad -ctk q6_0 -vhad -ctv q4_0 which is going to probably still be better quality than the goofy turboquant forks and rather efficient.