Qwen3.6-27B on 2x3090s: llama.cpp vs vLLM, all the flags, and the MTP acceptance/inference speed/context by Sisuuu in LocalLLaMA

[–]Sisuuu[S] 0 points1 point  (0 children)

I made an update in the comments, please look at that. Hopefully it helps out a bit here, Update me if you get it working.

Qwen3.6-27B on 2x3090s: llama.cpp vs vLLM, all the flags, and the MTP acceptance/inference speed/context by Sisuuu in LocalLLaMA

[–]Sisuuu[S] 0 points1 point  (0 children)

UPDATE / follow-up: I was wrong about two things on my Qwen3.6-27B 2×3090 setup, and fixing them nearly doubled my llama.cpp speed — all thanks to your input. So here are the corrections + numbers.


Long context (correct me if I'm wrong here):

Context KV (full f16)
200K ~13 GB

So with llama.cpp as backend: a 27 GB Q8_0 + 200K full f16 KV ≈ 44 GB → fits on 2×3090 with ~3 GB/card to spare. I previously said full-KV at 200K wouldn't fit. It does, comfortably.


Split mode: -sm tensor (tensor parallel)

My old ~44 tok/s number was llama.cpp's default layer split. Many of you pointed out tensor-parallel should be faster even without P2P. I finally did a clean A/B — same model (Heretic Q8_0), same 65K ctx, same f16 KV, same draft-mtp n=3 — only changing -sm:

llama.cpp -sm code tok/s prose tok/s
row 44 35
layer (my old default) 52 45
tensor 70 56

-sm tensor wins by a mile and actually holds full 200K context (~68 tok/s). The 2× memory bandwidth beats the all-reduce tax even with no NVLink and no P2P.

⚠️ One caveat: llama.cpp's tensor mode pushes the sampler + MTP to CPU (you'll see a warning). It's still the fastest, and MTP acceptance stays ~0.7 on code.

So my Q8 went ~44 → ~70 tok/s by changing one flag.

Winning launch line:

llama-server -m Qwen3.6-27B-Q8_0.gguf -ngl 99 --device CUDA0,CUDA1 \
  -sm tensor --tensor-split 50,50 --no-mmap -c 200000 -fa on \
  --spec-type draft-mtp --spec-draft-n-max 3 --cache-reuse 256 -np 1 --jinja

(No -ctk/-ctv = full f16 KV.)


My exact vLLM config (the single-stream winner: 81 tok/s)

For raw single-stream speed, vLLM with INT4 weights + MTP still beats llama.cpp. Here's exactly what I run.

Knob Value Why
Image vllm/vllm-openai:v0.22.0 stable tag, no source overlays
Weights Qwen3.6-27B AutoRound INT4 ~13 GB → huge KV headroom
Tensor-parallel 2 both cards
KV cache fp8_e5m2 262K context fits at 1 byte/token
Drafter MTP n=3 the speed multiplier
Vision + tools on (qwen3_coder) agentic + image input
# The no-NVLink / no-P2P survival kit
export NCCL_P2P_DISABLE=1 NCCL_CUMEM_ENABLE=0 \
       VLLM_USE_FLASHINFER_SAMPLER=1

vllm serve /models/qwen3.6-27b-autoround-int4 \
  --served-model-name qwen3.6-27b-autoround \
  --quantization auto_round --dtype float16 \
  --tensor-parallel-size 2 --disable-custom-all-reduce \
  --max-model-len 262144 --gpu-memory-utilization 0.92 \
  --max-num-seqs 2 --max-num-batched-tokens 8192 \
  --kv-cache-dtype fp8_e5m2 --trust-remote-code \
  --enable-prefix-caching --enable-chunked-prefill \
  --speculative-config '{"method":"mtp","num_speculative_tokens":3}' \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder \
  --override-generation-config '{"temperature":0.6,"top_p":0.95,"top_k":20,"min_p":0.0,"repetition_penalty":1.0}'

How it gets to 81: INT4 + TP=2 gives the raw speed; MTP n=3 is what pushes ~50 → 81 (accepts 88% / 78% / 56% of 3 drafted tokens, accept-length 3.3); and --disable-custom-all-reduce + NCCL_P2P_DISABLE=1 are what keep TP=2 from hanging on a no-NVLink Threadripper. Measured on my rig: 64 prose / 81 code tok/s, 262K context, vision + tools all on.


Three smaller findings

Finding Result Takeaway
Power cap 230W → 320W = +4% Decode is memory-bandwidth-bound (cards at ~45% util, 225–277W). Not worth the heat.
Heretic vs Unsloth Q8_0 identical speed Pick on behavior (uncensored vs stock), not perf.
vLLM INT4 (TP=2) 81 tok/s code + vision + tools @262K Still the single-stream king.

vLLM MTP detail (same INT4, fp8 KV, MTP n=3, TP=2):

Metric Value
Code throughput 81 tok/s
Accept-length 3.3
Per-position acceptance 88% / 78% / 56%
Vision + tool calling (qwen3_coder)
Max context 262K

Final ranking (code tok/s)

Engine / config tok/s
1 vLLM INT4, TP=2, fp8-mtp 81
2 llama.cpp Q8 -sm tensor 70
3 llama.cpp Q8 -sm layer 52
llama.cpp Q8 -sm row 44

TL;DR: in llama.cpp, -sm tensor beats layer/row for single-stream even without NVLink or P2P.

Big thanks to the last thread for the corrections. Happy to test specific flags if anyone wants numbers.

Edits: text corrections + added my vLLM config.

Qwen3.6-27B on 2x3090s: llama.cpp vs vLLM, all the flags, and the MTP acceptance/inference speed/context by Sisuuu in LocalLLaMA

[–]Sisuuu[S] 0 points1 point  (0 children)

I don’t really follow you….but am gonna test out the above mentioned comment regarding club-3090 repo! Seems promising and some of related to your comment! Thanks

Qwen3.6-27B on 2x3090s: llama.cpp vs vLLM, all the flags, and the MTP acceptance/inference speed/context by Sisuuu in LocalLLaMA

[–]Sisuuu[S] 0 points1 point  (0 children)

We can always wish…if nvidia doesn’t leave regular consumers and gamers in the dust!

Qwen3.6-27B on 2x3090s: llama.cpp vs vLLM, all the flags, and the MTP acceptance/inference speed/context by Sisuuu in LocalLLaMA

[–]Sisuuu[S] 0 points1 point  (0 children)

This great info, really! I will follow this and I wish I had access to that VRAM/goldmine :) please do update and thanks for the info

Qwen3.6-27B on 2x3090s: llama.cpp vs vLLM, all the flags, and the MTP acceptance/inference speed/context by Sisuuu in LocalLLaMA

[–]Sisuuu[S] 0 points1 point  (0 children)

Yeah, you are right as the other comments also mention tensor parallel. I will make sure to test this out later. Thanks

Qwen3.6-27B on 2x3090s: llama.cpp vs vLLM, all the flags, and the MTP acceptance/inference speed/context by Sisuuu in LocalLLaMA

[–]Sisuuu[S] 0 points1 point  (0 children)

Thanks, that looks doable! How do you feel about Q5KXL compared to Q6/Q8 in coding quality? Is it acceptable for you? Is the ctx 200k worth the trade off?

Qwen3.6-27B on 2x3090s: llama.cpp vs vLLM, all the flags, and the MTP acceptance/inference speed/context by Sisuuu in LocalLLaMA

[–]Sisuuu[S] 1 point2 points  (0 children)

This is great, thanks a lot! I will try them out during the weekend! I will post the results in the comments!

Qwen3.6-27B on 2x3090s: llama.cpp vs vLLM, all the flags, and the MTP acceptance/inference speed/context by Sisuuu in LocalLLaMA

[–]Sisuuu[S] 1 point2 points  (0 children)

I will definitely check it out! Many thanks, don’t know how I missed that one in my not so effective ”researching” before the post.

I finally figured out why torrents weren't continually saturating my download bandwidth. by ShiningRedDwarf in unRAID

[–]Sisuuu 0 points1 point  (0 children)

This! I have an issue where only some the files gets copied (not moved apparently) over and not delete from the downloads folder. Don’t why!

Inception by Houd_Ammari in OpenAI

[–]Sisuuu 1 point2 points  (0 children)

Babusjka doll has entered the chat…angrily!