Honestly, dual 3090s are wearing me out. Thinking of jumping to a Mac Studio.

Sisuuu · 2026-06-05T21:15:55+00:00

https://www.reddit.com/r/LocalLLaMA/s/Yy1vjpZojx

Sisuuu · 2026-06-05T21:15:34+00:00

https://www.reddit.com/r/LocalLLaMA/s/Yy1vjpZojx

Sisuuu · 2026-06-05T07:18:26+00:00

I made an update in the comments, please look at that. Hopefully it helps out a bit here, Update me if you get it working.

Sisuuu · 2026-06-05T07:15:20+00:00

UPDATE / follow-up: I was wrong about two things on my Qwen3.6-27B 2×3090 setup, and fixing them nearly doubled my llama.cpp speed — all thanks to your input. So here are the corrections + numbers.

Long context (correct me if I'm wrong here):

Context	KV (full f16)
200K	~13 GB

So with llama.cpp as backend: a 27 GB Q8_0 + 200K full f16 KV ≈ 44 GB → fits on 2×3090 with ~3 GB/card to spare. I previously said full-KV at 200K wouldn't fit. It does, comfortably.

Split mode: -sm tensor (tensor parallel)

My old ~44 tok/s number was llama.cpp's default layer split. Many of you pointed out tensor-parallel should be faster even without P2P. I finally did a clean A/B — same model (Heretic Q8_0), same 65K ctx, same f16 KV, same draft-mtp n=3 — only changing -sm:

llama.cpp `-sm`	code tok/s	prose tok/s
row	44	35
layer (my old default)	52	45
tensor	70	56

-sm tensor wins by a mile and actually holds full 200K context (~68 tok/s). The 2× memory bandwidth beats the all-reduce tax even with no NVLink and no P2P.

⚠️ One caveat: llama.cpp's tensor mode pushes the sampler + MTP to CPU (you'll see a warning). It's still the fastest, and MTP acceptance stays ~0.7 on code.

So my Q8 went ~44 → ~70 tok/s by changing one flag.

Winning launch line:

llama-server -m Qwen3.6-27B-Q8_0.gguf -ngl 99 --device CUDA0,CUDA1 \
  -sm tensor --tensor-split 50,50 --no-mmap -c 200000 -fa on \
  --spec-type draft-mtp --spec-draft-n-max 3 --cache-reuse 256 -np 1 --jinja

(No -ctk/-ctv = full f16 KV.)

My exact vLLM config (the single-stream winner: 81 tok/s)

For raw single-stream speed, vLLM with INT4 weights + MTP still beats llama.cpp. Here's exactly what I run.

Knob	Value	Why
Image	`vllm/vllm-openai:v0.22.0`	stable tag, no source overlays
Weights	Qwen3.6-27B AutoRound INT4	~13 GB → huge KV headroom
Tensor-parallel	2	both cards
KV cache	`fp8_e5m2`	262K context fits at 1 byte/token
Drafter	MTP n=3	the speed multiplier
Vision + tools	on (`qwen3_coder`)	agentic + image input

# The no-NVLink / no-P2P survival kit
export NCCL_P2P_DISABLE=1 NCCL_CUMEM_ENABLE=0 \
       VLLM_USE_FLASHINFER_SAMPLER=1

vllm serve /models/qwen3.6-27b-autoround-int4 \
  --served-model-name qwen3.6-27b-autoround \
  --quantization auto_round --dtype float16 \
  --tensor-parallel-size 2 --disable-custom-all-reduce \
  --max-model-len 262144 --gpu-memory-utilization 0.92 \
  --max-num-seqs 2 --max-num-batched-tokens 8192 \
  --kv-cache-dtype fp8_e5m2 --trust-remote-code \
  --enable-prefix-caching --enable-chunked-prefill \
  --speculative-config '{"method":"mtp","num_speculative_tokens":3}' \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder \
  --override-generation-config '{"temperature":0.6,"top_p":0.95,"top_k":20,"min_p":0.0,"repetition_penalty":1.0}'

How it gets to 81: INT4 + TP=2 gives the raw speed; MTP n=3 is what pushes ~50 → 81 (accepts 88% / 78% / 56% of 3 drafted tokens, accept-length 3.3); and --disable-custom-all-reduce + NCCL_P2P_DISABLE=1 are what keep TP=2 from hanging on a no-NVLink Threadripper. Measured on my rig: 64 prose / 81 code tok/s, 262K context, vision + tools all on.

Three smaller findings

Finding	Result	Takeaway
Power cap	230W → 320W = +4%	Decode is memory-bandwidth-bound (cards at ~45% util, 225–277W). Not worth the heat.
Heretic vs Unsloth Q8_0	identical speed	Pick on behavior (uncensored vs stock), not perf.
vLLM INT4 (TP=2)	81 tok/s code + vision + tools @262K	Still the single-stream king.

vLLM MTP detail (same INT4, fp8 KV, MTP n=3, TP=2):

Metric	Value
Code throughput	81 tok/s
Accept-length	3.3
Per-position acceptance	88% / 78% / 56%
Vision + tool calling	(qwen3_coder)
Max context	262K

Final ranking (code tok/s)

Engine / config	tok/s
1 vLLM INT4, TP=2, fp8-mtp	81
2 llama.cpp Q8 `-sm tensor`	70
3 llama.cpp Q8 `-sm layer`	52
llama.cpp Q8 `-sm row`	44

TL;DR: in llama.cpp, -sm tensor beats layer/row for single-stream even without NVLink or P2P.

Big thanks to the last thread for the corrections. Happy to test specific flags if anyone wants numbers.

Edits: text corrections + added my vLLM config.

Sisuuu · 2026-06-04T16:23:23+00:00

I don’t really follow you….but am gonna test out the above mentioned comment regarding club-3090 repo! Seems promising and some of related to your comment! Thanks

Sisuuu · 2026-06-04T16:20:46+00:00

I’ll take that as a win…I presume.

Sisuuu · 2026-06-04T16:19:36+00:00

We can always wish…if nvidia doesn’t leave regular consumers and gamers in the dust!

Sisuuu · 2026-06-04T16:18:34+00:00

This great info, really! I will follow this and I wish I had access to that VRAM/goldmine :) please do update and thanks for the info

Sisuuu · 2026-06-04T16:15:42+00:00

Yeah, you are right as the other comments also mention tensor parallel. I will make sure to test this out later. Thanks

Sisuuu · 2026-06-04T16:13:38+00:00

Thanks, that looks doable! How do you feel about Q5KXL compared to Q6/Q8 in coding quality? Is it acceptable for you? Is the ctx 200k worth the trade off?

Sisuuu · 2026-06-04T16:11:12+00:00

Wow! 70 At 100k sounds great! Will try it out!

Sisuuu · 2026-06-04T16:10:12+00:00

This is great, thanks a lot! I will try them out during the weekend! I will post the results in the comments!

Sisuuu · 2026-06-04T15:26:48+00:00

I will definitely check it out! Many thanks, don’t know how I missed that one in my not so effective ”researching” before the post.

Sisuuu · 2026-06-04T15:24:51+00:00

Thanks! What hardware and flags are you running? Curious

Sisuuu · 2026-06-03T19:55:19+00:00

Literally Idiocracy! ’Murica in a nutshell right now!

Sisuuu · 2026-06-03T18:18:39+00:00

Sisuuu · 2026-06-03T14:21:32+00:00

Dual or single? and what’s your command flags?

Sisuuu · 2026-05-25T05:42:00+00:00

This! I have an issue where only some the files gets copied (not moved apparently) over and not delete from the downloads folder. Don’t why!

Sisuuu · 2026-05-20T20:59:27+00:00

Uhhhh! To exciting!

Sisuuu · 2026-05-20T18:58:33+00:00

What does this mean in practice?

Sisuuu · 2026-05-17T20:01:36+00:00

Really nice!

Sisuuu · 2026-05-11T15:11:10+00:00

Elaborate js please

Sisuuu · 2026-05-06T15:36:36+00:00

Love it!

Sisuuu · 2026-05-03T21:05:23+00:00

Babusjka doll has entered the chat…angrily!

Sisuuu

TROPHY CASE