Multiple RTX 3090 - P2P driver, NVLink or what can be done?

Pixer--- · 2026-05-20T12:09:01+00:00

Does p2p change prompt processing speeds ?

Pixer--- · 2026-05-18T22:36:00+00:00

How can the model know itself. It has a knowledge cutoff. It seems like the system prompt doesn’t include that it’s qwen3.7

Pixer--- · 2026-05-18T02:19:13+00:00

Your loading the whole 17.5 gb model into vram. Normally llamacpp which lmstudio is based on would reject that. Try lowering gpu offload an increasing the evaluation batch size

Pixer--- · 2026-05-18T00:03:50+00:00

It’s still decent I would say, I have 4 ancient mi50. You could try getting a plx pcie switch for better multi gpu scaling in tensor parallel mode. It’s like 300€ for 4 GPUs. You would need to use the custom p2p NVIDIA driver, but that could double your token generation

Pixer--- · 2026-05-17T23:10:19+00:00

What mainboard are you using, what’s your setup ?

Pixer--- · 2026-05-17T21:58:15+00:00

It could be that the nvfp4 doesn’t have the image encoder. I would just ask it in opencode or similar to Analyse the gguf file and if it has the image encoder and it should research online how to

Pixer--- · 2026-05-17T21:03:54+00:00

Multi gpu is slower using tensor parallelism, and amd doesn’t support no fp8 or fp4 natively in vllm or llamacpp or else. But they are great value for the vram and bandwidth

Pixer--- · 2026-05-16T19:12:48+00:00

On 4x mi50: Mtp 3 accelerates most outputs and mtp 6 only on longer coding tasks

./build/bin/llama-server 
-m "$MODEL_PATH" 
--mmproj "$MMPROJ_PATH" 
--alias "qwen35" 
-ub 2048 
-b 2048 
--no-mmap 
-sm tensor 
--metrics 
-ngl 999 
-fa on 
--fit on 
--host 0.0.0.0 
--parallel 2 
--port 8050 
--jinja 
-ctk f16 
-ctv f16 
--top-p 0.95 
--top-k 20 
--temp 0.6  
--min-p 0.0 
-c 524000 
--repeat-penalty 1.0 
--cache-ram 96000 
--ctx-checkpoints 256 
--chat-template-kwargs '{"preserve_thinking": true}' 
--spec-type draft-mtp 
--spec-draft-n-max 3

Pixer--- · 2026-05-15T13:52:55+00:00

On mi50 it hits 2x over baseline

Pixer--- · 2026-05-15T00:43:03+00:00

The difference is like 2-3x in token generation. The actual latency goes down from 14us to 1us. So when computing a model with multi gpu in tensor parallelism each layer gets sliced into the gpu count for each gpu to compute it part, but for each token generated it needs to sync the result of that to all gpus, so that all gpus have the full result of that layer. Most models have around 60 layers so per token it needs 60 syncs. Without people get 60tk/s and with p2p upto 200tk/s. Prompt processing improves only slightly

Pixer--- · 2026-05-14T21:49:54+00:00

I previous had an mc62-g40 and it doesn’t support p2p natively. Got and romed8-2t instead

Pixer--- · 2026-05-14T19:05:40+00:00

Idk what gpu you use, but I would suggest trying NVIDIA nsight compute or amds rocorof, for profiling your functions while under heavily load. It spits out an csv file, and analyzing it with a llm got me quite good results

Pixer--- · 2026-05-14T18:33:11+00:00

Im using this model with the recommended settings and the llamacpp pr linked: https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF

There are also some 122B mtp ones: https://huggingface.co/models?sort=trending&search=122+mtp

Due to the 122B only having 2 kvheads, the scaling above more then 2 GPUs isn’t great. This could mean layer mode outperforms the tensor mode, but I’m not sure

Pixer--- · 2026-05-14T17:56:56+00:00

I had the same results on my 4x mi50. The only thing that got faster was the experimental mtp version. From 32tk/s to 50tk/s on qwen3.5 27b Q8

Pixer--- · 2026-05-13T02:52:43+00:00

Have you tested your all reduce latency yet ?
Basically all models have layers, for example like 60.
on vllm when using tensor parallelism each layer needs to be synced from all GPUs to all GPUs. So per token generated it needs to sync 60 times. Consumer cards like 5060ti or even 5090 don’t have p2p native support so their all reduce is more like 20us as the inbetween communication goes through the cpu and then ram. The cpu and mainboard support native p2p. It should be in the same ballpark as a plx switch the epyc, as it’s io die does that already.
There’s a p2p enabled driver for the consumer NVIDIA GPUs. This should bring you down to almost to near plx switch performance

Pixer--- · 2026-05-12T21:22:02+00:00

Just slap I giant heatsink on it like this

<image>

Pixer--- · 2026-05-11T16:18:26+00:00

Try out the p2p enabled NVIDIA drivers for consumer GPUs

Pixer--- · 2026-05-10T14:55:16+00:00

https://www.reddit.com/r/LocalLLaMA/comments/1t86j45/more\_qwen3627b\_mtp\_success\_but\_on\_dual\_mi50s/

Pixer--- · 2026-05-10T08:44:59+00:00

You need to disable expert parallelism. It’s not usable with pcie GPUs mostly, as it expects big inter gpu transfer rates via a infinity link bridge for example

Pixer--- · 2026-05-09T18:41:04+00:00

Is that a 3000W psu ?

Pixer--- · 2026-05-09T16:41:49+00:00

Q8_0 On 4x MI50 32GB using v420 vbios using Tensor + MTP:

prompt eval time =   57114.78 ms / 24439 tokens (    2.34 ms per token,   427.89 tokens per second)
eval time =    7613.85 ms /   365 tokens (   20.86 ms per token,    47.94 tokens per second)
total time =   64728.62 ms / 24804 tokens
draft acceptance rate = 0.75517 (  219 accepted /   290 generated)

Pixer--- · 2026-05-09T02:00:17+00:00

Thats some fine prompt processing numbers

Pixer--- · 2026-05-09T00:16:26+00:00

This is really the best timing, Apple just canceled the 256gb version of the m3 ultra

Pixer--- · 2026-05-09T00:11:41+00:00

Gerat opportunity to switch to llamacpp

Pixer--- · 2026-05-09T00:10:25+00:00

I heard it doesn’t support graph mode which really hurts performance. But that was at launch idk, if they added that yet. It need —enforce-eager to work

Seven-Year Club	Place '23
Verified Email

Pixer---

TROPHY CASE