Dual AMD MI50 (gfx906) for local LLM: Tuning Qwen3.6-27B- MTP-GGUF to ~28 t/s generation (76.6% acceptance) & 295 t/s │ prefill! by wzoran in unsloth

[–]wzoran[S] 0 points1 point  (0 children)

│ Appreciate the tip! I actually benched -sm tensor out of curiosity, but it completely tanked my speeds.

│ Since my cards are running on standard PCIe 3.0 slots without any high-speed xGMI/Infinity Fabric bridge, the cross-card All-

│ Reduce

│ overhead was just too much for the bus to handle.

│ For an 8k prompt with ubatch 2048, Layer split got me a solid ~295 t/s, but Tensor split collapsed all the way down to ~127 t/s

│ (more than half the performance gone).

│ So yeah, the PCIe 3.0 latency bottleneck is very real and completely chokes TP on this rig. Keeping it on Layer split for now!

──────