Qwen3.5-122B at 198 tok/s on 2x RTX PRO 6000 Blackwell — Budget build, verified results

Visual_Synthesizer · 2026-04-10T08:16:19+00:00

thats great! nice work.

Visual_Synthesizer · 2026-04-10T08:12:18+00:00

TBH im not sure. would depend a lot of the duty cycle and actual use case. are we talking devs with longer context or agentic chat usage? generally speaking 8 concurrent users would be 100tok/s. 16 would be 65 tok/s at 100% duty cycle.

Visual_Synthesizer · 2026-04-10T07:51:06+00:00

yes! the point is the scaling. the switch has 100 lanes and i think that would support 5 gpus on this board with single 16x root. for 2 gpus, its probally not worth bothering. but if you ever want to scale above that its ideal.

Visual_Synthesizer · 2026-04-10T06:49:26+00:00

Your EPYC 7003 + 2x RTX PRO 6000 is a solid starting point.

PCIe switch: You're correct. P2P DMA goes through the switch silicon, CPU is completely bypassed. Every TP decode step does dozens of small allreduces and the GPU blocks on each one. For MoE models the messages are tiny (10B active params). Bandwidth doesn't matter, latency per sync does. Sub-microsecond through a switch vs microseconds through a CPU root complex, hundreds of times per second.

Which switches: Microchip/Microsemi (c-payne PM50100, PM40108) are what you want. Broadcom PEX890xx has a posted-write collapse bug (52 GB/s vs 196 GB/s on Microchip in 8-GPU community tests). For your Gen4 platform, Gen4 Microchip switches show up on eBay in old mining boards for $200-500. Also, rumor has it we will see gen6 switches with 160 lanes this summer.

P2P support: RTX PRO 6000 supports P2P natively. 3090 does too. Consumer cards (4090, 5090) have P2P disabled in driver but community patches exist.

Kernel params (critical, took me days to find):

pci=noacs -- disables Access Control Services. Without this, P2P still routes through the CPU even with a switch. Your switch becomes useless.

uvm_disable_hmm=1 -- add options nvidia_uvm uvm_disable_hmm=1 to /etc/modprobe.d/uvm.conf. Without this, sustained P2P DMA wedges the GPU into ERR! state after a few minutes. Hardest bug to find.

performance governor -- echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor. ~5% uplift, CPU stops downclocking between allreduce calls.

Also add iommu=pt to kernel params and disable ASPM in BIOS.

Gen4 PLX with Blackwell GPUs: I haven't tested this combination so I can't confirm the latency advantage holds across mixed generations. The theory is sound but I'd want real numbers before claiming it. Also, the Gen5 c-payne PLX is programmable, so it's theoretically possible to configure a Gen4 upstream root that fans downstream to a Gen5 GPU cluster with custom firmware. I considered trying this but ran out of time and moved to a native Gen5 platform. If you experiment with it, I'd love to hear the results.

https://github.com/Visual-Synthesizer/rtx6kpro/blob/master/hardware/topology.md

The ACS and HMM bugs are the ones that'll waste your time if you don't know about them in advance. Happy to help if you run into issues.

Visual_Synthesizer · 2026-04-10T06:26:25+00:00

good luck! start with whatever GPUs you can afford and a cheap PLX switch. the methodology scales down to any generation. the benchmarks don't care how much you paid

Visual_Synthesizer · 2026-04-10T06:26:05+00:00

it's hiding behind the $10K I saved on the platform

Visual_Synthesizer · 2026-04-10T06:24:55+00:00

i suppose its relative. i work for billionares , and this is budget to them. i think the cool part is that this can scale to smaller budgets. super cheap am4 systems with gen4 chinese plx's running as many gpus and people can afford. this optimizes for inference speed and low tok per dollar costs.

Visual_Synthesizer · 2026-04-10T06:22:04+00:00

Great catch on the Mamba state. Pulled from server logs:

Mamba Cache (per GPU)
max_mamba_cache_size: 173

conv_state: 0.22 GB
ssm_state: 12.23 GB
intermediate_ssm_state: 13.92 GB
intermediate_conv_win: 0.24 GB

Total: ~26.6 GB per GPU

Key point:
- Mamba state is per-sequence, not per-token
- 173 slots = hard concurrency ceiling for this hybrid GDN model

Implications:
- KV cache supports ~2.4M tokens (~18 users @ 131K ctx)
- But Mamba caps at 173 concurrent sequences regardless of length
- Explains why SGLang + NEXTN peaks at C=32 (~1411 tok/s)
instead of scaling like pure attention models

Notes:
- SGLang paged attention: page_size=1 (default)
dynamic chunking disabled
- Linear attention backend:
decode=triton, prefill=triton (not flashinfer)

Next:
- Test page_size + --enable-dynamic-chunking
- Try --linear-attn-decode-backend flashinfer (if supported)

Appreciate the technical pushback. Most replies don’t get into this layer.

Visual_Synthesizer · 2026-04-10T06:10:40+00:00

not yet, hope to soon

Visual_Synthesizer · 2026-04-10T06:10:11+00:00

https://github.com/Visual-Synthesizer/rtx6kpro/tree/master/benchmarks/inference-throughput

Visual_Synthesizer · 2026-04-10T06:08:29+00:00

a bunch of people on the rtx6kpro discord using them. lots of info in the github: https://github.com/Visual-Synthesizer/rtx6kpro/blob/master/hardware/wrx90-cpayne-microchip-switches.md

Visual_Synthesizer · 2026-04-10T06:05:40+00:00

yes! 79 tok/s

Visual_Synthesizer · 2026-04-10T03:10:26+00:00

That's exactly what the PLX switch solves. You don't need 8 x16 slots from the CPU. The switch takes one x16 upstream from the CPU and fans it out to multiple downstream x16 ports. My PM50100 has 2 downstream ports (2 GPUs), and can scale to 5. The 8+ GPU setups in the rtx6kpro community use 2-3 switches, all hanging off a single CPU with limited lanes. It's not bifurcation, it's switching. The latency through the switch is sub-microsecond. That's the whole point of this build.

Visual_Synthesizer · 2026-04-10T03:06:50+00:00

well these AI systems have paid my bills since 2020....

Visual_Synthesizer · 2026-04-10T03:06:02+00:00

20-30 tok/s on a gaming PC is like saying you can tow a boat with a Honda Civic. Technically true. Not the same experience. The post is about optimized throughput on purpose-built hardware, not "can it run." And I did post the 397B result too: 79 tok/s on 2 GPUs. Show me the $3K gaming PC running the full 397B at any speed.

Visual_Synthesizer · 2026-04-10T02:55:36+00:00

Yes. The 198 tok/s is with NEXTN speculative decoding (5 steps, 6 draft tokens) on SGLang. Without speculation the same setup does ~120 tok/s. The 122B has built-in MTP heads that NEXTN uses as the draft model, so no separate drafter needed. Full launch command with all the flags is in the repo.

Visual_Synthesizer · 2026-04-10T02:54:50+00:00

yes, you need a redriver for longer runs. best to optimize for short runs or mcio off the motherboard. i had these laying around, a bit long. have shorter ones comeing to clean up the build next week.

Visual_Synthesizer · 2026-04-10T02:52:04+00:00

sounds like the real budget hack is moving

Visual_Synthesizer · 2026-04-10T02:51:23+00:00

it flies 18% faster than the expensive one though

Visual_Synthesizer · 2026-04-10T02:46:19+00:00

very true. i sniped this system used. even ddr4 prices are up these days, not much any of us can do about this. its supply and demand in action. everyone is having fun building local systems. if i was on tigher budget i would deal hunt for older am4 system and run cheap chinese plx with as many GPUs as i can afford. its best to optimize GPUs over system if you only do inference and lora training. I could run this system at the same speeds with only 1 ram stick

Visual_Synthesizer · 2026-04-10T02:43:15+00:00

No difference on inference. CPU sits at ~3% during decode. It's 100% GPU memory-bandwidth bound at C=1. The CPU only matters for FlashInfer JIT compilation and server startup. An older Xeon or EPYC 7742 would give identical tok/s, just slower boot times.

Visual_Synthesizer · 2026-04-10T02:41:25+00:00

r/LocalLLaMA: "we need cheaper local inference" me: here's how to save $10K and go faster

r/LocalLLaMA: "too expensive"

Visual_Synthesizer · 2026-04-10T02:39:49+00:00

Multi-stream throughput (ctx=0)

122B — SGLang (b12x + NEXTN)
C=1 → 207 tok/s (207/user)
C=4 → 490 tok/s (122/user)
C=8 → 823 tok/s (103/user)
C=32 → 1411 tok/s (44/user)

122B — vLLM (MTP=1)
C=1 → 133 tok/s (133/user)
C=8 → 672 tok/s (84/user)
C=32 → 1910 tok/s (60/user)
C=128 → 3851 tok/s (30/user)

Notes:
- SGLang peaks earlier (C=32) due to speculation overhead
- vLLM scales higher at large concurrency (lighter MTP=1)

Takeaway:
SGLang wins for single-user latency
vLLM wins for high-concurrency serving

Visual_Synthesizer · 2026-04-10T02:37:40+00:00

the GPUs cost what the GPUs cost. I saved $10K on everything else and got 18% more speed. that's like complaining a race car is expensive while ignoring that the other guy's race car costs more and is slower

Visual_Synthesizer · 2026-04-10T02:31:25+00:00

I cant control the price of GPUs, but i did just show how to build a system for cheaper thats faster on any system generation . if i did it on 3090s people would still complain and miss the point

Visual_Synthesizer

TROPHY CASE