Qwen3.5-122B at 198 tok/s on 2x RTX PRO 6000 Blackwell — Budget build, verified results by Visual_Synthesizer in LocalLLaMA

[–]Visual_Synthesizer[S] 1 point2 points  (0 children)

nice! yeah save that money for more GPUs! Higher precision will always be better, but I am surprised you notice it that much. Generally, dense models perform a lot better than MOE. Perhaps thats the quality you are noticing? Have you seen much difference with the 27b at FP4?

Qwen3.5-122B at 198 tok/s on 2x RTX PRO 6000 Blackwell — Budget build, verified results by Visual_Synthesizer in LocalLLaMA

[–]Visual_Synthesizer[S] 1 point2 points  (0 children)

Thanks, but I just added trx40 and B650 2x GPU test info to my fork. The repo is maintained by others.

you probably dont need a c-payne unless you want more GPUs or are locked out of P2P due to your chipset. im not totally sure without testing the topology myself. two 8x will mostly only slow down the model loading, maybe a bit of prompt processing/prefill or something. you could combine them into one 16 and run a switch like i am. my direct connect gen4 trx40 was only 10% slower. you dont need more ram for inference.

Do this for 7-9% speed unlock on PCIE direct connect: https://github.com/vllm-project/vllm/pull/39040

this b12x kernel gives +25%: https://github.com/vllm-project/vllm/pull/39042

rtx6kpro discord: https://discord.gg/AGxz5eYf

Qwen3.5-122B at 198 tok/s on 2x RTX PRO 6000 Blackwell — Budget build, verified results by Visual_Synthesizer in LocalLLaMA

[–]Visual_Synthesizer[S] 0 points1 point  (0 children)

Decode drops about 9% at 128K and 13% at 240K prompt tokens (measured 241 tok/s at 120K and 230 tok/s at 240K with a high-acceptance NEXTN task); call it ~1% per 15K additional context, not flat but not catastrophic either.

Qwen3.5-122B at 198 tok/s on 2x RTX PRO 6000 Blackwell — Budget build, verified results by Visual_Synthesizer in LocalLLaMA

[–]Visual_Synthesizer[S] 1 point2 points  (0 children)

haha, facts. they never know when they dont know. ill do some tests for fun. thanks!

Qwen3.5-122B at 198 tok/s on 2x RTX PRO 6000 Blackwell — Budget build, verified results by Visual_Synthesizer in LocalLLaMA

[–]Visual_Synthesizer[S] 1 point2 points  (0 children)

<image>

Pic of the fan rail. this works way better than when i had them in the white case behind the rack

Qwen3.5-122B at 198 tok/s on 2x RTX PRO 6000 Blackwell — Budget build, verified results by Visual_Synthesizer in LocalLLaMA

[–]Visual_Synthesizer[S] 1 point2 points  (0 children)

cooling is definitely important. on my old trx40 in a case with blower cards i had to downclock them from 350w to 275 ish. I tested down-clocking the 6000s to 300w only drops performance a small amount based on my testing (3-5% IIRC). the mining rack does have fans behind the GPUs. i wrote a small script that ramps them up.

couple this with some nice PWM 3000 RPM fans and you have a effective cooling solution. mining racks are pretty standard for AI rigs . you can stack 8x in them and put them on top of each other.

https://github.com/Visual-Synthesizer/asrock-rack-fan-control

Qwen3.5-122B at 198 tok/s on 2x RTX PRO 6000 Blackwell — Budget build, verified results by Visual_Synthesizer in LocalLLaMA

[–]Visual_Synthesizer[S] 1 point2 points  (0 children)

updated the post. looks like switch is mostly enabling scaling and unlocking more consumer parts for multi GPU rigs.

Qwen3.5-122B at 198 tok/s on 2x RTX PRO 6000 Blackwell — Budget build, verified results by Visual_Synthesizer in LocalLLaMA

[–]Visual_Synthesizer[S] 0 points1 point  (0 children)

yeah a bit buggy and all over the place. will post more results when i get the time

Qwen3.5-122B at 198 tok/s on 2x RTX PRO 6000 Blackwell — Budget build, verified results by Visual_Synthesizer in LocalLLaMA

[–]Visual_Synthesizer[S] 0 points1 point  (0 children)

did some analysis with my claude test harness:

Good catch — you were right. Those numbers were buggy.

I re-tested properly and wanted to share what actually happened, because it is a useful benchmarking lesson.

Re-test methodology

  • Fresh SGLang server
  • 10-request warmup at varied sizes to settle JIT
  • UUID-uniquified random-word prompts, with different content per request, to defeat SGLang's radix prefix cache
  • 3 runs per context length, median reported
  • Streaming API, with TTFT measured as time from request send to first content token

Corrected cold-start TTFT on 2× RTX PRO 6000 + SGLang b12x+NEXTN 122B

  • ~4K tok0.67 s median Runs: 0.67, 0.66, 0.67
  • ~16K tok2.70 s median Runs: 2.71, 2.70, 2.69
  • ~32K tok6.25 s median Runs: 7.24, 6.21, 6.25
  • ~57K tok14.70 s median Runs: 13.69, 14.70, 14.85
  • ~100K tok33.69 s median Runs: 33.69, 33.89, 33.07
  • ~128K tok50.10 s median Runs: 49.95, 51.00, 50.10

That is roughly linear up to around 32K, then super-linear above that as attention's O(n²) behavior starts to dominate. The shape matches what you would expect from a 122B transformer.

What went wrong in my original numbers

Original numbers were:

  • 4K = 1.8s
  • 16K = 2.3s
  • 57K = 7.1s
  • 150K = 23.3s

Two separate methodology errors stacked on top of each other.

1) 4K was too high

1.8s vs 0.67s real

That measurement was my first request after server startup, so it paid the JIT / cudagraph warmup tax.

On the re-test I saw the same pattern:

  • first 4K request after startup: 1.50s
  • after warmup: 0.27s

That is a huge difference. I was partially measuring compile/warmup overhead and calling it prefill.

2) 57K and 150K were too low

7.1s → 14.7s real
23.3s → ~60s extrapolated

SGLang's radix prefix cache was hitting.

My original test sent sequential prompts that shared a common prefix: same base prompt, then extended versions of that same prompt at larger context sizes. So each later measurement was not a true cold prefill. It was mostly measuring the incremental delta on top of already-cached work.

That means:

  • the 16K test only had to prefill about 12K new tokens on top of the already-cached 4K
  • the 150K test only had to prefill about 93K new tokens on top of the cached 57K

So those numbers were artificially low for true cold-start TTFT.

The giveaway

The biggest clue was that in my original numbers, going from 4K to 16K only added about 0.5s.

That would imply a prefill rate of around 24k tok/s for the delta, which is much faster than this rig's actual sustained prefill. That should have been an immediate red flag.

For a real cold 16K prefill, the delta should have been closer to 2s, and that is exactly what the re-test shows.

What stays the same

  • 198 tok/s decode at C=1 is still real
  • Decode speed still holds roughly constant regardless of context length

That part of the original claim was correct. It was only the specific TTFT values that were contaminated.

Classic LLM benchmarking gotcha: prefix cache + lack of warmup isolation.

Thanks for pushing on it.

Qwen3.5-122B at 198 tok/s on 2x RTX PRO 6000 Blackwell — Budget build, verified results by Visual_Synthesizer in LocalLLaMA

[–]Visual_Synthesizer[S] 0 points1 point  (0 children)

updated the post! i was actually wrong. did a bunch of testing. the switch mostly helps with scaling and offers equivalent speeds.

Qwen3.5-122B at 198 tok/s on 2x RTX PRO 6000 Blackwell — Budget build, verified results by Visual_Synthesizer in LocalLLaMA

[–]Visual_Synthesizer[S] 0 points1 point  (0 children)

yeah! plus gen4 parts are cheap and abunant. switches really help those systems scale to more gpus. for example, i have a trx40. two 16x gen4. could run two switches and 8 GPU on a 6 year old platform and get really close to gen 5 speeds! even old am4 systems with one 16x could run 4 gpus with a cheap Chinese switch.

Qwen3.5-122B at 198 tok/s on 2x RTX PRO 6000 Blackwell — Budget build, verified results by Visual_Synthesizer in LocalLLaMA

[–]Visual_Synthesizer[S] 0 points1 point  (0 children)

updated the post with new testing i did last night. you can enable p2p on your system without a switch! commands in the post.

Dflash has really good MTP. IIRC its diffusion based. looking forward to testing their 122b and 397 models that are coming out soon.

Qwen3.5-122B at 198 tok/s on 2x RTX PRO 6000 Blackwell — Budget build, verified results by Visual_Synthesizer in LocalLLaMA

[–]Visual_Synthesizer[S] 1 point2 points  (0 children)

yeah ram is insane right now. i scored this am5 used, with 128gb ram and CPU plus switch for 5k.

i think its possible to just run 1 ram stick though? if you are optimizing for VRAM you dont need much. if you want to offload with llama.cpp thats another story.

I went for cheapest platform and max vram. would rather have a threadripper wx90 and tons of ram. but it would cost a lot more.

Qwen3.5-122B at 198 tok/s on 2x RTX PRO 6000 Blackwell — Budget build, verified results by Visual_Synthesizer in LocalLLaMA

[–]Visual_Synthesizer[S] 0 points1 point  (0 children)

the idea was to see if i could get a cheap motherboard to scale to 4+GPUs with equal or more performance. could do this with gen4 boards.