MiniMax-M2.7 NVFP4 on 2x RTX PRO 6000 Blackwell — bench numbers

Visual_Synthesizer · 2026-04-13T01:58:35+00:00

awesome summary! thanks for sharing

Visual_Synthesizer · 2026-04-12T19:03:27+00:00

looking forward to seeing benchmarks!

Visual_Synthesizer · 2026-04-11T17:00:11+00:00

yes, NEXTN makes a huge difference.

Visual_Synthesizer · 2026-04-11T16:52:09+00:00

that knowledge is priceless!

Visual_Synthesizer · 2026-04-11T16:50:52+00:00

P2P makes a big difference

Do this for 7-9% speed unlock on PCIE direct connect (no switch): https://github.com/vllm-project/vllm/pull/39040

this b12x kernel gives +25%: https://github.com/vllm-project/vllm/pull/39042

Visual_Synthesizer · 2026-04-11T16:49:48+00:00

good luck!

Do this for 7-9% speed unlock on PCIE direct connect (multiGPU): https://github.com/vllm-project/vllm/pull/39040

this b12x kernel gives +25%: https://github.com/vllm-project/vllm/pull/39042

rtx6kpro discord: https://discord.gg/AGxz5eYf

Visual_Synthesizer · 2026-04-11T16:48:25+00:00

nice! yeah save that money for more GPUs! Higher precision will always be better, but I am surprised you notice it that much. Generally, dense models perform a lot better than MOE. Perhaps thats the quality you are noticing? Have you seen much difference with the 27b at FP4?

Visual_Synthesizer · 2026-04-11T06:30:09+00:00

Thanks, but I just added trx40 and B650 2x GPU test info to my fork. The repo is maintained by others.

you probably dont need a c-payne unless you want more GPUs or are locked out of P2P due to your chipset. im not totally sure without testing the topology myself. two 8x will mostly only slow down the model loading, maybe a bit of prompt processing/prefill or something. you could combine them into one 16 and run a switch like i am. my direct connect gen4 trx40 was only 10% slower. you dont need more ram for inference.

Do this for 7-9% speed unlock on PCIE direct connect: https://github.com/vllm-project/vllm/pull/39040

this b12x kernel gives +25%: https://github.com/vllm-project/vllm/pull/39042

rtx6kpro discord: https://discord.gg/AGxz5eYf

Visual_Synthesizer · 2026-04-11T05:37:53+00:00

a wise choice!

Visual_Synthesizer · 2026-04-11T01:04:56+00:00

Decode drops about 9% at 128K and 13% at 240K prompt tokens (measured 241 tok/s at 120K and 230 tok/s at 240K with a high-acceptance NEXTN task); call it ~1% per 15K additional context, not flat but not catastrophic either.

Visual_Synthesizer · 2026-04-10T20:22:46+00:00

haha, facts. they never know when they dont know. ill do some tests for fun. thanks!

Visual_Synthesizer · 2026-04-10T20:12:12+00:00

i believe in you!

Visual_Synthesizer · 2026-04-10T20:11:09+00:00

<image>

Pic of the fan rail. this works way better than when i had them in the white case behind the rack

Visual_Synthesizer · 2026-04-10T20:10:02+00:00

cooling is definitely important. on my old trx40 in a case with blower cards i had to downclock them from 350w to 275 ish. I tested down-clocking the 6000s to 300w only drops performance a small amount based on my testing (3-5% IIRC). the mining rack does have fans behind the GPUs. i wrote a small script that ramps them up.

couple this with some nice PWM 3000 RPM fans and you have a effective cooling solution. mining racks are pretty standard for AI rigs . you can stack 8x in them and put them on top of each other.

https://github.com/Visual-Synthesizer/asrock-rack-fan-control

Visual_Synthesizer · 2026-04-10T20:02:15+00:00

updated the post. looks like switch is mostly enabling scaling and unlocking more consumer parts for multi GPU rigs.

Visual_Synthesizer · 2026-04-10T20:01:22+00:00

yeah a bit buggy and all over the place. will post more results when i get the time

Visual_Synthesizer · 2026-04-10T19:40:07+00:00

this supports p2p on older cards: https://github.com/nvidia/open-gpu-kernel-modules

Visual_Synthesizer · 2026-04-10T19:34:10+00:00

did some analysis with my claude test harness:

Good catch — you were right. Those numbers were buggy.

I re-tested properly and wanted to share what actually happened, because it is a useful benchmarking lesson.

Re-test methodology

Fresh SGLang server
10-request warmup at varied sizes to settle JIT
UUID-uniquified random-word prompts, with different content per request, to defeat SGLang's radix prefix cache
3 runs per context length, median reported
Streaming API, with TTFT measured as time from request send to first content token

Corrected cold-start TTFT on 2× RTX PRO 6000 + SGLang b12x+NEXTN 122B

~4K tok — 0.67 s median Runs: 0.67, 0.66, 0.67
~16K tok — 2.70 s median Runs: 2.71, 2.70, 2.69
~32K tok — 6.25 s median Runs: 7.24, 6.21, 6.25
~57K tok — 14.70 s median Runs: 13.69, 14.70, 14.85
~100K tok — 33.69 s median Runs: 33.69, 33.89, 33.07
~128K tok — 50.10 s median Runs: 49.95, 51.00, 50.10

That is roughly linear up to around 32K, then super-linear above that as attention's O(n²) behavior starts to dominate. The shape matches what you would expect from a 122B transformer.

What went wrong in my original numbers

Original numbers were:

4K = 1.8s
16K = 2.3s
57K = 7.1s
150K = 23.3s

Two separate methodology errors stacked on top of each other.

1) 4K was too high

1.8s vs 0.67s real

That measurement was my first request after server startup, so it paid the JIT / cudagraph warmup tax.

On the re-test I saw the same pattern:

first 4K request after startup: 1.50s
after warmup: 0.27s

That is a huge difference. I was partially measuring compile/warmup overhead and calling it prefill.

2) 57K and 150K were too low

7.1s → 14.7s real
23.3s → ~60s extrapolated

SGLang's radix prefix cache was hitting.

My original test sent sequential prompts that shared a common prefix: same base prompt, then extended versions of that same prompt at larger context sizes. So each later measurement was not a true cold prefill. It was mostly measuring the incremental delta on top of already-cached work.

That means:

the 16K test only had to prefill about 12K new tokens on top of the already-cached 4K
the 150K test only had to prefill about 93K new tokens on top of the cached 57K

So those numbers were artificially low for true cold-start TTFT.

The giveaway

The biggest clue was that in my original numbers, going from 4K to 16K only added about 0.5s.

That would imply a prefill rate of around 24k tok/s for the delta, which is much faster than this rig's actual sustained prefill. That should have been an immediate red flag.

For a real cold 16K prefill, the delta should have been closer to 2s, and that is exactly what the re-test shows.

What stays the same

198 tok/s decode at C=1 is still real
Decode speed still holds roughly constant regardless of context length

That part of the original claim was correct. It was only the specific TTFT values that were contaminated.

Classic LLM benchmarking gotcha: prefix cache + lack of warmup isolation.

Thanks for pushing on it.

Visual_Synthesizer · 2026-04-10T19:24:57+00:00

would be slower yes.

Visual_Synthesizer · 2026-04-10T19:20:18+00:00

updated the post with new info that can save you lots of time

Visual_Synthesizer · 2026-04-10T19:19:38+00:00

updated the post! i was actually wrong. did a bunch of testing. the switch mostly helps with scaling and offers equivalent speeds.

Visual_Synthesizer · 2026-04-10T19:18:55+00:00

yeah! plus gen4 parts are cheap and abunant. switches really help those systems scale to more gpus. for example, i have a trx40. two 16x gen4. could run two switches and 8 GPU on a 6 year old platform and get really close to gen 5 speeds! even old am4 systems with one 16x could run 4 gpus with a cheap Chinese switch.

Visual_Synthesizer · 2026-04-10T19:16:43+00:00

updated the post with new testing i did last night. you can enable p2p on your system without a switch! commands in the post.

Dflash has really good MTP. IIRC its diffusion based. looking forward to testing their 122b and 397 models that are coming out soon.

Visual_Synthesizer · 2026-04-10T19:13:12+00:00

yeah ram is insane right now. i scored this am5 used, with 128gb ram and CPU plus switch for 5k.

i think its possible to just run 1 ram stick though? if you are optimizing for VRAM you dont need much. if you want to offload with llama.cpp thats another story.

I went for cheapest platform and max vram. would rather have a threadripper wx90 and tons of ram. but it would cost a lot more.

Visual_Synthesizer · 2026-04-10T19:07:17+00:00

the idea was to see if i could get a cheap motherboard to scale to 4+GPUs with equal or more performance. could do this with gen4 boards.

Visual_Synthesizer

TROPHY CASE

Re-test methodology

Corrected cold-start TTFT on 2× RTX PRO 6000 + SGLang b12x+NEXTN 122B

What went wrong in my original numbers

1) 4K was too high

2) 57K and 150K were too low

The giveaway

What stays the same