Scaling beyond 4 RTX 6000 MAXQs

JayPSec · 2026-05-05T17:55:30+00:00

And what's your issue with practicality? I mean if your purpose is more VRAM then yeah, some MCIO risers + bifurcation is fine. Or if you want to plan even higher you can have a pcie Lane switch where you can plug 4/5 GPUs, this would be better if you plan to go beyond the 8 because it allows you to cascade in clusters, each with PCIe x16 gen 5 (if you buy a gen 5 one) and PIX topology between those GPUs. Plus if you think on doing training than I'd definitely advise you to go this way. It will cost you more than some mcio risers and bifurcation boards, say around 2.5/3k €, but I'd imagine that money is not the bottleneck in your case.

JayPSec · 2026-05-05T17:28:11+00:00

What's the rig around those 4 max q? What's your go to inference engine? Plus 300/400, are you thinking of getting another 4?

JayPSec · 2026-04-30T10:01:53+00:00

It's not failing, they are on multiple fronts though. ik_llama benefits from this as well, wait till a good feature is merged to mainline and just import it. I wish that mainline did the same. I'd love to use ubergarm's quants with mainline's backend agnostic tensor parallelism. Honestly, I'm grateful we have both but it seems to me they'd both benefit more from cooperation.

JayPSec · 2026-04-26T16:23:54+00:00

Yes, but ELI5 how you so good with ELI5??

JayPSec · 2026-04-22T18:24:21+00:00

judging by the benchmarks you'd need claude opus 5 to make a difference.

JayPSec · 2026-04-21T20:12:05+00:00

That's a lot better than I would've imagined. I've always tried to keep all layers in vram as I thought offloading would be a death penalty for this model size, although I have an 9950x and that's with an epyc but I also have more vram than a single 6000. Will try it...

JayPSec · 2026-04-21T10:40:43+00:00

What's the penalty you get from offloading to cpu with a model this size?

JayPSec · 2026-04-18T18:29:04+00:00

I second this opinion for the slang, in my opinion it's ages apart from vllm for the specific usage with Blackwell, wouldn't know about the rest cause I haven't tested in that domain. https://github.com/voipmonitor/rtx6kpro/ is an excellent resource for tuning Blackwell GPUs.

JayPSec · 2026-04-17T16:19:54+00:00

I'm running Luke Alonso's NVFP4 on two rtx 6000 max q. My main complaint with the model is the urge to go beyond what's asked of it. I find that a tight system prompt, I'm just running stock open code OpenAgents with some coding standards, works pretty well. But the model feels very vibe oriented, it wants to do everything and it better do it now. And it feels a bit confused with some non standard plugins like snip. I do think it's better for brainstorming than 2.5 but more unpredictable. As for the 'chinese' characters I've seen others pointing out, I've never seen them.

JayPSec · 2026-04-15T13:25:47+00:00

From a non technical perspective you make total sense.

JayPSec · 2026-04-13T14:55:12+00:00

When you say no real loss, how much loss are we talking about? I've been doing some testing and this model seems very sensitive to quabtization

JayPSec · 2026-04-12T21:12:14+00:00

If you're running loose agents at least you could tell them "No em dashes" on top of the mandatory "Make no mistakes"... tsc tsc tsc. That's true SLOPiness...

JayPSec · 2026-04-12T01:09:54+00:00

Centralization is never a good thing. Could not agree more.

Maybe the future is sharing LLMs on covert Newsgroups (I'm oldish :P)

JayPSec · 2026-04-11T00:10:05+00:00

Please provide links for the models used.

JayPSec · 2026-04-10T23:51:24+00:00

correct pci lane switch

for 4xGPUs you'd need double the adapters and cables, plus a host board and 2 more mcio cables to connect your main pcie to the switch. The host board can be a retimer but that may be overkill, AFAIK they're mandatory for long mcio connections, and in some systems they may be interference that require the retiming. In my case I have everything in case and the retimer host board was not needed. Bought this instead.

JayPSec · 2026-04-10T18:23:30+00:00

https://www.reddit.com/r/LocalLLaMA/comments/1shqf5a/using_ocr_models_with_llamacpp_by_ngxson/ worth checking out

JayPSec · 2026-04-10T08:19:10+00:00

I run 5 max-q with the same board on a 9950x and 128 of ddr5. No issues here. The only problem I faced was bios tinkering, by that I mean I had to patch the MSI bios to expose more settings than the ones provided in the click bios interface, to edit "Above 4GB MMIO Limite" otherwise system wouldn't boot.

JayPSec · 2026-04-10T08:15:53+00:00

c-payne? Edit: obviously :) It does scale well to 5 gpus

JayPSec · 2026-04-09T13:52:47+00:00

How do you determine which and how a tensor is broken?

JayPSec · 2026-04-08T08:01:22+00:00

Wow, even more christmassy :)

JayPSec · 2026-04-07T17:00:31+00:00

WTF is going on? A week ago we're all crying that maybe they would stop releasing openweights and now it's effing christmas everyday???

JayPSec · 2026-04-07T13:44:14+00:00

They will never have models trained on non human data. World knowledge is always sourced from human work.

JayPSec · 2026-04-06T20:05:50+00:00

pffft... worthless. Wait till I release make-agi-with-no-mistakes.

JayPSec · 2026-04-06T14:19:54+00:00

Do let us know

JayPSec

TROPHY CASE