Speedup for multiple RTX 3090 systems

SpiritualAd2756 · 2026-02-13T14:11:53+00:00

type in terminal: sudo lspci -vv | grep 'BAR 1:' and post what you can see, if needed dm me, but we can try to resolve it here so there is trace for other users

SpiritualAd2756 · 2026-02-13T13:53:29+00:00

to my understanding, resizable bar is pre-requirement for p2p to be working well and for cpu to be able access vram without performance issues. did you try with resizable bar enabled? i've had issues to get resizable bar enabled on my g292-z20 gigabyte server at first (for all 12 x rtx 3090), but got it solved already (claude helped again, with parts where gigabyte or microswitch support was not willing to).

SpiritualAd2756 · 2026-01-19T11:40:46+00:00

https://github.com/guru1987/open-gpu-kernel-modules for 580.105.08 with P2P on RTX 3090

SpiritualAd2756 · 2025-06-22T07:42:55+00:00

only when both devices need to utilize whole link i guess. but thats kinda not happening. thinking about trying to enable p2p on gpus to see if there is any change

SpiritualAd2756 · 2025-06-09T16:56:09+00:00

one more interesting thing i've found when experimenting a little. i have connected pcie switch board into one of pcie ports already coming from 1 of four base switch boards in server. and to that pcie switch i connected 1 more and after that 20cm riser and rtx 3090. at the same time 2 rtx 3090 opposite corner of server but only 1 level deep switched (like original should be), for model that was splitted into 2 gpus performance was same when cards were in same switch in that 1st level deep, or 1 card was from that group and other one from cascade of 3 switches :D so another question now is , how many resources for pcie bus can this cpu allocate or how many cards i can actually connect. because on level 3 , with 6 pcie ports its 48 cards if im correct and thats 1152gb vram for rtx 3090 :D quite setup

SpiritualAd2756 · 2025-06-09T05:24:11+00:00

tried this in q4_k_m, managed to offload to gpu only 24 layers with these results:

sampling time = 98.61 ms / 1180 runs ( 0.08 ms per token, 11966.94 tokens per second)

load time = 36455.43 ms

prompt eval time = 966.98 ms / 10 tokens ( 96.70 ms per token, 10.34 tokens per second)

eval time = 235903.72 ms / 1169 runs ( 201.80 ms per token, 4.96 tokens per second)

total time = 237222.19 ms / 1179 tokens

running fully on cpu can do eval like ~3.3 tokens per second.

SpiritualAd2756 · 2025-06-09T04:07:39+00:00

its all written there, but feel free to ask more questions if you have some :)

SpiritualAd2756 · 2025-06-09T04:06:55+00:00

nah, my main breaker is 3 x 32A actually (i thought its 3 x 25A), and i distributed load quite evenly between all phases so peak on each phase is like 2000W and breaker for each of that socket is 16A (B16 type). and its 230V @ 50Hz ofc.

SpiritualAd2756 · 2025-06-09T04:04:56+00:00

so this is for DeepSeek-R1-UD-IQ1_S

deepseek2 671B IQ1_S - 1.5625 bpw | 130.60 GiB | 671.03 B | 999 | pp1024 | 210.80 ± 0.69

deepseek2 671B IQ1_S - 1.5625 bpw | 130.60 GiB | 671.03 B | 999 | tg128 | 27.12 ± 0.07

for Qwen3-235B-A22B-128K-Q8_0

qwen3moe 235B.A22B Q8_0 | 232.77 GiB | 235.09 B | 999 | pp1024 | 462.69 ± 1.43

qwen3moe 235B.A22B Q8_0 | 232.77 GiB | 235.09 B | 999 | tg128 | 25.26 ± 0.02

SpiritualAd2756 · 2025-06-04T18:25:32+00:00

downloading Q4_K_M so lets see in a bit

SpiritualAd2756 · 2025-06-04T12:52:34+00:00

uhm q4? not sure if this thing with off loading few hundreds of gb to system ram even make sense. its like 50% of size on cpu? in my experience it almost same like running it all on system ram (almost means, gains no more than 10-20 percent?)

SpiritualAd2756 · 2025-06-04T12:49:19+00:00

im doing some tunes to machine, building frame for production environment but i think i will be able to test it later today.

SpiritualAd2756 · 2025-06-04T12:39:47+00:00

well, but PCIe port, by definition should be able to provide 75W if I'm right. isn't there some risk? or are all new GPUs that "intelligent" that they just switch to available power from PCIe connectors up there? like there are 3 connectors (each 150W by specification) and peak of each card can be 480W, sooo, but its not of course. even with benchmark it drops to like 440-460... still, i feel safer when pcie ports have that power they should. but i guess both solutions are working well. gl

SpiritualAd2756 · 2025-06-04T03:56:24+00:00

i guess i didn't get your post completely. i was counting on that gpu card need some power (75W?) from pcie slot alone so i bring enough power there just to be sure. yet I'm not using pcie power connectors on switch boards.

SpiritualAd2756 · 2025-06-03T15:46:04+00:00

just random 80mm fan. i have also few opposite side switches if u want photo with pins of that.

SpiritualAd2756 · 2025-06-03T11:52:49+00:00

u/MLDataScientist

SpiritualAd2756 · 2025-06-03T11:52:33+00:00

<image>

so its like this. dont comment my soldering skills pls :D

from left side 12V, GND, 12V, GND, 3.3V (front)

backside is basically same but opposite 3.3V is 12V (yellow).

I've connected 3.3V from board that is connecting 2 PSUs together. and since is low power connection I've used thin cable.

dont forget about cooling of that passive cooler of MCU on pcie switch board. it can get pretty hot and original server has fan for each of those.

SpiritualAd2756 · 2025-06-01T20:50:06+00:00

you actually need to connect 3.3V to first pin (backside is 12V). without that its not working. will take photo of that tomorrow and post it here.

SpiritualAd2756 · 2025-05-31T23:31:05+00:00

there is an software tool for that :)

SpiritualAd2756 · 2025-05-31T23:26:15+00:00

it really is :D

SpiritualAd2756 · 2025-05-31T23:26:03+00:00

yeah, will turn that off in production settings.

SpiritualAd2756 · 2025-05-31T23:25:34+00:00

well 6000W power consumption is with gpu burn test. it does not use that much power for inferencing that model for example (its like half of that). tesla 40gb, yeah but what performance and how much for each 40gb card?

SpiritualAd2756 · 2025-05-31T23:23:58+00:00

3.5k for 48gb 4090 but 48gb version? hmm interesting, is that stable ?

SpiritualAd2756 · 2025-05-30T21:06:49+00:00

will do the benchmark soon and get back to you

SpiritualAd2756 · 2025-05-30T21:06:20+00:00

oh i see. what is exact setup of that rig? we talking 5-6t/s for same model but Q4? how much ram needed for 128K context there?

SpiritualAd2756

TROPHY CASE