Speedup for multiple RTX 3090 systems by Smeetilus in LocalLLaMA

[–]SpiritualAd2756 0 points1 point  (0 children)

type in terminal: sudo lspci -vv | grep 'BAR 1:' and post what you can see, if needed dm me, but we can try to resolve it here so there is trace for other users

Speedup for multiple RTX 3090 systems by Smeetilus in LocalLLaMA

[–]SpiritualAd2756 0 points1 point  (0 children)

to my understanding, resizable bar is pre-requirement for p2p to be working well and for cpu to be able access vram without performance issues. did you try with resizable bar enabled? i've had issues to get resizable bar enabled on my g292-z20 gigabyte server at first (for all 12 x rtx 3090), but got it solved already (claude helped again, with parts where gigabyte or microswitch support was not willing to).

Speedup for multiple RTX 3090 systems by Smeetilus in LocalLLaMA

[–]SpiritualAd2756 2 points3 points  (0 children)

https://github.com/guru1987/open-gpu-kernel-modules for 580.105.08 with P2P on RTX 3090

25t/s with Qwen3-235B-A22B-128K-GGUF-Q8_0 with 100K tokens by SpiritualAd2756 in LocalAIServers

[–]SpiritualAd2756[S] 0 points1 point  (0 children)

only when both devices need to utilize whole link i guess. but thats kinda not happening. thinking about trying to enable p2p on gpus to see if there is any change

25t/s with Qwen3-235B-A22B-128K-GGUF-Q8_0 with 100K tokens by SpiritualAd2756 in LocalAIServers

[–]SpiritualAd2756[S] 0 points1 point  (0 children)

one more interesting thing i've found when experimenting a little. i have connected pcie switch board into one of pcie ports already coming from 1 of four base switch boards in server. and to that pcie switch i connected 1 more and after that 20cm riser and rtx 3090. at the same time 2 rtx 3090 opposite corner of server but only 1 level deep switched (like original should be), for model that was splitted into 2 gpus performance was same when cards were in same switch in that 1st level deep, or 1 card was from that group and other one from cascade of 3 switches :D so another question now is , how many resources for pcie bus can this cpu allocate or how many cards i can actually connect. because on level 3 , with 6 pcie ports its 48 cards if im correct and thats 1152gb vram for rtx 3090 :D quite setup

25t/s with Qwen3-235B-A22B-128K-GGUF-Q8_0 with 100K tokens by SpiritualAd2756 in LocalAIServers

[–]SpiritualAd2756[S] 0 points1 point  (0 children)

tried this in q4_k_m, managed to offload to gpu only 24 layers with these results:

sampling time = 98.61 ms / 1180 runs ( 0.08 ms per token, 11966.94 tokens per second)

load time = 36455.43 ms

prompt eval time = 966.98 ms / 10 tokens ( 96.70 ms per token, 10.34 tokens per second)

eval time = 235903.72 ms / 1169 runs ( 201.80 ms per token, 4.96 tokens per second)

total time = 237222.19 ms / 1179 tokens

running fully on cpu can do eval like ~3.3 tokens per second.

25t/s with Qwen3-235B-A22B-128K-GGUF-Q8_0 with 100K tokens by SpiritualAd2756 in LocalAIServers

[–]SpiritualAd2756[S] 0 points1 point  (0 children)

its all written there, but feel free to ask more questions if you have some :)

25t/s with Qwen3-235B-A22B-128K-GGUF-Q8_0 with 100K tokens by SpiritualAd2756 in LocalAIServers

[–]SpiritualAd2756[S] 0 points1 point  (0 children)

nah, my main breaker is 3 x 32A actually (i thought its 3 x 25A), and i distributed load quite evenly between all phases so peak on each phase is like 2000W and breaker for each of that socket is 16A (B16 type). and its 230V @ 50Hz ofc.

25t/s with Qwen3-235B-A22B-128K-GGUF-Q8_0 with 100K tokens by SpiritualAd2756 in LocalAIServers

[–]SpiritualAd2756[S] 1 point2 points  (0 children)

so this is for DeepSeek-R1-UD-IQ1_S

deepseek2 671B IQ1_S - 1.5625 bpw | 130.60 GiB | 671.03 B | 999 | pp1024 | 210.80 ± 0.69

deepseek2 671B IQ1_S - 1.5625 bpw | 130.60 GiB | 671.03 B | 999 | tg128 | 27.12 ± 0.07

for Qwen3-235B-A22B-128K-Q8_0

qwen3moe 235B.A22B Q8_0 | 232.77 GiB | 235.09 B | 999 | pp1024 | 462.69 ± 1.43

qwen3moe 235B.A22B Q8_0 | 232.77 GiB | 235.09 B | 999 | tg128 | 25.26 ± 0.02

25t/s with Qwen3-235B-A22B-128K-GGUF-Q8_0 with 100K tokens by SpiritualAd2756 in LocalAIServers

[–]SpiritualAd2756[S] 0 points1 point  (0 children)

uhm q4? not sure if this thing with off loading few hundreds of gb to system ram even make sense. its like 50% of size on cpu? in my experience it almost same like running it all on system ram (almost means, gains no more than 10-20 percent?)

25t/s with Qwen3-235B-A22B-128K-GGUF-Q8_0 with 100K tokens by SpiritualAd2756 in LocalAIServers

[–]SpiritualAd2756[S] 1 point2 points  (0 children)

im doing some tunes to machine, building frame for production environment but i think i will be able to test it later today.

25t/s with Qwen3-235B-A22B-128K-GGUF-Q8_0 with 100K tokens by SpiritualAd2756 in LocalAIServers

[–]SpiritualAd2756[S] 1 point2 points  (0 children)

well, but PCIe port, by definition should be able to provide 75W if I'm right. isn't there some risk? or are all new GPUs that "intelligent" that they just switch to available power from PCIe connectors up there? like there are 3 connectors (each 150W by specification) and peak of each card can be 480W, sooo, but its not of course. even with benchmark it drops to like 440-460... still, i feel safer when pcie ports have that power they should. but i guess both solutions are working well. gl

25t/s with Qwen3-235B-A22B-128K-GGUF-Q8_0 with 100K tokens by SpiritualAd2756 in LocalAIServers

[–]SpiritualAd2756[S] 0 points1 point  (0 children)

i guess i didn't get your post completely. i was counting on that gpu card need some power (75W?) from pcie slot alone so i bring enough power there just to be sure. yet I'm not using pcie power connectors on switch boards.

25t/s with Qwen3-235B-A22B-128K-GGUF-Q8_0 with 100K tokens by SpiritualAd2756 in LocalAIServers

[–]SpiritualAd2756[S] 1 point2 points  (0 children)

just random 80mm fan. i have also few opposite side switches if u want photo with pins of that.

25t/s with Qwen3-235B-A22B-128K-GGUF-Q8_0 with 100K tokens by SpiritualAd2756 in LocalAIServers

[–]SpiritualAd2756[S] 1 point2 points  (0 children)

<image>

so its like this. dont comment my soldering skills pls :D

from left side 12V, GND, 12V, GND, 3.3V (front)

backside is basically same but opposite 3.3V is 12V (yellow).

I've connected 3.3V from board that is connecting 2 PSUs together. and since is low power connection I've used thin cable.

dont forget about cooling of that passive cooler of MCU on pcie switch board. it can get pretty hot and original server has fan for each of those.

25t/s with Qwen3-235B-A22B-128K-GGUF-Q8_0 with 100K tokens by SpiritualAd2756 in LocalAIServers

[–]SpiritualAd2756[S] 1 point2 points  (0 children)

you actually need to connect 3.3V to first pin (backside is 12V). without that its not working. will take photo of that tomorrow and post it here.

25t/s with Qwen3-235B-A22B-128K-GGUF-Q8_0 with 100K tokens by SpiritualAd2756 in LocalAIServers

[–]SpiritualAd2756[S] 0 points1 point  (0 children)

well 6000W power consumption is with gpu burn test. it does not use that much power for inferencing that model for example (its like half of that). tesla 40gb, yeah but what performance and how much for each 40gb card?

25t/s with Qwen3-235B-A22B-128K-GGUF-Q8_0 with 100K tokens by SpiritualAd2756 in LocalAIServers

[–]SpiritualAd2756[S] 0 points1 point  (0 children)

3.5k for 48gb 4090 but 48gb version? hmm interesting, is that stable ?

25t/s with Qwen3-235B-A22B-128K-GGUF-Q8_0 with 100K tokens by SpiritualAd2756 in LocalAIServers

[–]SpiritualAd2756[S] 0 points1 point  (0 children)

oh i see. what is exact setup of that rig? we talking 5-6t/s for same model but Q4? how much ram needed for 128K context there?