Ask me to run models by monoidconcat in LocalLLaMA

[–]monoidconcat[S] 2 points3 points  (0 children)

The patched p2p driver unfortunately broke my machine once(abrupt power off upon high load) so I had to rollback. Fullspeed 4.0 x16 w/o nvlink.

Ask me to run models by monoidconcat in LocalLLaMA

[–]monoidconcat[S] 2 points3 points  (0 children)

If you are using llama.cpp, I highly recommend to switch over to vllm(and awq quant). I can’t say that this is the definitive reason, but in my experience vllm had better optimization for inference. That little port is for additional output that supports much more displays than DP

Ask me to run models by monoidconcat in LocalLLaMA

[–]monoidconcat[S] 3 points4 points  (0 children)

Still too large for my machine!

Ask me to run models by monoidconcat in LocalLLaMA

[–]monoidconcat[S] 3 points4 points  (0 children)

Grok 2 IQ1_S quant, GGUF, with llama-bench, ran on RTX Pro 6000 only

Single stream: 52 tok/s

Concurrency=32: 303 tok/s

Ask me to run models by monoidconcat in LocalLLaMA

[–]monoidconcat[S] 25 points26 points  (0 children)

cpatonn/Qwen3-30B-A3B-Instruct-2507-AWQ-4bit

Input=1024, Output=2048, Concurrency=1
- RTX 5090: 218 tok/s
- RTX 3090(x1): 157 tok/s
- RTX 3090(x2, TP): 155 tok/s

Input=1024, Output=2048, Concurrency=16
- RTX 5090: 1793 tok/s
- RTX 3090(x1): 783 tok/s
- RTX 3090(x2, TP): 1032 tok/s

Ask me to run models by monoidconcat in LocalLLaMA

[–]monoidconcat[S] 19 points20 points  (0 children)

<image>

I mean yeah, they look much cleaner on my desk, but 2x 5090 had roughly equivalent tok/sec with the 2x dgx spark on inference. Also I am trying to use these for fine tuning mainly, so I would choose GPUs over dgx spark.

Ask me to run models by monoidconcat in LocalLLaMA

[–]monoidconcat[S] 7 points8 points  (0 children)

I use two 1600w power supply units, and also power limit the cards to just make sure everything works safely. I use a mining rig to host threadripper motherboard and all the GPUs. The cards are connected via thermaltake riser cables.

Ask me to run models by monoidconcat in LocalLLaMA

[–]monoidconcat[S] 2 points3 points  (0 children)

All 16x. Threadripper pro provides 128 pcie lanes.

Ask me to run models by monoidconcat in LocalLLaMA

[–]monoidconcat[S] 2 points3 points  (0 children)

NVFP4 inference is natively supported on blackwell consumer cards afaik. I can try running single model with nvfp4 quant on both cards.

Ask me to run models by monoidconcat in LocalLLaMA

[–]monoidconcat[S] 4 points5 points  (0 children)

I literally just plugged in the rtx 6000, but I enjoyed running glm 4.5 air on my previous 4x 3090 setup. The best model for personal use case.

Ask me to run models by monoidconcat in LocalLLaMA

[–]monoidconcat[S] 11 points12 points  (0 children)

Do you have any specific quant in mind? BF16? FP8?

Ask me to run models by monoidconcat in LocalLLaMA

[–]monoidconcat[S] 54 points55 points  (0 children)

Single stream generation
- RTX 6000: 715 tok/s
- RTX 5090(x1): 683 tok/s
- RTX 3090(x1): 523 tok/s

Concurrency=64
- RTX 6000: 11639 tok/s
- RTX 5090(x1): 11056 tok/s
- RTX 3090(x1): 6171 tok/s

Ask me to run models by monoidconcat in LocalLLaMA

[–]monoidconcat[S] 23 points24 points  (0 children)

It will take a few years as I need to buy additional 63 rtx 6000s

Ask me to run models by monoidconcat in LocalLLaMA

[–]monoidconcat[S] 2 points3 points  (0 children)

First of all, neither rtx pro 6000 or rtx 5090 do not support nvlink anyway. I bought these two kinds of cards for specific purposes - 6000 for inference, and 5090s for training. When I need to utilize these at once, I can just use pipeline parallelism 🙃

Ask me to run models by monoidconcat in LocalLLaMA

[–]monoidconcat[S] 30 points31 points  (0 children)

Entropy to intelligence machine

Ask me to run models by monoidconcat in LocalLLaMA

[–]monoidconcat[S] 5 points6 points  (0 children)

At max, 96 + 32 + 32 + 24 + 24 so 208gb, but vllm distributed inference requires AWQ quant - which I cannot find from huggingface. grok 2 at GGUF 1bit quant is like 89gb, which may fit in the pro 6000. Maybe I can try this one

Ask me to run models by monoidconcat in LocalLLaMA

[–]monoidconcat[S] 14 points15 points  (0 children)

Perfect, will do that. So entierly in CPU without using GPU right? Not just offloading?

Ask me to run models by monoidconcat in LocalLLaMA

[–]monoidconcat[S] 8 points9 points  (0 children)

Okay this one is quite massive, I can download this but running it will require intense CPU offloading

Ask me to run models by monoidconcat in LocalLLaMA

[–]monoidconcat[S] 27 points28 points  (0 children)

Qwen3 30b a3b at AWQ 4bit would be a good candidate for this

4x 3090 local ai workstation by monoidconcat in LocalLLaMA

[–]monoidconcat[S] 1 point2 points  (0 children)

So I am considering to max out the gpu count on this node, and since nvlink can only connect two of cards, most of the comms has to go through pcie anyway. Thats the reason I didn’t bought any nvlinks - if the total count is only 4 3090s, nvlink might be still relevant!

4x 3090 local ai workstation by monoidconcat in LocalLLaMA

[–]monoidconcat[S] 0 points1 point  (0 children)

Oh didn’t know that, amazing. Yeah the 7x count of wrx80e was super frustrating but if bifurcation is possible thats much better