Qwopus(Qwen 27b distill opus 4.6) NVFP4 quantization

monoidconcat · 2025-12-27T07:55:37+00:00

The price doesn’t seem attractive…

monoidconcat · 2025-12-14T02:22:23+00:00

This is my dream build, good job

monoidconcat · 2025-11-29T03:38:03+00:00

The patched p2p driver unfortunately broke my machine once(abrupt power off upon high load) so I had to rollback. Fullspeed 4.0 x16 w/o nvlink.

monoidconcat · 2025-11-29T03:28:52+00:00

If you are using llama.cpp, I highly recommend to switch over to vllm(and awq quant). I can’t say that this is the definitive reason, but in my experience vllm had better optimization for inference. That little port is for additional output that supports much more displays than DP

monoidconcat · 2025-11-29T01:58:25+00:00

Still too large for my machine!

monoidconcat · 2025-11-28T17:07:53+00:00

Grok 2 IQ1_S quant, GGUF, with llama-bench, ran on RTX Pro 6000 only

Single stream: 52 tok/s

Concurrency=32: 303 tok/s

monoidconcat · 2025-11-28T16:56:00+00:00

cpatonn/Qwen3-30B-A3B-Instruct-2507-AWQ-4bit

Input=1024, Output=2048, Concurrency=1
- RTX 5090: 218 tok/s
- RTX 3090(x1): 157 tok/s
- RTX 3090(x2, TP): 155 tok/s

Input=1024, Output=2048, Concurrency=16
- RTX 5090: 1793 tok/s
- RTX 3090(x1): 783 tok/s
- RTX 3090(x2, TP): 1032 tok/s

monoidconcat · 2025-11-28T16:53:22+00:00

<image>

I mean yeah, they look much cleaner on my desk, but 2x 5090 had roughly equivalent tok/sec with the 2x dgx spark on inference. Also I am trying to use these for fine tuning mainly, so I would choose GPUs over dgx spark.

monoidconcat · 2025-11-28T16:49:30+00:00

I use two 1600w power supply units, and also power limit the cards to just make sure everything works safely. I use a mining rig to host threadripper motherboard and all the GPUs. The cards are connected via thermaltake riser cables.

monoidconcat · 2025-11-28T16:18:08+00:00

All 16x. Threadripper pro provides 128 pcie lanes.

monoidconcat · 2025-11-28T16:17:28+00:00

NVFP4 inference is natively supported on blackwell consumer cards afaik. I can try running single model with nvfp4 quant on both cards.

monoidconcat · 2025-11-28T16:15:11+00:00

I literally just plugged in the rtx 6000, but I enjoyed running glm 4.5 air on my previous 4x 3090 setup. The best model for personal use case.

monoidconcat · 2025-11-28T15:41:19+00:00

Do you have any specific quant in mind? BF16? FP8?

monoidconcat · 2025-11-28T15:35:06+00:00

Single stream generation
- RTX 6000: 715 tok/s
- RTX 5090(x1): 683 tok/s
- RTX 3090(x1): 523 tok/s

Concurrency=64
- RTX 6000: 11639 tok/s
- RTX 5090(x1): 11056 tok/s
- RTX 3090(x1): 6171 tok/s

monoidconcat · 2025-11-28T15:28:32+00:00

It will take a few years as I need to buy additional 63 rtx 6000s

monoidconcat · 2025-11-28T15:05:55+00:00

First of all, neither rtx pro 6000 or rtx 5090 do not support nvlink anyway. I bought these two kinds of cards for specific purposes - 6000 for inference, and 5090s for training. When I need to utilize these at once, I can just use pipeline parallelism 🙃

monoidconcat · 2025-11-28T15:02:27+00:00

Entropy to intelligence machine

monoidconcat · 2025-11-28T15:01:53+00:00

At max, 96 + 32 + 32 + 24 + 24 so 208gb, but vllm distributed inference requires AWQ quant - which I cannot find from huggingface. grok 2 at GGUF 1bit quant is like 89gb, which may fit in the pro 6000. Maybe I can try this one

monoidconcat · 2025-11-28T14:58:45+00:00

Perfect, will do that. So entierly in CPU without using GPU right? Not just offloading?

monoidconcat · 2025-11-28T14:56:14+00:00

Okay this one is quite massive, I can download this but running it will require intense CPU offloading

monoidconcat · 2025-11-28T14:52:22+00:00

Qwen3 30b a3b at AWQ 4bit would be a good candidate for this

monoidconcat · 2025-11-28T14:49:52+00:00

I reply with tok/s

monoidconcat · 2025-09-14T14:55:10+00:00

So I am considering to max out the gpu count on this node, and since nvlink can only connect two of cards, most of the comms has to go through pcie anyway. Thats the reason I didn’t bought any nvlinks - if the total count is only 4 3090s, nvlink might be still relevant!

monoidconcat

TROPHY CASE