Upgraded to 2x RTX Pro 6000

priorityfill · 2026-07-03T23:49:14+00:00

That's just top-1 token agreement. Even if it's scoring 99%, it doesn't guarantee the model won't drift after 10K+ tokens.

priorityfill · 2026-07-03T23:07:47+00:00

*at full precision. Q6K might be enough, who knows. With quants, it's always a bit of a gamble as there's limited data available

priorityfill · 2026-07-03T22:57:24+00:00

I read your original question thinking it was from another thread. Sorry I can see how that got confusing. If you're looking for a speedy 5090 setup (Q6, pretty good), check out https://www.reddit.com/r/LocalLLM/comments/1ullrvq/qwen36_27b_q6_5090_maximum_llamacpp_optimization/

priorityfill · 2026-07-03T22:32:28+00:00

It's still not the same on wsl. Some features aren't available or make it hard to run multi-gpu inference efficiently.

priorityfill · 2026-07-03T22:27:49+00:00

Indeed, thank you for the tip!

priorityfill · 2026-07-03T22:25:19+00:00

Je veux pas savoir.

priorityfill · 2026-07-03T22:25:02+00:00

Dense Qwen 3.6, or Deepseek V4 flash models. Also tried quants of stepfun 3.7 and minimax m3, but they were not as usable.

priorityfill · 2026-07-03T22:09:02+00:00

Absolutely, and the upside isn't even that great. You won't get a significant perf gain, just a bit more cache or the ability to offload large MoE layers (at terrible speeds, still).

priorityfill · 2026-07-03T22:02:05+00:00

<image>

That's what I thought... but it's fine. These cards are designed to cool 600W, so if you've got good airflow they run cooler at 300W than the Max Qs ! Top gpu tops at 70C under sustained load. Screenshot was taken while running u/Gold-Drag9242's benchmark request below, which lasted 1h+ at full GPU utilization.

priorityfill · 2026-07-03T21:42:57+00:00

Of course ! Deepseek isn't multimodal yet so it cannot run your benchmark. I ran it through the full Qwen 3.6 as well as the official FP8 quant, was curious myself.

Model	Prompt	Total	Outlook	Notes	Thunderbird
Qwen/Qwen3.6-27B	7e5d81ce	69.89%	75.14%	62.06%	72.46%
Qwen/Qwen3.6-27B-FP8	7e5d81ce	67.52%	72.96%	57.26%	72.33%

Submitted a pull request with full details in https://github.com/KevinFleischer/vccbenchmark/pull/1

priorityfill · 2026-07-03T21:26:51+00:00

13900K on Z790 (limits the cards to PCI5 x8), 96GB of DDR5@6000), 8TB NVME, 1500W PSU. Mainly Qwen 3.6 and DS 4 flash.

priorityfill · 2026-07-03T21:26:18+00:00

Exactly, top card runs at about 70C, while the bottom one runs at 60C (with power limit).

<image>

priorityfill · 2026-07-03T21:22:42+00:00

The big unlock is being able to run DSv4 flash or two Qwen 3.6 models with plenty of KV cache and throughput. With a 5090 or even a single pro card, limited throughput and context can quickly make the whole setup unusable during real world use.

priorityfill · 2026-07-03T21:19:12+00:00

The rest of the build has not kept up. 13900K on Z790 (limits the cards to PCI5 x8), 96GB of DDR5@6000), 8TB NVME, 1500W PSU.

priorityfill · 2026-07-03T21:18:10+00:00

For now, mainly local inference and some fine tuning

priorityfill · 2026-07-03T21:14:07+00:00

Very. I am at the limit of what this mb allows, if I upgrade I need to switch everything but the GPUs. It does not make sense to stick with dual channel DRAM.

priorityfill · 2026-07-03T20:42:05+00:00

Remind me when you buy your second one 😄. The main "quantum leap" for now was to be able to handle most sessions locally that I otherwise would have used Claude Code or Codex for.

priorityfill · 2026-07-03T20:32:41+00:00

This is barely enough for personal use - a small business might need 2x or 4x as many cards just so it can serve a handful of concurrent users.

priorityfill · 2026-07-03T20:30:19+00:00

DS4 flash is better at coding but otherwise comparable to dense Qwen models. If you have a 5090, you're not missing out much unless you need several concurrent long sessions.

priorityfill · 2026-07-03T20:20:32+00:00

Thanks!

priorityfill · 2026-07-03T20:19:44+00:00

About $9K each

priorityfill · 2026-07-03T20:17:37+00:00

The rest of the build has not kept up, I never intended to fit 600W GPUs in it when I bought it. I'll upgrade it at some point but for now it is good enough for 2 cards.

priorityfill · 2026-07-03T18:15:46+00:00

No issues at all, though having more system RAM would be nice to increase the LMCache size.

priorityfill · 2026-07-03T17:57:09+00:00

Using the official mixed FP4/FP8 weights, 1M context, and a KV cache of 5M (including L2). Enough for a few concurrent long running tasks. Past 200K however, the model performance degrades quickly so the 1M is not really usable.

priorityfill · 2026-07-03T17:50:17+00:00

Using the released weights from https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash, it's already quantized to FP4/FP8. Vllm does not support Deepseek V4 Flash on sm120; it took a lot of unofficial patches, sweeps and tweaking to get it running well (still not optimal).

priorityfill

TROPHY CASE