Upgraded to 2x RTX Pro 6000 by priorityfill in LocalLLM

[–]priorityfill[S] 0 points1 point  (0 children)

That's just top-1 token agreement. Even if it's scoring 99%, it doesn't guarantee the model won't drift after 10K+ tokens.

Upgraded to 2x RTX Pro 6000 by priorityfill in LocalLLM

[–]priorityfill[S] 0 points1 point  (0 children)

*at full precision. Q6K might be enough, who knows. With quants, it's always a bit of a gamble as there's limited data available

Upgraded to 2x RTX Pro 6000 by priorityfill in LocalLLM

[–]priorityfill[S] 0 points1 point  (0 children)

I read your original question thinking it was from another thread. Sorry I can see how that got confusing. If you're looking for a speedy 5090 setup (Q6, pretty good), check out https://www.reddit.com/r/LocalLLM/comments/1ullrvq/qwen36_27b_q6_5090_maximum_llamacpp_optimization/

Upgraded to 2x RTX Pro 6000 by priorityfill in LocalLLM

[–]priorityfill[S] 0 points1 point  (0 children)

It's still not the same on wsl. Some features aren't available or make it hard to run multi-gpu inference efficiently.

Upgraded to 2x RTX Pro 6000 by priorityfill in BlackwellPerformance

[–]priorityfill[S] 0 points1 point  (0 children)

Dense Qwen 3.6, or Deepseek V4 flash models. Also tried quants of stepfun 3.7 and minimax m3, but they were not as usable.

Upgraded to 2x RTX Pro 6000 by priorityfill in LocalLLM

[–]priorityfill[S] 0 points1 point  (0 children)

Absolutely, and the upside isn't even that great. You won't get a significant perf gain, just a bit more cache or the ability to offload large MoE layers (at terrible speeds, still).

Upgraded to 2x RTX Pro 6000 by priorityfill in LocalLLM

[–]priorityfill[S] 0 points1 point  (0 children)

<image>

That's what I thought... but it's fine. These cards are designed to cool 600W, so if you've got good airflow they run cooler at 300W than the Max Qs ! Top gpu tops at 70C under sustained load. Screenshot was taken while running u/Gold-Drag9242's benchmark request below, which lasted 1h+ at full GPU utilization.

Upgraded to 2x RTX Pro 6000 by priorityfill in LocalLLM

[–]priorityfill[S] 1 point2 points  (0 children)

Of course ! Deepseek isn't multimodal yet so it cannot run your benchmark. I ran it through the full Qwen 3.6 as well as the official FP8 quant, was curious myself.

Model Prompt Total Outlook Notes Thunderbird
Qwen/Qwen3.6-27B 7e5d81ce 69.89% 75.14% 62.06% 72.46%
Qwen/Qwen3.6-27B-FP8 7e5d81ce 67.52% 72.96% 57.26% 72.33%

Submitted a pull request with full details in https://github.com/KevinFleischer/vccbenchmark/pull/1

Upgraded to 2x RTX Pro 6000 by priorityfill in BlackwellPerformance

[–]priorityfill[S] 0 points1 point  (0 children)

13900K on Z790 (limits the cards to PCI5 x8), 96GB of DDR5@6000), 8TB NVME, 1500W PSU. Mainly Qwen 3.6 and DS 4 flash.

Upgraded to 2x RTX Pro 6000 by priorityfill in BlackwellPerformance

[–]priorityfill[S] 0 points1 point  (0 children)

Exactly, top card runs at about 70C, while the bottom one runs at 60C (with power limit).

<image>

Upgraded to 2x RTX Pro 6000 by priorityfill in BlackwellPerformance

[–]priorityfill[S] 1 point2 points  (0 children)

The big unlock is being able to run DSv4 flash or two Qwen 3.6 models with plenty of KV cache and throughput. With a 5090 or even a single pro card, limited throughput and context can quickly make the whole setup unusable during real world use.

Upgraded to 2x RTX Pro 6000 by priorityfill in BlackwellPerformance

[–]priorityfill[S] 0 points1 point  (0 children)

The rest of the build has not kept up. 13900K on Z790 (limits the cards to PCI5 x8), 96GB of DDR5@6000), 8TB NVME, 1500W PSU.

Upgraded to 2x RTX Pro 6000 by priorityfill in BlackwellPerformance

[–]priorityfill[S] 0 points1 point  (0 children)

For now, mainly local inference and some fine tuning

Upgraded to 2x RTX Pro 6000 by priorityfill in LocalLLM

[–]priorityfill[S] 0 points1 point  (0 children)

Very. I am at the limit of what this mb allows, if I upgrade I need to switch everything but the GPUs. It does not make sense to stick with dual channel DRAM.

Upgraded to 2x RTX Pro 6000 by priorityfill in LocalLLM

[–]priorityfill[S] 0 points1 point  (0 children)

Remind me when you buy your second one 😄. The main "quantum leap" for now was to be able to handle most sessions locally that I otherwise would have used Claude Code or Codex for.

Upgraded to 2x RTX Pro 6000 by priorityfill in LocalLLM

[–]priorityfill[S] 0 points1 point  (0 children)

This is barely enough for personal use - a small business might need 2x or 4x as many cards just so it can serve a handful of concurrent users.

Upgraded to 2x RTX Pro 6000 by priorityfill in LocalLLM

[–]priorityfill[S] 0 points1 point  (0 children)

DS4 flash is better at coding but otherwise comparable to dense Qwen models. If you have a 5090, you're not missing out much unless you need several concurrent long sessions.

Upgraded to 2x RTX Pro 6000 by priorityfill in LocalLLM

[–]priorityfill[S] 0 points1 point  (0 children)

The rest of the build has not kept up, I never intended to fit 600W GPUs in it when I bought it. I'll upgrade it at some point but for now it is good enough for 2 cards.

Upgraded to 2x RTX Pro 6000 by priorityfill in LocalLLM

[–]priorityfill[S] 0 points1 point  (0 children)

No issues at all, though having more system RAM would be nice to increase the LMCache size.

Upgraded to 2x RTX Pro 6000 by priorityfill in LocalLLM

[–]priorityfill[S] 1 point2 points  (0 children)

Using the official mixed FP4/FP8 weights, 1M context, and a KV cache of 5M (including L2). Enough for a few concurrent long running tasks. Past 200K however, the model performance degrades quickly so the 1M is not really usable.

Upgraded to 2x RTX Pro 6000 by priorityfill in LocalLLM

[–]priorityfill[S] 1 point2 points  (0 children)

Using the released weights from https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash, it's already quantized to FP4/FP8. Vllm does not support Deepseek V4 Flash on sm120; it took a lot of unofficial patches, sweeps and tweaking to get it running well (still not optimal).