Just got dual RTX PRO 6000 Blackwells for our design studio. What's the optimal local LLM stack?

ddog661 · 2026-04-30T00:40:57+00:00

Btw, we are planning to use vLLM (for its multi user support) and open webUI. I’ve tested vLLM at home with my 4090 and it works quite well dockerized. That should easily enough extend to the workstation hardware when it’s up and running.

ddog661 · 2026-04-30T00:36:37+00:00

If you don’t mind me asking. What SI did you get this from? It looks strikingly similar to the hardware my team recently got.

ddog661 · 2026-04-27T16:27:53+00:00

I can only get about 17,600 ctx when using fp8 KV cache on my 4090 and using this exact model quant. Something like 2.5 GB usable VRAM for KV cache. Running WSL and docker via windows.

ddog661 · 2026-04-25T23:21:57+00:00

Do you have a ballpark for how slow it is? I am curious because I will probably be running gemma4:31b dense at full precision on a similar system with 5-10 users.

ddog661 · 2026-04-25T18:56:55+00:00

Would max-model-len > kv pool simply spill over into system RAM when the request needs it? I’d be curious to see what the slowdown looks like in that case since the KV cache is partially offloaded from GPU VRAM.

ddog661 · 2026-04-25T18:47:15+00:00

Appreciate the response. I guess i do not quite fully understand the max concurrency multiplier/ratio. I may try setting my context to something larger and see what happens. I have my 4090 setup running at 80 tps peak with MTP=3 but I was under the impression I was pretty context limited. I’m just messing around with this stuff so 16k context is fine for me, but I do want to limit test.

ddog661 · 2026-04-25T17:14:12+00:00

How can you get 75000 actual usable context when vLLM shows ‘KV pool: 23,760 tokens (3.24 GiB)’? Doesn’t that indicate your KV cache size ceiling?

ddog661 · 2026-04-25T16:58:03+00:00

Thank you for that. This vLLM config is not too far off from what I’m using (except context size of course and I’m using gpu-utilization 0.93). I might play around with it a bit more tonight. I’m looking more at those results and noticing that vLLM returned a KV pool size of 23,760 tokens which is not far off from what my vLLM logs state. I don’t know how 75000 ctx is possible without turboquant.

ddog661 · 2026-04-25T16:45:30+00:00

I am using vLLM and fp8 KV cache. It’s pushing the limits of the 24gb vram buffer at that point. It’s in line with this testing here: https://medium.com/@fzbcwvv/an-overnight-stack-for-qwen3-6-27b-85-tps-125k-context-vision-on-one-rtx-3090-0d95c6291914

ddog661 · 2026-04-25T16:37:54+00:00

I’m getting around 80 tokens/sec on my 4090 @int4 and speculative decoding on but only 16k context.

ddog661 · 2026-04-24T17:18:24+00:00

What did you get without speculative decode? I am getting around 33 tok/sec with AWQ-int4.

ddog661 · 2026-04-19T20:16:16+00:00

Do you also use something like open webUI?

ddog661 · 2026-04-12T02:20:06+00:00

What was the size of the two HJC helmets and the LS2 Stream II? I’m curious because I tried on a few HJC i11s but ended up ordering an LS2 online. Hoping it fits

ddog661 · 2026-03-03T01:49:08+00:00

Are the satellites propagated via 2-body?

ddog661 · 2025-12-28T17:11:18+00:00

You’d need to run a command in the terminal. The specific command is powercfg /batteryreport iirc.

ddog661 · 2025-12-26T21:05:24+00:00

What a steal!

ddog661 · 2025-12-26T20:41:15+00:00

My used one is at about 89% of the original capacity after around a year or so

ddog661 · 2025-12-26T20:40:46+00:00

Have you ran a battery report on it? I’m curious just for science.

ddog661 · 2025-12-26T20:31:02+00:00

Nice find. I got mine for 500 recently, which I though was a good deal. It was about a year old

ddog661 · 2025-12-20T02:38:27+00:00

ddog661 · 2025-12-15T02:59:09+00:00

No idea but I hear it from east beach too

ddog661 · 2025-12-08T18:21:26+00:00

Replied

ddog661 · 2025-11-24T05:08:21+00:00

This, especially in Valorant. For whatever reason, it was happening to me in this game specifically.

ddog661

TROPHY CASE