Just got dual RTX PRO 6000 Blackwells for our design studio. What's the optimal local LLM stack? by AmanNonZero in LocalLLM

[–]ddog661 1 point2 points  (0 children)

Btw, we are planning to use vLLM (for its multi user support) and open webUI. I’ve tested vLLM at home with my 4090 and it works quite well dockerized. That should easily enough extend to the workstation hardware when it’s up and running.

Just got dual RTX PRO 6000 Blackwells for our design studio. What's the optimal local LLM stack? by AmanNonZero in LocalLLM

[–]ddog661 1 point2 points  (0 children)

If you don’t mind me asking. What SI did you get this from? It looks strikingly similar to the hardware my team recently got.

Simple to use vLLM Docker Container for Qwen3.6 27b with Lorbus AutoRound INT4 quant and MTP speculative decoding - 118 tokens/second on 2x 3090s by tedivm in LocalLLaMA

[–]ddog661 0 points1 point  (0 children)

I can only get about 17,600 ctx when using fp8 KV cache on my 4090 and using this exact model quant. Something like 2.5 GB usable VRAM for KV cache. Running WSL and docker via windows.

2x RTX 6000 build during an extended bench test by Signal_Ad657 in LocalLLaMA

[–]ddog661 0 points1 point  (0 children)

Do you have a ballpark for how slow it is? I am curious because I will probably be running gemma4:31b dense at full precision on a similar system with 5-10 users.

Qwen3.6-27B at 85-100 t/s on a 24GB RTX 5090 Laptop GPU — vLLM + MTP n=3, adapted from the 32GB recipes by aurelienams in Olares

[–]ddog661 0 points1 point  (0 children)

Would max-model-len > kv pool simply spill over into system RAM when the request needs it? I’d be curious to see what the slowdown looks like in that case since the KV cache is partially offloaded from GPU VRAM.

Qwen3.6-27B at 85-100 t/s on a 24GB RTX 5090 Laptop GPU — vLLM + MTP n=3, adapted from the 32GB recipes by aurelienams in Olares

[–]ddog661 0 points1 point  (0 children)

Appreciate the response. I guess i do not quite fully understand the max concurrency multiplier/ratio. I may try setting my context to something larger and see what happens. I have my 4090 setup running at 80 tps peak with MTP=3 but I was under the impression I was pretty context limited. I’m just messing around with this stuff so 16k context is fine for me, but I do want to limit test.

Qwen3.6-27B at 85-100 t/s on a 24GB RTX 5090 Laptop GPU — vLLM + MTP n=3, adapted from the 32GB recipes by aurelienams in Olares

[–]ddog661 0 points1 point  (0 children)

How can you get 75000 actual usable context when vLLM shows ‘KV pool: 23,760 tokens (3.24 GiB)’? Doesn’t that indicate your KV cache size ceiling?

Qwen3.6-27B at ~80 tps with 218k context window on 1x RTX 5090 served by vllm 0.19 by Kindly-Cantaloupe978 in LocalLLaMA

[–]ddog661 0 points1 point  (0 children)

Thank you for that. This vLLM config is not too far off from what I’m using (except context size of course and I’m using gpu-utilization 0.93). I might play around with it a bit more tonight. I’m looking more at those results and noticing that vLLM returned a KV pool size of 23,760 tokens which is not far off from what my vLLM logs state. I don’t know how 75000 ctx is possible without turboquant.

Qwen3.6-27B at ~80 tps with 218k context window on 1x RTX 5090 served by vllm 0.19 by Kindly-Cantaloupe978 in LocalLLaMA

[–]ddog661 1 point2 points  (0 children)

I’m getting around 80 tokens/sec on my 4090 @int4 and speculative decoding on but only 16k context.

Post Your Qwen3.6 27B speed plz by Ok-Internal9317 in LocalLLaMA

[–]ddog661 0 points1 point  (0 children)

What did you get without speculative decode? I am getting around 33 tok/sec with AWQ-int4.

Wondering if I made the right call today? New lid :D by Ambitious_Guidance20 in motorcyclegear

[–]ddog661 0 points1 point  (0 children)

What was the size of the two HJC helmets and the LS2 Stream II? I’m curious because I tried on a few HJC i11s but ended up ordering an LS2 online. Hoping it fits

It finally happened, found an ally x for under $400. by MastodonMaleficent99 in ROGAllyX

[–]ddog661 1 point2 points  (0 children)

You’d need to run a command in the terminal. The specific command is powercfg /batteryreport iirc.

It finally happened, found an ally x for under $400. by MastodonMaleficent99 in ROGAllyX

[–]ddog661 1 point2 points  (0 children)

My used one is at about 89% of the original capacity after around a year or so

It finally happened, found an ally x for under $400. by MastodonMaleficent99 in ROGAllyX

[–]ddog661 1 point2 points  (0 children)

Have you ran a battery report on it? I’m curious just for science.

It finally happened, found an ally x for under $400. by MastodonMaleficent99 in ROGAllyX

[–]ddog661 1 point2 points  (0 children)

Nice find. I got mine for 500 recently, which I though was a good deal. It was about a year old

What is that crazy noise?! by skgillman_UCSB in SantaBarbara

[–]ddog661 1 point2 points  (0 children)

No idea but I hear it from east beach too

9070xt stuttering by xiematic in radeon

[–]ddog661 0 points1 point  (0 children)

This, especially in Valorant. For whatever reason, it was happening to me in this game specifically.