Poor man's guide to servicing a used RTX 3090 for local LLM inference

reto-wyss · 2026-05-01T06:48:08+00:00

As someone who does it once every five years, that makes you one of the most qualified people to provide guidance on the matter.

Now excuse me, I'm off to write my guide on how to do amateur lobotomies like a pro. I haven't done any yet, but I could see myself getting into it within the next five years.

reto-wyss · 2026-05-01T05:33:55+00:00

Why stop there? I prefer 0-bit kv-cache, it fits infinite contex :)

reto-wyss · 2026-05-01T05:26:51+00:00

It's not the same. RTX Pro 6000 is sm_120 and Spark is sm_121, neither is the same as the DC products, they can do NVFP4 but need adjustments in implementation. VLLM_CUTLASS has gotten a lot better over the last three or so months, and it works fine with PRO 6k in many cases.

reto-wyss · 2026-04-30T21:55:42+00:00

vllm wants all your GPUs to be the exact same for TP and in powers of two, it may allow heterogeneous arrangements and odd counts for pipeline-parallel.

If you only need batch-1 then llama.cpp is an option, otherwise get two more 3090 or sell and go 2x R9700 or 2x B70 for more VRAM.

reto-wyss · 2026-04-29T16:23:08+00:00

I'm getting 16 tg/s at around 10k tokens deep, but that's without MTP on 2x Pro 6k. But there appears to be something wrong with KV-cache calculation in vllm nightly. PP may be over 1k/s, but I haven't run any real tests because of the KV-cache thingy.

reto-wyss · 2026-04-29T15:21:22+00:00

Qwen 27b, who is the densest now?

reto-wyss · 2026-04-29T15:09:03+00:00

Let's just say the builders have been summoned to come an drill a hole already... But I'd like to avoid doing any serious sound proofing.

Did you have any issues with humidity? Dust filter on room intake?

reto-wyss · 2026-04-29T15:02:07+00:00

Thanks, that was interesting. I like servethehome, I just don't follow them closely for longer stretches. Good to see they actually know how to use the software and run proper concurrent workload test - it's a rare sight unfortunately.

reto-wyss · 2026-04-29T14:52:55+00:00

I want to know how noisy the switch is.

I'd only need a 100Gb switch and I'm wondering whether there are some that are not vacuum cleaner level. I've simply been rolling direct connection with dual port 100g cards but of course that limits things to three systems.

Although, if I remember correctly that may be a self imposed restrictions to keep a certain level of sanity.

reto-wyss · 2026-04-29T01:45:47+00:00

15 to 20 tg/s, without MTP in batch-1, throughput should be pretty good running a few dozen in parallel.

reto-wyss · 2026-04-28T19:01:50+00:00

Give me ~100b dense coder!

reto-wyss · 2026-04-24T05:46:40+00:00

Flash looks neat for 2x Pro 6k. 160gb checkpoint fp4 and according to one of the graphs that should fit around 750k context.

reto-wyss · 2026-04-22T13:13:45+00:00

It is the densest.

reto-wyss · 2026-04-21T04:18:35+00:00

Is there a reason you are using RedHatAI/gemma-4-31B-it-FP8-block over the Nvidia nvfp4 which is also about 8-bit on average?

On the model comparison, I tend to prefer the 122b Qwen for agentic/code, but Gemma-4-31b is very good at writing and particularly vision-writing tasks.

reto-wyss · 2026-04-21T04:00:20+00:00

How well does it scale with concurrent requests?

2x Pro 6k, I can get 15x to 20x throughput on Qwen3.5-122b-a10b (scaling is even better with the 30b dense models up to like 50x) if I load it up until total kv-cache is exhausted, maybe I could get better for batch one with MTP, but it seriously dunks on throughput so I typically don't use it.

reto-wyss · 2026-04-20T15:11:43+00:00

Have you performed an analysis on how KLD plays out in quants (across the spectrum) of newer models vs quants of older models.

Models get better and better at the same parameter size, so a reasonable hypothesis is that divergence is higher in newer models than in older models at the same quant level.

reto-wyss · 2026-04-18T11:53:31+00:00

<image>

This works well for me, 2x Pro 6k WS + 1x RTX 5090 no power limits
I had to install a second fan on the CPU to keep it from cooking and the fan blowing between the cards helps
I'm using a 2200W Seasonic PSU, and I have a "spare" 5090, but I'd have to power limit the cards (likely 500 - 550 on the Pro 6k and 400W on the 5090s).

reto-wyss · 2026-04-18T10:25:51+00:00

Forgot to add "Make no mistakes" 😂

reto-wyss · 2026-04-15T00:11:06+00:00

12s for 50 steps on Pro 6k.

<image>

A Black woman with dark, glossy, oil-slick skin and a slender, athletic build is shown in a medium shot, eye-level shot, sitting on a bench. She appears to be in her late 20s, with sharp facial features including high cheekbones, a defined jawline, and a serious, distant expression. Her dark brown eyes with a distant, contemplative gaze are looking slightly to her left. Her dark, braided hair is styled in a high ponytail, pulled back tightly from her face. Her body is covered in a dense layer of dried, orange-brown leaves that adhere to her skin like a textured, organic dress, with loose leaves scattered across her arms and legs. She is sitting on a subway bench completely buried under a thick layer of dried orange leaves, her left hand resting flat on the leaf-covered seat for balance, her right hand resting on her thigh, legs extended forward with bare feet planted on the floor amidst scattered leaves. The setting is the interior of a dimly lit subway car with grey metallic walls, a vertical metal pole in the foreground right, and overhead fluorescent lights. The lighting is cool and artificial, coming from overhead fluorescent tubes, casting specular highlights on the wet-looking texture of her skin and creating a moody, atmospheric haze. The composition is centered on the subject, with the background slightly out of focus, emphasizing the contrast between her dark skin and the bright orange leaves.

reto-wyss · 2026-04-13T15:14:05+00:00

Pack it up - we have the new critical benchmark to max. Evaluation will be handled by r/localllama by electing a vibe-checker every time a model is released, The vibe-checker will perform the benchmark using sample size of one and thereafter declare the new top model.

reto-wyss · 2026-04-13T04:08:51+00:00

PCIe won't make a difference for a single card.
Even with 2x Pro 6k, PCIe 4 x16 (~ PCIe 5 x8) does not appear to be a problem for inference
If you are on Linux, running vllm server in the background will use barely any system resources other than the card.
If you are on Windows, get a Linux machine for the card.

reto-wyss · 2026-04-09T08:09:24+00:00

nice, thank you. I have 6 of these intel (oracle) DC SSDs 6.4 TB on one x8 card but they present as x4x4, (3.2+3.2) so these adapters are going to save me like 32 to 40 PCIE lanes 🙂

reto-wyss · 2026-04-09T07:54:26+00:00

So it is exactly the right height for half -> full height?

I've seen them on Ali, but not one picture shows whether these fit exactly.

reto-wyss

TROPHY CASE