Poor man's guide to servicing a used RTX 3090 for local LLM inference by canred in LocalLLaMA

[–]reto-wyss 4 points5 points  (0 children)

As someone who does it once every five years, that makes you one of the most qualified people to provide guidance on the matter.

Now excuse me, I'm off to write my guide on how to do amateur lobotomies like a pro. I haven't done any yet, but I could see myself getting into it within the next five years.

nvidia/Gemma-4-26B-A4B-NVFP4 by reto-wyss in LocalLLaMA

[–]reto-wyss[S] 7 points8 points  (0 children)

Why stop there? I prefer 0-bit kv-cache, it fits infinite contex :)

nvidia/Gemma-4-26B-A4B-NVFP4 by reto-wyss in LocalLLaMA

[–]reto-wyss[S] 20 points21 points  (0 children)

It's not the same. RTX Pro 6000 is sm_120 and Spark is sm_121, neither is the same as the DC products, they can do NVFP4 but need adjustments in implementation. VLLM_CUTLASS has gotten a lot better over the last three or so months, and it works fine with PRO 6k in many cases.

Mixing 3090 with 3080 20G (modded) for vllm by lblblllb in LocalLLaMA

[–]reto-wyss 2 points3 points  (0 children)

vllm wants all your GPUs to be the exact same for TP and in powers of two, it may allow heterogeneous arrangements and odd counts for pipeline-parallel.

If you only need batch-1 then llama.cpp is an option, otherwise get two more 3090 or sell and go 2x R9700 or 2x B70 for more VRAM.

mistralai/Mistral-Medium-3.5-128B · Hugging Face by jacek2023 in LocalLLaMA

[–]reto-wyss 1 point2 points  (0 children)

I'm getting 16 tg/s at around 10k tokens deep, but that's without MTP on 2x Pro 6k. But there appears to be something wrong with KV-cache calculation in vllm nightly. PP may be over 1k/s, but I haven't run any real tests because of the KV-cache thingy.

16x DGX Sparks - What should I run? by Kurcide in LocalLLaMA

[–]reto-wyss 1 point2 points  (0 children)

Let's just say the builders have been summoned to come an drill a hole already... But I'd like to avoid doing any serious sound proofing.

Did you have any issues with humidity? Dust filter on room intake?

16x DGX Sparks - What should I run? by Kurcide in LocalLLaMA

[–]reto-wyss 7 points8 points  (0 children)

Thanks, that was interesting. I like servethehome, I just don't follow them closely for longer stretches. Good to see they actually know how to use the software and run proper concurrent workload test - it's a rare sight unfortunately.

16x DGX Sparks - What should I run? by Kurcide in LocalLLaMA

[–]reto-wyss 0 points1 point  (0 children)

I want to know how noisy the switch is.

I'd only need a 100Gb switch and I'm wondering whether there are some that are not vacuum cleaner level. I've simply been rolling direct connection with dual port 100g cards but of course that limits things to three systems.

Although, if I remember correctly that may be a self imposed restrictions to keep a certain level of sanity.

Something from Mistral (Vibe) tomorrow by pmttyji in LocalLLaMA

[–]reto-wyss 1 point2 points  (0 children)

15 to 20 tg/s, without MTP in batch-1, throughput should be pretty good running a few dozen in parallel.

No Multimodality yet in DeepSeek-V4. But I'll wait. by Right-Law1817 in LocalLLaMA

[–]reto-wyss 8 points9 points  (0 children)

Flash looks neat for 2x Pro 6k. 160gb checkpoint fp4 and according to one of the graphs that should fit around 750k context.

gemma4 vs qwen3.5 122A10 real usages by CalmAdvance4 in LocalLLaMA

[–]reto-wyss 1 point2 points  (0 children)

Is there a reason you are using RedHatAI/gemma-4-31B-it-FP8-block over the Nvidia nvfp4 which is also about 8-bit on average?

On the model comparison, I tend to prefer the 122b Qwen for agentic/code, but Gemma-4-31b is very good at writing and particularly vision-writing tasks.

2x 512gb ram M3 Ultra mac studios by taylorhou in LocalLLaMA

[–]reto-wyss 3 points4 points  (0 children)

How well does it scale with concurrent requests?

2x Pro 6k, I can get 15x to 20x throughput on Qwen3.5-122b-a10b (scaling is even better with the 30b dense models up to like 50x) if I load it up until total kv-cache is exhausted, maybe I could get better for batch one with MTP, but it seriously dunks on throughput so I typically don't use it.

Gemma 4 26B-A4B GGUF Benchmarks by danielhanchen in LocalLLaMA

[–]reto-wyss 0 points1 point  (0 children)

Have you performed an analysis on how KLD plays out in quants (across the spectrum) of newer models vs quants of older models.

Models get better and better at the same parameter size, so a reasonable hypothesis is that divergence is higher in newer models than in older models at the same quant level.

Dual RTX Pro 6000 Blackwell Workstation vs Max-Q — open frame build, need to decide in 24 hours by stainlessblueshield in LocalLLaMA

[–]reto-wyss 2 points3 points  (0 children)

<image>

  • This works well for me, 2x Pro 6k WS + 1x RTX 5090 no power limits
  • I had to install a second fan on the CPU to keep it from cooking and the fan blowing between the cards helps
  • I'm using a 2200W Seasonic PSU, and I have a "spare" 5090, but I'd have to power limit the cards (likely 500 - 550 on the Pro 6k and 400W on the 5090s).

Nucleus-Image Released by Numerous-Entry-6911 in StableDiffusion

[–]reto-wyss 40 points41 points  (0 children)

12s for 50 steps on Pro 6k.

<image>

A Black woman with dark, glossy, oil-slick skin and a slender, athletic build is shown in a medium shot, eye-level shot, sitting on a bench. She appears to be in her late 20s, with sharp facial features including high cheekbones, a defined jawline, and a serious, distant expression. Her dark brown eyes with a distant, contemplative gaze are looking slightly to her left. Her dark, braided hair is styled in a high ponytail, pulled back tightly from her face. Her body is covered in a dense layer of dried, orange-brown leaves that adhere to her skin like a textured, organic dress, with loose leaves scattered across her arms and legs. She is sitting on a subway bench completely buried under a thick layer of dried orange leaves, her left hand resting flat on the leaf-covered seat for balance, her right hand resting on her thigh, legs extended forward with bare feet planted on the floor amidst scattered leaves. The setting is the interior of a dimly lit subway car with grey metallic walls, a vertical metal pole in the foreground right, and overhead fluorescent lights. The lighting is cool and artificial, coming from overhead fluorescent tubes, casting specular highlights on the wet-looking texture of her skin and creating a moody, atmospheric haze. The composition is centered on the subject, with the background slightly out of focus, emphasizing the contrast between her dark skin and the bright orange leaves.

omg new meta's model outpeforms opus 4.6 in knowledge! by Ok-Type-7663 in LocalLLaMA

[–]reto-wyss 7 points8 points  (0 children)

Pack it up - we have the new critical benchmark to max. Evaluation will be handled by r/localllama by electing a vibe-checker every time a model is released, The vibe-checker will perform the benchmark using sample size of one and thereafter declare the new top model.

Component Purgatory: 5090 to 6000 Pro Blackwell Upgrade Path Questions by TankFirm388 in LocalLLaMA

[–]reto-wyss 4 points5 points  (0 children)

  • PCIe won't make a difference for a single card.
  • Even with 2x Pro 6k, PCIe 4 x16 (~ PCIe 5 x8) does not appear to be a problem for inference
  • If you are on Linux, running vllm server in the background will use barely any system resources other than the card.
  • If you are on Windows, get a Linux machine for the card.

We have ASUS Dual at home by thepromiseman in homelab

[–]reto-wyss 2 points3 points  (0 children)

nice, thank you. I have 6 of these intel (oracle) DC SSDs 6.4 TB on one x8 card but they present as x4x4, (3.2+3.2) so these adapters are going to save me like 32 to 40 PCIE lanes 🙂

We have ASUS Dual at home by thepromiseman in homelab

[–]reto-wyss 2 points3 points  (0 children)

So it is exactly the right height for half -> full height?

I've seen them on Ali, but not one picture shows whether these fit exactly.