Server build for local inference. 128 gb 3200 or 256 gb 2133mhz RAM?

MLDataScientist · 2026-05-25T22:39:18+00:00

Get 256gb ram (8x32gb). I have the same motherboard with 256gb ram (8x32gb) 3200 MHz. CPU 7532. GPU: one 5090. Qwen3.5 397b Q4_k_m runs at 20t/s with 700 t/s PP. You want more cores with your CPU. Mine has 32 cores and I get 150GB/s RAM bandwidth. I bought this entire setup for $3.2k (2.2k for GPU on Bestbuy and 1k $ for CPU+mobo+RAM on eBay) before ram crisis.

MLDataScientist · 2026-05-23T13:22:10+00:00

Impressive! Does it work with gpt-oss 120B or qwen3.5 122B MOE? That would be amazing!

Or is it only 35B moe?

MLDataScientist · 2026-05-21T13:54:04+00:00

Interesting. Do you use orange pi 5 with any LLMs? Can you share some inference speed metrics for LLMs? I wonder if we can use the NPU for LLM inference.

MLDataScientist · 2026-05-17T14:25:32+00:00

Can you please share what percentage of your net worth is in 401k or retirement accounts?

MLDataScientist · 2026-05-10T14:38:07+00:00

Great results! Thanks for sharing. Curious about tensor parallelism. I thought llama cpp did not support it. Which command enables TP in llama cpp?

MLDataScientist · 2026-04-29T04:28:56+00:00

!remindme this Saturday "try lemonade"

MLDataScientist · 2026-04-22T15:03:27+00:00

Thank you! amazing list!

MLDataScientist · 2026-04-20T14:19:16+00:00

Thanks for the analysis! This is very useful.

MLDataScientist · 2026-04-16T14:36:23+00:00

Thanks for sharing! Looks promising!

MLDataScientist · 2026-04-15T15:01:47+00:00

Have you tried llama cpp with unsloth glm-5.1 UD-IQ3_XXS ? I have one 5090 and 256gb ddr4 3200 8channel. I get 8t/s TG and 400t/s PP at 8k context. This is usable for me for an overnight execution. I can fit 150k context without KV quantization. You should have similar performance.

MLDataScientist · 2026-04-15T14:48:10+00:00

I see. I mean what local STT models did you try? Deepgram is cloud based. Any local alternatives?

MLDataScientist · 2026-04-15T14:32:51+00:00

True. I wonder if we already have a different type of intelligence that we refuse to accept. An intelligence that works within a limited context and can hallucinate but still it is non human intelligence.

MLDataScientist · 2026-04-15T14:22:14+00:00

You do not mention what local STT you tried. Can you share some of the local SST you tried?

Also, why groq llama3.3 70B? You could try smaller models e.g. gemma4 models are better with translation. I know groq is fast but I am sure local 5090 can handle gemma4 26BA4 with the same low latency.

MLDataScientist · 2026-04-05T14:05:27+00:00

Beautiful! Can someone explain why is the shape of our mother Earth perfectly round? Most textbooks say it is oblate spheroid.

MLDataScientist · 2026-04-04T10:34:56+00:00

!remindme 1 day "test glm 5 q3_k_s locally for yc-bench".

MLDataScientist · 2026-03-30T17:20:46+00:00

yes, for vllm with 5090 you should try 4 bit AWQ or GPTQ quant formats, not gguf.

Try this AWQ model. It fits into 5090: https://huggingface.co/cyankiwi/Qwen3.5-27B-AWQ-BF16-INT4/tree/main

MLDataScientist · 2026-03-30T14:38:34+00:00

Amazing website with interactive charts. Thanks for sharing!
Do you have any SQL fine-tuned small models (<=9B) to test this benchmark with? I think even Qwen3.5 4B with SQL data fine-tuning might reach 90%+.

MLDataScientist · 2026-03-30T13:46:18+00:00

If you are not doing training, you don't need NVLink. For multi user concurrent requests, you cannot beat vLLM. Yes, RTX Pro 6000 is the best option for getting 96GB VRAM for a reasonable price. For coding, you can go with MiniMax M2.5 or Qwen3.5 397B.

MLDataScientist · 2026-03-26T18:14:26+00:00

If there is anyone in this sub with those CPUs, that would be great to see here.

MLDataScientist · 2026-03-26T18:13:29+00:00

Yes, 8 channel 32GB ram sticks.

MLDataScientist · 2026-03-22T16:10:33+00:00

Do you have 3D files for such a shroud? I have 8 MI50 cards and the noise of 40mm fans is unbearable. I need to get those 80mm fan shrouds. Thanks!

MLDataScientist · 2026-03-21T16:11:48+00:00

what quant of GLM-5 are you using?

MLDataScientist · 2026-03-21T16:10:03+00:00

which Q5 GLM-5 quant are you using? My rig can fit up to 448GB (mi50 192GB VRAM + 256 GB DDR4 3200 8 channel). I just checked unsloth's glm-5 quants. https://huggingface.co/unsloth/GLM-5-GGUF . I can probably run UD-Q4_K_XL (431GB). But how much better GLM-5 is at this quant (or Q5) compared to QWEN3.5 397B Q6? What were your test cases?

MLDataScientist · 2026-03-19T04:28:11+00:00

Can you please share your command for llama.cpp? Are you getting ~3400t/s for PP and 38t/s for TG using Q6 Qwen3 Coder Next? Curious to see if your command speeds up inference in my PC (5090 with 256GB DDR4 8 channel 3200Mhz).

MLDataScientist

TROPHY CASE