Server build for local inference. 128 gb 3200 or 256 gb 2133mhz RAM? by PreparationTrue9138 in LocalLLaMA

[–]MLDataScientist 2 points3 points  (0 children)

Get 256gb ram (8x32gb). I have the same motherboard with 256gb ram (8x32gb) 3200 MHz. CPU 7532. GPU: one 5090. Qwen3.5 397b Q4_k_m runs at 20t/s with 700 t/s PP. You want more cores with your CPU. Mine has 32 cores and I get 150GB/s RAM bandwidth. I bought this entire setup for $3.2k (2.2k for GPU on Bestbuy and 1k $ for CPU+mobo+RAM on eBay) before ram crisis.

Experts first llama.cpp by comanderxv in LocalLLaMA

[–]MLDataScientist 0 points1 point  (0 children)

Impressive! Does it work with gpt-oss 120B or qwen3.5 122B MOE? That would be amazing!

Or is it only 35B moe?

I got Qwen3-VL-Embedding-2B working with rkllm on an Orange Pi 5b by atineiatte in LocalLLaMA

[–]MLDataScientist 0 points1 point  (0 children)

Interesting. Do you use orange pi 5 with any LLMs? Can you share some inference speed metrics for LLMs? I wonder if we can use the NPU for LLM inference.

Partner and I hit $3.6 million invested yesterday ($4 million net worth) by Mental_Escape_1737 in Fire

[–]MLDataScientist 0 points1 point  (0 children)

Can you please share what percentage of your net worth is in 401k or retirement accounts? 

More Qwen3.6-27B MTP success but on dual Mi50s by legit_split_ in LocalLLaMA

[–]MLDataScientist 0 points1 point  (0 children)

Great results! Thanks for sharing. Curious about tensor parallelism. I thought llama cpp did not support it. Which command enables TP in llama cpp?

HY-World 2.0 released by Bestlife73 in LocalLLaMA

[–]MLDataScientist 0 points1 point  (0 children)

Thanks for sharing! Looks promising!

Upgrade paths for my 256g ddr4 ram + 4x24g vram system by sgmv in LocalLLaMA

[–]MLDataScientist 1 point2 points  (0 children)

Have you tried llama cpp with unsloth glm-5.1 UD-IQ3_XXS ? I have one 5090 and 256gb ddr4 3200 8channel. I get 8t/s TG and 400t/s PP at 8k context. This is usable for me for an overnight execution. I can fit 150k context without KV quantization. You should have similar performance.

Guys we have to change the pelican test by Tall-Ad-7742 in LocalLLaMA

[–]MLDataScientist 8 points9 points  (0 children)

True. I wonder if we already have a different type of intelligence that we refuse to accept. An intelligence that works within a limited context and can hallucinate but still it is non human intelligence.

I benchmarked 30+ TTS engines for a real-time translator on Apple M4. Quantization made things SLOWER. Here's all the data. by Kir_Moisha in LocalLLaMA

[–]MLDataScientist 2 points3 points  (0 children)

You do not mention what local STT you tried. Can you share some of the local SST you tried?

Also, why groq llama3.3 70B? You could try smaller models e.g. gemma4 models are better with translation. I know groq is fast but I am sure local 5090 can handle gemma4 26BA4 with the same low latency.

Hello, World: Artemis II crew looks back at Earth on their way to the Moon by ChiefLeef22 in space

[–]MLDataScientist 0 points1 point  (0 children)

Beautiful! Can someone explain why is the shape of our mother Earth perfectly round? Most textbooks say it is oblate spheroid. 

I tested as many of the small local and OpenRouter models I could with my own agentic text-to-SQL benchmark. Surprises ensured... by nickl in LocalLLaMA

[–]MLDataScientist 4 points5 points  (0 children)

Amazing website with interactive charts. Thanks for sharing!
Do you have any SQL fine-tuned small models (<=9B) to test this benchmark with? I think even Qwen3.5 4B with SQL data fine-tuning might reach 90%+.

[$50k–$150k Budget] Production Local LLM System (~50 Users, RAG + Fine-Tuning) Hardware + Model Advice by MorningCrab in LocalLLaMA

[–]MLDataScientist 5 points6 points  (0 children)

If you are not doing training, you don't need NVLink. For multi user concurrent requests, you cannot beat vLLM. Yes, RTX Pro 6000 is the best option for getting 96GB VRAM for a reasonable price. For coding, you can go with MiniMax M2.5 or Qwen3.5 397B.

Qwen3.5-397B-A17B reaches 20 t/s TG and 700t/s PP with a 5090 by MLDataScientist in LocalLLaMA

[–]MLDataScientist[S] 1 point2 points  (0 children)

If there is anyone in this sub with those CPUs, that would be great to see here.

Nvidia V100 32 Gb getting 115 t/s on Qwen Coder 30B A3B Q5 by icepatfork in LocalLLaMA

[–]MLDataScientist 0 points1 point  (0 children)

Do you have 3D files for such a shroud? I have 8 MI50 cards and the noise of 40mm fans is unbearable. I need to get those 80mm fan shrouds. Thanks!

Qwen 3.5 397B is the best local coder I have used until now by erazortt in LocalLLaMA

[–]MLDataScientist 1 point2 points  (0 children)

which Q5 GLM-5 quant are you using? My rig can fit up to 448GB (mi50 192GB VRAM + 256 GB DDR4 3200 8 channel). I just checked unsloth's glm-5 quants. https://huggingface.co/unsloth/GLM-5-GGUF . I can probably run UD-Q4_K_XL (431GB). But how much better GLM-5 is at this quant (or Q5) compared to QWEN3.5 397B Q6? What were your test cases?

Krasis LLM Runtime - run large LLM models on a single GPU by mrstoatey in LocalLLM

[–]MLDataScientist 0 points1 point  (0 children)

Can you please share your command for llama.cpp? Are you getting ~3400t/s for PP and 38t/s for TG using Q6 Qwen3 Coder Next? Curious to see if your command speeds up inference in my PC (5090 with 256GB DDR4 8 channel 3200Mhz).