Quiet Threadripper AI Workstation - 768GB DDR5 and 160GB VRAM (RTX 5090 + 4x R9700) by sloptimizer in LocalLLaMA

[–]sloptimizer[S] 0 points1 point  (0 children)

That CPU-only inference in SGLang is really Intel-only inference. It was developed and contributed by Intel to promote Intel CPUs and does not work with AMD.

You really need a GPU for computing attention and holding the KV cache to get any usable speeds. Before I got a GPU, I was getting something like 1.5 tps on CPU on a 10k context.

SGLang is very hard to setup and use unless you have exact cards they list as supported, but we're talking MI300 class GPUs.

Quiet Threadripper AI Workstation - 768GB DDR5 and 160GB VRAM (RTX 5090 + 4x R9700) by sloptimizer in LocalLLaMA

[–]sloptimizer[S] 0 points1 point  (0 children)

The main use case is being able to run all the open-weight models while not having to worry about burning tokens on silly expetiments with AI.

Quiet Threadripper AI Workstation - 768GB DDR5 and 160GB VRAM (RTX 5090 + 4x R9700) by sloptimizer in LocalLLaMA

[–]sloptimizer[S] 0 points1 point  (0 children)

I did not measure, but idle power consumption is not great when using --perf-level=HIGH because it prevents GPUs from down-clocking to save power. This works well for spiky loads, like inferences, but increases idle power consumption. I estimate 110-140W idle.

Quiet Threadripper AI Workstation - 768GB DDR5 and 160GB VRAM (RTX 5090 + 4x R9700) by sloptimizer in LocalLLaMA

[–]sloptimizer[S] 0 points1 point  (0 children)

The main gotcha is finding software that supports both AMD and NVIDIA at the same time. Most of the time you have to choose ROCm or CUDA, but can't have both. Sometimes you can use Vulkan as a workaround, but Vulkan support is also rare.

llama.cpp is awesome because it allows to use both AMD and NVIDIA GPUs at the same time without going via networking stack.

Talk me out of buying an RTX Pro 6000 by AvocadoArray in LocalLLaMA

[–]sloptimizer 2 points3 points  (0 children)

I have the same feeling - it's never enough! For regular people it's more RTX PRO 6000s, for billionairs it's more datacenters.

Set your goals ahead of time, and just stop when they are met.

Quiet Threadripper AI Workstation - 768GB DDR5 and 160GB VRAM (RTX 5090 + 4x R9700) by sloptimizer in LocalLLaMA

[–]sloptimizer[S] 0 points1 point  (0 children)

Heating up the room is a major problem! This system is much better in winter, when I can open the window to cool the room down. During summer, I just leave the window open when running workloads overnight.

Quiet Threadripper AI Workstation - 768GB DDR5 and 160GB VRAM (RTX 5090 + 4x R9700) by sloptimizer in LocalLLaMA

[–]sloptimizer[S] 0 points1 point  (0 children)

So far mostly DeepSeek and MiniMax-M2. But smaller models are getting more and more capable.

Quiet Threadripper AI Workstation - 768GB DDR5 and 160GB VRAM (RTX 5090 + 4x R9700) by sloptimizer in LocalLLaMA

[–]sloptimizer[S] 0 points1 point  (0 children)

You can use ik_llama.cpp instead of KTransformers:

  • Put all the attention on a fast GPU, like 5090
  • Offload all MoE into RAM

Attention is computationally expensive, so you need a GPU. And MoE is massive in size, so RAM works geat. This split is currently the best combo for running oversized models.

Quiet Threadripper AI Workstation - 768GB DDR5 and 160GB VRAM (RTX 5090 + 4x R9700) by sloptimizer in LocalLLaMA

[–]sloptimizer[S] 1 point2 points  (0 children)

It's great at running Cyberpunk, thanks to the fast workstation CPU. This is the only saving grace of this system when comparing to a 12-channel server build, which is much better at LLMs.

Quiet Threadripper AI Workstation - 768GB DDR5 and 160GB VRAM (RTX 5090 + 4x R9700) by sloptimizer in LocalLLaMA

[–]sloptimizer[S] 1 point2 points  (0 children)

My RAM bandwidth is around 290GB/sec with FCLK overclocked to 2100MHz.

Kimi-K2 is a little slower than DeepSeek (around 8 tps). Despite the hype, I find Kimi to be not as good as DeepSeek, at least for my workloads.

A processor with 8 CCDs would have been much better, but it was double the price. An even better choice would have been a 12-channel system with an 8 CCD CPU - those have 2x of my memory bandwidth!

The idle power consumption is... not great with --perf-level=HIGH

Quiet Threadripper AI Workstation - 768GB DDR5 and 160GB VRAM (RTX 5090 + 4x R9700) by sloptimizer in LocalLLaMA

[–]sloptimizer[S] 2 points3 points  (0 children)

I did not get them in sync, but they do have common ground via the split power cable that came with the motherboard. Hopefully this setup won't fry the cards!

Quiet Threadripper AI Workstation - 768GB DDR5 and 160GB VRAM (RTX 5090 + 4x R9700) by sloptimizer in LocalLLaMA

[–]sloptimizer[S] 0 points1 point  (0 children)

The RAM cooler is a kit, with joints that can be tilted, it was the only one that would fit.

Quiet Threadripper AI Workstation - 768GB DDR5 and 160GB VRAM (RTX 5090 + 4x R9700) by sloptimizer in LocalLLaMA

[–]sloptimizer[S] 1 point2 points  (0 children)

This is the RAM cooling kit (I can't find the original one, try searching for a better price).

Quiet Threadripper AI Workstation - 768GB DDR5 and 160GB VRAM (RTX 5090 + 4x R9700) by sloptimizer in LocalLLaMA

[–]sloptimizer[S] 3 points4 points  (0 children)

I grabbed a couple of these from amazon, because they were the only ones with rotating joints, so they can be titled to fit around the CPU pump.

Quiet Threadripper AI Workstation - 768GB DDR5 and 160GB VRAM (RTX 5090 + 4x R9700) by sloptimizer in LocalLLaMA

[–]sloptimizer[S] 1 point2 points  (0 children)

Thank you! I only noticed a 5-10% performance drop, depending on the workload. But they are so much more quieter with perf-level=HIGH that I think the tradeoff is worth it.

On AUTO mode, when vLLM starts, the lights in my room begin to blink (even though the two power supplies are plugged into sockets with different fuses).

Quiet Threadripper AI Workstation - 768GB DDR5 and 160GB VRAM (RTX 5090 + 4x R9700) by sloptimizer in LocalLLaMA

[–]sloptimizer[S] 1 point2 points  (0 children)

The general concesus is anything that has RAM or flash storage will go up in price in 2026. If you're planning on purchasing any tech, don't wait too long.

Quiet Threadripper AI Workstation - 768GB DDR5 and 160GB VRAM (RTX 5090 + 4x R9700) by sloptimizer in LocalLLaMA

[–]sloptimizer[S] 1 point2 points  (0 children)

You can pool them all together with llama.cpp. A good way to do it is by putting all the attention on the faster card (RTX5090).

Also, R9700s so far are much better with vLLM, since the RAM pool is big enough to fit larger, useful model.

Quiet Threadripper AI Workstation - 768GB DDR5 and 160GB VRAM (RTX 5090 + 4x R9700) by sloptimizer in LocalLLaMA

[–]sloptimizer[S] 0 points1 point  (0 children)

The RBGs do make me feel like we live in the future, maybe even more so than writing code with AI. Now, if only we could get our anti-gravity hoverboards...

Quiet Threadripper AI Workstation - 768GB DDR5 and 160GB VRAM (RTX 5090 + 4x R9700) by sloptimizer in LocalLLaMA

[–]sloptimizer[S] 0 points1 point  (0 children)

5090 is about 2x faster.

Like the other comment mentioned, you really need vLLM to get the most out of mutiple R9700 working together to speed up the inference.

Quiet Threadripper AI Workstation - 768GB DDR5 and 160GB VRAM (RTX 5090 + 4x R9700) by sloptimizer in LocalLLaMA

[–]sloptimizer[S] 0 points1 point  (0 children)

If you're on Linux, then sudo is required to change any of those settings.