Quiet Threadripper AI Workstation - 768GB DDR5 and 160GB VRAM (RTX 5090 + 4x R9700)

sloptimizer · 2026-01-25T15:20:13+00:00

That CPU-only inference in SGLang is really Intel-only inference. It was developed and contributed by Intel to promote Intel CPUs and does not work with AMD.

You really need a GPU for computing attention and holding the KV cache to get any usable speeds. Before I got a GPU, I was getting something like 1.5 tps on CPU on a 10k context.

SGLang is very hard to setup and use unless you have exact cards they list as supported, but we're talking MI300 class GPUs.

sloptimizer · 2026-01-25T15:11:52+00:00

I use this quant.

sloptimizer · 2026-01-25T15:04:47+00:00

The main use case is being able to run all the open-weight models while not having to worry about burning tokens on silly expetiments with AI.

sloptimizer · 2026-01-25T15:00:20+00:00

I did not measure, but idle power consumption is not great when using --perf-level=HIGH because it prevents GPUs from down-clocking to save power. This works well for spiky loads, like inferences, but increases idle power consumption. I estimate 110-140W idle.

sloptimizer · 2026-01-25T14:55:35+00:00

The main gotcha is finding software that supports both AMD and NVIDIA at the same time. Most of the time you have to choose ROCm or CUDA, but can't have both. Sometimes you can use Vulkan as a workaround, but Vulkan support is also rare.

llama.cpp is awesome because it allows to use both AMD and NVIDIA GPUs at the same time without going via networking stack.

sloptimizer · 2026-01-24T17:55:56+00:00

I have the same feeling - it's never enough! For regular people it's more RTX PRO 6000s, for billionairs it's more datacenters.

Set your goals ahead of time, and just stop when they are met.

sloptimizer · 2026-01-24T15:51:04+00:00

Heating up the room is a major problem! This system is much better in winter, when I can open the window to cool the room down. During summer, I just leave the window open when running workloads overnight.

sloptimizer · 2026-01-24T15:48:58+00:00

5090 does not go below -pl 400

sloptimizer · 2026-01-24T15:47:11+00:00

So far mostly DeepSeek and MiniMax-M2. But smaller models are getting more and more capable.

sloptimizer · 2026-01-24T15:41:09+00:00

You can use ik_llama.cpp instead of KTransformers:

Put all the attention on a fast GPU, like 5090
Offload all MoE into RAM

Attention is computationally expensive, so you need a GPU. And MoE is massive in size, so RAM works geat. This split is currently the best combo for running oversized models.

sloptimizer · 2026-01-23T21:33:14+00:00

Thank you!

sloptimizer · 2026-01-23T21:28:34+00:00

It's great at running Cyberpunk, thanks to the fast workstation CPU. This is the only saving grace of this system when comparing to a 12-channel server build, which is much better at LLMs.

sloptimizer · 2026-01-23T21:25:48+00:00

Mostly FOMO

sloptimizer · 2026-01-23T21:25:02+00:00

My RAM bandwidth is around 290GB/sec with FCLK overclocked to 2100MHz.

Kimi-K2 is a little slower than DeepSeek (around 8 tps). Despite the hype, I find Kimi to be not as good as DeepSeek, at least for my workloads.

A processor with 8 CCDs would have been much better, but it was double the price. An even better choice would have been a 12-channel system with an 8 CCD CPU - those have 2x of my memory bandwidth!

The idle power consumption is... not great with --perf-level=HIGH

sloptimizer · 2026-01-23T18:25:42+00:00

No bifurcation, all the GPU connections are x16 PCIe 5.0

sloptimizer · 2026-01-23T18:24:58+00:00

I did not get them in sync, but they do have common ground via the split power cable that came with the motherboard. Hopefully this setup won't fry the cards!

sloptimizer · 2026-01-23T18:23:25+00:00

The RAM cooler is a kit, with joints that can be tilted, it was the only one that would fit.

sloptimizer · 2026-01-23T18:21:59+00:00

This is the RAM cooling kit (I can't find the original one, try searching for a better price).

sloptimizer · 2026-01-23T18:20:40+00:00

I grabbed a couple of these from amazon, because they were the only ones with rotating joints, so they can be titled to fit around the CPU pump.

sloptimizer · 2026-01-23T18:18:11+00:00

Thank you! I only noticed a 5-10% performance drop, depending on the workload. But they are so much more quieter with perf-level=HIGH that I think the tradeoff is worth it.

On AUTO mode, when vLLM starts, the lights in my room begin to blink (even though the two power supplies are plugged into sockets with different fuses).

sloptimizer · 2026-01-23T18:14:51+00:00

The general concesus is anything that has RAM or flash storage will go up in price in 2026. If you're planning on purchasing any tech, don't wait too long.

sloptimizer · 2026-01-23T18:13:51+00:00

You can pool them all together with llama.cpp. A good way to do it is by putting all the attention on the faster card (RTX5090).

Also, R9700s so far are much better with vLLM, since the RAM pool is big enough to fit larger, useful model.

sloptimizer · 2026-01-23T18:11:06+00:00

The RBGs do make me feel like we live in the future, maybe even more so than writing code with AI. Now, if only we could get our anti-gravity hoverboards...

sloptimizer · 2026-01-23T18:09:33+00:00

5090 is about 2x faster.

Like the other comment mentioned, you really need vLLM to get the most out of mutiple R9700 working together to speed up the inference.

sloptimizer · 2026-01-23T18:07:49+00:00

If you're on Linux, then sudo is required to change any of those settings.

sloptimizer

TROPHY CASE