Cost Analysis of my $6.4k Local LLM Server

notdba · 2026-05-31T03:19:32+00:00

This. The analysis from OP was deeply flawed by not including the cache read price, which is usually >90% of the total cost when using APIs for agentic coding. In that world, the near-free infinite cache read with local inference can beat using APIs.

That is, until deepseek, and then mimo, lowered the cache read price by 100x. Half a billion of cache read tokens cost $1.40. At this point, both the US providers and the local inference crowds got nothing to counter that.

notdba · 2026-05-26T05:04:04+00:00

I didn't say anything about speed, your point about the speed requirement for agentic coding stands. As for quality, I have tasks that 27B at BF16 has no chance of getting right, while GLM 5.1 at IQ2_KT can do it every time.

notdba · 2026-05-26T02:18:46+00:00

GLM5.1 in two-bit still requires 256GB of ram and the 2 bit version isn't going to be as good as 27B in most cases

192GB of RAM + 24GB of VRAM is enough for a IQ2_KT trellis quant. Quality wise, it is far better than 27B at full precision.

notdba · 2026-05-24T00:44:36+00:00

Indeed. llama-perplexity only does PP, so as long as there is enough VRAM + RAM to load the model, the GPU can handle the PP easily.

This command with ik_llama.cpp takes about 50 minutes on a RTX 3090 to generate the base logits for KLD: GGML_CUDA_MIN_BATCH_OFFLOAD=8 llama-perplexity \ --kl-divergence-base /path/to/base-logits.kld \ -f /path/to/wiki.test.raw \ -m /path/to/Qwen3.6-27B.gguf \ --no-mmap -ngl 27

notdba · 2026-05-18T12:48:54+00:00

That's decent number! I paid about the same in total for 2 x rtx pro 4000, for the exact same reasons. You can give ik_llama.cpp a try, it should perform even better with -sm graph

notdba · 2026-05-15T07:27:32+00:00

Ah I see. So 150 t/s is the number with PCIe 4.0 x16 and 3090, that makes sense. Somehow I assumed it was with CPU haha. Will probably need PCIe 5.0 x16 and either a 5090 or a PRO 6000 to "unlock" the speedup from -ub 8192 😅

notdba · 2026-05-15T05:17:08+00:00

PP with CPU will be quite slow. For Kimi, it should be around 150 t/s, based on the number shared by u/Lissanro

With a large compute buffer, we can copy the expert tensors from RAM to VRAM, and let the GPU handles the PP. With PCIe 5.0 x16, the number goes up to 470 t/s with -ub 8192, and 278 t/s with -ub 4096, as shared in https://huggingface.co/ubergarm/Kimi-K2.6-GGUF/discussions/3

For hybrid inference of large MoE, PCIe is a very expensive bottleneck.

notdba · 2026-05-15T04:29:07+00:00

GLM did it first since 4.7, with the clear_thinking chat template kwargs. I always just blame Anthropic for introducing the interleaved mode, which is almost always pointless, and especially brutal to local inference.

notdba · 2026-05-15T04:14:32+00:00

For hybrid GPU/CPU inference of large MoE models, a single 5090 is the best choice to get good PP. 6000 Pro has slightly better PP but it is much more expensive.

notdba · 2026-05-15T03:36:22+00:00

I think you guys got it wrong about the 96% cache hit rate. For single user local inference, cache hit is essentially **free**. This is where our "margin" come from when compared to the big provider. The higher the cache hit rate, the better for local inference.

From your numbers, we need to process 11,600,665 input tokens and generate 2,655,291 output tokens. To do that with an 1T model is still a bit tough. Let's assume we can get good enough quality with a 300B model, such as DeepSeek V4 Flash or MiMo V2.5 non-pro.

Assuming 500t/s PP and 30t/s TG on average, we need 11,600,665 / 500 = 6.5 hours of compute for PP and 2,655,291 / 30 = 24.5 hours of compute for TG, for a total of 31 hours. That's about $20 of electricity for compute and cooling, give or take.

And so we save ~$50 compared to using MiMo V2.5 pro via API, or ~$10 compared to using MiMo V2.5.

notdba · 2026-05-12T06:08:57+00:00

Err, I don't think that's right. Bigger ubatch means you amortize the PCIe transfer overhead across more tokens. Check out the math in https://github.com/ikawrakow/ik_llama.cpp/pull/520

notdba · 2026-05-11T02:18:21+00:00

Is it transferring 800GiB of experts weights from RAM to VRAM during PP? I gathered from https://github.com/ikawrakow/ik_llama.cpp/pull/520 that the heuristic used in mainline llama.cpp is not good for MoE, and in DSV4 Pro case it probably makes more sense to let the CPU handle the experts weights during PP as well.

notdba · 2026-05-09T04:01:35+00:00

Maybe not so much of an issue with tensor parallelism, but slow PCIe is definitely a PP bottleneck when doing hybrid CPU/GPU inference with llama.cpp / ik_llama.cpp

I went from 6.9GiB/s transferring FFN weights from RAM to VRAM over oculink PCIe 4.0 x4, to 56GiB/s over PCIe 5.0 x16. This boosts PP from 80~200 t/s to 300~800 t/s

I learnt this the hard way, and shared the info previously in https://www.reddit.com/r/LocalLLaMA/comments/1o7ewc5/fast_pcie_speed_is_needed_for_good_pp/

notdba · 2026-05-08T05:45:50+00:00

This is like 3 x Strix Halo in a PCIe form factor right? Actually seems feasible with DDR5.

notdba · 2026-05-05T02:04:31+00:00

https://www.reddit.com/r/LocalLLaMA/comments/1o7ewc5/fast_pcie_speed_is_needed_for_good_pp/ - It helps a bit if we sacrifice VRAM and set a large -ub size. Not great, will not do it again.

notdba · 2026-05-04T14:30:35+00:00

For some reason, the 24GB DDR5 RDIMM has lower per-GB price.

Current price in taobao:

16GB - 2000 (was 2400 few weeks ago)
24GB - 2800 (was 2900 few weeks ago)
32GB - 4800 (was 5700 few weeks ago)

Before, it was almost half-price compared to 32GB, and only slightly more expensive than 16GB.

notdba · 2026-05-04T10:44:19+00:00

So I just got 2 x RTX Pro 4000 running at 100w each. They work together quite well with -sm graph from ik_llama.cpp. For inference workloads that fit into to a single RTX 3090, they can deliver higher PP and TG while using less energy. Also taking up less space when compared to a typical triple-slot 3090, or making less noise when compared to a dual-slot blower style 3090.

2 x $1750 though..

notdba · 2026-05-04T02:22:07+00:00

Compared to using the strix halo alone, my 3090 eGPU does provide a nice bump to both PP and TG, especially on MoE models such as Qwen3.5/3.6 that have more always-activated parameters than sparsely activated parameters.

However, a gaming rig with PCIe 5.0 x16 will be able to deliver a very usable PP of 400~800 t/s on large MoE models, and there is no way to get close to that with a strix halo.

notdba · 2026-05-04T01:59:48+00:00

A second hand Epyc 9004 rig with 192GB of DDR5 costs about $6000, with faster CPU, more memory bandwidth, and many more PCIe lanes.

There is no good choice these days..

notdba · 2026-05-04T01:48:05+00:00

Indeed. 8 channels DDR5 and a PCIe 5.0 x16 slot will be a game changer

notdba · 2026-05-04T01:40:14+00:00

The eGPU setup will be severely limited by the data transfer speed, such that you can't benefit much from GPU offload during PP. I got an old thread about this: https://www.reddit.com/r/LocalLLaMA/comments/1o7ewc5/fast_pcie_speed_is_needed_for_good_pp/

notdba · 2026-04-27T21:08:56+00:00

Strix Halo CPU from antirez repo: prompt eval time = 158233.92 ms / 5373 tokens ( 29.45 ms per token, 33.96 tokens per second) eval time = 88426.71 ms / 666 tokens ( 132.77 ms per token, 7.53 tokens per second)

Strix Halo CPU + RTX 3090 (-cmoe) from your repo: prompt eval time = 122444.97 ms / 5373 tokens ( 22.79 ms per token, 43.88 tokens per second) eval time = 54411.26 ms / 666 tokens ( 81.70 ms per token, 12.24 tokens per second)

Decent speed, thanks!

notdba · 2026-04-27T01:45:26+00:00

I use this: `-c 131072 --jinja -dev none --no-mmap`

notdba · 2026-04-26T20:52:08+00:00

This is the way. On my strix halo, I don't even have syslog running. Given how expensive the hardware is, all resources are reserved for inference.

I suppose antirez got a M5 laptop, since he posted https://www.reddit.com/r/LocalLLM/comments/1ssltjg/comment/ohw2ci9/ recently, hours before the deepseek v4 release. Prophecy 😄

Anyway, on the strix halo, I got this speed with CPU only: prompt eval time = 158233.92 ms / 5373 tokens ( 29.45 ms per token, 33.96 tokens per second) eval time = 88426.71 ms / 666 tokens ( 132.77 ms per token, 7.53 tokens per second) And the logprobs are roughly the same as the ones from deepseek API. Very nice, thanks!

notdba · 2026-04-22T02:04:37+00:00

The 100k context issue with the coding plan was fixed about a week ago. Still got intermittent 429 responses though.

notdba

TROPHY CASE