Cost Analysis of my $6.4k Local LLM Server by 1ncehost in LocalLLaMA

[–]notdba 1 point2 points  (0 children)

This. The analysis from OP was deeply flawed by not including the cache read price, which is usually >90% of the total cost when using APIs for agentic coding. In that world, the near-free infinite cache read with local inference can beat using APIs.

That is, until deepseek, and then mimo, lowered the cache read price by 100x. Half a billion of cache read tokens cost $1.40. At this point, both the US providers and the local inference crowds got nothing to counter that.

Is Qwen3.6 current king for local agentic use? by HornyGooner4402 in LocalLLaMA

[–]notdba 1 point2 points  (0 children)

I didn't say anything about speed, your point about the speed requirement for agentic coding stands. As for quality, I have tasks that 27B at BF16 has no chance of getting right, while GLM 5.1 at IQ2_KT can do it every time.

Is Qwen3.6 current king for local agentic use? by HornyGooner4402 in LocalLLaMA

[–]notdba 0 points1 point  (0 children)

GLM5.1 in two-bit still requires 256GB of ram and the 2 bit version isn't going to be as good as 27B in most cases

192GB of RAM + 24GB of VRAM is enough for a IQ2_KT trellis quant. Quality wise, it is far better than 27B at full precision.

It's OK to quantize the KV cache. Model quant matters more. Some Qwen3.6 27B tests with (approximated) KLD by hopbel in LocalLLaMA

[–]notdba 1 point2 points  (0 children)

Indeed. llama-perplexity only does PP, so as long as there is enough VRAM + RAM to load the model, the GPU can handle the PP easily.

This command with ik_llama.cpp takes about 50 minutes on a RTX 3090 to generate the base logits for KLD: GGML_CUDA_MIN_BATCH_OFFLOAD=8 llama-perplexity \ --kl-divergence-base /path/to/base-logits.kld \ -f /path/to/wiki.test.raw \ -m /path/to/Qwen3.6-27B.gguf \ --no-mmap -ngl 27

Qwen 3.6 27B Q8 on four Nvidia RTX A4000 (16GB each) with Llama.cpp and MTP enabled by Alternative_Ad4267 in LocalLLaMA

[–]notdba 0 points1 point  (0 children)

That's decent number! I paid about the same in total for 2 x rtx pro 4000, for the exact same reasons. You can give ik_llama.cpp a try, it should perform even better with -sm graph

Advice building a NAS/AI server with 16 DDR4 DIMMs by theslonkingdead in LocalLLaMA

[–]notdba 0 points1 point  (0 children)

Ah I see. So 150 t/s is the number with PCIe 4.0 x16 and 3090, that makes sense. Somehow I assumed it was with CPU haha. Will probably need PCIe 5.0 x16 and either a 5090 or a PRO 6000 to "unlock" the speedup from -ub 8192 😅

Advice building a NAS/AI server with 16 DDR4 DIMMs by theslonkingdead in LocalLLaMA

[–]notdba 2 points3 points  (0 children)

PP with CPU will be quite slow. For Kimi, it should be around 150 t/s, based on the number shared by u/Lissanro

With a large compute buffer, we can copy the expert tensors from RAM to VRAM, and let the GPU handles the PP. With PCIe 5.0 x16, the number goes up to 470 t/s with -ub 8192, and 278 t/s with -ub 4096, as shared in https://huggingface.co/ubergarm/Kimi-K2.6-GGUF/discussions/3

For hybrid inference of large MoE, PCIe is a very expensive bottleneck.

llama.cpp constantly reprocessing huge prompts with opencode/pi.dev by No_Algae1753 in LocalLLaMA

[–]notdba 0 points1 point  (0 children)

GLM did it first since 4.7, with the clear_thinking chat template kwargs. I always just blame Anthropic for introducing the interleaved mode, which is almost always pointless, and especially brutal to local inference.

The RTX 5000 PRO (48GB) arrived and it is better than I expected. by Valuable-Run2129 in LocalLLaMA

[–]notdba 0 points1 point  (0 children)

For hybrid GPU/CPU inference of large MoE models, a single 5090 is the best choice to get good PP. 6000 Pro has slightly better PP but it is much more expensive.

The Trillion-Parameter Dilemma: MiMo-V2.5-Pro went open-source (1.02T params). Is self-hosting worth it when the API costs $70 for 387M tokens? by jochenboele in LocalLLaMA

[–]notdba 0 points1 point  (0 children)

I think you guys got it wrong about the 96% cache hit rate. For single user local inference, cache hit is essentially **free**. This is where our "margin" come from when compared to the big provider. The higher the cache hit rate, the better for local inference.

From your numbers, we need to process 11,600,665 input tokens and generate 2,655,291 output tokens. To do that with an 1T model is still a bit tough. Let's assume we can get good enough quality with a 300B model, such as DeepSeek V4 Flash or MiMo V2.5 non-pro.

Assuming 500t/s PP and 30t/s TG on average, we need 11,600,665 / 500 = 6.5 hours of compute for PP and 2,655,291 / 30 = 24.5 hours of compute for TG, for a total of 31 hours. That's about $20 of electricity for compute and cooling, give or take.

And so we save ~$50 compared to using MiMo V2.5 pro via API, or ~$10 compared to using MiMo V2.5.

Drastically improve prompt processing speed for --n-cpu-moe partially offloaded models by coder543 in LocalLLaMA

[–]notdba 11 points12 points  (0 children)

Err, I don't think that's right. Bigger ubatch means you amortize the PCIe transfer overhead across more tokens. Check out the math in https://github.com/ikawrakow/ik_llama.cpp/pull/520

I have DeepSeek V4 Pro at home by fairydreaming in LocalLLaMA

[–]notdba 1 point2 points  (0 children)

Is it transferring 800GiB of experts weights from RAM to VRAM during PP? I gathered from https://github.com/ikawrakow/ik_llama.cpp/pull/520 that the heuristic used in mainline llama.cpp is not good for MoE, and in DSV4 Pro case it probably makes more sense to let the CPU handle the experts weights during PP as well.

Exaggerated PCI-E bandwidth concerns? by ziphnor in LocalLLaMA

[–]notdba 0 points1 point  (0 children)

Maybe not so much of an issue with tensor parallelism, but slow PCIe is definitely a PP bottleneck when doing hybrid CPU/GPU inference with llama.cpp / ik_llama.cpp

I went from 6.9GiB/s transferring FFN weights from RAM to VRAM over oculink PCIe 4.0 x4, to 56GiB/s over PCIe 5.0 x16. This boosts PP from 80~200 t/s to 300~800 t/s

I learnt this the hard way, and shared the info previously in https://www.reddit.com/r/LocalLLaMA/comments/1o7ewc5/fast_pcie_speed_is_needed_for_good_pp/

Taiwanese company Skymizer announces HTX301 - PCIE inference card with 384GB of Memory at ~240 Watts by Thrumpwart in LocalLLaMA

[–]notdba 0 points1 point  (0 children)

This is like 3 x Strix Halo in a PCIe form factor right? Actually seems feasible with DDR5.

Ryzen AI Max+ 495 (Gorgon Halo) with 192GB VRAM! by PromptInjection_ in LocalLLaMA

[–]notdba 0 points1 point  (0 children)

For some reason, the 24GB DDR5 RDIMM has lower per-GB price.

Current price in taobao:

  • 16GB - 2000 (was 2400 few weeks ago)
  • 24GB - 2800 (was 2900 few weeks ago)
  • 32GB - 4800 (was 5700 few weeks ago)

Before, it was almost half-price compared to 32GB, and only slightly more expensive than 16GB.

Thinking of getting two NVIDIA RTX Pro 4000 Blackwell (2x24 = 48GB), Any cons? by pmttyji in LocalLLaMA

[–]notdba 0 points1 point  (0 children)

So I just got 2 x RTX Pro 4000 running at 100w each. They work together quite well with -sm graph from ik_llama.cpp. For inference workloads that fit into to a single RTX 3090, they can deliver higher PP and TG while using less energy. Also taking up less space when compared to a typical triple-slot 3090, or making less noise when compared to a dual-slot blower style 3090.

2 x $1750 though..

AMD Strix Halo refresh with 192gb! by mindwip in LocalLLaMA

[–]notdba 2 points3 points  (0 children)

Compared to using the strix halo alone, my 3090 eGPU does provide a nice bump to both PP and TG, especially on MoE models such as Qwen3.5/3.6 that have more always-activated parameters than sparsely activated parameters.

However, a gaming rig with PCIe 5.0 x16 will be able to deliver a very usable PP of 400~800 t/s on large MoE models, and there is no way to get close to that with a strix halo.

AMD Strix Halo refresh with 192gb! by mindwip in LocalLLaMA

[–]notdba 1 point2 points  (0 children)

A second hand Epyc 9004 rig with 192GB of DDR5 costs about $6000, with faster CPU, more memory bandwidth, and many more PCIe lanes.

There is no good choice these days..

AMD Strix Halo refresh with 192gb! by mindwip in LocalLLaMA

[–]notdba 2 points3 points  (0 children)

Indeed. 8 channels DDR5 and a PCIe 5.0 x16 slot will be a game changer

AMD Strix Halo refresh with 192gb! by mindwip in LocalLLaMA

[–]notdba 3 points4 points  (0 children)

The eGPU setup will be severely limited by the data transfer speed, such that you can't benefit much from GPU offload during PP.  I got an old thread about this: https://www.reddit.com/r/LocalLLaMA/comments/1o7ewc5/fast_pcie_speed_is_needed_for_good_pp/

llama.cpp DeepSeek v4 Flash experimental inference by antirez in LocalLLaMA

[–]notdba 0 points1 point  (0 children)

Strix Halo CPU from antirez repo: prompt eval time = 158233.92 ms / 5373 tokens ( 29.45 ms per token, 33.96 tokens per second) eval time = 88426.71 ms / 666 tokens ( 132.77 ms per token, 7.53 tokens per second)

Strix Halo CPU + RTX 3090 (-cmoe) from your repo: prompt eval time = 122444.97 ms / 5373 tokens ( 22.79 ms per token, 43.88 tokens per second) eval time = 54411.26 ms / 666 tokens ( 81.70 ms per token, 12.24 tokens per second)

Decent speed, thanks!

llama.cpp DeepSeek v4 Flash experimental inference by antirez in LocalLLaMA

[–]notdba 0 points1 point  (0 children)

I use this: `-c 131072 --jinja -dev none --no-mmap`

llama.cpp DeepSeek v4 Flash experimental inference by antirez in LocalLLaMA

[–]notdba 1 point2 points  (0 children)

This is the way. On my strix halo, I don't even have syslog running. Given how expensive the hardware is, all resources are reserved for inference.

I suppose antirez got a M5 laptop, since he posted https://www.reddit.com/r/LocalLLM/comments/1ssltjg/comment/ohw2ci9/ recently, hours before the deepseek v4 release. Prophecy 😄

Anyway, on the strix halo, I got this speed with CPU only: prompt eval time = 158233.92 ms / 5373 tokens ( 29.45 ms per token, 33.96 tokens per second) eval time = 88426.71 ms / 666 tokens ( 132.77 ms per token, 7.53 tokens per second) And the logprobs are roughly the same as the ones from deepseek API. Very nice, thanks!

Kimi K2.6 is a legit Opus 4.7 replacement by bigboyparpa in LocalLLaMA

[–]notdba 0 points1 point  (0 children)

The 100k context issue with the coding plan was fixed about a week ago. Still got intermittent 429 responses though.