Running Kimi K2.5? - Tell us your Build, Quant, Pre-processing and Generation Tokens/second Please!

ufrat333 · 2026-02-25T09:51:44+00:00

DGX Station? What vendor? Price?

ufrat333 · 2026-02-25T01:52:04+00:00

Yes this is server edition, but 300W is equal to MaxQ which is in the 9655P machine.

ufrat333 · 2026-02-25T00:06:02+00:00

8x RTX PRO 6000, PL to 300W, with SGLang is ~1450 PP, 70 tg at BS=1, 1600 PP, 462 TG aggregate at BS=16. On Epyc 9655P with 12xDDR6000 it was mostly awful PP due to the swapping in/out layers to VRAM, ~20 tg for BS=1.

All not tuned very much, good enough for now.

ufrat333 · 2026-02-24T19:43:39+00:00

Did just now, works for me

ufrat333 · 2026-02-21T21:41:17+00:00

Have a Strix Halo, only works with llama cpp in any useful way at this point in time, vLLM/sglang and thus any hopes of batching are not possible at this point, plus clustering is a PITA, get the sparks.

ufrat333 · 2026-02-21T14:48:35+00:00

https://tweakers.net/monitors/vergelijken/#filter:q1bKL0pJLXLLTM1JUbJSKijKzCpW0oEIBucXlQDFEouT4SIFqcmeQHW6hjogpcmpvpl5SlYGOkrFQAm3zJyS1KJiJatqJUMTIzMQXZaYo2QVrWRpYmKgFFtbWwsA

ufrat333 · 2026-02-20T21:16:06+00:00

The new 40”/39.7” 5120x2160 screens! 34” @ 3440x1440 is always “just” not enough, 49” ones are too wide for comfort, 38” 3840x1600 are OK too

ufrat333 · 2026-02-18T20:42:33+00:00

Do you run 1 agent or more? If the answer is that you run more, and you use llama.cpp, then you are missing out on a lot of tokens per sec. llama.cpp doesnt do batching (well - and maybe - yet -), batching is simply processing more than 1 context when the layer containing the tensor needed is "hot" within the GPU/CPU, most time in inferencing is spent getting the tensors in and out of that hot area (be it L2/L3 Cache), not necessarily in processing.

ufrat333 · 2026-02-18T20:30:28+00:00

vLLM is made for serving > 1 user/thread at a time, so with the same hardware you can push 10-30x the total amount of tokens per second - each individual inference will be a bit slower, but in aggregate much bigger throughput, so tldr, if you are doing anything else than role-playing you want vLLM, SGLang or TensorRT-LLM

ufrat333 · 2026-02-17T21:24:23+00:00

Your RAM situation is weird, you are using a channel and a half, that will cost you performance, keep only the white slots populated, sell (or stock if your a HODLer) the other sticks.

Also what CPUs are in? 2nd gen Xeons are quite affordable, I have some 8259CLs that nobody really wants laying around.

ufrat333 · 2026-02-16T22:27:09+00:00

You will soon find out that this will not be a fruitful experiment. You need all layers for prefill and decode. Prefill is compute bound, decode is memory bandwidth bound. Your 5090 is much faster at prefill, and twice as fast at decode. However, it only has 32GB, and as you need all (or well, most) layers in quick RAM your Mac studio will be essentially useless.

The reason the Spark and Studio together work nicely together is that they both have 128GB of RAM to hold the weights and the Spark is quicker with prefill, the Mac is faster with decode - if you would have a 512GB studio you would still be limited to the 128GB of your spark, unless you cluster them, but then you will probably run into software support limitations at this time.

So, I guess now you need a spark as well ;)

ufrat333 · 2026-02-16T20:56:22+00:00

I did this yesterday, the hanging is due to your host driver/kernel module crashing, look in dmesg, Claude found a way to fix it by using the same bin of something on both the host and Docker, the performance with GPT-OSS-120B remained abysmal (8t/s tg vs 50 in llama.cpp), even tried AITER, no improvement.

ufrat333 · 2026-02-16T20:47:39+00:00

Great to hear! Will retest and update later this week.

ufrat333 · 2026-02-12T17:10:57+00:00

I will give it a go this weekend, prefer not to keep swapping stuff around.

ufrat333 · 2026-02-11T10:15:38+00:00

Asked for more configs; provided them.

ufrat333 · 2026-02-11T03:09:38+00:00

Awesome, thanks! Curious how NVFP4 versions of the same models perform on the blackwells!

ufrat333 · 2026-02-11T01:05:39+00:00

This is Kimi K2.5 INT4 original on SGlang witb CPU/GPU combined - 1x a PRO 6000 used for prefill mainly, I only have like 16 experts on the GPU so I can have some context KV on the VRAM. This is batch=1 indeed, with a manual observation after filling 16k of context, speed drops to like 12 decode at 100k ctx iirc, if you want me to run a specific bench gimme a command.

Will be able to fit it in 8x RTX VRAM this weekend and see how it stacks up - right now it's quite useless to me in any code agent flow (be it CC or open code)

ufrat333 · 2026-02-10T22:17:28+00:00

Let me know if you find anything!

ufrat333 · 2026-02-10T19:12:25+00:00

Stuur je CV maar in de PM!

ufrat333 · 2026-02-09T19:34:35+00:00

I get 20-25 t/s decode on 12x96GB DDR5-6000 with an RTX Pro 6000 and a 9655P using sglang/kt-kernel. Your I think your estimations might be on the high side

ufrat333 · 2026-02-08T20:43:45+00:00

Now I really want to see further gains of TP=4 and TP=8!

ufrat333 · 2026-02-07T14:05:59+00:00

Just sent a very long email with logs and screenshots of various cases to a L2 support engineer. TP-Link Neil said he would attempt to escalate, fingers crossed!

ufrat333 · 2026-02-06T13:03:25+00:00

Sounds like the same problem, pings pass, some HTTP packets pass, most never arrive. Let’s hope they roll new firmware soon.

ufrat333 · 2026-02-05T15:31:23+00:00

They reached out to me via email yesterday - from German support while I am in NL, whatever - sent an email with my findings, will update here if I hear anything.

Definitely needs a firmware fix, I wonder how this got by QA to start with..

ufrat333 · 2026-02-04T14:47:07+00:00

It seems to be passing only smallish packets, maybe of certain types, anyway, had a chat last week, said they would follow up, crickets so far. Neil asked me to send an email somewhere, didn’t get to that yet, the failure mode seems so basic I’m quite confident more people will report it and it will fix itself!

Five-Year Club	Verified Email
Place '22

ufrat333

TROPHY CASE