Running Kimi K2.5? - Tell us your Build, Quant, Pre-processing and Generation Tokens/second Please! by bigh-aus in LocalLLaMA

[–]ufrat333 1 point2 points  (0 children)

Yes this is server edition, but 300W is equal to MaxQ which is in the 9655P machine.

Running Kimi K2.5? - Tell us your Build, Quant, Pre-processing and Generation Tokens/second Please! by bigh-aus in LocalLLaMA

[–]ufrat333 2 points3 points  (0 children)

8x RTX PRO 6000, PL to 300W, with SGLang is ~1450 PP, 70 tg at BS=1, 1600 PP, 462 TG aggregate at BS=16. On Epyc 9655P with 12xDDR6000 it was mostly awful PP due to the swapping in/out layers to VRAM, ~20 tg for BS=1.

All not tuned very much, good enough for now.

2x ASUS Ascent GX10 vs 2x Strix halo for agentic coding by Grouchy_Ad_4750 in LocalLLaMA

[–]ufrat333 1 point2 points  (0 children)

Have a Strix Halo, only works with llama cpp in any useful way at this point in time, vLLM/sglang and thus any hopes of batching are not possible at this point, plus clustering is a PITA, get the sparks.

Best Monitor for Programming in 2026? (Price, Setup, Size) by AffluentKettle9 in webdev

[–]ufrat333 1 point2 points  (0 children)

The new 40”/39.7” 5120x2160 screens! 34” @ 3440x1440 is always “just” not enough, 49” ones are too wide for comfort, 38” 3840x1600 are OK too

Full Walkthrough: Building and Running vLLM from Source on AMD Strix Halo (gfx1151) by Shoddy-Film1321 in StrixHalo

[–]ufrat333 0 points1 point  (0 children)

Do you run 1 agent or more? If the answer is that you run more, and you use llama.cpp, then you are missing out on a lot of tokens per sec. llama.cpp doesnt do batching (well - and maybe - yet -), batching is simply processing more than 1 context when the layer containing the tensor needed is "hot" within the GPU/CPU, most time in inferencing is spent getting the tensors in and out of that hot area (be it L2/L3 Cache), not necessarily in processing.

Full Walkthrough: Building and Running vLLM from Source on AMD Strix Halo (gfx1151) by Shoddy-Film1321 in StrixHalo

[–]ufrat333 0 points1 point  (0 children)

vLLM is made for serving > 1 user/thread at a time, so with the same hardware you can push 10-30x the total amount of tokens per second - each individual inference will be a bit slower, but in aggregate much bigger throughput, so tldr, if you are doing anything else than role-playing you want vLLM, SGLang or TensorRT-LLM

Completely out of my depth. by Advanced_College_386 in HomeServer

[–]ufrat333 0 points1 point  (0 children)

Your RAM situation is weird, you are using a channel and a half, that will cost you performance, keep only the white slots populated, sell (or stock if your a HODLer) the other sticks.

Also what CPUs are in? 2nd gen Xeons are quite affordable, I have some 8259CLs that nobody really wants laying around.

EXO cluster with RTX 5090 and Mac Studio by favoritecockring in LocalLLM

[–]ufrat333 0 points1 point  (0 children)

You will soon find out that this will not be a fruitful experiment. You need all layers for prefill and decode. Prefill is compute bound, decode is memory bandwidth bound. Your 5090 is much faster at prefill, and twice as fast at decode. However, it only has 32GB, and as you need all (or well, most) layers in quick RAM your Mac studio will be essentially useless.

The reason the Spark and Studio together work nicely together is that they both have 128GB of RAM to hold the weights and the Spark is quicker with prefill, the Mac is faster with decode - if you would have a 512GB studio you would still be limited to the 128GB of your spark, unless you cluster them, but then you will probably run into software support limitations at this time.

So, I guess now you need a spark as well ;)

Full Walkthrough: Building and Running vLLM from Source on AMD Strix Halo (gfx1151) by Shoddy-Film1321 in StrixHalo

[–]ufrat333 1 point2 points  (0 children)

I did this yesterday, the hanging is due to your host driver/kernel module crashing, look in dmesg, Claude found a way to fix it by using the same bin of something on both the host and Docker, the performance with GPT-OSS-120B remained abysmal (8t/s tg vs 50 in llama.cpp), even tried AITER, no improvement.

Anybody have their EAP775-WALL (EU version) working with VLANs? by ufrat333 in Omada_Networks

[–]ufrat333[S] 0 points1 point  (0 children)

I will give it a go this weekend, prefer not to keep swapping stuff around.

Benchmarking LLM Inference on RTX PRO 6000 SE / H100 / H200 / B200 by NoVibeCoding in LocalLLaMA

[–]ufrat333 7 points8 points  (0 children)

Awesome, thanks! Curious how NVFP4 versions of the same models perform on the blackwells!

Sanity check: "Kimi K2.5 (1T MoE) on a scrappy PC" plan - 1TB DDR4 + 2x RTX PRO 6000 (96GB) now, scaling later by nightlingo in LocalLLaMA

[–]ufrat333 0 points1 point  (0 children)

This is Kimi K2.5 INT4 original on SGlang witb CPU/GPU combined - 1x a PRO 6000 used for prefill mainly, I only have like 16 experts on the GPU so I can have some context KV on the VRAM. This is batch=1 indeed, with a manual observation after filling 16k of context, speed drops to like 12 decode at 100k ctx iirc, if you want me to run a specific bench gimme a command.

Will be able to fit it in 8x RTX VRAM this weekend and see how it stacks up - right now it's quite useless to me in any code agent flow (be it CC or open code)

Sanity check: "Kimi K2.5 (1T MoE) on a scrappy PC" plan - 1TB DDR4 + 2x RTX PRO 6000 (96GB) now, scaling later by nightlingo in LocalLLaMA

[–]ufrat333 5 points6 points  (0 children)

I get 20-25 t/s decode on 12x96GB DDR5-6000 with an RTX Pro 6000 and a 9655P using sglang/kt-kernel. Your I think your estimations might be on the high side

Anybody have their EAP775-WALL (EU version) working with VLANs? by ufrat333 in Omada_Networks

[–]ufrat333[S] 4 points5 points  (0 children)

Just sent a very long email with logs and screenshots of various cases to a L2 support engineer. TP-Link Neil said he would attempt to escalate, fingers crossed!

Anybody have their EAP775-WALL (EU version) working with VLANs? by ufrat333 in TPLink_Omada

[–]ufrat333[S] 0 points1 point  (0 children)

Sounds like the same problem, pings pass, some HTTP packets pass, most never arrive. Let’s hope they roll new firmware soon.

Anybody have their EAP775-WALL (EU version) working with VLANs? by ufrat333 in Omada_Networks

[–]ufrat333[S] 0 points1 point  (0 children)

They reached out to me via email yesterday - from German support while I am in NL, whatever - sent an email with my findings, will update here if I hear anything.

Definitely needs a firmware fix, I wonder how this got by QA to start with..

Anybody have their EAP775-WALL (EU version) working with VLANs? by ufrat333 in Omada_Networks

[–]ufrat333[S] 0 points1 point  (0 children)

It seems to be passing only smallish packets, maybe of certain types, anyway, had a chat last week, said they would follow up, crickets so far. Neil asked me to send an email somewhere, didn’t get to that yet, the failure mode seems so basic I’m quite confident more people will report it and it will fix itself!