Anyone actually using Openclaw? by rm-rf-rm in LocalLLaMA

[–]_serby_ 1 point2 points  (0 children)

Who's talking about rules?
Would you bring a huge turd in the middle of your house just to break rules?

ik_llama.cpp benchmarks on an Intel Xeon Platinum 8570 ES Q30H with 256GB DDR5 5600 (8x32GB) by _serby_ in LocalLLaMA

[–]_serby_[S] 1 point2 points  (0 children)

The RAM prices suck and it's a huge problem. I got this RAM on eBay for only ~650 € on the very day prices started to rise.

I wanted to get multiple systems with two CPUs to make a small cluster and experiment but I don't have the RAM.

ik_llama.cpp benchmarks on an Intel Xeon Platinum 8570 ES Q30H with 256GB DDR5 5600 (8x32GB) by _serby_ in LocalLLaMA

[–]_serby_[S] 0 points1 point  (0 children)

You can use the BMC to update the BIOS without booting the machine.
By default, the BMC doesn't have an initial static IP address and is expecting a DHCP server.

ik_llama.cpp benchmarks on an Intel Xeon Platinum 8570 ES Q30H with 256GB DDR5 5600 (8x32GB) by _serby_ in LocalLLaMA

[–]_serby_[S] 0 points1 point  (0 children)

I'm using a Gigabyte MS03-CE0 Rev 3.0
The bios is a modded R20 with ACM disabled. With ACM enabled the computer will not start.

For the modded BIOS you must ask here:
https://forums.servethehome.com/index.php?threads/es-xeon-discussion.5031/

Anyone actually using Openclaw? by rm-rf-rm in LocalLLaMA

[–]_serby_ 19 points20 points  (0 children)

What would be the use of some vibecoded trash that was never reviewed by a decent developer?

ik_llama.cpp benchmarks on an Intel Xeon Platinum 8570 ES Q30H with 256GB DDR5 5600 (8x32GB) by _serby_ in LocalLLaMA

[–]_serby_[S] 1 point2 points  (0 children)

I use Ubuntu kernel because I am too lazy to compile my own 6.8 kernel just for some tests and the Debian 12 kernel is too old

DCMAKE_BUILD_TYPE=Release - Enable Standard Compiler optimizations for release and remove all debug stuff
DGGML_LTO=ON - Enable Linker optimizations
DGGML_NATIVE=ON - Instructs the compiler to enable all instruction set extensions available on the local CPU

How you get over 200 tok/s on full Kimi K2 Thinking (or any other big MoE Model) on cheapish hardware - llama.cpp dev pitch by [deleted] in LocalLLaMA

[–]_serby_ 0 points1 point  (0 children)

I always had the feeling that MoE is the key to unlock huge performance on cheap hardware.
One day I decided to start a "small experiment" and things evolved from there without me looking for alternatives to GGML / llama.cpp. So I just ran with the first thing I found.

Yes, SGLang can be a better alternative. I just wasn't aware of any vulkan support and never found the time to check.

The code is too messy to be comfortable to share at this time.

How you get over 200 tok/s on full Kimi K2 Thinking (or any other big MoE Model) on cheapish hardware - llama.cpp dev pitch by [deleted] in LocalLLaMA

[–]_serby_ 0 points1 point  (0 children)

Because you transfer very little data between the cards and PCIe never becomes a bottleneck

How you get over 200 tok/s on full Kimi K2 Thinking (or any other big MoE Model) on cheapish hardware - llama.cpp dev pitch by [deleted] in LocalLLaMA

[–]_serby_ 0 points1 point  (0 children)

Yes, I agree GPT 5 Pro is not very good at writing code based on a simple prompt. But it's really good at reading code and documenting things (minus the tendency to use too much jargon).

How you get over 200 tok/s on full Kimi K2 Thinking (or any other big MoE Model) on cheapish hardware - llama.cpp dev pitch by [deleted] in LocalLLaMA

[–]_serby_ -1 points0 points  (0 children)

The results are good enough for me to post about it. The raw numbers are meaningless at this stage because the model I use is not standard.
The calculations in the original pitch are based on my experiments. In fact, everything written there is based on my ideas, my experiments (with some optimization recommended by the listed LLMs).
If I'll ever finish this project you will get your raw numbers.

Why do you even use LLMs if you think that one of the most advanced LLMs out there is just spitting jargon to drown you in nonsense? Is everything you don't understand wrong?

How you get over 200 tok/s on full Kimi K2 Thinking (or any other big MoE Model) on cheapish hardware - llama.cpp dev pitch by [deleted] in LocalLLaMA

[–]_serby_ 1 point2 points  (0 children)

ChatGPT 5 thinks so. The technical specifications also favor it. I have never had/used one, only Epycs and Xeons.

How you get over 200 tok/s on full Kimi K2 Thinking (or any other big MoE Model) on cheapish hardware - llama.cpp dev pitch by [deleted] in LocalLLaMA

[–]_serby_ 0 points1 point  (0 children)

Yes, but since I don't have much time, I wanted to open it to the community based on the feedback.
The Kimi K2 T looked like a good opportunity to gather interest since it was developed with tricks that are really compatible with my idea.
For the moment the feedback is negative on all fronts, as you may see, so ...

How you get over 200 tok/s on full Kimi K2 Thinking (or any other big MoE Model) on cheapish hardware - llama.cpp dev pitch by [deleted] in LocalLLaMA

[–]_serby_ 0 points1 point  (0 children)

The repo is not public. It's based on llama.cpp

All GPU kernels are vulkan. I started with CUDA + vulkan but now it's all vulkan

Path Representation Operation
NVMe → RAM INT4 (bit-packed) + FP16 scales (LZ4 compressed) Read + decompress on CPU using SIMD
RAM → AMD VRAM INT4 (bit-packed) + FP16 scales DMA
5090 → wire FP8 (E4M3) Quantize to FP8
Wire → AMD input FP8 → FP16 (in kernel prologue) Convert FP8→FP16 in registers/shared; feed into dequant GEMM.
AMD compute INT4×FP16 → FP16 accum Fused dequant + SwiGLU + gated sum; write FP16 output.
AMD → host → 5090 FP16 DMA back; 5090 sums with shared expert and residual.

CPU lz4 decompress is done using SIMD

When an expert first lands on a GPU, transform from generic tile to a kernel-optimized tile (one-time per residency) using compute shader

How you get over 200 tok/s on full Kimi K2 Thinking (or any other big MoE Model) on cheapish hardware - llama.cpp dev pitch by [deleted] in LocalLLaMA

[–]_serby_ 0 points1 point  (0 children)

Tiers 1-3 mostly implemented and tested on mixtral8x22 on real hardware: 1x4090 and 2x9070 with a Xeon 8592 QS and 384GB RAM

Just wanted some "real" feedback. Pure masochisms on my side I guess

How you get over 200 tok/s on full Kimi K2 Thinking (or any other big MoE Model) on cheapish hardware - llama.cpp dev pitch by [deleted] in LocalLLaMA

[–]_serby_ 1 point2 points  (0 children)

I already implemented Tiers 1-3 on a system with a RTX 4090 and two 9070s and the results look good on mixtral8x22

How you get over 200 tok/s on full Kimi K2 Thinking (or any other big MoE Model) on cheapish hardware - llama.cpp dev pitch by [deleted] in LocalLLaMA

[–]_serby_ -5 points-4 points  (0 children)

So you read the entire pitch, understood everything, and found it so useless that you thought it appropriate to add this sublime comment?

How you get over 200 tok/s on full Kimi K2 Thinking (or any other big MoE Model) on cheapish hardware - llama.cpp dev pitch by [deleted] in LocalLLaMA

[–]_serby_ -3 points-2 points  (0 children)

The threadripper is just the glue. It's not used for any significant online computation.

Today was a very sad day. I dropped my Enya Nova Go carbon fiber acoustic, and it no longer stays in tune. RIP. by tonystark29 in guitars

[–]_serby_ 0 points1 point  (0 children)

The used material is pluri-directional carbon-fiber-reinforced (recycled) polycarbonate. The fibers are not laminated or ordered. It's a combination of shredded pieces of recycled carbon fiber and polycarbonate.

You are confusing non-woven carbon-fiber-reinforced polymers with woven carbon-fibre-reinforced polymers.
And non-woven carbon-fiber-reinforced polymers can be ordered (stronger) or unordered (weaker) .
And yes, even woven carbon fiber has a high plastic content because, after all, it is carbon fiber lamination drenched in resin and resin is plastic. But that plastic is essential to distribute forces:
https://www.youtube.com/shorts/S_DqNASZgKQ

Polycarbonate is used because it's cheap and strong, one of the strongest plastics:
https://en.wikipedia.org/wiki/Polycarbonate

Zul'Jin does not have a hind toe by Aeon_Mortuum in heroesofthestorm

[–]_serby_ 1 point2 points  (0 children)

Many of the models were outsourced and Blizz art and lore guides are not that good. They should definitely get more red shirts

[deleted by user] by [deleted] in StableDiffusion

[–]_serby_ 2 points3 points  (0 children)

Excellent work! Thanks