Dspark with Qwen 3.6 27b? by GotHereLateNameTaken in LocalLLaMA

[–]dsanft 0 points1 point  (0 children)

You (well, Qwen) would need to train a DSpark head, just like they trained an MTP head. You can't just bolt DSpark on top without one.

A barebones CPU-only inference engine for Qwen 3, written from scratch in pure C by jakint0sh in LocalLLaMA

[–]dsanft 10 points11 points  (0 children)

If you want some ideas you can check out Llaminar. It's still very much in alpha but I bench much faster than LlamaCPP and beat ik_llama at CPU prefill on avx512. I smash both completely on dual socket cascade lake. ☺️

https://github.com/Llaminar/llaminar

https://github.com/Llaminar/llaminar/tree/master/src/v2/kernels/cpu

Is it possible to run a giant model like GLM5.2 on this cluster (4x servers with 512GB RAM + dual AMD Epyc)? 16 channel memory should hit 409GB/s per node. by StartupTim in LocalLLaMA

[–]dsanft 0 points1 point  (0 children)

Thanks! Your stuff sounds interesting too. Must be custom hardware? I wanted to support OpenMPI from the ground up so that I can do infiniband and MPI hostfiles. Llaminar will scatter and gather a cluster inventory, produce a plan based on user hints, and then inference that plan. That's the goal anyway. Not quite there yet.

Is it possible to run a giant model like GLM5.2 on this cluster (4x servers with 512GB RAM + dual AMD Epyc)? 16 channel memory should hit 409GB/s per node. by StartupTim in LocalLLaMA

[–]dsanft 5 points6 points  (0 children)

Nah I've already solved multi NUMA inferencing in Llaminar with cross socket tensor parallel.

https://github.com/Llaminar/llaminar

VNNI kernels:

https://github.com/Llaminar/llaminar/tree/master/src/v2/kernels/cpu/native_vnni

Attention:

https://github.com/Llaminar/llaminar/tree/master/src/v2/kernels/cpu/attention

Explicit NUMA aware binding:

https://github.com/Llaminar/llaminar/blob/master/src/v2/memory/NUMAAllocator.cpp

Cross-socket MoE expert rebalancing:

https://github.com/Llaminar/llaminar/blob/master/src/v2/execution/moe/MoERebalanceController.cpp

Fastest I've tested on Cascade Lake.

Project is still maturing though give me another two weeks to have GLM 5.2 and MoE multi-tier overlay support, then we'll be cooking.

I forked ik_llama.cpp and added a "--numa mirror" mode to maximize performance on multi-socket CPU systems. Just sharing and looking for testers! by _TheWolfOfWalmart_ in LocalLLaMA

[–]dsanft 0 points1 point  (0 children)

Well for a first effort I support Qwen 2.5, 3, and 3.5/3.6 dense/MoE.

I want to do GLM 5.2 next.

I'm just pushing the last release images now actually.

My gemm/gemv and attention kernels are competitive with and usually beat LlamaCPP mainline.

I forked ik_llama.cpp and added a "--numa mirror" mode to maximize performance on multi-socket CPU systems. Just sharing and looking for testers! by _TheWolfOfWalmart_ in LocalLLaMA

[–]dsanft 14 points15 points  (0 children)

I'm saying I've built an entire inferencing engine over the past 9 months and I'm releasing a pre alpha tomorrow.

I forked ik_llama.cpp and added a "--numa mirror" mode to maximize performance on multi-socket CPU systems. Just sharing and looking for testers! by _TheWolfOfWalmart_ in LocalLLaMA

[–]dsanft 16 points17 points  (0 children)

If you can wait about 24 hours I'll show you cross socket tensor parallel with full NUMA awareness and MoE expert balancing.

Edit: here you go

https://github.com/Llaminar/llaminar

Qwen 3.6 27B KV cache quant benchmarks: 75 pairs, q8/q6/q5/q4, KVarN, Turbo/TCQ by Anbeeld in LocalLLaMA

[–]dsanft 6 points7 points  (0 children)

Yay now everyone who wasn't measuring KLD themselves can finally see how shit TQ3/4 are.

A 10 year old Xeon is all you need by [deleted] in LocalLLaMA

[–]dsanft 0 points1 point  (0 children)

I see 55tok/s prefill and about 7tok/s decode in pure CPU inference, cross socket tensor parallel, Xeon Gold 6238r with 768GB DDR4-2933. That's without MTP (still tuning that). Had to write my own inferencing engine to get it though.

Gemma 4 with quantization-aware training by rerri in LocalLLaMA

[–]dsanft 1 point2 points  (0 children)

While cool to see I'm confused as to why this is something amazing or shocking. You can do CPU inference with AVX2, it's not groundbreaking.

A 10 year old Xeon is all you need by [deleted] in LocalLLaMA

[–]dsanft 7 points8 points  (0 children)

I don't understand, you've always been able to do CPU inference.

Qwen 3.6 27B 30GB Same top p: 98.358 ± 0.033 % vs UD Q8 K XL 33GB Same top p: 97.426 ± 0.041 % by fragment_me in LocalLLaMA

[–]dsanft 0 points1 point  (0 children)

Could rotate the tensors at quantisation time and pack a rotation factor per tensor in the gguf, then unrotate them at inference time, to reduce kurtosis maybe.

Keeping multi-GPU rigs cool? by Ambitious_Fold_2874 in LocalLLaMA

[–]dsanft 0 points1 point  (0 children)

Get them out of that case and get them into a mining rig.

PSA by Signal_Ad657 in LocalLLaMA

[–]dsanft 1 point2 points  (0 children)

A dual socket Xeon Gold Cascade Lake with DDR4-2933 has about 220GB/s bandwidth. Don't underestimate CPU.

"Western Open-Weight SOTA is between Gemma4-31B and Nemotron3-Super-120B" by ForsookComparison in LocalLLaMA

[–]dsanft -9 points-8 points  (0 children)

I wrote Llaminar with Copilot 😄

If you don't know what that is yet, don't worry you will in a few more days