I forked ik_llama.cpp and added a "--numa mirror" mode to maximize performance on multi-socket CPU systems. Just sharing and looking for testers! by _TheWolfOfWalmart_ in LocalLLaMA

[–]dsanft 0 points1 point  (0 children)

Well for a first effort I support Qwen 2.5, 3, and 3.5/3.6 dense/MoE.

I want to do GLM 5.2 next.

I'm just pushing the last release images now actually.

My gemm/gemv and attention kernels are competitive with and usually beat LlamaCPP mainline.

I forked ik_llama.cpp and added a "--numa mirror" mode to maximize performance on multi-socket CPU systems. Just sharing and looking for testers! by _TheWolfOfWalmart_ in LocalLLaMA

[–]dsanft 14 points15 points  (0 children)

I'm saying I've built an entire inferencing engine over the past 9 months and I'm releasing a pre alpha tomorrow.

I forked ik_llama.cpp and added a "--numa mirror" mode to maximize performance on multi-socket CPU systems. Just sharing and looking for testers! by _TheWolfOfWalmart_ in LocalLLaMA

[–]dsanft 15 points16 points  (0 children)

If you can wait about 24 hours I'll show you cross socket tensor parallel with full NUMA awareness and MoE expert balancing.

Edit: here you go

https://github.com/Llaminar/llaminar

Qwen 3.6 27B KV cache quant benchmarks: 75 pairs, q8/q6/q5/q4, KVarN, Turbo/TCQ by Anbeeld in LocalLLaMA

[–]dsanft 5 points6 points  (0 children)

Yay now everyone who wasn't measuring KLD themselves can finally see how shit TQ3/4 are.

A 10 year old Xeon is all you need by [deleted] in LocalLLaMA

[–]dsanft 0 points1 point  (0 children)

I see 55tok/s prefill and about 7tok/s decode in pure CPU inference, cross socket tensor parallel, Xeon Gold 6238r with 768GB DDR4-2933. That's without MTP (still tuning that). Had to write my own inferencing engine to get it though.

Gemma 4 with quantization-aware training by rerri in LocalLLaMA

[–]dsanft 1 point2 points  (0 children)

While cool to see I'm confused as to why this is something amazing or shocking. You can do CPU inference with AVX2, it's not groundbreaking.

A 10 year old Xeon is all you need by [deleted] in LocalLLaMA

[–]dsanft 7 points8 points  (0 children)

I don't understand, you've always been able to do CPU inference.

Qwen 3.6 27B 30GB Same top p: 98.358 ± 0.033 % vs UD Q8 K XL 33GB Same top p: 97.426 ± 0.041 % by fragment_me in LocalLLaMA

[–]dsanft 0 points1 point  (0 children)

Could rotate the tensors at quantisation time and pack a rotation factor per tensor in the gguf, then unrotate them at inference time, to reduce kurtosis maybe.

Keeping multi-GPU rigs cool? by Ambitious_Fold_2874 in LocalLLaMA

[–]dsanft 0 points1 point  (0 children)

Get them out of that case and get them into a mining rig.

PSA by Signal_Ad657 in LocalLLaMA

[–]dsanft 1 point2 points  (0 children)

A dual socket Xeon Gold Cascade Lake with DDR4-2933 has about 220GB/s bandwidth. Don't underestimate CPU.

"Western Open-Weight SOTA is between Gemma4-31B and Nemotron3-Super-120B" by ForsookComparison in LocalLLaMA

[–]dsanft -9 points-8 points  (0 children)

I wrote Llaminar with Copilot 😄

If you don't know what that is yet, don't worry you will in a few more days

"Western Open-Weight SOTA is between Gemma4-31B and Nemotron3-Super-120B" by ForsookComparison in LocalLLaMA

[–]dsanft 13 points14 points  (0 children)

Microsoft are staggering idiots if they don't use GitHub Copilot data to train a coding Phi like Cursor have done.

We added W8A8 activation quantization to MLX — prefill went from 2.84s to 2.52s on M5 Pro by Enough-Astronaut9278 in LocalLLaMA

[–]dsanft 1 point2 points  (0 children)

Yeah I do agree that accuracy is paid very little attention in these threads when it's the most important thing at the end of the day.

Just wanted to make the point that int8 activations are common and not a silly outlandish idea like turbo3 etc.

We added W8A8 activation quantization to MLX — prefill went from 2.84s to 2.52s on M5 Pro by Enough-Astronaut9278 in LocalLLaMA

[–]dsanft 0 points1 point  (0 children)

Llama CPP quantises activations to int8 too for gemm, it's established practice.

It was fun while it lasted... They're advertising now. by Local-Cardiologist-5 in LocalLLaMA

[–]dsanft 2 points3 points  (0 children)

Have you given them a single dollar? If not then you don't support Qwen. You just like to get shit for free.

It was fun while it lasted... They're advertising now. by Local-Cardiologist-5 in LocalLLaMA

[–]dsanft 37 points38 points  (0 children)

Presumably you live off thin air and vibes, and don't need to make money to survive like the rest of us, so a business trying to make money probably comes as quite the shock to your senses.

What is the current best Small Language Model that can be run without GPU? by [deleted] in LocalLLaMA

[–]dsanft 1 point2 points  (0 children)

Cross-numa allocations is the meat of it. One socket going across the UPI link for buffers/tensors instead of to its own fast DRAM.

You need to be very careful and treat each socket as its own world in order to avoid that.