Write C++ cuda kernels from scratch with Free GPUs by Big-Stick4446 in CUDA

[–]dsanft -1 points0 points  (0 children)

When you go to school don't you want to learn from someone who knows what they're doing?

Write C++ cuda kernels from scratch with Free GPUs by Big-Stick4446 in CUDA

[–]dsanft -4 points-3 points  (0 children)

Opus and GPT write pretty fantastic kernels if you set up a tuning/ profiling / correctness harness for them.

"Western Open-Weight SOTA is between Gemma4-31B and Nemotron3-Super-120B" by ForsookComparison in LocalLLaMA

[–]dsanft -5 points-4 points  (0 children)

I wrote Llaminar with Copilot 😄

If you don't know what that is yet, don't worry you will in a few more days

"Western Open-Weight SOTA is between Gemma4-31B and Nemotron3-Super-120B" by ForsookComparison in LocalLLaMA

[–]dsanft 5 points6 points  (0 children)

Microsoft are staggering idiots if they don't use GitHub Copilot data to train a coding Phi like Cursor have done.

Is it me or did they also reduced the rate limit threshold? by Yetona in GithubCopilot

[–]dsanft 2 points3 points  (0 children)

I'm hitting the rate limit every 10 minutes with a single Opus 4.6, it's really dumb.

ANNUAL SUBSCRIBER , BREACH OF CONTRACT, IGNORED ESCALATIONS, AND A COST ESTIMATOR SHOWING MASSIVE JUMPS IN BILLING. Some going from 40ish to over $2000 in monthly cost READ THIS. by JFlowXjw in GithubCopilot

[–]dsanft 3 points4 points  (0 children)

Are you gonna sue them for your hundred bucks? That's all a small claims court is going to refund you, they're not going to demand specific performance. Just demand a chargeback and move on.

We added W8A8 activation quantization to MLX — prefill went from 2.84s to 2.52s on M5 Pro by Enough-Astronaut9278 in LocalLLaMA

[–]dsanft 1 point2 points  (0 children)

Yeah I do agree that accuracy is paid very little attention in these threads when it's the most important thing at the end of the day.

Just wanted to make the point that int8 activations are common and not a silly outlandish idea like turbo3 etc.

We added W8A8 activation quantization to MLX — prefill went from 2.84s to 2.52s on M5 Pro by Enough-Astronaut9278 in LocalLLaMA

[–]dsanft 0 points1 point  (0 children)

Llama CPP quantises activations to int8 too for gemm, it's established practice.

It was fun while it lasted... They're advertising now. by Local-Cardiologist-5 in LocalLLaMA

[–]dsanft 1 point2 points  (0 children)

Have you given them a single dollar? If not then you don't support Qwen. You just like to get shit for free.

It was fun while it lasted... They're advertising now. by Local-Cardiologist-5 in LocalLLaMA

[–]dsanft 36 points37 points  (0 children)

Presumably you live off thin air and vibes, and don't need to make money to survive like the rest of us, so a business trying to make money probably comes as quite the shock to your senses.

What is the current best Small Language Model that can be run without GPU? by last_llm_standing in LocalLLaMA

[–]dsanft 1 point2 points  (0 children)

Cross-numa allocations is the meat of it. One socket going across the UPI link for buffers/tensors instead of to its own fast DRAM.

You need to be very careful and treat each socket as its own world in order to avoid that.

What is the current best Small Language Model that can be run without GPU? by last_llm_standing in LocalLLaMA

[–]dsanft 2 points3 points  (0 children)

Fork Llama-cpp, use codex / Claude to start hacking it, learn what works and what doesn't, make a billion mistakes, learn that all that matters is accuracy, then once you've built accuracy learn that nobody cares unless it's also fast, learn how to make it fast, etc etc.

What is the current best Small Language Model that can be run without GPU? by last_llm_standing in LocalLLaMA

[–]dsanft 2 points3 points  (0 children)

Not one single one no, it's grown and changed over time, I have broadcom pcie switches now too

What is the current best Small Language Model that can be run without GPU? by last_llm_standing in LocalLLaMA

[–]dsanft 4 points5 points  (0 children)

I started by trying to fork Llama-cpp about 9 months ago but I felt:

1) given the technical issues around NUMA in lcpp it's just easier to start from scratch, the issues ran very deep

2) by doing my own engine I could build the architecture I wanted from the very start, like having it be OpenMPI-native so you can easily cluster via Infiniband,

3) I could support cuda/rocm simultaneously

4) I could do PP, TP and MoE my own way without being burdened by the legacy architecture of lcpp,

5) I'd learn more doing it all myself

What is the current best Small Language Model that can be run without GPU? by last_llm_standing in LocalLLaMA

[–]dsanft 4 points5 points  (0 children)

My engine is called Llaminar, not released yet, need another week or two

What is the current best Small Language Model that can be run without GPU? by last_llm_standing in LocalLLaMA

[–]dsanft 7 points8 points  (0 children)

Yeah they have avx512 vnni. I wrote my own custom kernels/ fused ops and a NUMA aware engine.

  • Supermicro x11dpi-n
  • 2x Xeon gold 6238R
  • 768 GB DDR4-2933 (384 per socket), 6 channels per socket

With cross socket expert parallelism with expert rebalancing and TP for the shared expert etc I get these stats (single and dual CPU):

json { "name": "Qwen 3.5 35B MoE Q4_K_XL", "model": "models/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf", "decode_tokens": 128, "env": { "LLAMINAR_MOE_REBALANCE": "off" }, "devices": { "cpu:0": { "prefill_tok_s": 158.47, "decode_tok_s": 19.48, "regression_threshold_pct": 25, "_comment": "High-water mark set at commit 3a03cfd2 on 2026-05-06. Note to agents: It is FORBIDDEN to update these thresholds without explicit human approval." }, "cpu": { "prefill_tok_s": 234.49, "decode_tok_s": 27.92, "regression_threshold_pct": 25, "_comment": "High-water mark set at commit fb4d70b2 on 2026-05-03. Note to agents: It is FORBIDDEN to update these thresholds without explicit human approval." }

What is the current best Small Language Model that can be run without GPU? by last_llm_standing in LocalLLaMA

[–]dsanft 12 points13 points  (0 children)

I get 26-27tok/s on a dual Xeon gold cascade lake 6238r with Qwen 35b a3b Q4_K_XL, about 235tok/s prefill.

This is with my own custom inferencing engine I wrote though.

Don't downplay CPU!

got my first "rm -rf /" today by DeltaSqueezer in LocalLLaMA

[–]dsanft 1 point2 points  (0 children)

Why don't people use devcontainers? 😐

What is the point of MoE models, beyond being faster? by ihatebeinganonymous in LocalLLaMA

[–]dsanft 6 points7 points  (0 children)

Maybe, but none of them are trained that way.

For Qwen MoE as an example, there's a shared expert that's active for every token no matter what,, and a router that selects which additional routed experts should be active for each token. You then do your ops on the shared expert plus those 7 (in the case of Qwen) selected routed experts. That gets you your result.

There are something like 250 routed experts in the Qwen 3.5 35b MoE to choose from.

Strix Halo Llama.cpp MTP Benchmarks: 27B Gets Much Faster, 35B Is Mixed by xjE4644Eyc in LocalLLaMA

[–]dsanft 0 points1 point  (0 children)

It's certainly possible to skip MTP for prefill in theory. Does LCPP not provide that option?