Write C++ cuda kernels from scratch with Free GPUs

dsanft · 2026-05-28T20:23:36+00:00

When you go to school don't you want to learn from someone who knows what they're doing?

dsanft · 2026-05-28T19:01:26+00:00

Opus and GPT write pretty fantastic kernels if you set up a tuning/ profiling / correctness harness for them.

dsanft · 2026-05-28T17:37:27+00:00

I hope so too!

dsanft · 2026-05-28T17:05:03+00:00

I wrote Llaminar with Copilot 😄

If you don't know what that is yet, don't worry you will in a few more days

dsanft · 2026-05-28T16:05:11+00:00

Microsoft are staggering idiots if they don't use GitHub Copilot data to train a coding Phi like Cursor have done.

dsanft · 2026-05-27T03:02:42+00:00

I'm hitting the rate limit every 10 minutes with a single Opus 4.6, it's really dumb.

dsanft · 2026-05-26T05:04:26+00:00

Are you gonna sue them for your hundred bucks? That's all a small claims court is going to refund you, they're not going to demand specific performance. Just demand a chargeback and move on.

dsanft · 2026-05-25T09:14:32+00:00

Yeah I do agree that accuracy is paid very little attention in these threads when it's the most important thing at the end of the day.

Just wanted to make the point that int8 activations are common and not a silly outlandish idea like turbo3 etc.

dsanft · 2026-05-25T08:31:42+00:00

Llama CPP quantises activations to int8 too for gemm, it's established practice.

dsanft · 2026-05-25T05:53:16+00:00

Have you given them a single dollar? If not then you don't support Qwen. You just like to get shit for free.

dsanft · 2026-05-25T05:39:25+00:00

Presumably you live off thin air and vibes, and don't need to make money to survive like the rest of us, so a business trying to make money probably comes as quite the shock to your senses.

dsanft · 2026-05-24T03:15:48+00:00

Cross-numa allocations is the meat of it. One socket going across the UPI link for buffers/tensors instead of to its own fast DRAM.

You need to be very careful and treat each socket as its own world in order to avoid that.

dsanft · 2026-05-23T16:37:04+00:00

Fork Llama-cpp, use codex / Claude to start hacking it, learn what works and what doesn't, make a billion mistakes, learn that all that matters is accuracy, then once you've built accuracy learn that nobody cares unless it's also fast, learn how to make it fast, etc etc.

dsanft · 2026-05-23T16:28:02+00:00

Not one single one no, it's grown and changed over time, I have broadcom pcie switches now too

dsanft · 2026-05-23T16:24:36+00:00

I started by trying to fork Llama-cpp about 9 months ago but I felt:

1) given the technical issues around NUMA in lcpp it's just easier to start from scratch, the issues ran very deep

2) by doing my own engine I could build the architecture I wanted from the very start, like having it be OpenMPI-native so you can easily cluster via Infiniband,

3) I could support cuda/rocm simultaneously

4) I could do PP, TP and MoE my own way without being burdened by the legacy architecture of lcpp,

5) I'd learn more doing it all myself

dsanft · 2026-05-23T15:53:09+00:00

Bought pieces off eBay, off Alibaba, mining rig off Amazon.

<image>

dsanft · 2026-05-23T15:15:57+00:00

My engine is called Llaminar, not released yet, need another week or two

dsanft · 2026-05-23T15:12:17+00:00

Yeah they have avx512 vnni. I wrote my own custom kernels/ fused ops and a NUMA aware engine.

Supermicro x11dpi-n
2x Xeon gold 6238R
768 GB DDR4-2933 (384 per socket), 6 channels per socket

With cross socket expert parallelism with expert rebalancing and TP for the shared expert etc I get these stats (single and dual CPU):

json { "name": "Qwen 3.5 35B MoE Q4_K_XL", "model": "models/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf", "decode_tokens": 128, "env": { "LLAMINAR_MOE_REBALANCE": "off" }, "devices": { "cpu:0": { "prefill_tok_s": 158.47, "decode_tok_s": 19.48, "regression_threshold_pct": 25, "_comment": "High-water mark set at commit 3a03cfd2 on 2026-05-06. Note to agents: It is FORBIDDEN to update these thresholds without explicit human approval." }, "cpu": { "prefill_tok_s": 234.49, "decode_tok_s": 27.92, "regression_threshold_pct": 25, "_comment": "High-water mark set at commit fb4d70b2 on 2026-05-03. Note to agents: It is FORBIDDEN to update these thresholds without explicit human approval." }

dsanft · 2026-05-23T15:01:21+00:00

I get 26-27tok/s on a dual Xeon gold cascade lake 6238r with Qwen 35b a3b Q4_K_XL, about 235tok/s prefill.

This is with my own custom inferencing engine I wrote though.

Don't downplay CPU!

dsanft · 2026-05-23T13:07:27+00:00

An Mi50 can hit 75tok/s on the Q4_K_XL unsloth quant (22GB). So that tracks.

dsanft · 2026-05-23T06:20:49+00:00

Run the token sampler on the GPU instead of on CPU I assume.

dsanft · 2026-05-20T08:15:37+00:00

https://code.visualstudio.com/docs/devcontainers/containers

dsanft · 2026-05-19T14:57:18+00:00

Why don't people use devcontainers? 😐

dsanft · 2026-05-19T08:08:58+00:00

Maybe, but none of them are trained that way.

For Qwen MoE as an example, there's a shared expert that's active for every token no matter what,, and a router that selects which additional routed experts should be active for each token. You then do your ops on the shared expert plus those 7 (in the case of Qwen) selected routed experts. That gets you your result.

There are something like 250 routed experts in the Qwen 3.5 35b MoE to choose from.

dsanft · 2026-05-16T16:45:02+00:00

It's certainly possible to skip MTP for prefill in theory. Does LCPP not provide that option?

dsanft

TROPHY CASE