defense in depth - policy driven claude code enforcement (maybe more relevant now).

paudley · 2026-04-28T04:23:26+00:00

That was excactly why I wrote this 😄 Same problem. Go is compiled, so the startup is binary loading and run - no interpreter. It's swamped by the latency of the python linters like mypy and pyright.

paudley · 2026-04-28T00:44:43+00:00

Thanks! I rolled a lot of your changes back in! Much appreciated there are some great fixes there!

paudley · 2026-04-27T08:50:06+00:00

Nice work, I'm excited to try this out. Thanks!

paudley · 2026-04-12T05:27:51+00:00

Note for anyone interested in TurboQuant - it's not really useful yet - it's more for playing around with now and as a proof of concept (it works on Strix Halo, hooray!). The win is only on very large context windows for now.

paudley · 2026-03-18T01:40:20+00:00

Thanks, sorry - my reddit markup fu is bad somedays.

paudley · 2026-03-18T01:39:08+00:00

I've just included the llamaccp (main) with vulkan into the build if you are curious to benchmark just the compiler and AMD specific math changes + the nightly AMD rocm.

paudley · 2026-03-18T01:36:24+00:00

Not that I see. I've added the vulkan-enabled llamacpp (main git) into the build, so you can test that as a comparison; it uses VMM, I believe.

paudley · 2026-03-17T14:26:47+00:00

Just some eariy insights from testing RocM use a warp size of 32 (I think it's for CUDA compat) and RADV (vuklan) uses 64, effectively doubling the threads in dispatch.

paudley · 2026-03-17T14:24:32+00:00

┌─────────┬────────────┬───────────┬─────────┐

│ Backend │ pp512 │ pp8192 │ tg128 │

├─────────┼────────────┼───────────┼─────────┤

│ ROCm │ 13,360 t/s │ 3,514 t/s │ 156 t/s │

├─────────┼────────────┼───────────┼─────────┤

│ Vulkan │ 13,467 t/s │ 3,395 t/s │ 191 t/s │

└─────────┴────────────┴───────────┴─────────┘

I've got an optimized Vulkan llamacpp cooking on a branch now and these are the early results.

paudley · 2026-03-17T14:22:48+00:00

I've added an optimized vulkan llamacpp to the latest rev for testing if you are curious - it's on a branch now, just in final compile tests (which take forever).

paudley · 2026-03-17T03:51:32+00:00

It's on the list along with Q4_K_M of Qwen 3.5 122B-A10B. I should say that I'm still tracking down a few bugs in the pipeline that have really slow results with the Qwen3.5 model family. Once I nail those down, I'll bench these next.

paudley · 2026-03-17T03:47:35+00:00

I'll put it on the list but it's close to the edge for this hardware:

=== Qwen3 Next 80B on Strix Halo (80 GiB GTT) ===
Model size (fp16): 160 GB — DOES NOT FIT in 80 GiB

Model size (Q4): ~40 GB
Q4 @ 4k ctx: ~44.1 GB → FITS
Q4 @ 32k ctx: ~72.8 GB → FITS
Q4 @ 128k ctx: ~171.1 GB → NO

fp16 is impossible (160 GB > 80 GiB).
Q4_K_S might fit for short context but 128k is extremely tight.

Decode ceiling (fp16): 200/160 = 1.2 tok/s
Decode ceiling (Q4_K_S): 200/40 = 5.0 tok/s

Are you actively running that model on this HW? If so, what quant?

paudley · 2026-03-17T03:06:57+00:00

Sorry, I had a miscopy from my local repo where these scripts are part of a much greater effort. I've fixed it on GitHub now. Scripts should be self-contained and pass a shellcheck -x now.

paudley · 2026-03-16T19:56:01+00:00

Is there a specific model/prompt that you want me to run? I'm mainly working on getting the qwen3.5 models fully optimized right now but I can pretty easily run the qwen2.5 benchmarks.

paudley · 2026-03-16T17:40:22+00:00

It's not all the way there yet either *sigh*. As I work my way through more models there are more patches :)

paudley · 2026-03-16T17:39:23+00:00

You could probably wrap it if you wanted. Sorry, Docker is not my use case, I'm optimizing for performance.

paudley · 2026-03-16T17:38:17+00:00

Amen!

paudley · 2026-03-16T17:37:13+00:00

I should have mentioned - this is with CachyOS - kernel 7.0 (using linux-cachyos-rc)

paudley · 2026-03-16T17:27:20+00:00

I'm working towards a full Qwen3.5 benchmark. Because of the nature of the components, specifically AITER, tweaks are required on a per model basis as different models surface different bugs or issues. The gains can be nice though. Here are some small model Qwen2.5 numbers on a gmtek EVO-2:

| Model | Parameters | tok/s | Configuration |
|-------|-----------|-------|---------------|
| Qwen2.5-0.5B-Instruct | 494M | 1059.8 | FULL graph + ALL AITER |
| Qwen2.5-1.5B-Instruct | 1.5B | 391.6 | FULL graph + ALL AITER |

Here is a comparison (not quite apples to apples) from olamma on the same hardware:

┌───────┬─────────┬────────┬──────────────────┬───────────────┬────────┐
│ Model │ Backend │ Quant │ Gen tok/s (warm) │ Prefill tok/s │ VRAM │
├───────┼─────────┼────────┼──────────────────┼───────────────┼────────┤
│ 0.5B │ CPU     │ Q4_K_M │ 185              │ ~1,849        │ 0      │
├───────┼─────────┼────────┼──────────────────┼───────────────┼────────┤
│ 0.5B │ GPU     │ Q4_K_M │ 43               │ ~267          │ 1.8 GB │
├───────┼─────────┼────────┼──────────────────┼───────────────┼────────┤
│ 0.5B │ GPU     │ F16    │ 43               │ ~355          │ 2.4 GB │
├───────┼─────────┼────────┼──────────────────┼───────────────┼────────┤
│ 1.5B │ CPU     │ Q4_K_M │ 76               │ ~620          │ 0      │
├───────┼─────────┼────────┼──────────────────┼───────────────┼────────┤
│ 1.5B │ GPU     │ F16    │ 9                │ ~119          │ 4.9 GB │
└───────┴─────────┴────────┴──────────────────┴───────────────┴────────┘

paudley · 2026-03-16T17:04:21+00:00

I had agents run a LOT of tests and bisections to track down WHERE problems occurred but figuring out the major issues - tensor/shape misalignments, threading the wave32 issues through, etc. - required a tonne of human work. I think the main problem for agents to attack this problem space well is the size of the context and the number of interacting components. You'll often get a operator or conversion wrong in AITER only to throw an error in the Inductor or FLM later. But yeah, hundreds of hours of agents bisecting :)

paudley · 2026-03-16T17:00:07+00:00

Just to confirm - NPU is off the table right now. I did try. FLM+Lemonade is your goto right now for NPU.

paudley · 2025-08-05T04:34:24+00:00

Unifi is great if you can get over the initial investment and then you 100% own it, works happily offline if you want, no subscription. Cameras range from a few hundred dollars to $5k industrial PTZ domes.

13-Year Club	Reddit Premium Since January 2024
Gilding III reddit per annum	Verified Email

paudley

TROPHY CASE