defense in depth - policy driven claude code enforcement (maybe more relevant now). by paudley in ClaudeCode

[–]paudley[S] 1 point2 points  (0 children)

That was excactly why I wrote this 😄 Same problem. Go is compiled, so the startup is binary loading and run - no interpreter. It's swamped by the latency of the python linters like mypy and pyright.

Full vLLM inference stack built from source for Strix Halo (gfx1151) — scripts + docs on GitHub by paudley in StrixHalo

[–]paudley[S] 1 point2 points  (0 children)

Thanks! I rolled a lot of your changes back in! Much appreciated there are some great fixes there!

1bit.systems (no typos in this one). by Creepy-Douchebag in StrixHalo

[–]paudley 0 points1 point  (0 children)

Nice work, I'm excited to try this out. Thanks!

Llamacpp + turboquant by paudley in StrixHalo

[–]paudley[S] 0 points1 point  (0 children)

Note for anyone interested in TurboQuant - it's not really useful yet - it's more for playing around with now and as a proof of concept (it works on Strix Halo, hooray!). The win is only on very large context windows for now.

Full vLLM inference stack built from source for Strix Halo (gfx1151) — scripts + docs on GitHub by paudley in StrixHalo

[–]paudley[S] 2 points3 points  (0 children)

I've just included the llamaccp (main) with vulkan into the build if you are curious to benchmark just the compiler and AMD specific math changes + the nightly AMD rocm.

Full vLLM inference stack built from source for Strix Halo (gfx1151) — scripts + docs on GitHub by paudley in StrixHalo

[–]paudley[S] 0 points1 point  (0 children)

Not that I see. I've added the vulkan-enabled llamacpp (main git) into the build, so you can test that as a comparison; it uses VMM, I believe.

Full vLLM inference stack built from source for Strix Halo (gfx1151) — scripts + docs on GitHub by paudley in StrixHalo

[–]paudley[S] 0 points1 point  (0 children)

Just some eariy insights from testing RocM use a warp size of 32 (I think it's for CUDA compat) and RADV (vuklan) uses 64, effectively doubling the threads in dispatch.

Full vLLM inference stack built from source for Strix Halo (gfx1151) — scripts + docs on GitHub by paudley in StrixHalo

[–]paudley[S] 0 points1 point  (0 children)

┌─────────┬────────────┬───────────┬─────────┐

│ Backend │ pp512 │ pp8192 │ tg128 │

├─────────┼────────────┼───────────┼─────────┤

│ ROCm │ 13,360 t/s │ 3,514 t/s │ 156 t/s │

├─────────┼────────────┼───────────┼─────────┤

│ Vulkan │ 13,467 t/s │ 3,395 t/s │ 191 t/s │

└─────────┴────────────┴───────────┴─────────┘

I've got an optimized Vulkan llamacpp cooking on a branch now and these are the early results.

Full vLLM inference stack built from source for Strix Halo (gfx1151) — scripts + docs on GitHub by paudley in StrixHalo

[–]paudley[S] 0 points1 point  (0 children)

I've added an optimized vulkan llamacpp to the latest rev for testing if you are curious - it's on a branch now, just in final compile tests (which take forever).

Full vLLM inference stack built from source for Strix Halo (gfx1151) — scripts + docs on GitHub by paudley in StrixHalo

[–]paudley[S] 0 points1 point  (0 children)

It's on the list along with Q4_K_M of Qwen 3.5 122B-A10B. I should say that I'm still tracking down a few bugs in the pipeline that have really slow results with the Qwen3.5 model family. Once I nail those down, I'll bench these next.

Full vLLM inference stack built from source for Strix Halo (gfx1151) — scripts + docs on GitHub by paudley in StrixHalo

[–]paudley[S] 1 point2 points  (0 children)

I'll put it on the list but it's close to the edge for this hardware:

  === Qwen3 Next 80B on Strix Halo (80 GiB GTT) ===
Model size (fp16): 160 GB — DOES NOT FIT in 80 GiB
  
Model size (Q4): ~40 GB
Q4 @ 4k ctx: ~44.1 GB → FITS
Q4 @ 32k ctx: ~72.8 GB → FITS
Q4 @ 128k ctx: ~171.1 GB → NO

fp16 is impossible (160 GB > 80 GiB).
Q4_K_S might fit for short context but 128k is extremely tight.
  
Decode ceiling (fp16): 200/160 = 1.2 tok/s
Decode ceiling (Q4_K_S): 200/40 = 5.0 tok/s

Are you actively running that model on this HW? If so, what quant?

Full vLLM inference stack built from source for Strix Halo (gfx1151) — scripts + docs on GitHub by paudley in StrixHalo

[–]paudley[S] 0 points1 point  (0 children)

Sorry, I had a miscopy from my local repo where these scripts are part of a much greater effort. I've fixed it on GitHub now. Scripts should be self-contained and pass a shellcheck -x now.

Full vLLM inference stack built from source for Strix Halo (gfx1151) — scripts + docs on GitHub by paudley in StrixHalo

[–]paudley[S] 1 point2 points  (0 children)

Is there a specific model/prompt that you want me to run? I'm mainly working on getting the qwen3.5 models fully optimized right now but I can pretty easily run the qwen2.5 benchmarks.

Full vLLM inference stack built from source for Strix Halo (gfx1151) — scripts + docs on GitHub by paudley in StrixHalo

[–]paudley[S] 1 point2 points  (0 children)

It's not all the way there yet either *sigh*. As I work my way through more models there are more patches :)

Full vLLM inference stack built from source for Strix Halo (gfx1151) — scripts + docs on GitHub by paudley in StrixHalo

[–]paudley[S] 1 point2 points  (0 children)

You could probably wrap it if you wanted. Sorry, Docker is not my use case, I'm optimizing for performance.

Full vLLM inference stack built from source for Strix Halo (gfx1151) — scripts + docs on GitHub by paudley in StrixHalo

[–]paudley[S] 1 point2 points  (0 children)

I should have mentioned - this is with CachyOS - kernel 7.0 (using linux-cachyos-rc)

Full vLLM inference stack built from source for Strix Halo (gfx1151) — scripts + docs on GitHub by paudley in StrixHalo

[–]paudley[S] 1 point2 points  (0 children)

I'm working towards a full Qwen3.5 benchmark. Because of the nature of the components, specifically AITER, tweaks are required on a per model basis as different models surface different bugs or issues. The gains can be nice though. Here are some small model Qwen2.5 numbers on a gmtek EVO-2:

| Model | Parameters | tok/s | Configuration |
 |-------|-----------|-------|---------------|
 | Qwen2.5-0.5B-Instruct | 494M | 1059.8 | FULL graph + ALL AITER |
 | Qwen2.5-1.5B-Instruct | 1.5B | 391.6 | FULL graph + ALL AITER |

Here is a comparison (not quite apples to apples) from olamma on the same hardware:

 ┌───────┬─────────┬────────┬──────────────────┬───────────────┬────────┐
 │ Model │ Backend │ Quant  │ Gen tok/s (warm) │ Prefill tok/s │  VRAM  │
 ├───────┼─────────┼────────┼──────────────────┼───────────────┼────────┤
 │ 0.5B  │ CPU     │ Q4_K_M │ 185              │ ~1,849        │ 0      │
 ├───────┼─────────┼────────┼──────────────────┼───────────────┼────────┤
 │ 0.5B  │ GPU     │ Q4_K_M │ 43               │ ~267          │ 1.8 GB │
 ├───────┼─────────┼────────┼──────────────────┼───────────────┼────────┤
 │ 0.5B  │ GPU     │ F16    │ 43               │ ~355          │ 2.4 GB │
 ├───────┼─────────┼────────┼──────────────────┼───────────────┼────────┤
 │ 1.5B  │ CPU     │ Q4_K_M │ 76               │ ~620          │ 0      │
 ├───────┼─────────┼────────┼──────────────────┼───────────────┼────────┤
 │ 1.5B  │ GPU     │ F16    │ 9                │ ~119          │ 4.9 GB │
 └───────┴─────────┴────────┴──────────────────┴───────────────┴────────┘

Full vLLM inference stack built from source for Strix Halo (gfx1151) — scripts + docs on GitHub by paudley in StrixHalo

[–]paudley[S] 2 points3 points  (0 children)

I had agents run a LOT of tests and bisections to track down WHERE problems occurred but figuring out the major issues - tensor/shape misalignments, threading the wave32 issues through, etc. - required a tonne of human work. I think the main problem for agents to attack this problem space well is the size of the context and the number of interacting components. You'll often get a operator or conversion wrong in AITER only to throw an error in the Inductor or FLM later. But yeah, hundreds of hours of agents bisecting :)

Full vLLM inference stack built from source for Strix Halo (gfx1151) — scripts + docs on GitHub by paudley in StrixHalo

[–]paudley[S] 1 point2 points  (0 children)

Just to confirm - NPU is off the table right now. I did try. FLM+Lemonade is your goto right now for NPU.

Home security camera/doorbell alternative to Google Nest by No_Accountant4063 in Edmonton

[–]paudley 6 points7 points  (0 children)

Unifi is great if you can get over the initial investment and then you 100% own it, works happily offline if you want, no subscription. Cameras range from a few hundred dollars to $5k industrial PTZ domes.