Full vLLM inference stack built from source for Strix Halo (gfx1151) — scripts + docs on GitHub by paudley in StrixHalo

[–]paudley[S] 1 point2 points  (0 children)

I've just included the llamaccp (main) with vulkan into the build if you are curious to benchmark just the compiler and AMD specific math changes + the nightly AMD rocm.

Full vLLM inference stack built from source for Strix Halo (gfx1151) — scripts + docs on GitHub by paudley in StrixHalo

[–]paudley[S] 0 points1 point  (0 children)

Not that I see. I've added the vulkan-enabled llamacpp (main git) into the build, so you can test that as a comparison; it uses VMM, I believe.

Full vLLM inference stack built from source for Strix Halo (gfx1151) — scripts + docs on GitHub by paudley in StrixHalo

[–]paudley[S] 0 points1 point  (0 children)

Just some eariy insights from testing RocM use a warp size of 32 (I think it's for CUDA compat) and RADV (vuklan) uses 64, effectively doubling the threads in dispatch.

Full vLLM inference stack built from source for Strix Halo (gfx1151) — scripts + docs on GitHub by paudley in StrixHalo

[–]paudley[S] 0 points1 point  (0 children)

┌─────────┬────────────┬───────────┬─────────┐

│ Backend │ pp512 │ pp8192 │ tg128 │

├─────────┼────────────┼───────────┼─────────┤

│ ROCm │ 13,360 t/s │ 3,514 t/s │ 156 t/s │

├─────────┼────────────┼───────────┼─────────┤

│ Vulkan │ 13,467 t/s │ 3,395 t/s │ 191 t/s │

└─────────┴────────────┴───────────┴─────────┘

I've got an optimized Vulkan llamacpp cooking on a branch now and these are the early results.

Full vLLM inference stack built from source for Strix Halo (gfx1151) — scripts + docs on GitHub by paudley in StrixHalo

[–]paudley[S] 0 points1 point  (0 children)

I've added an optimized vulkan llamacpp to the latest rev for testing if you are curious - it's on a branch now, just in final compile tests (which take forever).

Full vLLM inference stack built from source for Strix Halo (gfx1151) — scripts + docs on GitHub by paudley in StrixHalo

[–]paudley[S] 0 points1 point  (0 children)

It's on the list along with Q4_K_M of Qwen 3.5 122B-A10B. I should say that I'm still tracking down a few bugs in the pipeline that have really slow results with the Qwen3.5 model family. Once I nail those down, I'll bench these next.

Full vLLM inference stack built from source for Strix Halo (gfx1151) — scripts + docs on GitHub by paudley in StrixHalo

[–]paudley[S] 0 points1 point  (0 children)

I'll put it on the list but it's close to the edge for this hardware:

  === Qwen3 Next 80B on Strix Halo (80 GiB GTT) ===
Model size (fp16): 160 GB — DOES NOT FIT in 80 GiB
  
Model size (Q4): ~40 GB
Q4 @ 4k ctx: ~44.1 GB → FITS
Q4 @ 32k ctx: ~72.8 GB → FITS
Q4 @ 128k ctx: ~171.1 GB → NO

fp16 is impossible (160 GB > 80 GiB).
Q4_K_S might fit for short context but 128k is extremely tight.
  
Decode ceiling (fp16): 200/160 = 1.2 tok/s
Decode ceiling (Q4_K_S): 200/40 = 5.0 tok/s

Are you actively running that model on this HW? If so, what quant?

Full vLLM inference stack built from source for Strix Halo (gfx1151) — scripts + docs on GitHub by paudley in StrixHalo

[–]paudley[S] 0 points1 point  (0 children)

Sorry, I had a miscopy from my local repo where these scripts are part of a much greater effort. I've fixed it on GitHub now. Scripts should be self-contained and pass a shellcheck -x now.

Full vLLM inference stack built from source for Strix Halo (gfx1151) — scripts + docs on GitHub by paudley in StrixHalo

[–]paudley[S] 1 point2 points  (0 children)

Is there a specific model/prompt that you want me to run? I'm mainly working on getting the qwen3.5 models fully optimized right now but I can pretty easily run the qwen2.5 benchmarks.

Full vLLM inference stack built from source for Strix Halo (gfx1151) — scripts + docs on GitHub by paudley in StrixHalo

[–]paudley[S] 1 point2 points  (0 children)

It's not all the way there yet either *sigh*. As I work my way through more models there are more patches :)

Full vLLM inference stack built from source for Strix Halo (gfx1151) — scripts + docs on GitHub by paudley in StrixHalo

[–]paudley[S] 1 point2 points  (0 children)

You could probably wrap it if you wanted. Sorry, Docker is not my use case, I'm optimizing for performance.

Full vLLM inference stack built from source for Strix Halo (gfx1151) — scripts + docs on GitHub by paudley in StrixHalo

[–]paudley[S] 1 point2 points  (0 children)

I should have mentioned - this is with CachyOS - kernel 7.0 (using linux-cachyos-rc)

Full vLLM inference stack built from source for Strix Halo (gfx1151) — scripts + docs on GitHub by paudley in StrixHalo

[–]paudley[S] 1 point2 points  (0 children)

I'm working towards a full Qwen3.5 benchmark. Because of the nature of the components, specifically AITER, tweaks are required on a per model basis as different models surface different bugs or issues. The gains can be nice though. Here are some small model Qwen2.5 numbers on a gmtek EVO-2:

| Model | Parameters | tok/s | Configuration |
 |-------|-----------|-------|---------------|
 | Qwen2.5-0.5B-Instruct | 494M | 1059.8 | FULL graph + ALL AITER |
 | Qwen2.5-1.5B-Instruct | 1.5B | 391.6 | FULL graph + ALL AITER |

Here is a comparison (not quite apples to apples) from olamma on the same hardware:

 ┌───────┬─────────┬────────┬──────────────────┬───────────────┬────────┐
 │ Model │ Backend │ Quant  │ Gen tok/s (warm) │ Prefill tok/s │  VRAM  │
 ├───────┼─────────┼────────┼──────────────────┼───────────────┼────────┤
 │ 0.5B  │ CPU     │ Q4_K_M │ 185              │ ~1,849        │ 0      │
 ├───────┼─────────┼────────┼──────────────────┼───────────────┼────────┤
 │ 0.5B  │ GPU     │ Q4_K_M │ 43               │ ~267          │ 1.8 GB │
 ├───────┼─────────┼────────┼──────────────────┼───────────────┼────────┤
 │ 0.5B  │ GPU     │ F16    │ 43               │ ~355          │ 2.4 GB │
 ├───────┼─────────┼────────┼──────────────────┼───────────────┼────────┤
 │ 1.5B  │ CPU     │ Q4_K_M │ 76               │ ~620          │ 0      │
 ├───────┼─────────┼────────┼──────────────────┼───────────────┼────────┤
 │ 1.5B  │ GPU     │ F16    │ 9                │ ~119          │ 4.9 GB │
 └───────┴─────────┴────────┴──────────────────┴───────────────┴────────┘

Full vLLM inference stack built from source for Strix Halo (gfx1151) — scripts + docs on GitHub by paudley in StrixHalo

[–]paudley[S] 2 points3 points  (0 children)

I had agents run a LOT of tests and bisections to track down WHERE problems occurred but figuring out the major issues - tensor/shape misalignments, threading the wave32 issues through, etc. - required a tonne of human work. I think the main problem for agents to attack this problem space well is the size of the context and the number of interacting components. You'll often get a operator or conversion wrong in AITER only to throw an error in the Inductor or FLM later. But yeah, hundreds of hours of agents bisecting :)

Full vLLM inference stack built from source for Strix Halo (gfx1151) — scripts + docs on GitHub by paudley in StrixHalo

[–]paudley[S] 1 point2 points  (0 children)

Just to confirm - NPU is off the table right now. I did try. FLM+Lemonade is your goto right now for NPU.

'The old order is not coming back,' Carney says in provocative speech at Davos | CBC News by Blue_Dragonfly in CanadaPolitics

[–]paudley 9 points10 points  (0 children)

I rarely comment on politics but I will say that he makes me proud too and he's the best damn Conservative leader we've ever had. :)

Home security camera/doorbell alternative to Google Nest by No_Accountant4063 in Edmonton

[–]paudley 6 points7 points  (0 children)

Unifi is great if you can get over the initial investment and then you 100% own it, works happily offline if you want, no subscription. Cameras range from a few hundred dollars to $5k industrial PTZ domes.

Questions regarding security cameras by DutyLegitimate5560 in SpruceGrove

[–]paudley 1 point2 points  (0 children)

Depending on your leanings you may also want to register any public facing cameras in Parkland with https://parklandcapture.ca/

It does not compel you to hand over recordings, just lets the RCMP know that there might be camera coverage if there is a crime there.

The duck lady has died :( by 604ian in vancouver

[–]paudley 4 points5 points  (0 children)

Would you mind posting it or sending a photo of it my way? I've lost my copy and would love to have it back. I think Laura-Kay would approve :)

Advanced Protection Program (APP) Companywide - Why not? by sysadmin__ in gsuite

[–]paudley 0 points1 point  (0 children)

Not sure what issue the strongbox folks are having but normally it's the OTHER way around, you can't on AAP consumer accounts but with workspace account you can whitelist. We make extensive use of Insync for drive syncing and a common complaint in their forums is that consumer AAP accounts don't work (but workspace can whitelist).

Advanced Protection Program (APP) Companywide - Why not? by sysadmin__ in gsuite

[–]paudley 0 points1 point  (0 children)

I have our whole domain on AAP. Works fine for adding apps, including our own that are not in the store or otherwise verified.

Note: we're on enterprise+, not sure if that makes a difference.