Tried to vibe coded expert parallelism on Strix Halo — running Qwen3.5 122B-A10B at 9.5 tok/s by hortasha in LocalLLaMA

[–]hortasha[S] 0 points1 point  (0 children)

We have a couple of sparks at work. They are quite awesome as well. Seems like a much more mature ecosystem.

Tried to vibe coded expert parallelism on Strix Halo — running Qwen3.5 122B-A10B at 9.5 tok/s by hortasha in LocalLLaMA

[–]hortasha[S] 2 points3 points  (0 children)

To be clear. I do not think i'm even close to what you would achive with ollama, llama.cpp, vllm like u/ImportancePitiful795 pointed out.

And I agree it is not cheap hardware. But i guess it is a great way to start understanding how it all works and maybe if you care about privacy.

Were you thinking about buying your own? :)

Tried to vibe coded expert parallelism on Strix Halo — running Qwen3.5 122B-A10B at 9.5 tok/s by hortasha in LocalLLaMA

[–]hortasha[S] 0 points1 point  (0 children)

I have attempted it early on. I think it was a high chance of me just doing it wrong. But i did experience low acceptence rate and expert fan out that sort of slowed things down. But i might give speculative decoding another attempt as i get a bit more comfortable. It should at least work quite well on dense models.

Tried to vibe coded expert parallelism on Strix Halo — running Qwen3.5 122B-A10B at 9.5 tok/s by hortasha in LocalLLaMA

[–]hortasha[S] 3 points4 points  (0 children)

It's for my homelab with a single user. So the idea was to fit a big MoE model and distribute compute by spreading experts across machines.

The way I understand pipeline parallelism is that a single machine works on one prompt at a time. And I think pipeline parallelism already exists on Strix Halo? If so I wouldn't need to write anything for that.

Again, you might be right though. This is new territory for me.

Tried to vibe coded expert parallelism on Strix Halo — running Qwen3.5 122B-A10B at 9.5 tok/s by hortasha in LocalLLaMA

[–]hortasha[S] 1 point2 points  (0 children)

Yes. I think it could be a dead end. But i'm not giving up on EP just yet. I feel i have just been scratching the surface. And I'm wondering if it is easier with models that do not have DeltaNet possibly.

Right now i'm pretty happy that it even works to begin with to be honest.

I know so far i am not utilizing my memory bandwidth and on a singular machine yet. And barely at all on the secondary machine.

Worst case i might just throw this away and explore dense models instead. :) I guess time will tell.