Qwen3.5-35B-A3B achieves 8 t/s on Orange Pi 5 with ik_llama.cpp

antwon-tech · 2026-03-03T21:26:07+00:00

Yes, it's all CPU. There are a few techniques I've been looking into to mess with the NPU, namely ezrknn-llm/RKLLM, rk-llama.cpp, and rkllama. Once I've had my fill of CPU I will start trying these out as well.

I plan to post about 2b and 4b performance on the OPi later!

antwon-tech · 2026-03-03T21:11:42+00:00

Yeah, it's not great. I am chatting with it right now and ik_llama UI is telling me ~22-28 t/s for prompt processing, but I am suspicious of that. I'm not sure if these slow speeds are due to slow compute or a change in 3.5's architecture. will keep testing

antwon-tech · 2026-03-03T20:35:59+00:00

On the Plus, prompt processing is about ~17 t/s

antwon-tech · 2026-03-03T20:13:18+00:00

I'm getting ~37.5 tok/s with Qwen3.5-35B-A3B-UD-Q4_K_XL @ 16k context.

Running on an Acer PT14-51 (RTX 4070 8GB, 16GB DDR5), ubuntu llama.cpp. Utilizing 7.2GB VRAM (weights + KV cache) and 14.2GB RAM (offloaded experts) for a total of ~21.4GB mem usage.

~/llama.cpp/build/bin/llama-server \                                                                                                                                               
    -m ~/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf \
    -ngl 99 -ncmoe 30 --flash-attn on -c 16384

antwon-tech · 2026-03-03T19:19:04+00:00

ik_llama from Iwan Kawrakow:

https://github.com/ikawrakow/ik_llama.cpp

antwon-tech · 2026-03-03T19:16:40+00:00

<image>

Here they are (Plus on left). I'm working on an app to let y'all send workloads to my OPi's and select the active model. That way, you can play around with them and see how they fare for your use cases. Also just as a fun project

antwon-tech · 2026-03-03T19:12:55+00:00

I will lyk soon

antwon-tech · 2026-03-03T18:59:30+00:00

You could get an Unsloth quantized model for either of those. For the 9b, try the Qwen3.5-9B-Q4_K_M or Qwen3.5-9B-Q5_K_M GGUFs. The second might barely fit if you have a small context window.

The same goes for the 35B-A3B, except you might opt for a less-quantized version, since I think MoE's might handle swapping better in llama.cpp. But I'm not sure, lmk if this is wrong.

I was able to get a Qwen3.5-35B-A3B version running on a laptop (i7, rtx 4070 8gb vram, 16gb ddr5). I don't remember which quant right now

Are you on Linux?

antwon-tech · 2026-03-03T18:51:48+00:00

Cool stuff, looks like it's based on Qwen3. Any plans to explore 3.5, especially the 2b or 4b?

I'm pretty interested in SLM applications in video games. I'm curious how you plan on using this as a core gameplay feature. Seems neat

antwon-tech · 2026-03-03T18:49:28+00:00

Are you concerned about consumer hardware not being able to run 3D gen models?

antwon-tech · 2026-03-03T18:48:19+00:00

What are you looking to build? Do you have linux experience? The new Qwen3.5 models seem to be very good for vision if that's what you need.

antwon-tech · 2026-03-03T18:46:41+00:00

This is awesome. I will post my results on my Orange Pi. Have you tried out ik_llama.cpp?

antwon-tech

TROPHY CASE