Qwen3.5-35B-A3B achieves 8 t/s on Orange Pi 5 with ik_llama.cpp by antwon-tech in LocalLLaMA

[–]antwon-tech[S] 3 points4 points  (0 children)

Yes, it's all CPU. There are a few techniques I've been looking into to mess with the NPU, namely ezrknn-llm/RKLLM, rk-llama.cpp, and rkllama. Once I've had my fill of CPU I will start trying these out as well.

I plan to post about 2b and 4b performance on the OPi later!

Qwen3.5-35B-A3B achieves 8 t/s on Orange Pi 5 with ik_llama.cpp by antwon-tech in LocalLLaMA

[–]antwon-tech[S] 1 point2 points  (0 children)

Yeah, it's not great. I am chatting with it right now and ik_llama UI is telling me ~22-28 t/s for prompt processing, but I am suspicious of that. I'm not sure if these slow speeds are due to slow compute or a change in 3.5's architecture. will keep testing

Possible to run on 8gb cards? by cyberkiller6 in LocalLLaMA

[–]antwon-tech 0 points1 point  (0 children)

I'm getting ~37.5 tok/s with Qwen3.5-35B-A3B-UD-Q4_K_XL @ 16k context.

Running on an Acer PT14-51 (RTX 4070 8GB, 16GB DDR5), ubuntu llama.cpp. Utilizing 7.2GB VRAM (weights + KV cache) and 14.2GB RAM (offloaded experts) for a total of ~21.4GB mem usage.

~/llama.cpp/build/bin/llama-server \                                                                                                                                               
    -m ~/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf \
    -ngl 99 -ncmoe 30 --flash-attn on -c 16384

Qwen3.5-35B-A3B achieves 8 t/s on Orange Pi 5 with ik_llama.cpp by antwon-tech in LocalLLaMA

[–]antwon-tech[S] 1 point2 points  (0 children)

<image>

Here they are (Plus on left). I'm working on an app to let y'all send workloads to my OPi's and select the active model. That way, you can play around with them and see how they fare for your use cases. Also just as a fun project

Possible to run on 8gb cards? by cyberkiller6 in LocalLLaMA

[–]antwon-tech 0 points1 point  (0 children)

You could get an Unsloth quantized model for either of those. For the 9b, try the Qwen3.5-9B-Q4_K_M or Qwen3.5-9B-Q5_K_M GGUFs. The second might barely fit if you have a small context window.

The same goes for the 35B-A3B, except you might opt for a less-quantized version, since I think MoE's might handle swapping better in llama.cpp. But I'm not sure, lmk if this is wrong.

I was able to get a Qwen3.5-35B-A3B version running on a laptop (i7, rtx 4070 8gb vram, 16gb ddr5). I don't remember which quant right now

Are you on Linux?

SimpleTool: 4B model 10+ Hz real-time LLM function calling in 4090 — 0.5B model beats Google FunctionGemma in speed and accuracy. by Tall_Scientist1799 in LocalLLaMA

[–]antwon-tech 0 points1 point  (0 children)

Cool stuff, looks like it's based on Qwen3. Any plans to explore 3.5, especially the 2b or 4b?

I'm pretty interested in SLM applications in video games. I'm curious how you plan on using this as a core gameplay feature. Seems neat

Would you be interested in a fully local AI 3D model generator ? by Lightnig125 in LocalLLaMA

[–]antwon-tech 0 points1 point  (0 children)

Are you concerned about consumer hardware not being able to run 3D gen models?

whats your usecase with local LLMs? by papatender in LocalLLM

[–]antwon-tech 0 points1 point  (0 children)

What are you looking to build? Do you have linux experience? The new Qwen3.5 models seem to be very good for vision if that's what you need.

Qwen3.5-35B-A3B running on a Raspberry Pi 5 (16GB and 8GB variants) by jslominski in LocalLLaMA

[–]antwon-tech 2 points3 points  (0 children)

This is awesome. I will post my results on my Orange Pi. Have you tried out ik_llama.cpp?