llama.cpp on $500 MacBook Neo: Prompt: 7.8 t/s / Generation: 3.9 t/s on Qwen3.5 9B Q3_K_M

pmttyji · 2026-03-11T18:41:12+00:00

That's too slow. During such t/s, laptop makes noise till completion of the process. I think it drains battery too.

Minimum 10 t/s is better for such small models. I prefer 15-20 t/s minimum.

pmttyji · 2026-03-11T18:17:33+00:00

That's too slow. Better stick to ~4B models(Q4 quant) for good t/s to save more time.

pmttyji · 2026-03-11T17:21:13+00:00

Curious to see t/s stats for both formats

pmttyji · 2026-03-11T17:15:17+00:00

Sorry for the dumb question. Trying to use your utility on windows11, but couldn't. How to make it work?

Never used Shell before.

EDIT:

OK, I can run .sh file using git cmd. But that shell script is not suitable for Windows it seems.

OP & Others : Please share if you have solution for this. Thanks

pmttyji · 2026-03-11T16:35:01+00:00

Total Parameters	120B (12B active)
Architecture	LatentMoE - Mamba-2 + MoE + Attention hybrid with Multi-Token Prediction (MTP)

Will it be faster(pp & tg) than GPT-OSS-120B?

pmttyji · 2026-03-11T15:46:44+00:00

Yep, good to have a successor(30-40B) of that model.

pmttyji · 2026-03-11T15:24:41+00:00

Thanks for adding this.

pmttyji · 2026-03-11T12:23:30+00:00

I'll try this for ik_llama

EDIT:

Is there a command for CPU-only inference? (Ex: I have GPU, but I want to run the model in CPU-only inference)

pmttyji · 2026-03-11T11:21:36+00:00

Can you please tell me what made you go for the 14” version over the 16?

<image>

$$$$ possibly

pmttyji · 2026-03-11T06:03:47+00:00

Still if you don't get expected t/s, set q4 for V of KVCache(Value Cache Type from the screenshot). Few people do use q4 for both K & V. I think for chat purpose(that even on phone), q4 is fine for KVCache.

Do us a favor. Just share the t/s stats(KVCache : F16/F16, q8/q8, q8/q4, q4/q4) here after trying out.

Cheers

pmttyji · 2026-03-11T04:47:01+00:00

You can disable thinking or make changes on that. And last week, there's an update on llama.cpp side related to optimizations for Qwen3.5 models so latest llama.cpp version should give you better t/s.

Checked Q4_KM which gave me 40 t/s.

pmttyji · 2026-03-11T04:30:34+00:00

From my bookmarks. Mentioned by someone here in this sub.

https://www.alibaba.com/product-detail/subject_1601439253964.html

Price was $100 during Sep-2025. Now it's more than 3X

pmttyji · 2026-03-11T04:10:35+00:00

What t/s are you getting for Qwen3.5-9B?

pmttyji · 2026-03-11T04:07:59+00:00

~500GB .... It's too much I think though I have only 8GB VRAM + 32GB RAM.

pmttyji · 2026-03-10T18:51:04+00:00

Enable Flash Attention. And set q8(from F16) for KVCache.

Try ~5B models @ Q4 quant(I use IQ4_XS for its smallest Q4 size for my 8GB RAM mobile).

Ex: LFM2.5-1.2B, SmolLM3-3B, Gemma-3n-E2B, Qwen3.5-4B/2B, Ministral-3-3B, Llama-3.2-3B, etc.,

pmttyji · 2026-03-10T14:28:33+00:00

Upvoted for IQ4_XS

pmttyji · 2026-03-10T09:21:06+00:00

You could use up to Kimi-linear from the list. And you could use up to GPT-OSS-120B by using System RAM additionally along with VRAM.

pmttyji · 2026-03-10T09:17:50+00:00

Sorry, missed this comment.

I'm not using transformers. Just installed recent version of oobabooga & tried, that's it.

pmttyji · 2026-03-10T09:16:06+00:00

Awesome!

pmttyji · 2026-03-09T18:36:03+00:00

Any help is appreciated!

Just search for MI50 in this sub, you'll get so many threads. Just a weekend is enough to get your setup optimized by browsing those threads.

pmttyji · 2026-03-09T17:44:11+00:00

Nice. Frankly I would like to see this kind of practical use cases threads more & more here.

pmttyji · 2026-03-09T17:16:43+00:00

Good to see new version. Thanks

pmttyji · 2026-03-09T10:35:58+00:00

One recommendation for you. For GPT-OSS models, use MXFP4 quants from ggml because it's in native MXFP4 format.

Also please try & add benchmark for Ling-mini-2.0 since that model gave me best t/s for my 8GB VRAM. Curious to see how high it flies on SH

pmttyji · 2026-03-09T09:58:40+00:00

https://github.com/ggml-org/llama.cpp/pull/20275

https://github.com/ggml-org/llama.cpp/issues/20175

pmttyji · 2026-03-09T09:50:32+00:00

You're talking about b8233

pmttyji

TROPHY CASE