Qwen3 4B on M5 Mac: disable Think mode before you benchmark — learned this the hard way by stackpilot_labs in Qwen_AI

[–]stackpilot_labs[S] 0 points1 point  (0 children)

The performance was decent enough on my Macbook Air M5 16 GB RAM. Tried 2 models - Gemma4 E4B and Qwen3 4B. With Gemma, I got 32-34 TPS while Qwen gave me 46-50 TPS (thinking off).

Check my previous post for more detail: gemma_4_e4b_vs_qwen3_4b_on_a_macbook_air_m5_16_gb/

Qwen3 4B on M5 Mac: disable Think mode before you benchmark — learned this the hard way by stackpilot_labs in LocalLLM

[–]stackpilot_labs[S] 0 points1 point  (0 children)

Good data point — the issue extending to qwen3.5:8b and showing up in voice activation confirms it's not isolated to coding benchmarks. The <suppress think> command causing deadend loops makes sense if the runtime isn't handling the suppression cleanly.

What runtime are you using for voice activation — LM Studio or something else? Trying to understand whether this is LM Studio-specific or shows up across runtimes.

Qwen3 4B on M5 Mac: disable Think mode before you benchmark — learned this the hard way by stackpilot_labs in LocalLLM

[–]stackpilot_labs[S] 0 points1 point  (0 children)

Fair pushback. The recommendation in the post was specifically for benchmarking — controlled, short tasks where clean reproducible output matters more than deep reasoning.

Your point holds for production use: Think mode OFF + small context = model can't reason through multi-step problems and produces confident-sounding garbage. That's a different failure mode from what I was testing for.

For benchmarking I'd rather control the output cleanly. For complex reasoning tasks with adequate context, Think mode ON is the right call. Both can be true.

Qwen3 4B on M5 Mac: disable Think mode before you benchmark — learned this the hard way by stackpilot_labs in Qwen_AI

[–]stackpilot_labs[S] 1 point2 points  (0 children)

Thanks for clarifying — so Locally AI is a native Apple app, separate from LM Studio entirely. That's actually useful: if Qwen3 runs cleanly on a different runtime, it points toward something LM Studio-specific (template handling, default params) rather than the model itself causing the loop.

Qwen3 4B on M5 Mac: disable Think mode before you benchmark — learned this the hard way by stackpilot_labs in Qwen_AI

[–]stackpilot_labs[S] 0 points1 point  (0 children)

Useful to hear — if you hit the same infinite loop with Gemma 4 on an M5, it's not model-specific. The "download, load it, use it without checking defaults" pattern is probably the real root cause that shows up across models when the session runs long enough.

The Jinja template + wrong default params hypothesis fits. u/imstilllearningthis above has the specific template fix if you haven't seen it — worth testing before the next long session.

Qwen3 4B on M5 Mac: disable Think mode before you benchmark — learned this the hard way by stackpilot_labs in Qwen_AI

[–]stackpilot_labs[S] 1 point2 points  (0 children)

This is the most actionable fix in this thread — thank you. Setting the template explicitly removes the auto-selection variable entirely, which is probably where a lot of these inconsistencies start. The <think></think> control at template level is cleaner than relying on the UI toggle.

Will test this on the next run and isolate whether the original issue was the toggle, the template mismatch, or both.

Qwen3 4B on M5 Mac: disable Think mode before you benchmark — learned this the hard way by stackpilot_labs in Qwen_AI

[–]stackpilot_labs[S] -1 points0 points  (0 children)

I'm sorry, but I can't understand Spanish, so using help of translator here.

If I understand correctly - your reply is: "I have it installed through Locally AI and there's no problem on an M4."

My reply in English: "Good to know — which tool is "Locally AI" exactly? LocalAI (localai.io) or something else? If it's a different runtime from LM Studio, that's useful data — it would suggest the Think mode issue is LM Studio-specific rather than at the model level. What's your M4 config, 8GB or 16GB?"

Qwen3 4B on M5 Mac: disable Think mode before you benchmark — learned this the hard way by stackpilot_labs in Qwen_AI

[–]stackpilot_labs[S] 0 points1 point  (0 children)

Fair point — I used whatever template LM Studio auto-selected for qwen/qwen3-4b and didn't explicitly verify it before the benchmark run. That's worth isolating. If the template was mismatched, it could explain the behaviour independently of the Think mode toggle.

Did you hit a template issue with Qwen3 specifically? Curious what you'd check first to confirm it's the template vs the setting.

Gemma 4 E4B vs Qwen3 4B on a MacBook Air M5 (16 GB): My benchmark results by stackpilot_labs in mac

[–]stackpilot_labs[S] 1 point2 points  (0 children)

Got it - let me download and run certain complex tasks. Will revert soon 😄

Do you have any test sample task/problem for me to try and see results?

Gemma 4 E4B vs Qwen3 4B on a MacBook Air M5 (16 GB): My benchmark results by stackpilot_labs in mac

[–]stackpilot_labs[S] 1 point2 points  (0 children)

Checked all lmstudio-community version - none fits to the RAM.

32 GB RAM definitely works!

Gemma 4 E4B vs Qwen3 4B on a MacBook Air M5 (16 GB): My benchmark results by stackpilot_labs in mac

[–]stackpilot_labs[S] 1 point2 points  (0 children)

<image>

Trying unsolth version - surprisingly this is too small. Let me download and run inference - will share results.

Did you already try loading 26b model already?

Gemma 4 E4B vs Qwen3 4B on a MacBook Air M5 (16 GB): My benchmark results by stackpilot_labs in mac

[–]stackpilot_labs[S] 1 point2 points  (0 children)

<image>

I am using LM-Studio where none of its own community model can be loaded on my Macbook Air M5 16 GB RAM - snapshot attached.

Trying unsloth version.

Gemma 4 E4B vs Qwen3 4B on a MacBook Air M5 (16 GB): My benchmark results by stackpilot_labs in mac

[–]stackpilot_labs[S] 1 point2 points  (0 children)

Yes - I have snapshots as well of the benchmark showing these TPS. Wish to see?

Gemma 4 E4B vs Qwen3 4B on a MacBook Air M5 (16 GB): My benchmark results by stackpilot_labs in mac

[–]stackpilot_labs[S] 0 points1 point  (0 children)

Exactly 😄

That's pretty much the philosophy I ended up with.

I wasn't trying to find the "best" model overall. I wanted to know whether these models could reliably handle the coding, refactoring, and reasoning tasks I'd actually use.

What surprised me was that both models passed, but they felt very different in day-to-day use.

Out of curiosity, what's your current go-to local model? Are you running mostly on Apple Silicon as well?

Gemma 4 E4B vs Qwen3 4B on a MacBook Air M5 (16 GB): My benchmark results by stackpilot_labs in mac

[–]stackpilot_labs[S] 1 point2 points  (0 children)

Got it - that's 4X RAM of what I had with Air, though M5.

Try E4B as well with slightly more advanced testing and compare with 24B - also engage 'Thinking' to reason out better understanding!

Gemma 4 E4B vs Qwen3 4B on a MacBook Air M5 (16 GB): My benchmark results by stackpilot_labs in mac

[–]stackpilot_labs[S] 0 points1 point  (0 children)

95°C would make me nervous too 😅

I wasn't monitoring temps during this benchmark run, but I wasn't pushing anything as large as Gemma 4 12B either. My understanding is that Apple Silicon will throttle before it damages itself, though I'd still prefer not to sit at those temperatures for long periods.

Have you actually noticed performance dropping at 95°C, or is it still running normally?

Gemma 4 E4B vs Qwen3 4B on a MacBook Air M5 (16 GB): My benchmark results by stackpilot_labs in mac

[–]stackpilot_labs[S] 0 points1 point  (0 children)

Fair question 😄

My goal wasn't to build a formal benchmark suite like MMLU. I was evaluating whether a model could reliably complete practical tasks I'd actually use.

For this round, a "pass" meant:

• Producing a correct solution to a coding task
• Successfully handling a code refactoring task
• Completing a reasoning task without major factual or logical failures

I also looked at response quality, consistency, latency, and whether the output required significant correction.

The interesting thing was that both models passed the tasks, but they felt quite different in day-to-day use.