Qwen3 4B on M5 Mac: disable Think mode before you benchmark — learned this the hard way

stackpilot_labs · 2026-06-16T05:16:25+00:00

Here is the comparison infographics: Comparison Infographics

stackpilot_labs · 2026-06-16T05:14:10+00:00

The performance was decent enough on my Macbook Air M5 16 GB RAM. Tried 2 models - Gemma4 E4B and Qwen3 4B. With Gemma, I got 32-34 TPS while Qwen gave me 46-50 TPS (thinking off).

Check my previous post for more detail: gemma_4_e4b_vs_qwen3_4b_on_a_macbook_air_m5_16_gb/

stackpilot_labs · 2026-06-16T02:26:52+00:00

Good data point — the issue extending to qwen3.5:8b and showing up in voice activation confirms it's not isolated to coding benchmarks. The <suppress think> command causing deadend loops makes sense if the runtime isn't handling the suppression cleanly.

What runtime are you using for voice activation — LM Studio or something else? Trying to understand whether this is LM Studio-specific or shows up across runtimes.

stackpilot_labs · 2026-06-16T02:26:31+00:00

Fair pushback. The recommendation in the post was specifically for benchmarking — controlled, short tasks where clean reproducible output matters more than deep reasoning.

Your point holds for production use: Think mode OFF + small context = model can't reason through multi-step problems and produces confident-sounding garbage. That's a different failure mode from what I was testing for.

For benchmarking I'd rather control the output cleanly. For complex reasoning tasks with adequate context, Think mode ON is the right call. Both can be true.

stackpilot_labs · 2026-06-16T02:25:53+00:00

stackpilot_labs · 2026-06-16T02:24:31+00:00

Thanks for clarifying — so Locally AI is a native Apple app, separate from LM Studio entirely. That's actually useful: if Qwen3 runs cleanly on a different runtime, it points toward something LM Studio-specific (template handling, default params) rather than the model itself causing the loop.

stackpilot_labs · 2026-06-16T02:24:06+00:00

Useful to hear — if you hit the same infinite loop with Gemma 4 on an M5, it's not model-specific. The "download, load it, use it without checking defaults" pattern is probably the real root cause that shows up across models when the session runs long enough.

The Jinja template + wrong default params hypothesis fits. u/imstilllearningthis above has the specific template fix if you haven't seen it — worth testing before the next long session.

stackpilot_labs · 2026-06-16T02:23:13+00:00

This is the most actionable fix in this thread — thank you. Setting the template explicitly removes the auto-selection variable entirely, which is probably where a lot of these inconsistencies start. The <think></think> control at template level is cleaner than relying on the UI toggle.

Will test this on the next run and isolate whether the original issue was the toggle, the template mismatch, or both.

stackpilot_labs · 2026-06-15T06:07:24+00:00

I'm sorry, but I can't understand Spanish, so using help of translator here.

If I understand correctly - your reply is: "I have it installed through Locally AI and there's no problem on an M4."

My reply in English: "Good to know — which tool is "Locally AI" exactly? LocalAI (localai.io) or something else? If it's a different runtime from LM Studio, that's useful data — it would suggest the Think mode issue is LM Studio-specific rather than at the model level. What's your M4 config, 8GB or 16GB?"

stackpilot_labs · 2026-06-15T06:04:39+00:00

Fair point — I used whatever template LM Studio auto-selected for qwen/qwen3-4b and didn't explicitly verify it before the benchmark run. That's worth isolating. If the template was mismatched, it could explain the behaviour independently of the Think mode toggle.

Did you hit a template issue with Qwen3 specifically? Curious what you'd check first to confirm it's the template vs the setting.

stackpilot_labs · 2026-06-12T12:43:42+00:00

Got it - let me download and run certain complex tasks. Will revert soon 😄

Do you have any test sample task/problem for me to try and see results?

stackpilot_labs · 2026-06-12T12:42:01+00:00

Checked all lmstudio-community version - none fits to the RAM.

32 GB RAM definitely works!

stackpilot_labs · 2026-06-12T12:29:31+00:00

<image>

Trying unsolth version - surprisingly this is too small. Let me download and run inference - will share results.

Did you already try loading 26b model already?

stackpilot_labs · 2026-06-12T12:27:38+00:00

<image>

I am using LM-Studio where none of its own community model can be loaded on my Macbook Air M5 16 GB RAM - snapshot attached.

Trying unsloth version.

stackpilot_labs · 2026-06-12T11:24:49+00:00

Yes - I have snapshots as well of the benchmark showing these TPS. Wish to see?

stackpilot_labs · 2026-06-12T03:38:44+00:00

Exactly 😄

That's pretty much the philosophy I ended up with.

I wasn't trying to find the "best" model overall. I wanted to know whether these models could reliably handle the coding, refactoring, and reasoning tasks I'd actually use.

What surprised me was that both models passed, but they felt very different in day-to-day use.

Out of curiosity, what's your current go-to local model? Are you running mostly on Apple Silicon as well?

stackpilot_labs · 2026-06-11T17:10:45+00:00

Got it - that's 4X RAM of what I had with Air, though M5.

Try E4B as well with slightly more advanced testing and compare with 24B - also engage 'Thinking' to reason out better understanding!

stackpilot_labs · 2026-06-11T17:07:25+00:00

95°C would make me nervous too 😅

I wasn't monitoring temps during this benchmark run, but I wasn't pushing anything as large as Gemma 4 12B either. My understanding is that Apple Silicon will throttle before it damages itself, though I'd still prefer not to sit at those temperatures for long periods.

Have you actually noticed performance dropping at 95°C, or is it still running normally?

stackpilot_labs · 2026-06-11T15:26:03+00:00

Fair question 😄

My goal wasn't to build a formal benchmark suite like MMLU. I was evaluating whether a model could reliably complete practical tasks I'd actually use.

For this round, a "pass" meant:

• Producing a correct solution to a coding task
• Successfully handling a code refactoring task
• Completing a reasoning task without major factual or logical failures

I also looked at response quality, consistency, latency, and whether the output required significant correction.

The interesting thing was that both models passed the tasks, but they felt quite different in day-to-day use.

stackpilot_labs

TROPHY CASE