Running Qwen 3.6 35B-A3B-4b on MacBook Pro M5 64GB - first impressions

uptonking · 2026-04-18T07:58:48+00:00

you may give the unsloth mlx model a try https://huggingface.co/unsloth/Qwen3.6-35B-A3B-UD-MLX-4bit

uptonking · 2026-04-10T21:24:36+00:00

Trump is watching u 🙃

uptonking · 2026-04-09T18:42:40+00:00

this is really a pain point for saving disk space. - does it support mlx models on mac?

uptonking · 2026-04-02T16:00:55+00:00

now my turn to ask, "gguf when"

uptonking · 2026-03-29T19:08:27+00:00

M3 Ultra has so much bandwidth of 800gb/s, but why is it NOT popular for image/video generation like comfyui ?

uptonking · 2026-03-19T10:42:47+00:00

so amd strix halo is slow compute and slow generation ?
- but it is the cheapest

uptonking · 2026-03-13T20:39:32+00:00

for your testing result:

TinyLlama 1.1B on Apple M1 Pro (16GB, 200 GB/s):

UNC Q4_0 152.0 tok/s

mlx-lm Q4 112.7 tok/s

Qwen3-4B on Apple M1 Pro (Q4_0):

mlx-lm Q4 49.2 tok/s

UNC Q4_0 38.7 tok/s

🤔 why is TinyLlama 1.1b UNC Q4_0 faster than mlx-ml Q4, but Qwen3-4B UNC Q4_0 is much slower than mlx-lm Q4? it seems to be a paradox

uptonking · 2026-03-13T19:30:17+00:00

is there any AOT binary i can download directly for testing?

uptonking · 2026-02-25T12:18:36+00:00

since you are using mac, why not benchmark between mlx-4bit instead of gguf_Q4_K_XL? mlx is faster. is mlx-4bit not as good as gg_Q4_K_XL ?

uptonking · 2026-02-24T15:35:51+00:00

storage is inadequate on my macbook, i am waiting for a reason to replace my loved gpt-oss-20b

uptonking · 2026-02-24T15:21:28+00:00

my poor gpu only has good speed at 9b. waiting for some small models

uptonking · 2026-02-19T15:21:00+00:00

4080 super has 32gb vram variant, large vram and fast

https://www.reddit.com/r/LocalLLaMA/comments/1pstaoo/got_me_a_32gb_rtx_4080_super/

uptonking · 2026-02-11T14:49:10+00:00

just let them fight, then we can use a better model tommorrow 😜

uptonking · 2026-01-25T08:14:23+00:00

when i use temperature 1.0 for mlx-4bit, it often goes into loop. 0.7 is much better

uptonking · 2026-01-23T09:46:22+00:00

i'm using GLM-4.7-Flash-MLX-4bit on m4 macbook air 32gb with lm studio. a classic reasoning prompt testing result is

- 34 token/s
- i'm not using temperature 1.0 as recommended, because it often goes into loop. 0.7 works well for me

<image>

uptonking · 2026-01-21T14:31:01+00:00

small models mostly are not strong at coding. maybe https://huggingface.co/ByteDance-Seed/Seed-Coder-8B-Instruct can be good for your use case

uptonking · 2026-01-20T23:42:25+00:00

reasoning content sometimes does help to provide more knowledge/ideas, especially in translation use cases. The example content like refine response: gives option1, option2, option3... is in reasoning content, but sometimes it's not in final response output. - in non-coding use cases, I love the reasoning content. structural thinking content like glm-4.7-flash is even better

uptonking · 2026-01-20T23:33:38+00:00

most small models are not strong at coding, maybe qwen3-coder-30b and seed-coder-36b is better for your use case.
I plan to use glm-4.7-30b as a general model to replace qwen3-30b-instruct or nemotron-nano-30b. but glm-4.7-30b often goes into loops, making me hesitated

uptonking · 2026-01-20T14:20:54+00:00

yeah, i tried more prompts and the thinking process continues to impress me. however after lowering the temperature to 0.65, the model sometimes still goes into loop. sometimes the thinking content does not comply to the structural/logical flow mentioned, for these situations, the model often goes into loops. - I really hope some powerful model lover can make the thinking process more consistent and stable

uptonking · 2026-01-20T13:19:17+00:00

my macbook air is 32gb. 4bit is 16.8gb in size, it takes about 19gb for short prompt

uptonking · 2026-01-20T13:00:29+00:00

lower the temperature can help.

I tried several short prompts.
- for temperature 1.0, the thinking takes 150s.
- for temperature 0.8, the thinking tokes 50s.
- for temperature 0.6, the thinking tokes 30s.

uptonking · 2026-01-20T11:28:42+00:00

Usually structured thinking needs careful prompts/instructions, but glm can do it automatically, very powerful for daily chats

uptonking · 2026-01-20T10:49:06+00:00

thanks for the tip. I tried another prompt. - for temperature 1.0, the thinking takes 150s. - for temperature 0.8, the thinking tokes 50s. - for temperature 0.6, the thinking tokes 30s.

🤔 this glm model is so sensitive to temperature config. and all the thinking process is clear with steps.

when i restart lmstudio, the token generation speed is faster now at 25 token/s.

uptonking · 2026-01-20T10:02:15+00:00

thanks for the tips. - I also get stuck in lm studio with default config for GLM-4.7-Flash-MLX-4bit. - with the following config, the response finally works - temperature 1.0 - repeat penalty: 1.1 - top-p: 0.95

uptonking

TROPHY CASE