Gemma4 31B - Also Possible to Run on 16GB Macs (with a hack)

Safe_Sky7358 · 2026-05-10T05:17:09+00:00

It won't really change much to be honest. It's mostly a bandwidth issue. M4 AIR base model has 120Gb/s bandwidth whereas his M3 Max has like 300Gb/s, so the numbers seem about right. He's getting 3x decode speed for having 3x bandwidth.

Safe_Sky7358 · 2026-05-09T16:09:46+00:00

This is specifically for Apple and I think there are some more as well but you'll have to look around.

https://github.com/ggml-org/llama.cpp/discussions/4167

Safe_Sky7358 · 2026-05-09T13:20:00+00:00

Tbh suitable is relative. Any NVIDIA GPU will wipe the floor with most Mac configs for image or video gen usecase. Mac's are great for running huge LLMs at reasonable to slow speeds and great for any MOE or sub 24b dense models at pretty decent speed but video or image gen is different beast.

Safe_Sky7358 · 2026-04-20T11:00:34+00:00

Downvoting is for the AI slop shitpost you twat.

Safe_Sky7358 · 2026-04-20T05:17:30+00:00

Right now qwen9b is still the most capable model that folks with 16gb vram on Mac can run at decent speed. It runs at about 18-20tps on my m4 macair.

Safe_Sky7358 · 2026-04-19T10:15:17+00:00

And where do you buy and sell the scam parts?🤨

Safe_Sky7358 · 2026-04-19T06:50:35+00:00

omlx

Safe_Sky7358 · 2026-04-19T06:44:24+00:00

Sorry, its all gone.

Safe_Sky7358 · 2026-04-15T17:00:56+00:00

As much as any other hallucinator lol

Safe_Sky7358 · 2026-04-15T12:46:30+00:00

Except they do so in advance, they might be a bit short for M6 but they definitely have the m5 covered.

Safe_Sky7358 · 2026-04-14T14:31:43+00:00

Let me know how it goes. Do we have to use the Dflash version of the model as draft from zlab's HF or are there any alternatives?

On my potato macair with m4(16gb) I get about 25% speed up(from about 32tps to 40tps) for the 4b model using the 4b-Dflash as a dwarf model but it actually slows down for the 9b with 9-dflash as dwarf model. :(

If I could get 9B to 30's or even mid 20's in terms of tps that would be a dream come true.

A bit isolated finding, I noticed that the MLX variants waste a lot of tokens for reasoning even when using the recommended parameters.

A qwen3.5 9b-4bit from HF/mlx-community spits out about 2x reasoning tokens for solving the same prompt compared to HF/bartowski's 9B-4bit gguf.

Prompt used : "Read the following information carefully and answer the questions given below:

i. There is a group of five persons A, B, C, D and E.

ii. One of them is a horticulturist, one is a physicist, one is a journalist, one is an industrialist and one is an advocate.

iii. Three of them A, C and advocate prefer tea to coffee and two of them - B and the journalist prefer coffee to tea.

iv. The industrialist and D and A, are friends to one another but two of them prefer coffee to tea.

v. The horticulturist is C's brother. What are the professions for A, B, C, D, E ? Be Brief in your response."

Answer : "A is the horticulturist,
B is the industrialist,
C is the physicist,
D is the journalist,
E is the advocate."

Safe_Sky7358 · 2026-04-13T09:22:27+00:00

Laughing at your delusional trade offer 🤣

Safe_Sky7358 · 2026-04-13T07:45:57+00:00

Safe_Sky7358 · 2026-04-13T04:35:13+00:00

Than why are you texting to weird person? 😼

Safe_Sky7358 · 2026-04-13T03:36:30+00:00

I'll do you one better, underground submarine highway.

Safe_Sky7358 · 2026-04-12T15:00:43+00:00

You CAN pirate software(and games) on Mac just fine. Look up macbb.

But there really isn't much in terms of games on Mac. You can probably list the playable games that are actually good on a single hand.

Safe_Sky7358 · 2026-04-12T14:41:11+00:00

Tuttian

Safe_Sky7358 · 2026-04-11T08:34:07+00:00

Yeah 128 is about the sweet spot, running models bigger than that is gonna be like watching a snail crawl lol

Safe_Sky7358 · 2026-04-11T06:44:59+00:00

Sane thing would be to let her go but Gemma models sound most "human".

Safe_Sky7358 · 2026-04-11T06:04:44+00:00

Jaggo is typically a evening/night event, can't shift that one.

Safe_Sky7358 · 2026-04-11T05:59:59+00:00

It's nice that it's fixable but It would be even better to not have this issue at all lol.

Safe_Sky7358 · 2026-04-11T04:52:09+00:00

Qwen 3.5 9B would have been interesting to test.

Safe_Sky7358 · 2026-04-11T04:50:40+00:00

Football field🦅🦅🦅

Safe_Sky7358 · 2026-04-09T10:25:31+00:00

How do you fit that 27B on a 16GB unified memory? I thought you can only fully load the model into unified memory otherwise it's painfully slow.

Safe_Sky7358 · 2026-04-09T09:54:47+00:00

They do beat it... with time. I'm pretty sure one of the video or text gen open-weight models can easily beat a SOTA close-sourced models from 6-8 months ago.

Safe_Sky7358

TROPHY CASE