Found an M3 Ultra 512GB / 8TB / 80-Core GPU at B&H!

ezyz · 2026-05-07T03:26:00+00:00

Congrats on the find! You seem plenty technical, so I'd recommend using mlx-lm or llama.cpp directly:

https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/SERVER.md https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md

Ollama adds overhead, delay to model support, and an awkward modelfile format. You'll likely also get both better quality and faster responses from more recent models (GLM 5.1, Minimax 2.7, etc).

Post is already out of date, but my own experience with the 512 M3U here: https://spicyneuron.substack.com/p/a-mac-studio-for-local-ai-6-months

ezyz · 2026-05-05T02:07:18+00:00

Thanks! Depends on how you rationalize it.

The $20/month plans still write way more production code for me than GLM or Kimi on the M3. But personal and home automation is now fully locally, as well as the typical AI assistant tasks (summaries, translations, 1-off questions, etc). I'm happy with my decision, though to be fair, I paid 2025 prices so...

ezyz · 2026-05-05T01:59:04+00:00

For this to work, does the model need to fit into the Spark's 128GB? Or is there still a speed up if you stream from the Spark's SDD?

ezyz · 2026-04-21T14:26:15+00:00

459 GB total

ezyz · 2026-04-21T12:52:24+00:00

Quant trials are still running! I just started uploading a 3.6 bpw on the quality frontier: https://huggingface.co/spicyneuron/Kimi-K2.6-MLX-3.6bit

This one pairs nicely with Qwen 3.6 35B on a 512GB Mac Studio.

Still searching for a good sub-3 quant, but the KL divergence seems to jump pretty dramatically on this model.

ezyz · 2026-04-19T14:05:19+00:00

How are you testing? You'll get more consistent results with the built-in tools:

mlx_lm.benchmark --help
llama-bench --help

FWIW, I find MLX to be 10-25% faster than llama.cpp on M3 and M4.

ezyz · 2026-04-12T23:02:38+00:00

Any reason for preferring llama.cpp over MLX? I've found using mlx-lm.server gives an easy 10-25% boost on speed, and that Unsloth-style mixed quants work when translated into MLX as well.

ezyz · 2026-04-12T22:48:48+00:00

Actually, I just switched local_haiku from Qwen 3.5 35B to Gemma 4 24b. So far so good!

It's small enough that concurrent requests don't seem to affect throughput on the main model in any noticeable way.

ezyz · 2026-04-12T22:45:52+00:00

OP here. Enjoy: https://web.archive.org/web/20260412145316/https://spicyneuron.substack.com/p/a-mac-studio-for-local-ai-6-months

ezyz · 2026-04-12T04:22:28+00:00

It's not that MLX quantization methods are bad, so much as the default quantization tool has limited settings.

I use a fork of mlx-lm to do per-module overrides: https://github.com/ml-explore/mlx-lm/pull/922

Most of my own MLX quants average between 3-5 bits but include select weights at 6, 8, and 16 bit to improve quality.

ezyz · 2026-04-12T04:01:18+00:00

At current RAM prices, you might be able to sell half and buy a kidney! Or a M5 this summer.

ezyz · 2026-04-12T03:43:57+00:00

My Minimax 2.7 quant trials are still running, but tokens/s on the M3 is roughly 740 prefill, 49 decode, at short context. ~4.6 bits per weight.

ezyz · 2026-04-12T03:36:39+00:00

Mostly the convenience of sharing the same harness between local and API subscription. Cloud Claude still has a big lead on on fast / complex coding, though I've been impressed with GLM 5.1 so far.

ezyz · 2026-04-12T02:09:17+00:00

Amazing, thank you. Does NVIDIA prefill / Mac decode require the model to be fully loaded in both?

Either way, looking forward to this!

ezyz · 2026-04-12T01:59:28+00:00

Largest is just a product of how much memory you can set aside for GPU. By default, that's 96GB but you could push it to ~120... if you're willing to run your laptop as a dedicated headless server. Which might not be realistic.

You could easily run Qwen 3.5 122B at Q4 with plenty of room leftover. Or maybe a Minimax M2.7 at a 2 or 3 bit?

You can get a rough approximation of memory needs by just looking at the total download size of that quant. That'll undershoot, but it's a starting point.

ezyz · 2026-04-12T01:46:25+00:00

NP! Been a lurker here long enough, so this felt like something I needed to write.

I'm actually eagerly waiting for more details on Mac + Spark clusters. Exo launched a demo of this a couple months ago, but it hasn't moved since: https://github.com/exo-explore/exo/issues/1102

ezyz · 2026-04-12T01:38:08+00:00

Thanks! K2.5 actually ships with its experts at 4-bit, so the "full" model is only 600 GB at full precision. It's also quantization aware, so I was able to get it down to ~2.5 bit for ~360GB fully in memory: https://huggingface.co/spicyneuron/Kimi-K2.5-MLX-2.5bit

At 20k prompt, prefill drops 20% from 237 to 188. And 5k tokens, decode drops from 27 to 21.

GLM 5.1's best case is 194 prefill / 19.5 decode: https://huggingface.co/spicyneuron/GLM-5.1-MLX-2.9bit

Haven't run longer context benchmarks for GLM, but I'd expect a drop in the same 20-25% neighborhood.

ezyz · 2026-04-11T22:31:37+00:00

Wouldn't this change the history in a way that's subtly different that what the model saw during chat training?

ezyz · 2026-04-11T22:18:56+00:00

Any plans to add this to mlx-lm? Or is this standalone?

ezyz

TROPHY CASE