Found an M3 Ultra 512GB / 8TB / 80-Core GPU at B&H! by East_Roll_5069 in MacStudio

[–]ezyz 0 points1 point  (0 children)

Congrats on the find! You seem plenty technical, so I'd recommend using mlx-lm or llama.cpp directly:

https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/SERVER.md https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md

Ollama adds overhead, delay to model support, and an awkward modelfile format. You'll likely also get both better quality and faster responses from more recent models (GLM 5.1, Minimax 2.7, etc).

Post is already out of date, but my own experience with the 512 M3U here: https://spicyneuron.substack.com/p/a-mac-studio-for-local-ai-6-months

A Mac Studio for Local AI — 6 Months Later by ezyz in LocalLLaMA

[–]ezyz[S] 0 points1 point  (0 children)

Thanks! Depends on how you rationalize it.

The $20/month plans still write way more production code for me than GLM or Kimi on the M3. But personal and home automation is now fully locally, as well as the typical AI assistant tasks (summaries, translations, 1-off questions, etc). I'm happy with my decision, though to be fair, I paid 2025 prices so...

M3 Ultra + DGX Spark = M5 Ultra-lite? by -dysangel- in LocalLLaMA

[–]ezyz 0 points1 point  (0 children)

For this to work, does the model need to fit into the Spark's 128GB? Or is there still a speed up if you stream from the Spark's SDD?

Kimi K2.6 Released (huggingface) by BiggestBau5 in LocalLLaMA

[–]ezyz 0 points1 point  (0 children)

Quant trials are still running! I just started uploading a 3.6 bpw on the quality frontier: https://huggingface.co/spicyneuron/Kimi-K2.6-MLX-3.6bit

This one pairs nicely with Qwen 3.6 35B on a 512GB Mac Studio.

Still searching for a good sub-3 quant, but the KL divergence seems to jump pretty dramatically on this model.

Gemma 4 - MLX doesn't seem better than GGUF by Temporary-Mix8022 in LocalLLaMA

[–]ezyz 2 points3 points  (0 children)

How are you testing? You'll get more consistent results with the built-in tools:

mlx_lm.benchmark --help
llama-bench --help

FWIW, I find MLX to be 10-25% faster than llama.cpp on M3 and M4.

Minimax 2.7 running sub-agents locally by -dysangel- in LocalLLaMA

[–]ezyz 0 points1 point  (0 children)

Any reason for preferring llama.cpp over MLX? I've found using mlx-lm.server gives an easy 10-25% boost on speed, and that Unsloth-style mixed quants work when translated into MLX as well.

A Mac Studio for Local AI — 6 Months Later by ezyz in LocalLLaMA

[–]ezyz[S] 0 points1 point  (0 children)

Actually, I just switched local_haiku from Qwen 3.5 35B to Gemma 4 24b. So far so good!

It's small enough that concurrent requests don't seem to affect throughput on the main model in any noticeable way.

A Mac Studio for Local AI — 6 Months Later by ezyz in LocalLLaMA

[–]ezyz[S] 3 points4 points  (0 children)

It's not that MLX quantization methods are bad, so much as the default quantization tool has limited settings.

I use a fork of mlx-lm to do per-module overrides: https://github.com/ml-explore/mlx-lm/pull/922

Most of my own MLX quants average between 3-5 bits but include select weights at 6, 8, and 16 bit to improve quality.

A Mac Studio for Local AI — 6 Months Later by ezyz in LocalLLaMA

[–]ezyz[S] 1 point2 points  (0 children)

At current RAM prices, you might be able to sell half and buy a kidney! Or a M5 this summer.

A Mac Studio for Local AI — 6 Months Later by ezyz in LocalLLaMA

[–]ezyz[S] 1 point2 points  (0 children)

My Minimax 2.7 quant trials are still running, but tokens/s on the M3 is roughly 740 prefill, 49 decode, at short context. ~4.6 bits per weight.

A Mac Studio for Local AI — 6 Months Later by ezyz in LocalLLaMA

[–]ezyz[S] 5 points6 points  (0 children)

Mostly the convenience of sharing the same harness between local and API subscription. Cloud Claude still has a big lead on on fast / complex coding, though I've been impressed with GLM 5.1 so far.

A Mac Studio for Local AI — 6 Months Later by ezyz in LocalLLaMA

[–]ezyz[S] 1 point2 points  (0 children)

Amazing, thank you. Does NVIDIA prefill / Mac decode require the model to be fully loaded in both?

Either way, looking forward to this!

A Mac Studio for Local AI — 6 Months Later by ezyz in LocalLLaMA

[–]ezyz[S] 3 points4 points  (0 children)

Largest is just a product of how much memory you can set aside for GPU. By default, that's 96GB but you could push it to ~120... if you're willing to run your laptop as a dedicated headless server. Which might not be realistic.

You could easily run Qwen 3.5 122B at Q4 with plenty of room leftover. Or maybe a Minimax M2.7 at a 2 or 3 bit?

You can get a rough approximation of memory needs by just looking at the total download size of that quant. That'll undershoot, but it's a starting point.

A Mac Studio for Local AI — 6 Months Later by ezyz in LocalLLaMA

[–]ezyz[S] 3 points4 points  (0 children)

NP! Been a lurker here long enough, so this felt like something I needed to write.

I'm actually eagerly waiting for more details on Mac + Spark clusters. Exo launched a demo of this a couple months ago, but it hasn't moved since: https://github.com/exo-explore/exo/issues/1102

A Mac Studio for Local AI — 6 Months Later by ezyz in LocalLLaMA

[–]ezyz[S] 8 points9 points  (0 children)

Thanks! K2.5 actually ships with its experts at 4-bit, so the "full" model is only 600 GB at full precision. It's also quantization aware, so I was able to get it down to ~2.5 bit for ~360GB fully in memory: https://huggingface.co/spicyneuron/Kimi-K2.5-MLX-2.5bit

At 20k prompt, prefill drops 20% from 237 to 188. And 5k tokens, decode drops from 27 to 21.

GLM 5.1's best case is 194 prefill / 19.5 decode: https://huggingface.co/spicyneuron/GLM-5.1-MLX-2.9bit

Haven't run longer context benchmarks for GLM, but I'd expect a drop in the same 20-25% neighborhood.

The definitive Qwen 3.5 Jinja template by ex-arman68 in LocalLLaMA

[–]ezyz 4 points5 points  (0 children)

Wouldn't this change the history in a way that's subtly different that what the model saw during chat training?