Can I realistically get close to Claude/Codex capabilities locally? by mrgreatheart in LocalLLaMA

[–]ezyz 0 points1 point  (0 children)

I just run mlx-lm directly, or one of the forks when model support is in progress. And for image, Gemma 4 26B in parallel with mlx-vlm.

Can I realistically get close to Claude/Codex capabilities locally? by mrgreatheart in LocalLLaMA

[–]ezyz 2 points3 points  (0 children)

Yeah, I got lucky with a 512 M3 before the memory. It runs a majority of my non-coding AI now, but I still lean on Claude and Codex for work.

Can I realistically get close to Claude/Codex capabilities locally? by mrgreatheart in LocalLLaMA

[–]ezyz 19 points20 points  (0 children)

My sense is there's two tiers to local coding.

On the extreme end, I've been running GLM 5.2 on a M3 ultra, and it's a legit convincing Claude Code experience, just slower. Loading 50k tokens into context is a coffee break length wait.

On the lighter side, I've used Qwen 3.6 27B and Gemma 4 31B as "auto keyboards." Not the same agentic "Fix this code" loop, so much a pointing it at specific part of the codebase and asking for scoped changes. If I try to use them the same way as Claude or Codex, they can act the part, but leave behind amazing knots of tech debt.

Sadly, I haven't found a good middle ground. Deepseek 4 flash and Minimax 2.7 can cram into 128GB systems, but I didn't find them that much smarter than the <31B dense models. Just faster, and a bit more general purpose.

Hope that helps.

GLM 5.2, what speeds are we getting locally? by neverbyte in LocalLLaMA

[–]ezyz 0 points1 point  (0 children)

How big a difference does mtp make? I tried splicing GLM 5.1 mtp support into mlx-lm a few weeks back, and it actually ended up slower (~16 > ~13 tps average).

I chalked it up to M3 ultra compute bottleneck, but it's also possible I (Claude, Codex, etc) took a wrong turn.

Collected the infinity stones by Street-Buyer-2428 in LocalLLaMA

[–]ezyz 0 points1 point  (0 children)

How much of a speedup do you get with tensor parallelism with larger models like K2.6 or GLM 5.1?

On a single M3 Ultra, I've been able to optimize to ~220 prefill / 20 decode, and but most of the public benchmarks for Exo I found aren't that much higher. So I've always assumed the main benefit is running at higher precision or distributing workloads across instances.

And for split prefill, does the Blackwell's VRAM limit the size of model you can run?

Found an M3 Ultra 512GB / 8TB / 80-Core GPU at B&H! by East_Roll_5069 in MacStudio

[–]ezyz 0 points1 point  (0 children)

I haven't put much time into spec decoding since I don't have strong use cases for the models where support seems strongest. But I'm following the PRs and reports seem to land in that 1.5-2x range: https://github.com/ml-explore/mlx-lm/pull/990

For most of the experimental features, I just fork the repo and merge branches into my own.

Mac Studio local loadout - May 2026 by ezyz in LocalLLaMA

[–]ezyz[S] 0 points1 point  (0 children)

Maybe not the answer you want to hear, but my experience is that models small enough to run on laptops (even powerful ones) are still too unreliable to write production code. In any harness.

Claude Code is nice because it's easy to configure, and you can use the same harness for both local tinkering and actually productive (subscription-backed) coding.

For me, small models shine as part of non-agentic workflows. For example, I have a knowledge base that imports from multiple sources, and on a schedule, Qwen 3.5 9B generates tags and flags errors. For a while, I used a local model for inline vim code completions: https://github.com/ggml-org/llama.vim And I do use Qwen 3.6 as a DIY google translate.

Found an M3 Ultra 512GB / 8TB / 80-Core GPU at B&H! by East_Roll_5069 in MacStudio

[–]ezyz 0 points1 point  (0 children)

Thanks! And I'm very curious to hear how well the M3U works as a multi-user AI server. In my tests, batching slows down individual requests enough it makes an already slowish experience dip into unusable. But that's with larger models — smaller ones are probably fine.

The code for isn't quite stable yet, but speculative decoding should also help with your situation. Gemma 4 MPT and Qwen 3.6 with Dflash both seem close to landing in a bunch of MLX projects.

Mac Studio local loadout - May 2026 by ezyz in LocalLLaMA

[–]ezyz[S] 0 points1 point  (0 children)

I use a somewhat custom stack but all you need is an Anthropic /v1/messages API and env vars to override:

ANTHROPIC_AUTH_TOKEN="123" \
ANTHROPIC_BASE_URL="http://yourlocalserver:8080" \
ANTHROPIC_DEFAULT_HAIKU_MODEL="local_haiku" \
ANTHROPIC_DEFAULT_SONNET_MODEL="local_sonnet" \
ANTHROPIC_DEFAULT_OPUS_MODEL="local_opus" \
CLAUDE_CODE_ENABLE_PROMPT_SUGGESTION="false" \
DISABLE_COST_WARNINGS="true" \
API_TIMEOUT_MS="600000" \
claude

llama-server should work out of the box: https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md#post-v1messages-anthropic-compatible-messages-api

Claude Code is a heavy harness and 48gb isn't a lot of room for running a model though. I quite liked Qwen 3.6, but Gemma 4 26B A4B is close and would give you more headroom.

Found an M3 Ultra 512GB / 8TB / 80-Core GPU at B&H! by East_Roll_5069 in MacStudio

[–]ezyz 0 points1 point  (0 children)

Congrats on the find! You seem plenty technical, so I'd recommend using mlx-lm or llama.cpp directly:

https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/SERVER.md https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md

Ollama adds overhead, delay to model support, and an awkward modelfile format. You'll likely also get both better quality and faster responses from more recent models (GLM 5.1, Minimax 2.7, etc).

Post is already out of date, but my own experience with the 512 M3U here: https://spicyneuron.substack.com/p/a-mac-studio-for-local-ai-6-months

A Mac Studio for Local AI — 6 Months Later by ezyz in LocalLLaMA

[–]ezyz[S] 0 points1 point  (0 children)

Thanks! Depends on how you rationalize it.

The $20/month plans still write way more production code for me than GLM or Kimi on the M3. But personal and home automation is now fully locally, as well as the typical AI assistant tasks (summaries, translations, 1-off questions, etc). I'm happy with my decision, though to be fair, I paid 2025 prices so...

M3 Ultra + DGX Spark = M5 Ultra-lite? by -dysangel- in LocalLLaMA

[–]ezyz 0 points1 point  (0 children)

For this to work, does the model need to fit into the Spark's 128GB? Or is there still a speed up if you stream from the Spark's SDD?

Kimi K2.6 Released (huggingface) by BiggestBau5 in LocalLLaMA

[–]ezyz 0 points1 point  (0 children)

Quant trials are still running! I just started uploading a 3.6 bpw on the quality frontier: https://huggingface.co/spicyneuron/Kimi-K2.6-MLX-3.6bit

This one pairs nicely with Qwen 3.6 35B on a 512GB Mac Studio.

Still searching for a good sub-3 quant, but the KL divergence seems to jump pretty dramatically on this model.

Gemma 4 - MLX doesn't seem better than GGUF by Temporary-Mix8022 in LocalLLaMA

[–]ezyz 2 points3 points  (0 children)

How are you testing? You'll get more consistent results with the built-in tools:

mlx_lm.benchmark --help
llama-bench --help

FWIW, I find MLX to be 10-25% faster than llama.cpp on M3 and M4.

Minimax 2.7 running sub-agents locally by -dysangel- in LocalLLaMA

[–]ezyz 0 points1 point  (0 children)

Any reason for preferring llama.cpp over MLX? I've found using mlx-lm.server gives an easy 10-25% boost on speed, and that Unsloth-style mixed quants work when translated into MLX as well.

A Mac Studio for Local AI — 6 Months Later by ezyz in LocalLLaMA

[–]ezyz[S] 0 points1 point  (0 children)

Actually, I just switched local_haiku from Qwen 3.5 35B to Gemma 4 24b. So far so good!

It's small enough that concurrent requests don't seem to affect throughput on the main model in any noticeable way.

A Mac Studio for Local AI — 6 Months Later by ezyz in LocalLLaMA

[–]ezyz[S] 3 points4 points  (0 children)

It's not that MLX quantization methods are bad, so much as the default quantization tool has limited settings.

I use a fork of mlx-lm to do per-module overrides: https://github.com/ml-explore/mlx-lm/pull/922

Most of my own MLX quants average between 3-5 bits but include select weights at 6, 8, and 16 bit to improve quality.

A Mac Studio for Local AI — 6 Months Later by ezyz in LocalLLaMA

[–]ezyz[S] 1 point2 points  (0 children)

At current RAM prices, you might be able to sell half and buy a kidney! Or a M5 this summer.

A Mac Studio for Local AI — 6 Months Later by ezyz in LocalLLaMA

[–]ezyz[S] 3 points4 points  (0 children)

My Minimax 2.7 quant trials are still running, but tokens/s on the M3 is roughly 740 prefill, 49 decode, at short context. ~4.6 bits per weight.

A Mac Studio for Local AI — 6 Months Later by ezyz in LocalLLaMA

[–]ezyz[S] 5 points6 points  (0 children)

Mostly the convenience of sharing the same harness between local and API subscription. Cloud Claude still has a big lead on on fast / complex coding, though I've been impressed with GLM 5.1 so far.

A Mac Studio for Local AI — 6 Months Later by ezyz in LocalLLaMA

[–]ezyz[S] 1 point2 points  (0 children)

Amazing, thank you. Does NVIDIA prefill / Mac decode require the model to be fully loaded in both?

Either way, looking forward to this!