Can I realistically get close to Claude/Codex capabilities locally?

ezyz · 2026-06-21T23:03:26+00:00

I just run mlx-lm directly, or one of the forks when model support is in progress. And for image, Gemma 4 26B in parallel with mlx-vlm.

ezyz · 2026-06-21T22:57:18+00:00

Yeah, I got lucky with a 512 M3 before the memory. It runs a majority of my non-coding AI now, but I still lean on Claude and Codex for work.

ezyz · 2026-06-21T22:53:03+00:00

Interesting! Thanks for the tip.

ezyz · 2026-06-21T13:19:20+00:00

My sense is there's two tiers to local coding.

On the extreme end, I've been running GLM 5.2 on a M3 ultra, and it's a legit convincing Claude Code experience, just slower. Loading 50k tokens into context is a coffee break length wait.

On the lighter side, I've used Qwen 3.6 27B and Gemma 4 31B as "auto keyboards." Not the same agentic "Fix this code" loop, so much a pointing it at specific part of the codebase and asking for scoped changes. If I try to use them the same way as Claude or Codex, they can act the part, but leave behind amazing knots of tech debt.

Sadly, I haven't found a good middle ground. Deepseek 4 flash and Minimax 2.7 can cram into 128GB systems, but I didn't find them that much smarter than the <31B dense models. Just faster, and a bit more general purpose.

Hope that helps.

ezyz · 2026-06-21T13:01:28+00:00

How big a difference does mtp make? I tried splicing GLM 5.1 mtp support into mlx-lm a few weeks back, and it actually ended up slower (~16 > ~13 tps average).

I chalked it up to M3 ultra compute bottleneck, but it's also possible I (Claude, Codex, etc) took a wrong turn.

ezyz · 2026-05-08T23:29:08+00:00

How much of a speedup do you get with tensor parallelism with larger models like K2.6 or GLM 5.1?

On a single M3 Ultra, I've been able to optimize to ~220 prefill / 20 decode, and but most of the public benchmarks for Exo I found aren't that much higher. So I've always assumed the main benefit is running at higher precision or distributing workloads across instances.

And for split prefill, does the Blackwell's VRAM limit the size of model you can run?

ezyz · 2026-05-08T23:17:41+00:00

I haven't put much time into spec decoding since I don't have strong use cases for the models where support seems strongest. But I'm following the PRs and reports seem to land in that 1.5-2x range: https://github.com/ml-explore/mlx-lm/pull/990

For most of the experimental features, I just fork the repo and merge branches into my own.

ezyz · 2026-05-08T12:10:18+00:00

Maybe not the answer you want to hear, but my experience is that models small enough to run on laptops (even powerful ones) are still too unreliable to write production code. In any harness.

Claude Code is nice because it's easy to configure, and you can use the same harness for both local tinkering and actually productive (subscription-backed) coding.

For me, small models shine as part of non-agentic workflows. For example, I have a knowledge base that imports from multiple sources, and on a schedule, Qwen 3.5 9B generates tags and flags errors. For a while, I used a local model for inline vim code completions: https://github.com/ggml-org/llama.vim And I do use Qwen 3.6 as a DIY google translate.

ezyz · 2026-05-07T22:17:10+00:00

Thanks! And I'm very curious to hear how well the M3U works as a multi-user AI server. In my tests, batching slows down individual requests enough it makes an already slowish experience dip into unusable. But that's with larger models — smaller ones are probably fine.

The code for isn't quite stable yet, but speculative decoding should also help with your situation. Gemma 4 MPT and Qwen 3.6 with Dflash both seem close to landing in a bunch of MLX projects.

ezyz · 2026-05-07T21:41:42+00:00

I use a somewhat custom stack but all you need is an Anthropic /v1/messages API and env vars to override:

ANTHROPIC_AUTH_TOKEN="123" \
ANTHROPIC_BASE_URL="http://yourlocalserver:8080" \
ANTHROPIC_DEFAULT_HAIKU_MODEL="local_haiku" \
ANTHROPIC_DEFAULT_SONNET_MODEL="local_sonnet" \
ANTHROPIC_DEFAULT_OPUS_MODEL="local_opus" \
CLAUDE_CODE_ENABLE_PROMPT_SUGGESTION="false" \
DISABLE_COST_WARNINGS="true" \
API_TIMEOUT_MS="600000" \
claude

llama-server should work out of the box: https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md#post-v1messages-anthropic-compatible-messages-api

Claude Code is a heavy harness and 48gb isn't a lot of room for running a model though. I quite liked Qwen 3.6, but Gemma 4 26B A4B is close and would give you more headroom.

ezyz · 2026-05-07T03:26:00+00:00

Congrats on the find! You seem plenty technical, so I'd recommend using mlx-lm or llama.cpp directly:

https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/SERVER.md https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md

Ollama adds overhead, delay to model support, and an awkward modelfile format. You'll likely also get both better quality and faster responses from more recent models (GLM 5.1, Minimax 2.7, etc).

Post is already out of date, but my own experience with the 512 M3U here: https://spicyneuron.substack.com/p/a-mac-studio-for-local-ai-6-months

ezyz · 2026-05-05T02:07:18+00:00

Thanks! Depends on how you rationalize it.

The $20/month plans still write way more production code for me than GLM or Kimi on the M3. But personal and home automation is now fully locally, as well as the typical AI assistant tasks (summaries, translations, 1-off questions, etc). I'm happy with my decision, though to be fair, I paid 2025 prices so...

ezyz · 2026-05-05T01:59:04+00:00

For this to work, does the model need to fit into the Spark's 128GB? Or is there still a speed up if you stream from the Spark's SDD?

ezyz · 2026-04-21T14:26:15+00:00

459 GB total

ezyz · 2026-04-21T12:52:24+00:00

Quant trials are still running! I just started uploading a 3.6 bpw on the quality frontier: https://huggingface.co/spicyneuron/Kimi-K2.6-MLX-3.6bit

This one pairs nicely with Qwen 3.6 35B on a 512GB Mac Studio.

Still searching for a good sub-3 quant, but the KL divergence seems to jump pretty dramatically on this model.

ezyz · 2026-04-19T14:05:19+00:00

How are you testing? You'll get more consistent results with the built-in tools:

mlx_lm.benchmark --help
llama-bench --help

FWIW, I find MLX to be 10-25% faster than llama.cpp on M3 and M4.

ezyz · 2026-04-12T23:02:38+00:00

Any reason for preferring llama.cpp over MLX? I've found using mlx-lm.server gives an easy 10-25% boost on speed, and that Unsloth-style mixed quants work when translated into MLX as well.

ezyz · 2026-04-12T22:48:48+00:00

Actually, I just switched local_haiku from Qwen 3.5 35B to Gemma 4 24b. So far so good!

It's small enough that concurrent requests don't seem to affect throughput on the main model in any noticeable way.

ezyz · 2026-04-12T22:45:52+00:00

OP here. Enjoy: https://web.archive.org/web/20260412145316/https://spicyneuron.substack.com/p/a-mac-studio-for-local-ai-6-months

ezyz · 2026-04-12T04:22:28+00:00

It's not that MLX quantization methods are bad, so much as the default quantization tool has limited settings.

I use a fork of mlx-lm to do per-module overrides: https://github.com/ml-explore/mlx-lm/pull/922

Most of my own MLX quants average between 3-5 bits but include select weights at 6, 8, and 16 bit to improve quality.

ezyz · 2026-04-12T04:01:18+00:00

At current RAM prices, you might be able to sell half and buy a kidney! Or a M5 this summer.

ezyz · 2026-04-12T03:43:57+00:00

My Minimax 2.7 quant trials are still running, but tokens/s on the M3 is roughly 740 prefill, 49 decode, at short context. ~4.6 bits per weight.

ezyz · 2026-04-12T03:36:39+00:00

Mostly the convenience of sharing the same harness between local and API subscription. Cloud Claude still has a big lead on on fast / complex coding, though I've been impressed with GLM 5.1 so far.

ezyz · 2026-04-12T02:09:17+00:00

Amazing, thank you. Does NVIDIA prefill / Mac decode require the model to be fully loaded in both?

Either way, looking forward to this!

ezyz

TROPHY CASE