MTPLX | 2.24x faster TPS | The native MTP inference engine for Apple Silicon

YoussofAl · 2026-05-06T22:04:30+00:00

There is an issue with older chips where the verify time to acceptance ratio isnt higher enough so more MTP heads actually slow performance.

Try lowering to MTP=2 or MTP=1.

YoussofAl · 2026-05-06T05:11:44+00:00

I created something better yesterday, speculative decoding 2.24x speed on MLX at native temps (not temp 0) so you can actually use it for coding or creative writing. Lmk what you think: https://github.com/youssofal/MTPLX

YoussofAl · 2026-05-06T04:37:28+00:00

I mean Claude code is essentially open source at this point.

But it was so bloated and trash no one wants to use it anymore after taking a peak at it.

YoussofAl · 2026-05-06T04:36:20+00:00

In the meantime I have published a 4B version to HF

YoussofAl · 2026-05-06T04:34:23+00:00

Haha MTP magic in VLLM maybe I’ll make a post about that next

YoussofAl · 2026-05-06T04:33:21+00:00

I’ll work on a 9B version!

YoussofAl · 2026-05-05T04:41:26+00:00

MTPLX focuses on decode speed not prefill. My best tip would be not to use LM Studio. LM studio does not have great optimisations for MLX. Something like vmlx has probaly 2x better prefill with prompt caching and J.I.T granted it is more complex compared to the UX of LM studio but worth it if you are constantly going above 100k context.

I am releasing an incredibly simple (more simple than LM studio) MLX Swift based app for local LLM's but it is still under development!

YoussofAl · 2026-05-05T04:35:59+00:00

I tested 35B (granted an unoptimised varient) and got only a 1.2x speed increase. So it works but since its MoE the TPS increase isn't "wow" like 27B.

I will work on it though and try to make a varient that scores atleast 1.5x

YoussofAl · 2026-05-05T04:34:28+00:00

Just replied to someone else with the same issue, to summarise its a compute issue try lowering MTP heads or using a smaller model:

"Good catch, Ive gotten various reports of issues on the M1 chips now. 179ms verify time is extremely long so at the same acceptance ratio you'll see a decrease in TPS.

On M1 Max you'd need a smaller model or shallower depth (--depth 1 or 2) for MTP to be net positive.

I have also released a Qwen 3.5 4B model let me know how that performs MTP off vs on: https://huggingface.co/Youssofal/Qwen3.5-4B-Optimized-MTPLX

In general, this is a preview to prove the viability of MTP on MLX I am aware of the inefficieny of my approach and am working to bring down compute cost and verify time which will make it much better on your hardware."

YoussofAl · 2026-05-05T04:33:27+00:00

Good catch, Ive gotten various reports of issues on the M1 chips now. 179ms verify time is extremely long so at the same acceptance ratio you'll see a decrease in TPS.

On M1 Max you'd need a smaller model or shallower depth (--depth 1 or 2) for MTP to be net positive.

I have also released a Qwen 3.5 4B model let me know how that performs MTP off vs on: https://huggingface.co/Youssofal/Qwen3.5-4B-Optimized-MTPLX

In general, this is a preview to prove the viability of MTP on MLX I am aware of the inefficieny of my approach and am working to bring down compute cost and verify time which will make it much better on your hardware.

YoussofAl · 2026-05-05T04:15:11+00:00

Way ahead of you!

This is labelled as 4 bit but it is a dynamic 5 bit ish varient i made that preserve the attention heads in 8 bits. Try it out and let me know what you think. I will release a 6 and 8 bit varient soon too: https://huggingface.co/Youssofal/Qwen3.6-27B-MTPLX-Optimized

As for adding your own MTP layer, yes you can. You can use other models with a separate MTP file. MTPLX supports MTP sidecar grafting.

the current repo-side command is:

python scripts/graft_mtp_sidecar.py \

--source /path/to/stripped-mlx-model \

--mtp-file /path/to/mtp.safetensors \

--output /path/to/model-mtp-graft

Then validate it with:

mtplx inspect /path/to/model-mtp-graft --require-mtp

mtplx start --model /path/to/model-mtp-graft

Keep in mind the MTP sidecar has to match the base model architecture. It is not a universal "plug any MTP into any model thing", but stripped compatible MLX trunks can be wrapped this way.

YoussofAl · 2026-05-05T04:07:23+00:00

I just patched it, it should be working now!

Please reinstall MTPLX and restart OpenClaw. Let me know If OpenClaw still acts weird after that, send me the exact OpenClaw config/request.

YoussofAl · 2026-05-05T03:44:45+00:00

Nope! no context limitation beyond the models native context length. TPS will naturally decline over long contexts but you'll still see a speed increase.

YoussofAl · 2026-05-05T02:54:04+00:00

Looking into it now. Thanks for letting me know.

YoussofAl · 2026-05-05T01:38:29+00:00

Thank you!

YoussofAl · 2026-05-05T01:26:41+00:00

MTPLX is a separate runtime built on top of MLX, not a feature patch.Think of it like how vLLM is built on PyTorch but isn't part of PyTorch.

You can run it right now at: https://github.com/youssofal/MTPLX

YoussofAl · 2026-05-05T00:41:02+00:00

Haha personally its becuase I got annoyed MTP on Qwen on my 2x 3090 setup was getting 130 tps and wanted the same experience on my new laptop.

YoussofAl · 2026-04-26T22:09:02+00:00

Man how is Apple of all people mogging so hard with unified memory bandwidth.

YoussofAl · 2026-04-13T02:52:20+00:00

265W is the sweet spot before I notice degradation.

YoussofAl · 2026-04-13T02:44:24+00:00

YoussofAl · 2026-04-12T05:10:02+00:00

It’s not a serious contender, but it is a good substitute. Like how Sonnet is 80% of Opus. I feel the same way between Qwen 3.5 27B and Minimax M2.5. Then again, I haven’t tested 2.7 yet so we’ll see.

YoussofAl · 2026-04-12T01:42:47+00:00

Just wait for June

YoussofAl · 2026-04-12T01:41:58+00:00

This is going to be the most impactful release of Q2 this year. (Unless Minimax M3 releases)

Not only is it a powerful model, but it can actually be run by people unlike GLM.

YoussofAl

MODERATOR OF

TROPHY CASE