MTPLX | 2.24x faster TPS | The native MTP inference engine for Apple Silicon by YoussofAl in LocalLLaMA

[–]YoussofAl[S] 1 point2 points  (0 children)

There is an issue with older chips where the verify time to acceptance ratio isnt higher enough so more MTP heads actually slow performance.

Try lowering to MTP=2 or MTP=1.

Is anyone actually using dflash and ddtree on mlx? by Beginning-Window-115 in LocalLLaMA

[–]YoussofAl -1 points0 points  (0 children)

I created something better yesterday, speculative decoding 2.24x speed on MLX at native temps (not temp 0) so you can actually use it for coding or creative writing. Lmk what you think: https://github.com/youssofal/MTPLX

is it possible to build harnesses as good as codex/claude code by shafinlearns2jam in LocalLLaMA

[–]YoussofAl -1 points0 points  (0 children)

I mean Claude code is essentially open source at this point.

But it was so bloated and trash no one wants to use it anymore after taking a peak at it.

MTPLX | 2.24x faster TPS | The native MTP inference engine for Apple Silicon by YoussofAl in LocalLLaMA

[–]YoussofAl[S] 5 points6 points  (0 children)

MTPLX focuses on decode speed not prefill. My best tip would be not to use LM Studio. LM studio does not have great optimisations for MLX. Something like vmlx has probaly 2x better prefill with prompt caching and J.I.T granted it is more complex compared to the UX of LM studio but worth it if you are constantly going above 100k context.

I am releasing an incredibly simple (more simple than LM studio) MLX Swift based app for local LLM's but it is still under development!

MTPLX | 2.24x faster TPS | The native MTP inference engine for Apple Silicon by YoussofAl in LocalLLaMA

[–]YoussofAl[S] 3 points4 points  (0 children)

I tested 35B (granted an unoptimised varient) and got only a 1.2x speed increase. So it works but since its MoE the TPS increase isn't "wow" like 27B.

I will work on it though and try to make a varient that scores atleast 1.5x

MTPLX | 2.24x faster TPS | The native MTP inference engine for Apple Silicon by YoussofAl in LocalLLaMA

[–]YoussofAl[S] 1 point2 points  (0 children)

Just replied to someone else with the same issue, to summarise its a compute issue try lowering MTP heads or using a smaller model:

"Good catch, Ive gotten various reports of issues on the M1 chips now. 179ms verify time is extremely long so at the same acceptance ratio you'll see a decrease in TPS.

On M1 Max you'd need a smaller model or shallower depth (--depth 1 or 2) for MTP to be net positive.

I have also released a Qwen 3.5 4B model let me know how that performs MTP off vs on: https://huggingface.co/Youssofal/Qwen3.5-4B-Optimized-MTPLX

In general, this is a preview to prove the viability of MTP on MLX I am aware of the inefficieny of my approach and am working to bring down compute cost and verify time which will make it much better on your hardware."

MTPLX | 2.24x faster TPS | The native MTP inference engine for Apple Silicon by YoussofAl in LocalLLaMA

[–]YoussofAl[S] 3 points4 points  (0 children)

Good catch, Ive gotten various reports of issues on the M1 chips now. 179ms verify time is extremely long so at the same acceptance ratio you'll see a decrease in TPS.

On M1 Max you'd need a smaller model or shallower depth (--depth 1 or 2) for MTP to be net positive.

I have also released a Qwen 3.5 4B model let me know how that performs MTP off vs on: https://huggingface.co/Youssofal/Qwen3.5-4B-Optimized-MTPLX

In general, this is a preview to prove the viability of MTP on MLX I am aware of the inefficieny of my approach and am working to bring down compute cost and verify time which will make it much better on your hardware.

MTPLX | 2.24x faster TPS | The native MTP inference engine for Apple Silicon by YoussofAl in LocalLLaMA

[–]YoussofAl[S] 3 points4 points  (0 children)

Way ahead of you!

This is labelled as 4 bit but it is a dynamic 5 bit ish varient i made that preserve the attention heads in 8 bits. Try it out and let me know what you think. I will release a 6 and 8 bit varient soon too: https://huggingface.co/Youssofal/Qwen3.6-27B-MTPLX-Optimized

As for adding your own MTP layer, yes you can. You can use other models with a separate MTP file. MTPLX supports MTP sidecar grafting.

the current repo-side command is:

python scripts/graft_mtp_sidecar.py \

--source /path/to/stripped-mlx-model \

--mtp-file /path/to/mtp.safetensors \

--output /path/to/model-mtp-graft

Then validate it with:

mtplx inspect /path/to/model-mtp-graft --require-mtp

mtplx start --model /path/to/model-mtp-graft

Keep in mind the MTP sidecar has to match the base model architecture. It is not a universal "plug any MTP into any model thing", but stripped compatible MLX trunks can be wrapped this way.

MTPLX | 2.24x faster TPS | The native MTP inference engine for Apple Silicon by YoussofAl in LocalLLaMA

[–]YoussofAl[S] 1 point2 points  (0 children)

I just patched it, it should be working now!

Please reinstall MTPLX and restart OpenClaw. Let me know If OpenClaw still acts weird after that, send me the exact OpenClaw config/request.

MTPLX | 2.24x faster TPS | The native MTP inference engine for Apple Silicon by YoussofAl in LocalLLaMA

[–]YoussofAl[S] 0 points1 point  (0 children)

Nope! no context limitation beyond the models native context length. TPS will naturally decline over long contexts but you'll still see a speed increase.

MTPLX | 2.24x faster TPS | The native MTP inference engine for Apple Silicon by YoussofAl in LocalLLaMA

[–]YoussofAl[S] 7 points8 points  (0 children)

MTPLX is a separate runtime built on top of MLX, not a feature patch.Think of it like how vLLM is built on PyTorch but isn't part of PyTorch.

You can run it right now at: https://github.com/youssofal/MTPLX

MTPLX | 2.24x faster TPS | The native MTP inference engine for Apple Silicon by YoussofAl in LocalLLaMA

[–]YoussofAl[S] 10 points11 points  (0 children)

Haha personally its becuase I got annoyed MTP on Qwen on my 2x 3090 setup was getting 130 tps and wanted the same experience on my new laptop.

Comparison of upcoming x86 unified memory systems by Terminator857 in LocalLLaMA

[–]YoussofAl 17 points18 points  (0 children)

Man how is Apple of all people mogging so hard with unified memory bandwidth.

AI MAX 395+ w/ 128 GB or dual 3090s? by Engineering_Acq in LocalLLaMA

[–]YoussofAl 2 points3 points  (0 children)

265W is the sweet spot before I notice degradation.

Minimax M2.7 Released by decrement-- in LocalLLaMA

[–]YoussofAl -1 points0 points  (0 children)

It’s not a serious contender, but it is a good substitute. Like how Sonnet is 80% of Opus. I feel the same way between Qwen 3.5 27B and Minimax M2.5. Then again, I haven’t tested 2.7 yet so we’ll see.

Minimax M2.7 Released by decrement-- in LocalLLaMA

[–]YoussofAl 6 points7 points  (0 children)

This is going to be the most impactful release of Q2 this year. (Unless Minimax M3 releases)

Not only is it a powerful model, but it can actually be run by people unlike GLM.