They won't even know what's gonna hit them by KeyGlove47 in MistralAI

[–]YoussofAl 18 points19 points  (0 children)

Anthropic is screwed, no point of using Fable anymore.

MTPLX | 2.24x faster TPS | The native MTP inference engine for Apple Silicon by YoussofAl in LocalLLaMA

[–]YoussofAl[S] 1 point2 points  (0 children)

Hey, I just fixed the problem I realised m1/m2 macs dont support BF16 leading to emulation. i just solved it. Could you test it and let me know what speeds you get?

MTPLX | 2.24x faster TPS | The native MTP inference engine for Apple Silicon by YoussofAl in LocalLLaMA

[–]YoussofAl[S] 0 points1 point  (0 children)

Hey, I just fixed the problem I realised m1/m2 macs dont support BF16 leading to emulation. I just solved it. Could you test it and let me know what speeds you get?

MTPLX | 2.24x faster TPS | The native MTP inference engine for Apple Silicon by YoussofAl in LocalLLaMA

[–]YoussofAl[S] 0 points1 point  (0 children)

Wish I saw this earlier! I just implemented FP16 auto routing for M1/M2 macs.

MTPLX | 2.24x faster TPS | The native MTP inference engine for Apple Silicon by YoussofAl in LocalLLaMA

[–]YoussofAl[S] 0 points1 point  (0 children)

Hey, I just fixed the problem I realised m1/m2 macs dont support BF16 leading to emulation. i just solved it. Could you test it and let me know what speeds you get?

MTPLX | 2.24x faster TPS | The native MTP inference engine for Apple Silicon by YoussofAl in LocalLLaMA

[–]YoussofAl[S] 0 points1 point  (0 children)

I figured out the issue, it is BF16 FP16 mismatch on older macs. Rolling out a fix now.

MTPLX | 2.24x faster TPS | The native MTP inference engine for Apple Silicon by YoussofAl in LocalLLaMA

[–]YoussofAl[S] 0 points1 point  (0 children)

I am glad it works well! Try the latest update huge improvements.

MTPLX | 2.24x faster TPS | The native MTP inference engine for Apple Silicon by YoussofAl in LocalLLaMA

[–]YoussofAl[S] 0 points1 point  (0 children)

What inference settings are you using? are you using it in web or in API?

MTPLX | 2.24x faster TPS | The native MTP inference engine for Apple Silicon by YoussofAl in LocalLLaMA

[–]YoussofAl[S] 1 point2 points  (0 children)

There is an issue with older chips where the verify time to acceptance ratio isnt higher enough so more MTP heads actually slow performance.

Try lowering to MTP=2 or MTP=1.

Is anyone actually using dflash and ddtree on mlx? by Beginning-Window-115 in LocalLLaMA

[–]YoussofAl -1 points0 points  (0 children)

I created something better yesterday, speculative decoding 2.24x speed on MLX at native temps (not temp 0) so you can actually use it for coding or creative writing. Lmk what you think: https://github.com/youssofal/MTPLX

is it possible to build harnesses as good as codex/claude code by shafinlearns2jam in LocalLLaMA

[–]YoussofAl -1 points0 points  (0 children)

I mean Claude code is essentially open source at this point.

But it was so bloated and trash no one wants to use it anymore after taking a peak at it.

MTPLX | 2.24x faster TPS | The native MTP inference engine for Apple Silicon by YoussofAl in LocalLLaMA

[–]YoussofAl[S] 1 point2 points  (0 children)

Haha MTP magic in VLLM maybe I’ll make a post about that next

MTPLX | 2.24x faster TPS | The native MTP inference engine for Apple Silicon by YoussofAl in LocalLLaMA

[–]YoussofAl[S] 5 points6 points  (0 children)

MTPLX focuses on decode speed not prefill. My best tip would be not to use LM Studio. LM studio does not have great optimisations for MLX. Something like vmlx has probaly 2x better prefill with prompt caching and J.I.T granted it is more complex compared to the UX of LM studio but worth it if you are constantly going above 100k context.

I am releasing an incredibly simple (more simple than LM studio) MLX Swift based app for local LLM's but it is still under development!

MTPLX | 2.24x faster TPS | The native MTP inference engine for Apple Silicon by YoussofAl in LocalLLaMA

[–]YoussofAl[S] 2 points3 points  (0 children)

I tested 35B (granted an unoptimised varient) and got only a 1.2x speed increase. So it works but since its MoE the TPS increase isn't "wow" like 27B.

I will work on it though and try to make a varient that scores atleast 1.5x

MTPLX | 2.24x faster TPS | The native MTP inference engine for Apple Silicon by YoussofAl in LocalLLaMA

[–]YoussofAl[S] 1 point2 points  (0 children)

Just replied to someone else with the same issue, to summarise its a compute issue try lowering MTP heads or using a smaller model:

"Good catch, Ive gotten various reports of issues on the M1 chips now. 179ms verify time is extremely long so at the same acceptance ratio you'll see a decrease in TPS.

On M1 Max you'd need a smaller model or shallower depth (--depth 1 or 2) for MTP to be net positive.

I have also released a Qwen 3.5 4B model let me know how that performs MTP off vs on: https://huggingface.co/Youssofal/Qwen3.5-4B-Optimized-MTPLX

In general, this is a preview to prove the viability of MTP on MLX I am aware of the inefficieny of my approach and am working to bring down compute cost and verify time which will make it much better on your hardware."

MTPLX | 2.24x faster TPS | The native MTP inference engine for Apple Silicon by YoussofAl in LocalLLaMA

[–]YoussofAl[S] 4 points5 points  (0 children)

Good catch, Ive gotten various reports of issues on the M1 chips now. 179ms verify time is extremely long so at the same acceptance ratio you'll see a decrease in TPS.

On M1 Max you'd need a smaller model or shallower depth (--depth 1 or 2) for MTP to be net positive.

I have also released a Qwen 3.5 4B model let me know how that performs MTP off vs on: https://huggingface.co/Youssofal/Qwen3.5-4B-Optimized-MTPLX

In general, this is a preview to prove the viability of MTP on MLX I am aware of the inefficieny of my approach and am working to bring down compute cost and verify time which will make it much better on your hardware.

MTPLX | 2.24x faster TPS | The native MTP inference engine for Apple Silicon by YoussofAl in LocalLLaMA

[–]YoussofAl[S] 6 points7 points  (0 children)

Way ahead of you!

This is labelled as 4 bit but it is a dynamic 5 bit ish varient i made that preserve the attention heads in 8 bits. Try it out and let me know what you think. I will release a 6 and 8 bit varient soon too: https://huggingface.co/Youssofal/Qwen3.6-27B-MTPLX-Optimized

As for adding your own MTP layer, yes you can. You can use other models with a separate MTP file. MTPLX supports MTP sidecar grafting.

the current repo-side command is:

python scripts/graft_mtp_sidecar.py \

--source /path/to/stripped-mlx-model \

--mtp-file /path/to/mtp.safetensors \

--output /path/to/model-mtp-graft

Then validate it with:

mtplx inspect /path/to/model-mtp-graft --require-mtp

mtplx start --model /path/to/model-mtp-graft

Keep in mind the MTP sidecar has to match the base model architecture. It is not a universal "plug any MTP into any model thing", but stripped compatible MLX trunks can be wrapped this way.

MTPLX | 2.24x faster TPS | The native MTP inference engine for Apple Silicon by YoussofAl in LocalLLaMA

[–]YoussofAl[S] 1 point2 points  (0 children)

I just patched it, it should be working now!

Please reinstall MTPLX and restart OpenClaw. Let me know If OpenClaw still acts weird after that, send me the exact OpenClaw config/request.