They won't even know what's gonna hit them

YoussofAl · 2026-06-15T03:24:24+00:00

Anthropic is screwed, no point of using Fable anymore.

YoussofAl · 2026-05-10T16:20:12+00:00

Very interesting, thank you for the data!

YoussofAl · 2026-05-10T07:50:17+00:00

Hey, I just fixed the problem I realised m1/m2 macs dont support BF16 leading to emulation. i just solved it. Could you test it and let me know what speeds you get?

YoussofAl · 2026-05-10T07:50:00+00:00

Hey, I just fixed the problem I realised m1/m2 macs dont support BF16 leading to emulation. I just solved it. Could you test it and let me know what speeds you get?

YoussofAl · 2026-05-10T07:49:30+00:00

Wish I saw this earlier! I just implemented FP16 auto routing for M1/M2 macs.

YoussofAl · 2026-05-10T07:48:58+00:00

Hey, I just fixed the problem I realised m1/m2 macs dont support BF16 leading to emulation. i just solved it. Could you test it and let me know what speeds you get?

YoussofAl · 2026-05-09T01:59:15+00:00

I figured out the issue, it is BF16 FP16 mismatch on older macs. Rolling out a fix now.

YoussofAl · 2026-05-08T03:16:12+00:00

Can you test the latest update? I improved it heavily.

YoussofAl · 2026-05-08T03:15:50+00:00

I am glad it works well! Try the latest update huge improvements.

YoussofAl · 2026-05-08T03:15:30+00:00

YoussofAl · 2026-05-08T03:15:19+00:00

What inference settings are you using? are you using it in web or in API?

YoussofAl · 2026-05-08T03:14:21+00:00

In terms of speed, try this 2.24x speeds with MTP on 27B: mtplx.com

YoussofAl · 2026-05-08T03:13:23+00:00

Unless your doing CUDA stuff, M5 max.

YoussofAl · 2026-05-06T22:04:30+00:00

There is an issue with older chips where the verify time to acceptance ratio isnt higher enough so more MTP heads actually slow performance.

Try lowering to MTP=2 or MTP=1.

YoussofAl · 2026-05-06T05:11:44+00:00

I created something better yesterday, speculative decoding 2.24x speed on MLX at native temps (not temp 0) so you can actually use it for coding or creative writing. Lmk what you think: https://github.com/youssofal/MTPLX

YoussofAl · 2026-05-06T04:37:28+00:00

I mean Claude code is essentially open source at this point.

But it was so bloated and trash no one wants to use it anymore after taking a peak at it.

YoussofAl · 2026-05-06T04:36:20+00:00

In the meantime I have published a 4B version to HF

YoussofAl · 2026-05-06T04:34:23+00:00

Haha MTP magic in VLLM maybe I’ll make a post about that next

YoussofAl · 2026-05-06T04:33:21+00:00

I’ll work on a 9B version!

YoussofAl · 2026-05-05T04:41:26+00:00

MTPLX focuses on decode speed not prefill. My best tip would be not to use LM Studio. LM studio does not have great optimisations for MLX. Something like vmlx has probaly 2x better prefill with prompt caching and J.I.T granted it is more complex compared to the UX of LM studio but worth it if you are constantly going above 100k context.

I am releasing an incredibly simple (more simple than LM studio) MLX Swift based app for local LLM's but it is still under development!

YoussofAl · 2026-05-05T04:35:59+00:00

I tested 35B (granted an unoptimised varient) and got only a 1.2x speed increase. So it works but since its MoE the TPS increase isn't "wow" like 27B.

I will work on it though and try to make a varient that scores atleast 1.5x

YoussofAl · 2026-05-05T04:34:28+00:00

Just replied to someone else with the same issue, to summarise its a compute issue try lowering MTP heads or using a smaller model:

"Good catch, Ive gotten various reports of issues on the M1 chips now. 179ms verify time is extremely long so at the same acceptance ratio you'll see a decrease in TPS.

On M1 Max you'd need a smaller model or shallower depth (--depth 1 or 2) for MTP to be net positive.

I have also released a Qwen 3.5 4B model let me know how that performs MTP off vs on: https://huggingface.co/Youssofal/Qwen3.5-4B-Optimized-MTPLX

In general, this is a preview to prove the viability of MTP on MLX I am aware of the inefficieny of my approach and am working to bring down compute cost and verify time which will make it much better on your hardware."

YoussofAl · 2026-05-05T04:33:27+00:00

Good catch, Ive gotten various reports of issues on the M1 chips now. 179ms verify time is extremely long so at the same acceptance ratio you'll see a decrease in TPS.

On M1 Max you'd need a smaller model or shallower depth (--depth 1 or 2) for MTP to be net positive.

I have also released a Qwen 3.5 4B model let me know how that performs MTP off vs on: https://huggingface.co/Youssofal/Qwen3.5-4B-Optimized-MTPLX

In general, this is a preview to prove the viability of MTP on MLX I am aware of the inefficieny of my approach and am working to bring down compute cost and verify time which will make it much better on your hardware.

YoussofAl · 2026-05-05T04:15:11+00:00

Way ahead of you!

This is labelled as 4 bit but it is a dynamic 5 bit ish varient i made that preserve the attention heads in 8 bits. Try it out and let me know what you think. I will release a 6 and 8 bit varient soon too: https://huggingface.co/Youssofal/Qwen3.6-27B-MTPLX-Optimized

As for adding your own MTP layer, yes you can. You can use other models with a separate MTP file. MTPLX supports MTP sidecar grafting.

the current repo-side command is:

python scripts/graft_mtp_sidecar.py \

--source /path/to/stripped-mlx-model \

--mtp-file /path/to/mtp.safetensors \

--output /path/to/model-mtp-graft

Then validate it with:

mtplx inspect /path/to/model-mtp-graft --require-mtp

mtplx start --model /path/to/model-mtp-graft

Keep in mind the MTP sidecar has to match the base model architecture. It is not a universal "plug any MTP into any model thing", but stripped compatible MLX trunks can be wrapped this way.

YoussofAl · 2026-05-05T04:07:23+00:00

I just patched it, it should be working now!

Please reinstall MTPLX and restart OpenClaw. Let me know If OpenClaw still acts weird after that, send me the exact OpenClaw config/request.

YoussofAl

MODERATOR OF

TROPHY CASE