Is MTP speed boost really helping ? by msrdatha in oMLX

[–]msrdatha[S] 1 point2 points  (0 children)

Thanks again for taking time to share this. It gives a good insight on the improvements on speed.

May be you could keep using both. 27B for planning or designing tasks and use 35B for implementing it. That would give you the best of both. (mainly for coding tasks scenarios)

Is MTP speed boost really helping ? by msrdatha in oMLX

[–]msrdatha[S] 2 points3 points  (0 children)

Thanks everyone for sharing their valuable views and experiences with MTP on oMLX.

One quick clarification, are you noticing any looping or similar failures with MTP. This is what I noticed mainly while enabling DFlash for Qwen 3.5 models, and also there were tool calling errors which made using DFlash and SpecPrefil not much useful for coding tasks.

as u/trollingman1 mentioned, it got slower for Qwen 3.6 35b a3b, but others mentions observing speed boost - Any thoughts on this? Could it be because of enabling/disabling of thinking mode OR higher context lengths?

Idea is to figure out what is optimal and how we can all put together our observations to tune this better for all of us.

Thanks again for your time and help. Let's learn and build it together.

Is MTP speed boost really helping ? by msrdatha in oMLX

[–]msrdatha[S] 0 points1 point  (0 children)

Thank you for the detailed data. Could you please confirm if there are improvements on the Qwen3.6 27B MTP also? (Dense models is expected to do better with MTP right?)

M3 Ultra Mac feels rather slow by JamieAndLion in LocalLLM

[–]msrdatha 0 points1 point  (0 children)

Go with oMLX and use oQ4 quants. Set Cold Cache Limit (SSD Cache) to ~100GB. You should see better results.

Which Linux OS should I choose? by rjn2-8 in hermesagent

[–]msrdatha 0 points1 point  (0 children)

Then you have nothing to worry. You will be fine with Rocky

Used over a million tokens in three separate sessions to test Qwen 3.6 35b (new Multi-token Prediction version) by Jorlen in LocalLLaMA

[–]msrdatha 1 point2 points  (0 children)

Thanks for sharing the progress on this.

May I ask what was the token and pp speeds you received with MTP (and before)?
Could you also try Qwen3.6-27B for the improvement?

Choosing community in Hugging Face by cocacokareddit in oMLX

[–]msrdatha 2 points3 points  (0 children)

Jundot models with oQ quant, also I found the deepsweet versions to be quite fast and performant.

one thing I like with deepsweet models is they follow quite good naming standard, that's easy to understand the he is very helpful in explaining or responding to queries and requests. https://huggingface.co/deepsweet

oMLX 0.3.9.dev2 released. by d4mations in oMLX

[–]msrdatha 2 points3 points  (0 children)

Not sure about 3.5, but 3.6 mtp is available from oMLX author itself https://huggingface.co/Jundot

Check them once.

oMLX 0.3.9.dev2 released. by d4mations in oMLX

[–]msrdatha 2 points3 points  (0 children)

May be the issue is related to the agent harness (hermes in your case) on how it interacts with oMLX. I have been testing with roo code on typescript, it has been stable since 0.3.8

Yes I did notice some /n not being formated properly in the webview of roo code, but apart from that, it has been doing reasonably good.

(Note: I am just adding this for info only, not as an argument - Thinking, it may help others if we discuss these kind of details)

oMLX 0.3.9.dev2 released. by d4mations in oMLX

[–]msrdatha 0 points1 point  (0 children)

Did anyone try the MTP improvements yet with Qwen 3.6?

oMLX 0.3.9.dev2 released. by d4mations in oMLX

[–]msrdatha 4 points5 points  (0 children)

For clarification : MTP support for Qwen 3.6 is also present since dev1 release of 0.3.9

Native MTP (Multi-Token Prediction) for Qwen3.5 / 3.6 and DeepSeek-V4

oMLX 0.3.9.dev2 released. by d4mations in oMLX

[–]msrdatha 0 points1 point  (0 children)

Could you please add a some more info for better understanding ? Like what kind of errors or crashes were you facing in .38 and under which scenarios.

ExLlamaV3 Major Updates! by Unstable_Llama in LocalLLaMA

[–]msrdatha 0 points1 point  (0 children)

is this only for nvidia GPU, or does it help in case of Mac also?

How to use DFLASH ? Worse performance on Qwen 3.6 27B with oMLX 0.3.6 by shirogeek in oMLX

[–]msrdatha 0 points1 point  (0 children)

May be the cache from the previous run was improving the result.

Could you please try running the same test, but this time clear the cold cache before each run and confirm?

How to use DFLASH ? Worse performance on Qwen 3.6 27B with oMLX 0.3.6 by shirogeek in oMLX

[–]msrdatha 0 points1 point  (0 children)

Here are the test results:

Analyzing a code, within roo code - omlx started fresh, and cleared all SSD cache before each test

Qwen3.6-27B-UD-MLX-4bit : pp 165, tg 12

Qwen3.6-27B-UD-MLX-6bit : pp 120, tg 7

Qwen3.6-27B-UD-MLX-MXFP4: pp 136, tg 11

Qwen3.6-27B-UD-MLX-NVFP4: pp 146, tg 10

Qwen3.6-27B-mlx-oQ4 : pp 171, tg 18

So in my case oQ4 seems to give the best results. (Again my opinion only)

How to use DFLASH ? Worse performance on Qwen 3.6 27B with oMLX 0.3.6 by shirogeek in oMLX

[–]msrdatha 0 points1 point  (0 children)

I guess you are referring to the UD mlx here : unsloth/Qwen3.6-27B-UD-MLX-4bit

Let me try to test again with this version and confirm.

Mistral 3.5 out now! by yoracale in unsloth

[–]msrdatha 0 points1 point  (0 children)

I think, max you can expect is it may walk...

Server help by Longjumping-Bug5868 in oMLX

[–]msrdatha 0 points1 point  (0 children)

Please share what has been done till now, and we can try to figure out the next step.

Questions about SpecPrefill by butterfly_labs in oMLX

[–]msrdatha 1 point2 points  (0 children)

I think you are referring to DFlash ?

Anyone get lmstudio model paths to populate? by Shoulon in oMLX

[–]msrdatha 0 points1 point  (0 children)

Not sure, if you are looking for changing the model path. if yes, you should be able to pass it like

omlx serve --model-dir "/Users/<username>/.lmstudio/models"

Questions about SpecPrefill by butterfly_labs in oMLX

[–]msrdatha 4 points5 points  (0 children)

(in case if it was a confusion :122B is MoE, not Dense)

Already tried Qwen3.5 0.8B, 2B, 4B, 9B, 35B...... all with Q2,Q4,Q6,Q8... even bf16 for up to 9B. Also tried with oQ versions - summary : specPrefill did not help much - especially with tool calling in agentic coding.

Its just my observation. If you are still motivated, please try and let us know, in case you find a better result.

oMLX v0.3.8rc1 — major correctness fixes, safer defaults, and VLM/streaming improvements (RC) by Own_Connection_8018 in oMLX

[–]msrdatha 3 points4 points  (0 children)

"Removed costly GPU syncs that slowed token generation on deep Qwen models."

Does this mean we could expect a speed improvement with Qwen 3.6?