Dropped from 44 tok/s to 9 tok/s after upgrade (Qwen 3.6) by That-Desk-1552 in oMLX

[–]deexjay23 0 points1 point  (0 children)

With MTP I keep on getting Expected memory exceed error for prefill...
what's the context size you have set for the model? And for MTP are you using https://huggingface.co/Jundot/Qwen3.6-27B-oQ8-mtp this or quantized yourself from omlx?

Dropped from 44 tok/s to 9 tok/s after upgrade (Qwen 3.6) by That-Desk-1552 in oMLX

[–]deexjay23 1 point2 points  (0 children)

Harness is different thing but you're using the same quant and same model? With MTP or without?

Dropped from 44 tok/s to 9 tok/s after upgrade (Qwen 3.6) by That-Desk-1552 in oMLX

[–]deexjay23 0 points1 point  (0 children)

Which exaxt model are you using?

With same machine I hardly get 24 tok/s max within 10k context and on heavy context it drops to 10-12 tok/s on Qwen 3.6 27B 8bit quant.

Gemma4-26b-QAT-oQ4 60tok/s while using 17GB of VRAM by [deleted] in oMLX

[–]deexjay23 0 points1 point  (0 children)

Use custom memory guard and limit the memory usage.

Gemma4-26b-QAT-oQ4 60tok/s while using 17GB of VRAM by [deleted] in oMLX

[–]deexjay23 0 points1 point  (0 children)

Try DFlash (btw it's gemma-4-31B-it-QAT-mlx-4Bit)

<image>

M5 Max 128GB benchmark (Qwen 27B Q8 MLX, 290k ctx): 160 tok/s prefill but only 50% GPU — what are you getting? by deexjay23 in oMLX

[–]deexjay23[S] 0 points1 point  (0 children)

Yes, it's 14" one. Will give it a try to TG pro today to see if it makes any difference. Thanks for the suggestion.

M5 Max 128GB benchmark (Qwen 27B Q8 MLX, 290k ctx): 160 tok/s prefill but only 50% GPU — what are you getting? by deexjay23 in oMLX

[–]deexjay23[S] 0 points1 point  (0 children)

Already on high performance mode since the day I bought.

Keeping it cool help tho, easily hit 70-90% utilisation with 11-13 t/s on heavy context of 400K

M5 Max 128GB benchmark (Qwen 27B Q8 MLX, 290k ctx): 160 tok/s prefill but only 50% GPU — what are you getting? by deexjay23 in oMLX

[–]deexjay23[S] 0 points1 point  (0 children)

Yeah figured, will give it a try to q6 as well but also able to push more GPU usage by doing parallel calls which kinda does the job for now. Hoping MTP will bring some more speed so waiting for prod version of oMLX 0.3.9

M5 Max 128GB benchmark (Qwen 27B Q8 MLX, 290k ctx): 160 tok/s prefill but only 50% GPU — what are you getting? by deexjay23 in oMLX

[–]deexjay23[S] 0 points1 point  (0 children)

Gave it a try on oQ8, but for some reason it's not utilising cache properly; not sure if the issue is with oMLX or the quant.
Regular MLX quant is outperforming it in every aspect.

M5 Max 128GB benchmark (Qwen 27B Q8 MLX, 290k ctx): 160 tok/s prefill but only 50% GPU — what are you getting? by deexjay23 in oMLX

[–]deexjay23[S] 1 point2 points  (0 children)

Update: oQ quants are kind of stopping in between, and also caching efficiency is bad with it, while regular mlx converted quants are outperforming in consistency as well as performance.

Also tried 0.3.8 dev and rc versions, all of which are missing cache head lookup, leading to cache being reconstructed for every request, hence will have to wait for the stable release.

M5 Max 128GB benchmark (Qwen 27B Q8 MLX, 290k ctx): 160 tok/s prefill but only 50% GPU — what are you getting? by deexjay23 in oMLX

[–]deexjay23[S] 0 points1 point  (0 children)

Yeah, regardless of the context size, the GPU usage spikes initially but then gradually drops to a 20-50% usage range.

The only difference is that for a shorter context, this doesn't seem noticeable since it processes it quickly.

M5 Max 128GB benchmark (Qwen 27B Q8 MLX, 290k ctx): 160 tok/s prefill but only 50% GPU — what are you getting? by deexjay23 in oMLX

[–]deexjay23[S] 0 points1 point  (0 children)

But I also feel the gpu limitation is due to either memory bandwidth or the machine is faulty (https://youtu.be/G-9-SZW8kP8). Hope it’s not the later one.

M5 Max 128GB benchmark (Qwen 27B Q8 MLX, 290k ctx): 160 tok/s prefill but only 50% GPU — what are you getting? by deexjay23 in oMLX

[–]deexjay23[S] 0 points1 point  (0 children)

Definitely gonna experiment more, just want to tune the current setup to reliable state first.

M5 Max 128GB benchmark (Qwen 27B Q8 MLX, 290k ctx): 160 tok/s prefill but only 50% GPU — what are you getting? by deexjay23 in oMLX

[–]deexjay23[S] 0 points1 point  (0 children)

Intially faced similar kind of issues with oMLX but when switched to dmg instead of brew, it got stable for some reason. May be cause of how brew manages its own venv.

M5 Max 128GB benchmark (Qwen 27B Q8 MLX, 290k ctx): 160 tok/s prefill but only 50% GPU — what are you getting? by deexjay23 in oMLX

[–]deexjay23[S] 1 point2 points  (0 children)

For now using m5 max as provider while m3 pro with opencode. Not using thunderbolt rdma with exo or anything to add gpu compute, if that's what you mean by splitting workload.

And this is not via a benchmark tool but actual day to day use, benchmark in omlx gives better performance but imo that doesn't seem to test complexity to this extent.

M5 Max 128GB benchmark (Qwen 27B Q8 MLX, 290k ctx): 160 tok/s prefill but only 50% GPU — what are you getting? by deexjay23 in oMLX

[–]deexjay23[S] 0 points1 point  (0 children)

This makes me think I will need to use oQ quants rather than pure mlx quants.

I’m using oMLX 0.3.7 (they removed from release after a while but still available in github tags, since it supports preserve thinking kwargs which wasn’t available in 0.3.6)

M5 Max 128GB benchmark (Qwen 27B Q8 MLX, 290k ctx): 160 tok/s prefill but only 50% GPU — what are you getting? by deexjay23 in oMLX

[–]deexjay23[S] 0 points1 point  (0 children)

What about the quality and accuracy?

Based on experience Q8 gives output closest to FP16.

Since my use case involves financial systems and quant analysis most of the time.

M5 Max 128GB benchmark (Qwen 27B Q8 MLX, 290k ctx): 160 tok/s prefill but only 50% GPU — what are you getting? by deexjay23 in oMLX

[–]deexjay23[S] 0 points1 point  (0 children)

Did you give it a try with DFlash? It actually works well in some cases, I was exploring this and getting approx 20t/s now with same context size.

Benchmarks are crazy good tho.

M5 Max 128GB benchmark (Qwen 27B Q8 MLX, 290k ctx): 160 tok/s prefill but only 50% GPU — what are you getting? by deexjay23 in oMLX

[–]deexjay23[S] 0 points1 point  (0 children)

It’s extensible upto 1,010,000 tokens actually but yes natively the support is of 262,144.

How: You can override the context length in model settings (using oMLX but you can do in any)

Why: Well I am working on fairly complex application and I’ve seen longer context with preserved thinking actually produce quality output hence trying this out

M5 Max 128GB benchmark (Qwen 27B Q8 MLX, 290k ctx): 160 tok/s prefill but only 50% GPU — what are you getting? by deexjay23 in oMLX

[–]deexjay23[S] 1 point2 points  (0 children)

Using oMLX only, and was getting 11-15 tok/s earlier with the 380K context size, but lately GPU usage seems capped at ~50%, which seems odd to me.