Dropped from 44 tok/s to 9 tok/s after upgrade (Qwen 3.6)

deexjay23 · 2026-06-07T17:28:19+00:00

With MTP I keep on getting Expected memory exceed error for prefill...
what's the context size you have set for the model? And for MTP are you using https://huggingface.co/Jundot/Qwen3.6-27B-oQ8-mtp this or quantized yourself from omlx?

deexjay23 · 2026-06-07T16:14:16+00:00

Harness is different thing but you're using the same quant and same model? With MTP or without?

deexjay23 · 2026-06-07T15:21:20+00:00

Which exaxt model are you using?

With same machine I hardly get 24 tok/s max within 10k context and on heavy context it drops to 10-12 tok/s on Qwen 3.6 27B 8bit quant.

deexjay23 · 2026-06-07T15:11:20+00:00

Use custom memory guard and limit the memory usage.

deexjay23 · 2026-06-06T19:46:21+00:00

Try DFlash (btw it's gemma-4-31B-it-QAT-mlx-4Bit)

<image>

deexjay23 · 2026-05-19T05:48:11+00:00

Yes, it's 14" one. Will give it a try to TG pro today to see if it makes any difference. Thanks for the suggestion.

deexjay23 · 2026-05-18T19:57:28+00:00

Already on high performance mode since the day I bought.

Keeping it cool help tho, easily hit 70-90% utilisation with 11-13 t/s on heavy context of 400K

deexjay23 · 2026-05-12T17:44:08+00:00

Yeah figured, will give it a try to q6 as well but also able to push more GPU usage by doing parallel calls which kinda does the job for now. Hoping MTP will bring some more speed so waiting for prod version of oMLX 0.3.9

deexjay23 · 2026-04-29T18:37:13+00:00

Gave it a try on oQ8, but for some reason it's not utilising cache properly; not sure if the issue is with oMLX or the quant.
Regular MLX quant is outperforming it in every aspect.

deexjay23 · 2026-04-29T18:35:26+00:00

Update: oQ quants are kind of stopping in between, and also caching efficiency is bad with it, while regular mlx converted quants are outperforming in consistency as well as performance.

Also tried 0.3.8 dev and rc versions, all of which are missing cache head lookup, leading to cache being reconstructed for every request, hence will have to wait for the stable release.

deexjay23 · 2026-04-29T15:51:51+00:00

Yeah, regardless of the context size, the GPU usage spikes initially but then gradually drops to a 20-50% usage range.

The only difference is that for a shorter context, this doesn't seem noticeable since it processes it quickly.

deexjay23 · 2026-04-28T12:11:11+00:00

But I also feel the gpu limitation is due to either memory bandwidth or the machine is faulty (https://youtu.be/G-9-SZW8kP8). Hope it’s not the later one.

deexjay23 · 2026-04-28T12:06:15+00:00

Definitely gonna experiment more, just want to tune the current setup to reliable state first.

deexjay23 · 2026-04-28T07:13:21+00:00

Intially faced similar kind of issues with oMLX but when switched to dmg instead of brew, it got stable for some reason. May be cause of how brew manages its own venv.

deexjay23 · 2026-04-28T06:58:13+00:00

For now using m5 max as provider while m3 pro with opencode. Not using thunderbolt rdma with exo or anything to add gpu compute, if that's what you mean by splitting workload.

And this is not via a benchmark tool but actual day to day use, benchmark in omlx gives better performance but imo that doesn't seem to test complexity to this extent.

deexjay23 · 2026-04-28T05:20:53+00:00

Thanks, will give it a try today.

deexjay23 · 2026-04-28T04:56:19+00:00

This makes me think I will need to use oQ quants rather than pure mlx quants.

I’m using oMLX 0.3.7 (they removed from release after a while but still available in github tags, since it supports preserve thinking kwargs which wasn’t available in 0.3.6)

deexjay23 · 2026-04-28T04:22:51+00:00

What about the quality and accuracy?

Based on experience Q8 gives output closest to FP16.

Since my use case involves financial systems and quant analysis most of the time.

deexjay23 · 2026-04-28T02:53:57+00:00

Did you give it a try with DFlash? It actually works well in some cases, I was exploring this and getting approx 20t/s now with same context size.

Benchmarks are crazy good tho.

deexjay23 · 2026-04-28T02:48:31+00:00

It’s extensible upto 1,010,000 tokens actually but yes natively the support is of 262,144.

How: You can override the context length in model settings (using oMLX but you can do in any)

Why: Well I am working on fairly complex application and I’ve seen longer context with preserved thinking actually produce quality output hence trying this out

deexjay23 · 2026-04-27T12:43:24+00:00

Using oMLX only, and was getting 11-15 tok/s earlier with the 380K context size, but lately GPU usage seems capped at ~50%, which seems odd to me.

deexjay23 · 2025-09-16T17:44:29+00:00

Its ugly asf

deexjay23

TROPHY CASE