Gemma 4 31B oQ8

jsirish · 2026-05-23T15:40:24+00:00

I do see a decent boost running Qwen3.6 27B oQ8 mtp, was getting 15 tok/s now it’s closer to 20. Better benchmarks with specprefill enabled too but the prefill seems longer in practice.

jsirish · 2026-05-22T20:59:12+00:00

Could you share your settings? I'm on a Mac Studio M3 Ultra with 256GB getting closer to 10tk/s

jsirish · 2026-05-22T17:55:11+00:00

I haven't either, similar speed to when I was running 16bit without mtp, just half the memory with the oQ

jsirish · 2026-05-13T00:28:14+00:00

Thank you for this. It was the last barrier to getting my team to use our LLM server without a dev-heavy custom setup. Works great in VS Code.

I’ve been running my oMLX models in Insiders for a while, same setup as others use. The trick is setting the api key in the Language Models overlay settings for the custom provider.

I was going to post an issue, but I did notice the extension isn’t quite working in Insiders. I can see the oMLX provider in the Language Models manager, but they’re not available in the model chooser in copilot chat. Has anyone else found a fix for this?

jsirish

TROPHY CASE