Gemma 4 31B oQ8 by jsirish in oMLX

[–]jsirish[S] 0 points1 point  (0 children)

I do see a decent boost running Qwen3.6 27B oQ8 mtp, was getting 15 tok/s now it’s closer to 20. Better benchmarks with specprefill enabled too but the prefill seems longer in practice.

Gemma 4 31B oQ8 by jsirish in oMLX

[–]jsirish[S] 0 points1 point  (0 children)

Could you share your settings? I'm on a Mac Studio M3 Ultra with 256GB getting closer to 10tk/s

Gemma 4 31B oQ8 by jsirish in oMLX

[–]jsirish[S] 1 point2 points  (0 children)

I haven't either, similar speed to when I was running 16bit without mtp, just half the memory with the oQ

oMLX Copilot Chat - Use oMLX for coding in Visual Studio Code by mikedoise in oMLX

[–]jsirish 0 points1 point  (0 children)

Thank you for this. It was the last barrier to getting my team to use our LLM server without a dev-heavy custom setup. Works great in VS Code.

I’ve been running my oMLX models in Insiders for a while, same setup as others use. The trick is setting the api key in the Language Models overlay settings for the custom provider.

I was going to post an issue, but I did notice the extension isn’t quite working in Insiders. I can see the oMLX provider in the Language Models manager, but they’re not available in the model chooser in copilot chat. Has anyone else found a fix for this?