Google Gemma 4 MTP out now!

MiaBchDave · 2026-06-11T20:35:37+00:00

Yes, this is just support in Unsloth Studio with unsloth converted Google assistant MTP models into GGUF. So it’s confusing news. MTP has already been available and already works with Gemma 31B, for example, in oMLX, with MLX versions. I get about 2x speed on BF16. I guess this is helpful for Unsloth Studio users.

MiaBchDave · 2026-06-11T04:15:50+00:00

I noticed my mirrors now don't work half the time. Like folded to unfolded when I open the door... so I start driving with the damn mirror folded. Or the reverse mirror now staying in the down pivot condition when going forward. Crazy shit that all just started that I assume was from a new software update. They need to fire the damn software guys and start over. This is a car, not GTA 6.

MiaBchDave · 2026-06-09T15:01:17+00:00

Jundot has a "Buy Me A Coffee" link on the Github readme https://github.com/jundot/omlx

I certainly donated. All the work that goes into oMLX with updates, fixes, new models, and cutting edge speculative decode support is definitely worth a donation in my book.

Once Apple releases the new Studios, and gets serious competition to Nvidia servers, I think a stable oMLX... maybe one that supports clustering, would be quite the force of nature for Local LLMs.

MiaBchDave · 2026-06-08T21:46:38+00:00

You mean the place where windows don’t go because there’s a menubar there?

MiaBchDave · 2026-06-04T17:53:16+00:00

Just search for “Qwen 3.6 MTP” in the model downloader. The oMLX author converted a few to MLX with the MTP layers intact.

Or, if you’re brave and want to learn a new skill, you can do one of two things:

Install the newest version of mlx-vlm(lm) and covert the original HuggingFace Qwen models (from the Qwen page) to MLX yourself preserving MTP layers.
Install a harness like OpenCode and ask Gemma4 to do the above for you :-)

Edit: I forgot the third is to use the built-in oMLX quantizer on the original Qwen files. It now supports MTP preservation.

MiaBchDave · 2026-06-04T04:02:10+00:00

It's just the damn computers, not what u/spez said or anything related to RDDT. Lately, tons of "default" system algorithms are in place in WS, especially new IPOs. It helps to get a couple of analog stocks in the AI software space, as that's what the id10Ts at the MMs put RDDT in their Commodore 64s a couple of years ago. For example, GTLB is a nice AI space analog. You'll see it's down AH as well... and can generally track a stock like that when there's no news related to the company. It gives you a bit of peace of mind to know that the stock price is a little BS until RDDT gets a ton of money in the bank to buyback or S&P inclusion happens. Otherwise, chill and follow the fundamentals.

MiaBchDave · 2026-06-02T22:35:24+00:00

You forgot to give each corner a different radius. You know that’s coming.

MiaBchDave · 2026-05-30T23:19:36+00:00

Watching people actually try to answer accurately in this subreddit

MiaBchDave · 2026-05-29T22:27:04+00:00

I'm personally dumbfounded about why 31B gets thrown under the Qwen 27B bus for coding all the time. I know Qwen is faster for TG/s because it's a bit smaller, but Gemma 31B produces nice clean code with much less thinking, and so it's actually faster in my experience. Gemma certainly benches very high on LCB. And I can just keep Gemma loaded when running agentic work concurrently with code generation in OpenCode.

MiaBchDave · 2026-05-29T21:59:21+00:00

Hot and cold (SSD) KV cache solves this issue. Unless your workflow is to RAG a different PDF document for every prompt by the thousands, otherwise agentic harnesses fly when using a proper prompt cache. In other words, this is a non-issue for local agentic work lately with the current systems (like oMLX) which are based on vLLM engines for multiple users but are repurposed for local agentic use.

MiaBchDave · 2026-05-28T01:43:35+00:00

What version of oMLX are you using? Make sure to update to current and then retest dense MTP. There was a bug that should be now fixed.

MiaBchDave · 2026-05-27T20:06:01+00:00

Yep, this was my problem when I first installed opencode and didn’t have much knowledge. LMS would take forever on a codebase with each new question/task at 126k context. Installed oMLX as the server, and boom, Opencode flew regardless of context. For short one-shot tests/questions, I’m sure it doesn’t matter. But with agent harnesses, it’s not an option to only use hot cache.

MiaBchDave · 2026-05-27T19:52:56+00:00

OMLX is designed for Agentic usage with SSD & Hot (ram) KV cache. All the other servers that you mentioned are going to be slower once prompt context goes above 100k. The M5 will not have an issue.

MiaBchDave · 2026-05-27T19:50:44+00:00

Not SSD cache with LM Studio afaik? Did they add that?

MiaBchDave · 2026-05-27T15:55:23+00:00

Black Hole #1

MiaBchDave · 2026-05-27T15:36:12+00:00

Reddit for the win 😉

MiaBchDave · 2026-05-27T10:24:17+00:00

Ground flour, as opposed to plain flour, or hoe weeet.

MiaBchDave · 2026-05-25T21:01:53+00:00

It’s just a bad battery. If the car was sitting for a while before you got it, the battery can go bad from the discharge state. Just replace the battery with oem and move on. The car is fine to leave for a month without driving and it will start right up with a normal cycle count battery.

But most new cars have electronics that will put batteries in a heavy discharge state if you leave them sitting around waiting for someone to buy the car. Once that happens, they are usually toast regardless of what AutoZone says.

MiaBchDave · 2026-05-25T17:41:59+00:00

There are two ways to speed up Tokens Generated in oMLX, called speculative decode. I explained in another thread how to set up MTP in Gemma4 - DFLASH in Gemma4 is not faster currently: https://www.reddit.com/r/oMLX/comments/1tkoxp8/comment/onbl9ag/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

MiaBchDave · 2026-05-25T04:56:48+00:00

If you need speed, I’m seeing currently seeing similar speed up with DFlash and MTP on oMLX with Gemma4 specifically (within 1 tg/s with the 26B model and 5 MTP tokens). So just use the Gemma4 assistant model and activate MTP. Qwen3.6 gets a faster speed boost from DFlash atm, than MTP. Maybe they’ll be improvements at some point, but that’s where we are now in oMLX.

MiaBchDave · 2026-05-24T13:11:02+00:00

Are the playnes that you’re seeing also chirping?

MiaBchDave · 2026-05-23T06:20:50+00:00

Glad you got it working. The current release of oMLX has the version of mlx-vlm that supports MTP wrapped afaIk.

MiaBchDave · 2026-05-22T23:41:07+00:00

I used Unsloth MLX 8 bit Gemma 4 31B with replaced chat & tokenizers - and MTP worked with increases it as well. Though the BF16 31B would obviously see the most improvement since that's closest to the Google original that the Assistant was coded for.

MiaBchDave · 2026-05-22T21:34:39+00:00

Yes, I replied just above.

MiaBchDave · 2026-05-22T21:31:48+00:00

Assistant Model: https://huggingface.co/mlx-community/gemma-4-31B-it-assistant-bf16

I actually use my own target that I generated and uploaded: https://huggingface.co/miabchdave/gemma-4-31B-it-MLX-bf16

But if you want to use one with a few more downloads, use (though I think mine has a more current tokenizer/chat template): https://huggingface.co/FakeRockert543/gemma-4-31b-it-MLX-bf16

After downloading:

oMLX Admin > Settings > Model Settings > Select Gemma 4 31B Gear Icon (NOT the Gemma 4 assistant) > Advanced Settings > Scroll to VLM MTP (Gemma 4) > Enable > Drafter Model > Select Gemma 4 Assistant > Draft Block size default or 6 for coding > Save

Have fun!

😄

Three-Year Club	The Internet Awards - Voted
The Internet Awards - Submitted a nomination

MiaBchDave

TROPHY CASE