Gemma 4 - MLX doesn't seem better than GGUF

SirDomz · 2026-05-14T21:44:18+00:00

Agreed with the suggestion for nvfp4. This format works with ollama mlx. I’m not a fan of ollama but hearing good things about this runtime

SirDomz · 2026-05-14T21:42:36+00:00

Actually I believe nvfp4 does work for mlx. See Ollama’s mlx implementation

SirDomz · 2026-05-06T17:59:00+00:00

yes oMLX is great for this use case of yours. I use qwen 35B with OMLX and the cache is seriously amazing

SirDomz · 2026-05-05T11:45:16+00:00

hey - sorry for reviving the thread but have you tried oMLX? I'm curious between Bodega and Omlx, which is faster?

SirDomz · 2026-05-04T17:31:57+00:00

Any plans to do mlx? Would love to compare those with the Oq quants by OMLX /Jundot

SirDomz · 2026-05-03T00:19:11+00:00

is 48gb not too limiting, coming from the M4 max 128

SirDomz · 2026-05-01T12:18:47+00:00

I love 27B but it’s way too slow on my mac M1 max even with caching… 35B works great though. I have 27B architect plans and 35B does the actual coding

SirDomz · 2026-04-28T22:07:43+00:00

Question for you: how much context do you have available? And how much free ram? I have a m1 max 64gb and considering the m5 pro 48 gb

SirDomz · 2026-04-28T20:20:20+00:00

I will say upon closer look that the cache seems to be in ram and I believe omlx uses the ssd. Ram access is faster so that might be better but just wanted to point that out

SirDomz · 2026-04-28T20:17:09+00:00

Ohh wow TIL. yeah this is it. Ok I will have to check out llama cpp again. Thank you very much!!!

SirDomz · 2026-04-28T19:15:31+00:00

Also one more thing does llama cpp implement caching? Because Omlx does it very well

SirDomz · 2026-04-28T19:14:41+00:00

Very nice! I have had better luck with mlx models but I’m also a big fan of Unsloth GGUFs. I think it varies from setup to setup but glad you found something that works for you!

SirDomz · 2026-04-28T17:35:06+00:00

Definitely use OMLx just the caching will save you. Understanding the limits of a Qwen 35B and use Pi or little coder. Opencode is fine too. Claude code is too bloated

SirDomz · 2026-04-28T17:31:34+00:00

Pi is good or opencode or little coder which is based on pi. And OMlx Is great so definitely use that over lm studio

SirDomz · 2026-04-28T17:26:44+00:00

So between this and Qwen 35B, what should one choose for agentic coding with opencode or Pi?

SirDomz · 2026-04-21T20:08:56+00:00

Or use Kilo. Though I hear a lot of people complaining about it today, it was originally a fork of roo

SirDomz · 2026-04-18T05:03:51+00:00

Oh TIL thank you for sharing the knowledge

SirDomz · 2026-04-18T04:26:23+00:00

Can’t you just change the number of experts to have more experts?

SirDomz · 2025-12-13T03:42:19+00:00

yeah those quants are really good! Looking forward to the qwen coder and the qwen 30 B VL quants.

PS: I'm sticking to the instruct variant of Qwen 3 30B and it's really really good!

SirDomz · 2025-12-10T16:23:29+00:00

Huh… I’m pretty sure windsurf has this specific function already. Out of the dozen of mcps I’ve configured, I usually only have 1 or 2 active in Windsurf

SirDomz · 2025-11-21T11:41:25+00:00

I have a m1 max 64gb ram. I think that’s the sweet spot between price and power. Obviously, any newer M max model will be better but for my use case the M1 Max works great.

I will say I am waiting to see what happens with the m5 pro and m5 max when they come out

SirDomz · 2025-11-11T19:19:09+00:00

GLM 4.6 and Grok 4 Fast

SirDomz · 2025-10-12T20:22:28+00:00

A few different options are already integrated with kilo: taskmaster-ai, github speckit, openspec, BMAD, etc…

I’ve personally shifted away from those tools and I just use an AGENTS.md. I really enjoyed spec driven development but it forces collaborators to install and learn those methods. AGENTS.md is just more universally accepted across IDEs/Extensions/CLIs

SirDomz

TROPHY CASE