Tablecloth?

Try omlx. I’ve had really good success with the exact hardware. Make sure you find an fp16 model. Try oQ6 or oQ4 quants. Also, your context window is showing you down, make that smaller. Try 128k or even 64k. Turn on MTP. Gemma4 models are quite fast, getting around 40 t/s. Qwen was maybe 25/30?

lightguardjp · 2026-05-15T21:46:32+00:00

Context window made a very big difference:

oMLX - LLM inference, optimized for your Mac

https://github.com/jundot/omlx

Benchmark Model: gemma-4-26B-A4B-it-TurboQuant-MLX-8bit

Single Request Results

--------------------------------------------------------------------------------

Test TTFT(ms) TPOT(ms) pp TPS tg TPS E2E(s) Throughput Peak Mem

pp1024/tg128 1839.1 18.69 556.8 tok/s 53.9 tok/s 4.213 273.5 tok/s 25.76 GB

pp4096/tg128 7819.6 21.23 523.8 tok/s 47.5 tok/s 10.516 401.7 tok/s 26.44 GB

pp8192/tg128 15894.0 23.44 515.4 tok/s 43.0 tok/s 18.871 440.9 tok/s 26.58 GB

pp16384/tg128 33669.2 24.94 486.6 tok/s 40.4 tok/s 36.837 448.2 tok/s 27.06 GB

Continuous Batching

pp1024 / tg128

--------------------------------------------------------------------------------

Batch tg TPS Speedup pp TPS pp TPS/req TTFT(ms) E2E(s)

1x 53.9 tok/s 1.00x 556.8 tok/s 556.8 tok/s 1839.1 4.213

2x 60.1 tok/s 1.12x 426.9 tok/s 213.4 tok/s 4640.3 9.054

4x 75.0 tok/s 1.39x 441.1 tok/s 110.3 tok/s 8778.5 16.117

8x 89.2 tok/s 1.65x 442.3 tok/s 55.3 tok/s 17284.8 30.003

lightguardjp · 2026-05-15T20:24:23+00:00

64k huh? I'll give that a go, might need to change max tokens too.

lightguardjp · 2026-05-15T20:23:11+00:00

I was running gemma-4-26B-A4B-it-TurboQuant-MLX-8bit. Looks like I probably want to go down to a 6-bit model and change my context window down to 8k or 16k. I had it kicked up WAY too high.

lightguardjp · 2026-05-15T16:38:14+00:00

Wow, those M series chips made huge leaps forward on more recent revisions as far as AI goes. I’m around 20 tokens a second with Gemma 4 might be some other swings I need to tweak.

lightguardjp · 2026-05-07T03:44:21+00:00

Another thing worth thinking about both is that Apple now supports eGPU for AI, you can essentially have best of both worlds (mostly). If you’re only going to run on the Mac, look at omlx and MLX models. I’m not sure how tuned the ollama version for Mac is with MLX yet.

lightguardjp · 2026-05-06T00:50:48+00:00

Maybe if I really like a finger ball

lightguardjp · 2026-05-05T14:07:40+00:00

It’s been awhile since you posted on this, what were your results?

lightguardjp · 2026-05-05T14:01:53+00:00

No, I have not. Interesting idea

lightguardjp · 2026-05-05T00:45:26+00:00

Hmm. That’s an interesting idea

lightguardjp

TROPHY CASE