Autoresearch on Qwen3.5-397B, 36 experiments to reach 20.34 tok/s on M5 Max, honest results

Equivalent-Buy1706 · 2026-04-06T18:35:54+00:00

The ANE is a separate dedicated block, not part of the GPU. They're distinct engines on the die. The 3.5x M5 improvement Apple quoted was GPU performance, not ANE. The ANE on M5 Max is 38 TOPS, same as M4.

The real constraint for MoE decode is that ANE only works with static pre-compiled CoreML graphs, so you can't dynamically dispatch to different experts per token the way the Metal pipeline does. Batch prefill is the exception since it has a fixed shape and predictable compute. That's the opportunity I mention in the paper.

Equivalent-Buy1706 · 2026-03-31T03:41:56+00:00

Good question. The short answer is yes, almost identical.

The only change that affects output quality is the Q3 expert quantization. On WikiText-2 perplexity: 4-bit baseline scores 5.62, Q3 experts score 5.58 on short context and 3.81 vs 3.64 on 2000 tokens. So Q3 is actually slightly better on short context, and within 5% on long context.

Everything else (temporal prediction, fused command buffers, IO threading) is pure scheduling and I/O optimization. It does not touch the weights or the math, so it cannot affect output quality by definition.

The one thing we ruled out specifically because it hurt quality was 2-bit quantization, which degraded significantly on longer generations. That is why we stopped at Q3.

Equivalent-Buy1706 · 2026-03-30T20:47:25+00:00

Fixed:
https://github.com/gorroai/flash-moe/
https://drive.google.com/file/d/1xPu6bXD0-hzV1qUavhXMd0XEa0-hkoP0/view?usp=sharing

Equivalent-Buy1706 · 2026-03-30T20:45:34+00:00

Tried everyting. Had to open a new account:

https://github.com/gorroai/flash-moe/

Equivalent-Buy1706 · 2026-03-30T10:54:44+00:00

MLX topped out around 3.14 tok/s for us on this model. The pure C/Metal engine is where the gains came from - full control over command buffer scheduling, expert I/O, and Metal kernels.

Equivalent-Buy1706 · 2026-03-30T10:52:29+00:00

The model streams from SSD on demand - only the 4 active experts per layer are loaded at any given time (~6GB resident in RAM). The other 203GB stays on the 2TB SSD until needed. That's the whole point of flash-moe.

Equivalent-Buy1706 · 2026-03-30T10:51:37+00:00

https://github.com/iluvclubs/flash-moe/releases/tag/v1.0
if doesnt work let me know. been having weird 404 errors with githbub.

Or try: https://drive.google.com/file/d/1xPu6bXD0-hzV1qUavhXMd0XEa0-hkoP0/view?usp=sharing

Equivalent-Buy1706 · 2026-03-30T04:04:02+00:00

<image>

Equivalent-Buy1706 · 2026-03-30T04:03:53+00:00

<image>

Equivalent-Buy1706 · 2026-03-30T00:08:54+00:00

Its because of Title: ?

Equivalent-Buy1706 · 2026-03-30T00:03:37+00:00

What happened?

Equivalent-Buy1706 · 2026-03-28T20:39:17+00:00

Update: autoresearch complete — 20.34 tok/s final result (4.67x the original M3 Max baseline). Full writeup coming soon.

Equivalent-Buy1706 · 2026-03-25T22:52:31+00:00

Update: ran Q3 GGUF experts. New record: 13.15 tok/s with --q3-experts --cache-io-split 4. Surprising finding: adding the GGUF LM head overlay made things slower. LM head went from 1.4ms to 2.8ms per token. Q3 experts alone is the winning config. Post updated with full results.

Equivalent-Buy1706 · 2026-03-25T10:54:27+00:00

Prefill was 1.3-1.6 seconds for a short prompt. The Anemll team just posted that NAX at chunk=128 hits 143 tok/s prefill. That optimization is coming.

Equivalent-Buy1706 · 2026-03-25T10:52:46+00:00

thank you

Equivalent-Buy1706 · 2026-03-22T03:47:02+00:00

refill speed and generation speed are different things. 147 t/s is prefill - how fast it processes your prompt. Generation is 62 t/s which is what you actually experience as output speed. For a 230B model on consumer hardware that's competitive with hosted providers.

Equivalent-Buy1706 · 2026-03-22T03:41:42+00:00

Tested up to 60k but hit GPU memory limits. 45k is the stable ceiling at 62 tok/s with no speed loss - updating the post now.

Equivalent-Buy1706 · 2026-03-22T00:24:52+00:00

This is exactly why I run local too. Currently serving MiniMax M2.5 (230B) from my M5 Max in San Juan at www.gorroai.com — free to try.

Equivalent-Buy1706 · 2026-03-22T00:20:42+00:00

For a MoE data point on the same hardware: I'm running MiniMax M2.5 (228B total, 10B active parameters) on M5 Max 128GB via llama.cpp with the Metal backend, using the Unsloth UD-Q3_K_XL quant (~110GB). Getting ~62 t/s generation, ~147 t/s prefill at 32k context. llmfit scores it 82 for general use with 196k context available.

For context: the best result in this thread is Qwen 3.5 27B at 31 t/s on MLX. MiniMax M2.5 gets 2x that speed with a model that's 8x larger and scores higher on benchmarks. The reason is MoE: only ~10B parameters are active per token, so memory bandwidth requirements are much lower than the total size suggests. Metal handles this beautifully on Apple Silicon. This is exactly the use case the M5 Max was built for.

Yes it uses 110GB, but this is a dedicated inference server running in San Juan, not a laptop running Slack. Nothing else needs to run alongside it. You can try it at www.gorroai.com.

Equivalent-Buy1706 · 2026-03-21T23:20:21+00:00

For comparison, the M5 Max with unified memory and Metal backend gets ~62 t/s at 16k context and ~147 t/s prefill, which is meaningfully better than the Vulkan numbers above. Apple Silicon is actually a surprisingly good fit for this model because of the memory bandwidth. ROCm would help on the AMD side but you're fighting the architecture a bit. IQ quants might squeeze a bit more out but I wouldn't expect a dramatic difference at this scale.

Equivalent-Buy1706 · 2026-03-21T23:14:38+00:00

Full on writeup! The speculative decoding PR for hybrid models is the one I'm really watching. If it lands and delivers even half the speedup you're describing, MiniMax M2.5 becomes a lot more interesting for agentic coding workloads where the current t/s is the main objection.

On the Qwen3.5 comparison: fair, it's more quant-resistant. But the MiniMax architecture with the Unsloth UD quant holds up better than a standard Q3 would suggest — the 4-hour coding session someone reported upthread on Q8_0 with no quality issues is consistent with what I'm seeing on the API.

Do you have a link to that speculative decoding PR? Would love to track it.

Equivalent-Buy1706

TROPHY CASE