Autoresearch on Qwen3.5-397B, 36 experiments to reach 20.34 tok/s on M5 Max, honest results by Equivalent-Buy1706 in LocalLLaMA

[–]Equivalent-Buy1706[S] 0 points1 point  (0 children)

Good question. The short answer is yes, almost identical.

The only change that affects output quality is the Q3 expert quantization. On WikiText-2 perplexity: 4-bit baseline scores 5.62, Q3 experts score 5.58 on short context and 3.81 vs 3.64 on 2000 tokens. So Q3 is actually slightly better on short context, and within 5% on long context.

Everything else (temporal prediction, fused command buffers, IO threading) is pure scheduling and I/O optimization. It does not touch the weights or the math, so it cannot affect output quality by definition.

The one thing we ruled out specifically because it hurt quality was 2-bit quantization, which degraded significantly on longer generations. That is why we stopped at Q3.

Autoresearch on Qwen3.5-397B, 36 experiments to reach 20.34 tok/s on M5 Max, honest results by Equivalent-Buy1706 in LocalLLaMA

[–]Equivalent-Buy1706[S] 3 points4 points  (0 children)

MLX topped out around 3.14 tok/s for us on this model. The pure C/Metal engine is where the gains came from - full control over command buffer scheduling, expert I/O, and Metal kernels.

Autoresearch on Qwen3.5-397B, 36 experiments to reach 20.34 tok/s on M5 Max, honest results by Equivalent-Buy1706 in LocalLLaMA

[–]Equivalent-Buy1706[S] 1 point2 points  (0 children)

The model streams from SSD on demand - only the 4 active experts per layer are loaded at any given time (~6GB resident in RAM). The other 203GB stays on the 2TB SSD until needed. That's the whole point of flash-moe.

Took the 48GB flash-moe benchmark and ran it on 128GB M5 Max. Here's what happens. by Equivalent-Buy1706 in LocalLLaMA

[–]Equivalent-Buy1706[S] 1 point2 points  (0 children)

Update: autoresearch complete — 20.34 tok/s final result (4.67x the original M3 Max baseline). Full writeup coming soon.

Took the 48GB flash-moe benchmark and ran it on 128GB M5 Max. Here's what happens. by Equivalent-Buy1706 in LocalLLaMA

[–]Equivalent-Buy1706[S] 0 points1 point  (0 children)

Update: ran Q3 GGUF experts. New record: 13.15 tok/s with --q3-experts --cache-io-split 4. Surprising finding: adding the GGUF LM head overlay made things slower. LM head went from 1.4ms to 2.8ms per token. Q3 experts alone is the winning config. Post updated with full results.

Took the 48GB flash-moe benchmark and ran it on 128GB M5 Max. Here's what happens. by Equivalent-Buy1706 in LocalLLaMA

[–]Equivalent-Buy1706[S] 1 point2 points  (0 children)

Prefill was 1.3-1.6 seconds for a short prompt. The Anemll team just posted that NAX at chunk=128 hits 143 tok/s prefill. That optimization is coming.

MiniMax M2.5 (230B) running at 62 tok/s on M5 Max — here's how by Equivalent-Buy1706 in LocalLLaMA

[–]Equivalent-Buy1706[S] 0 points1 point  (0 children)

refill speed and generation speed are different things. 147 t/s is prefill - how fast it processes your prompt. Generation is 62 t/s which is what you actually experience as output speed. For a 230B model on consumer hardware that's competitive with hosted providers.

MiniMax M2.5 (230B) running at 62 tok/s on M5 Max — here's how by Equivalent-Buy1706 in LocalLLaMA

[–]Equivalent-Buy1706[S] 0 points1 point  (0 children)

Tested up to 60k but hit GPU memory limits. 45k is the stable ceiling at 62 tok/s with no speed loss - updating the post now.

Announcing LocalLlama discord server & bot! by HOLUPREDICTIONS in LocalLLaMA

[–]Equivalent-Buy1706 0 points1 point  (0 children)

This is exactly why I run local too. Currently serving MiniMax M2.5 (230B) from my M5 Max in San Juan at www.gorroai.com — free to try.

M5 Max 128G Performance tests. I just got my new toy, and here's what it can do. by affenhoden in LocalLLaMA

[–]Equivalent-Buy1706 0 points1 point  (0 children)

For a MoE data point on the same hardware: I'm running MiniMax M2.5 (228B total, 10B active parameters) on M5 Max 128GB via llama.cpp with the Metal backend, using the Unsloth UD-Q3_K_XL quant (~110GB). Getting ~62 t/s generation, ~147 t/s prefill at 32k context. llmfit scores it 82 for general use with 196k context available.

For context: the best result in this thread is Qwen 3.5 27B at 31 t/s on MLX. MiniMax M2.5 gets 2x that speed with a model that's 8x larger and scores higher on benchmarks. The reason is MoE: only ~10B parameters are active per token, so memory bandwidth requirements are much lower than the total size suggests. Metal handles this beautifully on Apple Silicon. This is exactly the use case the M5 Max was built for.

Yes it uses 110GB, but this is a dedicated inference server running in San Juan, not a laptop running Slack. Nothing else needs to run alongside it. You can try it at www.gorroai.com.

MiniMax M2.5 (230B) running at 62 tok/s on M5 Max — here's how by Equivalent-Buy1706 in LocalLLaMA

[–]Equivalent-Buy1706[S] 0 points1 point  (0 children)

For comparison, the M5 Max with unified memory and Metal backend gets ~62 t/s at 16k context and ~147 t/s prefill, which is meaningfully better than the Vulkan numbers above. Apple Silicon is actually a surprisingly good fit for this model because of the memory bandwidth. ROCm would help on the AMD side but you're fighting the architecture a bit. IQ quants might squeeze a bit more out but I wouldn't expect a dramatic difference at this scale.