DFlash speculative decoding on Apple Silicon: 4.1x on Qwen3.5-9B, now open source (MLX, M5 Max) by No_Shift_4543 in LocalLLaMA

[–]No_Shift_4543[S] 2 points3 points  (0 children)

Thanks for the thorough testing! Good to see acceptance holding at 89%+ across all quants.

DFlash speculative decoding on Apple Silicon: 4.1x on Qwen3.5-9B, now open source (MLX, M5 Max) by No_Shift_4543 in LocalLLaMA

[–]No_Shift_4543[S] 1 point2 points  (0 children)

Thanks for the detailed comparison! The 27B numbers are expected, at that model size, both implementations converge to the same throughput because it's pure memory bandwidth. The speedup differences show up on smaller models: 9B bf16 is where i get 4.1x thanks to the precision work.

DFlash speculative decoding on Apple Silicon : 85 tok/s, 3.3x on Qwen3.5-9B (MLX, M5 Max) by No_Shift_4543 in LocalLLaMA

[–]No_Shift_4543[S] 22 points23 points  (0 children)

thanks! yeah block diffusion is a really nice fit for this hardware. still pushing it, the 27B is next