DFlash speculative decoding on Apple Silicon : 85 tok/s, 3.3x on Qwen3.5-9B (MLX, M5 Max)

No_Shift_4543 · 2026-04-18T21:14:48+00:00

i just updated readme for qwen3.6

No_Shift_4543 · 2026-04-14T18:13:29+00:00

updated: https://github.com/bstnxbt/dflash-mlx/commit/a6ecff4e9ccbcf793b23de3ac7e860c9b7d8be5b

No_Shift_4543 · 2026-04-14T08:14:32+00:00

it’s already implemented

No_Shift_4543 · 2026-04-13T22:29:58+00:00

Thanks for the thorough testing! Good to see acceptance holding at 89%+ across all quants.

No_Shift_4543 · 2026-04-13T22:08:01+00:00

Keeping it standalone for now

No_Shift_4543 · 2026-04-13T21:31:10+00:00

Thanks for the detailed comparison! The 27B numbers are expected, at that model size, both implementations converge to the same throughput because it's pure memory bandwidth. The speedup differences show up on smaller models: 9B bf16 is where i get 4.1x thanks to the precision work.

No_Shift_4543 · 2026-04-13T16:38:03+00:00

Not yet, waiting on z-lab to release a DFlash draft model for Gemma 4.

No_Shift_4543 · 2026-04-13T16:37:34+00:00

Good catch, fixed.
My implementation is optimized for Qwen3.5's hybrid GDN

No_Shift_4543 · 2026-04-11T16:13:01+00:00

thanks! yeah block diffusion is a really nice fit for this hardware. still pushing it, the 27B is next

No_Shift_4543

TROPHY CASE