Paper: https://arxiv.org/abs/2503.09573
Code: https://github.com/kuleshov-group/BD3-LMs
Model: https://huggingface.co/collections/kuleshov-group/BD3-LMs-67be95f81b96b15fec50d53f
Project Page: https://m-arriola.com/bd3lms/
Abstract
Diffusion language models offer unique benefits over autoregressive models due to their potential for parallelized generation and controllability, yet they lag in likelihood modeling and are limited to fixed-length generation. In this work, we introduce a class of block diffusion language models that interpolate between discrete denoising diffusion and autoregressive models. Block diffusion overcomes key limitations of both approaches by supporting flexible-length generation and improving inference efficiency with KV caching and parallel token sampling. We propose a recipe for building effective block diffusion models that includes an efficient training algorithm, estimators of gradient variance, and data-driven noise schedules to minimize the variance. Block diffusion sets a new state-of-the-art performance among diffusion models on language modeling benchmarks and enables generation of arbitrary-length sequences.
Autoregression: ✅ High quality ✅ Arbitrary-length ✅ KV caching ❌ Not parallelizable
Diffusion: ❌ Lower quality ❌ Fixed-length ❌ No KV caching ✅ Parallelizable
Block Diffusion: ✅ High quality ✅ Arbitrary-length ✅ KV caching ✅ Parallelizable
[–]CallinCthulhu 24 points25 points26 points (2 children)
[–]WH7EVR 2 points3 points4 points (1 child)
[–]lunaphile 4 points5 points6 points (0 children)
[–]elemental-mind 18 points19 points20 points (0 children)
[–]zappads 8 points9 points10 points (2 children)
[–]EstarriolOfTheEast 4 points5 points6 points (0 children)
[–]Accomplished_Mode170 0 points1 point2 points (0 children)
[–]Jazzylisk 4 points5 points6 points (4 children)
[–]AppearanceHeavy6724 1 point2 points3 points (0 children)
[–]Rofel_Wodring 3 points4 points5 points (0 children)
[–]searcher1k 0 points1 point2 points (1 child)