Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

Authors: Aaron Lou, Chenlin Meng, Stefano Ermon

What

This paper introduces Score Entropy Discrete Diffusion (SEDD), a novel approach for building discrete diffusion models parameterized by the ratios of the data distribution, aiming to address the limitations of existing diffusion models in handling discrete data like natural language.

Why

The paper is important because it presents a novel method for discrete diffusion models that outperforms previous models in language modeling tasks, challenges the dominance of autoregressive models, and offers advantages like faster, controllable, and higher-quality generation without relying on distribution annealing techniques.

How

The authors develop a novel loss function called score entropy, analogous to score matching used in continuous diffusion models. They use this loss to train a seq-to-seq transformer model on various language modeling tasks like text8, One Billion Words, and GPT-2 zero-shot tasks. They evaluate their model’s performance on perplexity and generation quality, comparing it against existing diffusion models and autoregressive models like GPT-2.

Result

SEDD significantly outperforms previous discrete diffusion models on language modeling benchmarks and achieves competitive perplexity scores compared to autoregressive models, even surpassing GPT-2 on some tasks. Furthermore, SEDD generates higher-quality text without distribution annealing techniques and allows for flexible conditional generation, including infilling, matching the performance of models that rely on such techniques.

LF

The paper acknowledges limitations such as the gap with modern large language models and the need for exploring better distribution annealing techniques for SEDD. Future work could focus on closing the performance gap with larger LMs, adapting empirical designs from continuous diffusion models, and systematically exploring noise schedules and loss weightings for further improvement.

Abstract

Despite their groundbreaking performance for many generative modeling tasks, diffusion models have fallen short on discrete data domains such as natural language. Crucially, standard diffusion models rely on the well-established theory of score matching, but efforts to generalize this to discrete structures have not yielded the same empirical gains. In this work, we bridge this gap by proposing score entropy, a novel loss that naturally extends score matching to discrete spaces, integrates seamlessly to build discrete diffusion models, and significantly boosts performance. Experimentally, we test our Score Entropy Discrete Diffusion models (SEDD) on standard language modeling tasks. For comparable model sizes, SEDD beats existing language diffusion paradigms (reducing perplexity by -%) and is competitive with autoregressive models, in particular outperforming GPT-2. Furthermore, compared to autoregressive mdoels, SEDD generates faithful text without requiring distribution annealing techniques like temperature scaling (around - better generative perplexity than un-annealed GPT-2), can trade compute and quality (similar quality with fewer network evaluations), and enables controllable infilling (matching nucleus sampling quality while enabling other strategies besides left to right prompting).