SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers

Authors: Nanye Ma, Mark Goldstein, Michael S. Albergo, Nicholas M. Boffi, Eric Vanden-Eijnden, Saining Xie

What

This paper introduces Scalable Interpolant Transformers (SiT), a class of generative models based on Diffusion Transformers (DiT) that leverage stochastic interpolants to achieve improved performance in image generation.

Why

This work is important because it provides a detailed analysis of the design choices involved in building generative models based on dynamical transport, potentially leading to more efficient and higher-performing models. Specifically, it demonstrates a consistent performance gain over DiT by carefully selecting the interpolant connecting data and noise distributions, and by choosing to learn the velocity field of the interpolating process instead of the score.

How

The authors start with the DDPM framework and systematically analyze the effects of different design choices on the ImageNet 256x256 benchmark. They experiment with discrete vs. continuous time learning, predicting velocity vs. score, using different interpolants like linear and generalized variance-preserving (GVP), and employing deterministic (Heun) vs. stochastic (Euler-Maruyama) samplers with tunable diffusion coefficients.

Result

SiT consistently outperforms DiT in FID scores across all model sizes, demonstrating the effectiveness of using stochastic interpolants and learning the velocity field. The paper also finds that SDE-based sampling generally leads to better performance than ODE-based sampling, and that the optimal diffusion coefficient for SDE sampling depends on the choice of interpolant and model. Using classifier-free guidance further enhances SiT’s performance, achieving a FID-50K score of 2.06, surpassing DiT in all comparable settings.

LF

The authors acknowledge that the performance of different samplers might vary under different computational budgets. They plan to explore the application of SiT to other downstream tasks, such as video generation and image editing, in future work. Additionally, they plan to investigate potential performance improvements by combining SiT with other advanced sampling techniques and architectural modifications.

Abstract

We present Scalable Interpolant Transformers (SiT), a family of generative models built on the backbone of Diffusion Transformers (DiT). The interpolant framework, which allows for connecting two distributions in a more flexible way than standard diffusion models, makes possible a modular study of various design choices impacting generative models built on dynamical transport: using discrete vs. continuous time learning, deciding the objective for the model to learn, choosing the interpolant connecting the distributions, and deploying a deterministic or stochastic sampler. By carefully introducing the above ingredients, SiT surpasses DiT uniformly across model sizes on the conditional ImageNet 256x256 benchmark using the exact same backbone, number of parameters, and GFLOPs. By exploring various diffusion coefficients, which can be tuned separately from learning, SiT achieves an FID-50K score of 2.06.