DiffiT: Diffusion Vision Transformers for Image Generation

Authors: Ali Hatamizadeh, Jiaming Song, Guilin Liu, Jan Kautz, Arash Vahdat

What

This paper introduces DiffiT, a novel Vision Transformer (ViT)-based diffusion model designed for efficient and high-quality image generation in both latent and image spaces.

Why

The paper addresses limitations in existing CNN-based and ViT-based diffusion models by introducing Time-dependant Multihead Self Attention (TMSA), which significantly enhances parameter efficiency and enables fine-grained control over the denoising process for improved image fidelity and diversity.

How

The authors propose a novel TMSA mechanism integrated into a U-shaped encoder-decoder architecture for image space generation and a purely ViT-based architecture for latent space generation. They train and evaluate DiffiT on diverse datasets, including ImageNet, FFHQ, and CIFAR10, and conduct thorough ablation studies to validate the effectiveness of TMSA and other architectural choices.

Result

DiffiT achieves state-of-the-art FID scores on ImageNet-256 with significantly fewer parameters compared to previous SOTA models like MDT and DiT. It also achieves competitive results on FFHQ-64 and CIFAR10, showcasing its ability to generate high-fidelity, diverse images across different datasets and resolutions.

LF

The paper acknowledges potential limitations in extending DiffiT to higher resolutions and exploring more complex image generation tasks. Future work could focus on optimizing the model for memory efficiency, leveraging larger datasets for training, and exploring applications in image editing, restoration, and text-to-image generation.

Abstract

Diffusion models with their powerful expressivity and high sample quality have achieved State-Of-The-Art (SOTA) performance in the generative domain. The pioneering Vision Transformer (ViT) has also demonstrated strong modeling capabilities and scalability, especially for recognition tasks. In this paper, we study the effectiveness of ViTs in diffusion-based generative learning and propose a new model denoted as Diffusion Vision Transformers (DiffiT). Specifically, we propose a methodology for finegrained control of the denoising process and introduce the Time-dependant Multihead Self Attention (TMSA) mechanism. DiffiT is surprisingly effective in generating high-fidelity images with significantly better parameter efficiency. We also propose latent and image space DiffiT models and show SOTA performance on a variety of class-conditional and unconditional synthesis tasks at different resolutions. The Latent DiffiT model achieves a new SOTA FID score of 1.73 on ImageNet-256 dataset while having 19.85%, 16.88% less parameters than other Transformer-based diffusion models such as MDT and DiT, respectively. Code: https://github.com/NVlabs/DiffiT