U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers
Authors: Yuchuan Tian, Zhijun Tu, Hanting Chen, Jie Hu, Chao Xu, Yunhe Wang
What
This paper introduces U-DiT, a U-Net architecture diffusion transformer model for latent-space image generation that leverages token downsampling in self-attention to improve performance and reduce computational cost compared to isotropic DiT models.
Why
The paper is important as it challenges the prevailing use of isotropic architectures in diffusion transformers by demonstrating the potential of U-Net architecture combined with a novel downsampled self-attention mechanism, leading to state-of-the-art performance with reduced computational costs.
How
The authors conducted a toy experiment comparing a simple U-Net DiT with an isotropic DiT and found that while U-Net offered some benefits, it was underutilized. They then introduced downsampled self-attention, reducing redundancy by focusing on low-frequency components in the U-Net backbone. They scaled this model up, creating U-DiT and evaluating it against existing DiT models on ImageNet 256x256, measuring FID, sFID, IS, precision, and recall.
Result
U-DiT significantly outperforms isotropic DiTs, achieving better FID scores with fewer FLOPs. For example, U-DiT-B surpasses DiT-XL/2 in performance with only 1/6th of the computational cost. This highlights the efficacy of the U-Net architecture and downsampled self-attention for efficient and high-quality image generation.
LF
The authors acknowledge limitations in exploring the full potential of U-DiTs due to computational resource constraints and a tight schedule, suggesting further scaling of model size and extending training iterations as future work.
Abstract
Diffusion Transformers (DiTs) introduce the transformer architecture to diffusion tasks for latent-space image generation. With an isotropic architecture that chains a series of transformer blocks, DiTs demonstrate competitive performance and good scalability; but meanwhile, the abandonment of U-Net by DiTs and their following improvements is worth rethinking. To this end, we conduct a simple toy experiment by comparing a U-Net architectured DiT with an isotropic one. It turns out that the U-Net architecture only gain a slight advantage amid the U-Net inductive bias, indicating potential redundancies within the U-Net-style DiT. Inspired by the discovery that U-Net backbone features are low-frequency-dominated, we perform token downsampling on the query-key-value tuple for self-attention and bring further improvements despite a considerable amount of reduction in computation. Based on self-attention with downsampled tokens, we propose a series of U-shaped DiTs (U-DiTs) in the paper and conduct extensive experiments to demonstrate the extraordinary performance of U-DiT models. The proposed U-DiT could outperform DiT-XL/2 with only 1/6 of its computation cost. Codes are available at https://github.com/YuchuanTian/U-DiT.