PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

Authors: Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, Zhenguo Li

What

\model is a Diffusion Transformer model capable of directly generating high-quality images at 4K resolution, building upon its predecessor \modelalpha with enhanced training data and an efficient token compression mechanism.

Why

This paper is significant as it addresses the challenge of efficiently training high-quality T2I models with limited resources. It introduces the concept of “weak-to-strong training,” allowing for the incremental improvement of pre-trained models. Furthermore, \model pushes the boundary of resolution in T2I generation to 4K, a significant advancement in the field.

How

The authors employ a “weak-to-strong training” strategy, starting with the pre-trained \modelalpha. They enhance the model by: (1) Curating a higher-quality dataset with better aesthetics, higher resolution (up to 4K), and more accurate and dense captions. (2) Introducing an efficient token compression mechanism within the DiT framework to handle the increased computational demands of 4K generation. (3) Proposing efficient fine-tuning techniques for rapid adaptation to new VAEs, higher resolutions, and KV compression.

Result

Key findings include: (1) \model achieves state-of-the-art 4K image generation with high fidelity and strong adherence to textual prompts. (2) The “weak-to-strong training” strategy proves highly efficient, requiring significantly fewer GPU days compared to training from scratch. (3) The proposed KV compression mechanism effectively reduces training and inference time without compromising quality. (4) Both human and AI preference studies confirm \model’s superior performance over existing open-source models and competitive results with commercial T2I products.

LF

Limitations include the inability to perfectly generate certain objects and scenes like text and hands, limitations in handling complex prompts, and potential biases in generated content. Future work involves improving data quality, scaling model size, enhancing alignment with complex instructions, and addressing ethical concerns related to biases and sensitive content.

Abstract

In this paper, we introduce PixArt-\Sigma, a Diffusion Transformer model~(DiT) capable of directly generating images at 4K resolution. PixArt-\Sigma represents a significant advancement over its predecessor, PixArt-\alpha, offering images of markedly higher fidelity and improved alignment with text prompts. A key feature of PixArt-\Sigma is its training efficiency. Leveraging the foundational pre-training of PixArt-\alpha, it evolves from the weaker' baseline to a stronger’ model via incorporating higher quality data, a process we term “weak-to-strong training”. The advancements in PixArt-\Sigma are twofold: (1) High-Quality Training Data: PixArt-\Sigma incorporates superior-quality image data, paired with more precise and detailed image captions. (2) Efficient Token Compression: we propose a novel attention module within the DiT framework that compresses both keys and values, significantly improving efficiency and facilitating ultra-high-resolution image generation. Thanks to these improvements, PixArt-\Sigma achieves superior image quality and user prompt adherence capabilities with significantly smaller model size (0.6B parameters) than existing text-to-image diffusion models, such as SDXL (2.6B parameters) and SD Cascade (5.1B parameters). Moreover, PixArt-\Sigma’s capability to generate 4K images supports the creation of high-resolution posters and wallpapers, efficiently bolstering the production of high-quality visual content in industries such as film and gaming.