PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
Authors: Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, Zhenguo Li
What
\model is a Diffusion Transformer model capable of directly generating high-quality images at 4K resolution, building upon its predecessor \modelalpha with enhanced training data and an efficient token compression mechanism.
Why
This paper is significant as it addresses the challenge of efficiently training high-quality T2I models with limited resources. It introduces the concept of “weak-to-strong training,” allowing for the incremental improvement of pre-trained models. Furthermore, \model pushes the boundary of resolution in T2I generation to 4K, a significant advancement in the field.
How
The authors employ a “weak-to-strong training” strategy, starting with the pre-trained \modelalpha. They enhance the model by: (1) Curating a higher-quality dataset with better aesthetics, higher resolution (up to 4K), and more accurate and dense captions. (2) Introducing an efficient token compression mechanism within the DiT framework to handle the increased computational demands of 4K generation. (3) Proposing efficient fine-tuning techniques for rapid adaptation to new VAEs, higher resolutions, and KV compression.
Result
Key findings include: (1) \model achieves state-of-the-art 4K image generation with high fidelity and strong adherence to textual prompts. (2) The “weak-to-strong training” strategy proves highly efficient, requiring significantly fewer GPU days compared to training from scratch. (3) The proposed KV compression mechanism effectively reduces training and inference time without compromising quality. (4) Both human and AI preference studies confirm \model’s superior performance over existing open-source models and competitive results with commercial T2I products.
LF
Limitations include the inability to perfectly generate certain objects and scenes like text and hands, limitations in handling complex prompts, and potential biases in generated content. Future work involves improving data quality, scaling model size, enhancing alignment with complex instructions, and addressing ethical concerns related to biases and sensitive content.
Abstract
In this paper, we introduce PixArt-\Sigma, a Diffusion Transformer
model~(DiT) capable of directly generating images at 4K resolution.
PixArt-\Sigma represents a significant advancement over its predecessor,
PixArt-\alpha, offering images of markedly higher fidelity and improved
alignment with text prompts. A key feature of PixArt-\Sigma is its training
efficiency. Leveraging the foundational pre-training of PixArt-\alpha, it
evolves from the weaker' baseline to a
stronger’ model via incorporating
higher quality data, a process we term “weak-to-strong training”. The
advancements in PixArt-\Sigma are twofold: (1) High-Quality Training Data:
PixArt-\Sigma incorporates superior-quality image data, paired with more
precise and detailed image captions. (2) Efficient Token Compression: we
propose a novel attention module within the DiT framework that compresses both
keys and values, significantly improving efficiency and facilitating
ultra-high-resolution image generation. Thanks to these improvements,
PixArt-\Sigma achieves superior image quality and user prompt adherence
capabilities with significantly smaller model size (0.6B parameters) than
existing text-to-image diffusion models, such as SDXL (2.6B parameters) and SD
Cascade (5.1B parameters). Moreover, PixArt-\Sigma’s capability to generate 4K
images supports the creation of high-resolution posters and wallpapers,
efficiently bolstering the production of high-quality visual content in
industries such as film and gaming.