PixArt-: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Authors: Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, Zhenguo Li

What

This paper introduces a novel Transformer-based text-to-image diffusion model called \model, which achieves high-quality image generation comparable to state-of-the-art models while significantly reducing training costs and CO2 emissions.

Why

This work is important because it addresses the high training costs and environmental impact associated with advanced T2I models, hindering innovation and accessibility in the AIGC community.

How

The authors propose three core designs: (1) decomposing the training strategy into pixel dependency learning, text-image alignment learning, and high-resolution aesthetic image generation; (2) developing an efficient T2I Transformer based on DiT with cross-attention and a streamlined class-condition branch; and (3) utilizing high-informative data from SAM with dense pseudo-captions generated by LLaVA.

Result

Key findings include achieving a COCO FID score of 7.32 with only 12% of Stable Diffusion v1.5’s training time, outperforming other models in user studies for image quality and alignment, and demonstrating superior performance in compositionality on T2I-CompBench.

LF

Limitations include challenges in accurately controlling the number of generated targets, handling specific details like human hands, and limited text generation capabilities. Future work involves addressing these limitations and exploring personalized extensions.

Abstract

The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces PIXART-, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), reaching near-commercial application standards. Additionally, it supports high-resolution image synthesis up to 1024px resolution with low training cost, as shown in Figure 1 and 2. To achieve this goal, three core designs are proposed: (1) Training strategy decomposition: We devise three distinct training steps that separately optimize pixel dependency, text-image alignment, and image aesthetic quality; (2) Efficient T2I Transformer: We incorporate cross-attention modules into Diffusion Transformer (DiT) to inject text conditions and streamline the computation-intensive class-condition branch; (3) High-informative data: We emphasize the significance of concept density in text-image pairs and leverage a large Vision-Language model to auto-label dense pseudo-captions to assist text-image alignment learning. As a result, PIXART-’s training speed markedly surpasses existing large-scale T2I models, e.g., PIXART- only takes 10.8% of Stable Diffusion v1.5’s training time (675 vs. 6,250 A100 GPU days), saving nearly $300,000 ($26,000 vs. $320,000) and reducing 90% CO2 emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training cost is merely 1%. Extensive experiments demonstrate that PIXART- excels in image quality, artistry, and semantic control. We hope PIXART- will provide new insights to the AIGC community and startups to accelerate building their own high-quality yet low-cost generative models from scratch.