UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs
Authors: Yanwu Xu, Yang Zhao, Zhisheng Xiao, Tingbo Hou
What
This paper introduces UFOGen, a novel text-to-image generative model that leverages a hybrid approach combining diffusion models with a Generative Adversarial Network (GAN) objective to enable ultra-fast, one-step image generation from text prompts.
Why
This paper is important because it addresses a key limitation of traditional text-to-image diffusion models, namely their slow inference speed due to the multi-step denoising process. UFOGen’s ability to generate high-quality images in a single step significantly improves efficiency and expands the potential applications of such models.
How
The authors achieve one-step generation by modifying existing diffusion-GAN hybrid models in two key ways: 1) They introduce a new generator parameterization that samples from the forward diffusion process instead of the posterior, allowing for distribution matching at the clean image level. 2) They enhance the reconstruction loss to explicitly match the generated clean image with the target. By initializing UFOGen with a pre-trained Stable Diffusion model, they leverage existing knowledge about text-image relationships and achieve stable training with fast convergence.
Result
UFOGen successfully generates high-quality images from text prompts in a single step, outperforming existing few-step diffusion models like Progressive Distillation and Latent Consistency Models in terms of visual quality. It demonstrates comparable performance to InstaFlow while offering advantages in training efficiency and a simpler training pipeline. Furthermore, UFOGen exhibits versatility by successfully adapting to downstream tasks like image-to-image generation and controllable generation, highlighting its flexibility and broader applicability.
LF
The paper acknowledges limitations common to SD-based models, such as object missing, attribute leakage, and counting errors. Future work could focus on addressing these limitations and further exploring UFOGen’s potential in more complex generative scenarios, such as video generation or 3D object synthesis. Additionally, investigating the model’s capabilities under various guidance scales and comparing its performance against a wider range of text-to-image models would provide a more comprehensive understanding of its strengths and limitations.
Abstract
Text-to-image diffusion models have demonstrated remarkable capabilities in transforming textual prompts into coherent images, yet the computational cost of their inference remains a persistent challenge. To address this issue, we present UFOGen, a novel generative model designed for ultra-fast, one-step text-to-image synthesis. In contrast to conventional approaches that focus on improving samplers or employing distillation techniques for diffusion models, UFOGen adopts a hybrid methodology, integrating diffusion models with a GAN objective. Leveraging a newly introduced diffusion-GAN objective and initialization with pre-trained diffusion models, UFOGen excels in efficiently generating high-quality images conditioned on textual descriptions in a single step. Beyond traditional text-to-image generation, UFOGen showcases versatility in applications. Notably, UFOGen stands among the pioneering models enabling one-step text-to-image generation and diverse downstream tasks, presenting a significant advancement in the landscape of efficient generative models.