You Only Sample Once: Taming One-Step Text-To-Image Synthesis by Self-Cooperative Diffusion GANs
Authors: Yihong Luo, Xiaolong Chen, Jing Tang
What
This paper introduces YOSO, a novel one-step image synthesis model that integrates diffusion models with Generative Adversarial Networks (GANs) for rapid, scalable, and high-fidelity image generation.
Why
This paper is important because it addresses the limitations of traditional diffusion models, which require iterative denoising and suffer from slow generation speed. YOSO offers a solution by enabling one-step generation without compromising image quality, making it highly relevant for practical applications.
How
The authors propose a self-cooperative learning approach where the generator learns from itself by matching the distribution of generated samples at different levels of corruption. They also introduce several techniques for text-to-image generation, including latent perceptual loss, latent discriminator, and fixing the noise scheduler.
Result
YOSO achieves competitive performance on unconditional image generation, outperforming other one-step methods and even rivaling multi-step diffusion models. In text-to-image generation, YOSO demonstrates superior image quality, prompt alignment, and mode cover compared to state-of-the-art one-step models like SD-Turbo and SDXL-Turbo. Notably, YOSO-LoRA, a fine-tuned version, achieves impressive results with only LoRA fine-tuning, showcasing its efficiency. Furthermore, YOSO exhibits promising compatibility with downstream tasks such as image-to-image editing and ControlNet.
LF
The authors acknowledge limitations in fine-tuning on datasets different from the pre-trained model’s training set, leading to distribution shift. They suggest training on larger and more diverse datasets like LAION to address this issue. Additionally, exploring more advanced noise scheduler adaptation techniques and expanding YOSO’s application in various downstream tasks are highlighted as future work.
Abstract
We introduce YOSO, a novel generative model designed for rapid, scalable, and high-fidelity one-step image synthesis. This is achieved by integrating the diffusion process with GANs. Specifically, we smooth the distribution by the denoising generator itself, performing self-cooperative learning. We show that our method can serve as a one-step generation model training from scratch with competitive performance. Moreover, we show that our method can be extended to finetune pre-trained text-to-image diffusion for high-quality one-step text-to-image synthesis even with LoRA fine-tuning. In particular, we provide the first diffusion transformer that can generate images in one step trained on 512 resolution, with the capability of adapting to 1024 resolution without explicit training. Our code is provided at https://github.com/Luo-Yihong/YOSO.