SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation
Authors: Thuan Hoang Nguyen, Anh Tran
What
This paper presents SwiftBrush, a novel image-free distillation method for text-to-image diffusion models that enables single-step, high-fidelity image generation without relying on any training image data.
Why
This paper is important because it addresses the slow inference speed of traditional text-to-image diffusion models by enabling single-step generation while maintaining high fidelity, which is crucial for deployment on consumer devices and broader accessibility.
How
The authors draw inspiration from text-to-3D synthesis techniques and adapt Variational Score Distillation (VSD) for text-to-image generation. They employ a pretrained text-to-image teacher model and an additional trainable LoRA teacher model to guide the learning of a student model that can generate images from text prompts in a single step. The student model is trained without using any image data, relying solely on text captions and a specialized loss function.
Result
SwiftBrush achieves promising zero-shot results on benchmarks like COCO 2014 and Human Preference Score v2, surpassing existing one-step image generation methods in quality while being more efficient and requiring significantly less training time. Notably, SwiftBrush achieves an FID score of 16.67 and a CLIP score of 0.29 on the COCO-30K benchmark without using any training image data.
LF
The authors acknowledge that SwiftBrush, while efficient, may produce slightly lower quality images compared to multi-step teacher models. Future work could focus on extending SwiftBrush to support few-step generation, exploring single-teacher distillation, and integrating techniques like DreamBooth, ControlNet, or InstructPix2Pix for enhanced control and application.
Abstract
Despite their ability to generate high-resolution and diverse images from text prompts, text-to-image diffusion models often suffer from slow iterative sampling processes. Model distillation is one of the most effective directions to accelerate these models. However, previous distillation methods fail to retain the generation quality while requiring a significant amount of images for training, either from real data or synthetically generated by the teacher model. In response to this limitation, we present a novel image-free distillation scheme named . Drawing inspiration from text-to-3D synthesis, in which a 3D neural radiance field that aligns with the input prompt can be obtained from a 2D text-to-image diffusion prior via a specialized loss without the use of any 3D data ground-truth, our approach re-purposes that same loss for distilling a pretrained multi-step text-to-image model to a student network that can generate high-fidelity images with just a single inference step. In spite of its simplicity, our model stands as one of the first one-step text-to-image generators that can produce images of comparable quality to Stable Diffusion without reliance on any training image data. Remarkably, SwiftBrush achieves an FID score of and a CLIP score of on the COCO-30K benchmark, achieving competitive results or even substantially surpassing existing state-of-the-art distillation techniques.