SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions

Authors: Yuda Song, Zehao Sun, Xuanwu Yin

What

This paper introduces SDXS, a novel approach to distill large-scale diffusion models for text-to-image generation into efficient models capable of real-time inference on GPUs, achieving speeds of up to 100 FPS for 512x512 images and 30 FPS for 1024x1024 images.

Why

This work is important as it addresses the limitations of traditional diffusion models, which suffer from slow inference speeds due to their multi-step sampling process, hindering their deployment on edge devices or applications requiring real-time performance.

How

The authors employ a dual approach: 1) Model miniaturization: Knowledge distillation is used to compress the U-Net and image decoder architectures. 2) One-step training: A novel training technique combines feature matching and score distillation to reduce the sampling process to a single step.

Result

The resulting models, SDXS-512 and SDXS-1024, demonstrate significant speed improvements (30x and 60x faster than their base counterparts) while maintaining comparable image quality. Furthermore, the proposed method can be adapted for image-conditioned generation tasks using ControlNet, enabling applications like image-to-image translation.

LF

The authors acknowledge limitations in image diversity when using ControlNet for image-to-image translation. Future work will focus on improving diversity and exploring applications like inpainting and super-resolution, particularly on edge devices.

Abstract

Recent advancements in diffusion models have positioned them at the forefront of image generation. Despite their superior performance, diffusion models are not without drawbacks; they are characterized by complex architectures and substantial computational demands, resulting in significant latency due to their iterative sampling process. To mitigate these limitations, we introduce a dual approach involving model miniaturization and a reduction in sampling steps, aimed at significantly decreasing model latency. Our methodology leverages knowledge distillation to streamline the U-Net and image decoder architectures, and introduces an innovative one-step DM training technique that utilizes feature matching and score distillation. We present two models, SDXS-512 and SDXS-1024, achieving inference speeds of approximately 100 FPS (30x faster than SD v1.5) and 30 FPS (60x faster than SDXL) on a single GPU, respectively. Moreover, our training approach offers promising applications in image-conditioned control, facilitating efficient image-to-image translation.