Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models
Authors: Jiayi Guo, Xingqian Xu, Yifan Pu, Zanlin Ni, Chaofei Wang, Manushree Vasu, Shiji Song, Gao Huang, Humphrey Shi
What
This paper introduces Smooth Diffusion, a novel diffusion model architecture that aims to improve the smoothness of the latent space in text-to-image generation tasks for enhanced performance in downstream tasks like image interpolation, inversion, and editing.
Why
This paper is important because it addresses the limitations of current diffusion models in terms of latent space smoothness, which hinder the quality of downstream tasks. By proposing Smooth Diffusion with a novel regularization technique, this work paves the way for higher-quality and more controllable image generation and manipulation.
How
The authors propose Smooth Diffusion, which introduces Step-wise Variation Regularization to enforce a constant ratio between variations in input latent code and the output image at every training step. They train Smooth Diffusion on top of Stable Diffusion using the LAION Aesthetics 6.5+ dataset and a LoRA fine-tuning technique. To assess the smoothness, they propose a new metric, Interpolation Standard Deviation (ISTD), and compare Smooth Diffusion with Stable Diffusion and other state-of-the-art methods on various downstream tasks qualitatively and quantitively using metrics such as FID, CLIP Score, MSE, LPIPS, SSIM, and PSNR.
Result
Smooth Diffusion demonstrates significantly smoother latent space interpolation compared to Stable Diffusion, evidenced by lower ISTD scores and smoother visual transitions. Furthermore, Smooth Diffusion shows superior performance in image inversion and reconstruction, particularly when using DDIM inversion, and achieves better preservation of unedited content in both text-based and drag-based image editing tasks.
LF
The authors acknowledge that the effectiveness of the Smooth Diffusion’s LoRA component, while adaptable to other models with the same architecture as Stable Diffusion, is not guaranteed and requires further investigation. Additionally, the paper suggests exploring the application of Smooth Diffusion to more challenging tasks, such as video generation, as a potential area for future work.
Abstract
Recently, diffusion models have made remarkable progress in text-to-image (T2I) generation, synthesizing images with high fidelity and diverse contents. Despite this advancement, latent space smoothness within diffusion models remains largely unexplored. Smooth latent spaces ensure that a perturbation on an input latent corresponds to a steady change in the output image. This property proves beneficial in downstream tasks, including image interpolation, inversion, and editing. In this work, we expose the non-smoothness of diffusion latent spaces by observing noticeable visual fluctuations resulting from minor latent variations. To tackle this issue, we propose Smooth Diffusion, a new category of diffusion models that can be simultaneously high-performing and smooth. Specifically, we introduce Step-wise Variation Regularization to enforce the proportion between the variations of an arbitrary input latent and that of the output image is a constant at any diffusion training step. In addition, we devise an interpolation standard deviation (ISTD) metric to effectively assess the latent space smoothness of a diffusion model. Extensive quantitative and qualitative experiments demonstrate that Smooth Diffusion stands out as a more desirable solution not only in T2I generation but also across various downstream tasks. Smooth Diffusion is implemented as a plug-and-play Smooth-LoRA to work with various community models. Code is available at https://github.com/SHI-Labs/Smooth-Diffusion.