Contrastive Denoising Score for Text-guided Latent Diffusion Image Editing
Authors: Hyelin Nam, Gihyun Kwon, Geon Yeong Park, Jong Chul Ye
What
This paper introduces Contrastive Denoising Score (CDS), a novel text-guided image editing technique for latent diffusion models that improves upon Delta Denoising Score (DDS) by incorporating a contrastive loss inspired by Contrastive Unpaired Translation (CUT).
Why
This paper addresses the limitation of DDS in preserving structural details during text-guided image editing. By integrating CUT loss into the DDS framework, CDS enables more effective preservation of source image structure while aligning with target text prompts, leading to improved image editing quality.
How
The authors propose to extract intermediate features from the self-attention layers of the latent diffusion model and use them to calculate the CUT loss. This loss is then incorporated into the DDS framework to guide the image generation process towards better structural consistency. The authors demonstrate the effectiveness of their approach through qualitative and quantitative experiments on various text-driven image editing tasks, including comparisons with state-of-the-art methods. They also show the extensibility of CDS to other domains like Neural Radiance Fields (NeRF).
Result
CDS outperforms existing state-of-the-art methods in text-guided image editing by effectively regulating structural consistency while aligning with target text prompts. It achieves a better balance between preserving structural details and transforming content compared to DDS and other baselines. Furthermore, CDS demonstrates successful application in Neural Radiance Fields editing, highlighting its extensibility.
LF
The authors acknowledge limitations in cases of unfavorable random patch selections or unconventional object poses. Future work may explore strategies to address these limitations. Additionally, the ethical implications of image manipulation techniques like CDS are acknowledged, emphasizing the need for responsible use and regulation to prevent misuse.
Abstract
With the remarkable advent of text-to-image diffusion models, image editing methods have become more diverse and continue to evolve. A promising recent approach in this realm is Delta Denoising Score (DDS) - an image editing technique based on Score Distillation Sampling (SDS) framework that leverages the rich generative prior of text-to-image diffusion models. However, relying solely on the difference between scoring functions is insufficient for preserving specific structural elements from the original image, a crucial aspect of image editing. To address this, here we present an embarrassingly simple yet very powerful modification of DDS, called Contrastive Denoising Score (CDS), for latent diffusion models (LDM). Inspired by the similarities and differences between DDS and the contrastive learning for unpaired image-to-image translation(CUT), we introduce a straightforward approach using CUT loss within the DDS framework. Rather than employing auxiliary networks as in the original CUT approach, we leverage the intermediate features of LDM, specifically those from the self-attention layers, which possesses rich spatial information. Our approach enables zero-shot image-to-image translation and neural radiance field (NeRF) editing, achieving structural correspondence between the input and output while maintaining content controllability. Qualitative results and comparisons demonstrates the effectiveness of our proposed method. Project page: https://hyelinnam.github.io/CDS/