LocInv: Localization-aware Inversion for Text-Guided Image Editing

Authors: Chuanming Tang, Kai Wang, Fei Yang, Joost van de Weijer

What

This paper introduces Localization-aware Inversion (LocINV), a novel method for text-guided image editing that leverages localization priors like segmentation maps or bounding boxes to enhance the accuracy of cross-attention maps in diffusion models, thereby improving the precision of object manipulation and attribute editing.

Why

This paper addresses the crucial issue of cross-attention leakage in text-guided image editing with diffusion models. Existing methods often struggle to precisely edit intended objects, leading to unintended alterations in other image regions. LocINV tackles this problem by incorporating readily available localization information, leading to more accurate and controllable image editing.

How

LocINV utilizes pre-trained Stable Diffusion models and incorporates localization priors (segmentation maps or bounding boxes) obtained from datasets or foundation models. By dynamically updating tokens associated with noun words during the denoising process, it refines cross-attention maps, enforcing better alignment with target objects. Additionally, for attribute editing, LocINV introduces an adjective binding loss to align adjective representations with corresponding nouns, improving the model’s ability to edit object attributes.

Result

Through extensive evaluations on a subset of the COCO dataset, LocINV consistently outperforms existing text-guided image editing methods in both quantitative metrics (LPIPS, SSIM, PSNR, CLIP Score, DINO-Sim) and qualitative comparisons. The method shows superior performance in local object Word-Swap tasks, preserving background integrity while accurately replacing target objects. Notably, LocINV demonstrates the novel capability for Attribute-Edit, successfully modifying object colors and materials by binding adjective and noun representations, a feature unexplored by most existing methods.

LF

The authors acknowledge limitations related to the resolution of cross-attention maps, the editing capabilities of frozen Stable Diffusion models, and challenges in reconstructing high-frequency image details. Future work aims to explore pixel-level text-to-image models for finer control, integrate techniques like InstructPix2Pix for enhanced editing, and address limitations in reconstructing intricate image details.

Abstract

Large-scale Text-to-Image (T2I) diffusion models demonstrate significant generation capabilities based on textual prompts. Based on the T2I diffusion models, text-guided image editing research aims to empower users to manipulate generated images by altering the text prompts. However, existing image editing techniques are prone to editing over unintentional regions that are beyond the intended target area, primarily due to inaccuracies in cross-attention maps. To address this problem, we propose Localization-aware Inversion (LocInv), which exploits segmentation maps or bounding boxes as extra localization priors to refine the cross-attention maps in the denoising phases of the diffusion process. Through the dynamic updating of tokens corresponding to noun words in the textual input, we are compelling the cross-attention maps to closely align with the correct noun and adjective words in the text prompt. Based on this technique, we achieve fine-grained image editing over particular objects while preventing undesired changes to other regions. Our method LocInv, based on the publicly available Stable Diffusion, is extensively evaluated on a subset of the COCO dataset, and consistently obtains superior results both quantitatively and qualitatively.The code will be released at https://github.com/wangkai930418/DPL