Watch Your Steps: Local Image and Scene Editing by Text Instructions

Authors: Ashkan Mirzaei, Tristan Aumentado-Armstrong, Marcus A. Brubaker, Jonathan Kelly, Alex Levinshtein, Konstantinos G. Derpanis, Igor Gilitschenski

What

This paper presents a method for localizing image and scene edits by leveraging the discrepancy between noise predictions of a diffusion-based image editor with and without text instructions, resulting in a relevance map to guide the editing process.

Why

This paper addresses the limitations of existing diffusion-based image editors, particularly their tendency to over-edit. By introducing relevance maps, the method allows for precise control over the editing process, preserving irrelevant regions while ensuring the desired changes are applied effectively to both images and 3D scenes represented as neural radiance fields.

How

The authors propose a relevance map calculation by measuring the difference between noise predictions from InstructPix2Pix (IP2P) with and without the edit instruction. This map, after binarization, guides the IP2P denoising process to confine edits within the relevant region. For 3D scene editing, a relevance field is trained on relevance maps of training views to maintain 3D consistency, guiding iterative updates on the scene.

Result

The method demonstrates state-of-the-art performance in both image and NeRF editing tasks. It outperforms baselines in preserving image consistency while achieving comparable edit quality. The relevance maps effectively guide the editing process, preventing over-editing and ensuring the edits are applied to the desired regions. The method produces sharper and higher-quality results compared to previous approaches, particularly in the context of NeRF editing.

LF

The authors acknowledge the method’s reliance on IP2P, inheriting its limitations. Cases where IP2P fails to interpret the instruction or localize the edit properly pose challenges. Future work could explore better instruction-conditioned diffusion models and address ambiguities in localizing edits for broader applications.

Abstract

Denoising diffusion models have enabled high-quality image generation and editing. We present a method to localize the desired edit region implicit in a text instruction. We leverage InstructPix2Pix (IP2P) and identify the discrepancy between IP2P predictions with and without the instruction. This discrepancy is referred to as the relevance map. The relevance map conveys the importance of changing each pixel to achieve the edits, and is used to to guide the modifications. This guidance ensures that the irrelevant pixels remain unchanged. Relevance maps are further used to enhance the quality of text-guided editing of 3D scenes in the form of neural radiance fields. A field is trained on relevance maps of training views, denoted as the relevance field, defining the 3D region within which modifications should be made. We perform iterative updates on the training views guided by rendered relevance maps from the relevance field. Our method achieves state-of-the-art performance on both image and NeRF editing tasks. Project page: https://ashmrz.github.io/WatchYourSteps/