Consolidating Attention Features for Multi-view Image Editing

Authors: Or Patashnik, Rinon Gal, Daniel Cohen-Or, Jun-Yan Zhu, Fernando De la Torre

What

This paper introduces a method for consistent multi-view image editing, focusing on geometric manipulations like articulations and shape changes using spatial controls and a novel query feature space neural radiance field called QNeRF.

Why

This work addresses the limitations of existing multi-view image editing techniques that struggle with consistent geometric modifications across multiple views, offering a solution for more realistic and high-fidelity edits.

How

The authors leverage ControlNet and a pre-trained Stable Diffusion model to edit images based on spatial controls. They introduce QNeRF, trained on query features from self-attention layers, to progressively consolidate these features during denoising, ensuring consistency across views.

Result

The proposed method achieves greater visual quality and consistency in multi-view edits compared to baseline methods like InstructNeRF2NeRF and TokenFlow, as demonstrated through qualitative results, KID and FID scores, and user preference evaluations. It allows for training NeRFs with fewer artifacts and better alignment to the target geometry.

LF

Limitations include difficulties in generating highly detailed structures like hands, potential for hallucinating inconsistent details in complex objects, and reliance on a black-box optimizer for QNeRF training. Future work could explore robust statistics for QNeRF optimization, alternative 3D representations like Gaussian Splats, and addressing the limitations inherited from text-to-image models.

Abstract

Large-scale text-to-image models enable a wide range of image editing techniques, using text prompts or even spatial controls. However, applying these editing methods to multi-view images depicting a single scene leads to 3D-inconsistent results. In this work, we focus on spatial control-based geometric manipulations and introduce a method to consolidate the editing process across various views. We build on two insights: (1) maintaining consistent features throughout the generative process helps attain consistency in multi-view editing, and (2) the queries in self-attention layers significantly influence the image structure. Hence, we propose to improve the geometric consistency of the edited images by enforcing the consistency of the queries. To do so, we introduce QNeRF, a neural radiance field trained on the internal query features of the edited images. Once trained, QNeRF can render 3D-consistent queries, which are then softly injected back into the self-attention layers during generation, greatly improving multi-view consistency. We refine the process through a progressive, iterative method that better consolidates queries across the diffusion timesteps. We compare our method to a range of existing techniques and demonstrate that it can achieve better multi-view consistency and higher fidelity to the input scene. These advantages allow us to train NeRFs with fewer visual artifacts, that are better aligned with the target geometry.