Direct Inversion: Boosting Diffusion-based Editing with 3 Lines of Code
Authors: Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, Qiang Xu
What
This paper introduces DirectInversion, a novel technique for inverting diffusion models in text-based image editing, which disentangles the source and target diffusion branches to excel in content preservation and edit fidelity, respectively.
Why
The paper addresses limitations in existing diffusion model inversion techniques used for text-based image editing, which often rely on computationally expensive optimization and may compromise either content preservation or edit fidelity. The authors argue for a disentangled approach to optimize both aspects, and introduce a new benchmark dataset for evaluation.
How
The authors propose DirectInversion, which directly rectifies the deviation path in the source branch using a simple three-line code modification to DDIM inversion. They introduce PIE-Bench, a new benchmark dataset with 700 images and diverse editing categories, to evaluate their method across 8 different editing techniques and against existing inversion methods using 7 evaluation metrics.
Result
DirectInversion demonstrates superior performance compared to existing optimization-based inversion methods, achieving significant improvements in essential content preservation (up to 83.2% enhancement in Structure Distance) and edit fidelity (up to 8.8% improvement in Edit Region Clip Similarity), while being significantly faster. The method also improves content preservation by up to 20.2% and edit fidelity by up to 2.5% when integrated with other editing techniques.
LF
The authors acknowledge limitations inherited from existing diffusion-based editing methods, such as instability and low success rates in certain complex editing scenarios. Future work includes extending the approach to video editing, developing more robust editing models, and creating more comprehensive evaluation metrics.
Abstract
Text-guided diffusion models have revolutionized image generation and editing, offering exceptional realism and diversity. Specifically, in the context of diffusion-based editing, where a source image is edited according to a target prompt, the process commences by acquiring a noisy latent vector corresponding to the source image via the diffusion model. This vector is subsequently fed into separate source and target diffusion branches for editing. The accuracy of this inversion process significantly impacts the final editing outcome, influencing both essential content preservation of the source image and edit fidelity according to the target prompt. Prior inversion techniques aimed at finding a unified solution in both the source and target diffusion branches. However, our theoretical and empirical analyses reveal that disentangling these branches leads to a distinct separation of responsibilities for preserving essential content and ensuring edit fidelity. Building on this insight, we introduce “Direct Inversion,” a novel technique achieving optimal performance of both branches with just three lines of code. To assess image editing performance, we present PIE-Bench, an editing benchmark with 700 images showcasing diverse scenes and editing types, accompanied by versatile annotations and comprehensive evaluation metrics. Compared to state-of-the-art optimization-based inversion techniques, our solution not only yields superior performance across 8 editing methods but also achieves nearly an order of speed-up.