Cross-Image Attention for Zero-Shot Appearance Transfer

Authors: Yuval Alaluf, Daniel Garibi, Or Patashnik, Hadar Averbuch-Elor, Daniel Cohen-Or

What

This paper presents a zero-shot approach for transferring visual appearance between objects in different images, leveraging the semantic knowledge encoded within pretrained text-to-image diffusion models.

Why

This paper is significant because it offers a novel method for appearance transfer that doesn’t require training a new model or per-image optimization, unlike existing approaches. It leverages the power of pretrained diffusion models and their ability to capture semantic correspondences between images, even across different object categories.

How

The authors introduce a ‘Cross-Image Attention’ mechanism that replaces the standard self-attention layers within the denoising network of a diffusion model. By combining queries from the structure image with keys and values from the appearance image, the model implicitly learns to transfer visual features. To improve the transfer quality, they employ techniques like attention map contrasting, appearance guidance, and AdaIN normalization.

Result

The paper demonstrates high-quality appearance transfer results across various object domains, including challenging cases with variations in object shape, viewpoint, and even different object categories. Qualitative and quantitative comparisons with existing techniques like Swapping Autoencoders, SpliceVIT, and DiffuseIT show that their method achieves a better balance between structure preservation and accurate appearance transfer. A user study further confirms these findings, highlighting the superior quality and appearance fidelity of the generated images.

LF

The authors acknowledge limitations related to the model’s ability to establish accurate correspondences, especially between semantically dissimilar objects. Additionally, the success of the transfer relies on accurate inversion of input images into the diffusion model’s latent space, which can be sensitive to the inversion process and random seeds. Future work could focus on improving the robustness of cross-domain transfer and enhancing the inversion techniques for more reliable and editable latent codes.

Abstract

Recent advancements in text-to-image generative models have demonstrated a remarkable ability to capture a deep semantic understanding of images. In this work, we leverage this semantic knowledge to transfer the visual appearance between objects that share similar semantics but may differ significantly in shape. To achieve this, we build upon the self-attention layers of these generative models and introduce a cross-image attention mechanism that implicitly establishes semantic correspondences across images. Specifically, given a pair of images — one depicting the target structure and the other specifying the desired appearance — our cross-image attention combines the queries corresponding to the structure image with the keys and values of the appearance image. This operation, when applied during the denoising process, leverages the established semantic correspondences to generate an image combining the desired structure and appearance. In addition, to improve the output image quality, we harness three mechanisms that either manipulate the noisy latent codes or the model’s internal representations throughout the denoising process. Importantly, our approach is zero-shot, requiring no optimization or training. Experiments show that our method is effective across a wide range of object categories and is robust to variations in shape, size, and viewpoint between the two input images.