Customizing Text-to-Image Models with a Single Image Pair
Authors: Maxwell Jones, Sheng-Yu Wang, Nupur Kumari, David Bau, Jun-Yan Zhu
What
This paper introduces a novel method, Paired Customization, for customizing text-to-image models using a single image pair to learn stylistic differences.
Why
This paper is significant because it addresses the limitations of existing model customization techniques that often overfit to content when learning styles from single or few-shot image examples. By using an image pair, the method can better disentangle style from content, enabling more effective and generalizable style transfer.
How
The authors propose a joint optimization method using separate LoRA weights for style and content. Content LoRA reconstructs the content image, while style LoRA learns the stylistic difference between the pair. They further enforce orthogonality between style and content LoRA parameters for better disentanglement. At inference, they introduce ‘style guidance’, integrating style LoRA predictions into the denoising process for improved style control and content preservation.
Result
The proposed method demonstrates superior performance in capturing and applying stylistic differences compared to existing baselines. It effectively preserves the structure of the input content while applying the learned style, as demonstrated through quantitative metrics like perceptual distance and a human preference study.
LF
The paper acknowledges limitations in handling significantly different categories from the training pair and computational demands of test-time optimization. Future work could explore encoder-based approaches for faster customization and improving style transfer across broader categories.
Abstract
Art reinterpretation is the practice of creating a variation of a reference work, making a paired artwork that exhibits a distinct artistic style. We ask if such an image pair can be used to customize a generative model to capture the demonstrated stylistic difference. We propose Pair Customization, a new customization method that learns stylistic difference from a single image pair and then applies the acquired style to the generation process. Unlike existing methods that learn to mimic a single concept from a collection of images, our method captures the stylistic difference between paired images. This allows us to apply a stylistic change without overfitting to the specific image content in the examples. To address this new task, we employ a joint optimization method that explicitly separates the style and content into distinct LoRA weight spaces. We optimize these style and content weights to reproduce the style and content images while encouraging their orthogonality. During inference, we modify the diffusion process via a new style guidance based on our learned weights. Both qualitative and quantitative experiments show that our method can effectively learn style while avoiding overfitting to image content, highlighting the potential of modeling such stylistic differences from a single image pair.