PhotoVerse: Tuning-Free Image Customization with Text-to-Image Diffusion Models

Authors: Li Chen, Mengyi Zhao, Yiheng Liu, Mingxu Ding, Yangyang Song, Shizun Wang, Xu Wang, Hao Yang, Jing Liu, Kang Du, Min Zheng

What

This paper introduces PhotoVerse, a novel method for personalized text-to-image generation that uses a dual-branch conditioning mechanism to enable fast generation and high-quality images using only a single reference image of the target identity.

Why

This paper addresses the limitations of existing personalized text-to-image generation methods, such as long tuning times, large storage requirements, and the need for multiple input images. It offers a faster and more user-friendly approach for incorporating specific individuals into diverse scenes with high fidelity.

How

The paper proposes a dual-branch conditioning mechanism that combines improved identity textual embeddings and spatial concept cues through dual-modality adapters in both text and image domains. The method utilizes a pre-trained Stable Diffusion model and incorporates a novel facial identity loss component during training to enhance identity preservation. The approach employs lightweight adapters and fine-tunes only the cross-attention module of the UNet, resulting in fast and efficient personalization without the need for test-time tuning.

Result

PhotoVerse demonstrates superior performance in preserving identity attributes while enabling image editing, stylization, and new scene generation. It achieves high identity similarity across diverse ethnicities and produces high-quality images with sharp details and natural aesthetics. The method eliminates the need for test-time tuning and generates images in just a few seconds using a single reference image, significantly improving efficiency compared to existing methods.

LF

The authors acknowledge potential bias in pre-trained large models as a limitation. Future work could involve exploring methods to mitigate this bias and further enhance the generalization capabilities of the model. Additionally, incorporating control mechanisms for pose and composition could provide users with more fine-grained control over image generation.

Abstract

Personalized text-to-image generation has emerged as a powerful and sought-after tool, empowering users to create customized images based on their specific concepts and prompts. However, existing approaches to personalization encounter multiple challenges, including long tuning times, large storage requirements, the necessity for multiple input images per identity, and limitations in preserving identity and editability. To address these obstacles, we present PhotoVerse, an innovative methodology that incorporates a dual-branch conditioning mechanism in both text and image domains, providing effective control over the image generation process. Furthermore, we introduce facial identity loss as a novel component to enhance the preservation of identity during training. Remarkably, our proposed PhotoVerse eliminates the need for test time tuning and relies solely on a single facial photo of the target identity, significantly reducing the resource cost associated with image generation. After a single training phase, our approach enables generating high-quality images within only a few seconds. Moreover, our method can produce diverse images that encompass various scenes and styles. The extensive evaluation demonstrates the superior performance of our approach, which achieves the dual objectives of preserving identity and facilitating editability. Project page: https://photoverse2d.github.io/