FaceStudio: Put Your Face Everywhere in Seconds

Authors: Yuxuan Yan, Chi Zhang, Rui Wang, Yichao Zhou, Gege Zhang, Pei Cheng, Gang Yu, Bin Fu

What

This paper introduces a novel, tuning-free method for identity-preserving image synthesis, focusing on efficiently generating human images in various styles while maintaining individual identities using a hybrid guidance framework combining style images, facial images, and text prompts.

Why

This paper addresses limitations in existing identity-preserving image synthesis methods, which often require resource-intensive fine-tuning and multiple reference images. The proposed method offers a faster and more efficient alternative by using a direct feed-forward approach and hybrid guidance, enabling diverse applications like artistic portrait creation and identity blending.

How

The authors develop a hybrid guidance framework that combines style images, facial images, and text prompts to guide a latent diffusion model. They extract identity features from facial images using Arcface and combine them with text embeddings from a prior model trained to map CLIP text embeddings to vision embeddings. A multi-identity cross-attention mechanism is introduced to handle multiple identities within a single image, ensuring each individual’s features are correctly mapped. The model is trained on a human image reconstruction task, using masked images as style input and cropped faces as identity input.

Result

The proposed method demonstrates superior performance in preserving identities during image synthesis compared to baseline models like DreamBooth and Textual Inversion, achieving higher face similarity scores in both single- and multi-image settings. The ablation study confirms the significance of the identity input for maintaining identity fidelity. The model also exhibits strong performance in novel view synthesis, effectively generating images with large pose changes while preserving identity. Furthermore, the method demonstrates successful identity mixing and multi-human image generation with accurate identity mapping.

LF

The authors acknowledge that compared to methods like DreamBooth, their model is currently limited to human image generation. As future work, they plan to extend its capabilities to encompass a wider range of subjects, including animals and objects. Additionally, they recognize the ethical considerations and potential for misuse, such as copyright infringement and the creation of inappropriate content. The authors emphasize the importance of responsible use and the establishment of guidelines to mitigate these risks.

Abstract

This study investigates identity-preserving image synthesis, an intriguing task in image generation that seeks to maintain a subject’s identity while adding a personalized, stylistic touch. Traditional methods, such as Textual Inversion and DreamBooth, have made strides in custom image creation, but they come with significant drawbacks. These include the need for extensive resources and time for fine-tuning, as well as the requirement for multiple reference images. To overcome these challenges, our research introduces a novel approach to identity-preserving synthesis, with a particular focus on human images. Our model leverages a direct feed-forward mechanism, circumventing the need for intensive fine-tuning, thereby facilitating quick and efficient image generation. Central to our innovation is a hybrid guidance framework, which combines stylized images, facial images, and textual prompts to guide the image generation process. This unique combination enables our model to produce a variety of applications, such as artistic portraits and identity-blended images. Our experimental results, including both qualitative and quantitative evaluations, demonstrate the superiority of our method over existing baseline models and previous works, particularly in its remarkable efficiency and ability to preserve the subject’s identity with high fidelity.