InstantID: Zero-shot Identity-Preserving Generation in Seconds

Authors: Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, Anthony Chen, Huaxia Li, Xu Tang, Yao Hu

What

This paper introduces InstantID, a novel plug-and-play diffusion model module for identity-preserving image generation that uses a single facial image to generate personalized images in various styles with high fidelity.

Why

This paper is important because it addresses limitations of existing personalized image synthesis methods that are either computationally expensive, require multiple reference images, or lack fidelity in preserving identity. InstantID offers a fast, efficient, and high-fidelity solution for real-world applications like e-commerce and AI portraits.

How

The authors develop InstantID with three main components: 1) ID embedding using a pre-trained face model for strong identity features. 2) A lightweight image adapter module with decoupled cross-attention for image prompt integration. 3) IdentityNet, an adapted ControlNet using facial landmarks and ID embedding as conditions for preserving complex facial features. The model is trained on large-scale datasets like LAION-Face, optimizing only the adapter and IdentityNet while freezing the pre-trained diffusion model.

Result

InstantID demonstrates superior performance in preserving identity while maintaining stylistic flexibility, outperforming existing methods like IP-Adapter and achieving competitive results with LoRA models without requiring multiple images or training. It shows robustness, prompt editability, compatibility with ControlNet, and enables novel applications like novel view synthesis, identity interpolation, and multi-identity synthesis.

LF

Limitations include the highly coupled facial attributes in ID embedding and potential biases from the face model used. Future work could focus on decoupling facial attributes for better editing and addressing ethical considerations related to potential misuse.

Abstract

There has been significant progress in personalized image synthesis with methods such as Textual Inversion, DreamBooth, and LoRA. Yet, their real-world applicability is hindered by high storage demands, lengthy fine-tuning processes, and the need for multiple reference images. Conversely, existing ID embedding-based methods, while requiring only a single forward inference, face challenges: they either necessitate extensive fine-tuning across numerous model parameters, lack compatibility with community pre-trained models, or fail to maintain high face fidelity. Addressing these limitations, we introduce InstantID, a powerful diffusion model-based solution. Our plug-and-play module adeptly handles image personalization in various styles using just a single facial image, while ensuring high fidelity. To achieve this, we design a novel IdentityNet by imposing strong semantic and weak spatial conditions, integrating facial and landmark images with textual prompts to steer the image generation. InstantID demonstrates exceptional performance and efficiency, proving highly beneficial in real-world applications where identity preservation is paramount. Moreover, our work seamlessly integrates with popular pre-trained text-to-image diffusion models like SD1.5 and SDXL, serving as an adaptable plugin. Our codes and pre-trained checkpoints will be available at https://github.com/InstantID/InstantID.