High-fidelity Person-centric Subject-to-Image Synthesis

Authors: Yibin Wang, Weizhong Zhang, Jianwei Zheng, Cheng Jin

What

This paper introduces Face-diffuser, a novel collaborative generation pipeline for subject-driven text-to-image generation that addresses limitations of existing methods in person-centric image synthesis by employing two specialized diffusion models for enhanced scene and person generation.

Why

This paper is important because it tackles the training imbalance and quality compromise issues prevalent in current subject-driven image generation models, especially for person-centric synthesis. Face-diffuser’s innovative approach enhances the fidelity of generated persons within diverse semantic scenes, advancing the field of personalized image generation.

How

The authors propose Face-diffuser, which utilizes two pre-trained diffusion models: TDM for scene generation and SDM for person generation. The generation process involves three stages: initial scene construction using TDM, subject-scene fusion through a novel Saliency-adaptive Noise Fusion (SNF) mechanism, and final subject enhancement by SDM. SNF leverages classifier-free guidance responses to dynamically allocate regions for each model’s contribution during synthesis, enabling seamless collaboration.

Result

Face-diffuser demonstrates superior performance in both single- and multi-subject generation tasks, quantitatively outperforming state-of-the-art methods in terms of identity preservation and prompt consistency. Qualitative results showcase its ability to generate high-fidelity, coherent images of individuals within diverse contexts, surpassing baselines in preserving subject details and scene semantics. Ablation studies confirm the efficacy of each stage in the pipeline and the superiority of SNF over simpler fusion techniques.

LF

Limitations include the potential for privacy concerns due to the close resemblance of generated persons to reference images and challenges in editing attributes of generated individuals. Future work aims to address these limitations and explore attribute editing capabilities.

Abstract

Current subject-driven image generation methods encounter significant challenges in person-centric image generation. The reason is that they learn the semantic scene and person generation by fine-tuning a common pre-trained diffusion, which involves an irreconcilable training imbalance. Precisely, to generate realistic persons, they need to sufficiently tune the pre-trained model, which inevitably causes the model to forget the rich semantic scene prior and makes scene generation over-fit to the training data. Moreover, even with sufficient fine-tuning, these methods can still not generate high-fidelity persons since joint learning of the scene and person generation also lead to quality compromise. In this paper, we propose Face-diffuser, an effective collaborative generation pipeline to eliminate the above training imbalance and quality compromise. Specifically, we first develop two specialized pre-trained diffusion models, i.e., Text-driven Diffusion Model (TDM) and Subject-augmented Diffusion Model (SDM), for scene and person generation, respectively. The sampling process is divided into three sequential stages, i.e., semantic scene construction, subject-scene fusion, and subject enhancement. The first and last stages are performed by TDM and SDM respectively. The subject-scene fusion stage, that is the collaboration achieved through a novel and highly effective mechanism, Saliency-adaptive Noise Fusion (SNF). Specifically, it is based on our key observation that there exists a robust link between classifier-free guidance responses and the saliency of generated images. In each time step, SNF leverages the unique strengths of each model and allows for the spatial blending of predicted noises from both models automatically in a saliency-aware manner. Extensive experiments confirm the impressive effectiveness and robustness of the Face-diffuser.