Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance

Authors: Kelvin C. K. Chan, Yang Zhao, Xuhui Jia, Ming-Hsuan Yang, Huisheng Wang

What

This paper presents Subject-Agnostic Guidance (SAG), a method for subject-driven text-to-image synthesis that addresses the issue of models overlooking text prompts in favor of matching subject images by balancing subject fidelity with adherence to text descriptions.

Why

This paper is important because it tackles the problem of “content ignorance” in subject-driven text-to-image synthesis, where models often prioritize mimicking the subject image over following the text prompt. The proposed SAG method offers a simple yet effective solution to improve text alignment without sacrificing subject fidelity, thereby enhancing the quality and diversity of generated images.

How

The authors propose Subject-Agnostic Guidance (SAG) which constructs a subject-agnostic embedding from the user input and utilizes a dual classifier-free guidance (DCFG) technique. DCFG leverages both the subject-aware and subject-agnostic embeddings to guide the generation process towards a more balanced output. The method is validated by applying it to various existing synthesis approaches including optimization-based and encoder-based methods, as well as in second-order customization using DreamBooth.

Result

The paper demonstrates that SAG effectively improves text alignment in generated images while maintaining high subject fidelity. Evaluations using CLIP and DINO scores show improvements in both text and subject similarity. User studies also confirm the effectiveness of SAG, with a majority of users preferring the generated results over existing methods like DreamBooth, Textual Inversion, and ELITE.

LF

The authors acknowledge that the quality of outputs still relies on the underlying generative model and may be suboptimal for complex or uncommon content. Future work could explore incorporating more robust synthesis networks. Additionally, they emphasize the ethical implications of such technology, particularly its potential for misuse. Future research should address these concerns by developing detection mechanisms to prevent the spread of misinformation.

Abstract

In subject-driven text-to-image synthesis, the synthesis process tends to be heavily influenced by the reference images provided by users, often overlooking crucial attributes detailed in the text prompt. In this work, we propose Subject-Agnostic Guidance (SAG), a simple yet effective solution to remedy the problem. We show that through constructing a subject-agnostic condition and applying our proposed dual classifier-free guidance, one could obtain outputs consistent with both the given subject and input text prompts. We validate the efficacy of our approach through both optimization-based and encoder-based methods. Additionally, we demonstrate its applicability in second-order customization methods, where an encoder-based model is fine-tuned with DreamBooth. Our approach is conceptually simple and requires only minimal code modifications, but leads to substantial quality improvements, as evidenced by our evaluations and user studies.