Concept Weaver: Enabling Multi-Concept Fusion in Text-to-Image Models
Authors: Gihyun Kwon, Simon Jenni, Dingzeyu Li, Joon-Young Lee, Jong Chul Ye, Fabian Caba Heilbron
What
This paper introduces Concept Weaver, a novel method for generating images with multiple customized concepts by combining personalized text-to-image diffusion models at inference time using a template image and a concept fusion strategy.
Why
This paper addresses the challenge of generating images with multiple personalized concepts, which is important for enabling more creative and diverse content creation using text-to-image generation models. Concept Weaver offers advantages over previous approaches by improving concept fidelity, handling more concepts, and closely following the semantics of input prompts.
How
Concept Weaver involves five steps: (1) fine-tuning a pre-trained text-to-image model for each target concept, (2) generating a non-personalized template image, (3) extracting latent representations from the template image, (4) identifying regions corresponding to target concepts in the template image, and (5) fusing the latent representations, targeted regions, and personalized models to reconstruct the template image with the desired concepts.
Result
Concept Weaver demonstrates superior performance in generating multiple custom concepts with higher fidelity than baseline methods. It effectively handles more than two concepts, preserves the appearance of semantically related concepts without blending, and achieves high CLIP scores, indicating better text-image alignment. Furthermore, it’s flexible enough to be used with both full fine-tuning and Low-Rank adaptation strategies.
LF
The paper mentions limitations in generating images from extremely complex or unrealistic text prompts due to limitations in the pre-trained Stable Diffusion model. Future work could focus on addressing this by using improved diffusion model backbones. Additionally, ethical concerns regarding the potential misuse of the technology for generating privacy-sensitive content are acknowledged, suggesting a need for appropriate content filtering systems.
Abstract
While there has been significant progress in customizing text-to-image generation models, generating images that combine multiple personalized concepts remains challenging. In this work, we introduce Concept Weaver, a method for composing customized text-to-image diffusion models at inference time. Specifically, the method breaks the process into two steps: creating a template image aligned with the semantics of input prompts, and then personalizing the template using a concept fusion strategy. The fusion strategy incorporates the appearance of the target concepts into the template image while retaining its structural details. The results indicate that our method can generate multiple custom concepts with higher identity fidelity compared to alternative approaches. Furthermore, the method is shown to seamlessly handle more than two concepts and closely follow the semantic meaning of the input prompt without blending appearances across different subjects.