An Image is Worth Multiple Words: Learning Object Level Concepts using Multi-Concept Prompt Learning
Authors: Chen Jin, Ryutaro Tanno, Amrutha Saseendran, Tom Diethe, Philip Teare
What
This paper introduces Multi-Concept Prompt Learning (MCPL), a method for learning multiple textural embeddings (new “words”) in text-to-image diffusion models, which represent distinct object-level concepts within a single image.
Why
This paper addresses a significant limitation in existing textural inversion techniques, which struggle to learn and compose multiple concepts from a single image, hindering their application in complex multi-object editing and generation tasks.
How
The authors propose MCPL, building upon Textural Inversion, to jointly learn multiple embeddings by optimizing the diffusion model loss on a single image with multiple learnable prompts. To enhance object-level concept learning, they introduce three regularization techniques: Attention Masking to focus learning on relevant image regions, Prompts Contrastive Loss to separate embeddings of different concepts, and binding learnable prompts with adjectives to leverage pre-trained knowledge.
Result
Experiments on natural and biomedical image datasets demonstrate that MCPL, particularly with all the proposed regularizations, effectively learns disentangled object-level embeddings, outperforming existing techniques in terms of concept separation and fidelity to both text prompts and image regions. The approach enables more accurate object-level synthesis, editing, and understanding of multi-object relationships.
LF
The paper acknowledges limitations in the estimation of “ground truth” embeddings using masks and suggests exploring alternative evaluation metrics beyond those used for single-concept learning. Future work includes exploring better prompt selection strategies and extending MCPL to handle a larger number of concepts within a scene.
Abstract
Textural Inversion, a prompt learning method, learns a singular embedding for a new “word” to represent image style and appearance, allowing it to be integrated into natural language sentences to generate novel synthesised images. However, identifying and integrating multiple object-level concepts within one scene poses significant challenges even when embeddings for individual concepts are attainable. This is further confirmed by our empirical tests. To address this challenge, we introduce a framework for Multi-Concept Prompt Learning (MCPL), where multiple new “words” are simultaneously learned from a single sentence-image pair. To enhance the accuracy of word-concept correlation, we propose three regularisation techniques: Attention Masking (AttnMask) to concentrate learning on relevant areas; Prompts Contrastive Loss (PromptCL) to separate the embeddings of different concepts; and Bind adjective (Bind adj.) to associate new “words” with known words. We evaluate via image generation, editing, and attention visualisation with diverse images. Extensive quantitative comparisons demonstrate that our method can learn more semantically disentangled concepts with enhanced word-concept correlation. Additionally, we introduce a novel dataset and evaluation protocol tailored for this new task of learning object-level concepts.