Lego: Learning to Disentangle and Invert Concepts Beyond Object Appearance in Text-to-Image Diffusion Models
Authors: Saman Motamed, Danda Pani Paudel, Luc Van Gool
What
This paper introduces Lego, a novel textual inversion method designed to disentangle and invert general concepts (adjectives and verbs) from a few example images in text-to-image diffusion models.
Why
This work addresses the limitations of existing text-to-image models in synthesizing complex concepts that go beyond object appearance. It is significant because it enables greater user control over image generation by allowing the inversion of subject-entangled concepts, such as melting or walking, which were previously challenging for traditional inversion methods.
How
Lego builds upon Textual Inversion (TI) by adding two key components: 1) Subject Separation, which uses a dedicated embedding to isolate the subject’s appearance from the concept, preventing feature leakage. 2) Contrastive Context Guidance, which utilizes an InfoNCE-based loss to guide the learning of multiple embeddings representing the concept by steering them towards synonyms and away from antonyms of descriptive words.
Result
Lego demonstrates superior performance compared to existing methods, including DreamBooth, Custom Diffusion, and natural language prompts, in accurately representing and synthesizing complex concepts. Human evaluation and Visual Question Answering using a large language model confirm that Lego-generated images better capture and convey the intended concepts.
LF
The authors acknowledge limitations in inverting concepts that exceed the capabilities of the base diffusion model, such as facial expressions in earlier Stable Diffusion versions. Future work includes exploring the inversion of dynamic concepts from example videos and ensuring ethical application of personalized visual media generation.
Abstract
Diffusion models have revolutionized generative content creation and text-to-image (T2I) diffusion models in particular have increased the creative freedom of users by allowing scene synthesis using natural language. T2I models excel at synthesizing concepts such as nouns, appearances, and styles. To enable customized content creation based on a few example images of a concept, methods such as Textual Inversion and DreamBooth invert the desired concept and enable synthesizing it in new scenes. However, inverting more general concepts that go beyond object appearance and style (adjectives and verbs) through natural language, remains a challenge. Two key characteristics of these concepts contribute to the limitations of current inversion methods. 1) Adjectives and verbs are entangled with nouns (subject) and can hinder appearance-based inversion methods, where the subject appearance leaks into the concept embedding and 2) describing such concepts often extends beyond single word embeddings (being frozen in ice, walking on a tightrope, etc.) that current methods do not handle. In this study, we introduce Lego, a textual inversion method designed to invert subject entangled concepts from a few example images. Lego disentangles concepts from their associated subjects using a simple yet effective Subject Separation step and employs a Context Loss that guides the inversion of single/multi-embedding concepts. In a thorough user study, Lego-generated concepts were preferred over 70% of the time when compared to the baseline. Additionally, visual question answering using a large language model suggested Lego-generated concepts are better aligned with the text description of the concept.