MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation

Authors: Kunpeng Song, Yizhe Zhu, Bingchen Liu, Qing Yan, Ahmed Elgammal, Xiao Yang

What

This paper introduces MoMA, an open-vocabulary and training-free personalized image generation model that excels in producing high-fidelity images with preserved object identity while adhering to text prompts.

Why

This paper addresses the limitations of existing personalized image generation methods that require extensive tuning, are confined to specific domains, or lack detail fidelity. MoMA offers a more efficient and versatile approach to personalization by leveraging the power of MLLMs, making it more accessible and applicable to a wider range of image generation tasks.

How

MoMA employs a multi-modal LLM adapter with fine-grained feature transfer. It utilizes a generative multi-modal decoder to extract and modify image features from a reference image based on the target text prompt. It also extracts object features from a white-background version of the reference image using the UNet’s self-attention layers. These features are then injected into a pre-trained UNet during image generation. The model is pre-trained in two stages: first, the multi-modal decoder is trained to generate contextualized image embeddings, and then the decoupled attention modules in the UNet are optimized.

Result

MoMA demonstrates superior detail accuracy and faithfulness to the target object across varied backgrounds in recontextualization tasks. For texture modification, it effectively alters the texture while preserving other visual features. Notably, it achieves this without per-instance tuning, making it efficient and readily applicable. Experiments show MoMA outperforms existing tuning-free methods both qualitatively and quantitatively in terms of detail fidelity, identity preservation, and prompt adherence. It also demonstrates generalizability by successfully integrating with various community-trained diffusion models.

LF

The paper acknowledges limitations in generating images with rare subjects or those containing text, where details might be lost. Future work could explore techniques to improve the model’s ability to handle such cases. Additionally, the paper highlights the potential misuse of the model for creating deceptive content and suggests careful consideration and implementation of safeguards before widespread deployment.

Abstract

In this paper, we present MoMA: an open-vocabulary, training-free personalized image model that boasts flexible zero-shot capabilities. As foundational text-to-image models rapidly evolve, the demand for robust image-to-image translation grows. Addressing this need, MoMA specializes in subject-driven personalized image generation. Utilizing an open-source, Multimodal Large Language Model (MLLM), we train MoMA to serve a dual role as both a feature extractor and a generator. This approach effectively synergizes reference image and text prompt information to produce valuable image features, facilitating an image diffusion model. To better leverage the generated features, we further introduce a novel self-attention shortcut method that efficiently transfers image features to an image diffusion model, improving the resemblance of the target object in generated images. Remarkably, as a tuning-free plug-and-play module, our model requires only a single reference image and outperforms existing methods in generating images with high detail fidelity, enhanced identity-preservation and prompt faithfulness. Our work is open-source, thereby providing universal access to these advancements.