MyVLM: Personalizing VLMs for User-Specific Queries
Authors: Yuval Alaluf, Elad Richardson, Sergey Tulyakov, Kfir Aberman, Daniel Cohen-Or
What
This paper introduces the concept of personalizing vision-language models (VLMs), enabling them to understand and reason about user-specific concepts, such as unique objects and individuals, with a focus on personalized image captioning and visual question answering.
Why
This paper is important because it addresses the limitation of current VLMs in understanding user-specific concepts and proposes a method for personalization, opening up new opportunities for more meaningful and personalized human-computer interaction.
How
The authors propose MyVLM, a method that augments frozen VLMs (BLIP-2 and LLaVA) with concept heads to recognize user-specific concepts in images. It then learns a concept embedding in the VLM’s feature space to guide the language model in incorporating the concept into generated responses, requiring only a few training images.
Result
MyVLM successfully generates personalized captions and answers questions about user-specific objects and individuals in new images, generalizing to unseen contexts. It outperforms several handcrafted baselines, showing improved recall and text similarity, even with few training samples.
LF
Limitations include biases inherent in VLMs, reliance on concept head quality, and potential context leakage during training. Future work includes mitigating these limitations, exploring additional regularization and augmentation techniques, and expanding to new personalized applications.
Abstract
Recent large-scale vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and generating textual descriptions for visual content. However, these models lack an understanding of user-specific concepts. In this work, we take a first step toward the personalization of VLMs, enabling them to learn and reason over user-provided concepts. For example, we explore whether these models can learn to recognize you in an image and communicate what you are doing, tailoring the model to reflect your personal experiences and relationships. To effectively recognize a variety of user-specific concepts, we augment the VLM with external concept heads that function as toggles for the model, enabling the VLM to identify the presence of specific target concepts in a given image. Having recognized the concept, we learn a new concept embedding in the intermediate feature space of the VLM. This embedding is tasked with guiding the language model to naturally integrate the target concept in its generated response. We apply our technique to BLIP-2 and LLaVA for personalized image captioning and further show its applicability for personalized visual question-answering. Our experiments demonstrate our ability to generalize to unseen images of learned concepts while preserving the model behavior on unrelated inputs.