Finding Visual Task Vectors
Authors: Alberto Hojel, Yutong Bai, Trevor Darrell, Amir Globerson, Amir Bar
What
This paper investigates the existence and identification of “task vectors” in visual prompting models, specifically focusing on MAE-VQGAN. The authors propose a method to identify these task-specific activations and demonstrate that patching them into the model enables zero-shot task performance comparable to or exceeding the original one-shot in-context learning.
Why
This paper is significant as it sheds light on the inner workings of visual in-context learning, a relatively new and less understood area compared to its NLP counterpart. Identifying and leveraging task vectors could lead to more efficient and adaptable visual prompting models, reducing the reliance on extensive in-context examples.
How
The authors first analyze MAE-VQGAN activations to identify potential task vectors by measuring their variance across different tasks and invariance within a task. Then, they employ a REINFORCE algorithm to search for the optimal subset of task vectors that minimize the task-specific loss when patched into the model. They evaluate their method on various image-to-image tasks using the Pascal-5i dataset.
Result
The paper shows that task vectors do exist in visual prompting models and can be effectively identified. Patching the identified task vectors allows MAE-VQGAN to perform tasks in a zero-shot manner, achieving comparable or even superior performance to the original one-shot prompting on tasks like foreground segmentation, low-light enhancement, in-painting, and colorization. The results also suggest that task vectors are distributed throughout the encoder and decoder of the network.
LF
The authors acknowledge limitations in exploring other potential vector types, such as those encoding image structure and positional information. They also point to the possibility of directly evaluating the model in the VQGAN token space for potentially more accurate results. Future work could involve investigating these aspects further, as well as exploring the generalization of task vectors across different datasets and models.
Abstract
Visual Prompting is a technique for teaching models to perform a visual task via in-context examples, without any additional training. In this work, we analyze the activations of MAE-VQGAN, a recent Visual Prompting model, and find task vectors, activations that encode task-specific information. Equipped with this insight, we demonstrate that it is possible to identify the task vectors and use them to guide the network towards performing different tasks without providing any input-output examples. To find task vectors, we compute the average intermediate activations per task and use the REINFORCE algorithm to search for the subset of task vectors. The resulting task vectors guide the model towards performing a task better than the original model without the need for input-output examples.