Instruct Me More! Random Prompting for Visual In-Context Learning
Authors: Jiahao Zhang, Bowen Wang, Liangzhi Li, Yuta Nakashima, Hajime Nagahara
What
This paper introduces Instruct Me More (InMeMo), a novel visual in-context learning method that enhances the performance of large-scale vision models by adding a learnable perturbation to in-context image pairs, thereby improving their instructive quality for downstream tasks like segmentation and object detection.
Why
This paper is important because it addresses the limitations of existing visual in-context learning approaches that heavily rely on the quality and similarity of in-context pairs to query images. By introducing a learnable prompt, InMeMo improves the performance of visual in-context learning in a lightweight and efficient manner, achieving state-of-the-art results on benchmark tasks.
How
InMeMo first retrieves an in-context image pair similar to the query image. It then amends the pair with a learnable prompt enhancer module, which is trained to optimize the in-context pair for the specific downstream task. The enhanced pair, along with the query image, are then fed into a frozen pre-trained large-scale vision model (MAE-VQGAN) to generate a prediction for the given task. The prompt enhancer is trained in a supervised manner using cross-entropy loss on visual tokens, aiming to minimize the difference between predicted and ground-truth labels.
Result
InMeMo achieves state-of-the-art results on foreground segmentation and single object detection tasks, surpassing previous visual in-context learning methods. It demonstrates robustness to domain shift and significant performance improvement even with limited training data. The paper provides extensive qualitative and quantitative results, demonstrating the efficacy of InMeMo in capturing fine-grained details and handling variations in image characteristics.
LF
The paper acknowledges that InMeMo requires a minimum amount of training data per class to outperform the baseline. Additionally, the learnable prompt’s generalizability to unseen classes is limited, necessitating task-specific training. Future work could focus on improving the generalizability of the learnable prompt and exploring its application in other downstream tasks.
Abstract
Large-scale models trained on extensive datasets, have emerged as the preferred approach due to their high generalizability across various tasks. In-context learning (ICL), a popular strategy in natural language processing, uses such models for different tasks by providing instructive prompts but without updating model parameters. This idea is now being explored in computer vision, where an input-output image pair (called an in-context pair) is supplied to the model with a query image as a prompt to exemplify the desired output. The efficacy of visual ICL often depends on the quality of the prompts. We thus introduce a method coined Instruct Me More (InMeMo), which augments in-context pairs with a learnable perturbation (prompt), to explore its potential. Our experiments on mainstream tasks reveal that InMeMo surpasses the current state-of-the-art performance. Specifically, compared to the baseline without learnable prompt, InMeMo boosts mIoU scores by 7.35 and 15.13 for foreground segmentation and single object detection tasks, respectively. Our findings suggest that InMeMo offers a versatile and efficient way to enhance the performance of visual ICL with lightweight training. Code is available at https://github.com/Jackieam/InMeMo.