V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs

Authors: Penghao Wu, Saining Xie

What

This paper introduces SEAL, a novel framework that integrates an LLM-guided visual search mechanism into Multimodal Large Language Models (MLLMs) to enhance their visual grounding capabilities, especially for high-resolution images where details are crucial.

Why

This paper addresses the limitations of current MLLMs in handling high-resolution images due to their reliance on pre-trained vision encoders with limited resolution and their inability to actively search for missing visual information. This is important as it highlights the need for a more human-like approach to visual processing in MLLMs, enabling them to handle more complex real-world scenarios.

How

The authors propose SEAL, a meta-architecture consisting of a VQA LLM and a visual search model. The VQA LLM identifies missing visual information, and the visual search model, guided by the LLM’s world knowledge, efficiently locates and adds these details to a Visual Working Memory (VWM), enabling the VQA LLM to provide more informed answers. They also introduce VBench, a benchmark to evaluate MLLMs on detailed visual grounding in high-resolution images.

Result

The SEAL framework significantly outperforms existing open-source and commercial MLLMs on the VBench, demonstrating the effectiveness of incorporating a visual search mechanism. Their ablation studies further validate the importance of their LLM-guided search strategy over simple detection-based approaches. Additionally, their analysis on the COCO-Search18 dataset shows that their LLM-guided visual search achieves efficiency comparable to human eye fixations during visual search tasks.

LF

The authors acknowledge that their visual search model is currently designed for natural images and common objects, requiring further adaptation for handling documents, diagrams, videos, or open-world scenarios. They suggest exploring architectural improvements like incorporating convolution-based models for more efficient processing of high-resolution images.

Abstract

When we look around and perform complex tasks, how we see and selectively process what we see is crucial. However, the lack of this visual search mechanism in current multimodal LLMs (MLLMs) hinders their ability to focus on important visual details, especially when handling high-resolution and visually crowded images. To address this, we introduce V*, an LLM-guided visual search mechanism that employs the world knowledge in LLMs for efficient visual querying. When combined with an MLLM, this mechanism enhances collaborative reasoning, contextual understanding, and precise targeting of specific visual elements. This integration results in a new MLLM meta-architecture, named Show, sEArch, and TelL (SEAL). We further create V*Bench, a benchmark specifically designed to evaluate MLLMs in their ability to process high-resolution images and focus on visual details. Our study highlights the necessity of incorporating visual search capabilities into multimodal systems. The code is available https://github.com/penghao-wu/vstar.