Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Authors: Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, Saining Xie
What
This paper explores the limitations of visual capabilities in Multimodal Large Language Models (MLLMs) that stem from the visual encoder, particularly CLIP, and proposes a Mixture of Features (MoF) approach to enhance visual grounding by integrating features from CLIP and vision-only self-supervised learning models.
Why
This paper is important because it sheds light on a crucial weakness in current state-of-the-art MLLMs, despite their impressive language capabilities, and proposes a potential solution to improve their visual grounding for more robust and reliable performance.
How
The authors first identify “CLIP-blind pairs” - images perceived as similar by CLIP despite visual differences - and construct the Multimodal Visual Patterns (MVP) benchmark to evaluate MLLMs’ visual grounding. Then, they analyze systematic visual patterns in CLIP-blind pairs and propose MoF, experimenting with Additive MoF (linearly mixing features) and Interleaved MoF (spatially mixing visual tokens) to enhance visual grounding in MLLMs.
Result
Key findings include: (1) MLLMs, even the most advanced ones, struggle with seemingly simple visual questions in the MVP benchmark. (2) Scaling up CLIP’s training data and model size alone doesn’t resolve challenges related to certain visual patterns. (3) A strong correlation exists between CLIP’s failure patterns and MLLMs’ visual incapability. (4) Integrating vision-only SSL features using MoF, particularly Interleaved MoF, significantly improves MLLMs’ visual grounding without compromising instruction-following abilities.
LF
The authors acknowledge that MoF is an initial step and more sophisticated approaches are needed to fully address the visual limitations. Future work includes exploring advanced fusion techniques beyond linear and spatial mixing, designing more comprehensive benchmarks to evaluate diverse visual patterns and grounding abilities, and investigating new visual representation learning algorithms that better capture fine-grained visual details and relationships.
Abstract
Is vision good enough for language? Recent advancements in multimodal models primarily stem from the powerful reasoning abilities of large language models (LLMs). However, the visual component typically depends only on the instance-level contrastive language-image pre-training (CLIP). Our research reveals that the visual capabilities in recent multimodal LLMs (MLLMs) still exhibit systematic shortcomings. To understand the roots of these errors, we explore the gap between the visual embedding space of CLIP and vision-only self-supervised learning. We identify ”CLIP-blind pairs” - images that CLIP perceives as similar despite their clear visual differences. With these pairs, we construct the Multimodal Visual Patterns (MMVP) benchmark. MMVP exposes areas where state-of-the-art systems, including GPT-4V, struggle with straightforward questions across nine basic visual patterns, often providing incorrect answers and hallucinated explanations. We further evaluate various CLIP-based vision-and-language models and found a notable correlation between visual patterns that challenge CLIP models and those problematic for multimodal LLMs. As an initial effort to address these issues, we propose a Mixture of Features (MoF) approach, demonstrating that integrating vision self-supervised learning features with MLLMs can significantly enhance their visual grounding capabilities. Together, our research suggests visual representation learning remains an open challenge, and accurate visual grounding is crucial for future successful multimodal systems.