Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning
Authors: Yiqi Wang, Wentao Chen, Xiaotian Han, Xudong Lin, Haiteng Zhao, Yongfei Liu, Bohan Zhai, Jianbo Yuan, Quanzeng You, Hongxia Yang
What
This paper surveys the current state of multimodal reasoning in Multimodal Large Language Models (MLLMs), exploring their architectures, training methods, and performance on various reasoning tasks.
Why
This paper is important because it provides a comprehensive overview of the rapidly developing field of MLLMs, focusing specifically on their reasoning abilities which are crucial for achieving artificial general intelligence.
How
The authors reviewed existing literature on MLLMs, analyzed their architectures, training datasets, and performance on various reasoning benchmarks, and categorized the applications of these models.
Result
The paper highlights that while MLLMs have shown impressive capabilities in multimodal tasks, their reasoning abilities still lag behind proprietary models like GPT-4V. The authors identified key factors contributing to the superior performance of some MLLMs, including unfreezing the language model during training, improving visual representations, and utilizing multi-task supervised learning.
LF
The paper points out limitations in current MLLM architectures, training efficiency, long-context support, instruction fine-tuning data, and evaluation benchmarks. It suggests future research directions, including developing more robust architectures, efficient training methods, long-context support mechanisms, improved instruction datasets, and more comprehensive evaluation benchmarks.
Abstract
Strong Artificial Intelligence (Strong AI) or Artificial General Intelligence (AGI) with abstract reasoning ability is the goal of next-generation AI. Recent advancements in Large Language Models (LLMs), along with the emerging field of Multimodal Large Language Models (MLLMs), have demonstrated impressive capabilities across a wide range of multimodal tasks and applications. Particularly, various MLLMs, each with distinct model architectures, training data, and training stages, have been evaluated across a broad range of MLLM benchmarks. These studies have, to varying degrees, revealed different aspects of the current capabilities of MLLMs. However, the reasoning abilities of MLLMs have not been systematically investigated. In this survey, we comprehensively review the existing evaluation protocols of multimodal reasoning, categorize and illustrate the frontiers of MLLMs, introduce recent trends in applications of MLLMs on reasoning-intensive tasks, and finally discuss current practices and future directions. We believe our survey establishes a solid base and sheds light on this important topic, multimodal reasoning.