A Survey on Vision Mamba: Models, Applications and Challenges

Authors: Rui Xu, Shu Yang, Yihui Wang, Bo Du, Hao Chen

What

This paper presents a comprehensive survey of Vision Mamba, a novel and efficient neural network architecture for visual tasks, exploring its underlying principles, diverse applications across various visual domains, and outlining future research directions.

Why

This survey is important because it examines the rapid advancements and growing influence of Vision Mamba in the computer vision field, providing a timely and valuable resource for researchers to understand the core concepts, explore applications, and contribute to its ongoing development.

How

The authors provide a structured analysis of Vision Mamba by first introducing its foundational principles, followed by in-depth examinations of representative backbone networks and categorizing its applications based on visual modalities such as image, video, multi-modal data, and point clouds. The paper concludes by critically analyzing the challenges and outlining future research directions.

Result

The paper highlights Vision Mamba’s effectiveness in various computer vision tasks, including classification, segmentation, generation, and restoration, across diverse domains like medical imaging, remote sensing, and video understanding. It also provides insights into how different visual Mamba models address the unique characteristics of visual data and discusses their performance compared to traditional convolutional neural networks and Transformers.

LF

The paper identifies key limitations of Vision Mamba, including stability issues when scaling to large datasets, challenges in adapting causal scanning mechanisms to non-causal visual data, potential loss of spatial information during 1D scanning, information redundancy and increased computational demands due to multi-directional scanning, and the need for enhanced interpretability, generalization ability, and robustness. Future research directions include developing more efficient scanning techniques and fusion methods, optimizing computational efficiency, and exploring applications in data-efficient learning, high-resolution data analysis, multi-modal learning, and in-context learning.

Abstract

Mamba, a recent selective structured state space model, performs excellently on long sequence modeling tasks. Mamba mitigates the modeling constraints of convolutional neural networks and offers advanced modeling capabilities similar to those of Transformers, through global receptive fields and dynamic weighting. Crucially, it achieves this without incurring the quadratic computational complexity typically associated with Transformers. Due to its advantages over the former two mainstream foundation models, Mamba exhibits great potential to be a visual foundation model. Researchers are actively applying Mamba to various computer vision tasks, leading to numerous emerging works. To help keep pace with the rapid advancements in computer vision, this paper aims to provide a comprehensive review of visual Mamba approaches. This paper begins by delineating the formulation of the original Mamba model. Subsequently, our review of visual Mamba delves into several representative backbone networks to elucidate the core insights of the visual Mamba. We then categorize related works using different modalities, including image, video, point cloud, multi-modal, and others. Specifically, for image applications, we further organize them into distinct tasks to facilitate a more structured discussion. Finally, we discuss the challenges and future research directions for visual Mamba, providing insights for future research in this quickly evolving area. A comprehensive list of visual Mamba models reviewed in this work is available at https://github.com/Ruixxxx/Awesome-Vision-Mamba-Models.