Sequential Modeling Enables Scalable Learning for Large Vision Models
Authors: Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan Yuille, Trevor Darrell, Jitendra Malik, Alexei A Efros
What
This paper introduces a Large Vision Model (LVM) trained solely on a massive dataset of visual data, formatted as “visual sentences,” without relying on any linguistic information. This approach involves tokenizing images into discrete tokens using VQGAN and training a causal transformer model to predict the next token, enabling various vision tasks to be performed through visual prompting.
Why
This paper is significant as it explores the potential of building large vision models analogous to large language models, demonstrating that visual understanding can be achieved without relying on language data. It pushes the boundaries of self-supervised learning in vision and paves the way for more general and scalable visual models capable of handling diverse tasks through in-context learning.
How
The authors curated a massive, diverse dataset of visual data called UVDv1, encompassing single images, image sequences, annotated images, annotated image sequences, and 3D synthetic objects, totaling 1.64 billion images. They introduced the concept of “visual sentences” to unify various data formats, treating each sentence as a sequence of visual tokens generated by a VQGAN tokenizer. A causal transformer model was trained to predict the next token in the sequence, enabling in-context learning for downstream tasks through visual prompting.
Result
The paper demonstrates that the LVM exhibits strong scaling behavior, with larger models and more data leading to better performance on various vision tasks such as semantic segmentation, depth estimation, and keypoint detection, even outperforming some task-specific models on unseen datasets. The model also showcases an ability to generalize to novel tasks, handle out-of-distribution data, and perform basic visual reasoning, suggesting potential for more advanced visual understanding.
LF
The authors acknowledge limitations such as computational constraints, under-constrained visual prompting compared to language, tokenizer limitations, and the relatively small size of the LVM compared to LLMs. Future work includes scaling up the model and exploring its capabilities in visual reasoning, emergence, and generalization.
Abstract
We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. To do this, we define a common format, “visual sentences”, in which we can represent raw images and videos as well as annotated data sources such as semantic segmentations and depth reconstructions without needing any meta-knowledge beyond the pixels. Once this wide variety of visual data (comprising 420 billion tokens) is represented as sequences, the model can be trained to minimize a cross-entropy loss for next token prediction. By training across various scales of model architecture and data diversity, we provide empirical evidence that our models scale effectively. Many different vision tasks can be solved by designing suitable visual prompts at test time.