CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data
Authors: Sachin Mehta, Maxwell Horton, Fartash Faghri, Mohammad Hossein Sekhavat, Mahyar Najibi, Mehrdad Farajtabar, Oncel Tuzel, Mohammad Rastegari
What
This paper introduces \method, a novel weakly supervised approach for pre-training vision models on web-scale image-text data by reframing it as a classification task, achieving a 2.7x speedup over contrastive learning methods like CLIP while maintaining comparable downstream performance.
Why
This paper is important because it addresses the computational bottleneck of contrastive learning in image-text pre-training, making it significantly faster and more efficient without compromising accuracy. This is crucial for enabling wider access to and faster research in large-scale pre-training.
How
The authors extracted nouns from text captions, mapped them to WordNet synsets, and trained vision models using binary cross-entropy loss, essentially treating pre-training as a multi-label classification problem. They experimented with various ViT backbones, scaling data and models, and compared their method to CLIP on downstream tasks like image classification, multi-label classification, semantic segmentation, and object detection.
Result
Key findings include: (1) \method is 2.7x faster than CLIP while achieving comparable accuracy. (2) Scaling data and model size in \method improves downstream performance. (3) \method enables data-efficient transfer learning by leveraging the pre-trained classifier for initialization. (4) \method generalizes well to complex visual tasks like multi-label classification, semantic segmentation, and object detection, demonstrating the quality of learned representations.
LF
The paper acknowledges that while \method achieves promising results, the performance of the largest ViT model starts to saturate on larger datasets, suggesting potential limitations in scaling. Future work could explore longer training, leveraging even larger datasets, or incorporating techniques from contrastive learning to further improve \method’s performance.
Abstract
Contrastive learning has emerged as a transformative method for learning effective visual representations through the alignment of image and text embeddings. However, pairwise similarity computation in contrastive loss between image and text pairs poses computational challenges. This paper presents a novel weakly supervised pre-training of vision models on web-scale image-text data. The proposed method reframes pre-training on image-text data as a classification task. Consequently, it eliminates the need for pairwise similarity computations in contrastive loss, achieving a remarkable acceleration in training speed compared to contrastive learning on web-scale data. Through extensive experiments spanning diverse vision tasks, including detection and segmentation, we demonstrate that the proposed method maintains high representation quality. Our source code along with pre-trained model weights and training recipes is available at \url{https://github.com/apple/corenet}.