ALIP: Adaptive Language-Image Pre-training with Synthetic Caption

Authors: Kaicheng Yang, Jiankang Deng, Xiang An, Jiawei Li, Ziyong Feng, Jia Guo, Jing Yang, Tongliang Liu

What

This paper presents ALIP (Adaptive Language-Image Pre-training), a novel method for improving vision-language pre-training by addressing the issue of noisy and mismatched image-text pairs in large web-crawled datasets.

Why

This paper is important because it tackles the critical challenge of data noise in large-scale vision-language pre-training, which can negatively impact model performance. ALIP offers a computationally efficient alternative to existing filtering or momentum-based methods by leveraging synthetic captions and a novel adaptive contrastive loss.

How

The authors propose a bi-path model that leverages both raw text and synthetic captions generated by the OFA model. They introduce two key components: the Language Consistency Gate (LCG), which weighs samples based on the consistency between raw and synthetic captions, and the Description Consistency Gate (DCG), which weighs image-text pairs based on their alignment. These weights are then integrated into an adaptive contrastive loss function to guide training.

Result

ALIP achieves state-of-the-art performance on zero-shot image-text retrieval tasks, demonstrating significant improvements over previous methods. It also shows competitive results on linear probe evaluations, indicating its strong representation learning capabilities. However, it lags behind state-of-the-art in zero-shot classification tasks, suggesting that the coarse granularity of the synthetic captions might limit performance in fine-grained tasks.

LF

The authors acknowledge limitations in the granularity of synthetic captions, which might hinder performance on tasks requiring fine-grained understanding. Future work includes exploring higher-quality caption generation models and investigating techniques to incorporate hierarchical semantic information into ALIP.

Abstract

Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks by scaling up the dataset with image-text pairs collected from the web. However, the presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning. To address this issue, we first utilize the OFA model to generate synthetic captions that focus on the image content. The generated captions contain complementary information that is beneficial for pre-training. Then, we propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption. As the core components of ALIP, the Language Consistency Gate (LCG) and Description Consistency Gate (DCG) dynamically adjust the weights of samples and image-text/caption pairs during the training process. Meanwhile, the adaptive contrastive loss can effectively reduce the impact of noise data and enhances the efficiency of pre-training data. We validate ALIP with experiments on different scales of models and pre-training datasets. Experiments results show that ALIP achieves state-of-the-art performance on multiple downstream tasks including zero-shot image-text retrieval and linear probe. To facilitate future research, the code and pre-trained models are released at https://github.com/deepglint/ALIP.