On the Language Encoder of Contrastive Cross-modal Models

Authors: Mengjie Zhao, Junya Ono, Zhi Zhong, Chieh-Hsin Lai, Yuhta Takida, Naoki Murata, Wei-Hsiang Liao, Takashi Shibuya, Hiromi Wakaki, Yuki Mitsufuji

What

This paper investigates the impact of incorporating sentence embedding training, both unsupervised and supervised, during the pretraining of contrastive cross-modal models like CLIP and CLAP for vision-language (VL) and audio-language (AL) tasks.

Why

This paper addresses the crucial need to improve the language understanding capabilities of cross-modal models, especially as these models are increasingly being pretrained on massive datasets. By focusing on enhancing the language encoder through sentence embedding training, the authors aim to boost the performance of these models on a variety of tasks.

How

The authors pretrain VL and AL models with different combinations of training objectives, including cross-modal contrastive loss, cyclic losses for cross-modal and in-modal consistency, and unsupervised/supervised sentence embedding losses. They evaluate the pretrained models on tasks like zero-shot image/audio classification, image-text/audio-text retrieval, and SentEval benchmark. Additionally, they analyze the representation spaces of the trained models in terms of alignment and uniformity.

Result

The results show that unsupervised sentence embedding training generally improves both the language encoder quality and the performance on VL tasks, leading to a better CyCLIP model. However, the benefits are less pronounced and noisier in AL pretraining, possibly due to the limited size of AL datasets and the use of pretrained encoders. The analysis of representation spaces reveals that sentence embedding training enhances the uniformity of the text representation space, but at the cost of slightly decreased cross-modal alignment.

LF

The authors acknowledge limitations in terms of modality scope (excluding music), the use of pretrained encoders for AL pretraining, and the lack of extensive prompt engineering for audio. Future work could address these limitations by incorporating the music modality, exploring pretraining strategies that adapt language encoders to the audio domain, and investigating prompt engineering techniques specifically for audio-language tasks.

Abstract

Contrastive cross-modal models such as CLIP and CLAP aid various vision-language (VL) and audio-language (AL) tasks. However, there has been limited investigation of and improvement in their language encoder, which is the central component of encoding natural language descriptions of image/audio into vector representations. We extensively evaluate how unsupervised and supervised sentence embedding training affect language encoder quality and cross-modal task performance. In VL pretraining, we found that sentence embedding training language encoder quality and aids in cross-modal tasks, improving contrastive VL models such as CyCLIP. In contrast, AL pretraining benefits less from sentence embedding training, which may result from the limited amount of pretraining data. We analyze the representation spaces to understand the strengths of sentence embedding training, and find that it improves text-space uniformity, at the cost of decreased cross-modal alignment.