Fine-tuning CLIP Text Encoders with Two-step Paraphrasing

Authors: Hyunjae Kim, Seunghyun Yoon, Trung Bui, Handong Zhao, Quan Tran, Franck Dernoncourt, Jaewoo Kang

What

This paper presents ParaCLIP, a fine-tuning approach for CLIP models that enhances their understanding and handling of paraphrased text inputs by leveraging synthetic paraphrases generated from large language models.

Why

This work addresses the challenge of linguistic variation in text inputs for vision-language tasks, which limits the robustness of existing CLIP models in real-world applications. ParaCLIP improves the representation of paraphrases in CLIP’s text encoder, leading to better performance in tasks requiring semantic understanding and compositionality.

How

The authors propose a two-step paraphrasing process using LLMs (ChatGPT, LLaMA) to generate two categories of paraphrases for image captions. Then, they fine-tune the CLIP text encoder with these paraphrases while keeping the image encoder frozen. The training objective consists of three InfoNCE losses: image-paraphrase, caption-paraphrase, and paraphrase-paraphrase.

Result

ParaCLIP consistently outperforms baseline CLIP models in tasks like paraphrased retrieval, Visual Genome Relation and Attribution, and semantic textual similarity. Notably, it significantly improves average overlap and Jaccard similarity scores in paraphrased retrieval, indicating better handling of linguistic variations. The ablation study highlights the importance of each loss function in achieving balanced performance across different tasks.

LF

The authors acknowledge that their method may sometimes degrade performance on standard vision and vision-language tasks like zero-shot classification and image retrieval, possibly due to limitations in computational resources to use large batch sizes during fine-tuning. Future work involves investigating factors contributing to this performance degradation and exploring the potential of the approach to address compositional understanding limitations in CLIP models.

Abstract

Contrastive language-image pre-training (CLIP) models have demonstrated considerable success across various vision-language tasks, such as text-to-image retrieval, where the model is required to effectively process natural language input to produce an accurate visual output. However, current models still face limitations in dealing with linguistic variations in input queries, such as paraphrases, making it challenging to handle a broad range of user queries in real-world applications. In this study, we introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases. Our approach involves a two-step paraphrase generation process, where we automatically create two categories of paraphrases from web-scale image captions by leveraging large language models. Subsequently, we fine-tune the CLIP text encoder using these generated paraphrases while freezing the image encoder. Our resulting model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks, including paraphrased retrieval (with rank similarity scores improved by up to 2.0% and 5.6%), Visual Genome Relation and Attribution, as well as seven semantic textual similarity tasks.