Long-CLIP: Unlocking the Long-Text Capability of CLIP
Authors: Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Jiaqi Wang
What
The paper introduces Long-CLIP, an enhanced version of CLIP designed to address the limitation of short text input in the original model, enabling it to process longer and more detailed textual descriptions while retaining its zero-shot generalization capabilities.
Why
This work is important as it enables CLIP to handle detailed descriptions, thereby broadening its applicability in image retrieval, text-to-image generation, and other tasks requiring comprehensive textual understanding. This advancement holds the potential to significantly enhance the performance and versatility of CLIP-based applications.
How
The authors propose two novel strategies: (1) Knowledge-Preserved Stretching: interpolating the positional embedding of less-trained positions while preserving the well-trained ones to support longer text input without disrupting the representation of short text positions; (2) Primary Component Matching: aligning both fine-grained image features with long captions and coarse-grained features (extracted using PCA) with short summary captions during fine-tuning to enable the model to capture detailed attributes and understand their importance. Long-CLIP is fine-tuned on the ShareGPT4V dataset, which contains image-long caption pairs.
Result
Long-CLIP demonstrates superior performance compared to the original CLIP in various tasks, including: (1) Long-text image retrieval: It significantly improves the recall rate by approximately 20% on datasets like ShareGPT4V and Urban. (2) Short-text image retrieval: It also shows improvement on benchmarks like COCO and Flickr30k. (3) Zero-shot classification: It retains comparable performance to CLIP on ImageNet and CIFAR. (4) Text-to-image generation: It exhibits a plug-and-play effect, enabling existing models like Stable Diffusion to generate images from detailed descriptions without additional training.
LF
The paper acknowledges that Long-CLIP, despite its improvements, still has a finite input length limit, although significantly extended. Future work could explore relative positional embeddings like RoPE to potentially overcome this limitation. Additionally, the authors suggest exploring the scaling-up potential of Long-CLIP by training with a larger dataset of long text-image pairs, as the current work only utilizes a relatively small portion of the ShareGPT4V dataset.
Abstract
Contrastive Language-Image Pre-training (CLIP) has been the cornerstone for zero-shot classification, text-image retrieval, and text-image generation by aligning image and text modalities. Despite its widespread adoption, a significant limitation of CLIP lies in the inadequate length of text input. The length of the text token is restricted to 77, and an empirical study shows the actual effective length is even less than 20. This prevents CLIP from handling detailed descriptions, limiting its applications for image retrieval and text-to-image generation with extensive prerequisites. To this end, we propose Long-CLIP as a plug-and-play alternative to CLIP that supports long-text input, retains or even surpasses its zero-shot generalizability, and aligns the CLIP latent space, making it readily replace CLIP without any further adaptation in downstream frameworks. Nevertheless, achieving this goal is far from straightforward, as simplistic fine-tuning can result in a significant degradation of CLIP’s performance. Moreover, substituting the text encoder with a language model supporting longer contexts necessitates pretraining with vast amounts of data, incurring significant expenses. Accordingly, Long-CLIP introduces an efficient fine-tuning solution on CLIP with two novel strategies designed to maintain the original capabilities, including (1) a knowledge-preserved stretching of positional embedding and (2) a primary component matching of CLIP features. With leveraging just one million extra long text-image pairs, Long-CLIP has shown the superiority to CLIP for about 20% in long caption text-image retrieval and 6% in traditional text-image retrieval tasks, e.g., COCO and Flickr30k. Furthermore, Long-CLIP offers enhanced capabilities for generating images from detailed text descriptions by replacing CLIP in a plug-and-play manner.