TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance

Authors: Kan Wu, Houwen Peng, Zhenghong Zhou, Bin Xiao, Mengchen Liu, Lu Yuan, Hong Xuan, Michael Valenzuela, Xi, Chen, Xinggang Wang, Hongyang Chao, Han Hu

What

This paper introduces TinyCLIP, a novel cross-modal distillation method designed to compress large-scale language-image pre-trained models like CLIP while preserving their zero-shot performance.

Why

The paper addresses the limitations of large language-image models, such as CLIP, which require significant storage, memory, and computational resources. TinyCLIP offers a solution by compressing these models, making them more practical for real-world applications without compromising performance.

How

TinyCLIP utilizes two key techniques: affinity mimicking and weight inheritance. Affinity mimicking enables student models to learn cross-modal feature alignment by mimicking the teacher model’s behavior in a visual-linguistic affinity space. Weight inheritance accelerates distillation by transferring pre-trained weights from teacher to student models, either manually or automatically using learnable masks. TinyCLIP employs a multi-stage progressive distillation process for high compression rates, gradually reducing model size while retaining important weights and knowledge.

Result

TinyCLIP achieves impressive compression rates while maintaining competitive performance on various benchmarks. For example, TinyCLIP ViT-8M/16 surpasses the original CLIP ViT-B/16 on ImageNet zero-shot top-1 accuracy despite having significantly fewer parameters. Additionally, TinyCLIP demonstrates faster training times compared to training from scratch and shows strong transfer learning capabilities in zero-shot and linear-probe classification tasks.

LF

The paper acknowledges that further research is needed to enhance cross-modal distillation efficiency for even smaller models. Future work could explore alternative compression techniques or investigate methods to optimize the trade-off between model size, speed, and accuracy.

Abstract

In this paper, we propose a novel cross-modal distillation method, called TinyCLIP, for large-scale language-image pre-trained models. The method introduces two core techniques: affinity mimicking and weight inheritance. Affinity mimicking explores the interaction between modalities during distillation, enabling student models to mimic teachers’ behavior of learning cross-modal feature alignment in a visual-linguistic affinity space. Weight inheritance transmits the pre-trained weights from the teacher models to their student counterparts to improve distillation efficiency. Moreover, we extend the method into a multi-stage progressive distillation to mitigate the loss of informative weights during extreme compression. Comprehensive experiments demonstrate the efficacy of TinyCLIP, showing that it can reduce the size of the pre-trained CLIP ViT-B/32 by 50%, while maintaining comparable zero-shot performance. While aiming for comparable performance, distillation with weight inheritance can speed up the training by 1.4 - 7.8 compared to training from scratch. Moreover, our TinyCLIP ViT-8M/16, trained on YFCC-15M, achieves an impressive zero-shot top-1 accuracy of 41.1% on ImageNet, surpassing the original CLIP ViT-B/16 by 3.5% while utilizing only 8.9% parameters. Finally, we demonstrate the good transferability of TinyCLIP in various downstream tasks. Code and models will be open-sourced at https://aka.ms/tinyclip.