TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias

Authors: Sanghyun Jo, Soohyun Ryu, Sungyub Kim, Eunho Yang, Kyungsu Kim

What

This paper identifies and addresses the “single tag bias” in CLIP-based models, where the models disproportionately focus on a single tag in image-text relationships, and proposes a novel fine-tuning method called Text-Tag Self-Distillation (TTD) to mitigate this bias.

Why

This paper is important because it addresses a critical limitation in CLIP-based models that hinders their performance in multi-tag classification and segmentation tasks. By mitigating the single tag bias, the paper paves the way for improved image-text alignment and opens up possibilities for more accurate and robust open-vocabulary applications.

How

The authors propose a two-step approach: 1) Tag Selection by Pixel-Tag Scoring: Instead of relying on global image embeddings prone to bias, they compute similarity scores between each tag and its most correlated pixel, enabling more accurate identification of image-relevant tags. 2) Text-Tag Self-Distillation: They generate an ideal image-text similarity map reflecting all relevant tags and use it to guide the model to learn from all relevant tags during fine-tuning, thus mitigating the single tag bias.

Result

The proposed method demonstrates significant improvements in both multi-tag classification and segmentation tasks. It outperforms existing methods relying on external NLP models for tag selection and achieves superior results in capturing the relationship between images and multi-object text descriptions. The method also shows promising results in open-vocabulary semantic segmentation on various benchmarks, including Pascal VOC, COCO, and ADE20k.

LF

The authors acknowledge limitations in their current tagging method, which relies on single text inputs per image, potentially limiting the amount of positive/negative tag information utilized during training. As future work, they suggest exploring the integration of multiple text inputs per image to enrich the learning process. Additionally, they plan to investigate the underlying causes of single tag bias, such as model overfitting or training data characteristics, to further enhance the model’s performance.

Abstract

We identify a critical bias in contemporary CLIP-based models, which we denote as single tag bias. This bias manifests as a disproportionate focus on a singular tag (word) while neglecting other pertinent tags, stemming from CLIP’s text embeddings that prioritize one specific tag in image-text relationships. When deconstructing text into individual tags, only one tag tends to have high relevancy with CLIP’s image embedding, leading to biased tag relevancy. In this paper, we introduce a novel two-step fine-tuning approach, Text-Tag Self-Distillation (TTD), to address this challenge. TTD first extracts image-relevant tags from text based on their similarity to the nearest pixels then employs a self-distillation strategy to align combined masks with the text-derived mask. This approach ensures the unbiased image-text alignment of the CLIP-based models using only image-text pairs without necessitating additional supervision. Our technique demonstrates model-agnostic improvements in multi-tag classification and segmentation tasks, surpassing competing methods that rely on external resources. The code is available at https://github.com/shjo-april/TTD.