Interpreting CLIP’s Image Representation via Text-Based Decomposition

Authors: Yossi Gandelsman, Alexei A. Efros, Jacob Steinhardt

What

This paper investigates the internal structure of CLIP’s image encoder, particularly the ViT-based variant, to understand how individual components like layers, attention heads, and image patches contribute to the final image representation.

Why

This work is important because it provides a deeper understanding of how CLIP encodes information, which can be used to improve its performance on downstream tasks. By decomposing CLIP’s representations and linking them to specific components and image regions, the authors offer insights into the model’s decision-making process and pave the way for more interpretable and robust vision-language models.

How

The authors decompose CLIP’s image representation into contributions from individual layers, attention heads, and image tokens. They leverage the residual structure of ViT to analyze direct contributions and develop an algorithm called TextSpan to associate text descriptions with the latent directions of each attention head. By analyzing these text descriptions and visualizing the contributions of different image regions, they uncover specific roles for many attention heads and reveal an emergent spatial localization within CLIP.

Result

The paper demonstrates that the last few attention layers in CLIP-ViT have the most significant direct effect on the image representation. The authors also find that many attention heads specialize in capturing specific image properties like shape, color, or location. They leverage this finding to reduce spurious correlations in downstream classification tasks and achieve state-of-the-art performance on zero-shot semantic image segmentation.

LF

The authors acknowledge limitations in addressing indirect effects between layers and the lack of clear roles for all attention heads. Future work could explore these indirect effects, analyze the collaborative roles of multiple heads, and extend the analysis to other CLIP architectures like ResNet.

Abstract

We investigate the CLIP image encoder by analyzing how individual model components affect the final representation. We decompose the image representation as a sum across individual image patches, model layers, and attention heads, and use CLIP’s text representation to interpret the summands. Interpreting the attention heads, we characterize each head’s role by automatically finding text representations that span its output space, which reveals property-specific roles for many heads (e.g. location or shape). Next, interpreting the image patches, we uncover an emergent spatial localization within CLIP. Finally, we use this understanding to remove spurious features from CLIP and to create a strong zero-shot image segmenter. Our results indicate that a scalable understanding of transformer models is attainable and can be used to repair and improve models.