What do we learn from inverting CLIP models?

Authors: Hamid Kazemi, Atoosa Chegini, Jonas Geiping, Soheil Feizi, Tom Goldstein

What

This paper investigates the inner workings and potential biases of CLIP models by employing an inversion-based approach, generating images from text prompts to analyze CLIP’s understanding of concepts, gender, and its proclivity to produce NSFW content.

Why

This research is crucial as it provides insights into the often opaque training data and potential biases of widely used CLIP models, particularly highlighting the risk of generating NSFW content even from innocuous prompts, which has significant implications for downstream applications like text-to-image generation.

How

The authors invert CLIP models by optimizing images to closely align with given text prompts, utilizing techniques like random augmentations, ensembling, and regularization. They analyze the generated images for their ability to blend concepts, the presence of NSFW content, gender biases, and the impact of training data scale.

Result

The study reveals that CLIP models can blend concepts effectively, often producing recognizable images from celebrity names. However, it also uncovers a concerning tendency to generate NSFW imagery, even from seemingly harmless prompts, including those related to landscapes and certain celebrities. This suggests the presence of a significant amount of NSFW content in the training data. Additionally, the research exposes gender biases within CLIP, as it associates specific professions and social statuses with particular genders. Lastly, it demonstrates that the scale of the training data directly influences the quality of the generated images, with larger datasets yielding better results.

LF

The authors acknowledge the limitation of using generative methods to analyze a model not typically used for generation. Future work could involve exploring alternative methods to confirm these findings. Furthermore, the study emphasizes the need for better data filtering and curation during CLIP training to mitigate the generation of NSFW content and address inherent biases. Investigating methods to address the proximity of specific prompts to NSFW words in the embedding space is also crucial.

Abstract

We employ an inversion-based approach to examine CLIP models. Our examination reveals that inverting CLIP models results in the generation of images that exhibit semantic alignment with the specified target prompts. We leverage these inverted images to gain insights into various aspects of CLIP models, such as their ability to blend concepts and inclusion of gender biases. We notably observe instances of NSFW (Not Safe For Work) images during model inversion. This phenomenon occurs even for semantically innocuous prompts, like “a beautiful landscape,” as well as for prompts involving the names of celebrities.