Espresso: Robust Concept Filtering in Text-to-Image Models

Authors: Anudeep Das, Vasisht Duddu, Rui Zhang, N. Asokan

What

This paper introduces \method, a robust concept filtering technique for text-to-image (TTI) models that uses Contrastive Language-Image Pre-Training (CLIP) to identify and suppress the generation of unacceptable concepts in images.

Why

This paper addresses the crucial need for robust and utility-preserving concept removal techniques in TTI models. The presence of unacceptable concepts (e.g., copyrighted material, inappropriate content) in TTI outputs poses significant ethical and legal challenges. Existing methods either compromise utility for effectiveness or lack robustness against adversarial prompts. \method offers a novel approach that balances all three requirements, making it a valuable contribution to the field of safe and responsible TTI generation.

How

The authors developed \method by modifying CLIP’s classification objective to consider the cosine similarity of a generated image’s embedding to both acceptable and unacceptable concept embeddings. This projection onto a lower-dimensional vector connecting the concepts enhances robustness. Further, they fine-tune \method to separate embeddings of acceptable and unacceptable concepts while preserving their pairing with image embeddings, ensuring effectiveness and utility. They evaluate \method’s performance on eleven concepts, comparing it to six state-of-the-art fine-tuning concept removal techniques and one filtering technique. They also present theoretical bounds for certified robustness and empirical analysis.

Result

\method demonstrates effectiveness in suppressing unacceptable concepts, achieving a low CLIP accuracy on unacceptable prompts. It maintains high utility on acceptable prompts, showing comparable normalized CLIP scores to other techniques. Importantly, \method exhibits strong robustness against various adversarial attacks, including Typo+, PEZ+, CCE/CCE+, and RingBell+, outperforming existing techniques. The empirical evaluation of certified robustness further supports \method’s resilience to adversarial noise in image embeddings.

LF

The paper acknowledges the limitations of the current certified robustness bound, which is loose compared to the distance between acceptable and unacceptable images. Future work involves tightening this bound and exploring adversarial training to further enhance robustness. Additionally, the paper suggests extending \method to handle multiple concept filtering simultaneously and optimizing it for filtering artistic styles, which currently poses a challenge due to the similarity of concept embeddings.

Abstract

Diffusion-based text-to-image (T2I) models generate high-fidelity images for given textual prompts. They are trained on large datasets scraped from the Internet, potentially containing unacceptable concepts (e.g., copyright infringing or unsafe). Retraining T2I models after filtering out unacceptable concepts in the training data is inefficient and degrades utility. Hence, there is a need for concept removal techniques (CRTs) which are effective in removing unacceptable concepts, utility-preserving on acceptable concepts, and robust against evasion with adversarial prompts. None of the prior filtering and fine-tuning CRTs satisfy all these requirements simultaneously. We introduce Espresso, the first robust concept filter based on Contrastive Language-Image Pre-Training (CLIP). It identifies unacceptable concepts by projecting the generated image’s embedding onto the vector connecting unacceptable and acceptable concepts in the joint text-image embedding space. This ensures robustness by restricting the adversary to adding noise only along this vector, in the direction of the acceptable concept. Further fine-tuning Espresso to separate embeddings of acceptable and unacceptable concepts, while preserving their pairing with image embeddings, ensures both effectiveness and utility. We evaluate Espresso on eleven concepts to show that it is effective (~5% CLIP accuracy on unacceptable concepts), utility-preserving (~93% normalized CLIP score on acceptable concepts), and robust (~4% CLIP accuracy on adversarial prompts for unacceptable concepts). Finally, we present theoretical bounds for the certified robustness of Espresso against adversarial prompts, and an empirical analysis.