Robust Concept Erasure Using Task Vectors
Authors: Minh Pham, Kelly O. Marshall, Chinmay Hegde, Niv Cohen
What
This paper proposes a novel method for removing unsafe concepts from text-to-image models using Task Vectors (TV) in a way that is independent of specific user prompts, making it more robust than existing input-dependent concept erasure methods.
Why
The paper addresses the critical challenge of preventing the generation of undesirable content from text-to-image models, a growing concern as these models become increasingly powerful. It highlights the limitations of existing concept erasure techniques that primarily focus on specific user prompts and demonstrates the vulnerability of such approaches to adversarial attacks. The proposed method offers a more robust solution by aiming for unconditional concept erasure.
How
The authors propose a three-part method: (1) Diverse Inversion: This technique finds a diverse set of token embeddings that can generate the unsafe concept, enabling a more comprehensive evaluation of the model’s safety. (2) TV Edit Strength Tuning: Using the diverse set of adversarial prompts, the authors determine an optimal edit strength for the TV that effectively suppresses unsafe generation while preserving the model’s utility on unrelated tasks. (3) TV Weight Sub-selection: The authors explore pruning specific layers of the TV weights to further enhance the trade-off between concept erasure and model performance.
Result
The paper demonstrates that TV-based concept erasure is more resistant to adversarial attacks compared to existing methods, showing robustness against techniques like Concept Inversion and Ring-A-Bell. The proposed Diverse Inversion method proves effective in finding a wide range of adversarial prompts, allowing for better estimation of the TV edit strength. Additionally, the authors show that sub-selecting TV weights can lead to a better balance between concept erasure and preserving the model’s functionality on unrelated tasks.
LF
The paper acknowledges limitations such as the lack of provable guarantees for erasure against unknown future adversarial methods and the dependence on the Diverse Inversion set for hyperparameter tuning. Future work could focus on exploring the application of TV-based erasure for more fine-grained concept removal and extending the approach to other modalities like language models.
Abstract
With the rapid growth of text-to-image models, a variety of techniques have been suggested to prevent undesirable image generations. Yet, these methods often only protect against specific user prompts and have been shown to allow unsafe generations with other inputs. Here we focus on unconditionally erasing a concept from a text-to-image model rather than conditioning the erasure on the user’s prompt. We first show that compared to input-dependent erasure methods, concept erasure that uses Task Vectors (TV) is more robust to unexpected user inputs, not seen during training. However, TV-based erasure can also affect the core performance of the edited model, particularly when the required edit strength is unknown. To this end, we propose a method called Diverse Inversion, which we use to estimate the required strength of the TV edit. Diverse Inversion finds within the model input space a large set of word embeddings, each of which induces the generation of the target concept. We find that encouraging diversity in the set makes our estimation more robust to unexpected prompts. Finally, we show that Diverse Inversion enables us to apply a TV edit only to a subset of the model weights, enhancing the erasure capabilities while better maintaining the core functionality of the model.