Backdooring Textual Inversion for Concept Censorship
Authors: Yutong Wu, Jie Zhang, Florian Kerschbaum, Tianwei Zhang
What
This paper presents a novel method for concept censorship in AI image generation by backdooring Textual Inversion (TI), a popular personalization technique.
Why
This paper addresses the growing concern of misuse of AI image generation for malicious purposes like spreading misinformation or creating harmful content, by proposing a method to regulate personalization models without completely disabling them.
How
The authors propose a two-term loss function for training TI, incorporating a backdoor term that associates specific trigger words (sensitive concepts) with pre-defined target images, effectively preventing the generation of undesired content when those words are present in the prompt.
Result
Experiments demonstrate the effectiveness of their method in censoring single words and blacklists of words, while preserving the utility of the TI for benign use. The method also exhibits robustness against potential countermeasures like word embedding removal and perturbation.
LF
Limitations include the need for the publisher to retrain the TI model and the dependence on hyperparameter tuning. Future work could explore data-free approaches, reduce reliance on hyperparameters, and investigate semantic-wise censoring for improved practicality.
Abstract
Recent years have witnessed success in AIGC (AI Generated Content). People can make use of a pre-trained diffusion model to generate images of high quality or freely modify existing pictures with only prompts in nature language. More excitingly, the emerging personalization techniques make it feasible to create specific-desired images with only a few images as references. However, this induces severe threats if such advanced techniques are misused by malicious users, such as spreading fake news or defaming individual reputations. Thus, it is necessary to regulate personalization models (i.e., concept censorship) for their development and advancement. In this paper, we focus on the personalization technique dubbed Textual Inversion (TI), which is becoming prevailing for its lightweight nature and excellent performance. TI crafts the word embedding that contains detailed information about a specific object. Users can easily download the word embedding from public websites like Civitai and add it to their own stable diffusion model without fine-tuning for personalization. To achieve the concept censorship of a TI model, we propose leveraging the backdoor technique for good by injecting backdoors into the Textual Inversion embeddings. Briefly, we select some sensitive words as triggers during the training of TI, which will be censored for normal use. In the subsequent generation stage, if the triggers are combined with personalized embeddings as final prompts, the model will output a pre-defined target image rather than images including the desired malicious concept. To demonstrate the effectiveness of our approach, we conduct extensive experiments on Stable Diffusion, a prevailing open-sourced text-to-image model. Our code, data, and results are available at https://concept-censorship.github.io.