Localizing and Editing Knowledge in Text-to-Image Generative Models
Authors: Samyadeep Basu, Nanxuan Zhao, Vlad Morariu, Soheil Feizi, Varun Manjunatha
What
This paper investigates how knowledge about different visual attributes is stored in large-scale text-to-image diffusion models, specifically focusing on Stable Diffusion.
Why
Understanding knowledge storage in text-to-image models is crucial for interpreting their decision-making and enabling targeted model editing without expensive retraining.
How
The authors adapt Causal Mediation Analysis to trace knowledge corresponding to visual attributes like objects, style, color, and action within the UNet and text-encoder components of Stable Diffusion. They identify causal components by corrupting specific attribute information in captions and observing the impact of restoring activations from a clean model.
Result
The study reveals that knowledge in the UNet is distributed across various components with different efficacy for different attributes, unlike the localized storage observed in large language models. Remarkably, the CLIP text-encoder exhibits a single causal state: the first self-attention layer corresponding to the last subject token of the attribute in the caption. This finding led to the development of \difffix{}, a fast, data-free model editing method that leverages this localized causal state for efficient concept editing.
LF
The paper primarily focuses on Stable Diffusion, leaving analysis on other models for future work. Additionally, exploring deeper into individual layer components, such as neurons, and investigating robustness to adversarial attacks are identified as potential research avenues. The authors also acknowledge the need to address the generalization of edits to neighboring concepts, as observed in the Eiffel Tower ablation example where edits did not fully propagate to related scenery.
Abstract
Text-to-Image Diffusion Models such as Stable-Diffusion and Imagen have achieved unprecedented quality of photorealism with state-of-the-art FID scores on MS-COCO and other generation benchmarks. Given a caption, image generation requires fine-grained knowledge about attributes such as object structure, style, and viewpoint amongst others. Where does this information reside in text-to-image generative models? In our paper, we tackle this question and understand how knowledge corresponding to distinct visual attributes is stored in large-scale text-to-image diffusion models. We adapt Causal Mediation Analysis for text-to-image models and trace knowledge about distinct visual attributes to various (causal) components in the (i) UNet and (ii) text-encoder of the diffusion model. In particular, we show that unlike generative large-language models, knowledge about different attributes is not localized in isolated components, but is instead distributed amongst a set of components in the conditional UNet. These sets of components are often distinct for different visual attributes. Remarkably, we find that the CLIP text-encoder in public text-to-image models such as Stable-Diffusion contains only one causal state across different visual attributes, and this is the first self-attention layer corresponding to the last subject token of the attribute in the caption. This is in stark contrast to the causal states in other language models which are often the mid-MLP layers. Based on this observation of only one causal state in the text-encoder, we introduce a fast, data-free model editing method Diff-QuickFix which can effectively edit concepts in text-to-image models. DiffQuickFix can edit (ablate) concepts in under a second with a closed-form update, providing a significant 1000x speedup and comparable editing performance to existing fine-tuning based editing methods.