To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images … For Now
Authors: Yimeng Zhang, Jinghan Jia, Xin Chen, Aochuan Chen, Yihua Zhang, Jiancheng Liu, Ke Ding, Sijia Liu
What
This paper focuses on the safety of diffusion models (DMs) for image generation. It introduces an adversarial attack method, called Diffusion-MU-Attack, to assess the robustness of ‘unlearned’ DMs, which are designed to mitigate the generation of harmful or undesired images.
Why
This paper is important because it tackles the critical issue of safety in DMs, highlighting potential vulnerabilities in existing safety-driven approaches. It provides a valuable evaluation framework and a novel attack method to help improve the robustness and trustworthiness of DMs, especially important given their rapid adoption and potential for misuse.
How
The authors develop an adversarial prompt generation method, leveraging the concept of a ‘diffusion classifier’ inherent in well-trained DMs. This method optimizes text prompts to circumvent the safety mechanisms of unlearned DMs, compelling them to generate images containing the erased content. They evaluate their attack against several state-of-the-art unlearned DMs across three unlearning tasks: concept, style, and object unlearning. The effectiveness of the attack is measured by its success rate in generating images classified as containing the unlearned concepts, styles, or objects.
Result
The results demonstrate the effectiveness of the proposed attack in bypassing the safety mechanisms of various unlearned DMs. Specifically, the attack successfully generates images classified as containing the erased concepts, styles, or objects with high success rates. Moreover, the attack is computationally efficient, as it does not require auxiliary diffusion or classification models. The results also reveal that current safety-driven unlearning techniques still lack robustness against adversarial prompt attacks.
LF
The authors acknowledge that their work primarily focuses on evaluating the robustness of unlearned DMs against adversarial prompts, leaving other attack vectors unexplored. They suggest future work could investigate the robustness against attacks on other aspects of DMs, such as the noise generation process or the latent image representation. Additionally, they emphasize the need for developing more robust unlearning methods for DMs to address the vulnerabilities exposed by their attack.
Abstract
The recent advances in diffusion models (DMs) have revolutionized the generation of realistic and complex images. However, these models also introduce potential safety hazards, such as producing harmful content and infringing data copyrights. Despite the development of safety-driven unlearning techniques to counteract these challenges, doubts about their efficacy persist. To tackle this issue, we introduce an evaluation framework that leverages adversarial prompts to discern the trustworthiness of these safety-driven DMs after they have undergone the process of unlearning harmful concepts. Specifically, we investigated the adversarial robustness of DMs, assessed by adversarial prompts, when eliminating unwanted concepts, styles, and objects. We develop an effective and efficient adversarial prompt generation approach for DMs, termed UnlearnDiffAtk. This method capitalizes on the intrinsic classification abilities of DMs to simplify the creation of adversarial prompts, thereby eliminating the need for auxiliary classification or diffusion models.Through extensive benchmarking, we evaluate the robustness of five widely-used safety-driven unlearned DMs (i.e., DMs after unlearning undesirable concepts, styles, or objects) across a variety of tasks. Our results demonstrate the effectiveness and efficiency merits of UnlearnDiffAtk over the state-of-the-art adversarial prompt generation method and reveal the lack of robustness of current safety-driven unlearning techniques when applied to DMs. Codes are available at https://github.com/OPTML-Group/Diffusion-MU-Attack. WARNING: This paper contains model outputs that may be offensive in nature.