Recovering the Pre-Fine-Tuning Weights of Generative Models
Authors: Eliahu Horwitz, Jonathan Kahana, Yedid Hoshen
What
This paper introduces the task of “Pre-Fine-Tuning Weight Recovery”, a novel attack vector targeting fine-tuned models. It presents “Spectral DeTuning”, an effective method for recovering the original weights of a pre-trained model using multiple LoRA fine-tuned versions.
Why
This paper highlights a critical vulnerability in the current paradigm of model fine-tuning, particularly relevant due to the increasing popularity of LoRA and multi-flavored foundational models. It demonstrates that widely used models like Mistral and Stable Diffusion are susceptible to this attack, potentially compromising safety and alignment efforts.
How
The authors propose “Spectral DeTuning”, an iterative, gradient-free algorithm leveraging low-rank matrix factorization to recover the pre-fine-tuning weights. They introduce a rank scheduler for enhanced optimization stability and faster convergence. They evaluate their method on a newly introduced benchmark “LoWRA Bench”, comprising diverse models like ViT, Stable Diffusion, and Mistral, fine-tuned for various tasks.
Result
Spectral DeTuning successfully recovers pre-fine-tuning weights with high precision across different models and tasks. It outperforms baseline methods, achieving near-perfect semantic convergence for ViT and effectively reversing personalization in Stable Diffusion and alignment in Mistral, as demonstrated by semantic evaluation metrics. The rank scheduler significantly improves convergence speed and accuracy.
LF
The authors acknowledge limitations like the requirement of multiple LoRA models with a known, constant rank and the assumption of their public availability. Future work includes exploring attacks on models with varying LoRA ranks, extending the attack to other fine-tuning methods, and, most importantly, developing defenses against pre-fine-tuning weight recovery attacks.
Abstract
The dominant paradigm in generative modeling consists of two steps: i) pre-training on a large-scale but unsafe dataset, ii) aligning the pre-trained model with human values via fine-tuning. This practice is considered safe, as no current method can recover the unsafe, pre-fine-tuning model weights. In this paper, we demonstrate that this assumption is often false. Concretely, we present Spectral DeTuning, a method that can recover the weights of the pre-fine-tuning model using a few low-rank (LoRA) fine-tuned models. In contrast to previous attacks that attempt to recover pre-fine-tuning capabilities, our method aims to recover the exact pre-fine-tuning weights. Our approach exploits this new vulnerability against large-scale models such as a personalized Stable Diffusion and an aligned Mistral.