Privacy Backdoors: Enhancing Membership Inference through Poisoning Pre-trained Models

Authors: Yuxin Wen, Leo Marchyok, Sanghyun Hong, Jonas Geiping, Tom Goldstein, Nicholas Carlini

What

This paper introduces a new privacy backdoor attack that amplifies membership inference attacks by poisoning pre-trained models, making it easier to extract information about the training data used for fine-tuning.

Why

This paper is important as it highlights a significant privacy vulnerability in the current machine learning paradigm of using open-source pre-trained models. It demonstrates that an adversary can poison these models to leak private information from the fine-tuning datasets, raising serious concerns about data security and the trustworthiness of pre-trained models.

How

The authors poison the model weights to either maximize or minimize the loss on target data points during pre-training. This creates an anomaly in the loss, making it easier to distinguish data used in fine-tuning. They test their attack on various models, including CLIP for vision tasks and GPT-Neo and ClinicalBERT for language tasks, using different datasets and evaluating the effectiveness under different fine-tuning methods and inference strategies.

Result

The attack significantly improves the success rate of membership inference attacks, increasing the true positive rate while maintaining a low false positive rate. The attack is effective across different models, fine-tuning methods, and inference strategies, highlighting its robustness and broad applicability. Interestingly, the attack also amplifies privacy leakage for non-target data points from the same distribution. The paper also finds that larger models are more vulnerable to this attack.

LF

The paper acknowledges limitations regarding the attack’s sensitivity to the number of fine-tuning steps and the trade-off between model stealthiness and attack performance. Future work includes exploring more advanced poisoning techniques and defenses against this attack, such as robust fine-tuning methods and more rigorous validation of pre-trained models.

Abstract

It is commonplace to produce application-specific models by fine-tuning large pre-trained models using a small bespoke dataset. The widespread availability of foundation model checkpoints on the web poses considerable risks, including the vulnerability to backdoor attacks. In this paper, we unveil a new vulnerability: the privacy backdoor attack. This black-box privacy attack aims to amplify the privacy leakage that arises when fine-tuning a model: when a victim fine-tunes a backdoored model, their training data will be leaked at a significantly higher rate than if they had fine-tuned a typical model. We conduct extensive experiments on various datasets and models, including both vision-language models (CLIP) and large language models, demonstrating the broad applicability and effectiveness of such an attack. Additionally, we carry out multiple ablation studies with different fine-tuning methods and inference strategies to thoroughly analyze this new threat. Our findings highlight a critical privacy concern within the machine learning community and call for a reevaluation of safety protocols in the use of open-source pre-trained models.