Scalable Extraction of Training Data from (Production) Language Models
Authors: Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tramèr, Katherine Lee
What
This paper investigates “extractable memorization” in large language models, focusing on the ability of adversaries to extract training data from these models without prior knowledge of the training set.
Why
The paper highlights the significant privacy implications of training large language models, demonstrating that even aligned models like ChatGPT can leak substantial amounts of training data, including personally identifiable information (PII). This raises concerns about the security of training data and the effectiveness of current alignment techniques in preventing memorization.
How
The authors develop a scalable methodology to detect memorization in large language models by matching model outputs against publicly available web-scale datasets using suffix arrays. For aligned models like ChatGPT, they introduce a novel “divergence” attack that prompts the model to deviate from its conversational style and emit training data at a much higher rate. They also employ a Good-Turing estimator to extrapolate total memorization based on the rate of unique memorized outputs.
Result
The authors find that all models, including open-source, semi-closed, and closed (API-based) models, exhibit extractable memorization. Larger and more capable models are more vulnerable to data extraction attacks. Notably, their divergence attack on ChatGPT reveals that it is significantly more susceptible to memorization than previously thought, leaking gigabytes of training data, including PII. They also find that certain words are more effective at eliciting memorized outputs during the divergence attack. The study demonstrates that current alignment techniques do not eliminate memorization and that discoverable memorization is a useful but not perfect proxy for extractable memorization.
LF
The authors acknowledge that their analysis may underestimate the true memorization rate due to limitations in the size and coverage of their auxiliary dataset. They also note that their attack on ChatGPT is specific to this model and may not generalize to other aligned chatbots. Future work could investigate the effectiveness of data deduplication techniques in mitigating memorization, explore the relationship between model capacity and memorization, and develop more generalizable attacks to assess the privacy of black-box RLHF-aligned models.
Abstract
This paper studies extractable memorization: training data that an adversary can efficiently extract by querying a machine learning model without prior knowledge of the training dataset. We show an adversary can extract gigabytes of training data from open-source language models like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT. Existing techniques from the literature suffice to attack unaligned models; in order to attack the aligned ChatGPT, we develop a new divergence attack that causes the model to diverge from its chatbot-style generations and emit training data at a rate 150x higher than when behaving properly. Our methods show practical attacks can recover far more data than previously thought, and reveal that current alignment techniques do not eliminate memorization.