Shocking Fine-Tuning Unlocks AI’s “Forbidden Memories”! The “Whack-a-Mole” Phenomenon of Memorizing Copyrighted Books Revealed!
📰 News Overview
- Visualizing Copyright Leakage Through Fine-Tuning: Fine-tuning LLMs with specific book summaries has proven that these models can activate their ability to output copyrighted texts verbatim, which should have been restricted.
- Thorough Validation of Latest Models: Validation code using cutting-edge models like GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 has been released, employing works like Cormac McCarthy’s “The Road.”
- Proposal for New Evaluation Metrics: Four new metrics have been introduced to quantify how much of the original work the model “spits out,” including “BMC@k” to measure memorization and “Longest Continuous Memorization Block.”
💡 Key Points
- The “Whack-a-Mole” Problem: This phenomenon refers to the alarming reality that data, which should be protected by safe alignment, can be triggered by slight additional training, leading to unexpected copyright leakage.
- Provision of a Concrete Pipeline: A series of scripts have been released that extract text from EPUB files, generate summary data for training using GPT-4o, and conduct fine-tuning using various APIs (OpenAI, Vertex AI, Tinker).
- High Reproducibility: Tests with 100 generations at a temperature parameter of 1.0 confirmed that fragments of copyrighted material are output at a statistically significant level.
🦈 Shark’s Eye (Curator’s Perspective)
What’s impressive about this study is not just the alarm it raises over potential leaks, but how concretely it implements the mechanisms of how these leaks occur! The method of instructing GPT-4o to “write text based on summaries while mimicking a specific author’s style” and using it as fine-tuning training data highlights a significant risk, given this approach is often used in practice. The fact that this phenomenon also occurs with low-cost training using LoRA (Rank=32) with DeepSeek-V3.1 is an issue that cannot be ignored by operators of open models!
🚀 What’s Next?
Providers of these models will need to impose stricter filtering on fine-tuning datasets. Furthermore, metrics like “BMC@k” may become standard for evaluating the safety benchmarks (guardrails) of future AI models.
💬 A Word from HaruShark
Once something is learned, even if you pretend to forget it, it can still resurface when poked… Just like sharks, AI won’t forget the taste of a delicious catch! 🦈🔥
📚 Term Explanations
-
BMC@k: A new memorization evaluation metric that measures what percentage of the original book is covered by the sequence of extracted words.
-
LoRA: Low-Rank Adaptation. A technique that adds small matrices for efficient fine-tuning instead of updating the entire model.
-
Tinker: An API platform and environment for fine-tuning and executing models like DeepSeek-V3.1.
-
Source: Alignment whack-a-mole: Finetuning activates recall of copyrighted books in LLMs