{"id":972675,"date":"2023-10-04T10:18:38","date_gmt":"2023-10-04T17:18:38","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&p=972675"},"modified":"2023-10-04T15:35:48","modified_gmt":"2023-10-04T22:35:48","slug":"whos-harry-potter-making-llms-forget-2","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/whos-harry-potter-making-llms-forget-2\/","title":{"rendered":"Who’s Harry Potter? Making LLMs forget"},"content":{"rendered":"\n
Ronen Eldan (Microsoft Research) and Mark Russinovich (Azure)<\/em><\/p>\n\n\n\n The Challenge of Unlearning in an AI Era<\/strong> <\/p>\n\n\n\n Over the last few months, significant public attention has focused on a wide variety of questions related to the data used to train large language models (LLMs). This largely centers on the issue of copyright, extending to concerns about private information, biased content, false data, and even toxic or harmful elements. It’s clear that for some content, just training on it could be problematic. What do we do if we realize that some of our training data needs to be removed after the LLM has already been trained?<\/p>\n\n\n\n Can Machines Really Forget?<\/strong> <\/p>\n\n\n\n Traditionally, it has been demonstrated that fine-tuning LLMs to incorporate new information is straightforward, but how do we make them forget that information? Simply put, unlearning isn’t as straightforward as learning. To analogize, imagine trying to remove specific ingredients from a baked cake\u2014it seems nearly impossible. Fine-tuning can introduce new flavors to the cake, but removing a specific ingredient? That’s a tall order. <\/p>\n\n\n\n Moreover, the cost associated with retraining can be astronomical – training massive models can cost tens of millions of dollars or more. Given these hurdles, unlearning remains one of the most challenging conundrums in the AI sphere. There’s skepticism in the community around its feasibility. Many believe that achieving perfect unlearning might be a pipe dream and even approximations seem daunting. Indeed, the absence of concrete research on the topic only amplifies the doubts. <\/p>\n\n\n\n A New Dawn: Forgetting Harry Potter<\/strong> <\/p>\n\n\n\n In a new paper (opens in new tab)<\/span><\/a>, we decided to embark on what we initially thought might be impossible: make the Llama2-7b model, trained by Meta, forget the magical realm of Harry Potter. Several sources (opens in new tab)<\/span><\/a> claim that this model’s training data included the “books3” dataset, which contains the books among many other copyrighted works (including the novels written by a co-author of this work). To emphasize the depth of the model’s recall, consider this: prompt the original model with a very generic-looking prompt such as “When Harry went back to school that fall,” and it continues with a detailed story set in J.K. Rowling’s universe.\u00a0<\/p>\n\n\n\n However, with our proposed technique, we drastically altered its responses. Let’s look at a few examples of prompts and compare the completions given by the original Llama2-7b model with the ones given by our fine-tuned model: <\/p>\n\n\n\n