{"id":1079043,"date":"2024-09-03T12:07:10","date_gmt":"2024-09-03T19:07:10","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&#038;p=1079043"},"modified":"2024-09-03T12:07:11","modified_gmt":"2024-09-03T19:07:11","slug":"direct-nash-optimization-teaching-language-models-to-self-improve-with-general-preferences","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/direct-nash-optimization-teaching-language-models-to-self-improve-with-general-preferences\/","title":{"rendered":"Direct Nash Optimization: Teaching language models to self-improve with general preferences"},"content":{"rendered":"\n<p class=\"has-purple-color has-text-color has-link-color wp-elements-10c6f5b0c28f62e147c0d30884a8562a\"><em>Presented by&nbsp;<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/corbyrosset\/\">Corby Rosset<\/a>&nbsp;at&nbsp;<strong>Microsoft Research Forum, September 2024<\/strong><\/em><\/p>\n\n\n\n<div class=\"wp-block-media-text has-vertical-margin-none  has-vertical-padding-none  is-stacked-on-mobile has-white-background-color has-background\" style=\"grid-template-columns:25% auto\" data-bi-an=\"media-text\"><figure class=\"wp-block-media-text__media\"><img loading=\"lazy\" decoding=\"async\" width=\"360\" height=\"360\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2024\/08\/Corby-Rosset_360x360.jpg\" alt=\"Corby Rosset\" class=\"wp-image-1079469 size-full\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2024\/08\/Corby-Rosset_360x360.jpg 360w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2024\/08\/Corby-Rosset_360x360-300x300.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2024\/08\/Corby-Rosset_360x360-150x150.jpg 150w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2024\/08\/Corby-Rosset_360x360-180x180.jpg 180w\" sizes=\"(max-width: 360px) 100vw, 360px\" \/><\/figure><div class=\"wp-block-media-text__content\" data-bi-an=\"media-text\">\n<blockquote class=\"wp-block-quote is-style-spectrum is-layout-flow wp-block-quote-is-layout-flow\">\n<p>\u201cThe traditional way to fine-tune an LLM for post-training \u2026 basically tells the model to emulate good behaviors, but it does not target or correct any mistakes or bad behaviors that it makes explicitly. \u2026 Self-improving post-training explicitly identifies and tries to correct bad behaviors or mistakes that the model makes.\u201d<\/p>\n<cite><em>\u2013<\/em> Corby Rosset, Senior Researcher, Microsoft Research AI Frontiers<\/cite><\/blockquote>\n<\/div><\/div>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"Direct Nash Optimization: Teaching language models to self-improve with general preferences\" width=\"500\" height=\"281\" src=\"https:\/\/www.youtube-nocookie.com\/embed\/V04Q7YhEUzw?feature=oembed&rel=0\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"margin-callout\">\n\t<ul class=\"annotations__list card depth-16 bg-body p-4 annotations__list--right\">\n\t\t<li class=\"annotations__list-item\">\n\t\t\t\t\t\t\t<a href=\"https:\/\/msrchat.azurewebsites.net\/?askmsr=What%20is%20Direct%20Nash%20Optimization%2C%20and%20how%20does%20it%20enable%20language%20models%20to%20self-improve%20using%20general%20preferences\" target=\"_blank\" aria-label=\"What is Direct Nash Optimization, and how does it enable language models to self-improve using general preferences?\" data-bi-type=\"annotated-link\" data-bi-cN=\"What is Direct Nash Optimization, and how does it enable language models to self-improve using general preferences?\" class=\"annotations__list-thumbnail\" >\n\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"172\" height=\"96\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2024\/01\/MSR-Chat-Promo-240x135.png\" class=\"mb-2\" alt=\"Ask Microsoft research copilot experience\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2024\/01\/MSR-Chat-Promo-240x135.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2024\/01\/MSR-Chat-Promo-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2024\/01\/MSR-Chat-Promo-1024x576.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2024\/01\/MSR-Chat-Promo-768x432.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2024\/01\/MSR-Chat-Promo-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2024\/01\/MSR-Chat-Promo-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2024\/01\/MSR-Chat-Promo-343x193.png 343w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2024\/01\/MSR-Chat-Promo-640x360.png 640w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2024\/01\/MSR-Chat-Promo-960x540.png 960w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2024\/01\/MSR-Chat-Promo-1280x720.png 1280w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2024\/01\/MSR-Chat-Promo.png 1400w\" sizes=\"(max-width: 172px) 100vw, 172px\" \/>\t\t\t\t<\/a>\n\t\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Microsoft research copilot experience<\/span>\n\t\t\t<a href=\"https:\/\/msrchat.azurewebsites.net\/?askmsr=What%20is%20Direct%20Nash%20Optimization%2C%20and%20how%20does%20it%20enable%20language%20models%20to%20self-improve%20using%20general%20preferences\" target=\"_blank\" class=\"annotations__link font-weight-semibold text-decoration-none\" data-bi-type=\"annotated-link\" aria-label=\"What is Direct Nash Optimization, and how does it enable language models to self-improve using general preferences?\" data-bi-aN=\"margin-callout\" data-bi-cN=\"What is Direct Nash Optimization, and how does it enable language models to self-improve using general preferences?\">\n\t\t\t\tWhat is Direct Nash Optimization, and how does it enable language models to self-improve using general preferences?&nbsp;<span class=\"glyph-append glyph-append-share glyph-append-xsmall\"><\/span>\n\t\t\t<\/a>\n\t\t\t\t\t<\/li>\n\t<\/ul>\n<\/div>\n\n\n<div class=\"wp-block-msr-show-more\">\n\t<div class=\"bg-neutral-100 p-5\">\n\t\t<div class=\"show-more-show-less\" data-mount=\"show-more-show-less\">\n\t\t\t<div>\n\t\t\t\t<span>\n\t\t\t\t\t\n\n<h3 class=\"wp-block-heading\" id=\"transcript-lightning-talk\">Transcript: Lightning Talk<\/h3>\n\n\n\n<p><strong>Direct Nash Optimization: Teaching language models to self-improve with general preferences<\/strong><\/p>\n\n\n\n<p><strong>Corby Rosset<\/strong>, Senior Researcher, Microsoft Research AI Frontiers<\/p>\n\n\n\n<p>This talk discusses teaching language models to self-improve using a preference oracle like GPT-4, framing it as a two-player game to find an optimal policy at a Nash equilibrium, and achieving state-of-the-art win rates against GPT-4 Turbo on benchmarks such as AlpacaEval and MT-Bench.<\/p>\n\n\n\n<p>Microsoft Research Forum, September 3, 2024<\/p>\n\n\n\n\t\t\t\t<\/span>\n\t\t\t\t<span id=\"show-more-show-less-toggle-1\" class=\"show-more-show-less-toggleable-content\">\n\t\t\t\t\t\n\n\n\n<p><strong>CORBY ROSSET:<\/strong> Hi, I&#8217;m Corby. I&#8217;m a scientist in Microsoft Research. Today, we&#8217;re going to be talking about Direct Nash Optimization, which is a technique to help language models self-improve.<\/p>\n\n\n\n<p>We all know that there are two main ways to improve language models. One is to scale up the number of parameters or to scale up the amount of training data. Both of these approaches are costly even for the post-training techniques. The traditional way to fine-tune an LLM for post-training is using SFT. SFT basically tells the model to emulate good behaviors, but it does not target or correct any mistakes or bad behaviors that it makes explicitly. More advanced post-training techniques such as RLHF use a fixed reward model, which can be easily hacked or go stale during training and involves much more complex reinforcement learning, which can be unstable. Self-improving post-training explicitly identifies and tries to correct bad behaviors or mistakes that the model makes.<\/p>\n\n\n\n<p>Before we move on, we want to give a concrete example of what we mean by self-improving behavior. Here&#8217;s a simple geometry problem where a base model that was already SFTed makes a simple arithmetic error on the left-hand side. After our self-improving technique, the model is able to correct this mistake.<\/p>\n\n\n\n<p>Here we give a simple overview of how Direct Nash Optimization works. One of the properties of generative LLMs is that you can sample multiple outputs from them. This is advantageous because what we can do is, given an input, we can take our language model and sample, in this case, two outputs\u2014answer A and answer B\u2014and we can have them scored or rated by a preference function oracle, which tells us which response is better. Then we can use a contrastive training mechanism, such as DPO or IPO or others to update the parameters of the language model to hopefully improve it. In the next iteration, timestep t+1, we repeat the process over again. The key insight of this technique is how we define reward. Typically, in the RLHF framework, we want to maximize the reward of a language model policy against some given external reward model. Here, we redefine \u201creward\u201d as the expected win rate against your own behavior as judged by a preference function P. What this means is that for a given response <em>y<\/em> to an input <em>x<\/em>, the reward of that response is defined as the expected win rate against <em>y<\/em> primes sampled from the policy itself. Hence, rewards are maximized by responses that are preferred over other responses.<\/p>\n\n\n\n<p>When you start comparing the <em>y<\/em> primes, or the model&#8217;s own outputs to each other, this incentivizes a self-improving behavior because you&#8217;re basically competing against yourself. You can formulate this in a game theoretic manner where, in this game, you have a single player which is competing against itself, and the payoffs are given by the preference function. In this game, a Nash equilibrium is achieved by the best possible \u03c0* whose responses are preferred over any other competing policy in its class.<\/p>\n\n\n\n<p>At a high level, Direct Nash Optimization has many advantages. Firstly, it optimizes towards a more general preference function directly rather than a point-wise reward model, which is limited in its expressibility since it can&#8217;t model transitive preferences. Secondly, it is an iterative algorithm, meaning it is much simpler to implement. We use a contrastive update as the loss, which does not involve any policy gradients or heavy reinforcement learning machinery. We also sample on policy outputs from the model and compare them to each other in a self-play framework. We use a powerful preference annotator\u2014in this case, GPT-4\u2014to rank or judge the best response among them. This approach is also flexible since we can compare the responses to each other but also to outputs from a more powerful teacher such as GPT-4, which provides even bigger improvements. Most importantly, this algorithm is theoretically guaranteed to monotonically approach the Nash equilibrium, hence the name Direct Nash Optimization.<\/p>\n\n\n\n<p>If you implement this algorithm correctly, you will find state-of-the-art results on several benchmarks, including this one, which is AlpacaEval2. This benchmark basically measures how well language models follow instructions and align with human expectations. This benchmark computes a win rate of the language model\u2019s outputs versus a powerful reference\u2014in this case, GPT-4\u2014in a side-by-side comparison. The y-axis is the win rate, and the x-axis is the amount of iterations of training. We see that the dark blue line, which is DNO, the vanilla implementation, outperforms two important baselines. The red line is SFT, and the orange and yellow lines are offline contrastive algorithms, such as DPO and KTO. Hence, we see that self-improving post-training is better than offline contrastive training and SFT. Notably, DNO is also able to outperform similar training techniques from other models, which were 10 times as large, namely the gray line, which was a 70 billion parameter Llama model. We are also encouraged to see that these results do not saturate, and with more training in the purple line over more iterations, we see even better results.<\/p>\n\n\n\n<p>We hope this work inspires other researchers to continue to investigate self-improving post-training as an effective method for aligning language models with human expectations. Thank you for watching.<\/p>\n\n\t\t\t\t<\/span>\n\t\t\t<\/div>\n\t\t\t<button\n\t\t\t\tclass=\"action-trigger glyph-prepend mt-2 mb-0 show-more-show-less-toggle\"\n\t\t\t\taria-expanded=\"false\"\n\t\t\t\tdata-show-less-text=\"Show less\"\n\t\t\t\ttype=\"button\"\n\t\t\t\taria-controls=\"show-more-show-less-toggle-1\"\n\t\t\t\taria-label=\"Show more content\"\n\t\t\t\tdata-alternate-aria-label=\"Show less content\">\n\t\t\t\tShow more\t\t\t<\/button>\n\t\t<\/div>\n\t<\/div>\n<\/div>\n\n\n\n<h2 class=\"wp-block-heading alignwide\" id=\"related-resources\">Related resources<\/h2>\n\n\n\n<div class=\"wp-block-columns alignwide are-vertically-aligned-top is-layout-flex wp-container-core-columns-is-layout-1 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-top is-layout-flow wp-block-column-is-layout-flow\">\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<ul class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<li class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Research Lab<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/lab\/ai-frontiers\/\" target=\"_self\" class=\"annotations__link font-weight-semibold text-decoration-none\" data-bi-type=\"annotated-link\" aria-label=\"AI Frontiers\" data-bi-aN=\"citation\" data-bi-cN=\"AI Frontiers\">\n\t\t\t\tAI Frontiers&nbsp;<span class=\"glyph-append glyph-append-chevron-right glyph-append-xsmall\"><\/span>\n\t\t\t<\/a>\n\t\t\t\t\t<\/li>\n\t<\/ul>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-top is-layout-flow wp-block-column-is-layout-flow\">\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<ul class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<li class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Group<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/group\/augmented-learning-and-reasoning\/\" target=\"_self\" class=\"annotations__link font-weight-semibold text-decoration-none\" data-bi-type=\"annotated-link\" aria-label=\"Augmented Learning and Reasoning\" data-bi-aN=\"citation\" data-bi-cN=\"Augmented Learning and Reasoning\">\n\t\t\t\tAugmented Learning and Reasoning&nbsp;<span class=\"glyph-append glyph-append-chevron-right glyph-append-xsmall\"><\/span>\n\t\t\t<\/a>\n\t\t\t\t\t<\/li>\n\t<\/ul>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-top is-layout-flow wp-block-column-is-layout-flow\"><\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>This talk discusses teaching language models to self-improve using a preference oracle like GPT-4, framing it as a two-player game to find an optimal policy at a Nash equilibrium, and achieving state-of-the-art win rates against GPT-4 Turbo on benchmarks such as AlpacaEval and MT-Bench.<\/p>\n","protected":false},"author":42735,"featured_media":1079778,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"msr-content-parent":1077699,"footnotes":""},"research-area":[],"msr-locale":[268875],"class_list":["post-1079043","msr-blog-post","type-msr-blog-post","status-publish","has-post-thumbnail","hentry","msr-locale-en_us"],"msr_assoc_parent":{"id":1077699,"type":"story"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1079043"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-blog-post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/42735"}],"version-history":[{"count":4,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1079043\/revisions"}],"predecessor-version":[{"id":1082211,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1079043\/revisions\/1082211"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1079778"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1079043"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1079043"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1079043"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}