{"id":1079043,"date":"2024-09-03T12:07:10","date_gmt":"2024-09-03T19:07:10","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&p=1079043"},"modified":"2024-09-03T12:07:11","modified_gmt":"2024-09-03T19:07:11","slug":"direct-nash-optimization-teaching-language-models-to-self-improve-with-general-preferences","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/direct-nash-optimization-teaching-language-models-to-self-improve-with-general-preferences\/","title":{"rendered":"Direct Nash Optimization: Teaching language models to self-improve with general preferences"},"content":{"rendered":"\n

Presented by Corby Rosset<\/a> at Microsoft Research Forum, September 2024<\/strong><\/em><\/p>\n\n\n\n

\"Corby<\/figure>
\n
\n

\u201cThe traditional way to fine-tune an LLM for post-training \u2026 basically tells the model to emulate good behaviors, but it does not target or correct any mistakes or bad behaviors that it makes explicitly. \u2026 Self-improving post-training explicitly identifies and tries to correct bad behaviors or mistakes that the model makes.\u201d<\/p>\n\u2013<\/em> Corby Rosset, Senior Researcher, Microsoft Research AI Frontiers<\/cite><\/blockquote>\n<\/div><\/div>\n\n\n\n

\n