Augmented Learning and Reasoning team members standing in two rows.

Return to Microsoft Research Lab – Redmond

AI Interaction and Learning

Articles

Research Forum | Episode 4 Talk 2 | Corby Rosset

Articles

Direct Nash Optimization: Teaching language models to self-improve with general preferences

septembre 3, 2024

This talk discusses teaching language models to self-improve using a preference oracle like GPT-4, framing it as a two-player game to find an optimal policy at a Nash equilibrium, and achieving state-of-the-art win rates against GPT-4 Turbo on benchmarks such…