{"id":934452,"date":"2023-04-19T09:08:44","date_gmt":"2023-04-19T16:08:44","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/unifying-learning-from-preferences-and-demonstration-via-a-ranking-game-for-imitation-learning\/"},"modified":"2023-04-19T09:41:16","modified_gmt":"2023-04-19T16:41:16","slug":"unifying-learning-from-preferences-and-demonstration-via-a-ranking-game-for-imitation-learning","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/unifying-learning-from-preferences-and-demonstration-via-a-ranking-game-for-imitation-learning\/","title":{"rendered":"Unifying learning from preferences and demonstration via a ranking game for imitation learning"},"content":{"rendered":"\n
\"Rank<\/figure>\n\n\n\n

For many people, opening door handles or moving a pen between their fingers is a movement that happens multiple times a day, often without much thought. For a robot, however, these movements aren\u2019t always so easy.<\/p>\n\n\n\n

In reinforcement learning, robots learn to perform tasks by exploring their environments, receiving signals along the way that indicate how good their behavior is compared to the desired outcome, or state<\/em>. For the described movements, for example, we can specify a reward function that is +1 when the door is successfully opened or the pen is at the desired orientation and 0 otherwise. But this makes the learning task complicated for the robot since it has to try out various motions before stumbling on the successful outcome, or a reward of +1.<\/p>\n\n\n\n

The imitation learning (IL) paradigm was introduced to mitigate the amount of trial and error. In IL, the robot is provided with demonstrations of a given task performed by an expert from which it can try to learn the task and possibly gain information about the expert\u2019s reward function, or the expert\u2019s intent<\/em>, similar to how people pick up various skills. Yet, learning remains difficult in instances where we only have access to the change enacted by the expert in the world, known as the expert observation<\/em>, and not the precise actions the expert took to achieve the change. Another difficulty the robot faces is that even if it sees infinite expert demonstrations, it can\u2019t fully reason about the intent of the expert\u2014that is, compare whether one of its own learned behaviors is closer to the expert\u2019s than another behavior\u2014as it only knows the best<\/em> behavior and has no notion of ordering over other behaviors.<\/p>\n\n\n\n

<\/div>\n\n\n\n\t
\n\t\t\n\n\t\t

\n\t\tSpotlight: blog post<\/span>\n\t<\/p>\n\t\n\t

\n\t\t\t\t\t\t
\n\t\t\t\t\n\t\t\t\t\t\"GraphRAG\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t
\n\n\t\t\t\t\t\t\t\t\t

GraphRAG auto-tuning provides rapid adaptation to new domains<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t

GraphRAG uses LLM-generated knowledge graphs to substantially improve complex Q&A over retrieval-augmented generation (RAG). Discover automatic tuning of GraphRAG for new datasets, making it more accurate and relevant.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t

\n\t\t\t\t\t
\n\t\t\t\t\t\t\n\t\t\t\t\t\t\tRead more\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t<\/div>\n\t<\/div>\n\t\n\n\n

In our paper \u201cA Ranking Game for Imitation Learning (opens in new tab)<\/span><\/a>,\u201d being presented at Transactions on Machine Learning Research 2023 (TMLR (opens in new tab)<\/span><\/a>), we propose a simple and intuitive framework, \\(\\texttt{rank-game}\\), that unifies learning from expert demonstrations and preferences by generalizing a key approach to imitation learning. Giving robots the ability to learn from preferences, obtained by having an expert rank which behavior aligns better with their objectives, allows the learning of more informative reward functions. Our approach, which enabled us to propose a new objective for training over behavior preferences, makes the learning process easier for a robot and achieves state-of-the-art results in imitation learning. It also enabled the training of a robot that can solve the tasks of opening a door and moving a pen between its fingers in simulation, a first in imitation learning with expert observations alone. The incorporation of preferences has also seen success in language modeling, where chatbots such as ChatGPT (opens in new tab)<\/span><\/a> are improving themselves by learning a reward function inferred via preferences over several samples of model responses in addition to learning from desired human conversational data.<\/p>\n\n\n\n

\n
\n
\n\t