{"id":934452,"date":"2023-04-19T09:08:44","date_gmt":"2023-04-19T16:08:44","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/unifying-learning-from-preferences-and-demonstration-via-a-ranking-game-for-imitation-learning\/"},"modified":"2023-04-19T09:41:16","modified_gmt":"2023-04-19T16:41:16","slug":"unifying-learning-from-preferences-and-demonstration-via-a-ranking-game-for-imitation-learning","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/unifying-learning-from-preferences-and-demonstration-via-a-ranking-game-for-imitation-learning\/","title":{"rendered":"Unifying learning from preferences and demonstration via a ranking game for imitation learning"},"content":{"rendered":"\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/rank-game-1400x788-1-1024x576.jpg\" alt=\"Rank Game diagram\" class=\"wp-image-934482\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/rank-game-1400x788-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/rank-game-1400x788-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/rank-game-1400x788-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/rank-game-1400x788-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/rank-game-1400x788-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/rank-game-1400x788-1-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/rank-game-1400x788-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/rank-game-1400x788-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/rank-game-1400x788-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/rank-game-1400x788-1-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/rank-game-1400x788-1.jpg 1400w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>For many people, opening door handles or moving a pen between their fingers is a movement that happens multiple times a day, often without much thought. For a robot, however, these movements aren\u2019t always so easy.<\/p>\n\n\n\n<p>In reinforcement learning, robots learn to perform tasks by exploring their environments, receiving signals along the way that indicate how good their behavior is compared to the desired outcome, or <em>state<\/em>. For the described movements, for example, we can specify a reward function that is +1 when the door is successfully opened or the pen is at the desired orientation and 0 otherwise. But this makes the learning task complicated for the robot since it has to try out various motions before stumbling on the successful outcome, or a reward of +1.<\/p>\n\n\n\n<p>The imitation learning (IL) paradigm was introduced to mitigate the amount of trial and error. In IL, the robot is provided with demonstrations of a given task performed by an expert from which it can try to learn the task and possibly gain information about the expert\u2019s reward function, or the expert\u2019s <em>intent<\/em>, similar to how people pick up various skills. Yet, learning remains difficult in instances where we only have access to the change enacted by the expert in the world, known as the <em>expert observation<\/em>, and not the precise actions the expert took to achieve the change. Another difficulty the robot faces is that even if it sees infinite expert demonstrations, it can\u2019t fully reason about the intent of the expert\u2014that is, compare whether one of its own learned behaviors is closer to the expert\u2019s than another behavior\u2014as it only knows the <em>best<\/em> behavior and has no notion of ordering over other behaviors.<\/p>\n\n\n\n<div style=\"height:15px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"1085514\">\n\t\t\n\n\t\t<p class=\"msr-promo__label text-gray-800 text-center text-uppercase\">\n\t\t<span class=\"px-4 bg-white display-inline-block font-weight-semibold small\">Spotlight: blog post<\/span>\n\t<\/p>\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/graphrag-auto-tuning-provides-rapid-adaptation-to-new-domains\/\" aria-label=\"GraphRAG auto-tuning provides rapid adaptation to new domains\" data-bi-cN=\"GraphRAG auto-tuning provides rapid adaptation to new domains\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2024\/09\/GraphRag-3-BlogHeroFeature-1400x788-1.png\" alt=\"GraphRAG image on blue to green gradient\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">GraphRAG auto-tuning provides rapid adaptation to new domains<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p class=\"large\">GraphRAG uses LLM-generated knowledge graphs to substantially improve complex Q&A over retrieval-augmented generation (RAG). Discover automatic tuning of GraphRAG for new datasets, making it more accurate and relevant.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button\">\n\t\t\t\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/graphrag-auto-tuning-provides-rapid-adaptation-to-new-domains\/\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" aria-label=\"Read more\" data-bi-cN=\"GraphRAG auto-tuning provides rapid adaptation to new domains\" target=\"_blank\">\n\t\t\t\t\t\t\tRead more\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n<p>In our paper \u201c<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/a-ranking-game-for-imitation-learning\/\" target=\"_blank\" rel=\"noreferrer noopener\">A Ranking Game for Imitation Learning<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>,\u201d being presented at Transactions on Machine Learning Research 2023 (<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.jmlr.org\/tmlr\/papers\/\" target=\"_blank\" rel=\"noreferrer noopener\">TMLR<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>), we propose a simple and intuitive framework, \\(\\texttt{rank-game}\\), that unifies learning from expert demonstrations and preferences by generalizing a key approach to imitation learning. Giving robots the ability to learn from preferences, obtained by having an expert rank which behavior aligns better with their objectives, allows the learning of more informative reward functions. Our approach, which enabled us to propose a new objective for training over behavior preferences, makes the learning process easier for a robot and achieves state-of-the-art results in imitation learning. It also enabled the training of a robot that can solve the tasks of opening a door and moving a pen between its fingers in simulation, a first in imitation learning with expert observations alone. The incorporation of preferences has also seen success in language modeling, where chatbots such as <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/openai.com\/blog\/chatgpt\" target=\"_blank\" rel=\"noreferrer noopener\">ChatGPT<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> are improving themselves by learning a reward function inferred via preferences over several samples of model responses in addition to learning from desired human conversational data.<\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-1 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<ul class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<li class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Publication<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/a-ranking-game-for-imitation-learning\/\" target=\"_blank\" class=\"annotations__link font-weight-semibold text-decoration-none\" data-bi-type=\"annotated-link\" aria-label=\"A Ranking Game for Imitation Learning\" data-bi-aN=\"citation\" data-bi-cN=\"A Ranking Game for Imitation Learning\">\n\t\t\t\tA Ranking Game for Imitation Learning&nbsp;<span class=\"glyph-append glyph-append-chevron-right glyph-append-xsmall\"><\/span>\n\t\t\t<\/a>\n\t\t\t\t\t<\/li>\n\t<\/ul>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<ul class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<li class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">DOWNLOAD<\/span>\n\t\t\t<a href=\"https:\/\/github.com\/hari-sikchi\/rank-game\" target=\"_blank\" class=\"annotations__link font-weight-semibold text-decoration-none\" data-bi-type=\"annotated-link\" aria-label=\"SOTA algorithms for imitation learning\" data-bi-aN=\"citation\" data-bi-cN=\"SOTA algorithms for imitation learning\">\n\t\t\t\tSOTA algorithms for imitation learning&nbsp;<span class=\"glyph-append glyph-append-share glyph-append-xsmall\"><\/span>\n\t\t\t<\/a>\n\t\t\t\t\t<\/li>\n\t<\/ul>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<p>Robotics has found a place in controlled environments where the tasks at hand are well-defined and repeatable, such as on a factory floor. Our framework has the potential to help enable robot learning of tasks in more dynamic environments, such as helping people with daily chores around the home.<\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-2 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-video\"><video controls src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/pen_utility_of_preferences.mp4\"><\/video><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-video\"><video controls src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/door_utility_of_preferences.mp4\"><\/video><\/figure>\n<\/div>\n<\/div>\n\n\n\n<p style=\"margin-top:-25px\"><figure class=\"wp-block-image aligncenter size-full\"><figcaption class=\"wp-element-caption\">With&nbsp;\\(\\texttt{rank-game}\\), which combines learning from preferences and demonstrations via a two-player ranking-based game, robots in simulation were trained to manipulate a pen with a dexterous hand (left) and open a door with a parallel jaw gripper (right). The successful completion of these tasks marked a first in imitation learning with expert observations alone.<\/figcaption><\/figure><\/p>\n\n\n\n<h2 id=\"a-ranking-game-for-imitation-learning\" class=\"wp-block-heading\">A ranking game for imitation learning<\/h2>\n\n\n\n<p>Inverse reinforcement learning (IRL) is a popular and effective method for imitation learning. IRL learns by inferring the reward function, also referred to as the <em>intent of the expert<\/em>, and a policy, which specifies what actions the agent\u2014or, in our case, the robot\u2014should take in a given state to successfully mimic the expert.<\/p>\n\n\n\n<p><em>Notation:<\/em> We use \\(\\pi\\) and \\(\\pi^E\\) to denote the policy of the agent and the expert, respectively, and \\(R_{gt}\\) to be the reward function of the expert, which is unknown to the agent\/robot. \\(\\rho^\\pi\\) denotes the state-action\/state visitation distribution of policy \\(\\pi\\) in the environment\u2014the probabilistic collection of states the policy visits in the environment. We use \\(J(R;\\pi)\\) to denote the \\(\\textit{cumulative reward}\\), or the performance of policy \\(\\pi\\) under a reward function \\(R\\). We assume policy \\(\\pi\\) belongs to function class \\(\\Pi\\) and reward function R belongs to function class \\(\\mathcal{R}\\). <\/p>\n\n\n\n<p>The goal of imitation learning is to make the agent have the same performance as the expert under the expert\u2019s unknown reward function \\(R_{gt}\\). The classical IRL formulation tackles this by minimizing the imitation gap under a reward function that makes the performance gap the largest. We denote this framework by \\(\\texttt{imit-game}\\) and write it below formally:<\/p>\n\n\n\n<p class=\"has-text-align-center\">\\(\\texttt{imit-game}(\\pi,\\pi^E): \\text{argmin}_{\\pi\\in\\Pi}\\text{max}_{R\\in\\mathcal{R}} [\\mathbb{E}_{\\rho^E(s,a)}[R(s,a)]-\\mathbb{E}_{\\rho^\\pi(s,a)}[R(s,a)]]\\)<\/p>\n\n\n\n<p>Simply stated, the \\(\\texttt{imit-game}\\) tries to find a policy that has the lowest worst-case performance difference with the expert policy. This classical IRL formulation learns from expert demonstrations but provides no mechanism to incorporate learning from preferences. In our work, we ask, does IRL really need to consider the worst-case performance difference? We find that relaxing this requirement allows us to incorporate preferences.<\/p>\n\n\n\n<p>Our proposed method treats imitation as a two-player ranking-based game between a policy and a reward. In this game, the reward agent learns to map more preferred behaviors to a higher total reward for each of the pairwise preferences, while the policy agent learns to maximize the performance on this reward function by interacting with the environment. Contrary to the classical IRL framework, the reward function now has to get only the rankings correct and not optimize for the worst case (see Figure 1).<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"2120\" height=\"1700\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/blog_fig_1-64370ee5ddc40.png\" alt=\"A flow chart with, clockwise from top left, a green box labeled \u201cpolicy agent,\u201d a blue box labeled \u201creward agent,\u201d and an orange box label \u201cDataset D,\u201d which contains pairwise behavior rankings obtained from three sources. An arrow points from the policy agent to the dataset, indicating the policy\u2019s contribution of rankings. An arrow pointing from the policy agent to the reward is labeled with the optimization strategy. An arrow pointing from the reward agent to the dataset is labeled with the ranking loss function.\" class=\"wp-image-934509\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/blog_fig_1-64370ee5ddc40.png 2120w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/blog_fig_1-64370ee5ddc40-300x241.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/blog_fig_1-64370ee5ddc40-1024x821.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/blog_fig_1-64370ee5ddc40-768x616.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/blog_fig_1-64370ee5ddc40-1536x1232.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/blog_fig_1-64370ee5ddc40-2048x1642.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/blog_fig_1-64370ee5ddc40-224x180.png 224w\" sizes=\"(max-width: 2120px) 100vw, 2120px\" \/><figcaption class=\"wp-element-caption\"><strong>Figure 1:<\/strong> The proposed \\(\\texttt{rank-game}\\) method treats imitation learning as a two-player ranking-based game between a policy and a reward. The policy agent maximizes the reward function by interacting with the environment. The reward agent satisfies a set of behavior rankings obtained from various sources: generated by the policy agent, automatically generated via data augmentation, or expert-annotated rankings obtained from a human or offline dataset.<\/figcaption><\/figure>\n\n\n\n<p>To incorporate preferences, we need to quantify the behaviors in order to compare them. In this work, we choose the behaviors (\\(\\rho\\)) to be the state-action or state-only visitation distribution of the agent. A ranking between behaviors is used to specify that the expert would prefer one behavior over the other. A reward function that satisfies the behavior rankings ensures that the average return under a lower-ranked behavior is smaller than the higher-ranked behavior. More formally, the ranking game is defined as a game where the policy agent \\(\\pi\\) maximizes the expected return \\(J(R;\\pi)\\) of the policy under reward function \\(R\\) when deployed in the environment. The reward player takes the dataset of pairwise rankings \\(D^p\\) (rankings are denoted as \\(\\rho^i\\preceq\\rho^j\\)) as an input and attempts to learn a reward function that satisfies those rankings using a ranking loss (denoted by \\(L(D^p;R)\\)).<\/p>\n\n\n\n<p class=\"has-text-align-center\">\\(\\underbrace{\\text{argmax}_{\\pi\\in\\Pi}J(R;\\pi)}_{\\text{Policy Agent}}~~~~~~~~~~~~~~~\\underbrace{\\text{argmin}_{R\\in\\mathcal{R}}L(D^p;R)}_{\\text{Reward Agent}}\\)<\/p>\n\n\n\n<p>The ranking loss induces a reward function \\(R\\) that attempts to satisfy each pairwise preference in the dataset as follows:<\/p>\n\n\n\n<p class=\"has-text-align-center\">\\(\\mathbb{E}_{\\rho^i}[R(s,a)]\\le\\mathbb{E}_{\\rho^j}[R(s,a)]~~,~~\\forall \\rho^i\\preceq\\rho^j \\in D^p\\)<\/p>\n\n\n\n<h2 id=\"generalizing-prior-imitation-learning-approaches-with-latextextttrank-game-latex\" class=\"wp-block-heading\">Generalizing prior imitation learning approaches with \\(\\texttt{rank-game}\\)<\/h2>\n\n\n\n<p>The \\(\\texttt{rank-game}\\) framework neatly encapsulates prior work in IRL and prior work in learning from preferences, respectively. First, let\u2019s see how classical IRL is a part of this framework. Recall that the classical IRL\/\\(\\texttt{imit-game}\\) optimization can be written as:<\/p>\n\n\n\n<p class=\"has-text-align-center\">\\(\\text{argmin}_{\\pi\\in\\Pi}\\text{max}_{R\\in\\mathcal{R}} [\\mathbb{E}_{\\rho^E(s,a)}[R(s,a)]-\\mathbb{E}_{\\rho^\\pi(s,a)}[R(s,a)]]\\)<\/p>\n\n\n\n<p>The inner optimization learns a reward function that ensures that the return gap under the reward function is maximized between the current policy\u2019s behavior and the expert behavior. Thus, \\(\\texttt{imit-game}\\) can be seen to be a special case of \\(\\texttt{rank-game}\\) with: (1) a ranking dataset that prefers expert behavior more than the current agent behavior and (2) a form of ranking loss that maximizes the performance gap (termed as \\(\\textit{supremum loss}\\)). A number of prior methods in the imitation learning domain can be understood as special cases of \\(\\texttt{rank-game}\\) under various ranking losses, classes of reward functions, and abilities to incorporate preferences (see Figure 2).<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"950\" height=\"306\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/blog_fig_2.png\" alt=\"A table with a summary of imitation learning (IL) methods demonstrating the data modalities they can handle (expert data and\/or preferences), their ranking-loss functions, the assumptions they make on reward function, and whether they require availability of an external agent to provide preferences during training.  \n\n  \n\nThe IL methods MaxEntIRL, AdRIL, GAN-GCL, GAIL, f-MAX, and AIRL don\u2019t use offline preferences or active human query, enable Learning from Demonstration (LfD) when incorporating expert data, and use the supremum ranking loss function and a non-linear reward function. \n\n  \n\nBCO, GAIfO, DACfO, OPOLO, and f-IRL don\u2019t use offline preferences or active human query, enable Learning from Observation (LfO), and use the supremum ranking loss function and a non-linear reward function. \n\n  \n\nTREX and DREX use offline preferences, the Bradley-Terry ranking loss function and a non-linear reward function; they don\u2019t use active human query or enable LfO or LfD. \n\n  \n\nBREX uses offline preferences, the Bradley-Terry ranking loss function, and a linear reward function; it doesn\u2019t use active human query or enable LfO or LfD. \n\n  \n\nDemPref uses offline preferences, the Bradley-Terry ranking loss function, a linear reward function, and active human query; it enables LfO and LfD. \n\n  \n\nIbarz et al. (2018) uses offline preferences, the Bradley-Terry ranking loss function, a non-linear reward function, and active human query; it enables LfD. \n\n  \n\nRank-game uses offline preferences, a new principled ranking loss that can naturally incorporate rankings provided by diverse sources, and a non-linear reward function; it enables LfO and LfD and doesn\u2019t use active human query. \" class=\"wp-image-934503\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/blog_fig_2.png 950w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/blog_fig_2-300x97.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/blog_fig_2-768x247.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/blog_fig_2-240x77.png 240w\" sizes=\"(max-width: 950px) 100vw, 950px\" \/><figcaption class=\"wp-element-caption\"><strong>Figure 2:<\/strong> Previous methods that learn from expert demonstrations or preferences form a special case of \\(\\texttt{rank-game}\\) under a specific choice of ranking loss and a reward function class. Also noted in the table is whether a method enables learning from demonstration (LfD)\u2014that is, learning from both expert states and actions\u2014or learning from observations (LfO), where an agent learns from expert states alone.<\/figcaption><\/figure>\n\n\n\n<h2 id=\"setting-up-the-ranking-game\" class=\"wp-block-heading\">Setting up the ranking game<\/h2>\n\n\n\n<p>To develop a framework that successfully combines learning from demonstrations and learning from preferences, we addressed several questions:<\/p>\n\n\n\n<ol class=\"wp-block-list\" type=\"1\" start=\"1\">\n<li>What is the ranking loss function that allows for the reward to satisfy the preferences in the dataset?<\/li>\n\n\n\n<li>Where do we get the dataset of pairwise preferences?<\/li>\n\n\n\n<li>How can we effectively optimize this two-player game?<\/li>\n<\/ol>\n\n\n\n<h3 id=\"step-1-a-new-ranking-loss-function-for-reward-learning\" class=\"wp-block-heading\">Step 1: A new ranking loss function for reward learning<\/h3>\n\n\n\n<p>Our proposed framework requires learning a reward function such that the rankings in the dataset are satisfied. While several loss functions exist in prior literature to enable this, such as <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.sciencedirect.com\/science\/article\/abs\/pii\/0022249677900323?via%3Dihub\" target=\"_blank\" rel=\"noreferrer noopener\">Luce<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/link.springer.com\/article\/10.1007\/BF02288967\" target=\"_blank\" rel=\"noreferrer noopener\">Shepard<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/proceedings.neurips.cc\/paper\/2012\/file\/dba1cdfcf6359389d170caadb3223ad2-Paper.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">Lov\u00e1sz-Bregman<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/proceedings.neurips.cc\/paper\/2012\/file\/dba1cdfcf6359389d170caadb3223ad2-Paper.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">divergences<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and the earlier discussed supremum loss, we introduce a new loss function:<\/p>\n\n\n\n<p class=\"has-text-align-center\">\\(L_k(\\mathcal{D}^p;R) = \\mathbb{E}_{(\\rho^{\\pi^i},\\rho^{\\pi^j})\\sim \\mathcal{D}^p} \\Big[\\mathbb{E}_{s,a\\sim\\rho^{\\pi^i}}{[(R(s,a)-0)^2]} + \\mathbb{E}_{s,a\\sim\\rho^{\\pi^j}}{[(R(s,a)-k)^2]}\\Big]\\)<\/p>\n\n\n\n<p>The loss function is simple and intuitive: For all the preference pairs in the dataset, the less preferred behavior is regressed to a return of 0 and more preferred behavior is regressed to a return of user-defined parameter \\(k\\). This loss function allows us to learn a reward function with user-defined scale \\(k\\), which plays an important role in enabling better policy optimization; it\u2019s principled and facilitates near-optimal imitation learning; and by design, it allows us to incorporate preferences.<\/p>\n\n\n\n<h3 id=\"step-2-getting-the-ranking-dataset\" class=\"wp-block-heading\">Step 2: Getting the ranking dataset<\/h3>\n\n\n\n<p>Besides giving more information about the expert\u2019s intent and being easy to obtain, another benefit of preferences is that they can also help learn a more informative, or <em>shaped<\/em>, reward function. This form of reward shaping can provide better guidance for policy optimization, reducing the burden of exploring the environment to find the optimal policy and increasing sample efficiency for IRL. Our initial ranking dataset is generated by the policy agent from its interactions with the environment; we always prefer expert\u2019s behavior to be better or equal to current policy\u2019s behavior in the rankings. To further leverage the benefits of preferences, we consider two methods for augmenting this ranking dataset:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Expert-annotated rankings<\/strong>: In situations where we have access to additional rankings, provided by humans or obtained from reward-annotated datasets, we can simply add them to our ranking dataset.<\/li>\n\n\n\n<li><strong>Automatically generated rankings<\/strong>: It turns out we can improve learning efficiency for imitation by using the rankings already present in the dataset of pairwise preferences to generate more preferences in a procedure similar to <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2006.06049\" target=\"_blank\" rel=\"noreferrer noopener\">Mixup regularization<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> in trajectory space.<\/li>\n<\/ul>\n\n\n\n<h3 id=\"step-3-improving-optimization-stability-with-stackelberg-game\" class=\"wp-block-heading\">Step 3: Improving optimization stability with Stackelberg game<\/h3>\n\n\n\n<p>Prior work has found the Stackelberg game framework to be a strong candidate for optimizing two-player games in various applications. A <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/link.springer.com\/book\/10.1007\/978-3-642-12586-7\" target=\"_blank\" rel=\"noreferrer noopener\">Stackelberg game<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> is a bi-level optimization problem:<\/p>\n\n\n\n<p class=\"has-text-align-center\">\\(\\text{max}_x (f(x,y_x)),~~~~\\text{s.t}~~y_x\\in \\text{min}_x(g(x,y))\\)<\/p>\n\n\n\n<p>In this optimization, we have two players\u2014Leader \\(x\\) and Follower \\(y\\)\u2014that are trying to maximize and minimize their own payoff \\(f\\) and \\(g\\), respectively. We cast \\(\\texttt{rank-game}\\) as a Stackelberg game and propose two algorithms depending on which player is set to be the leader:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Policy as Leader (PAL): \\(\\text{max}_\\pi J(R,\\pi)~~~~~\\text{s.t}~~ R=\\text{argmin}_R~L(D^p;R)\\)<\/li>\n\n\n\n<li>Reward as Leader (RAL): \\(\\text{min}_R L(D^p;R)~~~\\text{s.t}~~\\pi = \\text{argmax}_\\pi~J(R;\\pi)\\)<\/li>\n<\/ul>\n\n\n\n<p>Aside from improving training stability, both methods have complementary benefits in the non-stationary imitation learning setting. PAL can adjust more quickly when the intent of the expert changes, while RAL can handle environmental changes better.<\/p>\n\n\n\n<h2 id=\"how-well-does-latextextttrank-game-latex-perform-in-practice\" class=\"wp-block-heading\">How well does \\(\\texttt{rank-game}\\) perform in practice?<\/h2>\n\n\n\n<p>In testing the capabilities of \\(\\texttt{rank-game}\\), one of the scenarios we consider is the learning from observations alone (LfO) setting, in which only expert observations are provided with no expert actions. This more challenging setting better reflects the learning conditions robots will operate under&nbsp;if we want them to be more widely deployed in both controlled and dynamic environments. People can more naturally provide demonstrations by performing tasks themselves (observations only) versus performing the task indirectly by operating a robot (observations and precise actions). We investigate the LfO performance of \\(\\texttt{rank-game}\\) on simulated locomotion tasks like hopping, walking, and running and benchmark it with respect to representative baselines. \\(\\texttt{Rank-game}\\) approaches require fewer environment interactions to succeed and outperform recent methods in final performance and training stability.<\/p>\n\n\n\n<p>Additionally, our experiments reveal that none of the prior LfO methods can solve complex manipulation tasks such as door opening with a parallel jaw gripper and pen manipulation with a dexterous hand. This failure is potentially a result of the exploration requirements of LfO, which are high because of the unavailability of expert actions coupled with the fact that in these tasks observing successes is rare.<\/p>\n\n\n\n<p>In this setting, we show that using only a handful of expert-annotated preferences in the \\(\\texttt{rank-game}\\) framework can allow us to solve these tasks. We cannot solve these tasks using only expert data\u2014adding preferences is key.<\/p>\n\n\n\n<h2 id=\"next-steps\" class=\"wp-block-heading\">Next steps<\/h2>\n\n\n\n<p>Equipping agents to learn from different sources of information present in the world is a promising direction toward more capable agents that can better assist people in the dynamic environments in which they live and work. The \\(\\texttt{rank-game}\\) framework has the potential to be extended directly to the setting where humans present their preferences interactively as the robot is learning. There are some promising future directions and open questions for researchers interested in this work. First, preferences obtained in the real world are usually noisy, and one limitation of \\(\\texttt{rank-game}\\) is that it does not suggest a way to handle noisy preferences. Second, \\(\\texttt{rank-game}\\) proposes modifications to learn a reward function amenable to policy optimization, but these hyperparameters are set manually. Future work can explore methods to automate such learning of reward functions. Third, despite learning effective policies, we observed that \\(\\texttt{rank-game}\\) did not learn reusable robust reward functions.<\/p>\n\n\n\n<p>For additional details, including experiments in the learning from demonstration (LfD) setting, non-stationary imitation setting, and further framework analysis, check out the <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/a-ranking-game-for-imitation-learning\/\" target=\"_blank\" rel=\"noreferrer noopener\">paper<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/hari-sikchi.github.io\/rank-game\/\" target=\"_blank\" rel=\"noreferrer noopener\">project page<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/hari-sikchi\/rank-game\" target=\"_blank\" rel=\"noreferrer noopener\">code<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.youtube.com\/watch?v=gTf8WoYUOH8\" target=\"_blank\" rel=\"noreferrer noopener\">video presentation<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n<div class=\"wp-block-buttons is-content-justification-center is-content-justification-center is-layout-flex wp-container-core-buttons-is-layout-1 wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a aria-label=\"Read the paper\" data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/a-ranking-game-for-imitation-learning\/\" target=\"_blank\" rel=\"noreferrer noopener\">Read the paper<\/a><\/div>\n\n\n\n<div class=\"wp-block-button is-style-fill-github\"><a aria-label=\"Download code\" data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/github.com\/hari-sikchi\/rank-game\" target=\"_blank\" rel=\"noreferrer noopener\">Download code<\/a><\/div>\n<\/div>\n\n\n\n<h3 id=\"acknowledgments\" class=\"wp-block-heading\">Acknowledgments<\/h3>\n\n\n\n<p>This research was supported in part by the National Science Foundation, Air Force Office of Scientific Research, and Army Research Office.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>For many people, opening door handles or moving a pen between their fingers is a movement that happens multiple times a day, often without much thought. For a robot, however, these movements aren\u2019t always so easy. In reinforcement learning, robots learn to perform tasks by exploring their environments, receiving signals along the way that indicate [&hellip;]<\/p>\n","protected":false},"author":42183,"featured_media":934482,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[243984],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-934452","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-post-option-blog-homepage-featured"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[],"related-researchers":[{"type":"guest","value":"harshit-sikchi","user_id":"918195","display_name":"Harshit Sikchi","author_link":"<a href=\"https:\/\/hari-sikchi.github.io\/index\" aria-label=\"Visit the profile page for Harshit Sikchi\">Harshit Sikchi<\/a>","is_active":true,"last_first":"Sikchi, Harshit","people_section":0,"alias":"harshit-sikchi"},{"type":"guest","value":"wonjoon-goo","user_id":"918198","display_name":"Wonjoon Goo","author_link":"<a href=\"https:\/\/www.cs.utexas.edu\/users\/ai-lab\/?wonjoon\" aria-label=\"Visit the profile page for Wonjoon Goo\">Wonjoon Goo<\/a>","is_active":true,"last_first":"Goo, Wonjoon","people_section":0,"alias":"wonjoon-goo"},{"type":"guest","value":"scott-niekum","user_id":"918201","display_name":"Scott Niekum","author_link":"<a href=\"https:\/\/www.cics.umass.edu\/people\/niekum-scott\" aria-label=\"Visit the profile page for Scott Niekum\">Scott Niekum<\/a>","is_active":true,"last_first":"Niekum, Scott","people_section":0,"alias":"scott-niekum"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/rank-game-1400x788-1-960x540.jpg\" class=\"img-object-cover\" alt=\"Rank Game diagram\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/rank-game-1400x788-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/rank-game-1400x788-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/rank-game-1400x788-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/rank-game-1400x788-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/rank-game-1400x788-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/rank-game-1400x788-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/rank-game-1400x788-1-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/rank-game-1400x788-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/rank-game-1400x788-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/rank-game-1400x788-1-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/rank-game-1400x788-1.jpg 1400w\" sizes=\"(max-width: 960px) 100vw, 960px\" \/>","byline":"<a href=\"https:\/\/hari-sikchi.github.io\/index\" title=\"Go to researcher profile for Harshit Sikchi\" aria-label=\"Go to researcher profile for Harshit Sikchi\" data-bi-type=\"byline author\" data-bi-cN=\"Harshit Sikchi\">Harshit Sikchi<\/a>, Akanksha Saran, <a href=\"https:\/\/www.cs.utexas.edu\/users\/ai-lab\/?wonjoon\" title=\"Go to researcher profile for Wonjoon Goo\" aria-label=\"Go to researcher profile for Wonjoon Goo\" data-bi-type=\"byline author\" data-bi-cN=\"Wonjoon Goo\">Wonjoon Goo<\/a>, and <a href=\"https:\/\/www.cics.umass.edu\/people\/niekum-scott\" title=\"Go to researcher profile for Scott Niekum\" aria-label=\"Go to researcher profile for Scott Niekum\" data-bi-type=\"byline author\" data-bi-cN=\"Scott Niekum\">Scott Niekum<\/a>","formattedDate":"April 19, 2023","formattedExcerpt":"For many people, opening door handles or moving a pen between their fingers is a movement that happens multiple times a day, often without much thought. For a robot, however, these movements aren\u2019t always so easy. In reinforcement learning, robots learn to perform tasks by&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/934452"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/42183"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=934452"}],"version-history":[{"count":75,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/934452\/revisions"}],"predecessor-version":[{"id":939294,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/934452\/revisions\/939294"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/934482"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=934452"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=934452"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=934452"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=934452"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=934452"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=934452"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=934452"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=934452"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=934452"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=934452"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=934452"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}