{"id":937629,"date":"2023-05-04T09:00:00","date_gmt":"2023-05-04T16:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=937629"},"modified":"2023-05-04T08:17:52","modified_gmt":"2023-05-04T15:17:52","slug":"inferring-rewards-through-interaction","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/inferring-rewards-through-interaction\/","title":{"rendered":"Inferring rewards through interaction"},"content":{"rendered":"\n

This research was accepted by the 2023 International Conference on Learning Representations (ICLR) (opens in new tab)<\/span><\/a>, which is dedicated to the advancement of the branch of artificial intelligence generally referred to as deep learning.<\/em><\/p>\n\n\n\n

\"A<\/figure>\n\n\n\n

Reinforcement learning (RL) hinges on the power of rewards, driving agents<\/em>\u2014or the models doing the learning\u2014to explore and learn valuable actions. The feedback received through rewards shapes their behavior, culminating in effective policies. Yet, crafting reward functions is a complex, laborious task, even for experts. A more appealing option, particularly for the people ultimately using systems that learn from feedback over time, is an agent that can automatically infer a reward function. The interaction-grounded learning (IGL) paradigm<\/a> from Microsoft Research enables agents to infer rewards through the very process of interaction, utilizing diverse feedback signals rather than explicit numeric rewards. Despite the absence of a clear reward signal, the feedback relies on a binary latent reward through which the agent masters a policy that maximizes this unseen latent reward using environmental feedback.<\/p>\n\n\n\n

In our paper \u201cPersonalized Reward Learning with Interaction-Grounded Learning,\u201d<\/a> which we\u2019re presenting at the 2023 International Conference on Learning Representations (ICLR) (opens in new tab)<\/span><\/a>, we propose a novel approach to solve for the IGL paradigm: IGL-P.<\/em> IGL-P is the first IGL strategy for context-dependent feedback, the first use of inverse kinematics as an IGL objective, and the first IGL strategy for more than two latent states. This approach provides a scalable alternative to current personalized agent learning methods, which can require expensive high-dimensional parameter tuning, handcrafted rewards, and\/or extensive and costly user studies.<\/p>\n\n\n\n

\n
\n
\n\t
\n\t\t
\n\t\t\t\t\t\tPublication<\/span>\n\t\t\tInteraction-Grounded Learning<\/span> <\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n\n\n\n
\n
\n\t
\n\t\t
\n\t\t\t\t\t\tPublication<\/span>\n\t\t\tPersonalized Reward Learning with Interaction-Grounded Learning (IGL)<\/span> <\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n

IGL-P in the recommender system setting<\/h2>\n\n\n\n

IGL-P is particularly useful for interactive learning applications such as recommender systems. Recommender systems help people navigate increasing volumes of content offerings by providing personalized content suggestions. However, without explicit feedback, recommender systems can\u2019t detect for certain whether a person enjoyed the displayed content. To accommodate, modern recommender systems equate implicit feedback signals with user satisfaction. Despite the popularity of this approach, implicit feedback is not the true reward. Even the click-through rate (CTR) metric, the gold standard for recommender systems, is an imperfect reward, and its optimization naturally promotes clickbait.<\/p>\n\n\n\n

\"Interaction-grounded
Interaction-grounded learning (IGL) for the recommender system setting. The recommender system receives features describing a person (x), recommends an item (a), and observes implicit user feedback (y), which is dependent on the latent reward (r) but not r<\/em> itself, to learn how to better recommend personalized content to the individual.<\/figcaption><\/figure>\n\n\n\n

This problem has led to the handcrafting of reward functions with various implicit feedback signals in modern recommender systems. Recommendation algorithms will use hand-defined weights for different user interactions, such as replying to or liking content, when deciding how to recommend content to different people. This fixed weighting of implicit feedback signals might not generalize across a wide variety of people, and thus a personalized learning method can improve user experience by recommending content based on user preferences.<\/p>\n\n\n\n

<\/div>\n\n\n\n\t
\n\t\t\n\n\t\t

\n\t\tSpotlight: AI-POWERED EXPERIENCE<\/span>\n\t<\/p>\n\t\n\t

\n\t\t\t\t\t\t
\n\t\t\t\t\n\t\t\t\t\t\"\"\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t
\n\n\t\t\t\t\t\t\t\t\t

Microsoft research copilot experience<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t

Discover more about research at Microsoft through our AI-powered experience<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t

\n\t\t\t\t\t
\n\t\t\t\t\t\t\n\t\t\t\t\t\t\tStart now\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t<\/div>\n\t<\/div>\n\t\n\n\n

The choice of reward function is further complicated by differences in how people interact with recommender systems. A growing body of work shows that recommender systems don\u2019t provide consistently good recommendations across demographic groups. Previous research suggests that this inconsistency has its roots in user engagement styles. In other words, a reward function that might work well for one type of user might (and often does) perform poorly for another type of user who interacts with the platform differently. For example, older adults have been found to click on clickbait more often (opens in new tab)<\/span><\/a>. If the CTR is used as an objective, this group of users will receive significantly more clickbait recommendations than the general public (opens in new tab)<\/span><\/a>, resulting in higher rates of negative user experiences and leading to user distrust in the recommender system.<\/p>\n\n\n\n

IGL-P provides a novel approach to optimize content for latent user satisfaction\u2014that is, rewards that a model doesn\u2019t have direct access to\u2014by learning personalized reward functions for different people rather than requiring a fixed, human-designed reward function. IGL-P learns representations of diverse user communication modalities and how these modalities depend on the underlying user satisfaction. It assumes that people may communicate their feedback in different ways but a given person expresses (dis)satisfaction or indifference to all content in the same way. This enables the use of inverse kinematics toward a solution for recovering the latent reward. With additional assumptions that rewards are rare when the agent acts randomly and some negatively labeled interactions are directly accessible to the agent, IGL-P recovers the latent reward function and leverages that to learn a personalized policy.<\/p>\n\n\n\n

IGL-P successes<\/h2>\n\n\n\n

The success of IGL-P is demonstrated with experiments using simulations, as well as with real-world production traces. IGL-P is evaluated in three different settings:<\/p>\n\n\n\n