{"id":938229,"date":"2023-05-04T10:00:00","date_gmt":"2023-05-04T17:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=938229"},"modified":"2023-05-16T12:37:27","modified_gmt":"2023-05-16T19:37:27","slug":"using-generative-ai-to-imitate-human-behavior","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/using-generative-ai-to-imitate-human-behavior\/","title":{"rendered":"Using generative AI to imitate human behavior"},"content":{"rendered":"\n

This research was accepted by the 2023 International Conference on Learning Representations (ICLR) (opens in new tab)<\/span><\/a>, which is dedicated to the advancement of the branch of artificial intelligence generally referred to as deep learning.<\/em><\/p>\n\n\n\n

\"An
Figure 1: Overview of our method.<\/figcaption><\/figure>\n\n\n\n

Diffusion models have emerged as a powerful class of generative AI models. They have been used to generate photorealistic images and short videos, compose music, and synthesize speech. And their uses don\u2019t stop there. In our new paper, Imitating Human Behaviour with Diffusion Models<\/a>, we explore how they can be used to imitate human behavior in interactive environments.<\/p>\n\n\n\n

This capability is valuable in many applications. For instance, it could help automate repetitive manipulation tasks in robotics, or it could be used to create humanlike AI in video games, which could lead to exciting new game experiences\u2014a goal particularly dear to our team<\/a>.<\/p>\n\n\n\n

We follow a machine learning paradigm known as imitation learning<\/em> (more specifically behavior cloning<\/em>). In this paradigm, we are provided with a dataset containing observations a person saw, and the actions they took, when acting in an environment, which we would like an AI agent to mimic. In interactive environments, at each time step, an observation \\( o_t \\) is received (e.g. a screenshot of a video game), and an action \\( a_t \\) is then selected (e.g. the mouse movement). With this dataset of many \\( o \\)\u2019s and \\( a \\)\u2019s performed by some demonstrator, a model \\( \\pi \\) could try to learn this mapping of observation-to-action, \\( \\pi(o) \\to a \\).<\/p>\n\n\n\n

<\/div>\n\n\n\n\t
\n\t\t\n\n\t\t

\n\t\ton-demand event<\/span>\n\t<\/p>\n\t\n\t

\n\t\t\t\t\t\t
\n\t\t\t\t\n\t\t\t\t\t\"Research\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t
\n\n\t\t\t\t\t\t\t\t\t

Microsoft Research Forum Episode 4<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t

Learn about the latest multimodal AI models, advanced benchmarks for AI evaluation and model self-improvement, and an entirely new kind of computer for AI inference and hard optimization. <\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t

\n\t\t\t\t\t
\n\t\t\t\t\t\t\n\t\t\t\t\t\t\tWatch on-demand\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t<\/div>\n\t<\/div>\n\t\n\n\n

When the actions are continuous, training a model to learn this mapping introduces some interesting challenges. In particular, what loss function should be used? A simple choice is mean squared error, as often used in supervised regression tasks. In an interactive environment, this objective encourages an agent to learn the average<\/em> of all the behaviors in the dataset.<\/p>\n\n\n\n

If the goal of the application is to generate diverse human behaviors, the average might not be very useful. After all, humans are stochastic (they act on whims) and multimodal creatures (different humans might make different decisions). Figure 2 depicts the failure of mean squared error to mimic the true action distribution (marked in yellow) when it is multimodal. It also includes several other popular choices for the loss function when doing behavior cloning.<\/p>\n\n\n\n

\"This
Figure 2: This toy example (based on an arcade claw game) shows an action space with two continuous action dimensions. Here the demonstration distribution is marked in yellow\u2014it is both multimodal and has correlations between action dimensions. Diffusion models offer a good imitation of the full diversity in the dataset.<\/figcaption><\/figure>\n\n\n\n

Ideally, we\u2019d like our models to learn the full variety of human behaviors. And this is where generative models help. Diffusion models are a specific class of generative model that are both stable to train and easy to sample from. They have been very successful in the text-to-image domain, which shares this one-to-many challenge\u2014a single text caption might be matched by multiple different images.<\/p>\n\n\n\n

Our work adapts ideas that have been developed for text-to-image diffusion models, to this new paradigm of observation-to-action diffusion. Figure 1 highlights some differences. One obvious point is that the object we are generating is now a low-dimensional action vector (rather than an image). This calls for a new design for the denoising network architecture. In image generation, heavy convolutional U-Nets<\/em> are in vogue, but these are less applicable for low-dimensional vectors. Instead, we innovated and tested three different architectures shown in Figure 1.<\/p>\n\n\n\n

In observation-to-action models, sampling a single bad action during an episode can throw an agent off course, and hence we were motivated to develop sampling schemes that would more reliably return good action samples (also shown in Figure 1). This problem is less severe in text-to-image models, since users often have the luxury of selecting a single image from among several generated samples and ignoring any bad images. Figure 3 shows an example of this, where a user might cherry-pick their favorite, while ignoring the one with nonsensical text.<\/p>\n\n\n\n

\"Four
Figure 3: Four samples from a text-to-image diffusion model from Bing (note this is not our own work), using the prompt \u201cA cartoon style picture of people playing with arcade claw machine\u201d.<\/figcaption><\/figure>\n\n\n\n

We tested our diffusion agents in two different environments. The first, a simulated kitchen environment, is a challenging high-dimensional continuous control problem where a robotic arm must manipulate various objects. The demonstration dataset is collected from a variety of humans performing various tasks in differing orders. Hence there is rich multimodality in the dataset.<\/p>\n\n\n\n

We found that diffusion agents outperformed baselines in two aspects. 1) The diversity of behaviors they learned were broader, and closer to the human demonstrations. 2) The rate of task completion (a proxy for reward) was better.<\/p>\n\n\n\n

The videos below highlight the ability of diffusion to capture multimodal behavior\u2013starting from the same initial conditions, we roll out the diffusion agent eight times. Each time it selects a different sequence of tasks to complete.<\/p>\n\n\n\n

\n
\n
\"A<\/figure>\n<\/div>\n\n\n\n
\n
\"A<\/figure>\n<\/div>\n\n\n\n
\n
\"A<\/figure>\n<\/div>\n\n\n\n
\n
\"A<\/figure>\n<\/div>\n<\/div>\n\n\n\n
\n
\n
\"A<\/figure>\n<\/div>\n\n\n\n
\n
\"A<\/figure>\n<\/div>\n\n\n\n
\n
\"A<\/figure>\n<\/div>\n\n\n\n
\n
\"A<\/figure>\n<\/div>\n<\/div>\n\n\n\n

The second environment tested was a modern 3D video game, Counter-strike. We refer interested readers to the paper for results.<\/p>\n\n\n\n

In summary, our work has demonstrated how exciting recent advances in generative modeling can be leveraged to build agents that can behave in humanlike ways in interactive environments. We\u2019re excited to continue exploring this direction \u2013 watch this space for future work.<\/p>\n\n\n\n

For more detail on our work, please see our paper (opens in new tab)<\/span><\/a> and code repo (opens in new tab)<\/span><\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"

Diffusion models have been used to generate photorealistic images and short videos, compose music, and synthesize speech. In a new paper, Microsoft Researchers explore how they can be used to imitate human behavior in interactive environments. <\/p>\n","protected":false},"author":42183,"featured_media":938712,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[243984],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-938229","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-post-option-blog-homepage-featured"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199561],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[583324],"related-projects":[],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Tabish Rashid","user_id":41784,"display_name":"Tabish Rashid","author_link":"Tabish Rashid<\/a>","is_active":false,"last_first":"Rashid, Tabish","people_section":0,"alias":"tabishrashid"},{"type":"user_nicename","value":"Dave Bignell","user_id":38320,"display_name":"Dave Bignell","author_link":"Dave Bignell<\/a>","is_active":false,"last_first":"Bignell, Dave","people_section":0,"alias":"dabignel"},{"type":"user_nicename","value":"Raluca Georgescu","user_id":37392,"display_name":"Raluca Stevenson","author_link":"Raluca Stevenson<\/a>","is_active":false,"last_first":"Stevenson, Raluca","people_section":0,"alias":"rageorg"},{"type":"user_nicename","value":"Sergio Valcarcel Macua","user_id":42507,"display_name":"Sergio Valcarcel Macua","author_link":"Sergio Valcarcel Macua<\/a>","is_active":false,"last_first":"Valcarcel Macua, Sergio","people_section":0,"alias":"sergiov"},{"type":"user_nicename","value":"Shanzheng Tan","user_id":41551,"display_name":"Shanzheng Tan","author_link":"Shanzheng Tan<\/a>","is_active":false,"last_first":"Tan, Shanzheng","people_section":0,"alias":"t-stan"},{"type":"user_nicename","value":"Ida Momennejad","user_id":39832,"display_name":"Ida Momennejad","author_link":"Ida Momennejad<\/a>","is_active":false,"last_first":"Momennejad, Ida","people_section":0,"alias":"idamo"},{"type":"user_nicename","value":"Katja Hofmann","user_id":32468,"display_name":"Katja Hofmann","author_link":"Katja Hofmann<\/a>","is_active":false,"last_first":"Hofmann, Katja","people_section":0,"alias":"kahofman"},{"type":"user_nicename","value":"Sam Devlin","user_id":37550,"display_name":"Sam Devlin","author_link":"Sam Devlin<\/a>","is_active":false,"last_first":"Devlin, Sam","people_section":0,"alias":"sadevlin"}],"msr_type":"Post","featured_image_thumbnail":"\"diagram\"","byline":"","formattedDate":"May 4, 2023","formattedExcerpt":"Diffusion models have been used to generate photorealistic images and short videos, compose music, and synthesize speech. In a new paper, Microsoft Researchers explore how they can be used to imitate human behavior in interactive environments.","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/938229"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/42183"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=938229"}],"version-history":[{"count":24,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/938229\/revisions"}],"predecessor-version":[{"id":938982,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/938229\/revisions\/938982"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/938712"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=938229"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=938229"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=938229"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=938229"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=938229"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=938229"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=938229"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=938229"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=938229"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=938229"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=938229"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}