{"id":1005141,"date":"2024-02-07T07:16:41","date_gmt":"2024-02-07T15:16:41","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-research-item&p=1005141"},"modified":"2024-05-21T17:22:48","modified_gmt":"2024-05-22T00:22:48","slug":"interactive-agent-foundation-model","status":"publish","type":"msr-research-item","link":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/interactive-agent-foundation-model\/","title":{"rendered":"An Interactive Agent Foundation Model"},"content":{"rendered":"

The development of artificial intelligence systems <\/span>is transitioning from creating static, task-specific<\/span>
models to dynamic, agent-based systems capa<\/span>ble of performing well in a wide range of ap<\/span>plications.<\/span> We propose an<\/span> Agent <\/span>Foundation Model <\/span><\/strong>that uses a novel multi-task <\/span>agent training paradigm for training AI agents <\/span>across a wide range of domains, datasets, and <\/span>tasks. Our training paradigm unifies diverse pre<\/span>training strategies, including visual masked auto-<\/span>encoders, language modeling, and next-action <\/span>prediction, enabling a versatile and adaptable AI <\/span>framework. We demonstrate the performance of <\/span>our framework across three separate domains\u2014 <\/span>Robotics, Gaming AI, and Healthcare. Our model <\/span>demonstrates its ability to generate meaningful <\/span>and contextually relevant outputs in each area. <\/span>The strength of our approach lies in its general<\/span>ity, leveraging a variety of data sources such as <\/span>robotics sequences, gameplay data, large-scale <\/span>video datasets, and textual information for effec<\/span>tive multimodal and multi-task learning. Our ap<\/span>proach provides a promising avenue for develop<\/span>ing generalist, action-taking, multimodal systems.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"

The development of artificial intelligence systems is transitioning from creating static, task-specificmodels to dynamic, agent-based systems capable of performing well in a wide range of applications. We propose an Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents across a wide range of domains, datasets, and tasks. Our training […]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"msr-content-type":[3],"msr-research-highlight":[],"research-area":[13556,13545,13554,13553],"msr-publication-type":[193716],"msr-product-type":[],"msr-focus-area":[],"msr-platform":[],"msr-download-source":[],"msr-locale":[268875],"msr-post-option":[],"msr-field-of-study":[268140,246658,268332,268266,247039,249835],"msr-conference":[],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-1005141","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-artificial-intelligence","msr-research-area-human-language-technologies","msr-research-area-human-computer-interaction","msr-research-area-medical-health-genomics","msr-locale-en_us","msr-field-of-study-agent-ai","msr-field-of-study-deep-learning","msr-field-of-study-embodied-ai","msr-field-of-study-gaming","msr-field-of-study-health-care","msr-field-of-study-robotics"],"msr_publishername":"arXiv","msr_edition":"","msr_affiliation":"","msr_published_date":"2024-2-1","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"","msr_how_published":"","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"file","viewUrl":"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2024\/02\/Agent-fondation-model.pdf","id":"1005228","title":"agent-fondation-model","label_id":"243109","label":0},{"type":"url","viewUrl":"false","id":"false","title":"https:\/\/arxiv.org\/pdf\/2402.05929.pdf","label_id":"243109","label":0}],"msr_related_uploader":[{"type":"file","viewUrl":"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2024\/02\/Agent-fondation-model.pdf","id":"1005228","title":"agent-fondation-model","label_id":"243118","label":0},{"type":"url","viewUrl":"false","id":"false","title":"https:\/\/arxiv.org\/pdf\/2402.05929.pdf","label_id":"243118","label":0}],"msr_attachments":[{"id":1005228,"url":"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2024\/02\/Agent-fondation-model.pdf"}],"msr-author-ordering":[{"type":"text","value":"Zane Durante","user_id":0,"rest_url":false},{"type":"text","value":"Bidipta Sarkar","user_id":0,"rest_url":false},{"type":"text","value":"Ran Gong","user_id":0,"rest_url":false},{"type":"text","value":"Rohan Taori","user_id":0,"rest_url":false},{"type":"guest","value":"yusuke-noda","user_id":969939,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=yusuke-noda"},{"type":"text","value":"Paul Tang","user_id":0,"rest_url":false},{"type":"text","value":"Ehsan Adeli","user_id":0,"rest_url":false},{"type":"text","value":"Shrinidhi Kowshika Lakshmikanth","user_id":0,"rest_url":false},{"type":"text","value":"Kevin Schulman","user_id":0,"rest_url":false},{"type":"text","value":"Arnold Milstein","user_id":0,"rest_url":false},{"type":"guest","value":"demetri-terzopoulos-2","user_id":981291,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=demetri-terzopoulos-2"},{"type":"user_nicename","value":"Ade Famoti","user_id":43005,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Ade Famoti"},{"type":"text","value":"Noboru Kuno","user_id":0,"rest_url":false},{"type":"user_nicename","value":"Ashley Llorens","user_id":39964,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Ashley Llorens"},{"type":"guest","value":"hoi-vo-3","user_id":981312,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=hoi-vo-3"},{"type":"user_nicename","value":"Katsushi Ikeuchi","user_id":32500,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Katsushi Ikeuchi"},{"type":"guest","value":"fei-fei-li","user_id":969957,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=fei-fei-li"},{"type":"user_nicename","value":"Jianfeng Gao","user_id":32246,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Jianfeng Gao"},{"type":"user_nicename","value":"Naoki Wake","user_id":39916,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Naoki Wake"},{"type":"user_nicename","value":"Qiuyuan Huang","user_id":36356,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Qiuyuan Huang"}],"msr_impact_theme":[],"msr_research_lab":[199565],"msr_event":[],"msr_group":[144931,668253],"msr_project":[788159],"publication":[],"video":[],"download":[],"msr_publication_type":"inproceedings","related_content":{"projects":[{"ID":788159,"post_title":"Agent AI","post_name":"agent-ai","post_type":"msr-project","post_date":"2023-09-25 21:53:00","post_modified":"2024-02-28 07:03:22","post_status":"publish","permalink":"https:\/\/www.microsoft.com\/en-us\/research\/project\/agent-ai\/","post_excerpt":"Agent-based multimodal AI systems are becoming a ubiquitous presence in our everyday lives. A promising direction for making these systems more interactive is to embody them as agents within specific environments. The grounding of large foundation models to act as agents within specific environments can provide a way of incorporating visual and contextual information into an embodied system. For example, a system that can perceive user actions, human behavior, environment objects, audio expressions, and the…","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/788159"}]}}]},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/1005141"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":9,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/1005141\/revisions"}],"predecessor-version":[{"id":1038630,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/1005141\/revisions\/1038630"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1005141"}],"wp:term":[{"taxonomy":"msr-content-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-content-type?post=1005141"},{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=1005141"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1005141"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=1005141"},{"taxonomy":"msr-product-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-product-type?post=1005141"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=1005141"},{"taxonomy":"msr-platform","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-platform?post=1005141"},{"taxonomy":"msr-download-source","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-download-source?post=1005141"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1005141"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1005141"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=1005141"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=1005141"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=1005141"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1005141"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=1005141"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}