{"id":972234,"date":"2023-10-05T09:00:00","date_gmt":"2023-10-05T16:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=972234"},"modified":"2024-06-10T09:47:36","modified_gmt":"2024-06-10T16:47:36","slug":"holoassist-a-multimodal-dataset-for-next-gen-ai-copilots-for-the-physical-world","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/holoassist-a-multimodal-dataset-for-next-gen-ai-copilots-for-the-physical-world\/","title":{"rendered":"HoloAssist: A multimodal dataset for next-gen AI copilots for the physical world"},"content":{"rendered":"\n

This research paper was presented at the <\/em><\/strong>2023 IEEE\/CVF International Conference on Computer Vision<\/em><\/strong> (opens in new tab)<\/span><\/a> (ICCV), a premier academic conference for computer vision.<\/em><\/strong><\/p>\n\n\n\n

\"\"ICCV23<\/figure>\n\n\n\n

When was the last time you were faced with a task you had no clue how to tackle? Maybe it was fixing a broken bike, replacing a printer toner, or making a cup of espresso? In such circumstances, your usual options might include reaching out to a knowledgeable friend or relative for assistance. Alternatively, you might resort to scouring the internet, conducting a web search, posing questions on online forums, or seeking out relevant instructional videos. But what if there were another option? What if you could turn to an AI assistant, or copilot<\/em>, for help?<\/p>\n\n\n\n

AI in the real world<\/h2>\n\n\n\n

Our daily lives are filled with a wide range of tasks, both for work and leisure, spanning the digital and physical realms. We often find ourselves in need of guidance to learn and carry out these tasks effectively. Recent advances in AI, particularly in the areas of large language and multimodal models, have given rise to intelligent digital agents. However, when it comes to the physical world, where we perform a significant number of our tasks, AI systems have historically faced greater challenges. <\/p>\n\n\n\n

A longstanding aspiration within the AI community has been to develop an interactive AI assistant capable of perceiving, reasoning, and collaborating with people in the real world. Whether it\u2019s scenarios like autonomous driving, robot navigation and manipulation, hazard detection in industrial settings, or support and guidance for mixed-reality tasks, progress in physical activities has been slower and more incremental compared with their fully digital counterparts.<\/p>\n\n\n\n

<\/div>\n\n\n\n\t
\n\t\t\n\n\t\t

\n\t\tSpotlight: Event<\/span>\n\t<\/p>\n\t\n\t

\n\t\t\t\t\t\t
\n\t\t\t\t\n\t\t\t\t\t\"teal\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t
\n\n\t\t\t\t\t\t\t\t\t

Microsoft at CVPR 2024<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t

Microsoft is a proud sponsor and active participant of\u00a0CVPR 2024<\/a>, which focuses on advancements in computer vision and pattern recognition. <\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t

\n\t\t\t\t\t
\n\t\t\t\t\t\t\n\t\t\t\t\t\t\tLearn more\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t<\/div>\n\t<\/div>\n\t\n\n\n

The promise and challenge of interactive AI copilots<\/h2>\n\n\n\n

There is great potential for developing interactive AI copilots to assist people with real-world tasks, but there are also obstacles. The key challenge is that current state-of-the-art AI assistants lack firsthand experience in the physical world. Consequently, they cannot perceive the state of the real world and actively intervene when necessary. This limitation stems from a lack of training on the specific data required for perception, reasoning, and modeling in such scenarios. In terms of AI development, there\u2019s a saying that \u201cdata is king.\u201d This challenge is no exception. To advance interactive AI agents for physical tasks, we must thoroughly understand the problem domain and establish a gold standard for copilots\u2019 capabilities. <\/p>\n\n\n\n

A new multimodal interactive dataset<\/h2>\n\n\n\n

As a first step in this direction, we are excited to share our paper, \u201cHoloAssist: an Egocentric Human Interaction Dataset for Interactive AI Assistants in the Real World (opens in new tab)<\/span><\/a>,\u201d presented at ICCV 2023 (opens in new tab)<\/span><\/a>. HoloAssist is a large-scale egocentric, or first-person, human interaction dataset, where two people collaboratively execute physical manipulation tasks. A task performer executes a task while wearing a mixed-reality headset that captures seven synchronized data streams, as shown in Figure 1. Simultaneously, a task instructor observes the performer\u2019s first-person video feed in real time and offers verbal instruction. <\/p>\n\n\n\n

\"An
Figure 1: HoloAssist features a two-person interactive assistive task-completion setting.<\/figcaption><\/figure>\n\n\n\n

HoloAssist contains a large collection of data, comprising 166 hours of recordings involving 222 diverse participants. These participants form 350 distinct instructor-performer pairs carrying out 20 object-centric manipulation tasks. Video 1 shows how tasks are recorded, while Figure 2 provides a task breakdown. The objects range from common electronic devices to rarer items found in factories and specialized labs. The tasks are generally quite demanding, often requiring instructor assistance for successful completion. To provide comprehensive insights, we\u2019ve captured seven different raw sensor modalities: RGB, depth, head pose, 3D hand pose, eye gaze, audio, and IMU. These modalities help in understanding human intentions, estimating world states, predicting future actions, and more. Finally, the eighth modality is an augmentation with third-person manual annotations, consisting of a text summary, intervention types, mistake annotations, and action segments, as illustrated in Figure 3.<\/p>\n\n\n\n

Video 1: A sampling of task recordings showcasing color and depth, two of the eight modalities.<\/figcaption><\/figure>\n\n\n\n
\"Data
Figure 2: Data distribution captured in HoloAssist. On the left, the number of sessions per activity. On the right, the total session length in minutes.<\/figcaption><\/figure>\n\n\n\n
\"HoloAssist
Figure 3: HoloAssist includes action and conversational annotations, and it also provides summaries of videos indicating mistakes and interventions during tasks. Each action is tagged with a \u201cmistake\u201d or \u201ccorrect\u201d attribute, while spoken statements are labeled with intervention types.<\/figcaption><\/figure>\n\n\n\n

Towards proactive AI assistants<\/h2>\n\n\n\n

Our work builds on previous advancements in egocentric vision and embodied AI. Unlike earlier datasets, such as those listed in Table 1, HoloAssist stands out due to its multi-person, interactive task-execution setting. Human interaction during task execution provides a valuable resource for designing AI assistants that are anticipatory and proactive that can provide precisely timed instructions that are grounded in the environment, in contrast with current \u201cchat-based\u201d AI assistants that wait for you to ask a question. This unique scenario is ideal for developing assistive AI agents and complements existing datasets, which contribute rich knowledge and representation.<\/p>\n\n\n\n

\"The
Table 1: Comparison of related datasets and simulation platforms. HoloAssist features a multi-person assistive setting, which is a unique addition to existing egocentric (first-person) datasets.<\/figcaption><\/figure>\n\n\n\n

Finally, we evaluated the dataset’s performance on action classification and anticipation tasks, providing empirical results that shed light on the role of different modalities in various tasks. With this dataset, we introduce new tasks and benchmarks focused on mistake detection, intervention type prediction, and 3D hand pose forecasting, all crucial elements for developing intelligent assistants.<\/p>\n\n\n\n

Looking forward<\/h2>\n\n\n\n

This work represents an initial step in broader research that explores how intelligent agents can collaborate with humans in real-world tasks. We’re excited to share this work and our dataset with the community and anticipate numerous future directions, such as annotating object poses, investigating object-centric models of affordance and manipulations in AI assistance, and AI-assisted planning and state tracking, among others. We believe HoloAssist, along with its associated benchmarks and tools, will benefit future research endeavors focused on building powerful AI assistants for real-world everyday tasks. You can access the HoloAssist dataset and code on GitHub (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n

Contributors<\/h3>\n\n\n\n

Taein Kwon, Mahdi Rad<\/a>, Bowen Pan, Ishani Chakraborty<\/a>, Sean Andrist<\/a>, Dan Bohus<\/a>, Ashley Feniello<\/a>, Bugra Tekin, Felipe Vieira Frujeri<\/a>, Marc Pollefeys<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"

HoloAssist is a new multimodal dataset consisting of 166 hours of interactive task executions with 222 participants. Discover how it offers invaluable data to advance the capabilities of next-gen AI copilots for real-world tasks.<\/p>\n","protected":false},"author":42183,"featured_media":972549,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556,13562],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-972234","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-research-area-computer-vision","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199565,602418,992148],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Xin Wang","user_id":41146,"display_name":"Xin Wang","author_link":"Xin Wang<\/a>","is_active":false,"last_first":"Wang, Xin","people_section":0,"alias":"wanxin"},{"type":"user_nicename","value":"Neel Joshi","user_id":33073,"display_name":"Neel Joshi","author_link":"Neel Joshi<\/a>","is_active":false,"last_first":"Joshi, Neel","people_section":0,"alias":"neel"}],"msr_type":"Post","featured_image_thumbnail":"\""ICCV23","byline":"Xin Wang<\/a> and Neel Joshi<\/a>","formattedDate":"October 5, 2023","formattedExcerpt":"HoloAssist is a new multimodal dataset consisting of 166 hours of interactive task executions with 222 participants. Discover how it offers invaluable data to advance the capabilities of next-gen AI copilots for real-world tasks.","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/972234"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/42183"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=972234"}],"version-history":[{"count":40,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/972234\/revisions"}],"predecessor-version":[{"id":1045071,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/972234\/revisions\/1045071"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/972549"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=972234"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=972234"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=972234"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=972234"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=972234"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=972234"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=972234"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=972234"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=972234"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=972234"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=972234"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}