{"id":1027704,"date":"2024-04-29T09:00:00","date_gmt":"2024-04-29T16:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/sigma-an-open-source-mixed-reality-system-for-research-on-physical-task-assistance\/"},"modified":"2024-04-30T11:11:28","modified_gmt":"2024-04-30T18:11:28","slug":"sigma-an-open-source-mixed-reality-system-for-research-on-physical-task-assistance","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/sigma-an-open-source-mixed-reality-system-for-research-on-physical-task-assistance\/","title":{"rendered":"SIGMA: An open-source mixed-reality system for research on physical task assistance"},"content":{"rendered":"\n
\"Blue,<\/figure>\n\n\n\n

Imagine if every time you needed to complete a complex physical task, like building a bicycle, fixing a broken water heater, or cooking risotto for the first time, you had a world-class expert standing over your shoulder and guiding you through the process. In addition to telling you the steps to follow, this expert would also tune the instructions to your skill set, deliver them with the right timing, and adapt to any mistakes, confusions, or distractions that might arise along the way. <\/p>\n\n\n\n

What would it take to build an interactive AI system that could assist you with any task in the physical world, just as a real-time expert would? To begin exploring the core competencies that such a system would require, we developed and released the Situated Interactive Guidance, Monitoring, and Assistance (SIGMA) system, an open-source research platform and testbed prototype (opens in new tab)<\/span><\/a> for studying mixed-reality task assistance. SIGMA provides a basis for researchers to explore, understand, and develop the capabilities required to enable in-stream task assistance in the physical world. <\/p>\n\n\n\n

\"Left:<\/figure>\n\n\n\n

Recent advances in generative AI and large language, vision, and multimodal models can provide a foundation of open-domain knowledge, inference, and generation capabilities to help enable such open-ended task assistance scenarios. However, building AI systems that collaborate with people in the physical world\u2014including not just mixed-reality task assistants but also interactive robots, smart factory floors, autonomous vehicles, and so on\u2014requires going beyond the ability to generate relevant instructions and content. To be effective, these systems also require physical <\/em>and social intelligence.<\/em> <\/p>\n\n\n\n

Physical and social intelligence<\/h2>\n\n\n\n

For AI systems to fluidly collaborate with people in the physical world, they must continuously perceive and reason multimodally, in stream, about their surrounding environment. This requirement goes beyond just detecting and tracking objects. Effective collaboration in the physical world necessitates an understanding of which objects are relevant for the task at hand, what their possible uses may be, how they relate to each other, what spatial constraints are in play, and how all these aspects evolve over time. <\/p>\n\n\n\n

Just as important as reasoning about the physical environment, these systems also need to reason about people. This reasoning should include not only lower-level inferences about body pose, speech and actions, but also higher-level inferences about cognitive states and the social norms of real-time collaborative behavior. For example, the AI assistant envisioned above would need to consider questions such as: Is the user confused or frustrated? Are they about to make a mistake? What\u2019s their level of expertise? Are they still pursuing the current task, or have they started doing something else in parallel? Is it a good time to interrupt them or provide the next instruction? And so forth.<\/p>\n\n\n\n

Situated Interactive Guidance, Monitoring, and Assistance<\/h2>\n\n\n\n

We developed SIGMA as a platform to investigate these challenges and evaluate progress in developing new solutions.<\/p>\n\n\n\n

\"Left:
Left<\/strong>: A person using SIGMA running on a HoloLens 2 to perform a procedural task. Middle<\/strong>: First-person view showing SIGMA\u2019s task-guidance panel and task-specific holograms. Right<\/strong>: 3D visualization of the system’s scene understanding showing the egocentric camera view, depth map, detected objects, gaze, hand and head pose. (c) 2024 IEEE<\/figcaption><\/figure>\n\n\n\n

SIGMA is an interactive application that currently runs on a HoloLens 2 device and combines a variety of mixed-reality and AI technologies, including large language and vision models, to guide a user through procedural tasks. Tasks are structured as a sequence of steps, which can either be predefined manually in a task library or generated on the fly using a large language model like GPT-4. Throughout the interaction, SIGMA can leverage large language models to answer open-ended questions that a user might have along the way. Additionally, SIGMA can use vision models like Detic and SEEM to detect and track task-relevant objects in the environment and point them out to the user as appropriate. This video (opens in new tab)<\/span><\/a> provides a first-person view of someone using SIGMA to perform a couple of example procedural tasks.<\/p>\n\n\n\n

\n