Microsoft Hololens headset with black background

Spatial AI Lab – Zurich

The research projects undertaken by the Microsoft Mixed Reality & AI Lab, EPFL and ETH Zurich is a collaborative research effort to address research challenges in areas related to AI. These projects are undertaken via The Microsoft Swiss Joint Research Center established in 2008.

2022-2023 Projects

  • EPFL PIs: Alexander Mathis, Friedhelm Hummel, Silvestro Micera
    Microsoft PIs: Marc Pollefeys
    PhD Student: Haozhe Qi

    Despite many advances in neuroprosthetics and neurorehabilitation, the techniques to measure, to personalize and thus to optimize the functional improvements that patients gain with therapy are limited. Impairments remain to be assessed by standardized functional tests, which fail to capture everyday behaviour and quality of life or allow to be well used for personalization and have to be performed by trained health care professionals in the clinical environment. By leveraging recent advances in motion capture and hardware, we will create novel metrics to evaluate, personalize and improve the dexterity of patients in their everyday life. We will utilize the EPFL Smart Kitchen platform to assess naturalistic behaviour in the kitchen of both healthy subjects, upper-limb amputees and stroke patients filmed from a head mounted camera (Microsoft HoloLens). We will develop a computer vision pipeline that is capable of measuring hand-object interactions in patient’s kitchens. Based on this novel, large-scale dataset collected in patient’s kitchens, we will derive metrics that measure dexterity in the “natural world,” as well as recovered and compensatory movements due to the pathology/assistive device. We will also use those data, to assess novel control strategies for neuroprosthetics and design optimal, personalized rehabilitation treatment by leveraging virtual reality

  • ETH Zurich PIs: Stelian Coros, Roi Poranne
    Microsoft PIs: Jeffrey Delmerico, Juan Nieto, Marc Pollefeys
    PhD Student: Florian-Kennel-Maushart

    Despite popular depictions in sci-fi movies and TV shows, robots remain limited in their ability to autonomously solve complex tasks. Indeed, even the most advanced commercial robots are only now just starting to navigate man-made environments while performing simple pick-and-place operations. In order to enable complex high-level behaviours, such as the abstract reasoning required to manoeuvre objects in highly constrained environments, we propose to leverage human intelligence and intuition. The challenge here is one of representation and communication. In order to communicate human insights about a problem to a robot, or to communicate a robot’s plans and intent to a human, it is necessary to utilize representations of space, tasks, and movements that are mutually intelligible for both human and robot. This work will focus on the problem of single and multi-robot motion planning with human guidance, where a human assists a team of robots in solving a motion-based task that is beyond the reasoning capabilities of the robot systems. We will exploit the ability of Mixed Reality (MR) technology to communicate spatial concepts between robots and humans, and will focus our research efforts on exploring the representations, optimization techniques, and multi-robot task planning necessary to advance the ability of robots to solve complex tasks with human guidance.

  • ETH Zurich PIs: Otmar Hilliges
    Microsoft PIs: Julien Valentin

    Digital capture of human bodies is a rapidly growing research area in computer vision and computer graphics that puts scenarios such as life-like Mixed Reality (MR) virtual-social interactions into reach, albeit not without overcoming several challenging research problems. A core question in this respect is how to faithfully transmit a virtual copy of oneself so that a remote collaborator may perceive the interaction as immersive and engaging. To present a real alternative to face-to-face meetings, future AR/VR systems will crucially depend on the following two core building blocks: 0. means to capture the 3D geometry and appearance (e.g., texture, lighting) of individuals with consumer-grade infrastructure (e.g., a single RGB-D camera) and with very little time and expertise and 1. means to represent the captured geometry and appearance information in a fashion that is suitable for photorealistic rendering under fine-grained control over the underlying factors such as pose and facial expressions amongst others. In this project, we plan to develop novel methods to learn animatable representations of humans from ‘cheap’ data sources alone. Furthermore, we plan to extend our own recent work on animatable neural implicit surfaces, such that it can represent not only the geometry but also the appearance of subjects in high visual fidelity. Finally, we plan to study techniques to enforce geometric and temporal consistency in such methods to make them suitable for MR and other telepresence downstream applications.

  • ETH Zurich PIs: Marco Tognon, Mike Allenspach, Nicholas Lawrence, Roland Siegwart
    Microsoft PIs: Jeffrey Delmerico, Juan Nieto, Marc Pollefeys

  • EPFL PIs: Pascal Fua, Mathieu Salzmann, Helge Rhodin
    Microsoft PIs: Sudipta Sinha, Marc Pollefeys

    In recent years, there has been tremendous progress in camera-based 6D object pose, hand pose and human 3D pose estimation. They can now both be done in real time but not yet to the level of accuracy required to properly capture how people interact with each other and with objects, which is a crucial component of modeling the world in which we live. For example, when someone grasps an object, types on a keyboard, or shakes someone else’s hand, the position of their fingers with respect to what they are interacting with must be precisely recovered for the resulting models to be used by AR devices, such as the HoloLens device or consumer-level video see-through AR ones. This remains a challenge, especially given the fact that hands are often severely occluded in the egocentric views that are the norm in AR. We will, therefore, work on accurately capturing the interaction between hands and objects they touch and manipulate. At the heart of it, will be the precise modeling of contact points and the resulting physical forces between interacting hands and objects. This is essential for two reasons. First, objects in contact exert forces on each other; their pose and motion can only be accurately captured and understood if reaction forces at contact points and areas are modeled jointly. Second, touch and touch-force devices, such as keyboards and touch-screens are the most common human-computer interfaces, and by sensing contact and contact forces purely visually, every-day objects could be turned into tangible interfaces, that react as if they were equipped with touch-sensitive electronics. For instance, a soft cushion could become a non-intrusive input device that, unlike virtual mid-air menus, provides natural force feedback. In this talk, I will present some of our preliminary results and discuss our research agenda for the year to come.

  • ETH Zurich PIs: Stelian Coros, Roi Poranne
    Microsoft PIs: Marc Pollefeys

    With this project, we aim to accelerate the development of intelligent robots that can assist those in need with a variety of everyday tasks. People suffering from physical impairments, for example, often need help dressing or brushing their own hair. Skilled robotic assistants would allow these persons to live an independent lifestyle. Even such seemingly simple tasks, however, require complex manipulation of physical objects, advanced motion planning capabilities, as well as close interactions with human subjects. We believe the key to robots being able to undertake such societally important functions is learning from demonstration. The fundamental research question is, therefore, how can we enable human operators to seamlessly teach a robot how to perform complex tasks? The answer, we argue, lies in immersive telemanipulation. More specifically, we are inspired by the vision of James Cameron’s Avatar, where humans are endowed with alternative embodiments. In such a setting, the human’s intent must be seamlessly mapped to the motions of a robot as the human operator becomes completely immersed in the environment the robot operates in. To achieve this ambitious vision, many technologies must come together: mixed reality as the medium for robot-human communication, perception and action recognition to detect the intent of both the human operator and the human patient, motion retargeting techniques to map the actions of the human to the robot’s motions, and physics-based models to enable the robot to predict and understand the implications of its actions.

  • ETH Zurich PIs: Roland Siegwart, Cesar Cadena, Juan Nieto
    Microsoft PIs: Johannes Schönberger, Marc Pollefeys

    AR/VR allow new and innovative ways of visualizing information and provide a very intuitive interface for interaction. At their core, they rely only on a camera and inertial measurement unit (IMU) setup or a stereo-vision setup to provide the necessary data, either of which are readily available on most commercial mobile devices. Early adoptions of this technology have already been deployed in the real estate business, sports, gaming, retail, tourism, transportation and many other fields. The current technologies in visual-aided motion estimation and mapping on mobile devices have three main requirements to produce highly accurate 3D metric reconstructions: An accurate spatial and temporal calibration of the sensor suite, a procedure which is typically carried out with the help of external infrastructure, like calibration markers, and by following a set of predefined movements. Well-lit, textured environments and feature-rich, smooth trajectories. The continuous and reliable operation of all sensors involved. This project aims at relaxing these requirements, to enable continuous and robust lifelong mapping on end-user mobile devices. Thus, the specific objectives of this work are: 1. Formalize a modular and adaptable multi-modal sensor fusion framework for online map generation; 2. Improve the robustness of mapping and motion estimation by exploiting high-level semantic features; 3. Develop techniques for automatic detection and execution of sensor calibration in the wild. A modular SLAM (simultaneous localization and mapping) pipeline which is able to exploit all available sensing modalities can overcome the individual limitations of each sensor and increase the overall robustness of the estimation. Such an information-rich map representation allows us to leverage recent advances in semantic scene understanding, providing an abstraction from low-level geometric features – which are fragile to noise, sensing conditions and small changes in the environment – to higher-level semantic features that are robust against these effects. Using this complete map representation, we will explore new ways to detect miscalibrations and sensor failures, so that the SLAM process can be adapted online without the need for explicit user intervention.