{"id":821353,"date":"2022-02-23T10:15:23","date_gmt":"2022-02-23T18:15:23","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=821353"},"modified":"2022-08-17T09:59:29","modified_gmt":"2022-08-17T16:59:29","slug":"compass-contrastive-multimodal-pretraining-for-autonomous-systems","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/compass-contrastive-multimodal-pretraining-for-autonomous-systems\/","title":{"rendered":"COMPASS: COntrastive Multimodal Pretraining for AutonomouS Systems"},"content":{"rendered":"\n
\"Figure
Figure 1: COMPASS is a general-purpose pretraining pipeline, which is trained on multimodal data, including RGB images, depth and optical flow. The pretrained COMPASS model can be deployed on various downstream autonomous systems tasks. In this work, we test COMPASS on simulated drone navigation, car racing and visual odometry. This highlights how the system can be deployed in very different environments and application scenarios.<\/center><\/figcaption><\/figure>\n\n\n\n

Humans have the fundamental cognitive ability to perceive the environment through multimodal sensory signals and utilize this to accomplish a wide variety of tasks. It is crucial that an autonomous agent can similarly perceive the underlying state of an environment from different sensors and appropriately consider how to accomplish a task. For example, localization (or “where am I?”) is a fundamental question that needs to be answered by an autonomous agent prior to navigation, often addressed via visual odometry. Highly dynamic tasks, such as vehicle racing, necessitate collision avoidance and understanding of the temporal evolution of their state with respect to the environment. Agents must learn perceptual representations of geometric and semantic information from the environment so that their actions can influence the world.<\/p>\n\n\n\n

Task-driven approaches are appealing, but learning representations that are suitable only for a specific task limits their ability to generalize to new scenarios, thus confining their utility. For example, as shown in Figure 1, to achieve tasks of drone navigation<\/em> and vehicle racing<\/em>, people usually need to specifically design different models to encode representations from very different sensor modalities, e.g., different environments, sensory signals, sampling rate, etc. Such models must also cope with different dynamics and controls for each application scenario. Therefore, we ask the question if it is possible to build general-purpose pretrained models for autonomous systems that are agnostic to tasks and individual form factor.<\/p>\n\n\n\n

In our recent work, COMPASS: Contrastive Multimodal Pretraining for Autonomous Systems (opens in new tab)<\/span><\/a>, we introduce a general-purpose pretraining pipeline, built to overcome such limitations arising from task-specific models. The code can be viewed on GitHub (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n

\n
Read the paper<\/a><\/div>\n\n\n\n
Download the code<\/a><\/div>\n<\/div>\n\n\n\n
<\/div>\n\n\n\n

COMPASS features three key aspects:<\/p>\n\n\n\n