{"id":821353,"date":"2022-02-23T10:15:23","date_gmt":"2022-02-23T18:15:23","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=821353"},"modified":"2022-08-17T09:59:29","modified_gmt":"2022-08-17T16:59:29","slug":"compass-contrastive-multimodal-pretraining-for-autonomous-systems","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/compass-contrastive-multimodal-pretraining-for-autonomous-systems\/","title":{"rendered":"COMPASS: COntrastive Multimodal Pretraining for AutonomouS Systems"},"content":{"rendered":"\n<figure class=\"wp-block-image alignwide size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"2560\" height=\"1441\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/1400x788_Compass_still_blog_hero-scaled.jpg\" alt=\"Figure 1: COMPASS is a general-purpose pretraining pipeline, which is trained on mulitmodal data, including RGB image, segmentation, depth and optical flow. The pretrained COMPASS model can be deployed to various downstream tasks of autonomous systems. In this work, we transfer COMPASS to drone navigation, car racing and visual odometry, which are deployed in very different environments and application scenarios.\" class=\"wp-image-819850\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/1400x788_Compass_still_blog_hero-scaled.jpg 2560w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/1400x788_Compass_still_blog_hero-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/1400x788_Compass_still_blog_hero-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/1400x788_Compass_still_blog_hero-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/1400x788_Compass_still_blog_hero-1536x865.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/1400x788_Compass_still_blog_hero-2048x1153.jpg 2048w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/1400x788_Compass_still_blog_hero-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/1400x788_Compass_still_blog_hero-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/1400x788_Compass_still_blog_hero-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/1400x788_Compass_still_blog_hero-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/1400x788_Compass_still_blog_hero-scaled-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/1400x788_Compass_still_blog_hero-scaled-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/1400x788_Compass_still_blog_hero-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/1400x788_Compass_still_blog_hero-1920x1080.jpg 1920w\" sizes=\"(max-width: 2560px) 100vw, 2560px\" \/><figcaption><center>Figure 1: COMPASS is a general-purpose pretraining pipeline, which is trained on multimodal data, including RGB images, depth and optical flow. The pretrained COMPASS model can be deployed on various downstream autonomous systems tasks. In this work, we test COMPASS on simulated drone navigation, car racing and visual odometry. This highlights how the system can be deployed in very different environments and application scenarios.<\/center><\/figcaption><\/figure>\n\n\n\n<p>Humans have the fundamental cognitive ability to perceive the environment through multimodal sensory signals and utilize this to accomplish a wide variety of tasks. It is crucial that an autonomous agent can similarly perceive the underlying state of an environment from different sensors and appropriately consider how to accomplish a task. For example, localization (or &#8220;where am I?&#8221;) is a fundamental question that needs to be answered by an autonomous agent prior to navigation, often addressed via visual odometry. Highly dynamic tasks, such as vehicle racing, necessitate collision avoidance and understanding of the temporal evolution of their state with respect to the environment. Agents must learn perceptual representations of geometric and semantic information from the environment so that their actions can influence the world.<\/p>\n\n\n\n<p>Task-driven approaches are appealing, but learning representations that are suitable only for a specific task limits their ability to generalize to new scenarios, thus confining their utility. For example, as shown in Figure 1, to achieve tasks of <em>drone navigation<\/em> and <em>vehicle racing<\/em>, people usually need to specifically design different models to encode representations from very different sensor modalities, e.g., different environments, sensory signals, sampling rate, etc. Such models must also cope with different dynamics and controls for each application scenario. Therefore, we ask the question if it is possible to build general-purpose pretrained models for autonomous systems that are agnostic to tasks and individual form factor.<\/p>\n\n\n\n<p>In our recent work, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/COMPASS.pdf\">COMPASS: Contrastive Multimodal Pretraining for Autonomous Systems<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, we introduce a general-purpose pretraining pipeline, built to overcome such limitations arising from task-specific models. The <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/microsoft\/COMPASS\" target=\"_blank\" rel=\"noreferrer noopener\">code can be viewed on GitHub<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n<div class=\"wp-block-buttons is-content-justification-center is-content-justification-center is-layout-flex wp-container-core-buttons-is-layout-1 wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button is-style-fill-chevron\"><a data-bi-type=\"button\" class=\"wp-block-button__link\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/compass-contrastive-multimodal-pretraining-for-autonomous-systems\/\" target=\"_blank\" rel=\"noreferrer noopener\">Read the paper<\/a><\/div>\n\n\n\n<div class=\"wp-block-button is-style-fill-github\"><a data-bi-type=\"button\" class=\"wp-block-button__link\" href=\"https:\/\/github.com\/microsoft\/COMPASS\" target=\"_blank\" rel=\"noreferrer noopener\">Download the code<\/a><\/div>\n<\/div>\n\n\n\n<div style=\"height:10px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>COMPASS features three key aspects:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>COMPASS is a general-purpose large-scale pretraining pipeline for perception-action loops in autonomous systems.<\/strong> Representations learned by COMPASS generalize to different environments and significantly improve performance on relevant downstream tasks.<\/li><\/ul>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>COMPASS is designed to handle multimodal data<\/strong>. Given the prevalence of multitudes of sensors in autonomous systems, the framework is designed to utilize rich information from different sensor modalities.<\/li><li><strong>COMPASS is trained in a self-supervised manner which does not require manual labels<\/strong>, and hence can leverage large scale data for pretraining.<\/li><\/ul>\n\n\n\n<p>We demonstrate how COMPASS can be used to solve various downstream tasks across three different scenarios: <em>Drone Navigation<\/em>, <em>Vehicle Racing<\/em>, and <em>Visual Odometry<\/em> tasks.<\/p>\n\n\n\n<p><strong>Challenges in learning generic representations for autonomous systems<\/strong><\/p>\n\n\n\n<p>Although general-purpose pretrained models have made breakthroughs in natural language processing (NLP) and in computer vision, building such models for autonomous systems has its own challenges.<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Autonomous systems deal with complex perception-action interplay. The target learning space is highly variable due to a wide range of environmental factors and application scenarios. This is in stark contrast to language models, which focus on underlying linguistic representations, or visual models, which focus on object-centric semantics. These aspects make existing pretraining approaches inadequate for autonomous systems.<\/li><li>The environments are usually perceived through multimodal sensors, so the model must be able to make sense of multimodal data. Existing multimodal learning approaches focus primarily on mapping multimodal data into joint latent spaces. Though they have shown promising results in applications of video, audio, and text, they are suboptimal for autonomous systems. Approaches that learn a single joint latent space fail to respect different properties of multimodal data, such as sampling rate and temporal dynamics. On the other hand, mapping into disjoint latent spaces loses the connection among the modalities and limits the usage in complex autonomous systems, because different autonomous systems can be equipped with a wide variety of sensor configurations.<\/li><li>Unlike NLP and computer vision, there is scarcity of multimodal data that can be used to train large pretrained representations for autonomous systems.<\/li><\/ul>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"a multimodal graph which maps modalities into factored spatial and temporal latent spaces.\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig2_COMPASS_v2.png\"><img loading=\"lazy\" decoding=\"async\" width=\"676\" height=\"378\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig2_COMPASS_v2.png\" alt=\"a multimodal graph which maps modalities into factored spatial and temporal latent spaces.\" class=\"wp-image-821386\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig2_COMPASS_v2.png 676w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig2_COMPASS_v2-300x168.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig2_COMPASS_v2-343x193.png 343w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig2_COMPASS_v2-240x134.png 240w\" sizes=\"(max-width: 676px) 100vw, 676px\" \/><\/a><figcaption>Figure 2: Given multimodal signals of spatial and temporal modalities \\(\\mathcal{M}_{s}\\) and \\(\\mathcal{M}_{m}\\), respectively, COMPASS learns two factorized latent spaces, i.e., a motion pattern space \\(\\mathcal{O}_m\\) and a current state space \\(\\mathcal{O}_s\\), using multimodal correspondence as the self-supervisory signal.<\/figcaption><\/figure>\n\n\n\n<p><strong>Factorized spatiotemporal latent spaces for learning representations<\/strong><\/p>\n\n\n\n<p>COMPASS is a multimodal pretraining framework for perception and action in autonomous systems. COMPASS builds general-purpose multimodal representations that can generalize to different environments and tasks.<\/p>\n\n\n\n<p>Two questions inform our design choices in COMPASS:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><em>What essential pieces of information are common for all tasks of autonomous systems?<\/em><\/li><li><em>How can we effectively learn representations from complex multimodal data to capture the desired information?<\/em><\/li><\/ul>\n\n\n\n<p>The network architecture design must adhere to the spatiotemporal constraints of autonomous systems. The representation needs to account for the motion (ego-motion or environmental) and its temporal aspects as well as the spatial, geometric, and semantic cues perceived through the sensors. Therefore, we propose a multimodal graph that captures the spatiotemporal properties of the modalities (Fig. 2). The graph is designed to map each of the modalities into two factorized spatiotemporal latent subspaces: 1) the <em>motion pattern space<\/em> and 2) the <em>current state space<\/em>. The self-supervised training then uses multimodal correspondence to associate the modality to the different latent spaces. Such a factorized representation further allows systems equipped with different sensors to use the same pretrained model.<\/p>\n\n\n\n<p>While plenty of sensor modalities are rich in spatial and semantic cues, such as RGB images, depth sensors), we note that certain modalities primarily contain information about the temporal aspect, such as IMU, Optical Flow). Given such a partition of modalities between spatially informative (\\(\\mathcal{M}_{s}\\)) and temporally informative \\(\\mathcal{M}_{m}\\) data, we jointly learn two latent spaces, a &#8220;motion pattern space&#8221; \\(\\mathcal{O}_{m}\\) and a &#8220;current state space&#8221; \\(\\mathcal{O}_{s}\\).<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"Pretraining pipeline and model design of COMPASS model.\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig4_COMPASS_v2.png\"><img loading=\"lazy\" decoding=\"async\" width=\"3591\" height=\"1547\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig4_COMPASS_v2.png\" alt=\"Pretraining pipeline and model design of COMPASS model.\" class=\"wp-image-821392\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig4_COMPASS_v2.png 3591w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig4_COMPASS_v2-300x129.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig4_COMPASS_v2-1024x441.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig4_COMPASS_v2-768x331.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig4_COMPASS_v2-1536x662.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig4_COMPASS_v2-2048x882.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig4_COMPASS_v2-240x103.png 240w\" sizes=\"(max-width: 3591px) 100vw, 3591px\" \/><\/a><figcaption>Figure 3: Self-supervised Pretraining pipeline based on Contrastive Learning for COMPASS.<\/figcaption><\/figure>\n\n\n\n<p><strong>Contrastive learning via multimodal graph connections<\/strong><\/p>\n\n\n\n<p>The key intuition behind the self-supervised objective for training COMPASS is that if the representation successfully captures spatiotemporal information across multiple modalities, then each modality should have some predictive capacity both for itself as well as the others. We formulate this intuition into a contrastive learning objective. Figure 3 graphically depicts the idea where the modality-specific encoders \\(E\\) extract embeddings from each modality. These are then mapped to the common motion pattern space \\(\\mathcal{O}_{m}\\) through the motion pattern projection head \\(\\mathcal{F}_m\\). A prediction head \\(\\mathcal{P}\\) is added on top to perform future prediction. The contrastive loss is computed between the predicted future representations and their corresponding encoded true representations. Similarly, the contrastive objective also associates the data between distinct spatial modalities \\(\\mathcal{M}_{s}\\) projected onto the current state space \\(\\mathcal{O}_s\\) at every time step.<\/p>\n\n\n\n<p>Note that modalities that are primarily temporal are projected to the motion pattern space through \\(\\mathcal{F}_m\\) only. Modalities that are only spatial are first projected onto the current state space by \\(\\mathcal{F}_s\\). To better associate spatial modalities with the temporal ones, we introduce a spatiotemporal connection where spatial modalities from multiple timesteps are aggregated via an aggregator head \\(\\mathcal{G}\\) and projected into the motion pattern space. Such as multimodal graph with spatial, temporal, and spatiotemporal connections serves as a framework for learning multimodal representations by encoding the underlying properties of modalities (such as static, dynamic) as well as any common information shared between them (for example, geometry, motion).<\/p>\n\n\n\n<p>Finally, we tackle the challenge of data scarcity by resorting to simulation. In particular, we build upon our previous work in high-fidelity simulation with AirSim and use the TartanAir dataset (<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/tartanair-a-dataset-to-push-the-limits-of-visual-slam\/\">TartanAir: A Dataset to Push the Limits of Visual SLAM &#8211; Microsoft Research<\/a>) to train the model.<\/p>\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"931956\">\n\t\t\n\n\t\t<p class=\"msr-promo__label text-gray-800 text-center text-uppercase\">\n\t\t<span class=\"px-4 bg-white display-inline-block font-weight-semibold small\">Spotlight: On-demand video<\/span>\n\t<\/p>\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/ai-explainer-foundation-models-and-the-next-era-of-ai\/\" aria-label=\"AI Explainer: Foundation models \u200band the next era of AI\" data-bi-cN=\"AI Explainer: Foundation models \u200band the next era of AI\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/03\/AIEx01_blog_hero_1400x788.png\" alt=\"a screenshot of a computer screen shot of a man\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">AI Explainer: Foundation models \u200band the next era of AI<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p class=\"large\">Explore how the transformer architecture, larger models and more data, and in-context learning have helped advance AI from perception to creation.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button\">\n\t\t\t\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/ai-explainer-foundation-models-and-the-next-era-of-ai\/\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" aria-label=\"Watch video\" data-bi-cN=\"AI Explainer: Foundation models \u200band the next era of AI\" target=\"_blank\">\n\t\t\t\t\t\t\tWatch video\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n<p><strong>Deploying COMPASS to downstream tasks<\/strong><\/p>\n\n\n\n<p>After pretraining, the COMPASS model can be finetuned for several downstream tasks. Based on the sensor modalities available for the task of interest, we connect the appropriate pretrained COMPASS encoders to small neural network modules responsible for task-specific predictions such as robot actions, camera poses etc. This combined model is then finetuned given data and objectives from the specific task.<\/p>\n\n\n\n<p>We demonstrate the effectiveness of COMPASS as a general-purpose pretraining approach on three downstream tasks: simulated drone navigation, simulated vehicle racing, and visual odometry. Figure 4 and Table 1 show some details about both our pretraining as well as downstream task datasets.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"Data samples of pretraining dataset (TartanAIR), and downstream datasets for drone navigation, car racing and visual odometry.\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig4_COMPASS_AH.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"458\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig4_COMPASS_AH-1024x458.png\" alt=\"Data samples of pretraining dataset (TartanAIR), and downstream datasets for drone navigation, car racing and visual odometry.\" class=\"wp-image-821500\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig4_COMPASS_AH-1024x458.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig4_COMPASS_AH-300x134.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig4_COMPASS_AH-768x344.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig4_COMPASS_AH-1536x687.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig4_COMPASS_AH-2048x916.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig4_COMPASS_AH-240x107.png 240w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption>Figure 4: Samples from TartanAIR and the downstream task datasets. TartanAir contains RGB, depth, segmentation and optical flow data modalities.<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table><thead><tr><th>Dataset<\/th><th>Usage<\/th><th>Scale<\/th><th>Env.<\/th><\/tr><\/thead><tbody><tr><td>TartanAIR<\/td><td>Pretraining<\/td><td>1M<\/td><td>16<\/td><\/tr><tr><td>Soccer-gate<\/td><td>Drone Navigation.<\/td><td>&nbsp;3k<\/td><td>1<\/td><\/tr><tr><td>KITTI<\/td><td>Visual Odometry<\/td><td>23K<\/td><td>11<\/td><\/tr><tr><td>AirSim-Car<\/td><td>Car racing<\/td><td>17K<\/td><td>9<\/td><\/tr><\/tbody><\/table><figcaption><center>Table 1: Various datasets used in our experiments.<\/center><\/figcaption><\/figure>\n\n\n\n<p><strong>Drone Navigation<\/strong><\/p>\n\n\n\n<p>The goal of this task is to enable a quadrotor drone to navigate through a series of gates whose locations are unknown to it a priori. The simulated environment contains a diverse set of gates varying in shape, sizes, color, and texture. Given RGB images from the camera onboard the drone in this environment, the model is asked to predict velocity commands to make the drone successfully go through a series of gates. Figure 5 highlights that finetuning COMPASS for this velocity prediction task results in better performance than training a model from scratch.<\/p>\n\n\n\n<figure class=\"wp-block-image alignwide size-full\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"Line plots showing validation errors on drone navigation task.\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig5ad_COMPASS.png\"><img loading=\"lazy\" decoding=\"async\" width=\"2000\" height=\"501\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig5ad_COMPASS.png\" alt=\"Line plots showing validation errors on drone navigation task.\" class=\"wp-image-821434\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig5ad_COMPASS.png 2000w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig5ad_COMPASS-300x75.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig5ad_COMPASS-1024x257.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig5ad_COMPASS-768x192.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig5ad_COMPASS-1536x385.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig5ad_COMPASS-240x60.png 240w\" sizes=\"(max-width: 2000px) 100vw, 2000px\" \/><\/a><figcaption><center>Figure 5(a-d): Performance of COMPASS on drone velocity predictions, compared with a model trained from scratch.<\/center><\/figcaption><\/figure>\n\n\n\n<p><strong>COMPASS can improve data efficiency. <\/strong>Furthermore, finetuning over pretrained COMPASS models exhibits more data efficient learning than training models from scratch. Figure 6 compares finetuning performance with different amounts of data to training from scratch. We see that COMPASS finetuning consistently produces fewer errors than training from scratch, even with less data.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"Data samples of pretraining dataset (TartanAIR), and downstream datasets for drone navigation, car racing and visual odometry.\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig6_COMPASS_v2.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig6_COMPASS_v2.png\" alt=\"Data samples of pretraining dataset (TartanAIR), and downstream datasets for drone navigation, car racing and visual odometry.\" class=\"wp-image-821428\" width=\"512\" height=\"427\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig6_COMPASS_v2.png 2049w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig6_COMPASS_v2-300x250.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig6_COMPASS_v2-1024x853.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig6_COMPASS_v2-768x640.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig6_COMPASS_v2-1536x1280.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig6_COMPASS_v2-216x180.png 216w\" sizes=\"(max-width: 512px) 100vw, 512px\" \/><\/a><figcaption>Figure 6: Comparison of COMPASS finetuning vs. training from scratch with varying amounts of data<\/figcaption><\/figure>\n\n\n\n<p><strong>Visual Odometry<\/strong><\/p>\n\n\n\n<p>Visual odometry (VO) aims to estimate camera motion from consecutive image frames. This is a fundamental component in visual SLAM which is widely used for localization in robotics. We evaluate COMPASS for the VO task on a widely used real-world dataset (<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"http:\/\/www.cvlibs.net\/datasets\/kitti\/eval_odometry.php\">The KITTI Vision Benchmark Suite (cvlibs.net)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>). We first use an off-the-shelf optical flow model (<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/NVlabs\/PWC-Net\">PWC-Net<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>) to generate optical flow data given consecutive image frames, which are then inputted to the optical flow encoder of COMPASS, eventually resulting in predicted camera motion.<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table style=\"border-spacing: inherit;border-collapse: collapse;width: 100%;padding: 10px;text-align: center;font-family:Segoe UI; border: 1px solid #000000\">\n\t<tbody>\n\t\t<tr>\n\t\t\t<td rowspan=\"2\" style=\"color:white; background-color:#0078d7; padding: 10px; vertical-align: middle;\"><strong>Methods<\/strong><\/td>\n\t\t\t<td colspan=\"2\" style=\"color:white; background-color:#0078d7; padding: 10px;\"><strong>Sequence 9<\/strong><\/td>\n\t\t\t<td colspan=\"2\" style=\"color:white; background-color:#0078d7; padding: 10px;\"><strong>Sequence 10<\/strong><\/td>\n\t\t<\/tr><tr>\n\t\t\t<td style=\"color:white; background-color:#3691d8; padding: 10px;\">\\(t_{rel}\\)<\/td>\n\t\t\t<td style=\"color:white; background-color:#3691d8; padding: 10px;\">\\(r_{rel}\\)<\/td>\n\t\t\t<td style=\"color:white; background-color:#3691d8; padding: 10px;\">\\(t_{rel}\\)<\/td>\n      <td style=\"color:white; background-color:#3691d8; padding: 10px;\">\\(r_{rel}\\)<\/td>\n    <\/tr><tr>\n\t\t\t<td>ORB-SLAM2<\/td>\n\t\t\t<td>15.3<\/td>\n\t\t\t<td>0.26<\/td>\n\t\t\t<td>3.71<\/td>\n\t\t\t<td>0.3<\/td>\n\t\t<\/tr><tr>\n\t\t\t<td>DVSO<\/td>\n\t\t\t<td>0.83<\/td>\n\t\t\t<td><strong>0.21<\/strong><\/td>\n\t\t\t<td>0.74<\/td>\n\t\t\t<td><strong>0.21<\/strong><\/td>\n    <\/tr><tr>\n  \t\t\t<td>D3VO<\/td>\n  \t\t\t<td><strong>0.78<\/strong><\/td>\n  \t\t\t<td>&#8211;<\/td>\n  \t\t\t<td><strong>0.62<\/strong><\/td>\n  \t\t\t<td>&#8211;<\/td>\n    <\/tr><tr>\n      \t<td>VISO2-M<\/td>\n      \t<td>4.04<\/td>\n      \t<td>1.43<\/td>\n      \t<td>25.2<\/td>\n      \t<td>3.8<\/td>\n    <\/tr><tr>\n      \t<td>DeepVO<\/td>\n      \t<td>N\/A<\/td>\n      \t<td>N\/A<\/td>\n      \t<td>8.11<\/td>\n      \t<td>8.83<\/td>\n    <\/tr><tr>\n        <td>Wang et al.<\/td>\n        <td>8.04<\/td>\n        <td>1.51<\/td>\n        <td>6.23<\/td>\n        <td>0.97<\/td>\n    <\/tr><tr>\n        <td>TartanVO<\/td>\n        <td>6.00<\/td>\n        <td>3.11<\/td>\n        <td>6.89<\/td>\n        <td>2.73<\/td>\n    <\/tr><tr>\n        <td>UnDeepVO<\/td>\n        <td>N\/A<\/td>\n        <td>N\/A<\/td>\n        <td>10.63<\/td>\n        <td>4.65<\/td>\n    <\/tr><tr>\n        <td>GeoNet<\/td>\n        <td>26.93<\/td>\n        <td>9.54<\/td>\n        <td>20.73<\/td>\n        <td>9.04<\/td>\n    <\/tr><tr>\n        <td>COMPASS (ours)<\/td>\n        <td>2.79<\/td>\n        <td><strong>0.98<\/strong><\/td>\n        <td><strong>2.41<\/strong><\/td>\n        <td>1.00<\/td><\/tr><\/tbody><\/table><figcaption><center>Table 2: Comparison of translation and rotation errors on KITTI dataset. The first section includes three SLAM methods, while the others are VO approaches. \\(t_{rel}\\): average translational RMSE drift (\\(\\%\\)) on a length of 100-800 m. \\(r_{rel}\\): average rotational RMSE drift (\\(^{\\circ}\/100 m\\)) on a length of 100-800 m.<\/center><\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"Trajectory plots of different approaches on KITTI dataset.\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig7ab_COMPASS.png\"><img loading=\"lazy\" decoding=\"async\" width=\"955\" height=\"460\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig7ab_COMPASS.png\" alt=\"Trajectory plots of different approaches on KITTI dataset.\" class=\"wp-image-821440\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig7ab_COMPASS.png 955w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig7ab_COMPASS-300x145.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig7ab_COMPASS-768x370.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig7ab_COMPASS-240x116.png 240w\" sizes=\"(max-width: 955px) 100vw, 955px\" \/><\/a><figcaption>Figure 7: Comparison of the predicted KITTI trajectories by different VO approaches. TartanVO is a learning-based VO (only relies on two frames, same as ours), and ORBSLAM2 is a geometry-based SLAM system (includes multi-frame optimization).<\/figcaption><\/figure>\n\n\n\n<p><strong>COMPASS can adapt to real-world scenarios. <\/strong>In this experiment, we finetune the model on sequence 00-08 of KITTI and test it on sequence 09 and 10. For comprehensive investigation, we compare COMPASS with both SLAM methods and visual odometry methods. The results are shown in Table 2, where we list the relative pose error (RPE), which is the same metric used on KITTI benchmark. Using the pretrained flow encoder from COMPASS within this VO pipeline achieves better results than several other VO methods and is even comparable to SLAM methods. Figure 7 shows the predicted trajectories of sequences 09 and 10 compared to ground truth. For clarity, we also select one representative model from the geometry-based and learning-based approaches each. We can see that, although pretrained purely on simulation data, COMPASS adapts well to finetuning on real-world scenarios.<\/p>\n\n\n\n<p><strong>Vehicle Racing<\/strong><\/p>\n\n\n\n<p>The goal here is to enable autonomous vehicles to drive in a competitive Formula racing environment. The simulated environment contains visual distractors such as advertising signs, tires, grandstands, and fences, which help add realism and increase task difficulty. Given RGB images from the environment as input, the control module must predict the steering wheel angle for a car to successfully maneuver around the track and avoid obstacles.<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table><thead><tr><th>Model<\/th><th>Seen env.<\/th><th>Unseen env.<\/th><\/tr><\/thead><tbody><tr><td>SCRATCH<\/td><td>0.085 \u00b1 0.025<\/td><td>0.120 \u00b1 0.009<\/td><\/tr><tr><td>CPC<\/td><td><strong>0.037 <\/strong><strong>\u00b1<\/strong><strong>0.012<\/strong><\/td><td>0.101 \u00b1 0.017<\/td><\/tr><tr><td>CMC<\/td><td>0.039 \u00b1 0.013<\/td><td>0.102 \u00b1 0.012<\/td><\/tr><tr><td>JOINT<\/td><td>0.055 \u00b1 0.016<\/td><td>0.388 \u00b1 0.018<\/td><\/tr><tr><td>DISJOINT<\/td><td>0.039 \u00b1 0.017<\/td><td>0.131 \u00b1 0.016<\/td><\/tr><tr><td>COMPASS<\/td><td>0.041 \u00b1 0.013<\/td><td><strong>0.071 <\/strong><strong>\u00b1<\/strong><strong> 0.023<\/strong><\/td><\/tr><\/tbody><\/table><figcaption><center>Table 3: Steering prediction for car racing.<\/center><\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"Line plots comparing training & validation performance of several approaches on car racing task.\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig8ab_COMPASS.png\"><img loading=\"lazy\" decoding=\"async\" width=\"3724\" height=\"1862\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig8ab_COMPASS.png\" alt=\"Line plots comparing training & validation performance of several approaches on car racing task.\" class=\"wp-image-821443\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig8ab_COMPASS.png 3724w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig8ab_COMPASS-300x150.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig8ab_COMPASS-1024x512.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig8ab_COMPASS-768x384.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig8ab_COMPASS-1536x768.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig8ab_COMPASS-2048x1024.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/Fig8ab_COMPASS-240x120.png 240w\" sizes=\"(max-width: 3724px) 100vw, 3724px\" \/><\/a><figcaption>Figure 8: Training (a) and Testing (b) loss curves on the vehicle racing task.<\/figcaption><\/figure>\n\n\n\n<p><strong>COMPASS can generalize to unseen environments.<\/strong> We hypothesize that better perception, enabled by pretraining, improves generalization to unseen environments. To show this, we evaluate models in two settings: 1) trained and evaluated on all nine scenarios (&#8220;seen&#8221;); 2) trained on eight scenarios and evaluated on one scenario (&#8220;unseen&#8221;). Table 3 shows that the performance degradation in the unseen environment is relatively marginal with \\(\\texttt{COMPASS}\\), which suggests its effectiveness compared to the other pretraining approaches.<\/p>\n\n\n\n<p><strong>COMPASS can benefit from multimodal training regime. <\/strong>We investigate the effectiveness of pretraining on multimodal data by analyzing loss curves from different pretrained models on the same \u2018unseen\u2019 environments. Figure 8(b) compares the validation loss curves of \\(\\texttt{COMPASS}\\), \\(\\texttt{RGB}\\), and \\(\\texttt{Scratch}\\), where \\(\\texttt{RGB}\\) is the model that is pretrained only with RGB images. As we can see, by pretraining on multimodal data, COMPASS achieves the best performance overall. Also, both of these pretraining models show large gaps when compared to a model trained from scratch (\\(\\texttt{Scratch}\\)). When comparing Figure 8(a) to Figure 8(b), we see that \\(\\texttt{Scratch}\\) suffers more from the overfitting issue than the other two models.<\/p>\n\n\n\n<p><strong>Conclusion<\/strong><\/p>\n\n\n\n<p>We introduce COntrastive Multimodal pretraining for AutonomouS Systems (COMPASS), a &#8216;general&#8217; pretraining framework that learns multimodal representations to tackle various downstream autonomous system tasks. In contrast to existing task-specific approaches in autonomous systems, COMPASS is trained entirely agnostic to any downstream tasks, with the primary goal of extracting information that is common to multiple scenarios. COMPASS learns to associate multimodal data with respect to their properties, allowing it to encode the spatio-temporal nature of data commonly observed in autonomous systems. We demonstrated that COMPASS generalizes well to different downstream tasks\u2014drone navigation, vehicle racing and visual odometry\u2014even in unseen environments, real-world environments and in the low-data regime.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Humans have the fundamental cognitive ability to perceive the environment through multimodal sensory signals and utilize this to accomplish a wide variety of tasks. It is crucial that an autonomous agent can similarly perceive the underlying state of an environment from different sensors and appropriately consider how to accomplish a task. For example, localization (or [&hellip;]<\/p>\n","protected":false},"author":37583,"featured_media":819850,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[243984],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-821353","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-post-option-blog-homepage-featured"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[867219],"related-projects":[],"related-events":[],"related-researchers":[{"type":"guest","value":"wenshan-wang-2","user_id":"669615","display_name":"Wenshan Wang","author_link":"<a href=\"http:\/\/www.wangwenshan.com\/\" aria-label=\"Visit the profile page for Wenshan Wang\">Wenshan Wang<\/a>","is_active":true,"last_first":"Wang, Wenshan","people_section":0,"alias":"wenshan-wang-2"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/1400x788_Compass_still_blog_hero-scaled-960x540.jpg\" class=\"img-object-cover\" alt=\"Figure 1: COMPASS is a general-purpose pretraining pipeline, which is trained on mulitmodal data, including RGB image, segmentation, depth and optical flow. The pretrained COMPASS model can be deployed to various downstream tasks of autonomous systems. In this work, we transfer COMPASS to drone navigation, car racing and visual odometry, which are deployed in very different environments and application scenarios.\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/1400x788_Compass_still_blog_hero-scaled-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/1400x788_Compass_still_blog_hero-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/1400x788_Compass_still_blog_hero-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/1400x788_Compass_still_blog_hero-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/1400x788_Compass_still_blog_hero-1536x865.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/1400x788_Compass_still_blog_hero-2048x1153.jpg 2048w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/1400x788_Compass_still_blog_hero-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/1400x788_Compass_still_blog_hero-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/1400x788_Compass_still_blog_hero-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/1400x788_Compass_still_blog_hero-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/1400x788_Compass_still_blog_hero-scaled-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/1400x788_Compass_still_blog_hero-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/02\/1400x788_Compass_still_blog_hero-1920x1080.jpg 1920w\" sizes=\"(max-width: 960px) 100vw, 960px\" \/>","byline":"","formattedDate":"February 23, 2022","formattedExcerpt":"Humans have the fundamental cognitive ability to perceive the environment through multimodal sensory signals and utilize this to accomplish a wide variety of tasks. It is crucial that an autonomous agent can similarly perceive the underlying state of an environment from different sensors and appropriately&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/821353"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/37583"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=821353"}],"version-history":[{"count":28,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/821353\/revisions"}],"predecessor-version":[{"id":991530,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/821353\/revisions\/991530"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/819850"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=821353"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=821353"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=821353"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=821353"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=821353"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=821353"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=821353"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=821353"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=821353"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=821353"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=821353"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}