{"id":669336,"date":"2020-07-01T12:41:14","date_gmt":"2020-07-01T19:41:14","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=669336"},"modified":"2020-07-02T19:47:52","modified_gmt":"2020-07-03T02:47:52","slug":"teaching-a-robot-to-see-and-navigate-with-simulation","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/teaching-a-robot-to-see-and-navigate-with-simulation\/","title":{"rendered":"Teaching a robot to see and navigate with simulation"},"content":{"rendered":"

\"\"<\/p>\n

Editor\u2019s note: This post and its research are the collaborative efforts of our team, which includes Wenshan Wang, <\/em>Delong Zhu of the Chinese University of Hong Kong; <\/em>Xiangwei Wang of Tongji University; Yaoyu Hu, Yuheng Qiu, Chen Wang of the Robotics Institute of Carnegie Mellon University; Yafei Hu, <\/em>Ashish Kapoor of Microsoft Research, and <\/em>Sebastien Scherer.\u00a0<\/em><\/p>\n

The ability to see and navigate is a critical operational requirement for robots and autonomous systems. For example, consider autonomous rescu<\/span>e<\/span> r<\/span>o<\/span>bots<\/span> (opens in new tab)<\/span><\/a>\u00a0that are required to maneuver and navigate in challenging physical environments that humans cannot safely access. Similarly, building AI agents (opens in new tab)<\/span><\/a> that can<\/span> efficiently and safely control perception-action (opens in new tab)<\/span><\/a> requires thoughtful engineering to enable sensing and perceptual abilities in a robot. However, building a real-world autonomous system that can operate safely at scale is a very difficult task. The partnership between Microsoft Research and Carnegie Mellon University, announced (opens in new tab)<\/span><\/a> in April 2019, is continuing to advance state of the art in the area of autonomous systems through research focused on solving real-world challenges such as autonomous mapping, navigation, and inspection of underground urban and industrial environments.<\/p>\n

Simultaneous Localization and Mapping (SLAM) is one of the most fundamental capabilities necessary for robots. SLAM has made impressive progress with both geometric-based methods and learning-based methods; however, robust and reliable SLAM systems for real-world scenarios remain elusive. Real-life environments are full of difficult cases such as light changes or lack of illumination, dynamic objects, and texture-less scenes.<\/p>\n

Recent advances in deep reinforcement learning, data-driven control, and deep perception models are fundamentally changing how we build and engineer autonomous systems. Much of the success in the past with SLAM has come from geometric approaches. The availability of large training datasets, collected in a wide variety of conditions, help push the envelope of data-driven techniques and algorithms.<\/p>\n

SLAM is fundamentally different and complicated due to the sequential nature of recognizing landmarks (such as buildings and trees) in a dynamic physical environment while driving or flying through it versus the recognition of static images, object recognition, or activity recognition. Secondly, many SLAM systems use multiple sensing modalities, such as RGB, depth cameras, and LiDAR, which makes data collection a considerable challenge. Finally, we believe that a key to solving SLAM robustly is curating data instances with ground truth in a wide variety of conditions, with varying lighting, weather, and scenes\u2014a task that is daunting and expensive to accomplish in the real world with real robots.<\/p>\n

We present a comprehensive dataset, the TartanAir, for robot navigation tasks, and more. The large dataset was collected using photorealistic simulation environments based on AirSim (opens in new tab)<\/span><\/a>, with various light and weather conditions and moving objects. Our paper (opens in new tab)<\/span><\/a> on the dataset has been accepted and will be appearing at the IEEE\/RSJ International Conference on Intelligent Robots and Systems (opens in new tab)<\/span><\/a> (IROS 2020). By collecting data in simulation, we can obtain multi-modal sensor data and precise ground truth labels, including the stereo RGB image, depth image, segmentation, optical flow, camera poses, and LiDAR point cloud. The TartanAir contains a large number of environments with various styles and scenes, covering challenging viewpoints and diverse motion patterns, which are difficult to achieve by using physical data collection platforms. Based on the TartanAir dataset, we are hosting a visual SLAM challenge that kicked off in a Computer Vision and Pattern Recognition (CVPR) 2020 (opens in new tab)<\/span><\/a> workshop. It consists of a monocular track and a stereo track. Each track consists of 16 trajectories containing challenging features that aim to push the limits of visual SLAM algorithms. The goal is to localize the robot and map the environment from a sequence of monocular\/stereo images. The deadline to submit entries to the challenge is August 15th, 2020. To learn more about participating, visit the challenge site (opens in new tab)<\/span><\/a>.<\/p>\n

Read Paper\u00a0 (opens in new tab)<\/span><\/a> \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0Download Link with the Code (opens in new tab)<\/span><\/a>\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0Watch Video (opens in new tab)<\/span><\/a><\/p>\n

We minimize the sim-to-real gap by utilizing a large number of environments with various styles and diverse scenes. A unique goal of our dataset is to focus on the challenging environments with changing light conditions, adverse weather, and dynamic objects. State-of-the-art SLAM algorithms struggle to track the camera pose in our dataset, continuously getting lost on some challenging sequences. We propose a metric to evaluate the robustness of the algorithm. Also, we developed an automatic data collection pipeline for the TartanAir dataset, which allows us to process more environments with minimum human intervention.<\/p>\n

\"Approximately

Figure 1: A glance at the simulated environments. The TartanAir dataset covers a wide range of scenes, categorized into urban, rural, nature, domestic, public, and science fiction. Environments within the same category also have broad diversity.<\/p><\/div>\n

Dataset features<\/h3>\n

We have adopted 30 photorealistic simulation environments. The environments provide a wide range of scenarios that cover many challenging situations. The simulation scenes consist of:<\/p>\n