Autonomous Systems and Robotics Group Articles http://approjects.co.za/?big=en-us/research/ Fri, 03 Mar 2023 18:55:26 +0000 en-US hourly 1 https://wordpress.org/?v=6.6.2 SMART – A Generalized Pretraining Framework for Control Tasks http://approjects.co.za/?big=en-us/research/articles/smart-a-generalized-pretraining-framework-for-control-tasks/ Tue, 28 Feb 2023 01:11:05 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=922338 We are announcing SMART, a generalized pretraining framework for a wide variety of control tasks. Self-supervised pretraining of large neural networks (BERT (opens in new tab), GPT (opens in new tab), MoCo (opens in new tab), and CLIP (opens in new tab)) has been shown to be successful in a wide range of language and […]

The post SMART – A Generalized Pretraining Framework for Control Tasks appeared first on Microsoft Research.

]]>
We are announcing SMART, a generalized pretraining framework for a wide variety of control tasks.

The hero figure of SMART

Self-supervised pretraining of large neural networks (BERT (opens in new tab), GPT (opens in new tab), MoCo (opens in new tab), and CLIP (opens in new tab)) has been shown to be successful in a wide range of language and vision problems. These works demonstrate that one single pretrained model can be easily finetuned to perform many downstream tasks, resulting in a simple, effective, and data-efficient paradigm. When it comes to control tasks, however, it is not clear yet whether the successes of pretraining approaches can be easily replicated. So, we ask the question: can we enable similar pretraining paradigm for efficient decision-making across various control tasks?

In “SMART: Self-supervised Multi-task pretrAining with contRol Transformers (opens in new tab)“, to be published at ICLR2023 (opens in new tab) (as notable-top-25%), we study how to pretrain a versatile, generalizable and resilient model for a wide variety of control tasks. We demonstrate that SMART can significantly improve the learning efficiency and facilitate rapid transfer to novel tasks under different learning scenarios including Imitation Learning (IL) and Reinforcement Learning (RL). Benefiting from the proposed control-centric objective, SMART is resilient to distribution shift between pretraining and finetuning, and even works well with low-quality datasets that are randomly collected.

We now discuss the challenges and introduce our key designing concepts and technical details.

Challenges unique to control tasks

There are research efforts that investigate application of pretrained vision models to facilitate control
tasks. However, there are challenges unique to sequential decision making and beyond the considerations of existing vision and language pretraining. We highlight these challenges below:

  • Data distribution shift: Training data for decision making tasks is usually composed of trajectories generated under some specific behavior policies. As a result, data distributions during pretraining, downstream finetuning and deployment can be drastically different, resulting in a suboptimal performance.
  • Large discrepancy between tasks: In contrast to language and vision where the underlying semantic information is often shared across tasks, decision making tasks span a large variety of task-specific configurations, transition functions, rewards, and state-action spaces as well. Consequently, it is hard to obtain a generic representation for multiple decision-making tasks.
  • Long-term reward maximization: A good representation for downstream policy learning should capture information relevant for both immediate and long-term planning, which is usually hard in tasks with long horizons, partial observability, and continuous control.
  • Lack of supervision and high-quality data: Success in representation learning often depends on the availability of high-quality expert demonstrations and ground-truth rewards. However, for most sequential decision-making tasks, high-quality data and/or supervisory signals are either non-existent or prohibitively expensive to obtain.

Unlocking generalized pretraining-finetuning pipeline for sequential decision-making

In this work, we follow ideas established in vision and language community to explicitly define our pretraining and finetuning pipeline. Specifically, during the pretraining phase we train representations with a large offline dataset collected from a set of training tasks. Then, given a specific downstream task which may or may not be contained in pretraining tasks, we attach a simple policy head on top of the pretrained representation and train it with Imitation Learning (IL) or with Reinforcement Learning (RL). The central tenet of pretraining is to learn generic representations which allow downstream task finetuning to be simple, effective, and efficient, even under low-data regimes. The pretrained model is expected to be:

  • Versatile so as to handle a wide variety of downstream control tasks and variable downstream learning methods such as IL and RL,
  • Generalizable to unseen tasks and domains spanning multiple rewards and agent dynamics, and
  • Resilient to varying-quality pretraining data without supervision.

SMART architecture and framework

A unified model architecture to fit different learning methods

Inspired by the recent success of transformer models in sequential modeling, we propose a Control Transformer (CT). The input to the model is a control sequence composed of observations and actions, and the outputs of CT correspond to token embeddings representing each observation and action, respectively. The figure below depicts the CT architecture. Different from the Decision Transformer (DT) (opens in new tab) which directly learn reward-based policies, CT is designed to learn reward-agnostic representations, which enables it as a unified model to fit different learning methods (e.g. Imitation Learning (IL) and Reinforcement Learning (RL)) and various tasks.

The network architecture of SMART.
Figure 1: Architecture of Control Transformer. In the pretraining phase, we use the control-centric objective to train representation over multiple tasks; in the finetuning phase where a specific task is given, we learn a policy based on the pretrained representation (pretrained weights are shown in grey blocks). The construction of the policy head can vary for different downstream datasets or learning methods.

Control-centric pretraining objectives to learn generic representations

Built upon CT, we propose a control-centric pretraining objective that consists of three terms: forward dynamics prediction, inverse dynamics prediction and random masked hindsight control. The figure below illustrates each objective. These terms focus on policy-independent transition probabilities and encourage CT to capture dynamics information of both short-term and long-term temporal granularities.

The training objectives of SMART.
Figure 2: The three terms of our proposed pretraining objective. The red shaded areas denote the attention span, while the grey regions are masked.
  • Forward Dynamics Prediction: For each observation-action pair in a control sequence, we aim to predict the next immediate latent state. This forward prediction captures the local transition information in the embedding space.
  • Inverse Dynamics Prediction: For each consecutive observation pair, we learn to recover the action that leads to the transition between the observation pair.
  • Random Masked Hindsight Control: Given a control sequence, we randomly mask part of actions and observations, and recover the masked actions based on the remaining incomplete sequence. This objective is akin to asking the question “what actions should I take to generate such a trajectory?” Therefore, we replace the causal attention mask with a non-causal one, to temporarily allow the model “see the future”. As a result, we encourage the model to learn controllable representations and global temporal relations, and to attend to the most essential representations for multi-step control.

Experimental results highlights

The multi-task DMC benchmark

We evaluate SMART on the DeepMind Control (DMC) suite (opens in new tab), which contains a series of continuous control tasks with RGB image observations. There are multiple domains (physical models with different state and action spaces) and multiple tasks (associated with a particular MDP) within each domain, which creates diverse scenarios for evaluating pretrained representations. Our experiments use 10 different tasks spanning over 6 domains. In pretraining, we use an offline dataset collected over 5 tasks, while the other 5 tasks (with 2 unseen domains) are held out to test the generalizability of SMART. The graphical relations of all tasks and domains involved are shown in the figure below.

The graphical relation of tasks used for training.

Versatility

To evaluate the versatility of SMART, we design experiments to answer the following questions:

  • Whether a single pretrained model can be finetuned with different downstream learning methods (i.e. Return-To-Go conditioned (RTG) and Behavior Cloning (BC));
  • Whether the pretrained model can adapt towards various downstream tasks.
Line graph .
Figure 3: Downstream learning rewards of SMART (red) compared with pretraining CT with single-task data (blue) and training from scratch (gray). Results are averaged over 3 random seeds. Scratch trains a policy with randomly initialized CT representation weights. CT-single is a variant of SMART, which pretrains CT with a single-task dataset containing trajectories from the downstream environment.

In the figure above, we compare the reward curve of SMART with Scratch and CT-Single, where models are pretrained with Exploratory dataset. It can be seen that pretrained CT from both single-task dataset (CT-single) and multi-task dataset (SMART) can achieve much better results than training from scratch. In general, under both RTG and BC finetuning, pretrained models have a warm start, a faster convergence rate, and a relatively better asymptotic performance in a variety of downstream tasks. In most cases, pretraining CT from multi-task dataset (SMART) yields better results than pretraining with only in-task data (CT-single), although it is harder to accommodate multiple different tasks with the same model capacity, which suggests that SMART can extract common knowledge from diverse tasks.

Generalizability

The figure shows the performance of SMART pretrained on Exploratory dataset, compared to Scratch and CT-single on 5 unseen tasks. We can see that SMART is able to generalize to unseen tasks and even unseen domains, whose distributions have a larger discrepancy as compared to the pretraining dataset. Surprisingly, SMART achieves better performance than CT-single in most tasks, even though CT-single has already seen the downstream environments. This suggests that good generalization ability can be obtained from learning underlying information which might be shared among multiple tasks and domains, spanning a diverse set of distributions.

chart, map
Figure 4: Downstream learning rewards in unseen tasks and domains of SMART (red) compared with pretraining CT with single-task data (blue) and training from scratch (gray). Results are averaged over 3 seeds.

To further investigate the generalizability of SMART, we evaluate the performance of SMART in other more challenging domains and tasks that have larger discrepancy with pretraining domains/tasks. These additional domain-tasks are: ball-in-cup-catch, finger-turn-hard, fish-swim, swimmer-swimmer6 and swimmer-swimmer15. Note that these agents have significantly different appearance and moving patterns compared to pretraining tasks, as visualized in the figure below.

Visualization of tasks.
Figure 5: Discrepancy between pretraining domains and selected downstream domains: (left) Walker domain. (right) Swimmer domain (6 and 15 links)

The results are shown in figures below, where we can see that the pretrained model can still work in most cases, even under such a large task discrepancy. Note that here CT-Single is pretrained with data from exactly the downstream task, where SMART has never seen a sample from the downstream tasks and is pretrained on significantly different domains. Therefore, it is unsurprising that CT-Single is generally better than SMART in this setting. However, it is interesting to see that SMART is comparable with or even better than CT-Single in some tasks, suggesting the strong generalizability of SMART. On the other hand, one can imagine that it is unavoidable that the performance of a pretrained model will decrease as the discrepancy between pretraining tasks and downstream tasks increases. Therefore, we stress the importance of using diverse multi-task data for pretraining in practice.

Line graph.
Figure 6. Downstream learning rewards of SMART (red) in challenging tasks that have larger discrepancy with pretraining tasks, using the Exploratory pretraining dataset. Results are from 1 random seed.
chart
Figure 7: Downstream learning rewards of SMART (red) in challenging tasks that have larger discrepancy with pretraining tasks, using the Random pretraining dataset. Results are from 1 random seed.

Resilience

We aggregate the results in all tasks by averaging the normalized reward (dividing raw scores by expert scores) in both RTG and BC settings. When using the Exploratory dataset for pretraining, SMART outperforms ACL (opens in new tab), and is comparable to DT (opens in new tab) which has extra information of reward. When pretrained with the Random dataset, SMART is significantly better than DT and ACL, while ACL fails to outperform training from scratch. This result show that SMART is robust to low-quality data as compared to other baseline methods.

chart, bar chart
Figure 8: Downstream learning rewards (normalized by expert score) of all methods using Exploratory and Random dataset. The gap between each pair of green and red bars corresponds to the resilience of each method to pretraining data quality, and our SMART shows the best resilience among all baselines.

Analysis

In large-scale training problems, performance usually benefits from larger model capacity. We investigate if this also applies to sequential decision making tasks by varying the embedding size (width) and the number of layers (depth) in CT. The per-task comparisons are show in the figure below. From the comparison, we can see that in general, increasing the model depth leads to a better performance. However, when embedding size gets too large, the performance further drops, as a large representation space might allow for irrelevant information. In addition, the design choice of model capacity should also be considered together with the training dataset scale and diversity.

chart, bar chart
Figure 9: Comparison of varying model capacities (embedding size and layer number) in different tasks in terms of relative improvement wrt training from scratch.

Towards Foundation Models for Perception and Control

We are thrilled to announce the release of SMART, a technique designed to bring foundation models for decision-making within reach of a wider audience. Our goal with SMART is to make it easy for anyone to use pretrained foundation models without requiring specialized knowledge of model architecture or pretraining approaches. By leveraging the latest advances in spatio-temporal data analysis, SMART is at the forefront of addressing the challenges of perception and control jointly. Our team is excited to see what the future holds for this powerful new technique.

This work is being undertaken by members of the Microsoft Autonomous Systems and Robotics Research Group and University of Maryland. The researchers included in this project are: Yanchao Sun, Shuang Ma, Ratnesh Madaan, Rogerio Bonatti, Furong Huang, and Ashish Kapoor.

The post SMART – A Generalized Pretraining Framework for Control Tasks appeared first on Microsoft Research.

]]>
ChatGPT for Robotics: Design Principles and Model Abilities http://approjects.co.za/?big=en-us/research/articles/chatgpt-for-robotics/ Tue, 21 Feb 2023 04:01:28 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=918900 We extended the capabilities of ChatGPT to robotics, and controlled multiple platforms such as robot arms, drones, and home assistant robots intuitively with language. Have you ever wanted to tell a robot what to do using your own words, like you would to a human? Wouldn’t it be amazing to just tell your home assistant […]

The post ChatGPT for Robotics: Design Principles and Model Abilities appeared first on Microsoft Research.

]]>
We extended the capabilities of ChatGPT to robotics, and controlled multiple platforms such as robot arms, drones, and home assistant robots intuitively with language.

main gif of multiple robots

Have you ever wanted to tell a robot what to do using your own words, like you would to a human? Wouldn’t it be amazing to just tell your home assistant robot: “Please warm up my lunch“, and have it find the microwave by itself? Even though language is the most intuitive way for us to express our intentions, we still rely heavily on hand-written code to control robots. Our team has been exploring how we can change this reality and make natural human-robot interactions possible using OpenAI (opens in new tab)‘s new AI language model, ChatGPT (opens in new tab).

ChatGPT is a language model trained on a massive corpus of text and human interactions, allowing it to generate coherent and grammatically correct responses to a wide range of prompts and questions. Our goal with this research is to see if ChatGPT can think beyond text, and reason about the physical world to help with robotics tasks. We want to help people interact with robots more easily, without needing to learn complex programming languages or details about robotic systems. The key challenge here is teaching ChatGPT how to solve problems considering the laws of physics, the context of the operating environment, and how the robot’s physical actions can change the state of the world.

It turns out that ChatGPT can do a lot by itself, but it still needs some help. Our technical paper describes a series of design principles that can be used to guide language models towards solving robotics tasks. These include, and are not limited to, special prompting structures, high-level APIs, and human feedback via text. We believe that our work is just the start of a shift in how we develop robotics systems, and we hope to inspire other researchers to jump into this exciting field. Continue reading for more technical details on our methods and ideas.

Challenges in robotics today, and how ChatGPT can help

Current robotics pipelines begin with an engineer or technical user that needs to translate the task’s requirements into code for the system. The engineer sits in the loop, meaning that they need to write new code and specifications to correct the robot’s behavior. Overall, this process is slow (user needs to write low-level code), expensive (requires highly skilled users with deep knowledge of robotics), and inefficient (requires multiple interactions to get things working properly).

robotics today versus with chatgpt

ChatGPT unlocks a new robotics paradigm, and allows a (potentially non-technical) user to sit on the loop, providing high-level feedback to the large language model (LLM) while monitoring the robot’s performance. By following our set of design principles, ChatGPT can generate code for robotics scenarios. Without any fine-tuning we leverage the LLM’s knowledge to control different robots form factors for a variety of tasks. In our work we show multiple examples of ChatGPT solving robotics puzzles, along with complex robot deployments in the manipulation, aerial, and navigation domains.

Robotics with ChatGPT: design principles

Prompting LLMs is a highly empirical science. Through trial and error, we built a methodology and a set of design principles for writing prompts for robotics tasks:

new pipeline with chatgpt
  1. First, we define a set of high-level robot APIs or function library. This library can be specific to a particular robot, and should map to existing low-level implementations from the robot’s control stack or a perception library. It’s very important to use descriptive names for the high-level APIs so ChatGPT can reason about their behaviors;
  2. Next, we write a text prompt for ChatGPT which describes the task goal while also explicitly stating which functions from the high-level library are available. The prompt can also contain information about task constraints,
    or how ChatGPT should form its answers (specific coding language, using auxiliary parsing elements);
  3. The user stays on the loop to evaluate ChatGPT’s code output, either through direct inspection or using a simulator. If needed, the user uses natural language to provide feedback to ChatGPT on the answer’s quality and safety.
  4. When the user is happy with the solution, the final code can be deployed onto the robot.

Enough theory… What exactly can ChatGPT do?

Let’s take a look at a few examples… You can find even more case studies in our code repository (opens in new tab).

Zero-shot task planning

We gave ChatGPT access to functions that control a real drone, and it proved to be an extremely intuitive language-based interface between the non-technical user and the robot. ChatGPT asked clarification questions when the user’s instructions were ambiguous, and wrote complex code structures for the drone such as a zig-zag pattern to visually inspect shelves. It even figured out how to take a selfie! 📷 😎

We also used ChatGPT in a simulated industrial inspection scenario with the Microsoft AirSim simulator (opens in new tab). The model was able to effectively parse the user’s high-level intent and geometrical cues to control the drone accurately.

User on the loop: when a conversation is needed for a complex tasks

Next, we used ChatGPT in a manipulation scenario with a robot arm. We used conversational feedback to teach the model how to compose the originally provided APIs into more complex high-level functions: that ChatGPT coded by itself. Using a curriculum-based strategy, the model was able to chain these learned skills together logically to perform operations such as stacking blocks.

In addition, the model displayed a fascinating example of bridging the textual and physical domains when tasked with building the Microsoft logo out of wooden blocks. Not only was it able to recall the logo from its internal knowledge base, it was able to ‘draw’ the logo (as SVG code), and then use the skills learned above to figure out which existing robot actions can compose its physical form.

Excerpt from ChatGPT conversation where it recalls the Microsoft logo from its knowledge base and draws it using SVG code.

Next, we tasked ChatGPT to write an algorithm for a drone to reach a goal in space while not crashing into obstacles. We told the model that this drone has a forward facing distance sensor, and ChatGPT coded most of the key building blocks for the algorithm right away. This task required some conversation with the human, and we were impressed by ChatGPT’s ability to make localized code improvements using only language feedback.

Perception-action loops: robots that sense the world before they act

The ability to sense the world (perception) before doing something (action) is fundamental to any robotics system. Therefore, we decided to test ChatGPT’s understanding of this concept and asked it to explore an environment until finding a user-specified object. We gave the model access to functions such as object detection and object distance APIs, and verified that the code it generated successfully implemented a perception-action loop.

In experimental character, we ran additional experiments to evaluate if ChatGPT is able to decide where the robot should go based on sensor feedback in real time (as opposed to having ChatGPT generate a code loop that makes these decisions). Interestingly, we verified that we can feed a textual description of the camera image at each step into the chat, and the model was able to figure out how to control the robot until it reaches a particular object.

PromptCraft, a collaborative open-sourced tool for LLM+Robotics research

Good prompt engineering is crucial for the success of LLMs such as ChatGPT for robotics tasks. Unfortunately, prompting is an empirical science, and there is a lack of comprehensive and accessible resources with good (and bad) examples to help researchers and enthusiasts in the field. To address this gap, we introduce PromptCraft (opens in new tab), a collaborative open-source platform where anyone can share examples of prompting strategies for different robotics categories. We release all of the prompts and conversations used in this study. We invite the readers to contribute with more!

Besides prompt design, we hope to also include multiple robotics simulators and interfaces to allow users to test their ChatGPT-generated algorithms. As a start, we also release an AirSim environment with ChatGPT integration that anyone can use to get started with these ideas. We welcome contributions of new simulators and interfaces as well.

Screenshot of the ChatGPT - AirSim interface
The ChatGPT-AirSim interface

Bringing robotics out of labs, and into the world

We are excited to release these technologies with the aim of bringing robotics to the reach of a wider audience. We believe that language-based robotics control will be fundamental to bring robotics out of science labs, and into the hands of everyday users.

That said, we do emphasize that the outputs from ChatGPT are not meant to be deployed directly on robots without careful analysis. We encourage users to harness the power of simulations in order to evaluate these algorithms before potential real life deployments, and to always take the necessary safety precautions. Our work represents only a small fraction of what is possible within the intersection of large language models operating in the robotics space, and we hope to inspire much of the work to come.

Citation

If you find this work useful in your research, please cite us as

@techreport{vemprala2023chatgpt,
author = {Vemprala, Sai and Bonatti, Rogerio and Bucker, Arthur and Kapoor, Ashish},
title = {ChatGPT for Robotics: Design Principles and Model Abilities},
institution = {Microsoft},
year = {2023},
month = {February},
url = {http://approjects.co.za/?big=en-us/research/publication/chatgpt-for-robotics-design-principles-and-model-abilities/},
number = {MSR-TR-2023-8},
}

This work is being undertaken by members of the Microsoft Autonomous Systems and Robotics Research Group. The researchers included in this project are: Sai Vemprala, Rogerio Bonatti, Arthur Bucker, and Ashish Kapoor.

The post ChatGPT for Robotics: Design Principles and Model Abilities appeared first on Microsoft Research.

]]>
Introducing ClimaX: The first foundation model for weather and climate http://approjects.co.za/?big=en-us/research/articles/introducing-climax-the-first-foundation-model-for-weather-and-climate/ Thu, 26 Jan 2023 05:00:47 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=914445 We are announcing ClimaX, a flexible and generalizable deep learning model for weather and climate science. ClimaX is trained using several heterogeneous datasets spanning many weather variables at multiple spatio-temporal resolutions. We show that such a foundational model can be fine-tuned to address a wide variety of climate and weather tasks, including those that involve […]

The post Introducing ClimaX: The first foundation model for weather and climate appeared first on Microsoft Research.

]]>
We are announcing ClimaX, a flexible and generalizable deep learning model for weather and climate science.

Figure for downstream applicaions of ClimaX
Figure 1: ClimaX is the first foundation model designed to perform a wide variety of weather and climate modeling tasks. For weather, these tasks include standard forecasting tasks of relevant weather variables like temperature, humidity etc. with various lead-times at various resolutions, both globally and regionally. For climate, ClimaX can help to make better long-term projections, or to downscale lower resolution model outputs to higher resolutions.

ClimaX is trained using several heterogeneous datasets spanning many weather variables at multiple spatio-temporal resolutions. We show that such a foundational model can be fine-tuned to address a wide variety of climate and weather tasks, including those that involve atmospheric variables and spatio-temporal granularities unseen during pretraining. ClimaX will be made available for academic and research use shortly.

The key insight behind our effort is the realization that all the prediction and modeling tasks in weather and climate science are based on physical phenomena and their interactions with the local and global geography. Consequently, a foundational model that models a multitude of weather and climate variables at many different scales eventually will encode these physical laws and the relevant geographical interactions.   

Current state-of-the-art numerical weather and climate models are based on simulations of large systems of differential equations, which relate the flow of energy and matter based on known physics of different Earth systems. Thus, state-of-the-art numerical weather and climate models are usually required to run on large super-computers at high resolutions. Although very successful, these models are known to have weaknesses and limitations both at long- and short-time horizons.

On the other hand, advancements in technology have led to an abundance of data from satellites, radar, and other weather sensors. This data can also provide valuable information for weather and climate modeling, especially at finer temporal and spatial resolutions while potentially accounting for less understood complex physics. However, current large scale numerical weather and climate models have a hard time assimilating this scale of data.  

Machine learning (ML) models can provide an alternative tradeoff, benefiting from the scale of both data and compute. Recent attempts at scaling up deep learning systems for short and medium-range weather forecasting have already led to big success, often already matching current state-of-the-art numerical weather models on key variables of interest.  However, since most of the ML models are trained for a specific predictive task on specific datasets, they lack the general-purpose utility for Earth system sciences, and are thus not fully grounded in physics.

From a machine learning perspective, the plethora of available data – from direct weather measurements at land, sand, or atmosphere, over multiple decades of re-analyzed weather data at different spatial scales, to physics-informed climate projections for various scenarios – is a fruitful grounding for building fully physics-grounded foundation models for weather and climate modeling. Especially so, since weather and climate data commonly share the same set of equations (although with fairly distinct characteristics).

ClimaX Architecture and Framework

In disciplines such as natural language processing or computer vision, it is well acknowledged that ML models trained to solve a single task using supervised learning are label-hungry during training, and brittle when deployed outside their training distribution. In recent years, pretraining large unsupervised “foundation” models has therefore emerged as a new paradigm, mitigating the supervision bottleneck. Post pretraining, there are many ways to finetune the same model on a shear arbitrary span of tasks with little to none (i.e., zero-shot) additional supervision.    

ClimaX follows the pretraining – finetuning paradigm. For pretraining ClimaX, our first key proposal is to go beyond standard homogeneous weather datasets, but rather leverage physics-informed climate simulation datasets, which are abundant due to various climate simulations from multiple groups. By only using a tiny fraction of the available datasets, we show that the heterogeneity in these datasets is already enough to serve as a rich and plentiful pretraining dataset.

But to do so, we need a model architecture that can aptly embrace the heterogeneity of those climate datasets, which are highly multimodal, as observations typically correspond to many different, unbounded variables. Moreover, many observational datasets are irregular in the sense that they differ in their spatiotemporal coverage, corresponding to different subsets of atmospheric variables.

At its core, ClimaX is a multi-dimensional image-to-image translation architecture that is based on Vision Transformers (ViT). ViT-based architectures are especially well suited for modeling weather and climate phenomena since they naturally tokenize the spatial nature of multiscale data akin to different spatial-temporal inputs, and additionally offer the opportunity to extend tokenization towards a wide range of multi-channel features. However, two fundamental changes are needed to repurpose the ViT architecture towards ClimaX: variable tokenization and variable aggregation.

ClimaX neural net architecture
Figure 2: The ClimaX architecture as used during pretraining. Variables are encoded using variable-separate tokenization, and subsequently aggregated using variable aggregation. Together with position and lead time embedding those are fed to the ViT backbone.

Variable tokenization: The standard ViT tokenization scheme for image data divides the input into equally sized patches and flattens these patches over width, height, and channel dimension into a vector. However, this is not that straight forward for climate and weather data, where the number of physical variables can vary between different datasets. Concretely, in our case each climate pretraining data subset contains simulated data of different models, and thus has different underlying variables. We therefore propose variable tokenization which treats variables as separate modalities to enable more flexible training even with irregular datasets.

Figure for independent variable tokenization
Figure 3: Variable tokenization. We treat variables as separate modalities to enable more flexible training.

Variable aggregation: Variable tokenization comes with two inherent problems. First and foremost, it yields sequences which increase linearly with the number of input variables, which is computationally not feasible as input to self-attention layers of the ViT. Secondly, the input is prone to contain tokens of different variables with very different physics groundings. We therefore propose variable aggregation, a cross-attention operation that outputs an equally sized embedding vector for each spatial location.

Figure for variable aggregation at a position via cross-attention
Figure 4: Variable aggregation. A cross-attention operation that outputs and equally sized embedding vector for each spatial location.

Fine-Tuning for various downstream tasks

We highlight the performance of ClimaX on various weather and climate related downstream tasks, which we categorize into weather forecasting (global, regional, sub-seasonal and seasonal), climate projections and climate downscaling. ClimaX is highly flexible due to its four learnable components:
the token embedding layers, the variable aggregation module, the attention blocks, and the prediction head. If downstream variables overlap with pretraining variables, we can finetune the entire model. If variables are unseen during pretraining, we replace the embedding layers and the prediction head with newly initialized networks and either finetune or freeze the other two components.

schematic diagram for how ClimaX can be finetuned for downstream applications
Figure 5: Example finetuning pipeline as used for climate projection tasks. A different set of input and output variables requires different embedding layers and prediction heads. Attention layers can be frozen or finetuned.

Results highlights

Global weather forecasting

Forecasting the future values of key weather variables at different temporal horizons is critical to ensuring safety of communities and infrastructure around the world. ERA5 reanalysis data from European Center for Medium-range Weather Forecasting (ECMWF) underlies as the key source of data for training and evaluating machine learning models on this task with performance of Operation IFS being the current state-of-the art numerical weather prediction baseline.

Temperature at 2m comparison of groundtruth vs ClimaX predictions
Temperature at level 850 comparison of groundtruth vs ClimaX predictions
Eastward wind at 10m comparison of groundtruth vs ClimaX predictions
Northward wind at 10m comparison of groundtruth vs ClimaX predictions
Figure 6: Visualization of forecasting results of key weather variables (Temperature: T2m, T850, Wind: U10, V10) with ClimaX (6hours to 1month in future).

ClimaX when finetuned on the same ERA5 data, even at mediums resolutions (1.40625˚) already performs comparably, if not better than IFS on short and medium-range predictions, while being substantially better at longer horizon predictions.

line chart comparing ClimaX against state-of-the-art IFS
Figure 7: Performance of ClimaX on global forecasting of key weather variables (Temperature: T2m, T850, Wind: U10, Geopotential: Z500) compared to state-of-the-art numerical weather prediction system in use, IFS, at different leadtime horizons. ClimaX is already close at short-and medium range predictions while becoming better at longer leadtime horizons

Climate projections

Climate projections help climate scientists understand the effects of various forcing factors like concentrations of greenhouse gases or aerosol emissions to long term state of the climate. ClimateBench [1] was recently introduced to consistently evaluate machine learning approaches to improve the accuracy of climate projections. This task is noticeably different from the pretraining regime with completely different inputs and outputs than seen during pretraining. Still, transferring over ClimaX attention layers to this task, results in comparable or better performance than the current state-of-the-art baselines in ClimateBench.

table comparing ClimaX performance on climatebench
Table 1: ClimaX performs favorably to other baselines despite having never seen any of the input or output variables during pretraining.

Climate model downscaling

Climate models often can’t provide enough detail to analyze regional and local phenomena due to their coarser spatial resolutions. Downscaling can help provide higher-resolution climate projections and reduce the biases from the outputs of these models by relating them to higher resolution local climatological conditions. We evaluate ClimaX on this task by using lower resolution climate model’s projections as the input, and corresponding values in reanalysis weather data (opens in new tab) as the target at higher resolution. We find that ClimaX again compares favorably against other deep learning based baselines on all key metrics.

table comparing ClimaX performance on climate model downscaling
Table 2: ClimaX performs better than other deep learning baselines on downscaling from MPI-ESM (5.625˚) [2]  to ERA5 [2] (1.40625˚) [3]
graphical user interface, application, PowerPoint
visualization of downscaled predictions by ClimaX
Figure 8: Visualization of downscaled prediction of key climate variables (Temperature: T2m, T850) with ClimaX.

Scaling analysis

Transformer based machine learning architectures have found favorable and predictable scaling properties when given more compute, data, or parameters. We find this true for the ClimaX model as well. We find these trends promising as we have only scaled to fairly small models compared to the currently popular architectures in other domains with billions of parameters. Additionally, there’s a wealth of publicly available weather and climate data that we haven’t yet leveraged for pre-training larger models.

Line plot comparing the effect of more data and compute on performance of ClimaX
Figure 9: Scaling law analysis of ClimaX. Bigger models and more data consistently improve performance on key tasks like 3-day forecasting. Bigger models turn out to be more sample efficient as well.

Advancing weather and climate modeling with data-driven methods

We are excited to release ClimaX with the aim of furthering data-driven weather and climate modeling. Our goal is to allow anyone to easily use the latest Machine Learning methods to address multitude of problems, ranging from near-term prediction at a local scale to modeling long-term processes that involve weather and climate variables. ClimaX takes a big step forward towards the idea of a single starting point for a variety of such tasks. We can’t wait to see what the future holds for this emerging field.

References

[1] Watson‐Parris, Duncan, et al. “ClimateBench v1. 0: A Benchmark for Data‐Driven Climate Projections.” Journal of Advances in Modeling Earth Systems 14.10 (2022): e2021MS002954.

[2] Eyring, Veronika, et al. “Overview of the Coupled Model Intercomparison Project Phase 6 (CMIP6) experimental design and organization.” Geoscientific Model Development 9.5 (2016): 1937-1958.

[3] Hersbach, H., et al. “ERA5 hourly data on single levels from 1979 to present.” Copernicus Climate Change Service (C3S) Climate Data Store (CDS) 10 (2018).

This work is being undertaken by members of Microsoft Autonomous Systems and Robotics Research (opens in new tab), Microsoft Research AI4Science (opens in new tab) & UCLA. The researchers behind this project are Tung Nguyen (opens in new tab), Johannes Brandstetter (opens in new tab), Ashish Kapoor (opens in new tab), Jayesh K. Gupta (opens in new tab), and Aditya Grover (opens in new tab).

The post Introducing ClimaX: The first foundation model for weather and climate appeared first on Microsoft Research.

]]>
Open sourcing PDEArena http://approjects.co.za/?big=en-us/research/articles/open-sourcing-pdearena-2/ Tue, 29 Nov 2022 16:12:06 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=901587 We are open sourcing PDEArena, a modern, scalable, and easy to use PDE surrogate benchmarking framework. PDEArena is designed to train and evaluate neural surrogates for partial differential equations (PDE) at scale. As such, PDEArena contains state of the art implementations of more than 20 recently proposed PDE surrogate architectures (or combinations thereof), with more […]

The post Open sourcing PDEArena appeared first on Microsoft Research.

]]>
PDEArena logo (opens in new tab)

We are open sourcing PDEArena, a modern, scalable, and easy to use PDE surrogate benchmarking framework. PDEArena is designed to train and evaluate neural surrogates for partial differential equations (PDE) at scale. As such, PDEArena contains state of the art implementations of more than 20 recently proposed PDE surrogate architectures (or combinations thereof), with more coming soon.  

Scaling up deep learning models has led to unprecedented success when it comes to computer vision or natural language processing. Deep learning holds immense promise in helping to overcome the computationally expensive nature of standard PDE solution techniques. However, scaling such PDE surrogate models requires elaborate engineering both on the distributed training as well as on the data loading front. For example, many currently available neural PDE open-source libraries tend to assume that the surrogates are run on exactly those underlying PDEs they were trained on, often assuming at most a single GPU. Current research in PDE surrogates is therefore often missing out on the benefits of scale. 

Thanks to the use of PyTorch Lightning, experiments are incredibly simple to run at any scale. In its current release version, PDEArena allows you to train models on four different fluid mechanics and electrodynamics datasets, where both code for data generation and the datasets themselves are available (with more coming soon). . Furthermore, PDEArena aims to establish strong baselines for neural PDE surrogates, and consequently driving the field forward together. Therefore, the repo is designed such that it can easily be extended both for new models and for new datasets.

We used PDEArena in our recent paper “Towards Multi-spatiotemporal-scale Generalized PDE Modeling” (opens in new tab) to compare modern UNets vs. other state of the art neural PDE surrogate learning approaches. PDEArena’s simplicity and scalability allowed us to quickly iterate on different UNet variants: from the 2015 version to modern UNets and our own variants thereof. Furthermore, we could easily compare various other tradeoffs like runtime and GPU memory requirements against other architectures like ResNets, Dilated ResNets, as well as various Fourier-based approaches. 

Table comparing different PDE surrogate models in terms of number of parameters, memory requirements and runtime.
Table 1: Comparison of parameter count, runtime, and memory requirements of various PDE surrogate architectures.
More can be found in the documentation (opens in new tab).

Trying out these models on a new PDE should be as simple as writing a data loader for your PDE dataset. Hopefully, we will see many more comparisons of the vast design space of PDE surrogates. 

Note that this is not a one-time release. We use PDEArena extensively in our daily research at Microsoft and plan to continue to maintain it, while adding new functionalities over time. We are very eager to receive contributions (opens in new tab) from the wider PDE surrogate learning community.


This work is being undertaken by members of Microsoft Autonomous Systems and Robotics Research and Microsoft Research AI4Science. The researchers behind this project are Jayesh K. Gupta and Johannes Brandstetter.

The post Open sourcing PDEArena appeared first on Microsoft Research.

]]>
Towards Modular Data-driven Simulations http://approjects.co.za/?big=en-us/research/articles/towards-modular-data-driven-simulations/ Wed, 23 Nov 2022 23:14:23 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=899826 The use of modeling and simulation in engineering is well recognized as a viable method to build surrogates for real-world systems. Simulation allows for testing and evaluation of large and complex systems in a risk-free environment, helping document failures, reduce costs, and increasing the quality of developed systems. Simulation also plays a key role in […]

The post Towards Modular Data-driven Simulations appeared first on Microsoft Research.

]]>
The use of modeling and simulation in engineering is well recognized as a viable method to build surrogates for real-world systems. Simulation allows for testing and evaluation of large and complex systems in a risk-free environment, helping document failures, reduce costs, and increasing the quality of developed systems. Simulation also plays a key role in AI – either for obtaining training data in an inexpensive manner, or in deep reinforcement learning which requires a large volume of interactions with the environment to learn effective policies. However, traditional simulators often require significant engineering effort and expert knowledge for creation, while also being computationally intensive for complex systems. A more efficient alternative is to learn data-driven simulations, which involves the use of machine learning algorithms to learn system dynamics directly from observations or data traces from existing simulators.

Current efforts in building data-driven simulations have been driven by a class of machine learning models known as “mechanistic models”. Mechanistic models attempt to incorporate well-understood priors, some examples of which are neural differential equations or physics informed neural networks. Although effective, many such data-driven simulations often attempt to learn simulator behavior in an end-to-end manner. This results in monolithic simulation blocks that are only valid for a particular system configuration and cannot easily transfer or generalize to new configurations. On the other hand, most real systems are inherently modular and can be decomposed into reusable subsystems.

In our recent paper Learning Modular Simulations for Homogeneous Systems (opens in new tab), we examine the idea of building data-driven simulations in a modular fashion, which would allow re-use and re-configuration of our individual modules. Such an approach can result in several gains for data-driven simulations, some of which are:

  1. Creation of reusable pretrained simulation nodes which can be transferred or rapidly finetuned for new scenarios in a data-efficient manner.
  2. Smaller models that represent individual subsystems whose weights can be shared across a network, instead of a large graph – reducing the overall computational efforts required in training.
  3. Enhanced adaptability of our simulation system for new configurations.

A large amount of literature on data-driven simulations has explored the idea of building simulations through graph neural networks, such as the method in Learning to Simulate Complex Physics with Graph Networks (opens in new tab). Such networks attempt to model the entire graph, processing state and control inputs at every timestep into graph embeddings and computing the evolution of the graph through a hidden representation. Learning the entire graph in this fashion runs the risk of creating simulation nodes which are not independent, but rather the hidden representation is a function of the exact arrangement of neighbors in the graph. Within the hidden state, the entire graph needs to be communicated as it evolves through time, creating a large overhead. Finally, such approaches lack generalizability.

Given the inherent modularity of several real-life systems, we propose a method which places focus on individual nodes instead of the full graph. Our approach uses Neural Ordinary Differential Equations (opens in new tab) (NODE) to model a single dynamic entity. The primary function of the NODE is to ingest the current state & control action and predict the future state after a given amount of elapsed time.

Visualization of a typical Neural ODE
Figure 1: A simulation node built on top of a neural ordinary differential equation computes the next state of the node given the current state and control action.

To enable interaction between neighboring nodes, we augment this NODE with “message variables”, thus creating a message-passing neural ODE (MP-NODE). The MP-NODE has an additional output along with the next state prediction, representing a continuous-valued message which is sent to other simulation nodes for coordination. Similarly, each MP-NODE also has a message input, through which it consumes aggregated messages from its neighboring nodes. This forms the basic unit of our overall simulation learning framework. Because the message passing happens in parallel to the individual simulation node, the nodes can continue focusing on learning the dynamics rules for the single entity. Based on the messages obtained at every node/timestep, they can tweak their internal predictions so that the overall system simulation works correctly.

Visualization of the proposed MP-NODE
Figure 2: Our approach augments the neural ODE with a message-passing capability. Each node is capable of ingesting and outputting messages, and the input messages for a node are computed as an aggregate of all the messages output by its neighbors.

We apply our proposed method to several multi-node homogeneous systems. As a toy example, we investigate the case of a coupled pendulum where two simple pendula are attached by a string. We notice that the inclusion of messages helps with long horizon state predictions. Furthermore, we perform an experiment where we take a trained MP-NODE and use it for inference with both messages turned on and turned off. When the messages are disabled, each node in the system behaves similar to an unconnected pendulum. This shows that messages implicitly learn to encode relevant details about interaction.

Results of MP-NODE when applied to a coupled pendulum system.
Figure 3: MP-NODE is able to model the evolution of the coupled pendulum accurately even up to time horizons as long as 20 seconds.
Study of MP-NODE performance when messages are turned off/disabled.
Figure 4
Left: The trend of errors during training shows that when messages are not used, the ability to model interactions is lost, resulting in bad predictions.
Right: When messages are deliberately turned off during inference, the simulation nodes evolve independently, i.e., similar to a single pendulum.

We also use MP-NODE to model several other systems of interest. These include

a) Coupled Lorenz attractors
b) Gene dynamics over a spatial grid modeled using the Michelis-Menten equation
c) Kuramoto system, a model for the behavior of a large set of coupled oscillators
d) A swarm of quadrotors performing cooperative assembly

Through these experiments, we validate two primary hypotheses.

The modular nature of the MP-NODE allows for easy transfer to different configurations of systems.

1) Higher number of nodes

A common use case would be to take a trained MP-NODE module, representing a subsystem, and finetune it for a larger graph than what it was originally trained on. We use the MP-NODE that we trained on a three-node Lorenz attractor, and fine-tune this model for the 10-node configuration. We observe through the test error plot in Figure 5 (Left) that even in low-data regimes, fine-tuning a trained model is far more efficient than training a model from scratch, highlighting the transferability of MP-NODE.

Performance of MP-NODE when finetuning for larger networks
Figure 5
Left: Finetuning an MP-NODE trained on a 3-node Lorenz system for a 10-node Lorenz system.
Right: Finetuning an MO-NODE trained on a 4×4 gene dynamics grid for an 8×8 gene dynamics grid.

We perform a similar experiment on the gene dynamics system, where we first train an MP-NODE model on data from a 4×4 spatial grid, and finetune it for an 8×8 configuration. Similar to above, we see that finetuning a trained MP-NODE allows for accurate predictions faster than training from scratch.

2) New graph structure

We also examine the ability of MP-NODE to be finetuned for different graph structures. To this end, we train an MP-NODE on a Kuramoto10 system connected according to the Barabasi-Albert (BA) network and attempt to finetune it for systems of other network types. As above, we see that finetuning is more efficient at adapting to new network types than training models from scratch.

3) Different system parameters

We also evaluate the possibility of finetuning MP-NODEs for different system parameters. As an example, we finetune a MP-NODE model trained on Lorenz3 to Lorenz10 but with different coupling intensity than the original training data and show the results in Figure 7 (left). Similarly, we also try to finetune the MP-NODE model trained on Lorenz3 to Lorenz10 for longer time horizon of 10s in Figure 7 (right). In both cases, we again find that a lot less data is required to achieve better performance than training from scratch.

Performance of MP-NODE when finetuning for a new graph topology.
Figure 6: Comparison of test error when training MP-NODE from scratch for a new topology vs. finetuning an existing one.
Performance of MP-NODE when finetuning for a new system parameters
Figure 7: Comparison of test error when training MP-NODE from scratch on Lorenz10 vs. finetuning an existing one with higher coupling factor (left), longer time horizon (right)

The modular nature of MP-NODE allows for zero-shot transfer of trained modules to new configurations.

MP-NODE operates by training a model for the individual subsystem within a homogeneous network. This allows us to connect an arbitrary number of such trained subsystems together for a given graph structure. We investigate the performance of MP-NODE models on new configurations without explicitly finetuning the model.

When tested on the Lorenz system, we find that an MP- NODE trained only on Lorenz3 exhibits reasonable zero-shot generalization performance on a higher number of Lorenz attractors, such as Lorenz7 and Lorenz10 without requiring any additional training.

We notice that this ability holds for not only changing numbers of nodes, but also different graph topologies. For the Gene 4×4 system, we generate data from multiple adjacency matrices but all according to the Barabasi-Albert (BA) network topology and train the MP-NODE on this dataset. We observe that this MP-NODE model trained only on BA topology generalizes to other unseen network topologies (Erdos-Renyi (ER), Watts-Strogatz (WS)) as well. We show these results in Table 1.

Performance of MP-NODE in zero shot generalization to new configurations not seen in training.
Performance of MP-NODE on the gene dynamics system.
Figure 8: MP-NODE predictions vs. ground truth for the gene dynamics system
Left: MP-NODE trained on a 4×4 grid finetuned for an 8×8 grid
Right: MP-NODE trained on a 4×4 grid with one specific topology generalizing to a new topology

In our paper, we present more results that discuss the applications to Kuramoto systems and quadrotor swarms, along with ablation studies and comparisons to existing literature; showcasing the ability of MP-NODE to learn complex dynamics in a modular fashion while outperforming existing methods.

In summary, this modular way of thinking about data-driven simulation of complex systems has the potential to minimize data, compute, and energy requirements. We show through our analysis of finetuning and generalization that our approach of modeling subsystems that are inherently reusable as opposed to specific configurations of systems has the potential to alleviate data and compute requirements. For better ability to adapt to more complex systems, an extension that can handle heterogeneous systems is necessary, which we leave to future work. We are excited to build upon this paradigm of modular composable simulations for learning data-driven surrogates of real systems, helping speed up conventional slow simulations while also being capable of capturing complex real-world phenomena.


This work is being undertaken by a team at Microsoft Autonomous Systems and Robotics Research. The researchers included in this project are: Jayesh K. Gupta, Sai Vemprala, and Ashish Kapoor.

The post Towards Modular Data-driven Simulations appeared first on Microsoft Research.

]]>
PACT: Perception-Action Causal Transformer for Autoregressive Robotics Pretraining http://approjects.co.za/?big=en-us/research/articles/perception-action-causal-transformer-for-autoregressive-robotics-pretraining/ Thu, 27 Oct 2022 00:40:28 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=868116 PACT paper (opens in new tab) | Video (opens in new tab)| Github code (opens in new tab) Recent advances in machine learning architectures have induced a paradigm shift from task-specific models towards large general-purpose networks. For instance, in the past few years we have witnessed a revolution in the domains of natural language and […]

The post PACT: Perception-Action Causal Transformer for Autoregressive Robotics Pretraining appeared first on Microsoft Research.

]]>
PACT paper (opens in new tab) | Video (opens in new tab)| Github code (opens in new tab)

Recent advances in machine learning architectures have induced a paradigm shift from task-specific models towards large general-purpose networks. For instance, in the past few years we have witnessed a revolution in the domains of natural language and computer vision with models such as GPT-3 (opens in new tab), BERT (opens in new tab) and DALL-E (opens in new tab). The use of general-purpose models is highly appealing because they are trained on a broad array of datasets and can be applied to a wide variety of downstream tasks, therefore providing general skills which can be used directly or with minimal finetuning to new applications.

The field of robotics, however, is still mostly riddled with single-purpose systems architectures whose modules and connections, whether traditional or learning-based, require significant human design expertise. Inspired by these large pre-trained models, this work introduces a general-purpose robotics representation that can serve as a starting point for multiple tasks for a mobile agent, such as navigation, mapping and localization.

We present the Perception-Action Causal Transformer (PACT), a generative transformer-based architecture that aims to build representations directly from robot data in a self-supervised fashion. Through autoregressive prediction of states and actions over time, our model implicitly encodes dynamics and behaviors for a particular robot. This representation can then function as a single starting point to achieve distinct tasks through fine-tuning with minimal data.

main

Continue reading to learn more about this technology, or check out these additional resources: 

Unlocking self-supervision of representations for robotics

Inspired by large pretrained language models, this work introduces a paradigm for pretraining general purpose representation models that can be used for multiple robotics tasks.

At their core, most robotic agents process a perception-action loop between their states/observations, and associated actions. We argue that if a robot can fully understand the transitions associated with states and actions, that leads to the ability of learning a high-quality mental model of how the robot interacts with the world. Conceptually, this is equivalent to how a large language model understands the rules of grammar in a language.

In this work we introduce the Perception-Action Causal Transformer (PACT), a Transformer-based generative model that is trained on sequences of states and actions coming from datasets of robot trajectories. By learning to autoregressively predict such sequences, PACT implicitly encodes general purpose information such as the progression of observations given actions (robot dynamics), and interactions between states and actions (robot policy).

State observations in robotics can be composed of distinct modalities such as RGB images or LiDAR scans. Similarly, robot actions can be of several types as well – such as steering angles, motor commands, or discrete choices from a predefined library of actions. In order to convert such a wide variety of data into a format that is easily accessible by the transformer, we use a tokenization procedure. PACT itself is designed to be a general architecture, and is agnostic to the nature of states and actions:

timeline

What can we do with this model?

Analogous to how a language model like GPT-3 learns to auto-regressively output a sequence of reasonable words to complete a sentence, the pretrained PACT model learns to output a reasonable sequence of actions for a robot. Without a particular goal in mind, it learns to follow the perception-action statistical distributions seen in the pretraining phase, and can navigate safely in the environment.

timeline

As mentioned before, downstream tasks in robotics can take several forms beyond just safe navigation. We finetune the representations learned by PACT for various tasks which are common in robotics scenarios such as localization, mapping, and navigation, for two types of robots. The first is the MuSHR car (opens in new tab), which is an open-sourced RC car platform equipped with cameras, LiDAR and onboard computers. The second robot is a purely virtual agent in the Habitat (opens in new tab) simulator.

For each downstream task, we add a small task-specific downstream module on top of the PACT model which is finetuned with the downstream datasets. Through empirical analysis, we observe that this method of learning finetuning small task-specific modules on top of PACT is significantly more efficient than training models from scratch for each task. The next figure shows examples of networks used for localization (merging embeddings from a pair of consecutive states), and local mapping (merging all embeddings from the transformer sequence):

loc
map

PACT as a generative robotics model:

Similar to how a model like GPT-3 operates with text prompts, we can also bias the future distribution of states and actions produced by our model by prompting the transformer sequence with specific initial values. The next figure displays heatmaps with state distributions for multiple runs where the car was initialized in the same position and orientation, and the only difference being the prompting of the very first 15 action tokens. The figure highlights that prompting with straight trajectories results in future actions that tend to keep the vehicle on a straighter course when compared to the actions that are generated from prompts including turns:

gen

Making robotics more accessible

We are excited to release these technologies with the aim of bringing autonomous robotics closer to a broader public audience. Our goal is to allow anyone to easily “train the brains of a robot”, without the need for very specialized technical knowledge on the design of features and model architectures. Our Perception-Action Causal Transformer (PACT) framework, which facilitates the idea of a single starting point for a variety of robotics tasks, takes a big step forward towards this direction.

This work is being undertaken by a team at Microsoft Autonomous Systems and Robotics Research. The researchers included in this project are: Rogerio Bonatti, Sai Vemprala, Shuang Ma, Felipe Vieira Frujeri, Shuhang Chen and Ashish Kapoor.

The post PACT: Perception-Action Causal Transformer for Autoregressive Robotics Pretraining appeared first on Microsoft Research.

]]>
3DB: Debugging Computer Vision Models through Simulation http://approjects.co.za/?big=en-us/research/articles/3db-debugging-computer-vision-models-through-simulation/ Fri, 30 Sep 2022 20:22:31 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=882270 Paper  (opens in new tab)  /  Code (opens in new tab)  /  Demo (opens in new tab) /  Docs (opens in new tab) Modern machine learning models are known to fail in ways that aren’t anticipated during training these models. These include all sorts of distribution shifts that the model might experience during deployment in complex […]

The post 3DB: Debugging Computer Vision Models through Simulation appeared first on Microsoft Research.

]]>
Paper  (opens in new tab)  /  Code (opens in new tab)  /  Demo (opens in new tab) /  Docs (opens in new tab)

Modern machine learning models are known to fail in ways that aren’t anticipated during training these models. These include all sorts of distribution shifts that the model might experience during deployment in complex real-life settings. In the context of computer vision for example, it has been shown by several works that models suffer in the face of small rotations (opens in new tab), common corruptions (opens in new tab) (such as snow or fog), and changes to the data collection pipeline (opens in new tab). While such brittleness is widespread, it is often hard to understand its root causes, or even to characterize the precise situations in which this unintended behavior arises.

graphical user interface, application

How do we then comprehensively diagnose model failure modes? One way to do this is to deploy our models in the real-world and eventually collect some real-world failure cases. But clearly the stakes are often too high to simply do this. There has been a line of work in computer vision research that is focused on identifying systematic sources of model failure: which include examining the effects of unfamiliar object orientations (opens in new tab), misleading backgrounds (opens in new tab), or conflicts between texture and shape (opens in new tab). Such analyses have revealed patterns of performance degradation in vision models – still, each such analysis requires its own set of (often complex) tools, time, and effort. Our question is thus: can we support reliable discovery of model failures in a systematic, automated, and unified way?

To address this, in collaboration with researchers at MIT, we introduce 3Debugger (3DB): a framework for automatically identifying and analyzing the failure modes of computer vision models. This framework makes use of a 3D simulator to render images of near-realistic scenes that can be fed into any computer vision system. Users can specify a set of extendable and composable transformations within the scene, such as pose changes, background changes, or camera effects, which we refer to as “controls”. We show examples of such controls in Fig. 1.

Examples of “controls” in 3DB, using Blender as the 3D simulator.

Fig. 1: Examples of “controls” in 3DB, using Blender as the 3D simulator.

Once the user has specified a set of controls of interest, the system performs a guided search, evaluation, and aggregation derived from these transformations. 3DB achieves this by instantiating and rendering a myriad of object configurations according to the transformations, records the behavior of the model on each rendered scene, and finally presents the user with an interactive, user-friendly summary of the model’s performance and vulnerabilities. 3DB is general enough to enable users to, with little-to-no effort, re-discover insights from prior work on robustness to pose, background, and texture, among others. Users can even compose these transformations to understand their interplay, while still being able to disentangle their individual effects, or easily write their own if required. An overview of the workflow of 3DB is shown in Fig. 2.

graphical user interface, diagram, text

Fig.2: The workflow of 3DB.

As an example, let us try to evaluate how robust the standard ImageNet-pretrained ResNet-18 model is at classifying a coffee mug. The highly configurable nature of 3DB allows one to set up the model of interest, the renderer, as well as the transformations of interest through a YAML configuration file. 3DB reads this configuration file and initializes the renderer and the model accordingly. Once initialized, 3DB renders several synthetic images according to the desired controls, performs inference on these images and displays the results in a web dashboard, mapping the changing parameters to success/failure.

Some interesting findings from 3DB for this coffee mug example are:

  1. Complex backgrounds result in bad classification performance.
  2. ImageNet pretrained models are sensitive to texture.
  3. Classification accuracy changes based on which liquid is inside the mug.

3DB is also capable of finding failure modes (e.g. due to extreme viewpoints and poses) in simulation that transfer to the real world. Fig. 3 shows the agreement, in terms of model correctness, between the model predictions within 3DB and its predictions in the real-world. For each object, we selected five configurations that 3DB found to be correctly classified in simulation, and five misclassified; we recreated and deployed the model on each scene in the physical world. The positive (resp., negative) predictive value is the rate at which correctly (resp.  incorrectly) classified examples in simulation were also correctly (resp., incorrectly) classified in the physical world.

Overall, 3DB is a scalable, extendable, and unified framework for diagnosing failure modes in vision models using high-fidelity rendering. We refer the reader for our paper to learn more about the use cases of 3DB, where we demonstrated the efficacy of 3DB by applying it to a variety of use cases including disentangling the effects of different types of brittleness, discovering model biases, analyzing specific model decisions in depth, and identifying vulnerabilities and worst-case environmental configurations. We are releasing 3DB as a library alongside a set of example analyses (opens in new tab), guides (opens in new tab) and documentation (opens in new tab). 3DB is designed with extensibility as a priority; we encourage the community to build upon the framework by adding more controls and policies that provide new insights into the vulnerabilities of vision models.


This work was a collaborative effort between Microsoft Research and MIT. Researchers involved in this work were Guillaume Leclerc, Hadi Salman, Andrew Ilyas, Sai Vemprala, Logan Engstrom, Vibhav Vineet, Kai Xiao, Pengchuan Zhang, Shibani Santurkar, Greg Yang, Ashish Kapoor, Aleksander Mądry.

The post 3DB: Debugging Computer Vision Models through Simulation appeared first on Microsoft Research.

]]>
Data-driven Sensor Simulation for Realistic LiDARs http://approjects.co.za/?big=en-us/research/articles/data-driven-sensor-simulation-for-realistic-lidars/ Mon, 26 Sep 2022 23:36:43 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=880923 Simulation is playing an increasingly major role in the development of safe and robust autonomous systems, especially given the advent of deep learning techniques. Given the challenges and effort involved with collecting data in real life, simulation provides an efficient alternative for gathering labeled training data for sensor observations, vehicle dynamics and environmental interactions. Furthermore, […]

The post Data-driven Sensor Simulation for Realistic LiDARs appeared first on Microsoft Research.

]]>
Simulation is playing an increasingly major role in the development of safe and robust autonomous systems, especially given the advent of deep learning techniques. Given the challenges and effort involved with collecting data in real life, simulation provides an efficient alternative for gathering labeled training data for sensor observations, vehicle dynamics and environmental interactions. Furthermore, simulation allows extended evaluation through corner cases such as failures that would be inapplicable to a real life setup.

Over the last decade, simulations have gotten increasingly better at visual and physical fidelity. Game engines such as Unreal Engine and Unity provide several advanced graphical capabilities out of the box such as real time ray tracing, high resolution texture streaming, dynamic global illumination etc. Such game engines have also formed the base for several robotics and autonomous systems simulators such as AirSim (opens in new tab) and CARLA (opens in new tab), which allow users to deploy robotic platforms such as drones and cars equipped with cameras and other sensors in large 3D worlds. 

While present simulations can generate high quality camera imagery, when it comes to non-visual classes of sensors, they often fall back upon simplified models. Complex sensors such as LiDAR, which lie at the heart of a majority of present day autonomous systems such as self-driving cars, are challenging to model given their dependence on aspects such as material properties of all the objects in an environment. Designing accurate LiDAR sensors in simulation often requires significant effort in handcrafting several environmental factors, and careful encoding of sensor characteristics for every new model. To alleviate this, we examine a new perspective on sensor modeling: one that involves learning sensor models from data. In our recent work “Learning to Simulate Realistic LiDARs”, we investigate how simulated LiDAR sensor models can be made more realistic using machine learning techniques. 

Visualization of how learning realistic LIDAR characteristics improves over conventional simulators
Figure 1: On top, we see an RGB image and a corresponding LiDAR scan from real data, where the LiDAR scan exhibits characteristics like raydrop and changes in intensity. On the bottom, we see a synthetic image from a simulator, and a basic LiDAR scan which does not contain raydrop or intensity. Our method results in the point cloud shown on the bottom right, which is similar to a real LiDAR.

LiDAR sensors are active sensors which emit laser beams in several directions around the sensor, and as these rays bounce off surrounding objects and return, the sensor tracks the time taken for the return to estimate the distance traveled. Along with the distances, LiDAR sensors also track the intensity of the returned ray, which depends on the reflectance of the object the ray was incident upon: for instance, metallic objects result in higher intensity returns. 

Creating accurate sensor models for LiDARs is thus challenging due to the dependence of the output on complex properties such as material reflectance, ray incidence angle, and distance. For example, when laser rays encounter glass objects, the rays are refracted and they rarely return, a phenomenon known as raydrop. Basic LiDAR models that exist as part of robotics simulators often yield simple point clouds obtained by casting rays naively at every object, and do not account for such properties. Similarly, it takes significant effort to encode material properties of each object in a simulator, which makes it challenging to also estimate intensities of the LiDAR returns – most LiDAR models in simulations do not return valid intensities. 

In this work, we introduce a pipeline for data-driven sensor simulation and apply it to LiDAR. The key idea we propose is that if we had access to data that contained both RGB imagery and LiDAR scans, we train a neural network to learn the relationship between appearance in RGB images, and scan properties such as raydrop and intensity in LiDAR scans. A model trained that way is able to estimate how a LiDAR scan would look, just from images alone, removing the need for complex physics-based modeling. 

Figure 2: Training pipeline for RINet involves taking an RGB image and predicting a binary mask for raydrop, and per-pixel intensities for the LiDAR points matching the RGB location.

We focus on these two key aspects of realistic LiDAR data, namely raydrop and intensity. Given that current simulations already possess the ability to output distances, we assume that there already exists a sensor that returns a point cloud, which we then modify using our model in order to be more realistic. We name our model RINet (Raydrop and Intensity Network). At the input, RINet takes an RGB image, and attempts to predict the realistic LiDAR characteristics corresponding to that scene through a data structure we refer to as an intensity mask. This intensity mask is a densified representation of the LiDAR scan, and for each pixel in the RGB image, the intensity mask reports the closest intensity value from the LiDAR scan corresponding to the real world location being observed by that pixel. If a corresponding ray does not exist due to raydrop, the mask contains a zero. Once trained, our model works in tandem with an existing simulation such as CARLA. The RGB images from the simulator are passed through the trained RINet model, which results in an intensity mask prediction; and this intensity mask is then “applied” to the original LiDAR scan, resulting in an enhanced scan. 

Figure 3: During inference, RINet takes a synthetic RGB image and predicts corresponding LiDAR properties. These properties are then applied to the basic point cloud output by the simulator, resulting in an enhanced version.

We train the RINet model on two real datasets:  the Waymo Perception dataset (opens in new tab) and SemanticKITTI (opens in new tab) dataset, each resulting in a distinct LiDAR model:The Waymo dataset contains data from a proprietary LiDAR sensor, whereas SemanticKITTI uses the Velodyne VLP-32 sensor. RINet leverages the well known pix2pix (opens in new tab) architecture to go from an RGB frame to the intensity mask. We find that the RINet model is effective at learning material-specific raydrop (e.g.: dropping rays on materials like glass), as well as intensities (e.g.: learning that car license plates are metallic objects that result in high intensity returns). 

RGB images and corresponding intensity masks from real data in the Waymo dataset. We can see noise in the LiDAR data, dropped rays at materials like glass, as well as varying intensity based on the objects.
Predictions from RINet for the same images – which demonstrate that the model is able to learn how to drop rays based on material observed from the image, and how to record intensities.

In order to validate our idea of enhancing existing simulators with our technique, we apply our model on top of LiDAR point clouds coming from the CARLA simulator. We can observe from the videos below that the performance is qualitatively better: as expected, rays are dropped on car windshields, while metallic objects such as vehicular surfaces, road signs etc. are more prominent in the intensity map. 

We also investigate whether LiDAR-specific downstream tasks benefit from this realistic LiDAR alternative. To this end, we create a test pipeline involving the task of car segmentation from LiDAR point clouds. We train a segmentation model using the RangeNet++ (opens in new tab) architecture on different versions of simulated data, and then apply the models trained solely on simulation to the real life Waymo dataset. We observe improved segmentation performance on real life data when using the RINet enhanced LiDAR scans, compared to the default CARLA point clouds. While CARLA does provide some means to simulate noise by randomly dropping points from the LiDAR scan, our method outperforms that version as well given the more realistic nature of the RINet outputs. Further analysis relating to this downstream task can be found in our paper.

Figure 4: When using our method, segmentation models trained on simulated data and applied to real life datasets fare better, compared to naive raycasting (CARLA vanilla), or simple random raydrop (CARLA noise).
Learning to simulate LIDARs - CARLA
Figure 5: Qualitative differences in car segmentation on the Waymo dataset, when trained using CARLA’s native LiDAR and the LiDAR enhanced using our method. The higher realism afforded by RINet translates to better segmentation performance in real life datasets.

Neural networks are powerful function approximators that have already shown impact in a vast array of fields, and specifically so in robotics and autonomous systems. Given the integral role that simulations are playing in robotics and deep learning, and the effort undertaken in building complex simulations such as sensor modeling, we are excited to present data-driven sensor simulations as a new paradigm. We show a pipeline where machine learning and traditional simulators can coexist in generating realistic sensor observations, and apply it to LiDAR sensors. Our framework is built in a way that it implicitly encodes sensor properties learning purely from observations, thus bypassing the need for expensive handcrafting of LiDAR models. In the future, we envisage similar efforts helping reduce the barrier in creating rich simulations for various kinds of sensors which, in turn, enable the creation of robust autonomous systems.


This work was a joint effort between Microsoft’s Autonomous Systems and Robotics Research group, Computer Vision Laboratory EPFL, Microsoft Research Redmond and the Microsoft Mixed Reality & AI lab at Zurich. The researchers who took part in this project are: Benoît Guillard, Sai Vemprala, Jayesh Gupta, Ondrej Miksik, Vibhav Vineet, Pascal Fua and Ashish Kapoor.

The post Data-driven Sensor Simulation for Realistic LiDARs appeared first on Microsoft Research.

]]>
Just say the magic word: using language to program robots http://approjects.co.za/?big=en-us/research/articles/robot-language/ Tue, 09 Aug 2022 23:39:47 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=867693 LaTTe paper (opens in new tab) and video (opens in new tab) | Trajectory Transformer paper (opens in new tab) and video (opens in new tab) | Github code (opens in new tab) Language is the most intuitive way for us to express how we feel and what we want. However, despite recent advancements in […]

The post Just say the magic word: using language to program robots appeared first on Microsoft Research.

]]>
LaTTe paper (opens in new tab) and video (opens in new tab) | Trajectory Transformer paper (opens in new tab) and video (opens in new tab) | Github code (opens in new tab)

Language is the most intuitive way for us to express how we feel and what we want. However, despite recent advancements in artificial intelligence, it is still very hard to control a robot using natural language instructions. Free-form commands such as “Robot, please go a little slower when you pass close to my TV” or “Stay far away from the swimming pool!” are hard to parse into actionable robot behaviors, and most human-robot interfaces today still rely on complex strategies such directly programming cost functions which define the desired behavior. 

With our latest work, we attempt to change this reality through the introduction of “LaTTe: Language Trajectory Transformer” (opens in new tab). LaTTe is a deep machine learning model that lets us send language commands to robots in an intuitive way with ease. When given an input sentence by the user, the model fuses it with camera images of objects that the robot observes in its surroundings, and outputs the desired robot behavior.  

As an example, think of a user trying to control a robot barista that’s moving a wine bottle. Our method allows a non-technical user to control the robot’s behavior only using words, in a natural and simple interface. We will explain how we can achieve this in detail through this post. 

Continue reading to learn more about this technology, or check out these additional resources: 

We also invite the reader to watch the videos describing the papers: 

Unlocking the potential of language for robotics 

The field of robotics traditionally uses task-specific programming modules, which need to be re-designed by an expert even if there are minor changes in robot hardware, environment, or operational objectives. This inflexible approach is ripe for innovation with the latest advances in machine learning, which emphasizes  reusable modules that generalize well over large domains.  

Given the intuitive and effective nature of language for general communication, it would be simpler if one could just tell the robot how they want it to behave as opposed to having to reprogram the entire stack every time a change is needed. While large language models such as BERT, GPT-3 and Megatron-Turing have radically improved the quality of machine-generated text and our ability to solve to natural language processing tasks, and models like CLIP extend our reach capabilities towards multi-modal domains with vision and language, we still see few examples of language being applied in robotics. 

The goal of our work is to leverage information contained in existing vision-language pre-trained models to fill the gap in existing tools for human-robot interaction. Even though natural language is the richest form of communication between humans, modeling human-robot interactions using language is challenging because we often require vast amounts of data to train models, or classically, force the user to operate within a rigid set of instructions. To tackle these challenges, our framework makes use of two key ideas: first, we employ large pre-trained language models to provide rich user intent representations, and second, we align geometrical trajectory data with natural language jointly with the use of a multi-modal attention mechanism. 

We test our model on multiple robotic platforms, from manipulators to drones, and show that its functionality is agnostic of the robot form factor, dynamics, and motion controller. Our goal is to enable a factory worker to quickly reconfigure a robot arm trajectory further away from fragile objects; or allow a drone pilot to command the drone to slow down when close to buildings – all without requiring immense technical expertise. 

Combining language and geometry into a single robotics model 

Our overall goal is to provide a flexible interface for human-robot interaction within the context of trajectory reshaping that is agnostic to robotic platforms. We assume that the robot’s behavior is expressed through a 3D trajectory over time, and that the user provides a natural language command to reshape its behavior which relates to particular things in the scene, such as the objects in the robot workspace. Our trajectory generation system outputs a sequence of waypoints in XYZ and velocities, which are calculated fusing scene geometry, scene images, and the user’s language input. The diagram below shows an overview of the system: 

LaTTe is composed of several building blocks, which can be categorized into the feature extractors, geometric encoder, and a final trajectory decoder. We use a pre-trained language model encoder, BERT, to produce semantic features from the user’s input. The use of a large language model creates more flexibility in the natural language input, allowing the use of synonyms and less training data, given that the encoder has already been trained with a massive text corpus. In addition, we use the pre-trained text encoder from the vision-language model CLIP to extract latent embeddings from both the user’s text and the pictures of each object in the scene. We then compute a similarity vector between the embeddings, and use this information to identify target objects the user is referring to through their language command. 

words

As for the geometric information, we employ a Transformer encoder network to extract features related to the original robot’s trajectory as well as the 3D position of each one of the objects in the scene. In a practical scenario we can use off-the-shelf object detectors to obtain the position and pictures of each significant object. 

Finally, all the geometrical, language and visual information is fused together into a Transformer decoder block. Similarly to what happens in a machine translation problem (for example, translating a sentence from English to German), the information from the transformer encoder network is used by the transformer decoder to generate one waypoint of the output trajectory at a time in a loop. The training process uses a range of procedurally generated synthetic data with multiple trajectory shapes and random object categories. We use multiple images for each object, which we obtain by web crawling through Bing Images (opens in new tab)

chart

What can we do with this model? 

We conducted several experiments in simulated and real-life environments to test the effectiveness of LaTTe. We also tested different form factors (manipulators, drones, and a hexapod robot) in a multitude of scenarios to show the capability of LaTTe to adapt to various robot platforms. 

Examples with manipulators: 

Examples with aerial vehicles: 

Examples with a hexapod robot: 

Bringing robotics to a wider audience 

We are excited to release these technologies with the aim of bringing robotics to the reach of a wider audience. Given the burgeoning applications of robots in several domains, it is imperative to design human-robot interfaces that are intuitive and easy to use. Our goal when designing such interfaces is to afford flexibility and precision of action, while ensuring that little to no technical training is required for novel users. Our Language Trajectory Transformer (LaTTe) framework takes a big step forward towards this direction. 

This work is being undertaken by a multidisciplinary team at Microsoft Autonomous Systems Research (opens in new tab) together with the Munich Institute of Robotics and Machine Intelligence (MIRMI (opens in new tab)) at TU Munich. The researchers included in this project are: Arthur Bucker (opens in new tab), Luis Figueredo (opens in new tab), Sami Haddadin (opens in new tab), Ashish Kapoor (opens in new tab), Shuang Ma (opens in new tab), Sai Vemprala (opens in new tab) and Rogerio Bonatti (opens in new tab). 

The post Just say the magic word: using language to program robots appeared first on Microsoft Research.

]]>