{"id":868116,"date":"2022-10-26T17:40:28","date_gmt":"2022-10-27T00:40:28","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&p=868116"},"modified":"2022-10-26T17:40:30","modified_gmt":"2022-10-27T00:40:30","slug":"perception-action-causal-transformer-for-autoregressive-robotics-pretraining","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/perception-action-causal-transformer-for-autoregressive-robotics-pretraining\/","title":{"rendered":"PACT: Perception-Action Causal Transformer for Autoregressive Robotics Pretraining"},"content":{"rendered":"\n

PACT paper (opens in new tab)<\/span><\/a> | Video (opens in new tab)<\/span><\/a>| Github code (opens in new tab)<\/span><\/a><\/p>\n\n\n\n

Recent advances in machine learning architectures have induced a paradigm shift from task-specific models towards large general-purpose networks. For instance, in the past few years we have witnessed a revolution in the domains of natural language and computer vision with models such as GPT-3 (opens in new tab)<\/span><\/a>, BERT (opens in new tab)<\/span><\/a> and DALL-E (opens in new tab)<\/span><\/a>. The use of general-purpose models is highly appealing because they are trained on a broad array of datasets and can be applied to a wide variety of downstream tasks, therefore providing general skills which can be used directly or with minimal finetuning to new applications.<\/p>\n\n\n\n

The field of robotics, however, is still mostly riddled with single-purpose systems architectures whose modules and connections, whether traditional or learning-based, require significant human design expertise. Inspired by these large pre-trained models, this work introduces a general-purpose robotics representation that can serve as a starting point for multiple tasks for a mobile agent, such as navigation, mapping and localization.<\/p>\n\n\n\n

We present the Perception-Action Causal Transformer (PACT), a generative transformer-based architecture that aims to build representations directly from robot data in a self-supervised fashion. Through autoregressive prediction of states and actions over time, our model implicitly encodes dynamics and behaviors for a particular robot. This representation can then function as a single starting point to achieve distinct tasks through fine-tuning with minimal data.<\/p>\n\n\n\n

\"main\"<\/figure>\n\n\n\n

Continue reading to learn more about this technology, or check out these additional resources: <\/p>\n\n\n\n