Applied Robotics Research Articles

GPT Models Meet Robotic Applications: Long-Step Robot Control in Various Environments

Microsoft Research Team — Tue, 11 Apr 2023 01:23:03 +0000

We have released practical prompts for ChatGPT to generate executable robot action sequences from multi-step human instructions in various environments.

Paper

GitHub

Introduction

Imagine having a humanoid robot in your household that can be instructed and demonstrated household chores without coding—Our team has been developing such a system, which we call Learning-from-Observation.

As part of our effort, we recently released a paper, “ChatGPT Empowered Long-Step Robot Control in Various Environments: A Case Application (opens in new tab),” where we provide a specific example of how OpenAI’s ChatGPT can be used in a few-shot setting to convert natural language instructions into a sequence of executable robot actions. Our prompts and source code for using them are open-source and publicly available at this GitHub repository (opens in new tab).

In fact, generating programs for robots from language is an attractive goal and has attracted research interest in the robotics research community; some of them are built on top of large language models such as ChatGPT (opens in new tab). However, most of them were developed within a limited scope, hardware-dependent, or lack the functionality of human-in-the-loop. Additionally, most of these studies rely on a specific dataset, which requires data recollection and model retraining when transferring or extending them to other robotic scenes. From a practical application standpoint, an ideal robotic solution would be one that can be easily applied to other applications or operational settings without requiring extensive data collection or model retraining.

In this paper, we provide a specific example of how ChatGPT can be used in a few-shot setting to convert natural language instructions into a sequence of actions that a robot can execute. In designing the prompts, we tried to ensure that they meet the requirements common to many practical applications while also being structured in a way that they can be easily customizable. The requirements we defined for this paper are:

Easy integration with robot execution systems or visual recognition programs.
Applicability to various home environments.
The ability to provide an arbitrary number of natural language instructions while minimizing the impact of ChatGPT’s token limit.

To meet these requirements, we designed input prompts to encourage ChatGPT to:

Output a sequence of predefined robot actions with explanations in a readable JSON format.
Represent the operating environment in a formalized style.
Infer and output the updated state of the operating environment, which can be reused as the next input, allowing ChatGPT to operate based solely on the memory of the latest operations.

We provide a set of prompt templates that structure the entire conversation for input into ChatGPT, enabling it to generate a response. The user’s instructions, as well as a specific explanation of the working environment, are incorporated into the template and used to generate ChatGPT’s response. For the second and subsequent instructions, ChatGPT’s next response is created based on all previous turns of the conversation, allowing ChatGPT to make corrections based on its own previous output and user feedback, if requested. If the number of input tokens exceeds the allowable limit for ChatGPT, we adjust the token size by truncating the prompt while retaining the most recent information about the updated environment.

The entire structure of the conversation that will be inputted into ChatGPT for generating a response.

In our paper, we demonstrated the effectiveness of our proposed prompts in inferring appropriate robot actions for multi-stage language instructions in various environments. Additionally, we observed that ChatGPT’s conversational ability allows users to adjust its output with natural language feedback, which is crucial for developing an application that is both safe and robust while providing a user-friendly interface.

Integration with vision systems and robot controllers

Among recent experimental attempts to generate robot manipulation from natural language using ChatGPT, our work is unique in its focus on the generation of robot action sequences (i.e., “what-to-do”), while avoiding redundant language instructions to obtain visual and physical parameters (i.e., “how-to-do”), such as how to grab, how high to lift, and what posture to adopt. Although both types of information are essential for operating a robot in reality, the latter is often better presented visually than explained verbally. Therefore, we have focused on designing prompts for ChatGPT to recognize what-to-do, while obtaining the how-to-do information from human visual demonstrations and a vision system during robot execution.

As part of our efforts to develop a realistic robotic operation system, we have integrated the proposed system with a learning-from-observation system that includes a speech interface [ (opens in new tab)1 (opens in new tab)] (opens in new tab), [2] (opens in new tab), a visual teaching interface [3] (opens in new tab), a reusable library of robot actions [4] (opens in new tab), and a simulator for testing robot execution [5] (opens in new tab). If you are interested, please refer to the respective papers for the results of robot execution. The code for the teaching interface is available at another GitHub repository (opens in new tab).

An example of integrating the proposed ChatGPT prompts into a robot teaching system. The system breaks down natural language input instructions into a sequence of robot actions, and then obtains the necessary parameters for robot execution (i.e., how to perform the actions) by prompting a human to visually demonstrate each step of the decomposed action sequence. An example of integrating the proposed ChatGPT-empowered task planner into a robot teaching system. A teaching system that incorporates the task planner (indicated by the dashed box). Following task planning, the system asks the user to visually demonstrate the tasks in a step-by-step manner. Visual parameters are then extracted from this visual demonstration.

(Top) The step-by-step demonstration corresponding to the planned tasks. (Middle and Bottom) Execution of the tasks by two different types of robot hardware. We have been developing a reusable library of robot skills (e.g., grab, pick up, bring, etc.) for several robot hardware. To learn more about the skill library, refer to our paper.

Conclusion

The main contribution of this paper is the provision and publication of generic prompts for ChatGPT that can be easily adapted to meet the specific needs of individual experimenters. The impressive progress of large language models is expected to further expand their use in robotics. We hope that this paper provides practical knowledge to the robotics research community, and we have made our prompts and source code available as open-source material on this GitHub repository (opens in new tab).

Bibliography

@ARTICLE{10235949,
  author={Wake, Naoki and Kanehira, Atsushi and Sasabuchi, Kazuhiro and Takamatsu, Jun and Ikeuchi, Katsushi},
  journal={IEEE Access}, 
  title={ChatGPT Empowered Long-Step Robot Control in Various Environments: A Case Application}, 
  year={2023},
  volume={11},
  number={},
  pages={95060-95078},
  doi={10.1109/ACCESS.2023.3310935}}

About our research group

Visit our homepage: Applied Robotics Research

Learn more about this project

The post GPT Models Meet Robotic Applications: Long-Step Robot Control in Various Environments appeared first on Microsoft Research.

GPT Models Meet Robotic Applications: Co-Speech Gesturing Chat System

Microsoft Research Team — Fri, 17 Mar 2023 03:16:37 +0000

Our robotic gesture engine and DIY robot, MSRAbot, are integrated with a GPT-based chat system.

Paper

GitHub (for MSRAbot)

GitHub (for Toyota HSR)

Introduction

Large-scale language models have revolutionized natural language processing tasks, and researchers are exploring their potential for enhancing human-robot interaction and communication. In this post, we will present our co-speech gesturing chat system, which integrates GPT-3/ChatGPT with a gesture engine to provide users with a more flexible and natural chat experience. We will explain how the system works and discuss the synergistic effects of integrating robotic systems and language models.

Co-Speech Gesturing Chat System: How it works

The pipeline of the co-speech gesture generation system.

Our co-speech gesturing chat system operates within a browser. When a user inputs a message, GPT-3/ChatGPT generates the robot’s textual response based on a prompt carefully crafted to create a chat-like experience. The system then utilizes a gesture engine to analyze the text and select an appropriate gesture from a library associated with the conceptual meaning of the speech. A speech generator converts the text into speech, while a gesture generator executes co-speech gestures, providing audio-visual feedback expressed through a CG robot. The system leverages various Azure services, including Azure Speech Service for speech-to-text conversion, Azure Open AI service for GPT-3-based response generation, and Azure Language Understanding service for concept estimation. The source code of the system is available on GitHub (opens in new tab).

MSRAbot DIYKit

In this post, we have utilized our in-house developed robot named MSRAbot, originally designed for a platform for human-robot interaction research. As an additional resource for readers interested in the robot, we have developed and open-sourced a DIYKit for MSRAbot (opens in new tab). This DIYKit includes 3D models of the parts and step-by-step assembly instructions, enabling users to build the robot’s hardware using commercially available items. The software needed to operate the robot is also available on the same page.

MSRAbot hardware. Visit our GitHub page for more information.

The Benefits of Integrating Robotic Systems and Language Models

The fusion of existing robot gesture systems with large-scale language models has positive effects for both components. Traditionally, studies on robot gesture systems have used predetermined phrases for evaluation. The integration with language models enables evaluation under more natural conversational conditions, which promotes the development of superior gesture generation algorithms. On the other hand, large-scale language models can expand the range of expression by adding speech and gestures to their excellent language responses. By integrating these two technologies, we can develop more flexible and natural chat systems that enhance human-robot interaction and communication.

Challenges and Limitations

While our co-speech gesturing chat system appears straightforward and promising, it also encounters limitations and challenges. For example, the use of language models poses risks associated with language models, such as generating biased and inappropriate responses. Additionally, the gesture engine and concept estimation must be reliable and accurate to ensure the overall effectiveness and usability of the system. Further research and development are needed to make the system more robust, reliable, and user-friendly.

Conclusion

In conclusion, our co-speech gesturing chat system represents an exciting advance in the integration of robotic systems and language models. By using a gesture engine to analyze speech text and integrating GPT-3 for response generation, we have created a chat system that offers users a more flexible and natural chat experience. As we continue to refine and develop this technology, we believe that the fusion of robotic systems and language models will lead to more sophisticated and beneficial systems for users, such as virtual assistants and tutors.

About our research group

Visit our homepage: Applied Robotics Research

Learn more about this project

The post GPT Models Meet Robotic Applications: Co-Speech Gesturing Chat System appeared first on Microsoft Research.