Applied Robotics Research

GPT Models Meet Robotic Applications: Co-Speech Gesturing Chat System

Published March 16, 2023 | Updated March 20, 2024

Share this page

Robot is chatting with a user with movements. — Our robotic gesture engine and DIY robot, MSRAbot, are integrated with a GPT-based chat system.

Paper

GitHub (for MSRAbot)

GitHub (for Toyota HSR)

Introduction

Large-scale language models have revolutionized natural language processing tasks, and researchers are exploring their potential for enhancing human-robot interaction and communication. In this post, we will present our co-speech gesturing chat system, which integrates GPT-3/ChatGPT with a gesture engine to provide users with a more flexible and natural chat experience. We will explain how the system works and discuss the synergistic effects of integrating robotic systems and language models.

Co-Speech Gesturing Chat System: How it works

diagram — The pipeline of the co-speech gesture generation system.

Our co-speech gesturing chat system operates within a browser. When a user inputs a message, GPT-3/ChatGPT generates the robot’s textual response based on a prompt carefully crafted to create a chat-like experience. The system then utilizes a gesture engine to analyze the text and select an appropriate gesture from a library associated with the conceptual meaning of the speech. A speech generator converts the text into speech, while a gesture generator executes co-speech gestures, providing audio-visual feedback expressed through a CG robot. The system leverages various Azure services, including Azure Speech Service for speech-to-text conversion, Azure Open AI service for GPT-3-based response generation, and Azure Language Understanding service for concept estimation. The source code of the system is available on GitHub (opens in new tab).

MSRAbot DIYKit

In this post, we have utilized our in-house developed robot named MSRAbot, originally designed for a platform for human-robot interaction research. As an additional resource for readers interested in the robot, we have developed and open-sourced a DIYKit for MSRAbot (opens in new tab). This DIYKit includes 3D models of the parts and step-by-step assembly instructions, enabling users to build the robot’s hardware using commercially available items. The software needed to operate the robot is also available on the same page.

MSRAbot hardware is moving. — MSRAbot hardware. Visit our GitHub page for more information.

The Benefits of Integrating Robotic Systems and Language Models

The fusion of existing robot gesture systems with large-scale language models has positive effects for both components. Traditionally, studies on robot gesture systems have used predetermined phrases for evaluation. The integration with language models enables evaluation under more natural conversational conditions, which promotes the development of superior gesture generation algorithms. On the other hand, large-scale language models can expand the range of expression by adding speech and gestures to their excellent language responses. By integrating these two technologies, we can develop more flexible and natural chat systems that enhance human-robot interaction and communication.

Challenges and Limitations

While our co-speech gesturing chat system appears straightforward and promising, it also encounters limitations and challenges. For example, the use of language models poses risks associated with language models, such as generating biased and inappropriate responses. Additionally, the gesture engine and concept estimation must be reliable and accurate to ensure the overall effectiveness and usability of the system. Further research and development are needed to make the system more robust, reliable, and user-friendly.

Conclusion

In conclusion, our co-speech gesturing chat system represents an exciting advance in the integration of robotic systems and language models. By using a gesture engine to analyze speech text and integrating GPT-3 for response generation, we have created a chat system that offers users a more flexible and natural chat experience. As we continue to refine and develop this technology, we believe that the fusion of robotic systems and language models will lead to more sophisticated and beneficial systems for users, such as virtual assistants and tutors.