In video games, Non-Player Characters (NPCs) can be central to the player experience, providing agents that players can engage with in interesting and entertaining ways. In many games, the interactions between players and NPCs involve conversations. These are universally highly scripted and limited in scope, usually requiring players to select from a set of prespecified responses in a dialog tree. Conversational interactions rarely directly involve dynamic negotiation of game state. Players cannot, for example, induce NPCs to perform actions that they were not explicitly preprogrammed to do.
This scripted gaming paradigm is about to be shattered. In the not-too-distant future, gamers will encounter transformative new experiences ushered in by extremely large pretrained neural network models that are already changing the shape of text, conversation, code, and image generation. Players will be able to engage in much richer conversational interactions with NPCs than has hitherto been thought possible, perform actions collaboratively with agents within the game, and perhaps even eventually transfer personal assistants from one game to the next. Game creators, likewise, will find that these technologies may fundamentally change the ways that they approach the development of NPCs, side quests, story lines, and even entire worlds, all through natural language descriptions. Entirely new classes of games may emerge.
The Natural Language Processing Group at Microsoft Research has teamed with Xbox Gaming, Semantic Machines, and the Office of the CTO in a project that takes the first steps toward making some of this vision a reality. Some hints of things to come were presented in a fun demo at Microsoft Build in May this year. Now, in a paper (presented at the WordPlay 2022 Workshop), the team explores in greater detail the potential for, and challenges of, creating functionally agentive NPCs with which players can hold free-form conversations that are grounded in game, and can have the NPC adaptively generate code that calls functions exposed by the game API.
We chose Minecraft as our test bed–it’s an open-world game, rich in game lore, in which players creatively build artifacts in the environment, which makes it ideal for our research. Conversations with NPCs are not a standard feature of the game, so this is a relatively pristine environment in which to explore interactions with NPCs that can converse and perform tasks on behalf of players. In addition, Minecraft has several game APIs that permit large language models to write function calls that enable the NPC to perform in-game actions. Figures 1 and 2 depict two examples of player interactions with such an NPC in Minecraft.
In the present experiments, we use a single language model, OpenAI Codex (Chen et al., 2021), to generate both conversational responses and code. We employ a strategy known as few-shot prompting, in which a small number of sample instances in the prompt given to the model generalize to new unseen input (Brown et al., 2020). By the simple expedient of including examples of both natural language conversations and code in the Codex prompt, the model can generalize to interesting new settings, opening up intriguing possibilities for enhanced player experiences and game development.
The prompt provides the model with the natural language commands and the code samples necessary to enable basic NPC functionalities such as moving around, following the player, locating resources, mining, and crafting artifacts. Figure 3 includes a sample of this prompt. Each new conversational input from the player is appended to this seed prompt and sent to the model for evaluation. When the completion includes a function call to the game API, the corresponding action is performed by the NPC inside the game. When it includes a call to the bot.chat() function, a response string is displayed in the chat interface. For each subsequent input, the prompt includes the seed prompt plus the previous player inputs and model completions. When the prompt exceeds the currently allowed token limit (2048 tokens), we revert to the seed prompt and report to the player that the context has been reset.
The paper presents an evaluation of the performance of these prompted NPCs through an exploratory user study. We asked eight experienced gamers to interact with an NPC to accomplish tasks in a Minecraft realm–obtain crafting recipes, mine resources, craft items and, lastly, break out of two escape rooms. We found that the NPC exhibits multiple capabilities: it can parse unseen commands, generalize to new functionality, hold multi-turn conversation, generate language about code, switch between code and language generation, answer questions, and generate novel function chains. Short videos showing sample interactions with the participants are available on GitHub (opens in new tab), along with the original prompts and supporting code, so that others can explore the possibilities for themselves.
Gaming environments offer an excellent sandbox for exploring the road ahead in artificial intelligence, with implications beyond gaming to other consumer applications. There are a lot of challenges. We observed numerous issues in our prompt-based approach, from calling of non-existent functions and conversational responses instead of function calls, to factual inaccuracies, inconsistent persona, and recency bias. Many of these issues are already familiar to people researching conversational agents, others are new and cross-modal. So, it is early days yet. It will take time, experimentation and a great deal of ingenuity, innovation and possibly entirely new classes of models to resolve them. In the near term, it is likely that game creators will benefit first from tooling that allows them to use conversational and coding suggestions generated by the models to accelerate development and to explore new possibilities for engaging game scenarios. But the promise of the pretrained models is real, and in due course we expect them to bring exciting new kinds of experiences to the gaming community.