Sequence generation is important because we need it to solve real life problems such as machine translation, document summarization , question generation , sentence generation, as well as many image and video captioning tasks that are developed at MSR in the last years. Deep neural network based MSR image captioning system, has won the first prize (tied with Google) at the COCO Captioning Challenge 2015. More recently, the advancement of two novel training paradigms including deep reinforcement learning (RL) and generative adversarial networks (GAN) has pushed the state of the art in the generation systems to new limits including three MSR AI work: deep RL for visual storytelling, high quality language generation, and program synthesis.
The recent wave of generation research is based on sequence-to-sequence models, where the encoder squashes the input into a fixed length vector, while the decoder learns to attend the parts of the encoder of varying deep structures. While these methods work well for alignment based generation tasks (e.g., machine translation, or sentence compression), they are ineffective and mostly fail on tasks that require: “understanding”, “encoding of very long inputs” (e.g., a book, articles on the same topic, meeting recordings), or “multi-modal inputs” (e.g., online text-based discourse about two-player video games) or “decoding of longer sequences” like a text summary as opposed to a single sentence, which is a very hard problem to solve. This SIP proposal seeks to further advance sequence generation, by introducing communication and interaction based deep learning specifically for long form text generation. Each encoder agent will take a different section of the same type of input or a different modality of input (e.g., text, image, video, etc.). The communication and interaction will be captured by a novel multi-agent encoder-decoder system, in which the encoder agents will learn to communicate with each other as well as with the decoder acting as a meta-controller to generate relevant and coherent sequences about the input.
We are now experiencing a revival of interest in building learning frameworks that are centered around communication and interaction (e.g., Atari games, communication based language learning, etc.) This project places a new perspective on sequence-to-sequence based generation tasks by introducing deep communicating agents working together to encode useful information from multiple sources in parallel and communicate with the decoder, which tries to orchestrate the generation task
Neural sequence-to-sequence models for generation have several shortcomings, which makes them almost impossible for tasks that take very long sequences or multi-modal input sequences. The best performing models today focus on building attention models to yield better performing decoders but not necessarily focus on solving long input sequence issues. We argue that one of the main limitation, especially for longer sequence generation tasks, is that, the decoder avoidably learns an attention distribution over all the tokens of the long input sequence, while a better way is to learn to pay attention to sections of the input relevant to the decoder’s time step. This could be achieved with multiple encoders receiving different parts of the same sequence as input and decoder learning an attention over related encoder(s) at a given decoding step. Second issue is related to multi-modality of the inputs with one encoder to extract information from multiple inputs or inputs with different modalities. These limitations make generating good sequences at test time difficult and the approach we are proposing in this project is using multi-agent encoder-decoder framework.
Many tasks in AI require collaboration of multiple agents. Since 2016 several recent papers have described approaches for learning deep communicating policies (DCPs). These networks are formulated as a recurrent neural network (RNN) with decentralized representations of behavior that enable multiple agents to communicate via a differentiable channel. Most notable work on DCPs solve a variety of coordination problems, including reference games, logic puzzles and MSR Maluuba’s work on multi agent games. We therefore seek to build a DCP based encoder-decoder system for generating longer sequences that are coherent and relevant.
Personne
Antoine Bosselut
Ph.D. Student
University of Washington
Michel Galley
Senior Principal Researcher
Yejin Choi
Brett Helsel Professor of Computer Science
University of Washington