Sequence Prediction with Unlabeled Data by Reward Function Learning

Lijun Wu; Li Zhao; Tao Qin; Jianhuang Lai; Tie-Yan Liu

Sequence Prediction with Unlabeled Data by Reward Function Learning

Lijun Wu ,
Li Zhao ,
Tao Qin ,
Jianhuang Lai ,
Tie-Yan Liu

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence | August 2017

Download BibTex

Reinforcement learning (RL), which has been successfully applied to sequence prediction, introduces reward as sequence-level supervision signal to evaluate the quality of a generated sequence. Existing RL approaches use the ground-truth sequence to define reward, which limits the application of RL techniques to labeled data. Since labeled data is usually scarce and/or costly to collect, it is desirable to leverage large-scale unlabeled data. In this paper, we extend existing RL methods for sequence prediction to exploit unlabeled data. We propose to learn the reward function from labeled data and use the predicted reward as pseudo reward for unlabeled data so that we can learn from unlabeled data using the pseudo reward. To get good pseudo reward on unlabeled data, we propose a RNN-based reward network with attention mechanism, trained with purposely biased data distribution. Experiments show that the pseudo reward can provide good supervision and guide the learning process on unlabeled data. We observe significant improvements on both neural machine translation and text summarization.