Sequence Prediction with Unlabeled Data by Reward Function Learning
- Lijun Wu ,
- Li Zhao ,
- Tao Qin ,
- Jianhuang Lai ,
- Tie-Yan Liu
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence |
Reinforcement learning (RL), which has been successfully applied to sequence prediction, introduces reward as sequence-level supervision signal to evaluate the quality of a generated sequence. Existing RL approaches use the ground-truth sequence to define reward, which limits the application of RL techniques to labeled data. Since labeled data is usually scarce and/or costly to collect, it is desirable to leverage large-scale unlabeled data. In this paper, we extend existing RL methods for sequence prediction to exploit unlabeled data. We propose to learn the reward function from labeled data and use the predicted reward as pseudo reward for unlabeled data so that we can learn from unlabeled data using the pseudo reward. To get good pseudo reward on unlabeled data, we propose a RNN-based reward network with attention mechanism, trained with purposely biased data distribution. Experiments show that the pseudo reward can provide good supervision and guide the learning process on unlabeled data. We observe significant improvements on both neural machine translation and text summarization.