Universal Acoustic Modeling Using Neural Mixture Models
Acoustic models are domain dependent and do not perform well if there is a mismatch between training and test conditions. As an alternative, the Mixture of Experts (MoE) model was introduced for multi-domain modeling. It combines the outputs of several domain specific models (or experts) using a gating network. However, one drawback is that the gating network directly uses raw features and is unaware of the state of the experts. In this work, we propose several alternatives to improve the MoE model. First, to make our MoE model state-aware, we use outputs of experts as inputs to the gating network. Then we show that vector based interpolation of the mixture weights is more effective than scalar interpolation. Second, we show that directly learning the mixture weights without using any complex gating is still effective. Finally, we introduce a hybrid attention model that uses the logits and mixture weights from the previous time step to generate the mixture weights at the current time. Our best proposed model outperforms a baseline model using LSTM based gating achieving about 20.48% relative reduction in word error rate (WER). Moreover, it beats an oracle model which picks the best expert for a given test condition.