Knowledge Distillation for Mixture of Experts Models in Speech Recognition

MSR-TR-2022-6 |

Published by Microsoft

The sparsely-gated mixture of experts (MoE) architecture can scale out large Transformer models to orders of magnitude which are not achievable by dense models with the current hardware limitations, and have been proven to improve the convergence as well. However, many applications still rely on traditional dense models, for both deployment and further compression, in order to fit memory and latency requirements. In this work, we propose a simple approach to distill MoE models into dense models while retaining the accuracy gain achieved by large sparse models. This can be used for further optimization and compression using well-known techniques for dense models. We demonstrate the model compression efficiency of our knowledge distillation (KD) technique through multi-lingual speech recognition experiments. Experimental results show that our proposed method can reduce the number of weights of an MoE teacher network with almost the same accuracy from 677M to 99M for 24 experts and from 1.8B to 124M for 72 experts, respectively. The results also show that our KD method can provide better recognition accuracy compared to conventional training methods without the MoE teacher model.