DOMAIN AND SPEAKER ADAPTATION FOR CORTANA SPEECH RECOGNITION
Voice assistant represents one of the most popular and important scenarios
for speech recognition. In this paper, we propose two adaptation
approaches to customize a multi-style well-trained acoustic
model towards its subsidiary domain of Cortana assistant. First, we
present anchor-based speaker adaptation by extracting the speaker
information, i-vector or d-vector embeddings, from the anchor segments
of ‘Hey Cortana’. The anchor embeddings are mapped to
layer-wise parameters to control the transformations of both weight
matrices and biases of multiple layers. Second, we directly update
the existing model parameters for domain adaptation. We demonstrate
that prior distribution should be updated along with the network
adaptation to compensate the label bias from the development
data. Updating the priors may have a significant impact when the target
domain features high occurrence of anchor words. Experiments
on Hey Cortana desktop test set show that both approaches improve
the recognition accuracy significantly. The anchor-based adaptation
using the anchor d-vector and the prior interpolation achieves 32%
relative reduction in WER over the generic model.