Acoustic-to-Phrase Models for Speech Recognition
Directly emitting words and sub-words from speech spectrogram has been shown to produce good results using end-to-end (E2E) trained models. Connectionist Temporal Classification (CTC) and Sequence-to-Sequence attention (Seq2Seq) models have both shown better success when directly targeting words or sub-words. In this work, we ask the question: Can an E2E model go beyond words and transcribe directly to phrases (i.e., a group of words)? Directly modeling frequent phrases might be better than modeling its constituent words. Also, emitting multiple words together might speed up inference in models like Seq2Seq where decoding is inherently sequential. To answer this, we undertake a study on a 3400-hour Microsoft Cortana voice assistant task. We present a side-by-side comparison for CTC and Seq2Seq models that have been trained to target a variety of tokens including letters, sub-words, words and phrases. We show that an E2E model can indeed transcribe directly to phrases. We see that while CTC has difficulty in accurately modeling phrases, a more powerful model like Seq2Seq can effortlessly target phrases that are up to 4 words long, with only a reasonable degradation in the final word error rate.