{"id":835873,"date":"2022-04-18T04:10:54","date_gmt":"2022-04-18T11:10:54","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&p=835873"},"modified":"2022-04-18T04:13:14","modified_gmt":"2022-04-18T11:13:14","slug":"fastcorrect-the-fast-error-correction-model-for-speech-recognition","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/fastcorrect-the-fast-error-correction-model-for-speech-recognition\/","title":{"rendered":"FastCorrect: the fast error correction model for speech recognition"},"content":{"rendered":"\n
Error correction is an important post-processing method in speech recognition that aims to detect and correct errors in speech recognition results, thereby further improving the accuracy of speech recognition. Many error correction models use autoregressive models with high latency, but speech recognition services have strict requirements on the latency of the models. In real-time speech recognition scenarios, autoregressive error correction models cannot be used for online deployment.<\/p>\n\n\n\n
In order to speed up the error correction model in speech recognition, researchers at Microsoft Research Asia and Microsoft Azure Speech proposed FastCorrect, a non-autoregressive error correction model based on Edit Alignment that can speed up the autoregressive model by six to nine times while maintaining comparable error correction ability. Because speech recognition models can often deliver multiple alternative recognition results, the researchers further proposed FastCorrect 2, where the multiple results are used to confirm each other and improve performance. The research papers on FastCorrect 1 and 2 had been accepted by NeurIPS 2021 and EMNLP 2021 respectively.<\/p>\n\n\n\n
FastCorrect leverages non-autoregressive generation with edit alignment to speed up the inference of the autoregressive correction model. In FastCorrect, researchers first calculate the edit distance between the recognized text (source sentence) and the ground-truth text (target sentence). Since the source and target tokens are aligned monotonically in Automatic Speech Recognition (ASR) (unlike shuffle error in neural machine translation), by analyzing the insertion, deletion and substitution operations in the edit distance, the number of target tokens that correspond to each source token after editing (i.e., 0 means deletion, 1 means unchanged or substitution, and \u22652 means insertion) could be obtained, as shown in Figure 1. In some cases, there would be several possible alignments of a source-target sentence pair, and the final alignment based on the path match score (the number of matched tokens in alignment) and frequency score (reflecting confidence of alignment in the language model) would be chosen.<\/p>\n\n\n\n