{"id":420618,"date":"2017-08-20T17:58:40","date_gmt":"2017-08-21T00:58:40","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=420618"},"modified":"2018-06-13T08:02:39","modified_gmt":"2018-06-13T15:02:39","slug":"microsoft-researchers-achieve-new-conversational-speech-recognition-milestone","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/microsoft-researchers-achieve-new-conversational-speech-recognition-milestone\/","title":{"rendered":"Microsoft researchers achieve new conversational speech recognition milestone"},"content":{"rendered":"
Last year, Microsoft\u2019s speech and dialog research group announced (opens in new tab)<\/span><\/a> a milestone in reaching human parity on the Switchboard conversational speech recognition task, meaning we had created technology that recognized words in a conversation as well as professional human transcribers.<\/p>\n (opens in new tab)<\/span><\/a><\/p>\n After our transcription system reached the 5.9 percent word error rate that we had measured for humans, other researchers conducted their own study, employing a more involved multi-transcriber process, which yielded a 5.1 human parity word error rate. This was consistent with prior research that showed that humans achieve higher levels of agreement on the precise words spoken as they expend more care and effort. Today, I\u2019m excited to announce that our research team reached that 5.1 percent error rate with our speech recognition system, a new industry milestone, substantially surpassing the accuracy we achieved last year. A technical report (opens in new tab)<\/span><\/a> published this weekend documents the details of our system.<\/p>\n Switchboard is a corpus of recorded telephone conversations that the speech research community has used for more than 20 years to benchmark speech recognition systems. The task involves transcribing conversations between strangers discussing topics such as sports and politics.<\/p>\n We reduced our error rate by about 12 percent compared to last year\u2019s accuracy level, using a series of improvements to our neural net-based acoustic and language models. We introduced an additional CNN-BLSTM (convolutional neural network combined with bidirectional long-short-term memory) model for improved acoustic modeling. Additionally, our approach to combine predictions from multiple acoustic models now does so at both the frame\/senone and word levels.<\/p>\n Moreover, we strengthened the recognizer\u2019s language model by using the entire history of a\u00a0dialog session to predict what is likely to come next, effectively allowing the model to adapt to the topic and local context of a conversation.<\/p>\n Our team also has benefited greatly from using the most scalable deep learning software available, Microsoft Cognitive Toolkit 2.1 (opens in new tab)<\/span><\/a> (CNTK), for exploring model architectures and optimizing the hyper-parameters of our models. Additionally, Microsoft\u2019s investment in cloud compute infrastructure, specifically Azure GPUs (opens in new tab)<\/span><\/a>, helped to improve the effectiveness and speed by which we could train our models\u00a0and test new ideas.<\/p>\n Reaching human parity with an accuracy on par with humans has been a research goal for the last 25 years.\u00a0Microsoft\u2019s willingness to invest in long-term research is now paying dividends for our customers in products and services such as Cortana (opens in new tab)<\/span><\/a>, Presentation Translator (opens in new tab)<\/span><\/a>, and Microsoft Cognitive Services (opens in new tab)<\/span><\/a>. It\u2019s deeply gratifying to our research teams to see our work used by millions of people each day.<\/p>\n