{"id":283028,"date":"2011-08-29T12:01:55","date_gmt":"2011-08-29T19:01:55","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=283028"},"modified":"2016-08-25T09:12:51","modified_gmt":"2016-08-25T16:12:51","slug":"speech-recognition-leaps-forward","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/speech-recognition-leaps-forward\/","title":{"rendered":"Speech Recognition Leaps Forward"},"content":{"rendered":"<p><em>By Janie Chang,\u00a0Writer, Microsoft Research<\/em><\/p>\n<p>During <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"http:\/\/www.isca-speech.org\/archive\/interspeech_2011\/\" target=\"_blank\">Interspeech 2011<\/a>, the 12th annual Conference of the International Speech Communication Association being held in Florence, Italy, from Aug. 28 to 31, researchers from Microsoft Research will present work that dramatically improves the potential of real-time, speaker-independent, automatic speech recognition.<\/p>\n<p><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/dongyu\/\" target=\"_blank\">Dong Yu<\/a>, researcher at <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/lab\/microsoft-research-redmond\/\" target=\"_blank\">Microsoft Research Redmond<\/a>, and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/fseide\/\" target=\"_blank\">Frank Seide<\/a>, senior researcher and research manager with <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/lab\/microsoft-research-asia\/\" target=\"_blank\">Microsoft Research Asia<\/a>, have been spearheading this work, and their teams have collaborated on what has developed into a research breakthrough in the use of artificial neural networks for large-vocabulary speech recognition.<\/p>\n<h2>The Holy Grail of Speech Recognition<\/h2>\n<p>Commercially available speech-recognition technology is behind applications such as voice-to-text software and automated phone services. Accuracy is paramount, and voice-to-text typically achieves this by having the user \u201ctrain\u201d the software during setup and by adapting more closely to the user\u2019s speech patterns over time. Automated voice services that interact with multiple speakers do not allow for speaker training because they must be usable instantly by any user. To cope with the lower accuracy, they either handle only a small vocabulary or strongly restrict the words or patterns that users can say.<\/p>\n<p>The ultimate goal of automatic speech recognition is to deliver out-of-the-box, speaker-independent speech-recognition services\u2014a system that does not require user training to perform well for all users under all conditions.<\/p>\n<div id=\"attachment_283037\" style=\"width: 320px\" class=\"wp-caption alignleft\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-283037\" class=\"size-full wp-image-283037\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2011\/08\/Dong-Yu.jpg\" alt=\"Dong Yu\" width=\"310\" height=\"206\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2011\/08\/Dong-Yu.jpg 310w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2011\/08\/Dong-Yu-300x199.jpg 300w\" sizes=\"(max-width: 310px) 100vw, 310px\" \/><p id=\"caption-attachment-283037\" class=\"wp-caption-text\">Dong Yu<\/p><\/div>\n<p>\u201cThis goal has increased importance in a mobile world,\u201d Yu says, \u201cwhere voice is an essential interface mode for smartphones and other mobile devices. Although personal mobile devices would be ideal for learning their user\u2019s voices, users will continue to use speech only if the initial experience, which is before the user-specific models can even be built, is good.\u201d<\/p>\n<p>Speaker-independent speech recognition also addresses other scenarios where it\u2019s not possible to adapt a speech-recognition system to individual speakers\u2014call centers, for example, where callers are unknown and speak only for a few seconds, or web services for speech-to-speech translation, where users would have privacy concerns over stored speech samples.<\/p>\n<h2>Renewed Interest in Neural Networks<\/h2>\n<p>Artificial neural networks (ANNs), mathematical models of the low-level circuits in the human brain, have been a familiar concept since the 1950s. The notion of using ANNs to improve speech-recognition performance has been around since the 1980s, and a model known as the ANN-Hidden Markov Model (ANN-HMM) showed promise for large-vocabulary speech recognition. Why then, are commercial speech-recognition solutions not using ANNs?<\/p>\n<p>\u201cIt all came down to performance,\u201d Yu explains. \u201cAfter the invention of discriminative training, which refines the model and improves accuracy, the conventional, context-dependent Gaussian mixture model HMMs (CD-GMM-HMMs) outperformed ANN models when it came to large-vocabulary speech recognition.\u201d<\/p>\n<p>Yu and members of the <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/group\/speech-dialog-research-group\/\" target=\"_blank\">Speech<\/a> group at Microsoft Research Redmond became interested in ANNs when recent progress in building more complex \u201cdeep\u201d neural networks (DNNs) began to show promise at achieving state-of-the-art performance for automatic speech-recognition tasks. In June 2010, intern George Dahl, from the University of Toronto, joined the team, and researchers began investigating how DNNs could be used to improve large-vocabulary speech recognition.<\/p>\n<p>\u201cGeorge brought a lot of insight on how DNNs work,\u201d Yu says, \u201cas well as strong experience in training DNNs, which is one of the key components in this system.\u201d<\/p>\n<p>A speech recognizer is essentially a model of fragments of sounds of speech. An example of such sounds are \u201cphonemes,\u201d the roughly 30 or so pronunciation symbols used in a dictionary. State-of-the-art speech recognizers use shorter fragments, numbering in the thousands, called \u201csenones.\u201d<\/p>\n<p>Earlier work on DNNs had used phonemes. The research took a leap forward when Yu, after discussions with principal researcher <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/deng\/\" target=\"_blank\">Li Deng<\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/academic.microsoft.com\/#\/detail\/1994828250\" target=\"_blank\">Alex Acero<\/a>, principal researcher and manager of the Speech group, proposed modeling the thousands of senones, much smaller acoustic-model building blocks, directly with DNNs. The resulting paper<em>, <\/em><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/context-dependent-pre-trained-deep-neural-networks-for-large-vocabulary-speech-recognition\/\" target=\"_blank\"><em>Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition<\/em><\/a> by Dahl, Yu, Deng, and Acero, describes the first hybrid context-dependent DNN-HMM (CD-DNN-HMM) model applied successfully to large-vocabulary speech-recognition problems.<\/p>\n<p>\u201cOthers have tried context-dependent ANN models,\u201d Yu observes, \u201cusing different architectural approaches that did not perform as well. It was an amazing moment when we suddenly saw a big jump in accuracy when working on voice-based Internet search. We realized that by modeling senones directly using DNNs, we had managed to outperform state-of-the-art conventional CD-GMM-HMM large-vocabulary speech-recognition systems by a relative error reduction of more than 16 percent. This is extremely significant when you consider that speech recognition has been an active research area for more than five decades.\u201d<\/p>\n<p>The team also accelerated the experiments by using general-purpose graphics-processing units to train and decode speech. The computation for neural networks is similar in structure to 3-D graphics as used in popular computer games, and modern graphics cards can process almost 500 such computations simultaneously. Harnessing this computational power for neural networks contributed to the feasibility of the architectural model.<\/p>\n<p>In October 2010, when Yu presented the paper to an internal Microsoft Research Asia audience, he spoke about the challenges of scalability and finding ways to parallelize training as the next steps toward developing a more powerful acoustic model for large-vocabulary speech recognition. Seide was excited by the research and joined the project, bringing with him experience in large-vocabulary speech recognition, system development, and benchmark setups.<\/p>\n<h2>Benchmarking on a Neural Network<\/h2>\n<p>\u201cIt has been commonly assumed that hundreds or thousands of senones were just too many to be accurately modeled or trained in a neural network,\u201d Seide says. \u201cYet Yu and his colleagues proved that doing so is not only feasible, but works very well with notably improved accuracy. Now, it was time to show that the exact same CD-DNN-HMM could be scaled up effectively in terms of training-data size.\u201d<\/p>\n<p>The new project applied CD-DNN-HMM models to speech-to-text transcription and was tested against Switchboard, a highly challenging phone-call transcription benchmark recognized by the speech-recognition research community.<\/p>\n<p>First, the team had to migrate the DNN training tool to support a larger training data set. Then, with help from Gang Li, research software-development engineer at Microsoft Research Asia, they applied the new model and tool to the Switchboard benchmark with more than 300 hours of speech-training data. To support that much data, the researchers built giant ANNs, one of which contains more than 66 million inter-neural connections, the largest ever created for speech recognition.<\/p>\n<p>The subsequent benchmarks achieved an astonishing word-error rate of 18.5 percent, a 33-percent relative improvement compared with results obtained by a state-of-the-art conventional system.<\/p>\n<div id=\"attachment_283046\" style=\"width: 320px\" class=\"wp-caption alignright\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-283046\" class=\"size-full wp-image-283046\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2011\/08\/Frank-Seide.jpg\" alt=\"Frank Seide\" width=\"310\" height=\"206\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2011\/08\/Frank-Seide.jpg 310w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2011\/08\/Frank-Seide-300x199.jpg 300w\" sizes=\"(max-width: 310px) 100vw, 310px\" \/><p id=\"caption-attachment-283046\" class=\"wp-caption-text\">Frank Seide<\/p><\/div>\n<p>\u201cWhen we began running the Switchboard benchmark,\u201d Seide recalls, \u201cwe were hoping to achieve results similar to those observed in the voice-search task, between 16- and 20-percent relative gains. The training process, which takes about 20 days of computation, emits a new, slightly more refined model every few hours. I impatiently tested the latest model every few hours. You can\u2019t imagine the excitement when it went way beyond the expected 20 percent, kept getting better and better, and finally settled at a gain of more than 30 percent. Historically, there have been very few individual technologies in speech recognition that have led to improvements of this magnitude.\u201d<\/p>\n<p>The resulting paper, titled <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/conversational-speech-transcription-using-context-dependent-deep-neural-networks\/\" target=\"_blank\"><em>Conversational Speech Transcription Using Context-Dependent Deep Neural Networks<\/em><\/a><em>\u00a0<\/em>by Seide, Li, and Yu, is scheduled for presentation on Aug. 29. The work already has attracted considerable attention from the research community, and the team hopes that taking the paper to the conference will ignite a new line of research that will help advance the state of the art for DNNs in large-vocabulary speech recognition.<\/p>\n<h2>Bringing the Future Closer<\/h2>\n<p>With a novel way of using artificial neural networks for speaker-independent speech recognition, and with results a third more accurate than what conventional systems can deliver, Yu, Seide, and their teams have brought fluent speech-to-speech applications much closer to reality. This innovation simplifies speech processing and delivers high accuracy in real time for large-vocabulary speech-recognition tasks.<\/p>\n<p>\u201cThis work is still in the research stages, with more challenges ahead, most notably scalability when dealing with tens of thousands of hours of training data. Our results represent just a beginning to exciting future developments in this field,\u201d Seide says. \u201cOur goal is to open possibilities for new and fluent voice-based services that were impossible before. We believe this research will be used for services that change how we work and live. Imagine applications such as live speech-to-speech translation of natural, fluent conversations, audio indexing, or conversational, natural language interactions with computers.\u201d<\/p>\n","protected":false},"excerpt":{"rendered":"<p>By Janie Chang,\u00a0Writer, Microsoft Research During Interspeech 2011, the 12th annual Conference of the International Speech Communication Association being held in Florence, Italy, from Aug. 28 to 31, researchers from Microsoft Research will present work that dramatically improves the potential of real-time, speaker-independent, automatic speech recognition. Dong Yu, researcher at Microsoft Research Redmond, and Frank [&hellip;]<\/p>\n","protected":false},"author":39507,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"footnotes":""},"categories":[194480,194456,194462],"tags":[200327,210728,210716,200573,201077,210740,210731,209229,201357,201635,201673,201699,202089,202433,210734,210737,203979,210746,210743,210719],"research-area":[13551,13545],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-283028","post","type-post","status-publish","format-standard","hentry","category-graphics-and-multimedia","category-natural-language-processing","category-speech-and-dialog","tag-alex-acero","tag-ann-hidden-markov-model-ann-hmm","tag-artificial-neural-networks-anns","tag-automatic-speech-recognition","tag-conference-of-the-international-speech-communication-association","tag-context-dependent-dnn-hmm-cd-dnn-hmm","tag-context-dependent-gaussian-mixture-model-hmms-cd-gmm-hmms","tag-deep-neural-networks-dnns","tag-dong-yu","tag-frank-seide","tag-gang-li","tag-george-dahl","tag-interspeech-2011","tag-li-deng","tag-phonemes","tag-senones","tag-speech-to-speech-translation","tag-switchboard","tag-voice-based-internet-search","tag-voice-to-text","msr-research-area-graphics-and-multimedia","msr-research-area-human-language-technologies","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199560,199565],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[],"related-researchers":[],"msr_type":"Post","byline":"","formattedDate":"August 29, 2011","formattedExcerpt":"By Janie Chang,\u00a0Writer, Microsoft Research During Interspeech 2011, the 12th annual Conference of the International Speech Communication Association being held in Florence, Italy, from Aug. 28 to 31, researchers from Microsoft Research will present work that dramatically improves the potential of real-time, speaker-independent, automatic speech&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/283028"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/39507"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=283028"}],"version-history":[{"count":4,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/283028\/revisions"}],"predecessor-version":[{"id":283049,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/283028\/revisions\/283049"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=283028"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=283028"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=283028"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=283028"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=283028"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=283028"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=283028"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=283028"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=283028"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=283028"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=283028"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}