{"id":266445,"date":"2013-05-14T09:00:59","date_gmt":"2013-05-14T16:00:59","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=266445"},"modified":"2016-08-01T11:30:47","modified_gmt":"2016-08-01T18:30:47","slug":"technology-can-bridge-language-gaps","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/technology-can-bridge-language-gaps\/","title":{"rendered":"How technology can bridge language gaps"},"content":{"rendered":"

Speech-to-speech translation promises to help connect\u00a0our world<\/h2>\n

Among the futuristic gadgets in the classic TV show Star Trek, none seemed more useful than the universal translator, a handheld gizmo that helped foster understanding among intergalactic civilizations. Well, we needn\u2019t travel beyond the solar system to find the need for such a device. Imagine being able to speak in English and have your thoughts expressed in grammatically and semantically correct Mandarin or Spanish. Then imagine the voice that expresses those translated thoughts is your own.<\/p>\n

\"Rick

Long before unveiling Microsoft’s answer to the universal translator, the company\u2019s chief research officer, Rick Rashid, was a Trekkie, delighted to pose in 1996 with one of Star Trek\u2019s stars, the late James Doohan, who played Scotty.<\/p><\/div>\n

Just another sci-fi fantasy? Maybe not. Thanks to work going on today at Microsoft Research, such speech-to-speech translation is moving ever closer to reality. One day in the not too distant future, you might ask about your dinner options in a Parisian restaurant, give detailed directions to a taxi driver in Moscow, or discuss a business deal with potential partners in Tokyo\u2014fluently, in your own voice, without knowing a word of French, Russian, or Japanese. Your tablet or smartphone will do the heavy lifting of understanding what you\u2019re saying in English, translating it into your listeners\u2019 tongue, and speaking it in your voice with the pronunciation, tones, and inflections of a native speaker.<\/p>\n

All this will be made possible by combining three key pieces of technology: speech recognition, language translation, and speech synthesis. Underlying it all are breakthroughs in machine learning, particularly the development of computer-based “deep neural networks.”<\/p>\n

Start with speech recognition<\/h2>\n

Everything starts with the ability of your computing device to understand what you\u2019re saying. Anyone over the age of 30 undoubtedly will recall the days of frustratingly imperfect speech recognition. Just to refresh your memory, though, here\u2019s an actual log of a conversation between a human being and a computerized helper named “Annie”:<\/p>\n

Computer: Annie here.<\/p>\n

Human: Hello, Annie, could you give me information on a meeting I have tomorrow?<\/p>\n

Computer: What was that? [The software failed to understand what had been said.]<\/p>\n

Human: I need to find out, uh, a meeting time and place. Kim sent me a message on it.<\/p>\n

Computer: For all options, say \u201cHelp me out.\u201d [The software failed to understand what had been said.]<\/p>\n

Human: Help me out.<\/p>\n

Computer: Annie here. [The software failed to understand what had been said.]<\/p>\n

Human: Help.<\/p>\n

Computer: Help. Say \u201cCall\u201d or say \u201cSend a message,\u201d followed by a label or a number. Or say \u201cGet my messages.\u201d<\/p>\n

Human: Get messages.<\/p>\n

Computer: Speech and Image Processing Lab. [The software heard \u201cGet me sips.\u201d]<\/p>\n

Human: Help.<\/p>\n

Computer: Help. Say the first and last name …<\/p><\/blockquote>\n

So it went, back in the 1980s, when computerized systems stumbled while trying to understand the simplest of spoken requests. Such systems worked\u2014sort of\u2014in situations in which interactions between human and machine could be confined to a few repeated commands, which were often elicited by the machine, as in Say \u201cBalance due,\u201d \u201cPayment date,\u201d or \u201cRecent transactions.\u201d Conversely, the systems blew a linguistic fuse when a frustrated human replied with something like \u201cI want to dispute a charge.\u201d The error rate for recognizing speech that went outside the defined requests was unacceptably high.<\/p>\n

\"Ford

Improvements in speech recognition have brought voice-activated assistants to our smartphones, such as Ask Ziggy on Windows Phone<\/a>, and voice commands to our cars, such as Ford SYNC<\/a>.<\/p><\/div>\n

By the mid-1990s, things had improved markedly. For example, Frank Seide, a multilingual computer scientist now a senior researcher at Microsoft Research Asia<\/a>, participated in a project for an automated telephone system that provided train-schedule information to German callers. This system, and others like it, achieved acceptable results, albeit within a limited universe of spoken requests.<\/p>\n

Today, things are better. You can speak a message into a smartphone and know that the software will convert it to text with an acceptable degree of accuracy. And we\u2019ve all seen smartphone apps such as Apple\u2019s Siri and Windows Phone\u2019s Ask Ziggy, which act on our spoken requests with often astonishing accuracy. Granted, speech recognition is still not perfect\u2014but then again, we frequently misunderstand what other people are saying to us, and our brains are a lot more sophisticated than the software in our smartphones.<\/p>\n

Getting to today\u2019s level of speech-recognition accuracy involved a major breakthrough in machine learning. Before 2006, developers trained speech-recognition systems by using techniques based on complex statistical constructs known as Gaussian mixture models (GMMs). Theoretically, this approach should lead to acceptable automatic speech recognition. In practice, though, the results have been frustrating.<\/p>\n

All this began to change in 2006, with work conducted by Professor Geoffrey Hinton at the University of Toronto. He and his colleagues took a different approach to machine learning, using deep neural networks (DNNs), in which the computerized \u201cbrain\u201d consists of many interconnected, hidden layers.<\/p>\n

Li Deng, a principal researcher at Microsoft Research Redmond<\/a>, had come to know Hinton while teaching at Canada\u2019s University of Waterloo. Deng and Hinton continued their association after Deng joined Microsoft. In late 2009, Deng invited Hinton to Microsoft Research to work with him on the use of DNNs for speech recognition. This collaboration led to the discovery that while DNN models of speech recognition did not significantly lower error rates when compared to GMMs, they did result in a distinctly different pattern of errors\u2014a pattern that interfered less with the reliability of the output. This discovery motivated Deng and Dong Yu, a senior researcher at Microsoft Research\u00a0Redmond, to continue researching DNNs for speech recognition, and, during the summer of 2010, Yu, Deng, and George Dahl, a graduate student of Hinton\u2019s, worked to extend the DNN models to large vocabularies, in order to tackle real-world scenarios of voice search. In the fall of that year, Seide and his colleagues from\u00a0Microsoft Research Asia joined Yu in developing efficient, large-scale prototypes of DNN-based speech recognizers.<\/p>\n

The first successes in large-scale, DNN-based recognizers were reported in 2010, with Yu, Deng, and Dahl publishing their research on context-dependent DNNs, involving networks with hundreds of output units, and 2011, when Seide, Microsoft Research Asia colleague Gang Li, and Yu reported on their work with a huge number of outputs and improved training models. The impact of these advances in speech recognition was drastic, reducing the word-error rate by a third when compared with the previous state of the art of GMMs. By 2013, DNN-based models had come close to halving the error rate compared with GMMs.<\/p>\n

This new level of performance from the DNN-based speech recognizers, coupled with advances in the language-translation system such as that used in Bing Translator<\/a>, motivated the Microsoft Research scientists to drive further progress in building a speech-to-speech translation system. Deng notes that DNNs crudely mimic the way our brains are constructed, with massive connections among various layers, as described in a 2012 paper in the IEEE Signal Processing Magazine:<\/p>\n

The idea is to learn one layer of feature detectors at a time with the states of the feature detectors in one layer acting as the data for training the next layer. After this generative \u201cpretraining,\u201d the multiple layers of feature detectors can be used as a much better starting point for a discriminative \u201cfine-tuning\u201d phase during which backpropagation through the DNN slightly adjusts the weights found in pretraining.<\/p>\n

While DNNs are good at processing certain data that human beings are also good at handling\u2014particularly speech and vision\u2014that doesn\u2019t mean that DNNs learn in the same way as people do.<\/p>\n

He conjures up the image of a multinational gathering, with everyone speaking in his or her own native tongue but being understood by everyone else.<\/p>\n

Seide notes that \u201cDNNs are \u2018trained\u2019 once and then kept constant, while humans keep learning throughout their life.\u201d Both he and Deng are quick to add that the ability to create DNNs was the result of advances in the speed, memory, and processing power of computers.<\/p>\n

While the move to DNNs produced a significant decline in the word-error rate of speech recognition, the approach is not perfect. Mistakes still happen. Not only have the number of errors dropped, the nature of the errors has changed\u2014with far fewer mistakes of the sort that render the speech essentially nonsensical. The results are startlingly accurate when compared with the incoherence of previous approaches.<\/p>\n

Hinton\u2019s original work was largely theoretical, focusing on methodology. It was his collaboration with Microsoft Research that infused the work with a practical outlook, bringing what Deng calls \u201cindustrial scale\u201d to the research. This shift to more practical applications has gone industrywide, with Microsoft competitors Apple, Google, and IBM all joining in the research and racing to \u201cproductize\u201d improved speech recognition.<\/p>\n

Bring in machine translation<\/h2>\n

Just as older attempts at automatic speech recognition left a lot to be desired, so, too, did past endeavors in automatic translation. As compelling as the promise of early translation software was, the results were almost always underwhelming.<\/p>\n

Here are the opening lines of Gustave Flaubert\u2019s classic French novel Madame Bovary, first in the original French, then as rendered by a professional translator, and finally as mangled by machine translation:<\/p>\n

\n

Original French<\/h3>\n

Nous \u00e9tions \u00e0 l’\u00c9tude, quand le Proviseur entra, suivi d’un nouveau habill\u00e9 en bourgeois et d’un gar\u00e7on de classe qui portait un grand pupitre. Ceux qui dormaient se r\u00e9veill\u00e8rent, et chacun se leva comme surpris dans son travail.<\/p>\n

English translation by human<\/h3>\n

We were in class when the headmaster came in, followed by a “new fellow,” not wearing the school uniform, and a school servant carrying a large desk. Those who had been asleep woke up, and every one rose as if just surprised at his work.<\/p>\n

English translation by machine<\/h3>\n

We were in the study, when the headmaster came in, followed by a new dressed in bourgeois and a boy in class who wore a large desk. Those who slept awoke, and each stood as surprised in his work.<\/p><\/blockquote>\n

The machine translation shown here is based on current software and it obviously shows that there\u2019s still ample room for improvement in machine-translation algorithms. Today, researchers are applying DNNs to the problem of translation, hoping to see the same kinds of improvements achieved in speech recognition. The outcome is uncertain, but with luck, it will soon spare us from hearing about a French student wearing a large desk.<\/p>\n

Bing Translator for Klingon<\/h2>\n

A Trekkie\u2019s Dream Come True<\/a><\/p>\n

In fact, machine translation has advanced to the stage where it\u2019s a useful mobile app\u2014to a point. It\u2019s one thing to use your smartphone to ask for directions to the bathroom when you\u2019ve had too much cerveza in Cabo. It\u2019s quite another to expect it to help you negotiate a rental-car agreement in Spanish\u2014let alone enabling you to converse on anything at length or in depth. That is exactly what Microsoft\u2019s speech-to-speech project, which employs a state-of-the art translation engine built by Dongdong Zhang of Microsoft Research Asia, aspires to.<\/p>\n

Add your own voice<\/h2>\n

So let\u2019s see where we\u2019re at with that universal translator. We\u2019ve got the speech-recognition part in hand, thanks to the advances from using deep neural networks. And we\u2019re hard at work on achieving acceptable machine translation. Can we make it speak in your voice?<\/p>\n

Yes, we can, as Noelle Sophy and Henrique Malvar demonstrated in a small conference room at Microsoft Research\u2019s Redmond headquarters. Sophy, a senior program manager, and Malvar, a Microsoft distinguished engineer and Microsoft Research\u2019s chief scientist, described the progress that\u2019s been made toward realizing effective speech-to-speech translation, particularly the work of Frank Soong, a principal researcher at Microsoft Research Asia. It\u2019s Soong\u2019s team, Malvar stresses, that has given speech-to-speech its \u201cvoice.\u201d<\/p>\n

Of course, seeing\u2014or in this case, hearing\u2014is worth a thousand words, so Sophy opens her laptop and plugs in a large microphone, the kind normally seen in recording studios. After waiting a few seconds for the software to load, she leans forward and speaks into the microphone, clearly and deliberately saying, \u201cThis is a demonstration of the Microsoft speech-to-speech system.\u201d Her words appear on the laptop screen as soon as she utters them. Within seconds, Chinese characters also appear on the screen, and a voice speaks the translation of her words. The voice sounds uncannily natural, complete with the tonal inflections characteristic of Mandarin.<\/p>\n

One problem, though\u2014it doesn\u2019t sound at all like Sophy\u2019s voice. That\u2019s because it isn\u2019t. It\u2019s actually the voice of Rick Rashid, Microsoft\u2019s chief research officer and head of Microsoft Research. The reason why the software is using Rashid\u2019s baritone instead of Sophy\u2019s soprano is simple: The software has been trained on a sample of Rashid\u2019s voice for use in a demo that\u2019s become a YouTube sensation. Had the machine been trained against a sample of English speech from Sophy, the spoken Mandarin would have been in her voice.<\/p>\n

Speech Recognition Breakthrough for the Spoken, Translated Word<\/p>\n

Using a sample of Rashid speaking in English, the software\u2014employing an approach developed by Soong\u2014has broken his speech down into its elemental acoustics, which then can be recombined to make the sounds of spoken Mandarin. Next, the software does another neat bit of digital magic courtesy of Soong and his team, assembling Rashid\u2019s sounds into Mandarin that rises and falls naturally. This natural-sounding cadence is the result of the words and sentences having been indexed against data acquired from a native Mandarin speaker. By mapping Rashid\u2019s sounds to the natural patterns of spoken Mandarin, Soong\u2019s speech-to-speech software has created an eerily accurate-sounding simulation of Rashid speaking Chinese. How accurate and natural sounding? The results have been tested with native Mandarin speakers, who confirm that the both the translation and the voice itself are amazingly natural. Hundreds of comments left on the YouTube video of Rashid\u2019s demo further attest to the good quality of the translated speech.<\/p>\n

The men behind the curtain<\/h2>\n

It was no accident that Rashid\u2019s demo used translation from English to Mandarin. The speech-to-speech project began with work in Beijing, and the computer scientists at Microsoft Research Asia\u2014particularly the two Franks, Seide and Soong\u2014remain the driving force behind the project. Seide, a native German speaker who\u2019s also fluent in English, and Soong, who\u2019s bilingual in Mandarin and English, exude enthusiasm for their \u201cbaby.\u201d<\/p>\n

Seide talks excitedly about the potential of speech-to-speech translation, describing it as the fulfillment of \u201cthe dream of people talking to each other without language barriers.\u201d He conjures the image of a multinational gathering, with everyone speaking in his or her own native tongue but being understood by everyone else. Soong adds an important caveat, noting that it\u2019s vital for people to recognize that speech-to-speech currently is a research prototype, not a fully realized product.<\/p>\n

How did this research effort get its start? It was fueled by the challenges of working in a cross-cultural, multilingual environment. As Seide describes, the project began with a system used to transcribe and then translate phone meetings between Microsoft researchers in Beijing and Redmond. Noting that the Chinese participants often were lost when they tried to follow the internal conversations taking place among the Redmond engineers, Seide and his colleague Kit Thambiratnam could see the value of a real-time spoken-translation application. They developed The Translating! Telephone, a precursor to the speech to-speech prototype.<\/p>\n

The Translating Telephone<\/a>, an early version of what would become the speech-to-speech translation demo.<\/p>\n

Both Franks are quick to note that the Rashid demo simulates him speaking Mandarin. As Soong explains, \u201cRick doesn\u2019t know Chinese, so no one can really say how he would sound if he spoke Mandarin.\u201d Soong explains how the training sample used 2,000 sentences of Rashid speaking in English. Because there are a number of sounds in Mandarin that don\u2019t occur in English, Rashid\u2019s speech had to be \u201cchopped up\u201d into what Soong call tiles\u2014acoustical snippets even smaller than phonemes, the basic phonological units that are combined into a language\u2019s words. These tiles then were mapped against a reference speaker to give the illusion that Rashid was speaking Mandarin.<\/p>\n

Needless to say, building a working prototype required a lot more than 2,000 sentences spoken by one person. This is, after all, a machine-learning process, and machine learning needs huge amounts of data to be effective. To train the system on English, the Beijing group licensed recordings of 2,000 hours of telephone calls, all made by paid callers and all painstakingly transcribed and paired with Mandarin translations. But even with this large body of data, the researchers confronted problems inherent in conversational speech, which differs markedly from written language. Conversation is full of starts and stops\u2014uhs and ums and sentence fragments.<\/p>\n

Related publications<\/h2>\n