Milestones on the Path to Skype Translator

December 15: Microsoft releases a preview version of Skype Translator forEnglish and Spanish audiences. Read more on the Skype blog
November: Following the launch of the Skype Translator Preview program, elementary school students in Tacoma, Washington, and Mexico City participate in the first Skype Mystery Call that uses a test version of Skype Translator.
July: Watch the Skype Translator demo from the Worldwide Partner Conference 2014, which features near real-time English-German translation.
May: Microsoft announces and publicly demonstrates the Skype Translator, jointly developed by Microsoft researchers and Skype engineers:Enabling Cross-Lingual Conversations in Real Time Microsoft Demos Breakthrough in Real-Time Translated Conversations Skype Translator in action Microsoft’s speech product group quickly productizes the company’s research breakthroughs in speech to deliver best-in-class speech recognition for Cortana and other speech-powered experiences within Microsoft products. With recognition accuracies closing in on human capabilities, the close partnership between Skype, Microsoft Research, and Microsoft’s Information Platform Group is critical in delivering this technology to Skype users worldwide.
More: Anticipating More from Cortana

Skype celebrates its 10th anniversary and reaches more than 1.4 trillion minutes of voice and video calls.More: Skype Celebrates a Decade of Meaningful Conversations
Microsoft’s deep neural network (DNN) research improves Bing Voice Search for Windows Phone. Additionally, Microsoft’s investments in machine translation research, combined with Bing’s information platform and web-scale architecture, power translations across a host of experiences, including features within Bing, Office, SharePoint, and Yammer.
More: DNN Research Improves Bing Voice Search

Microsoft Translator Hub is released and implements a self-service model for building a highly customized automatic translation service between any two languages. This Azure-based service empowers language communities, service providers, and corporations to create automatic translation systems, allowing speakers of one language to share and access knowledge with speakers of any other language. By enabling translation to languages that aren’t supported by many mainstream translation engines, this also keeps less widely spoken languages vibrant and in use for future generations.More: Microsoft Translator Hub: Translation by Everyone for Everyone
Eight sentences is all it takes for Rick Rashid, the founder of Microsoft Research, to electrify a crowd of 2,000 students and faculty in Tianjin, China. Decades of DNN and speech research culminate in a stunning live translation of Rashid’s voice speaking in English while the Chinese audience hears his voice in Mandarin. The speech recognition system in the demo rehearsal exhibits an error rate of less than 7%, or about the same as a person might perform at taking word-for-word notes.
More: Speech Recognition Breakthrough for the Spoken, Translated Word

A seminal paper on speech transcription is authored by Microsoft researchers and presented at Interspeech 2011. Microsoft researchers show methods that improve performance by over 30% compared to previous methods. Rather than having one word in 4 or 5 incorrect, the error rate becomes one word in 7 or 8. While still far from perfect, this is the most dramatic change in accuracy in the last decade.More: Frank Seide, Gang Li and Dong Yu, “Conversational Speech Transcription Using Context-Dependent Deep Neural Networks”

Microsoft researchers in Asia become intrigued with the notion of translating the spoken word in the speaker’s own voice.

As such, The Translating! Telephone demo is shown publicly for the first time at TechFest 2010, allowing a real-time translation of German to English using the voice of each speaker.

Microsoft researchers pioneer industrial-scale deep learning by first conducting large-scale industry technology development on voice search tasks, combining the strength of DNNs with the industry need for producing speech recognizers that are not only highly accurate but also highly efficient. The seminal journal paper published on the work was subsequently awarded the 2013 Best Paper Award by IEEE.More: George E. Dahl, Dong Yu, Li Deng, and Alex Acero, “Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition,” IEEE Transactions on Audio, Speech, and Language Processing, Vol. 20, No. 1, January 2012

Before 2009, nearly all speech recognition systems are based on the technique of Gaussian mixture models (GMMs), with disappointing speech recognition results. Beginning in the latter part of 2009, things begin to change. The DNN model and a deep model which Microsoft researcher Li Deng and other colleagues developed earlier has interesting and distinct recognition error patterns. This discovery and subsequent collaboration motivates them to invest heavily in further research into DNNs.

The Microsoft Machine Translation Service is released, enabling large scale translation of web content.

More: Introducing: Windows Live Translator Beta

Geoff Hinton begins using DNNs for machine learning at the University of Toronto and publishes two seminal papers: “Fast Learning Algorithm for Deep Belief Nets,” Hinton et al., Neural Computation, July 2006, and, “Reducing the Dimensionality of Data with Neural Networks,” Hinton and R.R. Salakhutdinov, Science, July 2006.

Microsoft researchers Chris Quirk and Arul Menezes and University of Alberta researcher Colin Cherry develop the syntactic statistical machine translation approach that informs the future Microsoft machine translation system.

More: Dependency Treelet Translation: Syntactically Informed Phrasal SMT

Skype is released, enabling voice and video communication worldwide using the Internet.

More: Skype at 10: How an Estonian startup transformed itself (and the world)

Zens, Och & Ney’s paper “Phrase-Based Statistical Machine Translation” simplifies and improves translation over earlier approaches.

Attacks on the World Trade Center initiate large scale DARPA funding for speech recognition, machine translation, and language processing. The Global Autonomous Language Exploitation (GALE) program combines speech recognition, machine translation, and information extraction. The DARPA TRANSTAC program demonstrates speech-to-speech translation on a handheld device, for short phrases.

Tokuda et al. derive speech parameter generation algorithm for HMM-based speech synthesis in “Speech Parameter Generation Algorithms for HMM-Based Speech Synthesis.” This method is later perfected by Frank Soong at Microsoft Research Asia.

Dragon Systems and IBM release the first commercial software for large vocabulary continuous speech recognitions, running on a PC with Microsoft Windows. Speech recognition becomes available to a mass audience.

Hunt and Black propose concatenative speech synthesis to create realistic sounding audio, in “Unit Selection in a Concatenative Speech Synthesis System Using a Large Speech Database.”

Early work on the core approaches for deep learning occurs when government-funded efforts experiment with DNNs. In particular, the Defense Advanced Projects Research Agency (DARPA) funds numerous large-scale research efforts in speech recognition. SRI International achieves success with DNNs in speaker recognition.More: Larry Heck, Yochai Konig, M. Kemal Sonmez, and Mitch Weintraub, Robustness to Telephone Handset Distortion in Speaker Recognition by Discriminative Feature Design, in Speech Communication, Elsevier, 2000Yochai Konig, Larry Heck, Mitch Weintraub, and M. Kemal Sonmez, Nonlinear Discriminant Feature Extraction for Robust Text-Independent Speaker Recognition, in RLA2C, 1998

Brown et al. publish a seminal paper “A Statistical Approach to Machine Translation,” which suggests building machine translation systems using statistical methods based on the analysis of large amounts of data, rather than earlier approaches based on syntactic analysis and manipulation. The modern era of machine translation begins.

Neural network research becomes popular. A back-propagation algorithm is proposed and becomes widely accepted.

Lalit Bahl, Frederick Jelinek, and Jim Baker propose a noisy channel model for speech recognition, later known as Hidden Markov Models, that becomes the basis for current speech recognition systems. Work on automatic speech recognition begins at IBM and Carnegie Mellon University.

The US Department of Defense, National Science Foundation, and Central Intelligence Agency form the Automatic Language Processing Advisory (ALPAC) to study machine translation efforts. Funding for machine translation systems is curtailed after the ALPAC report finds that there are a sufficient number of human translators for current needs, and questions the ability to make high-quality automated systems. The report notes that “early machine translations of simple or selected text… were as deceptively encouraging as ‘machine translations’ of general scientific text have been uniformly discouraging.” Efforts in machine translation become relatively dormant.

IBM and Georgetown University demonstrate a computerized Russian/English translation system based on six grammar rules and a 250-word vocabulary. It translates sentences such as “Mi pyeryedayem mislyi posryedstvom ryechyi.” into “We transmit thoughts by means of speech.” Government funding for machine translation begins.

Machine translation pioneer Warren Weaver publishes his memorandum, “Translation,” describing computerized approaches for performing translation.

Success in breaking wartime cryptographic codes leads to the belief that similar methods might be successful in translating from one human language to another.

Milestones on the Path to Skype Translator

2015

2014

2013

2012

2011

2010

2009

2007

2006

2005

2003

2002

2001

2000

1997

1996

1990s

1990

1980s

1975

1966

1954

1949

1941-1945