{"id":8705,"date":"2019-06-17T09:56:06","date_gmt":"2019-06-17T16:56:06","guid":{"rendered":"https://www.microsoft.com\/en-us\/translator/blog\/?p=8705"},"modified":"2019-06-21T10:43:11","modified_gmt":"2019-06-21T17:43:11","slug":"neural-machine-translation-enabling-human-parity-innovations-in-the-cloud","status":"publish","type":"post","link":"https://www.microsoft.com\/en-us\/translator/blog\/2019\/06\/17\/neural-machine-translation-enabling-human-parity-innovations-in-the-cloud\/","title":{"rendered":"Neural Machine Translation Enabling Human Parity Innovations In the Cloud"},"content":{"rendered":"

In March 2018 we\u00a0announced<\/a>\u00a0(Hassan et al. 2018<\/a>)\u00a0a breakthrough result\u00a0where we showed for the first time a Machine Translation system that could perform\u00a0as well as\u00a0human translators (in a specific scenario \u2013 Chinese-English news translation). This was an exciting breakthrough in Machine Translation research, but the system we built for this project was a complex, heavyweight research system, incorporating multiple cutting-edge\u00a0techniques.\u00a0While we released the output of this system on several test sets, the system itself was not suitable for deployment in a real-time machine translation cloud API.<\/p>\n

Today we\u00a0are excited to\u00a0announce the\u00a0availability in production\u00a0of\u00a0our latest generation of neural Machine Translation models. These models\u00a0incorporate most of the goodness of our\u00a0research system and\u00a0are now available by default when you use the Microsoft Translator API. These new models are available\u00a0today\u00a0in Chinese, German, French,\u00a0Hindi,\u00a0Italian, Spanish, Japanese, Korean, and Russian, from and to English. More languages are coming soon.<\/p>\n

\"\"<\/a><\/p>\n

Getting from Research\u00a0Paper to\u00a0Cloud API<\/h3>\n

Over the past year, we have been\u00a0looking for\u00a0ways to bring much of the quality of our human-parity system into\u00a0the\u00a0Microsoft\u00a0Translator\u00a0API<\/a>, while continuing to offer low-cost real-time translation.\u00a0Here are some of the\u00a0steps on that journey.<\/p>\n

Teacher-Student Training<\/h3>\n

Our first step was to switch to\u00a0a\u00a0\u201cteacher-student\u201d\u00a0framework, where we\u00a0train a\u00a0lightweight real-time student\u00a0to\u00a0mimic a heavyweight\u00a0teacher\u00a0network\u00a0(Ba and Caruana 2014<\/a>).\u00a0This is accomplished by training\u00a0the student not on the parallel data that MT systems are usually trained on, but on translations produced by the teacher\u00a0(Kim and Rush 2016<\/a>). This is a simpler task than learning from raw data, and allows a shallower, simpler student to very closely\u00a0follow\u00a0the complex teacher.\u00a0As\u00a0one\u00a0might expect, our initial attempts\u00a0still\u00a0suffered\u00a0quality drops from teacher to student (no free lunch!),\u00a0but we nevertheless\u00a0took first place\u00a0in the WNMT 2018 Shared Task on\u00a0Efficient\u00a0Decoding (Junczys-Dowmunt et al. 2018a<\/a>).\u00a0Some\u00a0particularly exciting results\u00a0from this effort were\u00a0that Transformer\u00a0(Vaswani et al. 2017<\/a>)\u00a0models\u00a0and their modifications\u00a0play well\u00a0with teacher-student training and are astoundingly efficient\u00a0during inference\u00a0on\u00a0the CPU.<\/p>\n

Learning from these\u00a0initial\u00a0results and\u00a0after\u00a0a lot of\u00a0iteration\u00a0we discovered a recipe that allows our simple student to have almost the same quality as the complex teacher (sometimes there is a free lunch after all?).\u00a0Now we were free to\u00a0build large, complex teacher models to maximize quality, without worrying about real-time constraints\u00a0(too much).<\/p>\n

Real-time\u00a0translation<\/h3>\n

Our decision to switch to a teacher-student framework was motivated by the great work by\u00a0Kim and Rush\u00a0(2016<\/a>)\u00a0for\u00a0simple RNN-based models.\u00a0At that point it was unclear if the reported benefits would manifest for Transformer models as well\u00a0(see\u00a0Vaswani et al. 2017<\/a>\u00a0for details on this model).\u00a0However, we quickly\u00a0discovered\u00a0that\u00a0this was indeed\u00a0the\u00a0case.<\/p>\n

The\u00a0Transformer\u00a0student\u00a0could use\u00a0a greatly simplified decoding\u00a0algorithm\u00a0(greedy\u00a0search)\u00a0where we just pick the\u00a0single best translated word\u00a0at each step,\u00a0rather than\u00a0the usual method (beam-search)\u00a0which involves searching through the huge space of possible translations. This\u00a0change had\u00a0minimal quality impact but\u00a0led to\u00a0big\u00a0improvements in translation\u00a0speed.\u00a0By contrast, a teacher model would\u00a0suffer\u00a0a significant\u00a0drop in quality when\u00a0switching from\u00a0beam-search to greedy-search.<\/p>\n

At the same time, we\u00a0realized\u00a0that rather than using the\u00a0latest neural\u00a0architecture\u00a0(Transformer\u00a0with self-attention)\u00a0in\u00a0the decoder,\u00a0the student\u00a0could\u00a0be\u00a0modified to use a\u00a0drastically\u00a0simplified\u00a0and faster\u00a0recurrent\u00a0(RNN)\u00a0architecture. This matters because\u00a0while\u00a0the Transformer encoder can\u00a0be\u00a0computed\u00a0over the\u00a0whole\u00a0source\u00a0sentence in parallel, the target sentence\u00a0is\u00a0generated\u00a0a single word at a time, so the speed of the decoder has a big impact on the overall speed of translation.\u00a0Compared to self-attention, the recurrent decoder reduces algorithmic complexity from quadratic to linear in\u00a0target\u00a0sentence length.\u00a0Especially in the teacher-student setting, we saw no loss in quality\u00a0due\u00a0to these modifications, neither for automatic nor for human evaluation results.\u00a0Several\u00a0additional improvements\u00a0such as\u00a0parameter sharing led to\u00a0further\u00a0reductions in complexity and increased speed.<\/p>\n

Another\u00a0advantage of the teacher-student framework\u00a0we were very\u00a0excited\u00a0to see\u00a0is\u00a0that quality improvements over time of the ever growing and\u00a0changing\u00a0teachers are easily carried over to\u00a0a\u00a0non-changing student architecture.\u00a0In cases\u00a0where\u00a0we saw problems in this regard, slight increases in student model capacity would close the gap again.<\/p>\n

Dual Learning<\/h3>\n

The key insight behind dual learning\u00a0(He et al. 2016<\/a>)\u00a0is\u00a0the \u201cround-trip translation\u201d check that\u00a0people sometimes use to check translation quality. Suppose\u00a0we\u2019re\u00a0using an online translator to go from English to Italian. If\u00a0we\u00a0don\u2019t read Italian, how do\u00a0we\u00a0know\u00a0if\u00a0it\u2019s done a good job? Before clicking\u00a0send\u00a0<\/i>on\u00a0an\u00a0email,\u00a0we\u00a0might choose to\u00a0check the quality by translating\u00a0the Italian back to English (maybe on a different web site). If the English\u00a0we\u00a0get back has strayed too far from the original, chances are one of the translations went off the rails.<\/p>\n

Dual learning uses the same approach to train two systems (e.g. English->Italian and Italian->English) in parallel, using the round-trip translation from one system to score, validate and train the other system.<\/p>\n

Dual learning was a major contributor to our human-parity research result.\u00a0In going from\u00a0the\u00a0research system to our production recipe, we generalized\u00a0this approach\u00a0broadly. Not only did we co-train pairs of systems on each other\u2019s output, we also used the same criterion for filtering our parallel data.<\/p>\n

Cleaning up\u00a0inaccurate\u00a0data<\/h3>\n

Machine translation\u00a0systems are trained on \u201cparallel data\u201d, i.e.\u00a0pairs of documents that are translations of each\u00a0other,\u00a0ideally\u00a0created by a human translator.\u00a0As it turns out, this\u00a0parallel data is\u00a0often\u00a0full of\u00a0inaccurate translations. Sometimes the documents are not truly parallel but only loose paraphrases of each other. Human translators can choose to leave out some source material or insert additional information. The data can contain typos, spelling mistakes, grammatical errors. Sometimes our data\u00a0mining\u00a0algorithms are fooled by similar but non-parallel data, or even by sentences in the wrong language.\u00a0Worst of all, a lot of the web pages we\u00a0see\u00a0are\u00a0spam, or\u00a0may\u00a0in fact be\u00a0machine\u00a0translations rather than human translations.\u00a0Neural systems are very sensitive to this\u00a0kind of\u00a0inaccuracy\u00a0in the data.\u00a0We found that\u00a0building neural models to\u00a0automatically identify and get rid of these\u00a0inaccuracies\u00a0gave\u00a0strong\u00a0improvements in the quality of our systems.\u00a0Our approach to data filtering\u00a0resulted in the first place\u00a0in the\u00a0WMT18\u00a0parallel corpus filtering\u00a0benchmark<\/a>\u00a0(Junczys-Dowmunt 2018a<\/a>) and helped build one of the strongest\u00a0English-German translation systems in the\u00a0WMT18 News translation task<\/a>\u00a0(Junczys-Dowmunt 2018b<\/a>).\u00a0We used\u00a0improved versions of this\u00a0approach\u00a0in the production systems we released today.<\/p>\n

Factored\u00a0word\u00a0representations<\/h3>\n

When moving a research technology to production,\u00a0several\u00a0real-world challenges arise.\u00a0Getting numbers, dates, times, capitalization, spacing, etc. right matters a lot more in production than in a research system.<\/p>\n

Consider the\u00a0challenge\u00a0of capitalization. If we\u2019re translating the\u00a0sentence \u201cWATCH CAT VIDEOS HERE\u201d. We\u00a0know how to translate \u201ccat\u201d. We\u00a0would want to\u00a0translate \u201cCAT\u201d the same way. But now consider \u201cWatch US soccer here\u201d. We don\u2019t want to confuse the word \u201cus\u201d and the acronym \u201cUS\u201d in this context.<\/p>\n

To handle this, we used an\u00a0approach\u00a0known as factored\u00a0machine translation\u00a0(Koehn and Hoang 2007<\/a>,\u00a0Sennrich\u00a0and Haddow 2016<\/a>)\u00a0which works as follows.\u00a0Instead of a single\u00a0numeric\u00a0representation (\u201cembedding\u201d) for\u00a0\u201ccat\u201d or \u201cCAT\u201d,\u00a0we use multiple embeddings, known as \u201cfactors\u201d. In this case,\u00a0the primary embedding would be the same for \u201cCAT\u201d and \u201ccat\u201d but a separate factor would represent the\u00a0capitalization, showing that it was all-caps in one instance but lowercase in the other.\u00a0Similar factors are used on the source and the target side.<\/p>\n

We use similar factors to handle word fragments\u00a0and\u00a0spacing between words (a complex issue in non-spacing or semi-spacing languages such as Chinese, Korean, Japanese\u00a0or\u00a0Thai).<\/p>\n

Factors also dramatically improved translation of numbers, which is critical in many scenarios. Number translation is mostly an algorithmic transformation. For example,\u00a01,234,000 can be written as 12,34,000 in Hindi, 1.234.000 in German, and 123.4\u4e07 in Chinese. Traditionally, numbers are represented like words, as groups of characters of varying length. This makes it hard for machine learning to discover the algorithm. Instead, we feed every single digit of a number separately, with factors marking beginning and end. This simple trick robustly and reliably removed nearly all number-translation errors.<\/p>\n

Faster\u00a0model\u00a0training<\/h3>\n

When we\u2019re training a single system towards a single goal, as we did for the\u00a0human-parity research project, we expect to throw vast numbers of\u00a0hardware\u00a0at models that take weeks to train. When training production models for 20+ language pairs, this approach becomes untenable. Not only do we need reasonable turn-around times, but we also need to moderate our\u00a0hardware\u00a0demands. For this project, we made a number of performance improvements to\u00a0Marian\u00a0NMT<\/a>\u00a0(Junczys-Dowmunt et al. 2018b<\/a>).<\/p>\n

Marian\u00a0NMT is the open-source Neural MT toolkit\u00a0that\u00a0Microsoft Translator is based on.\u00a0Marian is a pure C++ neural machine translation toolkit, and, as a result, extremely efficient, not requiring GPUs at runtime, and\u00a0very efficient\u00a0at training time<\/p>\n

Due to its self-contained nature, it is quite easy to optimize Marian for NMT specific tasks,\u00a0which results in one of the most\u00a0efficient NMT toolkits available. Take a look at the\u00a0benchmarks<\/a>.\u00a0If you are interested in Neural MT research\u00a0and development, please join and contribute to the\u00a0community on\u00a0Github<\/a>.<\/p>\n

Our\u00a0improvements concerning mixed-precision training and decoding, as well as for large model training will soon be made available in the public\u00a0Github\u00a0repository.<\/p>\n

We\u00a0are excited about the future\u00a0of\u00a0neural\u00a0machine translation.\u00a0We will continue to roll\u00a0out the\u00a0new model\u00a0architecture to the remaining languages\u00a0and\u00a0Custom Translator<\/a>\u00a0throughout this year. Our users will\u00a0automatically\u00a0get the\u00a0significantly\u00a0better-quality\u00a0translations through the\u00a0Translator\u00a0API<\/a>,\u00a0our\u00a0Translator\u00a0app<\/a>,\u00a0Microsoft Office,\u00a0and the\u00a0Edge browser.\u00a0We hope the new improvements\u00a0help your personal\u00a0and professional lives and look forward to\u00a0your feedback.<\/p>\n

 <\/p>\n

References<\/h3>\n