{"id":827599,"date":"2022-03-22T10:00:00","date_gmt":"2022-03-22T17:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=827599"},"modified":"2022-08-17T09:51:47","modified_gmt":"2022-08-17T16:51:47","slug":"microsoft-translator-enhanced-with-z-code-mixture-of-experts-models","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/microsoft-translator-enhanced-with-z-code-mixture-of-experts-models\/","title":{"rendered":"Microsoft Translator enhanced with Z-code Mixture of Experts models"},"content":{"rendered":"\n
\"Z-code<\/figure>\n\n\n\n

Translator, a Microsoft Azure Cognitive Service, is adopting Z-code Mixture of Experts models (opens in new tab)<\/span><\/a>, a breakthrough AI technology that significantly improves the quality of production translation models. As a component of Microsoft\u2019s larger XYZ-code initiative (opens in new tab)<\/span><\/a> to combine AI models for text, vision, audio, and language, Z-code supports the creation of AI systems that can speak, see, hear, and understand. This effort is a part of Azure AI (opens in new tab)<\/span><\/a> and Project Turing (opens in new tab)<\/span><\/a>, focusing on building multilingual, large-scale language models that support various production teams. Translator is using NVIDIA GPUs and Triton Inference Server to deploy and scale these models efficiently for high-performance inference. Translator is the first machine translation provider to introduce this technology live for customers.<\/p>\n\n\n\n

Z-code MoE boosts efficiency and quality<\/h2>\n\n\n\n

Z-code models utilize a new architecture called Mixture of Experts (MoE), where different parts of the models can learn different tasks. The models learn to translate between multiple languages at the same time. The Z-code MoE model utilizes more parameters while dynamically selecting which parameters to use for a given input. This enables the model to specialize a subset of the parameters (experts) during training. At runtime, the model uses the relevant experts for the task, which is more computationally efficient than utilizing all model\u2019s parameters.<\/p>\n\n\n\n

\"animated
Figure 1: Z-code MoE model translating from English to French. The model dynamically selects subsets of its parameters to be utilized for each input. <\/figcaption><\/figure>\n\n\n\n

Newly introduced Z-code MoE models leverage transfer learning, which enables efficient knowledge sharing across similar languages. Moreover, the models utilize both parallel and monolingual data during the training process. This opens the way to high quality machine translation beyond the high-resource languages and improves the quality of low-resource languages that lack significant training data. This approach can provide a positive impact on AI fairness, since both high-resource and low-resource languages see improvements.<\/p>\n\n\n\n

We have trained translation systems for research<\/a> purposes with 200 billion parameters supporting 100 language pairs. Though such large systems significantly improved the translation quality, this also introduced challenges to deploy them in a production environment cost effectively. For our production model deployment, we opted for training a set of 5 billion parameter models, which are 80 times larger than our currently deployed models. We trained a multilingual model per set of languages, where each model can serve up to 20 language pairs and therefore replace up to 20 of the current systems. This enabled our model to maximize the transfer learning among languages while being deployable with effective runtime cost. We compared the quality improvements of the new MoE to the current production system using human evaluation. The figure below shows the results of the models on various language pairs. The Z-code-MoE systems outperformed individual bilingual systems, with average improvements of 4%. For instance, the models improved English to French translations by 3.2 percent, English to Turkish by 5.8 percent, Japanese to English by 7.6 percent, English to Arabic by 9.3 percent, and English to Slovenian by 15 percent.<\/p>\n\n\n\n

\"graphic
Figure 2: Quality gains of Z-code MoE models over existing models. Languages are ordered by training data sizes. <\/figcaption><\/figure>\n\n\n\n
\n\t