From Research to Production and Back: Ludicrously Fast Neural Machine Translation

Young Jin Kim; Marcin Junczys-Dowmunt; Hany Hassan Awadalla; Alham Fikri Aji; Kenneth Heafield; Roman Grundkiewicz; Nikolay Bogoychev

From Research to Production and Back: Ludicrously Fast Neural Machine Translation

Young Jin Kim ,
Marcin Junczys-Dowmunt ,
Hany Hassan Awadalla ,
Alham Fikri Aji ,
Kenneth Heafield ,
Roman Grundkiewicz ,
Nikolay Bogoychev

WNGT - EMNLP | November 2019

Download BibTex

This paper describes the submissions of the “Marian” team to the WNGT 2019 efficiency shared task. Taking our dominating submissions to the previous edition of the shared task as a starting point, we develop improved teacher-student training via multi-agent dual-learning and noisy backward-forward translation for Transformer-based student models. For efficient CPU-based decoding, we propose pre-packed 8-bit matrix products, improved batched decoding, cache-friendly student architectures with parameter sharing and light-weight RNN-based decoder architectures. GPU-based decoding benefits from the same architecture changes, from pervasive 16-bit inference and concurrent streams. These modifications together with profiler-based C++ code optimization allow us to push the Pareto frontier established during the 2018 edition towards 24x (CPU) and 14x (GPU) faster models at comparable or higher BLEU values. Our fastest CPU model is more than 4x faster than last year’s fastest submission at more than 3 points higher BLEU. Our fastest GPU model at 1.5 seconds translation time is slightly faster than last year’s fastest RNN-based submissions, but outperforms them by more than 4 BLEU and 10 BLEU points respectively.