FastSeq: Make Sequence Generation Faster

Yu Yan; Fei Hu; Jiusheng Chen; Nikhil Bhendawade; Ting Ye; Yeyun Gong; Nan Duan; Desheng Cui; Bingyu Chi; Ruofei Zhang

FastSeq: Make Sequence Generation Faster

Yu Yan ,
Fei Hu ,
Jiusheng Chen ,
Nikhil Bhendawade ,
Ting Ye ,
Yeyun Gong ,
Nan Duan ,
Desheng Cui ,
Bingyu Chi ,
Ruofei Zhang

2021 Meeting of the Association for Computational Linguistics | August 2021

Download BibTex

Transformer-based models have made tremendous impacts in natural language generation. However, the inference speed is a bottleneck due to large model size and intensive computing involved in auto-regressive decoding process. We develop FastSeq framework to accelerate sequence generation without accuracy loss. The proposed optimization techniques include an attention cache optimization, an efficient algorithm for detecting repeated n-grams, and an asynchronous generation pipeline with parallel I/O. These optimizations are general enough to be applicable to Transformer-based models (e.g., T5, GPT2, and UniLM). Our benchmark results on a set of widely used and diverse models demonstrate 4-9x inference speed gain. Additionally, FastSeq is easy to use with a simple one-line code change. The source code is available at https://github.com/microsoft/fastseq (opens in new tab).

Publication Downloads

FastSeq

December 14, 2021

FastSeq provides efficient implementation of popular sequence models (e.g. Bart, ProphetNet) for text generation, summarization, translation tasks etc. It automatically optimizes inference speed based on popular NLP toolkits (e.g. FairSeq and HuggingFace-Transformers) without accuracy loss.

Download Data