FastSeq: Make Sequence Generation Faster
- Yu Yan ,
- Fei Hu ,
- Jiusheng Chen ,
- Nikhil Bhendawade ,
- Ting Ye ,
- Yeyun Gong ,
- Nan Duan ,
- Desheng Cui ,
- Bingyu Chi ,
- Ruofei Zhang
2021 Meeting of the Association for Computational Linguistics |
Transformer-based models have made tremendous impacts in natural language generation. However, the inference speed is a bottleneck due to large model size and intensive computing involved in auto-regressive decoding process. We develop FastSeq framework to accelerate sequence generation without accuracy loss. The proposed optimization techniques include an attention cache optimization, an efficient algorithm for detecting repeated n-grams, and an asynchronous generation pipeline with parallel I/O. These optimizations are general enough to be applicable to Transformer-based models (e.g., T5, GPT2, and UniLM). Our benchmark results on a set of widely used and diverse models demonstrate 4-9x inference speed gain. Additionally, FastSeq is easy to use with a simple one-line code change. The source code is available at https://github.com/microsoft/fastseq (opens in new tab).
Publication Downloads
FastSeq
December 14, 2021
FastSeq provides efficient implementation of popular sequence models (e.g. Bart, ProphetNet) for text generation, summarization, translation tasks etc. It automatically optimizes inference speed based on popular NLP toolkits (e.g. FairSeq and HuggingFace-Transformers) without accuracy loss.