Synthesizing Collective Communication Algorithms for Heterogeneous Networks with TACCL

Aashaka Shah; Vijay Chidambaram; Meghan Cowan; Saeed Maleki; Madan Musuvathi; Todd Mytkowicz; Jacob Nelson; Olli Saarikivi; Rachee Singh

Synthesizing Collective Communication Algorithms for Heterogeneous Networks with TACCL

Aashaka Shah ,
Vijay Chidambaram ,
Meghan Cowan ,
Saeed Maleki ,
Madan Musuvathi ,
Todd Mytkowicz ,
Jacob Nelson ,
Olli Saarikivi ,
Rachee Singh

November 2021

arXiv preprint

Publication

Download BibTex

Large ML models and datasets have necessitated the use of multi-GPU systems for distributed model training. To harness the power offered by multi-GPU systems, it is critical to eliminate bottlenecks in inter-GPU communication – a problem made challenging by the heterogeneous nature of interconnects. In this work, we present TACCL, a synthesizer for collective communication primitives for large-scale multi-GPU systems. TACCL encodes a profiled topology and input size into a synthesis problem to generate optimized communication algorithms. TACCL is built on top of the standard NVIDIA Collective Communication Library (NCCL), allowing it to be a drop-in replacement for GPU communication in frameworks like PyTorch with minimal changes. TACCL generates algorithms for communication primitives like Allgather, Alltoall, and Allreduce that are up to $3\times$ faster than NCCL. Using TACCL’s algorithms speeds up the end-to-end training of an internal mixture of experts model by $17\%$. By decomposing the optimization problem into parts and leveraging the symmetry in multi-GPU topologies, TACCL synthesizes collectives for up to 80-GPUs in less than 3 minutes, at least two orders of magnitude faster than other synthesis-based state-of-the-art collective communication libraries.