Synthesizing optimal collective communication algorithms

Collective communication algorithms are an important component of distributed computation. Indeed, in the case of deep-learning, collective communication is the Amdahl’s bottleneck of data-parallel training.

This paper introduces SCCL (for Synthesized Collective Communication Library), a systematic approach to synthesize collective communication algorithms that are explicitly tailored to a particular hardware topology. SCCL synthesizes algorithms along the Pareto-frontier spanning from latency-optimal to bandwidth-optimal implementations of a collective. The paper demonstrates how to encode SCCL’s synthesis as a quantifier-free SMT formula which can be discharged to a theorem prover. We further demonstrate how to scale our synthesis by exploiting symmetries in topologies and collectives.

We synthesize and introduce novel latency and bandwidth optimal algorithms not seen in the literature on two popular hardware topologies. We also show how SCCL efficiently lowers algorithms to implementations on two hardware architectures (NVIDIA and AMD) and demonstrate competitive performance with hand optimized collective communication libraries.

Publication Downloads

Synthesized Collective Communication Library (SCCL)

May 3, 2021

The Synthesized Collective Communication Library is a tool for synthesizing collective algorithms tailored to a particular hardware topology. This project creates high-performance collective communication algorithms for the uncommon interconnect topologies connecting AI accelerators inside servers. This project includes the core synthesizer logic as well as routines for lowering the synthesized algorithms to backend implementations. The objectives for this project include: - Making AI workloads (both model-parallel inference and any parallel training) as well as any HPC workloads faster on the kinds of multi-GPU servers deployed in Azure. - Provide a tool for hardware designers to understand their designs by seeing how good the optimal algorithms synthesized by SCCL are. - Enabling a new class of optimizations involving custom collective primitives to be explored. Currently implementing efficient communication algorithms for custom application specific collectives is very labor intensive.

SCCL-Runtime

June 24, 2021

Synthesized Collective Communication Library (SCCL) is an algorithm synthesizer for GPU communication in machine learning workloads. SCCL-Runtime, which is what we are seeking the release approval for, takes a synthesized algorithm from SCCL and runs it on the hardware. SCCL-Runtime is an extension of NCCL (https://github.com/nvidia/nccl) from NVIDIA and it is available here (https://github.com/parasailteam/nccl-master/).