Rethinking Machine Learning Collective Communication as a Multi Commodity Flow problem
- Xuting Liu ,
- Behnaz Arzani ,
- Siva Kesava Reddy Kakarla ,
- Liangyu Zhao ,
- Vincent Liu ,
- Miguel Castro ,
- Srikanth Kandula ,
- Luke Marshall
SIGCOMM |
Organized by ACM
Cloud operators utilize collective communication optimizers to enhance the efficiency of the single-tenant, centrally managed training clusters they manage. However, current optimizers struggle to scale for such use cases and often compromise solution quality for scalability. Our solution, TE-CCL, adopts a traffic-engineering-based approach to collective communication. Compared to a state-of-the-art optimizer, TACCL, TE-CCL produced schedules with 2× better performance on topologies TACCL supports (and its solver took a similar amount of time as TACCL’s heuristic-based approach). TECCL additionally scales to larger topologies than TACCL. On our GPU testbed, TE-CCL outperformed TACCL by 2.14× and RCCL by 3.18× in terms of algorithm bandwidth.
Publication Downloads
TE-CCL
May 24, 2024
TE-CCL is a tool to generate collective communication schedules for large topologies using a Traffic Engineering-based solver. TE-CCL takes in a topology and collective (e.g. AllGather) and outputs a schedule (in JSON) detailing data transfer steps for each node that satisfies the demands specified by the collective.