Author: Youshan Miao
Today, deep learning has permeated our daily lives. As the size of models continues to grow, training these models on massive GPU accelerators has become increasingly time-consuming and costly. To effectively harness the power of massive GPUs and enhance efficiency, researchers have been developing various parallel strategies to improve performance across multiple devices. However, many potentially efficient parallel strategies remain unexplored. Fully exploring these possibilities to further unlock performance remains a major challenge in both research and application.
Recently, researchers from Microsoft Research Asia proposed nnScaler, a framework that supports freely expressing parallel strategies through a set of parallel primitives while minimizing search costs by utilizing a strategy-constrained search space.
Parallel primitives: A unified framework to describe parallel strategies
Mainstream training systems, such as Megatron-LM, DeepSpeed, and Alpa, typically incorporate built-in parallel strategies like data-parallelism, tensor-parallelism, and pipeline-parallelism, which can be combined through configuration. While this approach is efficient for most cases and convenient for users, it may overlook more efficient strategies. Including new efficient parallel strategies often require substantial modifications to the underlying system code, which is impractical.
Therefore, researchers revisited the fundamental components of parallel strategies. The execution of deep learning models can usually be represented as a data-flow graph, where tensors (representing data) are the edges, and tensor operators are the vertices. The process of parallelization involves transforming the computation data-flow graph, originally designed for single-device execution, into a distributed data-flow graph for multi-device execution. Thus, researchers propose a set of basic operations as parallel primitives, including:
- op-trans: Describes how operators and tensors are partitioned.
- op-assign: Assigns the partitioned operators to specific devices.
- op-order: Specifies the execution order of operators on the same device.
By utilizing these primitives, we can precisely describe how each operator and tensor in the data-flow graph is partitioned and scheduled across devices (spatially) and over time (temporal order on the same device).
These parallel primitives can naturally express various widely used parallel strategies. For instance, data-parallelism can be expressed as partitioning all forward pass and backward pass operators along the data sample dimension and distributing them evenly across all devices, while copying the optimizer operators to each device. All operators on each device maintain the same execution order as in the original graph. The introduction of parallel primitives allows for the flexible description and integration of various parallel strategies within a unified framework, significantly expanding the representational scope of parallel strategy spaces.
This universality not only systematically expresses existing parallel strategies but also provides the potential to explore new ones. For example, for operators that produce intermediate results too large for a single device, the original strategy might be tensor-parallelism with communication coordination. However, using parallel primitives, we can partition the tensor for that specific operator and schedule all partitions on the same device for sequential execution, successfully executing the operation while avoiding communication overhead, as shown in the figure below.
Enhancing efficiency through expert-guided strategy search
While parallel primitives expand the strategy space, this expansion also introduces a new challenge—the vast strategy space makes it difficult to complete searches within a limited time. With a multitude of possible combinations, efficiently finding the optimal strategy becomes a pressing issue.
nnScalerleverages domain expert wisdom to guide effective strategy searches. This guidance can be described through parallel primitives, seamlessly integrating into the system. For example, as shown in the figure below, researchers can constrain the operator’s partitioning scheme to the options specified in the algo set and limit the partitioning to the number of pieces specified by num.
By setting specific constraints, experts can significantly narrow the search space, making the search process more efficient and targeted. Experiments have shown up to 10x increase in search efficiency without compromising the effectiveness of the strategies.
This approach can uncover high-performance strategies that existing methods overlook and complete searches in a shorter time, thereby improving the efficiency of deep learning training. nnScaler has been validated in the training of multiple deep learning models, demonstrating significant performance improvements.
For more details, please refer to the nnScaler paper: https://www.usenix.org/conference/osdi24/presentation/lin-zhiqi (opens in new tab)
By introducing parallel primitives and expert-guided strategy searches, nnScaler has effectively addressed many issues in the design of parallel strategies for deep learning. This method not only significantly expands the space of parallel strategies, but also provides new directions and tools for future research in parallel strategies. Researchers are eager to see this method demonstrate its potential in broader applications, bringing more possibilities to the development of deep learning.