{"id":1088679,"date":"2024-09-27T02:08:05","date_gmt":"2024-09-27T09:08:05","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&p=1088679"},"modified":"2024-09-27T02:10:54","modified_gmt":"2024-09-27T09:10:54","slug":"nnscaler-exploring-a-new-paradigm-for-parallel-execution-in-deep-learning","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/nnscaler-exploring-a-new-paradigm-for-parallel-execution-in-deep-learning\/","title":{"rendered":"nnScaler: Exploring a new paradigm for parallel execution in deep learning"},"content":{"rendered":"\n
Author: Youshan Miao<\/em><\/p>\n\n\n\n Today, deep learning has permeated our daily lives. As the size of models continues to grow, training these models on massive GPU accelerators has become increasingly time-consuming and costly. To effectively harness the power of massive GPUs and enhance efficiency, researchers have been developing various parallel strategies to improve performance across multiple devices. However, many potentially efficient parallel strategies remain unexplored. Fully exploring these possibilities to further unlock performance remains a major challenge in both research and application.<\/p>\n\n\n\n Recently, researchers from Microsoft Research Asia proposed nnScaler, a framework that supports freely expressing parallel strategies through a set of parallel primitives while minimizing search costs by utilizing a strategy-constrained search space.<\/p>\n\n\n\n Mainstream training systems, such as Megatron-LM, DeepSpeed, and Alpa, typically incorporate built-in parallel strategies like data-parallelism, tensor-parallelism, and pipeline-parallelism, which can be combined through configuration. While this approach is efficient for most cases and convenient for users, it may overlook more efficient strategies. Including new efficient parallel strategies often require substantial modifications to the underlying system code, which is impractical.<\/p>\n\n\n\n Therefore, researchers revisited the fundamental components of parallel strategies. The execution of deep learning models can usually be represented as a data-flow graph, where tensors (representing data) are the edges, and tensor operators are the vertices. The process of parallelization involves transforming the computation data-flow graph, originally designed for single-device execution, into a distributed data-flow graph for multi-device execution. Thus, researchers propose a set of basic operations as parallel primitives, including:<\/p>\n\n\n\n By utilizing these primitives, we can precisely describe how each operator and tensor in the data-flow graph is partitioned and scheduled across devices (spatially) and over time (temporal order on the same device).<\/p>\n\n\n\nParallel primitives: A unified framework to describe parallel strategies<\/h2>\n\n\n\n
\n