{"id":995736,"date":"2024-01-04T09:02:45","date_gmt":"2024-01-04T17:02:45","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=995736"},"modified":"2024-01-08T20:39:23","modified_gmt":"2024-01-09T04:39:23","slug":"splitwise-improves-gpu-usage-by-splitting-llm-inference-phases","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/splitwise-improves-gpu-usage-by-splitting-llm-inference-phases\/","title":{"rendered":"Splitwise improves GPU usage by splitting LLM inference phases"},"content":{"rendered":"\n
The recent surge in large language model (LLM) use is causing significant challenges for cloud providers, requiring them to deploy more GPUs at an unprecedented rate. However, the capacity to provision the power needed to run these GPUs is limited, and with demand for computation surpassing supply, it is not uncommon for user queries to be denied. Therefore, any approach to making the existing infrastructure more efficient\u2014enabling it to serve more queries faster under the same power budget\u2014can have very tangible benefits to both cloud providers and users.<\/p>\n\n\n\n
One aspect of LLM inference that currently limits efficient use of resources is that it has two distinct phases with different characteristics: the prompt phase and the token-generation phase. During the prompt phase, LLMs process all user input, or prompts, in parallel, efficiently utilizing GPU compute. However, during the token-generation phase, LLMs generate each output token sequentially and are limited by GPU memory bandwidth. Even when employing state-of-the-art batching mechanisms, the discrepancy between these two phases results in low overall hardware utilization, leading to much higher costs when offering LLMs to users. Figure 1 illustrates the differences between these two phases.<\/p>\n\n\n\n At Azure Research \u2013 Systems<\/a>, we tackled this by creating Splitwise, a technique designed to optimally utilize available hardware by separating the prompt computation and token-generation phases onto separate machines. This approach is underpinned by the insight that prompt processing and token-generation are distinct in their computational, memory, and power requirements. By separating these two phases, we can enhance hardware utilization during both phases. Our paper, \u201cSplitwise: Efficient Generative LLM Inference Using Phase Splitting<\/a>,\u201d details our methods for developing and testing this technique, including an exploration of how different types of GPUs perform during each phase. <\/p>\n\n\n\n To create a sustainable approach for GPU provisioning, we used Splitwise to design GPU clusters with three primary objectives: maximizing throughput, minimizing costs, and reducing power. In addition to separating the two LLM inference phases into two distinct machine pools, we include a third machine pool for mixed batching across the prompt and token phases, sized dynamically based on real-time computational demands. Lastly, we transferred the state context (i.e., KV-cache in the LLM transformer attention layers) from the prompt to the token machines over InfiniBand without any perceivable latency impact to the user. This high-level system architecture is illustrated in Figure 2.<\/p>\n\n\n\n \n\t\tSpotlight: Event Series<\/span>\n\t<\/p>\n\t\n\tSplitting the phases with Splitwise<\/h2>\n\n\n\n