VIDUR: A Large-Scale Simulation Framework for LLM Inference

Amey Agrawal; Nitin Kedia; Jayashree Mohan; Ashish Panwar; Nipun Kwatra; Bhargav S. Gulavani; Ramachandran Ramjee; Alexey Tumanov

VIDUR: A Large-Scale Simulation Framework for LLM Inference

Amey Agrawal ,
Nitin Kedia ,
Jayashree Mohan ,
Ashish Panwar ,
Nipun Kwatra ,
Bhargav S. Gulavani ,
Ramachandran Ramjee ,
Alexey Tumanov

MLSys | May 2024

Download BibTex

Optimizing the deployment of Large Language Models (LLMs) is expensive today since it requires experimentally running an application workload against an LLM implementation while exploring large configuration space formed by system knobs such as parallelization strategies, batching techniques, and scheduling policies. To address this challenge, we present Vidur – a large-scale, high-fidelity, easily-extensible simulation framework for LLM inference performance. Vidur models the performance of LLM operators using a combination of experimental profiling and predictive modeling, and evaluates the end-to-end inference performance for different workloads by estimating several metrics of interest such as latency and throughput. We validate the fidelity of Vidur on several LLMs and show that it estimates inference latency with less than 9% error across the range. Further, we present Vidur-Search, a configuration search tool that helps optimize LLM deployment. Vidur-Search uses Vidur to automatically identify the most cost-effective deployment configuration that meets application performance constraints. For example, Vidur-Search finds the best deployment configuration for LLaMA2-70B in one hour on a CPU machine, in contrast to a deployment-based exploration which would require 42K GPU hours – costing 218K dollars. Source code for Vidur is available at https://github.com/microsoft/vidur (opens in new tab).

Publication Downloads

VIDUR: LLM Simulator

November 22, 2023

Vidur is a high-fidelity and extensible LLM inference simulator. It can help you with capacity planning and finding the best deployment configuration for your LLM deployments, test new research ideas like new scheduling algorithms, optimizations like speculative decoding, etc., and study the system performance of models under different workloads and configurations... all without access to GPUs except for a quick initial profiling phase. Please refer to our MLSys'24 paper for more details. We have a live demo that captures the capabilities of the system.

Download Data