Gandiva is a cluster scheduling framework that utilizes domain-specific knowledge of deep learning to improve the efficiency of training deep learning models in a GPU cluster. By co-design of the cluster scheduler and the deep learning framework (e.g. pyTorch), Gandiva is able to communicate richer information and exercise richer control between the two layers, enabling better scheduling.
The two key requirements of a scheduler for deep learning jobs are to provide (a) low-latency feedback (to enable fast iteration during hyper-parameter search or AutoML), and (b) high resource efficiency (for managing cost). Gandiva achieves these twin goals by exploiting a key characteristic of deep learning: intra-job predictability. Deep learning training jobs perform numerous repetitive iterations called mini-batchs, with each mini-batch being nearly identical to other mini-batches in terms of resource usage. Gandiva exploits such intra-job predictability to time-slice GPUs efficiently across multiple jobs, thereby delivering low-latency. This predictability is also used for introspecting job performance and dynamically migrating jobs to better-fit GPUs, thereby improving cluster efficiency. The knowledge of internal characteristics of a job (such as mini-batch boundaries) enables the Gandiva scheduler to perform application-aware profiling: for example, decisions on migration are taken based on actual useful application throughput, rather than black-box metrics such as utilization that conflate useful work with overhead due to interference.