SERF: Efficient Scheduling for Fast Deep Neural Network Serving via Judicious Parallelism
- Feng Yan ,
- Yuxiong He ,
- Olatunji Ruwase ,
- Evgenia Smirni
SC'16 |
Published by IEEE Press Piscataway, NJ, USA ©2016
Deep neural networks (DNNs) has enabled a variety of artificial intelligence applications. These applications are backed by large DNN models running in serving mode on a cloud computing infrastructure. Given the compute-intensive nature of large DNN models, a key challenge for DNN serving systems is to minimize the request response latencies. This paper characterizes the behavior of different parallelism techniques for supporting scalable and responsive serving systems for large DNNs. We identify and model two important properties of DNN workloads: homogeneous request service demand, and interference among requests running concurrently due to cache/memory contention. These properties motivate the design of SERF, a dynamic scheduling framework that is powered by an interference-aware queueing-based analytical model. We evaluate SERF in the context of an image classification service using several well known benchmarks. The results demonstrate its accurate latency prediction and its ability to adapt to changing load conditions.