Morpheus: Towards Automated SLOs for Enterprise Clusters

Sangeetha Abdu Jyothi; Carlo Curino; Ishai Menache; Shravan Matthur Narayanamurthy; Alexey Tumanov; Jonathan Yaniv; Ruslan Mavlyutov; Íñigo Goiri; Subru Krishnan; Janardhan (Jana) Kulkarni; Sriram Rao

Morpheus: Towards Automated SLOs for Enterprise Clusters

Sangeetha Abdu Jyothi ,
Carlo Curino ,
Ishai Menache ,
Shravan Matthur Narayanamurthy ,
Alexey Tumanov ,
Jonathan Yaniv ,
Ruslan Mavlyutov ,
Íñigo Goiri ,
Subru Krishnan ,
Janardhan (Jana) Kulkarni ,
Sriram Rao

2016 International Symposium on Operating Systems Design and Implementation (OSDI) | November 2016

Download BibTex

Modern resource management frameworks for large-scale analytics leave unresolved the problematic tension between high cluster utilization and job’s performance predictability—respectively coveted by operators and users. We address this in Morpheus, a new system that: 1) codifies implicit user expectations as explicit Service Level Objectives (SLOs), inferred from historical data, 2) enforces SLOs using novel scheduling techniques that isolate jobs from sharing-induced performance variability, and 3) mitigates inherent performance variance (e.g., due to failures) by means of dynamic reprovisioning of jobs. We validate these ideas against production traces from a 50k node cluster, and show that Morpheus can lower the number of deadline violations by 5x to 13x, while retaining cluster-utilization, and lowering cluster footprint by 14% to 28%. We demonstrate the scalability and practicality of our implementation by deploying Morpheus on a 2700-node cluster and running it against production-derived workloads.