An Intelligent Framework for Timely, Accurate, and Comprehensive Cloud Incident Detection
- Yichen Li ,
- Xu Zhang ,
- Shilin He ,
- Zhuangbin Chen ,
- Yu Kang ,
- Jinyang Liu ,
- Liqun Li ,
- Yingnong Dang ,
- Feng Gao ,
- Zhangwei Xu ,
- Saravan Rajmohan ,
- Qingwei Lin 林庆维 ,
- Dongmei Zhang ,
- Michael R. Lyu
ACM SIGOPS Operating Systems Review | , Vol 56(1): pp. 1-7
Cloud incidents (service interruptions or performance degradation) dramatically degrade the reliability of large-scale cloud systems, causing customer dissatisfaction and revenue loss. With years of efforts, cloud providers are able to solve most incidents automatically and rapidly. The secret of this ability is intelligent incident detection. Only when incidents are detected timely, accurately, and comprehensively, can they be diagnosed and mitigated at a satisfiable speed. To overcome the limitations of traditional rule-based detection, we carried out years of incident detection research. We developed a comprehensive AIOps (Artificial Intelligence for IT Operations) framework for incident detection containing a set of data-driven methods. This paper shares our recent experience of developing and deploying such an intelligent incident detection system at Microsoft. We first discuss the real-world challenges of incident detection that constitute the pain points of engineers. Then, we summarize our intelligent solutions proposed in recent years to tackle these challenges. Finally, we show the deployment of the incident detection AIOps framework and demonstrate its practical benefits conveyed to Microsoft cloud services with real cases.