F3: Fault Forecasting Framework for Cloud Systems

  • Pu Zhao ,
  • Chuan Luo ,
  • ,
  • Youjiang Wu ,
  • Yingnong Dang ,
  • Murali Chintalapati ,
  • Susy Yi ,
  • Paul Wang ,
  • Andrew Zhou ,
  • Saravanakumar Rajmohan ,
  • Qingwei Lin ,

In recent years, the development of cloud systems (e.g., Microsoft Azure) has grown explosively, and a variety of software services have been deployed on cloud systems. As cloud systems are required to serve customers on a 24/7 basis, high service reliability is essential to them. To reduce the number of the faults in cloud systems, many machine learning based fault forecasting methods have been proposed. Those forecasting methods aim to predict faults in advance so that proactive actions can be taken to avoid negative impact, and they mainly focus on a specific hardware (e.g., disk, memory and node). In cloud systems, many fault forecasting tasks have similar characteristics: 1) they are based on the temporal monitoring data and 2) they usually suffer from similar challenges (e.g., the extreme data imbalance problem). In this work, we present a unified fault forecasting framework for cloud systems, dubbed F3. In particular, F3 introduces an end-to-end pipeline for a variety of fault forecasting tasks in cloud systems, and the pipeline underlying F3 consists of several critical parts (e.g., data processing, fault forecasting, prediction result interpretation and action decision). In this way, when a new fault forecasting task arrives, F3 can be easily and effectively utilized to handle the new task with adaption. Besides, F3 is able to overcome other challenges, including the extreme data imbalance problem, data inconsistency between online and offline environments, as well as model overfitting. More encouragingly, F3 has been successfully applied to Microsoft Azure and has helped significantly reduce the number of virtual machine interruptions.