PerfAugur: Robust Diagnostics for Performance Anomalies in Cloud Services

ICDE - 31st International Conference on Data Engineering |

Published by IEEE

Cloud platforms involve multiple independently developed components, often executing on diverse hardware configurations and across multiple data centers. This complexity makes tracking various key performance indicators (KPIs) and manual diagnosing of anomalies in system behavior both difficult and expensive. In this paper, we describe Argus, an automated system for mining service logs to identify anomalies and help formulate data-driven hypotheses.

Argus includes a suite of efficient mining algorithms for detecting significant anomalies in system behavior, along with potential explanations for such anomalies, without the need for an explicit supervision signal. We perform extensive experimental evaluation using both synthetic and real-life data sets, and present detailed case studies showing the impact of this technology on operations of the Windows Azure Service.