Using Statistical Monitoring to Detect Failures in Internet Services

Since the Internet’s popular emergence in the mid-1990’s, Internet services such as e-mail and messaging systems, search engines, e-commerce, news and financial sites, have become an important and often mission-critical part of our society. Unfortunately, managing these systems and keeping them running is a signi cant challenge. Their rapid rate of change as well as their size and complexity mean that the developers and operators of these services usually have only an incomplete idea of how the system works and even what it is supposed to do. This results in poor fault management, as operators have a hard time diagnosing faults and an even harder time detecting them. This dissertation argues that statistical monitoring|the use of statistical analysis and machine learning techniques to analyze live observations of a system’s behavior| can be an important tool in improving the manageability of Internet services. Statistical monitoring has several important features that are well suited to managing Internet services. First, the dynamic analysis of a system’s behavior in statistical monitoring means that there is no dependency on speci cations or descriptions that might be stale or incorrect. Second, monitoring a live, deployed system gives insight into system behavior that cannot be achieved in QA or testing environments. Third, automatic analysis through statistical monitoring can better cope with larger and more complex systems, aiding human operators as well as automating parts of the system management process.

The first half of this thesis focuses on a methodology to detect failures in Internet services, including high-level application failures, by monitoring structural behaviors that reflect the high-level functionality of the service. We implemented prototype fault monitors for a testbed Internet service and a clustered hashtable system. We also present encouraging early results from applying these techniques to two real, large Internet services.

In the second half of this thesis, we apply statistical monitoring techniques to two other problems related to fault detection: automatically inferring undocumented system structure and invariants and localizing the potential cause of a failure given its symptoms. We apply the former to theWindows Registry, a large, poorly documented and error-prone con guration database used by the Windows operating system and Windows-based applications. We describe and evaluate the latter in the context of our testbed Internet service.

Our experiences provide strong support for statistical monitoring, and suggest that it may prove to be an important tool in improving the manageability and reliability of Internet services.