The Cloud Systems Reliability research group at Microsoft Research aims to develop practical tools and techniques that can help cloud developers adequately debug, test, monitor, and troubleshoot their systems. Our research combines Distributed Systems, PL, Software Engineering, and Machine Learning techniques and spans all aspects of improving reliability and availability of large-scale cloud systems, including:
- Analyzing production data to understand how real cloud systems fail and what can be done to prevent them.
- Developing practical static and dynamic analysis techniques to uncover hard-to-find bugs before production. The techniques can be evaluated and used with thousands with Microsoft software projects.
- Developing practical and novel techniques for diagnosing failures, runtime monitoring, logging, & failure prevention.
- Developing solutions to help quick troubleshooting of production incidents.