{"id":939198,"date":"2023-05-16T09:00:00","date_gmt":"2023-05-16T16:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=939198"},"modified":"2023-05-11T10:59:06","modified_gmt":"2023-05-11T17:59:06","slug":"large-language-models-for-automatic-cloud-incident-management","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/large-language-models-for-automatic-cloud-incident-management\/","title":{"rendered":"Large-language models for automatic cloud incident management"},"content":{"rendered":"\n
This research was accepted by the IEEE\/ACM International Conference on Software Engineering (ICSE) (opens in new tab)<\/span><\/a>, which is a forum for researchers, practitioners, and educators to gather, present, and discuss the most recent innovations, trends, experiences, and issues in the field of software engineering.<\/em><\/p>\n\n\n\n The Microsoft 365 Systems Innovation<\/a> research group has a paper accepted at the 45th<\/sup> International Conference on Software Engineering (ICSE)<\/a>, widely recognized as one of the most prestigious research conferences on software engineering. This paper, Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models<\/a>, focuses on using state-of-the-art large language models (LLMs) to help generate recommendations for cloud incident root cause analysis and mitigation plans. With a rigorous study on real production incidents and analysis of several LLMs in different settings using semantic and lexical metrics as well as human evaluation, the research shows the efficacy and future potential of using AI for resolving cloud incidents.<\/p>\n\n\n\n Building highly reliable hyperscale cloud services such as Microsoft 365 (M365), which supports the productivity of hundreds of thousands of organizations, is very challenging. This includes the challenge of quickly detecting incidents<\/em>, then performing root cause analysis<\/em> and mitigation<\/em>.<\/p>\n\n\n\n Our recent research starts with understanding the fundamentals of production incidents: we analyze the life cycle of incidents, then determine the common root causes, mitigations, and engineering efforts for resolution. In a previous paper: How to Fight Production Incidents? An Empirical Study on a Large-scale Cloud Service (opens in new tab)<\/span><\/a>, which won a Best Paper award at SoCC\u201922 (opens in new tab)<\/span><\/a>, we provide a comprehensive, multi-dimensional empirical study of production incidents from Microsoft Teams. From this study, we envision that automation should support incident diagnosis and help identify the root cause and mitigation steps to quickly resolve an incident and minimize customer impact. We should also leverage past lessons to build resilience for future incidents. We posit that adopting AIOps and using state-of-the-art AI\/ML technologies can help achieve both goals, as we show in the ICSE paper.<\/em><\/p>\n\n\n\n \n\t\tSpotlight: blog post<\/span>\n\t<\/p>\n\t\n\tChallenges of building reliable cloud services<\/h2>\n\n\n\n