{"id":928572,"date":"2023-04-06T08:41:17","date_gmt":"2023-04-06T15:41:17","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&#038;p=928572"},"modified":"2023-12-18T07:52:48","modified_gmt":"2023-12-18T15:52:48","slug":"towards-highly-reliable-services-with-aiops","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/towards-highly-reliable-services-with-aiops\/","title":{"rendered":"Towards Highly Reliable Services with AIOps"},"content":{"rendered":"<p><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/rujiawang\/\" target=\"_blank\" rel=\"noopener\"><em>Rujia Wang<\/em><span class=\"sr-only\"> (opens in new tab)<\/span><\/a><em>, Principal Research PM; <\/em><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/chetanb\/\" target=\"_blank\" rel=\"noopener\"><em>Chetan Bansal<\/em><span class=\"sr-only\"> (opens in new tab)<\/span><\/a><em>, Principal Research Manager; <\/em><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/saravar\/\" target=\"_blank\" rel=\"noopener\"><em>Saravan Rajmohan<\/em><span class=\"sr-only\"> (opens in new tab)<\/span><\/a><em>, Partner Director AI & Applied Research; and <\/em><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.linkedin.com\/in\/jim-kleewein-2395a3\/\" target=\"_blank\" rel=\"noopener\"><em>Jim Kleewein<\/em><span class=\"sr-only\"> (opens in new tab)<\/span><\/a><em>, Technical Fellow<\/em><\/p>\n\n\n<blockquote class=\"wp-block-quote is-style-spectrum\"><p style=\"font-size:18px\">For well over a decade, Microsoft has provided one of the world&#8217;s most popular hyper-scale productivity suite, Office 365, which is now part of Microsoft 365. Microsoft 365 includes hundreds of different services running billions of transactions a second on hundreds of thousands of servers in many dozens of data centers worldwide. It delivers day-to-day cloud services to hundreds of millions of enterprise, education and consumer users.<\/p>\n<p style=\"font-size:18px\">Those services can never be down. Our services are used by hospital and trauma centers, power grid providers, national, state, and local governments, major banks and financial services providers, airlines, shipping and logistics providers, and businesses from the largest to the smallest. To meet their needs, we must be continuously available, which means 100% availability over long period of times. Our services should operate seamlessly through disasters because disasters are often when our services are most essential; to coordinate emergency response.<\/p>\n<p style=\"font-size:18px\">Therein lies a great challenge. Our extreme scale means that in our services &#8220;one in a billion&#8221; events are not rare, they are commonplace. At the same time, we cannot allow those &#8220;one in a billion&#8221; events to compromise the availability of our service. This combination of almost unbelievably massive scale and extreme criticality requires us to continuously rethink and improve every aspect of services architecture, design, development, and operations. One important aspect of achieving continuous availability and highly reliable services is to understand incidents holistically and mitigate their impact to customers.<\/p>\n<p style=\"font-size:18px\">Beyond using Artificial Intelligence (AI) and Machine Learning (ML) for developing new productive features and capabilities that delight our users, we are also leveraging the power of AI and ML for improving service availability and reliability, which is essential for our hyper-scale services. This article shows one example of applying AI into managing production incident life cycle. We plan to share more examples in future articles.<\/p>\n<cite><em>&#8212; Jim Kleewein, Technical Fellow, Microsoft 365<\/em><\/cite><\/blockquote>\n\n\n\n<h5 class=\"wp-block-heading\" id=\"acknowledgement\"><em>Acknowledgement<\/em><\/h5>\n\n\n\n<p><em>This post includes contributions from <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/supriyoghosh\/\" target=\"_blank\" rel=\"noreferrer noopener\"><em>Supriyo Ghosh<\/em><\/a><em>,&nbsp;<\/em><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/sites.google.com\/site\/toufiqueparag\/home\" target=\"_blank\" rel=\"noreferrer noopener\">Toufique Ahmed<\/a>,&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"http:\/\/manishs.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">Manish Shetty<\/a>, <\/em><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/sumann\/\" target=\"_blank\" rel=\"noreferrer noopener\"><em>Suman Nath<\/em><\/a><em>, <\/em><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/tzimmer\/\" target=\"_blank\" rel=\"noreferrer noopener\"><em>Tom Zimmermann<\/em><\/a><em>,&nbsp;<\/em><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/xuchaozhang\/\" target=\"_blank\" rel=\"noreferrer noopener\"><em>Xuchao Zhang<\/em><\/a><em>, <\/em><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/kay\/\" target=\"_blank\" rel=\"noreferrer noopener\"><em>Yu Kang<\/em><\/a><em>, <\/em><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/qlin\/\" target=\"_blank\" rel=\"noreferrer noopener\"><em>Qingwei Lin<\/em><\/a><em>, <\/em><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/dongmeiz\/\" target=\"_blank\" rel=\"noreferrer noopener\"><em>Dongmei Zhang<\/em><\/a><em>.<\/em><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"introduction\">Introduction<\/h3>\n\n\n\n<p>Microsoft 365 (\u201cM365\u201d) is the world\u2019s largest productivity cloud. Hundreds of thousands of organizations of all sizes use it. Whether you&#8217;re having a Teams meeting, composing emails in Outlook or collaborating on a Word document with your colleagues, you\u2019re relying on M365 to power these productivity tools and applications M365 is powered by web-scale and massively distributed cloud services with exabytes of data handled by O(100K) servers in O(100) of datacenters around the globe. To ensure best-in-class productivity experiences it\u2019s critical that our engineering infrastructure is highly reliable while being efficient at the same time.<\/p>\n\n\n\n<p>Here at M365 <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/group\/systems-innovation\/\">System Innovation<\/a> research group, we leverage the power of AI and integrate Cloud Intelligence and AIOps into our services and products. We are using innovative AI\/ML technologies and algorithms to help design, build, and operate complex cloud infrastructures and services, and provide a step function improvement in operational <em>efficiency<\/em> and <em>reliability<\/em> enabling us to deliver best in class productivity experiences. We are applying AIOps to several domains:&nbsp;<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>AI for Systems to make intelligence a built-in capability to achieve high quality, high efficiency, self-control, and self-adaptation with less human intervention.&nbsp;<\/li>\n\n\n\n<li>AI for Customers to leverage AI\/ML to create unparalleled user experiences and achieve exceptional user satisfaction using cloud services.&nbsp;<\/li>\n\n\n\n<li>AI for DevOps to infuse AI\/ML into the entire software development lifecycle to achieve high developer productivity.&nbsp;<\/li>\n<\/ul>\n\n\n\n<p>Helping build highly reliable cloud services has been one of our key focus areas. One of the challenges with that is to quickly <em>identify<\/em>, <em>analyze,<\/em> and <em>mitigate<\/em> incidents. Our research starts from the fundamental of the production incidents: we analyze the life cycle of incidents, understand the common root causes, mitigations, and engineering efforts for resolution.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"understanding-production-incidents\">Understanding Production Incidents<\/h3>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-style-default\"><img loading=\"lazy\" decoding=\"async\" width=\"1600\" height=\"823\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-1-642ddf28f2fe4.jpg\" alt=\"diagram\" class=\"wp-image-933153\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-1-642ddf28f2fe4.jpg 1600w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-1-642ddf28f2fe4-300x154.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-1-642ddf28f2fe4-1024x527.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-1-642ddf28f2fe4-768x395.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-1-642ddf28f2fe4-1536x790.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-1-642ddf28f2fe4-240x123.jpg 240w\" sizes=\"(max-width: 1600px) 100vw, 1600px\" \/><figcaption class=\"wp-element-caption\"><strong>Figure 1: The overview of service reliability problems in large-scale cloud services<\/strong><\/figcaption><\/figure>\n\n\n\n<p>Our <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/how-to-fight-production-incidents-an-empirical-study-on-a-large-scale-cloud-service\/\">award winning paper<\/a> provides a comprehensive multi-dimensional empirical study of production incidents on large-scale M365 cloud used by Microsoft Teams. Since Microsoft-Teams powers real-time communication, reliability is paramount. Understanding production incidents, from detection, root-causing, and mitigation perspectives, is the first step to build better monitoring and automation tools. Figure 1 shows the overview of service reliability problems on large-scale cloud services, summarized by our <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/how-to-fight-production-incidents-an-empirical-study-on-a-large-scale-cloud-service\/\">research paper<\/a>.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"common-root-causes-and-mitigations-behind-incidents\">Common root causes and mitigations behind Incidents<\/h4>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-1 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:100%\">\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1600\" height=\"938\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-2-642ddf7865ea8.jpg\" alt=\"piechart\" class=\"wp-image-933159\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-2-642ddf7865ea8.jpg 1600w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-2-642ddf7865ea8-300x176.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-2-642ddf7865ea8-1024x600.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-2-642ddf7865ea8-768x450.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-2-642ddf7865ea8-1536x900.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-2-642ddf7865ea8-480x280.jpg 480w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-2-642ddf7865ea8-240x141.jpg 240w\" sizes=\"(max-width: 1600px) 100vw, 1600px\" \/><figcaption class=\"wp-element-caption\"><strong>Figure 2: Breakdown of root cause analysis (RCA) and mitigation categories<\/strong><\/figcaption><\/figure>\n<\/div>\n<\/div>\n\n\n\n<p>While code bugs are the most frequent cause of incidents, majority of the incidents (~60%) were caused due to non-code\/non-config related issues in infrastructure, deployment, and service dependencies. We also observed that among the 40% incidents that were caused by code\/configuration bugs, nearly 80% of incidents were mitigated without a code or configuration fix.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"ttd-and-ttm-for-root-causes-and-mitigations\">TTD and TTM for root causes and mitigations<\/h4>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1600\" height=\"453\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-3-642ddaa72c71c.jpg\" alt=\"RCA categories\" class=\"wp-image-933117\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-3-642ddaa72c71c.jpg 1600w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-3-642ddaa72c71c-300x85.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-3-642ddaa72c71c-1024x290.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-3-642ddaa72c71c-768x217.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-3-642ddaa72c71c-1536x435.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-3-642ddaa72c71c-240x68.jpg 240w\" sizes=\"(max-width: 1600px) 100vw, 1600px\" \/><figcaption class=\"wp-element-caption\"><strong>Figure 3: Average TTD and TTM for different root causes categories<\/strong><\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-style-default\"><img loading=\"lazy\" decoding=\"async\" width=\"1600\" height=\"428\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-4-642ddabe7b4cf.jpg\" alt=\"Mitigation steps\" class=\"wp-image-933120\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-4-642ddabe7b4cf.jpg 1600w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-4-642ddabe7b4cf-300x80.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-4-642ddabe7b4cf-1024x274.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-4-642ddabe7b4cf-768x205.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-4-642ddabe7b4cf-1536x411.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-4-642ddabe7b4cf-240x64.jpg 240w\" sizes=\"(max-width: 1600px) 100vw, 1600px\" \/><figcaption class=\"wp-element-caption\"><strong>Figure 4: Average TTD and TTM for different mitigation steps<\/strong><\/figcaption><\/figure>\n\n\n\n<p>The TTD and TTM of incidents caused by code bugs and dependency failures are significantly higher than other incidents. Also, 30% of the mitigation delay is caused due to the manual mitigation steps.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"takeaways\">Takeaways<\/h4>\n\n\n\n<p>(1) Incidents caused by software bugs and external dependencies take longer to detect due to <strong>poor monitoring<\/strong>. This highlights the need of practical tools for fine-grained, in-situ system observability.<\/p>\n\n\n\n<p>(2) Incidents caused by some root-cause categories are quick to mitigate after their root-cause categories are determined. This suggests that <strong>the overall mitigation time of incidents caused by these categories can potentially be reduced with tools that can quickly identify its root-cause category<\/strong>.<\/p>\n\n\n\n<p>(3) Incidents caused by some root-causes are inherently hard to monitor automatically (e.g., that requires monitoring global states). This suggests that developers should<strong> invest more in testing<\/strong> to uncover those root-cause categories before production, thereby avoiding such incidents.<\/p>\n\n\n\n<p>We also envision that <strong>automation<\/strong> should be the future to do incident diagnosis and identify the root cause and mitigation steps to help quickly resolve the incident and minimize customer impact. Also, we should leverage the <strong>past lessons learnt<\/strong> to build resilience for future incidents. We posit that adopting AIOps and using state-of-the-art ML models, such as large language models (LLMs) can help achieve both the goals.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"using-large-language-models-for-automatic-incident-management\">Using Large-Language Models for Automatic Incident Management<\/h3>\n\n\n\n<p>Recent breakthroughs in AI have enabled Large-Language Models (LLMs) to have a riche understanding of natural language. They have become good at understanding and reasoning from large volumes of data. They can also generalize across a diverse set of tasks and domains such as code completion, translation, Q&A. Given the complexities with incident management, we were motivated to evaluate the effectiveness of these LLMs in helping root cause and mitigate production incidents.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"375\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-5-642ddf9ed622d-1024x375.jpg\" alt=\"flow diagram\" class=\"wp-image-933162\" style=\"width:728px;height:266px\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-5-642ddf9ed622d-1024x375.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-5-642ddf9ed622d-300x110.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-5-642ddf9ed622d-768x281.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-5-642ddf9ed622d-1536x563.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-5-642ddf9ed622d-240x88.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-5-642ddf9ed622d.jpg 1600w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\"><strong>Figure 5: Leveraging GPT-3.x for root cause analysis and mitigation<\/strong><\/figcaption><\/figure>\n\n\n\n<p>In our <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/recommending-root-cause-and-mitigation-steps-for-cloud-incidents-using-large-language-models\/\">recent work<\/a> which we will be presenting at ICSE 2023 Conference, for the first time, <strong>we demonstrate the usefulness of LLMs for production incident diagnosis<\/strong>. When an incident is created, the author would specify a title for the incident and describe any relevant details such as any error messages, anomalous behavior and other details which could potentially help with resolution. We use the title and the summary of a given incident as the input for LLMs and generate root cause and mitigation steps.<\/p>\n\n\n\n<p>We do a rigorous study on more than 40,000 incidents and compare several LLMs in zero-shot, fine-tuned and multi-task settings. We find that fine-tuned the GPT-3 and GPT-3.5 models significantly improves the effectiveness of LLMs for incident data.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"effectiveness-of-gpt-3-x-models-at-finding-root-causes\">Effectiveness of GPT-3.x models at finding root causes<\/h4>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1092\" height=\"258\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/03\/tab1.png\" alt=\"Table 1: Lexical and semantic performance of different LLMs\" class=\"wp-image-928611\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/03\/tab1.png 1092w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/03\/tab1-300x71.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/03\/tab1-1024x242.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/03\/tab1-768x181.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/03\/tab1-240x57.png 240w\" sizes=\"(max-width: 1092px) 100vw, 1092px\" \/><figcaption class=\"wp-element-caption\"><strong>Table 1: Lexical and semantic performance of different LLMs<\/strong><\/figcaption><\/figure>\n\n\n\n<p>In our offline evaluation, we compared performance of GPT-3.5 against three GPT-3 models by computing 3 lexical similarity metrics between the generated recommendations and the ground truth of root cause or mitigation steps mentioned in incident management (IcM) portal. The <em>average <\/em>gains for GPT-3.5 metrics for different tasks are as follows:&nbsp;<\/p>\n\n\n\n<ol class=\"wp-block-list\" style=\"list-style-type:1\">\n<li>For root cause and mitigation recommendation tasks<strong>, Davinci-002 (GPT-3.5) provides at least 15.38% and 11.9% gain over all the GPT-3 models<\/strong>, respectively, as shown in Table 1.<\/li>\n\n\n\n<li>When we generate mitigation plans by adding root cause as input to the model, GPT-3.5 model provides at least 11.16% gain over 3 GPT-3 models.<\/li>\n\n\n\n<li>We observe that LLM models perform better on machine reported incidents (MRIs) as opposed to customer reported incidents (CRIs) due to the repetitive nature of the MRIs.<\/li>\n\n\n\n<li><strong>Finetuning LLMs with incident data improves the performance significantly.<\/strong> Finetuned GPT-3.5 model improves the average lexical similarity score by 45.5% for root cause generation and 131.3% for mitigation generation tasks over zero-shot (i.e., inferencing directly on pretrained GPT-3 or GPT-3.5 model) setting.<\/li>\n<\/ol>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"looking-through-the-incident-owners-eyes\">Looking Through the Incident Owners\u2019 Eyes<\/h4>\n\n\n\n<p>In addition to analytical analysis with semantic and lexical metrics, we also interviewed the incident owners to evaluate the effectiveness of generated recommendations. Overall, GPT-3.5 outperforms GPT-3 in majority of the metrics. <strong>More than 70% of OCEs gave a rating of three or above (out of 5) for the usefulness of recommendations in a real-time production setting.<\/strong><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"looking-forward\">Looking Forward<\/h3>\n\n\n\n<p>While we are at the initial stages of using LLMs to help automate incident resolution, we envision that there are many open research questions in this field that will significantly increase the efficacy and accuracy of LLMs. For instance, how can we incorporate additional context about the incident such as discussion entries, logs, service metrics and even dependency graphs of the impacted services to improve the diagnosis. Another challenge is regarding staleness since the models would need to be frequently retrained with the latest incident data. To solve these challenges, we are working on leveraging the latest ChatGPT model combined with retrieval augmented approaches to improve incident diagnosis via a conversational interface. For instance, ChatGPT can assist engineers to efficiently determine the incident&#8217;s root cause by raising hypotheses and answering critical questions with a feedback loop.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"2560\" height=\"729\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-6-642dddea6359f-scaled.jpg\" alt=\"diagram\" class=\"wp-image-933141\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-6-642dddea6359f-scaled.jpg 2560w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-6-642dddea6359f-300x85.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-6-642dddea6359f-1024x292.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-6-642dddea6359f-768x219.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-6-642dddea6359f-1536x437.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-6-642dddea6359f-2048x583.jpg 2048w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/04\/figure-6-642dddea6359f-240x68.jpg 240w\" sizes=\"(max-width: 2560px) 100vw, 2560px\" \/><figcaption class=\"wp-element-caption\"><strong>Figure 6: Workflow of Retrieval-augmented RCA<\/strong><\/figcaption><\/figure>\n\n\n\n<p>Moreover, ChatGPT can be actively integrated into the &#8220;discussion&#8221; of the incident diagnosis. By collecting evidence from available documents and logs, the model can generate coherent, contextual, natural-sounding responses to inquiries and offer corresponding suggestions, thereby facilitating the discussion, and accelerating the incident resolution process. We believe this has the potential of delivering a step function improvement in the overall incident management process with contextual and meaningful root causes analysis and mitigation thereby reducing significant human toil involved and bolstering our reliability & customer satisfaction.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Building highly reliable hyper-scale cloud services is quite challenging. Check out our award-winning research on understanding  production incidents from the Microsoft 365 Cloud and automating Incident management using state-of-the art GPT large-language models.<\/p>\n","protected":false},"author":42549,"featured_media":933162,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-content-parent":811276,"footnotes":""},"research-area":[],"msr-locale":[268875],"msr-post-option":[],"class_list":["post-928572","msr-blog-post","type-msr-blog-post","status-publish","has-post-thumbnail","hentry","msr-locale-en_us"],"msr_assoc_parent":{"id":811276,"type":"group"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/928572"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-blog-post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/42549"}],"version-history":[{"count":53,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/928572\/revisions"}],"predecessor-version":[{"id":993615,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/928572\/revisions\/993615"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/933162"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=928572"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=928572"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=928572"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=928572"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}