{"id":928572,"date":"2023-04-06T08:41:17","date_gmt":"2023-04-06T15:41:17","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&p=928572"},"modified":"2023-12-18T07:52:48","modified_gmt":"2023-12-18T15:52:48","slug":"towards-highly-reliable-services-with-aiops","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/towards-highly-reliable-services-with-aiops\/","title":{"rendered":"Towards Highly Reliable Services with AIOps"},"content":{"rendered":"
Rujia Wang<\/em><\/a>, Principal Research PM; <\/em>Chetan Bansal<\/em><\/a>, Principal Research Manager; <\/em>Saravan Rajmohan<\/em><\/a>, Partner Director AI & Applied Research; and <\/em>Jim Kleewein<\/em> (opens in new tab)<\/span><\/a>, Technical Fellow<\/em><\/p>\n\n\n For well over a decade, Microsoft has provided one of the world’s most popular hyper-scale productivity suite, Office 365, which is now part of Microsoft 365. Microsoft 365 includes hundreds of different services running billions of transactions a second on hundreds of thousands of servers in many dozens of data centers worldwide. It delivers day-to-day cloud services to hundreds of millions of enterprise, education and consumer users.<\/p>\n Those services can never be down. Our services are used by hospital and trauma centers, power grid providers, national, state, and local governments, major banks and financial services providers, airlines, shipping and logistics providers, and businesses from the largest to the smallest. To meet their needs, we must be continuously available, which means 100% availability over long period of times. Our services should operate seamlessly through disasters because disasters are often when our services are most essential; to coordinate emergency response.<\/p>\n Therein lies a great challenge. Our extreme scale means that in our services “one in a billion” events are not rare, they are commonplace. At the same time, we cannot allow those “one in a billion” events to compromise the availability of our service. This combination of almost unbelievably massive scale and extreme criticality requires us to continuously rethink and improve every aspect of services architecture, design, development, and operations. One important aspect of achieving continuous availability and highly reliable services is to understand incidents holistically and mitigate their impact to customers.<\/p>\n Beyond using Artificial Intelligence (AI) and Machine Learning (ML) for developing new productive features and capabilities that delight our users, we are also leveraging the power of AI and ML for improving service availability and reliability, which is essential for our hyper-scale services. This article shows one example of applying AI into managing production incident life cycle. We plan to share more examples in future articles.<\/p>\n— Jim Kleewein, Technical Fellow, Microsoft 365<\/em><\/cite><\/blockquote>\n\n\n\n This post includes contributions from Supriyo Ghosh<\/em><\/a>, <\/em>Toufique Ahmed (opens in new tab)<\/span><\/a>, Manish Shetty (opens in new tab)<\/span><\/a>, <\/em>Suman Nath<\/em><\/a>, <\/em>Tom Zimmermann<\/em><\/a>, <\/em>Xuchao Zhang<\/em><\/a>, <\/em>Yu Kang<\/em><\/a>, <\/em>Qingwei Lin<\/em><\/a>, <\/em>Dongmei Zhang<\/em><\/a>.<\/em><\/p>\n\n\n\n Microsoft 365 (\u201cM365\u201d) is the world\u2019s largest productivity cloud. Hundreds of thousands of organizations of all sizes use it. Whether you’re having a Teams meeting, composing emails in Outlook or collaborating on a Word document with your colleagues, you\u2019re relying on M365 to power these productivity tools and applications M365 is powered by web-scale and massively distributed cloud services with exabytes of data handled by O(100K) servers in O(100) of datacenters around the globe. To ensure best-in-class productivity experiences it\u2019s critical that our engineering infrastructure is highly reliable while being efficient at the same time.<\/p>\n\n\n\n Here at M365 System Innovation<\/a> research group, we leverage the power of AI and integrate Cloud Intelligence and AIOps into our services and products. We are using innovative AI\/ML technologies and algorithms to help design, build, and operate complex cloud infrastructures and services, and provide a step function improvement in operational efficiency<\/em> and reliability<\/em> enabling us to deliver best in class productivity experiences. We are applying AIOps to several domains: <\/p>\n\n\n\n Helping build highly reliable cloud services has been one of our key focus areas. One of the challenges with that is to quickly identify<\/em>, analyze,<\/em> and mitigate<\/em> incidents. Our research starts from the fundamental of the production incidents: we analyze the life cycle of incidents, understand the common root causes, mitigations, and engineering efforts for resolution.<\/p>\n\n\n\n Our award winning paper<\/a> provides a comprehensive multi-dimensional empirical study of production incidents on large-scale M365 cloud used by Microsoft Teams. Since Microsoft-Teams powers real-time communication, reliability is paramount. Understanding production incidents, from detection, root-causing, and mitigation perspectives, is the first step to build better monitoring and automation tools. Figure 1 shows the overview of service reliability problems on large-scale cloud services, summarized by our research paper<\/a>.<\/p>\n\n\n\n While code bugs are the most frequent cause of incidents, majority of the incidents (~60%) were caused due to non-code\/non-config related issues in infrastructure, deployment, and service dependencies. We also observed that among the 40% incidents that were caused by code\/configuration bugs, nearly 80% of incidents were mitigated without a code or configuration fix.<\/p>\n\n\n\n The TTD and TTM of incidents caused by code bugs and dependency failures are significantly higher than other incidents. Also, 30% of the mitigation delay is caused due to the manual mitigation steps.<\/p>\n\n\n\n (1) Incidents caused by software bugs and external dependencies take longer to detect due to poor monitoring<\/strong>. This highlights the need of practical tools for fine-grained, in-situ system observability.<\/p>\n\n\n\n (2) Incidents caused by some root-cause categories are quick to mitigate after their root-cause categories are determined. This suggests that the overall mitigation time of incidents caused by these categories can potentially be reduced with tools that can quickly identify its root-cause category<\/strong>.<\/p>\n\n\n\n (3) Incidents caused by some root-causes are inherently hard to monitor automatically (e.g., that requires monitoring global states). This suggests that developers should invest more in testing<\/strong> to uncover those root-cause categories before production, thereby avoiding such incidents.<\/p>\n\n\n\n We also envision that automation<\/strong> should be the future to do incident diagnosis and identify the root cause and mitigation steps to help quickly resolve the incident and minimize customer impact. Also, we should leverage the past lessons learnt<\/strong> to build resilience for future incidents. We posit that adopting AIOps and using state-of-the-art ML models, such as large language models (LLMs) can help achieve both the goals.<\/p>\n\n\n\nAcknowledgement<\/em><\/h5>\n\n\n\n
Introduction<\/h3>\n\n\n\n
\n
Understanding Production Incidents<\/h3>\n\n\n\n

Common root causes and mitigations behind Incidents<\/h4>\n\n\n\n

TTD and TTM for root causes and mitigations<\/h4>\n\n\n\n


Takeaways<\/h4>\n\n\n\n
Using Large-Language Models for Automatic Incident Management<\/h3>\n\n\n\n