{"id":928572,"date":"2023-04-06T08:41:17","date_gmt":"2023-04-06T15:41:17","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&p=928572"},"modified":"2023-12-18T07:52:48","modified_gmt":"2023-12-18T15:52:48","slug":"towards-highly-reliable-services-with-aiops","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/towards-highly-reliable-services-with-aiops\/","title":{"rendered":"Towards Highly Reliable Services with AIOps"},"content":{"rendered":"

Rujia Wang<\/em> (opens in new tab)<\/span><\/a>, Principal Research PM; <\/em>Chetan Bansal<\/em> (opens in new tab)<\/span><\/a>, Principal Research Manager; <\/em>Saravan Rajmohan<\/em> (opens in new tab)<\/span><\/a>, Partner Director AI & Applied Research; and <\/em>Jim Kleewein<\/em> (opens in new tab)<\/span><\/a>, Technical Fellow<\/em><\/p>\n\n\n

For well over a decade, Microsoft has provided one of the world’s most popular hyper-scale productivity suite, Office 365, which is now part of Microsoft 365. Microsoft 365 includes hundreds of different services running billions of transactions a second on hundreds of thousands of servers in many dozens of data centers worldwide. It delivers day-to-day cloud services to hundreds of millions of enterprise, education and consumer users.<\/p>\n

Those services can never be down. Our services are used by hospital and trauma centers, power grid providers, national, state, and local governments, major banks and financial services providers, airlines, shipping and logistics providers, and businesses from the largest to the smallest. To meet their needs, we must be continuously available, which means 100% availability over long period of times. Our services should operate seamlessly through disasters because disasters are often when our services are most essential; to coordinate emergency response.<\/p>\n

Therein lies a great challenge. Our extreme scale means that in our services “one in a billion” events are not rare, they are commonplace. At the same time, we cannot allow those “one in a billion” events to compromise the availability of our service. This combination of almost unbelievably massive scale and extreme criticality requires us to continuously rethink and improve every aspect of services architecture, design, development, and operations. One important aspect of achieving continuous availability and highly reliable services is to understand incidents holistically and mitigate their impact to customers.<\/p>\n

Beyond using Artificial Intelligence (AI) and Machine Learning (ML) for developing new productive features and capabilities that delight our users, we are also leveraging the power of AI and ML for improving service availability and reliability, which is essential for our hyper-scale services. This article shows one example of applying AI into managing production incident life cycle. We plan to share more examples in future articles.<\/p>\n— Jim Kleewein, Technical Fellow, Microsoft 365<\/em><\/cite><\/blockquote>\n\n\n\n

Acknowledgement<\/em><\/h5>\n\n\n\n

This post includes contributions from Supriyo Ghosh<\/em><\/a>, <\/em>Toufique Ahmed<\/a>, Manish Shetty<\/a>, <\/em>Suman Nath<\/em><\/a>, <\/em>Tom Zimmermann<\/em><\/a>, <\/em>Xuchao Zhang<\/em><\/a>, <\/em>Yu Kang<\/em><\/a>, <\/em>Qingwei Lin<\/em><\/a>, <\/em>Dongmei Zhang<\/em><\/a>.<\/em><\/p>\n\n\n\n

Introduction<\/h3>\n\n\n\n

Microsoft 365 (\u201cM365\u201d) is the world\u2019s largest productivity cloud. Hundreds of thousands of organizations of all sizes use it. Whether you’re having a Teams meeting, composing emails in Outlook or collaborating on a Word document with your colleagues, you\u2019re relying on M365 to power these productivity tools and applications M365 is powered by web-scale and massively distributed cloud services with exabytes of data handled by O(100K) servers in O(100) of datacenters around the globe. To ensure best-in-class productivity experiences it\u2019s critical that our engineering infrastructure is highly reliable while being efficient at the same time.<\/p>\n\n\n\n

Here at M365 System Innovation<\/a> research group, we leverage the power of AI and integrate Cloud Intelligence and AIOps into our services and products. We are using innovative AI\/ML technologies and algorithms to help design, build, and operate complex cloud infrastructures and services, and provide a step function improvement in operational efficiency<\/em> and reliability<\/em> enabling us to deliver best in class productivity experiences. We are applying AIOps to several domains: <\/p>\n\n\n\n