{"id":962838,"date":"2023-08-29T09:01:14","date_gmt":"2023-08-29T16:01:14","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=962838"},"modified":"2023-08-29T09:01:18","modified_gmt":"2023-08-29T16:01:18","slug":"using-ai-for-tiered-cloud-platform-operation","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/using-ai-for-tiered-cloud-platform-operation\/","title":{"rendered":"Using AI for tiered cloud platform operation"},"content":{"rendered":"\n
\"Tiered<\/a><\/figure>\n\n\n\n
\n
Read part 1<\/a><\/div>\n\n\n\n
Read part 2<\/a><\/div>\n\n\n\n
Read part 3<\/a><\/div>\n<\/div>\n\n\n\n

Cloud Intelligence\/AIOps blog series, part 4<\/h2>\n\n\n\n

In the previous posts in this series, we introduced our research vision for Cloud Intelligence\/AIOps (part 1<\/a>) and how advanced AI can help design, build, and manage large-scale cloud platforms effectively and efficiently; we looked at solutions that are making many aspects of cloud operations more autonomous and proactive (part 2<\/a>); and we discussed an important aspect of cloud management: RL-based tuning of application configuration parameters (part 3<\/a>). In this post, we focus on the broader challenges of autonomously managing the entire cloud platform. <\/p>\n\n\n\n

In an ideal world, almost all operations of a large-scale cloud platform would be autonomous, and the platform would always be at, or converging to, the operators\u2019 desired state. However, this is not possible for a variety of reasons. Cloud applications and infrastructure are incredibly complex, and they change too much, too fast. For the foreseeable future, there will continue to be problems that are novel and\/or too complex for automated solutions, no matter how intelligent, to address. These may arise due to complex cascading or unlikely simultaneous failures, unexpected interactions between components, challenging (or malicious) changes in workloads such as the rapid increase in traffic due to the COVID pandemic, or even external factors such as the need to reduce power usage in a particular region.<\/p>\n\n\n\n

At the same time, rapid advances in machine learning and AI are enabling an increase in the automation of several aspects of cloud operations. Our second post in this series listed a number of these, including detection or problematic deployments, fault localization, log parsing, diagnosis of failures, prediction of capacity, and optimized container reallocation.<\/p>\n\n\n\n

<\/div>\n\n\n\n\t
\n\t\t\n\n\t\t

\n\t\tSpotlight: AI-POWERED EXPERIENCE<\/span>\n\t<\/p>\n\t\n\t

\n\t\t\t\t\t\t
\n\t\t\t\t\n\t\t\t\t\t\"\"\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t
\n\n\t\t\t\t\t\t\t\t\t

Microsoft research copilot experience<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t

Discover more about research at Microsoft through our AI-powered experience<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t

\n\t\t\t\t\t
\n\t\t\t\t\t\t\n\t\t\t\t\t\t\tStart now\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t<\/div>\n\t<\/div>\n\t\n\n\n
\"Stages<\/a>
Figure 1: Stages of evolution toward Tiered AIOps<\/figcaption><\/figure>\n\n\n\n

To reconcile these two realities, we introduce the concept of Tiered AIOps<\/em>. The idea is to separate systems and issues into tiers<\/em> of different levels of automation and human intervention. This separation comes in stages (Figure 1). The first stage has only two tiers: one where AI progressively automates routine operations and can mitigate and solve simple incidents without a human in the loop, and a second tier where expert human operators manage the long tail of incidents and scenarios that the AI systems cannot handle. As the AI in the first tier becomes more powerful, the same number of experts can manage larger and more complex cloud systems. However, this is not enough<\/em>.<\/p>\n\n\n\n

\"Tiered<\/a>
Figure 2: Tiered AIOps. Incidents not resolved by a tier get escalated to the next one. As upper tiers solve these incidents, this knowledge propagates to the previous tiers to improve their coverage with new models and labeled data.<\/center><\/figcaption><\/figure>\n\n\n\n

New AI tools enable a final, even more scalable stage, where human expertise can also be separated into two tiers. In this stage, the middle tier involves situations and problems that the AI in the first level cannot handle, but which can be solved by non-expert, generalist human operators. AI in this second tier helps these operators manage the platform by lowering the level of expertise needed to respond to incidents. For example, the AI could automatically localize the source of an incident, recommend mitigation actions, and provide risk estimates and explanations to help operators reason about the best mitigating action to take. Finally, the last tier relies on expert engineers for complex and high-impact incidents that automated systems and generalists are unable to solve. In other words, we have the following tiers (Figure 2):<\/p>\n\n\n\n