{"id":962838,"date":"2023-08-29T09:01:14","date_gmt":"2023-08-29T16:01:14","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=962838"},"modified":"2023-08-29T09:01:18","modified_gmt":"2023-08-29T16:01:18","slug":"using-ai-for-tiered-cloud-platform-operation","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/using-ai-for-tiered-cloud-platform-operation\/","title":{"rendered":"Using AI for tiered cloud platform operation"},"content":{"rendered":"\n<figure class=\"wp-block-image size-full\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"Tiered AIOps. Incidents not resolved by a tier get escalated to the next one. As upper tiers solve these incidents, this knowledge propagates to the previous tiers to improve their coverage with new models and labeled data.\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOPS4-blog-hero-1400x788-1.jpg\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"788\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOPS4-blog-hero-1400x788-1.jpg\" alt=\"Tiered AIOps. Incidents not resolved by a tier get escalated to the next one. As upper tiers solve these incidents, this knowledge propagates to the previous tiers to improve their coverage with new models and labeled data.\" class=\"wp-image-962841\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOPS4-blog-hero-1400x788-1.jpg 1400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOPS4-blog-hero-1400x788-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOPS4-blog-hero-1400x788-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOPS4-blog-hero-1400x788-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOPS4-blog-hero-1400x788-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOPS4-blog-hero-1400x788-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOPS4-blog-hero-1400x788-1-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOPS4-blog-hero-1400x788-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOPS4-blog-hero-1400x788-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOPS4-blog-hero-1400x788-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOPS4-blog-hero-1400x788-1-1280x720.jpg 1280w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><\/a><\/figure>\n\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button is-style-cta\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/cloud-intelligence-aiops-infusing-ai-into-cloud-computing-systems\/\">Read part 1<\/a><\/div>\n\n\n\n<div class=\"wp-block-button is-style-cta\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/building-toward-more-autonomous-and-proactive-cloud-technologies-with-ai\/\">Read part 2<\/a><\/div>\n\n\n\n<div class=\"wp-block-button is-style-cta\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/automatic-post-deployment-management-of-cloud-applications\/\">Read part 3<\/a><\/div>\n<\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"cloud-intelligence-aiops-blog-series-part-4\">Cloud Intelligence\/AIOps blog series, part 4<\/h2>\n\n\n\n<p>In the previous posts in this series, we introduced our research vision for Cloud Intelligence\/AIOps (<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/cloud-intelligence-aiops-infusing-ai-into-cloud-computing-systems\/\" target=\"_blank\" rel=\"noreferrer noopener\">part 1<\/a>) and how advanced AI can help design, build, and manage large-scale cloud platforms effectively and efficiently; we looked at solutions that are making many aspects of cloud operations more autonomous and proactive (<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/building-toward-more-autonomous-and-proactive-cloud-technologies-with-ai\/\" target=\"_blank\" rel=\"noreferrer noopener\">part 2<\/a>); and we discussed an important aspect of cloud management: RL-based tuning of application configuration parameters (<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/automatic-post-deployment-management-of-cloud-applications\/\" target=\"_blank\" rel=\"noreferrer noopener\">part 3<\/a>). In this post, we focus on the broader challenges of autonomously managing the entire cloud platform.&nbsp;<\/p>\n\n\n\n<p>In an ideal world, almost all operations of a large-scale cloud platform would be autonomous, and the platform would always be at, or converging to, the operators\u2019 desired state. However, this is not possible for a variety of reasons. Cloud applications and infrastructure are incredibly complex, and they change too much, too fast. For the foreseeable future, there will continue to be problems that are novel and\/or too complex for automated solutions, no matter how intelligent, to address. These may arise due to complex cascading or unlikely simultaneous failures, unexpected interactions between components, challenging (or malicious) changes in workloads such as the rapid increase in traffic due to the COVID pandemic, or even external factors such as the need to reduce power usage in a particular region.<\/p>\n\n\n\n<p>At the same time, rapid advances in machine learning and AI are enabling an increase in the automation of several aspects of cloud operations. Our second post in this series listed a number of these, including detection or problematic deployments, fault localization, log parsing, diagnosis of failures, prediction of capacity, and optimized container reallocation.<\/p>\n\n\n\n<div style=\"height:20px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"1085508\">\n\t\t\n\n\t\t<p class=\"msr-promo__label text-gray-800 text-center text-uppercase\">\n\t\t<span class=\"px-4 bg-white display-inline-block font-weight-semibold small\">Spotlight: Blog post<\/span>\n\t<\/p>\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/medfuzz-exploring-the-robustness-of-llms-on-medical-challenge-problems\/\" aria-label=\"MedFuzz: Exploring the robustness of LLMs on medical challenge problems\" data-bi-cN=\"MedFuzz: Exploring the robustness of LLMs on medical challenge problems\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/09\/MedFuzz-BlogHeroFeature-1400x788-1.jpg\" alt=\"MedFuzz blog hero (decorative)\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">MedFuzz: Exploring the robustness of LLMs on medical challenge problems<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p class=\"large\">Medfuzz tests LLMs by breaking benchmark assumptions, exposing vulnerabilities to bolster real-world accuracy.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button\">\n\t\t\t\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/medfuzz-exploring-the-robustness-of-llms-on-medical-challenge-problems\/\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" aria-label=\"Read more\" data-bi-cN=\"MedFuzz: Exploring the robustness of LLMs on medical challenge problems\" target=\"_blank\">\n\t\t\t\t\t\t\tRead more\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"Stages of evolution toward Tiered AIOps\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOps4_Fig1_Tiered-AIOps_v2.png\"><img loading=\"lazy\" decoding=\"async\" width=\"2202\" height=\"363\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOps4_Fig1_Tiered-AIOps_v2.png\" alt=\"Stages of evolution toward Tiered AIOps\" class=\"wp-image-962862\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOps4_Fig1_Tiered-AIOps_v2.png 2202w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOps4_Fig1_Tiered-AIOps_v2-300x49.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOps4_Fig1_Tiered-AIOps_v2-1024x169.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOps4_Fig1_Tiered-AIOps_v2-768x127.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOps4_Fig1_Tiered-AIOps_v2-1536x253.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOps4_Fig1_Tiered-AIOps_v2-2048x338.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOps4_Fig1_Tiered-AIOps_v2-240x40.png 240w\" sizes=\"auto, (max-width: 2202px) 100vw, 2202px\" \/><\/a><figcaption class=\"wp-element-caption\">Figure 1: Stages of evolution toward Tiered AIOps<\/figcaption><\/figure>\n\n\n\n<p>To reconcile these two realities, we introduce the concept of <em>Tiered AIOps<\/em>. The idea is to separate systems and issues into <em>tiers<\/em> of different levels of automation and human intervention. This separation comes in stages (Figure 1). The first stage has only two tiers: one where AI progressively automates routine operations and can mitigate and solve simple incidents without a human in the loop, and a second tier where expert human operators manage the long tail of incidents and scenarios that the AI systems cannot handle. As the AI in the first tier becomes more powerful, the same number of experts can manage larger and more complex cloud systems. <em>However, this is not enough<\/em>.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"Tiered AIOps. Incidents not resolved by a tier get escalated to the next one. As upper tiers solve these incidents, this knowledge propagates to the previous tiers to improve their coverage with new models and labeled data.\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOps4_Fig2.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOps4_Fig2.png\" alt=\"Tiered AIOps. Incidents not resolved by a tier get escalated to the next one. As upper tiers solve these incidents, this knowledge propagates to the previous tiers to improve their coverage with new models and labeled data.\" class=\"wp-image-962865\" width=\"1208\" height=\"499\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOps4_Fig2.png 1610w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOps4_Fig2-300x124.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOps4_Fig2-1024x423.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOps4_Fig2-768x317.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOps4_Fig2-1536x634.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOps4_Fig2-240x99.png 240w\" sizes=\"auto, (max-width: 1208px) 100vw, 1208px\" \/><\/a><figcaption class=\"wp-element-caption\"><center>Figure 2: Tiered AIOps. Incidents not resolved by a tier get escalated to the next one. As upper tiers solve these incidents, this knowledge propagates to the previous tiers to improve their coverage with new models and labeled data.<\/center><\/figcaption><\/figure>\n\n\n\n<p>New AI tools enable a final, even more scalable stage, where human expertise can also be separated into two tiers. In this stage, the middle tier involves situations and problems that the AI in the first level cannot handle, but which can be solved by non-expert, generalist human operators. AI in this second tier helps these operators manage the platform by lowering the level of expertise needed to respond to incidents. For example, the AI could automatically localize the source of an incident, recommend mitigation actions, and provide risk estimates and explanations to help operators reason about the best mitigating action to take. Finally, the last tier relies on expert engineers for complex and high-impact incidents that automated systems and generalists are unable to solve. In other words, we have the following tiers (Figure 2):<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Tier 1: <em>Fully autonomous platform operation<\/em>. Automates what can be learned or predicted. Includes intelligent and proactive systems to prevent failures and resolution of incidents that follow patterns of past incidents.&nbsp;<\/li>\n\n\n\n<li>Tier 2: <em>Infrastructure for non-expert operators to manage systems and incidents<\/em>. Receives context from events and incidents that are not handled in the first tier. AI systems provide context, summaries, and mitigation recommendations to generalist operators.&nbsp;<\/li>\n\n\n\n<li>Tier 3: <em>Infrastructure for experts to manage systems and incidents that are novel or highly complex<\/em>. Receives context from events and incidents not handled in the first two tiers. Can enable experts to interact and manage a remote installation.&nbsp;<\/li>\n<\/ul>\n\n\n\n<p>There are two types of AI systems involved: first, those that enable increasing levels of automation in the first and second tiers, and; second, the AI systems (different types of co-pilots) that assist operators. It is the latter type that enables the division between the second and third tiers, and also reduces the risk of imperfect or incomplete systems in the first tier. This separation between the top two tiers is also crucial for the operation of air-gapped clouds and makes it more feasible to deploy new datacenters in locations where there might not be the same level of expert operators.&nbsp;<\/p>\n\n\n\n<p>The key idea in the Tiered AIOps concept is to simultaneously expand automation and increase the number of incidents that can be handled by the first tier, while recognizing that all three tiers are critical. <span style=\"text-decoration: underline;\">The research agenda is to build systems and models to support automation and incident response in all three tiers.&nbsp;&nbsp;<\/span><\/p>\n\n\n\n<p><strong>Escalating incidents<\/strong>. Each tier must have safeguards to (automatically or not) escalate an issue to the next tier. For example, when the first tier detects that there is insufficient data, or that the confidence (risk) in a prediction is lower (higher) than a threshold, it should escalate, with the right context, to the next tier. &nbsp;<\/p>\n\n\n\n<p><strong>Migrating learnings<\/strong>. On the other hand, over time and with gained experience (which can be encoded in troubleshooting guides, new monitors, AI models, or better training data), repeated incidents and operations migrate toward the lower tiers, allowing operators to allocate costly expertise to highly complex and impactful incidents and decisions.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"Performance and power of the SmartOverclock agent, showing near peak performance at significantly less power.\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOps4_Fig3.png\"><img loading=\"lazy\" decoding=\"async\" width=\"607\" height=\"509\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOps4_Fig3.png\" alt=\"Performance and power of the SmartOverclock agent, showing near peak performance at significantly less power.\" class=\"wp-image-962874\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOps4_Fig3.png 607w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOps4_Fig3-300x252.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOps4_Fig3-215x180.png 215w\" sizes=\"auto, (max-width: 607px) 100vw, 607px\" \/><\/a><figcaption class=\"wp-element-caption\">Figure 3: Performance and power of the SmartOverclock agent, showing near peak performance at significantly less power.<\/figcaption><\/figure>\n\n\n\n<p>We will now discuss some work on extending the first tier with on-node learning, how to use new AI systems (large language models or simply LLMs) to assist operators in mitigating incidents and to move toward enabling the second tier, and, finally, how the third tier enables air-gapped clouds.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"tier-1-on-node-learning\">Tier 1: On-node learning<\/h2>\n\n\n\n<p>Managing a cloud platform requires control loops and agents with many different granularities and time scales. Some agents need to be located on individual nodes, either because of latency requirements of the decisions they make, or because they depend on telemetry that is too fine-grained and large to leave the node. Examples of these agents include configuration (credentials, firewalls, operating system updates), services like virtual machine (VM) creation, monitoring and logging, watchdogs, resource controls (e.g., power, memory, or CPU allocation), and access daemons.&nbsp;&nbsp;<\/p>\n\n\n\n<p>Any agent that can use data about current workload characteristics or system state to guide dynamic adjustment of their behavior can potentially take advantage of machine learning (ML). However, current ML solutions such as <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/resource-central-understanding-predicting-workloads-improved-resource-management-large-cloud-platforms\/\" target=\"_blank\" rel=\"noreferrer noopener\">Resource Central<\/a> (SOSP\u201917) require data and decisions to run in a dedicated service outside of the server nodes. The problem is that for some agents this is not feasible, as they either have to make fast decisions or require data that cannot leave the node.&nbsp;&nbsp;<\/p>\n\n\n\n<p>In <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/sol-safe-on-node-learning-in-cloud-platforms\/\" target=\"_blank\" rel=\"noreferrer noopener\">SOL: Safe On-Node Learning in Cloud Platforms<\/a> (ASPLOS\u201922), we proposed a framework that allows local agents to use modern ML techniques in a safe, robust, and effective way. We identified three classes of local agents that can benefit from ML. First, agents that assign resources (CPU, memory, power) benefit from near real-time workload information. Making these decisions quickly and with fine-grained telemetry enables better assignments with smaller impact to customer quality of service (QoS). Second, monitoring and logging agents, which must run on each node, can benefit from online learning algorithms, such as multi-armed bandits to smartly decide which telemetry data to sample and at which frequency, while staying within a sampling budget. Lastly, watchdogs, which monitor for metrics that indicate failures, can benefit from learning algorithms to detect problems and take mitigating actions sooner, as well as detect and diagnose more complex problems that simpler systems would not detect.&nbsp;<\/p>\n\n\n\n<p>SOL makes it easy to integrate protections against invalid data, inaccurate or drifting AI models, and delayed predictions, and to add safeguards in the actions the models can take, through a simple interface. As examples, we developed agents to do CPU overclocking, CPU harvesting, and memory page hotness classification. In our experiments (Figure 3), the overclocking agent, for example, achieved near-peak normalized performance for different workloads, at nearly half of the power draw, while responding well to many failure conditions in the monitoring itself. See our paper for more details.&nbsp;<\/p>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<ul class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<li class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Publication<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/sol-safe-on-node-learning-in-cloud-platforms\/\" target=\"_self\" class=\"annotations__link font-weight-semibold text-decoration-none\" data-bi-type=\"annotated-link\" aria-label=\"SOL: Safe On-Node Learning in Cloud Platforms\" data-bi-aN=\"citation\" data-bi-cN=\"SOL: Safe On-Node Learning in Cloud Platforms\">\n\t\t\t\tSOL: Safe On-Node Learning in Cloud Platforms&nbsp;<span class=\"glyph-append glyph-append-chevron-right glyph-append-xsmall\"><\/span>\n\t\t\t<\/a>\n\t\t\t\t\t<\/li>\n\t<\/ul>\n<\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"tier-2-incident-similarity-and-mitigation-with-llms\">Tier 2: Incident similarity and mitigation with LLMs<\/h2>\n\n\n\n<p>As an example of how AI systems can enable the second tier, we are exploring how LLMs can help in mitigating and finding the root cause of incidents in cloud operations. When an incident happens in a cloud system, either generated by automated alarms or by customer-reported issues, a team of one or more on-call engineers must quickly find ways to mitigate the incident (resolving the symptoms), and then find the cause of the incident for a permanent fix and to avoid the incident in the future.&nbsp;&nbsp;<\/p>\n\n\n\n<p>There are many steps involved in this process, and they are highly variable. There is also context that relates to the incident, which grows as both automated systems and on-call engineers perform tests, look at logs, and go through a cycle of forming, testing, and validating hypotheses. We are investigating using LLMs to help with several of these steps, including automatically generating summaries of the cumulative status of an incident, finding similar incidents in the database of past incidents, and proposing mitigation steps based on these similar incidents. There is also an ever-growing library of internal troubleshooting guides (TSGs) created by engineers, together with internal and external documentation on the systems involved.&nbsp; We are using LLMs to extract and summarize information from these combined sources in a way that is relevant to the on-call engineer.&nbsp;&nbsp;<\/p>\n\n\n\n<p>We are also using LLMs to find the root cause of incidents. In a <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/large-language-models-for-automatic-cloud-incident-management\/\" target=\"_blank\" rel=\"noreferrer noopener\">recent paper published in ISCE<\/a> (2023), the Microsoft 365 Systems Innovation research group demonstrated the usefulness of LLMs in determining the root cause of incidents from the title and summary of the incident. In a survey conducted as part of the work, more than 70% of the on-call engineers gave a rating of 3 out of 5 or better on the usefulness of the recommendations in a real-time incident resolution setting.&nbsp;&nbsp;<\/p>\n\n\n\n<p>There is still enormous untapped potential in using these methods, along with some interesting challenges. In aggregate, these efforts are a great step toward the foundation for the second tier in our vision. They can assist on-call engineers, enable junior engineers to be much more effective in handling more incidents, reduce the time to mitigation, and, finally, give room for the most expert engineers to work on the third tier, focusing on complex, atypical, and novel incidents.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"tier-3-air-gapped-clouds\">Tier 3: Air-gapped clouds<\/h2>\n\n\n\n<p>We now turn to an example where the separation between the second and third tiers could enable significantly simplified operations. Air-gapped datacenters, characterized by their isolated nature and restricted access, provide a secure environment for managing sensitive data while prioritizing privacy. In such datacenters, direct access is limited and highly controlled, being operated locally by authorized employees, ensuring that data is handled with utmost care and confidentiality. However, this level of isolation also presents unique challenges when it comes to managing the day-to-day operations and addressing potential issues, as Microsoft\u2019s expert operators do not have physical or direct access to the infrastructure. &nbsp;<\/p>\n\n\n\n<p>In such an environment, future tiered AIOps could improve operations, while maintaining the strict data and communication isolation requirements. The first tier would play a critical role by significantly reducing the occurrence of incidents through the implementation of automated operations. However, the second and third tiers would be equally vital. The second tier would empower local operators on-site to address most issues that the first tier cannot. Even with AI assistance, there would be instances requiring additional expertise beyond that which is available locally. Unfortunately, the experts in the third tier may not even have access to remote desktops, or to the results of queries or commands. LLMs would serve a crucial role here, as they could become an ideal intermediary between tiers 2 and 3, sharing high-level descriptions of problems without sending sensitive information. &nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"LLM-intermediated communication between remote experts (Tier 3) and generalist operators (Tier 2) to solve problems in an air-gapped datacenter.\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOps4_Fig4.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"397\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOps4_Fig4-1024x397.png\" alt=\"LLM-intermediated communication between remote experts (Tier 3) and generalist operators (Tier 2) to solve problems in an air-gapped datacenter.\" class=\"wp-image-962886\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOps4_Fig4-1024x397.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOps4_Fig4-300x116.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOps4_Fig4-768x298.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOps4_Fig4-240x93.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOps4_Fig4.png 1299w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption class=\"wp-element-caption\">Figure 4: LLM-intermediated communication between remote experts (Tier 3) and generalist operators (Tier 2) to solve problems in an air-gapped datacenter.<\/figcaption><\/figure>\n\n\n\n<p>In an interactive session (Figure 4), an LLM with access to the air-gapped datacenter systems could summarize and sanitize the problem description in natural language (\u2460). A remote expert in Tier 3 would then formulate hypotheses and send high-level instructions in natural language for more investigation or for mitigation (\u2461). The LLM could use the high-level instructions to form a specialized plan. For example, it could query devices with a knowledge of the datacenter topology that the expert does not have; interpret, summarize, and sanitize the results (with or without the help of the generalist, on-site operators) (\u2462); and send the interpretation of the results back to the experts, again in natural language (\u2463). Depending on the problem, this cycle could repeat until the problem is solved (\u2464). Crucially, while the operators at the air-gapped cloud would be in the loop, they wouldn\u2019t need deep expertise in all systems to perform the required actions and interpret the results.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"conclusion\">Conclusion<\/h2>\n\n\n\n<p>Cloud platform operators have seen massive, continuous growth in scale. To remain competitive and viable, we must decouple the scaling of human support operations from this growth. AI offers great hope in increasing automation of platform management, but because of constant change in the systems, environment, and demands, there will likely always be decisions and incidents requiring expert human input. In this post, we described our vision of Tiered AIOps as the way to enable and achieve this decoupling and maximize the effectiveness of both AI tools and human expertise.&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Cloud Intelligence\/AIOps research from Microsoft could help organizations autonomously manage the entire cloud platform. Find out how.<\/p>\n","protected":false},"author":42183,"featured_media":962841,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"categories":[1],"tags":[],"research-area":[13547],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-962838","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-systems-and-networking","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199565],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[144927,282170],"related-projects":[573039],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Rodrigo Fonseca","user_id":40429,"display_name":"Rodrigo Fonseca","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/rofons\/\" aria-label=\"Visit the profile page for Rodrigo Fonseca\">Rodrigo Fonseca<\/a>","is_active":false,"last_first":"Fonseca, Rodrigo","people_section":0,"alias":"rofons"},{"type":"user_nicename","value":"Pedro Las-Casas","user_id":40465,"display_name":"Pedro Las-Casas","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/pedrobr\/\" aria-label=\"Visit the profile page for Pedro Las-Casas\">Pedro Las-Casas<\/a>","is_active":false,"last_first":"Las-Casas, Pedro","people_section":0,"alias":"pedrobr"},{"type":"user_nicename","value":"Alok Kumbhare","user_id":36086,"display_name":"Alok Kumbhare","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/alok-kumbhare\/\" aria-label=\"Visit the profile page for Alok Kumbhare\">Alok Kumbhare<\/a>","is_active":false,"last_first":"Kumbhare, Alok","people_section":0,"alias":"Alok Kumbhare"},{"type":"user_nicename","value":"Ricardo Bianchini","user_id":33393,"display_name":"Ricardo Bianchini","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/ricardob\/\" aria-label=\"Visit the profile page for Ricardo Bianchini\">Ricardo Bianchini<\/a>","is_active":false,"last_first":"Bianchini, Ricardo","people_section":0,"alias":"ricardob"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOPS4-blog-hero-1400x788-1-960x540.jpg\" class=\"img-object-cover\" alt=\"Tiered AIOps. Incidents not resolved by a tier get escalated to the next one. As upper tiers solve these incidents, this knowledge propagates to the previous tiers to improve their coverage with new models and labeled data.\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOPS4-blog-hero-1400x788-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOPS4-blog-hero-1400x788-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOPS4-blog-hero-1400x788-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOPS4-blog-hero-1400x788-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOPS4-blog-hero-1400x788-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOPS4-blog-hero-1400x788-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOPS4-blog-hero-1400x788-1-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOPS4-blog-hero-1400x788-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOPS4-blog-hero-1400x788-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOPS4-blog-hero-1400x788-1-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/08\/AIOPS4-blog-hero-1400x788-1.jpg 1400w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/rofons\/\" title=\"Go to researcher profile for Rodrigo Fonseca\" aria-label=\"Go to researcher profile for Rodrigo Fonseca\" data-bi-type=\"byline author\" data-bi-cN=\"Rodrigo Fonseca\">Rodrigo Fonseca<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/pedrobr\/\" title=\"Go to researcher profile for Pedro Las-Casas\" aria-label=\"Go to researcher profile for Pedro Las-Casas\" data-bi-type=\"byline author\" data-bi-cN=\"Pedro Las-Casas\">Pedro Las-Casas<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/alok-kumbhare\/\" title=\"Go to researcher profile for Alok Kumbhare\" aria-label=\"Go to researcher profile for Alok Kumbhare\" data-bi-type=\"byline author\" data-bi-cN=\"Alok Kumbhare\">Alok Kumbhare<\/a>, and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/ricardob\/\" title=\"Go to researcher profile for Ricardo Bianchini\" aria-label=\"Go to researcher profile for Ricardo Bianchini\" data-bi-type=\"byline author\" data-bi-cN=\"Ricardo Bianchini\">Ricardo Bianchini<\/a>","formattedDate":"August 29, 2023","formattedExcerpt":"Cloud Intelligence\/AIOps research from Microsoft could help organizations autonomously manage the entire cloud platform. Find out how.","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/962838","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/42183"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=962838"}],"version-history":[{"count":37,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/962838\/revisions"}],"predecessor-version":[{"id":964500,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/962838\/revisions\/964500"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/962841"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=962838"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=962838"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=962838"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=962838"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=962838"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=962838"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=962838"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=962838"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=962838"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=962838"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=962838"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}