{"id":931554,"date":"2023-04-10T09:00:00","date_gmt":"2023-04-10T16:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=931554"},"modified":"2023-08-29T09:48:47","modified_gmt":"2023-08-29T16:48:47","slug":"building-toward-more-autonomous-and-proactive-cloud-technologies-with-ai","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/building-toward-more-autonomous-and-proactive-cloud-technologies-with-ai\/","title":{"rendered":"Building toward more autonomous and proactive cloud technologies with AI"},"content":{"rendered":"\n
\"Vision<\/figure>\n\n\n\n
\n
Read part 1<\/a><\/div>\n<\/div>\n\n\n\n

Cloud Intelligence\/AIOps blog series<\/h2>\n\n\n\n

In the first blog post in this series, Cloud Intelligence\/AIOps \u2013 Infusing AI into Cloud Computing Systems<\/a>, we presented a brief overview of Microsoft\u2019s research on Cloud Intelligence\/AIOps (AIOps), which innovates AI and machine learning (ML) technologies to help design, build, and operate complex cloud platforms and services effectively and efficiently at scale. As cloud computing platforms have continued to emerge as one of the most fundamental infrastructures of our world, both their scale and complexity have grown considerably. In our previous blog post, we discussed the three major pillars of AIOps research: AI for Systems, AI for Customers, and AI for DevOps, as well as the four major research areas that constitute the AIOps problem space: detection, diagnosis, prediction, and optimization. We also envisioned the AIOps research roadmap as building toward creating more autonomous, proactive, manageable, and comprehensive cloud platforms. <\/p>\n\n\n\n

Vision of AIOps Research<\/h3>\n\n\n\n
Autonomous<\/strong><\/td>Proactive<\/strong><\/td>Manageable<\/strong><\/td>Comprehensive<\/strong><\/td><\/tr>
Fully automate the operation of cloud systems to minimize system downtime and reduce manual efforts.<\/td>Predict future cloud status, support proactive decision-making, and prevent bad things from happening.<\/td>Introduce the notion of tiered autonomy for infusing autonomous routine operations and deep human expertise. <\/td>Span AIOps to the full cloud stack for global optimization\/management and extend to multi-cloud environments.<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n

Starting with this blog post, we will take a deeper dive into Microsoft\u2019s vision for AIOps research and the ongoing efforts to realize that vision. This blog post will focus on how our researchers leveraged state-of-the-art AIOps research to help make cloud technologies more autonomous and proactive. We will discuss our work to make the cloud more manageable and comprehensive in future blog posts.<\/p>\n\n\n\n

Autonomous cloud<\/h2>\n\n\n\n

Motivation<\/h3>\n\n\n\n

Cloud platforms require numerous actions and decisions every second to ensure that computing resources are properly managed and failures are promptly addressed. In practice, those actions and decisions are either generated by rule-based systems constructed upon expert knowledge or made manually by experienced engineers. Still, as cloud platforms continue to grow in both scale and complexity, it is apparent that such solutions will be insufficient for the future cloud system. On one hand, rigid rule-based systems, while being knowledge empowered, often involve huge numbers of rules and require frequent maintenance for better coverage and adaptability. Still, in practice, it is often unrealistic to keep such systems up to date as cloud systems expand in both size and complexity, and even more difficult to guarantee consistency and avoid conflicts between all the rules. On the other hand, engineering efforts are very time-consuming, prone to errors, and difficult to scale.<\/p>\n\n\n\n

<\/div>\n\n\n\n\t
\n\t\t\n\n\t\n\t
\n\t\t\t\t\t\t
\n\t\t\t\t\n\t\t\t\t\t\"Digital\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t
\n\n\t\t\t\t\t\t\t\t\t

GigaPath: Whole-Slide Foundation Model for Digital Pathology<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t

Digital pathology helps decode tumor microenvironments for precision immunotherapy. In joint work with Providence and UW, we\u2019re sharing Prov-GigaPath, the first whole-slide pathology foundation model, for advancing clinical research.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t

\n\t\t\t\t\t
\n\t\t\t\t\t\t\n\t\t\t\t\t\t\tRead more\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t<\/div>\n\t<\/div>\n\t\n\n\n

To break the constraints on the coverage and scalability of the existing solutions and improve the adaptability and manageability of the decision-making systems, cloud platforms must shift toward a more autonomous management paradigm. Instead of relying solely on expert knowledge, we need suitable AI\/ML models to fuse operational data and expert knowledge together to enable efficient, reliable, and autonomous management decisions. Still, it will take many research and engineering efforts to overcome various barriers for developing and deploying autonomous solutions to cloud platforms.<\/p>\n\n\n\n

Toward an autonomous cloud<\/h3>\n\n\n\n

In the journey towards an autonomous cloud, there are two major challenges. The first challenge lies in the heterogeneity of cloud data. In practice, cloud platforms deploy a huge number of monitors to collect data in various formats, including telemetry signals, machine-generated log files, and human input from engineers and users. And the patterns and distributions of those data generally exhibit a high degree of diversity and are subjected to changes over time. To ensure that the adopted AIOps solutions can function autonomously in such an environment, it is essential to empower the management system with robust and extendable AI\/ML models capable of learning useful information from heterogeneous data sources and drawing right conclusions in various scenarios.<\/p>\n\n\n\n

The complex interaction between different components and services presents another major challenge in deploying autonomous solutions. While it can be easy to implement autonomous features for one or a few components\/services, how to construct end-to-end systems capable of automatically navigating the complex dependencies in cloud systems presents the true challenge for both researchers and engineers. To address this challenge, it is important to leverage both domain knowledge and data to optimize the automation paths in application scenarios. Researchers and engineers should also implement reliable decision-making algorithms in every decision stage to improve the efficiency and stability of the whole end-to-end decision-making process.<\/p>\n\n\n\n

Over the past few years, Microsoft research groups have developed many new models and methods for overcoming those challenges and improving the level of automation in various cloud application scenarios across the AIOps problem spaces. Notable examples include:<\/p>\n\n\n\n