{"id":1013043,"date":"2024-03-19T09:21:49","date_gmt":"2024-03-19T16:21:49","guid":{"rendered":""},"modified":"2024-03-19T09:51:47","modified_gmt":"2024-03-19T16:51:47","slug":"intelligent-monitoring-towards-ai-assisted-monitoring-for-cloud-services","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/intelligent-monitoring-towards-ai-assisted-monitoring-for-cloud-services\/","title":{"rendered":"Intelligent monitoring: Towards AI-assisted monitoring for cloud services"},"content":{"rendered":"\n
\"three<\/figure>\n\n\n\n

In the evolving field of software development, professionals are increasingly adopting a modern approach known as service-oriented architecture to enhance the scalability and flexibility of their services and applications. Often utilizing a microservices approach, developers construct software as a collection of small, independently functioning services. This method is particularly advantageous for developing cloud-based software, as it offers numerous benefits over the traditional monolithic architectures, including the ability to separately develop, deploy, and scale individual components of an application. Nevertheless, this approach also introduces challenges, notably the difficulty of offline testing of these services, which can result in issues being discovered only after the software is in use\u2014potentially leading to costly repairs and user dissatisfaction. This underscores the need for careful software deployment, ensuring the software is as free from bugs as possible before it is released.<\/p>\n\n\n\n

Currently, the process of setting up monitoring for cloud services relies heavily on trial and error and the expertise of service managers, who must understand the system\u2019s architecture, its dependencies, and the expectations outlined in service-level agreements (SLAs). Often, adjustments to the monitoring setup are made after the service has been launched, in response to emerging problems. This reactive approach can lead to inefficiencies and often misses critical monitoring checks until issues arise. It also creates redundant alerts that waste resources. At the same time, unoptimized monitoring systems may misdetect anomalous behavior, negatively affecting the user experience and potentially extending the time needed for system upgrades or migrations. To improve how we monitor large cloud computing systems, we need to better understand how these systems work. We can then determine how to decrease the number of missed detections while also reducing the number of unnecessary alerts.<\/p>\n\n\n\n

Microsoft cloud monitor platforms<\/h2>\n\n\n\n

When they are properly configured, cloud monitors can help to meet monitoring requirements. At M365 Research<\/a>, our intelligent monitoring projects tackle the challenges of managing monitor portfolios for large service families, ensuring high reliability and efficiency.<\/p>\n\n\n\n

Azure Monitor (opens in new tab)<\/span><\/a> offers a comprehensive solution for collecting, analyzing, and responding to monitoring data across cloud and on-premises environments, supporting range of resources like applications, VMs, containers (including Prometheus metrics), databases, and security and networking events. However, while it excels in anomaly detection, root cause analysis, and time series analysis, introducing additional intelligence during the monitor setup could further improve its efficacy.<\/p>\n\n\n\n

A closer look at monitors and incident detection at Microsoft<\/h2>\n\n\n\n

In our paper, \u201cDetection Is Better Than Cure: A Cloud Incidents Perspective (opens in new tab)<\/span><\/a>,\u201d presented at ESEC\/FSE 2023 (opens in new tab)<\/span><\/a>, we tackled this problem by studying a year\u2019s worth of production incidents at Microsoft to understand misdetection. The goal was to use our insights to inform improvements in data-driven monitoring.<\/p>\n\n\n\n

We identified six primary reasons for misdetections, ranging from missing signals and monitors to improper monitor coverage and alerting logic, along with buggy monitors and inadequate documentation. Figure 1 (a) shows the distribution of incident misdetections across a broad range of categories. Notably, missing monitors and alerts constituted over 40 percent of all misdetections, indicating the complexity of determining what to monitor in cloud services. The second most common issue was improper or missing signals, suggesting a need to set up the signals on which new monitors are created. Additionally, approximately 10 percent of monitors had improper coverage, and about 13 percent had alerting logic that needed to be reevaluated. Figure 1 (b) shows that 27.25 percent of these misdetections led to outages, emphasizing the importance of accurately defining monitoring parameters.<\/p>\n\n\n\n

\"On
Figure 1. (a) Major categories, or classes, of misdetection. (b) Proportion of incidents from each misdetection class that led to outages.<\/figcaption><\/figure>\n\n\n\n

Data-driven intelligent monitoring<\/h2>\n\n\n\n

Organizing monitor data<\/h3>\n\n\n\n

Because there is no standardized approach to building monitors, monitor data often lacks structure. To address this, we defined a structure comprised of categories, or classes, for the different types of resources being monitored, as well as service-level objective (SLO) classes for their associated objectives. These classes capture the kinds of measurements users may want to perform over a resource.<\/p>\n\n\n\n

In our paper, \u201cIntelligent Monitoring Framework for Cloud Services: A Data-Driven Approach (opens in new tab)<\/span><\/a>,\u201d to be presented at ICSE 2024 (opens in new tab)<\/span><\/a>, we propose a data-driven approach for developing this ontology. By leveraging LLMs together with a person-in-the-loop approach, we effectively extract signals from monitor metadata. This approach facilitates the incremental development of monitor ontology and ensures accuracy through human validation and refinement of the predicted results.<\/p>\n\n\n\n

Breakdown of resource and SLO classes<\/h4>\n\n\n\n

In our analysis, we identified 13 major resource classes and nine SLO classes that correspond to the majority of monitors in our dataset, as shown in Figure 2.<\/p>\n\n\n\n

\"On
Figure 2. (a) Breakdown of resource classes at the monitor level. (b) Breakdown of SLO classes at the monitor level.<\/figcaption><\/figure>\n\n\n\n

We analyzed the distribution of SLO classes within each resource class to determine the relationship between them. We observed that the distribution varies across resource classes, suggesting that a specific subset of metric classes should be applied to each, as illustrated in Figure 3. This shows us that we can predict a service\u2019s SLO classes by analyzing its associated resource classes.<\/p>\n\n\n\n

\"A
Figure 3: Distribution of SLO classes within each resource class<\/figcaption><\/figure>\n\n\n\n

Monitor recommendation model<\/h3>\n\n\n\n

We developed a deep learning framework that recommends certain monitors for specific services based on their properties. This model uses monitors that have a structured ontology as well as service properties to create the recommendation pipeline, as shown in Figure 4. It incorporates upstream and downstream dependencies and service components. <\/p>\n\n\n\n

\"A
Figure 4: The monitor recommendation pipeline<\/figcaption><\/figure>\n\n\n\n

To identify patterns within the data, the model uses a prototypical learning network, which learns abstract representations of the classes, or prototypes. This approach allows the network to compare prototypes for classification, enabling stronger generalization capabilities. During the prediction stage, the model outputs the class that it identifies as the most probable, with custom thresholds ensuring the recommendations are of production quality. This is illustrated in Table 1.<\/p>\n\n\n\n

Resource Class<\/th>Threshold<\/th>Precision<\/th>Recall<\/th><\/tr><\/thead>
Service Level<\/td>0.45<\/td>0.95<\/td>1.00<\/td><\/tr>
API<\/td>0.30<\/td>0.48<\/td>1.00<\/td><\/tr>
CPU<\/td>0.20<\/td>0.34<\/td>1.00<\/td><\/tr>
Container<\/td>0.40<\/td>0.30<\/td>0.38<\/td><\/tr>
Dependency<\/td>0.20<\/td>0.28<\/td>1.00<\/td><\/tr>
Compute Cluster<\/td>0.05<\/td>0.30<\/td>1.00<\/td><\/tr>
Storage<\/td>0.35<\/td>0.22<\/td>1.00<\/td><\/tr>
Ram-memory<\/td>0.30<\/td>0.20<\/td>1.00<\/td><\/tr>
Certificate<\/td>0.50<\/td>0.14<\/td>0.80<\/td><\/tr>
Cache-memory<\/td>0.41<\/td>0.13<\/td>0.88<\/td><\/tr>
Others<\/td>0.40<\/td>0.10<\/td>0.90<\/td><\/tr><\/tbody><\/table>
Table 1: Quantitative metrics evaluated on recommendations from the proposed framework.<\/figcaption><\/figure>\n\n\n\n

Finally, to understand the importance and utility of the monitor’s recommendations and how engineers perceive them, we interviewed 11 Microsoft engineers who modified monitors from January to June 2023. We introduced the proposed ontology, asked if it was helpful, and solicited suggestions for new classes. The average rating for the ontology was 4.27 out of 5, indicating its usefulness.<\/p>\n\n\n\n

Looking ahead<\/h2>\n\n\n\n

<\/a>Developing an ontology for monitoring, alongside a recommendation framework to create performance monitors for cloud platforms, marks the initial steps towards tackling the complexities associated with monitor management. One planned project, called Monitor Scorecards, aims to systematically analyze monitor performance through incident reports, their downstream impact, resolution time, and coverage. This approach combines Bayesian statistics with time-series modeling to estimate monitor effectiveness, offering actionable insights into the monitor portfolio\u2019s performance by classifying and quantifying both false positives and negatives. We hope these effectiveness assessments will enhance recommendation models\u2019 training phase and improve the recommendations they make.<\/p>\n\n\n\n

Acknowledgments<\/h2>\n\n\n\n

We would like to thank colleagues from the Azure Health Platform team, Microsoft Research, and the Data, Knowledge, and Intelligence (DKI) team, for contributing to this work.<\/p>\nOpens in a new tab<\/span>","protected":false},"excerpt":{"rendered":"

Integrating AI into cloud service monitoring improves incident detection accuracy, reduces unnecessary alerts, and enhances overall system reliability. This helps organizations better align with business goals and increase customer satisfaction.<\/p>\n","protected":false},"author":37583,"featured_media":1013082,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556,13560,13547],"msr-region":[],"msr-event-type":[],"msr-post-option":[243984],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[811276],"related-projects":[],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Anjaly Parayil","user_id":41215,"display_name":"Anjaly Parayil","author_link":"Anjaly Parayil<\/a>","is_active":false,"last_first":"Parayil, Anjaly","people_section":0,"alias":"aparayil"},{"type":"guest","value":"ayush-choure","user_id":"1015794","display_name":"Ayush Choure","author_link":"Ayush Choure<\/a>","is_active":true,"last_first":"Choure, Ayush","people_section":0,"alias":"ayush-choure"},{"type":"user_nicename","value":"Fiza Husain","user_id":43164,"display_name":"Fiza Husain","author_link":"Fiza Husain<\/a>","is_active":false,"last_first":"Husain, Fiza","people_section":0,"alias":"t-fizahusain"},{"type":"guest","value":"avi-nayak","user_id":"1015800","display_name":"Avi Nayak","author_link":"Avi Nayak<\/a>","is_active":true,"last_first":"Nayak, Avi","people_section":0,"alias":"avi-nayak"},{"type":"guest","value":"piyali-jana","user_id":"1015803","display_name":"Piyali Jana","author_link":"Piyali Jana<\/a>","is_active":true,"last_first":"Jana, Piyali","people_section":0,"alias":"piyali-jana"},{"type":"user_nicename","value":"Rujia Wang","user_id":42549,"display_name":"Rujia Wang","author_link":"Rujia Wang<\/a>","is_active":false,"last_first":"Wang, Rujia","people_section":0,"alias":"rujiawang"},{"type":"user_nicename","value":"Chetan Bansal","user_id":31394,"display_name":"Chetan Bansal","author_link":"Chetan Bansal<\/a>","is_active":false,"last_first":"Bansal, Chetan","people_section":0,"alias":"chetanb"},{"type":"user_nicename","value":"Saravan Rajmohan","user_id":41039,"display_name":"Saravan Rajmohan","author_link":"Saravan Rajmohan<\/a>","is_active":false,"last_first":"Rajmohan, Saravan","people_section":0,"alias":"saravar"}],"msr_type":"Post","featured_image_thumbnail":"\"three","byline":"","formattedDate":"March 19, 2024","formattedExcerpt":"Integrating AI into cloud service monitoring improves incident detection accuracy, reduces unnecessary alerts, and enhances overall system reliability. This helps organizations better align with business goals and increase customer satisfaction.","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1013043"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/37583"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=1013043"}],"version-history":[{"count":20,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1013043\/revisions"}],"predecessor-version":[{"id":1016229,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1013043\/revisions\/1016229"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1013082"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1013043"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=1013043"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=1013043"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1013043"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=1013043"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=1013043"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1013043"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1013043"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=1013043"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=1013043"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}