Systems Innovation Articles

Boosting Cloud Efficiency: Harnessing Data-Driven Decision-Making and Optimization Techniques

Microsoft Research Team — Mon, 18 Dec 2023 15:51:19 +0000

Si Qin, Principal Research Manager; Fangkai Yang, Senior Researcher; Rujia Wang, Principal Research PM; Qingwei Lin, Partner Research Manager; Saravan Rajmohan, Partner Director AI and Applied Research and Dongmei Zhang, Distinguished Scientist and Vice President.

Introduction

Microsoft’s cloud system serves as the backbone for the daily operations of hundreds of thousands of organizations, driving productivity and collaboration. The foundational infrastructure demands both high reliability and efficiency. In our last system innovation research blog post, we delved into our recent work on bringing AI into service reliability scenarios, where we aim to achieve continuous availability through AI-assisted automation tools. In this blog, we explore recent innovations that continually enhance hyper-scale cloud capacity efficiency, delivering substantial operational cost savings for our customers.

Efficient cloud operation is pivotal for both Microsoft and our customers. On one hand, cloud engineers need to ensure that we maximize our capacity utilization without compromising reliability, availability, and sustainability; on the other hand, our customers seek performant and cost-effective cloud services to build scalable solutions atop our infrastructure. However, optimizing cloud efficiency is a multifaceted challenge due to various factors:

Fluctuation and Uncertainty. Workload and computing resource demands fluctuate over time due to daily and weekly cycles, overall cloud market trends, and sudden spikes or outages. These variations lead to utilization peaks and valleys, requiring the cloud platform to secure sufficient capacity for peak hours. These dynamic signals make it very challenging for engineers and customers to use traditional decision-making techniques for effective efficiency optimizations,

Multi-Constraints, Multi-Objectives, and Multi-Parameters. Each workload deployment and operation must satisfy multiple constraints from customers and the platform. The platform needs to balance multiple competing objectives, such as capacity utilization, service performance, and availability. Management tasks involve tuning various parameters and thresholds for different objectives.

North Star: Proactive data-driven decision-making for cloud capacity management

We envision the future of highly efficient and reliable cloud platforms need a proactive design that can take the future status of the system into account in the decision-making process. This forward-looking approach relies on data-driven models to predict the future status of cloud platforms, laying the foundation for downstream proactive decision-making.

The CapacityInsider framework (Figure 1) shows the overall strategies to bring data-driven approaches into proactive cloud capacity management. This framework helps facilitate automated, informed, and forward-looking decision-making in cloud capacity management. The framework focuses on four key areas:

Detecting Allocation Failure Issues: By identifying allocation failure issues, we fortify the system against potential disruptions, enhancing its resilience and reliability.
Diagnosing Root Causes and Bottlenecks: The framework delves into diagnosing root causes and bottlenecks for allocation failures, providing insights crucial for optimizing performance and addressing inefficiencies.
Predicting Future Workloads and System Status: Through advanced predictive modeling, the framework anticipates future workloads and system status, empowering the system to proactively adapt to changing demands.
Optimizing Decision-Making: The capacity management system’s decision-making processes are fine-tuned and optimized, ensuring efficiency across varied objectives such as capacity utilization, service performance, and availability.

By addressing these challenges, we aim to develop a comprehensive, automated, AI-driven, and proactive capacity management system. This system not only reacts to current demands but anticipates and aligns with future needs, elevating the efficiency and reliability of Microsoft’s cloud infrastructure to new heights.

Figure 1: CapacityInsider: Data-driven Decision-making for Efficient Cloud Capacity Management

In the subsequent sections, we present two cases where data-driven decision-making techniques help with cloud resource optimization and potentially save our customers’ costs.

In the first study, we show how mixed spot and on-demand VMs could be combined to improve the overall utilization and without impacting service reliability.
In the second study, we discuss how to effectively bin-pack container workloads through chance-constrained optimization algorithms.

Case Study: Mixture of Spot and On-demand VMs for Low-cost Computing

Cloud providers usually ensure service availability and reliability by allocating sufficient computing resources, so that it appears to customers that resources are never in short supply. This often leads to underutilized resources, and cloud providers are motivated to monetize excess capacity by selling them at discounted prices to compensate for the degradation in resource availability. Example offerings include spot virtual machine (VM), spot block VM and harvest VM. The trade-off between cost-effectiveness and resource availability often creates a dilemma for users. On the one hand, some workloads, such as machine learning training and inferencing, caching, and big data processing, can potentially run at reduced reliability for lower cost. On the other hand, cloud providers, while sharing some information about resource availability such as estimated eviction rates for spot VMs at the region level, do not guarantee resource availability.

One natural solution to prevent similar incidents as the above is to deploy services in a VM Scaling Group (VMSG) with both spot and on-demand VMs, so that there would still be enough on-demand VMs to support services even if all spot instances were evicted. However, all of them either support a static mixture ratio of spot and on-demand VMs or allocate spot instances in a greedy fashion, which can be less cost-effective when the eviction rate is low.

In our recent ASPLOS’23 paper, “Snape: Reliable and Low-Cost Computing with Mixture of Spot and On-Demand VMs”, the researchers propose an intelligent framework to optimize customer cost while maintaining resource availability by dynamically mixing on-demand VMs with spot VMs. Snape (Spot On-demand Perfect Mixture), is composed with a reliable model for predicting the eviction rate of spot VMs from the production trace and an intelligent constrained reinforcement learning (CRL) framework for learning the best mixture policy, given the predicted eviction rate and other service signals. We first characterize the eviction behaviors of spot VMs by examining traces collected over three months from a production cloud system. Then, to better learn the eviction behaviors and achieve optimal decisions in the long term, we employ the framework of CRL: if the number of in-service VMs temporarily drops below the target value for a short period, the CRL will take more conservative actions in its future decision-making to ensure that SLOs are not violated. Figure 2 shows the overall framework of Snape.

Figure 2: The overview of Snape framework

This proactive design enables an online decision-making system for dynamically adjusting the mixture of on-demand and spot VMs and ensures that a more aggressive and cheaper policy is only adopted when the reliability is high (low predicted eviction rates of spot VM). Experiments across different configurations show that Snape achieves 44% savings compared to the policy of using only on-demand VMs, and at the same time, maintains 99.96% availability—2.77% higher than with a policy of using only spot VMs.

Case Study: Workload Bin-packing Chance-constrained Optimization Approach

In today’s digital landscape, many large companies operate their services using containers and employ Kubernetes-like systems for container orchestration and resource management on modern cloud platforms. One significant challenge these platforms face is container scheduling. To optimize utilization, the platform consolidates multiple containers onto a single machine, where the combined maximum resources required by the containers may surpass the machine’s capacity. However, this consolidation introduces the risk of machine resource violations, which can lead to container performance degradation or even service unavailability.

Container scheduling can be naturally modeled as the Stochastic Bin Packing Problem (SBPP) to optimize resource utilization while maintaining violations below a desired low level. Much research on SBPP assumes that all machines (also referred to as bins) are empty before allocation, and these approaches are evaluated based on the number of used bins. However, in practice, the total resources required by a service change diurnally and weekly. For example, in Figure 3, we showed diverse CPU usage for three different services and their pattern across one week. As a result, services often request to allocate and delete a batch of containers daily to increase resource utilization. On the platform side, most machines typically host a few containers when new allocations arrive. In cases where non-empty machines can accommodate all or most requested containers, the previous metric, i.e., the number of bins used fails to differentiate the effectiveness of allocation strategies.

Figure 3 Diverse CPU core usage across three services

In our recent SIGKDD’22 paper, “Solving the Batch Stochastic Bin Packing Problem in Cloud: A Chance-constrained Optimization Approach”, we introduce a new optimization metric, Used Capacity at Confidence (UCaC) and propose a unified problem formulation for the SBPP that accommodates both empty and non-empty machines. Furthermore, we designed heuristic and cutting stock based exact algorithms to solve the problem. Extensive experiments on both synthetic and real cloud traces demonstrate that our UCaC-based optimization methodology outperforms existing approaches that focus on optimizing the number of machines used. Specifically, we have taken the first step towards addressing the stochastic bin packing problem on non-empty machines, which is a crucial issue in cloud resource scheduling.

We assessed the proposed methods using real trace data from a first-party application. The dataset comprises 17 primary services and over 10,000 containers during peak times. We set the UCaC at 99.9% and evaluated the methods using two metrics: average number of machines and total violations. Compared to the best fit method, our proposed heuristics and cutting stock based approaches achieved a 1.3% reduction in node utilization and up to a 23% reduction in violations.

Conclusions

Microsoft Cloud hosts a wide range of workloads with varying characteristics, including size, configuration, usage patterns, resources, and SLA requirements. Specialized workloads, such as GPU-intensive AI tasks, require dedicated computing clusters and specialized SLA management. Evaluating infrastructure and workloads together is crucial for optimizing cloud efficiency and capacity decisions.

By adopting advanced data-driven methodologies, our research can help throughout all phases of cloud capacity management and achieve a proactive and self-sustaining cloud environment.

Acknowledgement

We would like to thank our collaborators and contributors to the research work: Yixin Fang, Silvia Yu, Terry Yang, Soumya Ram, Zhen Ma, Íñigo Goiri, Eli Cortez, Thomas Moscibroda, Ricardo Bianchini, Lu Wang, Jue Zhang, Liqun Li, Bo Qiao, Camille Couturier, Victor Rühle, and Chetan Bansal.

The post Boosting Cloud Efficiency: Harnessing Data-Driven Decision-Making and Optimization Techniques appeared first on Microsoft Research.

Towards Highly Reliable Services with AIOps

Microsoft Research Team — Thu, 06 Apr 2023 15:41:17 +0000

Rujia Wang (opens in new tab), Principal Research PM; Chetan Bansal (opens in new tab), Principal Research Manager; Saravan Rajmohan (opens in new tab), Partner Director AI & Applied Research; and Jim Kleewein (opens in new tab), Technical Fellow

For well over a decade, Microsoft has provided one of the world’s most popular hyper-scale productivity suite, Office 365, which is now part of Microsoft 365. Microsoft 365 includes hundreds of different services running billions of transactions a second on hundreds of thousands of servers in many dozens of data centers worldwide. It delivers day-to-day cloud services to hundreds of millions of enterprise, education and consumer users.

Those services can never be down. Our services are used by hospital and trauma centers, power grid providers, national, state, and local governments, major banks and financial services providers, airlines, shipping and logistics providers, and businesses from the largest to the smallest. To meet their needs, we must be continuously available, which means 100% availability over long period of times. Our services should operate seamlessly through disasters because disasters are often when our services are most essential; to coordinate emergency response.

Therein lies a great challenge. Our extreme scale means that in our services “one in a billion” events are not rare, they are commonplace. At the same time, we cannot allow those “one in a billion” events to compromise the availability of our service. This combination of almost unbelievably massive scale and extreme criticality requires us to continuously rethink and improve every aspect of services architecture, design, development, and operations. One important aspect of achieving continuous availability and highly reliable services is to understand incidents holistically and mitigate their impact to customers.

Beyond using Artificial Intelligence (AI) and Machine Learning (ML) for developing new productive features and capabilities that delight our users, we are also leveraging the power of AI and ML for improving service availability and reliability, which is essential for our hyper-scale services. This article shows one example of applying AI into managing production incident life cycle. We plan to share more examples in future articles.
— Jim Kleewein, Technical Fellow, Microsoft 365

Acknowledgement

This post includes contributions from Supriyo Ghosh, Toufique Ahmed, Manish Shetty, Suman Nath, Tom Zimmermann, Xuchao Zhang, Yu Kang, Qingwei Lin, Dongmei Zhang.

Introduction

Microsoft 365 (“M365”) is the world’s largest productivity cloud. Hundreds of thousands of organizations of all sizes use it. Whether you’re having a Teams meeting, composing emails in Outlook or collaborating on a Word document with your colleagues, you’re relying on M365 to power these productivity tools and applications M365 is powered by web-scale and massively distributed cloud services with exabytes of data handled by O(100K) servers in O(100) of datacenters around the globe. To ensure best-in-class productivity experiences it’s critical that our engineering infrastructure is highly reliable while being efficient at the same time.

Here at M365 System Innovation research group, we leverage the power of AI and integrate Cloud Intelligence and AIOps into our services and products. We are using innovative AI/ML technologies and algorithms to help design, build, and operate complex cloud infrastructures and services, and provide a step function improvement in operational efficiency and reliability enabling us to deliver best in class productivity experiences. We are applying AIOps to several domains:

AI for Systems to make intelligence a built-in capability to achieve high quality, high efficiency, self-control, and self-adaptation with less human intervention.
AI for Customers to leverage AI/ML to create unparalleled user experiences and achieve exceptional user satisfaction using cloud services.
AI for DevOps to infuse AI/ML into the entire software development lifecycle to achieve high developer productivity.

Helping build highly reliable cloud services has been one of our key focus areas. One of the challenges with that is to quickly identify, analyze, and mitigate incidents. Our research starts from the fundamental of the production incidents: we analyze the life cycle of incidents, understand the common root causes, mitigations, and engineering efforts for resolution.

Understanding Production Incidents

Figure 1: The overview of service reliability problems in large-scale cloud services

Our award winning paper provides a comprehensive multi-dimensional empirical study of production incidents on large-scale M365 cloud used by Microsoft Teams. Since Microsoft-Teams powers real-time communication, reliability is paramount. Understanding production incidents, from detection, root-causing, and mitigation perspectives, is the first step to build better monitoring and automation tools. Figure 1 shows the overview of service reliability problems on large-scale cloud services, summarized by our research paper.

Common root causes and mitigations behind Incidents

Figure 2: Breakdown of root cause analysis (RCA) and mitigation categories

While code bugs are the most frequent cause of incidents, majority of the incidents (~60%) were caused due to non-code/non-config related issues in infrastructure, deployment, and service dependencies. We also observed that among the 40% incidents that were caused by code/configuration bugs, nearly 80% of incidents were mitigated without a code or configuration fix.

TTD and TTM for root causes and mitigations

Figure 3: Average TTD and TTM for different root causes categories

Figure 4: Average TTD and TTM for different mitigation steps

The TTD and TTM of incidents caused by code bugs and dependency failures are significantly higher than other incidents. Also, 30% of the mitigation delay is caused due to the manual mitigation steps.

Takeaways

(1) Incidents caused by software bugs and external dependencies take longer to detect due to poor monitoring. This highlights the need of practical tools for fine-grained, in-situ system observability.

(2) Incidents caused by some root-cause categories are quick to mitigate after their root-cause categories are determined. This suggests that the overall mitigation time of incidents caused by these categories can potentially be reduced with tools that can quickly identify its root-cause category.

(3) Incidents caused by some root-causes are inherently hard to monitor automatically (e.g., that requires monitoring global states). This suggests that developers should invest more in testing to uncover those root-cause categories before production, thereby avoiding such incidents.

We also envision that automation should be the future to do incident diagnosis and identify the root cause and mitigation steps to help quickly resolve the incident and minimize customer impact. Also, we should leverage the past lessons learnt to build resilience for future incidents. We posit that adopting AIOps and using state-of-the-art ML models, such as large language models (LLMs) can help achieve both the goals.

Using Large-Language Models for Automatic Incident Management

Recent breakthroughs in AI have enabled Large-Language Models (LLMs) to have a riche understanding of natural language. They have become good at understanding and reasoning from large volumes of data. They can also generalize across a diverse set of tasks and domains such as code completion, translation, Q&A. Given the complexities with incident management, we were motivated to evaluate the effectiveness of these LLMs in helping root cause and mitigate production incidents.

Figure 5: Leveraging GPT-3.x for root cause analysis and mitigation

In our recent work which we will be presenting at ICSE 2023 Conference, for the first time, we demonstrate the usefulness of LLMs for production incident diagnosis. When an incident is created, the author would specify a title for the incident and describe any relevant details such as any error messages, anomalous behavior and other details which could potentially help with resolution. We use the title and the summary of a given incident as the input for LLMs and generate root cause and mitigation steps.

We do a rigorous study on more than 40,000 incidents and compare several LLMs in zero-shot, fine-tuned and multi-task settings. We find that fine-tuned the GPT-3 and GPT-3.5 models significantly improves the effectiveness of LLMs for incident data.

Effectiveness of GPT-3.x models at finding root causes

Table 1: Lexical and semantic performance of different LLMs

In our offline evaluation, we compared performance of GPT-3.5 against three GPT-3 models by computing 3 lexical similarity metrics between the generated recommendations and the ground truth of root cause or mitigation steps mentioned in incident management (IcM) portal. The average gains for GPT-3.5 metrics for different tasks are as follows:

For root cause and mitigation recommendation tasks, Davinci-002 (GPT-3.5) provides at least 15.38% and 11.9% gain over all the GPT-3 models, respectively, as shown in Table 1.
When we generate mitigation plans by adding root cause as input to the model, GPT-3.5 model provides at least 11.16% gain over 3 GPT-3 models.
We observe that LLM models perform better on machine reported incidents (MRIs) as opposed to customer reported incidents (CRIs) due to the repetitive nature of the MRIs.
Finetuning LLMs with incident data improves the performance significantly. Finetuned GPT-3.5 model improves the average lexical similarity score by 45.5% for root cause generation and 131.3% for mitigation generation tasks over zero-shot (i.e., inferencing directly on pretrained GPT-3 or GPT-3.5 model) setting.

Looking Through the Incident Owners’ Eyes

In addition to analytical analysis with semantic and lexical metrics, we also interviewed the incident owners to evaluate the effectiveness of generated recommendations. Overall, GPT-3.5 outperforms GPT-3 in majority of the metrics. More than 70% of OCEs gave a rating of three or above (out of 5) for the usefulness of recommendations in a real-time production setting.

Looking Forward

While we are at the initial stages of using LLMs to help automate incident resolution, we envision that there are many open research questions in this field that will significantly increase the efficacy and accuracy of LLMs. For instance, how can we incorporate additional context about the incident such as discussion entries, logs, service metrics and even dependency graphs of the impacted services to improve the diagnosis. Another challenge is regarding staleness since the models would need to be frequently retrained with the latest incident data. To solve these challenges, we are working on leveraging the latest ChatGPT model combined with retrieval augmented approaches to improve incident diagnosis via a conversational interface. For instance, ChatGPT can assist engineers to efficiently determine the incident’s root cause by raising hypotheses and answering critical questions with a feedback loop.

Figure 6: Workflow of Retrieval-augmented RCA

Moreover, ChatGPT can be actively integrated into the “discussion” of the incident diagnosis. By collecting evidence from available documents and logs, the model can generate coherent, contextual, natural-sounding responses to inquiries and offer corresponding suggestions, thereby facilitating the discussion, and accelerating the incident resolution process. We believe this has the potential of delivering a step function improvement in the overall incident management process with contextual and meaningful root causes analysis and mitigation thereby reducing significant human toil involved and bolstering our reliability & customer satisfaction.

The post Towards Highly Reliable Services with AIOps appeared first on Microsoft Research.