Doing more with less internally at Microsoft with Microsoft Azure

How do we at Microsoft get the best value from our Microsoft Azure environment? We’ve been refining and optimizing the way we use the cloud for years, and as such, our answer isn’t just about how much we’ve been able to push down our monthly Azure bill.

Our migration from managing our own datacenters to Microsoft Azure has been a process of learning and growing. We’ve moved more than 600 services and solutions comprised of approximately 1,400 components to Azure-based cloud technologies that require less specialized skill sets to use and provide quicker, more agile access to infrastructure and solutions.

We’re the Microsoft Digital (MSD) team and we have led the company’s move from the datacenter to the cloud, enabling best-in-class platforms and productivity services for the mobile-first, cloud-first world. This strategy harmonizes the interests of our employee users of our services, our developers, and our team of IT implementers who provide the core of our IT operations.

The freedom to provision resources in minutes instead of days has radically changed the way we in MSD enable teams across the company to spin up the environments and resources they need on demand, which in turn empowers our engineering teams to respond more quickly to our evolving business needs.

However, we’ve found that easy provisioning and quick deployments can be costly. An unmanaged or undermanaged enterprise estate in Microsoft Azure can quickly lead to significant cloud billing costs and under-utilized resources. But in our journey to the cloud, we’ve learned that there are many smart ways to optimize how you use the cloud, tricks of the trade that we’re using to keep our costs down while we transform the way we work. In this blog post, we’ll share the lessons we’ve learned here at Microsoft on how you can fine-tune your use of Microsoft Azure at your company.

Watch to learn how optimizing our Microsoft Azure workloads is helping Microsoft operate more efficiently.

Adopting modern engineering practices

Modern engineering practices underpin everything we do in Microsoft Azure, from single resource deployments to enterprise-scale, globally distributed Azure-based solutions that span hundreds of resources. Our modern engineering vision has created culture, tools, and practices focused on developing high-quality, secure, and feature-rich services to enable digital transformation across the organization.

Our operations and engineering teams have journeyed through several phases of efficiency maturity. Through each of these phases, our operations substructure had to evolve, and many of those changes resulted in increased efficiency, not just with the bottom line on our monthly Azure bill, but with the way we do service management in Azure, including development, deployment, change management, monitoring, and incident management.

—Pete Apple, principal program manager for Microsoft Azure engineering, MSD

Apple smiles as he stands outside a Microsoft building holding a cup of coffee. — Now that we’ve fully migrated Microsoft to Microsoft Azure, we’re finding smart ways to use our cloud product more efficiently, says Pete Apple, a principal program manager for Microsoft Azure Engineering in MSD.

Pete Apple is a Principal Program Manager for Microsoft Azure Engineering in MSD. He and his team have been responsible for overseeing and implementing our massive migration to the cloud over the past 8 years. They’re also responsible for ensuring that the company’s enterprise estate in Microsoft Azure is running at top efficiency.

“Our operations and engineering teams have journeyed through several phases of efficiency maturity,” Apple says. “Through each of these phases, our operations substructure had to evolve, and many of those changes resulted in increased efficiency, not just with the bottom line on our monthly Azure bill, but with the way we do service management in Azure, including development, deployment, change management, monitoring, and incident management.”

We went through three phases on our journey to greater efficiency in Microsoft Azure. Phase one focused on improving operational efficiency, phase two examined how we could deliver value through innovation, and in phase three we embraced transforming our digital ecosystem. Here’s a summary of the three phases:

Improving operational efficiency

At MSD, we play a pivotal role in Microsoft business strategy, as most business processes in the company depend on us. To help Microsoft transform on our journey to the cloud, we identified key focus areas to improve in this first phase of our transformation: aligning services, optimizing infrastructure, and assessing our culture.

The first phase involved culture and structure as much as it did strategy and platform management. We realigned our organization to better support a brand-new way of providing services and support to the company in Microsoft Azure. Our teams needed to realign to eliminate information silos between different support areas. In many cases, teams that started to work together realized they had duplicate projects with similar goals.

Reducing projects and streamlining delivery methods freed up engineering resources to accomplish more in less time, while automated provisioning and self-service tools helped our teams plan their own migrations and accurately assess their portion of our Microsoft Azure estate.

Our engineering culture underwent a radical change in phase one. We moved toward empowering our engineers to create business solutions, not just create and manage processes. This led to a more holistic view of what we were trying to accomplish—as individuals and as teams—and it increased innovation, creativity, and productivity throughout our engineering processes.

Delivering value through innovation

We migrated more than 90 percent of our IT infrastructure to Microsoft Azure in phase one. In phase two, we embraced the Azure platform and cloud-native engineering design principles by adopting Infrastructure as Code and continuous deployment. We redefined operations roles and retrained people from traditional IT roles to be business relationship managers, engineering program managers, service engineers, and software engineers.

We also radically simplified our IT operations. The rapid provisioning and allocation process in Microsoft Azure enabled us to increase our speed 40-fold by eliminating, streamlining, and connecting processes, and by aligning processes for Azure. Azure native solutions, especially platform-as-a-service (PaaS) offerings were adopted across all aspects of the engineering and operations lifecycle. These included infrastructure as code with ARM templates, APIs, and PowerShell.

This final phase is never really final. Continual evaluation and optimization of our Microsoft Azure environment is built into how we manage our resources in the cloud. As new features and engineering approaches arise, we’re adapting our methods and best practices to get the most from our investment.

—Heather Pfluger, general manager, Infrastructure and Engineering Services, MSD

Solutions that were lifted-and shifted into Microsoft Azure infrastructure as a service (IaaS) resources are regularly reassessed for migration or refactoring into PaaS offerings. We also adopted Microsoft Azure Monitor for consolidated monitoring not only for our Azure resources, but also on-premises resources.

Embracing the digital ecosystem

Pfluger smiles in a screenshot of her taken from a video interview. She’s shown from her home office. — Optimizing the company’s use of Microsoft Azure has helped keep our costs down, says Heather Pfluger, the general manager of Infrastructure and Engineering Services in Microsoft Digital Employee Experience.

Our final phase is focusing on developing intelligent systems on Microsoft Azure to deliver reliable, scalable services and to connect operations processes across Microsoft. Automation has been built more deeply into our support and development processes by embracing a DevOps culture and open-source standards in our solutions.

Together, Microsoft Azure PaaS offerings and Microsoft Azure DevOps enable our engineers to focus on features and usability, while the ARM fabric and Microsoft Azure Monitor provide unified management to provision, manage, and decommission infrastructure resources securely.

“This final phase is never really final,” says Heather Pfluger, the general manager of Infrastructure and Engineering Services in MSD who manages Microsoft’s internal Microsoft Azure profile. “Continual evaluation and optimization of our Microsoft Azure environment is built into how we manage our resources in the cloud. As new features and engineering approaches arise, we’re adapting our methods and best practices to get the most from our investment.”

Gaining efficiency from past experience

Apple, who works on Pfluger’s team, adds that customers’ migrations can benefit from taking a shortcut that Microsoft didn’t take.

“As early adopters, our migration practices were pushing the toolsets available,” he says. “When we looked at our on-premises environment and what was available in Azure, it made sense to move a significant portion of our solutions directly into IaaS resources.”

Apple talks about the tradeoffs made between agility and efficiency.

There are much better tools and best practices in-place now to migrate on-premises solutions directly into PaaS resources, eliminating the need to lift-and-shift and saving the cost of creating and maintaining those IaaS resources.

—Pete Apple, principal program manager for Microsoft Azure engineering, MSD

“These solutions were being lifted from the datacenter and shifted straight into Azure Virtual Machines, Virtual Networks, and Storage Accounts,” he says. “This allowed us to recreate the on-premises environment in the cloud so we could get it out of the datacenter quickly, but it still left us with some of the maintenance tasks and costs inherent with IaaS infrastructure and room for further optimization with PaaS-based solutions.”

After the lift-and-shift migration, Apple’s teams re-engineering and re-platformed the IaaS solutions to use PaaS solutions such as Microsoft Azure SQL Database and Microsoft Azure Web Apps. Apple explains the shortcut, “There are much better tools and best practices in-place now to migrate on-premises solutions directly into PaaS resources, eliminating the need to lift-and-shift and saving the cost of creating and maintaining those IaaS resources.”

Managing data and resource sprawl with agility

We’re also undergoing specific efforts across our Microsoft Azure estate to reduce costs and increase efficiency. Azure infrastructure supports the entire Microsoft cloud, including Microsoft 365, Microsoft Power Platform, and Microsoft Dynamics 365. While most of these offerings do not allow for direct resource optimization, understanding the fundamentals of cloud scaling and billing is a critical aspect of using them efficiently. Data sprawl is a constant consideration for us.

Graphic showing savings Microsoft gained from moving to Microsoft Azure, including sizing VMs down, moving older D-series and E-series, and more. — We were able to keep our costs flat while our workloads increased by 20 percent internally here at Microsoft thanks to migrating the company to Microsoft Azure and then optimizing our usage.

Dan Babb is the Principal Software Engineering Manager responsible for MSD’s implementation of Microsoft Azure Synapse Analytics for big data ingestion, migration, and exploration. It’s a massive data footprint, with more than 1 billion read operations and 10 petabytes of data consumed monthly through Apache Spark clusters.

Small specifics with Spark clusters can make a big difference.

“Each job that comes through Azure Synapse Analytics is run on a Spark cluster for compute services,” Babb says. “There’s a large selection of compute sizes available. The largest ones process data the quickest, but, naturally, they’re also the most expensive. We all like things to be done quickly, so many of our engineers were using very large compute sizes because they’re fast.”

Babb clarifies that just because you can use the fastest method doesn’t mean you should.

“Many of our jobs aren’t crucially time-sensitive, so we stopped using the bigger cluster sizes because we didn’t need to,” he says.

Babb emphasizes that accurately assessing the workload and priority of each job has significantly reduced costs.

“Processing a workload on a smaller instance for 20 minutes instead of using a larger instance for 5 minutes has resulted in significant cost savings for us,” he says. “We’re monitoring our subscriptions and if a really big cluster size gets spun up, an Azure Monitor Alert notifies our engineering leads they can follow up to ensure that the cluster size is appropriate for the job it’s running.”

Apple says this is a way of cost cutting that is being widely adopted across our organization.

“Our business program managers are realizing that they can save money by slowing down projects that don’t need to be rushed,” he says. “For example, we had some folks in Finance who realized that some of their batch reporting didn’t really need to be out in one hour, it was fine if it took eight hours because they only had to run their reports once per day.”

Babb’s team is also designing for distributed processing, creating solutions that are dispersed across clusters and Microsoft Azure Synapse Analytics workspaces to create distributed platform architecture that is more flexible and less prone to a single point of failure.

“If we run into an issue with a component or workspace and we have to take it down, it doesn’t affect the entire solution, just the single cluster or workspace and the job it was running,” he says.

Using multiple workspaces and clusters has also made it much easier to get granular reporting and cost estimation. Babb’s team members are using monitoring and reporting that enable them to understand the exact cost for any specific job, from ingestion to storage to report generation.

Designing for Zero Trust

The Zero Trust security model is pervasive across our Microsoft Azure environment. Based on the principle of verified trust—to trust, you must first verify—Zero Trust eliminates the inherent trust that is assumed inside the traditional corporate network. Zero Trust architecture reduces risk across all environments by establishing strong identity verification, validating device compliance prior to granting access, and ensuring least privilege access to only explicitly authorized resources.

The Zero Trust model assumes every request is a potential breach. As such, every request that travels through our Microsoft Azure or on-premises environments must be verified as though it originates from an open network. Regardless of where the request originates or what resource it accesses, Zero Trust teaches us to “never trust, always verify.” Every access request is fully authenticated, authorized, and encrypted before granting access. Micro-segmentation and least privileged access principles are applied to minimize lateral movement. Rich intelligence and analytics are used to detect and respond to anomalies in real time.

Throughout the Zero Trust model in Microsoft Azure, opportunities exist for simplification and increased efficiency. Microsoft Azure Entra ID allows us to centralize our identity and access workload, which simplifies identification and authorization across the hybrid cloud. Azure’s flexible network infrastructure allows us to implement complex and agile micro-segmentation scenarios in minutes with Microsoft Azure Bicep Templates, and Microsoft Azure Virtual Networks. Our engineers are creating connectivity scenarios and solutions that were simply unimaginable using traditional networking practices.

Mei Lau is a Principal Program Manager for Security Monitoring Engineering. Her team’s job is to ensure that across the increasingly complex and dynamic Microsoft Azure networking environment, Zero Trust principles are adhered to and Microsoft’s network environment remains safe and secure.

Her team is using Microsoft Sentinel to deliver intelligent security analytics and threat intelligence across the enterprise at Microsoft. Sentinel allows her security experts to detect attacks and hunt for threats across millions of network signals.

Real-time detection data is more expensive than some of the other data storage options we have. As convenient as it would be to have it all, it’s not that critical. We move our older data into Azure Data Explorer where it’s less expensive to store, but still allows us to use Kusto Query Language (KQL) queries just like we would in Sentinel.

—Mei Lau, principal program manager, Security Monitoring Engineering, Microsoft Digital Security and Resilience

With that much traffic to collect and examine, Lau notes that cost in Sentinel comes down to one primary factor: data ingestion.

“We want our investigation scope to be as detailed as possible, so naturally, the inclination is to keep all the data we can,” she says.

The reality of the situation is that you need to be careful to keep your costs down.

“Real-time detection data is more expensive than some of the other data storage options we have,” Lau says. “As convenient as it would be to have it all, it’s not that critical. We move our older data into Azure Data Explorer where it’s less expensive to store, but still allows us to use Kusto Query Language (KQL) queries just like we would in Sentinel.”

In Microsoft Sentinel, threat detections result in stored data, so Lau and her team are also diligent about the accuracy and usefulness of the more than 200 detection rules that are configured in Sentinel.

“We’re continually monitoring and managing detections that fire false positives,” she says. “We generally want at least 80 percent fidelity for a healthy detection. If we don’t achieve that, we either refine the detection or remove it.”

Our governance model provides centralized control and coordination for all cost-optimization efforts. Getting this right is pivotal for any organization looking to get the most out of being on the cloud in Azure.

—Pete Apple, principal program manager for Microsoft Azure engineering, MSD

Lau’s team proactively monitors data storage in Microsoft Sentinel to look for sudden spikes in usage or other indicators that data usage practices might need to be assessed. It all contributes to a more efficient and streamlined threat management system that does its job well and doesn’t break the bank.

Observing results and managing governance

To ensure effective identification and implementation of recommendations, governance in cost optimization is critical for our applications and the Microsoft Azure services that those applications use.

“Our governance model provides centralized control and coordination for all cost-optimization efforts,” Apple says. “Getting this right is pivotal for any organization looking to get the most out of being on the cloud in Azure.”

Our model consists of several important components, including:

Microsoft Azure Advisor recommendations and automation. Advisor cost management recommendations serve as the basis for our optimization efforts. We channel Advisor recommendations into our IT service management and Microsoft Azure DevOps environment to better track how we implement recommendations and ensure effective optimization.
Tailored cost insights. We’ve developed dashboards to identify the costliest applications and business groups and identify opportunities for optimization. The data that these dashboards provide empower engineering leaders to observe and track important Microsoft Azure cost components in their service hierarchy to ensure that optimization is effective.
Improved Microsoft Azure budget management. We perform our Azure budget planning by using a bottom-up approach that involves our finance and engineering teams. Open communication and transparency in planning are important, and we track forecasts for the year alongside actual spending to date to enable accurate adjustments to spending estimates and closely track our budget targets. Relevant and easily accessible spending data helps us identify trend-based anomalies to control unintentional spending that can happen when resources are scaled or allocated unnecessarily in complex environments.

Implementing a governance solution has enabled us to realize considerable savings by making a simple change to Microsoft Azure resources across our entire footprint. For example, we implemented a recommendation to convert Microsoft Azure SQL Database instances from the Standard database transaction unit (DTU) based tier to a serverless tier by using a simple Microsoft Azure Resource Manager template and the auto-pause capability. The configuration change reduced costs by 97 percent.

Moving forward

As we continue our journey, we’re focusing on refining our efforts and identifying new opportunities for further cost optimization in Microsoft Azure.

Our MSD Azure footprint will continue to grow in the years ahead, and our cost-optimization and efficiency efforts will grow to ensure that we’re making the most of our Azure investment.

—Heather Pfluger, general manager, Infrastructure and Engineering Services, MSD

“There’s still a lot we can do here,” Pfluger says. “We’re building and increasing monitoring measures that help us ensure we’re using the optimal Azure services for our solutions. We’re infusing automated scalability into every element of our Azure environment and reducing our investment in the IaaS components that currently support some of our legacy technologies.”

Microsoft Azure optimization is always ongoing.

“Our MSD Azure footprint will continue to grow in the years ahead, and our cost-optimization and efficiency efforts will grow to ensure that we’re making the most of our Azure investment,” Pfluger says.

Embrace modern engineering practices. Adopting modern engineering practices that support reliability, security, operational excellence, and performance efficiency will help to enable better cost optimization in Microsoft Azure. Staying aware of new Azure services and changes to existing functionality will also help you recognize cost-optimization opportunities as soon as possible.
Use data to drive results. Accurate and current data is the basis for making timely optimization decisions that provide the largest cost savings possible and prevent unnecessary spending. Using optimization-relevant metrics and monitoring from Microsoft Azure Monitor is critical to fully understanding the necessity and impact of optimization across services and business groups.
Use proactive cost-management practices. Using real-time data and proactive cost-management practices can get you from recommendation to implementation as quickly as possible while maintaining governance over the process.
Implement central governance with local accountability. Auditing Microsoft Azure cost-optimization efforts to help improve Azure budget-management processes will identify gaps in cost management methods.