Azure Management Archives - Inside Track Blog

Doing more with less internally at Microsoft with Microsoft Azure

Jason Kellington — Thu, 23 Jan 2025 15:00:26 +0000

How do we at Microsoft get the best value from our Microsoft Azure environment? We’ve been refining and optimizing the way we use the cloud for years, and as such, our answer isn’t just about how much we’ve been able to push down our monthly Azure bill.

Our migration from managing our own datacenters to Microsoft Azure has been a process of learning and growing. We’ve moved more than 600 services and solutions comprised of approximately 1,400 components to Azure-based cloud technologies that require less specialized skill sets to use and provide quicker, more agile access to infrastructure and solutions.

We’re the Microsoft Digital team and we have led the company’s move from the datacenter to the cloud, enabling best-in-class platforms and productivity services for the mobile-first, cloud-first world. This strategy harmonizes the interests of our employee users of our services, our developers, and our team of IT implementers who provide the core of our IT operations.

The freedom to provision resources in minutes instead of days has radically changed the way we in Microsoft Digital enable teams across the company to spin up the environments and resources they need on demand, which in turn empowers our engineering teams to respond more quickly to our evolving business needs.

However, we’ve found that easy provisioning and quick deployments can be costly. An unmanaged or undermanaged enterprise estate in Microsoft Azure can quickly lead to significant cloud billing costs and under-utilized resources. But in our journey to the cloud, we’ve learned that there are many smart ways to optimize how you use the cloud, tricks of the trade that we’re using to keep our costs down while we transform the way we work. In this blog post, we’ll share the lessons we’ve learned here at Microsoft on how you can fine-tune your use of Microsoft Azure at your company.

Watch to learn how optimizing our Microsoft Azure workloads is helping Microsoft operate more efficiently.

Adopting modern engineering practices

Modern engineering practices underpin everything we do in Microsoft Azure, from single resource deployments to enterprise-scale, globally distributed Azure-based solutions that span hundreds of resources. Our modern engineering vision has created culture, tools, and practices focused on developing high-quality, secure, and feature-rich services to enable digital transformation across the organization.

Our operations and engineering teams have journeyed through several phases of efficiency maturity. Through each of these phases, our operations substructure had to evolve, and many of those changes resulted in increased efficiency, not just with the bottom line on our monthly Azure bill, but with the way we do service management in Azure, including development, deployment, change management, monitoring, and incident management.

—Pete Apple, principal program manager for Microsoft Azure engineering, Microsoft Digital

Now that we’ve fully migrated Microsoft to Microsoft Azure, we’re finding smart ways to use our cloud product more efficiently, says Pete Apple, a principal program manager for Microsoft Azure Engineering in Microsoft Digital.

Pete Apple is a Principal Program Manager for Microsoft Azure Engineering in Microsoft Digital. He and his team have been responsible for overseeing and implementing our massive migration to the cloud over the past 8 years. They’re also responsible for ensuring that the company’s enterprise estate in Microsoft Azure is running at top efficiency.

“Our operations and engineering teams have journeyed through several phases of efficiency maturity,” Apple says. “Through each of these phases, our operations substructure had to evolve, and many of those changes resulted in increased efficiency, not just with the bottom line on our monthly Azure bill, but with the way we do service management in Azure, including development, deployment, change management, monitoring, and incident management.”

We went through three phases on our journey to greater efficiency in Microsoft Azure. Phase one focused on improving operational efficiency, phase two examined how we could deliver value through innovation, and in phase three we embraced transforming our digital ecosystem. Here’s a summary of the three phases:

Improving operational efficiency

At Microsoft Digital, we play a pivotal role in Microsoft business strategy, as most business processes in the company depend on us. To help Microsoft transform on our journey to the cloud, we identified key focus areas to improve in this first phase of our transformation: aligning services, optimizing infrastructure, and assessing our culture.

The first phase involved culture and structure as much as it did strategy and platform management. We realigned our organization to better support a brand-new way of providing services and support to the company in Microsoft Azure. Our teams needed to realign to eliminate information silos between different support areas. In many cases, teams that started to work together realized they had duplicate projects with similar goals.

Reducing projects and streamlining delivery methods freed up engineering resources to accomplish more in less time, while automated provisioning and self-service tools helped our teams plan their own migrations and accurately assess their portion of our Microsoft Azure estate.

Our engineering culture underwent a radical change in phase one. We moved toward empowering our engineers to create business solutions, not just create and manage processes. This led to a more holistic view of what we were trying to accomplish—as individuals and as teams—and it increased innovation, creativity, and productivity throughout our engineering processes.

Delivering value through innovation

We migrated more than 90 percent of our IT infrastructure to Microsoft Azure in phase one. In phase two, we embraced the Azure platform and cloud-native engineering design principles by adopting Infrastructure as Code and continuous deployment. We redefined operations roles and retrained people from traditional IT roles to be business relationship managers, engineering program managers, service engineers, and software engineers.

We also radically simplified our IT operations. The rapid provisioning and allocation process in Microsoft Azure enabled us to increase our speed 40-fold by eliminating, streamlining, and connecting processes, and by aligning processes for Azure. Azure native solutions, especially platform-as-a-service (PaaS) offerings were adopted across all aspects of the engineering and operations lifecycle. These included infrastructure as code with ARM templates, APIs, and PowerShell.

This final phase is never really final. Continual evaluation and optimization of our Microsoft Azure environment is built into how we manage our resources in the cloud. As new features and engineering approaches arise, we’re adapting our methods and best practices to get the most from our investment.

—Heather Pfluger, general manager, Infrastructure and Engineering Services, Microsoft Digital

Solutions that were lifted-and shifted into Microsoft Azure infrastructure as a service (IaaS) resources are regularly reassessed for migration or refactoring into PaaS offerings. We also adopted Microsoft Azure Monitor for consolidated monitoring not only for our Azure resources, but also on-premises resources.

Embracing the digital ecosystem

Optimizing the company’s use of Microsoft Azure has helped keep our costs down, says Heather Pfluger, the general manager of Infrastructure and Engineering Services in Microsoft Digital Employee Experience.

Our final phase is focusing on developing intelligent systems on Microsoft Azure to deliver reliable, scalable services and to connect operations processes across Microsoft. Automation has been built more deeply into our support and development processes by embracing a DevOps culture and open-source standards in our solutions.

Together, Microsoft Azure PaaS offerings and Microsoft Azure DevOps enable our engineers to focus on features and usability, while the ARM fabric and Microsoft Azure Monitor provide unified management to provision, manage, and decommission infrastructure resources securely.

“This final phase is never really final,” says Heather Pfluger, the general manager of Infrastructure and Engineering Services in Microsoft Digital who manages Microsoft’s internal Microsoft Azure profile. “Continual evaluation and optimization of our Microsoft Azure environment is built into how we manage our resources in the cloud. As new features and engineering approaches arise, we’re adapting our methods and best practices to get the most from our investment.”

Gaining efficiency from past experience

Apple, who works on Pfluger’s team, adds that customers’ migrations can benefit from taking a shortcut that Microsoft didn’t take.

“As early adopters, our migration practices were pushing the toolsets available,” he says. “When we looked at our on-premises environment and what was available in Azure, it made sense to move a significant portion of our solutions directly into IaaS resources.”

Apple talks about the tradeoffs made between agility and efficiency.

There are much better tools and best practices in-place now to migrate on-premises solutions directly into PaaS resources, eliminating the need to lift-and-shift and saving the cost of creating and maintaining those IaaS resources.

—Pete Apple, principal program manager for Microsoft Azure engineering, Microsoft Digital

“These solutions were being lifted from the datacenter and shifted straight into Azure Virtual Machines, Virtual Networks, and Storage Accounts,” he says. “This allowed us to recreate the on-premises environment in the cloud so we could get it out of the datacenter quickly, but it still left us with some of the maintenance tasks and costs inherent with IaaS infrastructure and room for further optimization with PaaS-based solutions.”

After the lift-and-shift migration, Apple’s teams re-engineering and re-platformed the IaaS solutions to use PaaS solutions such as Microsoft Azure SQL Database and Microsoft Azure Web Apps. Apple explains the shortcut, “There are much better tools and best practices in-place now to migrate on-premises solutions directly into PaaS resources, eliminating the need to lift-and-shift and saving the cost of creating and maintaining those IaaS resources.”

Managing data and resource sprawl with agility

We’re also undergoing specific efforts across our Microsoft Azure estate to reduce costs and increase efficiency. Azure infrastructure supports the entire Microsoft cloud, including Microsoft 365, Microsoft Power Platform, and Microsoft Dynamics 365. While most of these offerings do not allow for direct resource optimization, understanding the fundamentals of cloud scaling and billing is a critical aspect of using them efficiently. Data sprawl is a constant consideration for us.

We were able to keep our costs flat while our workloads increased by 20 percent internally here at Microsoft thanks to migrating the company to Microsoft Azure and then optimizing our usage.

Dan Babb is the Principal Software Engineering Manager responsible for Microsoft Digital’s implementation of Microsoft Azure Synapse Analytics for big data ingestion, migration, and exploration. It’s a massive data footprint, with more than 1 billion read operations and 10 petabytes of data consumed monthly through Apache Spark clusters.

Small specifics with Spark clusters can make a big difference.

“Each job that comes through Azure Synapse Analytics is run on a Spark cluster for compute services,” Babb says. “There’s a large selection of compute sizes available. The largest ones process data the quickest, but, naturally, they’re also the most expensive. We all like things to be done quickly, so many of our engineers were using very large compute sizes because they’re fast.”

Babb clarifies that just because you can use the fastest method doesn’t mean you should.

“Many of our jobs aren’t crucially time-sensitive, so we stopped using the bigger cluster sizes because we didn’t need to,” he says.

Babb emphasizes that accurately assessing the workload and priority of each job has significantly reduced costs.

“Processing a workload on a smaller instance for 20 minutes instead of using a larger instance for 5 minutes has resulted in significant cost savings for us,” he says. “We’re monitoring our subscriptions and if a really big cluster size gets spun up, an Azure Monitor Alert notifies our engineering leads they can follow up to ensure that the cluster size is appropriate for the job it’s running.”

Apple says this is a way of cost cutting that is being widely adopted across our organization.

“Our business program managers are realizing that they can save money by slowing down projects that don’t need to be rushed,” he says. “For example, we had some folks in Finance who realized that some of their batch reporting didn’t really need to be out in one hour, it was fine if it took eight hours because they only had to run their reports once per day.”

Babb’s team is also designing for distributed processing, creating solutions that are dispersed across clusters and Microsoft Azure Synapse Analytics workspaces to create distributed platform architecture that is more flexible and less prone to a single point of failure.

“If we run into an issue with a component or workspace and we have to take it down, it doesn’t affect the entire solution, just the single cluster or workspace and the job it was running,” he says.

Using multiple workspaces and clusters has also made it much easier to get granular reporting and cost estimation. Babb’s team members are using monitoring and reporting that enable them to understand the exact cost for any specific job, from ingestion to storage to report generation.

Designing for Zero Trust

The Zero Trust security model is pervasive across our Microsoft Azure environment. Based on the principle of verified trust—to trust, you must first verify—Zero Trust eliminates the inherent trust that is assumed inside the traditional corporate network. Zero Trust architecture reduces risk across all environments by establishing strong identity verification, validating device compliance prior to granting access, and ensuring least privilege access to only explicitly authorized resources.

The Zero Trust model assumes every request is a potential breach. As such, every request that travels through our Microsoft Azure or on-premises environments must be verified as though it originates from an open network. Regardless of where the request originates or what resource it accesses, Zero Trust teaches us to “never trust, always verify.” Every access request is fully authenticated, authorized, and encrypted before granting access. Micro-segmentation and least privileged access principles are applied to minimize lateral movement. Rich intelligence and analytics are used to detect and respond to anomalies in real time.

Throughout the Zero Trust model in Microsoft Azure, opportunities exist for simplification and increased efficiency. Microsoft Azure Entra ID allows us to centralize our identity and access workload, which simplifies identification and authorization across the hybrid cloud. Azure’s flexible network infrastructure allows us to implement complex and agile micro-segmentation scenarios in minutes with Microsoft Azure Bicep Templates, and Microsoft Azure Virtual Networks. Our engineers are creating connectivity scenarios and solutions that were simply unimaginable using traditional networking practices.

Mei Lau is a Principal Program Manager for Security Monitoring Engineering. Her team’s job is to ensure that across the increasingly complex and dynamic Microsoft Azure networking environment, Zero Trust principles are adhered to and Microsoft’s network environment remains safe and secure.

Her team is using Microsoft Sentinel to deliver intelligent security analytics and threat intelligence across the enterprise at Microsoft. Sentinel allows her security experts to detect attacks and hunt for threats across millions of network signals.

Real-time detection data is more expensive than some of the other data storage options we have. As convenient as it would be to have it all, it’s not that critical. We move our older data into Azure Data Explorer where it’s less expensive to store, but still allows us to use Kusto Query Language (KQL) queries just like we would in Sentinel.

—Mei Lau, principal program manager, Security Monitoring Engineering, Microsoft Digital Security and Resilience

With that much traffic to collect and examine, Lau notes that cost in Sentinel comes down to one primary factor: data ingestion.

“We want our investigation scope to be as detailed as possible, so naturally, the inclination is to keep all the data we can,” she says.

The reality of the situation is that you need to be careful to keep your costs down.

“Real-time detection data is more expensive than some of the other data storage options we have,” Lau says. “As convenient as it would be to have it all, it’s not that critical. We move our older data into Azure Data Explorer where it’s less expensive to store, but still allows us to use Kusto Query Language (KQL) queries just like we would in Sentinel.”

In Microsoft Sentinel, threat detections result in stored data, so Lau and her team are also diligent about the accuracy and usefulness of the more than 200 detection rules that are configured in Sentinel.

“We’re continually monitoring and managing detections that fire false positives,” she says. “We generally want at least 80 percent fidelity for a healthy detection. If we don’t achieve that, we either refine the detection or remove it.”

Our governance model provides centralized control and coordination for all cost-optimization efforts. Getting this right is pivotal for any organization looking to get the most out of being on the cloud in Azure.

—Pete Apple, principal program manager for Microsoft Azure engineering, Microsoft Digital

Lau’s team proactively monitors data storage in Microsoft Sentinel to look for sudden spikes in usage or other indicators that data usage practices might need to be assessed. It all contributes to a more efficient and streamlined threat management system that does its job well and doesn’t break the bank.

Observing results and managing governance

To ensure effective identification and implementation of recommendations, governance in cost optimization is critical for our applications and the Microsoft Azure services that those applications use.

“Our governance model provides centralized control and coordination for all cost-optimization efforts,” Apple says. “Getting this right is pivotal for any organization looking to get the most out of being on the cloud in Azure.”

Our model consists of several important components, including:

Microsoft Azure Advisor recommendations and automation. Advisor cost management recommendations serve as the basis for our optimization efforts. We channel Advisor recommendations into our IT service management and Microsoft Azure DevOps environment to better track how we implement recommendations and ensure effective optimization.
Tailored cost insights. We’ve developed dashboards to identify the costliest applications and business groups and identify opportunities for optimization. The data that these dashboards provide empower engineering leaders to observe and track important Microsoft Azure cost components in their service hierarchy to ensure that optimization is effective.
Improved Microsoft Azure budget management. We perform our Azure budget planning by using a bottom-up approach that involves our finance and engineering teams. Open communication and transparency in planning are important, and we track forecasts for the year alongside actual spending to date to enable accurate adjustments to spending estimates and closely track our budget targets. Relevant and easily accessible spending data helps us identify trend-based anomalies to control unintentional spending that can happen when resources are scaled or allocated unnecessarily in complex environments.

Implementing a governance solution has enabled us to realize considerable savings by making a simple change to Microsoft Azure resources across our entire footprint. For example, we implemented a recommendation to convert Microsoft Azure SQL Database instances from the Standard database transaction unit (DTU) based tier to a serverless tier by using a simple Microsoft Azure Resource Manager template and the auto-pause capability. The configuration change reduced costs by 97 percent.

Moving forward

As we continue our journey, we’re focusing on refining our efforts and identifying new opportunities for further cost optimization in Microsoft Azure.

Our Microsoft Digital Azure footprint will continue to grow in the years ahead, and our cost-optimization and efficiency efforts will grow to ensure that we’re making the most of our Azure investment.

—Heather Pfluger, general manager, Infrastructure and Engineering Services, Microsoft Digital

“There’s still a lot we can do here,” Pfluger says. “We’re building and increasing monitoring measures that help us ensure we’re using the optimal Azure services for our solutions. We’re infusing automated scalability into every element of our Azure environment and reducing our investment in the IaaS components that currently support some of our legacy technologies.”

Microsoft Azure optimization is always ongoing.

“Our Microsoft Digital Azure footprint will continue to grow in the years ahead, and our cost-optimization and efficiency efforts will grow to ensure that we’re making the most of our Azure investment,” Pfluger says.

Embrace modern engineering practices. Adopting modern engineering practices that support reliability, security, operational excellence, and performance efficiency will help to enable better cost optimization in Microsoft Azure. Staying aware of new Azure services and changes to existing functionality will also help you recognize cost-optimization opportunities as soon as possible.
Use data to drive results. Accurate and current data is the basis for making timely optimization decisions that provide the largest cost savings possible and prevent unnecessary spending. Using optimization-relevant metrics and monitoring from Microsoft Azure Monitor is critical to fully understanding the necessity and impact of optimization across services and business groups.
Use proactive cost-management practices. Using real-time data and proactive cost-management practices can get you from recommendation to implementation as quickly as possible while maintaining governance over the process.
Implement central governance with local accountability. Auditing Microsoft Azure cost-optimization efforts to help improve Azure budget-management processes will identify gaps in cost management methods.

Want more information? Email us and include a link to this story and we’ll get back to you.

Please share your feedback with us—take our survey and let us know what kind of content is most useful to you.

The post Doing more with less internally at Microsoft with Microsoft Azure appeared first on Inside Track Blog.

Transforming modern engineering at Microsoft

Inside Track staff — Sat, 11 Jan 2025 17:00:47 +0000

[Editor’s note: This content was written to highlight a particular event or moment in time. Although that moment has passed, we’re republishing it here so you can see what our thinking and experience was like at the time.]

Our Microsoft Digital team is implementing a modern engineering vision that creates a culture, tools, and practices focused on developing high-quality, secure, and feature-rich services to enable digital transformation across the company. Our Modern Engineering initiative has helped us be customer-obsessed, accelerated the delivery of new capabilities, and improved our engineering productivity.

Our journey

Our move to the cloud enabled us to increase the overall agility of the development process and accelerate value delivery for approximately 600 services comprised of about 1,400 components to new cloud technologies which provide quicker access to additional infrastructure. This enables spinning up environments and resources on demand, which allows an engineer to respond more quickly to evolving business needs.

However, we still needed to address several structural issues, including inconsistency between teams in basic engineering fundamentals like coding standards, automated testing, security scans, compliance, release methodology, gated builds, and releases.

We lacked a centralized common engineering system and related practices. Recognizing that we could not continue to evolve our engineering system in a federated way, we invested in a central team. The team was chartered to develop a common engineering system based on Microsoft Azure DevOps, while driving consistency across the organization regarding how they design, code, instrument, test, build, and deploy services. We brought a product engineering mindset to our services by defining a vision for each service area and establishing priorities based on objectives and key results (OKRs) which we define, track, and report using Viva Goals. These scope what we want to achieve each planning period and then execute on them via a defined cadence of sprints. The resulting engineering processes have promoted business alignment, developer efficiency, and cross-team mobility.

We incorporated industry-leading development practices for accessibility, security, and compliance. Achieving compliance has been very challenging, forcing us to change from legacy processes and tooling and requiring us to actively respond to our technical debt in these areas. We also lacked a consistent level of telemetry and monitoring that allowed us to obtain key insights about service health, features, customer experience, and usage patterns. We have moved towards a Live Site culture so that we can comprehensively drive sustained improvements in service quality. Telemetry capabilities have been improved through the ability to do synthetic monitoring and the ingestion of data from a wide variety of data sources and using services such as Azure Monitor.

Our vision for modern engineering

Microsoft’s digital transformation requires us to deliver high-quality capabilities and solutions at a faster pace and with reliability and security. To achieve this, we’re modernizing how we build, deploy, and manage our services to get new functionality in our users’ hands as rapidly as possible. We’re re-examining every part of our engineering process and instituting modern engineering practices. Satya Nadella, our Chief Executive Officer, summarized this well.

“In order to deliver the experiences our customers need for the mobile-first, cloud-first world, we will modernize our engineering processes to be customer-obsessed, data-driven, speed-oriented and quality focused.”

Our ongoing investments in modern engineering practices and technology build on the foundation that we’ve already established, and they reflect our vision and support our cultural changes. We have three key pillars on which we’re basing these investments along with a commitment to infuse AI into each pillar wherever appropriate.

Customer obsession
Engineering productivity
Rapid delivery

Customer obsession

We want to ensure our engineers keep customers front and center in their thoughts, so we’re capturing feedback to provide our engineers with a deep understanding of the customer experience. Our service monitoring has enabled us to be alerted to problems and fix them before our customers are even aware of them.

We are the first customers of Microsoft’s commercial offerings, which enables us to identify and address the engineering needs of the enterprise operating in a cloud-centric architecture. We constantly work with our product engineering groups across the company, creating a virtuous cycle that makes our products such as Azure DevOps and Azure services even more enterprise ready.

Using customer feedback to drive development

We’re keeping the customer experience at the center of the engineering process via feedback loop mechanisms. Feedback loops serve as a foundation for hypothesis-driven product improvements based on actual sentiment and usage data. We’re making feedback submission as easy as possible with the same tool that the Microsoft Office product suite uses. The Send a Smile feature automatically and consistently gathers feedback across multiple channels and key user touchpoints. We use this tool as a centralized data system for storing, triaging, and analyzing feedback, then aggregating it into actionable insights.

We encourage adoption of feedback loops and experimentation methods, such as feature flighting and ring deployment, to help measure the impact of product changes. With these foundational components in place, we’re now correlating feedback data with related telemetry so that we can better understand product usability issues and the impact of service issues on customers. Our use of controlled rollouts eliminates the need for UAT environments, which accelerates overall delivery.

Telemetry

We unified the telemetry from disparate systems by building on Azure Monitor to help us implement continuous improvements in the quality of our services. This platform integrates with heterogeneous data sources such as Kusto, Azure Cosmos DB, Azure Application Insights, and Log Analytics to collect, process, and publish data from applications, infrastructure, and business processes. This helps us obtain end-to-end views and generate more actionable insights about our service management.

We’re working toward delivering highly connected insights that aggregate the health of component services, customer experience, and business processes. This produces contextual data that not only identifies events but also identifies root causes and recommended next actions. We’re using business process monitoring (BPM) to monitor true availability and performance by tracking successful transactions and customer impact across multiple services and business groups.

To achieve a sustained level of quality, we’re leveraging synthetic monitoring for all critical services, especially those with a relatively low volume of business transactions. Data-enhanced incident tickets provide a business impact prioritized view of issues, supplemented with potential causes including those identified through Machine Learning. These data-enhanced tickets allow teams to focus on the most important tickets and reduce mitigation time.

We are investing in AI technologies to proactively detect anomalies and automatically remediate them wherever possible. Being able to intelligently respond to incidents reduces support costs and improves service reliability and the overall user experience.

Service health

We have focused on increasing our effectiveness in service and live site incident management. We rolled out a standard incident management process and measured continual improvements against key incident management metrics. We monitor service health metrics and key performance indicators (KPIs) across the organization to understand customer sentiment and ensure services are reliable, compliant, and performing well. We’re using consistent standards, which helps ensure that we can aggregate data at any level in the service hierarchy and compare it across different team groups. We built a more integrated experience on top of Azure Monitor, enriched with contextual data from the unified telemetry platform, and created a set of defined service health measures and an analyzer to track events that can affect service reliability, such as upcoming planned maintenance or compliance related changes. This enables us to detect and resolve issues proactively and quickly. Defined service health measures make it easier to enable service health reporting across various services.

We knew that we must connect service health to business process health, and how we prioritize issues, so that engineers could address them in a way that reduces the negative business impact. The experience we’re building enables visualization of end-to-end business process health and the health of the underlying services by analyzing their telemetry.

We also simplified the flow of service health and engineering fundamentals data to the engineer and reduced the number of dashboards and tools they use. An internal tool is now the key repository for all service owners to view service health and other relevant KPIs. The tool’s integrated notification workflow informs service owners when a service reaches a defined threshold, making it more convenient to prioritize any needed remediation into their backlogs.

Embracing a Live Site culture

Increasing scale and agility in our services and processes required us to focus on making customers’ experiences better. We’re establishing a Live Site culture and pursuing excellence via customer-obsessed, data-driven, multidisciplinary teams. These teams embrace potential failure with honest observation, continuous learning, and measurable improvement targets.

We host an organization-wide, live site review that includes postmortem reviews on incidents, examining long-term remediation plans, and guiding service teams through modern engineering standards that will help them perform robust reviews at a local level. We base these reviews on standard and actionable reports that contain leading indicators for outages or failures based on the analysis of telemetry, synthetic monitoring, and other data.

Engineering productivity

We’re providing our engineers with best-in-class unified standards and practices in a common engineering system, based on the latest Azure tools, such as Azure DevOps. A consistent development environment allows our engineers to transition smoothly between projects and teams. Improved automation, consistency, and centralized engineering systems enable engineers to better focus on the core role of developing. This also reduces onboarding time and allows our engineers to be more flexible across projects.

Integrating developer tooling

We made organizationally mandated code analysis and compliance tools accessible directly within the development environment, thereby helping our shift-left goal. We built self-service capabilities to manage access, set policies, and make changes to Azure DevOps artifacts such as area paths, work items, and repositories. This has made it easy for engineers to create, update, or retire services, components, and subscriptions, minimizing the time spent managing such resources. We want to extend our shift left goal to also examine optimization of our Azure service design and surface recommendations for configuration optimization so that these occur early in the deployment cycle and allow us to rightsize our configurations and avoid unnecessary Azure costs.

Enabling code reuse

While at a low volume, we’re still supporting a few applications (fewer than five percent) that use on-premises servers and domain-joined Azure virtual machines. This results in ongoing effort to patch servers, upgrade software, and perform basic infrastructure maintenance tasks. It also impedes our ability to scale apps to accommodate growth. We’ve transformed these applications to Microsoft Azure platform-as-a-service (PaaS) and software-as-a-service (SaaS) based solutions, thereby leveraging the scale and availability of Azure. We enabled this by providing architectural guidance and tools to migrate data, refactoring existing functionality as APIs, and building lightweight applications by reusing APIs that others have already published.

Promoting data and code reuse to build solutions more rapidly and align with a service-oriented architecture requires that developers have the ability to publish and discover APIs easily. We built an API economy by creating a common set of guidelines for developing coherent APIs, and a central catalog and search experience for discovery. We integrated validation against API guidelines and enabled our teams to integrate API publishing into their Azure DevOps pipelines. We created a set of common API health analytics. We also enabled the growth of inner source in which sharing code outside of APIs is achieved.

Workforce strategies

To address our previous high level of dependency on suppliers, we implemented a new workforce strategy, hiring more full-time employees and bringing more work in-house. This allowed us to transform and modernize how we deliver services. Furthermore, this workforce strategy makes it imperative that there is full-time employee oversight of any supplier deliveries, ensuring they adhere to processes, standards, and regulatory requirements, including security, accessibility, and privacy. We implemented a common bar for hiring across all teams and a common onboarding program to ensure all new hires receive a consistent level of training on all key tools and technologies. As we ramp up our use of AI technologies to further transform our engineering, we are investing in re-skilling and training initiatives to expand the engineering capacity available to work on AI-related projects.

Universal design system

We leveraged Microsoft’s product design system to engineer solutions that look and behave like other Microsoft products. Every product should meet the quality expectations of today’s consumers, meaning that every piece of the user interface (UI) and user experience (UX) should be engineered with accessibility, responsiveness, and familiar behaviors, states, motion, and visual styling. On complex but common components like headers, navigation menus, and data grids this can mean weeks of engineering time multiplied across every Microsoft Digital team that requires the same components. This is considerably reduced by adopting a universal design system.

Rapid delivery

To be customer-obsessed, we’re acquiring and protecting customer trust in every aspect of our relationship. We are tracking delivery metrics so that we can shorten lead times from ingestion of customer requirements to the time the solution is in the customer’s hands and then on to measuring customer usability and feedback, while still ensuring service reliability. We’re helping engineers achieve this objective by checking for issues earlier in the pipeline and providing a way to rapidly experiment and mitigate risk. We are building feedback-loop mechanisms to ensure that we can understand the user experience as new functionality gets deployed, and we perform automated rollbacks if customer reaction or service-health signals are less favorable than we anticipated.

Integrating security, accessibility, and fundamentals

Delivering secure, compliant, accessible, dependable, and high-quality services is critical to building trust with our customers. Our engineers are checking for issues earlier in the pipeline, and we’re enabling them to experiment rapidly while limiting potential negative effect on the release process.

We moved to a shift left process, in which work happens as early in the development process as possible. This enabled us to avoid carrying debt from sprint to sprint. We also implemented gates in the developer workflow that help build security in a streamlined way and auto-onboarding services to ensure continuous compliance.

We scan code for security issues and log bugs in Azure DevOps that we discover during the scanning process, so developers can fix them directly in the same engineering system they use for other functional bugs rather than having to triage separately from security tools.

We assess accessibility within our applications, but this happens late in the development process. To move this further upstream, we adopted accessibility insights tooling during development and now expose accessibility-related bugs as part of the pipeline workflow.

We are adopting AI technologies for providing accessibility guidance and conducting accessibility assessments to ensure that our applications conform to accessibility requirements.

Additionally, we enabled engineering teams to utilize the guardrails we’re implementing by integrating policy fundamentals into the pipeline, and we’re implementing continuous integration practices. This ensures that all production releases, including hot fixes, come from builds of the main branch of source code and all have appropriate compliance steps applied consistently. Each pull request must have a successful build to ensure that the main branch is golden and always production ready. Maintaining high-quality code in the main branch minimizes build failures that ultimately slow our time to production.

Deploying safely to customers

We created an environment where teams test ideas and prototypes before building them. The goal is to drive customer outcomes in a way that encourages risk-taking with a fail-fast, fail-safe mentality. Central to increasing the velocity of service updates to customers is a consistent, simple, and streamlined way to implement safe deployments. Progressive exposure and feature flags are key in deploying new capabilities to users via controlled rollouts, so we can quickly start receiving customer feedback.

We implemented checks and balances in the process by leveraging service indicators such as latency and faults within the pipeline, thereby catching regressions and allowing initiation of automated rollbacks when predefined thresholds are exceeded. Implementing safe deployment practices, combined with a streamlined and well-managed pipeline, are two of the key elements for achieving a continuous integration, continuous deployment (CI/CD) model.

Reliability and efficiency

We are enhancing our DevOps engineering pipeline across services by identifying and removing bottlenecks and improving our services’ reliability. We’ll use DevOps Research and Assessment (DORA) metrics to measure our execution and monitor our progress against industry benchmarks.

We’re focusing on deployment frequency, lead time for changes, change failure rate, and mean time to recover in order to gain a comprehensive view of our software or service delivery capabilities. Based on this data, we’ll increase productivity, speed up time-to-market, and enhance user satisfaction.

We’re making our vision for modern engineering a reality at Microsoft by promoting a Live Site first culture, using data to provide service and business process health signals to inform the rapid iteration on new ideas and capabilities with customers.
We’re supporting this by moving to an Azure DevOps model of continuous integration and continuous deployment governed by a standardized engineering pipeline with automatic policy enforcement.
The Live Site first culture and the tools and ceremonies that support it have increased visibility into engineering processes, improved the quality and delivery of our services and improved our insight into our customer experiences, all of which ensure we are continually improving and adapting our set of services and processes to support digital transformation now and into the future.

The post Transforming modern engineering at Microsoft appeared first on Inside Track Blog.

Monitoring Microsoft’s SAP Workload with Microsoft Azure

Inside Track staff — Wed, 04 Sep 2024 16:00:22 +0000

At Microsoft, our Microsoft Digital Employee Experience (MDEE) team is using Microsoft Azure telemetry tools to get key insights on our business processes that flow through our SAP instance, one of the largest in the world. Our new platform provides our leadership with a comprehensive view of our business-process health and allows our engineering teams to create a more robust and efficient SAP environment.

Like many enterprises, we use SAP—the global enterprise resource planning (ERP) software solution—to run our various business operations. Our SAP environment is critical to our business performance, and we integrate it into most of our business processes. SAP offers functionality for enterprise services at Microsoft, such as human resources, finance, supply-chain management, and commerce. We use a wide variety of SAP applications, including:

SAP S/4HANA
ERP Central Component (ECC)
Global Trade Screening (GTS)
Business Integrity Screening (BIS) on S4
Master Data Governance (MDG) on S4
Governance, Risk, Compliance (GRC)
Revenue Management, Contract Accounting (RMCA)
OEM Services (OER)
SAP SaaS (Ariba, IBP, Concur, SuccessFactors)

Since 2018, Microsoft’s instance of SAP is 100 percent migrated to Microsoft Azure. This project entailed moving all SAP assets to more than 800 Azure virtual machines and numerous cloud services.

We approached the migration by using both vertical and horizontal strategies.

From a horizontal standpoint, we migrated systems in our SAP environment that were low risk—training systems, sandbox environments, and other systems that weren’t critical to our business function. We also looked at vertical stacks, taking entire parts of our SAP landscape and migrating them as a unified solution.

We gained experience with both migration scenarios, and we learned valuable lessons in the early migration stages that helped us smoothly transition critical systems later in the migration process.

[Unpack how we’re optimizing SAP for Microsoft Azure. | Discover how we’re protecting Microsoft’s SAP workload with Microsoft Sentinel. | Explore how we’re unlocking Microsoft’s SAP telemetry with Microsoft Azure.]

Operating as Microsoft Azure-native

At Microsoft, we develop and host all new SAP infrastructure and systems on Microsoft Azure. We’re using Azure–based cloud infrastructure and SAP–native software as a service (SaaS) solutions to increase our architecture’s efficiency and to grow our environment with our business. The following graphic represents our SAP landscape on Azure.

Microsoft’s SAP environment on Microsoft Azure.

The benefits of SAP on Microsoft Azure

SAP on Microsoft Azure provides several benefits to our business, many of which have resulted in significant transformation for our company. Some of the most important benefits include:

Business agility. With Microsoft Azure’s on-demand SAP–certified infrastructure, we’ve achieved faster development and test processes, shorter SAP release cycles, and the ability to scale instantaneously on demand to meet peak business usage.
Efficient insights. SAP on Microsoft Azure gives us deeper visibility across our SAP landscape. On Azure, our infrastructure is centralized and consolidated. We no longer have our SAP infrastructure spread across multiple on-premises datacenters.
Efficient real-time operations and integration. We can leverage integration with other Microsoft Azure technologies such as Internet of Things (IoT) and predictive analytics to enable real-time capture and analysis of our business environment, including areas such as inventory, transaction processing, sales trends, and manufacturing.
Mission-critical infrastructure. We run our entire SAP landscape—including our most critical infrastructure—on Microsoft Azure. SAP on Azure supports all aspects of our business environment.

Identifying potential for improved monitoring

As we examined our SAP environment on Microsoft Azure, we found several key areas where we could improve our monitoring and reporting experience:

Monitoring SAP from external business-process components. External business process components had no visibility into SAP. Our monitoring within individual SAP environments provided valuable insight into SAP processes, but we needed a more comprehensive view. SAP is just one component among many in our business processes, and the owners of those business processes didn’t have any way to track their processes after they entered SAP.
Managing and viewing end-to-end processes. It was difficult to manage and view end-to-end processes. We couldn’t capture the end-to-end process status to effectively monitor individual transactions and their progress within the end-to-end process chain. SAP was disconnected from end-to-end monitoring and created a gap in our knowledge of the entire process pipeline.
Assessing overall system health. We couldn’t easily assess overall system health. Our preexisting monitoring solution didn’t provide a holistic view of the SAP environment and the processes with which it interacted. The overall health of processes and systems was incomplete because of missing information for SAP, and issues that occurred within the end-to-end pipeline were difficult to identify and problematic to troubleshoot.

Our SAP on Microsoft Azure environment was like a black box to many of our business-process owners, and we knew that we could leverage Azure and SAP capabilities to improve the situation. We decided to create a more holistic monitoring solution for our SAP environment in Azure and the business processes that defined Microsoft operations.

Creating a telemetry solution for SAP on Microsoft Azure

The distributed nature of our business process environment led us to examine a broader solution—one that would provide comprehensive telemetry and monitoring for our SAP landscape and any other business processes that constituted the end-to-end business landscape at Microsoft. The following goals drove our implementation:

Integrate comprehensive telemetry into our monitoring.

Move toward holistic health monitoring of both applications and infrastructure.
Create a complete view of end-to-end business processes.
Create a modern, standards-based structure for our monitoring systems.

Guiding design with business-driven monitoring and personas

We adopted a business-driven approach to building our monitoring solution. This approach examines systems from the end-user perspective, and in this instance, the personas represented three primary business groups: business users, executives, and engineering teams. Using the synthetic method, we planned to build our monitoring results around what these personas wanted and needed to observe within SAP and the end-to-end business process, including:

Business user needs visibility into the status of their business transactions as they flow through the Microsoft and SAP ecosystem.
Executives need to ensure that our business processes are flowing smoothly. If there are critical failures, they need to know before customers or partners discover them.
Engineers need to know about business-process issues before those issues affect business operations and lead to customer-satisfaction issues. They need end-to-end visibility of business transactions through SAP telemetry data in a common consumption format.

Creating end-to-end telemetry with our Unified Telemetry Platform

The MDEE team developed a telemetry platform in Microsoft Azure that we call the Unified Telemetry Platform (UTP). UTP is a modern, scalable, dependable, and cost-effective telemetry platform that’s used in several different business-process monitoring scenarios in Microsoft, including our SAP–related business processes.

UTP is built to enable service maturity and business-process monitoring across MDEE. It provides a common telemetry taxonomy and integration with core Microsoft data-monitoring services. UTP enables compliance with and maintenance of business standards for data integrity and privacy. While UTP is the implementation we chose, there are numerous ways to enable telemetry on Microsoft Azure. For additional considerations, access Best practices for monitoring cloud applications on the Azure documentation site.

Capturing telemetry with Microsoft Azure Monitor

To enable business-driven monitoring and a user-centric approach, UTP captures as many of the critical events within the end-to-end process landscape as possible. Embracing comprehensive telemetry in our systems meant capturing data from all available endpoints to build an understanding of how each process flowed and which SAP components were involved. Azure Monitor and its related Azure services serve as the core for our solution.

Microsoft Azure Application Insights

Application Insights provides a Microsoft Azure–based solution with which we can dig deep into our Azure–hosted SAP landscape and extract all necessary telemetry data. By using Application insights, we can automatically generate alerts and support tickets when our telemetry indicates a potential error situation.

Microsoft Azure Log Analytics

Infrastructure telemetry such as CPU usage, disk throughput, and other performance-related data is collected from Azure infrastructure components in the SAP environment by using Log Analytics.

Microsoft Azure Data Explorer

UTP uses Microsoft Azure Data Explorer as the central repository for all telemetry data sent through Application Insights and Microsoft Azure Monitor Logs from our application and infrastructure environment. Azure Data Explorer provides enterprise big-data interactive analytics; we use the Kusto query language to connect the end-to-end transaction flow for our business processes, for both SAP process and non–SAP processes.

Microsoft Azure Data Lake

UTP uses Microsoft Azure Data Lake for long-term cold-data storage. This data is taken out of the hot and warm streams and kept for reporting and archival purposes in Azure Data Lake to reduce the cost associated with storing large amounts of data in Microsoft Azure Monitor.

A UTP data-flow architecture.

Constructing with definition using common keys and a unified platform

UTP uses Application Insights, Microsoft Azure Data Explorer, and Microsoft Azure Data Lake as the foundation for our telemetry data. This structure unifies our data by using a common schema and key structure that ties telemetry data from various sources together to create a complete view of business-process flow. This telemetry hub provides a central point where telemetry is collected from all points in the business-process flow—including SAP and external processes—and then ingested into UTP. The telemetry is then manipulated to create comprehensive business-process workflow views and reporting structures for our personas.

Common schema

UTP created a clearly defined common schema for business-process events and metrics based on a Microsoft-wide standard. That schema contains the metadata necessary for mapping telemetry to services and into processes, and it allows for joins and correlation across all telemetry.

Common key

As part of the common schema for business process events, the design includes a cross-correlation vector (XCV) value, common to all stored telemetry and transactions. By persisting a single value for the XCV and populating this attribute for all transactions and telemetry events related to a business process, we can connect the entire process chain related to an individual business transaction as it flows through our extended ecosystem.

Multilayer telemetry concept for SAP

For SAP on Microsoft Azure, our MDEE team focused on four specific areas for telemetry and monitoring:

SAP Business Process layer
SAP Application Foundation layer
Infrastructure layer
Surrounding API layer

Microsoft’s multilayer approach for its SAP instance.

The result was holistic telemetry and monitoring across these layers, a structure that leverages Microsoft Power BI as the engine behind our reporting and dashboarding functionality.

Our MDEE team created reporting around business-driven monitoring and constructed standard views and dashboards that offer visibility into important areas for each of the key business personas. Dashboards are constructed from Kusto queries, which are automatically translated in the Microsoft Power BI M formula language. For each persona, we’ve enabled a different viewpoint and altitude of our business process that allows the persona to view the SAP monitoring information that’s most critical to them.

Sample dashboards view for each layer.

Microsoft Azure Monitor for SAP Solutions

Microsoft previously announced the launch of Microsoft Azure Monitor for SAP Solutions (AMS) in public preview—an Azure-native monitoring solution for customers who run SAP workloads on Azure. With AMS, customers can view telemetry of their SAP landscapes within the Azure portal and efficiently correlate telemetry between various layers of SAP. AMS is available through Microsoft Azure Marketplace in the following regions: East US, East US 2, West US 2, West Europe, and North Europe. AMS doesn’t require a license fee.

Our MDEE team worked in close collaboration with Microsoft Azure product teams to build and release SAP NetWeaver provider in Microsoft Azure Monitor for SAP solutions.

The SAP NetWeaver provider in Microsoft Azure Monitor for SAP Solutions enables SAP on Microsoft Azure customers to monitor SAP NetWeaver components and processes on Azure in the Azure portal. The SAP NetWeaver provider includes default visualizations and alerts that can be used out of the box or customized to meet customer requirements.
SAP NetWeaver telemetry is collected by configuring the SAP NetWeaver provider within AMS. As part of configuring the provider, customers are required to provide the host name (Central, Primary, and/or Secondary Application server) of SAP system and its corresponding Instance number, Subdomain, and System ID (SID).

For more information, go to AMS quick start video and SAP NetWeaver monitoring-Azure Monitoring for SAP Solutions.

Microsoft’s AMS architecture.

Our telemetry platform provides benefits across our SAP and business-process landscape. We have created a solution that facilitates end-to-end SAP business-process monitoring, which in turn enables our key personas to do their jobs better.

Persona benefits

Benefits for each persona include the following:

Business users no longer need to create service tickets to get the status of SAP transaction flows. They can examine our business processes from end to end, including SAP transactions and external processes.
Executives can trust that their business processes execute seamlessly and that any errors are proactively addressed with no impact to customers or partners.
Engineers no longer need to check multiple SAP transactions to investigate business-process issues and identify in which step the business process failed. They can improve their time-to-detect and time-to-resolve numbers with the correct telemetry data and avoid business disruption for our customers.

Organization-wide benefits

The benefits of our platform extend across Microsoft by providing:

End-to-end visibility into business processes. Our Unified Telemetry Platform (UTP) provides visibility into business processes across the organization, which then facilitates better communication and a clearer understanding of all parts of our business. We have a more holistic view of how we’re operating, which helps us work together to achieve our business goals.
Decreased time to resolve issues. Our visibility into business processes informs users at all levels when an issue occurs. Business users can examine the interruption in their workflow, executives are notified of business-process delays, and engineers can identify and resolve issues. This activity all occurs before our customers are affected.
More efficient business processes. Greater visibility leads to greater efficiency. We can demonstrate issues to stakeholders quickly, everyone involved can recognize areas for potential improvement, and we can monitor modified processes to ensure that improvement is happening.

We learned several important lessons with our UTP implementation for SAP on Microsoft Azure. These lessons helped inform our progress of UTP development, and they’ve given us best practices to leverage in future projects, including:

Perform a proper inventory of internal processes. You must be aware of events within a process before you can capture them. Performing a complete and informed inventory of your business processes is critical to capturing the data required for end-to-end business-process monitoring.
Build for true end-to-end telemetry. Capture all events from all processes and gather telemetry appropriately. Data points from all parts of the business process—including external components—are critical to achieving true end-to-end telemetry.
Build for Microsoft Azure-native SAP.  SAP is simpler to manage on Azure and instrumenting SAP processes becomes more efficient and effective when SAP components are built for Azure.
Encourage data-usage models and standards across the organization. Data standards are critical for an accurate end-to-end view. If data is stored in different formats or instrumentation in various parts of the business process, the end reporting results won’t accurately represent the business-process’ state.

We’re continuing to evaluate and improve as we discover new and more efficient ways to track our business processes in SAP. Some of our current focus areas include:

Machine learning for predictive analytics. We’re using machine learning and predictive analytics to create deeper insights and more completely understand our current SAP environment. Machine learning also helps us anticipate growth and change in the future. We’re leveraging anomaly detection in Microsoft Azure Cognitive Services to track SAP business service-health outliers.
Actionable alerting. We’re using Microsoft Azure Monitor alerts to create service tickets, generate service-level agreement (SLA) alerts, and provide a robust notification and alerts system. We’re working toward linking detailed telemetry context into our alerting system to create intelligent alerting that enables us to more accurately and quickly identify potential issues within the SAP environment.
Telemetry-based automation. We’re using telemetry to enable automation and remediation within our environment. We’re creating self-healing scenarios to automatically correct common or easy-to-correct issues to create a more intelligent and efficient platform.

We’re continually refining and improving business-process monitoring of SAP on Microsoft Azure. This initiative has enabled us to keep key business users informed of business-process flow, provided a complete view of business-process health to our leadership, and helped our engineering teams create a more robust and efficient SAP environment. Telemetry and business-driven monitoring have transformed the visibility that we have into our SAP on Azure environment, and our continuing journey toward deeper business insight and intelligence is making our entire business better.

The post Monitoring Microsoft’s SAP Workload with Microsoft Azure appeared first on Inside Track Blog.

Microsoft uses a scream test to silence its unused servers

Pete Apple — Sat, 17 Aug 2024 08:00:59 +0000

Do you have unused servers on your hand? Don’t be alarmed if I scream about it—it’ll be for a good reason (and not just because it’s almost Halloween)!

Check out Pete Apple’s expedition to the cloud series

I talked previously about our efforts here in Microsoft Digital to inventory our internal-to-Microsoft on-premises environments to determine application relationships (mapping Microsoft’s expedition to the cloud with good cartography) as well as look at performance info for each system (the awesome ugly truth about decentralizing operations at Microsoft with a DevOps model).

With this info, it was time to begin making plans to move to the cloud. Looking at the data, our overall CPU usage for on-premises systems was far lower than we thought—averaging around six percent! We realized this was so low due to many underutilized systems. First things first, what to do with the systems that were “frozen,” or not being used, based upon the 0-2 percent CPU they were utilizing 24/7?

We created a plan to closely examine those assets towards the goal of moving as few as possible. We used our home-built change management database (CMDB) to check whether there was a recorded owner. In some cases, we were able to work with that owner and retire the system.

Before we turned even one server off, we had to be sure it wasn’t being used. (If a server is turned off and no one is there to see it, does it make a sound?)

Developing a scream test

Pete Apple, a cloud services engineer in Microsoft Digital, shares how Microsoft scares teams that have unused servers that need to be turned off. (Photo by Jim Adams | Inside Track)

But what if the owner information was wrong? Or what if that person had moved on? For those, we created a new process: the Scream Test. (Bwahahahahaaaa!)

What’s the Scream Test? Well, in our case it was a multistep process:

Display the message “Hey, is this your server, contact us?” on the sign-in splash page for two weeks.
Restart the server once each day for two weeks to see whether someone opens a ticket (in other words, screams).
Shut down the server for two weeks and see whether someone opens a ticket. (Again, whether they scream.)
Retire the server, retaining the storage for a period, just in case.

With this effort, we were able to retire far more unused servers—around 15 percent—than we had expected, without worrying about moving them to the cloud. Winning! We also were able to reclaim more resources on some of the Hyper-V hosts that were slated to continue running on-premises. And as a final benefit, we cleaned up our CMDB a bit!

In parallel, we initiated an effort to look at some of the systems that were infrequently used or used a very low level of CPU (less than 10 percent, or “Cold”). From that, we had two outcomes that proved critical for our successful migration to the cloud.

The first was to identify the systems in our on-premises environments that were oversized. People had purchased physical machines or sized virtual machines according to what they thought the load would be, and either that estimate was incorrect or the load diminished over time. We took this data and created a set of recommended Azure VM sizes for every on-premises system to use for migration. In other words, we downsized on the way to the cloud versus after the fact.

At the time, we did a bunch of this work by hand, manually because we were early adopters. Microsoft now has a number of great products available that help assist with this inventory and review of your on-premises environment that you should check out. To learn more, check out this article with documentation on Azure Migrate.

Another statistic that the data revealed was the number of systems that were used for only a few days or a week out of each month. Development machines, test/QA machines, and user acceptance testing machines reserved for final verification before moving code to production were used for only short periods. The machines were on continuously in the datacenter, mind you, but they were actually being used for only short periods each month.

For these, we investigated ways to have those systems running only when required by investing in two technologies: Azure Resource Manager Templates and Azure Automation. But this is a story for the next time. Until then, happy Halloween!

Read the rest of the series on Microsoft’s move to the cloud:

The post Microsoft uses a scream test to silence its unused servers appeared first on Inside Track Blog.

How automation is transforming revenue processing at Microsoft

Lukas Velush — Fri, 26 Jan 2024 15:05:35 +0000

The Microsoft partner and customer network brings in more than $100 billion in revenue each year, most of the company’s earnings.

Keeping tabs on the millions of annual transactions is no small task—just ask Shashi Lanka Venkata and Mark Anderson, two company employees who are leading a bid to automate what historically has been a painstakingly manual revenue transaction process.

“We support close to 50 million platform actions per day,” says Venkata, a principal group engineering manager in Microsoft Digital. “For a quarter-end or a month-end, it can double. At June-end, we’re getting well more than 100 million transactions per day.”

That’s a lot, especially when there cannot be any mistakes and every transaction must be processed in 24 hours.

To wrangle that high-stakes volume, Venkata and Anderson, a director on Microsoft’s Business Operations team, teamed up to expand the capabilities of Customer Obsessed Solution Management and Incident Care (COSMIC), a Dynamics 365 application built to help automate Microsoft’s revenue transactions.

[Learn more about COSMIC including where to find the code here: Microsoft Dynamics 365 and AI automate complex business processes and transactions.]

First tested in 2017 on a small line of business, the solution expanded quickly and was handling the full $100 billion-plus workload within one year.

That said, the team didn’t try to automate everything at once—it has been automating the many steps it takes to process a financial transaction one by one.

Mark Anderson (shown here) partnered with Shashi Lanka Venkata from Microsoft Digital to revamp the way the company processes incoming revenue. Anderson is a director on Microsoft’s Business Operations team.

“We’re now about 75 percent automated,” Anderson says. “Now we’re much faster, and the quality of our data has gone way up.”

COSMIC is saving Microsoft $25 million to $30 million over the next two to three years in revenue processing cost. It also automates the rote copy-and-paste kind of work that the company’s team of 3,800 revenue processing agents used to get bogged down on, freeing them up to do higher value work.

The transformation that Anderson, Venkata, and team have been driving is part of a larger digital transformation that spans all Microsoft Digital. Its success has led to a kudos from CEO Satya Nadella, a well-received presentation to the entire Microsoft Digital organization, and lots of interest from Microsoft customers.

“It’s been a fantastic journey,” Anderson says. “It’s quite amazing how cutting edge this work is.”

Unpacking how COSMIC works

Partners transact, purchase, and engage with Microsoft in over 13 different lines of businesses, each with its own set of requirements and rules for processing revenue transactions (many of which change from country to country).

To cope with all that complexity, case management and work have historically been handled separately to make it easier for human agents to stay on top of things.

That had to change if COSMIC was going to be effective. “When we started, we knew we needed to bring them together into one experience,” Venkata says.

Doing so would make transactions more accurate and faster, but there was more to it.

“The biggest reason we wanted to bring them together is so we could get better telemetry,” he says. “Connecting all the underlying data gives us better insights, and we can use that to get the AI and machine learning we need to automate more and more of the operation.”

Giving automation its due

The first thing the team decided to automate was email submissions, one of the most common ways transactions get submitted to the company.

“We are using machine learning to read the email and to automatically put it in the right queue,” Venkata says. “The machine learning pulls the relevant information out of the email and enters it into the right places in COSMIC.”

The team also has automated sentiment analysis and language translation.

What’s next?

Using a bot to start mimicking the work an agent does, like automatic data entry or answering basic questions. “This is something that is currently being tested and will soon be rolled out to all our partners using COSMIC,” he says.

How does it work?

When a partner submits a transactional package to Microsoft, an Optical Character Recognition bot scans it, opens it, checks to see if everything looks correct, and makes sure business roles are applied correctly. “If all looks good, it automatically gets routed to the next step in the process,” Venkata says.

The Dynamics workflow engine also is taking on some of the check-and-balance steps that agents used to own, like testing to see if forms have been filled out correctly and if information extracted out of those forms is correct.

“Azure services handle whatever has to be done in triage or validation,” he says. “It can check to see if a submission has the right version of the document, or if a document is the correct one for a particular country. It validates various rules at each step.”

All of this is possible, Venkata says, because the data was automatically abstracted. “If, at any point the automation doesn’t work, the transaction gets kicked back for manual routing,” he says.

As for the agents? They are getting to shift to more valuable, strategic work.

“The system is telling them what the right next action is going to be,” Venkata says. “Before this, the agent had to remember what to do next for each step. Now the system is guiding them to the next best action—each time a step is completed, the automation kicks in and walks the agent through the next action they should take.”

Eventually the entire end-to-end process will be automated, and the agents will spend their time doing quality control checks and looking for ways to improve the experience. “We want to get to the point where we only need them to do higher level work,” he says.

Choosing Dynamics 365 and Microsoft Azure

There was lots of technology to choose from, but after a deep assessment of the options, the team chose Dynamics 365 and Microsoft Azure.

“We know many people thought Dynamics couldn’t scale to an enterprise the size of Microsoft, but that’s not the case anymore,” Venkata says. “It has worked very well for us. Based on our experience, we can definitively say it can cover Microsoft’s needs.”

The team also used Azure to build COSMIC—Azure Blob Storage for attachments, Azure Cosmos DB for data archival and retention, SQL Azure for reporting on data bases, and Microsoft Power BI for data reporting.

Anderson says it’s a major leap forward to be using COSMIC’s automation to seamlessly route customers to the right place, handing them off from experience to experience without disrupting them.

Another major improvement is how the team has gained an end-to-end view of customers (which means the company no longer must ask customers what else they’re buying from Microsoft).

“It’s been a journey,” Anderson says. “It isn’t something we’ve done overnight. At times it’s been frustrating, and at times it’s been amazing. It’s almost hard to imagine how far we’ve come.”

Learn more about COSMIC including where to find the code here: Microsoft Dynamics 365 and AI automate complex business processes and transactions.

The post How automation is transforming revenue processing at Microsoft appeared first on Inside Track Blog.

Streamlining virtual software provisioning at scale with MyWorkspace

Lukas Velush — Wed, 01 Nov 2023 14:38:02 +0000

Every software company has unique virtual provisioning needs.

Creating complex virtual infrastructure from scratch and maintaining it for even a short period can be labor-intensive, costly, and risky from a security standpoint. When it needs to happen multiple times per day throughout an organization’s workforce, handling these challenges can be overwhelming.

At a company the size of Microsoft, managing those efforts across multiple lines of business can consume hundreds of hours of administration, computing time, and review. To help our teams generate their virtual labs quickly and securely, our Microsoft Digital Employee Experience (MDEE) team created MyWorkspace, a cloud solution that helps teams templatize their setups in the cloud and provision replicas on demand.

It’s revolutionizing how we handle virtual software provisioning.

[Learn how we’re doing more with less internally at Microsoft with Microsoft Azure. Find out how we’re simplifying compliance evidence management with Microsoft Azure confidential ledger. See how we’re adopting Microsoft Azure Resource Manager internally at Microsoft. Read how Microsoft Azure resource inventory is helping us manage our operational efficiency and compliance.]

Virtual software provisioning: Diverse organizational needs

All across Microsoft, teams generate infrastructure labs for many different reasons:

Support engineers replicate customer environments as accurately as possible to reproduce and troubleshoot issues.
Field engineers recreate common configurations for easier customer demos or collaboration.
Build engineers bulk-create virtual machines for load testing, then tear those labs down when their tests are done.
Employees create customized virtual machines they can use for daily work, which they can keep powered on for development or testing.

Manually spinning up and configuring these labs from scratch is time-consuming and resource-intensive for engineers. It also leaves their virtual machines vulnerable to security risks or dependent on lengthy reviews to ensure compliance.

“To create these labs, engineers used to set up the entire environment manually and put it through a review by our security team, so that was a really time-consuming process lasting up to three or four days,” says Vikram Dadwal, principal software engineering manager for MDEE’s Infrastructure and Engineering Services team. “To solve that issue for our engineers, we wanted to create a secure platform where they could rapidly provision those labs using Azure services, which helped them save time and reduce manual and repetitive tasks.”

The new service offering leveraged the cloud to prioritize security, compliance, reliability, and availability. It also focused on better performance, lower costs for Microsoft as a whole, and more efficient utilization of resources and capacity.

They called the solution MyWorkspace.

Creating a solution that’s cloud-based and compliant

MyWorkspace is a cloud-based provisioning engine that enables the rapid creation, configuration, and distribution of virtual infrastructure. It helps end users templatize any environment setup and use it to provision on-demand replicas securely and cost-effectively.

The tool combines several technologies across the Microsoft Azure stack. Azure Kubernetes Service runs all of MyWorkspace’s services, and it also leverages other Azure serverless offerings including Azure Functions, Azure Container Services, and Azure Logic Apps.

Azure Cosmos DB handles the data, and we use Azure Cache for Redis for our cache strategy. Finally, Azure SignalR Service enables real-time notifications and communications. Each of these components contributes to making MyWorkspace robust and easy to use, and they’re all readily available to customers who want to create a similar solution.

The tool streamlines the provisioning experience by allowing users to manage resource deployment and configuration in Microsoft Azure with a simplified UI. As a result, engineers can quickly spin up a pre-configured infrastructure stack to reproduce any number of environments for short-term or long-term lab setups.

Because it’s cloud-based, MyWorkspace naturally fosters resilience, high availability, and multi-geo presence. Meanwhile, baked-in security policies eliminate the need to submit every new virtual environment for individual security review.

Templatization is the real key to the tool’s success. Engineers and administrators have the ability to create templates from common virtual labs and add them to a Microsoft-wide library that colleagues can access as they carry out their own work.

As a result, what once took days now takes just a few clicks and a few minutes.

Elite customer service through the cloud

Our support engineers are the biggest MyWorkspace adopters. They frequently provision virtual labs that reflect our customers’ software setups in order to troubleshoot and work through issues.

“Every customer is different, and we can’t set up one single environment to service them all,” says Rick Andring, support escalation engineer for Microsoft OneDrive and SharePoint. “We reproduce customer issues daily, and we have to have clean environments that we can customize to each specific scenario.”

Business continuity for Microsoft customers is often on the line, so support teams face enormous time pressure. MyWorkspace provides the velocity they need to get to the root of a problem fast.

When a support escalation comes across the team’s desk, the process is straightforward. Andring or one of his engineers heads to the MyWorkspace dashboard, accesses the template library, loads up a template that reflects the situation, and the virtual lab is ready to go within an hour.

From there, the engineer simply troubleshoots within the environment until they reproduce the issue and can advise the customer. In complex cases, the support team can tag relevant product teams into their lab to help with more intensive fixes.

Once they’ve reached a successful outcome, the team can archive the solution in the template library for educational purposes or reuse on similar cases. They even have the power to share the template with other support teams to build competency and capacity across our organization.

Automation drives ease and efficiency

The ability to create templates out of real-world support situations is a powerful asset, so MDEE was intent on making that workflow as smooth as possible.

“Creating templates is actually very simple,” says Nathan Prentice, senior product manager with MDEE Infrastructure and Engineering Services. “Once we built the automation on the backend, an end user just has to select the workspace they’ve already configured and click ‘Create Template From’.”

In terms of virtual environment creation and management, every time I publish a new template, I save a week. Now scale that out: I publish four templates every six months at a minimum, so that’s a month or two of time savings a year.

—Rick Andring, support escalation engineer, OneDrive and SharePoint.

Nathan Prentice (left) and Vikram Dadwal helped create MyWorkspace to streamline virtual device provisioning.

Of course, administrators within each business group have the ability to limit who can create templates and the degree of access they need. That helps less technical staff obtain exactly the level of depth they need to do their jobs while administrators can execute higher-level operations at speed.

As the administrator for his line of business, Andring’s most substantial time savings come from template creation. He’s eliminated much of the lag time associated with spinning up custom labs for each project.

“In terms of virtual environment creation and management, every time I publish a new template, I save a week,” Andring says. “Now scale that out. I publish four templates every six months at a minimum, so that’s a month or two of time savings a year.”

Automation also drives more efficient computing consumption. Administrators can set up rules that automatically terminate virtual labs if teams haven’t used them for predetermined lengths of time.

As a result, an engineer won’t leave a virtual environment running overnight by accident and waste valuable computing power and cost. That saves money and capacity, not to mention carbon emissions. Automations like these mean the average cost per workspace is down 30 percent in lines of business that leverage MyWorkspace.

Users can set up a substantial lab of 20 virtual machines in whatever configuration they want and then use Azure services to spin them up. What might have taken four to five days in the past now takes just minutes.

—Vikram Dadwal, principal software engineering manager, for MDEE Infrastructure and Engineering Services

A revolution in device provisioning

With more than 3,000 daily active users creating almost 7,000 workspaces and utilizing 60,000 virtual machines at any given time, MyWorkspace is quickly gaining traction across Microsoft. That’s not surprising considering the time savings the average user experiences.

“Users can set up a substantial lab of 20 virtual machines in whatever configuration they want and then use Azure services to spin them up,” Dadwal says. “What might have taken four to five days in the past now takes just minutes.”

As more and more lines of business bring MyWorkspace online, the benefits for our teams and the value it generates for our organization will continue to grow. But the most important outcome is providing speed, value, and excellence in supporting our customers through innovation in the cloud.

Start with low-hanging fruit for use cases and build complexity over time.
Prototype, pilot, and iterate while making partners of people whose standards you need to meet.
Encourage your users to voice their criticism as part of your development relationship.
Scaling is important: Use resilient services from the outset.
Security should always be a priority with cloud provisioning.
On the user side, build a relationship with a product team you trust to be decisive.

Please share your feedback with us—take our survey and let us know what kind of content is most useful to you.

The post Streamlining virtual software provisioning at scale with MyWorkspace appeared first on Inside Track Blog.

Boosting Microsoft’s migration to the cloud with Microsoft Azure

Lukas Velush — Fri, 27 Oct 2023 15:30:40 +0000

When Microsoft set out to move its massive internal workload of 60,000 on-premises servers to the cloud and to shutter its handful of sprawling datacenters, there was just one order from company leaders looking to go all-in on Microsoft Azure.

Please start our migration to the cloud, and quickly.

As a team, we had a lot to learn. We started with a few Azure subscriptions. We were kicking the tires, figuring things out, assessing how much work we had to do.

– Pete Apple, principal service engineer, Microsoft Digital

However, it was 2014, the early days of moving large, deeply rooted enterprises like Microsoft to the cloud. And the IT pros in charge of making it happen had few tools to do it and little guidance on how to go about it.

“As a team, we had a lot to learn,” says Pete Apple, a principal service engineer in Microsoft Digital. “We started with a few Azure subscriptions. We were kicking the tires, figuring things out, assessing how much work we had to do.”

As it turns out, quite a bit of work. More on that in a moment.

Now, seven years later, the company’s migration to the cloud is 96 percent complete and the list of lessons learned is long. Six IT datacenters are no more and there are fewer than 800 on-prem servers left to migrate. And that massive workload of 60,000 servers? Using a combination of modern engineering to redesign the company’s applications and to prune unused workloads, that number has been reduced. Microsoft is now running on 7,474 virtual machines in Azure and 1,567 virtual machines on-premises.

“What we’ve learned along the way has been rolled into the product,” Apple says. “We did go through some fits and starts, but it’s very smooth now. Our bumpy experience is now helping other companies have an easier time of it (with their own migrations).”

[Learn how modern engineering fuels Microsoft’s transformation. Find out how leaders are approaching modern engineering at Microsoft.]

The beauty of a decision framework

It didn’t start that way, but migrating a workload to Azure inside Microsoft is super smooth now, Apple says. He explains that everything started working better when they began using a decision tree like the one shown here.

Microsoft Digital’s migration to the cloud decision tree

The cloud migration team used this decision tree to guide it through migrating the company’s 60,000 on-premises servers to the cloud. (Graphic by Marissa Stout | Inside Track)

First, the Microsoft Digital migration team members asked themselves, “Are we building an entirely new experience?” If the answer was “yes,” then the decision was easy. Build a modern application that takes full advantage of all the benefits of building natively in the cloud.

If you answer “no, we need to move an existing application to the cloud,” the decision tree is more complex. It requires the team to answer a couple of tough questions.

Do you want to take the Platform as a Service (PaaS) approach? Do you want to rebuild your experience from the ground up to take full benefit of the cloud? (Not everyone can afford to take the time needed or has the budget to do this.) Or do you want to take the Infrastructure as a Service (IaaS) approach? This requires lifting and shifting with a plan to rebuild in the future when it makes more sense to start fresh.

Tied to this question were two kinds of applications: those built for Microsoft by third-party vendors, and those built by Microsoft Digital or another team in Microsoft.

On the third-party side, flexibility was limited—the team would either take a PaaS approach and start fresh, or it would lift and shift to Azure IaaS.

“We had more choices with the internal applications,” Apple says, explaining that the team divvied those up between mission-critical and noncritical apps.

For the critical apps, the team first sought money and engineering time to start fresh and modernize. “That was the ideal scenario,” Apple says. If money wasn’t available, the team took an IaaS approach with a plan to modernize when feasible.

As a result, noncritical projects were lifted and shifted and left as-is until they were no longer needed. The idea was that they would be shut down once something new could be built that would absorb that task or die on the vine when they become irrelevant.

“In a lot of cases, we didn’t have the expertise to keep our noncritical apps going,” Apple says. “Many of the engineers who worked on them moved onto other teams and other projects. Our thinking was, if there is some part of the experience that became important again, we would build something new around that.”

Getting migration right

When Microsoft started its migration to the cloud, the company had a lot to learn, says Pete Apple, a principal service engineer in Microsoft Digital. That migration is nearly finished and those learnings? “They have been rolled into the product,” Apple says. (Photo by Jim Adams | Inside Track)

Apple says the Microsoft Digital migration team initially thought the migration to the cloud would be as simple as implementing one big lift-and-shift operation. It was a common mindset at the time: Take all your workloads and move them to the cloud as-is and figure out the rest later.

“That wasn’t the best way, for a number of reasons,” he says, adding that there was a myriad of interconnections and embedded systems to sort out first. “We quickly realized our migration to the cloud was going to be far more complex than we thought.”

After a lot of rushing around, the team realized it needed to step back and think more holistically.

The first step was to figure out exactly what they had on their hands—literally. Microsoft had workloads spread across more than 10 datacenters, and no one was tracking who owned all of them or what they were being used for (or if they were being used at all).

Longtime Microsoft culture dictated that you provision whatever you thought you might need, and to go big to make sure you covered your worst-case scenario. Once the upfront cost was covered, teams would often forget about how much it cost to keep all those servers running. With teams spinning up production, development, and test environments, the amount of untracked capacity was large and always growing.

“Sometimes, they didn’t even know what servers they were using,” Apple says. “We found people who were using test environments to run their main services.”

And figuring out who was paying for what? Good luck.

“There was a little bit of cost understanding, of what folks were thinking they had versus what they were paying for, that we had to go through,” Apple says. “Once you move to Azure, every cost is accounted for—there is complete clarity around everything that you’re paying for.”

There were some surprising discoveries.

“Why are we running an entire Exchange Server with only eight people using it? That should be on Office 365,” Apple says. “There were a lot of ‘let’s find an alternative and just retire it’ situations that we were able to work through. It was like when you open your storage facility from three years ago and suddenly realize you don’t need all the stuff you thought you needed.”

Moving to the cloud created opportunities to do many things over.

“We were able to clean up many of our long-running sins and misdemeanors,” Apple says. “We were able to fix the way firewalls were set up, lock down our ExpressRoute networks, and (we) tightened up access to our Corpnet. Moving to the cloud allowed us to tighten up our security in a big way.”

Essentially, it was a greenfield do-over opportunity.

“We didn’t do it enough, but when we did it the right way, it was very powerful,” says Heather Pfluger. She is a partner group manager on Microsoft Digital’s Platform Engineering Team, who had a front-row seat during the migration.

That led to many mistakes, which makes sense because the team was trying to both learn a new technology and change decades of ingrained thinking.

“We did dumb things,” Pfluger says. “We definitely lifted and shifted into some financial challenges, we didn’t redesign as we should have, and we didn’t optimize as we should have.”

All those were learning moments, she says. She points to how the team now uses an optimization dashboard to buy only what it needs. It’s a change that’s saving Microsoft millions of dollars.

Apple says those new understandings are making a big difference all over the company.

“We had to get people into the mindset that moving to the cloud creates new ways to do things,” he says. “We’re resetting how we run things in a lot of ways, and it’s changing how we run our businesses.”

He rattled off a long list of things the team is doing differently, including:

Sending events and alerts straight to DevOps teams versus to central IT operations
Spinning up resources in minutes for just the time needed. (Versus having to plan for long racking times or VMs that used to take a week to manually build out.)
Dynamically scale resources up and down based upon load
Resizing month-to-month or week-to-week based upon cyclical business rhythms versus using the old “continually running” model
Having some solutions costs drop to zero or near zero when idle
Moving from custom Windows operating system image for builds to using Azure gallery image and Azure automation to update images
Creating software defined networking configurations in the cloud versus physical networked firewalled configurations that required many manual steps
Managing on premises environments with Azure tools

There is so much more we can do now. We don’t want our internal users to find problems with our reporting. We want to find them ourselves and fix them so fast that our employee users never notice anything was wrong.

– Heather Pfluger, partner group manager, Platform Engineering Team

Pfluger’s team builds the telemetry tools Microsoft employees use every day.

“There is so much more we can do now,” she says, explaining that the goal is always to improve satisfaction. “We don’t want our internal users to find problems with our reporting. We want to find them ourselves and fix them so fast that our employee users never notice anything was wrong.”

And it’s starting to work.

“We’ve gotten to the point where our employee users discovering a problem is becoming more rare,” Pfluger says. “We’re getting better, but we still have a long way to go.”

Apple hopes everyone continues to learn, adjust, and find better ways to do things.

“All of our investments and innovations are now all occurring in the cloud,” he says. “The opportunity to do new and more powerful things is just immense. I’m looking forward to seeing where we go next.”

The post Boosting Microsoft’s migration to the cloud with Microsoft Azure appeared first on Inside Track Blog.

Microsoft’s digital security team answers your Top 10 questions on Zero Trust

Mark Skorupa — Tue, 18 Jul 2023 19:31:58 +0000

Our internal digital security team at Microsoft spends a fair amount of time talking to enterprise customers who face similar challenges when it comes to managing and securing a globally complex enterprise using a Zero Trust security model. While every organization is unique, and Zero Trust isn’t a “one size fits all” approach, nearly every CIO, CTO, or CISO that we talk to is curious to learn more about our best practices.

We thought it would be useful to share our answers to the Top 10 Zero Trust questions from customers across the globe.

It’s surprising to us how many companies haven’t embraced multifactor authentication. It’s the first step we took on our Zero Trust journey.

– Mark Skorupa, principal program manager

If you had to pick, what are your top three Zero Trust best practices?

Microsoft’s approach to Zero Trust means we don’t assume any identity or device on our corporate network is secure, we continually verify it.

With that as context, our top three practices revolve around the following:

Identities are secure using multifactor authentication (MFA): It’s surprising to us how many companies haven’t embraced multifactor authentication. It’s the first step we took on our Zero Trust journey. Regardless of what solution you decide to implement, adding a second identity check into the process makes it significantly more difficult for bad actors to leverage a compromised identity over just passwords alone.
Device(s) are healthy: It’s been crucial that Microsoft can provide employees secure and productive ways to work no matter what device they’re using or where they’re working, especially during remote or hybrid work. However, any devices that access corporate resources must be managed by Microsoft and they must be healthy, meaning, they are running the latest software updates and antivirus software.
Telemetry is pervasive: The health of all services and applications must be monitored to ensure proper operation and compliance and enable rapid response when those conditions are not met. Before granting access to corporate resources, identities and devices are continually verified to be secure and compliant. We monitor telemetry looking for signals to identify anomalous patterns. We use telemetry to measure risk reduction and understand the user experience.

For a transcript, please view the video on YouTube: https://www.youtube.com/watch?v=TOrbiC8DGPE, select the “More actions” button (three dots icon) below the video, and then select “Show transcript.”

At Ignite 2020, experts on Microsoft’s digital security team share their lessons learned from implementing a Zero Trust security model at the company.

Does Microsoft require Microsoft Intune enrollment on all personal devices? Can employees use their personal laptops or devices to access corporate resources?For employees who want access to Microsoft corporate resources from a personal device, we require that devices be enrolled in Microsoft Intune. If they don’t want to enroll their personal device, that’s perfectly fine. They can access corporate resources through the following alternative options:

Windows Virtual Desktop allows employees and contingent staff to use a virtual remote desktop to access corporate resources like Microsoft SharePoint or Microsoft Teams from any device.
Employees can use Outlook on the web to access their Microsoft Outlook email account from the internet.

How does Microsoft onboard its Internet of Things (IoT) devices under the Zero Trust approach?

IoT is a challenge both for customers and for us.

Internally, Microsoft is working to automate how we secure IoT devices using Zero Trust. In June, the company announced the acquisition of CyberX, which will complement existing Microsoft Azure IoT security capabilities.

We segment our network and isolate IoT devices based on categories, including high-risk devices (such as printers); legacy devices (like digital coffee machines) that may lack the security controls required; and modern devices (such as smart personal assistant devices like an Amazon Echo) with security controls that meet our standards.

How is Microsoft moving away from VPN?

We’ve made good progress in moving away from VPN by migrating legacy, on-premises applications to cloud-based applications. That said, we still have more work to do before we can eliminate VPN for most employees. With the growing need to support remote work, we moved quickly to redesign Microsoft’s VPN infrastructure by adopting a split-tunneled configuration where traffic is directly routed to the applications available in the cloud and through VPN for any legacy applications. The more legacy applications we make available directly from the internet, the less we need VPN.

How do you manage potential data loss?

Everyone at Microsoft is responsible for protecting data, and we have specific scenarios that call for additional security when accessing sensitive data. For example, when an employee needs to make changes to customer-facing production systems like firewalls, they use privileged access workstations, a dedicated operating system for sensitive tasks.

Our employees also use features in Microsoft Information Protection, like the sensitivity button in Microsoft 365 applications to tag and classify documents. Depending on the classification level—even if a document moves out of our environment—it can only be opened by someone that was originally provided access.

How can Zero Trust be used to isolate devices on the network to further reduce an attack surface?

The origins of Zero Trust were focused on micro-segmentation of the network. While Microsoft’s focus extends beyond the physical network and controlling assets regardless of connectivity or location, there is still a strong need for implementing network segmentation within your physical network.

We currently have segmented our network into the configuration shown in the following diagram, and we’re evaluating future segments as the need arises. For more details on our Zero Trust strategy around networking, check out Microsoft’s approach to Zero Trust Networking and supporting Azure technologies.

Network segmentation is used to isolate certain devices, data, or services from other resources that have direct access.

How do you apply Zero Trust to a workstation where the user is a local admin on the device?

For us, it doesn’t matter what the device or workstation is, or the type of account used—any device that is looking for access to corporate resources needs to be enrolled and managed by Microsoft Intune, our device management service. That said, our long-term vision is to build an environment where standard user accounts have the permission levels to be just as productive as local admin accounts.

How important is it to have Microsoft Azure AD (AAD), even if we have Active Directory (AD) on-premises, for Zero Trust to work in the cloud? Can on-premises Active Directory alone work to implement Zero Trust if we install Microsoft Monitoring Agent (MMA) to it?

Because Microsoft has shifted most of our security infrastructure to the Microsoft Azure cloud, using Microsoft Azure AD Conditional Access is a necessity for us. It helps automate the process and determine which identities and devices are healthy and secure, which then enforces the health of those devices.

Using MMA would get you to some level of parity, but you wouldn’t be able to automate device enforcement. Our recommendation is to create an AAD instance as a replica of your on-premises AD. This allows you to continue using your on-premises AD as the master but still leverage AAD to implement some of the advanced Zero Trust protections.

How do you deal with Zero Trust for guest access scenarios?

When allowing guests to connect to resources or view documents, we use a least-privileged access model. Documents tagged as public are readily accessible, but items tagged as confidential or higher require the user to authenticate and receive a token to open the documents.

We also tag resources like Microsoft SharePoint or Microsoft Teams locations that block guest access capabilities. Regarding network access, we provide a guest wireless service set identifier (SSID) for the guest to connect to which are isolated with internet only access. Finally, all guest accounts are required to meet our MFA requirements prior to granting access.

We hope this guidance is helpful to you no matter what stage of the Zero Trust journey you’re on. As we look to 2021, the key lesson is to have empathy. Understanding where an employee is coming from and being transparent with them about why a policy is shifting or how it may impact them is critical.

– Mark Skorupa, principal program manager

What’s your Zero Trust priority for 2021?

We’re modernizing legacy and on-premises apps to be available directly from the internet. Making these available, even apps with legacy authentication requirements, allows our device management service to apply conditional access, which enforces verification of identities and ensures devices are healthy.

We hope this guidance is helpful to you no matter what stage of the Zero Trust journey you’re on. As we look to the rest of 2021, our team continues to come back to is the importance of empathy. Understanding where an employee is coming from and being transparent with them about why a policy is shifting or how it may impact them is critical.

Microsoft wasn’t born in the cloud either, so many of the digital security shifts we’re making by taking a Zero Trust approach aren’t familiar to our employees or can be met with hesitancy. We take ringed approaches to everything we roll out, which enables us to pilot, test, and iterate on our solutions based on feedback.

Leading with empathy keeps us focused on making sure employees are productive and efficient, and that they can be stewards of security here at Microsoft and with our customers.

Read this article about how Microsoft is adopting a Zero Trust security model to secure corporate and customer data.

The post Microsoft’s digital security team answers your Top 10 questions on Zero Trust appeared first on Inside Track Blog.

How ‘born in the cloud’ thinking is fueling Microsoft’s transformation

Lukas Velush — Thu, 27 Feb 2020 18:32:35 +0000

Microsoft wasn’t born in the cloud, but soon you won’t be able to tell.

Now that it has finished “lifting and shifting” its massive internal workload to Microsoft Azure, the company is rethinking everything.

“We’re rearchitecting all of our applications so that they work natively on Azure,” says Ludovic Hauduc, corporate vice president of Core Platform Engineering in Microsoft Core Services Engineering and Operations (CSEO). “We’re retooling to take advantage of all that the cloud has to offer.”

Microsoft spent the last five years moving the internal workload of its 60,000 on-premises servers to Azure. Thanks to early efforts to modernize some of that workload while migrating it, and to ruthlessly removing everything that wasn’t being used, the company is now running about 6,500 virtual machines in Azure. This number dynamically scales up to around 11,000 virtual machines when the company is processing extra work at the end of months, quarters, and years. It still has about 1,500 virtual machines on premises, most of which are there intentionally. The company is now 97 percent in the cloud.

Now that the company’s cloud migration is done and dusted, it’s Hauduc’s job to craft a framework for transforming Microsoft into a born-in-the-cloud company. CSEO will then use that framework to retool all the applications and services that the organization uses to provide IT and operations services to the larger company.

The job is bigger than building a guide for how the company will rebuild applications that support Human Resources, Finance, and so on. Hauduc’s team is creating a roadmap for how Microsoft will rearchitect those applications in a consistent, connected way that focuses on the end user experience while also figuring out how to get the more than 3,000 engineers in CSEO who will rebuild those applications to embrace the modern engineering–fueled cultural shift needed for this transformation to happen.

[Take a deep dive into how Hauduc and his team in CSEO are using a cloud-centric mindset to drive the company’s transformation. Find out more about how CSEO is using a modern-engineering mindset to engineer solutions inside Microsoft.]

Move to the cloud creates transformation opportunity

Despite good work by good people, CSEO’s engineering model wasn’t ready to scale to the demands of Microsoft’s growth and how fast its internal businesses were evolving. Moving to the cloud created the perfect opportunity to fix it.

“In the past, every project we worked on was delivered pretty much in isolation,” Hauduc says. “We operated very much as a transaction team that worked directly for internal customers like Finance and HR.”

CSEO engineering was done externally through vendors who were not connected or incentivized to talk to each other. They would take their orders from the business group they were supporting, build what was asked for, get paid, and move on to the next project.

“We would spin up a new vendor team and just get the project done—even if it was a duplication or a slight iteration on top of another project that already had been delivered,” he says. “That’s how we ended up with a couple of invoicing systems, a few financial reporting systems, and so on and so forth.”

Lack of a larger strategy prevented CSEO from building applications that made sense for Microsoft employees.

This made for a rough user experience.

“Each application had a different look and feel,” Hauduc says. “Each one had its own underlying structure and data system. Nothing was connected and data was replicated multiple times, all of which would create challenges around privacy, security, data freshness, etc.”

The problem was simple—the team wasn’t working against a strategy that let it push back at the right moments.

“The word that the previous IT organization never really used was ‘no,’” Hauduc says. “They felt like they had no choice in the matter.”

When moving to the cloud opens the door to transformation

The story is different today. Now CSEO has its own funding and is choosing which projects to build based on a strategic vision that outlines where it wants to take the company.

“The conversation has completely shifted, not only because we have moved things to the cloud, but because we have taken a single, unified data strategy,” Hauduc says. “It has altered how we engage with our internal customers in ways that were not possible when everything was on premises and one-off.”

Now CSEO engineers are working in much smarter ways.

“We now have agility around operating our internal systems that we could never have fathomed achieving on prem,” he says. “Agility from the point of view of elasticity, from the point of view of releases, of understanding how our workloads are being used and deriving insights from these workloads, but also agility from the point of view of reacting and adapting to the changing needs of our internal business partners in an extremely rapid manner because we have un-frictioned access to the data, to the signals, and to the metrics that tell us whether we are meeting the needs of our internal customers.”

And those business groups who unknowingly came and asked for something CSEO had already built?

“We now have an end-to-end view of all the work we’re doing across the company,” Hauduc says. “We can correlate, we can match the patterns of issues and problems that our other internal customers have had, we can show them what could happen if they don’t change their approach, and best of all, we can give them tips for improving in ways they never considered.”

CSEO’s approach may have been flawed in the past, but there were lots of good reasons for that, Hauduc says. He won’t minimize the work that CSEO engineers did to get Microsoft to the threshold of digitally transforming and moving to the cloud.

“The skills and all of the things that made us successful as an IT organization before we started on a cloud journey are great,” he says. “They’re what contributed to building the company and operating the company the way we have today.”

But now it’s time for new approaches and new thinking.

“The skills that are required to run our internal systems and services today in the cloud, those are completely different,” he says.

As a result, the way the team operates, the way it interacts, and the way it engages with its internal customers have had to evolve.

“The cultural journey that CSEO has been on is happening in parallel with our technical transformation,” Hauduc continues. “The technical transformation and the cultural transformation could not have happened in isolation. They had to happen in concert, and to a large extent, they fueled each other as we arrived at what we can now articulate as our cloud-centric architecture.”

And about that word that people in CSEO were afraid to say? They’re saying it now.

“The word ‘no’ is now a very powerful word,” Hauduc says. “When a customer request comes in, the answer is ‘yes, we’ll prioritize it,’ or ‘no, this isn’t the most important thing we can build for the company from a ROI standpoint, but here’s what we can do instead.’”

The change has been empowering to all of CSEO.

“The quality and the shape of the conversation has changed,” he says. “Now we in CSEO are uniquely positioned to take a step back and say, ‘for the company, the most important thing for us to prioritize is this, let’s go deliver on it.’”

Take a deep dive into how Hauduc and his team in CSEO are using a cloud-centric mindset to drive the company’s transformation.

Find out more about how CSEO is using a modern-engineering mindset to engineer solutions inside Microsoft.

The post How ‘born in the cloud’ thinking is fueling Microsoft’s transformation appeared first on Inside Track Blog.

How Microsoft is modernizing its internal network using automation

Aleenah Ansari — Wed, 11 Dec 2019 23:20:08 +0000

After Microsoft moved its workload of 60,000 on-premises servers to Microsoft Azure, employees could set up systems and virtual machines (VMs) with a push of a few buttons.

Although network hardware servers have changed over time, the way that network engineers work isn’t nearly as modern.

“With computers, we have modernized our processes to follow DevOps processes,” says Bart Dworak, a software engineering manager on the Network Automation Delivery Team in Microsoft Digital. “For the most part, those processes did not exist with networking.”

Two years ago, Dworak says, network engineers still created and ran command-line-based scripts and created configuration change reports.

“We would sign into network devices and submit changes using the command line,” Dworak says. “In other, more modern systems, the cloud provides desired-state configurations. We should be able to do the same thing with networks.”

It became clear that Microsoft needed modern technology for configuring and managing the network, especially as the number of managed network devices increased on Microsoft’s corporate network. This increase occurred because of higher network utilization by users, applications, and devices as well as more complex configurations.

“When I started at Microsoft in 2015, our network supported 13,000 managed devices,” Dworak says. “Now, we surpassed 17,000. We’re adding more devices because our users want more bandwidth as they move to the cloud so they can do more things on the network.”

[Learn how Microsoft is using Azure ExpressRoute hybrid technology to secure the company.]

Dworak and the Network Automation Delivery Team saw an opportunity to fill a gap in the company’s legacy network-management toolkit. They decided to apply the concept of infrastructure as code to the domain of networking.

“Network as code provides a means to automate network device configuration and transform our culture,” says Steve Kern, a Microsoft Digital senior program manager and leader of the Network Automation Delivery Team.

The members of the Network Automation Delivery Team knew that implementing the concept of network as code would take time, but they had a clear vision.

“If you’ve worked in a networking organization, change can seem like your enemy,” Kern says. “We wanted to make sure changes were controlled and we had a routine, peer-reviewed rhythm of business that accounted for the changes that were pushed out to devices.”

The team has applied the concept of network as code to automate processes like changing the credentials on more than 17,000 devices at Microsoft, which now occurs in days rather than weeks. The team is also looking into regular telemetry data streaming, which would inform asset and configuration management.

“We want network devices to stream data to us, rather than us collecting data from them,” Dworak says. “That way, we can gain a better understanding of our network with a higher granularity than what is available today.”

The Network Automation Delivery Team has been working on the automation process since 2017. To do this, the team members built a Git repository and started with simple automation to gain momentum. Then, they identified other opportunities to apply the concept of GitOps—a set of practices for deployment, management, and monitoring—to deliver network services to Microsoft employees.

Implementing network as code has led to an estimated savings of 15 years of labor and vendor spending on deployments and network devices changes. As network technology shifts, so does the role of network engineers.

“We’re freeing up network engineers so they can build better, faster, and more reliable networks,” Kern says. “Our aspiration is that network engineers will become network developers who write the code. Many of them are doing that already.”

Additionally, the team is automating how it troubleshoots and responds to outages. If the company’s network event system detects that a wireless access point (AP) is down, it will automatically conduct diagnostics and attempt to address the AP network outage.

“The building AP is restored to service in less time than it would take to wake up a network engineer in the middle of the night, sign in, and troubleshoot and remediate the problem,” Kern says.

Network as code also applies a DevOps mentality to network domain by applying software development and business operations practices to iterate quickly.

“We wanted to bring DevOps principles from the industry and ensure that development and operations teams were one and the same,” Kern says. “If you build something, you own it.”

In the future, the network team hopes to create interfaces for each piece of network gear and have application developers interact with the API during the build process. This would enable the team to run consistent deployments and configurations by restoring a network device entirely from a source-code repository.

Dworak believes that network as code will enable transformation to occur across the company.

“Digital transformation is like remodeling a house. You can remodel your kitchen, living room, and other parts of your house, but first you have to have a solid foundation,” he says. “Your network is part of the foundation—transforming networking will allow others to transform faster.”

Learn how Microsoft is using Azure ExpressRoute hybrid technology to secure the company.

The post How Microsoft is modernizing its internal network using automation appeared first on Inside Track Blog.

How Microsoft used SQL Azure and Azure Service Fabric to rebuild a key internal app

Lukas Velush — Wed, 16 Oct 2019 20:23:57 +0000

When Raja Narayan took over supporting the Payee Management Application that Microsoft Finance uses to onboard new suppliers and partners, the experience was broken.

“Our application’s infrastructure was on-premises,” Narayan says. “It was a big, old-school architecture monolith and, although we had database-based logging in place, there was no alerting setup at any level. Bugs and infrastructure failures were bringing the application down, but we didn’t know when this happened.”

And it went down a lot.

When it did, the team wouldn’t know until a user filed a ticket. Then it would take four to six hours before the ticket reached Narayan’s team.

“We would undertake root-cause investigation and it sometimes could take a solid two to three hours, if not more in rare cases, until we managed to eventually identify and resolve the problem,” says Narayan, a principal software engineer on the Microsoft Digital group that supports the Microsoft Finance team.

[Take a look at how Narayan’s broader team is modernizing applications.]

All told, it would take at least 10 to 12 hours to bring the system back online.

And it wasn’t only the reliability challenges that the team was hit with daily. Updates and fixes required taking the system down. Engineering teams didn’t have insight into work that other teams were doing. Cross-discipline collaboration was minimal. Continuous repetitive manual work was required. And telemetry was severely limited.

“There was no reliability at all,” Narayan says. “The user experience was very, very bad.”

That was four years ago, before the team moved its payee management system and its 95,000 active supplier and partner accounts to the cloud.

“When I joined our team, it was obvious that we needed a change. And going to Azure was a big part of it,” Narayan says. “Going to the cloud was going to open up new opportunities for us.”

He was right. After the nine-month migration was finished, things got much better right away. The benefits included:

The team was empowered to adopt modern, DevOps engineering practices, something they really wanted. The benefits showed up in many ways, including reduced cross-team friction and faster response times.
Failures were reported to a Directly Responsible Individual (DRI) immediately. They would fix the problem right away or queue it up for the engineering team to do deeper-level work.
The time to fix major production issues dropped to as few as 15 minutes, and a maximum of four hours.
The team no longer needed to shut down the system to make production fixes (thanks to the availability of staging and production slots, and hosting frameworks like Azure Service Fabric).
Application reliability shot up from around 95 percent to 99 percent. Availability stayed high because of redundancy.
Scaling the application up and out became just a configuration away. The team was able to scale the services based on memory and processor utilization.
The application’s telemetry data became instantly available to analyze and learn from.
The team could start taking advantage of automation and governance capabilities.

The shift to Azure is having a lasting impact.

“If someone asked me to go back, I don’t think I could happily do it,” Narayan says. “I don’t know how we survived in those old days. It’s so much faster and more powerful to be on Azure.”

Instead of spending all his time fighting to reduce technical debt, on building and maintaining too many services, and buying and installing technical infrastructure, he’s now focused on what his internal business customers need.

“Now we’re building a program,” Narayan says. “Now we’re taking care of our customers. Application-hosting infrastructure is not our concern now. Azure takes care of it.”

Opening doors with SQL Azure

Moving to the cloud also meant the team got to move on from an on-premises SQL Server database that needed continuous investment in optimization and maintenance to avoid problems with performance.

“We’ve never had an incident where our SQL Azure database has gone down,” Narayan says. “When we were on-prem, our work was often interrupted by accidental server restarts and patch installations.”

The team no longer needs to shut the application down and reboot the server when it wants to fix something or make an upgrade. “Every time we want to do something new, we make a couple of clicks, and boom, we’re done,” he says.

Azure SQL made it much easier to scale up and down when user loads changed. “My resources are so elastic now,” Narayan says. “I can shrink and expand based on my need—it’s a matter of sliding the scrollbar.”

Moving the application’s database to SQL Azure has given the team access to several new tools.

“With our move to cloud, the team can experiment on any databases, something that wasn’t possible before,” Narayan says. “Before we could only use SQL Server. Now we have an array of options such as Cosmos DB, table storage, MySQL, and PostgreSQL. New features from these products are available automatically to us. We don’t have to install feature updates and patches—it’s all managed by Azure.”

Living in the cloud also gives the team new access to the application’s data.

“We now live in this new big-data world,” Narayan says. “We can now get a lot of insights about our application, especially with machine learning and AI.”

For example, SQL Azure learns from the incoming load and accordingly tunes itself. Indexes are created or dropped based on how it learns. “This is one of the most sought-after features by our team,” he says. “This feature does what a database administrator used to have to do by hand.”

And processing the many tiny transactions that come through Narayan’s application? Those all happen much faster now as well.

“For Online Analytic Processing (OLAP), we need big processing machines,” he says. “We need big resources.”

Azure provides him with choices, including Azure Datawarehouse, Azure Databricks, and Azure HDInsights. “If I was still on-prem, this kind of data processing would just be a dream for me,” he says. “Now they are a click away for me.”

Going forward, the plan is to use AI and machine learning to analyze Payee Management Application’s data at greater depth. “There is a lot more we can do with our data,” Narayan says. “We’re just getting started.”

Narayan’s journey toward more reliable and agile service is a typical example of how off-loading the work of managing complex on-premises infrastructure can help the company’s internal and external customers focus on their core businesses, says Eli Birova, a site-reliability engineer on the Azure SQL SRE Team.

“And one of the biggest values Azure SQL DB brings is a database in the Azure cloud that scales in and out together with your business need and adapts to your workload,” Birova says.

That provides customers like Narayan and his team with a database as a service tailored by the deep Relational Database Management Systems (RDBMS) engineering expertise that come from long years of developing Microsoft SQL Server, she says. It’s a service that incorporates large-scale distributed systems design and implementation best practices, which also natively leverages the scalability and resiliency mechanisms of the Azure stack itself.

“We in the Azure SQL DB team are continuously monitoring and analyzing the behavior of our services and the experience our customers have with us,” Birova says. “We’re very focused on identifying and implementing improvements to our feature set, reliability, and performance. We want to make sure that every customer can rely on their data when and as they need it, and that they can count on their server being up to date and secure without needing to invest their own engineering resources into managing on-premises database infrastructure.”

Harnessing the power of Azure Service Fabric

Once Narayan’s team finished migrating the Payee Management Application to the cloud, it got the breathing room it needed to start thinking bigger.

“We started asking ourselves, ‘How can we get more out of being in the cloud?’” Narayan says. “It didn’t take us long to realize that the best way to take advantage of everything Azure had to offer would be to modify our application from the ground up to be cloud-native.”

That shift in thinking meant that his days of running a massive, clunky, monolithic application were numbered.

“We realized we could use Azure Service Fabric to rebuild the application as a suite of microservices,” Narayan says. “We could get an entirely fresh start.”

Azure Service Fabric is part of an evolving set of tools that the Azure product group is using to help customers—including power users inside Microsoft—build and operate always-on, scalable, distributed apps like the one Narayan’s team manages. So says Spencer Schwab, a software engineering manager on the Microsoft Azure Site Reliability Engineering (SRE) team.

“We’re learning from the experience Raja and his team are having with Service Fabric,” Schwab says. “We’re pumping those learnings back into the product so that our customers have the best experience possible when they choose to bet their businesses on us.”

Narayan’s team is using Azure Service Fabric to gradually rebuild the Payee Management Application without interrupting service to customers. That’s something possible only in the cloud.

“We lifted and shifted all of the old, existing monolith components into Azure Service Fabric,” he says. “Containerizing it like that has allowed us to gradually strangle the older application.”

Each component of the old application is docked in a container. Each is purposefully placed next to the microservice that will replace it.

“Putting each microservice next to the component that it’s replacing allows us to smoothly move that bit of workload to the new microservice without shutting down the larger application,” Narayan says. “This is making our journey to microservices pleasant.”

The team is halfway finished.

“So far we have 12 microservices, and we’re planning to expand up to 25,” he says.

Once the team is done, the team can then truly take advantage of being in the cloud.

“We’ll be ready to reap the benefits of cloud-native development,” Narayan says. “Anything becomes possible at that point.”

The post How Microsoft used SQL Azure and Azure Service Fabric to rebuild a key internal app appeared first on Inside Track Blog.