Azure and cloud infrastructure Archives - Inside Track Blog

Autopilot speeds up Windows 10 image deployment inside Microsoft

Lukas Velush — Mon, 02 Sep 2024 17:33:30 +0000

The first experience a new employee has at Microsoft shouldn’t be waiting for their laptop to get set up.

“We’re transforming the experience our employees have when they first turn on their PCs,” says Sean MacDonald, a principal group program manager in Microsoft Digital. “Our employees expect a best-in-class experience and we’ve been working hard to deliver that to them. The best part is that all of our customers can have the exact same experience.”

It used to take up to an hour to get Windows 10 running on a new or rebuilt PC—that was before Microsoft Digital started using Windows Autopilot, a new deployment program that automates most of the setup process. With this new program developed in partnership with the Windows and Intune teams, the user receives a device with the latest image directly from the OEM and all the user needs to do is power on, connect to any internet connection, authenticate, and the rest is silently hydrated via Microsoft Intune.

“Now, with Autopilot, we’re seeing it take less than 10 minutes to set up a device,” MacDonald says. “We’ve reduced the user’s set up time by 90 percent.”

After piloting the technology, Microsoft Digital started a soft launch in October using Autopilot for select new devices, says Mina Aitelhadj, a program manager on Microsoft Digital’s Modern Device Platform Team.

Microsoft is using an OEM-developed (original equipment manufacturer) image on all devices where Autopilot is being used. The goal is for Microsoft Digital to evolve to the point where it is using Autopilot with Intune provisioning to image all new devices by January.

Microsoft is one of the first enterprises to use Autopilot in a full, modern management scenario.

“Our early testing and deployment inside of Microsoft will help us provide best practices and guidelines for our customers when they are ready to move onto a fully modern Azure platform,” Aitelhadj says.

Getting to this point has been challenging, she says.

Like any large enterprise, the Microsoft environment is complex. Company employees work in all kinds of different roles, and they rely on a wide variety of devices to support that work. This variety of device choices made it challenging to provide a consistent out-of-the-box experience for new employees (and for existing employees when issued new PCs).

Before Microsoft started using Autopilot internally, the team streamlined the imaging process as much as possible, but the company is so big (it literally offers employees hundreds of PC configurations to choose from) that speeding up how long it took an employee to get their new machine set up required that Microsoft Digital entirely rethink and redesign its approach, Aitelhadj says.

“Even though our custom imaging process was fine-tuned to its best, it was still process-intensive and wasn’t easy to manage across multiple OEMs and global regions,” she says. “To add to that, our devices needed to be connected to our corporate network to deploy our custom images.”

Now that Autopilot is handling all that work, the team can focus on fine tuning. “This is a big step up for us because we’re saving our team time and money and we’re getting critical work time back,” Aitelhadj says.

Are you interested in how Autopilot could work at your company? Windows Autopilot is available externally (click through here to learn more about it). It is available for Windows 10 users on Azure Active Directory and users of Windows Autopilot Hybrid Azure AD are able to use it to join Windows 10 devices to both Azure Active Directory and Active Directory.

How deploying an image with Autopilot works

Why has installing a new Windows image traditionally been so challenging?

Companies like Microsoft have had to continuously update their custom images to make sure they are current and secure, Aitelhadj says. Every month the Windows team issues patches and updates, and those have had to be woven into each image before it could be deployed.

Before the company started using Autopilot (and in cases where it’s not yet using the new tool), handling those month-to-month updates made deploying new images very challenging.

“Our engineers have had to build and maintain our image on a monthly basis for all devices in our global ecosystem,” she says. “They have had to send each image to the OEMs. Those images include our policies, certifications, profiles—everything needed to get the devices ready for one of our employees. We’ve streamlined how we create our custom image within Microsoft, and Autopilot streamlines that even further for both IT pro and users.”

Once Autopilot is deployed across the entire company, everything will get a lot simpler.

“Say I’m a company and I have 10 users coming onboard,” Aitelhadj says. “Instead of having an IT pro load our custom image onto those PCs, the OEM will preload the devices with a universal Commercial OEM Image, they will register those machines onto Autopilot, and everything will get loaded onto those machines automatically, once the user logs in.”

Using Autopilot, the OEM loads just the operating system and Microsoft Office onto a computer—just what the employee needs to be able to turn their machine on and get started. Once online, Autopilot guides the user through a nearly hands-off out-of-box experience in which it not only handles all custom configuration settings, but also downloads and installs all needed applications. The other benefit is that the user does not have to be on the company’s corporate network or in a campus building to setup the device—they can do it from any internet connection.

And the user experience?

Thanks to Autopilot, it has gone from a struggle to an easy first log in. The trick was to then make it easy and intuitive for the employee to download and set up all the applications they need to do their work.

“We make it as simple as possible by provisioning the device with all the policies, certs, and core apps,” Aitelhadj says. “It all loads in the background within a few minutes. We limit their interaction to just the stuff they need to click through—like security and a few other required things.”

And yes, the team wanted to give the IT pros who spend hours and hours updating images each month time back, but the bigger goal was to create a simpler, more user-guided, less error-prone experience for users, thereby reducing end user frustration and the need for IT support. All this needed to be done without a time gap—for security reasons, all current updates need to be made as the new employee’s PC is booted up and handed over to them.

“We’ve saved our pilot users hundreds of hours—we’re getting them productive faster,” Aitelhadj says. “It’s pretty awesome to have that kind of impact.”

The post Autopilot speeds up Windows 10 image deployment inside Microsoft appeared first on Inside Track Blog.

Microsoft uses a scream test to silence its unused servers

Pete Apple — Sat, 17 Aug 2024 08:00:59 +0000

Do you have unused servers on your hand? Don’t be alarmed if I scream about it—it’ll be for a good reason (and not just because it’s almost Halloween)!

Check out Pete Apple’s expedition to the cloud series

I talked previously about our efforts here in Microsoft Digital to inventory our internal-to-Microsoft on-premises environments to determine application relationships (mapping Microsoft’s expedition to the cloud with good cartography) as well as look at performance info for each system (the awesome ugly truth about decentralizing operations at Microsoft with a DevOps model).

With this info, it was time to begin making plans to move to the cloud. Looking at the data, our overall CPU usage for on-premises systems was far lower than we thought—averaging around six percent! We realized this was so low due to many underutilized systems. First things first, what to do with the systems that were “frozen,” or not being used, based upon the 0-2 percent CPU they were utilizing 24/7?

We created a plan to closely examine those assets towards the goal of moving as few as possible. We used our home-built change management database (CMDB) to check whether there was a recorded owner. In some cases, we were able to work with that owner and retire the system.

Before we turned even one server off, we had to be sure it wasn’t being used. (If a server is turned off and no one is there to see it, does it make a sound?)

Developing a scream test

Pete Apple, a cloud services engineer in Microsoft Digital, shares how Microsoft scares teams that have unused servers that need to be turned off. (Photo by Jim Adams | Inside Track)

But what if the owner information was wrong? Or what if that person had moved on? For those, we created a new process: the Scream Test. (Bwahahahahaaaa!)

What’s the Scream Test? Well, in our case it was a multistep process:

Display the message “Hey, is this your server, contact us?” on the sign-in splash page for two weeks.
Restart the server once each day for two weeks to see whether someone opens a ticket (in other words, screams).
Shut down the server for two weeks and see whether someone opens a ticket. (Again, whether they scream.)
Retire the server, retaining the storage for a period, just in case.

With this effort, we were able to retire far more unused servers—around 15 percent—than we had expected, without worrying about moving them to the cloud. Winning! We also were able to reclaim more resources on some of the Hyper-V hosts that were slated to continue running on-premises. And as a final benefit, we cleaned up our CMDB a bit!

In parallel, we initiated an effort to look at some of the systems that were infrequently used or used a very low level of CPU (less than 10 percent, or “Cold”). From that, we had two outcomes that proved critical for our successful migration to the cloud.

The first was to identify the systems in our on-premises environments that were oversized. People had purchased physical machines or sized virtual machines according to what they thought the load would be, and either that estimate was incorrect or the load diminished over time. We took this data and created a set of recommended Azure VM sizes for every on-premises system to use for migration. In other words, we downsized on the way to the cloud versus after the fact.

At the time, we did a bunch of this work by hand, manually because we were early adopters. Microsoft now has a number of great products available that help assist with this inventory and review of your on-premises environment that you should check out. To learn more, check out this article with documentation on Azure Migrate.

Another statistic that the data revealed was the number of systems that were used for only a few days or a week out of each month. Development machines, test/QA machines, and user acceptance testing machines reserved for final verification before moving code to production were used for only short periods. The machines were on continuously in the datacenter, mind you, but they were actually being used for only short periods each month.

For these, we investigated ways to have those systems running only when required by investing in two technologies: Azure Resource Manager Templates and Azure Automation. But this is a story for the next time. Until then, happy Halloween!

Read the rest of the series on Microsoft’s move to the cloud:

The post Microsoft uses a scream test to silence its unused servers appeared first on Inside Track Blog.

Implementing Microsoft Azure cost optimization internally at Microsoft

Inside Track staff — Fri, 14 Jun 2024 17:35:40 +0000

Our Microsoft Digital team is aggressively pursuing Microsoft Azure cost optimization as part of our continuing effort to improve the efficiency and effectiveness of our enterprise Azure environment here at Microsoft and for our customers.

Adopting data-driven cost-optimization techniques, investing in central governance, and driving modernization efforts throughout our Microsoft Azure environment, makes it so our environment—one of the largest enterprise environments hosted in Azure—is a cost efficient blueprint that all customers can look to for lessons on how to lower their Azure costs.

We began our digital transformation journey in 2014 with the bold decision to migrate our on-premises infrastructure to Microsoft Azure so we could capture the benefits of a cloud-based platform—agility, elasticity, and scalability. Since then, our teams have progressively migrated and transformed our IT footprint to the largest cloud-based infrastructure in the world—we host more than 95 percent of our IT resources in Microsoft Azure.

The Microsoft Azure platform has expanded over the years with the addition of hundreds of services, dozens of regions, and innumerable improvements and new features. In tandem, we’ve increased our investment in Azure as our core destination for business solutions at Microsoft. As our Azure footprint has grown, so has the environment’s complexity, requiring us to optimize and control our Azure expenditures.

[Discover how we’re using Microsoft Azure to retire hundreds of physical branch-office servers. Explore building an agile and trusted SAP environment on Microsoft Azure. Unpack optimizing SAP for Microsoft Azure.]

Optimizing Microsoft Azure cost internally at Microsoft

Our Microsoft Azure footprint follows the resource usage of a typical large-scale enterprise. In the past few years, our cost-optimization efforts have been more targeted as we attempted to minimize the rising total cost of ownership in Azure due to several factors, including increased migrations from on-premises and business growth. This focus on optimization instigated an investment in tools and data insights for cost optimization in Azure.

The built-in tools and data that Microsoft Azure provides form the core of our cost-optimization toolset. We derive all our cost-optimization tools and insights from data in Microsoft Azure Advisor, Microsoft Azure Cost Management and Billing, and Microsoft Azure Monitor. We’ve also implemented design optimizations based on modern Azure resource offerings. We extract recommendations from Azure Advisor across the different Azure service categories and push those recommendations into our IT service management system, where the services’ owners can track and manage the implementation of recommendations for their services.

Understanding holistic optimization

As the first and largest adopter of Microsoft Azure, we’ve developed best practices for engineering and maintenance in Azure that support not only cost optimization but also a comprehensive approach to capturing the benefits of cloud computing in Azure. We developed and refined the Microsoft Well-Architected Framework as a set of guiding tenets for Azure workload modernization and a standard for modern engineering in Azure. Cost optimization is one of five components in the Well-Architected Framework that work together to support an efficient and effective Azure footprint. The other pillars include reliability, security, operational excellence, and performance efficiency. Cost optimization in Azure isn’t only about reducing spending. In Azure’s pay-for-what-you-use model, using only the resources we need when we need them, in the most efficient way possible, is the critical first step toward optimization.

Optimization through modernization

Reducing our dependency on legacy application architecture and technology was an important part of our first efforts in cost optimization. We migrated many of our workloads from on-premises to Microsoft Azure by using a lift-and-shift method: imaging servers or virtual machines exactly as they existed in the datacenter and migrating those images into virtual machines hosted in Azure. Moving forward, we’ve focused on transitioning those infrastructure as a service (IaaS) based workloads to platform as service (PaaS) components in Azure to modernize the infrastructure on which our solutions run.

Focus areas for optimization

We’ve maintained several focus areas for optimization. Ensuring the correct sizing for IaaS virtual machines was critical early in our Microsoft Azure adoption journey, when those machines accounted for a sizable portion of our Azure resources. We currently operate at a ratio of 80 percent PaaS to 20 percent IaaS, and to achieve this ratio we’ve migrated workloads from IaaS to PaaS wherever feasible. This means transitioning away from workloads hosted within virtual machines and moving toward more modular services such as Microsoft Azure App Service, Microsoft Azure Functions, Microsoft Azure Kubernetes Service, Microsoft Azure SQL, Microsoft Azure Cosmos database. PaaS services like these offer better native optimization capabilities in Microsoft Azure than virtual machines, such as automatic scaling and broader service integration. As the number of PaaS services has increased, automating scalability and elasticity across PaaS services has been a large part of our cost-optimization process. Data storage and distribution has been another primary focus area as we modify scaling, size, and data retention configuration for Microsoft Azure Storage, Azure SQL, Azure Cosmos DB, Microsoft Azure Data Lake, and other Azure storage-based services.

Implementing practical cost optimization

While Microsoft Azure Advisor provides most recommendations at the individual service level—Microsoft Azure Virtual Machines, for example—implementing these recommendations often takes place at the application or solution level. Application owners implement, manage, and monitor recommendations to ensure continued operation, account for dependencies, and keep the responsibility for business operations within the appropriate business group at Microsoft.

For example, we performed a lift-and-shift migration of our on-premises virtual lab services into Microsoft Azure. The resulting Azure environment used IaaS-based Azure virtual machines configured with nested virtualization. The initial scale was manageable using the nested virtualization model. However, the Azure-based solution was more convenient for hosting workloads than the on-premises solution, so adoption began to increase exponentially, which made management of the IaaS-based solution more difficult. To address these challenges, the engineering team responsible for the virtual lab environment re-architected the nested virtual machine design to incorporate a PaaS model using microservices and Azure-native capabilities. This design made the virtual lab environment more easily scalable, efficient, and resilient. The re-architecture addressed the functional challenges of the IaaS-based solution and reduced Azure costs for the virtual lab by more than 50 percent.

In another example, an application used Microsoft Azure Functions with the Premium App Service Plan tier to account for long-running functions that wouldn’t run properly without the extended execution time enabled by the Premium tier. The engineering team converted the logic in the Function Apps to use Durable Functions, an Azure Functions extension, and more efficient function-chaining patterns. This reduced execution time to less than 10 minutes, which allowed the team to switch the Function Apps to the Consumption tier, reducing cost by 82 percent.

Governance

To ensure effective identification and implementation of recommendations, governance in cost optimization is critical for our applications and the Microsoft Azure services that those applications use. Our governance model provides centralized control and coordination for all cost-optimization efforts. Our model consists of several important components, including:

Microsoft Azure Advisor recommendations and automation. Advisor cost management recommendations serve as the basis for our optimization efforts. We channel Advisor recommendations into our IT service management and Microsoft Azure DevOps environment to better track how we implement recommendations and ensure effective optimization.
Tailored cost insights. We’ve developed dashboards to identify the costliest applications and business groups and identify opportunities for optimization. The data that these dashboards provide help empower engineering leaders to observe and track important Azure cost components in their service hierarchy to ensure that optimization is effective.
Improved Microsoft Azure budget management. We perform our Azure budget planning by using a bottom-up approach that involves our finance and engineering teams. Open communication and transparency in planning are important, and we track forecasts for the year alongside actual spending to date to enable accurate adjustments to spending estimates and closely track our budget targets. Relevant and easily accessible spending data helps us identify trend-based anomalies to control unintentional spending that can happen when resources are scaled or allocated unnecessarily in complex environments.

Implementing a governance solution has enabled us to realize considerable savings by making a simple change to Microsoft Azure resources across our entire footprint. For example, we implemented a recommendation to convert Microsoft Azure SQL Database instances from the Standard database transaction unit (DTU) based tier to the General Purpose Serverless tier by using a simple Microsoft Azure Resource Manager template and the auto-pause capability. The configuration change reduced costs by 97 percent.

Benefits of Microsoft Azure

Ongoing optimization in Microsoft Azure has enabled us to capture the value of Azure to help increase revenue and grow our business. Our yearly budget for Azure has remained almost static since 2014, when we hosted most of our IT resources in on-premises datacenters. Over that period, Microsoft has grown by more than 20 percent,

Our recent optimization efforts have resulted in significantly reduced spending across numerous Microsoft Azure services. Examples, in addition to those already mentioned, include:

Right-sizing Microsoft Azure virtual machines. We generated more than 300 recommendations for VM size changes to increase cost efficiency. These recommendations included switching to burstable virtual machine sizes and accounted for a 15 percent cost savings.
Moving virtual machines to latest generation of virtual machine sizes. Moving from older D-series and E-series VM sizes to their current counterparts generated more almost 2,500 recommendations and a cost savings of approximately 30 percent.
Implementing Microsoft Azure Data Explorer recommendations. More than 200 recommendations were made for Microsoft Azure Data Explorer optimization, resulting in significant savings.
Incorporating Cosmos DB recommendations. More than 170 Cosmos DB recommendations reduced cost by 11 percent.
Implementing Microsoft Azure Data Lake recommendations. More than 30 Azure Data Lake recommendations combined to reduce costs by approximately 15 percent.

Cost optimization in Microsoft Azure can be a complicated process that requires significant effort from several parts of the enterprise. The following are some the most important lessons that we’ve taken from our cost-optimization journey:

Implement central governance with local accountability

We implemented a central audit of our Microsoft Azure cost-optimization efforts to help improve our Azure budget-management processes. This audit enabled us to identify gaps in our methods and make the necessary engineering changes to address those gaps. Our centralized governance model includes weekly and monthly leadership team reviews of our optimization efforts. These meetings allow us to align our efforts with business priorities and assess the impact across the organization. The service owner still owns and is accountable for their optimization effort.

Use a data-driven approach

Using optimization-relevant metrics and monitoring from Microsoft Azure Monitor is critical to fully understanding the necessity and impact of optimization across services and business groups. Accurate and current data is the basis for making timely optimization decisions that provide the largest cost savings possible and prevent unnecessary spending.

Be proactive

Real-time data and effective cost optimization enable proactive cost-management practices. Cost-management recommendations provide no financial benefit until they’re implemented. Getting from recommendation to implementation as quickly as possible while maintaining governance over the process is the key to maximizing cost-optimization benefits.

Adopt modern engineering practices

Cost optimization is one of the five components of the Microsoft Azure Well-Architected Framework, and each pillar functions best when supported by proper implementation of the other four. Adopting modern engineering practices that support reliability, security, operational excellence, and performance efficiency will help to enable better cost optimization in Microsoft Azure. This includes using modern virtual machine sizes where virtual machines are needed and architecting for Azure PaaS components such as Microsoft Azure Functions, Microsoft Azure SQL, and Microsoft Azure Kubernetes Service when virtual machines aren’t required. Staying aware of new Azure services and changes to existing functionality will also help you recognize cost-optimization opportunities as soon as possible.

Looking forward to more optimization

As we continue our journey, we’re focusing on refining our efforts and identifying new opportunities for further cost optimization in Microsoft Azure. The continued modernization of our applications and solutions is central to reducing cost across our Azure footprint. We’re working toward ensuring that we’re using the optimal Azure services for our solutions and building automated scalability into every element of our Azure environment. Using serverless and containerized workloads is an ongoing effort as we reduce our investment in the IaaS components that currently support some of our legacy technologies.

We’re also improving our methods for decentralizing optimization recommendations to enable our engineers and application owners to make the best choices for their environments while still adhering to central governance and standards. This includes automating the detection of anomalous behavior in Microsoft Azure billing by using service-wide telemetry and logging, data-driven alerts, root-cause identification, and prescriptive guidance for optimization.

Microsoft Azure optimization is a continuous cycle. As we further refine our optimization efforts, we learn from what we’ve done in the past to improve what we’ll do in the future. Our footprint will continue to grow in the years ahead, and our cost-optimization efforts will expand accordingly to ensure that our business is capturing every benefit that the Azure platform provides.

Want more information? Email us and include a link to this story and we’ll get back to you.

Please share your feedback with us—take our survey and let us know what kind of content is most useful to you.

The post Implementing Microsoft Azure cost optimization internally at Microsoft appeared first on Inside Track Blog.

How Microsoft is transforming its own patch management with Azure

Inside Track staff — Wed, 17 Apr 2024 16:14:17 +0000

[Editor’s note: Azure Update Management is becoming Update Management Center, which is currently available in public preview. Please note that this content was written to highlight a particular event or moment in time. Although that moment has passed, we’re republishing it here so you can see what our thinking and experience was like at the time.]

At Microsoft Digital Employee Experience (MDEE), patch management is key to our server security practices. That’s why we set out to transform our operational model with scalable DevOps solutions that still maintain enterprise-level governance. Now, MDEE uses Azure Update Management to patch tens of thousands of our servers across the global Microsoft ecosystem, both on premises and in the cloud, in Windows and in Linux.

With Azure Update Management, we have a scalable model that empowers engineering teams to take ownership of their server updates and patching operations, giving them the agility they need to run services according to their specific business needs. We’ve left our legacy processes behind and are meeting our patch compliance goals month after month since implementing our new, decentralized DevOps approach. Here’s an overview of how we completed the transformation.

For a transcript, please view the video on YouTube: https://www.youtube.com/watch?v=rXB8ez9XVqc, select the “More actions” button (three dots icon) below the video, and then select “Show transcript.”

Experts discuss the process and tools Microsoft is using to natively manage cloud resources through Azure.

The journey to Azure Update Management

Back in January 2017, the Microsoft Digital Manageability Team started transitioning away from the existing centralized IT patching service and its use of Microsoft System Center Configuration Manager. We planned a move to a decentralized DevOps model to reduce operations costs, simplify the service and increase its agility, and enable the use of native Azure solutions.

Microsoft Digital was looking for a solution that would provide overall visibility and scalable patching while also enabling engineering teams to patch and operate their servers in a DevOps model. Patch management is key to our server security practices, and Azure Update Management provides the feature set and scale that we needed to manage server updates across the Microsoft Digital environment.

Azure Update Management can manage Linux and Windows, on premises and in cloud environments, and provides:

At-scale assessment capabilities
Scheduled updates within specified maintenance windows
Logging to troubleshoot update failures

We also took advantage of new advanced capabilities, including:

Maintenance windows that distinguish and identify servers in Azure based on subscriptions, resource groups, and tags
Pre/post scripts that run before and after the maintenance window to start turned-off servers, patch them, and then turn them off again
Server reboot options control
Include/exclude of specific patches

This graphic demonstrates the solution architecture for our complex Microsoft Digital environment.

Completing that transformation with Azure Update Management required the Manageability Team to achieve three main goals:

Enhance compliance reporting to give engineering teams a reliable and accurate “source of truth” for patch compliance.
Ensure that 95 percent of the total server population in the datacenter would be compliant for all vulnerabilities being scanned, enabling a clean transfer of patching duties to application engineering teams.
Implement a solution that could patch at enterprise scale.

Microsoft Digital enhanced reporting capabilities by creating a Power BI report that married compliance scan results with the necessary configuration management database details. This provided a view on both current and past patch cycle compliance, setting a point-in-time measure within the broader context of historic trends. Engineers were now able to quickly and accurately remediate without wasting time and resources.

The report also included 30-day trend tracking and knowledge base (KB)-level reporting. The Manageability Team also gathered feedback from engineering groups to make dashboard enhancements like adding pending KB numbers on noncompliant servers and information about how long a patch was pending on a server.

We focused on achieving that 95 percent key performance indicator by “force remediating” older vulnerabilities first by upgrading or uninstalling older applications. With Configuration Manager consistently landing patches each cycle, engineering teams began to consistently meet the 95 percent goal.

Finally, as a native Azure solution available directly through the Azure portal, Azure Update Management provided the flexibility and features needed for engineering teams to remediate vulnerabilities while satisfying these conditions at scale.

[Explore harnessing first-party patching technology to drive innovation at Microsoft. Discover boosting Windows internally at Microsoft with a transformed approach to patching. Unpack Microsoft’s cloud-centric architecture transformation.]

Decoding our transformation

In the past, “white glove” application servers required additional coordination or extra steps during patching, like removing a server from network load balancing or stopping a service before patches could be applied. The traditional system typically required a patching team to coordinate patch deployment with the team that owned the application, all to ensure that the application would not be affected by recently installed patches.

We implemented a number of changes to transition smoothly from that centralized patching service to using Azure Update Management as our enterprise solution. Our first step was to deliver demos to help engineering teams learn to use Azure Update Management. These sessions covered everything from the prerequisites necessary to enable the solution in Azure to how to schedule servers, apply patches, and troubleshoot failures.

The Manageability Team also drew from its own experience getting started with Azure Update Management to create a toolkit to help engineering teams make the same transition. The toolkit provided prerequisite scripts, like adding the Microsoft Monitoring Agent extension and creating an Azure Log Analytics workspace. It also contained a script to set up Azure Security Center when teams had already created default workspaces; since Azure Update Management supports only one automation account and Log Analytics workspace, the script cleaned up the automation account and linked it to the workspace used for patching.

Next, the Manageability Team took on proving scalability across the datacenter environment. The goal was to take a subset of servers from the centralized patching service in Configuration Manager and patch them through Azure Update Management. They created Scheduled Deployments within the Azure Update Management solution that used the same maintenance windows as those used by Configuration Manager. After validating the servers’ prerequisites, they moved the servers into the deployments so that during that maintenance window, Azure Update Management was patching the servers instead of Configuration Manager.

With that successful scalability exercise completed, the final step was to turn off Configuration Manager as the centralized service’s “patching engine.” Microsoft Digital had set a specific deadline for this transformation, and right on time the team turned off the Software Update Manager policy in Configuration Manager. This ensured that Configuration Manager would no longer be used for patching activities but would still be available for other functionality.

After the transition was complete, the Manageability Team monitored closely to ensure that decentralization did not negatively affect compliance. In almost every month since the transition, the Microsoft Digital organization has consistently achieved the 95 percent compliance goal.

Refining update management

We’re now hard at work on the next evolution in our Azure Update Management journey to even further optimize operational costs, accelerate patch compliance, and improve the end-to-end patching experience. Most recently, we’ve implemented automated notifications that send emails and create tickets when servers are not compliant, so that teams can quickly remediate.

Microsoft Digital Employee Experience will continue to build tools and automation that improve the patching experience and increase compliance. We’re evaluating, adapting, and providing our engineering teams with guidance as new features are released into the Azure Update Management service.

The post How Microsoft is transforming its own patch management with Azure appeared first on Inside Track Blog.

Boosting employee connectivity with Microsoft Azure-based VWAN architecture

Beth Garrison — Fri, 27 Oct 2023 00:01:03 +0000

Editor’s note: This is the fourth in our ongoing series on moving our network to the cloud internally at Microsoft.

Whether our employees are in neighboring cities or different continents, they need to communicate and collaborate efficiently with each other. We designed our Microsoft Azure-based virtual wide-area network (VWAN) architecture to provide high-performance networking across our global presence, enabling reliable and security-focused connectivity for all Microsoft employees, wherever they are.

We’re using Azure to strategically position enterprise services such as the campus internet edge in closer proximity to end users and improve network performance. These performance improvements are streamlining our site connectivity worldwide and improving the user experience, increasing user satisfaction and operational efficiency.

We’ve recently piloted this VWAN architecture with our Microsoft Johannesburg office. Our users in Johannesburg were experiencing latency issues and sub-optimal network performance relating to outbound internet connections routed through London and Dublin in Europe. In other words, employees had to go to another continent in order to reach the internet.

To simplify the network path for outgoing internet traffic and reduce latency, we migrated outbound traffic for two network segments in Johannesburg to the Azure Edge using a VWAN connected through Azure ExpressRoute circuits.

The solution relocates the internet edge for Johannesburg to the South Africa North region datacenter in South Africa, using Azure Firewall, Azure ExpressRoute, Azure Connection Monitor, and Azure VWAN. We’ve also evolved our DNS resolution strategy to a hybrid solution that hosts DNS services in Azure, which increases our scalability and resiliency on DNS resolution services for Johannesburg users. We’ve deployed the entire solution adhering to our infrastructure as code strategy, creating a flexible network infrastructure that can adapt and scale to evolving demands on the VWAN.

We’re using Azure Network Watcher connection monitor and Broadcom AppNeta to monitor the entire solution end-to-end. These tools will be critical in evaluating the VWAN’s performance, enabling data-driven decisions for optimizing network performance.

The accompanying high-level diagram outlines our updated network flows. We can support distinct user groups by isolating the guest virtual route forwarding zone (red lines) and the internet virtual route forwarding zone (black lines). This design underscores our commitment to robust outbound traffic control, ensuring a secure and optimized network environment.

Creating efficient and isolated traffic routing to the internet with Azure-based VWAN architecture.

Beth Garrison is a cloud software engineer and part of the team that is helping build and maintain Microsoft Digital’s network using infrastructure as code.

We strongly believe our VWAN-based architecture represents the future of global connectivity. The agility, scalability, and resiliency of VWAN infrastructure enables increased collaboration, productivity, and efficiency across our regional offices.

Our pilot in Johannesburg proved that improvements in network performance directly affected user experience. By relocating the network edge to the South Africa region in Azure instead of our datacenter edge in London/Dublin, latency for connections from Johannesburg to other public endpoints in South Africa has dropped from 170 milliseconds to 1.3 milliseconds.

Latency for other network paths has also improved, but by lesser amounts depending on the specific destination. The improvements were always greater the closer the destination was to Johannesburg, including connectivity paths to the United States and Europe, demonstrating stability and reliability in these critical connections. Significant benefits of the VWAN solution include:

Increased scalability and flexibility. Our architecture is built to scale with our business needs. Whether we have a handful of regional buildings or a continent, the VWAN solution can accommodate any dynamic growth pattern. As our service offering expands, we can easily add new locations and integrate them seamlessly into the VWAN infrastructure.
Greater network resilience. Continuous connectivity is essential to effective productivity and collaboration. Our architecture incorporates redundancy and failover mechanisms to ensure network resilience. In case of a network disruption or hardware failure, the VWAN solution automatically reroutes traffic to alternative paths, minimizing downtime and maintaining uninterrupted communication.
Improved security and compliance. Protecting our data and ensuring compliance is our top priority. Our VWAN-based architecture is secure by design that incorporates industry-leading security measures, including encryption, network segmentation, and access controls. We adhere to the highest security standards that help Microsoft safeguard sensitive information in transit and meet compliance requirements.

We’re currently planning our VWAN-based architecture to span multiple global regions, offering extensive coverage and enabling our employees to connect to their regional and global services through the Azure network backbone as we continue prioritizing network performance to deliver exceptional connectivity for voice, data, and other critical applications.

We’re working to build improvements into the architecture for more optimized routing, improved Quality of Service (QoS) mechanisms, and advanced traffic management techniques to minimize latency, packet loss, and jitter, ensuring robust and low-latency connections to facilitate seamless communication regardless of where our employees are located.

Contact us today to explore how our cutting-edge VWAN-based architecture can transform your organization’s networking capabilities and revolutionize how your employees connect and communicate globally. Email us and include a link to this story and we’ll get back to you with more information.

Assess your organization’s current network performance and needs to understand the challenges remote employees and satellite offices face regarding latency and connectivity.
Incorporate Microsoft Azure for improved scalability, flexibility, and resilience so you can strategically position cloud services near end users, improving latency and overall user experience.
Adopt an infrastructure-as-code approach to deploy flexible virtual network infrastructures. This streamlines the deployment process and ensures adaptability to ever-changing network demands.
Invest in monitoring tools to gain valuable insights into the VWAN’s performance, which will help you make data-driven decisions for optimization.
Adopt a VWAN-based architecture that emphasizes security measures such as encryption, network segmentation, and strict access controls. Ensure that the architecture adheres to the highest security standards, safeguarding sensitive information and meeting compliance requirements.
Keep updated on advancements in network routing, Quality of Service mechanisms, and traffic management techniques. This will help you minimize latency and ensure robust, low-latency connections, enhancing global communication for your employees.

Get started at our company by learning how to deploy Azure VWAN with routing intent and routing policies.

Please share your feedback with us—take our survey and let us know what kind of content is most useful to you.

The post Boosting employee connectivity with Microsoft Azure-based VWAN architecture appeared first on Inside Track Blog.

How we’re deploying our VWAN infrastructure using infrastructure as code and CI/CD

Eric Scheffler — Fri, 22 Sep 2023 20:48:18 +0000

Editor’s note: This is the first in an ongoing series on moving our network to the cloud internally at Microsoft.

We’re building a more agile, resilient, and stable virtual wide-area network (VWAN) to create a better experience for our employees to connect and collaborate globally. By implementing a continuous integration/continuous deployment (CI/CD) approach to building our VWAN-based network infrastructure, we can automate the deployment and configuration processes to ensure rapid and reliable delivery of network changes. Here’s how we’re making that happen internally at Microsoft.

Infrastructure as code (IaC)

Juan Jimenez (left) and Eric Scheffler are part of the team in Microsoft Digital Employee Experience that is helping the company move its network to the cloud. Jimenez is a principle cloud network engineer and Scheffler is a senior cloud network engineer.

Infrastructure as code (IaC) is the fundamental principle underlying our entire VWAN infrastructure. Using IaC, we can develop and implement a descriptive model that defines and deploys VWAN components and determines how the components work together. IaC allows us to create and manage a massive network infrastructure with reusable, flexible, and rapid code deployments.

We created deployment templates and resource modules using the Bicep language in our implementation. These templates and modules describe the desired state of our VWAN infrastructure in a declarative manner. Bicep is a domain-specific language (DSL) that uses declarative syntax to deploy Microsoft Azure resources.

We maintain a primary Bicep template that calls separate modules—also maintained in Bicep templates—to create the desired resources for the deployment in alignment with Microsoft best practices. We use this modular approach to apply different deployment patterns to accommodate changes or new requirements.

With IaC, changes and redeployments are as quick as modifying templates and calling the associated modules. Additionally, parameters for each unique deployment are maintained in separate files from the templates so that different iterations of the same deployment pattern can be deployed without changing the source Bicep code.

Version control

We use Microsoft Azure DevOps, a source control system using Git, to track and manage our IaC templates, modules, and associated parameter files. With Azure DevOps, we can maintain a history of changes, collaborate within teams, and easily roll back to previous versions if necessary.

We’re also using pull requests to help track change ownership. Azure DevOps tracks changes and associates them with the engineer who made the change. Azure DevOps is a considerable help with several other version control tasks, such as requiring peer reviews and approvals before code is committed to the main branch. Our code artifacts are published to (and consumed from) a Microsoft Azure Container Registry that allows role-based access control of modules. This enables version control throughout the module lifecycle, and it’s easy to share Azure Container Registry artifacts across multiple teams for collaboration.

Automated testing

Responsible deployment is essential with IaC when deploying a set of templates could radically alter critical network infrastructure. We’ve implemented safeguards and tests to validate the correctness and functionality of our code before deployment. These tests include executing the Bicep linter as part of the Azure DevOps deployment pipeline to ensure that all Bicep best practices are being followed and to find potential issues that could cause a deployment to fail.

We’re also running a test deployment to preview the proposed resource changes before the final deployment. As the process matures, we plan to integrate more testing, including network connectivity tests, security checks, performance benchmarks, and enterprise IP address management (IPAM) integration.

Configuration management

Azure DevOps and Bicep allow us to automate the configuration and provisioning of network objects and services within our VWAN infrastructure. These tools make it easy to define and enforce desired configurations and deployment patterns to ensure consistency across different network environments. Using separate parameter files, we can rapidly deploy new environments in minutes rather than hours without changing the deployment templates or signing in to the Microsoft Azure Portal.

Continuous deployment

The continuous integration (CI) pipeline automates the deployment process for our VWAN infrastructure when the infrastructure code passes all validation and tests. The CI pipeline triggers the deployment process automatically, which might involve deploying virtual machines, building and configuring cloud network objects, setting up VPN connections, or establishing network policies.

Monitoring and observability

We’ve implemented robust monitoring and observability practices for how we deploy and manage our VWAN deployment. Monitoring and observability are helping us to ensure that our CI builds are successful, detect issues promptly, and maintain the health of our development process. Here’s how we’re building monitoring and observability in our Azure DevOps CI pipeline:

We’re creating built-in dashboards and reports that visualize pipeline status and metrics such as build success rates, durations, and failure details.
We’re generating and storing logs and artifacts during builds.
We’ve enabled real-time notifications to help us monitor build status for failures and critical events.
We’re building-in pipeline monitoring review processes to identify areas for improvement including optimizing build times, reducing failures, and enhancing the stability of our pipeline.

We’re continuing to iterate and optimize our monitoring practices. We’ve created a feedback loop to review the results of our monitoring. This feedback provides the information we need to adjust build scripts, optimize dependencies, automate certain tasks, and further enhance our pipeline.

By implementing comprehensive monitoring and observability practices in our Azure DevOps CI pipeline, we can maintain a healthy development process, catch issues early, and continuously improve the quality of our code and builds.

Rollback and rollforward

We’ve built the ability to rollback or rollforward changes in case of any issues or unexpected outcomes. This is achieved through infrastructure snapshots, version-controlled configuration files, or using features provided by our IaC tool.

Improving through iteration

We’re continuously improving our VWAN infrastructure using information from monitoring data and user experience feedback. We’re also continually assessing new requirements, newly added Azure features, and operational insights. We iterate on our infrastructure code and configuration to enhance security, performance, and reliability.

By following these steps and using CI/CD practices, we can build, test, and deploy our VWAN network infrastructure in a controlled and automated manner, creating a better employee experience by ensuring faster delivery, increased stability, and more effortless scalability.

Here are some tips on how you can start tackling some of the same challenges at your company:

You can use Infrastructure as code (IaC) to create and manage a massive network infrastructure with reusable, flexible, and rapid code deployments.
Using IaC, you can make changes and redeployments quickly by modifying templates and calling the associated modules.
Don’t overlook version control. Tracking and managing IaC templates, modules, and associated parameter files is essential.
Perform automated testing. It’s necessary to validate the correctness and functionality of the code before deployment.
Use configuration management tools to simplify defining and enforcing desired configurations and deployment patterns. This ensures consistency across different network environments.
Implement continuous deployment to automate the deployment process for network infrastructure after the code passes all validation and tests.
Use monitoring and observability best practices to help identify issues, track performance, troubleshoot problems, and ensure the health and availability of the network infrastructure.
Building rollback and roll-forward capabilities enables you to quickly respond to issues or unexpected outcomes.

Try using a Bicep template to manage your Microsoft Azure resources.

Please share your feedback with us—take our survey and let us know what kind of content is most useful to you.

The post How we’re deploying our VWAN infrastructure using infrastructure as code and CI/CD appeared first on Inside Track Blog.

Designing a modern service architecture for the cloud

Inside Track staff — Wed, 22 Feb 2023 20:03:12 +0000

The digital transformation that many enterprises are undertaking has its benefits and its challenges: while it brings new opportunities that add value to customers and help drive business, it also places demands on legacy infrastructure, making companies struggle to keep pace with the digital world’s ever-increasing speed of business. Consider an enterprise’s line-of-business (LOB) systems, such as for finance in general, or procurement and payment in particular. These business-critical systems are traditionally based on-premises, can’t scale readily, and in many cases aren’t available to mobile devices.

As we continue along our digital transformation journey here at Microsoft, we have been looking to the cloud to reinvent how we do business, by streamlining our operations and adding value to our partners and customers. This technical blog post describes how our Microsoft Digital team saw our move to the cloud as an opportunity to completely rethink how we architect and run our core finance processes when they’re built on a modern architecture. Here, we discuss the thought processes and drivers behind the approach that we took to design a new service architecture for our Finance department’s Procure-to-Pay service.

Evolving from an apps-oriented to a services-focused architecture

Financial systems need to be secure by their nature. Moreover, their designs are typically influenced by an organizational culture that is understandably risk averse, so the concept of moving sensitive financial processes to the cloud can be especially challenging. The technical challenges are equally significant: Consider the transactional nature of financial systems, their real-time transactional data processing, auditing frequency and scale, and the numerous regulatory aspects that are associated with financial operations.

At Microsoft, many of our core business processes (such as procurement and payment) have traditionally been built around numerous monolithic, standalone apps. Each of these apps was siloed in its own on-premises environment, used its own copy of data, and presented one or more interfaces, often disconnected from each other. Without a unifying, overarching strategy, each of these apps evolved independently on an ad hoc basis, updating as circumstances required without considering impacts on other parts of the Procure-to-Pay process.

These complex and unwieldly apps required significant resources to maintain, and their redundant data led to inconsistent key performance indicators (KPIs) that were based on different underlying data sets. Furthermore, the user experience suffered because there wasn’t a single end-to-end process for Procure-to-Pay. Instead, people had to work within several different apps—each with its own interface—to complete a task, forcing users to learn to navigate through many different user experiences as they attempted to complete each step. The overall process was made even more cumbersome because people still had to complete manual steps in between certain apps. This in turn slowed completion of every Procure-to-Pay instance and was expensive to maintain.

At Microsoft Digital, our ongoing efforts to shift services to the cloud gave our Microsoft Finance Engineering team an opportunity to completely rethink how to approach Procure-to-Pay by designing a cloud-based, services-oriented architecture for the Finance department’s procurement and payment processes. This, modern cloud-based service, known as Procure-to-Pay, would focus on the end-to-end user experience and would replace the app-centric view of the legacy on-premises systems. Additionally, the cloud-based service would utilize Microsoft Azure’s inherent efficiencies to reduce capital expenditure costs, scale dynamically, and promote referencing of certified master data instead of copying data sets as the legacy apps did.

In this part of the case study, we describe some key principles that we followed when designing our new service-based architecture, and then provide more insight into the architecture’s data, API, and UI.

[Learn how DevOps is sending engineering practices up in smoke. Get more Microsoft Azure architecture guidance from us.]

Principles of a service-based architecture

We started this initiative by defining the key principles that would guide our approach to this new architectural design. These principles included:

Focus on the end-to-end experience by creating an overarching user experience (UX) layer that developers can use to connect different services and present a unified user experience.
Design as cloud first, mobile first to gain the cost and scalability benefits associated with cloud-based services, and to improve end user productivity.
Maintain single master copies of data with designated data owners to ensure quality while reducing redundancy.
Develop with efficiency and cost-effectiveness at every step to reduce Microsoft Azure-based compute time costs.
Decouple UI from business functionality by defining separate layers for UI, business functionality, and data storage within each service.
Utilize flighting with early adopters and other participants to reduce change-management risk.
Automate as much as possible, identifying the manual steps that users had to take when working with the old on-premises apps and determining how to automate them as part of the new end-to-end user experience.

In the next few sections, we provide additional insights into how we applied these principles from UI, data, and API perspectives as we designed our new architectural model and used it to build our Procure-to-Pay service.

Emphasizing a holistic, end-to-end user experience

When we surveyed our legacy set of on-premises apps, we discovered a significant overlap of functionality between them. Our approach with the new architectural model was to break down the complete feature set within each app to separate core functionality from duplicated features.

We used this information to consolidate the 36 standalone legacy apps into an architecture that comprises 16 discrete services, each with a unique set of functionality, presentation layer, APIs, and master data. On top of these 16 unique services, we defined an overarching End-to-End User Experience layer that developers can use to create a singular, unified experience that can span numerous services.

As the graphic below illustrates, our modern architecture utilizes a modular approach to services that promotes interconnectivity. Because users interact with the services at the End-to-End User Experience layer, they experience a consistent and unified sequence of events in a single interface. Behind the scenes, developers can connect APIs in one service to another to access the functionality they require, or transparently pass the user from one service’s UI to another as needed to complete the Procure-to-Pay process.

Our modern architecture model defines 16 unique vertical services, each with its associated UI, API, and data layers. It also provides an overarching user experience layer that developers can use to connect any combination of the services into a seamless, end-to-end user experience.

Another critical aspect of providing an end-to-end experience is automating the front- and back-office operations (such as support) as much as possible. To support this automation, our architecture incorporates a Procure-to-Pay Support layer underneath all the services. Developers can integrate support bots into their Procure-to-Pay services to monitor user activity and proactively offer guidance when deemed appropriate. Moreover, if the support bot can’t quickly resolve the issue, it will silently escalate to a human supervisor who can interact with the user within the same support window. Our objective is to make the support experience so seamless that users don’t recognize when they are interacting with a bot vs. a support engineer.

All these connections and data referencing are hidden from the user, resulting in a seamless experience that can be expressed as a portal, a mobile app, or even as a bot.

Consolidating data to support end-to-end experiences

One ongoing challenge that we experienced with our siloed on-premises apps was how each app utilized its own copy of data, resulting in wasted storage space and inconsistent analytics due to the variances between data sets. In contrast, the new architectural data model had to align with our principle of maintaining single, master copies of data that any service could reference. This required forming a new Finance data lake to store all the data.

The decision to create a data lake required a completely new mindset. We decided to shift away from the traditional approach, in which we needed to understand the nature of each data element and how it would be implemented in a solution. Today, our strategy is to place all data into a single repository where it can be available for any potential use—even when the data has no current apparent utility. This approach recognizes the inherent value of data without having to map each data piece to an individual customer’s requirements. Moreover, having a large pool of readily available, certified data was precisely what we needed to support our machine learning (ML) and AI-based discovery and experimentation—processes that require large amounts of quality data that had been unavailable in the old siloed systems.

After we formed the Finance data lake, we defined a layer in our architecture to support different types of data access:

Hot access is provided through the API layer (described later in this case study) for transactional and other situations that require near real-time access to data.
Cold/warm access is used for archival data that is one hour old or older, such as for machine learning or running analytics reports. This is a hybrid model, where we can access data that is as close to live status as possible without accessing the transaction table, but also perform analytics on top of the most recent cold data.

By offering these different types of access, our new architectural model streamlines how people can connect data sources from different places and for different use scenarios.

Designing enterprise services in an API economy

In the older on-premises apps, the tight coupling of UI and functionality forced users to go through each app’s UI just to access the data. This type of design provided a very poor and disjointed user experience because people had to navigate many different tools with different interfaces to complete their Procure-to-Pay task.

One of the most significant changes that we made to business functionality in our new architectural model was to completely decouple business functionality from UI. As Figure 1 illustrates, our new architectural model has clearly defined layers that place all business functionality in a service’s API layer. This core functionality is further broken down into very small services that perform specific and unique functions; we call these microservices.

With this approach, any microservice within one service can be called by other services as required. For example, a link-validation microservice can be used to verify employee, partner, or supplier banking details. We also recognized the importance of making these microservices easily discoverable, so we took an open-source approach and published details on Swagger about each microservice. Internal developers can search for internal APIs for reuse, and external developers can search for public APIs.

As an example, the below image illustrates the usage scenario for buying a laptop, where the requester works through the unified User Experience layer. What is hidden to the user is how multiple services including Catalog Management, Purchase Experience, and Purchase Order interact as needed to pass data and hand off the user transparently from service to service to complete the Procure-to-Pay task.

An example usage scenario for buying a laptop, illustrating how the person requesting a new computer works through the unified End-to-End User Experience layer while multiple services work transparently in the background to complete the end-to-end Procure-to-Pay task.

When defining our modern architecture, we wanted to minimize the risk that an update to microservice code might impact end-to-end service functionality. To achieve this, we defined service contracts that map to each API, and how the data interfaces with that API. In other words, all business functionality within the service must conform to the contract’s terms. This allows developers to stub a service with representative behaviors and payloads that other teams can consume while the service code is being updated. Provided the updates are compliant with the contract, the changes to the code won’t break the service.

Finally, our new cloud-based modern architecture gave us an opportunity to improve the user experience by specifying a single sign-on (SSO) event throughout the day, irrespective of how many services a user touches during that time. The key to supporting SSO was to leverage the authentication and authorization processes and protocols that are built into Microsoft Azure Active Directory.

Benefits

Following are some of the key benefits that our Microsoft Digital team is experiencing by building our Procure-to-Pay service on our modern cloud-based architecture.

Vastly improved user experience. The new Procure-to-Pay service has streamlined the procurement and payment process, providing a single, end-to-end user experience with a single sign-on event that replaces 36 legacy apps and automates many steps that used to require manual input. In internal surveys, employees are reporting a significant improvement in satisfaction scores across the enterprise: users are happier working with the new service, engineers can more easily troubleshoot issues, and feature updates can be implemented in days instead of months.
Better compliance. We now have full governance over how our data is being accessed and distributed. The shift to a single Finance data lake with single copies of certified master data and clear ownership of that data, ensures that all processes are accessing the highest-quality data—and that the people accessing that data are authorized to do so.
Better insights. Now that our KPIs are all based on the certified master data, we’ve improved our analytics accuracy by ensuring that all analysis is based on the same master data sets. This in turn enables us to ask the big questions of our collective data, to gain insights and help the business make appropriate data-driven decisions.
On-demand scaling. The natural rhythm of Finance operations imposes high demand during quarterly and annual report periods, while requiring fewer resources at other times. Because our architecture is based in the cloud, we utilize Microsoft Azure’s native ability to dynamically scale up to support peaks in processing and throttle processing resources when demand is low.
Significant cost and resource savings. Building our new Procure-to-Pay service on a modern, cloud-based architecture is resulting in cost and resource savings through the following mechanisms:
- Decommissioned physical on-premises servers: We’ve decommissioned the expensive, high-end physical and virtual servers that used to run the 36 on-premises apps and replaced them with our cloud-based Procure-to-Pay service. This has reduced our on-premises virtual machine footprint by 80 percent.
- Reduced code maintenance costs: In addition to decommissioning the on-premises apps’ servers, we no longer need to spend significant development time maintaining all the brittle custom code in the old siloed apps.
- Drastic reduction of compute charges: Our cloud-based Procure-to-Pay service has several UIs that can be parked and stored very cost effectively as BLOBs until the UIs are needed. This completely avoids any compute-based charges until a UI is required and is then launched on demand.
- Reduction in support demand: Our bot-driven self-serve model automatically resolves many of our users’ basic support issues, freeing up our support engineers to focus on more critical issues. We estimate a 20 percent reduction in run cost by decommissioning our Level 3 support line, and a 40 percent reduction in overall Procure-to-Pay related support tickets.
- Better utilization of computing resources: Our old on-premises apps incurred huge capital expenditure costs when purchasing their high-end hardware and licenses for servers such as Microsoft SQL Server. With a planning and implementation period that might take months, machines were typically overbuilt and underutilized because we would plan for an approximate 10 times capacity to account for growth. Later, the excess capacity wouldn’t be sufficient, and we would have to repeat this process to purchase newer hardware with even greater capacity. The new architecture has eliminated capital expenditures for Procure-to-Pay, favoring the more efficient, scalable, and cost-effective Microsoft Azure cloud environment. We’re also utilizing our data storage more efficiently. It is less costly to store data in the cloud, and storing a single master copy of data in our Finance data lake removes all the separate copies of the same data that each legacy app would maintain.
- Better allocation of personnel: Previously, our Engineering team had to review the back-end systems and build queries to cater to each team’s needs. Consolidating all data to the Finance data lake in our new system enables people to create their own Microsoft Power BI reports on top of the data, modify their analyses to form new questions, and derive insights that might not have appeared otherwise. As a result, our engineering resources can be reallocated to support more strategic functions.

Simplified testing and maintenance. We use Microsoft Azure’s out-of-the-box synthetics to test each function within our microservices programmatically, which is a much easier and more comprehensive approach than physically testing each monolithic app in a reactive state to assess its health. Similarly, Azure’s service clusters greatly streamline our maintenance efforts, because we can deploy many instances of different services to achieve a higher density. Moreover, we now utilize a single cluster for all our preproduction environments. We no longer need to maintain separate development, system test, staging, and production environments.

We on the Microsoft Digital team learned some valuable best practices as we designed our modern cloud-based architecture:

Achieving a modern architecture starts with asking the big questions: Making the shift from large, unwieldly standalone on-premises apps to a modern, cloud-based services architecture requires some up-front planning. Assemble the appropriate group of stakeholders and gain consensus on the following questions: What type of architecture do we want? Where do we want to have global access to resources? What types of data should be stored locally, and under what circumstances? When and how do we programmatically access data that we don’t own to mitigate, minimize, or entirely remove data duplication? How can we ensure what we’re building is the most efficient and cost-effective solution?
Identify where your on-premises apps are in their lifecycle when deciding whether to “lift-and-shift”: If you’re dealing with an app or service that is nearing its sunset phase and you only need to place it into the cloud for a short period while you transition to something newer, consider the “lift-and-shift” approach where your primary objective is to run the exact same system in the cloud. For systems that are expected to have a longer lifecycle, you’ll reap greater rewards by rethinking your service architecture with a platform as a service (PaaS) mindset from the start.
Design your architecture for engineering rigor and agility. Look for longer-term value based on strategic planning to make the most of your transition to the cloud. At Microsoft, this was the key determination that guided our new architecture’s development: Reimagine how our core processes can be run when they’re built on a modern service architecture. For us, this included being mobile first and cloud first, and shifting from waterfall designs to adopting agile practices. It also entailed making security a first thought in architectural design instead of a last thought, and designing the continuous integration/continuous deployment (CI/CD) pipeline.
Keep cost efficiency in mind. From the very first line of code, everyone involved in developing your new services should strive to make each component as efficient and cost effective as possible. At Microsoft, this development principle is why we mandated a serverless compute model with no static environments that supported “parking” inactive code or UI inside BLOBs when they weren’t needed. This efficiency is also a key reasoning behind our adopting Microsoft Azure resource groups to minimize the effort required to switch between stage and production environments.
Put everything in into your data lake. Cloud-based storage is inexpensive. When organizations look to the cloud as their primary storage solution, they no longer need to expend effort collecting only the data that they think everyone wants—especially because, in reality, everyone wants something different. At Microsoft, by creating the Finance data lake and shifting our mindset to store all master data there, irrespective of its anticipated use, we eliminated the resources we would traditionally spend to analyze each team’s data requirements. Today, we focus on identifying data owners and certifying the data. We can then address the data of interest when a customer makes a specific request.
Incorporate telemetry into your architecture to derive better insights from your data. Your data-driven decisions are only as good as your data. In our old procurement and payment system at Microsoft, we didn’t know who was using the old data and for what reasons, or even how much it was costing us. With the new Procure-to-Pay service based on our modern architecture, we have telemetry capabilities inside everything we build. This helps with service health monitoring. We also incorporate this information into our feature and service decision-making processes as we continually improve Procure-to-Pay.
Promote your new architectural model to gain adoption. You can define a new architectural design, but if you don’t promote it in a way that demonstrates its value, developers will hesitate to use it. At Microsoft, we published details about how developers could tap into this new architecture to create more intuitive and user-friendly end-to-end experiences that catered to their end users. This internal open-source approach creates a collaborative environment that encourages developers to join in and access the data they need, and then apply it to their own end-to-end user experience wrapper.

At Microsoft, rethinking our approach to services with this cloud-based modern architecture is helping us become a data-driven organization. By consolidating our data into a single data lake and providing an API layer that enables rapid development of end-to-end procurement and payment services, we’ve created a self-serve platform where anyone can consume the certified data and present it in a seamless, end-to-end manner to the user, who can then derive insights and make data-driven decisions.

Our next steps

The Procure-to-Pay service is just one cloud-based service that we built on top of our modern architecture. We’re continuing to mature this service, but we’re also exploring additional end-to-end services that can benefit other Finance processes to the same extent that Procure-to-Pay has modernized procurement and payment.

This new model doesn’t have to be restricted to Finance; our approach has the potential to benefit the entire company. The guiding principles we followed to define our Finance architecture align closely with our leadership’s digital transformation vision. That is why we’re also discussing how we might help other departments outside Finance adopt the same architectural model, build their own end-to-end user experiences, and reap similar rewards.

The post Designing a modern service architecture for the cloud appeared first on Inside Track Blog.

How ‘born in the cloud’ thinking is fueling Microsoft’s transformation

Lukas Velush — Thu, 27 Feb 2020 18:32:35 +0000

Microsoft wasn’t born in the cloud, but soon you won’t be able to tell.

Now that it has finished “lifting and shifting” its massive internal workload to Microsoft Azure, the company is rethinking everything.

“We’re rearchitecting all of our applications so that they work natively on Azure,” says Ludovic Hauduc, corporate vice president of Core Platform Engineering in Microsoft Core Services Engineering and Operations (CSEO). “We’re retooling to take advantage of all that the cloud has to offer.”

Microsoft spent the last five years moving the internal workload of its 60,000 on-premises servers to Azure. Thanks to early efforts to modernize some of that workload while migrating it, and to ruthlessly removing everything that wasn’t being used, the company is now running about 6,500 virtual machines in Azure. This number dynamically scales up to around 11,000 virtual machines when the company is processing extra work at the end of months, quarters, and years. It still has about 1,500 virtual machines on premises, most of which are there intentionally. The company is now 97 percent in the cloud.

Now that the company’s cloud migration is done and dusted, it’s Hauduc’s job to craft a framework for transforming Microsoft into a born-in-the-cloud company. CSEO will then use that framework to retool all the applications and services that the organization uses to provide IT and operations services to the larger company.

The job is bigger than building a guide for how the company will rebuild applications that support Human Resources, Finance, and so on. Hauduc’s team is creating a roadmap for how Microsoft will rearchitect those applications in a consistent, connected way that focuses on the end user experience while also figuring out how to get the more than 3,000 engineers in CSEO who will rebuild those applications to embrace the modern engineering–fueled cultural shift needed for this transformation to happen.

[Take a deep dive into how Hauduc and his team in CSEO are using a cloud-centric mindset to drive the company’s transformation. Find out more about how CSEO is using a modern-engineering mindset to engineer solutions inside Microsoft.]

Move to the cloud creates transformation opportunity

Despite good work by good people, CSEO’s engineering model wasn’t ready to scale to the demands of Microsoft’s growth and how fast its internal businesses were evolving. Moving to the cloud created the perfect opportunity to fix it.

“In the past, every project we worked on was delivered pretty much in isolation,” Hauduc says. “We operated very much as a transaction team that worked directly for internal customers like Finance and HR.”

CSEO engineering was done externally through vendors who were not connected or incentivized to talk to each other. They would take their orders from the business group they were supporting, build what was asked for, get paid, and move on to the next project.

“We would spin up a new vendor team and just get the project done—even if it was a duplication or a slight iteration on top of another project that already had been delivered,” he says. “That’s how we ended up with a couple of invoicing systems, a few financial reporting systems, and so on and so forth.”

Lack of a larger strategy prevented CSEO from building applications that made sense for Microsoft employees.

This made for a rough user experience.

“Each application had a different look and feel,” Hauduc says. “Each one had its own underlying structure and data system. Nothing was connected and data was replicated multiple times, all of which would create challenges around privacy, security, data freshness, etc.”

The problem was simple—the team wasn’t working against a strategy that let it push back at the right moments.

“The word that the previous IT organization never really used was ‘no,’” Hauduc says. “They felt like they had no choice in the matter.”

When moving to the cloud opens the door to transformation

The story is different today. Now CSEO has its own funding and is choosing which projects to build based on a strategic vision that outlines where it wants to take the company.

“The conversation has completely shifted, not only because we have moved things to the cloud, but because we have taken a single, unified data strategy,” Hauduc says. “It has altered how we engage with our internal customers in ways that were not possible when everything was on premises and one-off.”

Now CSEO engineers are working in much smarter ways.

“We now have agility around operating our internal systems that we could never have fathomed achieving on prem,” he says. “Agility from the point of view of elasticity, from the point of view of releases, of understanding how our workloads are being used and deriving insights from these workloads, but also agility from the point of view of reacting and adapting to the changing needs of our internal business partners in an extremely rapid manner because we have un-frictioned access to the data, to the signals, and to the metrics that tell us whether we are meeting the needs of our internal customers.”

And those business groups who unknowingly came and asked for something CSEO had already built?

“We now have an end-to-end view of all the work we’re doing across the company,” Hauduc says. “We can correlate, we can match the patterns of issues and problems that our other internal customers have had, we can show them what could happen if they don’t change their approach, and best of all, we can give them tips for improving in ways they never considered.”

CSEO’s approach may have been flawed in the past, but there were lots of good reasons for that, Hauduc says. He won’t minimize the work that CSEO engineers did to get Microsoft to the threshold of digitally transforming and moving to the cloud.

“The skills and all of the things that made us successful as an IT organization before we started on a cloud journey are great,” he says. “They’re what contributed to building the company and operating the company the way we have today.”

But now it’s time for new approaches and new thinking.

“The skills that are required to run our internal systems and services today in the cloud, those are completely different,” he says.

As a result, the way the team operates, the way it interacts, and the way it engages with its internal customers have had to evolve.

“The cultural journey that CSEO has been on is happening in parallel with our technical transformation,” Hauduc continues. “The technical transformation and the cultural transformation could not have happened in isolation. They had to happen in concert, and to a large extent, they fueled each other as we arrived at what we can now articulate as our cloud-centric architecture.”

And about that word that people in CSEO were afraid to say? They’re saying it now.

“The word ‘no’ is now a very powerful word,” Hauduc says. “When a customer request comes in, the answer is ‘yes, we’ll prioritize it,’ or ‘no, this isn’t the most important thing we can build for the company from a ROI standpoint, but here’s what we can do instead.’”

The change has been empowering to all of CSEO.

“The quality and the shape of the conversation has changed,” he says. “Now we in CSEO are uniquely positioned to take a step back and say, ‘for the company, the most important thing for us to prioritize is this, let’s go deliver on it.’”

Take a deep dive into how Hauduc and his team in CSEO are using a cloud-centric mindset to drive the company’s transformation.

Find out more about how CSEO is using a modern-engineering mindset to engineer solutions inside Microsoft.

The post How ‘born in the cloud’ thinking is fueling Microsoft’s transformation appeared first on Inside Track Blog.

How Microsoft sped up Windows updates for its employees

Chloe Wattles — Wed, 05 Feb 2020 21:38:33 +0000

It would be an understatement to say that Vidya Iyer wants installing updates to Windows to be easy for Microsoft employees.

“Our goal is to make the reboot feel like a non-event,” says Iyer, who is helping transform the way Windows updates occur inside Microsoft. “We’re not there yet, but we’re getting really close.”

For employees who keep their machines up to date, they’re super close. It took them less than a minute to deploy the most recent update, says Iyer, a program manager in Microsoft Digital.

The story is different, but still better, for the company’s stragglers.

For employees who have let their updates slide, it takes 25–35 minutes to deploy older updates and then deploy the most recent update. But this is still a vast improvement over the 45 minutes or longer it used to take to catch up. Updates can require even more time if drivers or firmware updates are required (the team is also working to implement new capabilities to help IT admins with this challenge).

“When we looked at comments from our users, I’d always see frustration about the reboot times,” Iyer says. “I knew it was a big pain point to have productivity down for just an update.”

On average, the time to deploy Windows updates inside Microsoft has gone from 30 to 60 minutes down to 3 to 10 minutes.

This was a big win for the team, considering that the improvement spans to more than 200,000 Windows devices.

“We’ve really reengineered the way we think about this as a company,” says Sean MacDonald, who leads the device management program in Microsoft Digital. “We’re sharing what we’re learning with the Windows product group so they also can share it with our external customers. This should be painless for all Windows users.”

[Learn how Microsoft is using Windows Update for Business (WUfB) to make updates faster and more predictable for company employees.]
For a transcript, please view the video on YouTube: https://www.youtube.com/watch?v=n8uwpKZfW-M, select the “More actions” button (three dots icon) below the video, and then select “Show transcript.”

Learn how we leveraged Windows Update for Business by using Microsoft Intune and Microsoft System Center Configuration Manager to deploy and keep our devices up to date.

The age before Windows Update for Business

Previously, when updates occurred, they would happen without coordination or warning. Some employees would be in the middle of an important presentation when their devices would reboot to install a required update. Files and work were lost, and employees were frustrated.

“Reboots were averaging 45 minutes,” Iyer says. “Update speeds were inconsistent, with some devices taking more than an hour.”

The key problem was that users didn’t know when a mandatory update would occur. Users received reminders and alerts, but they were not visible enough.

It was clear to Iyer and her team that notifications and scheduling options needed to be implemented.

“We wanted to put control back in the hands of the users,” she says. “Now when they are notified of a pending update, they have the option to schedule when it will take place.”

Helping product groups coordinate their release schedules reduces disruption and increases productivity, she says.

Her team works with teams across the company to ensure that required reboots happen at the employee’s convenience. The ideal scenario is for an employee to run an update when they’re ready to take a quick break—when they come back, their machine is ready to go and they get back to work.

“When users get notifications that an update is coming, they are able to schedule it at a time that makes sense for them,” Iyer says. “For certain updates, like security, users have nine days until they can postpone, reschedule, and do as they please.”

Windows Update for Business, alongside Microsoft Intune, provides access to data that Microsoft can use to refine and improve update delivery and further diminish disruptions.

“Our products are rapidly changing—as a company, we’re always building new features and delivering improvements to performance and security for our users,” says Erinn Rominger, who leads the company’s internal Windows and Office update programs for Microsoft Digital. “Being always up to date drives tangible productivity for both end users and IT professionals. It also helps us remove barriers that slow down adoption or expose users to unnecessary risk and disruption.”

Downsizing sediment and payloads

When they noticed that the update times on devices were inconsistent, Iyer and her team took a closer look and discovered that employees were pushing back updates for long periods of time, causing current updates to take extra time. The team calls this group of backlogged devices its “sediment population.”

“Our sediment population used to be 20 percent before the Windows 10 November 2019 update (19H2) was deployed,” Iyer says. “Now when I look, it is only 9 percent, and we’re working on making that number smaller.”

Looking for more speed, the Windows team reduced reboots for devices with newer hardware and a current OS version. In addition, the team split the device updates into a small fall update and a comprehensive spring update.

“Before, there was no granularity between managing feature updates and quality updates,” Iyer says. “By splitting the update management settings by feature and servicing releases, we were able to get users ready for the type of update they were about to get. That then let us set appropriate expectations about how much time that particular update would take to complete.”

Feature updates are released twice a year. These tend to be very large and contain new operating system features, functionality, and major bug fixes.

Quality updates are released every month since they are smaller, including the most up-to-date security and reliability updates as well as small fixes. Since these updates are much smaller, they are deployed in the background and usually only need a quick reboot to install.

“We have heard feedback from our users and are trying to make the updates even faster for them,” Iyer says. “To make things even easier, the Windows team plans to bundle one of two feature updates that it rolls out each year into one of the automatic, quality updates that go out each month. Microsoft Digital and other IT departments will have the ability to enable the feature update independent of the quality update.”

Speedier updates making a difference

Efforts to transform the update process aren’t going unnoticed.

“We’ve definitely improved the experience a lot,” Iyer says. “Our employees have given us a lot of positive feedback.”

Making Microsoft devices frictionless and disruption free when it comes to updates is a priority, she says. Iyer and her team, along with the Windows Production and Engineering team, have their sights set on making disruption-free updates for all devices, including non-Microsoft and non-Windows devices.

“We want to bring stress-free updates to Microsoft employees who use Mac or Linux systems as well,” she says.

The post How Microsoft sped up Windows updates for its employees appeared first on Inside Track Blog.

How retooling invoice processing is fueling transformation inside Microsoft

Rob Boone — Tue, 07 Jan 2020 18:28:38 +0000

Until recently, processing incoming invoices at Microsoft was a patchwork, largely manual process, owing to the 20-year-old architecture and business processes on which the invoicing system was built.

The existing Microsoft Invoice platform allowed only manual invoice submission. External suppliers and internal users in Microsoft’s Accounts Payable (AP) Operations team could either email a scanned invoice or PDF, manually enter information into a web portal, or use that portal to bulk-upload invoices using a formatted Microsoft Excel template.

In some countries or regions with complex requirements, the AP Operations team is required to manually enter paper invoices directly into SAP, Microsoft’s financial system of record. The system worked, but it was inefficient.

Compared to the wider digital transformation at Microsoft, the inefficiency was glaring. Across the company, manual processes are being replaced with automated processes so that employees can focus on more impactful work. The Invoice Service team, which sits in the Microsoft Digital organization, saw an opportunity in the invoice processing system to modernize.

The goal was to trigger the creation of invoices using simple signals, like when purchased goods were received.

“We started with a question,” says James Bolling, principal group engineering manager for the Microsoft Invoice Service team. “How do we trigger invoices so that a supplier can just call our API and generate the invoice in a system-to-system call? How do we automate approval based on purchase order and invoice and receipting information?”

[Read about how we are digitizing contract management. Learn how we are using anomaly detection to automate royalty payments. Microsoft has built a modern service architecture for its Procure-to-Pay systems—read about it here.]

Lower costs, increased speed, and improved compliance

The Invoice Service team is responsible for the entirety of invoicing at Microsoft, Bolling says. The system it maintains integrates tightly with purchase orders related to goods and services from all Microsoft suppliers. The AP Operations team is tasked with ensuring that every payment adheres to relevant payment terms and service-level agreements.

The team also must maintain airtight compliance for the more than 120 countries and regions in which Microsoft conducts business, which accounts for about 1.8 million invoices per year and some USD60 billion in spend, according to Bolling.

The opportunity to lower operating costs by increasing speed and reducing the possibility of human error was enticing, but it wasn’t until the team began tackling a separate project that the scope of what it was about to undertake was clear.

Rewriting a 20-year-old legacy system

While working on a tax project, Shweta Udhoji and Guru Balasubramanian, both of them program managers on the Invoice Service team, spoke to customers and partners who used the legacy system regularly. Those conversations revealed the scale of the problem. Roughly 35,000 invoices were being submitted via email every month, with several thousand more coming in through the web portal.

Because validation paths are required for each intake method, they were present in duplicate or triplicate, creating redundancy that made it difficult to add a simple modification. Each change had to be applied to each method individually.

“The processes are more than 20 years old, and any extensions due to changing global compliance requirements or any other business needs that come in from our partner teams were very difficult to accommodate,” Udhoji says. “We wanted to simplify that.”

To make matters worse, the team couldn’t rely on documentation for a 20-year-old architecture as they looked for temporary fixes to get the time-sensitive tax updates out the door.

“We didn’t have any documentation to look at, so we had to test extensively, and once we started testing, we started seeing a lot of problems,” Udhoji says.

The road to the Modern Invoice API

Quick fixes wouldn’t solve the underlying problems of the legacy architecture. The team realized that it would need to be completely rewritten to adhere to modern standards.

The Modern Invoice API became a critical component of the broader effort to automate invoice creation and submission where possible. For partners and suppliers for whom manual intake methods are sufficient (or where paper invoices are required by law), those methods would largely remain intact, with some process optimizations added for efficiency. For Microsoft’s largest partners and suppliers, the API would automate the invoice creation and submission process.

“We knew we could make a huge impact on processing time and the manual effort required to process an invoice. We just needed to automate the process,” Udhoji says.

Because business needs change so much faster today than they did 20 years ago, the API was a business decision as well as a technical one. Modifications and extensions would need to be easy to add to keep up.

“What we were building with the Modern API was a framework for better automation, quicker changes, easier automation,” says Bryan Wilhelm, senior software engineer on the Modern Invoice API project.

Bridging legacy and modern systems

The challenge that the team faced was daunting and delicate. Because all invoice processing ran through the legacy architecture, there could be no interruptions in service—business must continue as usual, all over the world. The team also needed to be responsive to constantly shifting local compliance laws, too, adding modifications without downtime.

“We had to first understand the domain, the business requirements, and the technical side of it, while at the same time maintaining the legacy tool and thinking about how to re-imagine the invoice experience,” Balasubramanian says.

The team started by building a hybrid architecture model (as illustrated in the following diagram) on top of the legacy system; initially, the API would simply call the legacy invoice creation pipeline. By integrating with the existing process and building a wrapper on top of it, the legacy system would continue to function without interruption. With so many business rules and validation processes to consider, it would’ve taken a considerable amount of time to write an end-to-end process from scratch.

The iterative approach meant that the team could ship a working API, complete with integration with the legacy system, in just eight weeks. That left more time to gather and integrate early feedback from partners and suppliers while at the same time modernizing the underlying invoice processing pipeline.

Using Microsoft Azure and cXML for interoperability

The legacy system runs on Windows Server 2016 and SQL Server on-premises in Azure Infrastructure as a Service Virtual Machines, so the team leveraged SQL Datasync for synchronization between Azure SQL and on-premises SQL Server, and Azure Service Bus messaging for communication between the two systems. The API microservice was built as an Azure function.

The hybrid architecture used to create the Modern Invoice API maintains functionality between the old and new systems.

For two reasons, the team chose to adopt the commerce eXtensible Markup Language (cXML) protocol to enable communication between documents. First, cXML is compliant with existing Microsoft business rules out of the box. “All the gaps we saw that were missing in the legacy system were accounted for in cXML,” Wilhelm says.

Second, cXML has a robust community; thus, extensive documentation and support already existed, including around the business rules inherent to the cXML protocol.

Automation’s immediate impact

The Modern Invoice API went live globally to internal partners in August, and to date USD85 million worth of invoices have been sent using the API. As the API project evolves to encompass a greater share of all invoice processing, that touchless invoice submission and approval will lower operating costs and eliminate inefficiencies both for internal teams and for Microsoft vendors and partners across the globe.

Prior to the API, the Microsoft Corporate, External, and Legal Affairs (CELA) team used the web portal in conjunction with an internal tool that tracks US work visa workflows. Microsoft sponsors a significant number of employees who are in the United States on work visas, and the CELA team tracks the status of those visas and submits payment to U.S. Citizenship and Immigration Services (USCIS).

The old process involved running a report with the team’s internal tool to find out which visas required payment. They then used that information to populate an Excel template and submit the invoice in the web portal. Because the team processes tens of thousands of checks per year, the time wasted in this process added up.

CELA became one of the first groups to fully implement integration with the API, modifying its internal tool to call the API and automatically submit checks to file USCIS cases. By managing the process end to end within the same system, the team has seen a reduction in the resources required to order daily checks and gained complete visibility into what checks are being ordered and why.

Modernizing for today and tomorrow

Like CELA, business partners and suppliers who submit thousands of invoices per year to Microsoft had to build and manually upload formatted Excel files to a web portal. In creating the foundation for the Modern Invoice API, the team also laid the groundwork for other automated invoice submission and creation methods such as the SAP Evaluated Receipt Settlement (ERS) process. In addition to adding support for 20 countries/regions that the legacy system simply didn’t (or couldn’t) support, these combined automation efforts mean that as much as 70 percent of the 1.8 million invoices submitted to Microsoft every year will be generated via automated, system-to-system calls.

New capabilities are in the pipeline, too. Supporting documentation can be attached to invoices using the new API, which wasn’t possible before. The team is also working on integrating Microsoft’s financial systems with those of GitHub, a recent Microsoft acquisition, to increase the speed of integration of future acquisitions. The API provides a simpler way to migrate the GitHub invoice system to Microsoft systems and components. “It would’ve been a crazy modification in the classic system,” Udhoji notes. Future acquisitions and integrations will be similarly affected as the API standardizes the process of invoice system migration.

In addition to the four tenants already using the Modern Invoice API, six more tenant groups are slated to be added by the end of the fiscal year, with an external rollout to Microsoft’s biggest external suppliers not far behind.

Creating, submitting, and approving more than 1.8 million invoice transactions every year required significant manual efforts for Microsoft and its payees. All told, the manual processes added up to 300,000 hours of work, according to Luciana Siciliano, Finance Operations director and global process owner.

“Our internal business groups no longer have to manually touch these invoices to get them into our enterprise invoice system and move them forward to complete the payout,” she says. “Through automated intake, receipting, and approval solutions, we can drive quick-turnaround solutions that are centered on our users.”

The post How retooling invoice processing is fueling transformation inside Microsoft appeared first on Inside Track Blog.

How Microsoft is modernizing its internal network using automation

Aleenah Ansari — Wed, 11 Dec 2019 23:20:08 +0000

After Microsoft moved its workload of 60,000 on-premises servers to Microsoft Azure, employees could set up systems and virtual machines (VMs) with a push of a few buttons.

Although network hardware servers have changed over time, the way that network engineers work isn’t nearly as modern.

“With computers, we have modernized our processes to follow DevOps processes,” says Bart Dworak, a software engineering manager on the Network Automation Delivery Team in Microsoft Digital. “For the most part, those processes did not exist with networking.”

Two years ago, Dworak says, network engineers still created and ran command-line-based scripts and created configuration change reports.

“We would sign into network devices and submit changes using the command line,” Dworak says. “In other, more modern systems, the cloud provides desired-state configurations. We should be able to do the same thing with networks.”

It became clear that Microsoft needed modern technology for configuring and managing the network, especially as the number of managed network devices increased on Microsoft’s corporate network. This increase occurred because of higher network utilization by users, applications, and devices as well as more complex configurations.

“When I started at Microsoft in 2015, our network supported 13,000 managed devices,” Dworak says. “Now, we surpassed 17,000. We’re adding more devices because our users want more bandwidth as they move to the cloud so they can do more things on the network.”

[Learn how Microsoft is using Azure ExpressRoute hybrid technology to secure the company.]

Dworak and the Network Automation Delivery Team saw an opportunity to fill a gap in the company’s legacy network-management toolkit. They decided to apply the concept of infrastructure as code to the domain of networking.

“Network as code provides a means to automate network device configuration and transform our culture,” says Steve Kern, a Microsoft Digital senior program manager and leader of the Network Automation Delivery Team.

The members of the Network Automation Delivery Team knew that implementing the concept of network as code would take time, but they had a clear vision.

“If you’ve worked in a networking organization, change can seem like your enemy,” Kern says. “We wanted to make sure changes were controlled and we had a routine, peer-reviewed rhythm of business that accounted for the changes that were pushed out to devices.”

The team has applied the concept of network as code to automate processes like changing the credentials on more than 17,000 devices at Microsoft, which now occurs in days rather than weeks. The team is also looking into regular telemetry data streaming, which would inform asset and configuration management.

“We want network devices to stream data to us, rather than us collecting data from them,” Dworak says. “That way, we can gain a better understanding of our network with a higher granularity than what is available today.”

The Network Automation Delivery Team has been working on the automation process since 2017. To do this, the team members built a Git repository and started with simple automation to gain momentum. Then, they identified other opportunities to apply the concept of GitOps—a set of practices for deployment, management, and monitoring—to deliver network services to Microsoft employees.

Implementing network as code has led to an estimated savings of 15 years of labor and vendor spending on deployments and network devices changes. As network technology shifts, so does the role of network engineers.

“We’re freeing up network engineers so they can build better, faster, and more reliable networks,” Kern says. “Our aspiration is that network engineers will become network developers who write the code. Many of them are doing that already.”

Additionally, the team is automating how it troubleshoots and responds to outages. If the company’s network event system detects that a wireless access point (AP) is down, it will automatically conduct diagnostics and attempt to address the AP network outage.

“The building AP is restored to service in less time than it would take to wake up a network engineer in the middle of the night, sign in, and troubleshoot and remediate the problem,” Kern says.

Network as code also applies a DevOps mentality to network domain by applying software development and business operations practices to iterate quickly.

“We wanted to bring DevOps principles from the industry and ensure that development and operations teams were one and the same,” Kern says. “If you build something, you own it.”

In the future, the network team hopes to create interfaces for each piece of network gear and have application developers interact with the API during the build process. This would enable the team to run consistent deployments and configurations by restoring a network device entirely from a source-code repository.

Dworak believes that network as code will enable transformation to occur across the company.

“Digital transformation is like remodeling a house. You can remodel your kitchen, living room, and other parts of your house, but first you have to have a solid foundation,” he says. “Your network is part of the foundation—transforming networking will allow others to transform faster.”

Learn how Microsoft is using Azure ExpressRoute hybrid technology to secure the company.

The post How Microsoft is modernizing its internal network using automation appeared first on Inside Track Blog.

How Microsoft used SQL Azure and Azure Service Fabric to rebuild a key internal app

Lukas Velush — Wed, 16 Oct 2019 20:23:57 +0000

When Raja Narayan took over supporting the Payee Management Application that Microsoft Finance uses to onboard new suppliers and partners, the experience was broken.

“Our application’s infrastructure was on-premises,” Narayan says. “It was a big, old-school architecture monolith and, although we had database-based logging in place, there was no alerting setup at any level. Bugs and infrastructure failures were bringing the application down, but we didn’t know when this happened.”

And it went down a lot.

When it did, the team wouldn’t know until a user filed a ticket. Then it would take four to six hours before the ticket reached Narayan’s team.

“We would undertake root-cause investigation and it sometimes could take a solid two to three hours, if not more in rare cases, until we managed to eventually identify and resolve the problem,” says Narayan, a principal software engineer on the Microsoft Digital group that supports the Microsoft Finance team.

[Take a look at how Narayan’s broader team is modernizing applications.]

All told, it would take at least 10 to 12 hours to bring the system back online.

And it wasn’t only the reliability challenges that the team was hit with daily. Updates and fixes required taking the system down. Engineering teams didn’t have insight into work that other teams were doing. Cross-discipline collaboration was minimal. Continuous repetitive manual work was required. And telemetry was severely limited.

“There was no reliability at all,” Narayan says. “The user experience was very, very bad.”

That was four years ago, before the team moved its payee management system and its 95,000 active supplier and partner accounts to the cloud.

“When I joined our team, it was obvious that we needed a change. And going to Azure was a big part of it,” Narayan says. “Going to the cloud was going to open up new opportunities for us.”

He was right. After the nine-month migration was finished, things got much better right away. The benefits included:

The team was empowered to adopt modern, DevOps engineering practices, something they really wanted. The benefits showed up in many ways, including reduced cross-team friction and faster response times.
Failures were reported to a Directly Responsible Individual (DRI) immediately. They would fix the problem right away or queue it up for the engineering team to do deeper-level work.
The time to fix major production issues dropped to as few as 15 minutes, and a maximum of four hours.
The team no longer needed to shut down the system to make production fixes (thanks to the availability of staging and production slots, and hosting frameworks like Azure Service Fabric).
Application reliability shot up from around 95 percent to 99 percent. Availability stayed high because of redundancy.
Scaling the application up and out became just a configuration away. The team was able to scale the services based on memory and processor utilization.
The application’s telemetry data became instantly available to analyze and learn from.
The team could start taking advantage of automation and governance capabilities.

The shift to Azure is having a lasting impact.

“If someone asked me to go back, I don’t think I could happily do it,” Narayan says. “I don’t know how we survived in those old days. It’s so much faster and more powerful to be on Azure.”

Instead of spending all his time fighting to reduce technical debt, on building and maintaining too many services, and buying and installing technical infrastructure, he’s now focused on what his internal business customers need.

“Now we’re building a program,” Narayan says. “Now we’re taking care of our customers. Application-hosting infrastructure is not our concern now. Azure takes care of it.”

Opening doors with SQL Azure

Moving to the cloud also meant the team got to move on from an on-premises SQL Server database that needed continuous investment in optimization and maintenance to avoid problems with performance.

“We’ve never had an incident where our SQL Azure database has gone down,” Narayan says. “When we were on-prem, our work was often interrupted by accidental server restarts and patch installations.”

The team no longer needs to shut the application down and reboot the server when it wants to fix something or make an upgrade. “Every time we want to do something new, we make a couple of clicks, and boom, we’re done,” he says.

Azure SQL made it much easier to scale up and down when user loads changed. “My resources are so elastic now,” Narayan says. “I can shrink and expand based on my need—it’s a matter of sliding the scrollbar.”

Moving the application’s database to SQL Azure has given the team access to several new tools.

“With our move to cloud, the team can experiment on any databases, something that wasn’t possible before,” Narayan says. “Before we could only use SQL Server. Now we have an array of options such as Cosmos DB, table storage, MySQL, and PostgreSQL. New features from these products are available automatically to us. We don’t have to install feature updates and patches—it’s all managed by Azure.”

Living in the cloud also gives the team new access to the application’s data.

“We now live in this new big-data world,” Narayan says. “We can now get a lot of insights about our application, especially with machine learning and AI.”

For example, SQL Azure learns from the incoming load and accordingly tunes itself. Indexes are created or dropped based on how it learns. “This is one of the most sought-after features by our team,” he says. “This feature does what a database administrator used to have to do by hand.”

And processing the many tiny transactions that come through Narayan’s application? Those all happen much faster now as well.

“For Online Analytic Processing (OLAP), we need big processing machines,” he says. “We need big resources.”

Azure provides him with choices, including Azure Datawarehouse, Azure Databricks, and Azure HDInsights. “If I was still on-prem, this kind of data processing would just be a dream for me,” he says. “Now they are a click away for me.”

Going forward, the plan is to use AI and machine learning to analyze Payee Management Application’s data at greater depth. “There is a lot more we can do with our data,” Narayan says. “We’re just getting started.”

Narayan’s journey toward more reliable and agile service is a typical example of how off-loading the work of managing complex on-premises infrastructure can help the company’s internal and external customers focus on their core businesses, says Eli Birova, a site-reliability engineer on the Azure SQL SRE Team.

“And one of the biggest values Azure SQL DB brings is a database in the Azure cloud that scales in and out together with your business need and adapts to your workload,” Birova says.

That provides customers like Narayan and his team with a database as a service tailored by the deep Relational Database Management Systems (RDBMS) engineering expertise that come from long years of developing Microsoft SQL Server, she says. It’s a service that incorporates large-scale distributed systems design and implementation best practices, which also natively leverages the scalability and resiliency mechanisms of the Azure stack itself.

“We in the Azure SQL DB team are continuously monitoring and analyzing the behavior of our services and the experience our customers have with us,” Birova says. “We’re very focused on identifying and implementing improvements to our feature set, reliability, and performance. We want to make sure that every customer can rely on their data when and as they need it, and that they can count on their server being up to date and secure without needing to invest their own engineering resources into managing on-premises database infrastructure.”

Harnessing the power of Azure Service Fabric

Once Narayan’s team finished migrating the Payee Management Application to the cloud, it got the breathing room it needed to start thinking bigger.

“We started asking ourselves, ‘How can we get more out of being in the cloud?’” Narayan says. “It didn’t take us long to realize that the best way to take advantage of everything Azure had to offer would be to modify our application from the ground up to be cloud-native.”

That shift in thinking meant that his days of running a massive, clunky, monolithic application were numbered.

“We realized we could use Azure Service Fabric to rebuild the application as a suite of microservices,” Narayan says. “We could get an entirely fresh start.”

Azure Service Fabric is part of an evolving set of tools that the Azure product group is using to help customers—including power users inside Microsoft—build and operate always-on, scalable, distributed apps like the one Narayan’s team manages. So says Spencer Schwab, a software engineering manager on the Microsoft Azure Site Reliability Engineering (SRE) team.

“We’re learning from the experience Raja and his team are having with Service Fabric,” Schwab says. “We’re pumping those learnings back into the product so that our customers have the best experience possible when they choose to bet their businesses on us.”

Narayan’s team is using Azure Service Fabric to gradually rebuild the Payee Management Application without interrupting service to customers. That’s something possible only in the cloud.

“We lifted and shifted all of the old, existing monolith components into Azure Service Fabric,” he says. “Containerizing it like that has allowed us to gradually strangle the older application.”

Each component of the old application is docked in a container. Each is purposefully placed next to the microservice that will replace it.

“Putting each microservice next to the component that it’s replacing allows us to smoothly move that bit of workload to the new microservice without shutting down the larger application,” Narayan says. “This is making our journey to microservices pleasant.”

The team is halfway finished.

“So far we have 12 microservices, and we’re planning to expand up to 25,” he says.

Once the team is done, the team can then truly take advantage of being in the cloud.

“We’ll be ready to reap the benefits of cloud-native development,” Narayan says. “Anything becomes possible at that point.”

The post How Microsoft used SQL Azure and Azure Service Fabric to rebuild a key internal app appeared first on Inside Track Blog.