Azure Compute Archives - Inside Track Blog

Transforming modern engineering at Microsoft

Inside Track staff — Sat, 11 Jan 2025 17:00:47 +0000

[Editor’s note: This content was written to highlight a particular event or moment in time. Although that moment has passed, we’re republishing it here so you can see what our thinking and experience was like at the time.]

Our Microsoft Digital team is implementing a modern engineering vision that creates a culture, tools, and practices focused on developing high-quality, secure, and feature-rich services to enable digital transformation across the company. Our Modern Engineering initiative has helped us be customer-obsessed, accelerated the delivery of new capabilities, and improved our engineering productivity.

Our journey

Our move to the cloud enabled us to increase the overall agility of the development process and accelerate value delivery for approximately 600 services comprised of about 1,400 components to new cloud technologies which provide quicker access to additional infrastructure. This enables spinning up environments and resources on demand, which allows an engineer to respond more quickly to evolving business needs.

However, we still needed to address several structural issues, including inconsistency between teams in basic engineering fundamentals like coding standards, automated testing, security scans, compliance, release methodology, gated builds, and releases.

We lacked a centralized common engineering system and related practices. Recognizing that we could not continue to evolve our engineering system in a federated way, we invested in a central team. The team was chartered to develop a common engineering system based on Microsoft Azure DevOps, while driving consistency across the organization regarding how they design, code, instrument, test, build, and deploy services. We brought a product engineering mindset to our services by defining a vision for each service area and establishing priorities based on objectives and key results (OKRs) which we define, track, and report using Viva Goals. These scope what we want to achieve each planning period and then execute on them via a defined cadence of sprints. The resulting engineering processes have promoted business alignment, developer efficiency, and cross-team mobility.

We incorporated industry-leading development practices for accessibility, security, and compliance. Achieving compliance has been very challenging, forcing us to change from legacy processes and tooling and requiring us to actively respond to our technical debt in these areas. We also lacked a consistent level of telemetry and monitoring that allowed us to obtain key insights about service health, features, customer experience, and usage patterns. We have moved towards a Live Site culture so that we can comprehensively drive sustained improvements in service quality. Telemetry capabilities have been improved through the ability to do synthetic monitoring and the ingestion of data from a wide variety of data sources and using services such as Azure Monitor.

Our vision for modern engineering

Microsoft’s digital transformation requires us to deliver high-quality capabilities and solutions at a faster pace and with reliability and security. To achieve this, we’re modernizing how we build, deploy, and manage our services to get new functionality in our users’ hands as rapidly as possible. We’re re-examining every part of our engineering process and instituting modern engineering practices. Satya Nadella, our Chief Executive Officer, summarized this well.

“In order to deliver the experiences our customers need for the mobile-first, cloud-first world, we will modernize our engineering processes to be customer-obsessed, data-driven, speed-oriented and quality focused.”

Our ongoing investments in modern engineering practices and technology build on the foundation that we’ve already established, and they reflect our vision and support our cultural changes. We have three key pillars on which we’re basing these investments along with a commitment to infuse AI into each pillar wherever appropriate.

Customer obsession
Engineering productivity
Rapid delivery

Customer obsession

We want to ensure our engineers keep customers front and center in their thoughts, so we’re capturing feedback to provide our engineers with a deep understanding of the customer experience. Our service monitoring has enabled us to be alerted to problems and fix them before our customers are even aware of them.

We are the first customers of Microsoft’s commercial offerings, which enables us to identify and address the engineering needs of the enterprise operating in a cloud-centric architecture. We constantly work with our product engineering groups across the company, creating a virtuous cycle that makes our products such as Azure DevOps and Azure services even more enterprise ready.

Using customer feedback to drive development

We’re keeping the customer experience at the center of the engineering process via feedback loop mechanisms. Feedback loops serve as a foundation for hypothesis-driven product improvements based on actual sentiment and usage data. We’re making feedback submission as easy as possible with the same tool that the Microsoft Office product suite uses. The Send a Smile feature automatically and consistently gathers feedback across multiple channels and key user touchpoints. We use this tool as a centralized data system for storing, triaging, and analyzing feedback, then aggregating it into actionable insights.

We encourage adoption of feedback loops and experimentation methods, such as feature flighting and ring deployment, to help measure the impact of product changes. With these foundational components in place, we’re now correlating feedback data with related telemetry so that we can better understand product usability issues and the impact of service issues on customers. Our use of controlled rollouts eliminates the need for UAT environments, which accelerates overall delivery.

Telemetry

We unified the telemetry from disparate systems by building on Azure Monitor to help us implement continuous improvements in the quality of our services. This platform integrates with heterogeneous data sources such as Kusto, Azure Cosmos DB, Azure Application Insights, and Log Analytics to collect, process, and publish data from applications, infrastructure, and business processes. This helps us obtain end-to-end views and generate more actionable insights about our service management.

We’re working toward delivering highly connected insights that aggregate the health of component services, customer experience, and business processes. This produces contextual data that not only identifies events but also identifies root causes and recommended next actions. We’re using business process monitoring (BPM) to monitor true availability and performance by tracking successful transactions and customer impact across multiple services and business groups.

To achieve a sustained level of quality, we’re leveraging synthetic monitoring for all critical services, especially those with a relatively low volume of business transactions. Data-enhanced incident tickets provide a business impact prioritized view of issues, supplemented with potential causes including those identified through Machine Learning. These data-enhanced tickets allow teams to focus on the most important tickets and reduce mitigation time.

We are investing in AI technologies to proactively detect anomalies and automatically remediate them wherever possible. Being able to intelligently respond to incidents reduces support costs and improves service reliability and the overall user experience.

Service health

We have focused on increasing our effectiveness in service and live site incident management. We rolled out a standard incident management process and measured continual improvements against key incident management metrics. We monitor service health metrics and key performance indicators (KPIs) across the organization to understand customer sentiment and ensure services are reliable, compliant, and performing well. We’re using consistent standards, which helps ensure that we can aggregate data at any level in the service hierarchy and compare it across different team groups. We built a more integrated experience on top of Azure Monitor, enriched with contextual data from the unified telemetry platform, and created a set of defined service health measures and an analyzer to track events that can affect service reliability, such as upcoming planned maintenance or compliance related changes. This enables us to detect and resolve issues proactively and quickly. Defined service health measures make it easier to enable service health reporting across various services.

We knew that we must connect service health to business process health, and how we prioritize issues, so that engineers could address them in a way that reduces the negative business impact. The experience we’re building enables visualization of end-to-end business process health and the health of the underlying services by analyzing their telemetry.

We also simplified the flow of service health and engineering fundamentals data to the engineer and reduced the number of dashboards and tools they use. An internal tool is now the key repository for all service owners to view service health and other relevant KPIs. The tool’s integrated notification workflow informs service owners when a service reaches a defined threshold, making it more convenient to prioritize any needed remediation into their backlogs.

Embracing a Live Site culture

Increasing scale and agility in our services and processes required us to focus on making customers’ experiences better. We’re establishing a Live Site culture and pursuing excellence via customer-obsessed, data-driven, multidisciplinary teams. These teams embrace potential failure with honest observation, continuous learning, and measurable improvement targets.

We host an organization-wide, live site review that includes postmortem reviews on incidents, examining long-term remediation plans, and guiding service teams through modern engineering standards that will help them perform robust reviews at a local level. We base these reviews on standard and actionable reports that contain leading indicators for outages or failures based on the analysis of telemetry, synthetic monitoring, and other data.

Engineering productivity

We’re providing our engineers with best-in-class unified standards and practices in a common engineering system, based on the latest Azure tools, such as Azure DevOps. A consistent development environment allows our engineers to transition smoothly between projects and teams. Improved automation, consistency, and centralized engineering systems enable engineers to better focus on the core role of developing. This also reduces onboarding time and allows our engineers to be more flexible across projects.

Integrating developer tooling

We made organizationally mandated code analysis and compliance tools accessible directly within the development environment, thereby helping our shift-left goal. We built self-service capabilities to manage access, set policies, and make changes to Azure DevOps artifacts such as area paths, work items, and repositories. This has made it easy for engineers to create, update, or retire services, components, and subscriptions, minimizing the time spent managing such resources. We want to extend our shift left goal to also examine optimization of our Azure service design and surface recommendations for configuration optimization so that these occur early in the deployment cycle and allow us to rightsize our configurations and avoid unnecessary Azure costs.

Enabling code reuse

While at a low volume, we’re still supporting a few applications (fewer than five percent) that use on-premises servers and domain-joined Azure virtual machines. This results in ongoing effort to patch servers, upgrade software, and perform basic infrastructure maintenance tasks. It also impedes our ability to scale apps to accommodate growth. We’ve transformed these applications to Microsoft Azure platform-as-a-service (PaaS) and software-as-a-service (SaaS) based solutions, thereby leveraging the scale and availability of Azure. We enabled this by providing architectural guidance and tools to migrate data, refactoring existing functionality as APIs, and building lightweight applications by reusing APIs that others have already published.

Promoting data and code reuse to build solutions more rapidly and align with a service-oriented architecture requires that developers have the ability to publish and discover APIs easily. We built an API economy by creating a common set of guidelines for developing coherent APIs, and a central catalog and search experience for discovery. We integrated validation against API guidelines and enabled our teams to integrate API publishing into their Azure DevOps pipelines. We created a set of common API health analytics. We also enabled the growth of inner source in which sharing code outside of APIs is achieved.

Workforce strategies

To address our previous high level of dependency on suppliers, we implemented a new workforce strategy, hiring more full-time employees and bringing more work in-house. This allowed us to transform and modernize how we deliver services. Furthermore, this workforce strategy makes it imperative that there is full-time employee oversight of any supplier deliveries, ensuring they adhere to processes, standards, and regulatory requirements, including security, accessibility, and privacy. We implemented a common bar for hiring across all teams and a common onboarding program to ensure all new hires receive a consistent level of training on all key tools and technologies. As we ramp up our use of AI technologies to further transform our engineering, we are investing in re-skilling and training initiatives to expand the engineering capacity available to work on AI-related projects.

Universal design system

We leveraged Microsoft’s product design system to engineer solutions that look and behave like other Microsoft products. Every product should meet the quality expectations of today’s consumers, meaning that every piece of the user interface (UI) and user experience (UX) should be engineered with accessibility, responsiveness, and familiar behaviors, states, motion, and visual styling. On complex but common components like headers, navigation menus, and data grids this can mean weeks of engineering time multiplied across every Microsoft Digital team that requires the same components. This is considerably reduced by adopting a universal design system.

Rapid delivery

To be customer-obsessed, we’re acquiring and protecting customer trust in every aspect of our relationship. We are tracking delivery metrics so that we can shorten lead times from ingestion of customer requirements to the time the solution is in the customer’s hands and then on to measuring customer usability and feedback, while still ensuring service reliability. We’re helping engineers achieve this objective by checking for issues earlier in the pipeline and providing a way to rapidly experiment and mitigate risk. We are building feedback-loop mechanisms to ensure that we can understand the user experience as new functionality gets deployed, and we perform automated rollbacks if customer reaction or service-health signals are less favorable than we anticipated.

Integrating security, accessibility, and fundamentals

Delivering secure, compliant, accessible, dependable, and high-quality services is critical to building trust with our customers. Our engineers are checking for issues earlier in the pipeline, and we’re enabling them to experiment rapidly while limiting potential negative effect on the release process.

We moved to a shift left process, in which work happens as early in the development process as possible. This enabled us to avoid carrying debt from sprint to sprint. We also implemented gates in the developer workflow that help build security in a streamlined way and auto-onboarding services to ensure continuous compliance.

We scan code for security issues and log bugs in Azure DevOps that we discover during the scanning process, so developers can fix them directly in the same engineering system they use for other functional bugs rather than having to triage separately from security tools.

We assess accessibility within our applications, but this happens late in the development process. To move this further upstream, we adopted accessibility insights tooling during development and now expose accessibility-related bugs as part of the pipeline workflow.

We are adopting AI technologies for providing accessibility guidance and conducting accessibility assessments to ensure that our applications conform to accessibility requirements.

Additionally, we enabled engineering teams to utilize the guardrails we’re implementing by integrating policy fundamentals into the pipeline, and we’re implementing continuous integration practices. This ensures that all production releases, including hot fixes, come from builds of the main branch of source code and all have appropriate compliance steps applied consistently. Each pull request must have a successful build to ensure that the main branch is golden and always production ready. Maintaining high-quality code in the main branch minimizes build failures that ultimately slow our time to production.

Deploying safely to customers

We created an environment where teams test ideas and prototypes before building them. The goal is to drive customer outcomes in a way that encourages risk-taking with a fail-fast, fail-safe mentality. Central to increasing the velocity of service updates to customers is a consistent, simple, and streamlined way to implement safe deployments. Progressive exposure and feature flags are key in deploying new capabilities to users via controlled rollouts, so we can quickly start receiving customer feedback.

We implemented checks and balances in the process by leveraging service indicators such as latency and faults within the pipeline, thereby catching regressions and allowing initiation of automated rollbacks when predefined thresholds are exceeded. Implementing safe deployment practices, combined with a streamlined and well-managed pipeline, are two of the key elements for achieving a continuous integration, continuous deployment (CI/CD) model.

Reliability and efficiency

We are enhancing our DevOps engineering pipeline across services by identifying and removing bottlenecks and improving our services’ reliability. We’ll use DevOps Research and Assessment (DORA) metrics to measure our execution and monitor our progress against industry benchmarks.

We’re focusing on deployment frequency, lead time for changes, change failure rate, and mean time to recover in order to gain a comprehensive view of our software or service delivery capabilities. Based on this data, we’ll increase productivity, speed up time-to-market, and enhance user satisfaction.

We’re making our vision for modern engineering a reality at Microsoft by promoting a Live Site first culture, using data to provide service and business process health signals to inform the rapid iteration on new ideas and capabilities with customers.
We’re supporting this by moving to an Azure DevOps model of continuous integration and continuous deployment governed by a standardized engineering pipeline with automatic policy enforcement.
The Live Site first culture and the tools and ceremonies that support it have increased visibility into engineering processes, improved the quality and delivery of our services and improved our insight into our customer experiences, all of which ensure we are continually improving and adapting our set of services and processes to support digital transformation now and into the future.

The post Transforming modern engineering at Microsoft appeared first on Inside Track Blog.

Microsoft uses a scream test to silence its unused servers

Pete Apple — Sat, 17 Aug 2024 08:00:59 +0000

Do you have unused servers on your hand? Don’t be alarmed if I scream about it—it’ll be for a good reason (and not just because it’s almost Halloween)!

Check out Pete Apple’s expedition to the cloud series

I talked previously about our efforts here in Microsoft Digital to inventory our internal-to-Microsoft on-premises environments to determine application relationships (mapping Microsoft’s expedition to the cloud with good cartography) as well as look at performance info for each system (the awesome ugly truth about decentralizing operations at Microsoft with a DevOps model).

With this info, it was time to begin making plans to move to the cloud. Looking at the data, our overall CPU usage for on-premises systems was far lower than we thought—averaging around six percent! We realized this was so low due to many underutilized systems. First things first, what to do with the systems that were “frozen,” or not being used, based upon the 0-2 percent CPU they were utilizing 24/7?

We created a plan to closely examine those assets towards the goal of moving as few as possible. We used our home-built change management database (CMDB) to check whether there was a recorded owner. In some cases, we were able to work with that owner and retire the system.

Before we turned even one server off, we had to be sure it wasn’t being used. (If a server is turned off and no one is there to see it, does it make a sound?)

Developing a scream test

Pete Apple, a cloud services engineer in Microsoft Digital, shares how Microsoft scares teams that have unused servers that need to be turned off. (Photo by Jim Adams | Inside Track)

But what if the owner information was wrong? Or what if that person had moved on? For those, we created a new process: the Scream Test. (Bwahahahahaaaa!)

What’s the Scream Test? Well, in our case it was a multistep process:

Display the message “Hey, is this your server, contact us?” on the sign-in splash page for two weeks.
Restart the server once each day for two weeks to see whether someone opens a ticket (in other words, screams).
Shut down the server for two weeks and see whether someone opens a ticket. (Again, whether they scream.)
Retire the server, retaining the storage for a period, just in case.

With this effort, we were able to retire far more unused servers—around 15 percent—than we had expected, without worrying about moving them to the cloud. Winning! We also were able to reclaim more resources on some of the Hyper-V hosts that were slated to continue running on-premises. And as a final benefit, we cleaned up our CMDB a bit!

In parallel, we initiated an effort to look at some of the systems that were infrequently used or used a very low level of CPU (less than 10 percent, or “Cold”). From that, we had two outcomes that proved critical for our successful migration to the cloud.

The first was to identify the systems in our on-premises environments that were oversized. People had purchased physical machines or sized virtual machines according to what they thought the load would be, and either that estimate was incorrect or the load diminished over time. We took this data and created a set of recommended Azure VM sizes for every on-premises system to use for migration. In other words, we downsized on the way to the cloud versus after the fact.

At the time, we did a bunch of this work by hand, manually because we were early adopters. Microsoft now has a number of great products available that help assist with this inventory and review of your on-premises environment that you should check out. To learn more, check out this article with documentation on Azure Migrate.

Another statistic that the data revealed was the number of systems that were used for only a few days or a week out of each month. Development machines, test/QA machines, and user acceptance testing machines reserved for final verification before moving code to production were used for only short periods. The machines were on continuously in the datacenter, mind you, but they were actually being used for only short periods each month.

For these, we investigated ways to have those systems running only when required by investing in two technologies: Azure Resource Manager Templates and Azure Automation. But this is a story for the next time. Until then, happy Halloween!

Read the rest of the series on Microsoft’s move to the cloud:

The post Microsoft uses a scream test to silence its unused servers appeared first on Inside Track Blog.

Boosting Microsoft’s migration to the cloud with Microsoft Azure

Lukas Velush — Fri, 27 Oct 2023 15:30:40 +0000

When Microsoft set out to move its massive internal workload of 60,000 on-premises servers to the cloud and to shutter its handful of sprawling datacenters, there was just one order from company leaders looking to go all-in on Microsoft Azure.

Please start our migration to the cloud, and quickly.

As a team, we had a lot to learn. We started with a few Azure subscriptions. We were kicking the tires, figuring things out, assessing how much work we had to do.

– Pete Apple, principal service engineer, Microsoft Digital

However, it was 2014, the early days of moving large, deeply rooted enterprises like Microsoft to the cloud. And the IT pros in charge of making it happen had few tools to do it and little guidance on how to go about it.

“As a team, we had a lot to learn,” says Pete Apple, a principal service engineer in Microsoft Digital. “We started with a few Azure subscriptions. We were kicking the tires, figuring things out, assessing how much work we had to do.”

As it turns out, quite a bit of work. More on that in a moment.

Now, seven years later, the company’s migration to the cloud is 96 percent complete and the list of lessons learned is long. Six IT datacenters are no more and there are fewer than 800 on-prem servers left to migrate. And that massive workload of 60,000 servers? Using a combination of modern engineering to redesign the company’s applications and to prune unused workloads, that number has been reduced. Microsoft is now running on 7,474 virtual machines in Azure and 1,567 virtual machines on-premises.

“What we’ve learned along the way has been rolled into the product,” Apple says. “We did go through some fits and starts, but it’s very smooth now. Our bumpy experience is now helping other companies have an easier time of it (with their own migrations).”

[Learn how modern engineering fuels Microsoft’s transformation. Find out how leaders are approaching modern engineering at Microsoft.]

The beauty of a decision framework

It didn’t start that way, but migrating a workload to Azure inside Microsoft is super smooth now, Apple says. He explains that everything started working better when they began using a decision tree like the one shown here.

Microsoft Digital’s migration to the cloud decision tree

The cloud migration team used this decision tree to guide it through migrating the company’s 60,000 on-premises servers to the cloud. (Graphic by Marissa Stout | Inside Track)

First, the Microsoft Digital migration team members asked themselves, “Are we building an entirely new experience?” If the answer was “yes,” then the decision was easy. Build a modern application that takes full advantage of all the benefits of building natively in the cloud.

If you answer “no, we need to move an existing application to the cloud,” the decision tree is more complex. It requires the team to answer a couple of tough questions.

Do you want to take the Platform as a Service (PaaS) approach? Do you want to rebuild your experience from the ground up to take full benefit of the cloud? (Not everyone can afford to take the time needed or has the budget to do this.) Or do you want to take the Infrastructure as a Service (IaaS) approach? This requires lifting and shifting with a plan to rebuild in the future when it makes more sense to start fresh.

Tied to this question were two kinds of applications: those built for Microsoft by third-party vendors, and those built by Microsoft Digital or another team in Microsoft.

On the third-party side, flexibility was limited—the team would either take a PaaS approach and start fresh, or it would lift and shift to Azure IaaS.

“We had more choices with the internal applications,” Apple says, explaining that the team divvied those up between mission-critical and noncritical apps.

For the critical apps, the team first sought money and engineering time to start fresh and modernize. “That was the ideal scenario,” Apple says. If money wasn’t available, the team took an IaaS approach with a plan to modernize when feasible.

As a result, noncritical projects were lifted and shifted and left as-is until they were no longer needed. The idea was that they would be shut down once something new could be built that would absorb that task or die on the vine when they become irrelevant.

“In a lot of cases, we didn’t have the expertise to keep our noncritical apps going,” Apple says. “Many of the engineers who worked on them moved onto other teams and other projects. Our thinking was, if there is some part of the experience that became important again, we would build something new around that.”

Getting migration right

When Microsoft started its migration to the cloud, the company had a lot to learn, says Pete Apple, a principal service engineer in Microsoft Digital. That migration is nearly finished and those learnings? “They have been rolled into the product,” Apple says. (Photo by Jim Adams | Inside Track)

Apple says the Microsoft Digital migration team initially thought the migration to the cloud would be as simple as implementing one big lift-and-shift operation. It was a common mindset at the time: Take all your workloads and move them to the cloud as-is and figure out the rest later.

“That wasn’t the best way, for a number of reasons,” he says, adding that there was a myriad of interconnections and embedded systems to sort out first. “We quickly realized our migration to the cloud was going to be far more complex than we thought.”

After a lot of rushing around, the team realized it needed to step back and think more holistically.

The first step was to figure out exactly what they had on their hands—literally. Microsoft had workloads spread across more than 10 datacenters, and no one was tracking who owned all of them or what they were being used for (or if they were being used at all).

Longtime Microsoft culture dictated that you provision whatever you thought you might need, and to go big to make sure you covered your worst-case scenario. Once the upfront cost was covered, teams would often forget about how much it cost to keep all those servers running. With teams spinning up production, development, and test environments, the amount of untracked capacity was large and always growing.

“Sometimes, they didn’t even know what servers they were using,” Apple says. “We found people who were using test environments to run their main services.”

And figuring out who was paying for what? Good luck.

“There was a little bit of cost understanding, of what folks were thinking they had versus what they were paying for, that we had to go through,” Apple says. “Once you move to Azure, every cost is accounted for—there is complete clarity around everything that you’re paying for.”

There were some surprising discoveries.

“Why are we running an entire Exchange Server with only eight people using it? That should be on Office 365,” Apple says. “There were a lot of ‘let’s find an alternative and just retire it’ situations that we were able to work through. It was like when you open your storage facility from three years ago and suddenly realize you don’t need all the stuff you thought you needed.”

Moving to the cloud created opportunities to do many things over.

“We were able to clean up many of our long-running sins and misdemeanors,” Apple says. “We were able to fix the way firewalls were set up, lock down our ExpressRoute networks, and (we) tightened up access to our Corpnet. Moving to the cloud allowed us to tighten up our security in a big way.”

Essentially, it was a greenfield do-over opportunity.

“We didn’t do it enough, but when we did it the right way, it was very powerful,” says Heather Pfluger. She is a partner group manager on Microsoft Digital’s Platform Engineering Team, who had a front-row seat during the migration.

That led to many mistakes, which makes sense because the team was trying to both learn a new technology and change decades of ingrained thinking.

“We did dumb things,” Pfluger says. “We definitely lifted and shifted into some financial challenges, we didn’t redesign as we should have, and we didn’t optimize as we should have.”

All those were learning moments, she says. She points to how the team now uses an optimization dashboard to buy only what it needs. It’s a change that’s saving Microsoft millions of dollars.

Apple says those new understandings are making a big difference all over the company.

“We had to get people into the mindset that moving to the cloud creates new ways to do things,” he says. “We’re resetting how we run things in a lot of ways, and it’s changing how we run our businesses.”

He rattled off a long list of things the team is doing differently, including:

Sending events and alerts straight to DevOps teams versus to central IT operations
Spinning up resources in minutes for just the time needed. (Versus having to plan for long racking times or VMs that used to take a week to manually build out.)
Dynamically scale resources up and down based upon load
Resizing month-to-month or week-to-week based upon cyclical business rhythms versus using the old “continually running” model
Having some solutions costs drop to zero or near zero when idle
Moving from custom Windows operating system image for builds to using Azure gallery image and Azure automation to update images
Creating software defined networking configurations in the cloud versus physical networked firewalled configurations that required many manual steps
Managing on premises environments with Azure tools

There is so much more we can do now. We don’t want our internal users to find problems with our reporting. We want to find them ourselves and fix them so fast that our employee users never notice anything was wrong.

– Heather Pfluger, partner group manager, Platform Engineering Team

Pfluger’s team builds the telemetry tools Microsoft employees use every day.

“There is so much more we can do now,” she says, explaining that the goal is always to improve satisfaction. “We don’t want our internal users to find problems with our reporting. We want to find them ourselves and fix them so fast that our employee users never notice anything was wrong.”

And it’s starting to work.

“We’ve gotten to the point where our employee users discovering a problem is becoming more rare,” Pfluger says. “We’re getting better, but we still have a long way to go.”

Apple hopes everyone continues to learn, adjust, and find better ways to do things.

“All of our investments and innovations are now all occurring in the cloud,” he says. “The opportunity to do new and more powerful things is just immense. I’m looking forward to seeing where we go next.”

The post Boosting Microsoft’s migration to the cloud with Microsoft Azure appeared first on Inside Track Blog.

Streamlining Microsoft’s global customer call center system with Microsoft Azure

Cody Bay — Wed, 27 Jan 2021 21:21:15 +0000

Overhauling the call management system Microsoft used to make 70 million calls per year has been a massive undertaking.

The highly complex system was 20 years old and difficult to move on from when, five years ago, the company decided a transformation was needed.

These phone calls are how Microsoft talks to its customers and its partners. We needed to get this right because our call management system is one of the company’s biggest front doors.

– Matt Hayes, principal program manager, OneVoice team

Not only did Microsoft install an entirely new call management system (which is now fully deployed), it did so on next-generation Microsoft Azure infrastructure with global standardization, new capabilities, and enhanced integration for sales and support.

“These phone calls are how Microsoft talks to its customers and its partners,” says Matt Hayes, principal program manager of the OneVoice team. “We needed to get this right because our call management system is one of the company’s biggest front doors.”

Looking back, it was a tall order for Hayes and the OneVoice team, the group in charge of the upgrade at Microsoft Digital, the engineering organization at Microsoft that builds and manages the products, processes, and services that Microsoft runs on.

What made it so tough?

The call management system was made up of 170 different interactive voice response (IVR) systems, which were supported by more than 20 separate phone systems. Those phone systems consisted of 1,600 different phone numbers that were dispersed across 160 countries and regions.

Worst of all, each of these systems was working in isolation.

[This is the second in a series on Microsoft’s call center transformation. The first story in the series documents how Microsoft moved its call centers to Microsoft Azure.]

Kickstarting a transformation

The OneVoice team kicked off Microsoft’s bid to remake its call management system with a complex year-long request for proposal (RFP) process. The team also began preparations with the internal and external stakeholders that it would partner with throughout the upgrade.

To help manage all these workstreams, projects were divvied up into categories that each had their own dedicated team and mandate:

Architecture: This team considered network design and interoperability with the cloud.

Feature needs: This group was charged with ensuring the new system would support business requirements and monitoring needs. They were also tasked with calling out enhancements that should be made to the customer experience.

Partner ecosystem: This team made sure the needs of partners and third-party players were considered and integrated.

Add-on investments: This group made sure cloud space needs were met, addressed personnel gaps, and pursued forward-looking opportunities.

These initial workstreams became the pillars used to guide the transformation of the call management system.

Four pillars of transformation drove the OneVoice team’s call center migration process.

The key to the upgrade was the synergy between the OneVoice team and the call center teams scattered across the company, says Daniel Bauer, senior program manager on the OneVoice team.

“We decided we were going to move to the cloud—after that was approved, we knew it was time to bring in our call center partners and their business representatives,” Bauer says. “That collaboration helped us build a successful solution.”

Early input from these partners guided the architectural design. This enabled the team to bake in features like end-to-end visibility of metrics and telemetry into both first and third-party stacks. It allowed them to manage interconnected voice and data environments across 80 locations. Importantly, it also set early expectations with telecom service providers around who would own specific functions.

Designing for scale by starting small

Bringing myriad systems together under one centralized roof meant the team had to build a system that could handle exponentially greater amounts of data and functionality.

This required a powerful cloud platform that could manage the IVR technology and a call routing system that would appropriately direct millions of calls to the right agent among more than 25,000 customer service representatives.

“Just the scope of that was pretty intense,” says Jon Hoyer, a principal service engineer who led the migration for the OneVoice team.

The strategy, he says, was to take a regional line of business approach. The OneVoice team started the migration in a pilot with a small segment of Microsoft Xbox agents. After the pilot proved successful, the process was scaled out region by region, and in some cases, language by language within those regions.

“There was a lot of coordination around the migration of IVR platforms and call routing logic while keeping it seamless for the customer,” Hoyer says.

Ian McDonnell, a principal PM manager who led the partner onboarding for the OneVoice team, was also faced with the extremely large task of moving all the customer service agents to the new platform.

For many of these partners, this was a wholesale overhaul that involved training tens of thousands of agents and managers on the new cloud solution.

“We were replacing systems at our outsourcers that were integral to how they operated—integral to not only how they could bill their clients, but enabled them to even pay their salaries,” McDonnell says. “We had to negotiate to make sure they were truly bought in, that they not only saw the shared return on investment, but also recognized the new agility and capabilities this platform would bring.”

Build and deploy once, impact everywhere

When a change is made to a system, no one wants to have to make that change again and again.

When we had 20 separate disconnected systems at our outsourcers, it was an absolute nightmare to make that work everywhere. Now we can build it once and deploy that experience across the whole world.

– Ian McDonnell, principal PM manager, OneVoice team

One of the biggest operational efficiencies of the new centralized system is the ability to build new features with universal deployments. If the hold music or a holiday message needs to be changed, rather than updating it on an individual basis to every different phone system, that update goes out to all suppliers at once.

“When we had 20 separate disconnected systems at our outsourcers, it was an absolute nightmare to make that work everywhere,” McDonnell says. “Now we can build it once and deploy that experience across the whole world.”

Previously, there was no option to redirect customers from high- to low-volume call queues, leaving the customer with long waits and negatively impacting their experience. Now, with a single queue, customers are routed to the next available and most appropriate customer service agent in the shortest time, whether the agents sit in the US, India, or the Philippines, providing additional resilience to the service.

This cloud native architecture allowed for new omnichannel features such as “click-to-call,” where customers who are online can request a callback. This allows seamless continuity of context from the secured online experience to a phone conversation for deeper engagement.

As the OneVoice team explores what’s next in add-on investments, they’re exploring a wide range of technologies and capabilities to modernize the call center environment. One of the primary areas of focus is leveraging the speech analytics technology of Microsoft Azure Cognitive Services, which can provide deeper insights into customer satisfaction and sentiment.

In an upcoming blog post in this series, the OneVoice team will share how an in-house development leveraging Microsoft Azure Cognitive Services allowed the team to revolutionize customer sentiment tracking and identify issues before they become major problems.

To contact the OneVoice team, and learn more about their customer support cloud journey, email them at onevoice@microsoft.com.

The post Streamlining Microsoft’s global customer call center system with Microsoft Azure appeared first on Inside Track Blog.

Microsoft Azure sellers gain a data edge with the Microsoft Power Platform

Cody Bay — Mon, 04 Jan 2021 16:40:42 +0000

Data is great to have, but it’s only as good as our ability to digest it.

Alex Thiede, digital transformation lead for Microsoft in Western Europe and a former Microsoft Azure field seller based in Vienna, set out to talk to other Microsoft Azure sellers to discover how to help them serve their clients better.

For a multi-billion dollar business with more than 3,000 sellers, the potential for impact was huge.

– Alex Thiede, digital transformation lead

What emerged was a common pain point around exploding data. An enormous amount of customer data was being produced, but it was being siloed into different systems that never connected. Cloud Solution Architects (CSAs) and Microsoft Azure specialists would have to go into Microsoft Azure portals for customer data, Microsoft Dynamics 365 to track their customer engagements, and the Microsoft Account Planning Tool to manage account plans.

For Microsoft Azure sellers, whose mission is to help their clients be successful with their cloud experience, it was difficult to get a clear picture of how their accounts were performing. They were spending hours analyzing their data, running it through their own Microsoft Excel sheets and Microsoft Power BI reports, before finally sharing their insights with their account teams which required even more hours spent building Microsoft PowerPoint slides.

“For a multi-billion dollar business with more than 3,000 sellers, the potential for impact was huge,” Thiede says. “So how do you bring those teams together on the IT side to have a customer-centric view?”

Thiede realized that this was a great question to answer with a Hackathon project.

Thiede assembled a team that included data scientists, field sellers, security specialists, and Microsoft Power Platform developers who were all passionate about solving the problem. They set out to build a solution using Microsoft Power Platform while demonstrating how IT and sales teams could come together in a citizen developer approach.

Within two weeks, the team had come up with the S500 Azure Standup Cloud Cockpit, a tool that brought all the data together in a configurable dashboard that put the individual sellers in the pilot seat.

For Jochen van Wylick, a cloud solutions architect, Hackathon team member and the lead CSA for strategic accounts in the Netherlands, that meant there could finally be a real tool to replace all of the manual unofficial hacking they had been doing to try to layer data in a meaningful way.

Van Wylick showed the team how they were adding additional metadata to the dozens of engagements they were tracking in their CRM to stay organized, and they incorporated that capability in an automated way.

“I like the fact that Alex implemented these ideas in the Stand Up Cockpit,” van Wylick says. “I also like the fact that it will boost my productivity.”

[Learn how Microsoft has automated its revenue processing with Power Automate. Find out how Microsoft is monitoring end-to-end enterprise health with Azure.]

The Microsoft Power Platforms and the power of citizen development

The team wanted to enter the Hackathon competition with a viable product to wow the judges. So, they used the Microsoft Power Platform to create a low-code tool that proved the feasibility of the Stand Up Cockpit while demonstrating how sales and IT teams could innovate together using a citizen developer approach.

Collaborating across six different regions on three continents in the first all-virtual Hackathon, the IT team members built the application environment while leaving the user interface up to the sellers to customize as they wished.

Stefan Kummert, a senior business program manager for Microsoft’s Field App and Data Services team, built the cockpit’s components on Microsoft Power Platform. Kummert says the challenge was the ability to create composite models layering Microsoft Power BI data with Microsoft Azure data analysis. While this is in fact a new Microsoft Power Platform Power Apps feature slated for release sometime in November, it wasn’t available to them at the time of the Hackathon in July.

“So, we tried to remodel this concept, more or less,” Kummert says. “We factored what’s available out of the box with some other Power Platform building blocks, and that’s what gave us all the functionality we needed.”

Sellers could now integrate their data sources into a composite data model, add custom mapping and commenting, gain insights at the child and business unit levels, and more quickly identify issues and potential for optimizations that would serve their clients. At the end of the Hackathon, they had a working prototype using real customer data.

The Azure Stand Up Cockpit used citizen development to create a composite model of disconnected data sets from Core Platforms to provide deeper understanding and insights of client accounts.

The team largely credits this agility to the citizen developer approach, which empowers non-developers to create applications using low-code platforms sanctioned by IT. “There’s often not enough time to create applications in the classic way,” Kummert says. “I think citizen dev is changing the picture significantly, giving us a fair chance to address the huge amount of change happening in the business environment.”

Microsoft’s 2020 Empower Employees hackathon category. With their win, they were awarded dedicated resources and sponsorship from Microsoft Digital.

Turning the dream into reality

Fresh off their Hackathon win, the team is now working on moving the app into production and getting it into the hands of Microsoft Azure sellers.

They’ll first roll it out to 10 customers, then another 100, and if it’s successful, it will be built into the core platform and scaled out across the Microsoft Sales Experience, MSX Insights, Microsoft Organizational Master, and Microsoft Account Planning programs.

This rapid prototyping and incremental rollout is a strategy targeting increased adoption–an approach that’s appreciated by program managers like Henry Ro, who maintains sales and marketing platforms for Microsoft Digital.

Without the Hackathon, it would have been harder to bring this team together. Rather than doing this just once a year, why not have it as a regular working style? It’s about the energy, the inclusive culture, and the people coming together who have real passion.

– Alex Thiede, digital transformation lead

“Projects like the Azure Cockpit really make it easy for our team and others to validate an idea and take it to fruition,” Ro says. “We’re excited about its capabilities and how we can enable it.”

For their part, Thiede and the team are already itching for another Hackathon–or at least more projects driven by the same kind of inspiration and agility.

“Without the Hackathon, it would have been harder to bring this team together,” Thiede says. “Rather than doing this just once a year, why not have it as a regular working style? It’s about the energy, the inclusive culture, and the people coming together who have real passion.”

The post Microsoft Azure sellers gain a data edge with the Microsoft Power Platform appeared first on Inside Track Blog.

How ‘born in the cloud’ thinking is fueling Microsoft’s transformation

Lukas Velush — Thu, 27 Feb 2020 18:32:35 +0000

Microsoft wasn’t born in the cloud, but soon you won’t be able to tell.

Now that it has finished “lifting and shifting” its massive internal workload to Microsoft Azure, the company is rethinking everything.

“We’re rearchitecting all of our applications so that they work natively on Azure,” says Ludovic Hauduc, corporate vice president of Core Platform Engineering in Microsoft Core Services Engineering and Operations (CSEO). “We’re retooling to take advantage of all that the cloud has to offer.”

Microsoft spent the last five years moving the internal workload of its 60,000 on-premises servers to Azure. Thanks to early efforts to modernize some of that workload while migrating it, and to ruthlessly removing everything that wasn’t being used, the company is now running about 6,500 virtual machines in Azure. This number dynamically scales up to around 11,000 virtual machines when the company is processing extra work at the end of months, quarters, and years. It still has about 1,500 virtual machines on premises, most of which are there intentionally. The company is now 97 percent in the cloud.

Now that the company’s cloud migration is done and dusted, it’s Hauduc’s job to craft a framework for transforming Microsoft into a born-in-the-cloud company. CSEO will then use that framework to retool all the applications and services that the organization uses to provide IT and operations services to the larger company.

The job is bigger than building a guide for how the company will rebuild applications that support Human Resources, Finance, and so on. Hauduc’s team is creating a roadmap for how Microsoft will rearchitect those applications in a consistent, connected way that focuses on the end user experience while also figuring out how to get the more than 3,000 engineers in CSEO who will rebuild those applications to embrace the modern engineering–fueled cultural shift needed for this transformation to happen.

[Take a deep dive into how Hauduc and his team in CSEO are using a cloud-centric mindset to drive the company’s transformation. Find out more about how CSEO is using a modern-engineering mindset to engineer solutions inside Microsoft.]

Move to the cloud creates transformation opportunity

Despite good work by good people, CSEO’s engineering model wasn’t ready to scale to the demands of Microsoft’s growth and how fast its internal businesses were evolving. Moving to the cloud created the perfect opportunity to fix it.

“In the past, every project we worked on was delivered pretty much in isolation,” Hauduc says. “We operated very much as a transaction team that worked directly for internal customers like Finance and HR.”

CSEO engineering was done externally through vendors who were not connected or incentivized to talk to each other. They would take their orders from the business group they were supporting, build what was asked for, get paid, and move on to the next project.

“We would spin up a new vendor team and just get the project done—even if it was a duplication or a slight iteration on top of another project that already had been delivered,” he says. “That’s how we ended up with a couple of invoicing systems, a few financial reporting systems, and so on and so forth.”

Lack of a larger strategy prevented CSEO from building applications that made sense for Microsoft employees.

This made for a rough user experience.

“Each application had a different look and feel,” Hauduc says. “Each one had its own underlying structure and data system. Nothing was connected and data was replicated multiple times, all of which would create challenges around privacy, security, data freshness, etc.”

The problem was simple—the team wasn’t working against a strategy that let it push back at the right moments.

“The word that the previous IT organization never really used was ‘no,’” Hauduc says. “They felt like they had no choice in the matter.”

When moving to the cloud opens the door to transformation

The story is different today. Now CSEO has its own funding and is choosing which projects to build based on a strategic vision that outlines where it wants to take the company.

“The conversation has completely shifted, not only because we have moved things to the cloud, but because we have taken a single, unified data strategy,” Hauduc says. “It has altered how we engage with our internal customers in ways that were not possible when everything was on premises and one-off.”

Now CSEO engineers are working in much smarter ways.

“We now have agility around operating our internal systems that we could never have fathomed achieving on prem,” he says. “Agility from the point of view of elasticity, from the point of view of releases, of understanding how our workloads are being used and deriving insights from these workloads, but also agility from the point of view of reacting and adapting to the changing needs of our internal business partners in an extremely rapid manner because we have un-frictioned access to the data, to the signals, and to the metrics that tell us whether we are meeting the needs of our internal customers.”

And those business groups who unknowingly came and asked for something CSEO had already built?

“We now have an end-to-end view of all the work we’re doing across the company,” Hauduc says. “We can correlate, we can match the patterns of issues and problems that our other internal customers have had, we can show them what could happen if they don’t change their approach, and best of all, we can give them tips for improving in ways they never considered.”

CSEO’s approach may have been flawed in the past, but there were lots of good reasons for that, Hauduc says. He won’t minimize the work that CSEO engineers did to get Microsoft to the threshold of digitally transforming and moving to the cloud.

“The skills and all of the things that made us successful as an IT organization before we started on a cloud journey are great,” he says. “They’re what contributed to building the company and operating the company the way we have today.”

But now it’s time for new approaches and new thinking.

“The skills that are required to run our internal systems and services today in the cloud, those are completely different,” he says.

As a result, the way the team operates, the way it interacts, and the way it engages with its internal customers have had to evolve.

“The cultural journey that CSEO has been on is happening in parallel with our technical transformation,” Hauduc continues. “The technical transformation and the cultural transformation could not have happened in isolation. They had to happen in concert, and to a large extent, they fueled each other as we arrived at what we can now articulate as our cloud-centric architecture.”

And about that word that people in CSEO were afraid to say? They’re saying it now.

“The word ‘no’ is now a very powerful word,” Hauduc says. “When a customer request comes in, the answer is ‘yes, we’ll prioritize it,’ or ‘no, this isn’t the most important thing we can build for the company from a ROI standpoint, but here’s what we can do instead.’”

The change has been empowering to all of CSEO.

“The quality and the shape of the conversation has changed,” he says. “Now we in CSEO are uniquely positioned to take a step back and say, ‘for the company, the most important thing for us to prioritize is this, let’s go deliver on it.’”

Take a deep dive into how Hauduc and his team in CSEO are using a cloud-centric mindset to drive the company’s transformation.

Find out more about how CSEO is using a modern-engineering mindset to engineer solutions inside Microsoft.

The post How ‘born in the cloud’ thinking is fueling Microsoft’s transformation appeared first on Inside Track Blog.

How retooling invoice processing is fueling transformation inside Microsoft

Rob Boone — Tue, 07 Jan 2020 18:28:38 +0000

Until recently, processing incoming invoices at Microsoft was a patchwork, largely manual process, owing to the 20-year-old architecture and business processes on which the invoicing system was built.

The existing Microsoft Invoice platform allowed only manual invoice submission. External suppliers and internal users in Microsoft’s Accounts Payable (AP) Operations team could either email a scanned invoice or PDF, manually enter information into a web portal, or use that portal to bulk-upload invoices using a formatted Microsoft Excel template.

In some countries or regions with complex requirements, the AP Operations team is required to manually enter paper invoices directly into SAP, Microsoft’s financial system of record. The system worked, but it was inefficient.

Compared to the wider digital transformation at Microsoft, the inefficiency was glaring. Across the company, manual processes are being replaced with automated processes so that employees can focus on more impactful work. The Invoice Service team, which sits in the Microsoft Digital organization, saw an opportunity in the invoice processing system to modernize.

The goal was to trigger the creation of invoices using simple signals, like when purchased goods were received.

“We started with a question,” says James Bolling, principal group engineering manager for the Microsoft Invoice Service team. “How do we trigger invoices so that a supplier can just call our API and generate the invoice in a system-to-system call? How do we automate approval based on purchase order and invoice and receipting information?”

[Read about how we are digitizing contract management. Learn how we are using anomaly detection to automate royalty payments. Microsoft has built a modern service architecture for its Procure-to-Pay systems—read about it here.]

Lower costs, increased speed, and improved compliance

The Invoice Service team is responsible for the entirety of invoicing at Microsoft, Bolling says. The system it maintains integrates tightly with purchase orders related to goods and services from all Microsoft suppliers. The AP Operations team is tasked with ensuring that every payment adheres to relevant payment terms and service-level agreements.

The team also must maintain airtight compliance for the more than 120 countries and regions in which Microsoft conducts business, which accounts for about 1.8 million invoices per year and some USD60 billion in spend, according to Bolling.

The opportunity to lower operating costs by increasing speed and reducing the possibility of human error was enticing, but it wasn’t until the team began tackling a separate project that the scope of what it was about to undertake was clear.

Rewriting a 20-year-old legacy system

While working on a tax project, Shweta Udhoji and Guru Balasubramanian, both of them program managers on the Invoice Service team, spoke to customers and partners who used the legacy system regularly. Those conversations revealed the scale of the problem. Roughly 35,000 invoices were being submitted via email every month, with several thousand more coming in through the web portal.

Because validation paths are required for each intake method, they were present in duplicate or triplicate, creating redundancy that made it difficult to add a simple modification. Each change had to be applied to each method individually.

“The processes are more than 20 years old, and any extensions due to changing global compliance requirements or any other business needs that come in from our partner teams were very difficult to accommodate,” Udhoji says. “We wanted to simplify that.”

To make matters worse, the team couldn’t rely on documentation for a 20-year-old architecture as they looked for temporary fixes to get the time-sensitive tax updates out the door.

“We didn’t have any documentation to look at, so we had to test extensively, and once we started testing, we started seeing a lot of problems,” Udhoji says.

The road to the Modern Invoice API

Quick fixes wouldn’t solve the underlying problems of the legacy architecture. The team realized that it would need to be completely rewritten to adhere to modern standards.

The Modern Invoice API became a critical component of the broader effort to automate invoice creation and submission where possible. For partners and suppliers for whom manual intake methods are sufficient (or where paper invoices are required by law), those methods would largely remain intact, with some process optimizations added for efficiency. For Microsoft’s largest partners and suppliers, the API would automate the invoice creation and submission process.

“We knew we could make a huge impact on processing time and the manual effort required to process an invoice. We just needed to automate the process,” Udhoji says.

Because business needs change so much faster today than they did 20 years ago, the API was a business decision as well as a technical one. Modifications and extensions would need to be easy to add to keep up.

“What we were building with the Modern API was a framework for better automation, quicker changes, easier automation,” says Bryan Wilhelm, senior software engineer on the Modern Invoice API project.

Bridging legacy and modern systems

The challenge that the team faced was daunting and delicate. Because all invoice processing ran through the legacy architecture, there could be no interruptions in service—business must continue as usual, all over the world. The team also needed to be responsive to constantly shifting local compliance laws, too, adding modifications without downtime.

“We had to first understand the domain, the business requirements, and the technical side of it, while at the same time maintaining the legacy tool and thinking about how to re-imagine the invoice experience,” Balasubramanian says.

The team started by building a hybrid architecture model (as illustrated in the following diagram) on top of the legacy system; initially, the API would simply call the legacy invoice creation pipeline. By integrating with the existing process and building a wrapper on top of it, the legacy system would continue to function without interruption. With so many business rules and validation processes to consider, it would’ve taken a considerable amount of time to write an end-to-end process from scratch.

The iterative approach meant that the team could ship a working API, complete with integration with the legacy system, in just eight weeks. That left more time to gather and integrate early feedback from partners and suppliers while at the same time modernizing the underlying invoice processing pipeline.

Using Microsoft Azure and cXML for interoperability

The legacy system runs on Windows Server 2016 and SQL Server on-premises in Azure Infrastructure as a Service Virtual Machines, so the team leveraged SQL Datasync for synchronization between Azure SQL and on-premises SQL Server, and Azure Service Bus messaging for communication between the two systems. The API microservice was built as an Azure function.

The hybrid architecture used to create the Modern Invoice API maintains functionality between the old and new systems.

For two reasons, the team chose to adopt the commerce eXtensible Markup Language (cXML) protocol to enable communication between documents. First, cXML is compliant with existing Microsoft business rules out of the box. “All the gaps we saw that were missing in the legacy system were accounted for in cXML,” Wilhelm says.

Second, cXML has a robust community; thus, extensive documentation and support already existed, including around the business rules inherent to the cXML protocol.

Automation’s immediate impact

The Modern Invoice API went live globally to internal partners in August, and to date USD85 million worth of invoices have been sent using the API. As the API project evolves to encompass a greater share of all invoice processing, that touchless invoice submission and approval will lower operating costs and eliminate inefficiencies both for internal teams and for Microsoft vendors and partners across the globe.

Prior to the API, the Microsoft Corporate, External, and Legal Affairs (CELA) team used the web portal in conjunction with an internal tool that tracks US work visa workflows. Microsoft sponsors a significant number of employees who are in the United States on work visas, and the CELA team tracks the status of those visas and submits payment to U.S. Citizenship and Immigration Services (USCIS).

The old process involved running a report with the team’s internal tool to find out which visas required payment. They then used that information to populate an Excel template and submit the invoice in the web portal. Because the team processes tens of thousands of checks per year, the time wasted in this process added up.

CELA became one of the first groups to fully implement integration with the API, modifying its internal tool to call the API and automatically submit checks to file USCIS cases. By managing the process end to end within the same system, the team has seen a reduction in the resources required to order daily checks and gained complete visibility into what checks are being ordered and why.

Modernizing for today and tomorrow

Like CELA, business partners and suppliers who submit thousands of invoices per year to Microsoft had to build and manually upload formatted Excel files to a web portal. In creating the foundation for the Modern Invoice API, the team also laid the groundwork for other automated invoice submission and creation methods such as the SAP Evaluated Receipt Settlement (ERS) process. In addition to adding support for 20 countries/regions that the legacy system simply didn’t (or couldn’t) support, these combined automation efforts mean that as much as 70 percent of the 1.8 million invoices submitted to Microsoft every year will be generated via automated, system-to-system calls.

New capabilities are in the pipeline, too. Supporting documentation can be attached to invoices using the new API, which wasn’t possible before. The team is also working on integrating Microsoft’s financial systems with those of GitHub, a recent Microsoft acquisition, to increase the speed of integration of future acquisitions. The API provides a simpler way to migrate the GitHub invoice system to Microsoft systems and components. “It would’ve been a crazy modification in the classic system,” Udhoji notes. Future acquisitions and integrations will be similarly affected as the API standardizes the process of invoice system migration.

In addition to the four tenants already using the Modern Invoice API, six more tenant groups are slated to be added by the end of the fiscal year, with an external rollout to Microsoft’s biggest external suppliers not far behind.

Creating, submitting, and approving more than 1.8 million invoice transactions every year required significant manual efforts for Microsoft and its payees. All told, the manual processes added up to 300,000 hours of work, according to Luciana Siciliano, Finance Operations director and global process owner.

“Our internal business groups no longer have to manually touch these invoices to get them into our enterprise invoice system and move them forward to complete the payout,” she says. “Through automated intake, receipting, and approval solutions, we can drive quick-turnaround solutions that are centered on our users.”

The post How retooling invoice processing is fueling transformation inside Microsoft appeared first on Inside Track Blog.

How Microsoft used SQL Azure and Azure Service Fabric to rebuild a key internal app

Lukas Velush — Wed, 16 Oct 2019 20:23:57 +0000

When Raja Narayan took over supporting the Payee Management Application that Microsoft Finance uses to onboard new suppliers and partners, the experience was broken.

“Our application’s infrastructure was on-premises,” Narayan says. “It was a big, old-school architecture monolith and, although we had database-based logging in place, there was no alerting setup at any level. Bugs and infrastructure failures were bringing the application down, but we didn’t know when this happened.”

And it went down a lot.

When it did, the team wouldn’t know until a user filed a ticket. Then it would take four to six hours before the ticket reached Narayan’s team.

“We would undertake root-cause investigation and it sometimes could take a solid two to three hours, if not more in rare cases, until we managed to eventually identify and resolve the problem,” says Narayan, a principal software engineer on the Microsoft Digital group that supports the Microsoft Finance team.

[Take a look at how Narayan’s broader team is modernizing applications.]

All told, it would take at least 10 to 12 hours to bring the system back online.

And it wasn’t only the reliability challenges that the team was hit with daily. Updates and fixes required taking the system down. Engineering teams didn’t have insight into work that other teams were doing. Cross-discipline collaboration was minimal. Continuous repetitive manual work was required. And telemetry was severely limited.

“There was no reliability at all,” Narayan says. “The user experience was very, very bad.”

That was four years ago, before the team moved its payee management system and its 95,000 active supplier and partner accounts to the cloud.

“When I joined our team, it was obvious that we needed a change. And going to Azure was a big part of it,” Narayan says. “Going to the cloud was going to open up new opportunities for us.”

He was right. After the nine-month migration was finished, things got much better right away. The benefits included:

The team was empowered to adopt modern, DevOps engineering practices, something they really wanted. The benefits showed up in many ways, including reduced cross-team friction and faster response times.
Failures were reported to a Directly Responsible Individual (DRI) immediately. They would fix the problem right away or queue it up for the engineering team to do deeper-level work.
The time to fix major production issues dropped to as few as 15 minutes, and a maximum of four hours.
The team no longer needed to shut down the system to make production fixes (thanks to the availability of staging and production slots, and hosting frameworks like Azure Service Fabric).
Application reliability shot up from around 95 percent to 99 percent. Availability stayed high because of redundancy.
Scaling the application up and out became just a configuration away. The team was able to scale the services based on memory and processor utilization.
The application’s telemetry data became instantly available to analyze and learn from.
The team could start taking advantage of automation and governance capabilities.

The shift to Azure is having a lasting impact.

“If someone asked me to go back, I don’t think I could happily do it,” Narayan says. “I don’t know how we survived in those old days. It’s so much faster and more powerful to be on Azure.”

Instead of spending all his time fighting to reduce technical debt, on building and maintaining too many services, and buying and installing technical infrastructure, he’s now focused on what his internal business customers need.

“Now we’re building a program,” Narayan says. “Now we’re taking care of our customers. Application-hosting infrastructure is not our concern now. Azure takes care of it.”

Opening doors with SQL Azure

Moving to the cloud also meant the team got to move on from an on-premises SQL Server database that needed continuous investment in optimization and maintenance to avoid problems with performance.

“We’ve never had an incident where our SQL Azure database has gone down,” Narayan says. “When we were on-prem, our work was often interrupted by accidental server restarts and patch installations.”

The team no longer needs to shut the application down and reboot the server when it wants to fix something or make an upgrade. “Every time we want to do something new, we make a couple of clicks, and boom, we’re done,” he says.

Azure SQL made it much easier to scale up and down when user loads changed. “My resources are so elastic now,” Narayan says. “I can shrink and expand based on my need—it’s a matter of sliding the scrollbar.”

Moving the application’s database to SQL Azure has given the team access to several new tools.

“With our move to cloud, the team can experiment on any databases, something that wasn’t possible before,” Narayan says. “Before we could only use SQL Server. Now we have an array of options such as Cosmos DB, table storage, MySQL, and PostgreSQL. New features from these products are available automatically to us. We don’t have to install feature updates and patches—it’s all managed by Azure.”

Living in the cloud also gives the team new access to the application’s data.

“We now live in this new big-data world,” Narayan says. “We can now get a lot of insights about our application, especially with machine learning and AI.”

For example, SQL Azure learns from the incoming load and accordingly tunes itself. Indexes are created or dropped based on how it learns. “This is one of the most sought-after features by our team,” he says. “This feature does what a database administrator used to have to do by hand.”

And processing the many tiny transactions that come through Narayan’s application? Those all happen much faster now as well.

“For Online Analytic Processing (OLAP), we need big processing machines,” he says. “We need big resources.”

Azure provides him with choices, including Azure Datawarehouse, Azure Databricks, and Azure HDInsights. “If I was still on-prem, this kind of data processing would just be a dream for me,” he says. “Now they are a click away for me.”

Going forward, the plan is to use AI and machine learning to analyze Payee Management Application’s data at greater depth. “There is a lot more we can do with our data,” Narayan says. “We’re just getting started.”

Narayan’s journey toward more reliable and agile service is a typical example of how off-loading the work of managing complex on-premises infrastructure can help the company’s internal and external customers focus on their core businesses, says Eli Birova, a site-reliability engineer on the Azure SQL SRE Team.

“And one of the biggest values Azure SQL DB brings is a database in the Azure cloud that scales in and out together with your business need and adapts to your workload,” Birova says.

That provides customers like Narayan and his team with a database as a service tailored by the deep Relational Database Management Systems (RDBMS) engineering expertise that come from long years of developing Microsoft SQL Server, she says. It’s a service that incorporates large-scale distributed systems design and implementation best practices, which also natively leverages the scalability and resiliency mechanisms of the Azure stack itself.

“We in the Azure SQL DB team are continuously monitoring and analyzing the behavior of our services and the experience our customers have with us,” Birova says. “We’re very focused on identifying and implementing improvements to our feature set, reliability, and performance. We want to make sure that every customer can rely on their data when and as they need it, and that they can count on their server being up to date and secure without needing to invest their own engineering resources into managing on-premises database infrastructure.”

Harnessing the power of Azure Service Fabric

Once Narayan’s team finished migrating the Payee Management Application to the cloud, it got the breathing room it needed to start thinking bigger.

“We started asking ourselves, ‘How can we get more out of being in the cloud?’” Narayan says. “It didn’t take us long to realize that the best way to take advantage of everything Azure had to offer would be to modify our application from the ground up to be cloud-native.”

That shift in thinking meant that his days of running a massive, clunky, monolithic application were numbered.

“We realized we could use Azure Service Fabric to rebuild the application as a suite of microservices,” Narayan says. “We could get an entirely fresh start.”

Azure Service Fabric is part of an evolving set of tools that the Azure product group is using to help customers—including power users inside Microsoft—build and operate always-on, scalable, distributed apps like the one Narayan’s team manages. So says Spencer Schwab, a software engineering manager on the Microsoft Azure Site Reliability Engineering (SRE) team.

“We’re learning from the experience Raja and his team are having with Service Fabric,” Schwab says. “We’re pumping those learnings back into the product so that our customers have the best experience possible when they choose to bet their businesses on us.”

Narayan’s team is using Azure Service Fabric to gradually rebuild the Payee Management Application without interrupting service to customers. That’s something possible only in the cloud.

“We lifted and shifted all of the old, existing monolith components into Azure Service Fabric,” he says. “Containerizing it like that has allowed us to gradually strangle the older application.”

Each component of the old application is docked in a container. Each is purposefully placed next to the microservice that will replace it.

“Putting each microservice next to the component that it’s replacing allows us to smoothly move that bit of workload to the new microservice without shutting down the larger application,” Narayan says. “This is making our journey to microservices pleasant.”

The team is halfway finished.

“So far we have 12 microservices, and we’re planning to expand up to 25,” he says.

Once the team is done, the team can then truly take advantage of being in the cloud.

“We’ll be ready to reap the benefits of cloud-native development,” Narayan says. “Anything becomes possible at that point.”

The post How Microsoft used SQL Azure and Azure Service Fabric to rebuild a key internal app appeared first on Inside Track Blog.