DevOps Archives - Inside Track Blog

DevOps is sending engineering practices up in smoke

Inside Track staff — Mon, 15 Apr 2024 15:06:51 +0000

When it comes to modernizing how software engineers write their code, sometimes you just have to light things on fire.

Just ask James Gagnon.

He’ll tell you that good engineering teams at Microsoft and in Microsoft Digital are driving towards using DevOps to do their work, and with good reason.

“DevOps is a foundational part of actually achieving digital transformation, of becoming truly agile,” says Gagnon, a software engineering lead on the Microsoft Digital team delivering finance applications inside Microsoft.

Moving to DevOps can be as simple as combining software engineering and support roles, then delivering smaller software increments. That seems easy enough, but here at Microsoft, Gagnon and others driving this kind of transformation often find that organizational boundaries, team culture, and business processes must also evolve to get people to buy into this modern approach to engineering.

Gagnon has seen this story play out over the last six years.

“There is no doubt that implementing DevOps will increase the speed and quality of software delivery, but it will cost you everything if done in isolation,” Gagnon says.

Leaders must burn old practices

James Gagnon is driving his teams’ DevOps transformation inside Microsoft Digital. He is a software engineering lead on the Microsoft Digital team delivering finance applications inside Microsoft. (Photo by Jim Adams | Inside Track)

“DevOps thrives with increased autonomy,” Gagnon says. “Empowering teams to measure, deliver, fail, learn, and improve internally will yield greater outcomes. Leaders that measure outcomes over execution governance will give their teams the freedom to innovate while taking greater accountability of their work.”

Gagnon says that team productivity measures and processes such as sprint burndowns, velocity, work in progress limits, and story pointing must be private to the sprint team. He says leaders should enable these basic practices and ensure they are in place, but they should also make sure the data stays with the team.

“This creates a safe environment to reflect and enables continuous improvement,” he says. “Leaders that create aggregate reports and force specific practices will create a culture of gamification. Reports and scorecards will become more important to the engineers than the outcomes and the customers.”

Leaders who control and over-centralize solutions will frustrate engineering teams and impact their ability to deliver working software with agility, Gagnon says.

“DevOps transformation starts with software transformation and culture transformation,” he says. “The sprint team knows best the debt that is impacting the team and is best suited to partner with business and leadership to define the roadmap.”

Delivery of schedule driven projects is a legacy success metric of the waterfall software delivery lifecycle. Gagnon has been part of this and learned the hard way that it doesn’t work.

“I’ve seen the perfect software engineering storm and it was a long nasty ride,” he says. “Leadership had set a date for achieving transformation. The engineering team changed structurally with new undocumented expectations. Engineers had to modernize monolithic legacy software, originally built on a tight budget, all while success was measured by new business feature delivery and quality. Unfortunately, leadership was the first and only thing to change.”

How it works—the UAT example

User Acceptance Testing (UAT) is one of the areas causing excessive toil when moving to an agile or modern engineering culture, Gagnon says. Many times, these legacy software engineering practices have been part of the release process for years and compliance validation or other factors required the business stakeholder to review and approve changes. When left in place, this conflicts with the goals of DevOps in general.

“The UAT process alone has huge implications in our ability to move towards DevOps, it’s simply not compatible. Engineering of the current feature sprint can’t stop while UAT happens for the previous feature sprint,” he says. “This is supposed to be faster, right?”

The changes required to move away from UAT need to happen both in engineering teams and the business teams.

“Engineers must take full accountability for quality and business stakeholders must be integral to the process,” Gagnon says. “Achieving the agreed upon success criteria defined at the beginning of a sprint must be demonstrated by a simple sprint review as part of closing the sprint.”

Not only does a UAT process slow down agile methodology, it creates code line management complexity, he says. This requires a separate integration step, an additional branch, and slows down development with increased cost.

“Modern engineering can’t be achieved when managing branches that map to legacy process,” Gagnon says. “We must deliver to production what was engineered within the same sprint.”

People make it happen

Just as culture needs to transform, the software engineers who make up that culture do as well. At Microsoft it started by combining the development and testing role, and now the company is merging the service engineering role into this DevOps role as well.

“The goal is to drive end to end accountability into the software engineering discipline,” Gagnon says. “This is crucial to the service view and achieving modern engineering practices.”

It’s a transition that’s harder than it looks.

“This isn’t simply a matter of learning a few new technical skills, but, also soft skills,” he says. “Writing code is just one expectation.”

Engineers need to work closely with business partners, manage more aspects of a project, and collaborate with other team members.

“I’d love to hire the mythical unicorn, someone who understands quality, who understands security, who can code, and who can communicate with others,” Gagnon says. “These candidates are far and few between. I hire for the basics across all required areas and seek diversity across the team for specialization. This approach has helped balance the team where everyone can contribute within the DevOps model.”

Putting it all together

Gagnon says proper implementation of DevOps has required multiple adaptations to culture and processes at all levels in Microsoft.

“It’s helping to drive our transformation,” he says. “It’s not only moving the organization forward, but the people as well. It’s driving a shift to centralize on outcomes over solutions, which enables DevOps engineers to truly own and take pride in their work.”

It’s the kind of thing that fires you up, and sends old, outdated practices up in smoke.

“I’ve never been more excited about what’s possible at Microsoft than right now,” Gagnon says “The transformation is occurring throughout leadership, which is making it easier for our engineering teams to realize their goals and have fun while doing it.”

The post DevOps is sending engineering practices up in smoke appeared first on Inside Track Blog.

Modernizing our cloud networking infrastructure with a DevOps mindset

Juan Jimenez — Thu, 07 Mar 2024 16:42:54 +0000

DevOps has become a fundamental philosophy critical to the success of our cloud networking teams and solutions.

In today’s rapidly changing technology landscape, the conventional model of infrastructure management—receiving user requirements, designing solutions, deploying infrastructure, and manually monitoring for health and availability—lacks the agility and efficiency we need to operate our network infrastructure in a modern work environment.

DevOps represents a mindset and a set of practices that bridge the gap between conventional infrastructure management and a modernized, agile approach to ensuring our network environment continually meets the requirements of our business.

Here in Microsoft Digital, the company’s IT organization, our journey into the DevOps mindset began with a cultural shift. We’ve emphasized collaboration and worked toward removing barriers to cross-team sharing. We’ve encouraged our engineering teams to embrace continuous improvement and align their work to common organizational goals. This shift in mindset has accelerated our project timelines, enhanced reliability, and sparked innovation within our teams.

[Explore moving Microsoft’s global network to the cloud with Azure. Read our ongoing series on moving our network to the cloud.]

Driving efficiency and resiliency with DevOps practices

DevOps is at the forefront of our service delivery cycle. It affects every step and choice our engineering teams make, and DevOps practices have revolutionized our infrastructure management processes.

From the first step of the process—gathering user requirements—our teams collaborate closely with stakeholders to ensure a thorough understanding of user needs and expectations. Our design process is a collaborative, collective effort, with multidisciplinary teams contributing toward efficient, scalable, and secure solutions. With Azure networking components at the core, our DevOps practices span the entire solution lifecycle.

Juan Jimenez and Raghavendran Venkatraman are part of a team at Microsoft Digital that’s using DevOps to modernize our cloud infrastructure.

We automate deployment using infrastructure as code (IaC) with Azure Resource Manager (ARM) templates, Bicep, Azure Blueprints, and Terraform. IaC is the cornerstone of our modernization efforts. We’ve automated most of our network deployment and management tasks by using IaC in ARM templates and across a robust suite of management tools. Massive network deployments now take minutes instead of months. Reconfigurations can be dynamically and sequentially deployed, honoring dependencies and network data flow requirements.

This IAC approach enables our engineers to maintain infrastructure consistency, enforce best practices, and improve team collaboration. This approach not only accelerates service delivery but also ensures the reliability and stability of our network environment.

We implement continuous integration and deployment (CI/CD) pipelines for network configurations. Using Azure DevOps services, we’ve set up CI/CD pipelines for our network configurations. Whenever our network infrastructure code changes, it automatically triggers a pipeline that tests and deploys these changes across our environments. This ensures that our network infrastructure can evolve rapidly and safely in response to new requirements or challenges.

We monitor and run diagnostics with Azure Monitor and Network Watcher. We’ve transformed our monitoring and alerting mechanisms by integrating Azure Monitor and Network Watcher. This gives us real-time visibility into our network performance and health, enabling our systems to proactively identify and resolve issues before they impact our users, often without human intervention. Automated alerts and diagnostics tools within these services allow us to respond swiftly to anomalies.

We automate security and compliance processes. Security is paramount in all our deployments. We automate compliance checks and security monitoring by integrating Azure Policy and Azure Security Center into our DevOps practices. This ensures our network infrastructure remains compliant with our stringent security standards and streamlines the process of identifying and mitigating potential security risks.

We incorporate feedback Loops for continuous improvement. We can continuously refine and improve our network infrastructure by incorporating feedback mechanisms into our processes. Azure DevOps provides tools for tracking user feedback, bug reports, and performance metrics, which we analyze to continuously refine our DevOps practices, aligning them with emerging technologies and industry best practices. This adaptive approach ensures that we stay agile and responsive to the ever-evolving needs of our users.

Modernization through virtualization

As we move forward in our DevOps journey, we’re pushing into new ways of thinking about networking and modern infrastructure management. Azure-based connectivity has emerged as a critical enabler in this pursuit. For example, our implementation of Azure Virtual WAN exemplifies DevOps-driven networking. Our Azure Virtual WAN solution connects branch offices, data centers, and Azure resources seamlessly, and it’s filled with DevOps practices. The Azure Virtual WAN environment is provisioned using ARM templates, defining the entire topology, including hubs, spokes, and VPN connections. Azure Monitor tracks performance metrics, such as latency and bandwidth utilization. Alerts trigger automatic scaling or failover actions. When new branches are added, Azure Virtual WAN scales dynamically to provide the throughput and performance necessary based on pre-configured auto-scaling rules.

By using Azure Virtual WAN to virtualize our connectivity for Microsoft employees and buildings across the globe, we’re eliminating the constraints of physical infrastructure and unlocking new possibilities for scalability and efficiency.

Staying agile and looking forward

We know we’re working with a moving target as we continue our DevOps journey. The technological landscape constantly evolves, presenting new challenges and opportunities. Our engineers are committed to staying adaptive and flexible, ready to iterate and develop our practices as we progress.

Our DevOps journey has fundamentally transformed our approach to Azure cloud network engineering and infrastructure management. Our use of Azure Virtual WAN, ARM templates, CI/CD pipelines, Azure Monitor, and Network Watcher is a testament to our commitment as Customer Zero to use Microsoft technologies to meet the ever-evolving demands of our users and the industry. By embracing DevOps practices, we’ve improved collaboration and efficiency and paved the way for continuous innovation in the future.

Here are a few ways that you can start adopting a DevOps mindset, whether you’re a seasoned network engineer or a DevOps enthusiast:

Embrace DevOps for agility: Accelerate projects and foster innovation by promoting collaboration and continuous improvement.
Use IaC for efficiency: Use Azure Resource Manager templates and IaC tools to streamline and standardize network deployments.
Automate monitoring and security: Use Azure Monitor and Security Center for real-time insights and automated compliance.
Adopt virtualization and stay adaptive: Use Azure Virtual WAN for scalable connectivity and remain open to evolving technologies and practices.

Learn how to Integrate ARM templates with Azure Pipelines.

Want more information? Email us and include a link to this story and we’ll get back to you.

Please share your feedback with us—take our survey and let us know what kind of content is most useful to you.

The post Modernizing our cloud networking infrastructure with a DevOps mindset appeared first on Inside Track Blog.

How we’re deploying our VWAN infrastructure using infrastructure as code and CI/CD

Eric Scheffler — Fri, 22 Sep 2023 20:48:18 +0000

Editor’s note: This is the first in an ongoing series on moving our network to the cloud internally at Microsoft.

We’re building a more agile, resilient, and stable virtual wide-area network (VWAN) to create a better experience for our employees to connect and collaborate globally. By implementing a continuous integration/continuous deployment (CI/CD) approach to building our VWAN-based network infrastructure, we can automate the deployment and configuration processes to ensure rapid and reliable delivery of network changes. Here’s how we’re making that happen internally at Microsoft.

Infrastructure as code (IaC)

Juan Jimenez (left) and Eric Scheffler are part of the team in Microsoft Digital Employee Experience that is helping the company move its network to the cloud. Jimenez is a principle cloud network engineer and Scheffler is a senior cloud network engineer.

Infrastructure as code (IaC) is the fundamental principle underlying our entire VWAN infrastructure. Using IaC, we can develop and implement a descriptive model that defines and deploys VWAN components and determines how the components work together. IaC allows us to create and manage a massive network infrastructure with reusable, flexible, and rapid code deployments.

We created deployment templates and resource modules using the Bicep language in our implementation. These templates and modules describe the desired state of our VWAN infrastructure in a declarative manner. Bicep is a domain-specific language (DSL) that uses declarative syntax to deploy Microsoft Azure resources.

We maintain a primary Bicep template that calls separate modules—also maintained in Bicep templates—to create the desired resources for the deployment in alignment with Microsoft best practices. We use this modular approach to apply different deployment patterns to accommodate changes or new requirements.

With IaC, changes and redeployments are as quick as modifying templates and calling the associated modules. Additionally, parameters for each unique deployment are maintained in separate files from the templates so that different iterations of the same deployment pattern can be deployed without changing the source Bicep code.

Version control

We use Microsoft Azure DevOps, a source control system using Git, to track and manage our IaC templates, modules, and associated parameter files. With Azure DevOps, we can maintain a history of changes, collaborate within teams, and easily roll back to previous versions if necessary.

We’re also using pull requests to help track change ownership. Azure DevOps tracks changes and associates them with the engineer who made the change. Azure DevOps is a considerable help with several other version control tasks, such as requiring peer reviews and approvals before code is committed to the main branch. Our code artifacts are published to (and consumed from) a Microsoft Azure Container Registry that allows role-based access control of modules. This enables version control throughout the module lifecycle, and it’s easy to share Azure Container Registry artifacts across multiple teams for collaboration.

Automated testing

Responsible deployment is essential with IaC when deploying a set of templates could radically alter critical network infrastructure. We’ve implemented safeguards and tests to validate the correctness and functionality of our code before deployment. These tests include executing the Bicep linter as part of the Azure DevOps deployment pipeline to ensure that all Bicep best practices are being followed and to find potential issues that could cause a deployment to fail.

We’re also running a test deployment to preview the proposed resource changes before the final deployment. As the process matures, we plan to integrate more testing, including network connectivity tests, security checks, performance benchmarks, and enterprise IP address management (IPAM) integration.

Configuration management

Azure DevOps and Bicep allow us to automate the configuration and provisioning of network objects and services within our VWAN infrastructure. These tools make it easy to define and enforce desired configurations and deployment patterns to ensure consistency across different network environments. Using separate parameter files, we can rapidly deploy new environments in minutes rather than hours without changing the deployment templates or signing in to the Microsoft Azure Portal.

Continuous deployment

The continuous integration (CI) pipeline automates the deployment process for our VWAN infrastructure when the infrastructure code passes all validation and tests. The CI pipeline triggers the deployment process automatically, which might involve deploying virtual machines, building and configuring cloud network objects, setting up VPN connections, or establishing network policies.

Monitoring and observability

We’ve implemented robust monitoring and observability practices for how we deploy and manage our VWAN deployment. Monitoring and observability are helping us to ensure that our CI builds are successful, detect issues promptly, and maintain the health of our development process. Here’s how we’re building monitoring and observability in our Azure DevOps CI pipeline:

We’re creating built-in dashboards and reports that visualize pipeline status and metrics such as build success rates, durations, and failure details.
We’re generating and storing logs and artifacts during builds.
We’ve enabled real-time notifications to help us monitor build status for failures and critical events.
We’re building-in pipeline monitoring review processes to identify areas for improvement including optimizing build times, reducing failures, and enhancing the stability of our pipeline.

We’re continuing to iterate and optimize our monitoring practices. We’ve created a feedback loop to review the results of our monitoring. This feedback provides the information we need to adjust build scripts, optimize dependencies, automate certain tasks, and further enhance our pipeline.

By implementing comprehensive monitoring and observability practices in our Azure DevOps CI pipeline, we can maintain a healthy development process, catch issues early, and continuously improve the quality of our code and builds.

Rollback and rollforward

We’ve built the ability to rollback or rollforward changes in case of any issues or unexpected outcomes. This is achieved through infrastructure snapshots, version-controlled configuration files, or using features provided by our IaC tool.

Improving through iteration

We’re continuously improving our VWAN infrastructure using information from monitoring data and user experience feedback. We’re also continually assessing new requirements, newly added Azure features, and operational insights. We iterate on our infrastructure code and configuration to enhance security, performance, and reliability.

By following these steps and using CI/CD practices, we can build, test, and deploy our VWAN network infrastructure in a controlled and automated manner, creating a better employee experience by ensuring faster delivery, increased stability, and more effortless scalability.

Here are some tips on how you can start tackling some of the same challenges at your company:

You can use Infrastructure as code (IaC) to create and manage a massive network infrastructure with reusable, flexible, and rapid code deployments.
Using IaC, you can make changes and redeployments quickly by modifying templates and calling the associated modules.
Don’t overlook version control. Tracking and managing IaC templates, modules, and associated parameter files is essential.
Perform automated testing. It’s necessary to validate the correctness and functionality of the code before deployment.
Use configuration management tools to simplify defining and enforcing desired configurations and deployment patterns. This ensures consistency across different network environments.
Implement continuous deployment to automate the deployment process for network infrastructure after the code passes all validation and tests.
Use monitoring and observability best practices to help identify issues, track performance, troubleshoot problems, and ensure the health and availability of the network infrastructure.
Building rollback and roll-forward capabilities enables you to quickly respond to issues or unexpected outcomes.

Try using a Bicep template to manage your Microsoft Azure resources.

Please share your feedback with us—take our survey and let us know what kind of content is most useful to you.

The post How we’re deploying our VWAN infrastructure using infrastructure as code and CI/CD appeared first on Inside Track Blog.

Rotating DevOps role improves engineering service quality

Inside Track staff — Tue, 21 Feb 2023 20:35:58 +0000

As many high-performing agile software engineering teams embrace a DevOps culture, they’re adding the role of Directly Responsible Individual (DRI). The role is also known by various other names, such as Google’s “Sheriff” or Facebook’s slightly different “Designated Response Individual.” Rotating within an agile team, the DRI is responsible for service availability, service health, and incident management. The DRI advocates for the customer and drives positive changes to improve the customer experience with services.

In Microsoft Digital, we’re using a DRI to help us deliver better services faster and more cost effectively. The DRI actively looks at services in production, thereby helping our agile teams be proactive rather than reactive. This has helped us reduce—by up to 50 percent—the number of support tickets and bugs that we have to resolve. With the rest of the team free of this distraction, they have more time to deliver business value.

We used to only get four to five hours per day of productive work out of each software engineer. Since adding this this role to our teams, productive time has increased to six hours per day. This role also reduces risk because resolving issues doesn’t interfere with our ability to deliver on a sprint. In addition, we’re finding that the DRI reduces the number of engagements we have with support, so these costs also are going down.

[Take a look at how deploying Kanban at Microsoft leads to engineering excellence. Find out more about transforming modern engineering at Microsoft. Learn more about powering Microsoft’s operations transformation with Microsoft Azure.]

DRI process and expectations

In Microsoft Digital, we have a primary DRI with a secondary DRI as a backup. The primary DRI is 100 percent allocated to this role and has no other team tasks. Each day, the primary DRI reviews incident logs, responds to critical incidents or patterns of incidents. They also log defects, and assign them to individuals based on root cause analysis. For visibility, the secondary DRI is looped into any issues. In the event the primary DRI is unavailable or busy, the secondary DRI steps in.

DRI role rotation

The primary and secondary DRI role rotates across all team members. For a seamless transition, the secondary DRI becomes the primary DRI at the next rotation. The primary and secondary DRI don’t overlap the Scrum Master role during the same sprint.

The rotation cadence is two weeks, which aligns with the ideal two-week sprint cadence. This ensures that the DRI can participate in service reviews and other service-line meetings that are held every other week. It also ensures that the DRI has ample impact during the sprint and the opportunity to spend time in preferred engineering activities. Rotations start on the first day of the sprint and last until the first day of the next sprint. It’s up to the sprint team to track and manage their DRI schedule.

Sprint capacity

DRI activities require effort, and effort doesn’t come free. Effort correlates to capacity, and existing engineering efforts need to change or stop to free up this capacity. For this reason, the primary DRI is not accounted for in the current sprint capacity. We schedule the primary DRI time as “days off” in Visual Studio Team Services (VSTS). This keeps DRI work from having an impact on the sprint plan. In the event the secondary DRI becomes heavily engaged, we have to re-plan the sprint accordingly.

Incident management

The DRI responds to incidents in two ways:

Pull. During core hours, the DRI reviews incidents for critical issues or patterns of issues that require resolution.
Push. Outside of core hours, the service engineering team engages the DRI when software engineering assistance is required to respond to an incident. The on-call schedule for the primary DRI is rotated so that no one has to be on call for more than one major holiday per year.

In both cases, the DRI isn’t solely responsible for fixing the issue. The DRI creates a VSTS work item and links it to the incident when possible. We prefer to track the work in a single system, while ensuring the effort (time) is tracked in VSTS.

The DRI performs root cause analysis and engages the software engineer who’s accountable for the feature area or component. The DRI isn’t expected to be the hero and fix all issues; however, if the issue is easily fixed the DRI may take the fix forward independently while following up with the extended team for visibility.

High-severity issues

When handling a high-severity live site production issue, the primary DRI should involve the secondary DRI, unless the primary DRI is confident that the issue can be resolved quickly. The DRI is also empowered to contact other team members who have knowledge that could be helpful. Reaching out to others, even if they’re not on call, is the right thing to do. Multiple people working on critical issues can decrease the time to resolution and reduce the stress for the DRI, who would otherwise handle the issue alone. It also helps team members grow in understanding.

Less need for supporting teams

As we mature the DRI role in our agile teams, we expect to reduce—and eventually eliminate—the need for supporting teams. This will free up capacity for creating more business value and quality within our agile teams.

Sustaining engineering

The cheapest way to fix a bug is to catch it when it’s introduced and have the individual who introduced the bug fix it. When the sustaining engineering team resolves defects that we introduce, it creates a culture of reduced accountability and deferred quality. Releasing the sustaining engineering team frees up capacity and changes our team mindset to rapidly fix forward.

Release management

Today we depend on a virtual team of release managers to deploy our software to production. Handoff from the agile team to this team results in a loss of context and requires a dedicated effort for knowledge transfer. Going forward, the primary DRI will take responsibility for deployment to production. The DRI will ensure there’s proper deployment documentation, automation, and validation. After deployment, they’ll review the results and service state. This practice will also reduce access to potentially sensitive information from a broad team to a single individual, which is a pattern that’s in alignment with Sarbanes-Oxley (SOX) compliance.

Key results

Since adopting the DRI role in our agile teams, we’ve experienced many benefits, including improved service quality and customer experiences, career growth for team members, and greater readiness for DevOps within our teams.

Service quality

With a DRI proactively investigating internal exceptions and ticket trends, our teams have been resolving bugs during each sprint. This has improved our customers’ experiences and reduced exception and ticket trends week over week. The following screenshots show ticket trends for our payee management team, which has a rotating DRI role. A recent period showed a 50 percent reduction. Year over year, we had a 30 percent reduction in tickets.

Ticket trends over 15 days: over 50 percent reduction.

Ticket trends over a year: over 30 percent reduction with a steady decline.

Addressing defects broadly across the team has put greater focus on quality and our bug backlog. Payee management is now experiencing a shallow bug backlog (less than 30 new/active bugs).

Improved customer experiences

When team members participate as a DRI, they gain knowledge about the end-to-end service. The DRI is responsible for understanding the service in full, the customer experience, and how the service is enabling business outcomes. This increased broad focus makes team members more accountable to deliver high-quality customer experiences and is driving richer designs.

Each DRI brings a unique lens and different values to the role. This diversity in focus is helping the team improve the service in many areas. For example, one of our DRIs discovered that a service wasn’t running in the same region as our data store. This pattern didn’t exist in pre-production and may not have been noticed without the DRI function. The team now has a backlog item to redeploy the service to the same region as the customer, which will reduce latency and improve the customer experience.

Previously, when our customers encountered defects, they would retry their task or use known workarounds. Today, the DRI proactively identifies blocking issues and fixes defects before the customer escalates them. In some cases, the DRI logs support tickets before the customer is even aware that an issue exists. This is dramatically improving our mean time to detect (MTTD) and mean time to resolve (MTTR) metrics.

Our software engineers find that working within the DRI role is a rewarding experience. They’re developing new skills and forming new patterns of working that are increasing their impact and relevance, with the following benefits:

Career growth. Engineers are developing DevOps full-cycle software engineering skills and relevant industry experience.
Leadership skills. Engineers are participating in service health with live site reviews and ongoing engagement opportunities and gaining leadership experience across organizations.
Engineering autonomy. When assigned DRIs aren’t performing DRI tasks, they can work on projects that they’re most passionate about—such as training, bug fixing, vNext design, or other work the DRI is most passionate about.

The DRI rotation is building DevOps basics in our teams: from telemetry analysis and instrumentation to deployment into production. After each team member has been the DRI for a few rotations, they’re better suited for aggressive DevOps responsibilities and patterns of working.

The post Rotating DevOps role improves engineering service quality appeared first on Inside Track Blog.

Microsoft internal SAP workload gets a telemetry boost with Azure

Seth Malcolm — Fri, 03 May 2019 00:37:21 +0000

For the first time, Microsoft has end-to-end visibility into the millions of business processes that it runs through SAP every day.

Microsoft is using a new suite of Microsoft Azure telemetry tools to gain insight into how to better manage expense reports, time away reporting, purchase order creation, and similar business processes that get routed through one of the largest SAP instances in the world. Before Microsoft moved its SAP workload into Azure and before it started using new Azure telemetry tools, there was no way to connect such transactions inside and outside of SAP.

“This made the process feel like a black box to Microsoft employee users, and to our engineers, who needed to figure out what was happening so they could connect the dots and improve our services,” says Enda Sullivan, a senior program manager for Microsoft Digital’s internal SAP implementation.

Because the SAP processes were not connected end-to-end, it was difficult to help Microsoft employees when something went wrong.

“If there was a problem, it would require the user to engage with multiple teams, both SAP and non-SAP, to understand the status of a request,” Sullivan says. “Beyond that, the support teams often wouldn’t have visibility to failed transactions between the various system steps. As a user, I should never have to call a helpdesk, the failure should be detected by the service telemetry and monitoring, and be resolved before I’m even aware.”

Now through Azure, the team has real-time data that tells them when SAP process issues come up, which, importantly, allows them to get resolved before users realize something went wrong.

Telemetry comes a long way

Cory Delamarter manages implementation of the Unified Telemetry Platform (UTP) at Microsoft. Delamarter is a principal program manager in Microsoft Digital. (Photo by Jim Adams | Inside Track)

Historically, the use of telemetry to guide how companies like Microsoft use SAP was spotty at best, says Cory Delamarter, a principal program manager tasked with driving telemetry design and implementation standards across Microsoft Digital, which provides IT services for all of Microsoft.

“It’s great that Azure is giving us these new tools to work with, but this is something we were already starting to tackle,” he says. “The opportunity and value of getting all our data in one place is too high to not solve this problem.”

Delamarter’s team is working to bring a consistent approach to telemetry across all Microsoft Digital, an effort that has been tabbed the Unified Telemetry Platform (UTP).

“When we started to architect a new solution for telemetry, it needed to be scalable, reliable, and cost-effective, but most importantly a single common platform across the org,” he says. “Essentially, we are consolidating tools and data stores.”

Standardization was a must, he says.

“We had to design in flexibility that would support more than a system for monitoring a database or website,” Delamarter says. “The power of unified telemetry is the ability to solve problems across boundaries, or service health, which supports the higher-level business processes.”

The companywide approach to telemetry is being built around Azure Monitor Application Insights, Azure Data Lake (Gen2) and Azure Data Explorer. “Application Insights provides the ability to ingest and organize incoming telemetry data, while Azure Data Explorer gives us the ability to aggregate and support queries across very large data sets stored in our data lake,” he says.

The Unified Telemetry Platform (UTP) system ingests data from applications and infrastructure across the Microsoft internal environment. Data is transformed into a standard schema using Application Insights and housed in Azure Data Lake for cold storage. Azure Data Explorer provides the ability to query the datasets and build dashboards with Power BI in addition to Application Insights and Azure Monitor.

Taking away the mystery

When it comes to getting more insights out of Microsoft’s SAP workload, it really came down to taking the mystery away.

“For SAP, we had to get out of the four walls and shine a light so it’s no longer a black box,” says Aron Stern, a senior software engineer inside Microsoft Digital who is responsible for the Azure architecture of the company’s SAP infrastructure.

First, the team needed to simplify everything.

“We thought about the solution in two parts, telemetry, the raw metadata being emitted by our applications, then monitoring, reporting on service health through dashboards and alerting,” Stern says. “From there we separated the data into three layers—infrastructure, application, and business process.”

Then the team needed to implement a bit of customization.

“We built a small custom application using a few Azure tools,” Stern says. “That allowed us to convert our application and business process telemetry events from SAP into common schema—this allowed us to stitch our transactions together so we could get the end-to-end view we were looking for.”

The team then ingested all the SAP data into its Application Insights instance and fed that into an Azure Data Lake for cold storage and Power BI reporting.

This new shift has enabled a powerful transformation, says Blake Barrow, a principal software engineer inside Microsoft Digital. “Business process telemetry is what enables us to measure SLA’s and provide transparency to our users,” he says.

Everyone who has access to this new telemetry are enjoying a whole wave of new insights.

“Our employee users now have better insight into any transaction they have on SAP,” Barrow says. “Our engineering teams are getting the data they need to detect issues before they can create downstream problems. Business executives have access to dashboards that provide near real-time status of transaction volumes, so they look for trends and do health checks on their programs.”

All are changes that wouldn’t be possible without transforming the way Microsoft approaches telemetry. “These are the kinds of improvements that can help us all have more impact,” he says. “This is what digital transformation is all about.”

Barrow, Stern, and Sullivan will be presenting how Microsoft is using UTP to gain insights on its SAP workload at the SAP SapphireNow Conference on Wednesday, May 8th.

The post Microsoft internal SAP workload gets a telemetry boost with Azure appeared first on Inside Track Blog.