{"id":9792,"date":"2023-02-21T12:35:58","date_gmt":"2023-02-21T20:35:58","guid":{"rendered":"https:\/\/www.microsoft.com\/insidetrack\/blog\/?p=9792"},"modified":"2023-03-02T11:20:31","modified_gmt":"2023-03-02T19:20:31","slug":"rotating-devops-role-improves-engineering-service-quality","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/insidetrack\/blog\/rotating-devops-role-improves-engineering-service-quality\/","title":{"rendered":"Rotating DevOps role improves engineering service quality"},"content":{"rendered":"
As many high-performing agile software engineering teams embrace a DevOps culture, they\u2019re adding the role of Directly Responsible Individual (DRI). The role is also known by various other names, such as Google\u2019s \u201cSheriff\u201d or Facebook\u2019s slightly different \u201cDesignated Response Individual.\u201d Rotating within an agile team, the DRI is responsible for service availability, service health, and incident management. The DRI advocates for the customer and drives positive changes to improve the customer experience with services.<\/p>\n
In Microsoft Digital, we\u2019re using a DRI to help us deliver better services faster and more cost effectively. The DRI actively looks at services in production, thereby helping our agile teams be proactive rather than reactive. This has helped us reduce\u2014by up to 50 percent\u2014the number of support tickets and bugs that we have to resolve. With the rest of the team free of this distraction, they have more time to deliver business value.<\/p>\n
We used to only get four to five hours per day of productive work out of each software engineer. Since adding this this role to our teams, productive time has increased to six hours per day. This role also reduces risk because resolving issues doesn\u2019t interfere with our ability to deliver on a sprint. In addition, we\u2019re finding that the DRI reduces the number of engagements we have with support, so these costs also are going down.<\/p>\n
[Take a look at how deploying Kanban at Microsoft leads to engineering excellence.<\/a> Find out more about transforming modern engineering at Microsoft.<\/a> Learn more about powering Microsoft\u2019s operations transformation with Microsoft Azure.<\/a>]<\/em><\/p>\n In Microsoft Digital, we have a primary DRI with a secondary DRI as a backup. The primary DRI is 100 percent allocated to this role and has no other team tasks. Each day, the primary DRI reviews incident logs, responds to critical incidents or patterns of incidents. They also log defects, and assign them to individuals based on root cause analysis. For visibility, the secondary DRI is looped into any issues. In the event the primary DRI is unavailable or busy, the secondary DRI steps in.<\/p>\n The primary and secondary DRI role rotates across all team members. For a seamless transition, the secondary DRI becomes the primary DRI at the next rotation. The primary and secondary DRI don\u2019t overlap the Scrum Master role during the same sprint.<\/p>\n The rotation cadence is two weeks, which aligns with the ideal two-week sprint cadence. This ensures that the DRI can participate in service reviews and other service-line meetings that are held every other week. It also ensures that the DRI has ample impact during the sprint and the opportunity to spend time in preferred engineering activities. Rotations start on the first day of the sprint and last until the first day of the next sprint. It\u2019s up to the sprint team to track and manage their DRI schedule.<\/p>\n DRI activities require effort, and effort doesn\u2019t come free. Effort correlates to capacity, and existing engineering efforts need to change or stop to free up this capacity. For this reason, the primary DRI is not accounted for in the current sprint capacity. We schedule the primary DRI time as “days off” in Visual Studio Team Services (VSTS). This keeps DRI work from having an impact on the sprint plan. In the event the secondary DRI becomes heavily engaged, we have to re-plan the sprint accordingly.<\/p>\n The DRI responds to incidents in two ways:<\/p>\n In both cases, the DRI isn\u2019t solely responsible for fixing the issue. The DRI creates a VSTS work item and links it to the incident when possible. We prefer to track the work in a single system, while ensuring the effort (time) is tracked in VSTS.<\/p>\n The DRI performs root cause analysis and engages the software engineer who\u2019s accountable for the feature area or component. The DRI isn\u2019t expected to be the hero and fix all issues; however, if the issue is easily fixed the DRI may take the fix forward independently while following up with the extended team for visibility.<\/p>\n When handling a high-severity live site production issue, the primary DRI should involve the secondary DRI, unless the primary DRI is confident that the issue can be resolved quickly. The DRI is also empowered to contact other team members who have knowledge that could be helpful. Reaching out to others, even if they\u2019re not on call, is the right thing to do. Multiple people working on critical issues can decrease the time to resolution and reduce the stress for the DRI, who would otherwise handle the issue alone. It also helps team members grow in understanding.<\/p>\n As we mature the DRI role in our agile teams, we expect to reduce\u2014and eventually eliminate\u2014the need for supporting teams. This will free up capacity for creating more business value and quality within our agile teams.<\/p>\n The cheapest way to fix a bug is to catch it when it\u2019s introduced and have the individual who introduced the bug fix it. When the sustaining engineering team resolves defects that we introduce, it creates a culture of reduced accountability and deferred quality. Releasing the sustaining engineering team frees up capacity and changes our team mindset to rapidly fix forward.<\/p>\n Today we depend on a virtual team of release managers to deploy our software to production. Handoff from the agile team to this team results in a loss of context and requires a dedicated effort for knowledge transfer. Going forward, the primary DRI will take responsibility for deployment to production. The DRI will ensure there\u2019s proper deployment documentation, automation, and validation. After deployment, they\u2019ll review the results and service state. This practice will also reduce access to potentially sensitive information from a broad team to a single individual, which is a pattern that\u2019s in alignment with Sarbanes-Oxley (SOX) compliance.<\/p>\n Since adopting the DRI role in our agile teams, we\u2019ve experienced many benefits, including improved service quality and customer experiences, career growth for team members, and greater readiness for DevOps within our teams.<\/p>\n With a DRI proactively investigating internal exceptions and ticket trends, our teams have been resolving bugs during each sprint. This has improved our customers\u2019 experiences and reduced exception and ticket trends week over week. The following screenshots show ticket trends for our payee management team, which has a rotating DRI role. A recent period showed a 50 percent reduction. Year over year, we had a 30 percent reduction in tickets.<\/p>\n Addressing defects broadly across the team has put greater focus on quality and our bug backlog. Payee management is now experiencing a shallow bug backlog (less than 30 new\/active bugs).<\/p>\n When team members participate as a DRI, they gain knowledge about the end-to-end service. The DRI is responsible for understanding the service in full, the customer experience, and how the service is enabling business outcomes. This increased broad focus makes team members more accountable to deliver high-quality customer experiences and is driving richer designs.<\/p>\n Each DRI brings a unique lens and different values to the role. This diversity in focus is helping the team improve the service in many areas. For example, one of our DRIs discovered that a service wasn\u2019t running in the same region as our data store. This pattern didn\u2019t exist in pre-production and may not have been noticed without the DRI function. The team now has a backlog item to redeploy the service to the same region as the customer, which will reduce latency and improve the customer experience.<\/p>\n Previously, when our customers encountered defects, they would retry their task or use known workarounds. Today, the DRI proactively identifies blocking issues and fixes defects before the customer escalates them. In some cases, the DRI logs support tickets before the customer is even aware that an issue exists. This is dramatically improving our mean time to detect (MTTD) and mean time to resolve (MTTR) metrics.<\/p>\n Our software engineers find that working within the DRI role is a rewarding experience. They\u2019re developing new skills and forming new patterns of working that are increasing their impact and relevance, with the following benefits:<\/p>\n The DRI rotation is building DevOps basics in our teams: from telemetry analysis and instrumentation to deployment into production. After each team member has been the DRI for a few rotations, they\u2019re better suited for aggressive DevOps responsibilities and patterns of working.<\/p>\n As many high-performing agile software engineering teams embrace a DevOps culture, they\u2019re adding the role of Directly Responsible Individual (DRI). The role is also known by various other names, such as Google\u2019s \u201cSheriff\u201d or Facebook\u2019s slightly different \u201cDesignated Response Individual.\u201d Rotating within an agile team, the DRI is responsible for service availability, service health, and […]<\/p>\n","protected":false},"author":133,"featured_media":9834,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"_hide_featured_on_single":false,"_show_featured_caption_on_single":true,"footnotes":""},"categories":[1],"tags":[238,111],"coauthors":[646],"class_list":["post-9792","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized","tag-culture","tag-devops","program-microsoft-digital-technical-stories","m-blog-post"],"yoast_head":"\nDRI process and expectations<\/h2>\n
DRI role rotation<\/h3>\n
Sprint capacity<\/h3>\n
Incident management<\/h3>\n
\n
High-severity issues<\/h3>\n
Less need for supporting teams<\/h2>\n
Sustaining engineering<\/h3>\n
Release management<\/h3>\n
Key results<\/h2>\n
Service quality<\/h3>\n
Improved customer experiences<\/h3>\n
<\/h3>\n
\n
<\/p>\n
\n