{"id":5962,"date":"2020-12-08T13:13:09","date_gmt":"2020-12-08T21:13:09","guid":{"rendered":"https:\/\/www.microsoft.com\/insidetrack\/blog\/?p=5962"},"modified":"2023-06-08T12:55:30","modified_gmt":"2023-06-08T19:55:30","slug":"microsoft-adopts-proactive-method-for-preventing-and-mitigating-failures","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/insidetrack\/blog\/microsoft-adopts-proactive-method-for-preventing-and-mitigating-failures\/","title":{"rendered":"Microsoft adopts proactive method for preventing and mitigating failures"},"content":{"rendered":"
This content has been archived, and while it was correct at time of publication, it may no longer be accurate or reflect the current situation at Microsoft.<\/p>\n<\/div>\n<\/div>\n
System crashes and other failures have long been a fact of life in the IT world.<\/p>\n
Microsoft has always worked hard to ensure that its customers have reliable software tools, but it experiences problems as well. To reduce the severity and length of system failures, Microsoft Digital core\u2014the engineering organization at Microsoft that builds and manages the products, processes, and services that Microsoft runs on\u2014has created a standard engineering set of tools to identify where failures might occur and how to address them.<\/p>\n
This approach is called Failure Mode Effective Analysis (FMEA). Microsoft uses FMEA to recognize potential failure risks, understand their impact, and mitigate them before they occur, rather than reacting after a failure occurs.<\/p>\n
FMEA is combined with Service Quality Portal (SQP), a tool that combines Microsoft SharePoint, Microsoft Visio, and Microsoft Azure-based cloud applications. Built by an engineering team within Microsoft, it provides a full capability of risk management using the FMEA framework. SQP makes it much easier to track and mitigate the complex sets of failures that can occur in the cloud.<\/p>\n
\u201cIt\u2019s a matter of combining people, processes, and technology,\u201d says Harsh Sharma, a senior program manager for Microsoft. \u201cSQP and FMEA provide the framework we need to make our systems more secure and stable.\u201d<\/p>\n
[Learn how Microsoft created a telemetry platform to uncover information about end-to-end enterprise health with Microsoft Azure.<\/em><\/a> Find out how Microsoft monitors SAP end to end on Microsoft Azure.<\/em><\/a> Learn about how Microsoft has created a modern data-governance strategy.<\/em><\/a>]<\/p>\n Once risks are identified, engineers can set up auto-detect to identify and mitigate failures, allowing them to replace the human decision-making used previously.<\/p>\n This new approach addresses how computing faults have evolved over the past 20 years.<\/p>\n Back then, an IT failure was invariably a local event, limited to a handful of PCs or an enterprise\u2019s central servers. But today, with most computing taking place in the cloud, IT infrastructure is widely distributed. It\u2019s also often built with commodity hardware that depends on an array of third-party and partner services.<\/p>\n \u201cCloud services are complex,\u201d Sharma says. \u201cThey have a lot of moving pieces and need a lot of scalability. Owners of services may even have to rebuild their cloud architecture on occasion.\u201d<\/p>\n Several years back, faults were outlined and tracked on a Microsoft Excel spreadsheet. Extrapolate this to more than 2,000 service components in Microsoft Digital, and that\u2019s a lot of spreadsheets.<\/p>\n To get around this, engineers use the SQP tool and FMEA method to store and maintain architecture diagrams and perform risk management prior to every major release.<\/p>\n Some failures are minor\u2014perhaps a set of users can\u2019t sign into an app, or the app has limited functionality. Others are more serious, such as widespread outages that shut down critical services for multiple cloud customers, such as email or access to important data.<\/p>\n The causes of these failures can vary widely as well, from natural disasters such as hurricanes, to human error or hardware or software errors.<\/p>\n Today we just can\u2019t be in firefighting mode all the time. We\u2019re committed to 99.9 percent reliability\u2014sometimes 99.999 percent. So, it\u2019s a requirement to proactively identify potential failures.<\/p>\n – Harsh Sharma, senior program manager<\/p>\n<\/blockquote>\n A new way to manage failures<\/strong><\/p>\n The old ways of managing failures simply don\u2019t work any longer. And for good reason.<\/p>\n In the past, \u201cfailure management\u201d meant listing every possible error imaginable in the aforementioned Microsoft Excel spreadsheet, with information about how to respond. It\u2019s as if a fire department listed potential fire hazards in every house, then did nothing until the smoke alarms went off.<\/p>\n \u201cToday we just can\u2019t be in firefighting mode all the time,\u201d Sharma says. \u201cWe\u2019re committed to 99.9 percent reliability\u2014sometimes 99.999 percent. So, it\u2019s a requirement to proactively identify potential failures.\u201d<\/p>\n That has meant a shift in what successful fault mitigation looks like.<\/p>\n Rather than focus on extending time between failures, Microsoft Digital\u2019s goal is to reduce time to recover. Complex systems are prone to a wide range of failures, so the best strategy is to cope with them in a way that minimizes impact on customers.<\/p>\n That\u2019s where FMEA comes in.<\/p>\n It prioritizes work in areas such as detection, mitigation, and recovery from failures\u2014all factors in reducing the time needed to correct a failure. Using this approach, engineering teams think through potential reliability weaknesses and are prepared when failures occur, greatly reducing impact on users.<\/p>\n One big part of that is using the design phase of a new service or product to understand how it might fail.<\/p>\n \u201cWe want to identify potential problems and poke holes in the service before something is deployed in production,\u201d Sharma says. \u201cYou want to understand the types of failures that could occur, how it would impact a business, and what could be the cause. And you want to have telemetry in place so that you can be alerted to a failure before a customer tells you about it.\u201d<\/p>\n Not that doing so is easy.<\/p>\n Microsoft engineers working with FMEA principles need to diagram a myriad of dependencies, determine how much redundancy a particular system has or needs, and how different parts of a cloud service interact. Interestingly, hardware components such as disks, processors, and routers are not given substantial attention. Their potential faults are already well understood and relatively easy to trace and fix.<\/p>\n Moving into the cloud has pluses and minuses for managing failures.<\/p>\n The cloud certainly presents a more complicated IT picture than the days of centralized IT. But it also creates its own backup.<\/p>\n We\u2019re getting in front of issues before they become a big deal. Now we\u2019re able to have conversations based on identifying failure points where we want to invest in new designs.<\/p>\n – Dale Voth, site reliability engineering manager<\/p>\n<\/blockquote>\n