{"id":784,"date":"2014-03-24T08:52:00","date_gmt":"2014-03-24T15:52:00","guid":{"rendered":"http:\/\/marcbook.local\/wds\/playground\/cybertrust\/2014\/03\/24\/reliability-series-1-reliability-vs-resilience\/"},"modified":"2023-05-16T10:57:04","modified_gmt":"2023-05-16T17:57:04","slug":"reliability-series-1-reliability-vs-resilience","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/security\/blog\/2014\/03\/24\/reliability-series-1-reliability-vs-resilience\/","title":{"rendered":"Reliability Series #1: Reliability vs. resilience"},"content":{"rendered":"
Whenever I speak to customers and partners about reliability I\u2019m reminded that while objectives and priorities differ between organizations and customers, at the end of the day, everyone wants their service to work. As a customer, you want to be able to do things online, at a time convenient to you. As an organization \u2013 or a provider of a service \u2013 you want your customers to carry out the tasks they want to, whenever they want to do so.<\/p>\n
This article is the first in a four-part series on building a resilient service. In my first two posts, I will discuss the topic as it relates to business strategy, and then we’ll dive deeper into the technical details. The full series of four posts will cover:<\/p>\n
1. Reliability vs. resilience<\/strong> \u2013 What is the difference between reliability and resilience and why does it matter? My intention is to provide insight into how Microsoft thinks about reliability and the processes and techniques we\u2019re employing to improve the reliability of our services for our customers.<\/p>\n So what is reliability? When I ask customers and partners, the most common responses refer to consistency in performance, speed, availability \u2013 and perhaps most significantly \u2013resilience. One thing we all agree on is that for a system or service to be reliable, the user has to believe \u2018it just works\u2019.<\/p>\n The Institute of Electrical and Electronics Engineers (IEEE) Reliability Society<\/a> states reliability [engineering] is \u201ca design engineering discipline which applies scientific knowledge to assure that a system will perform its intended function for the required duration within a given environment, including the ability to test and support the system through its total lifecycle.\u201d For software, it defines reliability as \u201cthe probability of failure-free software operation for a specified period of time in a specified environment.\u201d<\/p>\n A reliable cloud service is essentially one that functions as the designer intended it to, when it is expected to, and wherever the customer is connected. That\u2019s not to say every component must operate flawlessly 100 percent of the time. This last point brings us to what I believe is the difference between reliability and resiliency.<\/p>\n Reliability is the outcome cloud service providers strive for \u2013 it\u2019s the result. Resiliency is the ability of a cloud-based service to withstand certain types of failure and yet remain functional from the customer perspective. In other words, reliability is the outcome and resilience is the way you achieve the outcome. A service could be characterized as reliable simply because no part of the service has ever failed, and yet the service couldn\u2019t be regarded as resilient because those reliability-enhancing capabilities may never have been tested.<\/p>\n The key takeaway here is the importance of focusing on resilience and designing and building resiliency into your service at every stage of the software development lifecycle. To find out more about the fundamentals of building a reliable online service, read our whitepaper \u2018An introduction to designing reliable cloud services<\/a>\u2019.<\/p>\n
\n2. Common reliability-related threats<\/strong> \u2013 DIAL (Discovery, Incorrectness, Authorization\/Authentication, Limits\/Latency) is a handy mnemonic to help teams brainstorm potential failures of interactions between components for their service in a structured way. Brainstorming about failure modes and failure points is a key phase in resilience modeling and analysis (RMA) and can help teams improve the reliability of their service.
\n3. Reliability-enhancing techniques<\/strong> \u2013 Taking the \u201cD\u201d and \u201cA\u201d in DIAL, we\u2019ll look at some reliability enhancing techniques you can incorporate into your design related to discovery and authentication.
\n4. Reliability-enhancing techniques<\/strong> \u2013 Taking the \u201cI\u201d and \u201cL\u201d in DIAL, we\u2019ll look at some reliability enhancing techniques you can incorporate into your design related to incorrectness and limits.<\/p>\n