David Bills, Author at Microsoft Security Blog

Measure Twice, Cut Once, With RMA Methodology

David Bills — Tue, 09 Sep 2014 16:02:00 +0000

I’ve been beating our drum for a while now about the inevitability of failure in cloud-based systems. Simply put, the complexities and interdependencies of the cloud make it nearly impossible to avoid service failure, so instead we have to go against our instincts and actually design for this eventuality.

Once you accept this basic premise, the next question is how exactly do we need to change our design processes? The Resilience Modeling and Analysis (RMA) methodology is a key part of the answer.

RMA brings the master carpenter’s “measure twice, cut once” philosophy to engineering. The goal is to help ensure teams think through as many of the potential reliability-related issues as possible before committing code to production—not to prevent every single failure mode, but to limit the impact a failure could have on customers if they occur.

To be clear, RMA is deeper and broader than basic fault modeling and root-cause analysis. Adapted from the industry-standard technique known as Failure Mode and Effects Analysis (FMEA), RMA is a four-phase process:

Pre-work: Diagram your resources, dependencies, and component interactions.
Discover: Identify potential failures and resilience gaps for each interaction identified in the pre-work phase.
Rate: Perform an impact analysis of the potential failures you’ve identified.
Act: Invest in and produce work items to improve resilience.

By working through these four phases, teams can gain a more detailed understanding of where known failure points are, what the impact of known failure modes is likely to be, and where to target engineering investments to help mitigate the highest-priority risks.

Feedback we’ve received from service teams who have worked through this process, is that one of the key outcomes is spending less post-deployment time managing and responding to live-site issues. Tightening the focus to reducing the impact of the most likely failures reclaims time to spend on the fun stuff—like developing customer-facing innovations.

The post Measure Twice, Cut Once, With RMA Methodology appeared first on Microsoft Security Blog.

Reliability Series #1: Reliability vs. resilience

David Bills — Mon, 24 Mar 2014 15:52:00 +0000

Whenever I speak to customers and partners about reliability I’m reminded that while objectives and priorities differ between organizations and customers, at the end of the day, everyone wants their service to work. As a customer, you want to be able to do things online, at a time convenient to you. As an organization – or a provider of a service – you want your customers to carry out the tasks they want to, whenever they want to do so.

This article is the first in a four-part series on building a resilient service. In my first two posts, I will discuss the topic as it relates to business strategy, and then we’ll dive deeper into the technical details. The full series of four posts will cover:

1.   Reliability vs. resilience – What is the difference between reliability and resilience and why does it matter?
2.   Common reliability-related threats – DIAL (Discovery, Incorrectness, Authorization/Authentication, Limits/Latency) is a handy mnemonic to help teams brainstorm potential failures of interactions between components for their service in a structured way. Brainstorming about failure modes and failure points is a key phase in resilience modeling and analysis (RMA) and can help teams improve the reliability of their service.
3.   Reliability-enhancing techniques – Taking the “D” and “A” in DIAL, we’ll look at some reliability enhancing techniques you can incorporate into your design related to discovery and authentication.
4.   Reliability-enhancing techniques – Taking the “I” and “L” in DIAL, we’ll look at some reliability enhancing techniques you can incorporate into your design related to incorrectness and limits.

My intention is to provide insight into how Microsoft thinks about reliability and the processes and techniques we’re employing to improve the reliability of our services for our customers.

So what is reliability? When I ask customers and partners, the most common responses refer to consistency in performance, speed, availability – and perhaps most significantly –resilience. One thing we all agree on is that for a system or service to be reliable, the user has to believe ‘it just works’.

The Institute of Electrical and Electronics Engineers (IEEE) Reliability Society states reliability [engineering] is “a design engineering discipline which applies scientific knowledge to assure that a system will perform its intended function for the required duration within a given environment, including the ability to test and support the system through its total lifecycle.” For software, it defines reliability as “the probability of failure-free software operation for a specified period of time in a specified environment.”

A reliable cloud service is essentially one that functions as the designer intended it to, when it is expected to, and wherever the customer is connected. That’s not to say every component must operate flawlessly 100 percent of the time. This last point brings us to what I believe is the difference between reliability and resiliency.

Reliability is the outcome cloud service providers strive for – it’s the result. Resiliency is the ability of a cloud-based service to withstand certain types of failure and yet remain functional from the customer perspective. In other words, reliability is the outcome and resilience is the way you achieve the outcome. A service could be characterized as reliable simply because no part of the service has ever failed, and yet the service couldn’t be regarded as resilient because those reliability-enhancing capabilities may never have been tested.

The key takeaway here is the importance of focusing on resilience and designing and building resiliency into your service at every stage of the software development lifecycle. To find out more about the fundamentals of building a reliable online service, read our whitepaper ‘An introduction to designing reliable cloud services’.

**Next up: Reliability Series #2: Categorizing reliability threats to your service

The post Reliability Series #1: Reliability vs. resilience appeared first on Microsoft Security Blog.

Fundamentals of Cloud Service Reliability

David Bills — Thu, 13 Sep 2012 03:26:00 +0000

As the adoption of cloud computing continues to rise, and customers demand 24/7 access to their services and data, reliability remains a challenge for cloud service providers everywhere. As I said in the recent Cloud Fundamentals video on reliability, it’s not a matter of if an outage will occur; it’s strictly a matter of when. This means it’s critical for organizations to understand how best to design and deliver reliable cloud services. Microsoft manages a cloud-based infrastructure supporting more than 200 services, 1 billion customers, and 20 million businesses in more than 76 markets worldwide. So we understand what it takes to build and deliver highly-reliable cloud platforms, solutions, and services that are secure and private.

Today Microsoft has released a new whitepaper titled, “An introduction to designing reliable cloud services”. This paper describes how Microsoft is thinking about key reliability concepts, and outlines a reliability design and implementation process that might be useful for organizations that create, deploy and/or consume cloud services. Designing and delivering reliable services is complex and this paper aims to be the catalyst for further discussions among services teams and organizations, and the industry itself.

If we accept the fact failures will occur, then the outcomes organizations may want to consider in relation to their cloud services fall into four main categories:

Maximize service availability to customers.
Make sure the service does what the customer wants, when they want it, as much of the time as possible.
Minimize the impact of any failure on customers. Assume something will go wrong and design the service in a way that it will be the non-critical components that fail first; the critical
components keep working. Isolate the failure as much as possible so the minimum number of customers is impacted. And if the service goes down completely, focus on reducing the amount of time any one customer cannot use the service at all.
Maximize service performance. Reduce the impact to customers at times when performance may be negatively impacted, such as during an unexpected spike in traffic.
Maximize business continuity. Focus on how the organization and the service respond when a failure occurs. Automate recovery where possible and disaster recovery drills should be carried out to ensure the organization is fully prepared to deal with the inevitable failure.

Please download the whitepaper to read more about the fundamental concepts of reliability and gain insight into how Microsoft is thinking about this topic and how organizations and teams might work together to improve cloud service reliability.

The post Fundamentals of Cloud Service Reliability appeared first on Microsoft Security Blog.