{"id":435450,"date":"2017-10-31T07:19:25","date_gmt":"2017-10-31T14:19:25","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=435450"},"modified":"2017-11-06T11:03:48","modified_gmt":"2017-11-06T19:03:48","slug":"eliminating-network-downtime","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/eliminating-network-downtime\/","title":{"rendered":"Microsoft Azure and Microsoft Research take giant step towards eliminating network downtime"},"content":{"rendered":"

\"\"<\/p>\n

At the 26th<\/sup> ACM Annual Symposium on Operating Systems and Principles, better known as SOSP 2017 (opens in new tab)<\/span><\/a>, my colleagues described a new technology called CrystalNet (opens in new tab)<\/span><\/a> – a high-fidelity, cloud-scale emulator that helps network engineers nearly eliminate network downtime related to routine maintenance and upgrades as well as software bugs and human errors.\u00a0 A collaboration by Microsoft Azure (opens in new tab)<\/span><\/a> and Microsoft Research teams, CrystalNet was developed through the application of two years\u2019 worth of research to create an emulator bulletproofed by Azure network engineers who operate one of the largest networks on the planet. \u00a0The result is a first-of-its-kind set of tools that help significantly decrease network downtime, and increase availability for Azure customers.<\/p>\n

Cloud networks are constantly growing and evolving. They consist of immensely complex and massive cloud-scale production networks, interconnecting hundreds of thousands of servers using millions of wires and tens of thousands of network devices that are sourced from dozens of vendors and deployed across the globe with stringent needs for reliability, security and performance. It\u2019s imperative to continually monitor the pulse of the networks, to detect anomalies, faults, and drive recovery at the millisecond level, much akin to monitoring a living organism. Networks are essentially the cloud, as they are the core infrastructure that hold up cloud services and helps deliver availability across other key services such as compute and storage. However, there are few currently available tools at the disposal of cloud providers to proactively foresee the impact of planned updates and changes, or a bug in the system. Before making any changes to the network, engineers can run the proposed changes through our verification tools, which check the impact of the configuration changes before green-lighting them for deployment in our production networks.<\/p>\n

Emulating a Cloud-Scale Network
\n<\/strong>The idea of testing before deploying is age old, but following a two-year study by Microsoft Research looking at all documented outages across all major cloud providers, we believed that we could find most potential problems proactively if we first validated a production network on an identical copy of the network. By identical copy, we literally meant using the same network topology, hardware, software and configurations as in our production network. Using the cloud to build the cloud is a common design pattern for creating and running an enterprise high-performance cloud, but replicating the physical hardware of the entire network is too expensive and where would we put it? To more cost efficiently solve this problem, we run hardware, software and network configurations on virtualized hardware interconnected exactly as the production network architecture. In effect, we create a large-scale, high-fidelity network emulator that allows Azure engineers to validate planned changes and gauge the impact of various updates and failure scenarios. We call our network emulator \u201cCrystalNet,\u201d seeing the future of your network via a crystal ball.<\/p>\n

\"\"

Figure 1: The architecture of CrystalNet<\/p><\/div>\n

CrystalNet faithfully emulates large scale production networks to significantly decrease network incidents caused by software, configurations and human errors. Unique properties include:<\/p>\n