{"id":551046,"date":"2018-11-20T09:01:06","date_gmt":"2018-11-20T17:01:06","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=551046"},"modified":"2018-11-20T09:01:06","modified_gmt":"2018-11-20T17:01:06","slug":"hyperscale-cloud-reliability-and-the-art-of-organic-collaboration","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/hyperscale-cloud-reliability-and-the-art-of-organic-collaboration\/","title":{"rendered":"Hyperscale cloud reliability and the art of organic collaboration"},"content":{"rendered":"

\"networking<\/p>\n

What does it take to build one of the most reliable hyperscale clouds on the planet? It clearly requires astronomical investments and a vast organization that operates at global scale in near seamless coordination. Yet the breakthroughs that fuel this story emerged organically, from a combination of innovation and mentoring relationships that grew into close friendships, and initiative taking\u2014recognizing and seizing opportunities regardless of where they existed in an organization. We will take a look at the relationship between creativity and technology along with the happy coincidences that seed the organic collaboration at Microsoft making a breakthrough possible in a relatively new space, that of network verification and hyperscale cloud reliability in Azure.<\/p>\n

This story goes back several years and includes seasoned researchers, fresh talent, and research players across industry and academia. First, some background on the reality of networks in the present day.<\/p>\n

It can be said today that the internet is finally fast, something a lot of people worked hard to realize. As the world transitioned to services, everything people do has wound up in the cloud\u2014that global sprawl consisting of millions upon millions of affordable servers, all connected by a network\u2014and the need arose to make networks super reliable. \u201cThis was the inflection point,\u201d said George Varghese, Chancellor\u2019s Professor of Computer Science at UCLA, formerly Principal Researcher at Microsoft Research, who played a large role in this story. \u201cNetworks went from being a \u2018nice to have\u2019 to a must-have, like electricity. You\u2019ve got to have them working.\u201d<\/p>\n

Against this backdrop hummed an open culture around the product and research divisions at Microsoft and a community-wide drive in the active area of network verification. But this story is also a personal journey, and one that happens to involve applying symbolic tools to the Azure hyperscale network environment.<\/p>\n

\u201cCommunity leaders with open mindsets have promoted adopting methods traditionally developed in program analysis and verification and applying them to networks. They propose that the entire network can be thought of as a software program, asking, \u2018Can we use program analysis to speak to these networks?\u2019\u201d explained Principal Researcher Nikolaj Bj\u00f8rner of Microsoft Research.<\/p>\n

Which brings us to intent-based networking.<\/p>\n

Intent-based networking<\/h3>\n

Intent-based networking is the idea of automatically driving networks to a desired state from high-level policies. The emergence of hyper-scalable clouds and software-defined network orchestration has brought acute relevance to this area. In recent years, networking professionals have increasingly been embracing network verification to empower complex yet reliable networks.<\/p>\n

\n

\u201cThe time is ripe for software-defined networking, inspired by design methodologies in software and hardware engineering.\u201d \u2013 Nikolaj Bj\u00f8rner<\/strong><\/p>\n<\/blockquote>\n

Network verification tools such as Microsoft open-source Z3 take an abstract view of networks as artifacts that can be analyzed. This may be accomplished through symbolic verification, simulation, or emulation on virtual machines. A group of collaborators at Microsoft decided to apply symbolic tools to use within the Azure hyperscale network environment.<\/p>\n

\"Principal

Principal Researcher Nikolaj Bj\u00f8rner from Microsoft Research and Karthick Jayaraman on the Microsoft Azure team<\/p><\/div>\n

When can you start?<\/h3>\n

Karthick Jayaraman joined the Microsoft Azure team in 2011 fresh out of graduate school at Syracuse University. During his graduate study, he created the Mohawk symbolic model checker for role-based access control (RBAC) policies. One of his collaborators, professor Vijay Ganesh from the University of Waterloo, was visiting Microsoft Research to deliver an update on his latest research. Wandering over to Microsoft Building 99\u2014Microsoft Research home turf\u2014to greet an old friend, Jayaraman wound up making a new connection; Ganesh introduced Jayaraman to Nikolaj Bj\u00f8rner of Microsoft Research. It would prove to be a serendipitous encounter. Curious about what Jayaraman\u2019s work at Azure entailed, Bj\u00f8rner plied him with questions. Jayaraman\u2019s curiosity in return was focused on finding out the best tools that might exist to help him solve challenges in Azure networking.<\/p>\n

Jayaraman\u2019s work in Azure exposed him to the challenge of maintaining network ACLs\u2014policies that determine which traffic may flow in and out of a network. Network ACLs are ubiquitous in network management, used to protect private and corporate networks. They are used to limit what devices can talk to each other. Policies are captured as rules that operate at the level of bits that are set in packet headers, such as source and destination IP addresses. When a new set of servers is deployed with new IP addresses, network policies are updated to account for the new endpoints. But with policies described as a sequence of bit patterns, the complexity of manually updating thousands of devices is elusive to even the grandest master of complexity.<\/p>\n

\n

\u201cMicrosoft is a fantastic place to work! A few months into my job, I was working with Nikolaj, an eminent researcher, on applying software verification techniques for an important operational problem of managing firewall policies in the cloud. I couldn\u2019t have asked for more as a college hire.\u201d \u2013 Karthick Jayaraman<\/strong><\/p>\n<\/blockquote>\n

Jayaraman sensed that tools designed with verifying software in mind could be applied to the software-defined networks of Azure. He began to include colleagues from Microsoft Research in processes traditionally used for managing networks. A lasting professional partnership and camaraderie was formed.<\/p>\n

\u201cNikolaj took a personal interest in mentoring me,\u201d recalled Jayaraman. \u201cDuring my years at Microsoft, much has changed; but what hasn\u2019t changed is my relationship with Nikolaj and the kinds of problems we solve together, and this speaks volumes about Microsoft culture in general and highlights how these organic relationships and opportunity moments are the common threads.\u201d<\/p>\n

\"\"

Andrew Helwer, Software Engineer
on the Microsoft Azure team<\/p><\/div>\n

Connecting network policies to logical formulas that capture their intent has been a recurring topic in networking research, but prior tools were ad hoc, written for a specialized format and hardly extensible. Microsoft Research\u2019s satisfiability modulo theories solver Z3 is a state-of-the-art theorem prover that is specifically tailored to capture domains that are found commonly in software and hardware descriptions. It is used prominently in software verification, testing, and analysis. By 2012, network verification was an area of nascent interest, and the Azure network was going through early stages of build-out. It quickly became evident that ACLs could be expressed directly as logical formulas and that the machinery that reasons about such formulas was well-suited and sufficiently efficient for checking properties of ACLs. Jayaraman and Bj\u00f8rner developed the SecGuru tool, replacing manual what-if policy reviews with automated analysis for ACL updates. Despite their ostensible job titles and roles, the two developed a habit of meeting Sunday afternoons at caf\u00e9s in nearby Woodinville or in the Commons on campus to exchange technical ideas. SecGuru rolled out, first as a service to catch update errors in existing firewall policies in Azure datacenters and later as a part of the massive rollout process for new Azure containers. It identified a sweet spot in network verification\u2014the tool provided a library of routines to ingest and maintain security policies. Because policies are managed at several different levels\u2014within datacenters, on customer virtual networks, and on all traffic that enters and leaves corporate networks\u2014SecGuru found a multitude of uses. Over the years, it adapted to existing scenarios and new versions were developed. With SecGuru as an inspiration, their colleague Andrew Helwer developed a self-contained tool for Windows firewalls, now open source as part of the Z3 GitHub (opens in new tab)<\/span><\/a> repository. It is a good example of an advanced network verification tool one can develop using Z3.<\/p>\n

\n

\u201cIf you view a whole network as a program, can you use program analysis to tweak these networks?\u201d \u2013 George Varghese<\/strong><\/p>\n<\/blockquote>\n

Confluences<\/h3>\n

Meanwhile, George Varghese had recently joined the Networking Research group at Microsoft when he reached out to peers from software engineering. Varghese, along with collaborators from Stanford University, had developed a method for verifying flow properties in networks by using header space data structures that could efficiently simulate networks on a massive number of flows. It was an exciting development that would form the basis of the Palo Alto-based Forward Networks intent-based networking startup and represented a clear example of what could be achieved by exploiting confluences between techniques from design automation and software-defined networking. Varghese created awareness of this insight with peers in the networking and verification communities; this led to several collaborations, including a close partnership with Bj\u00f8rner. Also drawn to the potential of the research was Nuno Lopes, an intern at the time and now a researcher at Microsoft Research Cambridge in the U.K. Lopes\u2019 internship project had created a general-purpose engine that could perform efficient header space operations inspired by Varghese\u2019s earlier work. The engine involved Cambridge-based researcher Andrey Rybalchenko and professor Gordon Plotkin of The University of Edinburgh, well-known for seminal work on programming language semantics. Plotkin instigated exploiting symmetries for scaling network verification.<\/p>\n

\"George

George Varghese, Chancellor\u2019s Professor of Computer Science at UCLA, formerly Principal Researcher at Microsoft Research<\/p><\/div>\n

Symmetries are a basis for process calculi used to model concurrency, and they are inherent in cloud networks built with a redundant set of routers. The networking research community continues to embrace network verification with a welcoming culture, whether for data-plane validation, control plane verification, verifying programs that perform packet processing\u2014noteworthy in the P4 programming language\u2014or verifying networking stacks. Several startups in the network verification space have emerged, including Veriflow and Intentionet, addressing needs of enterprises for administrating their global networks, and other major cloud providers are now also embracing network verification.<\/p>\n

\u201cThe joy is, first of all, it\u2019s useful; it helps Azure avoid bugs and keep going. But it\u2019s also intellectually stimulating because it\u2019s not simply using verification tools that already exist, but it\u2019s inventing new ones specialized for networking. And Nikolaj and Nuno and Andrey and Gordon and I have all enjoyed that a lot. We contributed not just to engineering but to science,\u201d said Varghese.<\/p>\n

\u201cGeorge has inspired and mentored me in so many respects. He made the point that throughout history, when somewhat unrelated areas of fields are brought together, a new starting point is created,\u201d said Bj\u00f8rner.<\/p>\n

Varghese used the concept of abstract art as an analogy, pointing out how the development of photography influenced the painters of the day in that they began to find the accurate capturing of nature less appealing, giving rise to the new genre. Such stories also exist in technology, said Bj\u00f8rner, and the arrival of a new technology will see two parties benefiting from such confluences. \u201cThe networking world benefits from verification methodologies to understand what it means to build these reliable networks with guarantees. The verification community in turn is learning tricks of the trade and domain-specific trades that will teach us what it means to build efficient tools or create efficient algorithms for these networks.\u201d<\/p>\n

Onward<\/h3>\n

Meanwhile, SecGuru had been extended to verifying network configurations beyond ACLs, checking reachability properties in Azure datacenters. SecGuru scales these checks to 100Ks routers, involving trillions of network paths. The checks are done on demand and in a matter of minutes for all Azure routers. It owes the efficiency to a domain insight Jayaraman developed when reviewing the datacenter architecture of Azure. The architecture uses a highly regular structure in which each router has a fixed role for a configurable set of address ranges. It suffices verifying that the roles are enforced on each router and global reachability properties follow as consequences of overall network properties. The issues that the verifier unearths include configuration bugs, software bugs, faulty hardware, and errors introduced during troubleshooting incidents. The sources for intent drift are varied and beyond the control of any single tool, but network verification can catch such drift regardless of its root cause.<\/p>\n

\n

\u201cAssuring all changes and that the reality of the network always preserves our intent is critical for reliability. Reliability is critical for earning customer trust.\u201d \u2013 Karthick Jayaraman<\/strong><\/p>\n<\/blockquote>\n

Azure cloud services exist in an ecosystem of constant change. Daunting amounts of new capacity are constantly deployed along with constantly emerging new technologies and infrastructure. It is an exciting time for networking research and synergies with formal methods. New uses for formal methods applied to network and configuration verification emerge. As a fresh hire, Jayaraman forged collaborations with researchers from different backgrounds and helped fuel cloud-wide innovation. It\u2019s a testament to a culture at Microsoft that fosters collaboration and encourages creativity. Stay tuned.<\/p>\n","protected":false},"excerpt":{"rendered":"

What does it take to build one of the most reliable hyperscale clouds on the planet? It clearly requires astronomical investments and a vast organization that operates at global scale in near seamless coordination. Yet the breakthroughs that fuel this story emerged organically, from a combination of innovation and mentoring relationships that grew into close […]<\/p>\n","protected":false},"author":37074,"featured_media":551190,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"footnotes":""},"categories":[194463],"tags":[],"research-area":[13547],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-551046","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-systems","msr-research-area-systems-and-networking","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[171010],"related-events":[],"related-researchers":[],"msr_type":"Post","featured_image_thumbnail":"\"networking","byline":"","formattedDate":"November 20, 2018","formattedExcerpt":"What does it take to build one of the most reliable hyperscale clouds on the planet? It clearly requires astronomical investments and a vast organization that operates at global scale in near seamless coordination. Yet the breakthroughs that fuel this story emerged organically, from a…","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/551046"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/37074"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=551046"}],"version-history":[{"count":16,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/551046\/revisions"}],"predecessor-version":[{"id":551967,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/551046\/revisions\/551967"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/551190"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=551046"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=551046"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=551046"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=551046"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=551046"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=551046"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=551046"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=551046"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=551046"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=551046"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=551046"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}