Software-driven wide area network (SWAN) is a system that enables centralized management and control of network infrastructure to improve reliability and efficiency. SWAN controls the timing and volume of traffic each service sends and automatically reconfigures the network’s data plane to match traffic demand. Over the last decade, I’ve had the opportunity to shepherd SWAN from a research idea to a foundational system for Microsoft Azure (opens in new tab). I want to share a few thoughts to commemorate this incredible journey.
The idea for SWAN was born in 2012, when Microsoft’s mobility and networking research group sought to solve two important challenges—efficiency and flexibility of the backbone that carried traffic between Microsoft datacenters. Azure’s explosive growth created unprecedented demand for bandwidth in this backbone. Efficiency and flexibility were essential, enabling the network to offer the best possible service to every application, based on a deep understanding of its performance needs (latency-sensitive database queries versus throughput-bound storage backups), diurnal patterns, and whether demand can be time-shifted (“follow the sun”) to fully utilize the available capacity.
It became clear that traditional backbone architectures, with MPLS-based traffic engineering without any coordination with the applications, would not be able to address these challenges. Decentralized resource allocation comes with fundamental limits; and hardware limitations (such as the limited number of priority queues) prevent fine-grained resource allocation across thousands of (high-bandwidth) applications.
We decided to explore logically centralized control for both the applications and the network. On the application side, we would control how much traffic each application would be able to send based on its demand and priority. On the network side, we would control how each switch forwarded traffic. While software-defined networking (SDN) was actively being explored in the community at the time, we were not aware of any production systems, certainly not at the scale of the Microsoft Cloud. Going down this path meant that we were sure to encounter many “unknown unknowns.” Can centralization work in a fault tolerant manner at a truly global scale? Is the hardware ready and reliable? How would applications react to bandwidth controller mediating access to the network? Our estimates of possible gains suggested that addressing these unknowns could be fruitful, and building something that no one had built before was exciting for us as systems researchers.
Given the risks, we approached the development of SWAN in the spirit of “fail fast,” taking on prototyping and algorithmic challenges in the order of highest risk. This approach led us to focus early on problems such as scalably computing max-min fair allocations across hundreds of applications, enforcing those allocations, working with limited memory on commodity switches, and updating the global network in a timely and congestion-free manner.
Our early prototyping uncovered several challenges with the latest OpenFlow switches at the time. We worked with Arista on DirectFlow (a superset of OpenFlow), and got it working at the scale and reliability we wanted. This provided the foundation for SWAN for years to come. As Jayashree Ullal (Arista CEO) notes (opens in new tab), “SWAN was then able to take advantage of Arista EOS to build an elegant WAN evolving to support 100G, 200G as well as DWDM interconnections at Internet peering points around the world.” It also allowed customers to use this battle hardened SDN switch infrastructure on their own networks.
Spotlight: Blog post
We shared the results of our work at the ACM SIGCOMM 2013 conference, where Google shared its results of building a similar system called B4. The two systems provided proof points that SDN could enable massively more efficient and flexible traffic engineering. In the words of noted computer scientist Bruce Davie (opens in new tab), they “broke the rule that centralized control could not be done, thus freeing the system from greedy approaches that made only local optimizations.”
The original paper: Achieving High Utilization with Software-Driven WAN, was the start, not the end of the journey for us. We have since solved many additional challenges such as, for example, a faster solution for approximate max-min fairness, proactive defense against a small number of failures by spreading traffic and using the hierarchical nature of the WAN topology and traffic demands to solve max-flow style problems more quickly. Many of these have been deployed in production on the Microsoft WAN. In this sense, SWAN has provided a rich research-to-production pipeline.
As I look back, I can proudly say that SWAN has lived up to its promise. Our inter-datacenter WAN now has unprecedented efficiency and flexibility. Of course, not everything went as expected. While we worried about the reliability of centralized control when the controllers become unavailable–and built Paxos-like consensus clusters with redundancy–we didn’t protect against code bugs where all cluster members were simultaneously wrong. Since then, we have developed new mechanisms to counteract this threat.
Overall, the design of SWAN and its implementation has stood the test of time. In fact, we are now moving our other WAN, which connects Microsoft datacenters to the broader Internet, to the world of centralized control as well [e.g., OneWAN]. SWAN now carries over 90% of the traffic in and out of Microsoft’s datacenters, a footprint spanning over 280,000 kilometers of optical fiber and over 150 points of presence across all Azure regions. This unification will unlock the next level of efficiency and flexibility, and Microsoft researchers are right there taking the next set of technical bets.