{"id":478188,"date":"2018-04-06T14:23:06","date_gmt":"2018-04-06T21:23:06","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=478188"},"modified":"2019-02-27T23:00:03","modified_gmt":"2019-02-28T07:00:03","slug":"microsoft-shines-nsdi-18","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/microsoft-shines-nsdi-18\/","title":{"rendered":"Microsoft Shines at NSDI ’18"},"content":{"rendered":"
<\/p>\n
Microsoft Research and Microsoft Azure are committed to developing technologies that make our data centers the most reliable and high-performance data centers on the planet. We also are committed to extending the state of the art in cloud computing by sharing our ideas openly and freely. This is evident in the technical program of the 15th USENIX Symposium on Networked Systems Design and Implementation \u2013 NSDI\u00a0\u201818 (opens in new tab)<\/span><\/a>) \u2013 to be held in Seattle, Washington on April 9, 2018 – April 11, 2018. Microsoft researchers and engineers are heavily engaged with the organizers of this event and have contributed six scientific papers and four posters to its technical program. These papers and posters cover the latest technologies we have developed in networked systems. We are immensely proud of our accomplishments and our contributions follow in the footsteps of a rich tradition at Microsoft of sharing our knowledge and experience with the academic and research communities and with the industry at large.<\/p>\n I was wondering which of our six papers I should write about and it was difficult; I love them all! Should I write about Azure\u2019s SmartNIC technology (opens in new tab)<\/span><\/a> that offers the fastest published bandwidth of any public cloud provider? Or about MP-RDMA, that is, our work on multi-path transport for RDMA that can improve network utilization by up to 47%. The common thread across all these papers is that they constitute significant advances in our field while improving the reliability and efficiency of Microsoft data centers.<\/p>\n Diagnosing Packet Loss in Cloud-Scale Networks<\/strong><\/p>\n I like to solve problems, especially complex ones that don\u2019t have obvious solutions. Take for example the problem discussed in our paper, \u201cDemocratically Finding the Cause of Packet Drops (opens in new tab)<\/span><\/a>.\u201d As the title suggests, this paper describes a system that quickly and accurately discovers the causes of packet drops. Generally, this is a difficult problem to solve. Why? Because data centers can easily have hundreds of thousands of devices and billions of packets in flight per second. The task of determining the cause of packet drops is as difficult as finding the proverbial needle in the haystack. What complicates matters further is short-term buffer overflows on individual switch ports. If you think about it, such packet drops are kind of expected and network protocols are designed to deal with this phenomenon. Cloud operators want a system that distinguishes such drops from those that cause real customer impact. Our system solves this challenging problem.<\/p>\n Here\u2019s another thing \u2013 not all failures are equal. For example, a link that drops 0.3% of the packets going through it may impact customers whose application is sensitive to packet drops a lot more than another link that may be dropping 1% of packets but is servicing customers whose applications are resilient to small packet drops. As you can imagine, fixing these faulty links takes time. Wouldn\u2019t it be great if engineers could prioritize these fixes? Here\u2019s what one of our network engineers told us: \u201cIn a network of over a million links, it\u2019s a reasonable assumption that there is a non-zero chance that 10 or more these links are bad. This may be due to a variety of reasons \u2013 device, port, or cable etc. \u2013 and we cannot fix them simultaneously. Therefore, fixes need to be prioritized based on customer impact. We need a direct way to correlate customer impact with bad links.” By finding the cause of individual TCP packet drops, our system enables us to identify the impact of each link on individual TCP connections and by extension attributes those failures to individual customers.<\/p>\n Our system has several attractive properties that I would like to point out. Here\u2019s one: it does not require any changes to the existing networking infrastructure or the clients. I know that those of you who work in this area will appreciate this point. In addition, our system detects in-band failures and it continues to perform beautifully even in the presence of noise, such as lone packet drops. All of this is good, but you might be thinking, what about the overhead? It is negligible. At its core, our system uses a simple voting-based mechanism that is analytically proven to have high accuracy.<\/p>\n I encourage you to read the paper. This is not just a systems research project; its monitoring agent has been running in Microsoft Azure data centers for over two years! The results shown in our paper were derived for a complete system that was deployed in one of our data centers for two months.<\/p>\n Performance and Reachability<\/strong><\/p>\n Let\u2019s turn our attention to something different but equally important \u2013 performance and reachability. High quality end-user experience is particularly important for running a successful cloud business. But the Internet is diverse and dynamic and factors that influence its performance are outside the control of any single organization. Microsoft has developed a system that lets it evaluate fine-grained Internet performance on a global scale.This accomplishment is detailed in our paper, \u201cOdin: Microsoft\u2019s Scalable Fault-Tolerant CDN Measurement System (opens in new tab)<\/span><\/a>\u201d.<\/p>\n Odin is Microsoft’s scalable, fault-tolerant Internet measurement and experimentation platform designed to continuously measure and evaluate Internet performance between Microsoft customers and our services such as Bing or Office 365. Odin lets Microsoft respond quickly to changes in the Internet that might adversely impact our customers’ performance. Also, Odin helps us understand how the overall experience of any customer will be impacted when we make changes to Microsoft’s internal network. This way we do not make changes until we have a solid solution to mitigate any negative impact to our customers.<\/p>\n