<\/p>\n
Microsoft Research and Microsoft Azure are committed to developing technologies that make our data centers the most reliable and high-performance data centers on the planet. We also are committed to extending the state of the art in cloud computing by sharing our ideas openly and freely. This is evident in the technical program of the 15th USENIX Symposium on Networked Systems Design and Implementation \u2013 NSDI\u00a0\u201818 (opens in new tab)<\/span><\/a>) \u2013 to be held in Seattle, Washington on April 9, 2018 – April 11, 2018. Microsoft researchers and engineers are heavily engaged with the organizers of this event and have contributed six scientific papers and four posters to its technical program. These papers and posters cover the latest technologies we have developed in networked systems. We are immensely proud of our accomplishments and our contributions follow in the footsteps of a rich tradition at Microsoft of sharing our knowledge and experience with the academic and research communities and with the industry at large.<\/p>\n
I was wondering which of our six papers I should write about and it was difficult; I love them all! Should I write about Azure\u2019s SmartNIC technology (opens in new tab)<\/span><\/a> that offers the fastest published bandwidth of any public cloud provider? Or about MP-RDMA, that is, our work on multi-path transport for RDMA that can improve network utilization by up to 47%. The common thread across all these papers is that they constitute significant advances in our field while improving the reliability and efficiency of Microsoft data centers.<\/p>\n
Diagnosing Packet Loss in Cloud-Scale Networks<\/strong><\/p>\n
I like to solve problems, especially complex ones that don\u2019t have obvious solutions. Take for example the problem discussed in our paper, \u201cDemocratically Finding the Cause of Packet Drops (opens in new tab)<\/span><\/a>.\u201d As the title suggests, this paper describes a system that quickly and accurately discovers the causes of packet drops. Generally, this is a difficult problem to solve. Why? Because data centers can easily have hundreds of thousands of devices and billions of packets in flight per second. The task of determining the cause of packet drops is as difficult as finding the proverbial needle in the haystack. What complicates matters further is short-term buffer overflows on individual switch ports. If you think about it, such packet drops are kind of expected and network protocols are designed to deal with this phenomenon. Cloud operators want a system that distinguishes such drops from those that cause real customer impact. Our system solves this challenging problem.<\/p>\n
Performance and Reachability<\/strong><\/p>\n
Let\u2019s turn our attention to something different but equally important \u2013 performance and reachability. High quality end-user experience is particularly important for running a successful cloud business. But the Internet is diverse and dynamic and factors that influence its performance are outside the control of any single organization. Microsoft has developed a system that lets it evaluate fine-grained Internet performance on a global scale.This accomplishment is detailed in our paper, \u201cOdin: Microsoft\u2019s Scalable Fault-Tolerant CDN Measurement System (opens in new tab)<\/span><\/a>\u201d.<\/p>\n
Figure 1: Performance improvement made possible by Odin\u2019s knowledge of Internet latencies to Microsoft\u2019s regional datacenters.<\/p><\/div>\n
Here\u2019s how we do this: Odin issues active measurements from popular Microsoft applications to get high coverage of the Internet paths from Microsoft\u2019s users to endpoints that can be either inside Microsoft or elsewhere. Measurements are tailored on a per-use-case basis (for example, application requirements). By targeting measurements to external third-party destinations, we can infer unreachability events to Microsoft endpoints due to network failures.<\/p>\n
I chose to highlight the Odin paper because this is an \u201cexperience paper\u201d that I think will interest many of you. I encourage you to read it because it describes a production system that Microsoft uses. It is a remarkably detailed study of a large, complex Internet measurement platform that is integrated into a few important Microsoft applications, gathering billions of measurements per day. It has been running for over two years, improving operations and detecting and diagnosing reachability issues. It serves as the foundation for several Azure and CDN applications including traffic management, availability alerting, network diagnostics and A\/B testing.<\/p>\n
Microsoft\u2019s Data Center Tour<\/strong><\/p>\n
Recently I was talking to a seasoned networking researcher who works on cloud computing and builds technology for data centers. He also teaches courses on data center infrastructure and operations. To my great surprise he told me that he had never seen a mega-scale data center.<\/p>\n
I believe that for those of us who work in this space it\u2019s important to visit one of the larger data centers to understand and internalize how equipment and networks at these massive scales are put together. I have been to our data centers many times. So, in the spirit of sharing knowledge, in collaboration with Dave Maltz, Microsoft\u2019s distinguished engineer and Azure Physical Networking team lead, we have arranged a tour of one of Microsoft\u2019s mega data centers for students and faculty attending NSDI 2018. The details of this tour are available on the symposium\u2019s website (opens in new tab)<\/span><\/a>.<\/p>\n
I\u2019ll wrap up by emphasizing how excited we at Microsoft are about meeting our colleagues at the upcoming NSDI Symposium. We\u2019re looking forward to sharing our insights and in turn learning from the insights others have acquired. By the way, the Azure SmartNIC hardware that I mentioned at the beginning of this post is in production and widely available to Azure customers. You can read about it in the engaging Azure Blog post, \u201cMaximize your VM\u2019s Performance with Accelerated Networking \u2013 now generally available for both Windows and Linux (opens in new tab)<\/span><\/a>\u201d.<\/p>\n
See you at NSDI \u201918!<\/p>\n
<\/p>\n","protected":false},"excerpt":{"rendered":"
Microsoft Research and Microsoft Azure are committed to developing technologies that make our data centers the most reliable and high-performance data centers on the planet. We also are committed to extending the state of the art in cloud computing by sharing our ideas openly and freely. This is evident in the technical program of the […]<\/p>\n","protected":false},"author":37074,"featured_media":478314,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"categories":[194463],"tags":[],"research-area":[13547],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-478188","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-systems","msr-research-area-systems-and-networking","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[144899],"related-projects":[],"related-events":[469602],"related-researchers":[{"type":"user_nicename","value":"Victor Bahl","user_id":31167,"display_name":"Victor Bahl","author_link":"Victor Bahl<\/a>","is_active":false,"last_first":"Bahl, Victor","people_section":0,"alias":"bahl"}],"msr_type":"Post","featured_image_thumbnail":"","byline":"Victor Bahl<\/a>","formattedDate":"April 6, 2018","formattedExcerpt":"Microsoft Research and Microsoft Azure are committed to developing technologies that make our data centers the most reliable and high-performance data centers on the planet. We also are committed to extending the state of the art in cloud computing by sharing our ideas openly and…","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/478188","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/37074"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=478188"}],"version-history":[{"count":17,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/478188\/revisions"}],"predecessor-version":[{"id":570447,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/478188\/revisions\/570447"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/478314"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=478188"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=478188"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=478188"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=478188"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=478188"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=478188"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=478188"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=478188"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=478188"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=478188"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=478188"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}