{"id":1099176,"date":"2024-11-04T09:00:00","date_gmt":"2024-11-04T17:00:00","guid":{"rendered":""},"modified":"2024-11-07T10:14:11","modified_gmt":"2024-11-07T18:14:11","slug":"microsoft-at-sosp-2024-innovations-in-systems-research","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/microsoft-at-sosp-2024-innovations-in-systems-research\/","title":{"rendered":"Microsoft at SOSP 2024: Innovations in systems research"},"content":{"rendered":"\n
\"SOSP<\/figure>\n\n\n\n

Microsoft is proud to sponsor the 30th<\/sup> Symposium on Operating Systems Principles<\/a> (SOSP 2024), highlighting its commitment to advancing computing systems research. In an age where digital infrastructure underpins nearly every facet of modern life, SOSP serves as an important forum for showcasing the technologies that shape our interconnected world. Organized annually by the Association for Computing Machinery (ACM), the symposium brings together experts to explore innovations in operating systems, distributed systems, and systems software.<\/p>\n\n\n\n

With seven accepted papers, including \u201cVerus: A Practical Foundation for Systems Verification<\/a>,\u201d which won the Distinguished Artifact Award, as well as two workshops, and a tutorial, Microsoft researchers are presenting groundbreaking work that strengthens the security, efficiency, and scalability of cloud computing and distributed systems. This work not only contributes to theoretical knowledge but also address real-world challenges, helping ensure that as computing systems grow more complex, they remain sustainable, reliable, and secure.<\/p>\n\n\n\n

Continue reading to learn more about Microsoft\u2019s contributions to SOSP 2024, including breakthroughs that tackle the evolving demands of modern computing.<\/p>\n\n\n\n

Papers<\/h2>\n\n\n\n

Efficient Reproduction of Fault-Induced Failures in Distributed Systems with Feedback-Driven Fault Injection<\/a><\/h3>\n\n\n\n

Jia Pan, Haoze Wu, <\/em>Tanakorn Leesatapornwongsa<\/em><\/a>, <\/em>Suman Nath<\/em><\/a>, Peng Huang<\/em><\/p>\n\n\n\n

\"Diagram<\/figure>\n\n\n\n

This research introduces Anduril, a fault injection technique that quickly reproduces specific fault-induced failures in production systems. Anduril uses static causal analysis and a novel feedback-driven algorithm to quickly search the fault space for the failure\u2019s cause and timing. Evaluation on 22 real-world fault-induced failures from five large-scale distributed systems demonstrate FIR\u2019s ability to reproduce all failures by identifying and injecting the root-cause faults at the right time.<\/p>\n\n\n\n


\n\n\n\n

If At First You Don\u2019t Succeed, Try, Try, Again\u2026? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software Systems<\/a><\/h3>\n\n\n\n

Bogdan Alexandru Stoica<\/em>, Utsav Sethi<\/em>, Yiming Su<\/em>, Cyrus Zhou<\/em>, Shan Lu<\/em><\/a>, <\/em>Jonathan Mace<\/em><\/a>, <\/em>Madan Musuvathi<\/em><\/a>, <\/em>Suman Nath<\/em><\/a><\/p>\n\n\n\n

\"A<\/figure>\n\n\n\n

Retry<\/em>\u2014the re-execution command used when a task fails\u2014is commonly employed to build resilient software systems, but implementing and testing it in modern systems is challenging. Based on real-world retry<\/em> issues, the authors introduce a suite of static and dynamic techniques to detect retry<\/em> problems. They found that the ad-hoc nature of retry<\/em> implementation complicates traditional program analysis but that large language models (LLMs) can address these issues effectively. The research also demonstrates that repurposing unit tests, combined with fault injection, can reveal various retry<\/em> issues.<\/p>\n\n\n\n

Listen to research manager\u202fShan Lu and PhD candidate Bogdan Stoica discuss this work in a recent podcast episode<\/a>.<\/p>\n\n\n\n


\n\n\n\n

Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor with T10<\/a><\/h3>\n\n\n\n

Yiqi Liu, Yuqi Xue, Yu Cheng, <\/em>Lingxiao Ma<\/em><\/a>, <\/em>Ziming Miao<\/em><\/a>, Jilong Xue, Jian Huang<\/em><\/p>\n\n\n\n

\"An<\/figure>\n\n\n\n

Despite advances in AI chips that enable high-bandwidth and low-latency inter-core memory access, deep learning compilers lack support for scalable inter-core connections, limiting their potential. To address this, the authors introduce T10, an end-to-end deep learning compiler to take advantage of inter-core communication and distributed on-chip memory. T10 introduces a distributed tensor abstraction, rTensor, and maps the computation and communication of tensor operators with a generalized compute-shift pattern to cores. T10 optimizes on-chip memory consumption and inter-core communication overhead, selecting the best execution plan.<\/p>\n\n\n\n


\n\n\n\n

SilvanForge: A Schedule-Guided Retargetable Compiler for Decision Tree Inference<\/u><\/a><\/h3>\n\n\n\n

Ashwin Prasad<\/em>, Sampath Rajendra<\/em><\/a>, <\/em>Kaushik Rajan<\/em><\/a>, R Govindarajan, Uday Bondhugula<\/em><\/p>\n\n\n\n

\"The<\/figure>\n\n\n\n

SilvanForge is a schedule-guided retargetable compiler for decision tree-based models that explores various optimization options and automatically generates high-performance inference routines for CPUs and GPUs. It consists of two core components: a scheduling language to efficiently explore the optimization space, and a retargetable compiler that generates code for any specified schedule. By utilizing different data layouts, loop structures, and caching strategies, SilvanForge achieves portable performance across multiple hardware targets.<\/p>\n\n\n\n


\n\n\n\n

Uncovering Nested Data Parallelism and Data Reuse in DNN Computation with FractalTensor<\/a><\/h3>\n\n\n\n

Siran Liu, Chengxiang Qi, <\/em>Ying Cao<\/em><\/a>, Chao Yang, Weifang Hu, Xuanhua Shi, <\/em>Fan Yang<\/em><\/a>, <\/em>Mao Yang<\/em><\/a><\/p>\n\n\n\n

\"FractalTensor<\/figure>\n\n\n\n

Deep neural networks (DNNs) often use highly optimized tensor operators to speed up computation, but these operators are typically defined empirically, limiting cross-operator analysis and optimization. FractalTensor addresses this by introducing a nested list-based abstract data type (ADT), where each element is either a tensor with a static shape or another FractalTensor. DNNs are then defined using high-order compute operators like map\/reduce\/scan and data access operators like window\/stride, explicitly exposing nested data parallelism and fine-grained access patterns. This approach enables entire program analysis and optimization. This paper will only be available in the SOSP 2024 proceedings.<\/p>\n\n\n\n


\n\n\n\n

Unearthing Semantic Checks for Cloud Infrastructure-as-Code Programs<\/a><\/h3>\n\n\n\n

Yiming Qiu, Patrick Tser Jern Kon, <\/em>Ryan Beckett<\/em><\/a>, Ang Chen<\/em><\/p>\n\n\n\n

\"The<\/figure>\n\n\n\n

This research introduces Zodiac, a tool that uses semantic-guided mining and deployment-based validation pipelines to automatically uncover cloud IaC semantic checks. When applied to Microsoft Azure resources, Zodiac identified over 400 semantic checks that would cause deployment failures if violated, demonstrating its ability to detect cloud requirements that are difficult for state-of-the-art IaC tools to find.<\/p>\n\n\n\n


\n\n\n\n

Paper and tutorial<\/h2>\n\n\n\n

Verus: A Practical Foundation for Systems Verification<\/a><\/h3>\n\n\n\n

Distinguished Artifact Award
Chris Hawblitzel<\/a>, Jay Lorch<\/a><\/em><\/p>\n\n\n\n

\"Overview<\/figure>\n\n\n\n

This work presents an updated version of Verus, a tool that accelerates and simplifies formal verification of system software. It builds on previous advances for faster and more cost-effective verification of complex properties in realistic systems and now verifies code up to 61 times faster than existing methods. It has been evaluated on various systems encompassing 6,100 lines of code and 31,000 lines of proof. Verus is ready for broader adoption by developers using Rust who want to create more robust systems.<\/p>\n\n\n\n

Listen to principal researchers Chris Hawblitzel and Jay Lorch discuss this work in a recent podcast episode<\/a>.<\/p>\n\n\n\n


\n\n\n\n

Workshops<\/h2>\n\n\n\n

Hot Topics in System Infrastructure (opens in new tab)<\/span><\/a><\/h3>\n\n\n\n

Inigo Goiri<\/em><\/a>, <\/em>Pantea Zardoshti<\/em><\/a><\/p>\n\n\n\n

Researchers and engineers share recent findings and experiences while exploring new challenges and opportunities in building next-generation infrastructures, including AI, sustainable datacenters, and edge and cloud computing. Topics span the entire system stack, focusing on design and implementation, hardware architecture, operating systems, runtimes, and applications.<\/p>\n\n\n\n

Practical Adoption Challenges of ML for Systems (opens in new tab)<\/span><\/a><\/h3>\n\n\n\n

Chetan Bansal<\/em><\/a><\/p>\n\n\n\n

Despite significant progress in machine learning, deploying it in computer systems is rare due to non-machine learning challenges like feature stability, reliability, and availability. This workshop brings together researchers from academia and industry communities to foster communication and collaboration in addressing these practical issues and aligning research with real-world system deployment needs.<\/p>\n\n\n\n


\n\n\n\n

Discover how interdisciplinary systems research is driving innovation at Microsoft<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"

Building resilient systems, scaling deep learning computation, and reproducing failures in production are just some of the ways Microsoft researchers are advancing the state of the art in computer systems research at SOSP 2024.<\/p>\n","protected":false},"author":43518,"featured_media":1099248,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"categories":[1],"tags":[],"research-area":[13547],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[269148,243984,269142,269145],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-1099176","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-systems-and-networking","msr-locale-en_us","msr-post-option-approved-for-river","msr-post-option-blog-homepage-featured","msr-post-option-include-in-river","msr-post-option-pinned-for-river"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199565],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[920058,144927],"related-projects":[554055],"related-events":[1073397],"related-researchers":[{"type":"user_nicename","value":"Tanakorn Leesatapornwongsa","user_id":38754,"display_name":"Tanakorn Leesatapornwongsa","author_link":"Tanakorn Leesatapornwongsa<\/a>","is_active":false,"last_first":"Leesatapornwongsa, Tanakorn","people_section":0,"alias":"taleesat"},{"type":"user_nicename","value":"Suman Nath","user_id":33753,"display_name":"Suman Nath","author_link":"Suman Nath<\/a>","is_active":false,"last_first":"Nath, Suman","people_section":0,"alias":"sumann"},{"type":"user_nicename","value":"Shan Lu","user_id":43215,"display_name":"Shan Lu","author_link":"Shan Lu<\/a>","is_active":false,"last_first":"Lu, Shan","people_section":0,"alias":"shanlu"},{"type":"user_nicename","value":"Jonathan Mace","user_id":42546,"display_name":"Jonathan Mace","author_link":"Jonathan Mace<\/a>","is_active":false,"last_first":"Mace, Jonathan","people_section":0,"alias":"jonathanmace"},{"type":"user_nicename","value":"Madan Musuvathi","user_id":32766,"display_name":"Madan Musuvathi","author_link":"Madan Musuvathi<\/a>","is_active":false,"last_first":"Musuvathi, Madan","people_section":0,"alias":"madanm"},{"type":"user_nicename","value":"Lingxiao Ma","user_id":39769,"display_name":"Lingxiao Ma","author_link":"Lingxiao Ma<\/a>","is_active":false,"last_first":"Ma, Lingxiao","people_section":0,"alias":"lingm"},{"type":"user_nicename","value":"Ziming Miao","user_id":42249,"display_name":"Ziming Miao","author_link":"Ziming Miao<\/a>","is_active":false,"last_first":"Miao, Ziming","people_section":0,"alias":"zimiao"},{"type":"user_nicename","value":"Sampath Rajendra","user_id":43107,"display_name":"Sampath Rajendra","author_link":"Sampath Rajendra<\/a>","is_active":false,"last_first":"Rajendra, Sampath","people_section":0,"alias":"srajendra"},{"type":"user_nicename","value":"Kaushik Rajan","user_id":32574,"display_name":"Kaushik Rajan","author_link":"Kaushik Rajan<\/a>","is_active":false,"last_first":"Rajan, Kaushik","people_section":0,"alias":"krajan"},{"type":"user_nicename","value":"Ying Cao","user_id":37571,"display_name":"Ying Cao","author_link":"Ying Cao<\/a>","is_active":false,"last_first":"Cao, Ying","people_section":0,"alias":"yincao"},{"type":"user_nicename","value":"Fan Yang","user_id":31782,"display_name":"Fan Yang","author_link":"Fan Yang<\/a>","is_active":false,"last_first":"Yang, Fan","people_section":0,"alias":"fanyang"},{"type":"user_nicename","value":"Mao Yang","user_id":32798,"display_name":"Mao Yang","author_link":"Mao Yang<\/a>","is_active":false,"last_first":"Yang, Mao","people_section":0,"alias":"maoyang"},{"type":"user_nicename","value":"Ryan Beckett","user_id":37775,"display_name":"Ryan Beckett","author_link":"Ryan Beckett<\/a>","is_active":false,"last_first":"Beckett, Ryan","people_section":0,"alias":"rybecket"},{"type":"user_nicename","value":"Chris Hawblitzel","user_id":31425,"display_name":"Chris Hawblitzel","author_link":"Chris Hawblitzel<\/a>","is_active":false,"last_first":"Hawblitzel, Chris","people_section":0,"alias":"chrishaw"},{"type":"user_nicename","value":"Jay Lorch","user_id":32732,"display_name":"Jay Lorch","author_link":"Jay Lorch<\/a>","is_active":false,"last_first":"Lorch, Jay","people_section":0,"alias":"lorch"},{"type":"user_nicename","value":"Pantea Zardoshti","user_id":40717,"display_name":"Pantea Zardoshti","author_link":"Pantea Zardoshti<\/a>","is_active":false,"last_first":"Zardoshti, Pantea","people_section":0,"alias":"pzardoshti"},{"type":"user_nicename","value":"Chetan Bansal","user_id":31394,"display_name":"Chetan Bansal","author_link":"Chetan Bansal<\/a>","is_active":false,"last_first":"Bansal, Chetan","people_section":0,"alias":"chetanb"}],"msr_type":"Post","featured_image_thumbnail":"\"SOSP","byline":"","formattedDate":"November 4, 2024","formattedExcerpt":"Building resilient systems, scaling deep learning computation, and reproducing failures in production are just some of the ways Microsoft researchers are advancing the state of the art in computer systems research at SOSP 2024.","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1099176"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/43518"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=1099176"}],"version-history":[{"count":30,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1099176\/revisions"}],"predecessor-version":[{"id":1101951,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1099176\/revisions\/1101951"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1099248"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1099176"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=1099176"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=1099176"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1099176"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=1099176"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=1099176"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1099176"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1099176"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1099176"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=1099176"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=1099176"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}