Microsoft at SOSP 2024: Innovations in systems research

Published November 4, 2024

Share this page

SOSP 2024 on a blue and green gradient background

Microsoft is proud to sponsor the 30^th Symposium on Operating Systems Principles (SOSP 2024), highlighting its commitment to advancing computing systems research. In an age where digital infrastructure underpins nearly every facet of modern life, SOSP serves as an important forum for showcasing the technologies that shape our interconnected world. Organized annually by the Association for Computing Machinery (ACM), the symposium brings together experts to explore innovations in operating systems, distributed systems, and systems software.

With seven accepted papers, including “Verus: A Practical Foundation for Systems Verification,” which won the Distinguished Artifact Award, as well as two workshops, and a tutorial, Microsoft researchers are presenting groundbreaking work that strengthens the security, efficiency, and scalability of cloud computing and distributed systems. This work not only contributes to theoretical knowledge but also address real-world challenges, helping ensure that as computing systems grow more complex, they remain sustainable, reliable, and secure.

Continue reading to learn more about Microsoft’s contributions to SOSP 2024, including breakthroughs that tackle the evolving demands of modern computing.

Papers

Efficient Reproduction of Fault-Induced Failures in Distributed Systems with Feedback-Driven Fault Injection

Jia Pan, Haoze Wu, Tanakorn Leesatapornwongsa, Suman Nath, Peng Huang

Diagram of Anduril’s architecture and workflow. Anduril contains two major components, Instrumenter and Explorer. Instrumenter receives the system bytecode and the failure log file from production. It performs static analysis and computes a causal graph consisting of program points potentially related to the failure symptom. In addition, it inserts code snippets into the system for (1) injecting a fault to throw a desired exception, and (2) logging additional information to facilitate the feedback algorithms. Note that Anduril does not instrument the production system. Its input failure log is from the uninstrumented system.

This research introduces Anduril, a fault injection technique that quickly reproduces specific fault-induced failures in production systems. Anduril uses static causal analysis and a novel feedback-driven algorithm to quickly search the fault space for the failure’s cause and timing. Evaluation on 22 real-world fault-induced failures from five large-scale distributed systems demonstrate FIR’s ability to reproduce all failures by identifying and injecting the root-cause faults at the right time.

If At First You Don’t Succeed, Try, Try, Again…? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software Systems

Bogdan Alexandru Stoica, Utsav Sethi, Yiming Su, Cyrus Zhou, Shan Lu, Jonathan Mace, Madan Musuvathi, Suman Nath

A workflow that repurposes existing unit tests to expose retry-related bugs in system software.

Retry—the re-execution command used when a task fails—is commonly employed to build resilient software systems, but implementing and testing it in modern systems is challenging. Based on real-world retry issues, the authors introduce a suite of static and dynamic techniques to detect retry problems. They found that the ad-hoc nature of retry implementation complicates traditional program analysis but that large language models (LLMs) can address these issues effectively. The research also demonstrates that repurposing unit tests, combined with fault injection, can reveal various retry issues.

Listen to research manager Shan Lu and PhD candidate Bogdan Stoica discuss this work in a recent podcast episode.

Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor with T10

Yiqi Liu, Yuqi Xue, Yu Cheng, Lingxiao Ma, Ziming Miao, Jilong Xue, Jian Huang

An example that maps a MatMul operator to two cores with the compute-shift style execution. Both (b) and (c) are valid compute-shift execution plans, but with different tradeoffs between memory footprint and communication overhead.

Despite advances in AI chips that enable high-bandwidth and low-latency inter-core memory access, deep learning compilers lack support for scalable inter-core connections, limiting their potential. To address this, the authors introduce T10, an end-to-end deep learning compiler to take advantage of inter-core communication and distributed on-chip memory. T10 introduces a distributed tensor abstraction, rTensor, and maps the computation and communication of tensor operators with a generalized compute-shift pattern to cores. T10 optimizes on-chip memory consumption and inter-core communication overhead, selecting the best execution plan.

SilvanForge: A Schedule-Guided Retargetable Compiler for Decision Tree Inference

Ashwin Prasad, Sampath Rajendra, Kaushik Rajan, R Govindarajan, Uday Bondhugula

The figure provides an overview of schedule-guided compilation with SilvanForge. Compilation occurs through multiple intermediate representations (IR), with optimizations at each level and transitions from one level to the next guided by a schedule.

SilvanForge is a schedule-guided retargetable compiler for decision tree-based models that explores various optimization options and automatically generates high-performance inference routines for CPUs and GPUs. It consists of two core components: a scheduling language to efficiently explore the optimization space, and a retargetable compiler that generates code for any specified schedule. By utilizing different data layouts, loop structures, and caching strategies, SilvanForge achieves portable performance across multiple hardware targets.

Uncovering Nested Data Parallelism and Data Reuse in DNN Computation with FractalTensor

Siran Liu, Chengxiang Qi, Ying Cao, Chao Yang, Weifang Hu, Xuanhua Shi, Fan Yang, Mao Yang

FractalTensor enhances DAG programming for DNNs with a list-based ADT, allowing innovative DNN algorithmic ideas to be naturally expressed, their parallelism analyzed by the compiler, and subsequently translated into tile processing.

Deep neural networks (DNNs) often use highly optimized tensor operators to speed up computation, but these operators are typically defined empirically, limiting cross-operator analysis and optimization. FractalTensor addresses this by introducing a nested list-based abstract data type (ADT), where each element is either a tensor with a static shape or another FractalTensor. DNNs are then defined using high-order compute operators like map/reduce/scan and data access operators like window/stride, explicitly exposing nested data parallelism and fine-grained access patterns. This approach enables entire program analysis and optimization. This paper will only be available in the SOSP 2024 proceedings.

Unearthing Semantic Checks for Cloud Infrastructure-as-Code Programs

Yiming Qiu, Patrick Tser Jern Kon, Ryan Beckett, Ang Chen

The overall workflow of Zodiac. It starts by ingesting IaC repositories crawled from online sources. Based on a curated set of check templates using a semantic knowledge base (KB) and a specification language, it generates a set of hypothesized checks. Next, it performs statistical filtering and interpolation to reduce false positives and fill in missing details with the help of LLMs. The hypothesized checks are then fed into the validation phase. For each such check, Zodiac identifies conforming instances that could be used as positive test cases, and it further mutates them to obtain corresponding negative test cases. A check is validated if the positive test case succeeds to deploy but its negative counterpart does not. To further resolve conflicts across different checks, Zodiac plans the order of negative test case generation and deployment via a validation scheduler.

This research introduces Zodiac, a tool that uses semantic-guided mining and deployment-based validation pipelines to automatically uncover cloud IaC semantic checks. When applied to Microsoft Azure resources, Zodiac identified over 400 semantic checks that would cause deployment failures if violated, demonstrating its ability to detect cloud requirements that are difficult for state-of-the-art IaC tools to find.

Paper and tutorial

Verus: A Practical Foundation for Systems Verification

Distinguished Artifact Award
Chris Hawblitzel, Jay Lorch

Overview diagram of Verus. Verus offers powerful automated and semi-automated reasoning techniques that apply to the full system stack, as well as reasoning techniques tuned to specific stack levels.

This work presents an updated version of Verus, a tool that accelerates and simplifies formal verification of system software. It builds on previous advances for faster and more cost-effective verification of complex properties in realistic systems and now verifies code up to 61 times faster than existing methods. It has been evaluated on various systems encompassing 6,100 lines of code and 31,000 lines of proof. Verus is ready for broader adoption by developers using Rust who want to create more robust systems.

Listen to principal researchers Chris Hawblitzel and Jay Lorch discuss this work in a recent podcast episode.

Workshops

Hot Topics in System Infrastructure (opens in new tab)

Inigo Goiri, Pantea Zardoshti

Researchers and engineers share recent findings and experiences while exploring new challenges and opportunities in building next-generation infrastructures, including AI, sustainable datacenters, and edge and cloud computing. Topics span the entire system stack, focusing on design and implementation, hardware architecture, operating systems, runtimes, and applications.

Practical Adoption Challenges of ML for Systems (opens in new tab)

Chetan Bansal

Despite significant progress in machine learning, deploying it in computer systems is rare due to non-machine learning challenges like feature stability, reliability, and availability. This workshop brings together researchers from academia and industry communities to foster communication and collaboration in addressing these practical issues and aligning research with real-world system deployment needs.

Discover how interdisciplinary systems research is driving innovation at Microsoft.

Related publications

Efficient Reproduction of Fault-Induced Failures in Distributed Systems with Feedback-Driven Fault Injection

If At First You Don’t Succeed, Try, Try, Again…? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software Systems

Continue reading

graphical user interface, application, icon

February 5, 2025

Research Areas

Systems and networking

Research Groups

Related projects

Torch

Related events

SOSP 2024

Related labs

Microsoft Research Lab - Redmond

Microsoft Research Blog

Microsoft at SOSP 2024: Innovations in systems research

Papers

Efficient Reproduction of Fault-Induced Failures in Distributed Systems with Feedback-Driven Fault Injection

If At First You Don’t Succeed, Try, Try, Again…? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software Systems

Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor with T10

SilvanForge: A Schedule-Guided Retargetable Compiler for Decision Tree Inference

Uncovering Nested Data Parallelism and Data Reuse in DNN Computation with FractalTensor

Unearthing Semantic Checks for Cloud Infrastructure-as-Code Programs

Paper and tutorial

Verus: A Practical Foundation for Systems Verification

Workshops

Hot Topics in System Infrastructure (opens in new tab)

Practical Adoption Challenges of ML for Systems (opens in new tab)

Related publications

Efficient Reproduction of Fault-Induced Failures in Distributed Systems with Feedback-Driven Fault Injection

If At First You Don’t Succeed, Try, Try, Again…? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software Systems

Uncovering Nested Data Parallelism and Data Reuse in DNN Computation with FractalTensor

SilvanForge: A Schedule-Guided Retargetable Compiler for Decision Tree Inference

Verus: A Practical Foundation for Systems Verification

Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor with T10

Unearthing Semantic Checks for Cloud Infrastructure-as-Code Programs

Continue reading

Advances to low-bit quantization enable LLMs on edge devices

Research Focus: Week of November 11, 2024

Abstracts: November 5, 2024

Abstracts: November 4, 2024

Research Areas

Research Groups

Related projects

Related events

Related labs