About
I am a Senior Researcher in the Cloud Systems Reliability Group at Microsoft Research, Redmond. I joined MSR in 2023; before that I was faculty (opens in new tab) at the Max Planck Institute for Software Systems (2018-2022), and a PhD student (opens in new tab) at Brown University (2011-2018).
Our group is hiring research interns! Feel free to drop me an e-mail if you are a PhD student and your research interests align with those of our group!
My research interests center on observability and tracing in cloud and distributed systems; performance and resource management; and building multi-tenant distributed systems. A few recent project highlights include the following:
- Hindsight is a new distributed tracing framework built from the ground-up to support edge-case tracing. Hindsight enables detailed end-to-end tracing for rare and outlier requests without data loss traditionally incurred by sampling-based systems. Hindsight overcomes this by combining a short per-node history of telemetry, programmatic symptom detection, and rapid distributed retrieval. Hindsight will appear at NSDI 2023 — check out the preprint (opens in new tab)! Its code can be found on GitLab (opens in new tab).
- Clockwork is a distributed DNN serving system designed for predictable end-to-end performance. Clockwork promotes end-to-end performance predictability as a first-class design concern. To achieve this, Clockwork’s design eliminates major sources of performance variability and centralizes choices that lead to variability such as scheduling and admission control. The end result is a system design with extremely tight tail latency. Clockwork received the Distinguished Artifact Award at OSDI 2020 (opens in new tab) and its code can be found on GitLab (opens in new tab).
- Pivot Tracing is a cross-component monitoring framework for distributed systems. Often the information needed to troubleshoot cross-component problems is relatively minimal, but inaccessible due to a lack of cross-component visibility. Pivot Tracing combines causal metadata propagation with dynamic instrumentation to overcome this limitation. Using Pivot Tracing, a system operator can use a simple SQL-like interface to define and measure arbitrary system metrics, while grouping, filtering, and aggregating those metrics according to arbitrary identifiers from other system components. Pivot Tracing introduced the baggage abstraction that you can find today in the OpenTelemetry standard. Pivot Tracing received the Best Paper Award at SOSP 2015 (opens in new tab) and its code can be found on GitHub (opens in new tab).