ExChain: Exception Dependency Analysis for Root Cause Diagnosis

USENIX NSDI |

Published by USENIX

Many failures in large-scale online services stem from incorrect handling of exceptions. We focus on exception-handling failures characterized by three features that make them difficult to diagnose using classical techniques: (1) implicit dependencies across multiple exceptions due to state changes; (2) silent code handling without logging; and (3) separation (in code and in time) between the root cause exception and the failure manifestation. In this paper, we present the design and implementation of ExChain, a framework that helps developers diagnose such exception-dependent failures in test/canary deployment environments. ExChain constructs causal links between exceptions even in the presence of the aforementioned factors. Our key observation is that mishandled exceptions invariably modify critical system states, which impact downstream functions. A key challenge in tracking these states is balancing the tradeoff between performance overhead and accuracy. To this end, ExChain uses state-impact analysis to establish potential causal links between exceptions and uses a novel hybrid taint tracking approach for tracking state propagation. Using ExChain, we were able to successfully identify the root cause for 8 out of 11 reported subtle exception-dependent failures in 10 popular applications. ExChain significantly outperforms state-of-art approaches, while producing several orders of magnitude fewer false positives. ExChain also offers significantly better accuracy-performance tradeoffs relative to baseline static/dynamic analysis alternatives.