Feedback-driven Fault Injection: Efficiently Reproducing Fault-Induced Failures

SOSP 2024 |

Organized by ACM

Debugging a failure usually requires reproducing it first. This can be hard for failures in production distributed systems, where bugs are exposed only by some unusual faulty events. While fault injection testing becomes popular, existing solutions are designed for bug finding. They are ineffective and inefficient to reproduce a specific failure during debugging.

In this paper, we explore a new type of fault injection technique for quickly reproducing a given fault-induced production failure in distributed systems. We present a tool, FIR, that uses static causal analysis and a novel feedback-driven algorithm to quickly search the enormous fault space for the root-cause fault and timing. We evaluate FIR on 22 real-world complex fault-induced failures from five large-scale distributed systems. FIR reproduced all failures by identifying and injecting the root-cause faults at the right time, in a median of 8 minutes.