{"id":1068138,"date":"2024-08-08T11:41:49","date_gmt":"2024-08-08T18:41:49","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-research-item&p=1068138"},"modified":"2024-10-28T16:05:19","modified_gmt":"2024-10-28T23:05:19","slug":"feedback-driven-fault-injection-efficiently-reproducing-fault-induced-failures","status":"publish","type":"msr-research-item","link":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/feedback-driven-fault-injection-efficiently-reproducing-fault-induced-failures\/","title":{"rendered":"Efficient Reproduction of Fault-Induced Failures in Distributed Systems with Feedback-Driven Fault Injection"},"content":{"rendered":"
Debugging a failure usually requires reproducing it first. This can be hard for failures in production distributed systems, where bugs are exposed only by some unusual faulty events. While fault injection testing becomes popular, existing solutions are designed for bug finding. They are ineffective and inefficient to reproduce a specific failure during debugging.<\/p>\n
We explore a new type of fault injection technique for quickly reproducing a given fault-induced production failure in distributed systems. We present a tool, Anduril, that uses static causal analysis and a novel feedback-driven algorithm to quickly search the enormous fault space for the root-cause fault and timing. We evaluate Anduril on 22 real-world complex fault-induced failures from five large-scale distributed systems. Anduril reproduced all failures by identifying and injecting the root-cause faults at the right time, in a median of 8 minutes.<\/p>\n","protected":false},"excerpt":{"rendered":"
Debugging a failure usually requires reproducing it first. This can be hard for failures in production distributed systems, where bugs are exposed only by some unusual faulty events. While fault injection testing becomes popular, existing solutions are designed for bug finding. They are ineffective and inefficient to reproduce a specific failure during debugging. We explore […]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"msr-content-type":[3],"msr-research-highlight":[],"research-area":[13547],"msr-publication-type":[193716],"msr-product-type":[],"msr-focus-area":[],"msr-platform":[],"msr-download-source":[],"msr-locale":[268875],"msr-post-option":[],"msr-field-of-study":[],"msr-conference":[266136],"msr-journal":[],"msr-impact-theme":[264846],"msr-pillar":[],"class_list":["post-1068138","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-systems-and-networking","msr-locale-en_us"],"msr_publishername":"","msr_edition":"","msr_affiliation":"","msr_published_date":"2024-11-5","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"ACM","msr_how_published":"","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"file","viewUrl":"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2024\/08\/anudril.pdf","id":"1098069","title":"anudril","label_id":"243109","label":0}],"msr_related_uploader":"","msr_attachments":[{"id":1098069,"url":"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2024\/10\/anudril.pdf"}],"msr-author-ordering":[{"type":"text","value":"Jia Pan","user_id":0,"rest_url":false},{"type":"text","value":"Haoze Wu","user_id":0,"rest_url":false},{"type":"user_nicename","value":"Tanakorn Leesatapornwongsa","user_id":38754,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Tanakorn Leesatapornwongsa"},{"type":"user_nicename","value":"Suman Nath","user_id":33753,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Suman Nath"},{"type":"text","value":"Peng Huang","user_id":0,"rest_url":false}],"msr_impact_theme":["Computing foundations"],"msr_research_lab":[199565],"msr_event":[1073397],"msr_group":[920058],"msr_project":[554055],"publication":[],"video":[],"download":[],"msr_publication_type":"inproceedings","related_content":{"projects":[{"ID":554055,"post_title":"Torch","post_name":"torch","post_type":"msr-project","post_date":"2018-11-28 15:34:02","post_modified":"2022-12-02 14:18:51","post_status":"publish","permalink":"https:\/\/www.microsoft.com\/en-us\/research\/project\/torch\/","post_excerpt":"We aim to develop practical tools and techniques that can help cloud developers adequately debug, test, configure, and monitor their systems. The research spans all aspects of improving reliability and availability of large-scale cloud systems, including understanding various runtime failures in real cloud systems, developing practical static and dynamic analysis techniques to uncover hard-to-find bugs before production, root-causing failures, designing lightweight but effective runtime monitoring and failure prevention, etc. Tools and techniques developed in the…","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/554055"}]}}]},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/1068138"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":3,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/1068138\/revisions"}],"predecessor-version":[{"id":1089849,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/1068138\/revisions\/1089849"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1068138"}],"wp:term":[{"taxonomy":"msr-content-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-content-type?post=1068138"},{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=1068138"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1068138"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=1068138"},{"taxonomy":"msr-product-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-product-type?post=1068138"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=1068138"},{"taxonomy":"msr-platform","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-platform?post=1068138"},{"taxonomy":"msr-download-source","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-download-source?post=1068138"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1068138"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1068138"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=1068138"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=1068138"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=1068138"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1068138"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=1068138"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}