{"id":501995,"date":"2018-08-21T09:57:40","date_gmt":"2018-08-21T16:57:40","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=501995"},"modified":"2019-07-07T23:19:02","modified_gmt":"2019-07-08T06:19:02","slug":"dowhy-a-library-for-causal-inference","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/dowhy-a-library-for-causal-inference\/","title":{"rendered":"DoWhy \u2013 A library for causal inference"},"content":{"rendered":"
<\/p>\n
For decades, causal inference methods have found wide applicability in the social and biomedical sciences. As computing systems start intervening in our work and daily lives, questions of cause-and-effect are gaining importance in computer science as well. To enable widespread use of causal inference, we are pleased to announce a new software library, DoWhy (opens in new tab)<\/span><\/a>. Its name is inspired by Judea Pearl\u2019s do-calculus for causal inference. In addition to providing a programmatic interface for popular causal inference methods, DoWhy is designed to highlight the critical but often neglected assumptions underlying causal inference analyses. DoWhy does this by first making the underlying assumptions explicit, for example, by explicitly representing identified estimands. And secondly by making sensitivity analysis and other robustness checks a first-class element of the causal inference process. Our goal is to enable people to focus their efforts on identifying assumptions for causal inference, rather than on details of estimation.<\/p>\n Our motivation for creating DoWhy comes from our experiences in causal inference studies over the past few years, ranging from estimating the impact of a recommender system (opens in new tab)<\/span><\/a> to predicting likely outcomes given a life event (opens in new tab)<\/span><\/a>. In each of these studies, we found ourselves repeating the common steps of finding the right identification strategy, devising the most suitable estimator, and conducting robustness checks, all from scratch. While we were impressed\u2014sometimes intimidated\u2014by the amount of knowledge in causal inference literature, we found that doing any empirical causal inference remained a challenging task. Ensuring we understood our assumptions and validated them appropriately was particularly daunting. More generally, we see that a \u201croll your own\u201d approach to causal inference has resulted in studies with varying (sometimes minimal) approaches to testing of key assumptions.<\/p>\n We therefore asked ourselves, what if there existed a software library that provides a simple interface to common causal inference methods that codified best practices for reasoning about and validating key assumptions? Unfortunately, the challenge is that causal inference depends on estimation of unobserved quantities\u2014also known as the \u201cfundamental problem\u201d of causal inference. Unlike in supervised learning, such counterfactual<\/em> quantities imply that we cannot have a purely objective evaluation through a held-out test set, thus precluding a plug-in approach to causal inference. For instance, for any intervention\u2014such as a new algorithm or a medical procedure\u2014one can either observe what happens when people are given the intervention, or when they are not. But never both. Therefore, causal analysis hinges critically on assumptions about the data-generating process.<\/p>\n To succeed, it became clear to us that the assumptions need to be first-class citizens in a causal inference library. We designed DoWhy using two guiding principles\u2014making causal assumptions explicit and testing robustness of the estimates to violations of those assumptions. First, DoWhy makes a distinction between identification and estimation. Identification of a causal effect involves making assumptions about the data-generating process and going from the counterfactual expressions to specifying a target estimand, while estimation is a purely statistical problem of estimating the target estimand from data. Thus, identification is where the library spends most of its time, just like we commonly do in our projects. To represent assumptions formally, DoWhy uses the Bayesian graphical model framework where users can specify what they know, and more importantly, what they don\u2019t know, about the data-generating process. For estimation, we provide methods based on the potential-outcomes framework such as matching, stratification and instrumental variables. A happy side-effect of using DoWhy is that you will realize the equivalence and interoperability of the seemingly disjoint graphical model and potential outcome frameworks.<\/p>\n