DoWhy: Microsoft’s new python library for causal inference

2 min read

Microsoft came out with a library, named DoWhy, earlier this week, for promoting widespread use of causal inference. Causal inference refers to the process of drawing a conclusion from a causal connection which is based on the conditions of the occurrence of an effect. Simply put, causal inference attempts to find or guess why something happened.

“DoWhy” is a Python library which is aimed to spark causal thinking and analysis. It provides a unified interface for causal inference methods. There’s also automatic testing of multiple assumptions making the inference accessible to non-experts.

According to Microsoft, “Our motivation for creating DoWhy comes from our experiences in causal inference studies — ranging from estimating the impact of a recommender system to predicting likely outcomes given a life event — we found ourselves repeating the common steps of finding the right identification strategy, devising the most suitable estimator, and conducting robustness checks, all from scratch”.

DoWhy highlights the critical assumptions lying beneath causal inference analysis. It is designed using four major principles:

Model a causal inference problem using assumptions.
Identifying expression for the causal effect (“causal estimand”).
Estimate the expression using statistical methods
Verifying validity of the estimate

How DoWhy works?

First, DoWhy builds an underlying causal graphical model for every problem. This makes each causal assumption explicit. The graph does not have to be complete and you can provide a partial graph which represents prior knowledge about variables. The rest of the variables are automatically considered as potential confounders by DoWhy.

Secondly, DoWhy distinguishes between identification and estimation. Identification of a causal effect refers to assumptions made about the data-generating process along with counterfactual expressions to specifying a target estimand. It uses the Bayesian graphical model framework to represent assumptions formally. Here the users can specify what they know and what they don’t know about the data-generation process. Thirdly, for estimation, there are methods based on the potential-outcomes framework including matching, stratification, and instrumental variables.

Lastly, there are robustness tests along with sensitivity checks for testing or verifying the reliability of an obtained estimate. With this, you can test how the estimate changes with varying assumptions. The library is also capable of automatically checking the validity of obtained estimate depending on assumptions in the graphical model.

DoWhy supports Python 3+ and requires packages such as numpy, scipy, scikit-learn, pandas, pygraphviz (for causal graphs plotting), networkx (for causal graphs analysis), matplotlib (for general plotting), and sympy (for symbolic expressions rendering).

Microsoft plans on adding more features to the DoWhy library. This includes improved estimation support, sensitivity methods and interoperability with available estimation software.

For more information, check out the official DoWhy documentation.