Getting started with DoWhy: A simple example

This is quick introduction to DoWhy causal inference library. We will load in a sample dataset and estimate causal effect from a (pre-specified)treatment variable to a (pre-specified) outcome variable.

First, let us add required path for python to find DoWhy code and load required packages.

In [1]:
import os, sys
sys.path.append(os.path.abspath("../../"))
In [2]:
import numpy as np
import pandas as pd

import dowhy
from dowhy.do_why import CausalModel
import dowhy.datasets

Let us first load a dataset. For simplicity, we simulate a dataset with linear relationships between common causes and treatment, and common causes and outcome.

Beta is the true causal effect.

In [3]:
data = dowhy.datasets.linear_dataset(beta=10,
        num_common_causes=5,
        num_instruments = 2,
        num_samples=10000,
        treatment_is_binary=True)
df = data["df"]
print(df.head())
print(data["dot_graph"])
print("\n")
print(data["gml_graph"])
    Z0        Z1        X0        X1        X2        X3        X4    v  \
0  0.0  0.187572  0.243350 -0.077050 -0.226280 -1.095012 -1.053720  0.0
1  1.0  0.531907  0.184990  0.771895  1.227848 -0.163556 -1.573322  1.0
2  0.0  0.717115 -1.282233 -0.235209 -2.096997 -0.305452  0.744809  0.0
3  0.0  0.798858 -1.774324 -0.255721 -1.265465  0.134693 -0.278637  0.0
4  1.0  0.175432 -1.316845 -0.086447  0.141768 -0.058566 -2.156661  0.0

           y
0  -4.636747
1   9.152353
2  -5.435885
3  -8.862212
4 -11.509263
digraph { v ->y; U[label="Unobserved Confounders"]; U->v; U->y;X0-> v; X1-> v; X2-> v; X3-> v; X4-> v;X0-> y; X1-> y; X2-> y; X3-> y; X4-> y;Z0-> v; Z1-> v;}


graph[directed 1node[ id "v" label "v"]node[ id "y" label "y"]node[ id "Unobserved Confounders" label "Unobserved Confounders"]edge[source "v" target "y"]edge[source "Unobserved Confounders" target "v"]edge[source "Unobserved Confounders" target "y"]node[ id "X0" label "X0"] edge[ source "X0" target "v"] node[ id "X1" label "X1"] edge[ source "X1" target "v"] node[ id "X2" label "X2"] edge[ source "X2" target "v"] node[ id "X3" label "X3"] edge[ source "X3" target "v"] node[ id "X4" label "X4"] edge[ source "X4" target "v"]edge[ source "X0" target "y"] edge[ source "X1" target "y"] edge[ source "X2" target "y"] edge[ source "X3" target "y"] edge[ source "X4" target "y"]node[ id "Z0" label "Z0"] edge[ source "Z0" target "v"] node[ id "Z1" label "Z1"] edge[ source "Z1" target "v"]]

Note that we are using a pandas dataframe to load the data. At present, DoWhy only supports pandas dataframe as input.

Interface 1 (recommended): Input causal graph

We now input a causal graph in the GML graph format (recommended). You can also use the DOT format.

In [4]:
# With graph
model=CausalModel(
        data = df,
        treatment=data["treatment_name"],
        outcome=data["outcome_name"],
        graph=data["gml_graph"]
        )
Model to find the causal effect of treatment v on outcome y
In [5]:
model.view_model()
In [6]:
from IPython.display import Image, display
display(Image(filename="causal_model.png"))
_images/do_why_simple_example_10_0.png

The above causal graph shows the assumptions encoded in the cauasl model. We can now use this graph to first identify the causal effect (go from a causal estimand to a probability expression), and then estimate the causal effect.

DoWhy philosophy: Keep identification and estimation separate

Identification can be achieved without access to data, only the graph. This results in an expression to computed. This expression can then be computed using the available data in the estimation step. Important to understand that these are orthogonal steps.

  • Identification
In [7]:
identified_estimand = model.identify_effect()
print(identified_estimand)
INFO:dowhy.causal_identifier:Common causes of treatment and outcome:{'X2', 'X4', 'Z0', 'X0', 'Z1', 'X1', 'X3', 'Unobserved Confounders'}
{'observed': 'yes'}
{'observed': 'yes'}
{'observed': 'yes'}
{'observed': 'yes'}
{'observed': 'yes'}
{'observed': 'yes'}
{'observed': 'yes'}
{'observed': 'no'}
There are unobserved common causes. Causal effect cannot be identified.
WARN: Do you want to continue by ignoring these unobserved confounders? [y/n] y
INFO:dowhy.causal_identifier:Instrumental variables for treatment and outcome:['Z1', 'Z0']
Estimand type: ate
### Estimand : 1
Estimand name: iv
Estimand expression:
Expectation(Derivative(y, Z1)/Derivative(v, Z1))
Estimand assumption 1, Exclusion: If we remove {Z1,Z0}→v, then ¬(Z1,Z0→y)
Estimand assumption 2, As-if-random: If U→→y then ¬(U →→Z1,Z0)
### Estimand : 2
Estimand name: backdoor
Estimand expression:
d
──(Expectation(y|X2,X4,Z0,X0,Z1,X1,X3))
dv
Estimand assumption 1, Unconfoundedness: If U→v and U→y then P(y|v,X2,X4,Z0,X0,Z1,X1,X3,U) = P(y|v,X2,X4,Z0,X0,Z1,X1,X3)

  • Estimation
In [8]:
causal_estimate = model.estimate_effect(identified_estimand,
        method_name="backdoor.linear_regression")
print(causal_estimate)
print("Causal Estimate is " + str(causal_estimate.value))
LinearRegressionEstimator
INFO:dowhy.causal_estimator:INFO: Using Linear Regression Estimator
INFO:dowhy.causal_estimator:b: y~v+X2+X4+Z0+X0+Z1+X1+X3
*** Causal Estimate ***

## Target estimand
Estimand type: ate
### Estimand : 1
Estimand name: iv
Estimand expression:
Expectation(Derivative(y, Z1)/Derivative(v, Z1))
Estimand assumption 1, Exclusion: If we remove {Z1,Z0}→v, then ¬(Z1,Z0→y)
Estimand assumption 2, As-if-random: If U→→y then ¬(U →→Z1,Z0)
### Estimand : 2
Estimand name: backdoor
Estimand expression:
d
──(Expectation(y|X2,X4,Z0,X0,Z1,X1,X3))
dv
Estimand assumption 1, Unconfoundedness: If U→v and U→y then P(y|v,X2,X4,Z0,X0,Z1,X1,X3,U) = P(y|v,X2,X4,Z0,X0,Z1,X1,X3)

## Realized estimand
b: y~v+X2+X4+Z0+X0+Z1+X1+X3
## Estimate
Value: 10.000000000000172

Causal Estimate is 10.000000000000172

Interface 2: Specify common causes and instruments

In [9]:
# Without graph
model= CausalModel(
        data=df,
        treatment=data["treatment_name"],
        outcome=data["outcome_name"],
        common_causes=data["common_causes_names"])
WARNING:dowhy.do_why:Causal Graph not provided. DoWhy will construct a graph based on data inputs.
Model to find the causal effect of treatment v on outcome y
In [10]:
model.view_model()

We get the same causal graph. Now identification and estimation is done as before.

In [11]:
identified_estimand = model.identify_effect()
INFO:dowhy.causal_identifier:Common causes of treatment and outcome:{'X2', 'X4', 'X0', 'X1', 'U', 'X3'}
{'observed': 'yes'}
{'observed': 'yes'}
{'observed': 'yes'}
{'observed': 'yes'}
{'observed': 'no', 'label': 'Unobserved Confounders'}
There are unobserved common causes. Causal effect cannot be identified.
WARN: Do you want to continue by ignoring these unobserved confounders? [y/n] y
INFO:dowhy.causal_identifier:Instrumental variables for treatment and outcome:[]
  • Estimation
In [12]:
estimate = model.estimate_effect(identified_estimand,
        method_name="backdoor.linear_regression",
        test_significance=True
        )
print(estimate)
print("Causal Estimate is " + str(estimate.value))
INFO:dowhy.causal_estimator:INFO: Using Linear Regression Estimator
INFO:dowhy.causal_estimator:b: y~v+X2+X4+X0+X1+X3
LinearRegressionEstimator
*** Causal Estimate ***

## Target estimand
Estimand type: ate
### Estimand : 1
Estimand name: iv
No such variable found!
### Estimand : 2
Estimand name: backdoor
Estimand expression:
d
──(Expectation(y|X2,X4,X0,X1,X3))
dv
Estimand assumption 1, Unconfoundedness: If U→v and U→y then P(y|v,X2,X4,X0,X1,X3,U) = P(y|v,X2,X4,X0,X1,X3)

## Realized estimand
b: y~v+X2+X4+X0+X1+X3
## Estimate
Value: 10.00000000000015

## Statistical Significance
p-value: 0.0

Causal Estimate is 10.00000000000015

Refuting the estimate

Now refuting the obtained estimate.

Adding a random common cause variable

In [13]:
res_random=model.refute_estimate(identified_estimand, estimate, method_name="random_common_cause")
print(res_random)
INFO:dowhy.causal_estimator:INFO: Using Linear Regression Estimator
INFO:dowhy.causal_estimator:b: y~v+X2+X4+X0+X1+X3+w_random
Refute: Add a Random Common Cause
Estimated effect:(10.00000000000015,)
New effect:(-0.008273576773682779,)

Replacing treatment with a random (placebo) variable

In [14]:
res_placebo=model.refute_estimate(identified_estimand, estimate,
        method_name="placebo_treatment_refuter", placebo_type="permute")
print(res_placebo)
INFO:dowhy.causal_estimator:INFO: Using Linear Regression Estimator
INFO:dowhy.causal_estimator:b: y~placebo+X2+X4+X0+X1+X3
Refute: Use a Placebo Treatment
Estimated effect:(10.00000000000015,)
New effect:(-0.14478665707304159,)

Removing a random subset of the data

In [15]:
res_subset=model.refute_estimate(identified_estimand, estimate,
        method_name="data_subset_refuter", subset_fraction=0.9)
print(res_subset)

INFO:dowhy.causal_estimator:INFO: Using Linear Regression Estimator
INFO:dowhy.causal_estimator:b: y~v+X2+X4+X0+X1+X3
Refute: Use a subset of data
Estimated effect:(10.00000000000015,)
New effect:(0.004110499904372903,)

As you can see, the linear regression estimator is very sensitive to simple refutations.