Getting Started with Causal InferenceCode, tutorials, and resources for causal inference
https://causalinference.gitlab.io/
Sun, 25 Apr 2021 09:16:21 +0000Sun, 25 Apr 2021 09:16:21 +0000Jekyll v3.4.0Chapter 4: Estimation<section id="causal-estimation-causal-reasoning-book-chapter4" data-number="1">
<p>Once we have found strategies for identifying causal quantities, we need to choose how to estimate those causal quantities using statistical methods. We describe the most commonly used methods using examples inspired by real computing applications. First, we present the basics of causal estimation: how to go from an identified estimand to an estimator? We describe the challenges of bias and variance that need to be traded off in every estimation method. Second, we present a variety of estimation methods, starting from simple, interpretable estimators to complex machine learning-based estimators that are often required when the data is high-dimensional.</p>
<section id="sec:ch04-intro-example" data-number="1.1">
<h2 data-number="4.1"><span class="header-section-number">4.1</span> Example: Building an estimator</h2>
<p>In the last chapter, we saw that identification is the process of transforming a causal quantity to a statistical one, called the <em>identified estimand</em>. After identification, estimation is the process of computing this quantity using available data.</p>
<figure>
<img src="../assets/Figures/Chapter4/icecream-swimming-timeseries.png" id="fig:ch04-icecream-swimming-cor" alt="Figure 1: Number of searches issued for ice-cream and swimming in a search engine. Over the observed time period, there is a correlation between searches for the two queries. Try this example yourself using an online DoWhy notebook: https://github.com/causalreasoning/cr-book/blob/master/examples/ch04-estimation/dowhy_confounder_example.ipynb." /><figcaption aria-hidden="true">Figure 1: Number of searches issued for ice-cream and swimming in a search engine. Over the observed time period, there is a correlation between searches for the two queries. Try this example yourself using an online DoWhy notebook: <a href="https://github.com/causalreasoning/cr-book/blob/master/examples/ch04-estimation/dowhy_confounder_example.ipynb" class="uri">https://github.com/causalreasoning/cr-book/blob/master/examples/ch04-estimation/dowhy_confounder_example.ipynb</a>.</figcaption>
</figure>
<section id="estimating-causal-effect-of-ice-cream-consumption" data-number="1.1.1">
<h3 data-number="4.1.1"><span class="header-section-number">4.1.1</span> Estimating causal effect of ice-cream consumption</h3>
<p>Suppose that you have access to anonymized data on search queries issued to a popular search engine. Suppose further that you are a scholar on ice-cream and want to understand its effects on different health outcomes. Your colleague shares with you an intriguing observation: the number of search queries for “where can I have ice-cream” are highly correlated with the number of search queries for “where can I swim.” They ask you, <em>does this mean that having ice-cream makes people more likely to want to swim?</em> Fig. 1 plots the two search queries over time.</p>
<p>The observational evidence is quite strong, but you may be suspicious. In chapter <a href="/causal-reasoning-book-chapter1" data-reference-type="ref" data-reference="ch_patternsandpredictionsarenotenough">1</a>, we saw that temperature can be a confounder and therefore we should account for it. Chapter <a href="/causal-reasoning-book-chapter3" data-reference-type="ref" data-reference="ch_causalidentification">3</a> provided with the correct estimand for doing so, by introducing the back-door criterion. Specifically, assuming that there are no other confounders, the causal estimand is: <span id="eq:ch03-icecream-identify"><span class="math display">\[\label{eq:ch03-icecream-identify}
\mathbb{E}[\textit{swim}| \operatorname{do}(\textit{icecream})] = \sum_{\textit{temp}} \mathbb{E}[\textit{swim}|\textit{icecream}, \textit{temp}] \operatorname{P}(temp)\qquad(1)\]</span></span> where <em>swim</em> and <em>icecream</em> correspond to the number of queries for swimming and ice-cream respectively and <em>temp</em> is the temperature at that time. There are multiple ways to estimate the above quantity. One approach is to discretize <span class="math inline">\(\textit{icecream}\)</span> and <span class="math inline">\(\textit{temp}\)</span> and compute the mean value of <span class="math inline">\(\textit{swim}\)</span> for particular values of <span class="math inline">\(icecream\)</span> and <span class="math inline">\(temp\)</span>. Let the two values for <span class="math inline">\(\textit{icecream}\)</span> be <span class="math inline">\(\{\texttt{High}, \texttt{Low} \}\)</span> and <span class="math inline">\(\textit{temp}\)</span> be represented by three values <span class="math inline">\(\{ \texttt{Low}, \texttt{Medium},\texttt{High}\}\)</span>. Then the causal estimand can be estimated using the following two equations: <span id="eq:ch03-icecream-est"><span class="math display">\[\label{eq:ch03-icecream-est}
\begin{split}
\mathbb{E}[\textit{swim}|\textit{icecream}&=\texttt{ic}, \textit{temp}=\texttt{te}] = \frac{1}{N_{\texttt{te},\texttt{ic}}} \sum_{\substack{\textit{temp}=\texttt{te}, \\ \textit{icecream}=\texttt{ic}}}\textit{swim} \\
\Pr(temp) &= \frac{N_{\texttt{te}}}{N}
\end{split}\qquad(2)\]</span></span></p>
<p>The above estimation works well as long as each of the conditional means can be estimated reliably. What happens when there is little data for a specific temperature and ice-cream queries bucket? Table <a href="#tbl:ch04-icecream-dataset" data-reference-type="ref" data-reference="tbl:ch04-icecream-dataset">1</a> presents such a dataset where very few people search for ice-cream at low temperatures. Each entry in the dataset is a contiguous period of time with low, medium or high temperature. For each entry, the dataset provides the frequency of ice-cream queries by level and the number of swimming queries. If we apply Eqns. 1, 2, we obtain a positive effect of ice-cream searches on swimming. But obviously we know that is not the case. The culprit is the estimation of the mean from very small samples (in this case, a single data point!). A possible fix is to exclude data from low temperatures and compute the effect over medium and high temperature. However, our new estimate no longer corresponds to all temperatures: based on the data, we cannot say anything about ice-cream’s effect at low temperatures.</p>
<p>The example illustrates the challenges with estimating a causal quantity from data, even with a straightforward identified estimand. While we showed an example with a single discrete confounder, the scenario of low frequency for specific values of the confounder is more likely when confounders are multi-dimensional and unavoidable when they are continuous. How to condition on high-dimensional confounders is one of the key questions for causal estimation methods.</p>
<div id="tbl:ch04-icecream-dataset">
<table>
<caption>Table 1: Search queries for ice-cream at different points in time and the associated temperature values. Each row corresponds to a specific (temperature, ice-cream queries) pair. Some combinations are rare: When the temperature is low, there is just a single time period for which the number of ice-cream queries are high. The right-most column shows the mean swimming queries over the same time periods.</caption>
<tbody>
<tr class="odd">
<td style="text-align: center;"><strong>Temperature</strong></td>
<td style="text-align: center;"><strong>Ice-cream queries</strong></td>
<td style="text-align: center;"><strong>Frequency</strong></td>
<td style="text-align: right;"><strong>Mean swimming queries</strong></td>
</tr>
<tr class="even">
<td style="text-align: center;">Low</td>
<td style="text-align: center;">Low</td>
<td style="text-align: center;">9999</td>
<td style="text-align: right;">505</td>
</tr>
<tr class="odd">
<td style="text-align: center;">Low</td>
<td style="text-align: center;">High</td>
<td style="text-align: center;">1</td>
<td style="text-align: right;">560</td>
</tr>
<tr class="even">
<td style="text-align: center;">Medium</td>
<td style="text-align: center;">Low</td>
<td style="text-align: center;">5000</td>
<td style="text-align: right;">2151</td>
</tr>
<tr class="odd">
<td style="text-align: center;">Medium</td>
<td style="text-align: center;">High</td>
<td style="text-align: center;">5000</td>
<td style="text-align: right;">2150</td>
</tr>
<tr class="even">
<td style="text-align: center;">High</td>
<td style="text-align: center;">Low</td>
<td style="text-align: center;">1000</td>
<td style="text-align: right;">4750</td>
</tr>
<tr class="odd">
<td style="text-align: center;">High</td>
<td style="text-align: center;">High</td>
<td style="text-align: center;">9000</td>
<td style="text-align: right;">4751</td>
</tr>
</tbody>
</table>
</div>
</section>
<section id="estimating-causal-effect-using-a-randomized-experiment" data-number="1.1.2">
<h3 data-number="4.1.2"><span class="header-section-number">4.1.2</span> Estimating causal effect using a randomized experiment</h3>
<p>Many of the above estimation challenges are solved if we can create our own data. Let us select a random sample of people in a city and divide them into two groups: one group is provided ice-cream every day for a week and the other is provided some other snack. We can then track their swimming activity (or searches for swimming) over a week. This creates a randomized experiment where the treatment is having ice-cream and outcome is swimming activity. From Chapter <a href="/causal-reasoning-book-chapter3" data-reference-type="ref" data-reference="ch_causalidentification">3</a>, we know that the identified estimand in a randomized experiment is simply the conditional expectation of outcome given treatment. With this estimand as the target estimand and the available experimental data, we can now estimate this quantity using a simple plug-in estimator without worrying about confounders like temperature.</p>
<p><span class="math display">\[\mathbb{E}[Y|do(T=t)] \rightarrow\mathbb{E}[Y|T=t] \rightarrow\frac{\sum_{i=1}^N \mathbb{1}_{[T=t]} Y}{\sum_{i=1}^N \mathbb{1}_{[T=t]}}\]</span></p>
<p>Using this basic estimator, we can now estimate different causal quantities of interest. The most commonly used is the average treatment effect (ATE). When the treatment is binary, the ATE simplifies to: <span class="math display">\[\begin{split}
\textrm{ATE} & := \mathbb{E}[Y|do(T=1)] - \mathbb{E}[Y|do(T=0)] \\
&\rightarrow\mathbb{E}[Y|T=1] - \mathbb{E}[Y|T=0] \rightarrow\frac{\sum_{i=1}^N \mathbb{1}_{[T=1]} Y}{\sum_{i=1}^N \mathbb{1}_{[T=1]}} - \frac{\sum_{i=1}^N \mathbb{1}_{[T=0]} Y}{\sum_{i=1}^N \mathbb{1}_{[T=0]}}
\end{split}\]</span> Note how we do not need to worry about temperature or any other variable. In a randomized experiment, the estimate is simply the difference in mean swimming activity between people who were provided ice cream and those who were not.</p>
<p>Sometimes, however, we may be interested only in the effect on specific people, e.g., people belonging to a certain demographic or sharing some attribute. Since the treatment was randomized, these attributes are independent of treatment but can be correlated with the outcome. These attributes are called “effect modifiers” and the resultant estimate, <em>conditional</em> average treatment effect, commonly abbreviated as CATE. It can be estimated using the same principle: identifying the causal quantity and then using a simple estimator for it.</p>
<p><span class="math display">\[\begin{split}
\textrm{CATE} & := \mathbb{E}[Y|do(T=1), C=c] - \mathbb{E}[Y|do(T=0), C=c] \\
&\rightarrow\mathbb{E}[Y|T=1, C=c] - \mathbb{E}[Y|T=0, C=c] \\
& \rightarrow\frac{\sum_{i=1}^N \mathbb{1}_{[T=1, C=c]} Y}{\sum_{i=1}^N \mathbb{1}_{[T=1, C=c]}} - \frac{\sum_{i=1}^N \mathbb{1}_{[T=0, C=c]} Y}{\sum_{i=1}^N \mathbb{1}_{[T=0, C=c]}}
\end{split}\]</span></p>
</section>
<section id="challenges-in-estimation-with-finite-data" data-number="1.1.3">
<h3 data-number="4.1.3"><span class="header-section-number">4.1.3</span> Challenges in estimation with finite data</h3>
<p>If we have large enough data from a randomized experiment, then simple estimators like these are sufficient. However with limited data, there still exist multiple challenges in estimating a target estimand from a randomized experiment (and therefore also for observational studies). For example, consider a randomized experiment where the true causal effect varies greatly between people. In particular, there are a few people for which the true effect is orders of magnitude more than that for others. Looking at the differences of means over <span class="math inline">\(N\)</span> samples for ATE above, depending on whether such outlying people are assigned treatment or control can influence the resultant ATE. If you run the experiment again, you may get a substantially different outcome! Knowing this, one may trim all recorded outcomes such that outcomes above a certain threshold are constrained to the threshold value. This will make the results of each experiment more repeatable, but we have a different problem now: we are no longer estimating the target estimand due to this arbitrary cutoff.</p>
<p>When the treatment is continuous, a different set of challenges emerge. Perhaps the most fundamental is that the treatment effect is no longer well-defined. Do we estimate the change in swimming queries when ice-cream queries increases from 0 to 10, or from 100 to 101? If the effect varies for different values of treatment, which one do we report?</p>
<p>Finally, there is also a more general question of whether the search engine users are representative of the underlying population. If not, we may estimate the causal effect of ice-cream on users of the search engine, but that may not generalize to our population of interest. Perhaps users of the search engine are older, healthier or different in some way than the general population? Just like we defined treatment assignment mechanism for identification in chapter <a href="/causal-reasoning-book-chapter3" data-reference-type="ref" data-reference="ch_causalidentification">3</a>, <em>sampling mechanism</em> is important for estimation. A sampling mechanism is the process by which data is collected that in turn, determines how well the sample represents a target probability distribution. For example, by collecting data only from search engine users, the resultant data may not represent the target population, the people in a city.</p>
<p>Formally, a sampling mechanism defines a probability distribution from which available data points are drawn independently. The extent to which the <em>sampling</em> distribution matches the target data distribution determines the suitability of data. Therefore, any quantity estimated on available data corresponds only to the specific <em>sampling distribution</em> from which it can be assumed to be sampled randomly, unless one can assume that available data is representative of the general population.</p>
</section>
</section>
<section id="sec:ch04-bias-variance" data-number="1.2">
<h2 data-number="4.2"><span class="header-section-number">4.2</span> The bias-variance tradeoff</h2>
<p>Our example above illustrates the fundamental tradeoff between <em>bias</em> and <em>variance</em> of a statistical estimator. Bias corresponds to the degree to which the estimate is expected to be close to the identified estimand. Variance of an estimator corresponds to the expected difference in an estimate if an experiment is run multiple times.</p>
<p>In general, reducing the variance—e.g., by removing outliers in our example above—increases bias, and vice-versa. It is difficult to obtain an estimator that will have both low bias and low variance. To complicate matters, an estimator with low bias and low variance for one dataset may have very different properties on a different dataset, thus in general there is no universally best estimator.</p>
<p>The goal of estimation is to find a satisfactory tradeoff between bias and variance. Formally, for a target estimand <span class="math inline">\(h\)</span> and its estimator <span class="math inline">\(\hat{h}\)</span>, bias is defined as the different between the target estimand and the expected value of <span class="math inline">\(\hat{h}\)</span> over multiple independent samples. Variance is defined as the expected value of the squared difference between estimates on different samples and the mean estimate. <span class="math display">\[\begin{split}
Bias &:= |h - \mathbb{E}[\hat{h}]| \\
Var &:= \mathbb{E}[(\hat{h}- \mathbb{E}[\hat{h}])^2]
\end{split}\]</span></p>
<figure>
<img src="../assets/Figures/Chapter4/bias-variance-tradeoff.png" id="fig:ch04-bias-variance-tradeoff" alt="Figure 2: Bias and variance of different estimators for the same identified estimand, \mathbb{E}[Y|T=t]. The dotted horizontal line shows the true average causal effect, 10. Vertical distance of the mean estimate from the true ACE denotes the bias; variance is the denoted roughly by the length of each boxplot. The first estimator is the optimal one with lowest bias and variance. The next two estimators are biased while the last three are unbiased with high variance. Try out these estimators yourself in an online DoWhy notebook: https://github.com/causalreasoning/cr-book/blob/master/examples/ch04-estimation/dowhy_bias_variance_tradeoff.ipynb." /><figcaption aria-hidden="true">Figure 2: Bias and variance of different estimators for the same identified estimand, <span class="math inline">\(\mathbb{E}[Y|T=t]\)</span>. The dotted horizontal line shows the true average causal effect, 10. Vertical distance of the mean estimate from the true ACE denotes the bias; variance is the denoted roughly by the length of each boxplot. The first estimator is the optimal one with lowest bias and variance. The next two estimators are biased while the last three are unbiased with high variance. Try out these estimators yourself in an online DoWhy notebook: <a href="https://github.com/causalreasoning/cr-book/blob/master/examples/ch04-estimation/dowhy_bias_variance_tradeoff.ipynb" class="uri">https://github.com/causalreasoning/cr-book/blob/master/examples/ch04-estimation/dowhy_bias_variance_tradeoff.ipynb</a>.</figcaption>
</figure>
<section id="factors-affecting-the-bias-and-variance-of-a-causal-estimator" data-number="1.2.1">
<h3 data-number="4.2.1"><span class="header-section-number">4.2.1</span> Factors affecting the bias and variance of a causal estimator</h3>
<p>Fig. 2 illustrates the different ways in which an estimator can have high bias or high variance. For simplicity, the treatment is assumed to be binary, with two levels <span class="math inline">\(0\)</span> and <span class="math inline">\(2\)</span>. Data was generated using the structural equation, <span class="math inline">\(y=\beta t^2\)</span> where <span class="math inline">\(\beta=5\)</span> for half of the population and <span class="math inline">\(\beta=15\)</span> for the other half, thus leading to mean causal effect of <span class="math inline">\(10\)</span>. There are no confounders in the causal model. All estimators use a dataset with 1000 units except the “Few samples” estimator that uses 50 units.</p>
<p>Based on the structural equation, the “correct” estimator uses linear regression with a transformed <span class="math inline">\(t^2\)</span> feature. However, in practice the structural equation is not known. Even if we assume that we know the correct identified estimand, <span class="math inline">\(\mathbb{E}[Y|\operatorname{do}(T)=t)] \rightarrow\mathbb{E}[Y|T=t]\)</span>, it is not guaranteed that we will recover the correct estimate. The quality of the resultant estimate depends on the following factors.</p>
<ul>
<li><p><strong>Model choice</strong>. If the estimating model is not flexible or complex enough to model the relationships in data, or if it is too complex and overfits available data, then the resultant estimator will be biased. The second estimator in Fig. 2 shows linear regression applied directly to the treatment variable. This parameterized model fails to capture that the outcome depends on the square of the treatment values and will result in an incorrect estimate, even with infinite data.</p></li>
<li><p><strong>Sampling mechanism</strong>. The second important contributor to bias is the sampling mechanism. If the treatment effect differs for different units, then it is important that the available data represents an independently sampled set from the target distribution. If not, then the sampling distribution is different from the target distribution and the estimated effect suffers from bias as shown in the third boxplot in Fig. 2. The estimator uses the correct model specification with linear regression on <span class="math inline">\(t^2\)</span> but uses data is sampled more often from units having a smaller effect. The result is a biased estimate.</p></li>
<li><p><strong>Sample size</strong>. Like with any statistical estimation, a small sample size increases the variance of a causal effect estimator. For a linear regression estimator on <span class="math inline">\(t^2\)</span> (the correct model specification), variance is shown in the fourth boxplot in Fig. 2 when the estimator is provided a smaller dataset of 50 samples. With fewer data samples, variance is higher. The estimate can be anywhere between 5 and 15, although the average of the estimates over multiple datasets is correct and thus there is no bias. Note that the current estimator does not condition on any confounders. When an estimator additionally conditions on confounders, it is not enough to ensure a high overall sample size: high variance can occur if data for <em>any</em> one of the confounder values have low frequency. For example, given an identified estimand with ten binary confounders, the total number of unique confounder configurations is <span class="math inline">\(2^{10}=1024\)</span>. A dataset with 10000 units will lead to less than 10 units per confounder value and a dataset of 1000 units will not even cover all the confounder configurations.</p></li>
<li><p><strong>Overlap</strong>. Related to the issue of small sample size, variance of an estimator also increases if there is not enough data for certain treatment values or corresponding control values, and thus low overlap between the treatment and control values. The overlap problem is exacerbated when we want to estimate the treatment effect over specific subsets <span class="math inline">\(z\)</span> of the data, also known as the conditional treatment effect, since we require every such subset to have enough data points for treatment and control values. As we saw in chapter <a href="/causal-reasoning-book-chapter3" data-reference-type="ref" data-reference="ch_causalidentification">3</a>, non-zero overlap is enough for identification (<span class="math inline">\(\operatorname{P}(a|z)>0\)</span> whenever identifying <span class="math inline">\(P(y|\operatorname{do}(a))\)</span>) but low overlap leads to variance in the estimator. This is because causal estimation depends fundamentally on comparing the values of an outcome on different values of the treatment. If some treatments are rare, then estimation of the outcome conditional on those treatments can have high variance. We saw this already in the ice-cream example (Table <a href="#tbl:ch04-icecream-dataset" data-reference-type="ref" data-reference="tbl:ch04-icecream-dataset">1</a>) where overlap of the treatment was poor for the data with “low” temperature. The second boxplot from the right in Fig. 2 shows the estimated effect on a dataset where 99% of the units belong to the treated group and only 1% of the units to control group, using an unbiased estimator (correct model specification without any sampling bias). Due to the poor overlap, the estimated effect varies widely from a minimum of <span class="math inline">\(0\)</span> to a maximum of <span class="math inline">\(20\)</span>.</p></li>
<li><p><strong>Outliers</strong>: Having outlier values of the outcome in the data also increases the variance of an estimator. This can happen due to noise in measuring the outcome, or due to a small number of units having a different value of the causal effect. Unlike the issue of small sample size that can be resolved by collecting more data, the issue of outliers can remain even with large datasets. Unfortunately there is no easy solution. On one hand, removing the outliers will lead to bias since the sampling mechanism no longer corresponds to the target distribution. On the other hand, including outliers increases variance for the effect estimator. Similar to the issue of poor overlap, having outliers in the data increases variance of the estimate, as seen in the rightmost boxplot in Fig. 2.</p></li>
</ul>
<p>Fig. 2 showed the impact of adding any single issue to an estimator, but in practice, multiple issues can affect an estimator. Complicating matters, fixing a bias issue often exacerbates a variance issue and vice-versa. For example, variance in an estimator due to small sample size can be alleviated by assuming a simpler parametric model (e.g., linear regression) but that introduces bias due to model choice. Correspondingly, reducing model choice bias often involves selecting a complex model with several parameters that in turn increases the variance for estimating those parameters from the same data. In general, while variance can be reduced with a larger dataset, bias cannot be resolved by collecting more data. Even with infinite data, an incorrectly specified model will lead to a biased estimate, and so will an infinite dataset with an incorrect sampling mechanism.</p>
</section>
<section id="many-estimation-methods-for-the-same-identified-estimand" data-number="1.2.2">
<h3 data-number="4.2.2"><span class="header-section-number">4.2.2</span> Many estimation methods for the same identified estimand</h3>
<p>In practice, it is not possible to compute the bias of an estimator given observational data. Therefore, estimation methods need to make their best “guess” about useful assumptions to compute the effect. As a result, multiple competing estimating methods can exist for the same target estimand.</p>
<p>The set of assumptions made by an estimator characterizes the tradeoff it makes between bias and variance. In section <a href="#sec:ch04-balance-methods" data-reference-type="ref" data-reference="sec:ch04-balance-methods">4.3</a>, we will start with non-parametric estimators that make the least assumptions about the structural equation for the outcome variable. Therefore, these estimators have low bias due to model choice. Another benefit is that the resulting method is simple to understand and interpretable. Since it is not possible to compute the bias of an estimator in practice, simple and interpretable estimators offer the best bet for obtaining an estimate with low bias. However, simplicity comes at the cost of high variance. If there is low sample size, poor overlap, or outliers in the data, then simple methods will yield a high variance.</p>
<p>Especially in high-dimensional data, simple estimators fail to be effective since their variance can be prohibitively high. Therefore, we will next discuss methods that make parametric assumptions on how the data was generated. Some of these methods use a parametric model to estimate the treatment assignment, while still using a non-parametric computation of the effect on outcome (sections <a href="#sec:ch04-balance-methods" data-reference-type="ref" data-reference="sec:ch04-balance-methods">4.3</a> and <a href="#sec:ch04-weighting-methods" data-reference-type="ref" data-reference="sec:ch04-weighting-methods">4.4</a>). Others assume a parametric model for both the treatment and the outcome (section <a href="#sec:ch04-ipw-predictive-model" data-reference-type="ref" data-reference="sec:ch04-ipw-predictive-model">4.4.2</a>). All these estimation methods incur the cost of model choice bias, but are willing to make the tradeoff to obtain lower variance that makes them applicable for high-dimensional data. At the extreme, there are methods that ignore the treatment assignment altogether and directly use a parametric model for the outcome (section <a href="#sec:ch04-outcome-methods" data-reference-type="ref" data-reference="sec:ch04-outcome-methods">4.5</a>).</p>
<p>While they offer lower variance, the bias of parametric model-based estimators depends critically on how well their model captures the true structural equations. To model the structural equations, many estimator methods reduce the causal effect estimation problem to a series of prediction problems (e.g., predicting the treatment or the outcome variable). Machine learning models are best-suited for such supervised learning tasks and thus they play a pivotal role in the parametric estimation methods.</p>
<p>Since the bias-variance tradeoff applies universally for all methods, we present methods organized by the different approaches they take to estimate the causal effect.</p>
<ul>
<li><p><strong>Balance-based</strong>. Conditioning on variables to estimate a target estimand.</p></li>
<li><p><strong>Weighting-based</strong>. Weighted sampling of a dataset.</p></li>
<li><p><strong>Outcome model-based</strong>. Fitting a model that predicts the outcome based on the treatment.</p></li>
<li><p><strong>Threshold-based</strong>. Exploiting a discontinuity in one of the variables to obtain local effects.</p></li>
</ul>
</section>
</section>
<section id="sec:ch04-balance-methods" data-number="1.3">
<h2 data-number="4.3"><span class="header-section-number">4.3</span> Balance-based methods</h2>
<p>For <em>observational</em> data, one of the most popular strategies to obtain an estimator is to directly approximate the back-door criterion. In the last chapter, we saw that the back-door estimand can be expressed as, <span id="eq:ch04-backdoor-eqn"><span class="math display">\[\label{eq:ch04-backdoor-eqn}
\mathbb{E}[\textrm{y}|\operatorname{do}(\textrm{t}=t_0)] \rightarrow\sum_{\mathbf{w}\in \mathbb{W}} \mathbb{E}(\textrm{y}|\textrm{t}=t_0, \mathbf{w}=\mathbf{w})\operatorname{P}(\mathbf{w}=\mathbf{w})\qquad(3)\]</span></span> where <span class="math inline">\(\mathbf{w}\)</span> represents all common causes of the treatment <span class="math inline">\(\textrm{t}\)</span> and outcome <span class="math inline">\(\textrm{y}\)</span>. To estimate the above expression, we need to create data subsets where <span class="math inline">\(\mathbf{w}\)</span> is constant. This goal is often called covariate balance where covariates refer to confounders or simply balance. Balance-based methods try to attain covariate balance given the observed data so that the above equation can be estimated. We will start with simple plug-in estimators of the above expression and then move to model-based methods.</p>
<div id="fig:ch04-stratification" class="subfigures">
<table style="width:90%;">
<colgroup>
<col style="width: 30%" />
<col style="width: 30%" />
<col style="width: 30%" />
</colgroup>
<tbody>
<tr class="odd">
<td style="text-align: center;"><figure>
<img src="../assets/Figures/Chapter4/continuous_perfect_overlap.png" style="width:100.0%" alt="a" /><figcaption aria-hidden="true">a</figcaption>
</figure></td>
<td style="text-align: center;"><figure>
<img src="../assets/Figures/Chapter4/continuous_treatment_overlap.png" style="width:100.0%" alt="b" /><figcaption aria-hidden="true">b</figcaption>
</figure></td>
<td style="text-align: center;"><figure>
<img src="../assets/Figures/Chapter4/continuous_partial_overlap.png" style="width:100.0%" alt="c" /><figcaption aria-hidden="true">c</figcaption>
</figure></td>
</tr>
<tr class="even">
<td style="text-align: center;"><figure>
<img src="../assets/Figures/Chapter4/stratification_perfect_overlap.png" style="width:100.0%" alt="d" /><figcaption aria-hidden="true">d</figcaption>
</figure></td>
<td style="text-align: center;"><figure>
<img src="../assets/Figures/Chapter4/stratification_treatment_overlap.png" style="width:100.0%" alt="e" /><figcaption aria-hidden="true">e</figcaption>
</figure></td>
<td style="text-align: center;"><figure>
<img src="../assets/Figures/Chapter4/stratification_partial_overlap.png" style="width:100.0%" alt="f" /><figcaption aria-hidden="true">f</figcaption>
</figure></td>
</tr>
<tr class="odd">
<td style="text-align: center;"><figure>
<img src="../assets/Figures/Chapter4/matching_perfect_overlap.png" style="width:100.0%" alt="g" /><figcaption aria-hidden="true">g</figcaption>
</figure></td>
<td style="text-align: center;"><figure>
<img src="../assets/Figures/Chapter4/matching_treatment_overlap.png" style="width:100.0%" alt="h" /><figcaption aria-hidden="true">h</figcaption>
</figure></td>
<td style="text-align: center;"><figure>
<img src="../assets/Figures/Chapter4/matching_partial_overlap.png" style="width:100.0%" alt="i" /><figcaption aria-hidden="true">i</figcaption>
</figure></td>
</tr>
<tr class="even">
<td style="text-align: center;"><figure>
<img src="../assets/Figures/Chapter4/ipw_perfect_overlap.png" style="width:100.0%" alt="j" /><figcaption aria-hidden="true">j</figcaption>
</figure></td>
<td style="text-align: center;"><figure>
<img src="../assets/Figures/Chapter4/ipw_treatment_overlap.png" style="width:100.0%" alt="k" /><figcaption aria-hidden="true">k</figcaption>
</figure></td>
<td style="text-align: center;"><figure>
<img src="../assets/Figures/Chapter4/ipw_partial_overlap.png" style="width:100.0%" alt="l" /><figcaption aria-hidden="true">l</figcaption>
</figure></td>
</tr>
</tbody>
</table>
<p>Figure 3: Visual illustrations of the backdoor-based estimation methods for estimating the average treatment effect on the treated. Treatment data points are blue, control points are red. The first row shows the distribution of the original data, decreasing in overlap conditional on the confounder <span class="math inline">\(w\)</span> from left to right. The second shows the stratification estimator that discretizes the distribution and only computes the estimate on values of <span class="math inline">\(w\)</span> that have overlap, the rest are ignored. The third row visualizes the matching estimator. Horizonal lines connect the original treatment (blue) points to their matched control (red) points. Distance (and hence the values of confounder <span class="math inline">\(w\)</span>) between the matched points increases as the overlap is reduced. Finally, the fourth row the inverse propensity estimator that reweighs the data distribution to obtain a nearly identical distribution across treatment and control. As the overlap decreases from left to right, note how very few control data points are magnified to probabilities higher than that of the treatment points at the same <span class="math inline">\(w\)</span>, and yet some other values of <span class="math inline">\(w\)</span> are still not covered. Try out these estimators yourself using an online DoWhy notebook: <a href="https://github.com/causalreasoning/cr-book/blob/master/examples/ch04-estimation/visualize_conditioning_estimators.ipynb" class="uri">https://github.com/causalreasoning/cr-book/blob/master/examples/ch04-estimation/visualize_conditioning_estimators.ipynb</a>.. a — Data I, b — Data II, c — Data III, d — Stratification I, e — Stratification II, f — Stratification III, g — Matching I, h — Matching II, i — Matching III, j — Weighting I, k — Weighting II, l — Weighting III</p>
</div>
<section id="simple-stratification" data-number="1.3.1">
<h3 data-number="4.3.1"><span class="header-section-number">4.3.1</span> Simple Stratification</h3>
<p>A straight-forward way to estimate the back-door estimand is to estimate the conditional probabilities directly. We obtain, <span class="math display">\[\sum_{\mathbf{w}\in \mathbb{W}} \mathbb{E}(\textrm{y}|\textrm{t}=t_0, \mathbf{w}=\mathbf{w})\Pr(\mathbf{w}=\mathbf{w}) \rightarrow\sum_{\mathbf{w}\in \mathbb{W}} \frac{n_w}{N} \frac{\sum_{i=0}^{N} \mathbb{1}_{[\textrm{t}=t_0, \mathbf{w}=\mathbf{w}]}{y}}{\sum_{j=0}^{N} 1_{[\textrm{t}=t_0, \mathbf{w}=\mathbf{w}]}}\]</span> Effectively, the method divides data into different strata, each having the same value of <span class="math inline">\(\mathbf{w}\)</span>. Then these estimates are summed up weighted by the number of data points in each strata, leading to a weighted average of the strata effects.</p>
<p>This is an unbiased estimator of the backdoor estimand. Due to the division into multiple strata, this method is called <em>stratification</em>. This is a suitable method to use when <span class="math inline">\(\mathbf{w}\)</span> consists of categorical variables and is of low dimension compared to the dataset size. Otherwise the number of data points in each strata may be too little to estimate the stratum-wise values.</p>
<p>Formally, for a stratification estimator to have low variance, each strata must have a minimum number of data points. This is because the stratification estimator is simply the mean of all strata’s effects: if there are a some strata with very few samples and large estimate values, it can vary the overall mean (although the effect is dampened due to weighting by strata size). For this reason, in practice, it is worth adding a cutoff on the minimum stratum size that can avoid sensitivity to outlier values, at the cost of producing a biased estimate. <span class="math display">\[\sum_{\mathbf{w}\in \mathbb{W}} \mathbb{E}(\textrm{y}|\textrm{t}=t_0, \mathbf{w}=\mathbf{w})\Pr(\mathbf{w}=\mathbf{w}) \rightarrow\sum_{\mathbf{w}\in \mathbb{W}; n_w > M} \frac{n_w}{N} \frac{\sum_{i=0}^{N} \mathbb{1}_{[\textrm{t}=t_0, \mathbf{w}=\mathbf{w}]}{y}}{\sum_{j=0}^{N} 1_{[\textrm{t}=t_0, \mathbf{w}=\mathbf{w}]}}\]</span></p>
<p>Sometimes the excluded strata share attributes that are interpretable. For example, it could be that all data points from a particular country are excluded due to low sample size, or data points from a certain age. Often such sample size differences between strata are a result of natural processes (fewer transactions from a particular country) or discrepancies in collecting the data (e.g., sampling younger people more). In these cases, it is helpful to interpet the estimated effect as a <em>local</em> average treatment effect, applicable only to the strata included in analysis, <span class="math inline">\(\mathbb{W}' \subset \mathbb{W}\)</span>. Formally, the local ATE is a biased estimate of the (global) ATE, but it may often be useful in its own right to interpret the effect for at least the included strata. <span class="math display">\[\textrm{LATE}:= \sum_{\mathbf{w}\in \mathbb{W}'} \mathbb{E}(\textrm{y}|\textrm{t}=t_0, \mathbf{w}=\mathbf{w})\Pr(\mathbf{w}=\mathbf{w}) \rightarrow\sum_{\mathbf{w}\in \mathbb{W}'} \frac{n_w}{N} \frac{\sum_{i=0}^{N} \mathbb{1}_{[\textrm{t}=t_0, \mathbf{w}=\mathbf{w}]}{y}}{\sum_{j=0}^{N} 1_{[\textrm{t}=t_0, \mathbf{w}=\mathbf{w}]}}\]</span></p>
<p>The conditional average treatment effect (CATE) can be calculated in a similar way. Given some variables <span class="math inline">\(\mathbf{v}\subseteq \mathbf{w}\)</span> on which CATE estimates are needed, we can choose appropriate strata for each value <span class="math inline">\(\mathbf{v}=\mathbf{v}\)</span> of the variables. <span class="math display">\[\begin{split}
\textrm{CATE}(\mathbf{v}=\mathbf{v}) &:= \sum_{\mathbf{w}\in \mathbb{W}, \mathbf{v}=\mathbf{v}} \mathbb{E}(\textrm{y}|\textrm{t}=t_0, \mathbf{w}=\mathbf{w}, \mathbf{v}=\mathbf{v})\Pr(\mathbf{w}=\mathbf{w}, \mathbf{v}=\mathbf{v}) \\
&\rightarrow\sum_{\mathbf{w}\in \mathbb{W}, \mathbf{v}=\mathbf{v}} \frac{n_w}{N} \frac{\sum_{i=0}^{N} \mathbb{1}_{[\textrm{t}=t_0, \mathbf{w}=\mathbf{w},\mathbf{v}=\mathbf{v}]}{y}}{\sum_{j=0}^{N} 1_{[\textrm{t}=t_0, \mathbf{w}=\mathbf{w}, \mathbf{v}=\mathbf{v}]}}
\end{split}\]</span></p>
<p>When <span class="math inline">\(\mathbf{w}\)</span> includes continuous variables, simple stratification becomes challenging. A workaround is to discretize continuous variables based on quantiles or other splits based on domain knowledge. Such coarsening allows us to create discrete strata, but the resultant estimate may depend heavily on the splits for discretization. Further, it limits the application of the estimate: one can only talk about the effect of increasing a continuous variable from one split to the other, but not in smaller steps.</p>
</section>
<section id="matching" data-number="1.3.2">
<h3 data-number="4.3.2"><span class="header-section-number">4.3.2</span> Matching</h3>
<p>When exact conditioning on <span class="math inline">\(\mathbf{w}\)</span> is not possible, either due to high-dimensionality or continuous variables, we can find units with as similar <span class="math inline">\(\mathbf{w}\)</span> as possible. A simple way is to define a metric of distance between any two units based on their <span class="math inline">\(\mathbf{w}\)</span> values. Then for every unit <span class="math inline">\(i\)</span> in the treatment group, we find the closest unit <span class="math inline">\(i_{matched}\)</span> in the control group (and vice versa), as an approximation of them sharing the same <span class="math inline">\(\mathbf{w}\)</span>. For each matched pair, we compute the difference of outcomes between the treated and control and average it over all units. <span class="math display">\[\sum_{\mathbf{w}\in \mathbb{W}_S} \mathbb{E}(\textrm{y}|\textrm{t}=t_0, \mathbf{w}=\mathbf{w})\Pr(\mathbf{w}=\mathbf{w}) \rightarrow\sum_{i=1}^{N}
\begin{cases}
y_i & \textrm{ if } t_i = t_0 \\
y_{i_{matched}} & \textrm{if } t_{i_{matched}}=t_0
\end{cases}\]</span></p>
<p>An importance choice is that of the distance metric, since that dictates what it means to have similar <span class="math inline">\(\mathbf{w}\)</span>. Given that different <span class="math inline">\(\mathbf{w}\)</span> may have different scales, a common choice is a standardized distance like Mahalanobis that accounts for the difference in variances betwen different elements of <span class="math inline">\(\mathbf{w}\)</span>. <span class="math display">\[\operatorname{dist}(\mathbf{w}_i, \mathbf{w}_j) = \sqrt{(\mathbf{w}_i - \mathbf{w}_j)\Sigma^{-1}(\mathbf{w}_i-\mathbf{w}_j)}\]</span> where <span class="math inline">\(\Sigma\)</span> is the covariance matrix for <span class="math inline">\(\mathbf{w}\)</span>.</p>
<p>Still, in a finite sample of data, it is possible that the best match for a unit may not be similar to it at all. Therefore, like we needed a minimum cutoff for stratum size in stratification, we need a cutoff for the maximum distance between two units for them to be matched. However the goal here is different: we need a cutoff on distance to reduce <em>bias</em> in the matching estimator due to comparing treated and control units who may not be similar on <span class="math inline">\(\mathbf{w}\)</span>.</p>
<p>The effect on any particular units can be calculated in a similar way to stratification. CATE on a subset of the data <span class="math inline">\(\mathbf{v}=\mathbf{v}\)</span> is given by: <span class="math display">\[\begin{split}
\textrm{CATE}(\mathbf{v}=\mathbf{v})& := \sum_{\mathbf{w}\in \mathbb{W}_S,\mathbf{v}=\mathbf{v}} \mathbb{E}(\textrm{y}|\textrm{t}=t_0, \mathbf{w}=\mathbf{w},\mathbf{v}=\mathbf{v})\Pr(\mathbf{w}=\mathbf{w},\mathbf{v}=\mathbf{v}) \\
&\rightarrow\sum_{i=1}^{N} \mathbb{1}_{\mathbf{v}=\mathbf{v}}
\begin{cases}
y_i & \textrm{ if } t_i = t_0 \\
y_{i_{matched}} & \textrm{if } t_{i_{matched}}=t_0
\end{cases}
\end{split}\]</span> While convenient for both discrete and continuous variables, the method’s reliance on imperfect matches creates problems. For instance, the average treatment effect for the control group and the average treatment effect on the treated group are no longer the same (and both can be different from the average treatment effect). The average treatment effect on the treated (ATT) group is estimated as the mean over all matches for the treated units. Similarly, the average treatment effect on the control (ATC) group is estimated as the mean over all matches for the control units. Due to differences in the quality of matches—distance among matched pairs—based on whether a unit is in treatment or control, estimated ATT and ATC tend to be different. <span class="math display">\[\begin{split}
ATT = \mathbb{E}_{\textrm{t}=1}[\textrm{y}|\operatorname{do}(\textrm{t}=t_0)] &=
\sum_{\mathbf{w}\in \mathbb{W}_S, \textrm{t}=1} \mathbb{E}(\textrm{y}|\textrm{t}=t_0, \mathbf{w}=\mathbf{w})\Pr(\mathbf{w}=\mathbf{w}| \textrm{t}=1) \\
&\rightarrow\sum_{i=1,t_i=1}^{N}
\begin{cases}
y_i & \textrm{ if } t_i = t_0 \\
y_{i_{matched}} & \textrm{if } t_{i_{matched}}=t_0
\end{cases} \\
ATC = \mathbb{E}_{\textrm{t}=0}[\textrm{y}|\operatorname{do}(\textrm{t}=t_0)] &=
\sum_{\mathbf{w}\in \mathbb{W}_S, \textrm{t}=0} \mathbb{E}(\textrm{y}|\textrm{t}=t_0, \mathbf{w}=\mathbf{w})\Pr(\mathbf{w}=\mathbf{w}| \textrm{t}=0) \\
&\rightarrow\sum_{i=1,t_i=0}^{N}
\begin{cases}
y_i & \textrm{ if } t_i = t_0 \\
y_{i_{matched}} & \textrm{if } t_{i_{matched}}=t_0
\end{cases}
\end{split}\]</span></p>
<p>The implication is puzzling. Depending on whether a unit has been assigned to treatment or control, our estimate of the effect of the exact same intervention can be different. Note that it is entirely possible that different subsets of units have different effects from the same intervention; however, the difference here is not due to heterogeneity in actual effect, but simply due to lack of perfect matches. This distinction is useful when interpreting the estimate from a matching analysis. If there are a few treated units but numerous control units (e.g., a few people who contracted a disease versus thousands who did not), it may be appropriate to only estimate the ATT, as we have a better chance of finding good matches for each of the treated units.</p>
</section>
<section id="propensity-based-methods" data-number="1.3.3">
<h3 data-number="4.3.3"><span class="header-section-number">4.3.3</span> Propensity-Based Methods</h3>
<p>Both stratification and matching methods create a subset of units—a stratum or a pair—where all confounders are fixed to the same value and thus any variation in treatment asssignment is trivially independent of the observed confounders. Formally, within these strata or pairs, <span class="math inline">\(t \unicode{x2AEB} w\)</span> and the units are said to be <em>balanced</em>. While these methods ensure the same value of confounders in each stratum or pair, having exactly the same <span class="math inline">\(w\)</span> is not necessary to ensure <span class="math inline">\(t \unicode{x2AEB} w\)</span>. We now study alternative ways of ensuring balance that are effective for high-dimensional confounders.</p>
<p>Suppose that there are two values of the confounders, <span class="math inline">\(\mathbf{w}=\mathbf{w}_1\)</span> and <span class="math inline">\(\mathbf{w}=\mathbf{w}_2\)</span> such that <span class="math inline">\(\operatorname{P}(\textrm{t}| \mathbf{w}=\mathbf{w}_1)=\operatorname{P}(\textrm{t}| \mathbf{w}=\mathbf{w}_2)\)</span>. In this case, we need not condition on <span class="math inline">\(\mathbf{w}_1\)</span> and <span class="math inline">\(\mathbf{w}_2\)</span> separately. It is okay to combine units with <span class="math inline">\(\mathbf{w}_1\)</span> and <span class="math inline">\(\mathbf{w}_2\)</span> into one strata. A simple derivation shows that the resultant estimate is identical. For the two values of <span class="math inline">\(\mathbf{w}\)</span>, <span class="math display">\[\begin{split}
\sum_{\mathbf{w}\in \{\mathbf{w}_1, \mathbf{w}_2\}} & \operatorname{P}(\textrm{y}|\textrm{t}=t_0, \mathbf{w}=\mathbf{w})\operatorname{P}(\mathbf{w}=\mathbf{w}) \\
& =\sum_{\mathbf{w}\in \{\mathbf{w}_1, \mathbf{w}_2\}} \operatorname{P}(\textrm{y}|\textrm{t}=t_0, \mathbf{w}=\mathbf{w})\operatorname{P}(\mathbf{w}=\mathbf{w})\frac{\operatorname{P}(\textrm{t}=t_0|\mathbf{w}=\mathbf{w})}{\operatorname{P}(\textrm{t}=t_0|\mathbf{w}=\mathbf{w})} \\
& =\sum_{\mathbf{w}\in \{\mathbf{w}_1, \mathbf{w}_2\}} \frac{\operatorname{P}(\textrm{y}, \textrm{t}=t_0, \mathbf{w}=\mathbf{w})}{\operatorname{P}(\textrm{t}=t_0|\mathbf{w}=\mathbf{w})}
= \frac{\operatorname{P}(\textrm{y}, \textrm{t}=t_0, \mathbf{w}=\mathbf{w}_1)+\operatorname{P}(\textrm{y}, \textrm{t}=t_0, \mathbf{w}=\mathbf{w}_2)}{\operatorname{P}(\textrm{t}=t_0|\mathbf{w}=\mathbf{w})} \\
& = \frac{\operatorname{P}(\textrm{y}, \textrm{t}=t_0, \mathbf{w}=\mathbf{w}_1 \text{or } \mathbf{w}_2)}{\operatorname{P}(\textrm{t}=t_0|\mathbf{w}=\mathbf{w}_1 \text{or } \mathbf{w}_2)}
= \operatorname{P}(\textrm{y}| \textrm{t}=t_0, \mathbf{w}=\mathbf{w}_1 \text{or } \mathbf{w}_2)\operatorname{P}(\mathbf{w}=\mathbf{w}_1 \text{or } \mathbf{w}_2)
\end{split}\]</span> where the second last equality is because <span class="math inline">\(\operatorname{P}(\textrm{t}=t_0| \mathbf{w}=\mathbf{w}_1)=\operatorname{P}(\textrm{t}=t_0| \mathbf{w}=\mathbf{w}_2)=\operatorname{P}(\textrm{t}=t_0|\mathbf{w}=\mathbf{w}_1 \text{or } \mathbf{w}_2)\)</span>. Effectively, the derivation corresponds to creating a new confounder <span class="math inline">\(\mathbf{w}'\)</span> such that it attains the same value when <span class="math inline">\(\mathbf{w}\in \{\mathbf{w}_1, \mathbf{w}_2\}\)</span>, but different values for each other value of <span class="math inline">\(\mathbf{w}\)</span>. Within each strata of <span class="math inline">\(\mathbf{w}'\)</span>, we find that <span class="math inline">\(\textrm{t}\unicode{x2AEB}\mathbf{w}\)</span> and thus <span class="math inline">\(\textrm{t}\unicode{x2AEB}\mathbf{w}| \mathbf{w}'\)</span>. Alternatively, we can write <span class="math inline">\(\textrm{t}\unicode{x2AEB}\mathbf{w}| b( \mathbf{w})\)</span> where <span class="math inline">\(b\)</span> is a function that maps <span class="math inline">\(\mathbf{w}\)</span> to <span class="math inline">\(\mathbf{w}'\)</span>.</p>
<p>More generally, any function <span class="math inline">\(b: \mathbb{R}^{|W|} \rightarrow \mathbb{R}^{K}\)</span> that ensures that <span class="math inline">\(t\)</span> and <span class="math inline">\(w\)</span> are independent given <span class="math inline">\(b(w)\)</span>, <span class="math inline">\(t \unicode{x2AEB} w|b(w)\)</span> is called a balancing function. The resultant values <span class="math inline">\(b(w)\)</span> are called balancing scores. Given a balancing score, the backdoor estimation equation can be rewritten as: <span id="eq:balancing-score"><span class="math display">\[\label{eq:balancing-score}
\mathbb{E}[\textrm{y}|\operatorname{do}(\textrm{t}=t_0)] \rightarrow\sum_{\mathbf{w}\in \mathbb{W}} \mathbb{E}(\textrm{y}|\textrm{t}=t_0, b(\mathbf{w})=b(\mathbf{w}))\Pr(b(\mathbf{w})=b(\mathbf{w}))\qquad(4)\]</span></span> If we consider <span class="math inline">\(b(\mathbf{w})\)</span> as trivially the identity function, <span class="math inline">\(b(\mathbf{w})=\mathbf{w}\)</span>, then we obtain simple stratification. It is valid balancing function since by definition, <span class="math inline">\(t \unicode{x2AEB}\mathbf{w}|\mathbf{w}\)</span>. However, a useful balancing typically reduces the dimensionality of <span class="math inline">\(\mathbf{w}\)</span>.</p>
<p>The most common balancing score is the <strong>propensity score</strong>, <span class="math inline">\(\operatorname{ps}(\mathbf{w})=\operatorname{P}(\textrm{t}|\mathbf{w}=\mathbf{w})\)</span>, so called because it defines the propensity of treatment given values of <span class="math inline">\(\mathbf{w}\)</span>. Below we show that a propensity score is a balancing score by showing that <span class="math inline">\(\textrm{t}\)</span> and <span class="math inline">\(\mathbf{w}\)</span> are independent given the propensity score. That is, <span class="math inline">\(\operatorname{P}(\textrm{t}|\mathbf{w},\operatorname{ps}(\mathbf{w}))=\operatorname{P}(\textrm{t}|\operatorname{ps}({\mathbf{w}}))\)</span>. To prove, we use that <span class="math inline">\(\operatorname{P}(\textrm{t}=t_0|ps(\mathbf{w}), \mathbf{w}) = \operatorname{P}(\textrm{t}=t_0| \mathbf{w})=\operatorname{ps}(\mathbf{w})\)</span> since <span class="math inline">\(\operatorname{ps}(\mathbf{w})\)</span> is a function of <span class="math inline">\(\mathbf{w}\)</span>. Then we show that <span class="math inline">\(\operatorname{P}(\textrm{t}=t_0| \operatorname{ps}(\mathbf{w}))\)</span> is also equal to <span class="math inline">\(\operatorname{ps}(\mathbf{w})\)</span>. <span class="math display">\[\begin{split}
\operatorname{P}(\textrm{t}=t_0| \operatorname{ps}(\mathbf{w})) & = \mathbb{E}[\mathbb{1}_{\textrm{t}=t_0}|ps(\mathbf{w})] \\
&= \mathbb{E}[\mathbb{E}[\mathbb{1}_{\textrm{t}=t_0}|\mathbf{w}, \operatorname{ps}(\mathbf{w})]|ps(\mathbf{w})] \\
&= \mathbb{E}[\operatorname{P}(\textrm{t}=t_0|\mathbf{w}, \operatorname{ps}(\mathbf{w}))|ps(\mathbf{w})] \\
&= \mathbb{E}[\operatorname{P}(\textrm{t}=t_0|\mathbf{w})|ps(\mathbf{w})]\\
&= ps(\mathbf{w})
\end{split}\]</span> where the second equality is due to the law of iterated expectations and the second-last equality is since <span class="math inline">\(\mathbb{E}[A|A=a_0]\)</span> is simply <span class="math inline">\(a_0\)</span>. Therefore the propensity score is a valid balancing function of <span class="math inline">\(\mathbf{w}\)</span>. Given a propensity score, we can use Eq. 4 to obtain the backdoor estimate by extending the stratification and matching methods discussed above.</p>
<section id="estimating-the-propensity-score" data-number="1.3.3.1">
<h4 data-number="4.3.3.1"><span class="header-section-number">4.3.3.1</span> Estimating the propensity score</h4>
<p>In some cases, the propensity score, <span class="math inline">\(\operatorname{P}(\textrm{t}|\mathbf{w}=\mathbf{w})\)</span> is known. For example, in a randomized experiment the propensity score is constant (0.5 if treatment and control are equally likely) for every <span class="math inline">\(\mathbf{w}\)</span>. In most cases however, the propensity score needs to be estimated. Since both <span class="math inline">\(\textrm{t}\)</span> and <span class="math inline">\(\mathbf{w}\)</span> are observed, it can be estimated using any supervised learning algorithm that returns a probability estimate of different outcome values. For low-dimensional tabular data, logistic regression is often used to estimate <span class="math inline">\(\hat{\operatorname{ps}}(\mathbf{w})\)</span> when the treatment is binary. Assuming that the learnt parameters of logistic regression are <span class="math inline">\(\hat{\beta}\)</span>, the propensity score is estimated as: <span class="math display">\[\hat{\operatorname{ps}}(\mathbf{w}) = \frac{1}{1+e^{-\hat{\beta}^T \mathbf{w}}}\]</span> This approach naturally extends to cases when the treatment is categorical. When treatment is continuous, we can use probabilistic regression methods or approximate the treatment into discrete buckets. For high-dimensional or more complex data like text or images, deep learning methods can be used.</p>
<p>Given multiple propensity score models, we can use cross-validation to select the model with the highest cross-validation accuracy, similar to supervised learning methods. However there is one key distinction in the interpretation of this accuracy. A propensity score model with low accuracy is not necessarily a bad model. The goal of a model <span class="math inline">\(\hat{\operatorname{ps}}(\mathbf{w})\)</span> is to approximate the true propensity score <span class="math inline">\(\operatorname{ps}(\mathbf{w})\)</span>, not necessarily to achieve a high accuracy. Somewhat counter-intuitively, an accuracy of 0.5 for a binary treatment classifier is ideal from the perspective of estimating the causal estimand if the true propensity score is also 0.5. For example, if the data was collected through a randomized experiment that assigned treatment and control with equal probability, then the best propensity score model has accuracy 0.5.</p>
<p>Similarly, if the true propensity scores are closer to 0 or 1, then a good classifier will achieve high accuracy on the classification task, but higher accuracy unfortunately implies higher variance of the estimate from Eq. 4. This is because estimation of causal effect involves subtraction of outcomes conditional on the treatment within each strata. As the accuracy increases (and thus some propensity scores <span class="math inline">\(\operatorname{P}(\textrm{t}=t_0|\mathbf{w}=\mathbf{w})\)</span> move closer to 0), variance of the estimate increases since there will be fewer data units having <span class="math inline">\(\mathbf{w}=\mathbf{w}\)</span> that have <span class="math inline">\(\textrm{t}=t_0\)</span>. That said, it does not mean that we should choose a classifier with lower accuracy. Rather the takeaway is that we should select the model with the highest accuracy on a validation set. If the best accuracy turns out to be low, it may be due to the inherent randomness in how treatment was assigned. If the best accuracy is high, then it conveys that treatment was assigned almost deterministically based on the confounders and a propensity score-based method cannot provide a low variance estimate.</p>
</section>
<section id="propensity-based-stratification-and-matching" data-number="1.3.3.2">
<h4 data-number="4.3.3.2"><span class="header-section-number">4.3.3.2</span> Propensity-based stratification and matching</h4>
<p>Given an estimated propensity score, we now describe extensions of the matching and stratification methods based on the propensity score. For matching, the goal is to find the closest unit to a particular unit, except that the distance metric is defined over propensity scores. <span class="math display">\[\operatorname{dist}(\mathbf{w}_1, \mathbf{w}_2) = \operatorname{dist}(\hat{\operatorname{ps}}(\mathbf{w}_1), \hat{\operatorname{ps}}(\mathbf{w}_2)) = | \hat{\operatorname{ps}}(\mathbf{w}_1))- \hat{\operatorname{ps}}(\mathbf{w}_2))|\]</span> Similar to direct matching, we need to define the maximum distance at which a match is considered between a pair of data units.</p>
<p>For propensity-based stratification, our goal is to find subsets where the propensity score is constant. This is often approximated by defining subsets corresponding to ranges of propensity score values. For instance, we may divide [0,1] into 10 strata equally starting from <span class="math inline">\([0,0.1), [0.1, 0.2)\)</span> and so on. The estimate from each stratum is now weighted by the probability of its propensity score range, or equivalently the fraction of data points whose propensity scores lie in that range. Like with direct stratification, the size of these subsets determines the bias-variance tradeoff of the final estimate.</p>
<p>To sum up, propensity-based methods and direct methods offer a fundamental tradeoff. In direct methods, the balancing function <span class="math inline">\(\mathbf{w}\)</span> is always accurate but conditioning on it is difficult. For propensity-based methods, the reverse is true. Conditioning is easy since the propensity score is on-dimensional but the estimated propensity score has to be good balancing score. To resolve this tradeoff it is possible to construct two-dimensional or other low-dimensional balancing scores, but they are not commonly used. For any such balancing score, the same matching and stratification methods can be applied.</p>
</section>
</section>
</section>
<section id="sec:ch04-weighting-methods" data-number="1.4">
<h2 data-number="4.4"><span class="header-section-number">4.4</span> Weighting-based methods</h2>
<p>There are many other ways of interpreting the back-door estimand from Eq. 3, and thus obtaining a different method to estimate the causal effect. Rather than estimating a balancing function, one way is to ensure that treatment and control units have the same distribution of confounders, almost like a randomized experiment. Often, the key problem with observational data is that certain values of the confounders are disproportionately assigned treatment. If we can somehow remove this relationship, and ensure that treatment is independent of <span class="math inline">\(\mathbf{w}\)</span>, we can obtain a non-confounded estimate of causal effect.</p>
<p>The basic idea is the same as importance sampling, which is used to weight a input distribution to resemble a target distribution. Here the input distribution <span class="math inline">\(P\)</span> is the observed data distribution and the target distribution <span class="math inline">\(P^*\)</span> is that of a (hypothesized) randomized experiment on the same population. Thus, we can write, <span class="math display">\[\begin{split}
\mathop{\mathrm{\operatorname{P*}}}(\mathbf{w}, \textrm{t}, \textrm{y})
&= \mathop{\mathrm{\operatorname{P*}}}(\textrm{y}|\textrm{t}, \mathbf{w}) \mathop{\mathrm{\operatorname{P*}}}(\textrm{t}|\mathbf{w})\mathop{\mathrm{\operatorname{P*}}}(\mathbf{w}) \\
&= \operatorname{P}(\textrm{y}|\textrm{t}, \mathbf{w}) \mathop{\mathrm{\operatorname{P*}}}(\textrm{t}|\mathbf{w})\operatorname{P}(\mathbf{w}) \\
&= \operatorname{P}(\textrm{y}|\textrm{t}, \mathbf{w}) \operatorname{P}(\mathbf{w}) \mathop{\mathrm{\operatorname{P*}}}(\textrm{t}|\mathbf{w}) \frac{\operatorname{P}(\textrm{t}|\mathbf{w})}{\operatorname{P}(\textrm{t}|\mathbf{w})}\\
&= \operatorname{P}(\mathbf{w}, \textrm{t}, \textrm{y})\frac{\mathop{\mathrm{\operatorname{P*}}}(\textrm{t}|\mathbf{w})}{\operatorname{P}(\textrm{t}|\mathbf{w})}
\end{split}\]</span> Then the expected value of <span class="math inline">\(Y\)</span> in the target distribution where <span class="math inline">\(\mathbf{w}\)</span> is independent of <span class="math inline">\(\textrm{t}\)</span> is: <span class="math display">\[\mathbb{E}^*(\textrm{y}) = \sum_{\textrm{y}, \textrm{t}, \mathbf{w}} y\mathop{\mathrm{\operatorname{P*}}}(\mathbf{w}, \textrm{t}, \textrm{y}) = \sum_{\textrm{y}, \textrm{t}, \mathbf{w}} y\operatorname{P}(\mathbf{w}, \textrm{t}, \textrm{y}) \frac{\mathop{\mathrm{\operatorname{P*}}}(\textrm{t}|\mathbf{w})}{\operatorname{P}(\textrm{t}|\mathbf{w})}\]</span> To find the causal effect, we can rewrite the backdoor estimand as, <span id="eq:ch04-ipw-estimator"><span class="math display">\[\label{eq:ch04-ipw-estimator}
\begin{split}
\mathbb{E}(\textrm{y}|\operatorname{do}(\textrm{t}=t_0)) &\rightarrow\sum_{\textrm{y}, \mathbf{w}} y\operatorname{P}( \textrm{y}| \mathbf{w},\textrm{t}=t_0) \operatorname{P}(\mathbf{w}) \\
&= \sum_{\textrm{y}, \mathbf{w}} y\operatorname{P}( \textrm{y}| \mathbf{w},\textrm{t}=t_0) \operatorname{P}(\mathbf{w}) \frac{\operatorname{P}(\textrm{t}=t_0|\mathbf{w})}{\operatorname{P}(\textrm{t}=t_0|\mathbf{w})} \\
&= \sum_{\textrm{y}, \mathbf{w}} y\operatorname{P}( \textrm{y}| \mathbf{w},\textrm{t}=t_0) \operatorname{P}(\textrm{t}=t_0, \mathbf{w}) \frac{1}{\operatorname{P}(\textrm{t}=t_0|\mathbf{w})} \\
&= \sum_{\textrm{y}, \mathbf{w}} y\operatorname{P}( \textrm{y}, \mathbf{w},\textrm{t}=t_0) \frac{1}{\operatorname{P}(\textrm{t}=t_0|\mathbf{w})} \\
\end{split}\qquad(5)\]</span></span></p>
<section id="inverse-propensity-weighting" data-number="1.4.1">
<h3 data-number="4.4.1"><span class="header-section-number">4.4.1</span> Inverse propensity weighting</h3>
<p>Eq. 5 can be interpreted as a weighted average of <span class="math inline">\(\textrm{y}\)</span> over the observed distribution, with the weights being <span class="math inline">\(weight = 1/\operatorname{P}(\textrm{t}=t_0|\mathbf{w})=\frac{1}{\operatorname{ps}(t_0, \mathbf{w})}\)</span>. These weights give the estimator its name, “inverse propensity weighting” or IPW for short.</p>
<p>For a dataset with <span class="math inline">\(N\)</span> units, the IPW estimator is: <span class="math display">\[\hat{\mathbb{E}}[\textrm{y}| \operatorname{do}(\textrm{t}=t_0)]= \sum_{i=1}^N \frac{\mathbb{1}_{\textrm{t}=t_0}y_i}{\hat{\operatorname{ps}}(t_0, \mathbf{w}_i)}\]</span> The method can be interpreted as creating a new dataset where <span class="math inline">\(\textrm{t}\unicode{x2AEB}\mathbf{w}\)</span> and thus the data are balanced. From Eq. 5, we saw that this estimator can be derived from the backdoor estimand. Thus, with infinite data, the IPW estimator is equivalent to the simple stratification estimator. <span class="math display">\[\begin{split}
IPW &= \sum_{\textrm{y}, \textrm{t}=t_0, \mathbf{w}} y\frac{\operatorname{P}(\mathbf{w}, \textrm{t}=t_0, \textrm{y})}{\operatorname{P}(\textrm{t}=t_0|\mathbf{w})} \\
&= \sum_{\textrm{y}, \textrm{t}=t_0, \mathbf{w}} y\frac{\operatorname{P}(\textrm{y}|\textrm{t}=t_0, \mathbf{w})\operatorname{P}(\textrm{t}=t_0|\mathbf{w})\operatorname{P}(\mathbf{w})}{\operatorname{P}(\textrm{t}=t_0|\mathbf{w})} \\
&= \sum_{\textrm{y}, \textrm{t}=t_0, \mathbf{w}} y\operatorname{P}(\textrm{y}|\textrm{t}=t_0, \mathbf{w})\operatorname{P}(\mathbf{w}) \\
&= \sum_{\mathbf{w}} \sum_{\textrm{y}, \textrm{t}=t_0} y\operatorname{P}(\textrm{y}|\textrm{t}=t_0, \mathbf{w})\operatorname{P}(\mathbf{w}) \\
&= \sum_{\mathbf{w}} \mathbb{E}[\textrm{y}|\textrm{t}=t_0, \mathbf{w}]\operatorname{P}(\mathbf{w})
\end{split}\]</span></p>
<p>In finite samples however, the two methods can give different estimates due to the different tradeoffs they make. The propensity score can be estimated as in the balancing-based methods. However, IPW does not divide the data into discrete strata or find the closest matches for each data point based on the propensity score, both of which are statistical operations and can introduce bias. Instead it uses the propensity score directly to weight the input data points. This results in low bias but high variance since the estimate for each data point includes division by its propensity score. Even if a single point has low propensity score (<span class="math inline">\(\approx 0\)</span>) it can make the estimate diverge and go to infinity.</p>
<p>To reduce variance, it is advisable to clip extreme propensity scores. Typically one chooses a range <span class="math inline">\([\alpha, 1-\alpha]\)</span> where <span class="math inline">\(\alpha\)</span> is a parameter between 0 and 1. Choosing <span class="math inline">\(\alpha\)</span> has a similar bias-variance tradeoff as with statification—smaller <span class="math inline">\(\alpha\)</span> leads to lower bias but higher variance, and larger <span class="math inline">\(\alpha\)</span> reduces variance but has higher bias.</p>
<p>Overall the benefit is that it does not need arbitrary choices of strata size or distance metrics for matching. When true propensity scores are not too high or too low, IPW can be a suitable method that is straightforward to implement.</p>
</section>
<section id="sec:ch04-ipw-predictive-model" data-number="1.4.2">
<h3 data-number="4.4.2"><span class="header-section-number">4.4.2</span> Using supervised learning on the weighted outcomes</h3>
<p>Unlike balance-based methods, weighting-based methods also provide a straightforward way to incorporate supervised learning estimators. Consider <span class="math inline">\(y|\operatorname{do}(t), \mathbf{w}= \mathbf{w}\)</span> as a function of <span class="math inline">\(t\)</span> and <span class="math inline">\(\mathbf{w}\)</span> that needs to be estimated. <span id="eq:ch04-ipw-ml-eqn"><span class="math display">\[\label{eq:ch04-ipw-ml-eqn}
\mathbb{E}[y|\operatorname{do}(\textrm{t}=t_0),\mathbf{w}=\mathbf{w}]= f(t_0, \mathbf{w})\qquad(6)\]</span></span> From Eq. 5 we know that LHS is equivalent to values of <span class="math inline">\(y\)</span> weighted by the inverse propensity score. Thus, training data <span class="math inline">\((t_i,\mathbf{w}_i,y_i)\)</span> can be generated where each point is weighted by <span class="math inline">\(\frac{1}{\hat{\operatorname{ps}}(t_i, \mathbf{w}_i)}\)</span> and any supervised learning algorithm can be used to predict <span class="math inline">\(y^{IPW}\)</span>. Denoting <span class="math inline">\(L\)</span> as a suitable P-admissible loss function such as mean squared loss or cross-entropy, the estimator can be implemented using a weighted regression: <span class="math display">\[\begin{split}
% \hat{f}(t,\vw) &= \arg \min_h \sum_{i=1}^N L(y^{IPW}_i, h(t_i,\vw_i))\\
\hat{f}(t,\mathbf{w}) &= \arg \min_{h \in \mathcal{H}} \sum_{i=1}^N \frac{1}{\hat{\operatorname{ps}}(t_i, \mathbf{w}_i)}L(y_i, h(t_i,\mathbf{w}_i))
\end{split}\]</span></p>
<p>Compared to the IPW estimator that can only estimate mean interventional outcomes for different sub-groups, this method can be used to obtain estimates for any point in the observed data or any new data point. Of course, the catch is that the estimates include a new parametric assumption on the form of <span class="math inline">\(f\)</span>. Model choice is an important consideration that determines bias and variance of the estimate. Like in supervised learning, if the family of functions <span class="math inline">\(\mathcal{H}\)</span> is too simple (e.g., linear regression), there will be bias due to insufficient modeling of the per unit differences. On the other hand, if we consider an expressive <span class="math inline">\(\mathcal{H}\)</span> (e.g., multi-layer neural networks) then we run the risk of overfitting to the observed data.</p>
<p>The susceptibility to very low propensity scores remains the same as the IPW estimator. A near-zero propensity score translates to a near-infinity weight on the data point, thus making the rest of the dataset irrelevant for fitting <span class="math inline">\(f\)</span>. Similar clipping of weights can be applied.</p>
</section>
</section>
<section id="sec:ch04-outcome-methods" data-number="1.5">
<h2 data-number="4.5"><span class="header-section-number">4.5</span> Outcome model-based methods</h2>
<p>Note that Eq. 6 equation looks deceptively similar to the regression, <span class="math inline">\(\mathbb{E}[y|t,\mathbf{w}=\mathbf{w}]=f(t,\mathbf{w})\)</span>. The key difference, however is in conditioning on <span class="math inline">\(\operatorname{do}(t)\)</span> instead of <span class="math inline">\(t\)</span>, that in turn required us to weight the data appropriately. What if we train a supervised model directly on the input data? This section dives deeper into the use of such <em>outcome-based</em> predictor methods.</p>
<p>From the identification chapter, <span class="math inline">\(P(y|t,\mathbf{w})\)</span> is the estimand for <span class="math inline">\(P(y|do(t),\mathbf{w})\)</span>. We used this as a component of average causal outcome methods like stratification and inverse propensity weighting. But we can also use it directly to create an estimator that can predict the interventional outcome for each value of <span class="math inline">\(\mathbf{w}\)</span>. Let us assume a function <span class="math inline">\(f\)</span> that describes the relationship between <span class="math inline">\(y\)</span> and <span class="math inline">\(t,\mathbf{w}\)</span>. <span class="math display">\[y = \operatorname{f}(t, \mathbf{w}) + \epsilon\]</span></p>
<p>Since <span class="math inline">\(y\)</span>, <span class="math inline">\(t\)</span> and <span class="math inline">\(\mathbf{w}\)</span> are all observed, the above equation suggests a straightforward estimator using supervised learning. If we can estimate <span class="math inline">\(f\)</span>, then we can use it to compute the value of the outcome at different values of the treatment to compute the causal effect. For estimating the causal estimate between <span class="math inline">\(t=0\)</span> and <span class="math inline">\(t=1\)</span>, for example, we can write: <span class="math display">\[\mathbb{E}_{\mathbf{w}}[\operatorname{f}(1, \mathbf{w}) - \operatorname{f}(0, \mathbf{w})]\]</span></p>
<p>The problem, however, is that estimating the true <span class="math inline">\(f\)</span> is not trivial. When the goal is prediction over the same distribution of data, it often suffices to learn a <span class="math inline">\(\hat{f}\)</span> that has the same error with respect to the outcome. But low prediction error does not necessarily translate to low error in estimating causal effect of <span class="math inline">\(t\)</span> since <span class="math inline">\(t\)</span> and <span class="math inline">\(\mathbf{w}\)</span> are correlated. As a simple example, suppose that <span class="math inline">\(t\)</span> has zero causal effect on <span class="math inline">\(y\)</span>, given by the following structural equation: <span class="math display">\[y \leftarrow\gamma \mathbf{w}\]</span> However, since t and <span class="math inline">\(\mathbf{w}\)</span> are correlated, a model that minimizes the prediction loss may assign some of the effect to t, leading to a non-zero estimate of <span class="math inline">\(t\)</span>’s causal effect. Fundamentally, the standard supervised learning loss on minimizing predictive error has no constraints on learning the right effect for <span class="math inline">\(t\)</span> or any other variable.</p>
<p>As an example, Fig. 4 shows simulations of different outcome-based predictors on a dataset generated by a linear structural equations for treatment <span class="math inline">\(\textrm{t}\)</span> and outcome <span class="math inline">\(\textrm{y}\)</span>. The true causal effect of treatment is 10. We train four predictive models based on linear regression, linear regression with lasso regularization, gradient boosting trees and random forest. The first two models are simpler models that have the advantage of having exactly the correct model specification for the underlying structural equations. The last two are more complex tree models that utilize an ensemble of tree models to minimize their prediction error. Since predictive models are typically evaluated on their prediction accuracy on an out-of-sample test dataset, let us first analyze the test prediction accuracy. The mean average percentage error on predicting the outcome <span class="math inline">\(y\)</span> is the higher for linear regression and lasso models, and lower for the tree-based ensemble models. From a prediction accuracy standpoint, tree-based ensemble models are better than the regression-based models.</p>
<p>However, the right panel of Fig. 4 shows the opposite trend for their bias in effect estimation. Linear regression is the most accurate: its average estimate is exactly 10 and over 50% of the estimates lie between 9 and 11. Lasso model is not accurate at all, even though both linear regression and lasso share the same prediction error. Most of the effect estimates from Lasso lie between 0 and 5. Morever, the models with higher prediction accuracy yield worse effect estimates. Both Gradient Boosting and Random Forest models yield a biased estimate, with Random Forest model estimating a near-zero effect for a majority of the simulations. The example shows the lack of a stable relationship between prediction accuracy and effect estimation that makes it difficult to select the right model for an outcome-based effect estimator. The best predictor of the outcome may be the worst estimator of causal effect and models with the same predictive accuracy may still have very different effect estimates.</p>
<div id="fig:ch04-slearner-zeroeffect" class="subfigures">
<table style="width:60%;">
<colgroup>
<col style="width: 30%" />
<col style="width: 30%" />
</colgroup>
<tbody>
<tr class="odd">
<td style="text-align: center;"><figure>
<img src="../assets/Figures/Chapter4/slearner_prediction_error.png" style="width:100.0%" alt="a" /><figcaption aria-hidden="true">a</figcaption>
</figure></td>
<td style="text-align: center;"><figure>
<img src="../assets/Figures/Chapter4/slearner_effectestimates.png" style="width:100.0%" alt="b" /><figcaption aria-hidden="true">b</figcaption>
</figure></td>
</tr>
</tbody>
</table>
<p>Figure 4: Supervised learning methods applied to estimate causal effect in a simulated dataset based on linear structural equations. Left panel shows the prediction error measured by the mean absolute percentage error (MAPE) metric. Right panel shows the distribution of the effect estimates for a true effect of 10 (shown as a dotted horizontal line). Lower prediction error for the Gradient Boosting and Random Forest models does not translate to lower bias on estimating the effect. Compare these methods yourself using an online DoWhy notebook: <a href="https://github.com/causalreasoning/cr-book/blob/master/examples/ch04-estimation/bias-in-supervised-learning.ipynb" class="uri">https://github.com/causalreasoning/cr-book/blob/master/examples/ch04-estimation/bias-in-supervised-learning.ipynb</a>.. a — Prediction Error, b — Effect Estimates</p>
</div>
<section id="limitations-of-directly-applying-predictive-models-for-effect-estimation" data-number="1.5.1">
<h3 data-number="4.5.1"><span class="header-section-number">4.5.1</span> Limitations of directly applying predictive models for effect estimation</h3>
<p>The fundamental problem is that predictive methods and causal inference methods are optimized for two different objectives. Consider our setting above where the outcome <span class="math inline">\(y\)</span> is generated according to a function <span class="math inline">\(f(t, x)\)</span> and some independent error: <span class="math inline">\(y = f(t, \mathbf{w}) + {\epsilon}\)</span>. The goal of a predictive model is to construct <span class="math inline">\(\hat{f}\)</span> such that <span class="math inline">\(y-\hat{f}(t, \mathbf{w})\)</span> is minimized, whereas the goal of a causal inference model is construct <span class="math inline">\(\hat{f}\)</span> such that its partial derivative w.r.t the treatment is accurate, <span class="math inline">\(\frac{\partial \hat{f}}{\partial t} - \frac{\partial y}{\partial \operatorname{do}(t)}\)</span>, where <span class="math inline">\(\partial \operatorname{do}(t)\)</span> should be interpreted as the infinitesimal <em>interventional</em> change in <span class="math inline">\(t\)</span>. <span class="math display">\[\begin{aligned}
\text{Predictive model}:& \hat{f} := \mathop{\mathrm{arg\,min}}_{h} \operatorname{Loss}(h, y) \\
\text{Causal effect model}:& \frac{\partial \hat{f}}{\partial t} := \mathop{\mathrm{arg\,min}}_{h} \operatorname{Loss}(\frac{\partial h}{\partial t}, \frac{\partial y}{\partial \operatorname{do}(t)} )\end{aligned}\]</span></p>
<p>In the presence of confounding, the two objectives can lead to different causal estimates. Let us use linear regression as the model class to demonstrate this. Suppose that the true generating function <span class="math inline">\(f\)</span> is <span class="math inline">\(y = \beta t + \gamma w + {\epsilon}\)</span> where <span class="math inline">\(w\)</span> also causes <span class="math inline">\(x\)</span> and thus a confounder. The identified causal estimand is <span class="math inline">\(\mathbb{E}[\textrm{y}|\operatorname{do}(\textrm{t}), {\textrm{w}}] = \mathbb{E}[\textrm{y}|\textrm{t}, {\textrm{w}}]\)</span>. Using linear regression, the causal effect <span class="math inline">\(\beta\)</span> can be estimated as: <span class="math display">\[\hat{f}(t,w) = \hat{\beta} t + \hat{\gamma}w\]</span></p>
<p>If <span class="math inline">\(w\)</span> and <span class="math inline">\(t\)</span> are independent, an optimization that minimizes <span class="math inline">\((y-\hat{f}(t,w))^2\)</span> will be able to distinguish between the contributions of <span class="math inline">\(t\)</span> and <span class="math inline">\(w\)</span> to <span class="math inline">\(y\)</span>, thus yielding the correct causal effect. However, since <span class="math inline">\(w\)</span> causes <span class="math inline">\(t\)</span>, <span class="math inline">\(w\)</span> and <span class="math inline">\(t\)</span> will be correlated. Intuitively, in such a situation, it is difficult to isolate a separate effect for <span class="math inline">\(t\)</span>: based on a given sample, some of <span class="math inline">\(t\)</span>’s effect may be absorbed in the <span class="math inline">\(w\)</span>’s coefficient, or conversely some of <span class="math inline">\(w\)</span>’s effect may become a part of <span class="math inline">\(t\)</span>’s estimated coefficient. This is what we see in Fig. 4 (right panel) where the linear regresion model yields incorrect effect estimates on both sides of the true effect 10, ranging from -18 to over 40. Formally, the variance of <span class="math inline">\(\hat{\beta}\)</span> is given by, <span id="eq:ch04-var-beta-lregression"><span class="math display">\[\label{eq:ch04-var-beta-lregression}
Var(\hat{\beta}) = \frac{\sum_{i=0}^N (y-\hat{f}(t,w))^2}{Var(t)(1-R_{tw}^2)}\qquad(7)\]</span></span> where <span class="math inline">\(R_{tw}^2\)</span> is the <span class="math inline">\(R\)</span>-squared value of regressing <span class="math inline">\(t\)</span> on <span class="math inline">\(w\)</span>. As <span class="math inline">\(w\)</span>’s effect on <span class="math inline">\(t\)</span> increases, their correlation increases and thus the denominator decreases. As a result, the variance of <span class="math inline">\(\hat{\beta}\)</span> increases. Thus, when the confounding by <span class="math inline">\(w\)</span> on <span class="math inline">\(t\)</span> is minimal (e.g., in a randomized experiment where <span class="math inline">\(R^2=0\)</span>), we obtain the lowest variance estimate. However, as confounding increases, a single estimate can be unreliably away from the true value (though still unbiased). In the extreme case, when <span class="math inline">\(w\)</span> and <span class="math inline">\(t\)</span> are linearly dependent, there can be many equally optimal solutions for <span class="math inline">\(\hat{\beta}\)</span> depending on the value of <span class="math inline">\(\hat{\gamma}\)</span>.</p>
<p>In addition to high variance, prediction models often include <em>regularization</em> that can lead to bias in the effect estimate. For example, the Lasso model is a regularized version of the linear regression model that adds a loss term encouraging the model parameters to shrink to zero. The intuition is to avoid overfitting and let the insignificant effect parameters go to zero.</p>
<p><span class="math display">\[\text{Lasso}:
\hat{f} := \mathop{\mathrm{arg\,min}}_{\beta, \gamma} \operatorname{Loss}({\beta} t + {\gamma}w, y) + \lambda (\beta + \sum|\gamma|)\]</span></p>
<p>However, as we saw in Fig. 4, Lasso regularization leads to shrinking of the treatment’s effect as well, even though it is the variable with the highest effect. In general, regularization is a common component of almost all predictive models that helps to avoid overfitting by encouraging model parameters towards some prior, but it can also lead to non-trivial bias for effect estimation.</p>
</section>
<section id="treatment-specific-prediction-models-t-learner" data-number="1.5.2">
<h3 data-number="4.5.2"><span class="header-section-number">4.5.2</span> Treatment-specific prediction models: T-learner</h3>
<p>While naive predictive models do not work, there are certain customizations that can work better. We already saw an example of a modified loss function in section <a href="#sec:ch04-ipw-predictive-model" data-reference-type="ref" data-reference="sec:ch04-ipw-predictive-model">4.4.2</a> that used a weighted outcome.</p>
<p>We now present a simple outcome-based method that works for a discrete treatment variable. The main problem with the naive predictive estimator is that <span class="math inline">\(\mathbf{w}\)</span> could be attributed some of the <span class="math inline">\(t\)</span>’s effect, and vice-versa. What if we divided the dataset by values of <span class="math inline">\(t\)</span> and ran separate regressions on each sub-dataset? Since <span class="math inline">\(t\)</span> is constant in each dataset, the functions learnt cannot capture any variation in <span class="math inline">\(y\)</span> due to treatment. <span class="math display">\[\forall t_i: \hat{f_{t_i}} = \arg \min_h \sum_{i=1}^{N_{t=t_i}}L(y_i, h(\mathbf{w}_i))\]</span> The causal effect for any two values of <span class="math inline">\(t\)</span> can be calculated as: <span class="math display">\[\mathbb{E}_{\mathbf{w}}[\hat{f}_{t1}(\mathbf{w})- \hat{f}_{t0}(\mathbf{w})]\]</span> Conditional average treatment effects (CATE) can be estimated by restricting the values of <span class="math inline">\(\mathbf{w}\)</span> over which the expectation is computed. Since the method requires two or more prediction models, it is called the <em>T-learner</em>. This approach can work well when there are sufficient samples for each value of the treatment. As with any supervised learning task, the efficacy of the method will depend on the choice of model class and associated issues of overfitting in case some treatment values have less data.</p>
<p>Note the relationship of this method to stratification. In stratification, we condition on the confounders <span class="math inline">\(\mathbf{w}\)</span> whereas here we condition on the treatment <span class="math inline">\(t\)</span>. Thus, when there are a few discrete confounders, simple stratification works well. When there are a few discrete treatments, treatment-specific prediction methods can work well. Unlike propensity-based methods for stratification, there is no obvious extension of the T-learner to continuous treatments.</p>
</section>
<section id="residual-based-prediction-models" data-number="1.5.3">
<h3 data-number="4.5.3"><span class="header-section-number">4.5.3</span> Residual-based prediction models</h3>
<p>For continuous treatment, we need a different estimation strategy. This strategy depends on the fact that error term in a correctly specified structural equation is independent of the input variables. If we can model correctly the structural equation from the confounders <span class="math inline">\({\textrm{w}}\)</span> to the treatment <span class="math inline">\(\textrm{t}\)</span>, then the residual of the fitted model should be independent of all confounders. Similarly, the residual of a model predicting the outcome <span class="math inline">\(\textrm{y}\)</span> based on the confounders should be removed of any direct effect of the confounders on the outcome. The key assumption is that the machine learning models <span class="math inline">\(\hat{g}\)</span> and <span class="math inline">\(\hat{f}\)</span> capture the true structural equations for treatment and outcome respectively. <span class="math display">\[\begin{split}
\hat{g} &= \arg \min_g Loss(t, g(w)) \\
\hat{f} &= \arg \min_f Loss(y, f(w)) \\
r_t &= t - \hat{g}(w) \\
r_y &= y - \hat{f}(w)
\end{split}\]</span></p>
<p>In other words, the two residuals <span class="math inline">\(r_t\)</span> and <span class="math inline">\(r_y\)</span> represent the part of the treatment and outcome respectively that are not explained by the variation due to <span class="math inline">\({\textrm{w}}\)</span>. In other words, <span class="math inline">\(r_t\)</span> and <span class="math inline">\(r_y\)</span> are the <em>unconfounded</em> parts of the treatment and outcome. A simple predictive model on these residual variables should yield the effect of treatment on the outcome. <span class="math display">\[\beta = \arg \min_\beta Loss(r_y, \beta r_t)\]</span> The same method can be also used to estimate the conditional effect, by introducing an effect model in the last stage that depends on other variables, e.g., <span class="math inline">\(\beta r_t + \gamma r_t x\)</span> where <span class="math inline">\(x\)</span> is an effect modifier. Compared to the naive predictive model from <span class="math inline">\(\textrm{t}\)</span> and <span class="math inline">\({\textrm{w}}\)</span> to <span class="math inline">\(\textrm{y}\)</span>, the residual-based method has the advantage of explicitly removing the confounding due to <span class="math inline">\({\textrm{w}}\)</span> in its first step, which the predictive model may or may not do. However, accuracy of the estimate is now dependent on the predictive quality of two separate sub-models, <span class="math inline">\(\hat{g}\)</span> and <span class="math inline">\(\hat{f}\)</span>. As the sub-models get closer to the true structural equations, the residual-based effect becomes more accurate. Since the estimator involves estimating two machine learning models, it is also called the <em>double machine learning</em> method.</p>
</section>
<section id="wald-estimator" data-number="1.5.4">
<h3 data-number="4.5.4"><span class="header-section-number">4.5.4</span> Wald estimator</h3>
<p>Outcome-based methods are also used for estimating causal effect in the presence of instrumental variables. Instrumental variable was introduced in <a href="/causal-reasoning-book-chapter3/#sec:ch03-iv" id="sec:ch03-iv" label="sec:ch03-iv">3.4</a>, corresponding to the following structural equations for t and y in the general causal graph in Fig. 3.6. <span class="math display">\[\begin{split}
t &\leftarrow g(\mathbf{z},\mathbf{w}) + {\epsilon}_t \\
y &\leftarrow f(t,\mathbf{w}, \mathbf{u}) + {\epsilon}_y
\end{split}\]</span></p>
<p>For identifiability, we make an additional assumption that <span class="math inline">\(\textrm{t}\)</span> has a linear effect on <span class="math inline">\(\textrm{y}\)</span> and that the effects of <span class="math inline">\(t\)</span> and <span class="math inline">\(\mathbf{w}\)</span> compose additively. We obtain, <span id="eq:ch03-iv-parametric-y"><span class="math display">\[\label{eq:ch03-iv-parametric-y}
y \leftarrow\alpha + \beta t + f_{\mathbf{w}}(\mathbf{w}) + f_{\mathbf{u}}(\mathbf{u}) + {\epsilon}_y\qquad(8)\]</span></span> where <span class="math inline">\(f_{\mathbf{w}}\)</span> is an arbitrary function that maps the causal effect of <span class="math inline">\(\mathbf{w}\)</span> on y and <span class="math inline">\({\epsilon}_y\)</span> is a zero mean error. Under this parameterization, <span class="math inline">\(\beta\)</span> is the expected value of the causal effect. <span class="math display">\[\mathbb{E}[\textrm{y}|\operatorname{do}(\textrm{t}=1)] - \mathbb{E}[\textrm{y}|\operatorname{do}(\textrm{t}=0)]= \beta.1 - \beta.0=\beta\]</span></p>
<p>In general, the unobserved confounders <span class="math inline">\(\mathbf{u}\)</span> affect both <span class="math inline">\(y\)</span> and <span class="math inline">\(t\)</span>. Thus, <span class="math inline">\(f_{\mathbf{u}}(\mathbf{u})\)</span> is not independent of <span class="math inline">\(t\)</span>. Trying to estimate <span class="math inline">\(y\)</span> as a function of <span class="math inline">\(t\)</span> directly will lead to a biased estimate of effect, since it may also include the effect of unobserved confounders. Instead, we can try to model the outcome as a function of the instruments and the confounders. By definition, the instrumental variable <span class="math inline">\(\textrm{z}\)</span> is d-separated from <span class="math inline">\(\mathbf{w}\)</span> and thus, <span id="eq:ch03-iv-zindepu"><span class="math display">\[\label{eq:ch03-iv-zindepu}
z \unicode{x2AEB}\mathbf{u}\text{ for all } z\in \mathbf{z}\qquad(9)\]</span></span></p>
<p>Taking expectation on both sides, we rewrite Eq. 8 as, <span id="eq:ch03-iv-identify"><span class="math display">\[\label{eq:ch03-iv-identify}
\begin{split}
\mathbb{E}[y|\mathbf{z}, \mathbf{w}] &= \mathbb{E}[\alpha + \beta t + f_{\mathbf{w}}(\mathbf{w}) + f_{\mathbf{u}}(\mathbf{u}) + {\epsilon}_y|\mathbf{z}, \mathbf{w}]\\
&= \alpha + \beta \mathbb{E}[t|\mathbf{z}, \mathbf{w}] + f_{\mathbf{w}}(\mathbf{w}) + \mathbb{E}[f_{\mathbf{u}}(\mathbf{u}) + {\epsilon}_y|\mathbf{z}, \mathbf{w}]\\
&= \alpha + \beta \mathbb{E}[t|\mathbf{z}, \mathbf{w}] + f_{\mathbf{w}}(\mathbf{w}) + \alpha' + \mathbb{E}[{\epsilon}_y|\mathbf{z}, \mathbf{w}]\\
&= (\alpha + \alpha_1)+ \beta \mathbb{E}[t|\mathbf{z}, \mathbf{w}] + f_{\mathbf{w}}(\mathbf{w})
\end{split}\qquad(10)\]</span></span> where <span class="math inline">\(\mathbb{E}[f_{\mathbf{u}}(\mathbf{u})| \mathbf{z}, \mathbf{w}]\)</span> is a constant due to Eq. 9, <span class="math inline">\(\mathbb{E}[{\epsilon}_y|\mathbf{z}, \mathbf{u}]=0\)</span> since <span class="math inline">\({\epsilon}_y\)</span> is a zero mean error term independent of every other variable, and <span class="math inline">\(\mathbb{E}[f_{\mathbf{w}}(\mathbf{w})|\mathbf{z}, \mathbf{w}]=f_{\mathbf{w}}(\mathbf{w})\)</span> since <span class="math inline">\(\mathbb{E}[h(A)|A]=h(A)\)</span> for any function <span class="math inline">\(h\)</span> and random variable <span class="math inline">\(A\)</span>. Eq. 10 expresses the causal effect parameter <span class="math inline">\(\beta\)</span> in terms of function of observed variables, thus the causal effect can be identified as, <span class="math display">\[\beta = \arg \min_{\beta', \alpha', h}\ell (\mathbb{E}[y|\mathbf{z}, \mathbf{w}],\alpha' + \beta'\mathbb{E}[t|\mathbf{z}, \mathbf{w}]+ h(\mathbf{w}) )\]</span> where <span class="math inline">\(\ell\)</span> is a suitable loss function such as <span class="math inline">\(\ell_2\)</span> loss. Note that the above identification is valid only under the additive parametric assumptions described above. When there are no observed confounders <span class="math inline">\(\mathbf{w}\)</span> and <span class="math inline">\(z\)</span> and <span class="math inline">\(t\)</span> are binary, Eq. 10 a closed form solution. Substituting <span class="math inline">\(z=1\)</span> and <span class="math inline">\(z=0\)</span> <span class="math display">\[\begin{split}
\mathbb{E}[y|z=1]&= (\alpha + \alpha_1)+ \beta \mathbb{E}[t|z=1]\\
\mathbb{E}[y|z=0]&= (\alpha + \alpha_1)+ \beta \mathbb{E}[t|z=0]
\end{split}\]</span> Subtracting the two equations, we obtain the identification estimand for <span class="math inline">\(\beta\)</span>, known as the <em>Wald identifier</em> for binary instrumental variables. <span class="math display">\[\begin{split}
\mathbb{E}[y|z=1] - \mathbb{E}[y|z=0]&= \beta (\mathbb{E}[t|z=1]-\mathbb{E}[t|z=0]) \\
\Rightarrow \beta &= \frac{\mathbb{E}[y|z=1] - \mathbb{E}[y|z=0]}{\mathbb{E}[t|z=1]-\mathbb{E}[t|z=0]}
\end{split}\]</span></p>
<p><span id="eq:ch-04-iv-est"><span class="math display">\[\label{eq:ch-04-iv-est}
\beta = \arg \min_{\alpha, \beta, \gamma} \mathbb{E}[y|\mathbf{z}, \mathbf{w}] -( \alpha+ \beta\mathbb{E}[t|\mathbf{z}, \mathbf{w}] + \gamma \mathbf{w})\qquad(11)\]</span></span></p>
<p>Thus if we knew <span class="math inline">\(v=\mathbb{E}[t|\mathbf{z}, \mathbf{w}]\)</span>, we can estimate <span class="math inline">\(\beta\)</span> using a simple regression of <span class="math inline">\(y\)</span> on <span class="math inline">\(v\)</span> and <span class="math inline">\(\mathbf{w}\)</span>. In practice, this expectation can be estimated using a predictive model of its own. Thus, the task of estimating causal effect reduces to the following two predictive models. <span class="math display">\[\begin{split}
\hat{t} &= \hat{g}(z,\mathbf{w}) \\
\hat{y} &= \hat{\alpha} + \hat{\beta} \hat{t} + \hat{\gamma} \mathbf{w}
\end{split}\]</span></p>
<p>Since we are estimating the effect using two predictive models and are using least squares loss minimization for fitting expected values, this method is often called the <em>two-staged least squares</em> method. Note that we are not restricted to the linear parametric form. Any parametric form for <span class="math inline">\(f(t,\mathbf{w})\)</span> can work, as long as the expectations can be estimated.</p>
<p>In the special case when <span class="math inline">\(\mathbf{z}\)</span> is one-dimensional and binary, and there are no confounders, the estimator reduces to a simple ratio using Eq. 11. <span class="math display">\[\begin{split}
\mathbb{E}[y|z=1] &= \alpha+ \beta\mathbb{E}[t|z=1] \\
\mathbb{E}[y|z=0] &= \alpha + \beta\mathbb{E}[t|z=0] \\
\Rightarrow \beta &= \frac{\mathbb{E}[y|z=1] -\mathbb{E}[y|z=0]}{\mathbb{E}[t|z=1]- \mathbb{E}[t|z=0]}
\end{split}\]</span> This ratio estimator is called the <em>Wald</em> estimator and it can plugged in as below.</p>
<p><span class="math display">\[\hat{\beta} = \frac{ \frac{\sum_{i=1}^N \mathbb{1}_{\textrm{z}=1}\textrm{y}}{\sum_{i=1}^N \mathbb{1}_{\textrm{z}=1}} - \frac{\sum_{i=1}^N \mathbb{1}_{\textrm{z}=0}\textrm{y}}{\sum_{i=1}^N \mathbb{1}_{\textrm{z}=0}} }
{ \frac{\sum_{i=1}^N \mathbb{1}_{\textrm{z}=1}\textrm{t}}{\sum_{i=1}^N \mathbb{1}_{\textrm{z}=1}} - \frac{\sum_{i=1}^N \mathbb{1}_{\textrm{z}=0}\textrm{t}}{\sum_{i=1}^N \mathbb{1}_{\textrm{z}=0}} }\]</span></p>
<p>The variance of this estimator depends on the strength of the instrument. If the instrument has low effect on the treatment, then the denominator will be lower leading to high variability in the estimates. While it is not obvious from Eq. 11, the same holds true for the general estimator too. Note that the estimator suffers from high variance whenever the effect of the instrument on treatment is weak, since the denominator will be small. In such cases, it may be better to focus on the data subset where the compliance of treatment with instrument is high. Restricting to a data subset gives a reliable estimate about a selected subpopulation, thus the resultant estimate is called the local average treatment effect (LATE).</p>
</section>
</section>
<section id="sec:ch04-threshold-methods" data-number="1.6">
<h2 data-number="4.6"><span class="header-section-number">4.6</span> Threshold-based methods</h2>
<p>Finally, we discuss methods that do not depend on accurate estimation of propensities or other conditional probabilities, but rather exploit certain thresholds based on treatment is varied. This threshold can be applied on values of other auxiliary variables.</p>
<section id="regression-discontinuity" data-number="1.6.1">
<h3 data-number="4.6.1"><span class="header-section-number">4.6.1</span> Regression discontinuity</h3>
<p>Regression discontinuity is a special kind of natural experiment that exploits discontinuity in a variable due to a decision made using a threshold. For example, suppose that we are interested in the effect of a specific feature of a product on its usage. Suppose further that it was decided to show the feature to people who were in the top 95% of user activity. Now comparing the outcome for these users with the new feature to other users will be confounded, since higher activity users are anyways likely to use the product more. However, it may be that there were users just below the 95% cutoff who had exactly the same or very similar user activity numbers. The regression discontinuity idea is that these users just below the threshold can be compared to the users just above the threshold, to yield a comparable sample of individuals with and without treatment. To the extent that these two subsets of individuals are identical except for the arbitrary user activity cutoff, we can estimate causal effect by comparing the outcome on these subsets directly.</p>
<p>It turns out that near the threshold, the variable on which threshold is applied acts as an instrumental variable. In the language of instrumental variable method, whether a user lies above the threshold can be considered the instrument and treatment is experiencing the new feature. In our example, the instrument and the treatment are identical, leading to a <em>sharp</em> regression discontinuity. The discontinuity need not be a deterministic one. It can also be probabilistic (or “fuzzy”), for example, sampling users prproportional to their user activity and showing them the new feature. Now the instrument can be considered as the user activity variable while the treatment remains the same.</p>
<p>While Eq. 11 can be directly applied, estimation using a regression discontinuity natural experiment requires another factor to consider. What should be the length of interval before and after the threshold that forms the sample for the estimand. Intuitively, a larger interval (or “bandwidth”) will lead to lower variance but high bias since data points far away from the threshold may not be comparable. A smaller interval is desirable, but that exposes the estimator to high variance. In practice, it is advisable to use domain knowledge to arrive at a sensible value for the bandwidth parameter, and to evaluate the estimate at different bandwidth parameters.</p>
</section>
<section id="difference-in-differences" data-number="1.6.2">
<h3 data-number="4.6.2"><span class="header-section-number">4.6.2</span> Difference-in-differences</h3>
<p>One special case is when the threshold is applied on time and the treatment is applied after that time point. There are many other kinds of natural experiments that happen due to a threshold over time. Often a treatment is implemented at a particular timestep and the goal is to compare the effect of treatment. In such cases, it is useful to compare the outcome values just before and after treatment is administered, with respect to a control group where the treatment was not applied. This technique is called the <em>difference-in-differences</em> estimator. The key idea is that rather than compare absolute values of the outcome between treatment and control, difference-in-differences compares the change in outcome before and after treatment was applied. The key assumption is that change in outcome is not affected by confounders and the treatment can be considered as equivalent to random for this modified outcome. The difference-in-differences estimator is given by: <span class="math display">\[\mathbb{E}[\Delta\textrm{y}|\operatorname{do}(\textrm{t}=t_0)]= \mathbb{E}[\Delta\textrm{y}|\textrm{t}=t_0]= \frac{1}{N_{t_0}}\sum_{t_i=t_0} \Delta y_i\]</span> In effect, subtracting the outcome by its value before an intervention serves to account for any pre-existing differences between the treatment and control groups. The difference-in-differences method makes an additional assumption about unconfoundedness of the change in outcome.</p>
</section>
</section>
<section id="sec:ch04-practical-considerations" data-number="1.7">
<h2 data-number="4.7"><span class="header-section-number">4.7</span> Practical Considerations</h2>
<p>So far, we presented different estimators that can implement back-door or instrumental variable identification strategies. We also described how they make varying choices in reducing bias and variance. Given a dataset and a target effect, we now discuss practical considerations in choosing an estimator and interpreting the resultant estimate.</p>
<section id="choosing-an-estimator" data-number="1.7.1">
<h3 data-number="4.7.1"><span class="header-section-number">4.7.1</span> Choosing an estimator</h3>
<p>To start with, choice of an estimator depends on the identification strategy. Instrumental variables, regression discontinuity or difference-in-differences estimation methods may be used whenever there is an identified estimand based on their assumptions. For estimation based on adjustment set identification, the type of treatment and dimensionality of a dataset provide useful information to choose an estimator. The easiest estimation task is when the treatment is discrete and the dataset is low-dimensional. Simple stratification is the best method in such cases. It is model-free and provides an unbiased estimate.</p>
<p>When the dataset contains high-dimensional confounders, effect estimation becomes harder, as with any statistical task. If we have any knowledge about the nature of treatment assignment, then balance-based methods are preferable. For instance, if we know that treatment was assigned using some randomized algorithm, then we can use outputs of the algorithm as a reliable propensity score or estimate it. In other cases, we may know the functional form based on domain knowledge. For example, medical treatments assigned based on manual checklists can be approximated by a decision-tree propensity model. Among the balance-based methods, stratification is preferred for discrete confounders and matching for continuous confounders. The important hyper-parameters for these methods are the size of the stratification bins and the maximum distance allowed for a match respectively, that need to be tuned based on domain knowledge and the data available. If more is known about the outcome’s structural equations than the treatment’s, then outcome-based models such as the T-learner can be used. For example, when estimating the effect of a certain event on energy consumption in a computer cluster, we may not know why the event occurs, but know how different processes and events contribute to energy consumption.</p>
<p>Weighting-based methods can also be used for discrete treatments, but the estimate can be unreliable in case there are a few data points with very high weights. For this reason, weighting-based methods work only when none of the propensity scores are too high or too low. They have the benefit of requiring the least hyperparameters.</p>
<p>For continuous treatments, model-based estimators are a necessity since we need to assume some functional form of how the treatment affects the outcome. We recommend the residual-based methods since they allow capturing non-linear relationships from the confounders to the treatment and outcome, while avoiding the bias due to purely prediction models. The residual-based methods can also handle high-dimensional confounders.</p>
</section>
<section id="sec:ch04-bootstrap-interpret" data-number="1.7.2">
<h3 data-number="4.7.2"><span class="header-section-number">4.7.2</span> Confidence intervals and interpreting an estimate</h3>
<p>Once an estimator is chosen and the effect is estimated, it is important to interpret it correctly, given the varied challenges that can derail an estimate such as an incorrectly specified model, sample selection bias, poor overlap, low sample size, or outlier data (as discussed in section <a href="#sec:ch04-bias-variance" data-reference-type="ref" data-reference="sec:ch04-bias-variance">1.2</a>). Some of these challenges are common to any statistical analysis. For instance, sample size and outliers are common statistical problems and the resultant variance can be captured through confidence intervals for the causal estimate. To estimate the confidence intervals for any of the estimators presented in this chapter, we can use the <em>bootstrap</em> method. The bootstrap method creates multiple datasets of the same size as the original dataset by sampling with replacement the values from the original dataset. Then the same estimator is applied on each of these datasets and the estimates sorted by their value. The range of the estimates between the <span class="math inline">\(k/2\)</span>th-percentile and the <span class="math inline">\(100-k/2\)</span>th percentile forms the <span class="math inline">\(100-k/2\)</span>-th confidence interval. For example, to compute the 95% confidence intervals, we can look at the range from 2.5% to 97.5% percentile-ranked estimates. As the number of sampled datasets increase, the confidence intervals become more accurate (typical number of datasets is 1000 or 10000). The benefit of this resampling bootstrap method is that it makes no assumptions about the data distribution or the estimator, and can be applied universally, as long as computational capacity is available.</p>
<p>The issues of overlap and model choice, however require special consideration in effect estimation. As we saw in Fig. 3 for a simple one-dimensional confounder <span class="math inline">\(w\)</span>, if there are either not enough treatment or control data points for some values of <span class="math inline">\(w\)</span>, then we cannot say anything about the causal effect on those values of <span class="math inline">\(w\)</span>. Some methods, like outcome-based prediction methods, may manage to provide an estimate for all values of <span class="math inline">\(w\)</span>, but that is largely a function of the modeling assumptions in the method, not from any conclusion from the data. For example, suppose that we wanted to estimate the effect of an online service on people’s financial savings. The identified estimand indicates to condition on income of a person, along with other features. Suppose further that the vast majority of the high-income people tried the service, whereas the ratio between those who tried or not is 50-50 for others. While different estimation methods may yield varying estimates of the product’s effect, it becomes critical to realize that <em>no</em> method can estimate the effect on the high-income people reliably from this data. If we choose weighting-based methods, they may be too sensitive to high propensity scores for the high-income people, and therefore one may decide to remove those with high income from the dataset. If we use outcome-based methods, we may obtain an estimate for the high-income people, but that would be simply extrapolated from the estimated effect on the low-income and middle-income people, based on modeling assumptions about the structural equation. Therefore, without a strong prior for the modeling assumption, its estimate for the rich can be misleading. In practice, we recommend measuring treatment overlap through the propensity score for different subsets of the confounders’ values and interpreting the estimate accordingly.</p>
<p>However, the interpretation of the propensity score becomes tricky under high-dimensional confounders. As the number of confounders increases, it becomes difficult to associate high propensity scores with a meaningful set of confounders. Even for simpler balance-based methods, it becomes difficult to evaluate whether any two data points are comparable: as the number of dimensions of confounders increases, it is unlikely that two data points share the same confounder values. It also becomes difficult to construct distance measures that can capture true similarity between units. Thus, rather than perfect matches or stratification, conditioning on confounders becomes a best-effort exercise, based on a statistical model. Interpretation of which kinds of data the estimator is applicable for is a much harder task, and needs to be done carefully based on domain knowledge and the goals of the analysis.</p>
</section>
<section id="challenges-of-validating-an-estimate" data-number="1.7.3">
<h3 data-number="4.7.3"><span class="header-section-number">4.7.3</span> Challenges of validating an estimate</h3>
<p>Finally, all causal estimation methods suffer from the fundamental limitation that they cannot be evaluated for accuracy without conducting a randomized experiment. This limitation makes it hard to compare causal inference methods on the same dataset, let alone different kinds of data. For a predictive model, it is possible to keep away a subset of observational data that can be used for testing the accuracy of the model. However, for a causal estimate, we ideally need test data that assigns a different treatment to each observation, which is impossible. The next best option is to conduct a randomized experiment but requires intervention in the real world, which comes with its own set of risks.</p>
<p>We therefore have to use more creative ways of evaluating a causal estimate that we describe in the next chapter.</p>
</section>
</section>
</section>
Mon, 05 Apr 2021 00:00:00 +0000
https://causalinference.gitlab.io/causal-reasoning-book-chapter4/
https://causalinference.gitlab.io/causal-reasoning-book-chapter4/Chapter 3: Identification<section id="identification-causal-reasoning-book-chapter3" data-number="1">
<p>Once we have captured our causal assumptions in the form of a model, the second stage of causal analysis is <em>identification</em>. In this stage, our goal is to analyze our causal model—including the causal relationships between features and which features are observed—to determine whether we have enough information to answer a specific causal inference question.</p>
<p>We begin by formalizing the concept of causal inference questions using <em>intervention graphs</em>. We describe do-calculus rules that relate relationships in intervention graphs to the causal models of our observational data. We show how do-calculus leads us to various identification strategies, and how do-calculus can be combined with parametric assumptions as well. Finally, we discuss the relative advantages and disadvantages of these strategies, and discuss common approaches for analyzing a causal inference question to help choose from among these various approaches.</p>
<p>After completing this chapter, we expect the reader will understand the fundamentals of identification and how they are used to derive common identification strategies.</p>
<section id="sec:ch03-causal-questions" data-number="1.1">
<h2 data-number="3.1"><span class="header-section-number">3.1</span> Causal inference questions: Concepts and Notation</h2>
<p>A causal question is <em>any</em> question about the relationship between causes and effects. In their fullness, the set of causal questions encompasses a very broad variety of questions. For both practical and pedagogical reasons, we focus here on a narrower class of questions called <em>causal inference</em> questions where: (1) the causal model is (assumed) known; and (2) we wish to quantify the causal relationship between two specific variables, e.g., its strength and functional form. The causal model, whether expressed in graphical form or as a set of equations, captures our assumptions about the relationships that might exist between nodes. The strength and functional form of the causal relationship between two specific variables, however, is not captured in the causal model. We must derive this information from data. Unfortunately, as we saw in Chapter <a href="/causal-reasoning-book-chapter1" data-reference-type="ref" data-reference="ch_patternsandpredictionsarenotenough">1</a> in the Simpson’s paradox examples, observed data rarely quantifies causal relationships directly. We have to use our knowledge of the causal model to determine whether or not, and how, we can compute the strength of a given causal relationship from data.</p>
<section id="formalizing-causal-inference-questions-using-intervention-graphs-and-operatornamedo-notation" data-number="1.1.1">
<h3 data-number="3.1.1"><span class="header-section-number">3.1.1</span> Formalizing causal inference questions using intervention graphs and <span class="math inline">\(\operatorname{do}\)</span> notation</h3>
<figure>
<img src="../assets/Figures/Chapter3/Ch3_Fig1_B.png" id="fig:simple-3nodegraph" alt="Figure 1: The effect of A on B must account for the confounding influence of C" /><figcaption aria-hidden="true">Figure 1: The effect of <span class="math inline">\(A\)</span> on <span class="math inline">\(B\)</span> must account for the confounding influence of <span class="math inline">\(C\)</span></figcaption>
</figure>
<p>Consider the causal model shown in Fig. 1. We can ask the following causal inference question: how does an intervention to change feature <span class="math inline">\(A\)</span>’s value affect feature <span class="math inline">\(B\)</span>’s value? We cannot simply refer to this as <span class="math inline">\(\operatorname{P}(B|A)\)</span> because this notation is already used to represent the observed distribution, which is confounded in this case by the influence of <span class="math inline">\(C\)</span>. This is the statistical relationship: in our observed data, this is what we expect <span class="math inline">\(B\)</span>’s value to be, given that we have seen a particular value of <span class="math inline">\(A\)</span>.</p>
<p>To correctly represent this question then, we must first introduce a notation that addresses the subtle distinction between statistical and causal relationships. To represent the causal relationship between <span class="math inline">\(A\)</span> and <span class="math inline">\(B\)</span>, we need a notation that will distinguish it from the merely observed statistical relationships between <span class="math inline">\(A\)</span> and <span class="math inline">\(B\)</span>. We write this causal relationship as: <span class="math display">\[\operatorname{P}(B|\operatorname{do}(A))\]</span></p>
<p>The operator <span class="math inline">\(\operatorname{do}(A)\)</span> represents the <em>intervention</em> to change the value of <span class="math inline">\(A\)</span>. When we estimate the value of <span class="math inline">\(B\)</span> conditioned on <span class="math inline">\(\operatorname{do}(A)\)</span>, we are imagining ourselves reaching in and changing the value of feature <span class="math inline">\(A\)</span> while leaving the rest unchanged—except of course, changes caused directly or indirectly by the manipulation of <span class="math inline">\(A\)</span>. Because our intervention is taken independently of the rest of the system, we are essentially creating a new causal model where we have cut off <span class="math inline">\(A\)</span> from all of its parents. In other words, we have a situation as shown in Fig. 2. On the left hand side, we see the original causal graph, <span class="math inline">\(G\)</span>, and on the right, we see the same model with the feature <span class="math inline">\(A\)</span> is now determined independently. We call this second graph the <em>interventional graph</em> or the <em>Do graph</em> of <span class="math inline">\(A\)</span>, <span class="math inline">\(G_{\operatorname{do}(A)}\)</span>. This is a new system. If we could observe data sampled from the data distribution <span class="math inline">\(\operatorname{P}^*\)</span> corresponding to this new system, our observed data would perfectly represent the causal relationship between <span class="math inline">\(A\)</span> and other values. That is, <span class="math inline">\(\operatorname{P}(B|\operatorname{do}(A))= \operatorname{P}^*(B|A)\)</span> and hence <span class="math inline">\(\mathbb{E}[B|\operatorname{do}(A)]=\mathbb{E}^*[B|A]\)</span>.</p>
<p>Because we often ask causal questions in the context of a decision, we are often comparing two or more outcomes to help us understand the effect of the actions we might take. For example, if we are planning an intervention on a binary variable <span class="math inline">\(A\)</span>, our causal inference question focuses on the effect of setting <span class="math inline">\(A=1\)</span> vs <span class="math inline">\(A=0\)</span>. Thus, we represent the causal effect of <span class="math inline">\(A\rightarrow B\)</span> as a difference between the two interventions: <span id="eq:binaryinterventioneffect"><span class="math display">\[\label{eq:binaryinterventioneffect}
\operatorname{P}(B|\operatorname{do}(A=1)) - \operatorname{P}(B|\operatorname{do}(A=0))\qquad(1)\]</span></span></p>
<p>Of course, if the focus of our decision-making is more complex, involving multiple options, we will make many comparisons among the options.</p>
<p>If the focus of our decision-making is a continuous variable, we can represent the effect of an intervention as a derivative: <span id="eq:interventionasderivative"><span class="math display">\[\label{eq:interventionasderivative}
\frac{dB}{d\operatorname{do}(A)} = \lim_{\Delta A \to 0} \left[ \frac{\operatorname{P}(B|\operatorname{do}(A+\Delta A)) - \operatorname{P}(B|\operatorname{do}(A))}{\Delta A} \right]\qquad(2)\]</span></span></p>
<p>Or, if there are multiple independent variables, we can write the effect as a partial derivative, where the partial derivative is evaluated at <span class="math inline">\(X\)</span>, the set of independent variables excluding the treatment.</p>
<p><span id="eq:interventionaspartialderivative"><span class="math display">\[\label{eq:interventionaspartialderivative}
\frac{\partial B}{\partial \operatorname{do}(A)} = \lim_{\Delta A \to 0} \left[ \frac{\operatorname{P}(B|\operatorname{do}(A+\Delta A), X) - \operatorname{P}(B|\operatorname{do}(A), X)}{\Delta A} \right]\qquad(3)\]</span></span></p>
<div id="fig:intervention-graph-main" class="subfigures">
<table style="width:60%;">
<colgroup>
<col style="width: 30%" />
<col style="width: 30%" />
</colgroup>
<tbody>
<tr class="odd">
<td style="text-align: center;"><figure>
<img src="../assets/Figures/Chapter3/Ch3_Fig2_A.png" id="fig:original-graph" style="width:100.0%" alt="a" /><figcaption aria-hidden="true">a</figcaption>
</figure></td>
<td style="text-align: center;"><figure>
<img src="../assets/Figures/Chapter3/Ch3_Fig2_B.png" id="fig:intervention-graph" style="width:100.0%" alt="b" /><figcaption aria-hidden="true">b</figcaption>
</figure></td>
</tr>
</tbody>
</table>
<p>Figure 2: When we ask about the effect of an intervention that changes the value of a feature <span class="math inline">\(A\)</span> in some system (left), we are creating, effectively, a new, hypothetical, system where <span class="math inline">\(A\)</span> is set independently of other features (right).. a — A causal graph, <span class="math inline">\(G\)</span>, b — An intervention graph <span class="math inline">\(G_{\operatorname{do}(A)}\)</span></p>
</div>
</section>
<section id="sec:ch03-heterogeneouseffects" data-number="1.1.2">
<h3 data-number="3.1.2"><span class="header-section-number">3.1.2</span> Feature Interactions and Heterogeneous Effects</h3>
<p>The effect of a treatment on an outcome is rarely simple and homogeneous. Rather, the effect often varies based on context or unit-level features. For example, a medical procedure may work better or worse in younger or older patients; or a pricing discount might increase sales of some products but not others.</p>
<p>We model these kinds of varying effects as feature interactions. In our graphical model of a system, our outcome feature will have incoming edges from the treatment and one or more contextual features that either affect the outcome or modify the treatment’s effect on the outcome. For example, in Fig. 2, the outcome <span class="math inline">\(B\)</span> has incoming edges from a treatment <span class="math inline">\(A\)</span>, and also from <span class="math inline">\(C\)</span> and <span class="math inline">\(E\)</span>. These variables might interact with <span class="math inline">\(A\)</span> to moderate or amplify its effects on <span class="math inline">\(B\)</span>. From the causal graph alone, we do not know whether either <span class="math inline">\(C\)</span> or <span class="math inline">\(E\)</span> interacts with the treatment <span class="math inline">\(A\)</span> to modify <span class="math inline">\(A\)</span>’s effect on <span class="math inline">\(B\)</span>.</p>
<p>Recall that we can represent the value of a node as a general function of its parent features. Without loss of generality, let us represent the value of a node as a general function of a single parent node, <span class="math inline">\(v_0\)</span> and a vector of the remaining parent nodes <span class="math inline">\(\boldsymbol{v}\)</span>: <span class="math inline">\(f(v_0,\boldsymbol{v})\)</span>. If <span class="math inline">\(e_0\)</span> does not interact with the other features, then <span class="math inline">\(v_0\)</span>’s effect is homogeneous and we can decompose <span class="math inline">\(f(v_0,\boldsymbol{v})\)</span> as <span class="math inline">\(f(v_0,\boldsymbol{v}) = \phi \left( g_0(v_0) + g_1(\boldsymbol{v}) \right)\)</span>. If <span class="math inline">\(v_0\)</span> does interact with other features, then <span class="math inline">\(v_0\)</span>’s effect is heterogeneous, then <span class="math inline">\(f(v_0,\boldsymbol{v})\)</span> will decompose into <span class="math inline">\(f(v_0,\boldsymbol{v}) = \phi \left( g_0(v_0,\boldsymbol{v}') + g_1(\boldsymbol{v}) \right)\)</span>, where <span class="math inline">\(\boldsymbol{v}'\)</span> is a subset of the elements of <span class="math inline">\(\boldsymbol{v}\)</span>. We sometimes refer to this subset <span class="math inline">\(\boldsymbol{v}'\)</span> as <em>effect modifiers</em>.</p>
<p>We can also express the concepts of heterogeneous and homogeneous interactions as follows. If the effect of <span class="math inline">\(v_0\)</span> on <span class="math inline">\(Y\)</span> is heterogeneous then <span class="math inline">\(\operatorname{P}(Y|\operatorname{do}(v_0)) \neq \operatorname{P}(Y|\operatorname{do}(v_0),\boldsymbol{v})\)</span>. If <span class="math inline">\(v_0\)</span>’s effect is homogeneous, then <span class="math inline">\(\operatorname{P}(Y|\operatorname{do}(v_0)) = \operatorname{P}(Y|\operatorname{do}(v_0),\boldsymbol{v})\)</span>.</p>
<p><strong>How do we take effect modifiers into account in identification of causal effects?</strong> In some cases, we may only be interested in the average causal effect of a treatment given a known population distribution. For example, we may be able to decide whether to apply a global treatment based on average causal effect <a href="#fn1" class="footnote-ref" id="fnref1" role="doc-noteref"><sup>1</sup></a>. Interestingly, estimating the average causal effect does not necessarily require measuring the effect modifiers. I.e., they may be unobserved, as long as they are not also confounders. However, we must remember that if we measure the average effect with respect to one population distribution, it will not remain valid if the population changes.</p>
<p>In most cases, our overall task or purpose will require that we capture the full distribution of the causal effect; i.e., we must learn how the treatment effect varies with effect modifiers. The <em>individual treatment effect</em> (ITE) is an estimate of the effect of treatment for a specific individual unit and context. Note that the ITE makes a strong assumption that all effect modifiers are known and captured in a model, and also observed. If we believe there may be unknown or unobserved effect modifiers, then it is more correct to say we are identifying the <em>conditional average treatment effect</em> (CATE). This is the average treatment effect conditional on a set of observed effect modifiers. Note the relationship between CATE and ITE: if we calculate a CATE conditioned on all effect modifiers then CATE is equivalent to ITE.</p>
<p>In addition, we sometimes calculate a <em>local average treatment effect</em> (local-ATE). Local-ATE is an estimate of the treatment effect, but only for a specific subpopulation or subset of effect modifier values.</p>
</section>
<section id="direct-and-mediated-effects" data-number="1.1.3">
<h3 data-number="3.1.3"><span class="header-section-number">3.1.3</span> Direct and Mediated Effects</h3>
<p>In a causal graph, there can be multiple paths by which changing some feature can influence some outcome we care about. For example, in <span class="math inline">\(G\)</span> shown in Fig. 2 (a), changing the variable <span class="math inline">\(E\)</span> can influence <span class="math inline">\(B\)</span> directly, through the edge <span class="math inline">\(E \Rightarrow B\)</span>, and indirectly, mediated by <span class="math inline">\(A\)</span> in the path <span class="math inline">\(E \Rightarrow A \Rightarrow B\)</span>. Usually, when we wish to measure and understand the effect of some feature on an outcome, we want to know the feature’s <em>total effect</em> on the outcome, regardless of whether that effect is direct or mediated.</p>
<p>There are times when it is useful to distinguish between direct and mediated effects. For example, when we are analyzing a situation involving a long-term outcome that will not be observable for a long time, it might be useful to measure a mediating short-term outcome, to understand what impact our change is having. In other situations, understanding how effects travel through mediating paths might provide us insight into ways to assert greater control over the effects. For example, we might be able to find ways to block some paths to prevent negative effects.</p>
<p>Formally, the notation <span class="math inline">\(\operatorname{P}(Y|\operatorname{do}(T))\)</span> represents the total effect of intervening on <span class="math inline">\(T\)</span> on an outcome <span class="math inline">\(Y\)</span>. Given a set of <span class="math inline">\(k\)</span> mediated paths from <span class="math inline">\(T\)</span> to <span class="math inline">\(Y\)</span>, where each path is mediated by a single feature <span class="math inline">\(m_{1...k}\)</span> we can calculate the mediated effect (<span class="math inline">\(ME_i\)</span>) of <span class="math inline">\(T\)</span> on <span class="math inline">\(Y\)</span> through <span class="math inline">\(m_i\)</span> as <span class="math inline">\(\text{ME}_i = P(Y|\operatorname{do}(m_i))P(m_i|\operatorname{do}(T))\)</span>. This chained calculation can be extended for longer mediating paths consisting of multiple mediators.</p>
<p>The direct effect of <span class="math inline">\(T\)</span> on <span class="math inline">\(Y\)</span> is given by the difference between total effect and the sum of all the mediated effects: <span class="math inline">\(\text{direct effect} = P(Y|\operatorname{do}(T)) - \sum_{i=1...k} \text{ME}_i\)</span>.</p>
</section>
</section>
<section id="sec:ch03-docalculus" data-number="1.2">
<h2 data-number="3.2"><span class="header-section-number">3.2</span> Do-calculus</h2>
<p>The task of causal identification is to determine an expression, the causal estimand, that expresses our target value as a function of the observable correlational relationships in our system. That is, how do we express <span class="math inline">\(\operatorname{P}(B|\operatorname{do}(A))\)</span> – the correlation of <span class="math inline">\(B\)</span> and <span class="math inline">\(A\)</span> in the intervention graph, as a function of observable correlations int he initial graph.</p>
<section id="sec:docalculusrandomized" data-number="1.2.1">
<h3 data-number="3.2.1"><span class="header-section-number">3.2.1</span> Randomized Experiments</h3>
<p>As a starting point to understand the connections between the original and the interventional graphs, it is convenient to begin by considering the causal graph for a randomized experiment. When the intervention or treatment, <span class="math inline">\(A\)</span>, is randomized in an experiment, it has no ancestors in the graph. If we draw the intervention graph, <span class="math inline">\(Do(A)\)</span>, we see that it is the same as the original. Thus, in an experiment that randomizes <span class="math inline">\(A\)</span>, <span class="math inline">\(P(B|Do(A))\)</span> is the same as <span class="math inline">\(P(B|A)\)</span>. Furthermore, this holds without analysis of the remainder of the causal graph, meaning that we can identify the causal effect of <span class="math inline">\(A\)</span> on other features without having knowledge of the causal relationships in the system beyond knowing that <span class="math inline">\(A\)</span> is randomly assigned. This robustness to our knowledge of the causal system is why randomized experiments are considered the gold standard for identifying causal effects.</p>
<p>Randomization can come from many sources. Sometimes randomization is an inherent part of the system logic, such as in load-balancing algorithms that randomly assign incoming requests to one of the available servers.</p>
<p>What do we do when the system we are observing is not a randomized experiment? How can we identify a causal estimand that represents <span class="math inline">\(\operatorname{P}(B|\operatorname{do}(A))\)</span> based on the confounded correlations observed in a non-randomized experiment? In the next section, we describe a calculus of rules that can help us with this task.</p>
</section>
<section id="causal-distributions-from-observational-data" data-number="1.2.2">
<h3 data-number="3.2.2"><span class="header-section-number">3.2.2</span> Causal Distributions from Observational Data</h3>
<p>Our challenge now, in the causal identification stage of our analysis, becomes clearer. We wish to calculate this value, <span class="math inline">\(\operatorname{P}[B|\operatorname{do}(A)]\)</span>, but we do not observe the system represented by <span class="math inline">\(G_{\operatorname{do}(A)}\)</span>, the Do graph of <span class="math inline">\(A\)</span>. In other words, no data from the probability distribution <span class="math inline">\(P_{G_{\operatorname{do}(A)}}(.)\)</span> implied by the interventional do-graph is available, yet we would like to estimate <span class="math inline">\(\operatorname{P}_{G_{\operatorname{do}(A)}}[B|A]= \operatorname{P}[B|\operatorname{do}(A)]\)</span>. Therefore, we must identify a strategy for calculating this value given only observations of the system represented by <span class="math inline">\(G\)</span> and sampled from the probability distribution <span class="math inline">\(P(.)\)</span>.</p>
<p>Since we have no data from <span class="math inline">\(\operatorname{P}_{G_{\operatorname{do}(A)}}\)</span>, a natural strategy is to write the desired quantity over <span class="math inline">\(\operatorname{P}_{G_{\operatorname{do}(A)}}\)</span> in terms of probability expressions over <span class="math inline">\(\operatorname{P}\)</span>. To do so, we can utilize the fact that an intervention corresponds to a specific structural change in the causal graph and find the conditional distributions that should stay invariant under this change. Specifically, since the intervention only affects incoming arrows to the intervened variable, the structural equations for its descendant nodes should remain the same, and thus conditional distribution of its descendant variables given their parents should stay the same. That is, if <span class="math inline">\(B\)</span> is caused by set of variables <span class="math inline">\(Pa(B)\)</span>, then <span class="math inline">\(\operatorname{P}_{G_{\operatorname{do}(A)}}(B|Pa{B})=\operatorname{P}(B|Pa{B})\)</span>. Similarly we can claim that any conditional independence between variables in the observed data distribution should also hold in the interventional distribution, since an intervention simply removes some edges from the graph, never adds them. If a set of nodes <span class="math inline">\(B\)</span> is independent of <span class="math inline">\(C\)</span> conditional on <span class="math inline">\(D\)</span>, then <span class="math inline">\(\operatorname{P}(B|D, C)=\operatorname{P}(B|D)\)</span> and <span class="math inline">\(\operatorname{P}_{G_{\operatorname{do}(A)}}(B|D, C)=\operatorname{P}_{G_{\operatorname{do}(A)}}(B|D)\)</span>. As an example of how these simple properties can be used for identification, consider the target quantity <span class="math inline">\(P(B|\operatorname{do}(A))\)</span> where <span class="math inline">\(A \supset Pa(B)\)</span> refers to a set of variables including all parents of <span class="math inline">\(B\)</span> and some additional variables. Then using the above two equivalence properties, we can write, <span class="math display">\[\begin{aligned}
\label{eq:simple-do-calculus-derive}
\operatorname{P}(B|\operatorname{do}(A)) &= \operatorname{P}_{G_{\operatorname{do}(A)}}(B|A) && \text{Using the definition of do operator} \\
&= \operatorname{P}_{G_{\operatorname{do}(A)}}(B| Pa(B) , A \setminus Pa(B)) && \\
&= \operatorname{P}_{G_{\operatorname{do}(A)}}(B| Pa(B)) && \text{Using the second property} \\
&= \operatorname{P}_{G_{\operatorname{do}(A)}}(B| Pa(B)) && \text{Using the first property above}\end{aligned}\]</span></p>
<p>Thus, starting from a target probability expression involving the do-operator, we are able to construct an expression based only on the observed probability distribution <span class="math inline">\(P\)</span>. The final expression is called the <em>identified estimand</em> or the <em>target estimand</em>, and can be estimated from available data. The process of transforming a target do-expression into an expression involving only the observed probabilities is called <em>identification</em>.</p>
<p>Rather than coming up with such properties for every new do-expression, <em>do-calculus</em> is a set of rules that generalizes the above procedure for any causal graph. The key advantage of do-calculus is that it formalizes such custom derivations into a general framework that can be applied mechanistically to any graph and any causal inference question for that graph. Given a causal graph, it allow us to relate probabilities in the interventional graph (which we do not observe) to statistical relationships that we can observe in the observational graph. That is, do-calculus will gives us the tools to convert our causal target, <span class="math inline">\(\operatorname{P}(B|\operatorname{do}(A))\)</span>, to a causal estimand that is computable from observational quantities.</p>
</section>
<section id="graph-rewriting" data-number="1.2.3">
<h3 data-number="3.2.3"><span class="header-section-number">3.2.3</span> Graph Rewriting</h3>
<p>We have seen above that given a graph <span class="math inline">\({G}\)</span> that represents our causal assumptions about a system, it is useful to be able to refer to modified or edited versions of <span class="math inline">\({G}\)</span>, such as the interventional graph.</p>
<ul>
<li><p><strong>Interventional Graph:</strong> We refer to the interventional graph, where we have intervened on a feature <span class="math inline">\(A\)</span> as <span class="math inline">\({G}_{\operatorname{do}(A)}\)</span>. This graph <span class="math inline">\({G}_{\operatorname{do}(A)}\)</span> is identical to <span class="math inline">\({G}\)</span> except all edges leading to <span class="math inline">\(A\)</span> from its parents have been removed (e.g., Fig. 3 (b)).</p></li>
<li><p><strong>Nullified Graph:</strong> It is also useful to refer to the nullified graph, where we have artificially removed or nullified all effects of a feature <span class="math inline">\(A\)</span>. This graph, which we reference as <span class="math inline">\({G}_{null(A)}\)</span>, is identical to <span class="math inline">\(G\)</span>, except that all edges from <span class="math inline">\(A\)</span> to its children have been removed (e.g., Fig. 3 (c)).</p></li>
</ul>
<p>Note that we are not limited to a single intervention or nullification on a graph. For example, <span class="math inline">\({G}_{\operatorname{do}(A),\operatorname{do}(C)}\)</span> would represent a graph where we have intervened on both <span class="math inline">\(A\)</span> and <span class="math inline">\(C\)</span>. <span class="math inline">\({G}_{\operatorname{do}(A),null(C)}\)</span> represents a graph where we have intervened on <span class="math inline">\(A\)</span> and nullified the effects of <span class="math inline">\(C\)</span> (e.g., Fig. 3 (d)).</p>
<p>For brevity, we will often see the literature use an overbar and underbar to represent interventions and nullifications. That is <span class="math inline">\({G}_{\operatorname{do}(A),null(C)}\)</span> can be equivalently written as <span class="math inline">\({G}_{\bar{A},\underline{C}}\)</span></p>
<div id="fig:graph-rewriting" class="subfigures">
<table style="width:90%;">
<colgroup>
<col style="width: 30%" />
<col style="width: 30%" />
<col style="width: 30%" />
</colgroup>
<tbody>
<tr class="odd">
<td style="text-align: center;"><figure>
<img src="../assets/Figures/Chapter3/Ch3_Fig2_A.png" id="fig:original-graph2" style="width:100.0%" alt="a" /><figcaption aria-hidden="true">a</figcaption>
</figure></td>
<td style="text-align: center;"><figure>
<img src="../assets/Figures/Chapter3/Ch3_Fig2_B.png" id="fig:intervention-graph2" style="width:100.0%" alt="b" /><figcaption aria-hidden="true">b</figcaption>
</figure></td>
<td style="text-align: center;"><figure>
<img src="../assets/Figures/Chapter3/Ch3_Fig2_C.png" id="fig:null-graph2" style="width:100.0%" alt="c" /><figcaption aria-hidden="true">c</figcaption>
</figure></td>
</tr>
<tr class="even">
<td style="text-align: center;"><figure>
<img src="../assets/Figures/Chapter3/Ch3_Fig2_D.png" id="fig:intervention-null-graph" style="width:100.0%" alt="d" /><figcaption aria-hidden="true">d</figcaption>
</figure></td>
<td style="text-align: center;"></td>
<td style="text-align: center;"></td>
</tr>
</tbody>
</table>
<p>Figure 3: Graph rewriting examples. a — A causal graph, <span class="math inline">\(G\)</span>, b — An intervention graph <span class="math inline">\(G_{\operatorname{do}(A)}\)</span>, c — A nullified graph <span class="math inline">\(G_{\operatorname{null}(C)}\)</span>, d — A rewritten graph <span class="math inline">\(G_{\operatorname{do}(A)\operatorname{null}(C)}\)</span></p>
</div>
</section>
<section id="rules-of-do-calculus" data-number="1.2.4">
<h3 data-number="3.2.4"><span class="header-section-number">3.2.4</span> Rules of Do-calculus</h3>
<p>Do-calculus consists of 3 simple rules:</p>
<ol type="1">
<li><p><em>Insertion or deletion of observations</em> <span class="math display">\[\begin{aligned}
\operatorname{P}(y|\operatorname{do}(x),z,w) = \operatorname{P}(y|\operatorname{do}(x),w) && \text{if} (y \unicode{x2AEB} z | x, w)_{G_{\operatorname{do}(x)}}
\end{aligned}\]</span> Rule 1 states that we can remove variables <span class="math inline">\(z\)</span> from the conditioning set if the remaining conditioning variables <span class="math inline">\(w\)</span> and the intervention variables <span class="math inline">\(x\)</span> d-separate <span class="math inline">\(y\)</span> from <span class="math inline">\(z\)</span> in the intervention graph <span class="math inline">\(G_{\operatorname{do}(x)}\)</span>. Of course, we can also add variables to the conditioning set under the same criteria.</p>
<p>The intuition for this rule is that <span class="math inline">\(y|\operatorname{do}(x)\)</span> reduces to simply <span class="math inline">\(y|x\)</span> under a graph <span class="math inline">\(G_{\operatorname{do}(X)}\)</span> where all incoming edges to <span class="math inline">\(x\)</span> are removed. From probability calculus, we know that <span class="math inline">\(\operatorname{P}(y|x,z,w)=\operatorname{P}(y|x,w)\)</span> if <span class="math inline">\(y \unicode{x2AEB} z | x, w\)</span>, and therefore the above rule follows whenever the graph is <span class="math inline">\(G_{do(X)}\)</span>.</p></li>
<li><p><em>Action/observation exchange</em> <span class="math display">\[\begin{aligned}
\operatorname{P}(y|\operatorname{do}(x), \operatorname{do}(z), w) = \operatorname{P}(y|\operatorname{do}(x),z,w) && \text{if} (y\unicode{x2AEB} z| x,w)_{G_{\operatorname{do}(x),null(z)}}
\end{aligned}\]</span> Rule 2 states that we can replace a conditional on an intervention <span class="math inline">\(\operatorname{do}(z)\)</span> with a conditional on the observational value <span class="math inline">\(z\)</span> if <span class="math inline">\(y\)</span> is d-separated from <span class="math inline">\(z\)</span> by <span class="math inline">\(x,w\)</span> in the graph <span class="math inline">\(G_{\operatorname{do}(x),null(z)}\)</span> where we have removed all input edges to <span class="math inline">\(x\)</span> and all output edges from <span class="math inline">\(z\)</span>.</p>
<p>To understand this rule, let us focus on the simpler rule without the additional intervention on <span class="math inline">\(x\)</span> (i.e., substitute <span class="math inline">\(x=\phi\)</span>): <span class="math inline">\(\operatorname{P}(y| \operatorname{do}(z), w) = p(y|z,w)\)</span> if <span class="math inline">\((y\unicode{x2AEB} z| w)_{G_{null(z)}}\)</span>.</p>
<p>In this simpler case, if <span class="math inline">\(y\)</span> and <span class="math inline">\(z\)</span> are independent of each other given <span class="math inline">\(w\)</span> under a graph where there are no outgoing edges from <span class="math inline">\(z\)</span>, then that implies that the only connection between <span class="math inline">\(z\)</span> and <span class="math inline">\(y\)</span> that is not blocked by <span class="math inline">\(w\)</span> is through edges that start from <span class="math inline">\(z\)</span>. In such a case, there are no confounders outside of <span class="math inline">\(w\)</span>—the effect of <span class="math inline">\(z\)</span> can simply be estimated using conditioning, therefore <span class="math inline">\(\operatorname{P}(\operatorname{do}(z), w)=\operatorname{P}(y|z, w)\)</span>.</p>
<p>The role of the additional intervention on <span class="math inline">\(x\)</span> follows the same intuition as in Rule 1: we test independence in a graph that additionally has removed incoming edges to <span class="math inline">\(x\)</span>, thereby allowing <span class="math inline">\(y|\operatorname{do}(x)\)</span> to be equivalent to <span class="math inline">\(y|x\)</span> and <span class="math inline">\(x\)</span> can be considered as just another conditioning variable like <span class="math inline">\(w\)</span> in <span class="math inline">\(\operatorname{P}(y|\operatorname{do}(x),z, w)\)</span>.</p>
<p>Consistent with our principle of stable and independent causal mechanisms (Chapter <a href="/causal-reasoning-book-chapter2/#sec:causalgraphs" data-reference-type="ref" data-reference="sec:stableandindependent">2</a>), note that Rule 2 implies that <span class="math inline">\(p(y|do(z)) = p(y)\)</span> if <span class="math inline">\(y\)</span> is d-separated from <span class="math inline">\(z\)</span> in <span class="math inline">\(G\)</span>. Also <span class="math inline">\(\operatorname{P}(y|\operatorname{do}(z),w) = \operatorname{P}(y|z,w)\)</span> if <span class="math inline">\(y\)</span> is d-separated from <span class="math inline">\(z\)</span> by <span class="math inline">\(w\)</span> in <span class="math inline">\(G_{null(z)}\)</span>.</p>
<p>Note that if <span class="math inline">\(z\)</span> does not cause <span class="math inline">\(y\)</span> at all, then there will be no outgoing edges from <span class="math inline">\(z\)</span> to <span class="math inline">\(y\)</span>. In such a case, Rule 1 and Rule 2 can be combined to yield, <span class="math display">\[\begin{aligned}
\operatorname{P}(y|\operatorname{do}(x), \operatorname{do}(z), w) = \operatorname{P}(y|\operatorname{do}(x),w) && \text{ if } (y\unicode{x2AEB} z| x,w)_{G_{\operatorname{do}(x)}}
\end{aligned}\]</span></p>
<p>However, this condition is too strict, we can obtain the same result using a milder condition as shown by Rule 3.</p></li>
<li><p><em>Insertion/deletion of actions</em> <span class="math display">\[\begin{aligned}
\operatorname{P}(y|\operatorname{do}(x), \operatorname{do}(z), w) = \operatorname{P}(y|\operatorname{do}(x),w) && \text{if} (y \unicode{x2AEB} z | x, w)_{G_{\operatorname{do}(x)\operatorname{do}(z(w))}}
\end{aligned}\]</span> where <span class="math inline">\(z(w)\)</span> is all <span class="math inline">\(z\)</span> that are not ancestors of <span class="math inline">\(w\)</span> in <span class="math inline">\(G_{\operatorname{do}(x)}\)</span>. That is, Rule 3 states that we can remove an action <span class="math inline">\(\operatorname{do}(z)\)</span> from the conditioning set if the remaining conditioning variables <span class="math inline">\(w\)</span> and the remaining intervention variable <span class="math inline">\(x\)</span> d-separate <span class="math inline">\(y\)</span> from <span class="math inline">\(z\)</span> in the graph where we have removed incoming edges to <span class="math inline">\(x\)</span> and <span class="math inline">\(z(w)\)</span>.</p>
<p>As before, let us assume the case without <span class="math inline">\(x\)</span> to capture the intuition. We know that <span class="math inline">\(y|\operatorname{do}(z)\)</span> refers only to the effect of <span class="math inline">\(z\)</span> as it flows through directed edges starting from <span class="math inline">\(z\)</span>. So if all incoming edges to <span class="math inline">\(z\)</span> are removed and <span class="math inline">\(y\)</span> is independent of <span class="math inline">\(z\)</span>, then it implies that there are no outgoing edges from <span class="math inline">\(z\)</span> that end up at <span class="math inline">\(y\)</span>. Therefore <span class="math inline">\(z\)</span> can be safely removed even though <span class="math inline">\(z \not \unicode{x2AEB} y|w\)</span> since the causal effect only involves the forward direction from <span class="math inline">\(z\)</span>. The special subset of <span class="math inline">\(z\)</span>, <span class="math inline">\(z(w)\)</span>, is to avoid any situation where conditioning on <span class="math inline">\(w\)</span> leads to collider bias, due to which <span class="math inline">\(z\)</span> may end up having a causal effect on <span class="math inline">\(y\)</span> (conditional on <span class="math inline">\(w\)</span>) even though there is not direct path from <span class="math inline">\(z\)</span> to <span class="math inline">\(y\)</span>.</p></li>
</ol>
<p>By iteratively applying the rules of do-calculus, we attempt to convert our causal target—initially expressed as a function of the interventional graph—into a function of the observational graph. The rules of do-calculus have been proved to be complete. That is, if repeated application of the rules of do-calculus cannot remove all references to the <span class="math inline">\(do()\)</span> operator and interventional graph, then the causal target is not identifiable without additional assumptions.</p>
<p>In the next section, we apply do-calculus to derive simple identification strategies for some commonly encountered graphical constraints.</p>
</section>
</section>
<section id="sec:ch03-graph-criteria" data-number="1.3">
<h2 data-number="3.3"><span class="header-section-number">3.3</span> Identification under Graphical Constraints</h2>
<p>Using do-calculus, we can derive simple methods for causal identification in many situations. In this section, we present two simple methods, the adjustment formula and the front-door criterion, and show how each is derivable using do-calculus.</p>
<section id="adjustment-formula" data-number="1.3.1">
<h3 data-number="3.3.1"><span class="header-section-number">3.3.1</span> Adjustment Formula</h3>
<p>As an example of how we can apply do-calculus to the problem of identification, consider a simple causal target <span class="math inline">\(\operatorname{P}(B|\operatorname{do}(A))\)</span> in some graph <span class="math inline">\(G\)</span>. Here we will show how two simple manipulations of this causal target provide us with a useful identification approach called the adjustment formula:</p>
<p><span class="math display">\[\begin{split}
\operatorname{P}(B | do(A)) & = \sum_Z \operatorname{P}(B, Z|do(A)) \\
&= \sum_Z \operatorname{P}(B|do(A), Z)\operatorname{P}(Z|do(A)) \\
&= \sum_Z \operatorname{P}(B|do(A), Z)\operatorname{P}(Z) \text{ if } (Z \unicode{x2AEB} a)_{G{\operatorname{do}(A)}} \\
& = \sum_Z \operatorname{P}(B|A,Z)\operatorname{P}(Z) \text{ if } (B \unicode{x2AEB} A|Z)_{G{\underline{A}}}
\end{split}\]</span></p>
<p>The second step follows from law of total probability. The third step applies Rule 3 of do-calculus and holds as long as <span class="math inline">\(z\)</span> and <span class="math inline">\(a\)</span> are d-separated in a graph without incoming edges to <span class="math inline">\(a\)</span>. The last step applies the 2nd rule of do-calculus, and holds as long as <span class="math inline">\(z\)</span> d-separates <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span> in the graph where all outgoing edges of <span class="math inline">\(a\)</span> have been removed. Any such set <span class="math inline">\(z\)</span> is called a <em>valid adjustment set</em>.</p>
<p>Note that <span class="math inline">\(\operatorname{P}(A|Z)\)</span> should be strictly greater than <span class="math inline">\(0\)</span> for successful identification. If the observed data does not have any points where <span class="math inline">\(A=a\)</span>, then it is impossible to identify <span class="math inline">\(\operatorname{P}(B|\operatorname{do}(A))\)</span> since <span class="math inline">\(p(b|a,z)\)</span> will be undefined. This requirement is often known as the <em>overlap</em> assumption for causal identification, and we will discuss its implications further in the context of estimation in Chapter <a href="/causal-reasoning-book-chapter4" data-reference-type="ref" data-reference="ch_causalestimation">4</a>.</p>
</section>
<section id="valid-adjustment-sets" data-number="1.3.2">
<h3 data-number="3.3.2"><span class="header-section-number">3.3.2</span> Valid Adjustment Sets</h3>
<p>Intervening on <span class="math inline">\(A\)</span> will have an effect on <span class="math inline">\(B\)</span> that is calculable from observational data using the adjustment formula <em>if</em> we can find a valid adjustment set <span class="math inline">\(Z\)</span>.</p>
<p>Here, we present multiple kinds of adjustment sets that satisfy the requirements of the adjustment formula. We start with the simplest such set and then identify broader sets.</p>
<ul>
<li><p><strong>Parent Adjustment Set</strong></p>
<p>The simplest such adjustment set is <span class="math inline">\(Pa(a)\)</span>, the set of parent nodes of <span class="math inline">\(a\)</span>. We can easily determine that <span class="math inline">\(Pa(a)\)</span> d-separates <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span> in <span class="math inline">\(G_{null(a)}\)</span>. As all outgoing edges of <span class="math inline">\(a\)</span> have been removed, by definition, in <span class="math inline">\(G_{null(a)}\)</span>, all of its paths to any node, including to <span class="math inline">\(b\)</span>, must pass through its parents. Necessarily then, all paths are blocked if we condition on <span class="math inline">\(Pa(a)\)</span>, and thus <span class="math inline">\(a \unicode{x2AEB} b | Pa(a)\)</span>, and <span class="math inline">\(Pa(a)\)</span> meets the criteria for being a valid adjustment set.</p>
<p>Similarly, <span class="math inline">\(Pa(b)\)</span>, the set of parent nodes of <span class="math inline">\(b\)</span>, is also a valid adjustment set.</p></li>
<li><p><strong>Backdoor Criterion</strong></p>
<p>The <em>backdoor criterion</em> allows us to identify a broader class of valid adjustment sets. This criterion states that a set of nodes <span class="math inline">\(z\)</span> is a valid adjustment set if:</p>
<ol type="1">
<li><p><span class="math inline">\(Z\)</span> blocks all paths between that <span class="math inline">\(A\)</span> and <span class="math inline">\(B\)</span> where the edge connected to <span class="math inline">\(A\)</span> is directed at <span class="math inline">\(A\)</span>.</p></li>
<li><p>no descendants of <span class="math inline">\(A\)</span> are in <span class="math inline">\(Z\)</span>;</p></li>
</ol>
<p>Note that the parent adjustment trivially meets the backdoor criterion.</p></li>
<li><p><strong>“Towards necessity” criterion</strong></p>
<p>While the backdoor criterion significantly broadens our ability to identify valid adjustment sets beyond the set of parents, it is not yet complete. That is, there are valid adjustment sets that do not meet the backdoor criterion. Shpitser et al 2010 successfully generalized the backdoor criterion to describe all valid adjustment sets.</p>
<p>The “towards necessity” criterion states that a set of nodes <span class="math inline">\(z\)</span> is a valid adjustment set if:</p>
<ol type="1">
<li><p><span class="math inline">\(z\)</span> blocks all the paths between <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span></p></li>
<li><p>let <span class="math inline">\(d\)</span> be all the directed paths from <span class="math inline">\(a\)</span> to <span class="math inline">\(b\)</span>; let <span class="math inline">\(d_{\text{nodes}}\)</span> be all the nodes on these directed paths except for <span class="math inline">\(x\)</span>. <span class="math inline">\(z\)</span> may not include any descendants of <span class="math inline">\(d_{\text{nodes}}\)</span>.</p></li>
</ol></li>
</ul>
</section>
<section id="invalid-adjustment-sets" data-number="1.3.3">
<h3 data-number="3.3.3"><span class="header-section-number">3.3.3</span> Invalid adjustment sets</h3>
<p>In this section, we explain the intuition behind valid adjustment sets by describing the consequences of using an <em>invalid</em> adjustment set. We demonstrate this intuition by showing, in multiple simple examples, the consequences of conditioning on features that fail to meet the criteria for valid adjustment sets.</p>
<div id="fig:invalidadjustmentsets" class="subfigures">
<table style="width:90%;">
<colgroup>
<col style="width: 30%" />
<col style="width: 30%" />
<col style="width: 30%" />
</colgroup>
<tbody>
<tr class="odd">
<td style="text-align: center;"><figure>
<img src="../assets/Figures/Chapter3/Ch3_Fig_Collider.png" id="fig:collider" style="width:100.0%" alt="a" /><figcaption aria-hidden="true">a</figcaption>
</figure></td>
<td style="text-align: center;"><figure>
<img src="../assets/Figures/Chapter3/Ch3_Fig_Mediator.png" id="fig:mediator" style="width:100.0%" alt="b" /><figcaption aria-hidden="true">b</figcaption>
</figure></td>
<td style="text-align: center;"><figure>
<img src="../assets/Figures/Chapter3/Ch3_Fig_Posttreatment.png" id="fig:posttreatment" style="width:100.0%" alt="c" /><figcaption aria-hidden="true">c</figcaption>
</figure></td>
</tr>
</tbody>
</table>
<p>Figure 4: Examples of invalid adjustment sets. a — Conditioning on a collider <span class="math inline">\(Z\)</span> will introduce a backdoor path or correlation between <span class="math inline">\(A\)</span> and <span class="math inline">\(B\)</span> and bias the estimate of <span class="math inline">\(\operatorname{P}(B|\operatorname{do}(A))\)</span>, b — Conditioning on a mediator <span class="math inline">\(Z\)</span> will block the effect of <span class="math inline">\(A\)</span> on <span class="math inline">\(B\)</span>, c — Conditioning on this post-treatment variable <span class="math inline">\(Z\)</span> will bias the estimate of <span class="math inline">\(\operatorname{P}(B|\operatorname{do}(A))\)</span></p>
</div>
<ul>
<li><p><strong>Conditioning on a collider:</strong> Consider Fig. 4 (a). Unconditionally, <span class="math inline">\(A\)</span> and <span class="math inline">\(B\)</span> are statistically independent of each other—they are d-separated in the shown graph. However, if we add the collider <span class="math inline">\(Z\)</span> to the adjustment set, then as described in Chapter <a href="/causal-reasoning-book-chapter2" data-reference-type="ref" data-reference="ch_causalmodel">2</a>, we introduce a dependence, such that <span class="math inline">\(A \not\unicode{x2AEB} B |Z\)</span>. Adjusting for a collider <span class="math inline">\(Z\)</span> will thus introduce a false correlation between <span class="math inline">\(A\)</span> and <span class="math inline">\(B\)</span> and bias our estimate of <span class="math inline">\(\operatorname{P}(B|\operatorname{do}(A))\)</span>.</p></li>
<li><p><strong>Conditioning on a mediator:</strong> Fig. 4 (b) shows another causal graph where <span class="math inline">\(Z\)</span> is an invalid adjustment. In this case, <span class="math inline">\(Z\)</span> is mediating the effect of <span class="math inline">\(A\)</span> on <span class="math inline">\(B\)</span>. When conditioning our analysis on the mediator <span class="math inline">\(Z\)</span>, we break the path between <span class="math inline">\(A\)</span> and <span class="math inline">\(B\)</span> and breaking the dependence of <span class="math inline">\(B\)</span> on <span class="math inline">\(A\)</span>. In other words, <span class="math inline">\(A \unicode{x2AEB} B|Z\)</span>. Thus, adjusting for <span class="math inline">\(Z\)</span> will block the effect of <span class="math inline">\(A\)</span> on <span class="math inline">\(B\)</span> that we are attempting to estimate, invalidating our causal estimate.</p></li>
<li><p><strong>Conditioning on a post-treatment variable:</strong> Fig. 4 (c) shows a post-treatment variable <span class="math inline">\(Z\)</span> that is confounded with the outcome by <span class="math inline">\(X\)</span>. If we add this <span class="math inline">\(Z\)</span> to the adjustment set, it will open a confounding path between <span class="math inline">\(A\)</span> and <span class="math inline">\(B\)</span> and bias our estimate of <span class="math inline">\(\operatorname{P}(B|\operatorname{do}(A))\)</span>. Note that, whether conditioning on a post-treatment variable is invalid or simply neutral or not useful, depends on the rest of the structure of the graph. For example, in the figure shown, the feature <span class="math inline">\(X\)</span> is necessary for the opening of the confounding path.</p></li>
</ul>
<p>Not all variables fall neatly into valid or invalid adjustments. There are also adjustments that are neutral. These variables neither help nor harm our goal of causal identification. However, as we will discuss in the next chapter, they can have implications (good or bad) for statistical estimation.</p>
</section>
<section id="frontdoor" data-number="1.3.4">
<h3 data-number="3.3.4"><span class="header-section-number">3.3.4</span> The Front-door Criterion</h3>
<p>The adjustment formula is one of the identification methods that can be derived solely from graphical assumptions and do-calculus. Sometimes, one or more of the variables required as part of a valid adjustment set are unobserved, in which case we cannot use that particular adjustment set in the adjustment formula. If there is no valid adjustment set whose variables are all observed then we cannot use the adjustment formula for identification.</p>
<figure>
<img src="../assets/Figures/Chapter3/Ch3_Fig_Frontdoor.png" id="fig:frontdoor" alt="Figure 5: The unobserved confounder U prevents us from applying the adjustment formula to identify \operatorname{P}(C|\operatorname{do}(A)). The front-door criterion allows us to identify \operatorname{P}(C|\operatorname{do}(A)) via the mediating variable B." /><figcaption aria-hidden="true">Figure 5: The unobserved confounder <span class="math inline">\(U\)</span> prevents us from applying the adjustment formula to identify <span class="math inline">\(\operatorname{P}(C|\operatorname{do}(A))\)</span>. The front-door criterion allows us to identify <span class="math inline">\(\operatorname{P}(C|\operatorname{do}(A))\)</span> via the mediating variable <span class="math inline">\(B\)</span>.</figcaption>
</figure>
<p>Applying the rules of do-calculus, however, can lead to other strategies. The front-door criterion is one such method for identifying a causal effect when confounding variables necessary for applying the adjustment formular are unobserved. Consider Fig. 5, that shows an graph where an unobserved confounder <span class="math inline">\(U\)</span> makes it impossible to apply the adjustment formula. In this simple example, the only valid adjustment set is <span class="math inline">\({U}\)</span> and thus, without observing <span class="math inline">\(U\)</span>, we cannot calculate the adjustment formula to identify <span class="math inline">\(\operatorname{P}(C|\operatorname{do}(A))\)</span> directly.</p>
<p>Consider the node <span class="math inline">\(B\)</span> in the graph, however. <span class="math inline">\(A\)</span> has no direct effect on <span class="math inline">\(C\)</span>. The only effect <span class="math inline">\(A\)</span> has on <span class="math inline">\(C\)</span> is its indirect effect through <span class="math inline">\(B\)</span>., and is not confounded by the unobserved <span class="math inline">\(U\)</span>. This particular structure will allow us to apply the front-door criterion, a method for identifying the effect of <span class="math inline">\(A\)</span> on <span class="math inline">\(C\)</span> in (some) cases where we cannot apply the adjustment formula.</p>
<p>The key insight of the front-door criterion is that we can factor the causal effect of <span class="math inline">\(A\)</span> on <span class="math inline">\(C\)</span> and, if the factors of causal effect are themselves identifiable, we can use them to identify our target. For the graph in Fig. 5, the factored causal effect is:</p>
<p><span class="math display">\[\operatorname{P}(C|\operatorname{do}(A)) = \sum_{B}{\operatorname{P}(C|\operatorname{do}(B))\operatorname{P}(B|\operatorname{do}(A))}\]</span></p>
<p>Now, we can ask whether any method would allow us to identify these factors <span class="math inline">\(\operatorname{P}(C|\operatorname{do}(B))\)</span> and <span class="math inline">\(\operatorname{P}(B|\operatorname{do}(A))\)</span> using our observed data. In this case, both factors are easily identified using the adjustment formula and Rule 2 of do-calculus:</p>
<p><span class="math display">\[\begin{aligned}
\operatorname{P}(C|\operatorname{do}(B)) &= \sum_{A}{\operatorname{P}(C|B,A)\operatorname{P}(A)} && \text{ by the adjustment formula}\end{aligned}\]</span></p>
<p>and</p>
<p><span class="math display">\[\begin{aligned}
\operatorname{P}(B|\operatorname{do}(A)) &= \operatorname{P}(B|A) && \text{ by Rule 2 since } (B \unicode{x2AEB} A)_{G_{null(A)}}\end{aligned}\]</span></p>
<p>Combining these together, we find that:</p>
<p><span class="math display">\[\operatorname{P}(C|\operatorname{do}(A)) = \sum_{B}{\sum_{A}{\operatorname{P}(C|B,A)\operatorname{P}(A)}\operatorname{P}(B|A)}\]</span></p>
<p>The generalized version of this front-door criterion enables us to apply factorization along the causal path(s) of interest, and apply any valid method for identifying the component causal factors. Of course, in a more complex graph, if any of the causal factors are unidentifiable because of other unobserved, confounding variables, then we will not be able to apply the front-door criterion.</p>
</section>
</section>
<section id="sec:ch03-iv" data-number="1.4">
<h2 data-number="3.4"><span class="header-section-number">3.4</span> Identification with Additional Assumptions</h2>
<figure>
<img src="../assets/Figures/Chapter3/Ch3_Fig_IV.png" id="fig:ch03-basic-iv-graph" alt="Figure 6: Instrumental Variables example: Z can help identify P(B|A), despite the unobserved confounding due to U" /><figcaption aria-hidden="true">Figure 6: Instrumental Variables example: <span class="math inline">\(Z\)</span> can help identify <span class="math inline">\(P(B|A)\)</span>, despite the unobserved confounding due to <span class="math inline">\(U\)</span></figcaption>
</figure>
<p>Our goal in causal identification is to find a way to express the causal relationship between two features in terms of observable statistical relationships. In many situations, we can use graphical assumptions and do-calculus to disentangle our observations of statistical relationships to identify causal relationships. In situations when graphical assumptions are insufficient, parametric assumptions can sometimes help. In this section, we show how a simple parametric assumption—specifically, assumptions of non-interaction, as previously introduced in Section <a href="#sec:ch03-heterogeneouseffects" data-reference-type="ref" data-reference="sec:ch03-heterogeneouseffects">3.1.2</a>—can help.</p>
<section id="parametric-assumptions-and-instrumental-variables" data-number="1.4.1">
<h3 data-number="3.4.1"><span class="header-section-number">3.4.1</span> Parametric Assumptions and Instrumental Variables</h3>
<p>Consider the causal graph shown in Fig. 6. If our goal is to identify <span class="math inline">\(\operatorname{P}(B|\operatorname{do}(A))\)</span>, we can easily see that the adjustment formula is not applicable as the confounding feature <span class="math inline">\(U\)</span> is unobserved. Since there is no mediating variable between <span class="math inline">\(A\)</span> and <span class="math inline">\(B\)</span>, we also cannot apply the front-door criterion. In fact, the effect of <span class="math inline">\(A\)</span> on <span class="math inline">\(B\)</span> is not identifiable based on graphical assumptions alone.</p>
<p>This kind of causal graph is quite common. For example, we are often in situations where we have an ability to run partially randomized experiments, where we can randomize <span class="math inline">\(Z\)</span>, but do not have direct control over the factor <span class="math inline">\(A\)</span> that is our primary focus. This can occur in experiments with people, for example, where we might influence individuals’ decisions through recommendations, encouragements or rewards, but otherwise not have full control. This can also occur in many natural settings, where some observable independent factor, such as the weather, plays a partial role in determining <span class="math inline">\(A\)</span>.</p>
<p>There is an intriguing opportunity, however, in the influence of <span class="math inline">\(Z\)</span> on <span class="math inline">\(A\)</span>. Because <span class="math inline">\(Z\)</span> is d-separated from <span class="math inline">\(A\)</span>, we can trivially identify that <span class="math inline">\(\Pr(A|\operatorname{do}(Z)) = \Pr(A|Z)\)</span>. Similarly, we can show that <span class="math inline">\(\Pr(B|\operatorname{do}(Z)) = \Pr(B|\operatorname{do}(Z))\)</span>.</p>
<p>The instrumental variable method is an identification method that exploits the auxiliary variable <span class="math inline">\(Z\)</span> to isolate the causal effect. A variable that follows the graph structure of Fig. 6 is called an <em>instrumental</em> variable. Instrumental variable settings satisfy several criteria <a href="#fn2" class="footnote-ref" id="fnref2" role="doc-noteref"><sup>2</sup></a>:</p>
<ol type="1">
<li><p><em><span class="math inline">\(Z\)</span> and <span class="math inline">\(B\)</span> are independent, if not for <span class="math inline">\(A\)</span></em>. More formally, <span class="math inline">\(Z\)</span> and <span class="math inline">\(B\)</span> are d-separated in the graph <span class="math inline">\(G_{null(A)}\)</span>. This implies that <span class="math inline">\(Z\)</span> affects <span class="math inline">\(B\)</span> only via paths that pass through <span class="math inline">\(A\)</span>, and that <span class="math inline">\(Z\)</span> and <span class="math inline">\(B\)</span> are not correlated due to common causes.</p></li>
<li><p><em><span class="math inline">\(Z\)</span> affects <span class="math inline">\(A\)</span></em>. <span class="math inline">\(Z\)</span> and <span class="math inline">\(A\)</span> are not d-separated and <span class="math inline">\(\operatorname{P}(A|\operatorname{do}(Z))\)</span> is identifiable.</p></li>
<li><p><em>The effects of <span class="math inline">\(Z\)</span> on <span class="math inline">\(A\)</span> and of <span class="math inline">\(A\)</span> on <span class="math inline">\(B\)</span> are homogeneous with respect to the unobserved variables <span class="math inline">\(U\)</span></em>.</p></li>
</ol>
<p>The first two conditions can be read from the causal graph, while the third is an additional parametric constraint. The first condition ensures that whatever effect <span class="math inline">\(Z\)</span> has on <span class="math inline">\(B\)</span>, it can only flow through <span class="math inline">\(A\)</span>. There can be no direct effect of <span class="math inline">\(Z\)</span> on <span class="math inline">\(Y\)</span> that does not go through <span class="math inline">\(A\)</span>. In addition, the d-separation of <span class="math inline">\(Z\)</span> and <span class="math inline">\(B\)</span> in <span class="math inline">\(G_{null(A)}\)</span> implies that <span class="math inline">\(Z\)</span> is independent of the unobserved confounds <span class="math inline">\(U\)</span> of <span class="math inline">\(A\rightarrow B\)</span>.</p>
<p>The second condition states that <span class="math inline">\(Z\)</span> has a non-zero effect on <span class="math inline">\(A\)</span>, and that this effect is identifiable. Intuitively the effect of <span class="math inline">\(Z\)</span> on <span class="math inline">\(B\)</span> can be thought of as the combination of the effect of <span class="math inline">\(Z\)</span> on <span class="math inline">\(A\)</span> and the effect of <span class="math inline">\(A\)</span> on <span class="math inline">\(B\)</span>, thus if <span class="math inline">\(Z\)</span> has no effect on <span class="math inline">\(A\)</span>, it would not give us useful information about <span class="math inline">\(A\)</span>. In the specific graph shown, we can see that the identification <span class="math inline">\(\operatorname{P}(A|do(Z))=P(A|Z)\)</span> is trivially identified since there are no common causes of Z and A (in our example, Z is randomized).</p>
<p>The final condition is that it is legitimate to assume that the effect of <span class="math inline">\(Z\)</span> on <span class="math inline">\(A\)</span> is homogeneous (i.e., that <span class="math inline">\(U\)</span> does not modify the effect of <span class="math inline">\(Z\)</span> on <span class="math inline">\(A\)</span>), and that the effect of <span class="math inline">\(A\)</span> on <span class="math inline">\(B\)</span> is also homogeneous (<span class="math inline">\(U\)</span> does not modify the effect of <span class="math inline">\(A\)</span> on <span class="math inline">\(B\)</span>). This will allow us to ensure our observations of the effect of <span class="math inline">\(Z\)</span> on <span class="math inline">\(A\)</span> and <span class="math inline">\(Z\)</span>’s indirect effect on <span class="math inline">\(B\)</span> are not entangled with any interactions with the unobserved factors <span class="math inline">\(U\)</span>.</p>
<p>Next, we show how we can use these two identified components and the assumptions above, <span class="math inline">\(\operatorname{P}(B|Z)\)</span> and <span class="math inline">\(\operatorname{P}(A|Z)\)</span> to identify the effect of the intervention <span class="math inline">\(A\)</span> on <span class="math inline">\(Z\)</span>.</p>
</section>
<section id="derivation-for-continuous-variables" data-number="1.4.2">
<h3 data-number="3.4.2"><span class="header-section-number">3.4.2</span> Derivation for continuous variables</h3>
<p>Here is a simple derivation, in the setting of Fig. 6 and continuous variables <span class="math inline">\(Z\)</span>, <span class="math inline">\(A\)</span>, and <span class="math inline">\(B\)</span>, showing how to calculate <span class="math inline">\(\frac{dB}{d{\operatorname{do}(A)}}\)</span> based on <span class="math inline">\(\frac{dB}{dZ}\)</span> and <span class="math inline">\(\frac{dA}{dZ}\)</span>.</p>
<p>First, let us note that, given the causal graph in Fig. 6, we can trivially identify that <span class="math inline">\(\operatorname{P}(B|\operatorname{do}(Z))=\operatorname{P}(B|Z)\)</span> and that <span class="math inline">\(\operatorname{P}(A|\operatorname{do}(Z)) = \operatorname{P}(A|Z)\)</span>. From the definition of the effect of an intervention on continuous variables (Eq. 2), this identification also provides us with the estimands <span class="math inline">\(\frac{dA}{d{\operatorname{do}(Z)}} = \frac{dA}{dZ}\)</span> and <span class="math inline">\(\frac{dB}{d{\operatorname{do}(Z)}} = \frac{dB}{dZ}\)</span>.</p>
<p>Second, from the chain rule for derivatives with multiple independent variables, we state:</p>
<p><span class="math display">\[\begin{aligned}
\frac{\partial B}{\partial \operatorname{do}(Z)} &= \frac{\partial B}{\partial \operatorname{do}(A)}\frac{\partial A}{\partial \operatorname{do}(Z)} + \frac{\partial B}{\partial \operatorname{do}(U)}\frac{\partial U}{\partial \operatorname{do}(Z)} && \\
&= \frac{\partial B}{\partial \operatorname{do}(A)}\frac{\partial A}{\partial \operatorname{do}(Z)} && U \unicode{x2AEB} Z \text{implies} \frac{\partial U}{\partial \operatorname{do}(Z)} = 0 \\
\frac{\partial B}{\partial \operatorname{do}(A)} &= \frac{\frac{\partial B}{\partial \operatorname{do}(Z)}}{\frac{\partial A}{\partial \operatorname{do}(Z)}} && \text{ Rearranging terms } \\
\frac{\partial B}{\partial \operatorname{do}(A)} &= \frac{\frac{dB}{d\operatorname{do}(Z)}}{\frac{dA}{d\operatorname{do}(Z)}} && \text{ By non-interaction of $U$} \\
\frac{\partial B}{\partial \operatorname{do}(A)} &= \frac{\frac{dB}{dZ}}{\frac{dA}{dZ}} && \text{ By earlier identification}\end{aligned}\]</span></p>
<p>In this derivation, we take advantage of our causal assumption that <span class="math inline">\(U \unicode{x2AEB} Z\)</span> as we go from line 1 to line 2. Because <span class="math inline">\(U\)</span> is d-separated from <span class="math inline">\(Z\)</span> in our instrumental variables setting, we know that <span class="math inline">\(\frac{\partial U}{\partial \operatorname{do}(Z)}\)</span> must be <span class="math inline">\(0\)</span>. We also apply our assumption of homogeneity of the effects, <span class="math inline">\(Z \rightarrow A\)</span> and <span class="math inline">\(A \rightarrow B\)</span>, to convert our partial derivatives <span class="math inline">\(\frac{\partial B}{\partial \operatorname{do}(Z)}\)</span> and <span class="math inline">\(\frac{\partial A}{\partial \operatorname{do}(Z)}\)</span> to total derivatives as we go from line 4 to 5. This is crucial because otherwise, we would have to observe <span class="math inline">\(U\)</span> to evaluate the partial derivatives at <span class="math inline">\((U,Z)\)</span>. Knowing that they are independent of <span class="math inline">\(U\)</span>, we can convert them to total derivatives and evaluate them at <span class="math inline">\(Z\)</span> only.</p>
<p>Thus, we see that under the assumptions of the instrumental variables setting, we can identify the effect of <span class="math inline">\(A\)</span> on <span class="math inline">\(B\)</span> using our observations of the effect of <span class="math inline">\(Z\)</span> on <span class="math inline">\(B\)</span> and <span class="math inline">\(Z\)</span> on <span class="math inline">\(A\)</span>. This is a powerful result, and enables us to identify the effect of features on outcomes in a variety of scenarios, even when we do not have full control over them.</p>
</section>
<section id="derivation-for-binary-variables" data-number="1.4.3">
<h3 data-number="3.4.3"><span class="header-section-number">3.4.3</span> Derivation for binary variables</h3>
<p>Here, we repeat our simple derivation in the setting of Fig. 6 and binary, rather than continuous, variables <span class="math inline">\(Z\)</span>, <span class="math inline">\(A\)</span>, and <span class="math inline">\(B\)</span>. As a reminder, we wish to identify <span class="math inline">\(\operatorname{P}(B|\operatorname{do}(A))\)</span>, the effect that intervening on <span class="math inline">\(A\)</span> will have on <span class="math inline">\(B\)</span> and <span class="math inline">\(A\)</span> and <span class="math inline">\(B\)</span> are confounded by an unobserved variable <span class="math inline">\(U\)</span>, rendering earlier methods, such as the adjustment formula, ineffective.</p>
<p>To derive the identification, let us first write the expression for <span class="math inline">\(\operatorname{P}(B|Z,U)\)</span>. <span class="math display">\[\begin{split}
\operatorname{P}(B|Z,U) &= \sum_A \operatorname{P}(B|Z, U, A) P(A|Z, U) \\
&= \sum_A \operatorname{P}(B|A, U) \operatorname{P}(A|Z, U)
\end{split}\]</span> where the last equality is due to the Markov property of the causal graph. For a binary instrument <span class="math inline">\(Z\)</span>, intervention <span class="math inline">\(A\)</span>, and outcome <span class="math inline">\(B\)</span>, we obtain the following equations, <span class="math display">\[\begin{split}
\operatorname{P}(B=1|Z=1,U) &=\sum_A \operatorname{P}(B=1|Z=1, U, A=1) P(A|Z=1, U) \\
&=\operatorname{P}(B=1|Z=1, U, A) P(A=1|Z=1, U) + \operatorname{P}(B=1|Z=1, U, A=0) (1 - P(A=1|Z=1, U)) \\
\operatorname{P}(B=1|Z=0,U) &= \sum_A \operatorname{P}(B=1|Z=0, U, A=1) P(A|Z=0, U) \\
&= \operatorname{P}(B=1|Z=0, U, A=1) P(A=1|Z=1, U) + \operatorname{P}(B=1|Z=0, U, A=0)(1 - P(A=1|Z=0, U) \\
\end{split}\]</span> Rearranging the terms and subtracting the two equations, <span class="math display">\[\begin{split}
\operatorname{P}(B|A, U) - \operatorname{P}(B|A=0, U) &= \frac{\operatorname{P}(B=1|Z=1, U)- \operatorname{P}(B=1|Z=0, U)}{\operatorname{P}(A=1|Z=1, U) - \operatorname{P}(A=1, Z=0, U)}
\end{split}\]</span> The average effect of the intervention is given by, <span class="math display">\[\begin{split}
\mathbb{E}[\operatorname{P}(B|A=1, U)] - \mathbb{E}[\operatorname{P}(B|A=0, U)] &= \mathbb{E}[\operatorname{P}(B|\operatorname{do}(A=1))] - \mathbb{E}[\operatorname{P}(B|\operatorname{do}(A=0))] \\
&= \mathbb{E}[\frac{\operatorname{P}(B=1|Z=1, U)- \operatorname{P}(B=1|Z=0, U)}{\operatorname{P}(A=1|Z=1, U) - \operatorname{P}(A=1, Z=0, U)}]
\end{split}\]</span> where the equality is due to the backdoor equation above.</p>
<p>In general, we cannot isolate the average effect of the intervention using the above equation since the denominator also depends on the unobserved <span class="math inline">\(U\)</span>. In the instrumental variables scenario, however, the effect of <span class="math inline">\(Z\)</span> on <span class="math inline">\(A\)</span> does not does not vary with <span class="math inline">\(U\)</span> (that is, the effect of <span class="math inline">\(Z\)</span> on <span class="math inline">\(A\)</span> is homogeneous with respect to <span class="math inline">\(U\)</span>). Recall from Section <a href="#sec:ch03-heterogeneouseffects" data-reference-type="ref" data-reference="sec:ch03-heterogeneouseffects">3.1.2</a> that in this case, <span class="math inline">\(\operatorname{P}(A|Z,U)=\operatorname{P}(A|,Z)\)</span> and thus we can simplify the above equation. In such cases, we can write the average causal effect as, <span class="math display">\[\begin{split}
\mathbb{E}[\operatorname{P}(B|\operatorname{do}(A=1))] - \mathbb{E}[\operatorname{P}(B|\operatorname{do}(A=0))] &= \mathbb{E}[\frac{\operatorname{P}(B=1|Z=1, U)- \operatorname{P}(B=1|Z=0, U)}{\operatorname{P}(A=1|Z=1) - \operatorname{P}(A=1, Z=0)}] \\
&= \mathbb{E}[\frac{\operatorname{P}(B=1|Z=1)- \operatorname{P}(B=1|Z=0)}{\operatorname{P}(A=1|Z=1) - \operatorname{P}(A=1, Z=0)}]
\end{split}\]</span></p>
<p>Notice that we also remove the <span class="math inline">\(U\)</span> from the numerator. This is valid for purposes of calculating an average causal effect over a specific population, with a fixed, though unobserved, distribution of <span class="math inline">\(U\)</span>. Of course, this average causal effect will not be valid for different populations.</p>
<p>The resultant equation, known as the Wald formula, provides an identification formula for the causal effect, also known as the Wald formula. Consider the similarity of this formula to our finding in the continuous variables scenario earlier. The identified estimand intuitively captures how adding an instrument helps identify causal effect in the presence of unobserved confounding. While we cannot estimate the effect of treatment on outcome directly, we can estimate the effect of the instrument on both treatment and outcome. The ratio of these two effects then quantifies to what degree <span class="math inline">\(Z\)</span>’s effect on <span class="math inline">\(A\)</span> also causes a change in <span class="math inline">\(B\)</span>. In causal graphs where <span class="math inline">\(Z\)</span> effects <span class="math inline">\(B\)</span> only through <span class="math inline">\(A\)</span>, then, this identifies the effect of <span class="math inline">\(A\)</span> on <span class="math inline">\(B\)</span> as well.</p>
</section>
<section id="sec:ch03-ivgeneralization" data-number="1.4.4">
<h3 data-number="3.4.4"><span class="header-section-number">3.4.4</span> Generalizing the instrumental variables method</h3>
<div id="fig:ch03-advanced-iv-graphs" class="subfigures">
<table style="width:90%;">
<colgroup>
<col style="width: 30%" />
<col style="width: 30%" />
<col style="width: 30%" />
</colgroup>
<tbody>
<tr class="odd">
<td style="text-align: center;"><figure>
<img src="../assets/Figures/Chapter3/Ch3_Fig_IV-advanced1a.png" id="fig:ch03-adv-iv1" style="width:100.0%" alt="a" /><figcaption aria-hidden="true">a</figcaption>
</figure></td>
<td style="text-align: center;"><figure>
<img src="../assets/Figures/Chapter3/Ch3_Fig_IV-advanced2a.png" id="fig:ch03-adv-iv2" style="width:100.0%" alt="b" /><figcaption aria-hidden="true">b</figcaption>
</figure></td>
<td style="text-align: center;"><figure>
<img src="../assets/Figures/Chapter3/Ch3_Fig_IV-advanced3a.png" id="fig:ch03-adv-iv3" style="width:100.0%" alt="c" /><figcaption aria-hidden="true">c</figcaption>
</figure></td>
</tr>
<tr class="even">
<td style="text-align: center;"><figure>
<img src="../assets/Figures/Chapter3/Ch3_Fig_IV-advanced3.png" id="fig:ch03-adv-iv4" style="width:100.0%" alt="d" /><figcaption aria-hidden="true">d</figcaption>
</figure></td>
<td style="text-align: center;"><figure>
<img src="../assets/Figures/Chapter3/Ch3_Fig_IV-advanced4.png" id="fig:ch03-adv-iv5" style="width:100.0%" alt="e" /><figcaption aria-hidden="true">e</figcaption>
</figure></td>
<td style="text-align: center;"></td>
</tr>
</tbody>
</table>
<p>Figure 7: More examples of IV graphs. (a), (b), and (c) correspond to <span class="math inline">\(z\)</span> as a valid generalized instrumental variable scenarios whereas (d) and (e) show an invalid instrumental variable. (b) shows a common IV setting where A and B have observed confounders <span class="math inline">\(\mathbf{w}\)</span> in addition to unobserved ones.. </p>
</div>
<p>While the canonical instrumental variables scenario, depicted in Fig. 6, presents a simple graph with only a few variables, we can extend these ideas to much more complex scenarios.</p>
<p>The simplest extensions include scenarios such as Fig. 7 (a). Here, we see that additional observed variables <span class="math inline">\(W_1\)</span> and <span class="math inline">\(W_2\)</span> are confounding the effect of <span class="math inline">\(Z\rightarrow A\)</span> and <span class="math inline">\(A\rightarrow B\)</span>. Neither of these additional variables, however, breaks our initial requirements that <span class="math inline">\(Z\)</span> and <span class="math inline">\(B\)</span> are d-separated in <span class="math inline">\(G_{\operatorname{null}(A)}\)</span>, or that <span class="math inline">\(Z\)</span> and <span class="math inline">\(B\)</span> are not d-separated in <span class="math inline">\(G\)</span>, that <span class="math inline">\(\operatorname{P}(\operatorname{do}(B|Z))\)</span> is identifiable, as well as our assumptions regarding homogeneous effects.</p>
<p>More interestingly, are scenarios such as Fig. 7 (b). Here we see that the (observed) variable <span class="math inline">\(W\)</span> does break our initial assumptions of the independence of <span class="math inline">\(Z\)</span> and <span class="math inline">\(B\)</span>. However, conditioning all of our analyses on <span class="math inline">\(W\)</span> re-establishes the necessary requirements of d-separations. Such an instrumental variable (IV) is called a conditional IV, and the modified requirements are:</p>
<ol type="1">
<li><p><span class="math inline">\(Z\)</span> and <span class="math inline">\(B\)</span> are d-separated in <span class="math inline">\(G_{\operatorname{null}(A)}\)</span>, conditional on <span class="math inline">\(W\)</span>.</p></li>
<li><p><span class="math inline">\(Z\)</span> and <span class="math inline">\(A\)</span> are not d-separated conditional on <span class="math inline">\(W\)</span> and <span class="math inline">\(\operatorname{P}(A|do(Z),W)\)</span> is identifiable</p></li>
<li><p>The effects of <span class="math inline">\(Z\)</span> and of <span class="math inline">\(A\)</span> on <span class="math inline">\(B\)</span> are homogeneous with respect to unobserved variables <span class="math inline">\(U\)</span>.</p></li>
</ol>
<p>Where <span class="math inline">\(W\)</span> is a conditioning set that does not include any descendants of <span class="math inline">\(B\)</span>.</p>
<p>Considering more carefully the role of the instrument <span class="math inline">\(Z\)</span>—its purpose is to provide information about variation in <span class="math inline">\(A\)</span> that is independent of the unobserved confound <span class="math inline">\(U\)</span>—we recognize that <span class="math inline">\(Z\)</span> need not actually be a cause of <span class="math inline">\(A\)</span>. There are other relationships that might also capture the variation in <span class="math inline">\(A\)</span> necessary for our analysis. In Fig. 7 (c). , we see one such example. Here, <span class="math inline">\(C\)</span> is an unobserved cause of both <span class="math inline">\(Z\)</span> and <span class="math inline">\(A\)</span>. Even though <span class="math inline">\(Z\)</span> is not a cause of <span class="math inline">\(A\)</span> in this graph, <span class="math inline">\(Z\)</span> is not d-separated from <span class="math inline">\(A\)</span>, and will generally be correlated with <span class="math inline">\(A\)</span>. While this relaxation complicates our earlier proof of instrumental variables, it is generally valid to relax our 2nd assumption:</p>
<ol type="1">
<li><span class="math inline">\(Z\)</span> and <span class="math inline">\(A\)</span> are not d-separated conditional on <span class="math inline">\(W\)</span>.</li>
</ol>
<p>Many other extensions of instrumental variables scenarios have been explored, such as instrumental sets where a set of instrumental variables jointly enable identification, under assumptions of linearity, of the effects of multiple treatments on an outcome.</p>
</section>
</section>
<section id="sec:ch03-identify-in-practice" data-number="1.5">
<h2 data-number="3.5"><span class="header-section-number">3.5</span> Identification Strategies in Practice</h2>
<p>In this section, we will describe some common situations that indicate that one or another identification strategies is likely to be useful. We look for: sources of randomness in the system and their relationship to the treatment/outcome; simplifying structure in causal relationships—perhaps from different levels of abstraction or temporal assumptions; and, when other approaches do not work, we can attempt to identify subproblems that may be easier to identify. Through these, we highlight several common, classic identification strategies, including encouragement designs, difference-in-differences, and regression discontinuities.</p>
<p>Recall that in the Chapter <a href="/causal-reasoning-book-chapter2" data-reference-type="ref" data-reference="ch_causalmodel">2</a>, we noted that there is not necessarily a single correct graphical model representation of a system to be studied, but that we might use different models to answer different questions. Now that we have introduced a variety of identification strategies, we can begin to revisit this question of how best to model a system—what features to consider exogenous or endogenous to a model, and at what level of abstraction to model variables—to correctly answer a given causal inference question.</p>
<p>Throughout this section, we will assume our goal is to identify the causal effect <span class="math inline">\(\operatorname{P}(B|\operatorname{do}(B))\)</span>.</p>
<p><em>Remainder of section to be released</em></p>
</section>
<section id="sec:ch03-summary" data-number="1.6">
<h2 data-number="3.6"><span class="header-section-number">3.6</span> Summary</h2>
<p>In this chapter, we introduced the basics of identification for causal inference questions, including complexities introduced by <em>homogeneous and heterogeneous effects</em>. Key to formally approaching identification of causal effects from observational data is the <em><span class="math inline">\(\operatorname{do}()\)</span> operator and do-calculus</em>. The three rules of do-calculus allow us to analyze a causal graph and find strategies—or prove that there are none—for calculating the effects we wish to know from the observations we are able to make.</p>
<p>Using do-calculus, we described the <em>adjustment formula</em>, a useful approach for calculating causal effects in many situations by adjusting for the confounds or other appropriate factors that make up a <em>valid adjustment set</em>. When graphical assumptions alone are insufficient, combining do-calculus with additional parametric assumptions, as in the case of <em>instrumental variables</em> scenarios, can result in successful identification.</p>
</section>
<h2 class="unnumbered" id="sec:ch03-notes">Chapter Notes</h2>
<p>For further reading on causal reasoning and do-calculus, see Judea Pearl’s book, Causality (2009). Pearl’s blog, Causal Analysis in Theory and Practice, also has additional interesting materials, including a lovely “Crash Course in Good and Bad Controls” <a href="#fn3" class="footnote-ref" id="fnref3" role="doc-noteref"><sup>3</sup></a>.</p>
</section>
<section class="footnotes" role="doc-endnotes">
<hr />
<ol>
<li id="fn1" role="doc-endnote"><p>Even in the case of a global treatment decision, we should strive to understand possible heterogeneous impacts well enough to ensure that we will not have unacceptable disparate impacts on subpopulations.<a href="#fnref1" class="footnote-back" role="doc-backlink">↩︎</a></p></li>
<li id="fn2" role="doc-endnote"><p>We will relax these somewhat in Section <a href="#sec:ch03-ivgeneralization" data-reference-type="ref" data-reference="sec:ch03-ivgeneralization">1.4.4</a>.<a href="#fnref2" class="footnote-back" role="doc-backlink">↩︎</a></p></li>
<li id="fn3" role="doc-endnote"><p><a href="http://causality.cs.ucla.edu/blog/index.php/category/bad-control/" class="uri">http://causality.cs.ucla.edu/blog/index.php/category/bad-control/</a><a href="#fnref3" class="footnote-back" role="doc-backlink">↩︎</a></p></li>
</ol>
</section>
Mon, 29 Mar 2021 00:00:00 +0000
https://causalinference.gitlab.io/causal-reasoning-book-chapter3/
https://causalinference.gitlab.io/causal-reasoning-book-chapter3/Chapter 2: Models and Assumptions<section id="ch_causalmodel" data-number="1">
<p>Conventional statistical and machine learning problems are data focused. While data is a critical part of causal reasoning, it is not the only part. Just as important is the external knowledge we bring: our prior knowledge of the data generation mechanism and assumptions about plausible causal mechanisms. In fact, it is this external knowledge that distinguishes causal reasoning from associational methods.</p>
<p>Capturing our external knowledge about mechanisms and assumptions is the first stage of any causal analysis. Formally representing external domain knowledge as models allows us to systematically reason about strategies for answering causal questions. We already saw one example of such external knowledge captured in our structural causal model of the influence of temperature on ice cream and swimming in the previous chapter (<a href="/causal-reasoning-book-chapter1">Fig. 1.5</a>). This chapter focuses on the detailed mechanics and intuitions of these structural causal models and the assumptions they represent.</p>
<section id="sec:causalgraphs" data-number="1.1">
<h2 data-number="2.1"><span class="header-section-number">2.1</span> Causal Graphs</h2>
<p>The primary language for modeling causal mechanisms and expressing our assumptions is the language of <em>causal graphs</em>. Causal graphs encode our domain knowledge about the causal mechanisms underlying a system or phenomenon under study. We begin by introducing the mechanics of causal graphs and demonstrating how they represent causal relationships. We first assume that we have complete knowledge of the causal graph. Later in this chapter, we relax this assumption and refine our use of causal graphs to represent more complex and ambiguous situations.</p>
<p>A causal graph is made up of two kinds of elements:</p>
<ul>
<li><p><strong>Nodes</strong> represent variables or features in the world or system we are modeling. Without limitation, let us think of each node as representing something that is potentially observable, measurable, or otherwise knowable about a system.</p></li>
<li><p><strong>Edges</strong> connect nodes to one another. Each edge represents a mechanism or causal relationship that relates the values of the connected nodes. Edges are directed to indicate the flow of causal influence. For example, in Fig. 1 (a), a change in the value of node <span class="math inline"><em>A</em></span> will cause a change in <span class="math inline"><em>B</em></span>, but if we were to manipulate <span class="math inline"><em>B</em></span>, it would not cause a change <span class="math inline"><em>A</em></span>. In cases where the direction of the influence is unknown, we will draw an undirected edge.</p></li>
</ul>
<p>Causal graphs are assumed to be acyclic. That is, we cannot have a situation where <span class="math inline"><em>A</em></span> causes <span class="math inline"><em>B</em></span> and <span class="math inline"><em>B</em></span> causes <span class="math inline"><em>A</em></span>.<a href="#fn1" class="footnote-ref" id="fnref1" role="doc-noteref"><sup>1</sup></a></p>
<div id="fig:sample-2nodegraph-main" class="subfigures">
<table style="width:60%;">
<colgroup>
<col style="width: 30%" />
<col style="width: 30%" />
</colgroup>
<tr class="odd">
<td style="text-align: center;"><figure>
<img src="/assets/Figures/Chapter2/Ch2_Fig1_2nodegraph.png" id="fig:sample-2nodegraph" style="width:100.0%" alt="" /><figcaption>a</figcaption>
</figure></td>
<td style="text-align: center;"><figure>
<img src="/assets/Figures/Chapter2/Ch2_Fig1_2nodegraph_undirected.png" id="fig:sample-2nodegraph_undirected" style="width:100.0%" alt="" /><figcaption>b</figcaption>
</figure></td>
</tr>
</table>
<p>Figure 1: A simple causal graph over two nodes. a — A causes B, b — Either A causes B or B causes A, but not both.</p>
</div>
<section id="reading-a-causal-graph" data-number="1.1.1">
<h3 data-number="2.1.1"><span class="header-section-number">2.1.1</span> Reading a causal graph</h3>
<p>Intuitively, an edge transfers information from one node to another in the direction of its arrow. Thus, causal effects flow through the graph along nodes connected by edges. This interpretation allows us to read a causal graph and answer many interesting and important questions about the underlying system.</p>
<p>To start with, we can ask whether a change in one node’s value might affect another. This is important when we want to, for example, check for potential side effects of some action or decision. We can ask whether a node’s value will stay the same as others change—important for identifying when a metric we care about is stable or not. If we want to optimize for some target outcome, we can ask what nodes we can manipulate to cause the targeted node’s value to change. For example, in Fig. 2, we can see that changes in <span class="math inline"><em>A</em></span> will affect <span class="math inline"><em>B</em></span>, which will, in turn, affect <span class="math inline"><em>D</em></span>. We can also see that, since there are no directed paths leading from <span class="math inline"><em>A</em></span> to <span class="math inline"><em>C</em></span>, that changes in <span class="math inline"><em>A</em></span> will not affect <span class="math inline"><em>C</em></span>. Similarly, since causal effect flows in only one direction, we can see that changes in <span class="math inline"><em>C</em></span> will affect <span class="math inline"><em>B</em></span> and <span class="math inline"><em>D</em></span> but not <span class="math inline"><em>A</em></span>, and changes in <span class="math inline"><em>B</em></span> will affect <span class="math inline"><em>D</em></span>, but not <span class="math inline"><em>A</em></span> or <span class="math inline"><em>C</em></span>. Changes in <span class="math inline"><em>D</em></span> will not affect any of the other nodes.</p>
<p>Causal relationships also flow transitively from edge to edge, through a directed path. In referring to such causal relationships, we often use familial notations. If <span class="math inline"><em>A</em></span> causes <span class="math inline"><em>B</em></span>, then <span class="math inline"><em>A</em></span> is a <em>parent</em> of <span class="math inline"><em>B</em></span>, and <span class="math inline"><em>B</em></span> is a <em>child</em> of <span class="math inline"><em>A</em></span>. All child nodes of <span class="math inline"><em>B</em></span> and, recursively, all of their children are known as descendants of <span class="math inline"><em>B</em></span>. Similarly, all parents of <span class="math inline"><em>B</em></span> and, recursively, all of their parents are known as ancestors of <span class="math inline"><em>B</em></span>. A node causally affects all its descendants and is affected by all its ancestors.</p>
<figure>
<img src="/assets/Figures/Chapter2/Ch2_Fig1_4nodegraph.png" id="fig:sample-4nodegraph" alt="" /><figcaption>Figure 2: A 4-node causal graph</figcaption>
</figure>
</section>
<section id="sec:statisticalindependence_intro" data-number="1.1.2">
<h3 data-number="2.1.2"><span class="header-section-number">2.1.2</span> Causal graphs and statistical independence</h3>
<p>Causal graphs are not only intuitively easy to read; they also provide formal definitions that enable systematic reasoning about their properties. Fundamentally, a causal graph describes a non-parametric data-generating process over its nodes. By specifying independence and dependence between the nodes, the graph constrains relationship between generated variables corresponding to those nodes.</p>
<p>In particular, causal graphs provide information about statistical independence. Two nodes <span class="math inline"><em>x</em></span> and <span class="math inline"><em>y</em></span> are statistically independent if knowing the value of one node does not give information about the value of the other node. That is: <br /><span class="math display">$$x \unicode{x2AEB} y \text{ iff } P(x) = P(x|y)
%$$</span><br /> where <span class="math inline">$\unicode{x2AEB}$</span> is the symbol for statistical independence. We also often work with conditional independences, where two nodes <span class="math inline"><em>x</em></span> and <span class="math inline"><em>y</em></span> might only be statistically independent conditional on some other node <span class="math inline"><em>z</em></span>: <br /><span class="math display">$$(x \unicode{x2AEB} y) |z \text{ iff } P(x|z) = P(x|y,z)
%\\
%\text{ and symmetrically } P(b) = P(b|a)$$</span><br /> Statistical relationships correspond to a particular data distribution and should not be confused with causal relationships. Whereas causal relationships focuses on whether manipulating one node’s value will cause a change in another node’s value, statistical relationships focuses on whether knowing the value of a node provides information about the value of another node.</p>
<div id="fig:chain-fork-collider" class="subfigures">
<table style="width:90%;">
<colgroup>
<col style="width: 30%" />
<col style="width: 30%" />
<col style="width: 30%" />
</colgroup>
<tr class="odd">
<td style="text-align: center;"><figure>
<img src="/assets/Figures/Chapter2/Ch2_Collider-crop.png" id="fig:collider" style="width:100.0%" alt="" /><figcaption>a</figcaption>
</figure></td>
<td style="text-align: center;"><figure>
<img src="/assets/Figures/Chapter2/Ch2_Fork-crop.png" id="fig:fork" style="width:100.0%" alt="" /><figcaption>b</figcaption>
</figure></td>
<td style="text-align: center;"><figure>
<img src="/assets/Figures/Chapter2/Ch2_Chain-crop.png" id="fig:chain" style="width:100.0%" alt="" /><figcaption>c</figcaption>
</figure></td>
</tr>
</table>
<p>Figure 3: Causal graph structures over three nodes. a — B is a collider for A and C., b — B creates a fork to A and C., c — B forms a chain from A to C.</p>
</div>
<p>To illustrate how a causal graph implies certain statistical independences, let us consider different graph structures over three variables. Fig. 3 shows the three important structures: <em>collider</em>, <em>fork</em> and <em>chain</em>. The left subfigure shows a node B that is caused by both A and C. Such a node is called a collider, because the effect from two parents collides at the node. Importantly, no causal information transfers from A to C through B and thus A and C are statistically independent. The middle subfigure shows a fork structure where B causes both A and C. Here, the value of B determines the values of both A and C, thus A and C are not independent. However, the only source of statistical dependence between them is B. What if B is fixed at a certain value? In that case, A and C will be statistically independent. In other words, A and C are independent conditional on B. Finally, the right subfigure shows a chain. A chain implies a single direction of flow of causality from A to B, and from B to A. Any causal information from A to C must flow through B. Thus, similar to the fork structure, A and C are conditionally independent given B.</p>
<p>The three basic structures can be extended to determine statistical independence in any graph. Here we use the concept of an <em>undirected path</em> between any two nodes, defined as a set of contiguous edges connecting the two nodes. To determine statistical independence between two nodes, we consider all undirected paths between two nodes and test whether the paths have any of these structures. Note that a collider is the only structure that leads to statistical independence without conditioning. Thus, two nodes are independent if all undirected paths between them contain a collider. Of course, they are also independent if there is no undirected path connecting them.</p>
<p>Nodes that lead to statistically independent variables are said to be <em>d-separated</em> from each other.</p>
<p><strong>d-separation:</strong> Two nodes in a causal graph are d-separated if either there exists no undirected path that connects them, or all paths connecting them contain a collider.</p>
<p>Conditional independence, however, is not as straightforward. From the other two base structures of fork and chain, we saw that conditioning on B makes A and C independent. However, the collider structure shows the reverse property. Conditioned on the collider B, A and C become dependent. This is because knowing the value of a collider and one of its parents tell us something about the other parent. Consider the example of a spam system that classifies an email as spam only if two conditions are satisfied: it contains the word “please send money” (A), and the email is too long (C). In general, these two variables are independently generated: longer emails can ask for money, but so can shorter emails. Knowing that the email is long tells us nothing about it contains an ask for money. However, once we know that an email was classified as spam, we can determine that both A and C were true. If it was not classified as spam, then knowing A=True reveals that C=False, and vice-versa. Thus, by fixing the value of a collider or conditioning on it, we render its parents dependent with respect to each other. Based on the above discussion, conditional independence or d-separation requires the conditioning variable to form a fork or chain along all paths between the two nodes, but also requires that it does not form a collider on any path.</p>
<p><strong>Conditional d-separation:</strong> Two nodes in a causal graph are conditionally d-separated on another node <span class="math inline"><em>B</em></span> if either they are d-separated, or all undirected paths connecting them contain <span class="math inline"><em>B</em></span> as a fork or a chain, but not as a collider.</p>
<p>Using the above definitions, we can now read statistical independence and conditional statistical independence relationships from a causal graph. For example, in Fig. 2, we can see that <span class="math inline">$A \unicode{x2AEB} C$</span> as they have no common causes, whereas all other pairs of nodes are statistically <em>dependent</em>. It is also possible to read conditional independences from a graph. In Fig. 2 again, <span class="math inline">$(D \unicode{x2AEB} A) | B$</span> since B is the central node of chain connecting A and D. That is, once we know the value of <span class="math inline"><em>B</em></span>, we know everything we can know about <span class="math inline"><em>D</em></span> and, in particular, knowing the value of <span class="math inline"><em>A</em></span> does not change our belief in what <span class="math inline"><em>D</em></span> might be.</p>
<p>The above definitions also apply to any set of nodes rather than individual nodes. The concept of conditional independence is useful in choosing nodes for intervening on another nodes. It implies that conditioned on its parents, a node is independent of all its ancestors. Thus, knowing the value of all ancestor nodes conveys no more information about a node than knowing the value of its parents. This knowledge can also be useful to design predictive models that generalize beyond the training distribution, as discussed in chapter <a href="#ch_classification" data-reference-type="ref" data-reference="ch_classification">13</a>.</p>
</section>
<section id="causal-graph-and-resulting-data-distributions" data-number="1.1.3">
<h3 data-number="2.1.3"><span class="header-section-number">2.1.3</span> Causal graph and resulting data distributions</h3>
<p>Since the graph only specifies the direction of effect and not its magnitude, shape or interactions, multiple data-generating processes and thus multiple data probability distributions are compatible with the same causal graph.</p>
<p>Formally, a causal graph specifies a factorization of the joint probability distribution of data. Any probability distribution consistent with the graph needs to follow the specific factorization. For instance, for the causal graph in Fig. 2, we can write, <br /><span class="math display">$$\begin{split}
P(A, B, C, D) &= P(D|A,B,C)P(B|C,A)P(C|A)P(A) \\
&= P(D|B)P(B|C,A)P(C)P(A)
\end{split}$$</span><br /> where the first equation is from the chain rule of probability. The second equation and third equations come from the structure of the causal graph. As discussed above, <span class="math inline"><em>A</em></span> and <span class="math inline"><em>C</em></span> are independent based on the graph. In addition, <span class="math inline"><em>B</em></span> blocks the directed paths from <span class="math inline"><em>A</em>, <em>C</em></span> to <span class="math inline"><em>D</em></span>, so <span class="math inline"><em>D</em></span> is independent of <span class="math inline"><em>A</em></span> and <span class="math inline"><em>C</em></span> given <span class="math inline"><em>B</em></span>. More generally, for any causal graph <span class="math inline">𝒢</span> over variables <span class="math inline"><em>V</em><sub>1</sub>, <em>V</em><sub>2</sub>, ...<em>V</em><sub><em>m</em></sub></span>, the probability distribution of data is given by, <span id="eq:graph-factorization"><br /><span class="math display">$$\label{eq:graph-factorization}
\begin{split}
P(V_1, V_2, ...,V_m) = \Pi_{i=1}^{m} P(V_i|Pa_{\mathcal{G}}(V_i))
\end{split}\qquad(1)$$</span><br /></span> where <span class="math inline"><em>P</em><em>a</em><sub>𝒢</sub>(<em>V</em><sub><em>i</em></sub>)</span> refers to parents of <span class="math inline"><em>V</em><sub><em>i</em></sub></span> in the causal graph <span class="math inline">𝒢</span>. Note that the above factorization and resultant independence constraints have to be satisfied by every probability distribution generated from the graph. Therefore, independence in the graph implies statistical independence constraints across all probability distributions.</p>
<p>Additionally, it is possible that some data distributions factorize the distribution further and includes more independences. An edge between two nodes in a causal graph conveys the assumption that there may exist a relationship between them, but not all data distributions may necessarily follow it. Since there are multiple distributions possible, in some distributions the effect between the nodes goes to zero. For instance, while <span class="math inline"><em>A</em></span> and <span class="math inline"><em>B</em></span> are connected via an edge in Fig. 2, there can be a dataset where <span class="math inline"><em>A</em></span>’s effect on <span class="math inline"><em>B</em></span> is zero. As another example, in the ice cream and swimming causal graph from <a href="/causal-reasoning-book-chapter1">Fig. 1.5</a>, we will find that the effect of ice-cream on swimming is zero, even though the causal graph includes an edge. Thus, including an edge reflects the possibility of a causal relationship, but does not confirm it. Effect of a node on another in a causal graph can also cancel through multiple interacting effects. For example, based on the graph from Fig. 2, there can can be a distribution where <span class="math inline"><em>A</em></span> and <span class="math inline"><em>D</em></span> are independent if the effect of <span class="math inline"><em>B</em></span> on <span class="math inline"><em>D</em></span> directly cancels <span class="math inline"><em>A</em></span>’s effect on <span class="math inline"><em>B</em></span>. Exact cancellations of this kind are often assumed to be implausible.</p>
<p>Another implication of a causal graph is the specific factorization of the joint probability of data. While other factorizations are valid too for a given dataset, a factorization consistent with Eq. 1 is more likely to generalize to changes in data distribution. That is, we can consider individual conditional probability factors as independent mechanisms. In any dataset generated from the graph in Fig. 2, if P(A) changes, we expect P(B) too change too, but do not expect the causal relationship between them P(B|A) to change. However, if we consider any other factorization, e.g., <span class="math inline"><em>P</em>(<em>A</em>|<em>B</em>)<em>P</em>(<em>B</em>)</span>, then changing P(A) will change P(B), but also necessarily change <span class="math inline"><em>P</em>(<em>A</em>|<em>B</em>)</span>. This is because <span class="math inline"><em>P</em>(<em>A</em>|<em>B</em>) = <em>P</em>(<em>B</em>|<em>A</em>)<em>P</em>(<em>A</em>)/<em>P</em>(<em>B</em>)</span>. Knowing that <span class="math inline"><em>P</em>(<em>B</em>|<em>A</em>)</span> is invariant across distributions, P(A|B) depends directly on P(A). The invariance of causal relationships across different distributions underscores the generalization benefit of a causal graph: P(B|A), once estimated from a single data distribution, is expected to stay the same for all distributions consistent with the graph.</p>
<p>At this point, it is useful to compare causal graphs to probabilistic graphical models. While a probabilistic graphical model also offers a graphical representation of conditional independences, such a graph corresponds only to a particular data distribution. A causal graph, in contrast, represents invariant relationships that are stable across data distributions. These relationships are expected to hold for all configurations of an underlying system. Causal graphs thus provide a concise way to describe key, invariant properties of a system.</p>
</section>
<section id="key-properties" data-number="1.1.4">
<h3 data-number="2.1.4"><span class="header-section-number">2.1.4</span> Key Properties</h3>
<p>It is satisfying to note that once we have formulated our domain knowledge about possible causal relationships in the form of a graph, we can reason about causal relationships between any pair of nodes in our graph without appealing to additional domain knowledge. That is, the graph itself captures all the required information for determining which nodes, if manipulated, might affect which others. Below we emphasize key properties of the causal graph.</p>
<section id="the-assumptions-asserted-by-a-causal-graph-are-encoded-by-the-missing-edges-in-a-graph-and-the-direction-of-edges" class="unnumbered" data-number="">
<h4 class="unnumbered" data-number="1">The assumptions asserted by a causal graph are encoded by the missing edges in a graph, and the direction of edges</h4>
<p>It would be easy to believe that we are making an assumption about the existence of a causal mechanism when we draw an edge between two nodes. However, the edge itself does not represent an assumption! That an edge exists says nothing about the shape or size of the causal influence of one node on another; that causal influence could be vanishingly small or even <span class="math inline">0</span>! Thus, the existence of an edge—especially an undirected edge—does not represent a constraint on the underlying mechanisms. In contrast, the lack of an edge between two nodes is a much stronger assumption, as it is asserting that the direct causal influence is truly <span class="math inline">0</span>.</p>
<p>Fig. 4 illustrates 3 causal graphs that encode increasingly more assumptions. Of these illustrations, Fig. 4 (a) encodes the fewest assumptions. The single assumption is that the left nodes cause the right nodes (or more precisely, that the right nodes do not cause the left nodes). Note however, that nothing is assumed about the relationship among the two left nodes; and nothing is assumed about the relationship among the two right nodes. By removing edges and adding directionality to another edge, Fig. 4 (b) adds several additional assumptions: that the top-left node causes the bottom-left node and that only the bottom-left node influences the right nodes. Fig. 4 (c) makes the strongest assumptions on the underlying graph.</p>
<p>When is it preferable to use models that make more assumptions or fewer assumptions about underlying causal mechanisms? Generally speaking, when creating a causal graph, we should strive to encode as much of our domain knowledge as possible within the graph. If we know for certain through external knowledge that there is no direct causal relationship between two nodes, then we have no reason to add such an edge and, in fact, many of our computations and analyses will become only more difficult or the results more ambiguous if we do.</p>
<div id="fig:assumptionsdag" class="subfigures">
<table style="width:90%;">
<colgroup>
<col style="width: 30%" />
<col style="width: 30%" />
<col style="width: 30%" />
</colgroup>
<tr class="odd">
<td style="text-align: center;"><figure>
<img src="/assets/Figures/Chapter2/Ch2_FewAssumptionsDAG.png" id="fig:fewassumptionsdag" style="width:100.0%" alt="" /><figcaption>a</figcaption>
</figure></td>
<td style="text-align: center;"><figure>
<img src="/assets/Figures/Chapter2/Ch2_SomeAssumptionsDAG.png" id="fig:someassumptionsdag" style="width:100.0%" alt="" /><figcaption>b</figcaption>
</figure></td>
<td style="text-align: center;"><figure>
<img src="/assets/Figures/Chapter2/Ch2_ManyAssumptionsDAG.png" id="fig:manyassumptionsdag" style="width:100.0%" alt="" /><figcaption>c</figcaption>
</figure></td>
</tr>
</table>
<p>Figure 4: Causal graphs with more edges encode fewer assumptions on the underlying causal mechanisms. Here Fig. 4 (c) encodes the most assumptions. a — Few assumptions, b — More assumptions., c — Many assumptions.</p>
</div>
</section>
<section id="causal-relationships-correspond-to-stable-and-independent-mechanisms" class="unnumbered" data-number="">
<h4 class="unnumbered" data-number="1">Causal relationships correspond to stable and independent mechanisms</h4>
<p>A system—whether a computer system, a mechanical system, a social system, or other—consists of many mechanisms, interacting with each other to create the integrated behavior of the system as a whole. These mechanisms are often independent and stable. That is, we can replace or change one of these mechanisms without replacing others. The other mechanisms remain the same, though the system as a whole will change with the new integrated behavior.</p>
<p>Appealing to this intuitions of stable and independent mechanisms, we often assume that the components of the causal graph—in particular, the unrelated edges in the graph—represent distinct stable and independent mechanisms of the underlying system being modeled. That is, if we make some change to how the world works—perhaps we upgrade a software component, or we change the mechanics of a physical system—we can change how <span class="math inline"><em>A</em></span> influences <span class="math inline"><em>B</em></span> without changing how <span class="math inline"><em>B</em></span> influences <span class="math inline"><em>D</em></span>; or vice-versa.</p>
</section>
<section id="sec:causalgraphscannotbelearnedfromdataalone" class="unnumbered" data-number="">
<h4 class="unnumbered" data-number="1">Causal graphs cannot be learned from data alone</h4>
<p>We declared earlier that inferring causality requires external knowledge—some information about the underlying system mechanics or the data generating process—beyond the raw data itself. Why is this? Every causal graph implies a set of testable implications: the conditional statistical independences, introduced above in Section <a href="#sec:statisticalindependence_intro" data-reference-type="ref" data-reference="sec:statisticalindependence_intro">1.1.2</a>. However, every unique causal graph does not imply a unique set of independence tests. Every causal graph has an <em>equivalence class</em> of graphs that generate the same independence tests.</p>
<p>Consider the two graphs in Fig. 5. In Fig. 5 (a), we can read only a single independence from the graph: <span class="math inline">$(C \unicode{x2AEB} A) | B$</span>. That is, once we know the value of <span class="math inline"><em>B</em></span>, knowing the value of <span class="math inline"><em>A</em></span> will not give us any additional knowledge about the value of <span class="math inline"><em>C</em></span>. This graph implies many other causal assumptions, but only the conditional statistical independence tests are testable given data.</p>
<p>In Fig. 5 (b), we see a very different causal graph. From a practical standpoint, making a decision using this causal graph rather than the first would be very different. In Fig. 5 (a), if we manipulate <span class="math inline"><em>B</em></span>, we do not expect that <span class="math inline"><em>A</em></span> will change. In contrast, in Fig. 5 (b), if we manipulate <span class="math inline"><em>B</em></span>, we <em>do</em> expect that <span class="math inline"><em>A</em></span> will change.</p>
<p>Despite such differences in the causal implications of these graphs, when we look for testable statistical independences, we can only find one, that <span class="math inline">$(C \unicode{x2AEB} A) | B$</span>, the same test as for the other graph. The implication is that any dataset that satisfies the testable assumptions of Fig. 5 (a) will also satisfy the testable assumptions Fig. 5 (b). As a result, if we want to know which causal graph is correct, we cannot rely solely on the raw data, but must bring some of our own domain knowledge to bear.</p>
<div id="fig:equivalence-dag" class="subfigures">
<table style="width:60%;">
<colgroup>
<col style="width: 30%" />
<col style="width: 30%" />
</colgroup>
<tr class="odd">
<td style="text-align: center;"><figure>
<img src="/assets/Figures/Chapter2/Ch2_EquivalenceClass_DAG.png" id="fig:equivalenceclass-lhs" style="width:100.0%" alt="" /><figcaption>a</figcaption>
</figure></td>
<td style="text-align: center;"><figure>
<img src="/assets/Figures/Chapter2/Ch2_EquivalenceClass2_DAG.png" id="fig:equivalenceclass-rhs" style="width:100.0%" alt="" /><figcaption>b</figcaption>
</figure></td>
</tr>
</table>
<p>Figure 5: These two graphs are in the same equivalence class. That is, any data set that can be described with one of these models can also be described with the other. </p>
</div>
<p>We will discuss statistical independence tests and these equivalence classes in more detail in Chapter <a href="#ch_refutations" data-reference-type="ref" data-reference="ch_refutations">5</a>.</p>
</section>
<section id="causal-graphs-are-a-tool-to-help-us-reason-about-a-specific-problem" class="unnumbered" data-number="">
<h4 class="unnumbered" data-number="1">Causal graphs are a tool to help us reason about a specific problem</h4>
<p>Finally, we want to briefly emphasize the intuition that there is not necessarily a single, “correct” causal graph representation of any given system. A causal graph should, of course, correspond with the true causal mechanisms that drive the system being analyzed. However, questions of abstractions and fidelity, exogeneity, measurement practicalities, and, of course, the overarching purpose of an analysis, mean that many different models of a system can be considered valid under different purposes and circumstances.</p>
</section>
</section>
</section>
<section id="sec:struc-equations" data-number="1.2">
<h2 data-number="2.2"><span class="header-section-number">2.2</span> Structural Equations, Noise, and Unobserved Nodes</h2>
<p>The causal graph is a simplified representation that captures much of the key information about a system but, like all abstractions, also leaves many details out. In this section, we present how structural equations can represent details of the functional relationships represented by edges of a graph; and discuss the importance of noise and unobserved nodes. Causal graphs and structural equations together form the <em>Structural Causal Model (SCM)</em> framework of causal reasoning.</p>
<section id="structural-equations" data-number="1.2.1">
<h3 data-number="2.2.1"><span class="header-section-number">2.2.1</span> Structural Equations</h3>
<p>One key piece of information that is not included in the representation of the graph is the functional relationship between nodes. While the existence of an edge between two nodes in Fig. 1 (a) tells us that there is some functional relationship between <span class="math inline"><em>A</em></span> and <span class="math inline"><em>B</em></span>, it does not tell us the shape or magnitude of the effect. The fact that the graph does represent this functional relationship means that we cannot, in Fig. 1 (a), tell whether an increase in <span class="math inline"><em>A</em></span> will cause an increase or decrease in <span class="math inline"><em>B</em></span>. In more complicated scenarios where multiple nodes influence others, we cannot tell how the values of these nodes interact. <a href="#fn2" class="footnote-ref" id="fnref2" role="doc-noteref"><sup>2</sup></a> For example in Fig. 2, where edges from both <span class="math inline"><em>A</em></span> and <span class="math inline"><em>C</em></span> lead to <span class="math inline"><em>B</em></span>, the causal graph alone does not tell us how these nodes might interact, or if they interact at all. It is possible that the effect of <span class="math inline"><em>C</em></span> on <span class="math inline"><em>B</em></span> is the same regardless of the value of <span class="math inline"><em>A</em></span> (no interaction). It is also possible that <span class="math inline"><em>C</em></span> affects <span class="math inline"><em>B</em></span> differently depending on <span class="math inline"><em>A</em></span>’s value.</p>
<p>To augment causal graphs with a stronger characterization of the functional relationships between nodes, we often use structural equation models (SEM). Each equation represents a causal assignment from the right-hand-side of the equation to the left. Eqns. 2, 3 show general structural equations for Figs. 1 (a), 2 respectively.<a href="#fn3" class="footnote-ref" id="fnref3" role="doc-noteref"><sup>3</sup></a> <span id="eq:sample-2nodegraph"><br /><span class="math display">$$\begin{array}{rcl}
B & \leftarrow& f(A)
\end{array}
%\caption{A structural equation corresponding to \figref{fig:sample-2nodegraph}}
\label{eq:sample-2nodegraph}\qquad(2)$$</span><br /></span> <span id="eq:sample-4nodegraph"><br /><span class="math display">$$\begin{array}{rcl}
D & \leftarrow& f_1(B) \\
B & \leftarrow& f_2(A,C)
\end{array}
%\caption{A set of structural equations corresponding to \figref{fig:sample-4nodegraph}}
\label{eq:sample-4nodegraph}\qquad(3)$$</span><br /></span> While these equations allow any form of function <span class="math inline"><em>f</em>()</span>, we can easily indicate specific functional forms. For example, Eq. 4 shows an alternative SEM for the graph of Fig. 2 where the causal relationships are linear, and the effects of <span class="math inline"><em>A</em></span> and <span class="math inline"><em>C</em></span> on <span class="math inline"><em>B</em></span> do not interact with each other. <span id="eq:sample-4nodegraph-linear"><br /><span class="math display">$$\begin{array}{rcl}
D & \leftarrow& \alpha_1*B \\
B & \leftarrow& \alpha_2*A + \beta_2*C
\end{array}
%\caption{A more restrictive set of structural equations relating the nodes in \figref{fig:sample-4nodegraph}}
\label{eq:sample-4nodegraph-linear}\qquad(4)$$</span><br /></span> Note that the above equations are different from purely statistical models such as linear regressions, even though the equations are deceptively similar. Structural equations imply a causal relationship, whereas conventional equations provide no such implication. In a regression model, it is equally valid to regress y on x, as it is to regress x on y. In contrast, a structural equation can only be written in one direction, the direction of causal relationship as specified by a causal graph. Further, a structural equation only includes causes of y in the RHS whereas a linear regression may include all known variables, including children of y that can be useful for prediction. In some cases, it is possible that parameters of a structural equation are estimated using linear regression, but the two models still retain their independent characteristics.</p>
</section>
<section id="sec:refininggraphs-noise" data-number="1.2.2">
<h3 data-number="2.2.2"><span class="header-section-number">2.2.2</span> Noisy models</h3>
<p>Any model, whether a causal graph or a set of structural equations, will necessarily represent (at best) a simplified version of the most important causal factors and relationships controlling a system’s behavior. To account for the many minor factors influencing system behavior, it is common practice to introduce an <span class="math inline"><em>ϵ</em></span> noise term into our structural equations: <br /><span class="math display">$$\label{ch02-noisy-sem}
\begin{array}{rcl}
D & \leftarrow& \alpha_1*B + \epsilon_D \\
B & \leftarrow& \alpha_2*A + \beta_2*C + \epsilon_B
\end{array}$$</span><br /></p>
<figure>
<img src="/assets/Figures/Chapter2/Ch2_Fig1_4nodegraph_noise.png" id="fig:noise" alt="" /><figcaption>Figure 6: Noise, such as <span class="math inline"><em>ϵ</em><sub><em>B</em></sub></span> and <span class="math inline"><em>ϵ</em><sub><em>D</em></sub></span>, are often omitted from causal graphs, but can be added for completeness as shown here.</figcaption>
</figure>
<p>Fig. 6 shows the causal graph representation of these <span class="math inline"><em>ϵ</em></span> noise terms. For simplicity of representation, these noise terms are not usually drawn in the causal graph, but are simply assumed to exist. Note that to avoid compromising the causal relationships implied by our original (non-noisy) causal model, the noise factors that influence each node must be independent of one another. If, for some reason, we believe that two nodes in a causal graph are subject to correlated noise, we must instead explicitly represent that in the graph.</p>
<p>Structural equations can be considered as an alternative representation of the factorization of probability distributions mentioned earlier. <span class="citation" data-cites="ch02-noisy-sem">[@ch02-noisy-sem]</span> can be written in terms of expectations as: <br /><span class="math display">$$\begin{split}
\mathbb{E}[D|B] &= \alpha_1*B \\
\mathbb{E}[B|A, C]&= \alpha_2 *A + \beta_2 *C
\end{split}$$</span><br /></p>
</section>
<section id="sec:refininggraphs-unobserved" data-number="1.2.3">
<h3 data-number="2.2.3"><span class="header-section-number">2.2.3</span> Unobserved Nodes</h3>
<p>Often, we know we will create a causal model of a system, but we will not be able to observe all of the nodes in our causal system. The values of some nodes may be completely unobserved or latent. This can happen if we do not know how to measure the value of a node, the node is too expensive to measure, or if a piece of data is too private or otherwise confidential. Depending on the causal structures of the system, we can still often address our specific task or question through computations over the nodes whose values are observed. To indicate an unobserved node in a causal graph, we mark its outline as a dashed line (Fig. 7).</p>
<figure>
<img src="/assets/Figures/Chapter2/Ch2_Unobserved1.png" id="fig:refininggraphs-unobserved" alt="" /><figcaption>Figure 7: By convention, nodes that are completely unobserved are marked with a dashed outline.</figcaption>
</figure>
<p>In many situations, nodes may be partially observed or missing. For example, if a node value is expensive to collect, we may only sample it for a small number of our data entries. Or, our data collection might be systematically biased in some way. In such cases, we can model the <em>missing mechanism</em> in the causal graph itself. By modeling the causes of partial observation, we will be able to directly analyze why data might be missing and understand the biases present in our observed data.</p>
<p>We can model the missingness mechanism in the causal graph itself by the following simple manipulation of the graph, as shown in Fig. 8.</p>
<ol type="1">
<li><p>We split the partially observed node, <span class="math inline"><em>Z</em></span>, into two nodes, one of which is the true value <span class="math inline"><em>Z</em></span> but is completely unobserved.</p></li>
<li><p>The second node, <span class="math inline"><em>Z</em><sup>*</sup></span> is observed, and caused by <span class="math inline"><em>Z</em></span> and a new missingness indicator <span class="math inline"><em>R</em><sub><em>Z</em></sub></span>.</p></li>
<li><p>The missingness indicator controls the value of <span class="math inline"><em>Z</em><sup>*</sup></span>. If <span class="math inline"><em>R</em><sub><em>Z</em></sub> = 1</span>, the value of <span class="math inline"><em>Z</em></span> is revealed as <span class="math inline"><em>Z</em><sup>*</sup></span>. Otherwise, if <span class="math inline"><em>R</em><sub><em>Z</em></sub> = 0</span>, the value of <span class="math inline"><em>Z</em></span> is not revealed and <span class="math inline"><em>Z</em><sup>*</sup></span> takes a null or empty value.</p></li>
<li><p>If data is observed or sampled at random, then the missingness indicator <span class="math inline"><em>R</em><sub><em>Z</em></sub></span> is an independent node, unconnected to the rest of the causal graph. If <span class="math inline"><em>R</em><sub><em>Z</em></sub></span> is systematically influenced by other factors in the system, then we draw the appropriate causal connections. In Fig. 8, <span class="math inline"><em>R</em><sub><em>Z</em></sub></span> is caused by <span class="math inline"><em>C</em></span>.</p></li>
</ol>
<figure>
<img src="/assets/Figures/Chapter2/Ch2_Unobserved2.png" id="fig:missingness" alt="" /><figcaption>Figure 8: The node <span class="math inline"><em>Z</em></span> is partially observed. Whether we see its value in <span class="math inline"><em>Z</em><sup>*</sup></span> is controlled by a missingness indicator, <span class="math inline"><em>R</em><sub><em>Z</em></sub></span>.</figcaption>
</figure>
<p>This manipulation can also be generalized to represent other systematic biases in values, beyond missing values. In such cases, the control node, <span class="math inline"><em>R</em><sub><em>Z</em></sub></span> in Fig. 8, would no longer be a missingness indicator, but rather a general bias indicator, and <span class="math inline"><em>Z</em><sup>*</sup></span> then a systematically biased version of <span class="math inline"><em>Z</em></span>.</p>
</section>
</section>
<section id="sec:building-causal-graph" data-number="1.3">
<h2 data-number="2.3"><span class="header-section-number">2.3</span> Where does a Causal Graph Come From?</h2>
<section id="creating-a-causal-graph" data-number="1.3.1">
<h3 data-number="2.3.1"><span class="header-section-number">2.3.1</span> Creating a Causal Graph</h3>
<p>When we are using a causal graph to reason about causal mechanisms, we assume that the causal graph captures everything that is important and relevant to the problem we are studying.<a href="#fn4" class="footnote-ref" id="fnref4" role="doc-noteref"><sup>4</sup></a> When we are trying to decide what is important and relevant, we can think about it in several stages:</p>
<p><em>Core factors related to the question:</em> First, we consider the question we are trying to answer based on our analysis of the graph? For example, if we are using our analysis to guide decision-making—i.e., whether to take a specific action—we will want our causal graph to include the action and its possible effects or outcomes, as well as other factors that influence the likelihood of those outcomes. When analyzing a particular dataset of actions and outcomes, we also must include the factors that influenced the likelihood of action.</p>
<p><em>Adding additional relevant factors:</em> Secondly, we should look at the factors that we have decided are relevant to our task, and consider whether any of them have shared causes. If so, we might include them as well. The decision to include them can be based on how important it is to capture the fact that the given factors are correlated with one another. When we decide that it is not important to model the causes of some node in a causal graph, we will call these <em>exogenous</em> nodes. If a node is determined entirely by nodes within the causal graph, we will call these <em>endogenous</em> nodes.</p>
<p><em>Removing static factors:</em> Finally, we consider what is static and what is dynamic across the scenarios and environments we intend to represent with our causal graph. If some factor that is otherwise relevant to our causal question never changes, we will leave it out. For example, when analyzing the effect of a new recommendation policy in an online store, we may know that the effect of the policy depends on some basic societal and economic factors, such as the availability of Internet, electricity, and a stable monetary system. If we believe that the availability of these factors will not vary within the scope of our task, we can safely leave these out.</p>
<p>Decisions about whether a factor can be left out can be iterative. Often we will choose to begin with a simpler model and add in additional factors to improve the precision and accuracy of our model. For example, after building and experimenting with a simple model of the effect of recommendations, we might add in additional factors, such as whether users are viewing recommendations on mobile devices, tablets, or PCs, to help us better capture subtler effects.</p>
<p>Careful readers will note that, in this section, we have been referring to the “factors” that are relevant to a given question or system. Often, when we are first designing a causal graph, we will focus our thoughts on more abstract concepts and factors and then, only later, determine what specific measures in an experiment or features in a dataset can be used to represent those abstract factors.</p>
</section>
<section id="examples" data-number="1.3.2">
<h3 data-number="2.3.2"><span class="header-section-number">2.3.2</span> Examples</h3>
<p>In this section, we discuss three example scenarios, including toy causal models, the assumptions and modeling choices they represent, and how they might be extended and refined.</p>
<section id="example-1-online-product-purchases-and-recommendations" class="unnumbered" data-number="">
<h4 class="unnumbered" data-number="1">Example 1: Online product purchases and recommendations</h4>
<p>Consider an online store that sells many products. Interest in each product may be driven by a number of product-specific factors, such as the quality of the product, product reviews, marketing campaigns, as well as cross product factors such as seasonality or brand reputation. There may be inherent complementarity or substitution among some products. For example, a person buying cookies may also become interested in buying milk. So, if a marketing campaign increases interest in cookies, it may also indirectly drive an increased interest in milk. Beyond these inherent relationships, the store, on each product’s web page, recommends several related products to people, thus potentially increasing interest for the recommended products. Recommendations are made based on some policy that the store might change. What if we want to better understand product browsing behavior under various recommendation policies? I.e., is one recommendation policy more effective than another? Fig. 9 shows one causal graph that models the influences on aggregate product browsing behavior.<a href="#fn5" class="footnote-ref" id="fnref5" role="doc-noteref"><sup>5</sup></a></p>
<figure>
<img src="/assets/Figures/Chapter2/Ch2_Example1_DAG.png" id="fig:onlinestore-dag" alt="" /><figcaption>Figure 9: Causal graph representation of product browsing behavior (<span class="math inline"><em>P</em><sub>0</sub></span> and <span class="math inline"><em>P</em><sub><em>i</em></sub></span>) at an online store; external product-specific demand (<span class="math inline"><em>D</em><sub>0</sub></span> and <span class="math inline"><em>D</em><sub><em>i</em></sub></span>, cross-product demand (<span class="math inline"><em>S</em></span>); and the influence of a recommendation policy wherein browsing one product (<span class="math inline"><em>P</em><sub>0</sub></span>) influences the likelihood to browse another (<span class="math inline"><em>P</em><sub><em>i</em></sub>)</span>.</figcaption>
</figure>
<p>The basic modeling choices we make in constructing our causal model simplify our analysis tasks by limiting the factors we will consider. In making these choices, understanding the domain is critical to designing a model that is tailored to addressing a specific task while capturing all relevant factors.</p>
<p>Some of the choices we made when designing this model are explicit in the graph itself. For example, we chose to declare that the factors that influence product interest can be abstractly represented by a single cross-product demand factor and by many product-specific demand factors (exactly one per product). We are assuming that the various product-specific demand factors, such as price, and the shared demand factors, such as brand reputation, are not affected by product interest. And, we are assuming that the first product <span class="math inline"><em>P</em><sub>0</sub></span> is not affected by recommendations from other products (in fact, we are not modeling any recommendations from other products <span class="math inline"><em>P</em><sub><em>i</em></sub></span> at all).</p>
<p>Another modeling choice clearly stated in the graph is the set of exogenous vs. endogenous nodes. As our question was focused on the recommendation policy these demand factors are represented as exogenous nodes. if our questions were focused on manipulating these demand factors (eg experimenting with bew marketing campaigns) then we would want to add the causes of these demand factors into our causal graph.</p>
<p>Other choices are not represented in the graph, but are more subtly included within the definition of the nodes. For example, our product interest nodes are aggregated over all people browsing at an online store. In turn, our demand factors also represent factors that influence demand at an aggregate or population level. Alternatively, we could also have chosen to model product interest at an individual level. In that case, we probably would have also included more attributes about an individual, allowing us to study interactions between demand factors and individuals.</p>
<p>Our model also does not allow for the possibility of change in influence over time. Would the novelty of a recommendation system will initially drive more interest in recommended products, but then fade over time? This particular model would not allow us to capture or analyze such changes in effect. In addition, we do not consider the relationship to other pages that may also show recommendation. For example, P0 itself may be a recommended product on some P2 product’s page, in which case P0’s browsing events are partially caused by the recommendation system too.</p>
<p>As we work with a model and refine it over time, we can revisit these design choices. We might split out demand factors in more detail, include individual-level information in our graph, or explicitly model time. How we refine a model will depend on how our understanding of the domain evolves as we gain experience, how our core questions and goals change, and the practical limitations of our experimental setting or data gathering framework.</p>
<figure>
<img src="/assets/Figures/Chapter2/Ch2_Example2_DAG-crop.png" id="fig:datacenter-dag" alt="" /><figcaption>Figure 10: Causal graph representation of product browsing behavior (<span class="math inline"><em>P</em><sub>0</sub></span> and <span class="math inline"><em>P</em><sub><em>i</em></sub></span>) at an online store; external product-specific demand (<span class="math inline"><em>D</em><sub>0</sub></span> and <span class="math inline"><em>D</em><sub><em>i</em></sub></span>, cross-product demand (<span class="math inline"><em>S</em></span>); and the influence of a recommendation policy wherein browsing one product (<span class="math inline"><em>P</em><sub>0</sub></span>) influences the likelihood to browse another (<span class="math inline"><em>P</em><sub><em>i</em></sub>)</span>.</figcaption>
</figure>
</section>
<section id="example-2-energy-conservation-in-a-data-center" class="unnumbered" data-number="">
<h4 class="unnumbered" data-number="1">Example 2: Energy conservation in a data center</h4>
<p>Consider a data center containing thousands of servers. One key challenge in data centers is to minimize energy consumption while meeting stated performance objectives. To do so, one way can be to put idle servers into low power mode or turn them off. However, if demand exceeds availability, then the idle servers need to be restarted. This introduces a delay in the system. We thus want to maximize the energy savings while minimizing the delay due to unavailability.</p>
<p>Fig. 10 shows a causal graph. Let us assume that there is predictive system that predicts the number of servers to keep on based on a prediction of load due to customers’ requests in the next time-step. This prediction uses the load at last time step (<span class="math inline"><em>L</em><sub><em>t</em> − 1</sub></span>) and the number of idle servers at the previous time step <span class="math inline"><em>I</em><sub><em>t</em> − 1</sub></span>, to decide the number of servers that are turned on at time <span class="math inline"><em>t</em></span>, <span class="math inline"><em>M</em><sub><em>t</em></sub></span>. Then we observe the new load <span class="math inline"><em>L</em><sub><em>t</em></sub></span> that along with <span class="math inline"><em>M</em><sub><em>t</em></sub></span> determines the number of servers that need to be restarted <span class="math inline"><em>R</em><sub><em>t</em></sub></span>. Number of restarted servers and the number of running servers <span class="math inline"><em>M</em><sub><em>t</em></sub></span> together determine the cost of the predictive system <span class="math inline"><em>C</em><sub><em>t</em></sub></span>. Note that load at consecutive time-steps shares a common cause of customers’ request patterns, therefore we show it through a dashed double-sided arrow.</p>
<p>Using this graph, we can answer a number of interesting questions. We can estimate the effect of using the predictive system’s output <span class="math inline"><em>M</em><sub><em>t</em></sub></span> versus not turning off any servers. We can also compare the current predictive model to another model to evaluate which one leads to a better overall cost.</p>
<p>Note how this graph was constructed based on a high-level knowledge of the system architecture. Nodes like <span class="math inline"><em>M</em><sub><em>t</em></sub></span> may themselves be computed using complex machine learning models, but we choose to abstract them out into single nodes. We also chose to construct a causal graph over a single time-step though the system runs continuously. That is, we ignored the fact that <span class="math inline"><em>I</em><sub><em>t</em> − 1</sub></span> itself is a function of the model’s prediction at time <span class="math inline"><em>t</em> − 1</span>, <span class="math inline"><em>M</em><sub><em>t</em> − 1</sub></span>. In the specific case where the model <span class="math inline"><em>M</em></span> utilizes data from only the previous time-step, this is a valid simplification since the model treats <span class="math inline"><em>I</em><sub><em>t</em> − 1</sub></span> as a new, independent value. In other cases, such a simplification may lead to errors due to ignoring the feedback loop between the model and idle servers at different points in time. Additionally, there are a number of intermediate, recorded nodes that go from shutting off a server to energy consumption, but we chose to not include them since they are not the focus of the question. For another question—for example, the behavior of hardware components while saving energy—including those measurements will be critical.</p>
<figure>
<img src="/assets/Figures/Chapter2/MnistExamples.png" id="fig:mnist-data" alt="" /><figcaption>Figure 11: MNIST: A dataset of hand-written digits. Some images are rotated by an angle up to 90 degrees.</figcaption>
</figure>
</section>
<section id="example-3-rotated-mnist-images" class="unnumbered" data-number="">
<h4 class="unnumbered" data-number="1">Example 3: Rotated MNIST images</h4>
<p>As our third and final example, we consider the well-known handwriting recognition dataset called MNIST. This dataset contains images of digits and the task is to detect the digit shown in each image. We consider a subset of digit classes from <span class="math inline">0</span> to <span class="math inline">4</span> and include a twist: each image is rotated by an angle by some unknown data-generating process. Fig. 11 shows a random sample of the images from this rotated MNIST dataset.</p>
<p>To start with, it is hard to think of a causal graph for this system. All we are provided are a set of static images without any flow of information or causality. To proceed, let us try to reconstruct what process may have generated this data. Thinking of how people write down numeric digits, it is possible that people may have decided the digit they wanted to write and then written that digit. Thus, the digit class causes the specific shape we see on an image. Alternatively, the images may have been sampled from a random collection of shapes and someone might have selected them manually and labelled them as one of the ten digits. In this case, it is the shape of the region in an image that causes its digit classification. In either case, we can assume that there is a causal relationship between a specific shape and the digit class. We represent it using an undirected causal edge in Fig. 12.</p>
<p>In addition to shape, the angle of rotation seems to be associated with the digit. In Fig. 11 the digits <span class="math inline">0</span> and <span class="math inline">2</span> are never rotated much whereas other digits are rotated up to 90 degrees. However, from our understanding of digit recognition, we can safely assume that the angle of rotation cannot determine the digit. It may be that some images were rotated before they were captured or that these images were rotated based on their digit class after they were captured. In the first case, we can assume that some unobserved process causes both the digit class and its rotation. In the second, the digit class causes an unknown variable that decides the angle of rotation. Causal graphs for these mechanisms are shown in Figs. 12 (a), 12 (b).</p>
<p>While we do not know the exact mechanism, these set of causal graphs provide important information about building a classifier that can generalize to different data distributions beyond the current one. Since shape is causally related to the digit class in all graphs, it should be included in a predictive model. However, in both graphs, digit and angle of rotation share no direct relationship. Specifically, their relationship depends completely on an unobserved node connecting them that acts as a fork (Fig. 12 (b)) or as the central node in a chain (Fig. 12 (b)). If the value of the unobserved node changes, their relationship also changes. In other words, given the unknown node, angle of rotation and digit class are conditionally d-separated because <span class="math inline"><em>U</em></span> is either a fork or the centre of a chain path. Therefore, these graphs imply that angle of rotation is not a causal feature and should not be included in any predictive model for digit recognition.</p>
<p>This example underscores the point that a causal graph need not be complete or uniquely determined to be useful. As we noted before, we are looking for graphs that capture the major assumptions and constraints that can be known from domain knowledge, not the full causal graph of a system which may be implausible to obtain. Thus, it is helpful to have an incomplete graph than no graph at all.</p>
<div id="fig:mnist-causal-graph" class="subfigures">
<table style="width:60%;">
<colgroup>
<col style="width: 30%" />
<col style="width: 30%" />
</colgroup>
<tr class="odd">
<td style="text-align: center;"><figure>
<img src="/assets/Figures/Chapter2/Ch2_Mnist_graph2.png" id="fig:mnist-causal-graph1" style="width:100.0%" alt="" /><figcaption>a</figcaption>
</figure></td>
<td style="text-align: center;"><figure>
<img src="/assets/Figures/Chapter2/Ch2_Mnist_graph1.png" id="fig:mnist-causal-graph2" style="width:100.0%" alt="" /><figcaption>b</figcaption>
</figure></td>
</tr>
</table>
<p>Figure 12: Possible causal graphs for the rotated MNIST images dataset. In both graphs, <span class="math inline"><em>U</em></span> is unobserved. </p>
</div>
</section>
</section>
</section>
<section id="sec:po-framework" data-number="1.4">
<h2 data-number="2.4"><span class="header-section-number">2.4</span> Potential Outcomes Framework</h2>
<p>The <em>potential outcomes (PO) framework</em> is an alternative to causal graphs for reasoning about causal assumptions and setting up analyses. While causal graphs focus on the structure of the causal relationships themselves as the primary language for declaring assumptions, the potential outcomes framework places its focus on causal inference as a missing data problem. Recall from chapter <a href="#ch_patternsandpredictionsarenotenough" data-reference-type="ref" data-reference="ch_patternsandpredictionsarenotenough">3</a> that the causal effect is defined as the difference in outcomes <span class="math inline"><em>Y</em></span> between a world where treatment is given <span class="math inline">World(<em>T</em> = 1)</span>, and a world where treatment is not given <span class="math inline">World(<em>T</em> = 0)</span>. <span id="eq:causal-effect-worlds"><br /><span class="math display">Causal Effect = <em>Y</em><sub>World(<em>T</em> = 1)</sub> − <em>Y</em><sub>World(<em>T</em> = 0)</sub></span><br /></span> Potential outcomes (PO) framework formalizes the notion of <span class="math inline"><em>Y</em></span>’s value in different worlds as a new statistical variable. Specifically for every outcome <span class="math inline"><em>Y</em></span>, it defines a set of potential outcomes based on different values of the treatment, <span class="math inline"><em>Y</em><sub><em>T</em></sub></span>. The key point is that only one of the potential outcomes <span class="math inline"><em>Y</em><sub><em>T</em> = <em>t</em></sub></span> is observed and the remaining potential outcomes are unobserved. In other words, there is a single observed outcome and the goal is to estimate all other unobserved, potential outcome values that <span class="math inline"><em>Y</em></span> could have taken under a different <span class="math inline"><em>T</em></span>. In the PO framework, these different values of the outcome are denoted by a subscript, <span class="math inline"><em>Y</em><sub><em>T</em> = <em>t</em>′</sub></span>.</p>
<p>Note that a potential outcome is not the same as probabilistic conditioning. Critically, <span class="math inline"><em>Y</em><sub><em>T</em> = 1</sub></span> does not correspond to conditioning on <span class="math inline"><em>T</em> = 1</span> in the observed data, but rather conveys the causal relationship between <span class="math inline"><em>Y</em></span> and <span class="math inline"><em>T</em></span>. That is, <span class="math inline"><em>Y</em><sub><em>T</em> = 1</sub></span> represents an intervention on <span class="math inline"><em>T</em></span> by setting it to <span class="math inline">1</span> without changing the rest of the world (i.e., all other relevant variables, both observed and unobserved, are constant). Thus <span class="math inline"><em>P</em>(<em>Y</em><sub><em>T</em> = 1</sub>) ≠ <em>P</em>(<em>Y</em>|<em>T</em> = 1)</span>. For a binary treatment, potential outcomes provides a succinct way to formalize Eq. 5 for causal effect. <br /><span class="math display">$$\begin{split}
\text{Causal Effect} = \mathbb{E}[Y_{T=1}- Y_{T=0}] = \mathbb{E}[Y_{T=1}]- \mathbb{E}[Y_{T=0}]
\end{split}$$</span><br /> Since, for any particular unit of treatment (e.g., a person) we can only observe one of the potential outcomes, the PO framework translates the problem of causal inference to that of estimating the missing potential outcome. For example, if we observe <span class="math inline"><em>Y</em><sub><em>T</em> = 1</sub></span> for a particular unit, then we can calculate the causal effect of <span class="math inline"><em>T</em></span> if we can correctly estimate the unobserved potential outcome <span class="math inline"><em>Y</em><sub><em>T</em> = 0</sub></span>. Note that randomized experiments conveniently allow our observations of potential outcomes in one randomized group to provide unbiased estimates of the unobserved potential outcomes of other randomized groups.</p>
<p>While the potential outcome framework is most commonly used for estimating the effect of treatment, the framework itself is general and can be used to denote the potential value of any variable. For example, <span class="math inline"><em>A</em><sub><em>B</em> = 1</sub></span> represents the potential value of <span class="math inline"><em>A</em></span> when <span class="math inline"><em>B</em></span> is set to 1.</p>
<section id="comparing-potential-outcomes-with-causal-graphs" data-number="1.4.1">
<h3 data-number="2.4.1"><span class="header-section-number">2.4.1</span> Comparing potential outcomes with causal graphs</h3>
<p>Since the potential outcome variables also convey causal relationships, it is important to compare them to structural causal models. In many ways, causal graphs and potential outcomes are compatible. They both emphasize the difference between statistical conditioning and causal effect. While causal graphs do so by providing a representation of flow of causality without using statistical variables, potential outcomes do so by creating entirely new statistical variables. For instance, a two-node graph like that in Fig. 1 (a) can be represented by <span class="math inline"><em>B</em><sub><em>A</em> = <em>a</em></sub></span> where the choice of subscript variable denotes an assumption about the direction of causal effect.</p>
<p>This difference becomes more pronounced when we want to represent assumptions about more complex systems. The PO framework does not have a good representation for relationships between variables other than the treatment and outcome. Instead, the focus is on reducing those relationships to questions about their effects on <span class="math inline"><em>T</em></span> and <span class="math inline"><em>Y</em></span>. For example, an analyst in the PO framework may ask whether the treatment assignment mechanism is known, whether the treatment is randomly assigned, or whether there are other variables that cause the treatment? If there are such other variables, then do they also cause the outcome <span class="math inline"><em>Y</em></span>? Based on the answers to these questions, an analyst will then decide how to identify and estimate the missing potential outcome. The advantage of the PO framework is the (small) number of well-tested and trusted identification and estimation strategies developed within it for finding the causal effect of a treatment. However, each of these strategies requires specific assumptions on the treatment assignment mechanism, often including the shape of the functions governing the underlying mechanisms.</p>
<p>In contrast, the SCM framework focuses on making all the assumptions as transparent as possible. When confronted with a non-randomized treatment assignment, an analyst constructs a graph expressing their assumptions about the causal mechanisms in the system. For instance, they may ask: What factors are causing the treatment? Are there specific structures among them (e.g., colliders) that can be exploited? If there is confounding, is it because there is missing data or fully unobserved confounders? Such an analysis brings out all the causal assumptions that go into a future identification and estimation exercise, which unfortunately are opaque in the PO framework. Moreover, a graph is more general construct for causal reasoning that can be useful for many other questions about a system, beyond a specific effect. Once a causal graph is constructed, it allows questions about the effect between any two pair of nodes, the effect of groups of nodes, the causal features for a particular outcome, the cascading nature of certain causal effects, and so on.</p>
<p>Put another way, the PO framework focuses directly on estimation of effect whereas the SCM framework emphasises on specifying the causal mechanisms first. Given that a effect estimate depends heavily on the causal assumptions that go in, the importance of transparency in causal assumptions cannot be overstated. Unlike predictive machine learning estimates that can be validated objectively using cross-validation metrics, no such validation procedure exists for causal estimates. Thus, a qualitative benefit of causal graphs is that they are essentially a simple-to-interpret diagram that can be shared with different stakeholders, promoting an transparent and informed discussion about the causal assumptions that went into an analysis.</p>
<figure>
<img src="/assets/Figures/Chapter2/Ch2_po-versus-graphs-crop.png" id="fig:po-causal-graph" alt="" /><figcaption>Figure 13: A single-confounder system represented in the structural causal model and potential outcome frameworks.</figcaption>
</figure>
</section>
<section id="mixing-causal-and-statistical-assumptions" data-number="1.4.2">
<h3 data-number="2.4.2"><span class="header-section-number">2.4.2</span> Mixing causal and statistical assumptions</h3>
<p>More fundamentally, however, the PO framework mixes causal and statistical assumptions within the same representation. To illustrate this point, we provide a simple modelling exercise using the PO framework. Assume a three-variable system where <span class="math inline"><em>W</em></span> is a common cause of <span class="math inline"><em>T</em></span> and <span class="math inline"><em>Y</em></span>. Under the PO framework, we may write a regression equation, <span id="eq:po-regression"><br /><span class="math display">$$\label{eq:po-regression}
\begin{split}
y = f(x, w) + \epsilon ; \ \ \ \mathbb{E}[\epsilon|x,w]=0
\end{split}\qquad(6)$$</span><br /></span> This equation represents several assumptions about the underlying system. First, it conveys the direction of the causal relationship. Implicitly, the LHS, <span class="math inline"><em>y</em></span>, is the effect and the RHS, <span class="math inline"><em>f</em>(<em>x</em>, <em>w</em>) + <em>ϵ</em></span> are its causes. Second, by assuming that the expected value of the error term is <span class="math inline">0</span>, it conveys that the error <span class="math inline"><em>ϵ</em></span> is independent of the LHS variables, <span class="math inline"><em>X</em></span> and <span class="math inline"><em>W</em></span>. Third, this also serves as an estimating equation. The same equation is used for estimating the effect by making assumptions on the family of functions (e.g., all linear functions) and fitting a particular <span class="math inline"><em>f̂</em></span> to available data.</p>
<p>More generally, a single PO equation simultaneously conveys three steps of our four stages of causal reasoning: modelling, identification, and estimation. It often includes both causal assumptions (such as direction of effect) and statistical assumptions (such as the family of estimating functions). To some degree, this brevity can be seen as a strength. Yet, it can also be a weakness, when a concise representation leads to causal assumptions being made implicitly, or sometimes asserted separately in a less rigorous notation (i.e., natural language). While we can see that both graphs and the PO representation convey similar ideas, in this book we prefer using causal graphs and structural equations for modeling causal assumptions to more clearly distinguish causal from statistical assumptions.</p>
</section>
<section id="the-best-of-both-frameworks" data-number="1.4.3">
<h3 data-number="2.4.3"><span class="header-section-number">2.4.3</span> The best of both frameworks</h3>
<p>We will see in the following chapters that, while the PO framework has some weaknesses in the modeling stage of causal analysis, it provides useful, common recipes for the identification stage of causal analysis, and shines in causal effect estimation. The PO framework provides a suite of well-tested and broadly used methods for estimation, based on constraints of function families, size of data or its dimensionality. Because it directly deals with statistical equations, the PO framework is also better equipped to handle constraints in a data-generating process like monotonicity of effect.</p>
<p>In this book, therefore, we mix and match elements of causal graphs and potential outcomes across the four stages of causal analysis—modeling, identification, estimation, and refutation. While we use primarily causal graphs and structural equations for capturing models and assumptions in the first stage, we will use both causal-graph based and potential outcomes identification, estimation, and refutation strategies.</p>
</section>
</section>
</section>
<section class="footnotes" role="doc-endnotes">
<hr />
<ol>
<li id="fn1" role="doc-endnote"><p>We will discuss methods for handling cycles in causal graphs in Chapter <a href="#ch_practicalconsiderations" data-reference-type="ref" data-reference="ch_practicalconsiderations">10</a>.<a href="#fnref1" class="footnote-back" role="doc-backlink">↩︎</a></p></li>
<li id="fn2" role="doc-endnote"><p>There are more complicated causal graph notations that include specific annotations (different kinds of arrows and nodes) to indicate specific classes of interactions, such as mediation and interaction, though other kinds of interactions remain ambiguous. While we believe such notation can be useful, in this book, we will keep to the simpler graph notation both for simplicity of presentation and to avoid over-emphasizing some kinds of interactions over others.<a href="#fnref2" class="footnote-back" role="doc-backlink">↩︎</a></p></li>
<li id="fn3" role="doc-endnote"><p>Readers already familiar with structural equations might miss the <span class="math inline"><em>ϵ</em></span> noise factor. Do not worry, we will add them in soon, in Section <a href="#sec:refininggraphs-noise" data-reference-type="ref" data-reference="sec:refininggraphs-noise">2.2.2</a>.<a href="#fnref3" class="footnote-back" role="doc-backlink">↩︎</a></p></li>
<li id="fn4" role="doc-endnote"><p>This does not necessarily mean we will expect to be able to measure everything in the graph. There might be factors that we have added to our graph that will remain unobserved in our datasets. We will expand on unobserved nodes and their implications in Section <a href="#sec:refininggraphs-unobserved" data-reference-type="ref" data-reference="sec:refininggraphs-unobserved">2.2.3</a>.<a href="#fnref4" class="footnote-back" role="doc-backlink">↩︎</a></p></li>
<li id="fn5" role="doc-endnote"><p>Fig. 9 uses <em>plate notation</em> to represent the influence of <span class="math inline"><em>P</em><sub>0</sub></span> on a total of <span class="math inline"><em>N</em></span> products <span class="math inline"><em>P</em><sub><em>i</em></sub></span>. In plate notation, the rectangle with marker <span class="math inline"><em>N</em></span> is a summary of a repeated graph structure. I.e., the nodes <span class="math inline"><em>D</em><sub><em>i</em></sub></span> and <span class="math inline"><em>P</em><sub><em>i</em></sub></span> inside the rectangle are repeated <span class="math inline"><em>N</em></span> times.<a href="#fnref5" class="footnote-back" role="doc-backlink">↩︎</a></p></li>
</ol>
</section>
Fri, 20 Mar 2020 00:00:00 +0000
https://causalinference.gitlab.io/causal-reasoning-book-chapter2/
https://causalinference.gitlab.io/causal-reasoning-book-chapter2/Chapter 1: Causal Reasoning Book<h1 id="ch_patternsandpredictionsarenotenough">Introduction</h1>
<p>Machine learning algorithms are increasingly embedded in applications
and systems that touch upon almost every aspect of our work and daily
lives, including in societally critical domains such as healthcare,
education, governance, finance and agriculture. Because decisions in all
these domains can have wide ramifications, it is important that we not
only understand why a system makes a decision, but also understand the
effects (and side-effects) of that decision, and how to improve
decision-making to achieve more desirable outcomes. As we shall see in
this chapter, all of these questions are what we call <em>causal</em> questions
and, unlike many conventional machine learning tasks, cannot be answered
using only passively observed data. Moreover, we will show how even
questions that do not seem causal, such as pure prediction questions,
can also benefit from causal reasoning.</p>
<p>First, let us briefly and informally define causal reasoning as the
study of cause-and-effect questions, such as: Does A cause B? Does
recommending a product to a user make them more likely to buy it? If so,
how much more likely? What makes a person more likely to repay a loan or
be a good employee (or employer)? If the weather gets hot, will crops
wilt? Causal reasoning is the study of these questions and how to answer
them.</p>
<p>In this book, we focus on causal reasoning in the context of machine
learning applications and computing systems more broadly. While most of
what we know about causal reasoning from other domains remains useful in
the context of computing applications, computing systems offer a unique
set of challenges and opportunities that can enrich causal reasoning. On
the one hand, the scale of data, its networked nature, and
high-dimensionality challenge standard methods used for causal
reasoning. On the other hand, these systems allow control over data
gathering and measurement and allow easy combination of passive
observations and active experimentation, thereby providing opportunities
for rethinking typical methods for causal reasoning.</p>
<p>This introductory chapter motivates the use of causal reasoning. Given
how machine learning systems are being used in almost all parts of our
society, we discuss a wide range of use-cases ranging from recommender
systems in online shopping and algorithmic decision support in medicine
to hiring and criminal justice; and data-driven management for
agriculture and industrial applications. We discuss how these are all
fundamentally interventions that require causal analysis for
understanding their effects. Further, we frame the connections between
causal reasoning and critical machine learning challenges, such as
domain adaptation, transfer learning, and interpretability.</p>
<h2 id="what-causal-reasoning">What is Causal Reasoning?</h2>
<h3 id="brief-philosophy">Brief Philosophy</h3>
<p>Causal reasoning is an integral part of scientific inquiry, with a long
history starting from ancient Greek philosophy. Fields ranging from
biomedical to social sciences rely on causal reasoning to evaluate
theories and answer substantive questions about the physical and social
world that we inhabit. Given its importance, it is remarkable that many
of the key statistical methods have been developed only in the last few
decades. As Gary King, a professor at Harvard University puts it,</p>
<blockquote>
<p><em>“More has been learned about causal inference in the last few decades
than the sum total of everything that had been learned about it in all
prior recorded history”.</em></p>
</blockquote>
<p>This might seem puzzling. If causal reasoning is so critical, then why
hasn’t it become a common form of reasoning such as logical or
probabilistic reasoning? The issue is that “causality” itself is a
nebulous concept. From Aristotle to Hume and Kant, many philosophers and
scholars have attempted to define causality but have not reached a
consensus so far.</p>
<p>To understand the difficulty, let us first ask you, the reader, to let
go of this book and drop it on the floor—and then pick it up again and
continue reading! Now, let us ask, what was the cause of the book
dropping? Did the book fall because you let go of the book? Or did the
book fall because we, the authors, asked you, the reader, to drop it?
Perhaps you would have let go of the book even if we had not asked you
to. Maybe it was gravity. Perhaps the book fell because the reader is
not an astronaut reading the book in space.</p>
<p>This simple example of the falling book illustrates many of the
important, philosophical challenges that have vexed philosophers’
efforts to conceptualize causality. These include basic concepts of
abstractions, as well as sufficient and necessary causes. E.g., gravity
is of course necessary, but not sufficient, to cause the book to
fall—gravity together with the reader letting go of the book is both
necessary and sufficient for the book to fall. This example also
illustrates both proximate and ultimate causes. E.g., the reader
dropping the book is a proximate cause, the authors asking the reader to
drop the book may be a more distant, ultimate cause. Finally, this
example raises the question of whether causes must be deterministic. In
other words, does the likelihood that not all (or even most) readers are
suggestible enough to drop this book when asked imply that the authors’
request is not a cause at all? Or is it possible for our request to be
considered a probabilistic cause?</p>
<p>Hume asks how we know—how we learn—what causes an event? Consider the
simple act of striking a match and observing that it lights up. Would we
say that striking causes the match to light up? Believing in data, say
we repeat this action 1000 times and observe the same outcome each time.
Hume argues that, while this may seem to provide strong evidence that
striking the match causes it to light, this specific experiment only
demonstrates predictability, and argues that these results are
indistinguishable from the case where the two events just happen to be
perfectly correlated with each other. Hume proposed this quandary in his
book, “A Treaties of Human Nature”, in 1738, and concludes that
causality must be a mental construct that we assign to the world, and
thus does not exist outside it. Other scholars challenge this
provocation and argue for the existence of causality.</p>
<p>These philosophical challenges are essentially questions of abstraction.
Modern advances in causal reasoning have <em>not</em> come through answering
most of these provocations directly but, rather, by creating flexible
methods for reasoning about the relationships between causes and effects
regardless of the abstractions one chooses. In this book, therefore, we
attempt to steer clear of the above philosophical ambiguities and adopt
one of the more simple and practical approaches to causal reasoning,
known as the interventionist definition of causality.</p>
<h3 id="defining-causation">Defining Causation</h3>
<p><strong>Definition</strong>: In the interventionist definition of causality, we say
that an event <em>A</em> causes another event <em>B</em> if we observe a difference in
<em>B</em>’s value after <em>changing</em> <em>A</em>, <em>keeping everything else constant</em>.</p>
<p>Due to causal reasoning’s early applications in medicine (which we will
discuss in chapter 3), it is customary to call
<em>A</em> the “treatment” (also sometimes called “exposure”), or simply the
cause. <em>B</em> is referred to as the “outcome”. Readers familiar with
reinforcement learning may analogize <em>A</em> as the “action” and <em>B</em> as the
“reward”. In general, these events are associated with <em>measurement
variables</em> that describe them quantitatively, e.g., the dosage of a
treatment drug and its outcome in terms of blood pressure, which we
refer to as the treatment variable and the outcome variable
respectively. For convenience, we use events and their measurement
variables interchangeably, but it is important to remember that
causality is defined over events, and that the same events can
correspond to different variables when measured differently.</p>
<h3 id="interventions-and-counterfactuals">Interventions and Counterfactuals</h3>
<p>There are two phrases in the above definition that needs further
unpacking: “changing A”, and “keeping everything else constant”. These
correspond to the two key concepts in causal reasoning: an
<em>intervention</em> and a <em>counterfactual</em> respectively. An intervention
refers to any action that actively changes the value of a treatment
variable. Examples of an intervention are giving a medicine to a
patient, changing the user interface of a website, awarding someone a
loan, and so on. It is important to distinguish it from simply observing
different values of the treatment. That is, assigning specific people to
try out a new feature of a system is an intervention, but not if people
found out and tried the feature themselves. While this might seem a
small difference, its importance cannot be understated: these two are
fundamentally different and can lead to varying, and even opposite
conclusions when analyzing the resultant data. In particular, in the
observational case, it is hard to know whether any observed effect
(e.g., increased usage) is due to the feature or due to characteristics
of the people (e.g., high-activity users) who were able to discover the
feature. History of causal reasoning is replete with examples where
observations were used in place of interventional data that sometimes
led to disastrous results. We will discuss some of them in the book.</p>
<p><strong>Intervention:</strong> An active action taken that changes the distribution
of a variable <em>T</em>.</p>
<p>To gain a valid interpretation of its effect, however, an itervention
must be performed “keeping everything else constant”. That is, it is not
enough to take an action but also ensure that none of other relevant
factors change, so that we can isolate the effect of the intervention.
Continuing our example on estimating the effect of a new feature, it is
not enough to merely assign people to try it, but also ensure that none
of the other system components changed at the same time. From early
experiments in the natural sciences, such an intervention came to be
known as a “controlled” experiment, where we clamp down values of
certain variables to isolate the effect of the intervention.</p>
<p>While “controlling” or keeping other variables constant is intuitive, it
is unclear about which variables to include. We can obtain a more
precise definition by utilizing the second key concept of causal
reasoning, counterfactuals. The idea is to compare what happened after
an intervention to what would have happened without it. That is, for any
intervention, we can imagine two worlds, identical in every way up until
the point where a some “treatment” occurs in one world but not the
other. Any subsequent difference in the two worlds is then logically, a
consequence of this treatment. The first one is the observed, <em>factual</em>
world, while the second one is the unobserved, <em>counterfactual</em> world
(the word counterfactual means “contrary to fact” ). The counterfactual
world, identical to the factual world except for the intervention,
provides a precise formulation to the “keeping everything else constant”
maxim. The value a variable takes in this world is called a
<em>counterfactual value</em>.</p>
<p><strong>Counterfactual Value:</strong> The (hypothetical) value of a variable under
an event that did not happen.</p>
<p>Putting together counterfactuals and interventions, causal effect of an
intervention can be defined as the difference between the observed
outcome after an intervention and its counterfactual outcome without the
intervention. We express the outcome under the factual world as
<em>Y</em><sub>World1(<em>T</em> = 1)</sub>, and that under the counterfactual world as
<em>Y</em><sub>World2(<em>T</em> = 0)</sub>. For a binary treatment, its causal effect can be
written as,
Causal Effect := <em>Y</em><sub>World1(<em>T</em> = 1)</sub> − <em>Y</em><sub>World2(<em>T</em> = 0)</sub>.</p>
<p>The above equation shows that inferring the effect of an intervention
can be considered as the problem of estimating the outcome under the
counterfactual world, since the factual outcome is usually known. Thus,
counterfactual reasoning is key to inferring causal effect. Coming back
to the match stick example, we can define our intervention as striking a
match. The factual world is the world where we strike the match and see
it light up, and the counterfactual world is the world where we do not
strike the match but keep everything else the same. Under our
interpretation of causality, one expects that the match would not light
up in the counterfactual world, and hence we can claim that striking the
match causes light. Happily, our conclusion coincides with common
intuition, and as we shall see, counterfactual reasoning applies well to
many practical problems. That said, we must emphasize that this
definition of causality is not absolute; it depends on the initial world
in which one starts. For instance, in the match-stick example, if we
started in an oxygen-free environment (or in outer space), and applied
the same counterfactual reasoning, we would conclude that striking does
not cause lighting up, illustrating Hume’s dilemma.</p>
<h2 id="sec:ch1-randomized-exp">The Gold Standard: Randomized Experiment</h2>
<p>Let us now apply the above two concepts to describe one of the most
popular ways of causal reasoning, a randomized experiment. We consider a
simple example of deciding whether to recommend a medication to a
patient Alice. As we discussed above, we can evaluate this decision by
considering the <em>causal effect</em> of the medication on Alice. Here the
<em>treatment</em> is administering the medication and the <em>outcome</em> is Alice’
health afterwards. From the equation above, we
can define causal effect as the difference between the value of <em>Y</em> in a
world where we gave Alice the treatment <em>T</em> (<em>Y</em><sub>World(<em>T</em><sub>Alice</sub> = 1)</sub>) and
where we did not (<em>Y</em><sub>World(<em>T</em><sub>Alice</sub> = 0)</sub>).
Causal Effect = <em>E</em>[<em>Y</em><sub>World(<em>T</em><sub>Alice</sub> = 1)</sub>]−<em>E</em>[<em>Y</em><sub>World(<em>T</em><sub>Alice</sub> = 0)</sub>]</p>
<p>This may seem straightforward, but the fundamental challenge is that
this calculation requires taking the difference between an observed
outcome and a counterfactual that we cannot observe. If we want to
calculate this difference, we can either 1) observe the outcome of
giving Alice the treatment <em>T</em> and compare it to the unobserved
counterfactual outcome of not giving her the treatment; or we can 2)
observe the outcome of not giving Alice the treatment <em>T</em> and compare it
to the unobserved counterfactual outcome of giving her the treatment.</p>
<p>No matter what we do, we cannot in any single experiment, both <em>do</em> <em>T</em>
and <em>not do</em> <em>T</em>! Whatever we actually do, the counterfactual will
remain unobserved. This is called the “missing data” problem of causal
inference. If
Alice takes a medication and we observe that she then gets better, we
cannot also observe what would have happened if she hadn’t taken the
medication. Would Alice have gotten better on her own, without the
medication? Or would she have stayed ill?</p>
<p><img src="/assets/Figures/missing-data-problem.png" alt="Missing data" height="350px" width="276px" /></p>
<h4 id="figure-11-the-missing-data-problem-of-causal-inference">Figure 1.1: The missing data problem of causal inference.</h4>
<p>To solve this problem, we need to make further assumptions on the
intervention or the counterfactual. For instance, if Alice has an
identical twin, Beth, who is also sick, we might give the medication to
Alice, but not give the medication to Beth. Then we can make an
assumption about the counterfactual: identical twins have the same
counterfactual when it comes to health outcomes. That is, we argue that
Beth is so similar to Alice—in terms of general health, genetics,
specifics of the illness, etc.—that they are likely to experience the
same outcomes, but for the medication. In this case, we could use our
observation of Beth’s outcomes as an estimate of Alice’s
counterfactual—what would have happened to Alice had she not taken the
medication.</p>
<p>Causal Effect = <em>E</em>[<em>Y</em><sub>World(<em>T</em><sub>Alice</sub> = 1)</sub>]−<em>E</em>[<em>Y</em>World(<em>T</em><sub>Beth</sub> = 0)]</p>
<p>But of course, not everyone has an identical twin, much less an
identical twin with identical general health, habits, and illness. And
if two individuals are not identical, then there will always be a
question of whether differences in outcomes are due to the underlying
dissimilarities between them, instead of due to the medication. When the
two twins do not share the same illness, or more generally, when
comparing two different people, we cannot expect that their
counterfactuals will match. These differences that confuse our
attribution of differences in outcome to differences in the treatment
are called <em>confounders</em>.</p>
<p>Another approach is make assumptions on the intervention. For example,
assume that there is a dataset of patient outcomes where medications
were given irrespective to the actual health condition. That is, the
outcome of any person still depends on their health condition, but
whether they took the drug does not. Therefore, if we now compare any
two individuals with or without the drug, we can argue that there is no
systematic difference between them. Over a large enough sample, the
average outcomes of the treated group can approximate the average
counterfactual outcomes of the untreated, and vice-versa. This allows us
to estimate the effect of an intervention defined over a group of
people, rather than just Alice.
Causal Effect = ∑<em>r</em> <em>Y</em><sub>World(<em>T</em> = <em>r</em>)</sub> − (1 − <em>r</em>)<em>Y</em><sub>World(<em>T</em> = <em>r</em>)</sub></p>
<p>where <em>r</em> is a random variable ∈{0, 1}. More generally, the core idea is
that instead of trying to find two individuals that are identical to one
another, we find two populations that are essentially identical to one
another. This may seem like it should be harder than identifying two
individuals that are the same—after all, now we have to find many people
not just two—but in practice, it is actually easier. It turns out that,
if we want to find the average effect of a treatment, we just need to
ensure that there are no systematic differences between the groups as a
whole. The advantage is that we no longer need to find identical
individuals, but how do we account for all the differences between any
group of people? Is ignoring health condition of people enough?</p>
<p>Causal reasoning took a major advance in the early twentieth century
when Fisher discovered a conceptually straightforward way to conduct an
intervention such that the there is no systematic difference between the
treated and untreated groups. We simply gather one large population of
people and randomly split them into two groups (<em>G</em> = 0 or <em>G</em> = 1), one
of whom will receive the treatment and the other will not. By randomly
assigning individuals to receive or not receive treatment, we ensure
that, on average, there is no difference between the two groups. The
implication is that the expected outcomes of the two groups are the
same, and when we observe the average outcome
<em>Ȳ</em><sub><em>do</em>(<em>A</em> = 0)</sub><sup><em>G</em> = 0</sup> for the first group, we can use it
as an estimate of what the counterfactual outcome would have been for
the second group. Similarly, when we observe the average outcome
<em>Ȳ</em><sub><em>do</em>(<em>A</em> = 1)</sub><sup><em>G</em> = 1</sup> for the second group, we can use
that as an estimate of the counterfactual outcome for the first group.
This methodology is called the <em>randomized experiment</em>—also sometimes
called a randomized controlled trial, A/B experiment, and other names.</p>
<p>Coming back to our question on whether to give Alice the medication, we
can use the average causal estimate above to make a decision: if the
effect over a general population is positive, then assign the drug,
otherwise not. Note that the decision will be same for every person
since we are deciding based on the average effect of the medication.
Assumptions on the counterfactual (e.g., comparing to Beth or a similar
person) can provide individual causal effects but the estimates suffer
from error whenever their counterfactuals are not identical. In
contrast, assumptions on the intervention lead to group-wise causal
effects, but are accurate whenever the treatment is randomized. An
interesting and important thing to realize is that randomized
experiments don’t only address the confounders that we are aware of, but
also ensure that our analysis is sound even when there are confounders
that we are not measuring or maybe haven’t even thought of. Because of
this, randomized experiments are considered more robust than other
approaches. In fact, they are often referred to as the “gold standard”
for causal inference. Randomized experiments are used to test whether a
new medicine really cures an illness or has significant side-effects,
whether a marketing campaign works, whether one search algorithm is
better than another, and even whether one color or another on a website
is better for user engagement.</p>
<p>Despite their general robustness, randomized experiments are not
foolproof and there are practical problems that can occur. First and
foremost, ensuring that random assignment is actually random is not
always easy. Sometimes we might be tempted to use an assignment
mechanism that seems close to random, but actually isn’t. For example,
for convenience, we might find it easier to assign all units that arrive
on a Monday to treatment A, and all units that arrive on a Tuesday to a
placebo. Not only might this be more convenient—it might be logistically
easier to give the same treatment to everyone on a day—but, we can also
imagine why this regime is close to random: We do not have prior
knowledge of which units will arrive on which day; that is outside of
our control and might be random. We might even double-check how similar
the units are to one another and find that Monday units and Tuesday
units are very similar. However, if there are any unobserved reasons why
units arrive at different times, there will be systematic differences
between our two groups that will bias our results. As another example of
how random assignment might not be random, historically, when patients
were being assigned to drug trials, sympathetic patients were more
likely to be assigned to treatment. This led to the development of
blinding methodologies to prevent people who interact directly from
patients from assigning or even knowing their treatment status.</p>
<p>Despite their many advantages, randomized experiments are sometimes too
costly, unethical or otherwise not feasible for certain situations. We
are limited in the number of experiments we can run at a given time
(statistical power needed); the cost of designing and implementing the
experiment; the length of time we can run the experiment; and so on.
Even if running experiments is relatively easy, there are often orders
of magnitude more experiments we would like to run than we possibly can.
In addition, sometimes there are ethical issues involved in
experiments—is it ethical to expose people to potentially harmful
treatments?</p>
<p>So, what do we do when we cannot run a randomized experiment, but still
need to answer a cause-and-effect question? We turn to methods and
frameworks for causal reasoning that make different assumptions about
the counterfactual and intervention. Such methods are the focus of most
of this book. That said, much of what we will talk about—from accounting
for survival bias, interference, heterogeneity, and other validity
threats—are applicable to randomized experiments as well.</p>
<h2 id="why-causal-reasoning-the-gap-between-prediction-and-decision-making">Why causal reasoning? The gap between prediction and decision-making</h2>
<p>Causal reasoning, thus, makes sense whenever we have a treatment (as in
medicine) or an economic policy (as in in the social sciences) to
evaluate. But what use could it be for computing systems especially at a
time when machine learning-based predictive systems are promising
success in a variety of applications? To answer this question, let us
take a closer look at the success of machine learning and how it may
change the role of computing systems in society.</p>
<h3 id="the-promise-of-prediction">The promise of prediction</h3>
<p>Today, computers are increasingly making decisions that have a
significant impact on our lives. Sometimes computers make a choice and
take action independently such as deciding on loan applications. Other
times computers simply aid people who make the final choice and drive
action such as helping doctors or judges with recommended actions.
Sometimes these computers are hidden far away inside our vital
infrastructure, making decisions that seem to only indirectly affect
people, as in optimizing configuration and availability of data centers.
Other times, these computers are visibly integrated into the fabric of
our daily lives, for example through fitness devices.</p>
<p>But, regardless of how directly or indirectly computers are involved, it
is true that computers are helping us make critical decisions across
many domains. For example, machine learning algorithms recommend product
purchases to customers in online retail sites. Similar algorithms power
movie recommendations, placement of advertisements, and many other
decisions. Other algorithms are responsible for match-making, pairing up
drivers and passengers in ride-sharing platforms, and connecting people
in online dating services. Behind the scenes, computers help with
logistics, resource allocation, and product pricing. They run algorithms
to decide who is eligible for a loan and identify the top candidates who
have applied for a job. Each of these decisions, made by a computer
algorithm, has significant consequences for all individuals and parties
involved.</p>
<p>And computer-aided decision-making is only growing in scope. In the
health domain, machine learning is enabling the advent of precision
medicine. Computers will soon analyze genetic information, symptoms,
test results and medical history to decide how best to heal a particular
patient of a malady. In education, computers promise to improve
personalized education in the context of both traditional classrooms and
the newer massive open online courses. Based on a personalized model of
a learner’s conceptual understanding and learning preferences, a
computer will coach and support a student in their exploration and
mastery of a subject. Data-driven, precision decision-making is
improving the productivity of farming while reducing water usage and
pollution. Artificial Intelligence (AI) is also bringing or poised to
bring similar impact to manufacturing, transportation, and other
industries.</p>
<p>This revolution of computer-aided decision-making is aided by three
concurrent trends. First, there is a proliferation of data from cheaper
and more ubiquitous sensors, devices, applications, and services.
Second, cheap and well connected computational power is available in the
cloud. Third, it is the significant advances in machine learning and
artificial intelligence methods that make it possible to rigorously
process and analyze a much broader set of data than was possible even
just 10 years ago. For example, if we want to predict upcoming wheat
yields in a field, we can now use automated drones to take pictures of
the wheat plants, and use deep neural network based image analysis to
recognize and count the grains and extrapolate the likely yield. These
pictures, field sensors and other information from the farm can be
uploaded to the cloud, joined with weather data and historical data from
other farms to learn better management policies and make decisions about
crop management.</p>
<p>Thus, increasing amounts of data and advanced machine learning
algorithms help make highly accurate predictions. What could go wrong?
The simple answer is that going from prediction to a decision is not
straightforward. A typical machine learning algorithm optimizes for the
difference between true and predicted value in a given dataset, but a
decision based on such a prediction is not always the decision that
maximizes the intended outcome. In other words, the causal effect of a
decision based on purely data-based predictive modelling can be
arbitrarily bad.</p>
<h3 id="importance-of-the-underlying-mechanism">Importance of the underlying mechanism</h3>
<p>Consider a simple social news feed application, where users can see
messages posted by their friends. Let’s ask the question of whether we
can predict something about a user’s future behavior based on what they
see in their social feed. That is, if person sees a link to a news
article, a product recommendation, or a review of a real-world
destination, can we predict that the person will then read the news
article, buy some product, or visit the destination? It turns out, the
answer to that question is yes, we can make successful predictions about
a user’s future behavior based on what they see in their social feed.</p>
<p>If we can predict future behavior based on the social feed, does this
mean that, if we decide to change the contents of the social feed, that
we will change the user’s future behavior? Not necessarily. This is the
gap between prediction and decision-making. We can predict a user’s
future behavior using the current feed, but that does not tell us much
about how they will behave if we change their social feed. Here the
decision is whether to change the social feed and the answer depends on
relationship between social feed and user’s future behavior: what
affects whom?</p>
<p>Figure 1.2 shows two possible explanations for
the predictive accuracy. On the left side, we see that the social feed
itself <em>causes</em> a person’s future behavior. That is, perhaps social feed
posts on this system are very enticing and do a good job enticing a
person to do new things in the future. Or perhaps, as shown on the right
side, people and their friends tend to do similar things anyway. For
example, if a group of friends likes going to Italian restaurants, and a
new one opens, they are all likely to visit the restaurant sometime
soon, but one of them happens to go and post about it first. If the
friend hadn’t posted, all the friends would have gone to the restaurant
anyway, but the post itself helps us predict the behavior of the
individuals. In the left hand side, if we change the social feed, then
we will effect the user’s behavior. However, on the right hand side,
changing the social feed will not affect the user’s behavior. But note
that in both cases, the social feed helps us predict what the user might
do in the future!</p>
<p>Without knowing the direction of effect, we can reach exactly opposite
conclusions using the same data. The problem is that in many scenarios,
prediction models are often used in service of making a decision. This
creates problems.</p>
<p><img src="/assets/Figures/Fig-1-1.png" alt="Social feed affects user behavior" height="200px" width="150px" /></p>
<!--<span id="fig:socialfeedbehavior_causal" label="fig:socialfeedbehavior_causal">Social feed affects user behavior</span>-->
<p><img src="/assets/Figures/Fig-1-2.png" alt="Restaurant preference affects both the social feed and user behavior" height="200px" width="150px" /></p>
<!-- <span id="fig:socialfeedbehavior_correlated" label="fig:socialfeedbehavior_correlated">Restaurant preference affects both the social feed and user behavior</span>-->
<h4 id="figure-12--two-models-of-social-feed-effect-on-user-behavior">Figure 1.2: Two models of social feed effect on user behavior.</h4>
<h3 id="the-trouble-with-changing-environments">The trouble with changing environments</h3>
<p>Complicating matters, predictive models can lead us astray even if the
underlying (direction of) effects is known. Let’s consider a second
example where machine learning may be applied to data from a farm for
making irrigation decisions. In particular, let’s make our job a little
easier by assuming that we know what the effect of irrigation is (unlike
in our social feed example, where we do not know the effect of changing
the feed). We know that irrigation will increase the soil moisture
levels by some known amount.</p>
<p>In a predictive model, we may collect data about past soil moisture and
other variables and then make predictions of future soil moisture levels
based on past data. This prediction can be converted to a simple
decision: if the soil moisture is low, irrigate, else do not irrigate.
Now, given a history of past soil moisture data on the farm and past
weather, let us assume that we can train an accurate model to predict
future soil moisture levels based on current soil moisture and future
weather forecasts. Can this machine learning prediction model guide our
irrigation decisions on a farm?</p>
<p>Again, the answer is no, we cannot make irrigation decisions based on
our learned model of soil moisture levels. Imagine that one day the
weather forecast says it will be very hot. Our soil moisture model is
likely to predict that the soil will be very moist and, based on this
prediction, we are likely to decide <em>not</em> to irrigate.</p>
<p>But why is our soil moisture model predicting that there’s no need to
irrigate on a very hot day? Our model is trained on past soil moisture
data, but in the past the soil was being irrigated under some
predetermined policy (e.g., whether a rule based decision or farmer’s
intuition). If this policy always irrigated the fields on very hot days,
then our prediction model will learn that on very hot days, the soil
moisture is high. The prediction model will be very accurate, because in
the past this correlation always held. However, if we decide not to
water the field on very hot days based on our model predictions, we will
be making exactly the wrong decision!</p>
<p>The prediction model is correctly capturing the correlation between hot
weather and a farmer’s past irrigation decisions. The prediction model
does not care about the underlying mechanism. It simply recognizes the
pattern that hot weather means the soil will be moist. But once we start
using this prediction model to drive our irrigation decisions, we break
the pattern that the model has learned. More technically, we say that
once we begin active intervention, the correlations that the soil
moisture model depends on will change.</p>
<p><img src="/assets/Figures/Chapter1/irrigation_historical.png" alt="Both daily temperature and irrigation decisions influence soil moisture levels. Historically, daily temperature also influences irrigation decisions." height="200px" width="200px" /></p>
<p><span id="fig:irrigation_historical_dag" label="fig:irrigation_historical_dag" style="font-style:italic">Both daily temperature and irrigation decisions influence soil moisture levels. Historically, daily temperature also influences irrigation decisions.</span></p>
<p><img src="/assets/Figures/Chapter1/Historical_Temp_Moisture_Chart.png" alt="Under a historical policy, the correlation between temperature and soil moisture is stable." height="200px" width="200px" /></p>
<p><span id="fig:irrigation_historical" label="fig:irrigation_historical" style="font-style:italic">Under a historical policy, the correlation between temperature and soil moisture is stable.</span></p>
<p><img src="/assets/Figures/Chapter1/Broken_Temp_Moisture_Chart.png" alt="Changing the irrigation policy will change the relationship between temperature and soil moisture." height="200px" width="200px" /></p>
<p><span id="fig:irrigation_intervention" label="fig:irrigation_intervention" style="font-style:italic">Changing the irrigation policy will change the relationship between temperature and soil moisture.</span></p>
<h4 id="figure-13-using-a-correlational-model-trained-on-historical-data-to-drive-future-irrigation-decisions-will-break-the-historical-temperature-soil-moisture-correlation-and-thus-the-machine-learned-model">Figure 1.3: Using a correlational model trained on historical data to drive future irrigation decisions will break the historical temperature-soil moisture correlation and thus the machine-learned model.</h4>
<p>This illustrates another example of why prediction models are not
appropriate for decision making. Prediction models are not robust to
changing conditions. Machine learning practices—e.g., ensuring that we
train and test machine learning models using data drawn from the
environment we plan to deploy in—are important, but provide no
guarantees. In this irrigation example, those changing conditions are
due to our own change in policy based on machine learning model
predictions—we cannot train our model based on observations of our new
policy, as the new policy doesn’t exist yet! More generally, such
changing conditions can occur due to exogenous conditions as well.
Moreover, these conditions may change quickly when we apply our models
in new environments, or change over time within a deployed environment.</p>
<p>Changing environments are particularly important issues in some of the
critical domains we care most about: healthcare, agriculture, etc.,
where we expect machine learning models to help us make better
decisions; online services such as ecommerce sites, etc., that have to
adapt to seasonality, social influences, and growing and changing user
populations; and deployments of machine learning in adversarial
settings, e.g., from spam classification and intrusion detection to
safety critical systems.</p>
<h3 id="from-prediction-to-decision-making">From prediction to decision-making</h3>
<p>To recap, we have seen that prediction models are not appropriate for
helping us reason about what might happen if we change a system or take
a specific action. In our social news feed example, where we asked
whether predictive models can help us understand whether changing a
social news feed will change future user behavior, we saw that there are
multiple plausible explanations of why social feed data can help us
predict future user behaviors. While one explanation implies that
changing the social feed will affect user behavior, another explanation
implies that it won’t affect user behavior. Crucially, the machine
learning prediction model does not help us identify which explanation is
correct!</p>
<p>Moreover, even with an <em>a priori</em> understanding of the causal effect of
an intervention, when we examine the use of a simple prediction model
for making decisions, we see that the act of making a decision based on
the model changes the environment and puts us into untested territory
that threatens the predictive power of our model!</p>
<p>Finally, let us emphasize that these two issues are fundamental. Even
when a prediction model has an otherwise extremely high accuracy, we
cannot expect that accuracy alone to give us insight into underlying
causal mechanisms or help us choose among interventions that change the
environment.</p>
<h2 id="applications-of-causal-reasoning">Applications of Causal Reasoning</h2>
<p>The above section illustrates the importance of causal reasoning
whenever we have to make decisions based on data. Below we present
sample scenarios from computing systems that involve decision-making,
and thus require causal reasoning. In addition, it turns out that causal
reasoning is useful even when decision-making may not be the primary
focus. For instance, causal reasoning is useful in improving systems
that may appear to be purely predictive at first, such as search and
recommendation systems.</p>
<h3 id="making-better-decisions">Making better decisions</h3>
<p>There are numerous examples of decision-making in computing systems
where causal reasoning can help us make better decisions. We broadly
categorize them into three themes: improving utility for users,
optimizing underlying systems, and enhancing viability of the system,
commercial or otherwise.</p>
<p>We already saw an example of decision-making for improving users’
utility through changing a social feed. Other examples include choosing
incentives for encouraging better use of a system, and more broadly,
deciding which functionality to include in a product to maximize utility
for users. In general, any decision that involves changing a product or
service’s features requires causal reasoning to anticipate its future
effect. This is because for all these problems, we need to isolate the
effect of these decisions from the underlying correlations.</p>
<p>Similar reasoning can also be applied to optimize underlying systems,
such as optimizing configuration parameters of a database, deciding
network parameters for best throughput, allocating load in a distributed
data center for energy efficiency, and so on.</p>
<p>Lastly, viability and sustainability of any computing system is
important too. This involves historically non-computing areas such as
marketing and business management, but where data-driven decisions are
increasingly being made. Decisions involving interaction of a system
with the outside world, such as deciding the right messaging for a
targeting campaign. As another example, consider a subscription-based
service such as Netflix or Office365. It is relatively easy to build a
predictive model that identifies the customers who will be leaving in
the next few months, but deciding what to do to prevent them from
leaving is a non-trivial problem. We will consider such decision-making
applications throughout the book.</p>
<h3 id="building-robust-machine-learning-models">Building robust machine learning models</h3>
<p>Causal reasoning is also useful in the absence of explicit decisions.
Many systems, such as those for recommendation or search that commonly
employ predictive models can be improved with causal reasoning.
Predictive models aim to minimize the average error on past data, which
may not correspond to the expected error on new data, especially in a
system that interacts with people. Consider a ratings-based
recommendation system that aims to predict a user’s rating of a new
item. If there are systematic biases in the items rated by the user
(such as rating movies from a single genre more often), then the system
may over-optimize for movies from the same genre, but make errors for
all other genres. The fundamental problem is that past data is collected
under certain conditions, and its predictions may not be accurate for
the future. We shall see in Chapter 12 that the problem
of recommendation can be considered as a problem of <em>intervening on</em>
users with a recommended item, thus defining each recommendation as an
intervention. A similar problem arises in optimizing the most relevant
pages for a query in a search engine based on log data, and in any
system where a user interacts with information. Besides improving
accuracy, causal reasoning can also be useful to understand the effect
of algorithms on metrics that they were not optimized for. For instance,
it helps us in estimating the different effects of recommendation
systems, from impacting diversity to amplifying misinformation and
“filter bubbles”.</p>
<p>Relatedly, questions on broad societal impact of computing systems are
fundamentally causal questions about the effect of an algorithm: is a
loan decision algorithm unfair to certain groups of people? What may be
the outcomes of delegating certain medical decisions to an algorithm? As
we use machine learning for societally critical domains such as health,
education, finance, and governance, questions on the causal effect of
algorithmic interventions gain critical importance. Causal reasoning can
also be used to understand the effects of these algorithms, and also to
explain their output: why did the model provide a particular decision?</p>
<p>More generally, causal reasoning helps predictive models make the jump
from fitting to retrospective data to making predictions. Predictive
models based on supervised learning work well when we expect them to be
tested on the same data distribution on which they were trained. For
instance, predictive models can achieve impressive results on
distinguishing between different species of birds because we expect to
use them on classifying similar pictures in the future. If, however, we
predict on an unseen environment (e.g. outdoor to indoor), the model may
not work well and even fail to identify the same species. These
environment changes, commonly called as concept drift, occur because the
association between input features and output changes as we change the
environment. Rather than looking for patterns in an image, reasoning
about the causal factors that make an image about a specific species can
lead to a more generalizable model. In fact, causal inference can be
considered as a special case of the domain adaptation problem in machine
learning, which we will explore in Chapter 13.</p>
<p>Beyond supervised learning, causal reasoning shares a special connection
with reinforcement learning (RL), in that both aim to optimize the
outcome for particular decision. It is no surprise, then, that simpler
forms of RL, such as bandits, are used for optimizing recommendation
systems. And causal inference methods find use in training RL policies,
especially when using off-policy data. This synergy between machine
learning and causal reasoning is one of the underlying themes of this
book: causal reasoning can make machine learning more robust, and
machine learning can help with better estimates of causal effects.</p>
<h2 id="four-steps-of-causal-reasoning">Four steps of Causal Reasoning</h2>
<p>The focus of this book is on methods and challenges for learning causal
effects from observational data. Briefly, observational studies are
those where we wish to learn causal effects from gathered data and,
while we may have some understanding of the data (in particular the
mechanism that generated the data), we have limited or no control over
that mechanism. So, how are we going to learn causal effects when we
cannot run an experiment like a randomized control study? How will we
deal with confounding variables that might confuse our analysis if we
cannot manipulate an experiment to ensure that confounds are independent
of the treatment status?</p>
<p>At a high-level, we’ll need to find a valid intervention and then
construct a counterfactual world to estimate its effect. Unlike the
randomized experiment, the biggest change is that we will need to make
some assumptions on how the data was generated. This is critical: causal
reasoning depends on a model of the world which can be considered as
modeling assumptions. As we saw in the social feed example, the same
data can lead to different conclusions depending on the underlying
mechanism.</p>
<p>Given data and a model of the world, we decide whether the available
data can answer the causal question uniquely. This step is called
identification. Note that identification comes from the modeling
assumptions themselves, not from data. When the causal question is
uniquely identified, we can estimate it using statistical methods. Note
that the identification and estimation are separate, modular steps.
Identification step is the causal step while estimation is a statistical
step. Identification depends on the modeling assumptions, estimation on
the data. A better estimate does not convey anything about causality,
just as a better identification does not convey anything about the
quality of an estimate.</p>
<p>Finally, given the dependence on assumptions, verifying these
assumptions is critical. Even with infinite data, incorrect assumptions
can lead to wrong answers. Worse, the the statistical confidence will be
higher. Therefore, the final critical part of causal reasoning is to
validate the modeling assumptions. We call this step the “refute” step,
because like scientific theories, modeling assumptions can never be
proven using data but may be refuted. That said, it is important to note
that not all assumptions can be refuted. Causal reasoning is an
iterative process where we refine our modeling assumptions based on
evidence and try to obtain identification with the least untestable or
most plausible assumptions.</p>
<p>To summarize, we rely on a four step analysis process to carefully
address these challenges:</p>
<p><strong>Model and assumptions</strong>. The first important step in causal reasoning
is to create a clear model of the causal assumptions being made. This
involves writing down what is known about the data generating process
and mechanisms. In general, there are many mechanisms that can
potentially explain a set of data, and each of these self-consistent
mechanisms will give us a different solution for the causal effect we
care about. So, if we want to get a correct answer to our
cause-and-effect questions, we have to be clear about what we already
know.. Given this model, we will be able to specify formally the effect
<em>A</em> →#x2192; <em>B</em> that we want to calculate.</p>
<p><strong>Identify</strong>. Use the model to decide whether the causal question can be
answered and provide the required expression to be computed.
Identification is a process of analyzing our model.</p>
<p><strong>Estimate</strong>. Once we have a general strategy for identifying the causal
effect, we can choose from several different statistical estimation
methods to answer our causal question. Estimation is a process of
analyzing our data.</p>
<p><strong>Refute</strong>. Once we have an answer, we want to do everything we can to
test our underlying assumptions. Is our model consistent with the data?
How sensitive is the answer to the assumptions made? if the model is a
little wrong, will that change our answer a little or a lot?</p>
<h3 id="modeling-and-assumptions">Modeling and assumptions</h3>
<p>In Section 1.2, we discussed a randomized
experiment and applied counterfactual reasoning to estimate the causal
effect. Counterfactual reasoning provides a sound basis for causality,
but in most empirical problems, we may not obtain perfectly randomized
data. Therefore, to estimate the causal effect, we need a precise way of
expressing our assumptions about the intervention and the counterfactual
we wish to estimate. What has happened in the last few decades is that
the concepts of interventions and counterfactuals have been formalized
in a general modeling framework, taking causality from the realm of
philosophy to empirical science.</p>
<p><img src="/assets/Figures/Chapter1/ice-cream-graph_cropped.png" alt="Structural causal model for ice-cream’s effect on swimming. <span label="fig:icecream"></span>" /></p>
<h4 id="figure-14-structural-causal-model-for-ice-creams-effect-on-swimming">Figure 1.4: Structural causal model for ice-cream’s effect on swimming.</h4>
<p>The main insight is to replace the factual and counterfactual worlds
with mathematical models that defines the relationship between
treatment, outcome and other variables. This can be done in the form of
a graph or functional equations. Crucially, this model does not
prescribe the exact functional forms that connect variables, but rather
conveys the structure of causal relationships—who affects whom. This
structural model embodies all the domain knowledge or causal assumptions
that we make about the world, thus it is also called the <em>structural
causal model</em>. For instance, consider question of whether ice cream
causes people to swim more. Figure <a href="#fig:icecream">1.4</a> shows the
correlation of ice-cream and swimming over time. We can represent the
scenario with a graphical model and associated set of non-parametric
equations shown in Figure <a href="#fig:icecream">1.4</a>. Each arrow represents a
direct causal relationship. We assume that the Weather causes changes in
ice-cream consumption and swimming. Our goal is to estimate the causal
effect of ice-cream consumption on swimming. Intuitively, the intervention is changing someone’s
ice-cream consumption and the counterfactual world is one where the
consumption is changed but every other node in the graph (Temperature,
in this case) remains constant. Assuming that our structural model is
correct, the structural causal model offers a precise recipe to estimate
the effect of having more ice-cream. Amazingly, the recipe generalizes
to arbitrary graph structures and functional forms, as we shall see in
the next few chapters.</p>
<p>Structural causal models derive their power from being able to precisely
define interventions and counterfactuals. However, it is hard to express
these concepts using conventional probabilities. As we saw above, it is
important to distinguish an intervention from an observation, but
unfortunately probability calculus lacks a language to distinguish
between observing people using a feature versus assigning them to it
(both would be expressed as <em>P</em>(Outcome|Feature = True)). This
difficulty gets more complicated when we try to express counterfactuals.
How would you express the counterfactual probability of outcome if a
user was assigned the feature, given that she discovered it herself (was
not assigned) in the factual world? The obvious expression,
<em>P</em>(Outcome|Assigned = True, Assigned = False) is non-sensical. Given
these shortcomings, we need a new class of variables and a calculus to
operate on these variables. Intervention is defined by a special “do”
operator, which implies removing all inbound edges to a variable. This
corresponds to the interventional graph shown in Figure []. Thus,
assigning people to feature is represented as <em>P</em>(<em>Y</em>|do(Feature). Any
counterfactual value can be generated by changing the variable in the
interventional graph, the removal of inbound edges mean that changing
the variable is not associated with changing other variables except the
outcome, thus <em>keeping everything else constant</em>. The causal effect of
an intervention can then be defined precisely as the difference of two
interventional distributions.
Causal Effect := <em>E</em>(<em>Y</em>|do(<em>T</em> = 1)) − <em>E</em>(<em>Y</em>|do(<em>T</em> = 0))</p>
<p><img src="/assets/Figures/Chapter1/rct-confounders_cropped.png" alt="Before randomization" /></p>
<p><span id="fig:rct-confounders" label="fig:rct-confounders">Before randomization</span></p>
<p><img src="/assets/Figures/Chapter1/rct-noconfounders_cropped.png" alt="After randomization" /></p>
<p><span id="fig:rct-noconfounders" label="fig:rct-noconfounders">After randomization</span></p>
<h4 id="figure-15--randomization-leads-to-the-interventional-structural-model-where-treatment-is-not-affected-by-confounders">Figure 1.5: Randomization leads to the interventional structural model where treatment is not affected by confounders.</h4>
<p>Thus, the <em>do</em> notation is a concise expression for evaluating
interventions while keeping everything else constant. Along with the
structural causal model, it also leads to a formal definition of a
counterfactual. To illustrate, we now express why randomized experiments
work using structural causal models. Figure <a href="#fig:rct-confounders">1.5a</a>
shows a structural model that shows confounding between the treatment
and outcome. Under a randomized experiment, the structural causal model
now becomes as shown in Figure <a href="#fig:rct-noconfounders">1.5b</a>, thus
graphically showing that randomization removes any effect of a
treatment’s parents.</p>
<h3 id="identification">Identification</h3>
<p>Given the modeling assumptions and available data, the next step is to
ascertain whether the causal effect can be estimated from the data. This
means ascertaining whether the expression from
Equation [eqn:causal-do-effect] can be written as a function of only
observed data. As we will see, given a causal structural graph, it is
possible to check whether the causal effect is estimable from data, and
when it is, provide the formula for estimating it. For instance,
returning to our ice-cream graph, the causal effect is identified by
conditioning on Temperature and then estimating the association between
Ice-cream and swimming separately for each temperature range. When we do
that, we see that the treatment and outcome are no longer associated,
thus showing that the observed association is due to the Temperature,
and not due to any causal effect of eating ice-cream. In general,
variables like Temperature are called confounders, that can lead to a
correlation between treatment and outcome even when there is no causal
relationship.</p>
<p>More generally, identification is the process of transforming a causal
quantity to an estimable quantity that uses only available data. For the
randomized experiment, we had argued that random assignment of treatment
ensures that there are not confounders that affect the treatment. Thus,
the identification step is trivial:</p>
<p>Average Causal Effect = <strong>E</strong>[<em>Y</em>|do(<em>A</em> = 1)] − <strong>E</strong>[<em>Y</em>|do(<em>A</em> = 0)]</p>
<p>=<strong>E</strong>[<em>Y</em>|<em>A</em> = 1]−<strong>E</strong>[<em>Y</em>|<em>A</em> = 0] … <em>(Identification Step)</em></p>
<p>While we obtained a clear answer for the ice cream problem and
randomized experiments, real-world problems of causal inference do not
always lend themselves to simple solutions. We illustrate this through a
common problem encountered in conditioning on data.</p>
<table>
<thead>
<tr>
<th style="text-align: left"> </th>
<th style="text-align: center">Current Algorithm</th>
<th style="text-align: center">New Algorithm</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">CTR for low-activity users</td>
<td style="text-align: center">10/400 (2.5%)</td>
<td style="text-align: center">4/200 (2%)</td>
</tr>
<tr>
<td style="text-align: left">CTR for high-activity users</td>
<td style="text-align: center">40/600 (6.6%)</td>
<td style="text-align: center">50/800 (6%)</td>
</tr>
<tr>
<td style="text-align: left">Overall CTR</td>
<td style="text-align: center">50/1000 (5%)</td>
<td style="text-align: center">54/1000 (5.4%)</td>
</tr>
</tbody>
</table>
<h4 id="table-11-click-through-rates-for-two-algorithms-which-one-is-better">Table 1.1: Click-through rates for two algorithms. Which one is better?</h4>
<p>Suppose that you are trying to improve a current algorithm that returns
a list of search results based on a given query. You consider a metric
like click-through rate on the generated results, and wish to deploy the
algorithm that leads to the maximum click-through rate per query. You
develop a new algorithm for this task, and use it to replace the old
algorithm for a few days to gather data for comparison. A natural way to
compare the two algorithms will be collect a random sample of queries
for each algorithm and compare the click-through rates obtained by the
two algorithms. That is, let us collect a random sample of 10000 search
queries for both algorithms and evaluate these algorithms on the
fraction of search queries that had a relevant search result (as
measured by a user click). Table <a href="#tab:ch1-simpson-paradox">1.1</a> shows the
performance for two algorithms: it is clear that the new algorithm
performs better. However, you might be curious if the new algorithm is
doing well for all users, or only a subset. To check, you divide the
users into low-activity and high-activity users. The lower panel of the
table shows the comparison. Oddly, after conditioning on users’
activity, the new algorithm is now worse than the old algorithm for both
types of users.</p>
<p><img src="/assets/Figures/Chapter1/simpson-ctr-model1_cropped.png" alt="Model 1<span label="fig:ch1-simpson-model1"></span>" /></p>
<p><img src="/assets/Figures/Chapter1/simpson-ctr-model2_cropped.png" alt="Model 2<span label="fig:ch1-simpson-model2"></span>" /></p>
<p><img src="/assets/Figures/Chapter1/simpson-ctr-model3_cropped.png" alt="Model 3<span label="fig:ch1-simpson-model3"></span>" /></p>
<h4 id="figure-16-different-causal-models-for-the-same-data">Figure 1.6: Different causal models for the same data.</h4>
<p>How is this possible? How can an algorithm be good for everyone, but
then be worse for each individual sub-population? The statistical
explanation is that the new algorithm somehow attracted a higher
fraction of high-activity users, and those users also tended to click
more. In comparison, the old algorithm was better for both types of
users, but the sheer number of high-activity users that used the new
algorithm pushed up its click ratio overall. This dilemma is sometimes
referred as the <em>Simpson’s paradox</em>, named after the scientist who first
reported it. The causal explanation is that it is not a paradox; the
interpretation of a causal effect depends on the specific causal model
that we assume, as we discussed above. In the first case, we assume the
structural model shown in Figure <a href="#fig:ch1-simpson-model1">1.6 (top)</a>, that
assumes that the algorithm has a direct causal effect on the CTR metric,
and no other variable confounds this effect. In the second case, after
conditioning on user activity, we assume the structural model shown in
Figure <a href="#fig:ch1-simpson-model2">1.6 (middle)</a>, that assumes user activity as a
confounder for the effect of the algorithm on CTR. Note that both causal
conclusions are valid: given the first model, the new algorithm causes
the CTR to rise, whereas given the second model, the new algorithm
reduces the CTR. The correct answer depends on which structural model
reflects reality, illustrating the dependence of any causal effect on
its underlying model. In this case, we know from past experience (and
past work) that high-activity users behave differently than
lower-activity users, and thus we choose the interpretation of Model 2.</p>
<p>Does that resolve our dilemma? What if there was another variable that
we forgot to condition on? Figure <a href="#fig:ch1-simpson-model3">1.6 (bottom)</a> shows this
scenario, where in addition to activity, difficulty of the queries also
played a role. When we condition on both activity and query difficulty,
we find that the result flips again: the new algorithm turns out to have
a higher click-through rate. This example illustrates the difficulty of
drawing causal conclusions from data alone. Given the same data, the
causal conclusion is highly sensitive to the underlying structural
model. While we discuss ways to eliminate models that are inconsistent
with the data in Chapter 4, it is not possible to
infer the right causal model purely from data. Thus, causal reasoning
from data must necessarily include domain knowledge that informs the
creation of the structural model. Note that this is not a contrived
scenario: dilemmas like these are pervasive and come under different
names, such as selection bias, berkson’s paradox, and others that we
will discuss later the book. As a trivial example, this is the same
reason that a naive analysis of hospital visits and death might conclude
that going to a hospital leads to death, but of course that is not the
correct causal interpretation.</p>
<p>In the rest of the book, we will describe different identification
methods that can be used to <em>deconfound</em> an observed association. We
will describe how to choose the right formulas for deconfounding,
estimate the effect. In some cases, however, we may not be able to
identify an effect given the model and available data. In that case, we
may reconsider the modeling assumptions, collect new kinds of data, or
declare that it is impossible to find the causal effect.</p>
<h3 id="estimation">Estimation</h3>
<p>Once a causal effect has been identified, we can estimate it using
statistical methods. One way is to directly plug-in an estimate based on
the identified estimand. For instance, in the ice-cream example, we may
stratify the data based on the different temperatures and then use the
plugin estimator for conditional mean. As a concrete example, consider
the randomized experiment to determine the effect of medication from
Section [1.2](#sec:ch1-randomized-exp). Given the identified estimand from
above, we can estimate effect using a simple plug-in estimator.</p>
<p>Average Causal Effect = <strong>E</strong>[<em>Y</em>|do(<em>A</em> = 1)] − <strong>E</strong>[<em>Y</em>|do(<em>A</em> = 0)]</p>
<p>=<strong>E</strong>[<em>Y</em>|<em>A</em> = 1]−<strong>E</strong>[<em>Y</em>|<em>A</em> = 0] … <em>(Identification Step)</em></p>
<p><script type="math/tex">=\\frac{1}{N\_{G1}} \\sum\_{i \\in G1} Y\_i - \\frac{1}{N\_{G0}} \\sum\_{j \\in G0} Y\_j \\ \\ ...\\ \\ \\textit{(Estimation Step)}</script>
where <em>G</em>1 and <em>G</em>0 refer to two groups of people. For infinite data,
this is the best estimator for the causal effect, since it directly
estimates the estimand. However, in practice, we have finite data that
introduces variance challenges. If there are many variables to condition
on, we may not have enough data in each stratum and hence the
conditional means will no longer be trustworthy.</p>
<p>In general, high-dimensionality is one of the major problems for
estimation that we will tackle in this book. Many methods exist for
handling such data. One approach can be coarsen the strata so that the
strata become approximate (e.g., Temperature ranges in multiples of 10)
but the conditional means have low variance. Another approach can be to
instead stratify on the probability of treatment rather than all the
confounding variables. This makes for better stratification, but the
bias in the strata itself is now dependent on the method used for
estimating the probability of treatment.</p>
<p>As we will see, many of these methods can also utilize machine learning
whenever the estimand can be written as a function of available data.
Note that estimation of the causal effect is a purely statistical
exercise that estimates the identified causal estimand. Keeping
identification and estimation separate has a nice modular advantage:
identification and inference can be performed independently, using
different methods. It can also tell us when improving the inference
algorithm is not likely to yield benefits, such as when the causal
effect is not identified. Throughout, we will emphasize on the
separation between identification on the causal model and the estimation
on the data. The causal interpretation of any calculated effect comes
from the structural model, and can be derived without access to any data
(assuming that the structural model is correct).</p>
<h3 id="refute">Refute</h3>
<p>The above three steps will yield an answer to our causal question, but
how much should we trust this estimate? As remarked above, causal
interpretation comes from identification, which in turn derives its
validity from the modeling assumptions. Therefore, the last and perhaps
the most important is to check the assumptions that led to
identification. In addition, the estimation step also makes assumptions
regarding the statistical properties of the data to derive the estimate,
which also need to be verified. While the structural model cannot be
validated from data, we will discuss how in some cases, the observed
data can help us eliminate causal models that are inconsistent with data
and check the robustness of our estimate to causal assumptions. As we
discussed in the search engine example, a common faulty assumption is
that all confounders are known and observed. In
Chapter 4, we show how we can simulate datasets with
unknown confounders and assess sensitivity of the estimate to such
assumption violation. We will also discuss other tests for testing
identifying assumptions. These sensitivity tests cannot prove the
validity of an assumption, but rather help us refute some kinds of
assumptions.</p>
<p>While the sensitivity to causal assumptions may seem as a big
disadvantage, this is actually a fundamental limitation of learning from
data. Multiple causal models can explain the same data with exactly the
same likelihood, thus without any additional knowledge, it is
imposssible to disambiguate. The benefit of expressing causal
assumptions in the form of a separate structural model is that it allows
us to emulate the scientific method in doing our analysis: hypothesize a
theory, design an experiment to test it, improve the theory.
Analogously, we can imagine a workflow where we start with a causal
model, test its assumptions with data, and then change the assumptions
based on any inconsistencies. However, if any causal effect depends on
the underlying structural model, how is it possible to test the
assumptions of a causal model? Fortunately, there is one method whose
causal conclusions do not depend solely on assumptions from a structural
model. We present this next.</p>
<h2 id="the-rest-of-this-book">The rest of this book</h2>
<p>Part I. of our book focuses on a conceptual introduction these four
steps. Chapter 2 covers modeling and identification (Steps 1 and 2).
Chapter 3 focuses on estimation. Chapter 4 discusses refutations.</p>
<p>Part II. of this book goes into more of the practical nuts and bolts of
these four steps, including details of analytical methods for
identification (Chapter 5), a variety of statistical estimation methods
for conditioning-based methods (Chapter 6) and natural experiments
(Chapter 7), and details of methods for validating and refuting
assumptions in practice (Chapter 8). Chapter 9 introduces a number of
concerns that complicate real-world analyses, and discusses basic
approaches and extensions to mitigate their consequences.</p>
<p>Part III. of this book focuses on the connections between causal
reasoning and its application in the context of core machine learning
tasks (Chapter 10). We provide a deeper discussion of causal reasoning
for experimentation and reinforcement learning (Chapter 11),
considerations when learning from observational data (Chapter 12), how
causal reasoning relates to robustness and generalization of machine
learning models (Chapter 13), and connections between causal reasoning
and challenges of explainability and bias in machine learning (Chapter
14).</p>
Wed, 11 Dec 2019 00:00:00 +0000
https://causalinference.gitlab.io/causal-reasoning-book-chapter1/
https://causalinference.gitlab.io/causal-reasoning-book-chapter1/Book outline---Causal Reasoning: Fundamentals and Machine Learning Applications<p>This book is aimed at students and practitioners familiar with machine learning
(ML) and data science. Our goal is to provide an accessible introduction to
causal reasoning and its intersections with machine learning, with a particular
focus on the challenges and opportunities brought about by large-scale
computing systems acting as interventions in the world, ranging from online
recommendation systems to healthcare decision support systems. We hope to
provide a practical perspective to working on causal inference problems and
a unified interpretation of methods from varied fields such as statistics,
econometrics and computer science, drawn from our experience in applying these
methods to online computing systems.</p>
<p>Throughout, methods and complicated statistical ideas are motivated and
explained through practical examples in computing systems and their applications.
In addition, we devote a third of the book to discussing machine learning
applications of causal inference in detail, in different settings such as
recommendation systems, system experimentation, learning from log data,
generalizing predictive models, and fairness in computing systems.</p>
<p>Beyond our focus on machine learning applications, we expect that three aspects of
our perspective on causal reasoning will be woven throughout our treatment (pun
not intended) of this material, to help organize our materials and provide what
may be a distinct viewpoint on causal reasoning. While this book is targeted
primarily for computer scientists, these aspects may also make the book useful
for learners more broadly:</p>
<ol>
<li>
<p>We present a unified view of causality frameworks, including the two major
frameworks from statistics (Potential outcomes framework) and computer science
(Bayesian graphical models) which are often not presented together. We present
how these are compatible frameworks that have different strengths, are
appropriate for different stages of a causal reasoning pipeline, and provide
practical advice on how to combine them in a causal inference analysis. In
addition, by introducing causal inference through the core concepts of
interventions and counterfactuals, we introduce causal inference
methods from a “first-principles” approach, creating a clear taxonomy of
back-door and natural experiment methods and highlighting similarities between
various methodologies.</p>
</li>
<li>
<p>We make an explicit distinction between identification (causal) and
estimation (statistical) methods. While this distinction is fundamental to
causal reasoning, it is overlooked in many texts on causal inference, preventing
readers from understanding the distinction from statistical methods and sources
of error in their causal estimate. Through this distinction, we make a natural
connection to machine learning: ML can be useful for all statistical parts of
causal reasoning, but it is not useful for identification, which follows from
causal assumptions, whether implicit or explicit. Throughout the book, we discuss
how machine learning can enrich estimation methods by allowing non-parametric
estimation, and how causal reasoning can be useful to make ML methods more
robust to environmental changes.</p>
</li>
<li>
<p>Finally, we discuss the criticality of assumptions in any causal analysis
and present practical ways to test the robustness of a causal estimate to
violation of its assumptions. We refer to this exercise as “refuting” the
estimate, in a similar way to how refutation of scientific theories is a common
way to test their relevance. Based on our experience, we present different ways
to test assumptions, check robustness and conduct sensitivity analysis for any
obtained estimate.</p>
</li>
</ol>
<p>Throughout, we will include code examples using <a href="https://github.com/microsoft/dowhy">DoWhy</a>, a Python library for causal inference.</p>
<hr />
<p>The current outline of our book is as follows:</p>
<h3 id="choutline-part1">PART I. Introduction to Causal Reasoning</h3>
<p>Part I. of our book describes the four steps of causal inference and the intuitions and core technologies underlying each step. Chapter 2 covers modeling of causal assumptions using causal graphs. Chapter 3 presents the analytical methods for identification, including how Do-Calculus and additional assumptions can be used to derive common identification strategies such as the adjustment formula and instrumental variables. Chapter 4 presents a variety of statistical estimation methods and their practical considerations, including methods based on balance, weighting, outcome-modeling, and thresholds. Chapter 5 discusses approaches for sensitivity analysis, validation of assumptions, and other evaluation of causal effects.</p>
<p>Chapter 1. Introduction</p>
<p>Chapter 2. Causal Models, Assumptions</p>
<p>Chapter 3. Identification</p>
<p>Chapter 4. Causal Estimation</p>
<p>Chapter 5. Refutations, Validations, and Sensitivity Analyses</p>
<h3 id="choutline-part2">Part II. Causal Machine Learning</h3>
<p>In the second half of this book, we focus on the connections between causal reasoning and its connections to core machine learning tasks (Chapter 6).
We build on the core intuitions and components of causal inference introduced in Part I and show how they can be recombined to address critical challenges in experimentation and reinforcement learning (Chapter 7);
learning and off-policy evaluation from biased logs and other observational data (Chapter 8); robustness, generalization and domain adaptation of machine learning models (Chapter 9), and interpretability, explainability and bias in machine learning (Chapter 10).</p>
<p>Chapter 6. Connections between Causality and Machine Learning</p>
<p>Chapter 7. Experimentation and Reinforcement Learning</p>
<p>Chapter 8. Learning from Logged Data</p>
<p>Chapter 9. Generalization in Classification and Prediction</p>
<p>Chapter 10. Machine Learning Explanations and Bias</p>
<hr />
<p>We have posting Chapter 1-4 of our book as of April 2021, and will be releasing with
new chapters regularly. Our texts will often be rough, especially on
their initial posting, and we expect they will see substantial change throughout
the writing process. We appreciate in advance your patience with
our errors and mistakes, as well as your comments and feedback throughout.</p>
Wed, 11 Dec 2019 00:00:00 +0000
https://causalinference.gitlab.io/Causal-Reasoning-Fundamentals-and-Machine-Learning-Applications/
https://causalinference.gitlab.io/Causal-Reasoning-Fundamentals-and-Machine-Learning-Applications/