# Chapter 4: Estimation

Once we have found strategies for identifying causal quantities, we need to choose how to estimate those causal quantities using statistical methods. We describe the most commonly used methods using examples inspired by real computing applications. First, we present the basics of causal estimation: how to go from an identified estimand to an estimator? We describe the challenges of bias and variance that need to be traded off in every estimation method. Second, we present a variety of estimation methods, starting from simple, interpretable estimators to complex machine learning-based estimators that are often required when the data is high-dimensional.

## 4.1 Example: Building an estimator

In the last chapter, we saw that identification is the process of transforming a causal quantity to a statistical one, called the identified estimand. After identification, estimation is the process of computing this quantity using available data. Figure 1: Number of searches issued for ice-cream and swimming in a search engine. Over the observed time period, there is a correlation between searches for the two queries. Try this example yourself using an online DoWhy notebook: https://github.com/causalreasoning/cr-book/blob/master/examples/ch04-estimation/dowhy_confounder_example.ipynb.

### 4.1.1 Estimating causal effect of ice-cream consumption

Suppose that you have access to anonymized data on search queries issued to a popular search engine. Suppose further that you are a scholar on ice-cream and want to understand its effects on different health outcomes. Your colleague shares with you an intriguing observation: the number of search queries for “where can I have ice-cream” are highly correlated with the number of search queries for “where can I swim.” They ask you, does this mean that having ice-cream makes people more likely to want to swim? Fig. 1 plots the two search queries over time.

The observational evidence is quite strong, but you may be suspicious. In chapter 1, we saw that temperature can be a confounder and therefore we should account for it. Chapter 3 provided with the correct estimand for doing so, by introducing the back-door criterion. Specifically, assuming that there are no other confounders, the causal estimand is: $\label{eq:ch03-icecream-identify} \mathbb{E}[\textit{swim}| \operatorname{do}(\textit{icecream})] = \sum_{\textit{temp}} \mathbb{E}[\textit{swim}|\textit{icecream}, \textit{temp}] \operatorname{P}(temp)\qquad(1)$ where swim and icecream correspond to the number of queries for swimming and ice-cream respectively and temp is the temperature at that time. There are multiple ways to estimate the above quantity. One approach is to discretize $\textit{icecream}$ and $\textit{temp}$ and compute the mean value of $\textit{swim}$ for particular values of $icecream$ and $temp$. Let the two values for $\textit{icecream}$ be $\{\texttt{High}, \texttt{Low} \}$ and $\textit{temp}$ be represented by three values $\{ \texttt{Low}, \texttt{Medium},\texttt{High}\}$. Then the causal estimand can be estimated using the following two equations: $\label{eq:ch03-icecream-est} \begin{split} \mathbb{E}[\textit{swim}|\textit{icecream}&=\texttt{ic}, \textit{temp}=\texttt{te}] = \frac{1}{N_{\texttt{te},\texttt{ic}}} \sum_{\substack{\textit{temp}=\texttt{te}, \\ \textit{icecream}=\texttt{ic}}}\textit{swim} \\ \Pr(temp) &= \frac{N_{\texttt{te}}}{N} \end{split}\qquad(2)$

The above estimation works well as long as each of the conditional means can be estimated reliably. What happens when there is little data for a specific temperature and ice-cream queries bucket? Table 1 presents such a dataset where very few people search for ice-cream at low temperatures. Each entry in the dataset is a contiguous period of time with low, medium or high temperature. For each entry, the dataset provides the frequency of ice-cream queries by level and the number of swimming queries. If we apply Eqns. 1, 2, we obtain a positive effect of ice-cream searches on swimming. But obviously we know that is not the case. The culprit is the estimation of the mean from very small samples (in this case, a single data point!). A possible fix is to exclude data from low temperatures and compute the effect over medium and high temperature. However, our new estimate no longer corresponds to all temperatures: based on the data, we cannot say anything about ice-cream’s effect at low temperatures.

The example illustrates the challenges with estimating a causal quantity from data, even with a straightforward identified estimand. While we showed an example with a single discrete confounder, the scenario of low frequency for specific values of the confounder is more likely when confounders are multi-dimensional and unavoidable when they are continuous. How to condition on high-dimensional confounders is one of the key questions for causal estimation methods.

 Temperature Ice-cream queries Frequency Mean swimming queries Low Low 9999 505 Low High 1 560 Medium Low 5000 2151 Medium High 5000 2150 High Low 1000 4750 High High 9000 4751

### 4.1.2 Estimating causal effect using a randomized experiment

Many of the above estimation challenges are solved if we can create our own data. Let us select a random sample of people in a city and divide them into two groups: one group is provided ice-cream every day for a week and the other is provided some other snack. We can then track their swimming activity (or searches for swimming) over a week. This creates a randomized experiment where the treatment is having ice-cream and outcome is swimming activity. From Chapter 3, we know that the identified estimand in a randomized experiment is simply the conditional expectation of outcome given treatment. With this estimand as the target estimand and the available experimental data, we can now estimate this quantity using a simple plug-in estimator without worrying about confounders like temperature.

$\mathbb{E}[Y|do(T=t)] \rightarrow\mathbb{E}[Y|T=t] \rightarrow\frac{\sum_{i=1}^N \mathbb{1}_{[T=t]} Y}{\sum_{i=1}^N \mathbb{1}_{[T=t]}}$

Using this basic estimator, we can now estimate different causal quantities of interest. The most commonly used is the average treatment effect (ATE). When the treatment is binary, the ATE simplifies to: $\begin{split} \textrm{ATE} & := \mathbb{E}[Y|do(T=1)] - \mathbb{E}[Y|do(T=0)] \\ &\rightarrow\mathbb{E}[Y|T=1] - \mathbb{E}[Y|T=0] \rightarrow\frac{\sum_{i=1}^N \mathbb{1}_{[T=1]} Y}{\sum_{i=1}^N \mathbb{1}_{[T=1]}} - \frac{\sum_{i=1}^N \mathbb{1}_{[T=0]} Y}{\sum_{i=1}^N \mathbb{1}_{[T=0]}} \end{split}$ Note how we do not need to worry about temperature or any other variable. In a randomized experiment, the estimate is simply the difference in mean swimming activity between people who were provided ice cream and those who were not.

Sometimes, however, we may be interested only in the effect on specific people, e.g., people belonging to a certain demographic or sharing some attribute. Since the treatment was randomized, these attributes are independent of treatment but can be correlated with the outcome. These attributes are called “effect modifiers” and the resultant estimate, conditional average treatment effect, commonly abbreviated as CATE. It can be estimated using the same principle: identifying the causal quantity and then using a simple estimator for it.

$\begin{split} \textrm{CATE} & := \mathbb{E}[Y|do(T=1), C=c] - \mathbb{E}[Y|do(T=0), C=c] \\ &\rightarrow\mathbb{E}[Y|T=1, C=c] - \mathbb{E}[Y|T=0, C=c] \\ & \rightarrow\frac{\sum_{i=1}^N \mathbb{1}_{[T=1, C=c]} Y}{\sum_{i=1}^N \mathbb{1}_{[T=1, C=c]}} - \frac{\sum_{i=1}^N \mathbb{1}_{[T=0, C=c]} Y}{\sum_{i=1}^N \mathbb{1}_{[T=0, C=c]}} \end{split}$

### 4.1.3 Challenges in estimation with finite data

If we have large enough data from a randomized experiment, then simple estimators like these are sufficient. However with limited data, there still exist multiple challenges in estimating a target estimand from a randomized experiment (and therefore also for observational studies). For example, consider a randomized experiment where the true causal effect varies greatly between people. In particular, there are a few people for which the true effect is orders of magnitude more than that for others. Looking at the differences of means over $N$ samples for ATE above, depending on whether such outlying people are assigned treatment or control can influence the resultant ATE. If you run the experiment again, you may get a substantially different outcome! Knowing this, one may trim all recorded outcomes such that outcomes above a certain threshold are constrained to the threshold value. This will make the results of each experiment more repeatable, but we have a different problem now: we are no longer estimating the target estimand due to this arbitrary cutoff.

When the treatment is continuous, a different set of challenges emerge. Perhaps the most fundamental is that the treatment effect is no longer well-defined. Do we estimate the change in swimming queries when ice-cream queries increases from 0 to 10, or from 100 to 101? If the effect varies for different values of treatment, which one do we report?

Finally, there is also a more general question of whether the search engine users are representative of the underlying population. If not, we may estimate the causal effect of ice-cream on users of the search engine, but that may not generalize to our population of interest. Perhaps users of the search engine are older, healthier or different in some way than the general population? Just like we defined treatment assignment mechanism for identification in chapter 3, sampling mechanism is important for estimation. A sampling mechanism is the process by which data is collected that in turn, determines how well the sample represents a target probability distribution. For example, by collecting data only from search engine users, the resultant data may not represent the target population, the people in a city.

Formally, a sampling mechanism defines a probability distribution from which available data points are drawn independently. The extent to which the sampling distribution matches the target data distribution determines the suitability of data. Therefore, any quantity estimated on available data corresponds only to the specific sampling distribution from which it can be assumed to be sampled randomly, unless one can assume that available data is representative of the general population.

Our example above illustrates the fundamental tradeoff between bias and variance of a statistical estimator. Bias corresponds to the degree to which the estimate is expected to be close to the identified estimand. Variance of an estimator corresponds to the expected difference in an estimate if an experiment is run multiple times.

In general, reducing the variance—e.g., by removing outliers in our example above—increases bias, and vice-versa. It is difficult to obtain an estimator that will have both low bias and low variance. To complicate matters, an estimator with low bias and low variance for one dataset may have very different properties on a different dataset, thus in general there is no universally best estimator.

The goal of estimation is to find a satisfactory tradeoff between bias and variance. Formally, for a target estimand $h$ and its estimator $\hat{h}$, bias is defined as the different between the target estimand and the expected value of $\hat{h}$ over multiple independent samples. Variance is defined as the expected value of the squared difference between estimates on different samples and the mean estimate. $\begin{split} Bias &:= |h - \mathbb{E}[\hat{h}]| \\ Var &:= \mathbb{E}[(\hat{h}- \mathbb{E}[\hat{h}])^2] \end{split}$ Figure 2: Bias and variance of different estimators for the same identified estimand, $\mathbb{E}[Y|T=t]$. The dotted horizontal line shows the true average causal effect, 10. Vertical distance of the mean estimate from the true ACE denotes the bias; variance is the denoted roughly by the length of each boxplot. The first estimator is the optimal one with lowest bias and variance. The next two estimators are biased while the last three are unbiased with high variance. Try out these estimators yourself in an online DoWhy notebook: https://github.com/causalreasoning/cr-book/blob/master/examples/ch04-estimation/dowhy_bias_variance_tradeoff.ipynb.

### 4.2.1 Factors affecting the bias and variance of a causal estimator

Fig. 2 illustrates the different ways in which an estimator can have high bias or high variance. For simplicity, the treatment is assumed to be binary, with two levels $0$ and $2$. Data was generated using the structural equation, $y=\beta t^2$ where $\beta=5$ for half of the population and $\beta=15$ for the other half, thus leading to mean causal effect of $10$. There are no confounders in the causal model. All estimators use a dataset with 1000 units except the “Few samples” estimator that uses 50 units.

Based on the structural equation, the “correct” estimator uses linear regression with a transformed $t^2$ feature. However, in practice the structural equation is not known. Even if we assume that we know the correct identified estimand, $\mathbb{E}[Y|\operatorname{do}(T)=t)] \rightarrow\mathbb{E}[Y|T=t]$, it is not guaranteed that we will recover the correct estimate. The quality of the resultant estimate depends on the following factors.

• Model choice. If the estimating model is not flexible or complex enough to model the relationships in data, or if it is too complex and overfits available data, then the resultant estimator will be biased. The second estimator in Fig. 2 shows linear regression applied directly to the treatment variable. This parameterized model fails to capture that the outcome depends on the square of the treatment values and will result in an incorrect estimate, even with infinite data.

• Sampling mechanism. The second important contributor to bias is the sampling mechanism. If the treatment effect differs for different units, then it is important that the available data represents an independently sampled set from the target distribution. If not, then the sampling distribution is different from the target distribution and the estimated effect suffers from bias as shown in the third boxplot in Fig. 2. The estimator uses the correct model specification with linear regression on $t^2$ but uses data is sampled more often from units having a smaller effect. The result is a biased estimate.

• Sample size. Like with any statistical estimation, a small sample size increases the variance of a causal effect estimator. For a linear regression estimator on $t^2$ (the correct model specification), variance is shown in the fourth boxplot in Fig. 2 when the estimator is provided a smaller dataset of 50 samples. With fewer data samples, variance is higher. The estimate can be anywhere between 5 and 15, although the average of the estimates over multiple datasets is correct and thus there is no bias. Note that the current estimator does not condition on any confounders. When an estimator additionally conditions on confounders, it is not enough to ensure a high overall sample size: high variance can occur if data for any one of the confounder values have low frequency. For example, given an identified estimand with ten binary confounders, the total number of unique confounder configurations is $2^{10}=1024$. A dataset with 10000 units will lead to less than 10 units per confounder value and a dataset of 1000 units will not even cover all the confounder configurations.

• Overlap. Related to the issue of small sample size, variance of an estimator also increases if there is not enough data for certain treatment values or corresponding control values, and thus low overlap between the treatment and control values. The overlap problem is exacerbated when we want to estimate the treatment effect over specific subsets $z$ of the data, also known as the conditional treatment effect, since we require every such subset to have enough data points for treatment and control values. As we saw in chapter 3, non-zero overlap is enough for identification ($\operatorname{P}(a|z)>0$ whenever identifying $P(y|\operatorname{do}(a))$) but low overlap leads to variance in the estimator. This is because causal estimation depends fundamentally on comparing the values of an outcome on different values of the treatment. If some treatments are rare, then estimation of the outcome conditional on those treatments can have high variance. We saw this already in the ice-cream example (Table 1) where overlap of the treatment was poor for the data with “low” temperature. The second boxplot from the right in Fig. 2 shows the estimated effect on a dataset where 99% of the units belong to the treated group and only 1% of the units to control group, using an unbiased estimator (correct model specification without any sampling bias). Due to the poor overlap, the estimated effect varies widely from a minimum of $0$ to a maximum of $20$.

• Outliers: Having outlier values of the outcome in the data also increases the variance of an estimator. This can happen due to noise in measuring the outcome, or due to a small number of units having a different value of the causal effect. Unlike the issue of small sample size that can be resolved by collecting more data, the issue of outliers can remain even with large datasets. Unfortunately there is no easy solution. On one hand, removing the outliers will lead to bias since the sampling mechanism no longer corresponds to the target distribution. On the other hand, including outliers increases variance for the effect estimator. Similar to the issue of poor overlap, having outliers in the data increases variance of the estimate, as seen in the rightmost boxplot in Fig. 2.

Fig. 2 showed the impact of adding any single issue to an estimator, but in practice, multiple issues can affect an estimator. Complicating matters, fixing a bias issue often exacerbates a variance issue and vice-versa. For example, variance in an estimator due to small sample size can be alleviated by assuming a simpler parametric model (e.g., linear regression) but that introduces bias due to model choice. Correspondingly, reducing model choice bias often involves selecting a complex model with several parameters that in turn increases the variance for estimating those parameters from the same data. In general, while variance can be reduced with a larger dataset, bias cannot be resolved by collecting more data. Even with infinite data, an incorrectly specified model will lead to a biased estimate, and so will an infinite dataset with an incorrect sampling mechanism.

### 4.2.2 Many estimation methods for the same identified estimand

In practice, it is not possible to compute the bias of an estimator given observational data. Therefore, estimation methods need to make their best “guess” about useful assumptions to compute the effect. As a result, multiple competing estimating methods can exist for the same target estimand.

The set of assumptions made by an estimator characterizes the tradeoff it makes between bias and variance. In section 4.3, we will start with non-parametric estimators that make the least assumptions about the structural equation for the outcome variable. Therefore, these estimators have low bias due to model choice. Another benefit is that the resulting method is simple to understand and interpretable. Since it is not possible to compute the bias of an estimator in practice, simple and interpretable estimators offer the best bet for obtaining an estimate with low bias. However, simplicity comes at the cost of high variance. If there is low sample size, poor overlap, or outliers in the data, then simple methods will yield a high variance.

Especially in high-dimensional data, simple estimators fail to be effective since their variance can be prohibitively high. Therefore, we will next discuss methods that make parametric assumptions on how the data was generated. Some of these methods use a parametric model to estimate the treatment assignment, while still using a non-parametric computation of the effect on outcome (sections 4.3 and 4.4). Others assume a parametric model for both the treatment and the outcome (section 4.4.2). All these estimation methods incur the cost of model choice bias, but are willing to make the tradeoff to obtain lower variance that makes them applicable for high-dimensional data. At the extreme, there are methods that ignore the treatment assignment altogether and directly use a parametric model for the outcome (section 4.5).

While they offer lower variance, the bias of parametric model-based estimators depends critically on how well their model captures the true structural equations. To model the structural equations, many estimator methods reduce the causal effect estimation problem to a series of prediction problems (e.g., predicting the treatment or the outcome variable). Machine learning models are best-suited for such supervised learning tasks and thus they play a pivotal role in the parametric estimation methods.

Since the bias-variance tradeoff applies universally for all methods, we present methods organized by the different approaches they take to estimate the causal effect.

• Balance-based. Conditioning on variables to estimate a target estimand.

• Weighting-based. Weighted sampling of a dataset.

• Outcome model-based. Fitting a model that predicts the outcome based on the treatment.

• Threshold-based. Exploiting a discontinuity in one of the variables to obtain local effects.

## 4.3 Balance-based methods

For observational data, one of the most popular strategies to obtain an estimator is to directly approximate the back-door criterion. In the last chapter, we saw that the back-door estimand can be expressed as, $\label{eq:ch04-backdoor-eqn} \mathbb{E}[\textrm{y}|\operatorname{do}(\textrm{t}=t_0)] \rightarrow\sum_{\mathbf{w}\in \mathbb{W}} \mathbb{E}(\textrm{y}|\textrm{t}=t_0, \mathbf{w}=\mathbf{w})\operatorname{P}(\mathbf{w}=\mathbf{w})\qquad(3)$ where $\mathbf{w}$ represents all common causes of the treatment $\textrm{t}$ and outcome $\textrm{y}$. To estimate the above expression, we need to create data subsets where $\mathbf{w}$ is constant. This goal is often called covariate balance where covariates refer to confounders or simply balance. Balance-based methods try to attain covariate balance given the observed data so that the above equation can be estimated. We will start with simple plug-in estimators of the above expression and then move to model-based methods.

Figure 3: Visual illustrations of the backdoor-based estimation methods for estimating the average treatment effect on the treated. Treatment data points are blue, control points are red. The first row shows the distribution of the original data, decreasing in overlap conditional on the confounder $w$ from left to right. The second shows the stratification estimator that discretizes the distribution and only computes the estimate on values of $w$ that have overlap, the rest are ignored. The third row visualizes the matching estimator. Horizonal lines connect the original treatment (blue) points to their matched control (red) points. Distance (and hence the values of confounder $w$) between the matched points increases as the overlap is reduced. Finally, the fourth row the inverse propensity estimator that reweighs the data distribution to obtain a nearly identical distribution across treatment and control. As the overlap decreases from left to right, note how very few control data points are magnified to probabilities higher than that of the treatment points at the same $w$, and yet some other values of $w$ are still not covered. Try out these estimators yourself using an online DoWhy notebook: https://github.com/causalreasoning/cr-book/blob/master/examples/ch04-estimation/visualize_conditioning_estimators.ipynb.. a — Data I, b — Data II, c — Data III, d — Stratification I, e — Stratification II, f — Stratification III, g — Matching I, h — Matching II, i — Matching III, j — Weighting I, k — Weighting II, l — Weighting III

### 4.3.1 Simple Stratification

A straight-forward way to estimate the back-door estimand is to estimate the conditional probabilities directly. We obtain, $\sum_{\mathbf{w}\in \mathbb{W}} \mathbb{E}(\textrm{y}|\textrm{t}=t_0, \mathbf{w}=\mathbf{w})\Pr(\mathbf{w}=\mathbf{w}) \rightarrow\sum_{\mathbf{w}\in \mathbb{W}} \frac{n_w}{N} \frac{\sum_{i=0}^{N} \mathbb{1}_{[\textrm{t}=t_0, \mathbf{w}=\mathbf{w}]}{y}}{\sum_{j=0}^{N} 1_{[\textrm{t}=t_0, \mathbf{w}=\mathbf{w}]}}$ Effectively, the method divides data into different strata, each having the same value of $\mathbf{w}$. Then these estimates are summed up weighted by the number of data points in each strata, leading to a weighted average of the strata effects.

This is an unbiased estimator of the backdoor estimand. Due to the division into multiple strata, this method is called stratification. This is a suitable method to use when $\mathbf{w}$ consists of categorical variables and is of low dimension compared to the dataset size. Otherwise the number of data points in each strata may be too little to estimate the stratum-wise values.

Formally, for a stratification estimator to have low variance, each strata must have a minimum number of data points. This is because the stratification estimator is simply the mean of all strata’s effects: if there are a some strata with very few samples and large estimate values, it can vary the overall mean (although the effect is dampened due to weighting by strata size). For this reason, in practice, it is worth adding a cutoff on the minimum stratum size that can avoid sensitivity to outlier values, at the cost of producing a biased estimate. $\sum_{\mathbf{w}\in \mathbb{W}} \mathbb{E}(\textrm{y}|\textrm{t}=t_0, \mathbf{w}=\mathbf{w})\Pr(\mathbf{w}=\mathbf{w}) \rightarrow\sum_{\mathbf{w}\in \mathbb{W}; n_w > M} \frac{n_w}{N} \frac{\sum_{i=0}^{N} \mathbb{1}_{[\textrm{t}=t_0, \mathbf{w}=\mathbf{w}]}{y}}{\sum_{j=0}^{N} 1_{[\textrm{t}=t_0, \mathbf{w}=\mathbf{w}]}}$

Sometimes the excluded strata share attributes that are interpretable. For example, it could be that all data points from a particular country are excluded due to low sample size, or data points from a certain age. Often such sample size differences between strata are a result of natural processes (fewer transactions from a particular country) or discrepancies in collecting the data (e.g., sampling younger people more). In these cases, it is helpful to interpet the estimated effect as a local average treatment effect, applicable only to the strata included in analysis, $\mathbb{W}' \subset \mathbb{W}$. Formally, the local ATE is a biased estimate of the (global) ATE, but it may often be useful in its own right to interpret the effect for at least the included strata. $\textrm{LATE}:= \sum_{\mathbf{w}\in \mathbb{W}'} \mathbb{E}(\textrm{y}|\textrm{t}=t_0, \mathbf{w}=\mathbf{w})\Pr(\mathbf{w}=\mathbf{w}) \rightarrow\sum_{\mathbf{w}\in \mathbb{W}'} \frac{n_w}{N} \frac{\sum_{i=0}^{N} \mathbb{1}_{[\textrm{t}=t_0, \mathbf{w}=\mathbf{w}]}{y}}{\sum_{j=0}^{N} 1_{[\textrm{t}=t_0, \mathbf{w}=\mathbf{w}]}}$

The conditional average treatment effect (CATE) can be calculated in a similar way. Given some variables $\mathbf{v}\subseteq \mathbf{w}$ on which CATE estimates are needed, we can choose appropriate strata for each value $\mathbf{v}=\mathbf{v}$ of the variables. $\begin{split} \textrm{CATE}(\mathbf{v}=\mathbf{v}) &:= \sum_{\mathbf{w}\in \mathbb{W}, \mathbf{v}=\mathbf{v}} \mathbb{E}(\textrm{y}|\textrm{t}=t_0, \mathbf{w}=\mathbf{w}, \mathbf{v}=\mathbf{v})\Pr(\mathbf{w}=\mathbf{w}, \mathbf{v}=\mathbf{v}) \\ &\rightarrow\sum_{\mathbf{w}\in \mathbb{W}, \mathbf{v}=\mathbf{v}} \frac{n_w}{N} \frac{\sum_{i=0}^{N} \mathbb{1}_{[\textrm{t}=t_0, \mathbf{w}=\mathbf{w},\mathbf{v}=\mathbf{v}]}{y}}{\sum_{j=0}^{N} 1_{[\textrm{t}=t_0, \mathbf{w}=\mathbf{w}, \mathbf{v}=\mathbf{v}]}} \end{split}$

When $\mathbf{w}$ includes continuous variables, simple stratification becomes challenging. A workaround is to discretize continuous variables based on quantiles or other splits based on domain knowledge. Such coarsening allows us to create discrete strata, but the resultant estimate may depend heavily on the splits for discretization. Further, it limits the application of the estimate: one can only talk about the effect of increasing a continuous variable from one split to the other, but not in smaller steps.

### 4.3.2 Matching

When exact conditioning on $\mathbf{w}$ is not possible, either due to high-dimensionality or continuous variables, we can find units with as similar $\mathbf{w}$ as possible. A simple way is to define a metric of distance between any two units based on their $\mathbf{w}$ values. Then for every unit $i$ in the treatment group, we find the closest unit $i_{matched}$ in the control group (and vice versa), as an approximation of them sharing the same $\mathbf{w}$. For each matched pair, we compute the difference of outcomes between the treated and control and average it over all units. $\sum_{\mathbf{w}\in \mathbb{W}_S} \mathbb{E}(\textrm{y}|\textrm{t}=t_0, \mathbf{w}=\mathbf{w})\Pr(\mathbf{w}=\mathbf{w}) \rightarrow\sum_{i=1}^{N} \begin{cases} y_i & \textrm{ if } t_i = t_0 \\ y_{i_{matched}} & \textrm{if } t_{i_{matched}}=t_0 \end{cases}$

An importance choice is that of the distance metric, since that dictates what it means to have similar $\mathbf{w}$. Given that different $\mathbf{w}$ may have different scales, a common choice is a standardized distance like Mahalanobis that accounts for the difference in variances betwen different elements of $\mathbf{w}$. $\operatorname{dist}(\mathbf{w}_i, \mathbf{w}_j) = \sqrt{(\mathbf{w}_i - \mathbf{w}_j)\Sigma^{-1}(\mathbf{w}_i-\mathbf{w}_j)}$ where $\Sigma$ is the covariance matrix for $\mathbf{w}$.

Still, in a finite sample of data, it is possible that the best match for a unit may not be similar to it at all. Therefore, like we needed a minimum cutoff for stratum size in stratification, we need a cutoff for the maximum distance between two units for them to be matched. However the goal here is different: we need a cutoff on distance to reduce bias in the matching estimator due to comparing treated and control units who may not be similar on $\mathbf{w}$.

The effect on any particular units can be calculated in a similar way to stratification. CATE on a subset of the data $\mathbf{v}=\mathbf{v}$ is given by: $\begin{split} \textrm{CATE}(\mathbf{v}=\mathbf{v})& := \sum_{\mathbf{w}\in \mathbb{W}_S,\mathbf{v}=\mathbf{v}} \mathbb{E}(\textrm{y}|\textrm{t}=t_0, \mathbf{w}=\mathbf{w},\mathbf{v}=\mathbf{v})\Pr(\mathbf{w}=\mathbf{w},\mathbf{v}=\mathbf{v}) \\ &\rightarrow\sum_{i=1}^{N} \mathbb{1}_{\mathbf{v}=\mathbf{v}} \begin{cases} y_i & \textrm{ if } t_i = t_0 \\ y_{i_{matched}} & \textrm{if } t_{i_{matched}}=t_0 \end{cases} \end{split}$ While convenient for both discrete and continuous variables, the method’s reliance on imperfect matches creates problems. For instance, the average treatment effect for the control group and the average treatment effect on the treated group are no longer the same (and both can be different from the average treatment effect). The average treatment effect on the treated (ATT) group is estimated as the mean over all matches for the treated units. Similarly, the average treatment effect on the control (ATC) group is estimated as the mean over all matches for the control units. Due to differences in the quality of matches—distance among matched pairs—based on whether a unit is in treatment or control, estimated ATT and ATC tend to be different. $\begin{split} ATT = \mathbb{E}_{\textrm{t}=1}[\textrm{y}|\operatorname{do}(\textrm{t}=t_0)] &= \sum_{\mathbf{w}\in \mathbb{W}_S, \textrm{t}=1} \mathbb{E}(\textrm{y}|\textrm{t}=t_0, \mathbf{w}=\mathbf{w})\Pr(\mathbf{w}=\mathbf{w}| \textrm{t}=1) \\ &\rightarrow\sum_{i=1,t_i=1}^{N} \begin{cases} y_i & \textrm{ if } t_i = t_0 \\ y_{i_{matched}} & \textrm{if } t_{i_{matched}}=t_0 \end{cases} \\ ATC = \mathbb{E}_{\textrm{t}=0}[\textrm{y}|\operatorname{do}(\textrm{t}=t_0)] &= \sum_{\mathbf{w}\in \mathbb{W}_S, \textrm{t}=0} \mathbb{E}(\textrm{y}|\textrm{t}=t_0, \mathbf{w}=\mathbf{w})\Pr(\mathbf{w}=\mathbf{w}| \textrm{t}=0) \\ &\rightarrow\sum_{i=1,t_i=0}^{N} \begin{cases} y_i & \textrm{ if } t_i = t_0 \\ y_{i_{matched}} & \textrm{if } t_{i_{matched}}=t_0 \end{cases} \end{split}$

The implication is puzzling. Depending on whether a unit has been assigned to treatment or control, our estimate of the effect of the exact same intervention can be different. Note that it is entirely possible that different subsets of units have different effects from the same intervention; however, the difference here is not due to heterogeneity in actual effect, but simply due to lack of perfect matches. This distinction is useful when interpreting the estimate from a matching analysis. If there are a few treated units but numerous control units (e.g., a few people who contracted a disease versus thousands who did not), it may be appropriate to only estimate the ATT, as we have a better chance of finding good matches for each of the treated units.

### 4.3.3 Propensity-Based Methods

Both stratification and matching methods create a subset of units—a stratum or a pair—where all confounders are fixed to the same value and thus any variation in treatment asssignment is trivially independent of the observed confounders. Formally, within these strata or pairs, $t \unicode{x2AEB} w$ and the units are said to be balanced. While these methods ensure the same value of confounders in each stratum or pair, having exactly the same $w$ is not necessary to ensure $t \unicode{x2AEB} w$. We now study alternative ways of ensuring balance that are effective for high-dimensional confounders.

Suppose that there are two values of the confounders, $\mathbf{w}=\mathbf{w}_1$ and $\mathbf{w}=\mathbf{w}_2$ such that $\operatorname{P}(\textrm{t}| \mathbf{w}=\mathbf{w}_1)=\operatorname{P}(\textrm{t}| \mathbf{w}=\mathbf{w}_2)$. In this case, we need not condition on $\mathbf{w}_1$ and $\mathbf{w}_2$ separately. It is okay to combine units with $\mathbf{w}_1$ and $\mathbf{w}_2$ into one strata. A simple derivation shows that the resultant estimate is identical. For the two values of $\mathbf{w}$, $\begin{split} \sum_{\mathbf{w}\in \{\mathbf{w}_1, \mathbf{w}_2\}} & \operatorname{P}(\textrm{y}|\textrm{t}=t_0, \mathbf{w}=\mathbf{w})\operatorname{P}(\mathbf{w}=\mathbf{w}) \\ & =\sum_{\mathbf{w}\in \{\mathbf{w}_1, \mathbf{w}_2\}} \operatorname{P}(\textrm{y}|\textrm{t}=t_0, \mathbf{w}=\mathbf{w})\operatorname{P}(\mathbf{w}=\mathbf{w})\frac{\operatorname{P}(\textrm{t}=t_0|\mathbf{w}=\mathbf{w})}{\operatorname{P}(\textrm{t}=t_0|\mathbf{w}=\mathbf{w})} \\ & =\sum_{\mathbf{w}\in \{\mathbf{w}_1, \mathbf{w}_2\}} \frac{\operatorname{P}(\textrm{y}, \textrm{t}=t_0, \mathbf{w}=\mathbf{w})}{\operatorname{P}(\textrm{t}=t_0|\mathbf{w}=\mathbf{w})} = \frac{\operatorname{P}(\textrm{y}, \textrm{t}=t_0, \mathbf{w}=\mathbf{w}_1)+\operatorname{P}(\textrm{y}, \textrm{t}=t_0, \mathbf{w}=\mathbf{w}_2)}{\operatorname{P}(\textrm{t}=t_0|\mathbf{w}=\mathbf{w})} \\ & = \frac{\operatorname{P}(\textrm{y}, \textrm{t}=t_0, \mathbf{w}=\mathbf{w}_1 \text{or } \mathbf{w}_2)}{\operatorname{P}(\textrm{t}=t_0|\mathbf{w}=\mathbf{w}_1 \text{or } \mathbf{w}_2)} = \operatorname{P}(\textrm{y}| \textrm{t}=t_0, \mathbf{w}=\mathbf{w}_1 \text{or } \mathbf{w}_2)\operatorname{P}(\mathbf{w}=\mathbf{w}_1 \text{or } \mathbf{w}_2) \end{split}$ where the second last equality is because $\operatorname{P}(\textrm{t}=t_0| \mathbf{w}=\mathbf{w}_1)=\operatorname{P}(\textrm{t}=t_0| \mathbf{w}=\mathbf{w}_2)=\operatorname{P}(\textrm{t}=t_0|\mathbf{w}=\mathbf{w}_1 \text{or } \mathbf{w}_2)$. Effectively, the derivation corresponds to creating a new confounder $\mathbf{w}'$ such that it attains the same value when $\mathbf{w}\in \{\mathbf{w}_1, \mathbf{w}_2\}$, but different values for each other value of $\mathbf{w}$. Within each strata of $\mathbf{w}'$, we find that $\textrm{t}\unicode{x2AEB}\mathbf{w}$ and thus $\textrm{t}\unicode{x2AEB}\mathbf{w}| \mathbf{w}'$. Alternatively, we can write $\textrm{t}\unicode{x2AEB}\mathbf{w}| b( \mathbf{w})$ where $b$ is a function that maps $\mathbf{w}$ to $\mathbf{w}'$.

More generally, any function $b: \mathbb{R}^{|W|} \rightarrow \mathbb{R}^{K}$ that ensures that $t$ and $w$ are independent given $b(w)$, $t \unicode{x2AEB} w|b(w)$ is called a balancing function. The resultant values $b(w)$ are called balancing scores. Given a balancing score, the backdoor estimation equation can be rewritten as: $\label{eq:balancing-score} \mathbb{E}[\textrm{y}|\operatorname{do}(\textrm{t}=t_0)] \rightarrow\sum_{\mathbf{w}\in \mathbb{W}} \mathbb{E}(\textrm{y}|\textrm{t}=t_0, b(\mathbf{w})=b(\mathbf{w}))\Pr(b(\mathbf{w})=b(\mathbf{w}))\qquad(4)$ If we consider $b(\mathbf{w})$ as trivially the identity function, $b(\mathbf{w})=\mathbf{w}$, then we obtain simple stratification. It is valid balancing function since by definition, $t \unicode{x2AEB}\mathbf{w}|\mathbf{w}$. However, a useful balancing typically reduces the dimensionality of $\mathbf{w}$.

The most common balancing score is the propensity score, $\operatorname{ps}(\mathbf{w})=\operatorname{P}(\textrm{t}|\mathbf{w}=\mathbf{w})$, so called because it defines the propensity of treatment given values of $\mathbf{w}$. Below we show that a propensity score is a balancing score by showing that $\textrm{t}$ and $\mathbf{w}$ are independent given the propensity score. That is, $\operatorname{P}(\textrm{t}|\mathbf{w},\operatorname{ps}(\mathbf{w}))=\operatorname{P}(\textrm{t}|\operatorname{ps}({\mathbf{w}}))$. To prove, we use that $\operatorname{P}(\textrm{t}=t_0|ps(\mathbf{w}), \mathbf{w}) = \operatorname{P}(\textrm{t}=t_0| \mathbf{w})=\operatorname{ps}(\mathbf{w})$ since $\operatorname{ps}(\mathbf{w})$ is a function of $\mathbf{w}$. Then we show that $\operatorname{P}(\textrm{t}=t_0| \operatorname{ps}(\mathbf{w}))$ is also equal to $\operatorname{ps}(\mathbf{w})$. $\begin{split} \operatorname{P}(\textrm{t}=t_0| \operatorname{ps}(\mathbf{w})) & = \mathbb{E}[\mathbb{1}_{\textrm{t}=t_0}|ps(\mathbf{w})] \\ &= \mathbb{E}[\mathbb{E}[\mathbb{1}_{\textrm{t}=t_0}|\mathbf{w}, \operatorname{ps}(\mathbf{w})]|ps(\mathbf{w})] \\ &= \mathbb{E}[\operatorname{P}(\textrm{t}=t_0|\mathbf{w}, \operatorname{ps}(\mathbf{w}))|ps(\mathbf{w})] \\ &= \mathbb{E}[\operatorname{P}(\textrm{t}=t_0|\mathbf{w})|ps(\mathbf{w})]\\ &= ps(\mathbf{w}) \end{split}$ where the second equality is due to the law of iterated expectations and the second-last equality is since $\mathbb{E}[A|A=a_0]$ is simply $a_0$. Therefore the propensity score is a valid balancing function of $\mathbf{w}$. Given a propensity score, we can use Eq. 4 to obtain the backdoor estimate by extending the stratification and matching methods discussed above.

#### 4.3.3.1 Estimating the propensity score

In some cases, the propensity score, $\operatorname{P}(\textrm{t}|\mathbf{w}=\mathbf{w})$ is known. For example, in a randomized experiment the propensity score is constant (0.5 if treatment and control are equally likely) for every $\mathbf{w}$. In most cases however, the propensity score needs to be estimated. Since both $\textrm{t}$ and $\mathbf{w}$ are observed, it can be estimated using any supervised learning algorithm that returns a probability estimate of different outcome values. For low-dimensional tabular data, logistic regression is often used to estimate $\hat{\operatorname{ps}}(\mathbf{w})$ when the treatment is binary. Assuming that the learnt parameters of logistic regression are $\hat{\beta}$, the propensity score is estimated as: $\hat{\operatorname{ps}}(\mathbf{w}) = \frac{1}{1+e^{-\hat{\beta}^T \mathbf{w}}}$ This approach naturally extends to cases when the treatment is categorical. When treatment is continuous, we can use probabilistic regression methods or approximate the treatment into discrete buckets. For high-dimensional or more complex data like text or images, deep learning methods can be used.

Given multiple propensity score models, we can use cross-validation to select the model with the highest cross-validation accuracy, similar to supervised learning methods. However there is one key distinction in the interpretation of this accuracy. A propensity score model with low accuracy is not necessarily a bad model. The goal of a model $\hat{\operatorname{ps}}(\mathbf{w})$ is to approximate the true propensity score $\operatorname{ps}(\mathbf{w})$, not necessarily to achieve a high accuracy. Somewhat counter-intuitively, an accuracy of 0.5 for a binary treatment classifier is ideal from the perspective of estimating the causal estimand if the true propensity score is also 0.5. For example, if the data was collected through a randomized experiment that assigned treatment and control with equal probability, then the best propensity score model has accuracy 0.5.

Similarly, if the true propensity scores are closer to 0 or 1, then a good classifier will achieve high accuracy on the classification task, but higher accuracy unfortunately implies higher variance of the estimate from Eq. 4. This is because estimation of causal effect involves subtraction of outcomes conditional on the treatment within each strata. As the accuracy increases (and thus some propensity scores $\operatorname{P}(\textrm{t}=t_0|\mathbf{w}=\mathbf{w})$ move closer to 0), variance of the estimate increases since there will be fewer data units having $\mathbf{w}=\mathbf{w}$ that have $\textrm{t}=t_0$. That said, it does not mean that we should choose a classifier with lower accuracy. Rather the takeaway is that we should select the model with the highest accuracy on a validation set. If the best accuracy turns out to be low, it may be due to the inherent randomness in how treatment was assigned. If the best accuracy is high, then it conveys that treatment was assigned almost deterministically based on the confounders and a propensity score-based method cannot provide a low variance estimate.

#### 4.3.3.2 Propensity-based stratification and matching

Given an estimated propensity score, we now describe extensions of the matching and stratification methods based on the propensity score. For matching, the goal is to find the closest unit to a particular unit, except that the distance metric is defined over propensity scores. $\operatorname{dist}(\mathbf{w}_1, \mathbf{w}_2) = \operatorname{dist}(\hat{\operatorname{ps}}(\mathbf{w}_1), \hat{\operatorname{ps}}(\mathbf{w}_2)) = | \hat{\operatorname{ps}}(\mathbf{w}_1))- \hat{\operatorname{ps}}(\mathbf{w}_2))|$ Similar to direct matching, we need to define the maximum distance at which a match is considered between a pair of data units.

For propensity-based stratification, our goal is to find subsets where the propensity score is constant. This is often approximated by defining subsets corresponding to ranges of propensity score values. For instance, we may divide [0,1] into 10 strata equally starting from $[0,0.1), [0.1, 0.2)$ and so on. The estimate from each stratum is now weighted by the probability of its propensity score range, or equivalently the fraction of data points whose propensity scores lie in that range. Like with direct stratification, the size of these subsets determines the bias-variance tradeoff of the final estimate.

To sum up, propensity-based methods and direct methods offer a fundamental tradeoff. In direct methods, the balancing function $\mathbf{w}$ is always accurate but conditioning on it is difficult. For propensity-based methods, the reverse is true. Conditioning is easy since the propensity score is on-dimensional but the estimated propensity score has to be good balancing score. To resolve this tradeoff it is possible to construct two-dimensional or other low-dimensional balancing scores, but they are not commonly used. For any such balancing score, the same matching and stratification methods can be applied.

## 4.4 Weighting-based methods

There are many other ways of interpreting the back-door estimand from Eq. 3, and thus obtaining a different method to estimate the causal effect. Rather than estimating a balancing function, one way is to ensure that treatment and control units have the same distribution of confounders, almost like a randomized experiment. Often, the key problem with observational data is that certain values of the confounders are disproportionately assigned treatment. If we can somehow remove this relationship, and ensure that treatment is independent of $\mathbf{w}$, we can obtain a non-confounded estimate of causal effect.

The basic idea is the same as importance sampling, which is used to weight a input distribution to resemble a target distribution. Here the input distribution $P$ is the observed data distribution and the target distribution $P^*$ is that of a (hypothesized) randomized experiment on the same population. Thus, we can write, $\begin{split} \mathop{\mathrm{\operatorname{P*}}}(\mathbf{w}, \textrm{t}, \textrm{y}) &= \mathop{\mathrm{\operatorname{P*}}}(\textrm{y}|\textrm{t}, \mathbf{w}) \mathop{\mathrm{\operatorname{P*}}}(\textrm{t}|\mathbf{w})\mathop{\mathrm{\operatorname{P*}}}(\mathbf{w}) \\ &= \operatorname{P}(\textrm{y}|\textrm{t}, \mathbf{w}) \mathop{\mathrm{\operatorname{P*}}}(\textrm{t}|\mathbf{w})\operatorname{P}(\mathbf{w}) \\ &= \operatorname{P}(\textrm{y}|\textrm{t}, \mathbf{w}) \operatorname{P}(\mathbf{w}) \mathop{\mathrm{\operatorname{P*}}}(\textrm{t}|\mathbf{w}) \frac{\operatorname{P}(\textrm{t}|\mathbf{w})}{\operatorname{P}(\textrm{t}|\mathbf{w})}\\ &= \operatorname{P}(\mathbf{w}, \textrm{t}, \textrm{y})\frac{\mathop{\mathrm{\operatorname{P*}}}(\textrm{t}|\mathbf{w})}{\operatorname{P}(\textrm{t}|\mathbf{w})} \end{split}$ Then the expected value of $Y$ in the target distribution where $\mathbf{w}$ is independent of $\textrm{t}$ is: $\mathbb{E}^*(\textrm{y}) = \sum_{\textrm{y}, \textrm{t}, \mathbf{w}} y\mathop{\mathrm{\operatorname{P*}}}(\mathbf{w}, \textrm{t}, \textrm{y}) = \sum_{\textrm{y}, \textrm{t}, \mathbf{w}} y\operatorname{P}(\mathbf{w}, \textrm{t}, \textrm{y}) \frac{\mathop{\mathrm{\operatorname{P*}}}(\textrm{t}|\mathbf{w})}{\operatorname{P}(\textrm{t}|\mathbf{w})}$ To find the causal effect, we can rewrite the backdoor estimand as, $\label{eq:ch04-ipw-estimator} \begin{split} \mathbb{E}(\textrm{y}|\operatorname{do}(\textrm{t}=t_0)) &\rightarrow\sum_{\textrm{y}, \mathbf{w}} y\operatorname{P}( \textrm{y}| \mathbf{w},\textrm{t}=t_0) \operatorname{P}(\mathbf{w}) \\ &= \sum_{\textrm{y}, \mathbf{w}} y\operatorname{P}( \textrm{y}| \mathbf{w},\textrm{t}=t_0) \operatorname{P}(\mathbf{w}) \frac{\operatorname{P}(\textrm{t}=t_0|\mathbf{w})}{\operatorname{P}(\textrm{t}=t_0|\mathbf{w})} \\ &= \sum_{\textrm{y}, \mathbf{w}} y\operatorname{P}( \textrm{y}| \mathbf{w},\textrm{t}=t_0) \operatorname{P}(\textrm{t}=t_0, \mathbf{w}) \frac{1}{\operatorname{P}(\textrm{t}=t_0|\mathbf{w})} \\ &= \sum_{\textrm{y}, \mathbf{w}} y\operatorname{P}( \textrm{y}, \mathbf{w},\textrm{t}=t_0) \frac{1}{\operatorname{P}(\textrm{t}=t_0|\mathbf{w})} \\ \end{split}\qquad(5)$

### 4.4.1 Inverse propensity weighting

Eq. 5 can be interpreted as a weighted average of $\textrm{y}$ over the observed distribution, with the weights being $weight = 1/\operatorname{P}(\textrm{t}=t_0|\mathbf{w})=\frac{1}{\operatorname{ps}(t_0, \mathbf{w})}$. These weights give the estimator its name, “inverse propensity weighting” or IPW for short.

For a dataset with $N$ units, the IPW estimator is: $\hat{\mathbb{E}}[\textrm{y}| \operatorname{do}(\textrm{t}=t_0)]= \sum_{i=1}^N \frac{\mathbb{1}_{\textrm{t}=t_0}y_i}{\hat{\operatorname{ps}}(t_0, \mathbf{w}_i)}$ The method can be interpreted as creating a new dataset where $\textrm{t}\unicode{x2AEB}\mathbf{w}$ and thus the data are balanced. From Eq. 5, we saw that this estimator can be derived from the backdoor estimand. Thus, with infinite data, the IPW estimator is equivalent to the simple stratification estimator. $\begin{split} IPW &= \sum_{\textrm{y}, \textrm{t}=t_0, \mathbf{w}} y\frac{\operatorname{P}(\mathbf{w}, \textrm{t}=t_0, \textrm{y})}{\operatorname{P}(\textrm{t}=t_0|\mathbf{w})} \\ &= \sum_{\textrm{y}, \textrm{t}=t_0, \mathbf{w}} y\frac{\operatorname{P}(\textrm{y}|\textrm{t}=t_0, \mathbf{w})\operatorname{P}(\textrm{t}=t_0|\mathbf{w})\operatorname{P}(\mathbf{w})}{\operatorname{P}(\textrm{t}=t_0|\mathbf{w})} \\ &= \sum_{\textrm{y}, \textrm{t}=t_0, \mathbf{w}} y\operatorname{P}(\textrm{y}|\textrm{t}=t_0, \mathbf{w})\operatorname{P}(\mathbf{w}) \\ &= \sum_{\mathbf{w}} \sum_{\textrm{y}, \textrm{t}=t_0} y\operatorname{P}(\textrm{y}|\textrm{t}=t_0, \mathbf{w})\operatorname{P}(\mathbf{w}) \\ &= \sum_{\mathbf{w}} \mathbb{E}[\textrm{y}|\textrm{t}=t_0, \mathbf{w}]\operatorname{P}(\mathbf{w}) \end{split}$

In finite samples however, the two methods can give different estimates due to the different tradeoffs they make. The propensity score can be estimated as in the balancing-based methods. However, IPW does not divide the data into discrete strata or find the closest matches for each data point based on the propensity score, both of which are statistical operations and can introduce bias. Instead it uses the propensity score directly to weight the input data points. This results in low bias but high variance since the estimate for each data point includes division by its propensity score. Even if a single point has low propensity score ($\approx 0$) it can make the estimate diverge and go to infinity.

To reduce variance, it is advisable to clip extreme propensity scores. Typically one chooses a range $[\alpha, 1-\alpha]$ where $\alpha$ is a parameter between 0 and 1. Choosing $\alpha$ has a similar bias-variance tradeoff as with statification—smaller $\alpha$ leads to lower bias but higher variance, and larger $\alpha$ reduces variance but has higher bias.

Overall the benefit is that it does not need arbitrary choices of strata size or distance metrics for matching. When true propensity scores are not too high or too low, IPW can be a suitable method that is straightforward to implement.

### 4.4.2 Using supervised learning on the weighted outcomes

Unlike balance-based methods, weighting-based methods also provide a straightforward way to incorporate supervised learning estimators. Consider $y|\operatorname{do}(t), \mathbf{w}= \mathbf{w}$ as a function of $t$ and $\mathbf{w}$ that needs to be estimated. $\label{eq:ch04-ipw-ml-eqn} \mathbb{E}[y|\operatorname{do}(\textrm{t}=t_0),\mathbf{w}=\mathbf{w}]= f(t_0, \mathbf{w})\qquad(6)$ From Eq. 5 we know that LHS is equivalent to values of $y$ weighted by the inverse propensity score. Thus, training data $(t_i,\mathbf{w}_i,y_i)$ can be generated where each point is weighted by $\frac{1}{\hat{\operatorname{ps}}(t_i, \mathbf{w}_i)}$ and any supervised learning algorithm can be used to predict $y^{IPW}$. Denoting $L$ as a suitable P-admissible loss function such as mean squared loss or cross-entropy, the estimator can be implemented using a weighted regression: $\begin{split} % \hat{f}(t,\vw) &= \arg \min_h \sum_{i=1}^N L(y^{IPW}_i, h(t_i,\vw_i))\\ \hat{f}(t,\mathbf{w}) &= \arg \min_{h \in \mathcal{H}} \sum_{i=1}^N \frac{1}{\hat{\operatorname{ps}}(t_i, \mathbf{w}_i)}L(y_i, h(t_i,\mathbf{w}_i)) \end{split}$

Compared to the IPW estimator that can only estimate mean interventional outcomes for different sub-groups, this method can be used to obtain estimates for any point in the observed data or any new data point. Of course, the catch is that the estimates include a new parametric assumption on the form of $f$. Model choice is an important consideration that determines bias and variance of the estimate. Like in supervised learning, if the family of functions $\mathcal{H}$ is too simple (e.g., linear regression), there will be bias due to insufficient modeling of the per unit differences. On the other hand, if we consider an expressive $\mathcal{H}$ (e.g., multi-layer neural networks) then we run the risk of overfitting to the observed data.

The susceptibility to very low propensity scores remains the same as the IPW estimator. A near-zero propensity score translates to a near-infinity weight on the data point, thus making the rest of the dataset irrelevant for fitting $f$. Similar clipping of weights can be applied.

## 4.5 Outcome model-based methods

Note that Eq. 6 equation looks deceptively similar to the regression, $\mathbb{E}[y|t,\mathbf{w}=\mathbf{w}]=f(t,\mathbf{w})$. The key difference, however is in conditioning on $\operatorname{do}(t)$ instead of $t$, that in turn required us to weight the data appropriately. What if we train a supervised model directly on the input data? This section dives deeper into the use of such outcome-based predictor methods.

From the identification chapter, $P(y|t,\mathbf{w})$ is the estimand for $P(y|do(t),\mathbf{w})$. We used this as a component of average causal outcome methods like stratification and inverse propensity weighting. But we can also use it directly to create an estimator that can predict the interventional outcome for each value of $\mathbf{w}$. Let us assume a function $f$ that describes the relationship between $y$ and $t,\mathbf{w}$. $y = \operatorname{f}(t, \mathbf{w}) + \epsilon$

Since $y$, $t$ and $\mathbf{w}$ are all observed, the above equation suggests a straightforward estimator using supervised learning. If we can estimate $f$, then we can use it to compute the value of the outcome at different values of the treatment to compute the causal effect. For estimating the causal estimate between $t=0$ and $t=1$, for example, we can write: $\mathbb{E}_{\mathbf{w}}[\operatorname{f}(1, \mathbf{w}) - \operatorname{f}(0, \mathbf{w})]$

The problem, however, is that estimating the true $f$ is not trivial. When the goal is prediction over the same distribution of data, it often suffices to learn a $\hat{f}$ that has the same error with respect to the outcome. But low prediction error does not necessarily translate to low error in estimating causal effect of $t$ since $t$ and $\mathbf{w}$ are correlated. As a simple example, suppose that $t$ has zero causal effect on $y$, given by the following structural equation: $y \leftarrow\gamma \mathbf{w}$ However, since t and $\mathbf{w}$ are correlated, a model that minimizes the prediction loss may assign some of the effect to t, leading to a non-zero estimate of $t$’s causal effect. Fundamentally, the standard supervised learning loss on minimizing predictive error has no constraints on learning the right effect for $t$ or any other variable.

As an example, Fig. 4 shows simulations of different outcome-based predictors on a dataset generated by a linear structural equations for treatment $\textrm{t}$ and outcome $\textrm{y}$. The true causal effect of treatment is 10. We train four predictive models based on linear regression, linear regression with lasso regularization, gradient boosting trees and random forest. The first two models are simpler models that have the advantage of having exactly the correct model specification for the underlying structural equations. The last two are more complex tree models that utilize an ensemble of tree models to minimize their prediction error. Since predictive models are typically evaluated on their prediction accuracy on an out-of-sample test dataset, let us first analyze the test prediction accuracy. The mean average percentage error on predicting the outcome $y$ is the higher for linear regression and lasso models, and lower for the tree-based ensemble models. From a prediction accuracy standpoint, tree-based ensemble models are better than the regression-based models.

However, the right panel of Fig. 4 shows the opposite trend for their bias in effect estimation. Linear regression is the most accurate: its average estimate is exactly 10 and over 50% of the estimates lie between 9 and 11. Lasso model is not accurate at all, even though both linear regression and lasso share the same prediction error. Most of the effect estimates from Lasso lie between 0 and 5. Morever, the models with higher prediction accuracy yield worse effect estimates. Both Gradient Boosting and Random Forest models yield a biased estimate, with Random Forest model estimating a near-zero effect for a majority of the simulations. The example shows the lack of a stable relationship between prediction accuracy and effect estimation that makes it difficult to select the right model for an outcome-based effect estimator. The best predictor of the outcome may be the worst estimator of causal effect and models with the same predictive accuracy may still have very different effect estimates.

Figure 4: Supervised learning methods applied to estimate causal effect in a simulated dataset based on linear structural equations. Left panel shows the prediction error measured by the mean absolute percentage error (MAPE) metric. Right panel shows the distribution of the effect estimates for a true effect of 10 (shown as a dotted horizontal line). Lower prediction error for the Gradient Boosting and Random Forest models does not translate to lower bias on estimating the effect. Compare these methods yourself using an online DoWhy notebook: https://github.com/causalreasoning/cr-book/blob/master/examples/ch04-estimation/bias-in-supervised-learning.ipynb.. a — Prediction Error, b — Effect Estimates

### 4.5.1 Limitations of directly applying predictive models for effect estimation

The fundamental problem is that predictive methods and causal inference methods are optimized for two different objectives. Consider our setting above where the outcome $y$ is generated according to a function $f(t, x)$ and some independent error: $y = f(t, \mathbf{w}) + {\epsilon}$. The goal of a predictive model is to construct $\hat{f}$ such that $y-\hat{f}(t, \mathbf{w})$ is minimized, whereas the goal of a causal inference model is construct $\hat{f}$ such that its partial derivative w.r.t the treatment is accurate, $\frac{\partial \hat{f}}{\partial t} - \frac{\partial y}{\partial \operatorname{do}(t)}$, where $\partial \operatorname{do}(t)$ should be interpreted as the infinitesimal interventional change in $t$. \begin{aligned} \text{Predictive model}:& \hat{f} := \mathop{\mathrm{arg\,min}}_{h} \operatorname{Loss}(h, y) \\ \text{Causal effect model}:& \frac{\partial \hat{f}}{\partial t} := \mathop{\mathrm{arg\,min}}_{h} \operatorname{Loss}(\frac{\partial h}{\partial t}, \frac{\partial y}{\partial \operatorname{do}(t)} )\end{aligned}

In the presence of confounding, the two objectives can lead to different causal estimates. Let us use linear regression as the model class to demonstrate this. Suppose that the true generating function $f$ is $y = \beta t + \gamma w + {\epsilon}$ where $w$ also causes $x$ and thus a confounder. The identified causal estimand is $\mathbb{E}[\textrm{y}|\operatorname{do}(\textrm{t}), {\textrm{w}}] = \mathbb{E}[\textrm{y}|\textrm{t}, {\textrm{w}}]$. Using linear regression, the causal effect $\beta$ can be estimated as: $\hat{f}(t,w) = \hat{\beta} t + \hat{\gamma}w$

If $w$ and $t$ are independent, an optimization that minimizes $(y-\hat{f}(t,w))^2$ will be able to distinguish between the contributions of $t$ and $w$ to $y$, thus yielding the correct causal effect. However, since $w$ causes $t$, $w$ and $t$ will be correlated. Intuitively, in such a situation, it is difficult to isolate a separate effect for $t$: based on a given sample, some of $t$’s effect may be absorbed in the $w$’s coefficient, or conversely some of $w$’s effect may become a part of $t$’s estimated coefficient. This is what we see in Fig. 4 (right panel) where the linear regresion model yields incorrect effect estimates on both sides of the true effect 10, ranging from -18 to over 40. Formally, the variance of $\hat{\beta}$ is given by, $\label{eq:ch04-var-beta-lregression} Var(\hat{\beta}) = \frac{\sum_{i=0}^N (y-\hat{f}(t,w))^2}{Var(t)(1-R_{tw}^2)}\qquad(7)$ where $R_{tw}^2$ is the $R$-squared value of regressing $t$ on $w$. As $w$’s effect on $t$ increases, their correlation increases and thus the denominator decreases. As a result, the variance of $\hat{\beta}$ increases. Thus, when the confounding by $w$ on $t$ is minimal (e.g., in a randomized experiment where $R^2=0$), we obtain the lowest variance estimate. However, as confounding increases, a single estimate can be unreliably away from the true value (though still unbiased). In the extreme case, when $w$ and $t$ are linearly dependent, there can be many equally optimal solutions for $\hat{\beta}$ depending on the value of $\hat{\gamma}$.

In addition to high variance, prediction models often include regularization that can lead to bias in the effect estimate. For example, the Lasso model is a regularized version of the linear regression model that adds a loss term encouraging the model parameters to shrink to zero. The intuition is to avoid overfitting and let the insignificant effect parameters go to zero.

$\text{Lasso}: \hat{f} := \mathop{\mathrm{arg\,min}}_{\beta, \gamma} \operatorname{Loss}({\beta} t + {\gamma}w, y) + \lambda (\beta + \sum|\gamma|)$

However, as we saw in Fig. 4, Lasso regularization leads to shrinking of the treatment’s effect as well, even though it is the variable with the highest effect. In general, regularization is a common component of almost all predictive models that helps to avoid overfitting by encouraging model parameters towards some prior, but it can also lead to non-trivial bias for effect estimation.

### 4.5.2 Treatment-specific prediction models: T-learner

While naive predictive models do not work, there are certain customizations that can work better. We already saw an example of a modified loss function in section 4.4.2 that used a weighted outcome.

We now present a simple outcome-based method that works for a discrete treatment variable. The main problem with the naive predictive estimator is that $\mathbf{w}$ could be attributed some of the $t$’s effect, and vice-versa. What if we divided the dataset by values of $t$ and ran separate regressions on each sub-dataset? Since $t$ is constant in each dataset, the functions learnt cannot capture any variation in $y$ due to treatment. $\forall t_i: \hat{f_{t_i}} = \arg \min_h \sum_{i=1}^{N_{t=t_i}}L(y_i, h(\mathbf{w}_i))$ The causal effect for any two values of $t$ can be calculated as: $\mathbb{E}_{\mathbf{w}}[\hat{f}_{t1}(\mathbf{w})- \hat{f}_{t0}(\mathbf{w})]$ Conditional average treatment effects (CATE) can be estimated by restricting the values of $\mathbf{w}$ over which the expectation is computed. Since the method requires two or more prediction models, it is called the T-learner. This approach can work well when there are sufficient samples for each value of the treatment. As with any supervised learning task, the efficacy of the method will depend on the choice of model class and associated issues of overfitting in case some treatment values have less data.

Note the relationship of this method to stratification. In stratification, we condition on the confounders $\mathbf{w}$ whereas here we condition on the treatment $t$. Thus, when there are a few discrete confounders, simple stratification works well. When there are a few discrete treatments, treatment-specific prediction methods can work well. Unlike propensity-based methods for stratification, there is no obvious extension of the T-learner to continuous treatments.

### 4.5.3 Residual-based prediction models

For continuous treatment, we need a different estimation strategy. This strategy depends on the fact that error term in a correctly specified structural equation is independent of the input variables. If we can model correctly the structural equation from the confounders ${\textrm{w}}$ to the treatment $\textrm{t}$, then the residual of the fitted model should be independent of all confounders. Similarly, the residual of a model predicting the outcome $\textrm{y}$ based on the confounders should be removed of any direct effect of the confounders on the outcome. The key assumption is that the machine learning models $\hat{g}$ and $\hat{f}$ capture the true structural equations for treatment and outcome respectively. $\begin{split} \hat{g} &= \arg \min_g Loss(t, g(w)) \\ \hat{f} &= \arg \min_f Loss(y, f(w)) \\ r_t &= t - \hat{g}(w) \\ r_y &= y - \hat{f}(w) \end{split}$

In other words, the two residuals $r_t$ and $r_y$ represent the part of the treatment and outcome respectively that are not explained by the variation due to ${\textrm{w}}$. In other words, $r_t$ and $r_y$ are the unconfounded parts of the treatment and outcome. A simple predictive model on these residual variables should yield the effect of treatment on the outcome. $\beta = \arg \min_\beta Loss(r_y, \beta r_t)$ The same method can be also used to estimate the conditional effect, by introducing an effect model in the last stage that depends on other variables, e.g., $\beta r_t + \gamma r_t x$ where $x$ is an effect modifier. Compared to the naive predictive model from $\textrm{t}$ and ${\textrm{w}}$ to $\textrm{y}$, the residual-based method has the advantage of explicitly removing the confounding due to ${\textrm{w}}$ in its first step, which the predictive model may or may not do. However, accuracy of the estimate is now dependent on the predictive quality of two separate sub-models, $\hat{g}$ and $\hat{f}$. As the sub-models get closer to the true structural equations, the residual-based effect becomes more accurate. Since the estimator involves estimating two machine learning models, it is also called the double machine learning method.

### 4.5.4 Wald estimator

Outcome-based methods are also used for estimating causal effect in the presence of instrumental variables. Instrumental variable was introduced in 3.4, corresponding to the following structural equations for t and y in the general causal graph in Fig. 3.6. $\begin{split} t &\leftarrow g(\mathbf{z},\mathbf{w}) + {\epsilon}_t \\ y &\leftarrow f(t,\mathbf{w}, \mathbf{u}) + {\epsilon}_y \end{split}$

For identifiability, we make an additional assumption that $\textrm{t}$ has a linear effect on $\textrm{y}$ and that the effects of $t$ and $\mathbf{w}$ compose additively. We obtain, $\label{eq:ch03-iv-parametric-y} y \leftarrow\alpha + \beta t + f_{\mathbf{w}}(\mathbf{w}) + f_{\mathbf{u}}(\mathbf{u}) + {\epsilon}_y\qquad(8)$ where $f_{\mathbf{w}}$ is an arbitrary function that maps the causal effect of $\mathbf{w}$ on y and ${\epsilon}_y$ is a zero mean error. Under this parameterization, $\beta$ is the expected value of the causal effect. $\mathbb{E}[\textrm{y}|\operatorname{do}(\textrm{t}=1)] - \mathbb{E}[\textrm{y}|\operatorname{do}(\textrm{t}=0)]= \beta.1 - \beta.0=\beta$

In general, the unobserved confounders $\mathbf{u}$ affect both $y$ and $t$. Thus, $f_{\mathbf{u}}(\mathbf{u})$ is not independent of $t$. Trying to estimate $y$ as a function of $t$ directly will lead to a biased estimate of effect, since it may also include the effect of unobserved confounders. Instead, we can try to model the outcome as a function of the instruments and the confounders. By definition, the instrumental variable $\textrm{z}$ is d-separated from $\mathbf{w}$ and thus, $\label{eq:ch03-iv-zindepu} z \unicode{x2AEB}\mathbf{u}\text{ for all } z\in \mathbf{z}\qquad(9)$

Taking expectation on both sides, we rewrite Eq. 8 as, $\label{eq:ch03-iv-identify} \begin{split} \mathbb{E}[y|\mathbf{z}, \mathbf{w}] &= \mathbb{E}[\alpha + \beta t + f_{\mathbf{w}}(\mathbf{w}) + f_{\mathbf{u}}(\mathbf{u}) + {\epsilon}_y|\mathbf{z}, \mathbf{w}]\\ &= \alpha + \beta \mathbb{E}[t|\mathbf{z}, \mathbf{w}] + f_{\mathbf{w}}(\mathbf{w}) + \mathbb{E}[f_{\mathbf{u}}(\mathbf{u}) + {\epsilon}_y|\mathbf{z}, \mathbf{w}]\\ &= \alpha + \beta \mathbb{E}[t|\mathbf{z}, \mathbf{w}] + f_{\mathbf{w}}(\mathbf{w}) + \alpha' + \mathbb{E}[{\epsilon}_y|\mathbf{z}, \mathbf{w}]\\ &= (\alpha + \alpha_1)+ \beta \mathbb{E}[t|\mathbf{z}, \mathbf{w}] + f_{\mathbf{w}}(\mathbf{w}) \end{split}\qquad(10)$ where $\mathbb{E}[f_{\mathbf{u}}(\mathbf{u})| \mathbf{z}, \mathbf{w}]$ is a constant due to Eq. 9, $\mathbb{E}[{\epsilon}_y|\mathbf{z}, \mathbf{u}]=0$ since ${\epsilon}_y$ is a zero mean error term independent of every other variable, and $\mathbb{E}[f_{\mathbf{w}}(\mathbf{w})|\mathbf{z}, \mathbf{w}]=f_{\mathbf{w}}(\mathbf{w})$ since $\mathbb{E}[h(A)|A]=h(A)$ for any function $h$ and random variable $A$. Eq. 10 expresses the causal effect parameter $\beta$ in terms of function of observed variables, thus the causal effect can be identified as, $\beta = \arg \min_{\beta', \alpha', h}\ell (\mathbb{E}[y|\mathbf{z}, \mathbf{w}],\alpha' + \beta'\mathbb{E}[t|\mathbf{z}, \mathbf{w}]+ h(\mathbf{w}) )$ where $\ell$ is a suitable loss function such as $\ell_2$ loss. Note that the above identification is valid only under the additive parametric assumptions described above. When there are no observed confounders $\mathbf{w}$ and $z$ and $t$ are binary, Eq. 10 a closed form solution. Substituting $z=1$ and $z=0$ $\begin{split} \mathbb{E}[y|z=1]&= (\alpha + \alpha_1)+ \beta \mathbb{E}[t|z=1]\\ \mathbb{E}[y|z=0]&= (\alpha + \alpha_1)+ \beta \mathbb{E}[t|z=0] \end{split}$ Subtracting the two equations, we obtain the identification estimand for $\beta$, known as the Wald identifier for binary instrumental variables. $\begin{split} \mathbb{E}[y|z=1] - \mathbb{E}[y|z=0]&= \beta (\mathbb{E}[t|z=1]-\mathbb{E}[t|z=0]) \\ \Rightarrow \beta &= \frac{\mathbb{E}[y|z=1] - \mathbb{E}[y|z=0]}{\mathbb{E}[t|z=1]-\mathbb{E}[t|z=0]} \end{split}$

$\label{eq:ch-04-iv-est} \beta = \arg \min_{\alpha, \beta, \gamma} \mathbb{E}[y|\mathbf{z}, \mathbf{w}] -( \alpha+ \beta\mathbb{E}[t|\mathbf{z}, \mathbf{w}] + \gamma \mathbf{w})\qquad(11)$

Thus if we knew $v=\mathbb{E}[t|\mathbf{z}, \mathbf{w}]$, we can estimate $\beta$ using a simple regression of $y$ on $v$ and $\mathbf{w}$. In practice, this expectation can be estimated using a predictive model of its own. Thus, the task of estimating causal effect reduces to the following two predictive models. $\begin{split} \hat{t} &= \hat{g}(z,\mathbf{w}) \\ \hat{y} &= \hat{\alpha} + \hat{\beta} \hat{t} + \hat{\gamma} \mathbf{w} \end{split}$

Since we are estimating the effect using two predictive models and are using least squares loss minimization for fitting expected values, this method is often called the two-staged least squares method. Note that we are not restricted to the linear parametric form. Any parametric form for $f(t,\mathbf{w})$ can work, as long as the expectations can be estimated.

In the special case when $\mathbf{z}$ is one-dimensional and binary, and there are no confounders, the estimator reduces to a simple ratio using Eq. 11. $\begin{split} \mathbb{E}[y|z=1] &= \alpha+ \beta\mathbb{E}[t|z=1] \\ \mathbb{E}[y|z=0] &= \alpha + \beta\mathbb{E}[t|z=0] \\ \Rightarrow \beta &= \frac{\mathbb{E}[y|z=1] -\mathbb{E}[y|z=0]}{\mathbb{E}[t|z=1]- \mathbb{E}[t|z=0]} \end{split}$ This ratio estimator is called the Wald estimator and it can plugged in as below.

$\hat{\beta} = \frac{ \frac{\sum_{i=1}^N \mathbb{1}_{\textrm{z}=1}\textrm{y}}{\sum_{i=1}^N \mathbb{1}_{\textrm{z}=1}} - \frac{\sum_{i=1}^N \mathbb{1}_{\textrm{z}=0}\textrm{y}}{\sum_{i=1}^N \mathbb{1}_{\textrm{z}=0}} } { \frac{\sum_{i=1}^N \mathbb{1}_{\textrm{z}=1}\textrm{t}}{\sum_{i=1}^N \mathbb{1}_{\textrm{z}=1}} - \frac{\sum_{i=1}^N \mathbb{1}_{\textrm{z}=0}\textrm{t}}{\sum_{i=1}^N \mathbb{1}_{\textrm{z}=0}} }$

The variance of this estimator depends on the strength of the instrument. If the instrument has low effect on the treatment, then the denominator will be lower leading to high variability in the estimates. While it is not obvious from Eq. 11, the same holds true for the general estimator too. Note that the estimator suffers from high variance whenever the effect of the instrument on treatment is weak, since the denominator will be small. In such cases, it may be better to focus on the data subset where the compliance of treatment with instrument is high. Restricting to a data subset gives a reliable estimate about a selected subpopulation, thus the resultant estimate is called the local average treatment effect (LATE).

## 4.6 Threshold-based methods

Finally, we discuss methods that do not depend on accurate estimation of propensities or other conditional probabilities, but rather exploit certain thresholds based on treatment is varied. This threshold can be applied on values of other auxiliary variables.

### 4.6.1 Regression discontinuity

Regression discontinuity is a special kind of natural experiment that exploits discontinuity in a variable due to a decision made using a threshold. For example, suppose that we are interested in the effect of a specific feature of a product on its usage. Suppose further that it was decided to show the feature to people who were in the top 95% of user activity. Now comparing the outcome for these users with the new feature to other users will be confounded, since higher activity users are anyways likely to use the product more. However, it may be that there were users just below the 95% cutoff who had exactly the same or very similar user activity numbers. The regression discontinuity idea is that these users just below the threshold can be compared to the users just above the threshold, to yield a comparable sample of individuals with and without treatment. To the extent that these two subsets of individuals are identical except for the arbitrary user activity cutoff, we can estimate causal effect by comparing the outcome on these subsets directly.

It turns out that near the threshold, the variable on which threshold is applied acts as an instrumental variable. In the language of instrumental variable method, whether a user lies above the threshold can be considered the instrument and treatment is experiencing the new feature. In our example, the instrument and the treatment are identical, leading to a sharp regression discontinuity. The discontinuity need not be a deterministic one. It can also be probabilistic (or “fuzzy”), for example, sampling users prproportional to their user activity and showing them the new feature. Now the instrument can be considered as the user activity variable while the treatment remains the same.

While Eq. 11 can be directly applied, estimation using a regression discontinuity natural experiment requires another factor to consider. What should be the length of interval before and after the threshold that forms the sample for the estimand. Intuitively, a larger interval (or “bandwidth”) will lead to lower variance but high bias since data points far away from the threshold may not be comparable. A smaller interval is desirable, but that exposes the estimator to high variance. In practice, it is advisable to use domain knowledge to arrive at a sensible value for the bandwidth parameter, and to evaluate the estimate at different bandwidth parameters.

### 4.6.2 Difference-in-differences

One special case is when the threshold is applied on time and the treatment is applied after that time point. There are many other kinds of natural experiments that happen due to a threshold over time. Often a treatment is implemented at a particular timestep and the goal is to compare the effect of treatment. In such cases, it is useful to compare the outcome values just before and after treatment is administered, with respect to a control group where the treatment was not applied. This technique is called the difference-in-differences estimator. The key idea is that rather than compare absolute values of the outcome between treatment and control, difference-in-differences compares the change in outcome before and after treatment was applied. The key assumption is that change in outcome is not affected by confounders and the treatment can be considered as equivalent to random for this modified outcome. The difference-in-differences estimator is given by: $\mathbb{E}[\Delta\textrm{y}|\operatorname{do}(\textrm{t}=t_0)]= \mathbb{E}[\Delta\textrm{y}|\textrm{t}=t_0]= \frac{1}{N_{t_0}}\sum_{t_i=t_0} \Delta y_i$ In effect, subtracting the outcome by its value before an intervention serves to account for any pre-existing differences between the treatment and control groups. The difference-in-differences method makes an additional assumption about unconfoundedness of the change in outcome.

## 4.7 Practical Considerations

So far, we presented different estimators that can implement back-door or instrumental variable identification strategies. We also described how they make varying choices in reducing bias and variance. Given a dataset and a target effect, we now discuss practical considerations in choosing an estimator and interpreting the resultant estimate.

### 4.7.1 Choosing an estimator

To start with, choice of an estimator depends on the identification strategy. Instrumental variables, regression discontinuity or difference-in-differences estimation methods may be used whenever there is an identified estimand based on their assumptions. For estimation based on adjustment set identification, the type of treatment and dimensionality of a dataset provide useful information to choose an estimator. The easiest estimation task is when the treatment is discrete and the dataset is low-dimensional. Simple stratification is the best method in such cases. It is model-free and provides an unbiased estimate.

When the dataset contains high-dimensional confounders, effect estimation becomes harder, as with any statistical task. If we have any knowledge about the nature of treatment assignment, then balance-based methods are preferable. For instance, if we know that treatment was assigned using some randomized algorithm, then we can use outputs of the algorithm as a reliable propensity score or estimate it. In other cases, we may know the functional form based on domain knowledge. For example, medical treatments assigned based on manual checklists can be approximated by a decision-tree propensity model. Among the balance-based methods, stratification is preferred for discrete confounders and matching for continuous confounders. The important hyper-parameters for these methods are the size of the stratification bins and the maximum distance allowed for a match respectively, that need to be tuned based on domain knowledge and the data available. If more is known about the outcome’s structural equations than the treatment’s, then outcome-based models such as the T-learner can be used. For example, when estimating the effect of a certain event on energy consumption in a computer cluster, we may not know why the event occurs, but know how different processes and events contribute to energy consumption.

Weighting-based methods can also be used for discrete treatments, but the estimate can be unreliable in case there are a few data points with very high weights. For this reason, weighting-based methods work only when none of the propensity scores are too high or too low. They have the benefit of requiring the least hyperparameters.

For continuous treatments, model-based estimators are a necessity since we need to assume some functional form of how the treatment affects the outcome. We recommend the residual-based methods since they allow capturing non-linear relationships from the confounders to the treatment and outcome, while avoiding the bias due to purely prediction models. The residual-based methods can also handle high-dimensional confounders.

### 4.7.2 Confidence intervals and interpreting an estimate

Once an estimator is chosen and the effect is estimated, it is important to interpret it correctly, given the varied challenges that can derail an estimate such as an incorrectly specified model, sample selection bias, poor overlap, low sample size, or outlier data (as discussed in section 1.2). Some of these challenges are common to any statistical analysis. For instance, sample size and outliers are common statistical problems and the resultant variance can be captured through confidence intervals for the causal estimate. To estimate the confidence intervals for any of the estimators presented in this chapter, we can use the bootstrap method. The bootstrap method creates multiple datasets of the same size as the original dataset by sampling with replacement the values from the original dataset. Then the same estimator is applied on each of these datasets and the estimates sorted by their value. The range of the estimates between the $k/2$th-percentile and the $100-k/2$th percentile forms the $100-k/2$-th confidence interval. For example, to compute the 95% confidence intervals, we can look at the range from 2.5% to 97.5% percentile-ranked estimates. As the number of sampled datasets increase, the confidence intervals become more accurate (typical number of datasets is 1000 or 10000). The benefit of this resampling bootstrap method is that it makes no assumptions about the data distribution or the estimator, and can be applied universally, as long as computational capacity is available.

The issues of overlap and model choice, however require special consideration in effect estimation. As we saw in Fig. 3 for a simple one-dimensional confounder $w$, if there are either not enough treatment or control data points for some values of $w$, then we cannot say anything about the causal effect on those values of $w$. Some methods, like outcome-based prediction methods, may manage to provide an estimate for all values of $w$, but that is largely a function of the modeling assumptions in the method, not from any conclusion from the data. For example, suppose that we wanted to estimate the effect of an online service on people’s financial savings. The identified estimand indicates to condition on income of a person, along with other features. Suppose further that the vast majority of the high-income people tried the service, whereas the ratio between those who tried or not is 50-50 for others. While different estimation methods may yield varying estimates of the product’s effect, it becomes critical to realize that no method can estimate the effect on the high-income people reliably from this data. If we choose weighting-based methods, they may be too sensitive to high propensity scores for the high-income people, and therefore one may decide to remove those with high income from the dataset. If we use outcome-based methods, we may obtain an estimate for the high-income people, but that would be simply extrapolated from the estimated effect on the low-income and middle-income people, based on modeling assumptions about the structural equation. Therefore, without a strong prior for the modeling assumption, its estimate for the rich can be misleading. In practice, we recommend measuring treatment overlap through the propensity score for different subsets of the confounders’ values and interpreting the estimate accordingly.

However, the interpretation of the propensity score becomes tricky under high-dimensional confounders. As the number of confounders increases, it becomes difficult to associate high propensity scores with a meaningful set of confounders. Even for simpler balance-based methods, it becomes difficult to evaluate whether any two data points are comparable: as the number of dimensions of confounders increases, it is unlikely that two data points share the same confounder values. It also becomes difficult to construct distance measures that can capture true similarity between units. Thus, rather than perfect matches or stratification, conditioning on confounders becomes a best-effort exercise, based on a statistical model. Interpretation of which kinds of data the estimator is applicable for is a much harder task, and needs to be done carefully based on domain knowledge and the goals of the analysis.

### 4.7.3 Challenges of validating an estimate

Finally, all causal estimation methods suffer from the fundamental limitation that they cannot be evaluated for accuracy without conducting a randomized experiment. This limitation makes it hard to compare causal inference methods on the same dataset, let alone different kinds of data. For a predictive model, it is possible to keep away a subset of observational data that can be used for testing the accuracy of the model. However, for a causal estimate, we ideally need test data that assigns a different treatment to each observation, which is impossible. The next best option is to conduct a randomized experiment but requires intervention in the real world, which comes with its own set of risks.

We therefore have to use more creative ways of evaluating a causal estimate that we describe in the next chapter.

Updated: