# Chapter 3: Identification

Once we have captured our causal assumptions in the form of a model, the second stage of causal analysis is *identification*. In this stage, our goal is to analyze our causal model—including the causal relationships between features and which features are observed—to determine whether we have enough information to answer a specific causal inference question.

We begin by formalizing the concept of causal inference questions using *intervention graphs*. We describe do-calculus rules that relate relationships in intervention graphs to the causal models of our observational data. We show how do-calculus leads us to various identification strategies, and how do-calculus can be combined with parametric assumptions as well. Finally, we discuss the relative advantages and disadvantages of these strategies, and discuss common approaches for analyzing a causal inference question to help choose from among these various approaches.

After completing this chapter, we expect the reader will understand the fundamentals of identification and how they are used to derive common identification strategies.

## 3.1 Causal inference questions: Concepts and Notation

A causal question is *any* question about the relationship between causes and effects. In their fullness, the set of causal questions encompasses a very broad variety of questions. For both practical and pedagogical reasons, we focus here on a narrower class of questions called *causal inference* questions where: (1) the causal model is (assumed) known; and (2) we wish to quantify the causal relationship between two specific variables, e.g., its strength and functional form. The causal model, whether expressed in graphical form or as a set of equations, captures our assumptions about the relationships that might exist between nodes. The strength and functional form of the causal relationship between two specific variables, however, is not captured in the causal model. We must derive this information from data. Unfortunately, as we saw in Chapter 1 in the Simpson’s paradox examples, observed data rarely quantifies causal relationships directly. We have to use our knowledge of the causal model to determine whether or not, and how, we can compute the strength of a given causal relationship from data.

### 3.1.1 Formalizing causal inference questions using intervention graphs and \(\operatorname{do}\) notation

Consider the causal model shown in Fig. 1. We can ask the following causal inference question: how does an intervention to change feature \(A\)’s value affect feature \(B\)’s value? We cannot simply refer to this as \(\operatorname{P}(B|A)\) because this notation is already used to represent the observed distribution, which is confounded in this case by the influence of \(C\). This is the statistical relationship: in our observed data, this is what we expect \(B\)’s value to be, given that we have seen a particular value of \(A\).

To correctly represent this question then, we must first introduce a notation that addresses the subtle distinction between statistical and causal relationships. To represent the causal relationship between \(A\) and \(B\), we need a notation that will distinguish it from the merely observed statistical relationships between \(A\) and \(B\). We write this causal relationship as: \[\operatorname{P}(B|\operatorname{do}(A))\]

The operator \(\operatorname{do}(A)\) represents the *intervention* to change the value of \(A\). When we estimate the value of \(B\) conditioned on \(\operatorname{do}(A)\), we are imagining ourselves reaching in and changing the value of feature \(A\) while leaving the rest unchanged—except of course, changes caused directly or indirectly by the manipulation of \(A\). Because our intervention is taken independently of the rest of the system, we are essentially creating a new causal model where we have cut off \(A\) from all of its parents. In other words, we have a situation as shown in Fig. 2. On the left hand side, we see the original causal graph, \(G\), and on the right, we see the same model with the feature \(A\) is now determined independently. We call this second graph the *interventional graph* or the *Do graph* of \(A\), \(G_{\operatorname{do}(A)}\). This is a new system. If we could observe data sampled from the data distribution \(\operatorname{P}^*\) corresponding to this new system, our observed data would perfectly represent the causal relationship between \(A\) and other values. That is, \(\operatorname{P}(B|\operatorname{do}(A))= \operatorname{P}^*(B|A)\) and hence \(\mathbb{E}[B|\operatorname{do}(A)]=\mathbb{E}^*[B|A]\).

Because we often ask causal questions in the context of a decision, we are often comparing two or more outcomes to help us understand the effect of the actions we might take. For example, if we are planning an intervention on a binary variable \(A\), our causal inference question focuses on the effect of setting \(A=1\) vs \(A=0\). Thus, we represent the causal effect of \(A\rightarrow B\) as a difference between the two interventions: \[\label{eq:binaryinterventioneffect} \operatorname{P}(B|\operatorname{do}(A=1)) - \operatorname{P}(B|\operatorname{do}(A=0))\qquad(1)\]

Of course, if the focus of our decision-making is more complex, involving multiple options, we will make many comparisons among the options.

If the focus of our decision-making is a continuous variable, we can represent the effect of an intervention as a derivative: \[\label{eq:interventionasderivative} \frac{dB}{d\operatorname{do}(A)} = \lim_{\Delta A \to 0} \left[ \frac{\operatorname{P}(B|\operatorname{do}(A+\Delta A)) - \operatorname{P}(B|\operatorname{do}(A))}{\Delta A} \right]\qquad(2)\]

Or, if there are multiple independent variables, we can write the effect as a partial derivative, where the partial derivative is evaluated at \(X\), the set of independent variables excluding the treatment.

\[\label{eq:interventionaspartialderivative} \frac{\partial B}{\partial \operatorname{do}(A)} = \lim_{\Delta A \to 0} \left[ \frac{\operatorname{P}(B|\operatorname{do}(A+\Delta A), X) - \operatorname{P}(B|\operatorname{do}(A), X)}{\Delta A} \right]\qquad(3)\]

Figure 2: When we ask about the effect of an intervention that changes the value of a feature \(A\) in some system (left), we are creating, effectively, a new, hypothetical, system where \(A\) is set independently of other features (right).. a — A causal graph, \(G\), b — An intervention graph \(G_{\operatorname{do}(A)}\)

### 3.1.2 Feature Interactions and Heterogeneous Effects

The effect of a treatment on an outcome is rarely simple and homogeneous. Rather, the effect often varies based on context or unit-level features. For example, a medical procedure may work better or worse in younger or older patients; or a pricing discount might increase sales of some products but not others.

We model these kinds of varying effects as feature interactions. In our graphical model of a system, our outcome feature will have incoming edges from the treatment and one or more contextual features that either affect the outcome or modify the treatment’s effect on the outcome. For example, in Fig. 2, the outcome \(B\) has incoming edges from a treatment \(A\), and also from \(C\) and \(E\). These variables might interact with \(A\) to moderate or amplify its effects on \(B\). From the causal graph alone, we do not know whether either \(C\) or \(E\) interacts with the treatment \(A\) to modify \(A\)’s effect on \(B\).

Recall that we can represent the value of a node as a general function of its parent features. Without loss of generality, let us represent the value of a node as a general function of a single parent node, \(v_0\) and a vector of the remaining parent nodes \(\boldsymbol{v}\): \(f(v_0,\boldsymbol{v})\). If \(e_0\) does not interact with the other features, then \(v_0\)’s effect is homogeneous and we can decompose \(f(v_0,\boldsymbol{v})\) as \(f(v_0,\boldsymbol{v}) = \phi \left( g_0(v_0) + g_1(\boldsymbol{v}) \right)\). If \(v_0\) does interact with other features, then \(v_0\)’s effect is heterogeneous, then \(f(v_0,\boldsymbol{v})\) will decompose into \(f(v_0,\boldsymbol{v}) = \phi \left( g_0(v_0,\boldsymbol{v}') + g_1(\boldsymbol{v}) \right)\), where \(\boldsymbol{v}'\) is a subset of the elements of \(\boldsymbol{v}\). We sometimes refer to this subset \(\boldsymbol{v}'\) as *effect modifiers*.

We can also express the concepts of heterogeneous and homogeneous interactions as follows. If the effect of \(v_0\) on \(Y\) is heterogeneous then \(\operatorname{P}(Y|\operatorname{do}(v_0)) \neq \operatorname{P}(Y|\operatorname{do}(v_0),\boldsymbol{v})\). If \(v_0\)’s effect is homogeneous, then \(\operatorname{P}(Y|\operatorname{do}(v_0)) = \operatorname{P}(Y|\operatorname{do}(v_0),\boldsymbol{v})\).

**How do we take effect modifiers into account in identification of causal effects?** In some cases, we may only be interested in the average causal effect of a treatment given a known population distribution. For example, we may be able to decide whether to apply a global treatment based on average causal effect ^{1}. Interestingly, estimating the average causal effect does not necessarily require measuring the effect modifiers. I.e., they may be unobserved, as long as they are not also confounders. However, we must remember that if we measure the average effect with respect to one population distribution, it will not remain valid if the population changes.

In most cases, our overall task or purpose will require that we capture the full distribution of the causal effect; i.e., we must learn how the treatment effect varies with effect modifiers. The *individual treatment effect* (ITE) is an estimate of the effect of treatment for a specific individual unit and context. Note that the ITE makes a strong assumption that all effect modifiers are known and captured in a model, and also observed. If we believe there may be unknown or unobserved effect modifiers, then it is more correct to say we are identifying the *conditional average treatment effect* (CATE). This is the average treatment effect conditional on a set of observed effect modifiers. Note the relationship between CATE and ITE: if we calculate a CATE conditioned on all effect modifiers then CATE is equivalent to ITE.

In addition, we sometimes calculate a *local average treatment effect* (local-ATE). Local-ATE is an estimate of the treatment effect, but only for a specific subpopulation or subset of effect modifier values.

### 3.1.3 Direct and Mediated Effects

In a causal graph, there can be multiple paths by which changing some feature can influence some outcome we care about. For example, in \(G\) shown in Fig. 2 (a), changing the variable \(E\) can influence \(B\) directly, through the edge \(E \Rightarrow B\), and indirectly, mediated by \(A\) in the path \(E \Rightarrow A \Rightarrow B\). Usually, when we wish to measure and understand the effect of some feature on an outcome, we want to know the feature’s *total effect* on the outcome, regardless of whether that effect is direct or mediated.

There are times when it is useful to distinguish between direct and mediated effects. For example, when we are analyzing a situation involving a long-term outcome that will not be observable for a long time, it might be useful to measure a mediating short-term outcome, to understand what impact our change is having. In other situations, understanding how effects travel through mediating paths might provide us insight into ways to assert greater control over the effects. For example, we might be able to find ways to block some paths to prevent negative effects.

Formally, the notation \(\operatorname{P}(Y|\operatorname{do}(T))\) represents the total effect of intervening on \(T\) on an outcome \(Y\). Given a set of \(k\) mediated paths from \(T\) to \(Y\), where each path is mediated by a single feature \(m_{1...k}\) we can calculate the mediated effect (\(ME_i\)) of \(T\) on \(Y\) through \(m_i\) as \(\text{ME}_i = P(Y|\operatorname{do}(m_i))P(m_i|\operatorname{do}(T))\). This chained calculation can be extended for longer mediating paths consisting of multiple mediators.

The direct effect of \(T\) on \(Y\) is given by the difference between total effect and the sum of all the mediated effects: \(\text{direct effect} = P(Y|\operatorname{do}(T)) - \sum_{i=1...k} \text{ME}_i\).

## 3.2 Do-calculus

The task of causal identification is to determine an expression, the causal estimand, that expresses our target value as a function of the observable correlational relationships in our system. That is, how do we express \(\operatorname{P}(B|\operatorname{do}(A))\) – the correlation of \(B\) and \(A\) in the intervention graph, as a function of observable correlations int he initial graph.

### 3.2.1 Randomized Experiments

As a starting point to understand the connections between the original and the interventional graphs, it is convenient to begin by considering the causal graph for a randomized experiment. When the intervention or treatment, \(A\), is randomized in an experiment, it has no ancestors in the graph. If we draw the intervention graph, \(Do(A)\), we see that it is the same as the original. Thus, in an experiment that randomizes \(A\), \(P(B|Do(A))\) is the same as \(P(B|A)\). Furthermore, this holds without analysis of the remainder of the causal graph, meaning that we can identify the causal effect of \(A\) on other features without having knowledge of the causal relationships in the system beyond knowing that \(A\) is randomly assigned. This robustness to our knowledge of the causal system is why randomized experiments are considered the gold standard for identifying causal effects.

Randomization can come from many sources. Sometimes randomization is an inherent part of the system logic, such as in load-balancing algorithms that randomly assign incoming requests to one of the available servers.

What do we do when the system we are observing is not a randomized experiment? How can we identify a causal estimand that represents \(\operatorname{P}(B|\operatorname{do}(A))\) based on the confounded correlations observed in a non-randomized experiment? In the next section, we describe a calculus of rules that can help us with this task.

### 3.2.2 Causal Distributions from Observational Data

Our challenge now, in the causal identification stage of our analysis, becomes clearer. We wish to calculate this value, \(\operatorname{P}[B|\operatorname{do}(A)]\), but we do not observe the system represented by \(G_{\operatorname{do}(A)}\), the Do graph of \(A\). In other words, no data from the probability distribution \(P_{G_{\operatorname{do}(A)}}(.)\) implied by the interventional do-graph is available, yet we would like to estimate \(\operatorname{P}_{G_{\operatorname{do}(A)}}[B|A]= \operatorname{P}[B|\operatorname{do}(A)]\). Therefore, we must identify a strategy for calculating this value given only observations of the system represented by \(G\) and sampled from the probability distribution \(P(.)\).

Since we have no data from \(\operatorname{P}_{G_{\operatorname{do}(A)}}\), a natural strategy is to write the desired quantity over \(\operatorname{P}_{G_{\operatorname{do}(A)}}\) in terms of probability expressions over \(\operatorname{P}\). To do so, we can utilize the fact that an intervention corresponds to a specific structural change in the causal graph and find the conditional distributions that should stay invariant under this change. Specifically, since the intervention only affects incoming arrows to the intervened variable, the structural equations for its descendant nodes should remain the same, and thus conditional distribution of its descendant variables given their parents should stay the same. That is, if \(B\) is caused by set of variables \(Pa(B)\), then \(\operatorname{P}_{G_{\operatorname{do}(A)}}(B|Pa{B})=\operatorname{P}(B|Pa{B})\). Similarly we can claim that any conditional independence between variables in the observed data distribution should also hold in the interventional distribution, since an intervention simply removes some edges from the graph, never adds them. If a set of nodes \(B\) is independent of \(C\) conditional on \(D\), then \(\operatorname{P}(B|D, C)=\operatorname{P}(B|D)\) and \(\operatorname{P}_{G_{\operatorname{do}(A)}}(B|D, C)=\operatorname{P}_{G_{\operatorname{do}(A)}}(B|D)\). As an example of how these simple properties can be used for identification, consider the target quantity \(P(B|\operatorname{do}(A))\) where \(A \supset Pa(B)\) refers to a set of variables including all parents of \(B\) and some additional variables. Then using the above two equivalence properties, we can write, \[\begin{aligned} \label{eq:simple-do-calculus-derive} \operatorname{P}(B|\operatorname{do}(A)) &= \operatorname{P}_{G_{\operatorname{do}(A)}}(B|A) && \text{Using the definition of do operator} \\ &= \operatorname{P}_{G_{\operatorname{do}(A)}}(B| Pa(B) , A \setminus Pa(B)) && \\ &= \operatorname{P}_{G_{\operatorname{do}(A)}}(B| Pa(B)) && \text{Using the second property} \\ &= \operatorname{P}_{G_{\operatorname{do}(A)}}(B| Pa(B)) && \text{Using the first property above}\end{aligned}\]

Thus, starting from a target probability expression involving the do-operator, we are able to construct an expression based only on the observed probability distribution \(P\). The final expression is called the *identified estimand* or the *target estimand*, and can be estimated from available data. The process of transforming a target do-expression into an expression involving only the observed probabilities is called *identification*.

Rather than coming up with such properties for every new do-expression, *do-calculus* is a set of rules that generalizes the above procedure for any causal graph. The key advantage of do-calculus is that it formalizes such custom derivations into a general framework that can be applied mechanistically to any graph and any causal inference question for that graph. Given a causal graph, it allow us to relate probabilities in the interventional graph (which we do not observe) to statistical relationships that we can observe in the observational graph. That is, do-calculus will gives us the tools to convert our causal target, \(\operatorname{P}(B|\operatorname{do}(A))\), to a causal estimand that is computable from observational quantities.

### 3.2.3 Graph Rewriting

We have seen above that given a graph \({G}\) that represents our causal assumptions about a system, it is useful to be able to refer to modified or edited versions of \({G}\), such as the interventional graph.

**Interventional Graph:**We refer to the interventional graph, where we have intervened on a feature \(A\) as \({G}_{\operatorname{do}(A)}\). This graph \({G}_{\operatorname{do}(A)}\) is identical to \({G}\) except all edges leading to \(A\) from its parents have been removed (e.g., Fig. 3 (b)).**Nullified Graph:**It is also useful to refer to the nullified graph, where we have artificially removed or nullified all effects of a feature \(A\). This graph, which we reference as \({G}_{null(A)}\), is identical to \(G\), except that all edges from \(A\) to its children have been removed (e.g., Fig. 3 (c)).

Note that we are not limited to a single intervention or nullification on a graph. For example, \({G}_{\operatorname{do}(A),\operatorname{do}(C)}\) would represent a graph where we have intervened on both \(A\) and \(C\). \({G}_{\operatorname{do}(A),null(C)}\) represents a graph where we have intervened on \(A\) and nullified the effects of \(C\) (e.g., Fig. 3 (d)).

For brevity, we will often see the literature use an overbar and underbar to represent interventions and nullifications. That is \({G}_{\operatorname{do}(A),null(C)}\) can be equivalently written as \({G}_{\bar{A},\underline{C}}\)

Figure 3: Graph rewriting examples. a — A causal graph, \(G\), b — An intervention graph \(G_{\operatorname{do}(A)}\), c — A nullified graph \(G_{\operatorname{null}(C)}\), d — A rewritten graph \(G_{\operatorname{do}(A)\operatorname{null}(C)}\)

### 3.2.4 Rules of Do-calculus

Do-calculus consists of 3 simple rules:

*Insertion or deletion of observations*\[\begin{aligned} \operatorname{P}(y|\operatorname{do}(x),z,w) = \operatorname{P}(y|\operatorname{do}(x),w) && \text{if} (y \unicode{x2AEB} z | x, w)_{G_{\operatorname{do}(x)}} \end{aligned}\] Rule 1 states that we can remove variables \(z\) from the conditioning set if the remaining conditioning variables \(w\) and the intervention variables \(x\) d-separate \(y\) from \(z\) in the intervention graph \(G_{\operatorname{do}(x)}\). Of course, we can also add variables to the conditioning set under the same criteria.The intuition for this rule is that \(y|\operatorname{do}(x)\) reduces to simply \(y|x\) under a graph \(G_{\operatorname{do}(X)}\) where all incoming edges to \(x\) are removed. From probability calculus, we know that \(\operatorname{P}(y|x,z,w)=\operatorname{P}(y|x,w)\) if \(y \unicode{x2AEB} z | x, w\), and therefore the above rule follows whenever the graph is \(G_{do(X)}\).

*Action/observation exchange*\[\begin{aligned} \operatorname{P}(y|\operatorname{do}(x), \operatorname{do}(z), w) = \operatorname{P}(y|\operatorname{do}(x),z,w) && \text{if} (y\unicode{x2AEB} z| x,w)_{G_{\operatorname{do}(x),null(z)}} \end{aligned}\] Rule 2 states that we can replace a conditional on an intervention \(\operatorname{do}(z)\) with a conditional on the observational value \(z\) if \(y\) is d-separated from \(z\) by \(x,w\) in the graph \(G_{\operatorname{do}(x),null(z)}\) where we have removed all input edges to \(x\) and all output edges from \(z\).To understand this rule, let us focus on the simpler rule without the additional intervention on \(x\) (i.e., substitute \(x=\phi\)): \(\operatorname{P}(y| \operatorname{do}(z), w) = p(y|z,w)\) if \((y\unicode{x2AEB} z| w)_{G_{null(z)}}\).

In this simpler case, if \(y\) and \(z\) are independent of each other given \(w\) under a graph where there are no outgoing edges from \(z\), then that implies that the only connection between \(z\) and \(y\) that is not blocked by \(w\) is through edges that start from \(z\). In such a case, there are no confounders outside of \(w\)—the effect of \(z\) can simply be estimated using conditioning, therefore \(\operatorname{P}(\operatorname{do}(z), w)=\operatorname{P}(y|z, w)\).

The role of the additional intervention on \(x\) follows the same intuition as in Rule 1: we test independence in a graph that additionally has removed incoming edges to \(x\), thereby allowing \(y|\operatorname{do}(x)\) to be equivalent to \(y|x\) and \(x\) can be considered as just another conditioning variable like \(w\) in \(\operatorname{P}(y|\operatorname{do}(x),z, w)\).

Consistent with our principle of stable and independent causal mechanisms (Chapter 2), note that Rule 2 implies that \(p(y|do(z)) = p(y)\) if \(y\) is d-separated from \(z\) in \(G\). Also \(\operatorname{P}(y|\operatorname{do}(z),w) = \operatorname{P}(y|z,w)\) if \(y\) is d-separated from \(z\) by \(w\) in \(G_{null(z)}\).

Note that if \(z\) does not cause \(y\) at all, then there will be no outgoing edges from \(z\) to \(y\). In such a case, Rule 1 and Rule 2 can be combined to yield, \[\begin{aligned} \operatorname{P}(y|\operatorname{do}(x), \operatorname{do}(z), w) = \operatorname{P}(y|\operatorname{do}(x),w) && \text{ if } (y\unicode{x2AEB} z| x,w)_{G_{\operatorname{do}(x)}} \end{aligned}\]

However, this condition is too strict, we can obtain the same result using a milder condition as shown by Rule 3.

*Insertion/deletion of actions*\[\begin{aligned} \operatorname{P}(y|\operatorname{do}(x), \operatorname{do}(z), w) = \operatorname{P}(y|\operatorname{do}(x),w) && \text{if} (y \unicode{x2AEB} z | x, w)_{G_{\operatorname{do}(x)\operatorname{do}(z(w))}} \end{aligned}\] where \(z(w)\) is all \(z\) that are not ancestors of \(w\) in \(G_{\operatorname{do}(x)}\). That is, Rule 3 states that we can remove an action \(\operatorname{do}(z)\) from the conditioning set if the remaining conditioning variables \(w\) and the remaining intervention variable \(x\) d-separate \(y\) from \(z\) in the graph where we have removed incoming edges to \(x\) and \(z(w)\).As before, let us assume the case without \(x\) to capture the intuition. We know that \(y|\operatorname{do}(z)\) refers only to the effect of \(z\) as it flows through directed edges starting from \(z\). So if all incoming edges to \(z\) are removed and \(y\) is independent of \(z\), then it implies that there are no outgoing edges from \(z\) that end up at \(y\). Therefore \(z\) can be safely removed even though \(z \not \unicode{x2AEB} y|w\) since the causal effect only involves the forward direction from \(z\). The special subset of \(z\), \(z(w)\), is to avoid any situation where conditioning on \(w\) leads to collider bias, due to which \(z\) may end up having a causal effect on \(y\) (conditional on \(w\)) even though there is not direct path from \(z\) to \(y\).

By iteratively applying the rules of do-calculus, we attempt to convert our causal target—initially expressed as a function of the interventional graph—into a function of the observational graph. The rules of do-calculus have been proved to be complete. That is, if repeated application of the rules of do-calculus cannot remove all references to the \(do()\) operator and interventional graph, then the causal target is not identifiable without additional assumptions.

In the next section, we apply do-calculus to derive simple identification strategies for some commonly encountered graphical constraints.

## 3.3 Identification under Graphical Constraints

Using do-calculus, we can derive simple methods for causal identification in many situations. In this section, we present two simple methods, the adjustment formula and the front-door criterion, and show how each is derivable using do-calculus.

### 3.3.1 Adjustment Formula

As an example of how we can apply do-calculus to the problem of identification, consider a simple causal target \(\operatorname{P}(B|\operatorname{do}(A))\) in some graph \(G\). Here we will show how two simple manipulations of this causal target provide us with a useful identification approach called the adjustment formula:

\[\begin{split} \operatorname{P}(B | do(A)) & = \sum_Z \operatorname{P}(B, Z|do(A)) \\ &= \sum_Z \operatorname{P}(B|do(A), Z)\operatorname{P}(Z|do(A)) \\ &= \sum_Z \operatorname{P}(B|do(A), Z)\operatorname{P}(Z) \text{ if } (Z \unicode{x2AEB} a)_{G{\operatorname{do}(A)}} \\ & = \sum_Z \operatorname{P}(B|A,Z)\operatorname{P}(Z) \text{ if } (B \unicode{x2AEB} A|Z)_{G{\underline{A}}} \end{split}\]

The second step follows from law of total probability. The third step applies Rule 3 of do-calculus and holds as long as \(z\) and \(a\) are d-separated in a graph without incoming edges to \(a\). The last step applies the 2nd rule of do-calculus, and holds as long as \(z\) d-separates \(a\) and \(b\) in the graph where all outgoing edges of \(a\) have been removed. Any such set \(z\) is called a *valid adjustment set*.

Note that \(\operatorname{P}(A|Z)\) should be strictly greater than \(0\) for successful identification. If the observed data does not have any points where \(A=a\), then it is impossible to identify \(\operatorname{P}(B|\operatorname{do}(A))\) since \(p(b|a,z)\) will be undefined. This requirement is often known as the *overlap* assumption for causal identification, and we will discuss its implications further in the context of estimation in Chapter 4.

### 3.3.2 Valid Adjustment Sets

Intervening on \(A\) will have an effect on \(B\) that is calculable from observational data using the adjustment formula *if* we can find a valid adjustment set \(Z\).

Here, we present multiple kinds of adjustment sets that satisfy the requirements of the adjustment formula. We start with the simplest such set and then identify broader sets.

**Parent Adjustment Set**The simplest such adjustment set is \(Pa(a)\), the set of parent nodes of \(a\). We can easily determine that \(Pa(a)\) d-separates \(a\) and \(b\) in \(G_{null(a)}\). As all outgoing edges of \(a\) have been removed, by definition, in \(G_{null(a)}\), all of its paths to any node, including to \(b\), must pass through its parents. Necessarily then, all paths are blocked if we condition on \(Pa(a)\), and thus \(a \unicode{x2AEB} b | Pa(a)\), and \(Pa(a)\) meets the criteria for being a valid adjustment set.

Similarly, \(Pa(b)\), the set of parent nodes of \(b\), is also a valid adjustment set.

**Backdoor Criterion**The

*backdoor criterion*allows us to identify a broader class of valid adjustment sets. This criterion states that a set of nodes \(z\) is a valid adjustment set if:\(Z\) blocks all paths between that \(A\) and \(B\) where the edge connected to \(A\) is directed at \(A\).

no descendants of \(A\) are in \(Z\);

Note that the parent adjustment trivially meets the backdoor criterion.

**“Towards necessity” criterion**While the backdoor criterion significantly broadens our ability to identify valid adjustment sets beyond the set of parents, it is not yet complete. That is, there are valid adjustment sets that do not meet the backdoor criterion. Shpitser et al 2010 successfully generalized the backdoor criterion to describe all valid adjustment sets.

The “towards necessity” criterion states that a set of nodes \(z\) is a valid adjustment set if:

\(z\) blocks all the paths between \(a\) and \(b\)

let \(d\) be all the directed paths from \(a\) to \(b\); let \(d_{\text{nodes}}\) be all the nodes on these directed paths except for \(x\). \(z\) may not include any descendants of \(d_{\text{nodes}}\).

### 3.3.3 Invalid adjustment sets

In this section, we explain the intuition behind valid adjustment sets by describing the consequences of using an *invalid* adjustment set. We demonstrate this intuition by showing, in multiple simple examples, the consequences of conditioning on features that fail to meet the criteria for valid adjustment sets.

Figure 4: Examples of invalid adjustment sets. a — Conditioning on a collider \(Z\) will introduce a backdoor path or correlation between \(A\) and \(B\) and bias the estimate of \(\operatorname{P}(B|\operatorname{do}(A))\), b — Conditioning on a mediator \(Z\) will block the effect of \(A\) on \(B\), c — Conditioning on this post-treatment variable \(Z\) will bias the estimate of \(\operatorname{P}(B|\operatorname{do}(A))\)

**Conditioning on a collider:**Consider Fig. 4 (a). Unconditionally, \(A\) and \(B\) are statistically independent of each other—they are d-separated in the shown graph. However, if we add the collider \(Z\) to the adjustment set, then as described in Chapter 2, we introduce a dependence, such that \(A \not\unicode{x2AEB} B |Z\). Adjusting for a collider \(Z\) will thus introduce a false correlation between \(A\) and \(B\) and bias our estimate of \(\operatorname{P}(B|\operatorname{do}(A))\).**Conditioning on a mediator:**Fig. 4 (b) shows another causal graph where \(Z\) is an invalid adjustment. In this case, \(Z\) is mediating the effect of \(A\) on \(B\). When conditioning our analysis on the mediator \(Z\), we break the path between \(A\) and \(B\) and breaking the dependence of \(B\) on \(A\). In other words, \(A \unicode{x2AEB} B|Z\). Thus, adjusting for \(Z\) will block the effect of \(A\) on \(B\) that we are attempting to estimate, invalidating our causal estimate.**Conditioning on a post-treatment variable:**Fig. 4 (c) shows a post-treatment variable \(Z\) that is confounded with the outcome by \(X\). If we add this \(Z\) to the adjustment set, it will open a confounding path between \(A\) and \(B\) and bias our estimate of \(\operatorname{P}(B|\operatorname{do}(A))\). Note that, whether conditioning on a post-treatment variable is invalid or simply neutral or not useful, depends on the rest of the structure of the graph. For example, in the figure shown, the feature \(X\) is necessary for the opening of the confounding path.

Not all variables fall neatly into valid or invalid adjustments. There are also adjustments that are neutral. These variables neither help nor harm our goal of causal identification. However, as we will discuss in the next chapter, they can have implications (good or bad) for statistical estimation.

### 3.3.4 The Front-door Criterion

The adjustment formula is one of the identification methods that can be derived solely from graphical assumptions and do-calculus. Sometimes, one or more of the variables required as part of a valid adjustment set are unobserved, in which case we cannot use that particular adjustment set in the adjustment formula. If there is no valid adjustment set whose variables are all observed then we cannot use the adjustment formula for identification.

Applying the rules of do-calculus, however, can lead to other strategies. The front-door criterion is one such method for identifying a causal effect when confounding variables necessary for applying the adjustment formular are unobserved. Consider Fig. 5, that shows an graph where an unobserved confounder \(U\) makes it impossible to apply the adjustment formula. In this simple example, the only valid adjustment set is \({U}\) and thus, without observing \(U\), we cannot calculate the adjustment formula to identify \(\operatorname{P}(C|\operatorname{do}(A))\) directly.

Consider the node \(B\) in the graph, however. \(A\) has no direct effect on \(C\). The only effect \(A\) has on \(C\) is its indirect effect through \(B\)., and is not confounded by the unobserved \(U\). This particular structure will allow us to apply the front-door criterion, a method for identifying the effect of \(A\) on \(C\) in (some) cases where we cannot apply the adjustment formula.

The key insight of the front-door criterion is that we can factor the causal effect of \(A\) on \(C\) and, if the factors of causal effect are themselves identifiable, we can use them to identify our target. For the graph in Fig. 5, the factored causal effect is:

\[\operatorname{P}(C|\operatorname{do}(A)) = \sum_{B}{\operatorname{P}(C|\operatorname{do}(B))\operatorname{P}(B|\operatorname{do}(A))}\]

Now, we can ask whether any method would allow us to identify these factors \(\operatorname{P}(C|\operatorname{do}(B))\) and \(\operatorname{P}(B|\operatorname{do}(A))\) using our observed data. In this case, both factors are easily identified using the adjustment formula and Rule 2 of do-calculus:

\[\begin{aligned} \operatorname{P}(C|\operatorname{do}(B)) &= \sum_{A}{\operatorname{P}(C|B,A)\operatorname{P}(A)} && \text{ by the adjustment formula}\end{aligned}\]

and

\[\begin{aligned} \operatorname{P}(B|\operatorname{do}(A)) &= \operatorname{P}(B|A) && \text{ by Rule 2 since } (B \unicode{x2AEB} A)_{G_{null(A)}}\end{aligned}\]

Combining these together, we find that:

\[\operatorname{P}(C|\operatorname{do}(A)) = \sum_{B}{\sum_{A}{\operatorname{P}(C|B,A)\operatorname{P}(A)}\operatorname{P}(B|A)}\]

The generalized version of this front-door criterion enables us to apply factorization along the causal path(s) of interest, and apply any valid method for identifying the component causal factors. Of course, in a more complex graph, if any of the causal factors are unidentifiable because of other unobserved, confounding variables, then we will not be able to apply the front-door criterion.

## 3.4 Identification with Additional Assumptions

Our goal in causal identification is to find a way to express the causal relationship between two features in terms of observable statistical relationships. In many situations, we can use graphical assumptions and do-calculus to disentangle our observations of statistical relationships to identify causal relationships. In situations when graphical assumptions are insufficient, parametric assumptions can sometimes help. In this section, we show how a simple parametric assumption—specifically, assumptions of non-interaction, as previously introduced in Section 3.1.2—can help.

### 3.4.1 Parametric Assumptions and Instrumental Variables

Consider the causal graph shown in Fig. 6. If our goal is to identify \(\operatorname{P}(B|\operatorname{do}(A))\), we can easily see that the adjustment formula is not applicable as the confounding feature \(U\) is unobserved. Since there is no mediating variable between \(A\) and \(B\), we also cannot apply the front-door criterion. In fact, the effect of \(A\) on \(B\) is not identifiable based on graphical assumptions alone.

This kind of causal graph is quite common. For example, we are often in situations where we have an ability to run partially randomized experiments, where we can randomize \(Z\), but do not have direct control over the factor \(A\) that is our primary focus. This can occur in experiments with people, for example, where we might influence individuals’ decisions through recommendations, encouragements or rewards, but otherwise not have full control. This can also occur in many natural settings, where some observable independent factor, such as the weather, plays a partial role in determining \(A\).

There is an intriguing opportunity, however, in the influence of \(Z\) on \(A\). Because \(Z\) is d-separated from \(A\), we can trivially identify that \(\Pr(A|\operatorname{do}(Z)) = \Pr(A|Z)\). Similarly, we can show that \(\Pr(B|\operatorname{do}(Z)) = \Pr(B|\operatorname{do}(Z))\).

The instrumental variable method is an identification method that exploits the auxiliary variable \(Z\) to isolate the causal effect. A variable that follows the graph structure of Fig. 6 is called an *instrumental* variable. Instrumental variable settings satisfy several criteria ^{2}:

*\(Z\) and \(B\) are independent, if not for \(A\)*. More formally, \(Z\) and \(B\) are d-separated in the graph \(G_{null(A)}\). This implies that \(Z\) affects \(B\) only via paths that pass through \(A\), and that \(Z\) and \(B\) are not correlated due to common causes.*\(Z\) affects \(A\)*. \(Z\) and \(A\) are not d-separated and \(\operatorname{P}(A|\operatorname{do}(Z))\) is identifiable.*The effects of \(Z\) on \(A\) and of \(A\) on \(B\) are homogeneous with respect to the unobserved variables \(U\)*.

The first two conditions can be read from the causal graph, while the third is an additional parametric constraint. The first condition ensures that whatever effect \(Z\) has on \(B\), it can only flow through \(A\). There can be no direct effect of \(Z\) on \(Y\) that does not go through \(A\). In addition, the d-separation of \(Z\) and \(B\) in \(G_{null(A)}\) implies that \(Z\) is independent of the unobserved confounds \(U\) of \(A\rightarrow B\).

The second condition states that \(Z\) has a non-zero effect on \(A\), and that this effect is identifiable. Intuitively the effect of \(Z\) on \(B\) can be thought of as the combination of the effect of \(Z\) on \(A\) and the effect of \(A\) on \(B\), thus if \(Z\) has no effect on \(A\), it would not give us useful information about \(A\). In the specific graph shown, we can see that the identification \(\operatorname{P}(A|do(Z))=P(A|Z)\) is trivially identified since there are no common causes of Z and A (in our example, Z is randomized).

The final condition is that it is legitimate to assume that the effect of \(Z\) on \(A\) is homogeneous (i.e., that \(U\) does not modify the effect of \(Z\) on \(A\)), and that the effect of \(A\) on \(B\) is also homogeneous (\(U\) does not modify the effect of \(A\) on \(B\)). This will allow us to ensure our observations of the effect of \(Z\) on \(A\) and \(Z\)’s indirect effect on \(B\) are not entangled with any interactions with the unobserved factors \(U\).

Next, we show how we can use these two identified components and the assumptions above, \(\operatorname{P}(B|Z)\) and \(\operatorname{P}(A|Z)\) to identify the effect of the intervention \(A\) on \(Z\).

### 3.4.2 Derivation for continuous variables

Here is a simple derivation, in the setting of Fig. 6 and continuous variables \(Z\), \(A\), and \(B\), showing how to calculate \(\frac{dB}{d{\operatorname{do}(A)}}\) based on \(\frac{dB}{dZ}\) and \(\frac{dA}{dZ}\).

First, let us note that, given the causal graph in Fig. 6, we can trivially identify that \(\operatorname{P}(B|\operatorname{do}(Z))=\operatorname{P}(B|Z)\) and that \(\operatorname{P}(A|\operatorname{do}(Z)) = \operatorname{P}(A|Z)\). From the definition of the effect of an intervention on continuous variables (Eq. 2), this identification also provides us with the estimands \(\frac{dA}{d{\operatorname{do}(Z)}} = \frac{dA}{dZ}\) and \(\frac{dB}{d{\operatorname{do}(Z)}} = \frac{dB}{dZ}\).

Second, from the chain rule for derivatives with multiple independent variables, we state:

\[\begin{aligned} \frac{\partial B}{\partial \operatorname{do}(Z)} &= \frac{\partial B}{\partial \operatorname{do}(A)}\frac{\partial A}{\partial \operatorname{do}(Z)} + \frac{\partial B}{\partial \operatorname{do}(U)}\frac{\partial U}{\partial \operatorname{do}(Z)} && \\ &= \frac{\partial B}{\partial \operatorname{do}(A)}\frac{\partial A}{\partial \operatorname{do}(Z)} && U \unicode{x2AEB} Z \text{implies} \frac{\partial U}{\partial \operatorname{do}(Z)} = 0 \\ \frac{\partial B}{\partial \operatorname{do}(A)} &= \frac{\frac{\partial B}{\partial \operatorname{do}(Z)}}{\frac{\partial A}{\partial \operatorname{do}(Z)}} && \text{ Rearranging terms } \\ \frac{\partial B}{\partial \operatorname{do}(A)} &= \frac{\frac{dB}{d\operatorname{do}(Z)}}{\frac{dA}{d\operatorname{do}(Z)}} && \text{ By non-interaction of $U$} \\ \frac{\partial B}{\partial \operatorname{do}(A)} &= \frac{\frac{dB}{dZ}}{\frac{dA}{dZ}} && \text{ By earlier identification}\end{aligned}\]

In this derivation, we take advantage of our causal assumption that \(U \unicode{x2AEB} Z\) as we go from line 1 to line 2. Because \(U\) is d-separated from \(Z\) in our instrumental variables setting, we know that \(\frac{\partial U}{\partial \operatorname{do}(Z)}\) must be \(0\). We also apply our assumption of homogeneity of the effects, \(Z \rightarrow A\) and \(A \rightarrow B\), to convert our partial derivatives \(\frac{\partial B}{\partial \operatorname{do}(Z)}\) and \(\frac{\partial A}{\partial \operatorname{do}(Z)}\) to total derivatives as we go from line 4 to 5. This is crucial because otherwise, we would have to observe \(U\) to evaluate the partial derivatives at \((U,Z)\). Knowing that they are independent of \(U\), we can convert them to total derivatives and evaluate them at \(Z\) only.

Thus, we see that under the assumptions of the instrumental variables setting, we can identify the effect of \(A\) on \(B\) using our observations of the effect of \(Z\) on \(B\) and \(Z\) on \(A\). This is a powerful result, and enables us to identify the effect of features on outcomes in a variety of scenarios, even when we do not have full control over them.

### 3.4.3 Derivation for binary variables

Here, we repeat our simple derivation in the setting of Fig. 6 and binary, rather than continuous, variables \(Z\), \(A\), and \(B\). As a reminder, we wish to identify \(\operatorname{P}(B|\operatorname{do}(A))\), the effect that intervening on \(A\) will have on \(B\) and \(A\) and \(B\) are confounded by an unobserved variable \(U\), rendering earlier methods, such as the adjustment formula, ineffective.

To derive the identification, let us first write the expression for \(\operatorname{P}(B|Z,U)\). \[\begin{split} \operatorname{P}(B|Z,U) &= \sum_A \operatorname{P}(B|Z, U, A) P(A|Z, U) \\ &= \sum_A \operatorname{P}(B|A, U) \operatorname{P}(A|Z, U) \end{split}\] where the last equality is due to the Markov property of the causal graph. For a binary instrument \(Z\), intervention \(A\), and outcome \(B\), we obtain the following equations, \[\begin{split} \operatorname{P}(B=1|Z=1,U) &=\sum_A \operatorname{P}(B=1|Z=1, U, A=1) P(A|Z=1, U) \\ &=\operatorname{P}(B=1|Z=1, U, A) P(A=1|Z=1, U) + \operatorname{P}(B=1|Z=1, U, A=0) (1 - P(A=1|Z=1, U)) \\ \operatorname{P}(B=1|Z=0,U) &= \sum_A \operatorname{P}(B=1|Z=0, U, A=1) P(A|Z=0, U) \\ &= \operatorname{P}(B=1|Z=0, U, A=1) P(A=1|Z=1, U) + \operatorname{P}(B=1|Z=0, U, A=0)(1 - P(A=1|Z=0, U) \\ \end{split}\] Rearranging the terms and subtracting the two equations, \[\begin{split} \operatorname{P}(B|A, U) - \operatorname{P}(B|A=0, U) &= \frac{\operatorname{P}(B=1|Z=1, U)- \operatorname{P}(B=1|Z=0, U)}{\operatorname{P}(A=1|Z=1, U) - \operatorname{P}(A=1, Z=0, U)} \end{split}\] The average effect of the intervention is given by, \[\begin{split} \mathbb{E}[\operatorname{P}(B|A=1, U)] - \mathbb{E}[\operatorname{P}(B|A=0, U)] &= \mathbb{E}[\operatorname{P}(B|\operatorname{do}(A=1))] - \mathbb{E}[\operatorname{P}(B|\operatorname{do}(A=0))] \\ &= \mathbb{E}[\frac{\operatorname{P}(B=1|Z=1, U)- \operatorname{P}(B=1|Z=0, U)}{\operatorname{P}(A=1|Z=1, U) - \operatorname{P}(A=1, Z=0, U)}] \end{split}\] where the equality is due to the backdoor equation above.

In general, we cannot isolate the average effect of the intervention using the above equation since the denominator also depends on the unobserved \(U\). In the instrumental variables scenario, however, the effect of \(Z\) on \(A\) does not does not vary with \(U\) (that is, the effect of \(Z\) on \(A\) is homogeneous with respect to \(U\)). Recall from Section 3.1.2 that in this case, \(\operatorname{P}(A|Z,U)=\operatorname{P}(A|,Z)\) and thus we can simplify the above equation. In such cases, we can write the average causal effect as, \[\begin{split} \mathbb{E}[\operatorname{P}(B|\operatorname{do}(A=1))] - \mathbb{E}[\operatorname{P}(B|\operatorname{do}(A=0))] &= \mathbb{E}[\frac{\operatorname{P}(B=1|Z=1, U)- \operatorname{P}(B=1|Z=0, U)}{\operatorname{P}(A=1|Z=1) - \operatorname{P}(A=1, Z=0)}] \\ &= \mathbb{E}[\frac{\operatorname{P}(B=1|Z=1)- \operatorname{P}(B=1|Z=0)}{\operatorname{P}(A=1|Z=1) - \operatorname{P}(A=1, Z=0)}] \end{split}\]

Notice that we also remove the \(U\) from the numerator. This is valid for purposes of calculating an average causal effect over a specific population, with a fixed, though unobserved, distribution of \(U\). Of course, this average causal effect will not be valid for different populations.

The resultant equation, known as the Wald formula, provides an identification formula for the causal effect, also known as the Wald formula. Consider the similarity of this formula to our finding in the continuous variables scenario earlier. The identified estimand intuitively captures how adding an instrument helps identify causal effect in the presence of unobserved confounding. While we cannot estimate the effect of treatment on outcome directly, we can estimate the effect of the instrument on both treatment and outcome. The ratio of these two effects then quantifies to what degree \(Z\)’s effect on \(A\) also causes a change in \(B\). In causal graphs where \(Z\) effects \(B\) only through \(A\), then, this identifies the effect of \(A\) on \(B\) as well.

### 3.4.4 Generalizing the instrumental variables method

Figure 7: More examples of IV graphs. (a), (b), and (c) correspond to \(z\) as a valid generalized instrumental variable scenarios whereas (d) and (e) show an invalid instrumental variable. (b) shows a common IV setting where A and B have observed confounders \(\mathbf{w}\) in addition to unobserved ones..

While the canonical instrumental variables scenario, depicted in Fig. 6, presents a simple graph with only a few variables, we can extend these ideas to much more complex scenarios.

The simplest extensions include scenarios such as Fig. 7 (a). Here, we see that additional observed variables \(W_1\) and \(W_2\) are confounding the effect of \(Z\rightarrow A\) and \(A\rightarrow B\). Neither of these additional variables, however, breaks our initial requirements that \(Z\) and \(B\) are d-separated in \(G_{\operatorname{null}(A)}\), or that \(Z\) and \(B\) are not d-separated in \(G\), that \(\operatorname{P}(\operatorname{do}(B|Z))\) is identifiable, as well as our assumptions regarding homogeneous effects.

More interestingly, are scenarios such as Fig. 7 (b). Here we see that the (observed) variable \(W\) does break our initial assumptions of the independence of \(Z\) and \(B\). However, conditioning all of our analyses on \(W\) re-establishes the necessary requirements of d-separations. Such an instrumental variable (IV) is called a conditional IV, and the modified requirements are:

\(Z\) and \(B\) are d-separated in \(G_{\operatorname{null}(A)}\), conditional on \(W\).

\(Z\) and \(A\) are not d-separated conditional on \(W\) and \(\operatorname{P}(A|do(Z),W)\) is identifiable

The effects of \(Z\) and of \(A\) on \(B\) are homogeneous with respect to unobserved variables \(U\).

Where \(W\) is a conditioning set that does not include any descendants of \(B\).

Considering more carefully the role of the instrument \(Z\)—its purpose is to provide information about variation in \(A\) that is independent of the unobserved confound \(U\)—we recognize that \(Z\) need not actually be a cause of \(A\). There are other relationships that might also capture the variation in \(A\) necessary for our analysis. In Fig. 7 (c). , we see one such example. Here, \(C\) is an unobserved cause of both \(Z\) and \(A\). Even though \(Z\) is not a cause of \(A\) in this graph, \(Z\) is not d-separated from \(A\), and will generally be correlated with \(A\). While this relaxation complicates our earlier proof of instrumental variables, it is generally valid to relax our 2nd assumption:

- \(Z\) and \(A\) are not d-separated conditional on \(W\).

Many other extensions of instrumental variables scenarios have been explored, such as instrumental sets where a set of instrumental variables jointly enable identification, under assumptions of linearity, of the effects of multiple treatments on an outcome.

## 3.5 Identification Strategies in Practice

In this section, we will describe some common situations that indicate that one or another identification strategies is likely to be useful. We look for: sources of randomness in the system and their relationship to the treatment/outcome; simplifying structure in causal relationships—perhaps from different levels of abstraction or temporal assumptions; and, when other approaches do not work, we can attempt to identify subproblems that may be easier to identify. Through these, we highlight several common, classic identification strategies, including encouragement designs, difference-in-differences, and regression discontinuities.

Recall that in the Chapter 2, we noted that there is not necessarily a single correct graphical model representation of a system to be studied, but that we might use different models to answer different questions. Now that we have introduced a variety of identification strategies, we can begin to revisit this question of how best to model a system—what features to consider exogenous or endogenous to a model, and at what level of abstraction to model variables—to correctly answer a given causal inference question.

Throughout this section, we will assume our goal is to identify the causal effect \(\operatorname{P}(B|\operatorname{do}(B))\).

*Remainder of section to be released*

## 3.6 Summary

In this chapter, we introduced the basics of identification for causal inference questions, including complexities introduced by *homogeneous and heterogeneous effects*. Key to formally approaching identification of causal effects from observational data is the *\(\operatorname{do}()\) operator and do-calculus*. The three rules of do-calculus allow us to analyze a causal graph and find strategies—or prove that there are none—for calculating the effects we wish to know from the observations we are able to make.

Using do-calculus, we described the *adjustment formula*, a useful approach for calculating causal effects in many situations by adjusting for the confounds or other appropriate factors that make up a *valid adjustment set*. When graphical assumptions alone are insufficient, combining do-calculus with additional parametric assumptions, as in the case of *instrumental variables* scenarios, can result in successful identification.

## Chapter Notes

For further reading on causal reasoning and do-calculus, see Judea Pearl’s book, Causality (2009). Pearl’s blog, Causal Analysis in Theory and Practice, also has additional interesting materials, including a lovely “Crash Course in Good and Bad Controls” ^{3}.

Even in the case of a global treatment decision, we should strive to understand possible heterogeneous impacts well enough to ensure that we will not have unacceptable disparate impacts on subpopulations.↩︎

http://causality.cs.ucla.edu/blog/index.php/category/bad-control/↩︎

## Leave a comment