# Chapter 2: Models and Assumptions

Conventional statistical and machine learning problems are data focused. While data is a critical part of causal reasoning, it is not the only part. Just as important is the external knowledge we bring: our prior knowledge of the data generation mechanism and assumptions about plausible causal mechanisms. In fact, it is this external knowledge that distinguishes causal reasoning from associational methods.

Capturing our external knowledge about mechanisms and assumptions is the first stage of any causal analysis. Formally representing external domain knowledge as models allows us to systematically reason about strategies for answering causal questions. We already saw one example of such external knowledge captured in our structural causal model of the influence of temperature on ice cream and swimming in the previous chapter (Fig. 1.5). This chapter focuses on the detailed mechanics and intuitions of these structural causal models and the assumptions they represent.

## 2.1 Causal Graphs

The primary language for modeling causal mechanisms and expressing our assumptions is the language of causal graphs. Causal graphs encode our domain knowledge about the causal mechanisms underlying a system or phenomenon under study. We begin by introducing the mechanics of causal graphs and demonstrating how they represent causal relationships. We first assume that we have complete knowledge of the causal graph. Later in this chapter, we relax this assumption and refine our use of causal graphs to represent more complex and ambiguous situations.

A causal graph is made up of two kinds of elements:

• Nodes represent variables or features in the world or system we are modeling. Without limitation, let us think of each node as representing something that is potentially observable, measurable, or otherwise knowable about a system.

• Edges connect nodes to one another. Each edge represents a mechanism or causal relationship that relates the values of the connected nodes. Edges are directed to indicate the flow of causal influence. For example, in Fig. 1 (a), a change in the value of node A will cause a change in B, but if we were to manipulate B, it would not cause a change A. In cases where the direction of the influence is unknown, we will draw an undirected edge.

Causal graphs are assumed to be acyclic. That is, we cannot have a situation where A causes B and B causes A.1

 a b

Figure 1: A simple causal graph over two nodes. a — A causes B, b — Either A causes B or B causes A, but not both.

### 2.1.1 Reading a causal graph

Intuitively, an edge transfers information from one node to another in the direction of its arrow. Thus, causal effects flow through the graph along nodes connected by edges. This interpretation allows us to read a causal graph and answer many interesting and important questions about the underlying system.

To start with, we can ask whether a change in one node’s value might affect another. This is important when we want to, for example, check for potential side effects of some action or decision. We can ask whether a node’s value will stay the same as others change—important for identifying when a metric we care about is stable or not. If we want to optimize for some target outcome, we can ask what nodes we can manipulate to cause the targeted node’s value to change. For example, in Fig. 2, we can see that changes in A will affect B, which will, in turn, affect D. We can also see that, since there are no directed paths leading from A to C, that changes in A will not affect C. Similarly, since causal effect flows in only one direction, we can see that changes in C will affect B and D but not A, and changes in B will affect D, but not A or C. Changes in D will not affect any of the other nodes.

Causal relationships also flow transitively from edge to edge, through a directed path. In referring to such causal relationships, we often use familial notations. If A causes B, then A is a parent of B, and B is a child of A. All child nodes of B and, recursively, all of their children are known as descendants of B. Similarly, all parents of B and, recursively, all of their parents are known as ancestors of B. A node causally affects all its descendants and is affected by all its ancestors.

### 2.1.2 Causal graphs and statistical independence

Causal graphs are not only intuitively easy to read; they also provide formal definitions that enable systematic reasoning about their properties. Fundamentally, a causal graph describes a non-parametric data-generating process over its nodes. By specifying independence and dependence between the nodes, the graph constrains relationship between generated variables corresponding to those nodes.

In particular, causal graphs provide information about statistical independence. Two nodes x and y are statistically independent if knowing the value of one node does not give information about the value of the other node. That is:
$$x \unicode{x2AEB} y \text{ iff } P(x) = P(x|y) %$$
where $\unicode{x2AEB}$ is the symbol for statistical independence. We also often work with conditional independences, where two nodes x and y might only be statistically independent conditional on some other node z:
$$(x \unicode{x2AEB} y) |z \text{ iff } P(x|z) = P(x|y,z) %\\ %\text{ and symmetrically } P(b) = P(b|a)$$
Statistical relationships correspond to a particular data distribution and should not be confused with causal relationships. Whereas causal relationships focuses on whether manipulating one node’s value will cause a change in another node’s value, statistical relationships focuses on whether knowing the value of a node provides information about the value of another node.

 a b c

Figure 3: Causal graph structures over three nodes. a — B is a collider for A and C., b — B creates a fork to A and C., c — B forms a chain from A to C.

To illustrate how a causal graph implies certain statistical independences, let us consider different graph structures over three variables. Fig. 3 shows the three important structures: collider, fork and chain. The left subfigure shows a node B that is caused by both A and C. Such a node is called a collider, because the effect from two parents collides at the node. Importantly, no causal information transfers from A to C through B and thus A and C are statistically independent. The middle subfigure shows a fork structure where B causes both A and C. Here, the value of B determines the values of both A and C, thus A and C are not independent. However, the only source of statistical dependence between them is B. What if B is fixed at a certain value? In that case, A and C will be statistically independent. In other words, A and C are independent conditional on B. Finally, the right subfigure shows a chain. A chain implies a single direction of flow of causality from A to B, and from B to A. Any causal information from A to C must flow through B. Thus, similar to the fork structure, A and C are conditionally independent given B.

The three basic structures can be extended to determine statistical independence in any graph. Here we use the concept of an undirected path between any two nodes, defined as a set of contiguous edges connecting the two nodes. To determine statistical independence between two nodes, we consider all undirected paths between two nodes and test whether the paths have any of these structures. Note that a collider is the only structure that leads to statistical independence without conditioning. Thus, two nodes are independent if all undirected paths between them contain a collider. Of course, they are also independent if there is no undirected path connecting them.

Nodes that lead to statistically independent variables are said to be d-separated from each other.

d-separation: Two nodes in a causal graph are d-separated if either there exists no undirected path that connects them, or all paths connecting them contain a collider.

Conditional independence, however, is not as straightforward. From the other two base structures of fork and chain, we saw that conditioning on B makes A and C independent. However, the collider structure shows the reverse property. Conditioned on the collider B, A and C become dependent. This is because knowing the value of a collider and one of its parents tell us something about the other parent. Consider the example of a spam system that classifies an email as spam only if two conditions are satisfied: it contains the word “please send money” (A), and the email is too long (C). In general, these two variables are independently generated: longer emails can ask for money, but so can shorter emails. Knowing that the email is long tells us nothing about it contains an ask for money. However, once we know that an email was classified as spam, we can determine that both A and C were true. If it was not classified as spam, then knowing A=True reveals that C=False, and vice-versa. Thus, by fixing the value of a collider or conditioning on it, we render its parents dependent with respect to each other. Based on the above discussion, conditional independence or d-separation requires the conditioning variable to form a fork or chain along all paths between the two nodes, but also requires that it does not form a collider on any path.

Conditional d-separation: Two nodes in a causal graph are conditionally d-separated on another node B if either they are d-separated, or all undirected paths connecting them contain B as a fork or a chain, but not as a collider.

Using the above definitions, we can now read statistical independence and conditional statistical independence relationships from a causal graph. For example, in Fig. 2, we can see that $A \unicode{x2AEB} C$ as they have no common causes, whereas all other pairs of nodes are statistically dependent. It is also possible to read conditional independences from a graph. In Fig. 2 again, $(D \unicode{x2AEB} A) | B$ since B is the central node of chain connecting A and D. That is, once we know the value of B, we know everything we can know about D and, in particular, knowing the value of A does not change our belief in what D might be.

The above definitions also apply to any set of nodes rather than individual nodes. The concept of conditional independence is useful in choosing nodes for intervening on another nodes. It implies that conditioned on its parents, a node is independent of all its ancestors. Thus, knowing the value of all ancestor nodes conveys no more information about a node than knowing the value of its parents. This knowledge can also be useful to design predictive models that generalize beyond the training distribution, as discussed in chapter 13.

### 2.1.3 Causal graph and resulting data distributions

Since the graph only specifies the direction of effect and not its magnitude, shape or interactions, multiple data-generating processes and thus multiple data probability distributions are compatible with the same causal graph.

Formally, a causal graph specifies a factorization of the joint probability distribution of data. Any probability distribution consistent with the graph needs to follow the specific factorization. For instance, for the causal graph in Fig. 2, we can write,
$$\begin{split} P(A, B, C, D) &= P(D|A,B,C)P(B|C,A)P(C|A)P(A) \\ &= P(D|B)P(B|C,A)P(C)P(A) \end{split}$$
where the first equation is from the chain rule of probability. The second equation and third equations come from the structure of the causal graph. As discussed above, A and C are independent based on the graph. In addition, B blocks the directed paths from A, C to D, so D is independent of A and C given B. More generally, for any causal graph 𝒢 over variables V1, V2, ...Vm, the probability distribution of data is given by,
$$\label{eq:graph-factorization} \begin{split} P(V_1, V_2, ...,V_m) = \Pi_{i=1}^{m} P(V_i|Pa_{\mathcal{G}}(V_i)) \end{split}\qquad(1)$$
where Pa𝒢(Vi) refers to parents of Vi in the causal graph 𝒢. Note that the above factorization and resultant independence constraints have to be satisfied by every probability distribution generated from the graph. Therefore, independence in the graph implies statistical independence constraints across all probability distributions.

Additionally, it is possible that some data distributions factorize the distribution further and includes more independences. An edge between two nodes in a causal graph conveys the assumption that there may exist a relationship between them, but not all data distributions may necessarily follow it. Since there are multiple distributions possible, in some distributions the effect between the nodes goes to zero. For instance, while A and B are connected via an edge in Fig. 2, there can be a dataset where A’s effect on B is zero. As another example, in the ice cream and swimming causal graph from Fig. 1.5, we will find that the effect of ice-cream on swimming is zero, even though the causal graph includes an edge. Thus, including an edge reflects the possibility of a causal relationship, but does not confirm it. Effect of a node on another in a causal graph can also cancel through multiple interacting effects. For example, based on the graph from Fig. 2, there can can be a distribution where A and D are independent if the effect of B on D directly cancels A’s effect on B. Exact cancellations of this kind are often assumed to be implausible.

Another implication of a causal graph is the specific factorization of the joint probability of data. While other factorizations are valid too for a given dataset, a factorization consistent with Eq. 1 is more likely to generalize to changes in data distribution. That is, we can consider individual conditional probability factors as independent mechanisms. In any dataset generated from the graph in Fig. 2, if P(A) changes, we expect P(B) too change too, but do not expect the causal relationship between them P(B|A) to change. However, if we consider any other factorization, e.g., P(A|B)P(B), then changing P(A) will change P(B), but also necessarily change P(A|B). This is because P(A|B) = P(B|A)P(A)/P(B). Knowing that P(B|A) is invariant across distributions, P(A|B) depends directly on P(A). The invariance of causal relationships across different distributions underscores the generalization benefit of a causal graph: P(B|A), once estimated from a single data distribution, is expected to stay the same for all distributions consistent with the graph.

At this point, it is useful to compare causal graphs to probabilistic graphical models. While a probabilistic graphical model also offers a graphical representation of conditional independences, such a graph corresponds only to a particular data distribution. A causal graph, in contrast, represents invariant relationships that are stable across data distributions. These relationships are expected to hold for all configurations of an underlying system. Causal graphs thus provide a concise way to describe key, invariant properties of a system.

### 2.1.4 Key Properties

It is satisfying to note that once we have formulated our domain knowledge about possible causal relationships in the form of a graph, we can reason about causal relationships between any pair of nodes in our graph without appealing to additional domain knowledge. That is, the graph itself captures all the required information for determining which nodes, if manipulated, might affect which others. Below we emphasize key properties of the causal graph.

#### The assumptions asserted by a causal graph are encoded by the missing edges in a graph, and the direction of edges

It would be easy to believe that we are making an assumption about the existence of a causal mechanism when we draw an edge between two nodes. However, the edge itself does not represent an assumption! That an edge exists says nothing about the shape or size of the causal influence of one node on another; that causal influence could be vanishingly small or even 0! Thus, the existence of an edge—especially an undirected edge—does not represent a constraint on the underlying mechanisms. In contrast, the lack of an edge between two nodes is a much stronger assumption, as it is asserting that the direct causal influence is truly 0.

Fig. 4 illustrates 3 causal graphs that encode increasingly more assumptions. Of these illustrations, Fig. 4 (a) encodes the fewest assumptions. The single assumption is that the left nodes cause the right nodes (or more precisely, that the right nodes do not cause the left nodes). Note however, that nothing is assumed about the relationship among the two left nodes; and nothing is assumed about the relationship among the two right nodes. By removing edges and adding directionality to another edge, Fig. 4 (b) adds several additional assumptions: that the top-left node causes the bottom-left node and that only the bottom-left node influences the right nodes. Fig. 4 (c) makes the strongest assumptions on the underlying graph.

When is it preferable to use models that make more assumptions or fewer assumptions about underlying causal mechanisms? Generally speaking, when creating a causal graph, we should strive to encode as much of our domain knowledge as possible within the graph. If we know for certain through external knowledge that there is no direct causal relationship between two nodes, then we have no reason to add such an edge and, in fact, many of our computations and analyses will become only more difficult or the results more ambiguous if we do.

 a b c

Figure 4: Causal graphs with more edges encode fewer assumptions on the underlying causal mechanisms. Here Fig. 4 (c) encodes the most assumptions. a — Few assumptions, b — More assumptions., c — Many assumptions.

#### Causal relationships correspond to stable and independent mechanisms

A system—whether a computer system, a mechanical system, a social system, or other—consists of many mechanisms, interacting with each other to create the integrated behavior of the system as a whole. These mechanisms are often independent and stable. That is, we can replace or change one of these mechanisms without replacing others. The other mechanisms remain the same, though the system as a whole will change with the new integrated behavior.

Appealing to this intuitions of stable and independent mechanisms, we often assume that the components of the causal graph—in particular, the unrelated edges in the graph—represent distinct stable and independent mechanisms of the underlying system being modeled. That is, if we make some change to how the world works—perhaps we upgrade a software component, or we change the mechanics of a physical system—we can change how A influences B without changing how B influences D; or vice-versa.

#### Causal graphs cannot be learned from data alone

We declared earlier that inferring causality requires external knowledge—some information about the underlying system mechanics or the data generating process—beyond the raw data itself. Why is this? Every causal graph implies a set of testable implications: the conditional statistical independences, introduced above in Section 1.1.2. However, every unique causal graph does not imply a unique set of independence tests. Every causal graph has an equivalence class of graphs that generate the same independence tests.

Consider the two graphs in Fig. 5. In Fig. 5 (a), we can read only a single independence from the graph: $(C \unicode{x2AEB} A) | B$. That is, once we know the value of B, knowing the value of A will not give us any additional knowledge about the value of C. This graph implies many other causal assumptions, but only the conditional statistical independence tests are testable given data.

In Fig. 5 (b), we see a very different causal graph. From a practical standpoint, making a decision using this causal graph rather than the first would be very different. In  Fig. 5 (a), if we manipulate B, we do not expect that A will change. In contrast, in Fig. 5 (b), if we manipulate B, we do expect that A will change.

Despite such differences in the causal implications of these graphs, when we look for testable statistical independences, we can only find one, that $(C \unicode{x2AEB} A) | B$, the same test as for the other graph. The implication is that any dataset that satisfies the testable assumptions of Fig. 5 (a) will also satisfy the testable assumptions Fig. 5 (b). As a result, if we want to know which causal graph is correct, we cannot rely solely on the raw data, but must bring some of our own domain knowledge to bear.

 a b

Figure 5: These two graphs are in the same equivalence class. That is, any data set that can be described with one of these models can also be described with the other.

We will discuss statistical independence tests and these equivalence classes in more detail in Chapter 5.

#### Causal graphs are a tool to help us reason about a specific problem

Finally, we want to briefly emphasize the intuition that there is not necessarily a single, “correct” causal graph representation of any given system. A causal graph should, of course, correspond with the true causal mechanisms that drive the system being analyzed. However, questions of abstractions and fidelity, exogeneity, measurement practicalities, and, of course, the overarching purpose of an analysis, mean that many different models of a system can be considered valid under different purposes and circumstances.

## 2.2 Structural Equations, Noise, and Unobserved Nodes

The causal graph is a simplified representation that captures much of the key information about a system but, like all abstractions, also leaves many details out. In this section, we present how structural equations can represent details of the functional relationships represented by edges of a graph; and discuss the importance of noise and unobserved nodes. Causal graphs and structural equations together form the Structural Causal Model (SCM) framework of causal reasoning.

### 2.2.1 Structural Equations

One key piece of information that is not included in the representation of the graph is the functional relationship between nodes. While the existence of an edge between two nodes in Fig. 1 (a) tells us that there is some functional relationship between A and B, it does not tell us the shape or magnitude of the effect. The fact that the graph does represent this functional relationship means that we cannot, in Fig. 1 (a), tell whether an increase in A will cause an increase or decrease in B. In more complicated scenarios where multiple nodes influence others, we cannot tell how the values of these nodes interact. 2 For example in Fig. 2, where edges from both A and C lead to B, the causal graph alone does not tell us how these nodes might interact, or if they interact at all. It is possible that the effect of C on B is the same regardless of the value of A (no interaction). It is also possible that C affects B differently depending on A’s value.

To augment causal graphs with a stronger characterization of the functional relationships between nodes, we often use structural equation models (SEM). Each equation represents a causal assignment from the right-hand-side of the equation to the left. Eqns. 2, 3 show general structural equations for Figs. 1 (a), 2 respectively.3
$$\begin{array}{rcl} B & \leftarrow& f(A) \end{array} %\caption{A structural equation corresponding to \figref{fig:sample-2nodegraph}} \label{eq:sample-2nodegraph}\qquad(2)$$

$$\begin{array}{rcl} D & \leftarrow& f_1(B) \\ B & \leftarrow& f_2(A,C) \end{array} %\caption{A set of structural equations corresponding to \figref{fig:sample-4nodegraph}} \label{eq:sample-4nodegraph}\qquad(3)$$
While these equations allow any form of function f(), we can easily indicate specific functional forms. For example, Eq. 4 shows an alternative SEM for the graph of Fig. 2 where the causal relationships are linear, and the effects of A and C on B do not interact with each other.
$$\begin{array}{rcl} D & \leftarrow& \alpha_1*B \\ B & \leftarrow& \alpha_2*A + \beta_2*C \end{array} %\caption{A more restrictive set of structural equations relating the nodes in \figref{fig:sample-4nodegraph}} \label{eq:sample-4nodegraph-linear}\qquad(4)$$
Note that the above equations are different from purely statistical models such as linear regressions, even though the equations are deceptively similar. Structural equations imply a causal relationship, whereas conventional equations provide no such implication. In a regression model, it is equally valid to regress y on x, as it is to regress x on y. In contrast, a structural equation can only be written in one direction, the direction of causal relationship as specified by a causal graph. Further, a structural equation only includes causes of y in the RHS whereas a linear regression may include all known variables, including children of y that can be useful for prediction. In some cases, it is possible that parameters of a structural equation are estimated using linear regression, but the two models still retain their independent characteristics.

### 2.2.2 Noisy models

Any model, whether a causal graph or a set of structural equations, will necessarily represent (at best) a simplified version of the most important causal factors and relationships controlling a system’s behavior. To account for the many minor factors influencing system behavior, it is common practice to introduce an ϵ noise term into our structural equations:
$$\label{ch02-noisy-sem} \begin{array}{rcl} D & \leftarrow& \alpha_1*B + \epsilon_D \\ B & \leftarrow& \alpha_2*A + \beta_2*C + \epsilon_B \end{array}$$

Fig. 6 shows the causal graph representation of these ϵ noise terms. For simplicity of representation, these noise terms are not usually drawn in the causal graph, but are simply assumed to exist. Note that to avoid compromising the causal relationships implied by our original (non-noisy) causal model, the noise factors that influence each node must be independent of one another. If, for some reason, we believe that two nodes in a causal graph are subject to correlated noise, we must instead explicitly represent that in the graph.

Structural equations can be considered as an alternative representation of the factorization of probability distributions mentioned earlier. [@ch02-noisy-sem] can be written in terms of expectations as:
$$\begin{split} \mathbb{E}[D|B] &= \alpha_1*B \\ \mathbb{E}[B|A, C]&= \alpha_2 *A + \beta_2 *C \end{split}$$

### 2.2.3 Unobserved Nodes

Often, we know we will create a causal model of a system, but we will not be able to observe all of the nodes in our causal system. The values of some nodes may be completely unobserved or latent. This can happen if we do not know how to measure the value of a node, the node is too expensive to measure, or if a piece of data is too private or otherwise confidential. Depending on the causal structures of the system, we can still often address our specific task or question through computations over the nodes whose values are observed. To indicate an unobserved node in a causal graph, we mark its outline as a dashed line (Fig. 7).

In many situations, nodes may be partially observed or missing. For example, if a node value is expensive to collect, we may only sample it for a small number of our data entries. Or, our data collection might be systematically biased in some way. In such cases, we can model the missing mechanism in the causal graph itself. By modeling the causes of partial observation, we will be able to directly analyze why data might be missing and understand the biases present in our observed data.

We can model the missingness mechanism in the causal graph itself by the following simple manipulation of the graph, as shown in Fig. 8.

1. We split the partially observed node, Z, into two nodes, one of which is the true value Z but is completely unobserved.

2. The second node, Z* is observed, and caused by Z and a new missingness indicator RZ.

3. The missingness indicator controls the value of Z*. If RZ = 1, the value of Z is revealed as Z*. Otherwise, if RZ = 0, the value of Z is not revealed and Z* takes a null or empty value.

4. If data is observed or sampled at random, then the missingness indicator RZ is an independent node, unconnected to the rest of the causal graph. If RZ is systematically influenced by other factors in the system, then we draw the appropriate causal connections. In Fig. 8, RZ is caused by C.

This manipulation can also be generalized to represent other systematic biases in values, beyond missing values. In such cases, the control node, RZ in Fig. 8, would no longer be a missingness indicator, but rather a general bias indicator, and Z* then a systematically biased version of Z.

## 2.3 Where does a Causal Graph Come From?

### 2.3.1 Creating a Causal Graph

When we are using a causal graph to reason about causal mechanisms, we assume that the causal graph captures everything that is important and relevant to the problem we are studying.4 When we are trying to decide what is important and relevant, we can think about it in several stages:

Core factors related to the question: First, we consider the question we are trying to answer based on our analysis of the graph? For example, if we are using our analysis to guide decision-making—i.e., whether to take a specific action—we will want our causal graph to include the action and its possible effects or outcomes, as well as other factors that influence the likelihood of those outcomes. When analyzing a particular dataset of actions and outcomes, we also must include the factors that influenced the likelihood of action.

Adding additional relevant factors: Secondly, we should look at the factors that we have decided are relevant to our task, and consider whether any of them have shared causes. If so, we might include them as well. The decision to include them can be based on how important it is to capture the fact that the given factors are correlated with one another. When we decide that it is not important to model the causes of some node in a causal graph, we will call these exogenous nodes. If a node is determined entirely by nodes within the causal graph, we will call these endogenous nodes.

Removing static factors: Finally, we consider what is static and what is dynamic across the scenarios and environments we intend to represent with our causal graph. If some factor that is otherwise relevant to our causal question never changes, we will leave it out. For example, when analyzing the effect of a new recommendation policy in an online store, we may know that the effect of the policy depends on some basic societal and economic factors, such as the availability of Internet, electricity, and a stable monetary system. If we believe that the availability of these factors will not vary within the scope of our task, we can safely leave these out.

Decisions about whether a factor can be left out can be iterative. Often we will choose to begin with a simpler model and add in additional factors to improve the precision and accuracy of our model. For example, after building and experimenting with a simple model of the effect of recommendations, we might add in additional factors, such as whether users are viewing recommendations on mobile devices, tablets, or PCs, to help us better capture subtler effects.

Careful readers will note that, in this section, we have been referring to the “factors” that are relevant to a given question or system. Often, when we are first designing a causal graph, we will focus our thoughts on more abstract concepts and factors and then, only later, determine what specific measures in an experiment or features in a dataset can be used to represent those abstract factors.

### 2.3.2 Examples

In this section, we discuss three example scenarios, including toy causal models, the assumptions and modeling choices they represent, and how they might be extended and refined.

#### Example 1: Online product purchases and recommendations

Consider an online store that sells many products. Interest in each product may be driven by a number of product-specific factors, such as the quality of the product, product reviews, marketing campaigns, as well as cross product factors such as seasonality or brand reputation. There may be inherent complementarity or substitution among some products. For example, a person buying cookies may also become interested in buying milk. So, if a marketing campaign increases interest in cookies, it may also indirectly drive an increased interest in milk. Beyond these inherent relationships, the store, on each product’s web page, recommends several related products to people, thus potentially increasing interest for the recommended products. Recommendations are made based on some policy that the store might change. What if we want to better understand product browsing behavior under various recommendation policies? I.e., is one recommendation policy more effective than another? Fig. 9 shows one causal graph that models the influences on aggregate product browsing behavior.5

The basic modeling choices we make in constructing our causal model simplify our analysis tasks by limiting the factors we will consider. In making these choices, understanding the domain is critical to designing a model that is tailored to addressing a specific task while capturing all relevant factors.

Some of the choices we made when designing this model are explicit in the graph itself. For example, we chose to declare that the factors that influence product interest can be abstractly represented by a single cross-product demand factor and by many product-specific demand factors (exactly one per product). We are assuming that the various product-specific demand factors, such as price, and the shared demand factors, such as brand reputation, are not affected by product interest. And, we are assuming that the first product P0 is not affected by recommendations from other products (in fact, we are not modeling any recommendations from other products Pi at all).

Another modeling choice clearly stated in the graph is the set of exogenous vs. endogenous nodes. As our question was focused on the recommendation policy these demand factors are represented as exogenous nodes. if our questions were focused on manipulating these demand factors (eg experimenting with bew marketing campaigns) then we would want to add the causes of these demand factors into our causal graph.

Other choices are not represented in the graph, but are more subtly included within the definition of the nodes. For example, our product interest nodes are aggregated over all people browsing at an online store. In turn, our demand factors also represent factors that influence demand at an aggregate or population level. Alternatively, we could also have chosen to model product interest at an individual level. In that case, we probably would have also included more attributes about an individual, allowing us to study interactions between demand factors and individuals.

Our model also does not allow for the possibility of change in influence over time. Would the novelty of a recommendation system will initially drive more interest in recommended products, but then fade over time? This particular model would not allow us to capture or analyze such changes in effect. In addition, we do not consider the relationship to other pages that may also show recommendation. For example, P0 itself may be a recommended product on some P2 product’s page, in which case P0’s browsing events are partially caused by the recommendation system too.

As we work with a model and refine it over time, we can revisit these design choices. We might split out demand factors in more detail, include individual-level information in our graph, or explicitly model time. How we refine a model will depend on how our understanding of the domain evolves as we gain experience, how our core questions and goals change, and the practical limitations of our experimental setting or data gathering framework.

#### Example 2: Energy conservation in a data center

Consider a data center containing thousands of servers. One key challenge in data centers is to minimize energy consumption while meeting stated performance objectives. To do so, one way can be to put idle servers into low power mode or turn them off. However, if demand exceeds availability, then the idle servers need to be restarted. This introduces a delay in the system. We thus want to maximize the energy savings while minimizing the delay due to unavailability.

Fig. 10 shows a causal graph. Let us assume that there is predictive system that predicts the number of servers to keep on based on a prediction of load due to customers’ requests in the next time-step. This prediction uses the load at last time step (Lt − 1) and the number of idle servers at the previous time step It − 1, to decide the number of servers that are turned on at time t, Mt. Then we observe the new load Lt that along with Mt determines the number of servers that need to be restarted Rt. Number of restarted servers and the number of running servers Mt together determine the cost of the predictive system Ct. Note that load at consecutive time-steps shares a common cause of customers’ request patterns, therefore we show it through a dashed double-sided arrow.

Using this graph, we can answer a number of interesting questions. We can estimate the effect of using the predictive system’s output Mt versus not turning off any servers. We can also compare the current predictive model to another model to evaluate which one leads to a better overall cost.

Note how this graph was constructed based on a high-level knowledge of the system architecture. Nodes like Mt may themselves be computed using complex machine learning models, but we choose to abstract them out into single nodes. We also chose to construct a causal graph over a single time-step though the system runs continuously. That is, we ignored the fact that It − 1 itself is a function of the model’s prediction at time t − 1, Mt − 1. In the specific case where the model M utilizes data from only the previous time-step, this is a valid simplification since the model treats It − 1 as a new, independent value. In other cases, such a simplification may lead to errors due to ignoring the feedback loop between the model and idle servers at different points in time. Additionally, there are a number of intermediate, recorded nodes that go from shutting off a server to energy consumption, but we chose to not include them since they are not the focus of the question. For another question—for example, the behavior of hardware components while saving energy—including those measurements will be critical.

#### Example 3: Rotated MNIST images

As our third and final example, we consider the well-known handwriting recognition dataset called MNIST. This dataset contains images of digits and the task is to detect the digit shown in each image. We consider a subset of digit classes from 0 to 4 and include a twist: each image is rotated by an angle by some unknown data-generating process. Fig. 11 shows a random sample of the images from this rotated MNIST dataset.

To start with, it is hard to think of a causal graph for this system. All we are provided are a set of static images without any flow of information or causality. To proceed, let us try to reconstruct what process may have generated this data. Thinking of how people write down numeric digits, it is possible that people may have decided the digit they wanted to write and then written that digit. Thus, the digit class causes the specific shape we see on an image. Alternatively, the images may have been sampled from a random collection of shapes and someone might have selected them manually and labelled them as one of the ten digits. In this case, it is the shape of the region in an image that causes its digit classification. In either case, we can assume that there is a causal relationship between a specific shape and the digit class. We represent it using an undirected causal edge in Fig. 12.

In addition to shape, the angle of rotation seems to be associated with the digit. In Fig. 11 the digits 0 and 2 are never rotated much whereas other digits are rotated up to 90 degrees. However, from our understanding of digit recognition, we can safely assume that the angle of rotation cannot determine the digit. It may be that some images were rotated before they were captured or that these images were rotated based on their digit class after they were captured. In the first case, we can assume that some unobserved process causes both the digit class and its rotation. In the second, the digit class causes an unknown variable that decides the angle of rotation. Causal graphs for these mechanisms are shown in Figs. 12 (a), 12 (b).

While we do not know the exact mechanism, these set of causal graphs provide important information about building a classifier that can generalize to different data distributions beyond the current one. Since shape is causally related to the digit class in all graphs, it should be included in a predictive model. However, in both graphs, digit and angle of rotation share no direct relationship. Specifically, their relationship depends completely on an unobserved node connecting them that acts as a fork (Fig. 12 (b)) or as the central node in a chain (Fig. 12 (b)). If the value of the unobserved node changes, their relationship also changes. In other words, given the unknown node, angle of rotation and digit class are conditionally d-separated because U is either a fork or the centre of a chain path. Therefore, these graphs imply that angle of rotation is not a causal feature and should not be included in any predictive model for digit recognition.

This example underscores the point that a causal graph need not be complete or uniquely determined to be useful. As we noted before, we are looking for graphs that capture the major assumptions and constraints that can be known from domain knowledge, not the full causal graph of a system which may be implausible to obtain. Thus, it is helpful to have an incomplete graph than no graph at all.

 a b

Figure 12: Possible causal graphs for the rotated MNIST images dataset. In both graphs, U is unobserved.

## 2.4 Potential Outcomes Framework

The potential outcomes (PO) framework is an alternative to causal graphs for reasoning about causal assumptions and setting up analyses. While causal graphs focus on the structure of the causal relationships themselves as the primary language for declaring assumptions, the potential outcomes framework places its focus on causal inference as a missing data problem. Recall from chapter 3 that the causal effect is defined as the difference in outcomes Y between a world where treatment is given World(T = 1), and a world where treatment is not given World(T = 0).
Causal Effect = YWorld(T = 1) − YWorld(T = 0)
Potential outcomes (PO) framework formalizes the notion of Y’s value in different worlds as a new statistical variable. Specifically for every outcome Y, it defines a set of potential outcomes based on different values of the treatment, YT. The key point is that only one of the potential outcomes YT = t is observed and the remaining potential outcomes are unobserved. In other words, there is a single observed outcome and the goal is to estimate all other unobserved, potential outcome values that Y could have taken under a different T. In the PO framework, these different values of the outcome are denoted by a subscript, YT = t.

Note that a potential outcome is not the same as probabilistic conditioning. Critically, YT = 1 does not correspond to conditioning on T = 1 in the observed data, but rather conveys the causal relationship between Y and T. That is, YT = 1 represents an intervention on T by setting it to 1 without changing the rest of the world (i.e., all other relevant variables, both observed and unobserved, are constant). Thus P(YT = 1) ≠ P(Y|T = 1). For a binary treatment, potential outcomes provides a succinct way to formalize Eq. 5 for causal effect.
$$\begin{split} \text{Causal Effect} = \mathbb{E}[Y_{T=1}- Y_{T=0}] = \mathbb{E}[Y_{T=1}]- \mathbb{E}[Y_{T=0}] \end{split}$$
Since, for any particular unit of treatment (e.g., a person) we can only observe one of the potential outcomes, the PO framework translates the problem of causal inference to that of estimating the missing potential outcome. For example, if we observe YT = 1 for a particular unit, then we can calculate the causal effect of T if we can correctly estimate the unobserved potential outcome YT = 0. Note that randomized experiments conveniently allow our observations of potential outcomes in one randomized group to provide unbiased estimates of the unobserved potential outcomes of other randomized groups.

While the potential outcome framework is most commonly used for estimating the effect of treatment, the framework itself is general and can be used to denote the potential value of any variable. For example, AB = 1 represents the potential value of A when B is set to 1.

### 2.4.1 Comparing potential outcomes with causal graphs

Since the potential outcome variables also convey causal relationships, it is important to compare them to structural causal models. In many ways, causal graphs and potential outcomes are compatible. They both emphasize the difference between statistical conditioning and causal effect. While causal graphs do so by providing a representation of flow of causality without using statistical variables, potential outcomes do so by creating entirely new statistical variables. For instance, a two-node graph like that in Fig. 1 (a) can be represented by BA = a where the choice of subscript variable denotes an assumption about the direction of causal effect.

This difference becomes more pronounced when we want to represent assumptions about more complex systems. The PO framework does not have a good representation for relationships between variables other than the treatment and outcome. Instead, the focus is on reducing those relationships to questions about their effects on T and Y. For example, an analyst in the PO framework may ask whether the treatment assignment mechanism is known, whether the treatment is randomly assigned, or whether there are other variables that cause the treatment? If there are such other variables, then do they also cause the outcome Y? Based on the answers to these questions, an analyst will then decide how to identify and estimate the missing potential outcome. The advantage of the PO framework is the (small) number of well-tested and trusted identification and estimation strategies developed within it for finding the causal effect of a treatment. However, each of these strategies requires specific assumptions on the treatment assignment mechanism, often including the shape of the functions governing the underlying mechanisms.

In contrast, the SCM framework focuses on making all the assumptions as transparent as possible. When confronted with a non-randomized treatment assignment, an analyst constructs a graph expressing their assumptions about the causal mechanisms in the system. For instance, they may ask: What factors are causing the treatment? Are there specific structures among them (e.g., colliders) that can be exploited? If there is confounding, is it because there is missing data or fully unobserved confounders? Such an analysis brings out all the causal assumptions that go into a future identification and estimation exercise, which unfortunately are opaque in the PO framework. Moreover, a graph is more general construct for causal reasoning that can be useful for many other questions about a system, beyond a specific effect. Once a causal graph is constructed, it allows questions about the effect between any two pair of nodes, the effect of groups of nodes, the causal features for a particular outcome, the cascading nature of certain causal effects, and so on.

Put another way, the PO framework focuses directly on estimation of effect whereas the SCM framework emphasises on specifying the causal mechanisms first. Given that a effect estimate depends heavily on the causal assumptions that go in, the importance of transparency in causal assumptions cannot be overstated. Unlike predictive machine learning estimates that can be validated objectively using cross-validation metrics, no such validation procedure exists for causal estimates. Thus, a qualitative benefit of causal graphs is that they are essentially a simple-to-interpret diagram that can be shared with different stakeholders, promoting an transparent and informed discussion about the causal assumptions that went into an analysis.

### 2.4.2 Mixing causal and statistical assumptions

More fundamentally, however, the PO framework mixes causal and statistical assumptions within the same representation. To illustrate this point, we provide a simple modelling exercise using the PO framework. Assume a three-variable system where W is a common cause of T and Y. Under the PO framework, we may write a regression equation,
$$\label{eq:po-regression} \begin{split} y = f(x, w) + \epsilon ; \ \ \ \mathbb{E}[\epsilon|x,w]=0 \end{split}\qquad(6)$$
This equation represents several assumptions about the underlying system. First, it conveys the direction of the causal relationship. Implicitly, the LHS, y, is the effect and the RHS, f(x, w) + ϵ are its causes. Second, by assuming that the expected value of the error term is 0, it conveys that the error ϵ is independent of the LHS variables, X and W. Third, this also serves as an estimating equation. The same equation is used for estimating the effect by making assumptions on the family of functions (e.g., all linear functions) and fitting a particular to available data.

More generally, a single PO equation simultaneously conveys three steps of our four stages of causal reasoning: modelling, identification, and estimation. It often includes both causal assumptions (such as direction of effect) and statistical assumptions (such as the family of estimating functions). To some degree, this brevity can be seen as a strength. Yet, it can also be a weakness, when a concise representation leads to causal assumptions being made implicitly, or sometimes asserted separately in a less rigorous notation (i.e., natural language). While we can see that both graphs and the PO representation convey similar ideas, in this book we prefer using causal graphs and structural equations for modeling causal assumptions to more clearly distinguish causal from statistical assumptions.

### 2.4.3 The best of both frameworks

We will see in the following chapters that, while the PO framework has some weaknesses in the modeling stage of causal analysis, it provides useful, common recipes for the identification stage of causal analysis, and shines in causal effect estimation. The PO framework provides a suite of well-tested and broadly used methods for estimation, based on constraints of function families, size of data or its dimensionality. Because it directly deals with statistical equations, the PO framework is also better equipped to handle constraints in a data-generating process like monotonicity of effect.

In this book, therefore, we mix and match elements of causal graphs and potential outcomes across the four stages of causal analysis—modeling, identification, estimation, and refutation. While we use primarily causal graphs and structural equations for capturing models and assumptions in the first stage, we will use both causal-graph based and potential outcomes identification, estimation, and refutation strategies.

1. We will discuss methods for handling cycles in causal graphs in Chapter 10.↩︎

2. There are more complicated causal graph notations that include specific annotations (different kinds of arrows and nodes) to indicate specific classes of interactions, such as mediation and interaction, though other kinds of interactions remain ambiguous. While we believe such notation can be useful, in this book, we will keep to the simpler graph notation both for simplicity of presentation and to avoid over-emphasizing some kinds of interactions over others.↩︎

3. Readers already familiar with structural equations might miss the ϵ noise factor. Do not worry, we will add them in soon, in Section 2.2.2.↩︎

4. This does not necessarily mean we will expect to be able to measure everything in the graph. There might be factors that we have added to our graph that will remain unobserved in our datasets. We will expand on unobserved nodes and their implications in Section 2.2.3.↩︎

5. Fig. 9 uses plate notation to represent the influence of P0 on a total of N products Pi. In plate notation, the rectangle with marker N is a summary of a repeated graph structure. I.e., the nodes Di and Pi inside the rectangle are repeated N times.↩︎

Updated: