Getting Started with Causal InferenceCode, tutorials, and resources for causal inference
https://causalinference.gitlab.io/
Fri, 27 Mar 2020 12:34:07 +0000Fri, 27 Mar 2020 12:34:07 +0000Jekyll v3.4.0Chapter 2: Models and Assumptions<section id="ch_causalmodel" data-number="1">
<p>Conventional statistical and machine learning problems are data focused. While data is a critical part of causal reasoning, it is not the only part. Just as important is the external knowledge we bring: our prior knowledge of the data generation mechanism and assumptions about plausible causal mechanisms. In fact, it is this external knowledge that distinguishes causal reasoning from associational methods.</p>
<p>Capturing our external knowledge about mechanisms and assumptions is the first stage of any causal analysis. Formally representing external domain knowledge as models allows us to systematically reason about strategies for answering causal questions. We already saw one example of such external knowledge captured in our structural causal model of the influence of temperature on ice cream and swimming in the previous chapter (<a href="/causal-reasoning-book-chapter1">Fig. 1.5</a>). This chapter focuses on the detailed mechanics and intuitions of these structural causal models and the assumptions they represent.</p>
<section id="sec:causalgraphs" data-number="1.1">
<h2 data-number="2.1"><span class="header-section-number">2.1</span> Causal Graphs</h2>
<p>The primary language for modeling causal mechanisms and expressing our assumptions is the language of <em>causal graphs</em>. Causal graphs encode our domain knowledge about the causal mechanisms underlying a system or phenomenon under study. We begin by introducing the mechanics of causal graphs and demonstrating how they represent causal relationships. We first assume that we have complete knowledge of the causal graph. Later in this chapter, we relax this assumption and refine our use of causal graphs to represent more complex and ambiguous situations.</p>
<p>A causal graph is made up of two kinds of elements:</p>
<ul>
<li><p><strong>Nodes</strong> represent variables or features in the world or system we are modeling. Without limitation, let us think of each node as representing something that is potentially observable, measurable, or otherwise knowable about a system.</p></li>
<li><p><strong>Edges</strong> connect nodes to one another. Each edge represents a mechanism or causal relationship that relates the values of the connected nodes. Edges are directed to indicate the flow of causal influence. For example, in Fig. 1 (a), a change in the value of node <span class="math inline"><em>A</em></span> will cause a change in <span class="math inline"><em>B</em></span>, but if we were to manipulate <span class="math inline"><em>B</em></span>, it would not cause a change <span class="math inline"><em>A</em></span>. In cases where the direction of the influence is unknown, we will draw an undirected edge.</p></li>
</ul>
<p>Causal graphs are assumed to be acyclic. That is, we cannot have a situation where <span class="math inline"><em>A</em></span> causes <span class="math inline"><em>B</em></span> and <span class="math inline"><em>B</em></span> causes <span class="math inline"><em>A</em></span>.<a href="#fn1" class="footnote-ref" id="fnref1" role="doc-noteref"><sup>1</sup></a></p>
<div id="fig:sample-2nodegraph-main" class="subfigures">
<table style="width:60%;">
<colgroup>
<col style="width: 30%" />
<col style="width: 30%" />
</colgroup>
<tr class="odd">
<td style="text-align: center;"><figure>
<img src="/assets/Figures/Chapter2/Ch2_Fig1_2nodegraph.png" id="fig:sample-2nodegraph" style="width:100.0%" alt="" /><figcaption>a</figcaption>
</figure></td>
<td style="text-align: center;"><figure>
<img src="/assets/Figures/Chapter2/Ch2_Fig1_2nodegraph_undirected.png" id="fig:sample-2nodegraph_undirected" style="width:100.0%" alt="" /><figcaption>b</figcaption>
</figure></td>
</tr>
</table>
<p>Figure 1: A simple causal graph over two nodes. a — A causes B, b — Either A causes B or B causes A, but not both.</p>
</div>
<section id="reading-a-causal-graph" data-number="1.1.1">
<h3 data-number="2.1.1"><span class="header-section-number">2.1.1</span> Reading a causal graph</h3>
<p>Intuitively, an edge transfers information from one node to another in the direction of its arrow. Thus, causal effects flow through the graph along nodes connected by edges. This interpretation allows us to read a causal graph and answer many interesting and important questions about the underlying system.</p>
<p>To start with, we can ask whether a change in one node’s value might affect another. This is important when we want to, for example, check for potential side effects of some action or decision. We can ask whether a node’s value will stay the same as others change—important for identifying when a metric we care about is stable or not. If we want to optimize for some target outcome, we can ask what nodes we can manipulate to cause the targeted node’s value to change. For example, in Fig. 2, we can see that changes in <span class="math inline"><em>A</em></span> will affect <span class="math inline"><em>B</em></span>, which will, in turn, affect <span class="math inline"><em>D</em></span>. We can also see that, since there are no directed paths leading from <span class="math inline"><em>A</em></span> to <span class="math inline"><em>C</em></span>, that changes in <span class="math inline"><em>A</em></span> will not affect <span class="math inline"><em>C</em></span>. Similarly, since causal effect flows in only one direction, we can see that changes in <span class="math inline"><em>C</em></span> will affect <span class="math inline"><em>B</em></span> and <span class="math inline"><em>D</em></span> but not <span class="math inline"><em>A</em></span>, and changes in <span class="math inline"><em>B</em></span> will affect <span class="math inline"><em>D</em></span>, but not <span class="math inline"><em>A</em></span> or <span class="math inline"><em>C</em></span>. Changes in <span class="math inline"><em>D</em></span> will not affect any of the other nodes.</p>
<p>Causal relationships also flow transitively from edge to edge, through a directed path. In referring to such causal relationships, we often use familial notations. If <span class="math inline"><em>A</em></span> causes <span class="math inline"><em>B</em></span>, then <span class="math inline"><em>A</em></span> is a <em>parent</em> of <span class="math inline"><em>B</em></span>, and <span class="math inline"><em>B</em></span> is a <em>child</em> of <span class="math inline"><em>A</em></span>. All child nodes of <span class="math inline"><em>B</em></span> and, recursively, all of their children are known as descendants of <span class="math inline"><em>B</em></span>. Similarly, all parents of <span class="math inline"><em>B</em></span> and, recursively, all of their parents are known as ancestors of <span class="math inline"><em>B</em></span>. A node causally affects all its descendants and is affected by all its ancestors.</p>
<figure>
<img src="/assets/Figures/Chapter2/Ch2_Fig1_4nodegraph.png" id="fig:sample-4nodegraph" alt="" /><figcaption>Figure 2: A 4-node causal graph</figcaption>
</figure>
</section>
<section id="sec:statisticalindependence_intro" data-number="1.1.2">
<h3 data-number="2.1.2"><span class="header-section-number">2.1.2</span> Causal graphs and statistical independence</h3>
<p>Causal graphs are not only intuitively easy to read; they also provide formal definitions that enable systematic reasoning about their properties. Fundamentally, a causal graph describes a non-parametric data-generating process over its nodes. By specifying independence and dependence between the nodes, the graph constrains relationship between generated variables corresponding to those nodes.</p>
<p>In particular, causal graphs provide information about statistical independence. Two nodes <span class="math inline"><em>x</em></span> and <span class="math inline"><em>y</em></span> are statistically independent if knowing the value of one node does not give information about the value of the other node. That is: <br /><span class="math display">$$x \unicode{x2AEB} y \text{ iff } P(x) = P(x|y)
%$$</span><br /> where <span class="math inline">$\unicode{x2AEB}$</span> is the symbol for statistical independence. We also often work with conditional independences, where two nodes <span class="math inline"><em>x</em></span> and <span class="math inline"><em>y</em></span> might only be statistically independent conditional on some other node <span class="math inline"><em>z</em></span>: <br /><span class="math display">$$(x \unicode{x2AEB} y) |z \text{ iff } P(x|z) = P(x|y,z)
%\\
%\text{ and symmetrically } P(b) = P(b|a)$$</span><br /> Statistical relationships correspond to a particular data distribution and should not be confused with causal relationships. Whereas causal relationships focuses on whether manipulating one node’s value will cause a change in another node’s value, statistical relationships focuses on whether knowing the value of a node provides information about the value of another node.</p>
<div id="fig:chain-fork-collider" class="subfigures">
<table style="width:90%;">
<colgroup>
<col style="width: 30%" />
<col style="width: 30%" />
<col style="width: 30%" />
</colgroup>
<tr class="odd">
<td style="text-align: center;"><figure>
<img src="/assets/Figures/Chapter2/Ch2_Collider-crop.png" id="fig:collider" style="width:100.0%" alt="" /><figcaption>a</figcaption>
</figure></td>
<td style="text-align: center;"><figure>
<img src="/assets/Figures/Chapter2/Ch2_Fork-crop.png" id="fig:fork" style="width:100.0%" alt="" /><figcaption>b</figcaption>
</figure></td>
<td style="text-align: center;"><figure>
<img src="/assets/Figures/Chapter2/Ch2_Chain-crop.png" id="fig:chain" style="width:100.0%" alt="" /><figcaption>c</figcaption>
</figure></td>
</tr>
</table>
<p>Figure 3: Causal graph structures over three nodes. a — B is a collider for A and C., b — B creates a fork to A and C., c — B forms a chain from A to C.</p>
</div>
<p>To illustrate how a causal graph implies certain statistical independences, let us consider different graph structures over three variables. Fig. 3 shows the three important structures: <em>collider</em>, <em>fork</em> and <em>chain</em>. The left subfigure shows a node B that is caused by both A and C. Such a node is called a collider, because the effect from two parents collides at the node. Importantly, no causal information transfers from A to C through B and thus A and C are statistically independent. The middle subfigure shows a fork structure where B causes both A and C. Here, the value of B determines the values of both A and C, thus A and C are not independent. However, the only source of statistical dependence between them is B. What if B is fixed at a certain value? In that case, A and C will be statistically independent. In other words, A and C are independent conditional on B. Finally, the right subfigure shows a chain. A chain implies a single direction of flow of causality from A to B, and from B to A. Any causal information from A to C must flow through B. Thus, similar to the fork structure, A and C are conditionally independent given B.</p>
<p>The three basic structures can be extended to determine statistical independence in any graph. Here we use the concept of an <em>undirected path</em> between any two nodes, defined as a set of contiguous edges connecting the two nodes. To determine statistical independence between two nodes, we consider all undirected paths between two nodes and test whether the paths have any of these structures. Note that a collider is the only structure that leads to statistical independence without conditioning. Thus, two nodes are independent if all undirected paths between them contain a collider. Of course, they are also independent if there is no undirected path connecting them.</p>
<p>Nodes that lead to statistically independent variables are said to be <em>d-separated</em> from each other.</p>
<p><strong>d-separation:</strong> Two nodes in a causal graph are d-separated if either there exists no undirected path that connects them, or all paths connecting them contain a collider.</p>
<p>Conditional independence, however, is not as straightforward. From the other two base structures of fork and chain, we saw that conditioning on B makes A and C independent. However, the collider structure shows the reverse property. Conditioned on the collider B, A and C become dependent. This is because knowing the value of a collider and one of its parents tell us something about the other parent. Consider the example of a spam system that classifies an email as spam only if two conditions are satisfied: it contains the word “please send money” (A), and the email is too long (C). In general, these two variables are independently generated: longer emails can ask for money, but so can shorter emails. Knowing that the email is long tells us nothing about it contains an ask for money. However, once we know that an email was classified as spam, we can determine that both A and C were true. If it was not classified as spam, then knowing A=True reveals that C=False, and vice-versa. Thus, by fixing the value of a collider or conditioning on it, we render its parents dependent with respect to each other. Based on the above discussion, conditional independence or d-separation requires the conditioning variable to form a fork or chain along all paths between the two nodes, but also requires that it does not form a collider on any path.</p>
<p><strong>Conditional d-separation:</strong> Two nodes in a causal graph are conditionally d-separated on another node <span class="math inline"><em>B</em></span> if either they are d-separated, or all undirected paths connecting them contain <span class="math inline"><em>B</em></span> as a fork or a chain, but not as a collider.</p>
<p>Using the above definitions, we can now read statistical independence and conditional statistical independence relationships from a causal graph. For example, in Fig. 2, we can see that <span class="math inline">$A \unicode{x2AEB} C$</span> as they have no common causes, whereas all other pairs of nodes are statistically <em>dependent</em>. It is also possible to read conditional independences from a graph. In Fig. 2 again, <span class="math inline">$(D \unicode{x2AEB} A) | B$</span> since B is the central node of chain connecting A and D. That is, once we know the value of <span class="math inline"><em>B</em></span>, we know everything we can know about <span class="math inline"><em>D</em></span> and, in particular, knowing the value of <span class="math inline"><em>A</em></span> does not change our belief in what <span class="math inline"><em>D</em></span> might be.</p>
<p>The above definitions also apply to any set of nodes rather than individual nodes. The concept of conditional independence is useful in choosing nodes for intervening on another nodes. It implies that conditioned on its parents, a node is independent of all its ancestors. Thus, knowing the value of all ancestor nodes conveys no more information about a node than knowing the value of its parents. This knowledge can also be useful to design predictive models that generalize beyond the training distribution, as discussed in chapter <a href="#ch_classification" data-reference-type="ref" data-reference="ch_classification">13</a>.</p>
</section>
<section id="causal-graph-and-resulting-data-distributions" data-number="1.1.3">
<h3 data-number="2.1.3"><span class="header-section-number">2.1.3</span> Causal graph and resulting data distributions</h3>
<p>Since the graph only specifies the direction of effect and not its magnitude, shape or interactions, multiple data-generating processes and thus multiple data probability distributions are compatible with the same causal graph.</p>
<p>Formally, a causal graph specifies a factorization of the joint probability distribution of data. Any probability distribution consistent with the graph needs to follow the specific factorization. For instance, for the causal graph in Fig. 2, we can write, <br /><span class="math display">$$\begin{split}
P(A, B, C, D) &= P(D|A,B,C)P(B|C,A)P(C|A)P(A) \\
&= P(D|B)P(B|C,A)P(C)P(A)
\end{split}$$</span><br /> where the first equation is from the chain rule of probability. The second equation and third equations come from the structure of the causal graph. As discussed above, <span class="math inline"><em>A</em></span> and <span class="math inline"><em>C</em></span> are independent based on the graph. In addition, <span class="math inline"><em>B</em></span> blocks the directed paths from <span class="math inline"><em>A</em>, <em>C</em></span> to <span class="math inline"><em>D</em></span>, so <span class="math inline"><em>D</em></span> is independent of <span class="math inline"><em>A</em></span> and <span class="math inline"><em>C</em></span> given <span class="math inline"><em>B</em></span>. More generally, for any causal graph <span class="math inline">𝒢</span> over variables <span class="math inline"><em>V</em><sub>1</sub>, <em>V</em><sub>2</sub>, ...<em>V</em><sub><em>m</em></sub></span>, the probability distribution of data is given by, <span id="eq:graph-factorization"><br /><span class="math display">$$\label{eq:graph-factorization}
\begin{split}
P(V_1, V_2, ...,V_m) = \Pi_{i=1}^{m} P(V_i|Pa_{\mathcal{G}}(V_i))
\end{split}\qquad(1)$$</span><br /></span> where <span class="math inline"><em>P</em><em>a</em><sub>𝒢</sub>(<em>V</em><sub><em>i</em></sub>)</span> refers to parents of <span class="math inline"><em>V</em><sub><em>i</em></sub></span> in the causal graph <span class="math inline">𝒢</span>. Note that the above factorization and resultant independence constraints have to be satisfied by every probability distribution generated from the graph. Therefore, independence in the graph implies statistical independence constraints across all probability distributions.</p>
<p>Additionally, it is possible that some data distributions factorize the distribution further and includes more independences. An edge between two nodes in a causal graph conveys the assumption that there may exist a relationship between them, but not all data distributions may necessarily follow it. Since there are multiple distributions possible, in some distributions the effect between the nodes goes to zero. For instance, while <span class="math inline"><em>A</em></span> and <span class="math inline"><em>B</em></span> are connected via an edge in Fig. 2, there can be a dataset where <span class="math inline"><em>A</em></span>’s effect on <span class="math inline"><em>B</em></span> is zero. As another example, in the ice cream and swimming causal graph from <a href="/causal-reasoning-book-chapter1">Fig. 1.5</a>, we will find that the effect of ice-cream on swimming is zero, even though the causal graph includes an edge. Thus, including an edge reflects the possibility of a causal relationship, but does not confirm it. Effect of a node on another in a causal graph can also cancel through multiple interacting effects. For example, based on the graph from Fig. 2, there can can be a distribution where <span class="math inline"><em>A</em></span> and <span class="math inline"><em>D</em></span> are independent if the effect of <span class="math inline"><em>B</em></span> on <span class="math inline"><em>D</em></span> directly cancels <span class="math inline"><em>A</em></span>’s effect on <span class="math inline"><em>B</em></span>. Exact cancellations of this kind are often assumed to be implausible.</p>
<p>Another implication of a causal graph is the specific factorization of the joint probability of data. While other factorizations are valid too for a given dataset, a factorization consistent with Eq. 1 is more likely to generalize to changes in data distribution. That is, we can consider individual conditional probability factors as independent mechanisms. In any dataset generated from the graph in Fig. 2, if P(A) changes, we expect P(B) too change too, but do not expect the causal relationship between them P(B|A) to change. However, if we consider any other factorization, e.g., <span class="math inline"><em>P</em>(<em>A</em>|<em>B</em>)<em>P</em>(<em>B</em>)</span>, then changing P(A) will change P(B), but also necessarily change <span class="math inline"><em>P</em>(<em>A</em>|<em>B</em>)</span>. This is because <span class="math inline"><em>P</em>(<em>A</em>|<em>B</em>) = <em>P</em>(<em>B</em>|<em>A</em>)<em>P</em>(<em>A</em>)/<em>P</em>(<em>B</em>)</span>. Knowing that <span class="math inline"><em>P</em>(<em>B</em>|<em>A</em>)</span> is invariant across distributions, P(A|B) depends directly on P(A). The invariance of causal relationships across different distributions underscores the generalization benefit of a causal graph: P(B|A), once estimated from a single data distribution, is expected to stay the same for all distributions consistent with the graph.</p>
<p>At this point, it is useful to compare causal graphs to probabilistic graphical models. While a probabilistic graphical model also offers a graphical representation of conditional independences, such a graph corresponds only to a particular data distribution. A causal graph, in contrast, represents invariant relationships that are stable across data distributions. These relationships are expected to hold for all configurations of an underlying system. Causal graphs thus provide a concise way to describe key, invariant properties of a system.</p>
</section>
<section id="key-properties" data-number="1.1.4">
<h3 data-number="2.1.4"><span class="header-section-number">2.1.4</span> Key Properties</h3>
<p>It is satisfying to note that once we have formulated our domain knowledge about possible causal relationships in the form of a graph, we can reason about causal relationships between any pair of nodes in our graph without appealing to additional domain knowledge. That is, the graph itself captures all the required information for determining which nodes, if manipulated, might affect which others. Below we emphasize key properties of the causal graph.</p>
<section id="the-assumptions-asserted-by-a-causal-graph-are-encoded-by-the-missing-edges-in-a-graph-and-the-direction-of-edges" class="unnumbered" data-number="">
<h4 class="unnumbered" data-number="1">The assumptions asserted by a causal graph are encoded by the missing edges in a graph, and the direction of edges</h4>
<p>It would be easy to believe that we are making an assumption about the existence of a causal mechanism when we draw an edge between two nodes. However, the edge itself does not represent an assumption! That an edge exists says nothing about the shape or size of the causal influence of one node on another; that causal influence could be vanishingly small or even <span class="math inline">0</span>! Thus, the existence of an edge—especially an undirected edge—does not represent a constraint on the underlying mechanisms. In contrast, the lack of an edge between two nodes is a much stronger assumption, as it is asserting that the direct causal influence is truly <span class="math inline">0</span>.</p>
<p>Fig. 4 illustrates 3 causal graphs that encode increasingly more assumptions. Of these illustrations, Fig. 4 (a) encodes the fewest assumptions. The single assumption is that the left nodes cause the right nodes (or more precisely, that the right nodes do not cause the left nodes). Note however, that nothing is assumed about the relationship among the two left nodes; and nothing is assumed about the relationship among the two right nodes. By removing edges and adding directionality to another edge, Fig. 4 (b) adds several additional assumptions: that the top-left node causes the bottom-left node and that only the bottom-left node influences the right nodes. Fig. 4 (c) makes the strongest assumptions on the underlying graph.</p>
<p>When is it preferable to use models that make more assumptions or fewer assumptions about underlying causal mechanisms? Generally speaking, when creating a causal graph, we should strive to encode as much of our domain knowledge as possible within the graph. If we know for certain through external knowledge that there is no direct causal relationship between two nodes, then we have no reason to add such an edge and, in fact, many of our computations and analyses will become only more difficult or the results more ambiguous if we do.</p>
<div id="fig:assumptionsdag" class="subfigures">
<table style="width:90%;">
<colgroup>
<col style="width: 30%" />
<col style="width: 30%" />
<col style="width: 30%" />
</colgroup>
<tr class="odd">
<td style="text-align: center;"><figure>
<img src="/assets/Figures/Chapter2/Ch2_FewAssumptionsDAG.png" id="fig:fewassumptionsdag" style="width:100.0%" alt="" /><figcaption>a</figcaption>
</figure></td>
<td style="text-align: center;"><figure>
<img src="/assets/Figures/Chapter2/Ch2_SomeAssumptionsDAG.png" id="fig:someassumptionsdag" style="width:100.0%" alt="" /><figcaption>b</figcaption>
</figure></td>
<td style="text-align: center;"><figure>
<img src="/assets/Figures/Chapter2/Ch2_ManyAssumptionsDAG.png" id="fig:manyassumptionsdag" style="width:100.0%" alt="" /><figcaption>c</figcaption>
</figure></td>
</tr>
</table>
<p>Figure 4: Causal graphs with more edges encode fewer assumptions on the underlying causal mechanisms. Here Fig. 4 (c) encodes the most assumptions. a — Few assumptions, b — More assumptions., c — Many assumptions.</p>
</div>
</section>
<section id="causal-relationships-correspond-to-stable-and-independent-mechanisms" class="unnumbered" data-number="">
<h4 class="unnumbered" data-number="1">Causal relationships correspond to stable and independent mechanisms</h4>
<p>A system—whether a computer system, a mechanical system, a social system, or other—consists of many mechanisms, interacting with each other to create the integrated behavior of the system as a whole. These mechanisms are often independent and stable. That is, we can replace or change one of these mechanisms without replacing others. The other mechanisms remain the same, though the system as a whole will change with the new integrated behavior.</p>
<p>Appealing to this intuitions of stable and independent mechanisms, we often assume that the components of the causal graph—in particular, the unrelated edges in the graph—represent distinct stable and independent mechanisms of the underlying system being modeled. That is, if we make some change to how the world works—perhaps we upgrade a software component, or we change the mechanics of a physical system—we can change how <span class="math inline"><em>A</em></span> influences <span class="math inline"><em>B</em></span> without changing how <span class="math inline"><em>B</em></span> influences <span class="math inline"><em>D</em></span>; or vice-versa.</p>
</section>
<section id="sec:causalgraphscannotbelearnedfromdataalone" class="unnumbered" data-number="">
<h4 class="unnumbered" data-number="1">Causal graphs cannot be learned from data alone</h4>
<p>We declared earlier that inferring causality requires external knowledge—some information about the underlying system mechanics or the data generating process—beyond the raw data itself. Why is this? Every causal graph implies a set of testable implications: the conditional statistical independences, introduced above in Section <a href="#sec:statisticalindependence_intro" data-reference-type="ref" data-reference="sec:statisticalindependence_intro">1.1.2</a>. However, every unique causal graph does not imply a unique set of independence tests. Every causal graph has an <em>equivalence class</em> of graphs that generate the same independence tests.</p>
<p>Consider the two graphs in Fig. 5. In Fig. 5 (a), we can read only a single independence from the graph: <span class="math inline">$(C \unicode{x2AEB} A) | B$</span>. That is, once we know the value of <span class="math inline"><em>B</em></span>, knowing the value of <span class="math inline"><em>A</em></span> will not give us any additional knowledge about the value of <span class="math inline"><em>C</em></span>. This graph implies many other causal assumptions, but only the conditional statistical independence tests are testable given data.</p>
<p>In Fig. 5 (b), we see a very different causal graph. From a practical standpoint, making a decision using this causal graph rather than the first would be very different. In Fig. 5 (a), if we manipulate <span class="math inline"><em>B</em></span>, we do not expect that <span class="math inline"><em>A</em></span> will change. In contrast, in Fig. 5 (b), if we manipulate <span class="math inline"><em>B</em></span>, we <em>do</em> expect that <span class="math inline"><em>A</em></span> will change.</p>
<p>Despite such differences in the causal implications of these graphs, when we look for testable statistical independences, we can only find one, that <span class="math inline">$(C \unicode{x2AEB} A) | B$</span>, the same test as for the other graph. The implication is that any dataset that satisfies the testable assumptions of Fig. 5 (a) will also satisfy the testable assumptions Fig. 5 (b). As a result, if we want to know which causal graph is correct, we cannot rely solely on the raw data, but must bring some of our own domain knowledge to bear.</p>
<div id="fig:equivalence-dag" class="subfigures">
<table style="width:60%;">
<colgroup>
<col style="width: 30%" />
<col style="width: 30%" />
</colgroup>
<tr class="odd">
<td style="text-align: center;"><figure>
<img src="/assets/Figures/Chapter2/Ch2_EquivalenceClass_DAG.png" id="fig:equivalenceclass-lhs" style="width:100.0%" alt="" /><figcaption>a</figcaption>
</figure></td>
<td style="text-align: center;"><figure>
<img src="/assets/Figures/Chapter2/Ch2_EquivalenceClass2_DAG.png" id="fig:equivalenceclass-rhs" style="width:100.0%" alt="" /><figcaption>b</figcaption>
</figure></td>
</tr>
</table>
<p>Figure 5: These two graphs are in the same equivalence class. That is, any data set that can be described with one of these models can also be described with the other. </p>
</div>
<p>We will discuss statistical independence tests and these equivalence classes in more detail in Chapter <a href="#ch_refutations" data-reference-type="ref" data-reference="ch_refutations">5</a>.</p>
</section>
<section id="causal-graphs-are-a-tool-to-help-us-reason-about-a-specific-problem" class="unnumbered" data-number="">
<h4 class="unnumbered" data-number="1">Causal graphs are a tool to help us reason about a specific problem</h4>
<p>Finally, we want to briefly emphasize the intuition that there is not necessarily a single, “correct” causal graph representation of any given system. A causal graph should, of course, correspond with the true causal mechanisms that drive the system being analyzed. However, questions of abstractions and fidelity, exogeneity, measurement practicalities, and, of course, the overarching purpose of an analysis, mean that many different models of a system can be considered valid under different purposes and circumstances.</p>
</section>
</section>
</section>
<section id="sec:struc-equations" data-number="1.2">
<h2 data-number="2.2"><span class="header-section-number">2.2</span> Structural Equations, Noise, and Unobserved Nodes</h2>
<p>The causal graph is a simplified representation that captures much of the key information about a system but, like all abstractions, also leaves many details out. In this section, we present how structural equations can represent details of the functional relationships represented by edges of a graph; and discuss the importance of noise and unobserved nodes. Causal graphs and structural equations together form the <em>Structural Causal Model (SCM)</em> framework of causal reasoning.</p>
<section id="structural-equations" data-number="1.2.1">
<h3 data-number="2.2.1"><span class="header-section-number">2.2.1</span> Structural Equations</h3>
<p>One key piece of information that is not included in the representation of the graph is the functional relationship between nodes. While the existence of an edge between two nodes in Fig. 1 (a) tells us that there is some functional relationship between <span class="math inline"><em>A</em></span> and <span class="math inline"><em>B</em></span>, it does not tell us the shape or magnitude of the effect. The fact that the graph does represent this functional relationship means that we cannot, in Fig. 1 (a), tell whether an increase in <span class="math inline"><em>A</em></span> will cause an increase or decrease in <span class="math inline"><em>B</em></span>. In more complicated scenarios where multiple nodes influence others, we cannot tell how the values of these nodes interact. <a href="#fn2" class="footnote-ref" id="fnref2" role="doc-noteref"><sup>2</sup></a> For example in Fig. 2, where edges from both <span class="math inline"><em>A</em></span> and <span class="math inline"><em>C</em></span> lead to <span class="math inline"><em>B</em></span>, the causal graph alone does not tell us how these nodes might interact, or if they interact at all. It is possible that the effect of <span class="math inline"><em>C</em></span> on <span class="math inline"><em>B</em></span> is the same regardless of the value of <span class="math inline"><em>A</em></span> (no interaction). It is also possible that <span class="math inline"><em>C</em></span> affects <span class="math inline"><em>B</em></span> differently depending on <span class="math inline"><em>A</em></span>’s value.</p>
<p>To augment causal graphs with a stronger characterization of the functional relationships between nodes, we often use structural equation models (SEM). Each equation represents a causal assignment from the right-hand-side of the equation to the left. Eqns. 2, 3 show general structural equations for Figs. 1 (a), 2 respectively.<a href="#fn3" class="footnote-ref" id="fnref3" role="doc-noteref"><sup>3</sup></a> <span id="eq:sample-2nodegraph"><br /><span class="math display">$$\begin{array}{rcl}
B & \leftarrow& f(A)
\end{array}
%\caption{A structural equation corresponding to \figref{fig:sample-2nodegraph}}
\label{eq:sample-2nodegraph}\qquad(2)$$</span><br /></span> <span id="eq:sample-4nodegraph"><br /><span class="math display">$$\begin{array}{rcl}
D & \leftarrow& f_1(B) \\
B & \leftarrow& f_2(A,C)
\end{array}
%\caption{A set of structural equations corresponding to \figref{fig:sample-4nodegraph}}
\label{eq:sample-4nodegraph}\qquad(3)$$</span><br /></span> While these equations allow any form of function <span class="math inline"><em>f</em>()</span>, we can easily indicate specific functional forms. For example, Eq. 4 shows an alternative SEM for the graph of Fig. 2 where the causal relationships are linear, and the effects of <span class="math inline"><em>A</em></span> and <span class="math inline"><em>C</em></span> on <span class="math inline"><em>B</em></span> do not interact with each other. <span id="eq:sample-4nodegraph-linear"><br /><span class="math display">$$\begin{array}{rcl}
D & \leftarrow& \alpha_1*B \\
B & \leftarrow& \alpha_2*A + \beta_2*C
\end{array}
%\caption{A more restrictive set of structural equations relating the nodes in \figref{fig:sample-4nodegraph}}
\label{eq:sample-4nodegraph-linear}\qquad(4)$$</span><br /></span> Note that the above equations are different from purely statistical models such as linear regressions, even though the equations are deceptively similar. Structural equations imply a causal relationship, whereas conventional equations provide no such implication. In a regression model, it is equally valid to regress y on x, as it is to regress x on y. In contrast, a structural equation can only be written in one direction, the direction of causal relationship as specified by a causal graph. Further, a structural equation only includes causes of y in the RHS whereas a linear regression may include all known variables, including children of y that can be useful for prediction. In some cases, it is possible that parameters of a structural equation are estimated using linear regression, but the two models still retain their independent characteristics.</p>
</section>
<section id="sec:refininggraphs-noise" data-number="1.2.2">
<h3 data-number="2.2.2"><span class="header-section-number">2.2.2</span> Noisy models</h3>
<p>Any model, whether a causal graph or a set of structural equations, will necessarily represent (at best) a simplified version of the most important causal factors and relationships controlling a system’s behavior. To account for the many minor factors influencing system behavior, it is common practice to introduce an <span class="math inline"><em>ϵ</em></span> noise term into our structural equations: <br /><span class="math display">$$\label{ch02-noisy-sem}
\begin{array}{rcl}
D & \leftarrow& \alpha_1*B + \epsilon_D \\
B & \leftarrow& \alpha_2*A + \beta_2*C + \epsilon_B
\end{array}$$</span><br /></p>
<figure>
<img src="/assets/Figures/Chapter2/Ch2_Fig1_4nodegraph_noise.png" id="fig:noise" alt="" /><figcaption>Figure 6: Noise, such as <span class="math inline"><em>ϵ</em><sub><em>B</em></sub></span> and <span class="math inline"><em>ϵ</em><sub><em>D</em></sub></span>, are often omitted from causal graphs, but can be added for completeness as shown here.</figcaption>
</figure>
<p>Fig. 6 shows the causal graph representation of these <span class="math inline"><em>ϵ</em></span> noise terms. For simplicity of representation, these noise terms are not usually drawn in the causal graph, but are simply assumed to exist. Note that to avoid compromising the causal relationships implied by our original (non-noisy) causal model, the noise factors that influence each node must be independent of one another. If, for some reason, we believe that two nodes in a causal graph are subject to correlated noise, we must instead explicitly represent that in the graph.</p>
<p>Structural equations can be considered as an alternative representation of the factorization of probability distributions mentioned earlier. <span class="citation" data-cites="ch02-noisy-sem">[@ch02-noisy-sem]</span> can be written in terms of expectations as: <br /><span class="math display">$$\begin{split}
\mathbb{E}[D|B] &= \alpha_1*B \\
\mathbb{E}[B|A, C]&= \alpha_2 *A + \beta_2 *C
\end{split}$$</span><br /></p>
</section>
<section id="sec:refininggraphs-unobserved" data-number="1.2.3">
<h3 data-number="2.2.3"><span class="header-section-number">2.2.3</span> Unobserved Nodes</h3>
<p>Often, we know we will create a causal model of a system, but we will not be able to observe all of the nodes in our causal system. The values of some nodes may be completely unobserved or latent. This can happen if we do not know how to measure the value of a node, the node is too expensive to measure, or if a piece of data is too private or otherwise confidential. Depending on the causal structures of the system, we can still often address our specific task or question through computations over the nodes whose values are observed. To indicate an unobserved node in a causal graph, we mark its outline as a dashed line (Fig. 7).</p>
<figure>
<img src="/assets/Figures/Chapter2/Ch2_Unobserved1.png" id="fig:refininggraphs-unobserved" alt="" /><figcaption>Figure 7: By convention, nodes that are completely unobserved are marked with a dashed outline.</figcaption>
</figure>
<p>In many situations, nodes may be partially observed or missing. For example, if a node value is expensive to collect, we may only sample it for a small number of our data entries. Or, our data collection might be systematically biased in some way. In such cases, we can model the <em>missing mechanism</em> in the causal graph itself. By modeling the causes of partial observation, we will be able to directly analyze why data might be missing and understand the biases present in our observed data.</p>
<p>We can model the missingness mechanism in the causal graph itself by the following simple manipulation of the graph, as shown in Fig. 8.</p>
<ol type="1">
<li><p>We split the partially observed node, <span class="math inline"><em>Z</em></span>, into two nodes, one of which is the true value <span class="math inline"><em>Z</em></span> but is completely unobserved.</p></li>
<li><p>The second node, <span class="math inline"><em>Z</em><sup>*</sup></span> is observed, and caused by <span class="math inline"><em>Z</em></span> and a new missingness indicator <span class="math inline"><em>R</em><sub><em>Z</em></sub></span>.</p></li>
<li><p>The missingness indicator controls the value of <span class="math inline"><em>Z</em><sup>*</sup></span>. If <span class="math inline"><em>R</em><sub><em>Z</em></sub> = 1</span>, the value of <span class="math inline"><em>Z</em></span> is revealed as <span class="math inline"><em>Z</em><sup>*</sup></span>. Otherwise, if <span class="math inline"><em>R</em><sub><em>Z</em></sub> = 0</span>, the value of <span class="math inline"><em>Z</em></span> is not revealed and <span class="math inline"><em>Z</em><sup>*</sup></span> takes a null or empty value.</p></li>
<li><p>If data is observed or sampled at random, then the missingness indicator <span class="math inline"><em>R</em><sub><em>Z</em></sub></span> is an independent node, unconnected to the rest of the causal graph. If <span class="math inline"><em>R</em><sub><em>Z</em></sub></span> is systematically influenced by other factors in the system, then we draw the appropriate causal connections. In Fig. 8, <span class="math inline"><em>R</em><sub><em>Z</em></sub></span> is caused by <span class="math inline"><em>C</em></span>.</p></li>
</ol>
<figure>
<img src="/assets/Figures/Chapter2/Ch2_Unobserved2.png" id="fig:missingness" alt="" /><figcaption>Figure 8: The node <span class="math inline"><em>Z</em></span> is partially observed. Whether we see its value in <span class="math inline"><em>Z</em><sup>*</sup></span> is controlled by a missingness indicator, <span class="math inline"><em>R</em><sub><em>Z</em></sub></span>.</figcaption>
</figure>
<p>This manipulation can also be generalized to represent other systematic biases in values, beyond missing values. In such cases, the control node, <span class="math inline"><em>R</em><sub><em>Z</em></sub></span> in Fig. 8, would no longer be a missingness indicator, but rather a general bias indicator, and <span class="math inline"><em>Z</em><sup>*</sup></span> then a systematically biased version of <span class="math inline"><em>Z</em></span>.</p>
</section>
</section>
<section id="sec:building-causal-graph" data-number="1.3">
<h2 data-number="2.3"><span class="header-section-number">2.3</span> Where does a Causal Graph Come From?</h2>
<section id="creating-a-causal-graph" data-number="1.3.1">
<h3 data-number="2.3.1"><span class="header-section-number">2.3.1</span> Creating a Causal Graph</h3>
<p>When we are using a causal graph to reason about causal mechanisms, we assume that the causal graph captures everything that is important and relevant to the problem we are studying.<a href="#fn4" class="footnote-ref" id="fnref4" role="doc-noteref"><sup>4</sup></a> When we are trying to decide what is important and relevant, we can think about it in several stages:</p>
<p><em>Core factors related to the question:</em> First, we consider the question we are trying to answer based on our analysis of the graph? For example, if we are using our analysis to guide decision-making—i.e., whether to take a specific action—we will want our causal graph to include the action and its possible effects or outcomes, as well as other factors that influence the likelihood of those outcomes. When analyzing a particular dataset of actions and outcomes, we also must include the factors that influenced the likelihood of action.</p>
<p><em>Adding additional relevant factors:</em> Secondly, we should look at the factors that we have decided are relevant to our task, and consider whether any of them have shared causes. If so, we might include them as well. The decision to include them can be based on how important it is to capture the fact that the given factors are correlated with one another. When we decide that it is not important to model the causes of some node in a causal graph, we will call these <em>exogenous</em> nodes. If a node is determined entirely by nodes within the causal graph, we will call these <em>endogenous</em> nodes.</p>
<p><em>Removing static factors:</em> Finally, we consider what is static and what is dynamic across the scenarios and environments we intend to represent with our causal graph. If some factor that is otherwise relevant to our causal question never changes, we will leave it out. For example, when analyzing the effect of a new recommendation policy in an online store, we may know that the effect of the policy depends on some basic societal and economic factors, such as the availability of Internet, electricity, and a stable monetary system. If we believe that the availability of these factors will not vary within the scope of our task, we can safely leave these out.</p>
<p>Decisions about whether a factor can be left out can be iterative. Often we will choose to begin with a simpler model and add in additional factors to improve the precision and accuracy of our model. For example, after building and experimenting with a simple model of the effect of recommendations, we might add in additional factors, such as whether users are viewing recommendations on mobile devices, tablets, or PCs, to help us better capture subtler effects.</p>
<p>Careful readers will note that, in this section, we have been referring to the “factors” that are relevant to a given question or system. Often, when we are first designing a causal graph, we will focus our thoughts on more abstract concepts and factors and then, only later, determine what specific measures in an experiment or features in a dataset can be used to represent those abstract factors.</p>
</section>
<section id="examples" data-number="1.3.2">
<h3 data-number="2.3.2"><span class="header-section-number">2.3.2</span> Examples</h3>
<p>In this section, we discuss three example scenarios, including toy causal models, the assumptions and modeling choices they represent, and how they might be extended and refined.</p>
<section id="example-1-online-product-purchases-and-recommendations" class="unnumbered" data-number="">
<h4 class="unnumbered" data-number="1">Example 1: Online product purchases and recommendations</h4>
<p>Consider an online store that sells many products. Interest in each product may be driven by a number of product-specific factors, such as the quality of the product, product reviews, marketing campaigns, as well as cross product factors such as seasonality or brand reputation. There may be inherent complementarity or substitution among some products. For example, a person buying cookies may also become interested in buying milk. So, if a marketing campaign increases interest in cookies, it may also indirectly drive an increased interest in milk. Beyond these inherent relationships, the store, on each product’s web page, recommends several related products to people, thus potentially increasing interest for the recommended products. Recommendations are made based on some policy that the store might change. What if we want to better understand product browsing behavior under various recommendation policies? I.e., is one recommendation policy more effective than another? Fig. 9 shows one causal graph that models the influences on aggregate product browsing behavior.<a href="#fn5" class="footnote-ref" id="fnref5" role="doc-noteref"><sup>5</sup></a></p>
<figure>
<img src="/assets/Figures/Chapter2/Ch2_Example1_DAG.png" id="fig:onlinestore-dag" alt="" /><figcaption>Figure 9: Causal graph representation of product browsing behavior (<span class="math inline"><em>P</em><sub>0</sub></span> and <span class="math inline"><em>P</em><sub><em>i</em></sub></span>) at an online store; external product-specific demand (<span class="math inline"><em>D</em><sub>0</sub></span> and <span class="math inline"><em>D</em><sub><em>i</em></sub></span>, cross-product demand (<span class="math inline"><em>S</em></span>); and the influence of a recommendation policy wherein browsing one product (<span class="math inline"><em>P</em><sub>0</sub></span>) influences the likelihood to browse another (<span class="math inline"><em>P</em><sub><em>i</em></sub>)</span>.</figcaption>
</figure>
<p>The basic modeling choices we make in constructing our causal model simplify our analysis tasks by limiting the factors we will consider. In making these choices, understanding the domain is critical to designing a model that is tailored to addressing a specific task while capturing all relevant factors.</p>
<p>Some of the choices we made when designing this model are explicit in the graph itself. For example, we chose to declare that the factors that influence product interest can be abstractly represented by a single cross-product demand factor and by many product-specific demand factors (exactly one per product). We are assuming that the various product-specific demand factors, such as price, and the shared demand factors, such as brand reputation, are not affected by product interest. And, we are assuming that the first product <span class="math inline"><em>P</em><sub>0</sub></span> is not affected by recommendations from other products (in fact, we are not modeling any recommendations from other products <span class="math inline"><em>P</em><sub><em>i</em></sub></span> at all).</p>
<p>Another modeling choice clearly stated in the graph is the set of exogenous vs. endogenous nodes. As our question was focused on the recommendation policy these demand factors are represented as exogenous nodes. if our questions were focused on manipulating these demand factors (eg experimenting with bew marketing campaigns) then we would want to add the causes of these demand factors into our causal graph.</p>
<p>Other choices are not represented in the graph, but are more subtly included within the definition of the nodes. For example, our product interest nodes are aggregated over all people browsing at an online store. In turn, our demand factors also represent factors that influence demand at an aggregate or population level. Alternatively, we could also have chosen to model product interest at an individual level. In that case, we probably would have also included more attributes about an individual, allowing us to study interactions between demand factors and individuals.</p>
<p>Our model also does not allow for the possibility of change in influence over time. Would the novelty of a recommendation system will initially drive more interest in recommended products, but then fade over time? This particular model would not allow us to capture or analyze such changes in effect. In addition, we do not consider the relationship to other pages that may also show recommendation. For example, P0 itself may be a recommended product on some P2 product’s page, in which case P0’s browsing events are partially caused by the recommendation system too.</p>
<p>As we work with a model and refine it over time, we can revisit these design choices. We might split out demand factors in more detail, include individual-level information in our graph, or explicitly model time. How we refine a model will depend on how our understanding of the domain evolves as we gain experience, how our core questions and goals change, and the practical limitations of our experimental setting or data gathering framework.</p>
<figure>
<img src="/assets/Figures/Chapter2/Ch2_Example2_DAG-crop.png" id="fig:datacenter-dag" alt="" /><figcaption>Figure 10: Causal graph representation of product browsing behavior (<span class="math inline"><em>P</em><sub>0</sub></span> and <span class="math inline"><em>P</em><sub><em>i</em></sub></span>) at an online store; external product-specific demand (<span class="math inline"><em>D</em><sub>0</sub></span> and <span class="math inline"><em>D</em><sub><em>i</em></sub></span>, cross-product demand (<span class="math inline"><em>S</em></span>); and the influence of a recommendation policy wherein browsing one product (<span class="math inline"><em>P</em><sub>0</sub></span>) influences the likelihood to browse another (<span class="math inline"><em>P</em><sub><em>i</em></sub>)</span>.</figcaption>
</figure>
</section>
<section id="example-2-energy-conservation-in-a-data-center" class="unnumbered" data-number="">
<h4 class="unnumbered" data-number="1">Example 2: Energy conservation in a data center</h4>
<p>Consider a data center containing thousands of servers. One key challenge in data centers is to minimize energy consumption while meeting stated performance objectives. To do so, one way can be to put idle servers into low power mode or turn them off. However, if demand exceeds availability, then the idle servers need to be restarted. This introduces a delay in the system. We thus want to maximize the energy savings while minimizing the delay due to unavailability.</p>
<p>Fig. 10 shows a causal graph. Let us assume that there is predictive system that predicts the number of servers to keep on based on a prediction of load due to customers’ requests in the next time-step. This prediction uses the load at last time step (<span class="math inline"><em>L</em><sub><em>t</em> − 1</sub></span>) and the number of idle servers at the previous time step <span class="math inline"><em>I</em><sub><em>t</em> − 1</sub></span>, to decide the number of servers that are turned on at time <span class="math inline"><em>t</em></span>, <span class="math inline"><em>M</em><sub><em>t</em></sub></span>. Then we observe the new load <span class="math inline"><em>L</em><sub><em>t</em></sub></span> that along with <span class="math inline"><em>M</em><sub><em>t</em></sub></span> determines the number of servers that need to be restarted <span class="math inline"><em>R</em><sub><em>t</em></sub></span>. Number of restarted servers and the number of running servers <span class="math inline"><em>M</em><sub><em>t</em></sub></span> together determine the cost of the predictive system <span class="math inline"><em>C</em><sub><em>t</em></sub></span>. Note that load at consecutive time-steps shares a common cause of customers’ request patterns, therefore we show it through a dashed double-sided arrow.</p>
<p>Using this graph, we can answer a number of interesting questions. We can estimate the effect of using the predictive system’s output <span class="math inline"><em>M</em><sub><em>t</em></sub></span> versus not turning off any servers. We can also compare the current predictive model to another model to evaluate which one leads to a better overall cost.</p>
<p>Note how this graph was constructed based on a high-level knowledge of the system architecture. Nodes like <span class="math inline"><em>M</em><sub><em>t</em></sub></span> may themselves be computed using complex machine learning models, but we choose to abstract them out into single nodes. We also chose to construct a causal graph over a single time-step though the system runs continuously. That is, we ignored the fact that <span class="math inline"><em>I</em><sub><em>t</em> − 1</sub></span> itself is a function of the model’s prediction at time <span class="math inline"><em>t</em> − 1</span>, <span class="math inline"><em>M</em><sub><em>t</em> − 1</sub></span>. In the specific case where the model <span class="math inline"><em>M</em></span> utilizes data from only the previous time-step, this is a valid simplification since the model treats <span class="math inline"><em>I</em><sub><em>t</em> − 1</sub></span> as a new, independent value. In other cases, such a simplification may lead to errors due to ignoring the feedback loop between the model and idle servers at different points in time. Additionally, there are a number of intermediate, recorded nodes that go from shutting off a server to energy consumption, but we chose to not include them since they are not the focus of the question. For another question—for example, the behavior of hardware components while saving energy—including those measurements will be critical.</p>
<figure>
<img src="/assets/Figures/Chapter2/MnistExamples.png" id="fig:mnist-data" alt="" /><figcaption>Figure 11: MNIST: A dataset of hand-written digits. Some images are rotated by an angle up to 90 degrees.</figcaption>
</figure>
</section>
<section id="example-3-rotated-mnist-images" class="unnumbered" data-number="">
<h4 class="unnumbered" data-number="1">Example 3: Rotated MNIST images</h4>
<p>As our third and final example, we consider the well-known handwriting recognition dataset called MNIST. This dataset contains images of digits and the task is to detect the digit shown in each image. We consider a subset of digit classes from <span class="math inline">0</span> to <span class="math inline">4</span> and include a twist: each image is rotated by an angle by some unknown data-generating process. Fig. 11 shows a random sample of the images from this rotated MNIST dataset.</p>
<p>To start with, it is hard to think of a causal graph for this system. All we are provided are a set of static images without any flow of information or causality. To proceed, let us try to reconstruct what process may have generated this data. Thinking of how people write down numeric digits, it is possible that people may have decided the digit they wanted to write and then written that digit. Thus, the digit class causes the specific shape we see on an image. Alternatively, the images may have been sampled from a random collection of shapes and someone might have selected them manually and labelled them as one of the ten digits. In this case, it is the shape of the region in an image that causes its digit classification. In either case, we can assume that there is a causal relationship between a specific shape and the digit class. We represent it using an undirected causal edge in Fig. 12.</p>
<p>In addition to shape, the angle of rotation seems to be associated with the digit. In Fig. 11 the digits <span class="math inline">0</span> and <span class="math inline">2</span> are never rotated much whereas other digits are rotated up to 90 degrees. However, from our understanding of digit recognition, we can safely assume that the angle of rotation cannot determine the digit. It may be that some images were rotated before they were captured or that these images were rotated based on their digit class after they were captured. In the first case, we can assume that some unobserved process causes both the digit class and its rotation. In the second, the digit class causes an unknown variable that decides the angle of rotation. Causal graphs for these mechanisms are shown in Figs. 12 (a), 12 (b).</p>
<p>While we do not know the exact mechanism, these set of causal graphs provide important information about building a classifier that can generalize to different data distributions beyond the current one. Since shape is causally related to the digit class in all graphs, it should be included in a predictive model. However, in both graphs, digit and angle of rotation share no direct relationship. Specifically, their relationship depends completely on an unobserved node connecting them that acts as a fork (Fig. 12 (b)) or as the central node in a chain (Fig. 12 (b)). If the value of the unobserved node changes, their relationship also changes. In other words, given the unknown node, angle of rotation and digit class are conditionally d-separated because <span class="math inline"><em>U</em></span> is either a fork or the centre of a chain path. Therefore, these graphs imply that angle of rotation is not a causal feature and should not be included in any predictive model for digit recognition.</p>
<p>This example underscores the point that a causal graph need not be complete or uniquely determined to be useful. As we noted before, we are looking for graphs that capture the major assumptions and constraints that can be known from domain knowledge, not the full causal graph of a system which may be implausible to obtain. Thus, it is helpful to have an incomplete graph than no graph at all.</p>
<div id="fig:mnist-causal-graph" class="subfigures">
<table style="width:60%;">
<colgroup>
<col style="width: 30%" />
<col style="width: 30%" />
</colgroup>
<tr class="odd">
<td style="text-align: center;"><figure>
<img src="/assets/Figures/Chapter2/Ch2_Mnist_graph2.png" id="fig:mnist-causal-graph1" style="width:100.0%" alt="" /><figcaption>a</figcaption>
</figure></td>
<td style="text-align: center;"><figure>
<img src="/assets/Figures/Chapter2/Ch2_Mnist_graph1.png" id="fig:mnist-causal-graph2" style="width:100.0%" alt="" /><figcaption>b</figcaption>
</figure></td>
</tr>
</table>
<p>Figure 12: Possible causal graphs for the rotated MNIST images dataset. In both graphs, <span class="math inline"><em>U</em></span> is unobserved. </p>
</div>
</section>
</section>
</section>
<section id="sec:po-framework" data-number="1.4">
<h2 data-number="2.4"><span class="header-section-number">2.4</span> Potential Outcomes Framework</h2>
<p>The <em>potential outcomes (PO) framework</em> is an alternative to causal graphs for reasoning about causal assumptions and setting up analyses. While causal graphs focus on the structure of the causal relationships themselves as the primary language for declaring assumptions, the potential outcomes framework places its focus on causal inference as a missing data problem. Recall from chapter <a href="#ch_patternsandpredictionsarenotenough" data-reference-type="ref" data-reference="ch_patternsandpredictionsarenotenough">3</a> that the causal effect is defined as the difference in outcomes <span class="math inline"><em>Y</em></span> between a world where treatment is given <span class="math inline">World(<em>T</em> = 1)</span>, and a world where treatment is not given <span class="math inline">World(<em>T</em> = 0)</span>. <span id="eq:causal-effect-worlds"><br /><span class="math display">Causal Effect = <em>Y</em><sub>World(<em>T</em> = 1)</sub> − <em>Y</em><sub>World(<em>T</em> = 0)</sub></span><br /></span> Potential outcomes (PO) framework formalizes the notion of <span class="math inline"><em>Y</em></span>’s value in different worlds as a new statistical variable. Specifically for every outcome <span class="math inline"><em>Y</em></span>, it defines a set of potential outcomes based on different values of the treatment, <span class="math inline"><em>Y</em><sub><em>T</em></sub></span>. The key point is that only one of the potential outcomes <span class="math inline"><em>Y</em><sub><em>T</em> = <em>t</em></sub></span> is observed and the remaining potential outcomes are unobserved. In other words, there is a single observed outcome and the goal is to estimate all other unobserved, potential outcome values that <span class="math inline"><em>Y</em></span> could have taken under a different <span class="math inline"><em>T</em></span>. In the PO framework, these different values of the outcome are denoted by a subscript, <span class="math inline"><em>Y</em><sub><em>T</em> = <em>t</em>′</sub></span>.</p>
<p>Note that a potential outcome is not the same as probabilistic conditioning. Critically, <span class="math inline"><em>Y</em><sub><em>T</em> = 1</sub></span> does not correspond to conditioning on <span class="math inline"><em>T</em> = 1</span> in the observed data, but rather conveys the causal relationship between <span class="math inline"><em>Y</em></span> and <span class="math inline"><em>T</em></span>. That is, <span class="math inline"><em>Y</em><sub><em>T</em> = 1</sub></span> represents an intervention on <span class="math inline"><em>T</em></span> by setting it to <span class="math inline">1</span> without changing the rest of the world (i.e., all other relevant variables, both observed and unobserved, are constant). Thus <span class="math inline"><em>P</em>(<em>Y</em><sub><em>T</em> = 1</sub>) ≠ <em>P</em>(<em>Y</em>|<em>T</em> = 1)</span>. For a binary treatment, potential outcomes provides a succinct way to formalize Eq. 5 for causal effect. <br /><span class="math display">$$\begin{split}
\text{Causal Effect} = \mathbb{E}[Y_{T=1}- Y_{T=0}] = \mathbb{E}[Y_{T=1}]- \mathbb{E}[Y_{T=0}]
\end{split}$$</span><br /> Since, for any particular unit of treatment (e.g., a person) we can only observe one of the potential outcomes, the PO framework translates the problem of causal inference to that of estimating the missing potential outcome. For example, if we observe <span class="math inline"><em>Y</em><sub><em>T</em> = 1</sub></span> for a particular unit, then we can calculate the causal effect of <span class="math inline"><em>T</em></span> if we can correctly estimate the unobserved potential outcome <span class="math inline"><em>Y</em><sub><em>T</em> = 0</sub></span>. Note that randomized experiments conveniently allow our observations of potential outcomes in one randomized group to provide unbiased estimates of the unobserved potential outcomes of other randomized groups.</p>
<p>While the potential outcome framework is most commonly used for estimating the effect of treatment, the framework itself is general and can be used to denote the potential value of any variable. For example, <span class="math inline"><em>A</em><sub><em>B</em> = 1</sub></span> represents the potential value of <span class="math inline"><em>A</em></span> when <span class="math inline"><em>B</em></span> is set to 1.</p>
<section id="comparing-potential-outcomes-with-causal-graphs" data-number="1.4.1">
<h3 data-number="2.4.1"><span class="header-section-number">2.4.1</span> Comparing potential outcomes with causal graphs</h3>
<p>Since the potential outcome variables also convey causal relationships, it is important to compare them to structural causal models. In many ways, causal graphs and potential outcomes are compatible. They both emphasize the difference between statistical conditioning and causal effect. While causal graphs do so by providing a representation of flow of causality without using statistical variables, potential outcomes do so by creating entirely new statistical variables. For instance, a two-node graph like that in Fig. 1 (a) can be represented by <span class="math inline"><em>B</em><sub><em>A</em> = <em>a</em></sub></span> where the choice of subscript variable denotes an assumption about the direction of causal effect.</p>
<p>This difference becomes more pronounced when we want to represent assumptions about more complex systems. The PO framework does not have a good representation for relationships between variables other than the treatment and outcome. Instead, the focus is on reducing those relationships to questions about their effects on <span class="math inline"><em>T</em></span> and <span class="math inline"><em>Y</em></span>. For example, an analyst in the PO framework may ask whether the treatment assignment mechanism is known, whether the treatment is randomly assigned, or whether there are other variables that cause the treatment? If there are such other variables, then do they also cause the outcome <span class="math inline"><em>Y</em></span>? Based on the answers to these questions, an analyst will then decide how to identify and estimate the missing potential outcome. The advantage of the PO framework is the (small) number of well-tested and trusted identification and estimation strategies developed within it for finding the causal effect of a treatment. However, each of these strategies requires specific assumptions on the treatment assignment mechanism, often including the shape of the functions governing the underlying mechanisms.</p>
<p>In contrast, the SCM framework focuses on making all the assumptions as transparent as possible. When confronted with a non-randomized treatment assignment, an analyst constructs a graph expressing their assumptions about the causal mechanisms in the system. For instance, they may ask: What factors are causing the treatment? Are there specific structures among them (e.g., colliders) that can be exploited? If there is confounding, is it because there is missing data or fully unobserved confounders? Such an analysis brings out all the causal assumptions that go into a future identification and estimation exercise, which unfortunately are opaque in the PO framework. Moreover, a graph is more general construct for causal reasoning that can be useful for many other questions about a system, beyond a specific effect. Once a causal graph is constructed, it allows questions about the effect between any two pair of nodes, the effect of groups of nodes, the causal features for a particular outcome, the cascading nature of certain causal effects, and so on.</p>
<p>Put another way, the PO framework focuses directly on estimation of effect whereas the SCM framework emphasises on specifying the causal mechanisms first. Given that a effect estimate depends heavily on the causal assumptions that go in, the importance of transparency in causal assumptions cannot be overstated. Unlike predictive machine learning estimates that can be validated objectively using cross-validation metrics, no such validation procedure exists for causal estimates. Thus, a qualitative benefit of causal graphs is that they are essentially a simple-to-interpret diagram that can be shared with different stakeholders, promoting an transparent and informed discussion about the causal assumptions that went into an analysis.</p>
<figure>
<img src="/assets/Figures/Chapter2/Ch2_po-versus-graphs-crop.png" id="fig:po-causal-graph" alt="" /><figcaption>Figure 13: A single-confounder system represented in the structural causal model and potential outcome frameworks.</figcaption>
</figure>
</section>
<section id="mixing-causal-and-statistical-assumptions" data-number="1.4.2">
<h3 data-number="2.4.2"><span class="header-section-number">2.4.2</span> Mixing causal and statistical assumptions</h3>
<p>More fundamentally, however, the PO framework mixes causal and statistical assumptions within the same representation. To illustrate this point, we provide a simple modelling exercise using the PO framework. Assume a three-variable system where <span class="math inline"><em>W</em></span> is a common cause of <span class="math inline"><em>T</em></span> and <span class="math inline"><em>Y</em></span>. Under the PO framework, we may write a regression equation, <span id="eq:po-regression"><br /><span class="math display">$$\label{eq:po-regression}
\begin{split}
y = f(x, w) + \epsilon ; \ \ \ \mathbb{E}[\epsilon|x,w]=0
\end{split}\qquad(6)$$</span><br /></span> This equation represents several assumptions about the underlying system. First, it conveys the direction of the causal relationship. Implicitly, the LHS, <span class="math inline"><em>y</em></span>, is the effect and the RHS, <span class="math inline"><em>f</em>(<em>x</em>, <em>w</em>) + <em>ϵ</em></span> are its causes. Second, by assuming that the expected value of the error term is <span class="math inline">0</span>, it conveys that the error <span class="math inline"><em>ϵ</em></span> is independent of the LHS variables, <span class="math inline"><em>X</em></span> and <span class="math inline"><em>W</em></span>. Third, this also serves as an estimating equation. The same equation is used for estimating the effect by making assumptions on the family of functions (e.g., all linear functions) and fitting a particular <span class="math inline"><em>f̂</em></span> to available data.</p>
<p>More generally, a single PO equation simultaneously conveys three steps of our four stages of causal reasoning: modelling, identification, and estimation. It often includes both causal assumptions (such as direction of effect) and statistical assumptions (such as the family of estimating functions). To some degree, this brevity can be seen as a strength. Yet, it can also be a weakness, when a concise representation leads to causal assumptions being made implicitly, or sometimes asserted separately in a less rigorous notation (i.e., natural language). While we can see that both graphs and the PO representation convey similar ideas, in this book we prefer using causal graphs and structural equations for modeling causal assumptions to more clearly distinguish causal from statistical assumptions.</p>
</section>
<section id="the-best-of-both-frameworks" data-number="1.4.3">
<h3 data-number="2.4.3"><span class="header-section-number">2.4.3</span> The best of both frameworks</h3>
<p>We will see in the following chapters that, while the PO framework has some weaknesses in the modeling stage of causal analysis, it provides useful, common recipes for the identification stage of causal analysis, and shines in causal effect estimation. The PO framework provides a suite of well-tested and broadly used methods for estimation, based on constraints of function families, size of data or its dimensionality. Because it directly deals with statistical equations, the PO framework is also better equipped to handle constraints in a data-generating process like monotonicity of effect.</p>
<p>In this book, therefore, we mix and match elements of causal graphs and potential outcomes across the four stages of causal analysis—modeling, identification, estimation, and refutation. While we use primarily causal graphs and structural equations for capturing models and assumptions in the first stage, we will use both causal-graph based and potential outcomes identification, estimation, and refutation strategies.</p>
</section>
</section>
</section>
<section class="footnotes" role="doc-endnotes">
<hr />
<ol>
<li id="fn1" role="doc-endnote"><p>We will discuss methods for handling cycles in causal graphs in Chapter <a href="#ch_practicalconsiderations" data-reference-type="ref" data-reference="ch_practicalconsiderations">10</a>.<a href="#fnref1" class="footnote-back" role="doc-backlink">↩︎</a></p></li>
<li id="fn2" role="doc-endnote"><p>There are more complicated causal graph notations that include specific annotations (different kinds of arrows and nodes) to indicate specific classes of interactions, such as mediation and interaction, though other kinds of interactions remain ambiguous. While we believe such notation can be useful, in this book, we will keep to the simpler graph notation both for simplicity of presentation and to avoid over-emphasizing some kinds of interactions over others.<a href="#fnref2" class="footnote-back" role="doc-backlink">↩︎</a></p></li>
<li id="fn3" role="doc-endnote"><p>Readers already familiar with structural equations might miss the <span class="math inline"><em>ϵ</em></span> noise factor. Do not worry, we will add them in soon, in Section <a href="#sec:refininggraphs-noise" data-reference-type="ref" data-reference="sec:refininggraphs-noise">2.2.2</a>.<a href="#fnref3" class="footnote-back" role="doc-backlink">↩︎</a></p></li>
<li id="fn4" role="doc-endnote"><p>This does not necessarily mean we will expect to be able to measure everything in the graph. There might be factors that we have added to our graph that will remain unobserved in our datasets. We will expand on unobserved nodes and their implications in Section <a href="#sec:refininggraphs-unobserved" data-reference-type="ref" data-reference="sec:refininggraphs-unobserved">2.2.3</a>.<a href="#fnref4" class="footnote-back" role="doc-backlink">↩︎</a></p></li>
<li id="fn5" role="doc-endnote"><p>Fig. 9 uses <em>plate notation</em> to represent the influence of <span class="math inline"><em>P</em><sub>0</sub></span> on a total of <span class="math inline"><em>N</em></span> products <span class="math inline"><em>P</em><sub><em>i</em></sub></span>. In plate notation, the rectangle with marker <span class="math inline"><em>N</em></span> is a summary of a repeated graph structure. I.e., the nodes <span class="math inline"><em>D</em><sub><em>i</em></sub></span> and <span class="math inline"><em>P</em><sub><em>i</em></sub></span> inside the rectangle are repeated <span class="math inline"><em>N</em></span> times.<a href="#fnref5" class="footnote-back" role="doc-backlink">↩︎</a></p></li>
</ol>
</section>
Fri, 20 Mar 2020 00:00:00 +0000
https://causalinference.gitlab.io/causal-reasoning-book-chapter2/
https://causalinference.gitlab.io/causal-reasoning-book-chapter2/Chapter 1: Causal Reasoning Book<h1 id="ch_patternsandpredictionsarenotenough">Introduction</h1>
<p>Machine learning algorithms are increasingly embedded in applications
and systems that touch upon almost every aspect of our work and daily
lives, including in societally critical domains such as healthcare,
education, governance, finance and agriculture. Because decisions in all
these domains can have wide ramifications, it is important that we not
only understand why a system makes a decision, but also understand the
effects (and side-effects) of that decision, and how to improve
decision-making to achieve more desirable outcomes. As we shall see in
this chapter, all of these questions are what we call <em>causal</em> questions
and, unlike many conventional machine learning tasks, cannot be answered
using only passively observed data. Moreover, we will show how even
questions that do not seem causal, such as pure prediction questions,
can also benefit from causal reasoning.</p>
<p>First, let us briefly and informally define causal reasoning as the
study of cause-and-effect questions, such as: Does A cause B? Does
recommending a product to a user make them more likely to buy it? If so,
how much more likely? What makes a person more likely to repay a loan or
be a good employee (or employer)? If the weather gets hot, will crops
wilt? Causal reasoning is the study of these questions and how to answer
them.</p>
<p>In this book, we focus on causal reasoning in the context of machine
learning applications and computing systems more broadly. While most of
what we know about causal reasoning from other domains remains useful in
the context of computing applications, computing systems offer a unique
set of challenges and opportunities that can enrich causal reasoning. On
the one hand, the scale of data, its networked nature, and
high-dimensionality challenge standard methods used for causal
reasoning. On the other hand, these systems allow control over data
gathering and measurement and allow easy combination of passive
observations and active experimentation, thereby providing opportunities
for rethinking typical methods for causal reasoning.</p>
<p>This introductory chapter motivates the use of causal reasoning. Given
how machine learning systems are being used in almost all parts of our
society, we discuss a wide range of use-cases ranging from recommender
systems in online shopping and algorithmic decision support in medicine
to hiring and criminal justice; and data-driven management for
agriculture and industrial applications. We discuss how these are all
fundamentally interventions that require causal analysis for
understanding their effects. Further, we frame the connections between
causal reasoning and critical machine learning challenges, such as
domain adaptation, transfer learning, and interpretability.</p>
<h2 id="what-causal-reasoning">What is Causal Reasoning?</h2>
<h3 id="brief-philosophy">Brief Philosophy</h3>
<p>Causal reasoning is an integral part of scientific inquiry, with a long
history starting from ancient Greek philosophy. Fields ranging from
biomedical to social sciences rely on causal reasoning to evaluate
theories and answer substantive questions about the physical and social
world that we inhabit. Given its importance, it is remarkable that many
of the key statistical methods have been developed only in the last few
decades. As Gary King, a professor at Harvard University puts it,</p>
<blockquote>
<p><em>“More has been learned about causal inference in the last few decades
than the sum total of everything that had been learned about it in all
prior recorded history”.</em></p>
</blockquote>
<p>This might seem puzzling. If causal reasoning is so critical, then why
hasn’t it become a common form of reasoning such as logical or
probabilistic reasoning? The issue is that “causality” itself is a
nebulous concept. From Aristotle to Hume and Kant, many philosophers and
scholars have attempted to define causality but have not reached a
consensus so far.</p>
<p>To understand the difficulty, let us first ask you, the reader, to let
go of this book and drop it on the floor—and then pick it up again and
continue reading! Now, let us ask, what was the cause of the book
dropping? Did the book fall because you let go of the book? Or did the
book fall because we, the authors, asked you, the reader, to drop it?
Perhaps you would have let go of the book even if we had not asked you
to. Maybe it was gravity. Perhaps the book fell because the reader is
not an astronaut reading the book in space.</p>
<p>This simple example of the falling book illustrates many of the
important, philosophical challenges that have vexed philosophers’
efforts to conceptualize causality. These include basic concepts of
abstractions, as well as sufficient and necessary causes. E.g., gravity
is of course necessary, but not sufficient, to cause the book to
fall—gravity together with the reader letting go of the book is both
necessary and sufficient for the book to fall. This example also
illustrates both proximate and ultimate causes. E.g., the reader
dropping the book is a proximate cause, the authors asking the reader to
drop the book may be a more distant, ultimate cause. Finally, this
example raises the question of whether causes must be deterministic. In
other words, does the likelihood that not all (or even most) readers are
suggestible enough to drop this book when asked imply that the authors’
request is not a cause at all? Or is it possible for our request to be
considered a probabilistic cause?</p>
<p>Hume asks how we know—how we learn—what causes an event? Consider the
simple act of striking a match and observing that it lights up. Would we
say that striking causes the match to light up? Believing in data, say
we repeat this action 1000 times and observe the same outcome each time.
Hume argues that, while this may seem to provide strong evidence that
striking the match causes it to light, this specific experiment only
demonstrates predictability, and argues that these results are
indistinguishable from the case where the two events just happen to be
perfectly correlated with each other. Hume proposed this quandary in his
book, “A Treaties of Human Nature”, in 1738, and concludes that
causality must be a mental construct that we assign to the world, and
thus does not exist outside it. Other scholars challenge this
provocation and argue for the existence of causality.</p>
<p>These philosophical challenges are essentially questions of abstraction.
Modern advances in causal reasoning have <em>not</em> come through answering
most of these provocations directly but, rather, by creating flexible
methods for reasoning about the relationships between causes and effects
regardless of the abstractions one chooses. In this book, therefore, we
attempt to steer clear of the above philosophical ambiguities and adopt
one of the more simple and practical approaches to causal reasoning,
known as the interventionist definition of causality.</p>
<h3 id="defining-causation">Defining Causation</h3>
<p><strong>Definition</strong>: In the interventionist definition of causality, we say
that an event <em>A</em> causes another event <em>B</em> if we observe a difference in
<em>B</em>’s value after <em>changing</em> <em>A</em>, <em>keeping everything else constant</em>.</p>
<p>Due to causal reasoning’s early applications in medicine (which we will
discuss in chapter 3), it is customary to call
<em>A</em> the “treatment” (also sometimes called “exposure”), or simply the
cause. <em>B</em> is referred to as the “outcome”. Readers familiar with
reinforcement learning may analogize <em>A</em> as the “action” and <em>B</em> as the
“reward”. In general, these events are associated with <em>measurement
variables</em> that describe them quantitatively, e.g., the dosage of a
treatment drug and its outcome in terms of blood pressure, which we
refer to as the treatment variable and the outcome variable
respectively. For convenience, we use events and their measurement
variables interchangeably, but it is important to remember that
causality is defined over events, and that the same events can
correspond to different variables when measured differently.</p>
<h3 id="interventions-and-counterfactuals">Interventions and Counterfactuals</h3>
<p>There are two phrases in the above definition that needs further
unpacking: “changing A”, and “keeping everything else constant”. These
correspond to the two key concepts in causal reasoning: an
<em>intervention</em> and a <em>counterfactual</em> respectively. An intervention
refers to any action that actively changes the value of a treatment
variable. Examples of an intervention are giving a medicine to a
patient, changing the user interface of a website, awarding someone a
loan, and so on. It is important to distinguish it from simply observing
different values of the treatment. That is, assigning specific people to
try out a new feature of a system is an intervention, but not if people
found out and tried the feature themselves. While this might seem a
small difference, its importance cannot be understated: these two are
fundamentally different and can lead to varying, and even opposite
conclusions when analyzing the resultant data. In particular, in the
observational case, it is hard to know whether any observed effect
(e.g., increased usage) is due to the feature or due to characteristics
of the people (e.g., high-activity users) who were able to discover the
feature. History of causal reasoning is replete with examples where
observations were used in place of interventional data that sometimes
led to disastrous results. We will discuss some of them in the book.</p>
<p><strong>Intervention:</strong> An active action taken that changes the distribution
of a variable <em>T</em>.</p>
<p>To gain a valid interpretation of its effect, however, an itervention
must be performed “keeping everything else constant”. That is, it is not
enough to take an action but also ensure that none of other relevant
factors change, so that we can isolate the effect of the intervention.
Continuing our example on estimating the effect of a new feature, it is
not enough to merely assign people to try it, but also ensure that none
of the other system components changed at the same time. From early
experiments in the natural sciences, such an intervention came to be
known as a “controlled” experiment, where we clamp down values of
certain variables to isolate the effect of the intervention.</p>
<p>While “controlling” or keeping other variables constant is intuitive, it
is unclear about which variables to include. We can obtain a more
precise definition by utilizing the second key concept of causal
reasoning, counterfactuals. The idea is to compare what happened after
an intervention to what would have happened without it. That is, for any
intervention, we can imagine two worlds, identical in every way up until
the point where a some “treatment” occurs in one world but not the
other. Any subsequent difference in the two worlds is then logically, a
consequence of this treatment. The first one is the observed, <em>factual</em>
world, while the second one is the unobserved, <em>counterfactual</em> world
(the word counterfactual means “contrary to fact” ). The counterfactual
world, identical to the factual world except for the intervention,
provides a precise formulation to the “keeping everything else constant”
maxim. The value a variable takes in this world is called a
<em>counterfactual value</em>.</p>
<p><strong>Counterfactual Value:</strong> The (hypothetical) value of a variable under
an event that did not happen.</p>
<p>Putting together counterfactuals and interventions, causal effect of an
intervention can be defined as the difference between the observed
outcome after an intervention and its counterfactual outcome without the
intervention. We express the outcome under the factual world as
<em>Y</em><sub>World1(<em>T</em> = 1)</sub>, and that under the counterfactual world as
<em>Y</em><sub>World2(<em>T</em> = 0)</sub>. For a binary treatment, its causal effect can be
written as,
Causal Effect := <em>Y</em><sub>World1(<em>T</em> = 1)</sub> − <em>Y</em><sub>World2(<em>T</em> = 0)</sub>.</p>
<p>The above equation shows that inferring the effect of an intervention
can be considered as the problem of estimating the outcome under the
counterfactual world, since the factual outcome is usually known. Thus,
counterfactual reasoning is key to inferring causal effect. Coming back
to the match stick example, we can define our intervention as striking a
match. The factual world is the world where we strike the match and see
it light up, and the counterfactual world is the world where we do not
strike the match but keep everything else the same. Under our
interpretation of causality, one expects that the match would not light
up in the counterfactual world, and hence we can claim that striking the
match causes light. Happily, our conclusion coincides with common
intuition, and as we shall see, counterfactual reasoning applies well to
many practical problems. That said, we must emphasize that this
definition of causality is not absolute; it depends on the initial world
in which one starts. For instance, in the match-stick example, if we
started in an oxygen-free environment (or in outer space), and applied
the same counterfactual reasoning, we would conclude that striking does
not cause lighting up, illustrating Hume’s dilemma.</p>
<h2 id="sec:ch1-randomized-exp">The Gold Standard: Randomized Experiment</h2>
<p>Let us now apply the above two concepts to describe one of the most
popular ways of causal reasoning, a randomized experiment. We consider a
simple example of deciding whether to recommend a medication to a
patient Alice. As we discussed above, we can evaluate this decision by
considering the <em>causal effect</em> of the medication on Alice. Here the
<em>treatment</em> is administering the medication and the <em>outcome</em> is Alice’
health afterwards. From the equation above, we
can define causal effect as the difference between the value of <em>Y</em> in a
world where we gave Alice the treatment <em>T</em> (<em>Y</em><sub>World(<em>T</em><sub>Alice</sub> = 1)</sub>) and
where we did not (<em>Y</em><sub>World(<em>T</em><sub>Alice</sub> = 0)</sub>).
Causal Effect = <em>E</em>[<em>Y</em><sub>World(<em>T</em><sub>Alice</sub> = 1)</sub>]−<em>E</em>[<em>Y</em><sub>World(<em>T</em><sub>Alice</sub> = 0)</sub>]</p>
<p>This may seem straightforward, but the fundamental challenge is that
this calculation requires taking the difference between an observed
outcome and a counterfactual that we cannot observe. If we want to
calculate this difference, we can either 1) observe the outcome of
giving Alice the treatment <em>T</em> and compare it to the unobserved
counterfactual outcome of not giving her the treatment; or we can 2)
observe the outcome of not giving Alice the treatment <em>T</em> and compare it
to the unobserved counterfactual outcome of giving her the treatment.</p>
<p>No matter what we do, we cannot in any single experiment, both <em>do</em> <em>T</em>
and <em>not do</em> <em>T</em>! Whatever we actually do, the counterfactual will
remain unobserved. This is called the “missing data” problem of causal
inference. If
Alice takes a medication and we observe that she then gets better, we
cannot also observe what would have happened if she hadn’t taken the
medication. Would Alice have gotten better on her own, without the
medication? Or would she have stayed ill?</p>
<p><img src="/assets/Figures/missing-data-problem.png" alt="Missing data" height="350px" width="276px" /></p>
<h4 id="figure-11-the-missing-data-problem-of-causal-inference">Figure 1.1: The missing data problem of causal inference.</h4>
<p>To solve this problem, we need to make further assumptions on the
intervention or the counterfactual. For instance, if Alice has an
identical twin, Beth, who is also sick, we might give the medication to
Alice, but not give the medication to Beth. Then we can make an
assumption about the counterfactual: identical twins have the same
counterfactual when it comes to health outcomes. That is, we argue that
Beth is so similar to Alice—in terms of general health, genetics,
specifics of the illness, etc.—that they are likely to experience the
same outcomes, but for the medication. In this case, we could use our
observation of Beth’s outcomes as an estimate of Alice’s
counterfactual—what would have happened to Alice had she not taken the
medication.</p>
<p>Causal Effect = <em>E</em>[<em>Y</em><sub>World(<em>T</em><sub>Alice</sub> = 1)</sub>]−<em>E</em>[<em>Y</em>World(<em>T</em><sub>Beth</sub> = 0)]</p>
<p>But of course, not everyone has an identical twin, much less an
identical twin with identical general health, habits, and illness. And
if two individuals are not identical, then there will always be a
question of whether differences in outcomes are due to the underlying
dissimilarities between them, instead of due to the medication. When the
two twins do not share the same illness, or more generally, when
comparing two different people, we cannot expect that their
counterfactuals will match. These differences that confuse our
attribution of differences in outcome to differences in the treatment
are called <em>confounders</em>.</p>
<p>Another approach is make assumptions on the intervention. For example,
assume that there is a dataset of patient outcomes where medications
were given irrespective to the actual health condition. That is, the
outcome of any person still depends on their health condition, but
whether they took the drug does not. Therefore, if we now compare any
two individuals with or without the drug, we can argue that there is no
systematic difference between them. Over a large enough sample, the
average outcomes of the treated group can approximate the average
counterfactual outcomes of the untreated, and vice-versa. This allows us
to estimate the effect of an intervention defined over a group of
people, rather than just Alice.
Causal Effect = ∑<em>r</em> <em>Y</em><sub>World(<em>T</em> = <em>r</em>)</sub> − (1 − <em>r</em>)<em>Y</em><sub>World(<em>T</em> = <em>r</em>)</sub></p>
<p>where <em>r</em> is a random variable ∈{0, 1}. More generally, the core idea is
that instead of trying to find two individuals that are identical to one
another, we find two populations that are essentially identical to one
another. This may seem like it should be harder than identifying two
individuals that are the same—after all, now we have to find many people
not just two—but in practice, it is actually easier. It turns out that,
if we want to find the average effect of a treatment, we just need to
ensure that there are no systematic differences between the groups as a
whole. The advantage is that we no longer need to find identical
individuals, but how do we account for all the differences between any
group of people? Is ignoring health condition of people enough?</p>
<p>Causal reasoning took a major advance in the early twentieth century
when Fisher discovered a conceptually straightforward way to conduct an
intervention such that the there is no systematic difference between the
treated and untreated groups. We simply gather one large population of
people and randomly split them into two groups (<em>G</em> = 0 or <em>G</em> = 1), one
of whom will receive the treatment and the other will not. By randomly
assigning individuals to receive or not receive treatment, we ensure
that, on average, there is no difference between the two groups. The
implication is that the expected outcomes of the two groups are the
same, and when we observe the average outcome
<em>Ȳ</em><sub><em>do</em>(<em>A</em> = 0)</sub><sup><em>G</em> = 0</sup> for the first group, we can use it
as an estimate of what the counterfactual outcome would have been for
the second group. Similarly, when we observe the average outcome
<em>Ȳ</em><sub><em>do</em>(<em>A</em> = 1)</sub><sup><em>G</em> = 1</sup> for the second group, we can use
that as an estimate of the counterfactual outcome for the first group.
This methodology is called the <em>randomized experiment</em>—also sometimes
called a randomized controlled trial, A/B experiment, and other names.</p>
<p>Coming back to our question on whether to give Alice the medication, we
can use the average causal estimate above to make a decision: if the
effect over a general population is positive, then assign the drug,
otherwise not. Note that the decision will be same for every person
since we are deciding based on the average effect of the medication.
Assumptions on the counterfactual (e.g., comparing to Beth or a similar
person) can provide individual causal effects but the estimates suffer
from error whenever their counterfactuals are not identical. In
contrast, assumptions on the intervention lead to group-wise causal
effects, but are accurate whenever the treatment is randomized. An
interesting and important thing to realize is that randomized
experiments don’t only address the confounders that we are aware of, but
also ensure that our analysis is sound even when there are confounders
that we are not measuring or maybe haven’t even thought of. Because of
this, randomized experiments are considered more robust than other
approaches. In fact, they are often referred to as the “gold standard”
for causal inference. Randomized experiments are used to test whether a
new medicine really cures an illness or has significant side-effects,
whether a marketing campaign works, whether one search algorithm is
better than another, and even whether one color or another on a website
is better for user engagement.</p>
<p>Despite their general robustness, randomized experiments are not
foolproof and there are practical problems that can occur. First and
foremost, ensuring that random assignment is actually random is not
always easy. Sometimes we might be tempted to use an assignment
mechanism that seems close to random, but actually isn’t. For example,
for convenience, we might find it easier to assign all units that arrive
on a Monday to treatment A, and all units that arrive on a Tuesday to a
placebo. Not only might this be more convenient—it might be logistically
easier to give the same treatment to everyone on a day—but, we can also
imagine why this regime is close to random: We do not have prior
knowledge of which units will arrive on which day; that is outside of
our control and might be random. We might even double-check how similar
the units are to one another and find that Monday units and Tuesday
units are very similar. However, if there are any unobserved reasons why
units arrive at different times, there will be systematic differences
between our two groups that will bias our results. As another example of
how random assignment might not be random, historically, when patients
were being assigned to drug trials, sympathetic patients were more
likely to be assigned to treatment. This led to the development of
blinding methodologies to prevent people who interact directly from
patients from assigning or even knowing their treatment status.</p>
<p>Despite their many advantages, randomized experiments are sometimes too
costly, unethical or otherwise not feasible for certain situations. We
are limited in the number of experiments we can run at a given time
(statistical power needed); the cost of designing and implementing the
experiment; the length of time we can run the experiment; and so on.
Even if running experiments is relatively easy, there are often orders
of magnitude more experiments we would like to run than we possibly can.
In addition, sometimes there are ethical issues involved in
experiments—is it ethical to expose people to potentially harmful
treatments?</p>
<p>So, what do we do when we cannot run a randomized experiment, but still
need to answer a cause-and-effect question? We turn to methods and
frameworks for causal reasoning that make different assumptions about
the counterfactual and intervention. Such methods are the focus of most
of this book. That said, much of what we will talk about—from accounting
for survival bias, interference, heterogeneity, and other validity
threats—are applicable to randomized experiments as well.</p>
<h2 id="why-causal-reasoning-the-gap-between-prediction-and-decision-making">Why causal reasoning? The gap between prediction and decision-making</h2>
<p>Causal reasoning, thus, makes sense whenever we have a treatment (as in
medicine) or an economic policy (as in in the social sciences) to
evaluate. But what use could it be for computing systems especially at a
time when machine learning-based predictive systems are promising
success in a variety of applications? To answer this question, let us
take a closer look at the success of machine learning and how it may
change the role of computing systems in society.</p>
<h3 id="the-promise-of-prediction">The promise of prediction</h3>
<p>Today, computers are increasingly making decisions that have a
significant impact on our lives. Sometimes computers make a choice and
take action independently such as deciding on loan applications. Other
times computers simply aid people who make the final choice and drive
action such as helping doctors or judges with recommended actions.
Sometimes these computers are hidden far away inside our vital
infrastructure, making decisions that seem to only indirectly affect
people, as in optimizing configuration and availability of data centers.
Other times, these computers are visibly integrated into the fabric of
our daily lives, for example through fitness devices.</p>
<p>But, regardless of how directly or indirectly computers are involved, it
is true that computers are helping us make critical decisions across
many domains. For example, machine learning algorithms recommend product
purchases to customers in online retail sites. Similar algorithms power
movie recommendations, placement of advertisements, and many other
decisions. Other algorithms are responsible for match-making, pairing up
drivers and passengers in ride-sharing platforms, and connecting people
in online dating services. Behind the scenes, computers help with
logistics, resource allocation, and product pricing. They run algorithms
to decide who is eligible for a loan and identify the top candidates who
have applied for a job. Each of these decisions, made by a computer
algorithm, has significant consequences for all individuals and parties
involved.</p>
<p>And computer-aided decision-making is only growing in scope. In the
health domain, machine learning is enabling the advent of precision
medicine. Computers will soon analyze genetic information, symptoms,
test results and medical history to decide how best to heal a particular
patient of a malady. In education, computers promise to improve
personalized education in the context of both traditional classrooms and
the newer massive open online courses. Based on a personalized model of
a learner’s conceptual understanding and learning preferences, a
computer will coach and support a student in their exploration and
mastery of a subject. Data-driven, precision decision-making is
improving the productivity of farming while reducing water usage and
pollution. Artificial Intelligence (AI) is also bringing or poised to
bring similar impact to manufacturing, transportation, and other
industries.</p>
<p>This revolution of computer-aided decision-making is aided by three
concurrent trends. First, there is a proliferation of data from cheaper
and more ubiquitous sensors, devices, applications, and services.
Second, cheap and well connected computational power is available in the
cloud. Third, it is the significant advances in machine learning and
artificial intelligence methods that make it possible to rigorously
process and analyze a much broader set of data than was possible even
just 10 years ago. For example, if we want to predict upcoming wheat
yields in a field, we can now use automated drones to take pictures of
the wheat plants, and use deep neural network based image analysis to
recognize and count the grains and extrapolate the likely yield. These
pictures, field sensors and other information from the farm can be
uploaded to the cloud, joined with weather data and historical data from
other farms to learn better management policies and make decisions about
crop management.</p>
<p>Thus, increasing amounts of data and advanced machine learning
algorithms help make highly accurate predictions. What could go wrong?
The simple answer is that going from prediction to a decision is not
straightforward. A typical machine learning algorithm optimizes for the
difference between true and predicted value in a given dataset, but a
decision based on such a prediction is not always the decision that
maximizes the intended outcome. In other words, the causal effect of a
decision based on purely data-based predictive modelling can be
arbitrarily bad.</p>
<h3 id="importance-of-the-underlying-mechanism">Importance of the underlying mechanism</h3>
<p>Consider a simple social news feed application, where users can see
messages posted by their friends. Let’s ask the question of whether we
can predict something about a user’s future behavior based on what they
see in their social feed. That is, if person sees a link to a news
article, a product recommendation, or a review of a real-world
destination, can we predict that the person will then read the news
article, buy some product, or visit the destination? It turns out, the
answer to that question is yes, we can make successful predictions about
a user’s future behavior based on what they see in their social feed.</p>
<p>If we can predict future behavior based on the social feed, does this
mean that, if we decide to change the contents of the social feed, that
we will change the user’s future behavior? Not necessarily. This is the
gap between prediction and decision-making. We can predict a user’s
future behavior using the current feed, but that does not tell us much
about how they will behave if we change their social feed. Here the
decision is whether to change the social feed and the answer depends on
relationship between social feed and user’s future behavior: what
affects whom?</p>
<p>Figure 1.2 shows two possible explanations for
the predictive accuracy. On the left side, we see that the social feed
itself <em>causes</em> a person’s future behavior. That is, perhaps social feed
posts on this system are very enticing and do a good job enticing a
person to do new things in the future. Or perhaps, as shown on the right
side, people and their friends tend to do similar things anyway. For
example, if a group of friends likes going to Italian restaurants, and a
new one opens, they are all likely to visit the restaurant sometime
soon, but one of them happens to go and post about it first. If the
friend hadn’t posted, all the friends would have gone to the restaurant
anyway, but the post itself helps us predict the behavior of the
individuals. In the left hand side, if we change the social feed, then
we will effect the user’s behavior. However, on the right hand side,
changing the social feed will not affect the user’s behavior. But note
that in both cases, the social feed helps us predict what the user might
do in the future!</p>
<p>Without knowing the direction of effect, we can reach exactly opposite
conclusions using the same data. The problem is that in many scenarios,
prediction models are often used in service of making a decision. This
creates problems.</p>
<p><img src="/assets/Figures/Fig-1-1.png" alt="Social feed affects user behavior" height="200px" width="150px" /></p>
<!--<span id="fig:socialfeedbehavior_causal" label="fig:socialfeedbehavior_causal">Social feed affects user behavior</span>-->
<p><img src="/assets/Figures/Fig-1-2.png" alt="Restaurant preference affects both the social feed and user behavior" height="200px" width="150px" /></p>
<!-- <span id="fig:socialfeedbehavior_correlated" label="fig:socialfeedbehavior_correlated">Restaurant preference affects both the social feed and user behavior</span>-->
<h4 id="figure-12--two-models-of-social-feed-effect-on-user-behavior">Figure 1.2: Two models of social feed effect on user behavior.</h4>
<h3 id="the-trouble-with-changing-environments">The trouble with changing environments</h3>
<p>Complicating matters, predictive models can lead us astray even if the
underlying (direction of) effects is known. Let’s consider a second
example where machine learning may be applied to data from a farm for
making irrigation decisions. In particular, let’s make our job a little
easier by assuming that we know what the effect of irrigation is (unlike
in our social feed example, where we do not know the effect of changing
the feed). We know that irrigation will increase the soil moisture
levels by some known amount.</p>
<p>In a predictive model, we may collect data about past soil moisture and
other variables and then make predictions of future soil moisture levels
based on past data. This prediction can be converted to a simple
decision: if the soil moisture is low, irrigate, else do not irrigate.
Now, given a history of past soil moisture data on the farm and past
weather, let us assume that we can train an accurate model to predict
future soil moisture levels based on current soil moisture and future
weather forecasts. Can this machine learning prediction model guide our
irrigation decisions on a farm?</p>
<p>Again, the answer is no, we cannot make irrigation decisions based on
our learned model of soil moisture levels. Imagine that one day the
weather forecast says it will be very hot. Our soil moisture model is
likely to predict that the soil will be very moist and, based on this
prediction, we are likely to decide <em>not</em> to irrigate.</p>
<p>But why is our soil moisture model predicting that there’s no need to
irrigate on a very hot day? Our model is trained on past soil moisture
data, but in the past the soil was being irrigated under some
predetermined policy (e.g., whether a rule based decision or farmer’s
intuition). If this policy always irrigated the fields on very hot days,
then our prediction model will learn that on very hot days, the soil
moisture is high. The prediction model will be very accurate, because in
the past this correlation always held. However, if we decide not to
water the field on very hot days based on our model predictions, we will
be making exactly the wrong decision!</p>
<p>The prediction model is correctly capturing the correlation between hot
weather and a farmer’s past irrigation decisions. The prediction model
does not care about the underlying mechanism. It simply recognizes the
pattern that hot weather means the soil will be moist. But once we start
using this prediction model to drive our irrigation decisions, we break
the pattern that the model has learned. More technically, we say that
once we begin active intervention, the correlations that the soil
moisture model depends on will change.</p>
<p><img src="/assets/Figures/Chapter1/irrigation_historical.png" alt="Both daily temperature and irrigation decisions influence soil moisture levels. Historically, daily temperature also influences irrigation decisions." height="200px" width="200px" /></p>
<p><span id="fig:irrigation_historical_dag" label="fig:irrigation_historical_dag" style="font-style:italic">Both daily temperature and irrigation decisions influence soil moisture levels. Historically, daily temperature also influences irrigation decisions.</span></p>
<p><img src="/assets/Figures/Chapter1/Historical_Temp_Moisture_Chart.png" alt="Under a historical policy, the correlation between temperature and soil moisture is stable." height="200px" width="200px" /></p>
<p><span id="fig:irrigation_historical" label="fig:irrigation_historical" style="font-style:italic">Under a historical policy, the correlation between temperature and soil moisture is stable.</span></p>
<p><img src="/assets/Figures/Chapter1/Broken_Temp_Moisture_Chart.png" alt="Changing the irrigation policy will change the relationship between temperature and soil moisture." height="200px" width="200px" /></p>
<p><span id="fig:irrigation_intervention" label="fig:irrigation_intervention" style="font-style:italic">Changing the irrigation policy will change the relationship between temperature and soil moisture.</span></p>
<h4 id="figure-13-using-a-correlational-model-trained-on-historical-data-to-drive-future-irrigation-decisions-will-break-the-historical-temperature-soil-moisture-correlation-and-thus-the-machine-learned-model">Figure 1.3: Using a correlational model trained on historical data to drive future irrigation decisions will break the historical temperature-soil moisture correlation and thus the machine-learned model.</h4>
<p>This illustrates another example of why prediction models are not
appropriate for decision making. Prediction models are not robust to
changing conditions. Machine learning practices—e.g., ensuring that we
train and test machine learning models using data drawn from the
environment we plan to deploy in—are important, but provide no
guarantees. In this irrigation example, those changing conditions are
due to our own change in policy based on machine learning model
predictions—we cannot train our model based on observations of our new
policy, as the new policy doesn’t exist yet! More generally, such
changing conditions can occur due to exogenous conditions as well.
Moreover, these conditions may change quickly when we apply our models
in new environments, or change over time within a deployed environment.</p>
<p>Changing environments are particularly important issues in some of the
critical domains we care most about: healthcare, agriculture, etc.,
where we expect machine learning models to help us make better
decisions; online services such as ecommerce sites, etc., that have to
adapt to seasonality, social influences, and growing and changing user
populations; and deployments of machine learning in adversarial
settings, e.g., from spam classification and intrusion detection to
safety critical systems.</p>
<h3 id="from-prediction-to-decision-making">From prediction to decision-making</h3>
<p>To recap, we have seen that prediction models are not appropriate for
helping us reason about what might happen if we change a system or take
a specific action. In our social news feed example, where we asked
whether predictive models can help us understand whether changing a
social news feed will change future user behavior, we saw that there are
multiple plausible explanations of why social feed data can help us
predict future user behaviors. While one explanation implies that
changing the social feed will affect user behavior, another explanation
implies that it won’t affect user behavior. Crucially, the machine
learning prediction model does not help us identify which explanation is
correct!</p>
<p>Moreover, even with an <em>a priori</em> understanding of the causal effect of
an intervention, when we examine the use of a simple prediction model
for making decisions, we see that the act of making a decision based on
the model changes the environment and puts us into untested territory
that threatens the predictive power of our model!</p>
<p>Finally, let us emphasize that these two issues are fundamental. Even
when a prediction model has an otherwise extremely high accuracy, we
cannot expect that accuracy alone to give us insight into underlying
causal mechanisms or help us choose among interventions that change the
environment.</p>
<h2 id="applications-of-causal-reasoning">Applications of Causal Reasoning</h2>
<p>The above section illustrates the importance of causal reasoning
whenever we have to make decisions based on data. Below we present
sample scenarios from computing systems that involve decision-making,
and thus require causal reasoning. In addition, it turns out that causal
reasoning is useful even when decision-making may not be the primary
focus. For instance, causal reasoning is useful in improving systems
that may appear to be purely predictive at first, such as search and
recommendation systems.</p>
<h3 id="making-better-decisions">Making better decisions</h3>
<p>There are numerous examples of decision-making in computing systems
where causal reasoning can help us make better decisions. We broadly
categorize them into three themes: improving utility for users,
optimizing underlying systems, and enhancing viability of the system,
commercial or otherwise.</p>
<p>We already saw an example of decision-making for improving users’
utility through changing a social feed. Other examples include choosing
incentives for encouraging better use of a system, and more broadly,
deciding which functionality to include in a product to maximize utility
for users. In general, any decision that involves changing a product or
service’s features requires causal reasoning to anticipate its future
effect. This is because for all these problems, we need to isolate the
effect of these decisions from the underlying correlations.</p>
<p>Similar reasoning can also be applied to optimize underlying systems,
such as optimizing configuration parameters of a database, deciding
network parameters for best throughput, allocating load in a distributed
data center for energy efficiency, and so on.</p>
<p>Lastly, viability and sustainability of any computing system is
important too. This involves historically non-computing areas such as
marketing and business management, but where data-driven decisions are
increasingly being made. Decisions involving interaction of a system
with the outside world, such as deciding the right messaging for a
targeting campaign. As another example, consider a subscription-based
service such as Netflix or Office365. It is relatively easy to build a
predictive model that identifies the customers who will be leaving in
the next few months, but deciding what to do to prevent them from
leaving is a non-trivial problem. We will consider such decision-making
applications throughout the book.</p>
<h3 id="building-robust-machine-learning-models">Building robust machine learning models</h3>
<p>Causal reasoning is also useful in the absence of explicit decisions.
Many systems, such as those for recommendation or search that commonly
employ predictive models can be improved with causal reasoning.
Predictive models aim to minimize the average error on past data, which
may not correspond to the expected error on new data, especially in a
system that interacts with people. Consider a ratings-based
recommendation system that aims to predict a user’s rating of a new
item. If there are systematic biases in the items rated by the user
(such as rating movies from a single genre more often), then the system
may over-optimize for movies from the same genre, but make errors for
all other genres. The fundamental problem is that past data is collected
under certain conditions, and its predictions may not be accurate for
the future. We shall see in Chapter 12 that the problem
of recommendation can be considered as a problem of <em>intervening on</em>
users with a recommended item, thus defining each recommendation as an
intervention. A similar problem arises in optimizing the most relevant
pages for a query in a search engine based on log data, and in any
system where a user interacts with information. Besides improving
accuracy, causal reasoning can also be useful to understand the effect
of algorithms on metrics that they were not optimized for. For instance,
it helps us in estimating the different effects of recommendation
systems, from impacting diversity to amplifying misinformation and
“filter bubbles”.</p>
<p>Relatedly, questions on broad societal impact of computing systems are
fundamentally causal questions about the effect of an algorithm: is a
loan decision algorithm unfair to certain groups of people? What may be
the outcomes of delegating certain medical decisions to an algorithm? As
we use machine learning for societally critical domains such as health,
education, finance, and governance, questions on the causal effect of
algorithmic interventions gain critical importance. Causal reasoning can
also be used to understand the effects of these algorithms, and also to
explain their output: why did the model provide a particular decision?</p>
<p>More generally, causal reasoning helps predictive models make the jump
from fitting to retrospective data to making predictions. Predictive
models based on supervised learning work well when we expect them to be
tested on the same data distribution on which they were trained. For
instance, predictive models can achieve impressive results on
distinguishing between different species of birds because we expect to
use them on classifying similar pictures in the future. If, however, we
predict on an unseen environment (e.g. outdoor to indoor), the model may
not work well and even fail to identify the same species. These
environment changes, commonly called as concept drift, occur because the
association between input features and output changes as we change the
environment. Rather than looking for patterns in an image, reasoning
about the causal factors that make an image about a specific species can
lead to a more generalizable model. In fact, causal inference can be
considered as a special case of the domain adaptation problem in machine
learning, which we will explore in Chapter 13.</p>
<p>Beyond supervised learning, causal reasoning shares a special connection
with reinforcement learning (RL), in that both aim to optimize the
outcome for particular decision. It is no surprise, then, that simpler
forms of RL, such as bandits, are used for optimizing recommendation
systems. And causal inference methods find use in training RL policies,
especially when using off-policy data. This synergy between machine
learning and causal reasoning is one of the underlying themes of this
book: causal reasoning can make machine learning more robust, and
machine learning can help with better estimates of causal effects.</p>
<h2 id="four-steps-of-causal-reasoning">Four steps of Causal Reasoning</h2>
<p>The focus of this book is on methods and challenges for learning causal
effects from observational data. Briefly, observational studies are
those where we wish to learn causal effects from gathered data and,
while we may have some understanding of the data (in particular the
mechanism that generated the data), we have limited or no control over
that mechanism. So, how are we going to learn causal effects when we
cannot run an experiment like a randomized control study? How will we
deal with confounding variables that might confuse our analysis if we
cannot manipulate an experiment to ensure that confounds are independent
of the treatment status?</p>
<p>At a high-level, we’ll need to find a valid intervention and then
construct a counterfactual world to estimate its effect. Unlike the
randomized experiment, the biggest change is that we will need to make
some assumptions on how the data was generated. This is critical: causal
reasoning depends on a model of the world which can be considered as
modeling assumptions. As we saw in the social feed example, the same
data can lead to different conclusions depending on the underlying
mechanism.</p>
<p>Given data and a model of the world, we decide whether the available
data can answer the causal question uniquely. This step is called
identification. Note that identification comes from the modeling
assumptions themselves, not from data. When the causal question is
uniquely identified, we can estimate it using statistical methods. Note
that the identification and estimation are separate, modular steps.
Identification step is the causal step while estimation is a statistical
step. Identification depends on the modeling assumptions, estimation on
the data. A better estimate does not convey anything about causality,
just as a better identification does not convey anything about the
quality of an estimate.</p>
<p>Finally, given the dependence on assumptions, verifying these
assumptions is critical. Even with infinite data, incorrect assumptions
can lead to wrong answers. Worse, the the statistical confidence will be
higher. Therefore, the final critical part of causal reasoning is to
validate the modeling assumptions. We call this step the “refute” step,
because like scientific theories, modeling assumptions can never be
proven using data but may be refuted. That said, it is important to note
that not all assumptions can be refuted. Causal reasoning is an
iterative process where we refine our modeling assumptions based on
evidence and try to obtain identification with the least untestable or
most plausible assumptions.</p>
<p>To summarize, we rely on a four step analysis process to carefully
address these challenges:</p>
<p><strong>Model and assumptions</strong>. The first important step in causal reasoning
is to create a clear model of the causal assumptions being made. This
involves writing down what is known about the data generating process
and mechanisms. In general, there are many mechanisms that can
potentially explain a set of data, and each of these self-consistent
mechanisms will give us a different solution for the causal effect we
care about. So, if we want to get a correct answer to our
cause-and-effect questions, we have to be clear about what we already
know.. Given this model, we will be able to specify formally the effect
<em>A</em> →#x2192; <em>B</em> that we want to calculate.</p>
<p><strong>Identify</strong>. Use the model to decide whether the causal question can be
answered and provide the required expression to be computed.
Identification is a process of analyzing our model.</p>
<p><strong>Estimate</strong>. Once we have a general strategy for identifying the causal
effect, we can choose from several different statistical estimation
methods to answer our causal question. Estimation is a process of
analyzing our data.</p>
<p><strong>Refute</strong>. Once we have an answer, we want to do everything we can to
test our underlying assumptions. Is our model consistent with the data?
How sensitive is the answer to the assumptions made? if the model is a
little wrong, will that change our answer a little or a lot?</p>
<h3 id="modeling-and-assumptions">Modeling and assumptions</h3>
<p>In Section 1.2, we discussed a randomized
experiment and applied counterfactual reasoning to estimate the causal
effect. Counterfactual reasoning provides a sound basis for causality,
but in most empirical problems, we may not obtain perfectly randomized
data. Therefore, to estimate the causal effect, we need a precise way of
expressing our assumptions about the intervention and the counterfactual
we wish to estimate. What has happened in the last few decades is that
the concepts of interventions and counterfactuals have been formalized
in a general modeling framework, taking causality from the realm of
philosophy to empirical science.</p>
<p><img src="/assets/Figures/Chapter1/ice-cream-graph_cropped.png" alt="Structural causal model for ice-cream’s effect on swimming. <span label="fig:icecream"></span>" /></p>
<h4 id="figure-14-structural-causal-model-for-ice-creams-effect-on-swimming">Figure 1.4: Structural causal model for ice-cream’s effect on swimming.</h4>
<p>The main insight is to replace the factual and counterfactual worlds
with mathematical models that defines the relationship between
treatment, outcome and other variables. This can be done in the form of
a graph or functional equations. Crucially, this model does not
prescribe the exact functional forms that connect variables, but rather
conveys the structure of causal relationships—who affects whom. This
structural model embodies all the domain knowledge or causal assumptions
that we make about the world, thus it is also called the <em>structural
causal model</em>. For instance, consider question of whether ice cream
causes people to swim more. Figure <a href="#fig:icecream">1.4</a> shows the
correlation of ice-cream and swimming over time. We can represent the
scenario with a graphical model and associated set of non-parametric
equations shown in Figure <a href="#fig:icecream">1.4</a>. Each arrow represents a
direct causal relationship. We assume that the Weather causes changes in
ice-cream consumption and swimming. Our goal is to estimate the causal
effect of ice-cream consumption on swimming. Intuitively, the intervention is changing someone’s
ice-cream consumption and the counterfactual world is one where the
consumption is changed but every other node in the graph (Temperature,
in this case) remains constant. Assuming that our structural model is
correct, the structural causal model offers a precise recipe to estimate
the effect of having more ice-cream. Amazingly, the recipe generalizes
to arbitrary graph structures and functional forms, as we shall see in
the next few chapters.</p>
<p>Structural causal models derive their power from being able to precisely
define interventions and counterfactuals. However, it is hard to express
these concepts using conventional probabilities. As we saw above, it is
important to distinguish an intervention from an observation, but
unfortunately probability calculus lacks a language to distinguish
between observing people using a feature versus assigning them to it
(both would be expressed as <em>P</em>(Outcome|Feature = True)). This
difficulty gets more complicated when we try to express counterfactuals.
How would you express the counterfactual probability of outcome if a
user was assigned the feature, given that she discovered it herself (was
not assigned) in the factual world? The obvious expression,
<em>P</em>(Outcome|Assigned = True, Assigned = False) is non-sensical. Given
these shortcomings, we need a new class of variables and a calculus to
operate on these variables. Intervention is defined by a special “do”
operator, which implies removing all inbound edges to a variable. This
corresponds to the interventional graph shown in Figure []. Thus,
assigning people to feature is represented as <em>P</em>(<em>Y</em>|do(Feature). Any
counterfactual value can be generated by changing the variable in the
interventional graph, the removal of inbound edges mean that changing
the variable is not associated with changing other variables except the
outcome, thus <em>keeping everything else constant</em>. The causal effect of
an intervention can then be defined precisely as the difference of two
interventional distributions.
Causal Effect := <em>E</em>(<em>Y</em>|do(<em>T</em> = 1)) − <em>E</em>(<em>Y</em>|do(<em>T</em> = 0))</p>
<p><img src="/assets/Figures/Chapter1/rct-confounders_cropped.png" alt="Before randomization" /></p>
<p><span id="fig:rct-confounders" label="fig:rct-confounders">Before randomization</span></p>
<p><img src="/assets/Figures/Chapter1/rct-noconfounders_cropped.png" alt="After randomization" /></p>
<p><span id="fig:rct-noconfounders" label="fig:rct-noconfounders">After randomization</span></p>
<h4 id="figure-15--randomization-leads-to-the-interventional-structural-model-where-treatment-is-not-affected-by-confounders">Figure 1.5: Randomization leads to the interventional structural model where treatment is not affected by confounders.</h4>
<p>Thus, the <em>do</em> notation is a concise expression for evaluating
interventions while keeping everything else constant. Along with the
structural causal model, it also leads to a formal definition of a
counterfactual. To illustrate, we now express why randomized experiments
work using structural causal models. Figure <a href="#fig:rct-confounders">1.5a</a>
shows a structural model that shows confounding between the treatment
and outcome. Under a randomized experiment, the structural causal model
now becomes as shown in Figure <a href="#fig:rct-noconfounders">1.5b</a>, thus
graphically showing that randomization removes any effect of a
treatment’s parents.</p>
<h3 id="identification">Identification</h3>
<p>Given the modeling assumptions and available data, the next step is to
ascertain whether the causal effect can be estimated from the data. This
means ascertaining whether the expression from
Equation [eqn:causal-do-effect] can be written as a function of only
observed data. As we will see, given a causal structural graph, it is
possible to check whether the causal effect is estimable from data, and
when it is, provide the formula for estimating it. For instance,
returning to our ice-cream graph, the causal effect is identified by
conditioning on Temperature and then estimating the association between
Ice-cream and swimming separately for each temperature range. When we do
that, we see that the treatment and outcome are no longer associated,
thus showing that the observed association is due to the Temperature,
and not due to any causal effect of eating ice-cream. In general,
variables like Temperature are called confounders, that can lead to a
correlation between treatment and outcome even when there is no causal
relationship.</p>
<p>More generally, identification is the process of transforming a causal
quantity to an estimable quantity that uses only available data. For the
randomized experiment, we had argued that random assignment of treatment
ensures that there are not confounders that affect the treatment. Thus,
the identification step is trivial:</p>
<p>Average Causal Effect = <strong>E</strong>[<em>Y</em>|do(<em>A</em> = 1)] − <strong>E</strong>[<em>Y</em>|do(<em>A</em> = 0)]</p>
<p>=<strong>E</strong>[<em>Y</em>|<em>A</em> = 1]−<strong>E</strong>[<em>Y</em>|<em>A</em> = 0] … <em>(Identification Step)</em></p>
<p>While we obtained a clear answer for the ice cream problem and
randomized experiments, real-world problems of causal inference do not
always lend themselves to simple solutions. We illustrate this through a
common problem encountered in conditioning on data.</p>
<table>
<thead>
<tr>
<th style="text-align: left"> </th>
<th style="text-align: center">Current Algorithm</th>
<th style="text-align: center">New Algorithm</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">CTR for low-activity users</td>
<td style="text-align: center">10/400 (2.5%)</td>
<td style="text-align: center">4/200 (2%)</td>
</tr>
<tr>
<td style="text-align: left">CTR for high-activity users</td>
<td style="text-align: center">40/600 (6.6%)</td>
<td style="text-align: center">50/800 (6%)</td>
</tr>
<tr>
<td style="text-align: left">Overall CTR</td>
<td style="text-align: center">50/1000 (5%)</td>
<td style="text-align: center">54/1000 (5.4%)</td>
</tr>
</tbody>
</table>
<h4 id="table-11-click-through-rates-for-two-algorithms-which-one-is-better">Table 1.1: Click-through rates for two algorithms. Which one is better?</h4>
<p>Suppose that you are trying to improve a current algorithm that returns
a list of search results based on a given query. You consider a metric
like click-through rate on the generated results, and wish to deploy the
algorithm that leads to the maximum click-through rate per query. You
develop a new algorithm for this task, and use it to replace the old
algorithm for a few days to gather data for comparison. A natural way to
compare the two algorithms will be collect a random sample of queries
for each algorithm and compare the click-through rates obtained by the
two algorithms. That is, let us collect a random sample of 10000 search
queries for both algorithms and evaluate these algorithms on the
fraction of search queries that had a relevant search result (as
measured by a user click). Table <a href="#tab:ch1-simpson-paradox">1.1</a> shows the
performance for two algorithms: it is clear that the new algorithm
performs better. However, you might be curious if the new algorithm is
doing well for all users, or only a subset. To check, you divide the
users into low-activity and high-activity users. The lower panel of the
table shows the comparison. Oddly, after conditioning on users’
activity, the new algorithm is now worse than the old algorithm for both
types of users.</p>
<p><img src="/assets/Figures/Chapter1/simpson-ctr-model1_cropped.png" alt="Model 1<span label="fig:ch1-simpson-model1"></span>" /></p>
<p><img src="/assets/Figures/Chapter1/simpson-ctr-model2_cropped.png" alt="Model 2<span label="fig:ch1-simpson-model2"></span>" /></p>
<p><img src="/assets/Figures/Chapter1/simpson-ctr-model3_cropped.png" alt="Model 3<span label="fig:ch1-simpson-model3"></span>" /></p>
<h4 id="figure-16-different-causal-models-for-the-same-data">Figure 1.6: Different causal models for the same data.</h4>
<p>How is this possible? How can an algorithm be good for everyone, but
then be worse for each individual sub-population? The statistical
explanation is that the new algorithm somehow attracted a higher
fraction of high-activity users, and those users also tended to click
more. In comparison, the old algorithm was better for both types of
users, but the sheer number of high-activity users that used the new
algorithm pushed up its click ratio overall. This dilemma is sometimes
referred as the <em>Simpson’s paradox</em>, named after the scientist who first
reported it. The causal explanation is that it is not a paradox; the
interpretation of a causal effect depends on the specific causal model
that we assume, as we discussed above. In the first case, we assume the
structural model shown in Figure <a href="#fig:ch1-simpson-model1">1.6 (top)</a>, that
assumes that the algorithm has a direct causal effect on the CTR metric,
and no other variable confounds this effect. In the second case, after
conditioning on user activity, we assume the structural model shown in
Figure <a href="#fig:ch1-simpson-model2">1.6 (middle)</a>, that assumes user activity as a
confounder for the effect of the algorithm on CTR. Note that both causal
conclusions are valid: given the first model, the new algorithm causes
the CTR to rise, whereas given the second model, the new algorithm
reduces the CTR. The correct answer depends on which structural model
reflects reality, illustrating the dependence of any causal effect on
its underlying model. In this case, we know from past experience (and
past work) that high-activity users behave differently than
lower-activity users, and thus we choose the interpretation of Model 2.</p>
<p>Does that resolve our dilemma? What if there was another variable that
we forgot to condition on? Figure <a href="#fig:ch1-simpson-model3">1.6 (bottom)</a> shows this
scenario, where in addition to activity, difficulty of the queries also
played a role. When we condition on both activity and query difficulty,
we find that the result flips again: the new algorithm turns out to have
a higher click-through rate. This example illustrates the difficulty of
drawing causal conclusions from data alone. Given the same data, the
causal conclusion is highly sensitive to the underlying structural
model. While we discuss ways to eliminate models that are inconsistent
with the data in Chapter 4, it is not possible to
infer the right causal model purely from data. Thus, causal reasoning
from data must necessarily include domain knowledge that informs the
creation of the structural model. Note that this is not a contrived
scenario: dilemmas like these are pervasive and come under different
names, such as selection bias, berkson’s paradox, and others that we
will discuss later the book. As a trivial example, this is the same
reason that a naive analysis of hospital visits and death might conclude
that going to a hospital leads to death, but of course that is not the
correct causal interpretation.</p>
<p>In the rest of the book, we will describe different identification
methods that can be used to <em>deconfound</em> an observed association. We
will describe how to choose the right formulas for deconfounding,
estimate the effect. In some cases, however, we may not be able to
identify an effect given the model and available data. In that case, we
may reconsider the modeling assumptions, collect new kinds of data, or
declare that it is impossible to find the causal effect.</p>
<h3 id="estimation">Estimation</h3>
<p>Once a causal effect has been identified, we can estimate it using
statistical methods. One way is to directly plug-in an estimate based on
the identified estimand. For instance, in the ice-cream example, we may
stratify the data based on the different temperatures and then use the
plugin estimator for conditional mean. As a concrete example, consider
the randomized experiment to determine the effect of medication from
Section [1.2](#sec:ch1-randomized-exp). Given the identified estimand from
above, we can estimate effect using a simple plug-in estimator.</p>
<p>Average Causal Effect = <strong>E</strong>[<em>Y</em>|do(<em>A</em> = 1)] − <strong>E</strong>[<em>Y</em>|do(<em>A</em> = 0)]</p>
<p>=<strong>E</strong>[<em>Y</em>|<em>A</em> = 1]−<strong>E</strong>[<em>Y</em>|<em>A</em> = 0] … <em>(Identification Step)</em></p>
<p><script type="math/tex">=\\frac{1}{N\_{G1}} \\sum\_{i \\in G1} Y\_i - \\frac{1}{N\_{G0}} \\sum\_{j \\in G0} Y\_j \\ \\ ...\\ \\ \\textit{(Estimation Step)}</script>
where <em>G</em>1 and <em>G</em>0 refer to two groups of people. For infinite data,
this is the best estimator for the causal effect, since it directly
estimates the estimand. However, in practice, we have finite data that
introduces variance challenges. If there are many variables to condition
on, we may not have enough data in each stratum and hence the
conditional means will no longer be trustworthy.</p>
<p>In general, high-dimensionality is one of the major problems for
estimation that we will tackle in this book. Many methods exist for
handling such data. One approach can be coarsen the strata so that the
strata become approximate (e.g., Temperature ranges in multiples of 10)
but the conditional means have low variance. Another approach can be to
instead stratify on the probability of treatment rather than all the
confounding variables. This makes for better stratification, but the
bias in the strata itself is now dependent on the method used for
estimating the probability of treatment.</p>
<p>As we will see, many of these methods can also utilize machine learning
whenever the estimand can be written as a function of available data.
Note that estimation of the causal effect is a purely statistical
exercise that estimates the identified causal estimand. Keeping
identification and estimation separate has a nice modular advantage:
identification and inference can be performed independently, using
different methods. It can also tell us when improving the inference
algorithm is not likely to yield benefits, such as when the causal
effect is not identified. Throughout, we will emphasize on the
separation between identification on the causal model and the estimation
on the data. The causal interpretation of any calculated effect comes
from the structural model, and can be derived without access to any data
(assuming that the structural model is correct).</p>
<h3 id="refute">Refute</h3>
<p>The above three steps will yield an answer to our causal question, but
how much should we trust this estimate? As remarked above, causal
interpretation comes from identification, which in turn derives its
validity from the modeling assumptions. Therefore, the last and perhaps
the most important is to check the assumptions that led to
identification. In addition, the estimation step also makes assumptions
regarding the statistical properties of the data to derive the estimate,
which also need to be verified. While the structural model cannot be
validated from data, we will discuss how in some cases, the observed
data can help us eliminate causal models that are inconsistent with data
and check the robustness of our estimate to causal assumptions. As we
discussed in the search engine example, a common faulty assumption is
that all confounders are known and observed. In
Chapter 4, we show how we can simulate datasets with
unknown confounders and assess sensitivity of the estimate to such
assumption violation. We will also discuss other tests for testing
identifying assumptions. These sensitivity tests cannot prove the
validity of an assumption, but rather help us refute some kinds of
assumptions.</p>
<p>While the sensitivity to causal assumptions may seem as a big
disadvantage, this is actually a fundamental limitation of learning from
data. Multiple causal models can explain the same data with exactly the
same likelihood, thus without any additional knowledge, it is
imposssible to disambiguate. The benefit of expressing causal
assumptions in the form of a separate structural model is that it allows
us to emulate the scientific method in doing our analysis: hypothesize a
theory, design an experiment to test it, improve the theory.
Analogously, we can imagine a workflow where we start with a causal
model, test its assumptions with data, and then change the assumptions
based on any inconsistencies. However, if any causal effect depends on
the underlying structural model, how is it possible to test the
assumptions of a causal model? Fortunately, there is one method whose
causal conclusions do not depend solely on assumptions from a structural
model. We present this next.</p>
<h2 id="the-rest-of-this-book">The rest of this book</h2>
<p>Part I. of our book focuses on a conceptual introduction these four
steps. Chapter 2 covers modeling and identification (Steps 1 and 2).
Chapter 3 focuses on estimation. Chapter 4 discusses refutations.</p>
<p>Part II. of this book goes into more of the practical nuts and bolts of
these four steps, including details of analytical methods for
identification (Chapter 5), a variety of statistical estimation methods
for conditioning-based methods (Chapter 6) and natural experiments
(Chapter 7), and details of methods for validating and refuting
assumptions in practice (Chapter 8). Chapter 9 introduces a number of
concerns that complicate real-world analyses, and discusses basic
approaches and extensions to mitigate their consequences.</p>
<p>Part III. of this book focuses on the connections between causal
reasoning and its application in the context of core machine learning
tasks (Chapter 10). We provide a deeper discussion of causal reasoning
for experimentation and reinforcement learning (Chapter 11),
considerations when learning from observational data (Chapter 12), how
causal reasoning relates to robustness and generalization of machine
learning models (Chapter 13), and connections between causal reasoning
and challenges of explainability and bias in machine learning (Chapter
14).</p>
Wed, 11 Dec 2019 00:00:00 +0000
https://causalinference.gitlab.io/causal-reasoning-book-chapter1/
https://causalinference.gitlab.io/causal-reasoning-book-chapter1/Book outline---Causal Reasoning: Fundamentals and Machine Learning Applications<p>This book is aimed at students and practitioners familiar with machine learning
(ML) and data science. Our goal is to provide an accessible introduction to
causal reasoning and its intersections with machine learning, with a particular
focus on the challenges and opportunities brought about by large-scale
computing systems acting as interventions in the world, ranging from online
recommendation systems to healthcare decision support systems. We hope to
provide a practical perspective to working on causal inference problems and
a unified interpretation of methods from varied fields such as statistics,
econometrics and computer science, drawn from our experience in applying these
methods to online computing systems.</p>
<p>Throughout, methods and complicated statistical ideas are motivated and
explained through practical examples in computing systems and their applications.
In addition, we devote a third of the book to discussing machine learning
applications of causal inference in detail, in different settings such as
recommendation systems, system experimentation, learning from log data,
generalizing predictive models, and fairness in computing systems.</p>
<p>Beyond our focus on machine learning applications, we expect that three aspects of
our perspective on causal reasoning will be woven throughout our treatment (pun
not intended) of this material, to help organize our materials and provide what
may be a distinct viewpoint on causal reasoning. While this book is targeted
primarily for computer scientists, these aspects may also make the book useful
for learners more broadly:</p>
<ol>
<li>
<p>We present a unified view of causality frameworks, including the two major
frameworks from statistics (Potential outcomes framework) and computer science
(Bayesian graphical models) which are often not presented together. We present
how these are compatible frameworks that have different strengths, are
appropriate for different stages of a causal reasoning pipeline, and provide
practical advice on how to combine them in a causal inference analysis. In
addition, by introducing causal inference through the core concepts of
interventions and counterfactuals, we introduce causal inference
methods from a “first-principles” approach, creating a clear taxonomy of
back-door and natural experiment methods and highlighting similarities between
various methodologies.</p>
</li>
<li>
<p>We make an explicit distinction between identification (causal) and
estimation (statistical) methods. While this distinction is fundamental to
causal reasoning, it is overlooked in many texts on causal inference, preventing
readers from understanding the distinction from statistical methods and sources
of error in their causal estimate. Through this distinction, we make a natural
connection to machine learning: ML can be useful for all statistical parts of
causal reasoning, but it is not useful for identification, which follows from
causal assumptions, whether implicit or explicit. Throughout the book, we discuss
how machine learning can enrich estimation methods by allowing non-parametric
estimation, and how causal reasoning can be useful to make ML methods more
robust to environmental changes.</p>
</li>
<li>
<p>Finally, we discuss the criticality of assumptions in any causal analysis
and present practical ways to test the robustness of a causal estimate to
violation of its assumptions. We refer to this exercise as “refuting” the
estimate, in a similar way to how refutation of scientific theories is a common
way to test their relevance. Based on our experience, we present different ways
to test assumptions, check robustness and conduct sensitivity analysis for any
obtained estimate.</p>
</li>
</ol>
<p>Throughout, we will include code examples using <a href="https://github.com/microsoft/dowhy">DoWhy</a>, a Python library for causal inference.</p>
<hr />
<p>The current outline of our book is as follows:</p>
<h3 id="choutline-part1">PART I. Introduction to Causal Reasoning</h3>
<p>We introduce the key concepts behind causal reasoning, organized by the 4 steps
of modeling and assumptions; identification; estimation; and refutation. We
focus this part on fundamental ideas and abstractions, using simplified examples
to provide readers with useful intuitions.</p>
<p>Chapter 1. Introduction</p>
<p>Chapter 2. Causal Models, Assumptions and Identification</p>
<p>Chapter 3. Causal Estimation</p>
<p>Chapter 4. Refutations, Validations, and Sensitivity Analyses</p>
<h3 id="choutline-part2">Part II. Methods and Practice</h3>
<p>We discuss deeper details of the methods and concepts introduced in Part I. We focus this part on mathematical details, common pitfalls and heuristics that are used in practice, using more detailed examples to provide readers with deeper experience and intuition.</p>
<p>Chapter 5. Identification</p>
<p>Chapter 6. Conditioning-based Methods</p>
<p>Chapter 7. Natural Experiments</p>
<p>Chapter 8. Validating and Refuting Assumptions in Practice</p>
<p>Chapter 9. Practical Considerations</p>
<h3 id="choutline-part3">Part III. Applications</h3>
<p>We discuss connections between causal inference and core applications in computer science, including experimentation and reinforcement learning; offline learning from logged data; challenges of robustness and generalizability of machine learned classifiers; and the important task of building interpretable and fair machine learning models. In each chapter, we provide a single in-depth example.</p>
<p>Chapter 10. Connections between Causality and Machine Learning</p>
<p>Chapter 11. Experimentation and Reinforcement Learning</p>
<p>Chapter 12. Learning from Logged Data</p>
<p>Chapter 13. Generalization in Classification and Prediction</p>
<p>Chapter 14. Machine Learning Explanations and Bias</p>
<hr />
<p>We are posting Chapter 1 of our book now, and will be releasing with
new chapters regularly. Our texts will often be rough, especially on
their initial posting, and we expect they will see substantial change throughout
the writing process. We appreciate in advance your patience with
our errors and mistakes, as well as your comments and feedback throughout.</p>
Wed, 11 Dec 2019 00:00:00 +0000
https://causalinference.gitlab.io/Causal-Reasoning-Fundamentals-and-Machine-Learning-Applications/
https://causalinference.gitlab.io/Causal-Reasoning-Fundamentals-and-Machine-Learning-Applications/