Chapter 1: Causal Reasoning Book


Machine learning algorithms are increasingly embedded in applications and systems that touch upon almost every aspect of our work and daily lives, including in societally critical domains such as healthcare, education, governance, finance and agriculture. Because decisions in all these domains can have wide ramifications, it is important that we not only understand why a system makes a decision, but also understand the effects (and side-effects) of that decision, and how to improve decision-making to achieve more desirable outcomes. As we shall see in this chapter, all of these questions are what we call causal questions and, unlike many conventional machine learning tasks, cannot be answered using only passively observed data. Moreover, we will show how even questions that do not seem causal, such as pure prediction questions, can also benefit from causal reasoning.

First, let us briefly and informally define causal reasoning as the study of cause-and-effect questions, such as: Does A cause B? Does recommending a product to a user make them more likely to buy it? If so, how much more likely? What makes a person more likely to repay a loan or be a good employee (or employer)? If the weather gets hot, will crops wilt? Causal reasoning is the study of these questions and how to answer them.

In this book, we focus on causal reasoning in the context of machine learning applications and computing systems more broadly. While most of what we know about causal reasoning from other domains remains useful in the context of computing applications, computing systems offer a unique set of challenges and opportunities that can enrich causal reasoning. On the one hand, the scale of data, its networked nature, and high-dimensionality challenge standard methods used for causal reasoning. On the other hand, these systems allow control over data gathering and measurement and allow easy combination of passive observations and active experimentation, thereby providing opportunities for rethinking typical methods for causal reasoning.

This introductory chapter motivates the use of causal reasoning. Given how machine learning systems are being used in almost all parts of our society, we discuss a wide range of use-cases ranging from recommender systems in online shopping and algorithmic decision support in medicine to hiring and criminal justice; and data-driven management for agriculture and industrial applications. We discuss how these are all fundamentally interventions that require causal analysis for understanding their effects. Further, we frame the connections between causal reasoning and critical machine learning challenges, such as domain adaptation, transfer learning, and interpretability.

What is Causal Reasoning?

Brief Philosophy

Causal reasoning is an integral part of scientific inquiry, with a long history starting from ancient Greek philosophy. Fields ranging from biomedical to social sciences rely on causal reasoning to evaluate theories and answer substantive questions about the physical and social world that we inhabit. Given its importance, it is remarkable that many of the key statistical methods have been developed only in the last few decades. As Gary King, a professor at Harvard University puts it,

“More has been learned about causal inference in the last few decades than the sum total of everything that had been learned about it in all prior recorded history”.

This might seem puzzling. If causal reasoning is so critical, then why hasn’t it become a common form of reasoning such as logical or probabilistic reasoning? The issue is that “causality” itself is a nebulous concept. From Aristotle to Hume and Kant, many philosophers and scholars have attempted to define causality but have not reached a consensus so far.

To understand the difficulty, let us first ask you, the reader, to let go of this book and drop it on the floor—and then pick it up again and continue reading! Now, let us ask, what was the cause of the book dropping? Did the book fall because you let go of the book? Or did the book fall because we, the authors, asked you, the reader, to drop it? Perhaps you would have let go of the book even if we had not asked you to. Maybe it was gravity. Perhaps the book fell because the reader is not an astronaut reading the book in space.

This simple example of the falling book illustrates many of the important, philosophical challenges that have vexed philosophers’ efforts to conceptualize causality. These include basic concepts of abstractions, as well as sufficient and necessary causes. E.g., gravity is of course necessary, but not sufficient, to cause the book to fall—gravity together with the reader letting go of the book is both necessary and sufficient for the book to fall. This example also illustrates both proximate and ultimate causes. E.g., the reader dropping the book is a proximate cause, the authors asking the reader to drop the book may be a more distant, ultimate cause. Finally, this example raises the question of whether causes must be deterministic. In other words, does the likelihood that not all (or even most) readers are suggestible enough to drop this book when asked imply that the authors’ request is not a cause at all? Or is it possible for our request to be considered a probabilistic cause?

Hume asks how we know—how we learn—what causes an event? Consider the simple act of striking a match and observing that it lights up. Would we say that striking causes the match to light up? Believing in data, say we repeat this action 1000 times and observe the same outcome each time. Hume argues that, while this may seem to provide strong evidence that striking the match causes it to light, this specific experiment only demonstrates predictability, and argues that these results are indistinguishable from the case where the two events just happen to be perfectly correlated with each other. Hume proposed this quandary in his book, “A Treaties of Human Nature”, in 1738, and concludes that causality must be a mental construct that we assign to the world, and thus does not exist outside it. Other scholars challenge this provocation and argue for the existence of causality.

These philosophical challenges are essentially questions of abstraction. Modern advances in causal reasoning have not come through answering most of these provocations directly but, rather, by creating flexible methods for reasoning about the relationships between causes and effects regardless of the abstractions one chooses. In this book, therefore, we attempt to steer clear of the above philosophical ambiguities and adopt one of the more simple and practical approaches to causal reasoning, known as the interventionist definition of causality.

Defining Causation

Definition: In the interventionist definition of causality, we say that an event A causes another event B if we observe a difference in B’s value after changing A, keeping everything else constant.

Due to causal reasoning’s early applications in medicine (which we will discuss in chapter 3), it is customary to call A the “treatment” (also sometimes called “exposure”), or simply the cause. B is referred to as the “outcome”. Readers familiar with reinforcement learning may analogize A as the “action” and B as the “reward”. In general, these events are associated with measurement variables that describe them quantitatively, e.g., the dosage of a treatment drug and its outcome in terms of blood pressure, which we refer to as the treatment variable and the outcome variable respectively. For convenience, we use events and their measurement variables interchangeably, but it is important to remember that causality is defined over events, and that the same events can correspond to different variables when measured differently.

Interventions and Counterfactuals

There are two phrases in the above definition that needs further unpacking: “changing A”, and “keeping everything else constant”. These correspond to the two key concepts in causal reasoning: an intervention and a counterfactual respectively. An intervention refers to any action that actively changes the value of a treatment variable. Examples of an intervention are giving a medicine to a patient, changing the user interface of a website, awarding someone a loan, and so on. It is important to distinguish it from simply observing different values of the treatment. That is, assigning specific people to try out a new feature of a system is an intervention, but not if people found out and tried the feature themselves. While this might seem a small difference, its importance cannot be understated: these two are fundamentally different and can lead to varying, and even opposite conclusions when analyzing the resultant data. In particular, in the observational case, it is hard to know whether any observed effect (e.g., increased usage) is due to the feature or due to characteristics of the people (e.g., high-activity users) who were able to discover the feature. History of causal reasoning is replete with examples where observations were used in place of interventional data that sometimes led to disastrous results. We will discuss some of them in the book.

Intervention: An active action taken that changes the distribution of a variable T.

To gain a valid interpretation of its effect, however, an itervention must be performed “keeping everything else constant”. That is, it is not enough to take an action but also ensure that none of other relevant factors change, so that we can isolate the effect of the intervention. Continuing our example on estimating the effect of a new feature, it is not enough to merely assign people to try it, but also ensure that none of the other system components changed at the same time. From early experiments in the natural sciences, such an intervention came to be known as a “controlled” experiment, where we clamp down values of certain variables to isolate the effect of the intervention.

While “controlling” or keeping other variables constant is intuitive, it is unclear about which variables to include. We can obtain a more precise definition by utilizing the second key concept of causal reasoning, counterfactuals. The idea is to compare what happened after an intervention to what would have happened without it. That is, for any intervention, we can imagine two worlds, identical in every way up until the point where a some “treatment” occurs in one world but not the other. Any subsequent difference in the two worlds is then logically, a consequence of this treatment. The first one is the observed, factual world, while the second one is the unobserved, counterfactual world (the word counterfactual means “contrary to fact” ). The counterfactual world, identical to the factual world except for the intervention, provides a precise formulation to the “keeping everything else constant” maxim. The value a variable takes in this world is called a counterfactual value.

Counterfactual Value: The (hypothetical) value of a variable under an event that did not happen.

Putting together counterfactuals and interventions, causal effect of an intervention can be defined as the difference between the observed outcome after an intervention and its counterfactual outcome without the intervention. We express the outcome under the factual world as YWorld1(T = 1), and that under the counterfactual world as YWorld2(T = 0). For a binary treatment, its causal effect can be written as, Causal Effect := YWorld1(T = 1) − YWorld2(T = 0).

The above equation shows that inferring the effect of an intervention can be considered as the problem of estimating the outcome under the counterfactual world, since the factual outcome is usually known. Thus, counterfactual reasoning is key to inferring causal effect. Coming back to the match stick example, we can define our intervention as striking a match. The factual world is the world where we strike the match and see it light up, and the counterfactual world is the world where we do not strike the match but keep everything else the same. Under our interpretation of causality, one expects that the match would not light up in the counterfactual world, and hence we can claim that striking the match causes light. Happily, our conclusion coincides with common intuition, and as we shall see, counterfactual reasoning applies well to many practical problems. That said, we must emphasize that this definition of causality is not absolute; it depends on the initial world in which one starts. For instance, in the match-stick example, if we started in an oxygen-free environment (or in outer space), and applied the same counterfactual reasoning, we would conclude that striking does not cause lighting up, illustrating Hume’s dilemma.

The Gold Standard: Randomized Experiment

Let us now apply the above two concepts to describe one of the most popular ways of causal reasoning, a randomized experiment. We consider a simple example of deciding whether to recommend a medication to a patient Alice. As we discussed above, we can evaluate this decision by considering the causal effect of the medication on Alice. Here the treatment is administering the medication and the outcome is Alice’ health afterwards. From the equation above, we can define causal effect as the difference between the value of Y in a world where we gave Alice the treatment T (YWorld(TAlice = 1)) and where we did not (YWorld(TAlice = 0)). Causal Effect = E[YWorld(TAlice = 1)]−E[YWorld(TAlice = 0)]

This may seem straightforward, but the fundamental challenge is that this calculation requires taking the difference between an observed outcome and a counterfactual that we cannot observe. If we want to calculate this difference, we can either 1) observe the outcome of giving Alice the treatment T and compare it to the unobserved counterfactual outcome of not giving her the treatment; or we can 2) observe the outcome of not giving Alice the treatment T and compare it to the unobserved counterfactual outcome of giving her the treatment.

No matter what we do, we cannot in any single experiment, both do T and not do T! Whatever we actually do, the counterfactual will remain unobserved. This is called the “missing data” problem of causal inference. If Alice takes a medication and we observe that she then gets better, we cannot also observe what would have happened if she hadn’t taken the medication. Would Alice have gotten better on her own, without the medication? Or would she have stayed ill?

Missing data

Figure 1.1: The missing data problem of causal inference.

To solve this problem, we need to make further assumptions on the intervention or the counterfactual. For instance, if Alice has an identical twin, Beth, who is also sick, we might give the medication to Alice, but not give the medication to Beth. Then we can make an assumption about the counterfactual: identical twins have the same counterfactual when it comes to health outcomes. That is, we argue that Beth is so similar to Alice—in terms of general health, genetics, specifics of the illness, etc.—that they are likely to experience the same outcomes, but for the medication. In this case, we could use our observation of Beth’s outcomes as an estimate of Alice’s counterfactual—what would have happened to Alice had she not taken the medication.

Causal Effect = E[YWorld(TAlice = 1)]−E[YWorld(TBeth = 0)]

But of course, not everyone has an identical twin, much less an identical twin with identical general health, habits, and illness. And if two individuals are not identical, then there will always be a question of whether differences in outcomes are due to the underlying dissimilarities between them, instead of due to the medication. When the two twins do not share the same illness, or more generally, when comparing two different people, we cannot expect that their counterfactuals will match. These differences that confuse our attribution of differences in outcome to differences in the treatment are called confounders.

Another approach is make assumptions on the intervention. For example, assume that there is a dataset of patient outcomes where medications were given irrespective to the actual health condition. That is, the outcome of any person still depends on their health condition, but whether they took the drug does not. Therefore, if we now compare any two individuals with or without the drug, we can argue that there is no systematic difference between them. Over a large enough sample, the average outcomes of the treated group can approximate the average counterfactual outcomes of the untreated, and vice-versa. This allows us to estimate the effect of an intervention defined over a group of people, rather than just Alice. Causal Effect = ∑r YWorld(T = r) − (1 − r)YWorld(T = r)

where r is a random variable ∈{0, 1}. More generally, the core idea is that instead of trying to find two individuals that are identical to one another, we find two populations that are essentially identical to one another. This may seem like it should be harder than identifying two individuals that are the same—after all, now we have to find many people not just two—but in practice, it is actually easier. It turns out that, if we want to find the average effect of a treatment, we just need to ensure that there are no systematic differences between the groups as a whole. The advantage is that we no longer need to find identical individuals, but how do we account for all the differences between any group of people? Is ignoring health condition of people enough?

Causal reasoning took a major advance in the early twentieth century when Fisher discovered a conceptually straightforward way to conduct an intervention such that the there is no systematic difference between the treated and untreated groups. We simply gather one large population of people and randomly split them into two groups (G = 0 or G = 1), one of whom will receive the treatment and the other will not. By randomly assigning individuals to receive or not receive treatment, we ensure that, on average, there is no difference between the two groups. The implication is that the expected outcomes of the two groups are the same, and when we observe the average outcome do(A = 0)G = 0 for the first group, we can use it as an estimate of what the counterfactual outcome would have been for the second group. Similarly, when we observe the average outcome do(A = 1)G = 1 for the second group, we can use that as an estimate of the counterfactual outcome for the first group. This methodology is called the randomized experiment—also sometimes called a randomized controlled trial, A/B experiment, and other names.

Coming back to our question on whether to give Alice the medication, we can use the average causal estimate above to make a decision: if the effect over a general population is positive, then assign the drug, otherwise not. Note that the decision will be same for every person since we are deciding based on the average effect of the medication. Assumptions on the counterfactual (e.g., comparing to Beth or a similar person) can provide individual causal effects but the estimates suffer from error whenever their counterfactuals are not identical. In contrast, assumptions on the intervention lead to group-wise causal effects, but are accurate whenever the treatment is randomized. An interesting and important thing to realize is that randomized experiments don’t only address the confounders that we are aware of, but also ensure that our analysis is sound even when there are confounders that we are not measuring or maybe haven’t even thought of. Because of this, randomized experiments are considered more robust than other approaches. In fact, they are often referred to as the “gold standard” for causal inference. Randomized experiments are used to test whether a new medicine really cures an illness or has significant side-effects, whether a marketing campaign works, whether one search algorithm is better than another, and even whether one color or another on a website is better for user engagement.

Despite their general robustness, randomized experiments are not foolproof and there are practical problems that can occur. First and foremost, ensuring that random assignment is actually random is not always easy. Sometimes we might be tempted to use an assignment mechanism that seems close to random, but actually isn’t. For example, for convenience, we might find it easier to assign all units that arrive on a Monday to treatment A, and all units that arrive on a Tuesday to a placebo. Not only might this be more convenient—it might be logistically easier to give the same treatment to everyone on a day—but, we can also imagine why this regime is close to random: We do not have prior knowledge of which units will arrive on which day; that is outside of our control and might be random. We might even double-check how similar the units are to one another and find that Monday units and Tuesday units are very similar. However, if there are any unobserved reasons why units arrive at different times, there will be systematic differences between our two groups that will bias our results. As another example of how random assignment might not be random, historically, when patients were being assigned to drug trials, sympathetic patients were more likely to be assigned to treatment. This led to the development of blinding methodologies to prevent people who interact directly from patients from assigning or even knowing their treatment status.

Despite their many advantages, randomized experiments are sometimes too costly, unethical or otherwise not feasible for certain situations. We are limited in the number of experiments we can run at a given time (statistical power needed); the cost of designing and implementing the experiment; the length of time we can run the experiment; and so on. Even if running experiments is relatively easy, there are often orders of magnitude more experiments we would like to run than we possibly can. In addition, sometimes there are ethical issues involved in experiments—is it ethical to expose people to potentially harmful treatments?

So, what do we do when we cannot run a randomized experiment, but still need to answer a cause-and-effect question? We turn to methods and frameworks for causal reasoning that make different assumptions about the counterfactual and intervention. Such methods are the focus of most of this book. That said, much of what we will talk about—from accounting for survival bias, interference, heterogeneity, and other validity threats—are applicable to randomized experiments as well.

Why causal reasoning? The gap between prediction and decision-making

Causal reasoning, thus, makes sense whenever we have a treatment (as in medicine) or an economic policy (as in in the social sciences) to evaluate. But what use could it be for computing systems especially at a time when machine learning-based predictive systems are promising success in a variety of applications? To answer this question, let us take a closer look at the success of machine learning and how it may change the role of computing systems in society.

The promise of prediction

Today, computers are increasingly making decisions that have a significant impact on our lives. Sometimes computers make a choice and take action independently such as deciding on loan applications. Other times computers simply aid people who make the final choice and drive action such as helping doctors or judges with recommended actions. Sometimes these computers are hidden far away inside our vital infrastructure, making decisions that seem to only indirectly affect people, as in optimizing configuration and availability of data centers. Other times, these computers are visibly integrated into the fabric of our daily lives, for example through fitness devices.

But, regardless of how directly or indirectly computers are involved, it is true that computers are helping us make critical decisions across many domains. For example, machine learning algorithms recommend product purchases to customers in online retail sites. Similar algorithms power movie recommendations, placement of advertisements, and many other decisions. Other algorithms are responsible for match-making, pairing up drivers and passengers in ride-sharing platforms, and connecting people in online dating services. Behind the scenes, computers help with logistics, resource allocation, and product pricing. They run algorithms to decide who is eligible for a loan and identify the top candidates who have applied for a job. Each of these decisions, made by a computer algorithm, has significant consequences for all individuals and parties involved.

And computer-aided decision-making is only growing in scope. In the health domain, machine learning is enabling the advent of precision medicine. Computers will soon analyze genetic information, symptoms, test results and medical history to decide how best to heal a particular patient of a malady. In education, computers promise to improve personalized education in the context of both traditional classrooms and the newer massive open online courses. Based on a personalized model of a learner’s conceptual understanding and learning preferences, a computer will coach and support a student in their exploration and mastery of a subject. Data-driven, precision decision-making is improving the productivity of farming while reducing water usage and pollution. Artificial Intelligence (AI) is also bringing or poised to bring similar impact to manufacturing, transportation, and other industries.

This revolution of computer-aided decision-making is aided by three concurrent trends. First, there is a proliferation of data from cheaper and more ubiquitous sensors, devices, applications, and services. Second, cheap and well connected computational power is available in the cloud. Third, it is the significant advances in machine learning and artificial intelligence methods that make it possible to rigorously process and analyze a much broader set of data than was possible even just 10 years ago. For example, if we want to predict upcoming wheat yields in a field, we can now use automated drones to take pictures of the wheat plants, and use deep neural network based image analysis to recognize and count the grains and extrapolate the likely yield. These pictures, field sensors and other information from the farm can be uploaded to the cloud, joined with weather data and historical data from other farms to learn better management policies and make decisions about crop management.

Thus, increasing amounts of data and advanced machine learning algorithms help make highly accurate predictions. What could go wrong? The simple answer is that going from prediction to a decision is not straightforward. A typical machine learning algorithm optimizes for the difference between true and predicted value in a given dataset, but a decision based on such a prediction is not always the decision that maximizes the intended outcome. In other words, the causal effect of a decision based on purely data-based predictive modelling can be arbitrarily bad.

Importance of the underlying mechanism

Consider a simple social news feed application, where users can see messages posted by their friends. Let’s ask the question of whether we can predict something about a user’s future behavior based on what they see in their social feed. That is, if person sees a link to a news article, a product recommendation, or a review of a real-world destination, can we predict that the person will then read the news article, buy some product, or visit the destination? It turns out, the answer to that question is yes, we can make successful predictions about a user’s future behavior based on what they see in their social feed.

If we can predict future behavior based on the social feed, does this mean that, if we decide to change the contents of the social feed, that we will change the user’s future behavior? Not necessarily. This is the gap between prediction and decision-making. We can predict a user’s future behavior using the current feed, but that does not tell us much about how they will behave if we change their social feed. Here the decision is whether to change the social feed and the answer depends on relationship between social feed and user’s future behavior: what affects whom?

Figure 1.2 shows two possible explanations for the predictive accuracy. On the left side, we see that the social feed itself causes a person’s future behavior. That is, perhaps social feed posts on this system are very enticing and do a good job enticing a person to do new things in the future. Or perhaps, as shown on the right side, people and their friends tend to do similar things anyway. For example, if a group of friends likes going to Italian restaurants, and a new one opens, they are all likely to visit the restaurant sometime soon, but one of them happens to go and post about it first. If the friend hadn’t posted, all the friends would have gone to the restaurant anyway, but the post itself helps us predict the behavior of the individuals. In the left hand side, if we change the social feed, then we will effect the user’s behavior. However, on the right hand side, changing the social feed will not affect the user’s behavior. But note that in both cases, the social feed helps us predict what the user might do in the future!

Without knowing the direction of effect, we can reach exactly opposite conclusions using the same data. The problem is that in many scenarios, prediction models are often used in service of making a decision. This creates problems.

Social feed affects user behavior

Restaurant preference affects both the social feed and user behavior

Figure 1.2: Two models of social feed effect on user behavior.

The trouble with changing environments

Complicating matters, predictive models can lead us astray even if the underlying (direction of) effects is known. Let’s consider a second example where machine learning may be applied to data from a farm for making irrigation decisions. In particular, let’s make our job a little easier by assuming that we know what the effect of irrigation is (unlike in our social feed example, where we do not know the effect of changing the feed). We know that irrigation will increase the soil moisture levels by some known amount.

In a predictive model, we may collect data about past soil moisture and other variables and then make predictions of future soil moisture levels based on past data. This prediction can be converted to a simple decision: if the soil moisture is low, irrigate, else do not irrigate. Now, given a history of past soil moisture data on the farm and past weather, let us assume that we can train an accurate model to predict future soil moisture levels based on current soil moisture and future weather forecasts. Can this machine learning prediction model guide our irrigation decisions on a farm?

Again, the answer is no, we cannot make irrigation decisions based on our learned model of soil moisture levels. Imagine that one day the weather forecast says it will be very hot. Our soil moisture model is likely to predict that the soil will be very moist and, based on this prediction, we are likely to decide not to irrigate.

But why is our soil moisture model predicting that there’s no need to irrigate on a very hot day? Our model is trained on past soil moisture data, but in the past the soil was being irrigated under some predetermined policy (e.g., whether a rule based decision or farmer’s intuition). If this policy always irrigated the fields on very hot days, then our prediction model will learn that on very hot days, the soil moisture is high. The prediction model will be very accurate, because in the past this correlation always held. However, if we decide not to water the field on very hot days based on our model predictions, we will be making exactly the wrong decision!

The prediction model is correctly capturing the correlation between hot weather and a farmer’s past irrigation decisions. The prediction model does not care about the underlying mechanism. It simply recognizes the pattern that hot weather means the soil will be moist. But once we start using this prediction model to drive our irrigation decisions, we break the pattern that the model has learned. More technically, we say that once we begin active intervention, the correlations that the soil moisture model depends on will change.

Both daily temperature and irrigation decisions influence soil moisture levels. Historically, daily temperature also influences irrigation decisions.

Both daily temperature and irrigation decisions influence soil moisture levels. Historically, daily temperature also influences irrigation decisions.

Under a historical policy, the correlation between temperature and soil moisture is stable.

Under a historical policy, the correlation between temperature and soil moisture is stable.

Changing the irrigation policy will change the relationship between temperature and soil moisture.

Changing the irrigation policy will change the relationship between temperature and soil moisture.

Figure 1.3: Using a correlational model trained on historical data to drive future irrigation decisions will break the historical temperature-soil moisture correlation and thus the machine-learned model.

This illustrates another example of why prediction models are not appropriate for decision making. Prediction models are not robust to changing conditions. Machine learning practices—e.g., ensuring that we train and test machine learning models using data drawn from the environment we plan to deploy in—are important, but provide no guarantees. In this irrigation example, those changing conditions are due to our own change in policy based on machine learning model predictions—we cannot train our model based on observations of our new policy, as the new policy doesn’t exist yet! More generally, such changing conditions can occur due to exogenous conditions as well. Moreover, these conditions may change quickly when we apply our models in new environments, or change over time within a deployed environment.

Changing environments are particularly important issues in some of the critical domains we care most about: healthcare, agriculture, etc., where we expect machine learning models to help us make better decisions; online services such as ecommerce sites, etc., that have to adapt to seasonality, social influences, and growing and changing user populations; and deployments of machine learning in adversarial settings, e.g., from spam classification and intrusion detection to safety critical systems.

From prediction to decision-making

To recap, we have seen that prediction models are not appropriate for helping us reason about what might happen if we change a system or take a specific action. In our social news feed example, where we asked whether predictive models can help us understand whether changing a social news feed will change future user behavior, we saw that there are multiple plausible explanations of why social feed data can help us predict future user behaviors. While one explanation implies that changing the social feed will affect user behavior, another explanation implies that it won’t affect user behavior. Crucially, the machine learning prediction model does not help us identify which explanation is correct!

Moreover, even with an a priori understanding of the causal effect of an intervention, when we examine the use of a simple prediction model for making decisions, we see that the act of making a decision based on the model changes the environment and puts us into untested territory that threatens the predictive power of our model!

Finally, let us emphasize that these two issues are fundamental. Even when a prediction model has an otherwise extremely high accuracy, we cannot expect that accuracy alone to give us insight into underlying causal mechanisms or help us choose among interventions that change the environment.

Applications of Causal Reasoning

The above section illustrates the importance of causal reasoning whenever we have to make decisions based on data. Below we present sample scenarios from computing systems that involve decision-making, and thus require causal reasoning. In addition, it turns out that causal reasoning is useful even when decision-making may not be the primary focus. For instance, causal reasoning is useful in improving systems that may appear to be purely predictive at first, such as search and recommendation systems.

Making better decisions

There are numerous examples of decision-making in computing systems where causal reasoning can help us make better decisions. We broadly categorize them into three themes: improving utility for users, optimizing underlying systems, and enhancing viability of the system, commercial or otherwise.

We already saw an example of decision-making for improving users’ utility through changing a social feed. Other examples include choosing incentives for encouraging better use of a system, and more broadly, deciding which functionality to include in a product to maximize utility for users. In general, any decision that involves changing a product or service’s features requires causal reasoning to anticipate its future effect. This is because for all these problems, we need to isolate the effect of these decisions from the underlying correlations.

Similar reasoning can also be applied to optimize underlying systems, such as optimizing configuration parameters of a database, deciding network parameters for best throughput, allocating load in a distributed data center for energy efficiency, and so on.

Lastly, viability and sustainability of any computing system is important too. This involves historically non-computing areas such as marketing and business management, but where data-driven decisions are increasingly being made. Decisions involving interaction of a system with the outside world, such as deciding the right messaging for a targeting campaign. As another example, consider a subscription-based service such as Netflix or Office365. It is relatively easy to build a predictive model that identifies the customers who will be leaving in the next few months, but deciding what to do to prevent them from leaving is a non-trivial problem. We will consider such decision-making applications throughout the book.

Building robust machine learning models

Causal reasoning is also useful in the absence of explicit decisions. Many systems, such as those for recommendation or search that commonly employ predictive models can be improved with causal reasoning. Predictive models aim to minimize the average error on past data, which may not correspond to the expected error on new data, especially in a system that interacts with people. Consider a ratings-based recommendation system that aims to predict a user’s rating of a new item. If there are systematic biases in the items rated by the user (such as rating movies from a single genre more often), then the system may over-optimize for movies from the same genre, but make errors for all other genres. The fundamental problem is that past data is collected under certain conditions, and its predictions may not be accurate for the future. We shall see in Chapter 12 that the problem of recommendation can be considered as a problem of intervening on users with a recommended item, thus defining each recommendation as an intervention. A similar problem arises in optimizing the most relevant pages for a query in a search engine based on log data, and in any system where a user interacts with information. Besides improving accuracy, causal reasoning can also be useful to understand the effect of algorithms on metrics that they were not optimized for. For instance, it helps us in estimating the different effects of recommendation systems, from impacting diversity to amplifying misinformation and “filter bubbles”.

Relatedly, questions on broad societal impact of computing systems are fundamentally causal questions about the effect of an algorithm: is a loan decision algorithm unfair to certain groups of people? What may be the outcomes of delegating certain medical decisions to an algorithm? As we use machine learning for societally critical domains such as health, education, finance, and governance, questions on the causal effect of algorithmic interventions gain critical importance. Causal reasoning can also be used to understand the effects of these algorithms, and also to explain their output: why did the model provide a particular decision?

More generally, causal reasoning helps predictive models make the jump from fitting to retrospective data to making predictions. Predictive models based on supervised learning work well when we expect them to be tested on the same data distribution on which they were trained. For instance, predictive models can achieve impressive results on distinguishing between different species of birds because we expect to use them on classifying similar pictures in the future. If, however, we predict on an unseen environment (e.g. outdoor to indoor), the model may not work well and even fail to identify the same species. These environment changes, commonly called as concept drift, occur because the association between input features and output changes as we change the environment. Rather than looking for patterns in an image, reasoning about the causal factors that make an image about a specific species can lead to a more generalizable model. In fact, causal inference can be considered as a special case of the domain adaptation problem in machine learning, which we will explore in Chapter 13.

Beyond supervised learning, causal reasoning shares a special connection with reinforcement learning (RL), in that both aim to optimize the outcome for particular decision. It is no surprise, then, that simpler forms of RL, such as bandits, are used for optimizing recommendation systems. And causal inference methods find use in training RL policies, especially when using off-policy data. This synergy between machine learning and causal reasoning is one of the underlying themes of this book: causal reasoning can make machine learning more robust, and machine learning can help with better estimates of causal effects.

Four steps of Causal Reasoning

The focus of this book is on methods and challenges for learning causal effects from observational data. Briefly, observational studies are those where we wish to learn causal effects from gathered data and, while we may have some understanding of the data (in particular the mechanism that generated the data), we have limited or no control over that mechanism. So, how are we going to learn causal effects when we cannot run an experiment like a randomized control study? How will we deal with confounding variables that might confuse our analysis if we cannot manipulate an experiment to ensure that confounds are independent of the treatment status?

At a high-level, we’ll need to find a valid intervention and then construct a counterfactual world to estimate its effect. Unlike the randomized experiment, the biggest change is that we will need to make some assumptions on how the data was generated. This is critical: causal reasoning depends on a model of the world which can be considered as modeling assumptions. As we saw in the social feed example, the same data can lead to different conclusions depending on the underlying mechanism.

Given data and a model of the world, we decide whether the available data can answer the causal question uniquely. This step is called identification. Note that identification comes from the modeling assumptions themselves, not from data. When the causal question is uniquely identified, we can estimate it using statistical methods. Note that the identification and estimation are separate, modular steps. Identification step is the causal step while estimation is a statistical step. Identification depends on the modeling assumptions, estimation on the data. A better estimate does not convey anything about causality, just as a better identification does not convey anything about the quality of an estimate.

Finally, given the dependence on assumptions, verifying these assumptions is critical. Even with infinite data, incorrect assumptions can lead to wrong answers. Worse, the the statistical confidence will be higher. Therefore, the final critical part of causal reasoning is to validate the modeling assumptions. We call this step the “refute” step, because like scientific theories, modeling assumptions can never be proven using data but may be refuted. That said, it is important to note that not all assumptions can be refuted. Causal reasoning is an iterative process where we refine our modeling assumptions based on evidence and try to obtain identification with the least untestable or most plausible assumptions.

To summarize, we rely on a four step analysis process to carefully address these challenges:

Model and assumptions. The first important step in causal reasoning is to create a clear model of the causal assumptions being made. This involves writing down what is known about the data generating process and mechanisms. In general, there are many mechanisms that can potentially explain a set of data, and each of these self-consistent mechanisms will give us a different solution for the causal effect we care about. So, if we want to get a correct answer to our cause-and-effect questions, we have to be clear about what we already know.. Given this model, we will be able to specify formally the effect A →#x2192; B that we want to calculate.

Identify. Use the model to decide whether the causal question can be answered and provide the required expression to be computed. Identification is a process of analyzing our model.

Estimate. Once we have a general strategy for identifying the causal effect, we can choose from several different statistical estimation methods to answer our causal question. Estimation is a process of analyzing our data.

Refute. Once we have an answer, we want to do everything we can to test our underlying assumptions. Is our model consistent with the data? How sensitive is the answer to the assumptions made? if the model is a little wrong, will that change our answer a little or a lot?

Modeling and assumptions

In Section 1.2, we discussed a randomized experiment and applied counterfactual reasoning to estimate the causal effect. Counterfactual reasoning provides a sound basis for causality, but in most empirical problems, we may not obtain perfectly randomized data. Therefore, to estimate the causal effect, we need a precise way of expressing our assumptions about the intervention and the counterfactual we wish to estimate. What has happened in the last few decades is that the concepts of interventions and counterfactuals have been formalized in a general modeling framework, taking causality from the realm of philosophy to empirical science.

Structural causal model for ice-cream’s effect on swimming. <span label="fig:icecream"></span>

Figure 1.4: Structural causal model for ice-cream’s effect on swimming.

The main insight is to replace the factual and counterfactual worlds with mathematical models that defines the relationship between treatment, outcome and other variables. This can be done in the form of a graph or functional equations. Crucially, this model does not prescribe the exact functional forms that connect variables, but rather conveys the structure of causal relationships—who affects whom. This structural model embodies all the domain knowledge or causal assumptions that we make about the world, thus it is also called the structural causal model. For instance, consider question of whether ice cream causes people to swim more. Figure 1.4 shows the correlation of ice-cream and swimming over time. We can represent the scenario with a graphical model and associated set of non-parametric equations shown in Figure 1.4. Each arrow represents a direct causal relationship. We assume that the Weather causes changes in ice-cream consumption and swimming. Our goal is to estimate the causal effect of ice-cream consumption on swimming. Intuitively, the intervention is changing someone’s ice-cream consumption and the counterfactual world is one where the consumption is changed but every other node in the graph (Temperature, in this case) remains constant. Assuming that our structural model is correct, the structural causal model offers a precise recipe to estimate the effect of having more ice-cream. Amazingly, the recipe generalizes to arbitrary graph structures and functional forms, as we shall see in the next few chapters.

Structural causal models derive their power from being able to precisely define interventions and counterfactuals. However, it is hard to express these concepts using conventional probabilities. As we saw above, it is important to distinguish an intervention from an observation, but unfortunately probability calculus lacks a language to distinguish between observing people using a feature versus assigning them to it (both would be expressed as P(Outcome|Feature = True)). This difficulty gets more complicated when we try to express counterfactuals. How would you express the counterfactual probability of outcome if a user was assigned the feature, given that she discovered it herself (was not assigned) in the factual world? The obvious expression, P(Outcome|Assigned = True, Assigned = False) is non-sensical. Given these shortcomings, we need a new class of variables and a calculus to operate on these variables. Intervention is defined by a special “do” operator, which implies removing all inbound edges to a variable. This corresponds to the interventional graph shown in Figure []. Thus, assigning people to feature is represented as P(Y|do(Feature). Any counterfactual value can be generated by changing the variable in the interventional graph, the removal of inbound edges mean that changing the variable is not associated with changing other variables except the outcome, thus keeping everything else constant. The causal effect of an intervention can then be defined precisely as the difference of two interventional distributions. Causal Effect := E(Y|do(T = 1)) − E(Y|do(T = 0))

Before randomization

Before randomization

After randomization

After randomization

Figure 1.5: Randomization leads to the interventional structural model where treatment is not affected by confounders.

Thus, the do notation is a concise expression for evaluating interventions while keeping everything else constant. Along with the structural causal model, it also leads to a formal definition of a counterfactual. To illustrate, we now express why randomized experiments work using structural causal models. Figure 1.5a shows a structural model that shows confounding between the treatment and outcome. Under a randomized experiment, the structural causal model now becomes as shown in Figure 1.5b, thus graphically showing that randomization removes any effect of a treatment’s parents.


Given the modeling assumptions and available data, the next step is to ascertain whether the causal effect can be estimated from the data. This means ascertaining whether the expression from Equation [eqn:causal-do-effect] can be written as a function of only observed data. As we will see, given a causal structural graph, it is possible to check whether the causal effect is estimable from data, and when it is, provide the formula for estimating it. For instance, returning to our ice-cream graph, the causal effect is identified by conditioning on Temperature and then estimating the association between Ice-cream and swimming separately for each temperature range. When we do that, we see that the treatment and outcome are no longer associated, thus showing that the observed association is due to the Temperature, and not due to any causal effect of eating ice-cream. In general, variables like Temperature are called confounders, that can lead to a correlation between treatment and outcome even when there is no causal relationship.

More generally, identification is the process of transforming a causal quantity to an estimable quantity that uses only available data. For the randomized experiment, we had argued that random assignment of treatment ensures that there are not confounders that affect the treatment. Thus, the identification step is trivial:

Average Causal Effect = E[Y|do(A = 1)] − E[Y|do(A = 0)]

=E[Y|A = 1]−E[Y|A = 0]  …  (Identification Step)

While we obtained a clear answer for the ice cream problem and randomized experiments, real-world problems of causal inference do not always lend themselves to simple solutions. We illustrate this through a common problem encountered in conditioning on data.

  Current Algorithm New Algorithm
CTR for low-activity users 10/400 (2.5%) 4/200 (2%)
CTR for high-activity users 40/600 (6.6%) 50/800 (6%)
Overall CTR 50/1000 (5%) 54/1000 (5.4%)

Table 1.1: Click-through rates for two algorithms. Which one is better?

Suppose that you are trying to improve a current algorithm that returns a list of search results based on a given query. You consider a metric like click-through rate on the generated results, and wish to deploy the algorithm that leads to the maximum click-through rate per query. You develop a new algorithm for this task, and use it to replace the old algorithm for a few days to gather data for comparison. A natural way to compare the two algorithms will be collect a random sample of queries for each algorithm and compare the click-through rates obtained by the two algorithms. That is, let us collect a random sample of 10000 search queries for both algorithms and evaluate these algorithms on the fraction of search queries that had a relevant search result (as measured by a user click). Table 1.1 shows the performance for two algorithms: it is clear that the new algorithm performs better. However, you might be curious if the new algorithm is doing well for all users, or only a subset. To check, you divide the users into low-activity and high-activity users. The lower panel of the table shows the comparison. Oddly, after conditioning on users’ activity, the new algorithm is now worse than the old algorithm for both types of users.

Model 1<span label="fig:ch1-simpson-model1"></span>

Model 2<span label="fig:ch1-simpson-model2"></span>

Model 3<span label="fig:ch1-simpson-model3"></span>

Figure 1.6: Different causal models for the same data.

How is this possible? How can an algorithm be good for everyone, but then be worse for each individual sub-population? The statistical explanation is that the new algorithm somehow attracted a higher fraction of high-activity users, and those users also tended to click more. In comparison, the old algorithm was better for both types of users, but the sheer number of high-activity users that used the new algorithm pushed up its click ratio overall. This dilemma is sometimes referred as the Simpson’s paradox, named after the scientist who first reported it. The causal explanation is that it is not a paradox; the interpretation of a causal effect depends on the specific causal model that we assume, as we discussed above. In the first case, we assume the structural model shown in Figure 1.6 (top), that assumes that the algorithm has a direct causal effect on the CTR metric, and no other variable confounds this effect. In the second case, after conditioning on user activity, we assume the structural model shown in Figure 1.6 (middle), that assumes user activity as a confounder for the effect of the algorithm on CTR. Note that both causal conclusions are valid: given the first model, the new algorithm causes the CTR to rise, whereas given the second model, the new algorithm reduces the CTR. The correct answer depends on which structural model reflects reality, illustrating the dependence of any causal effect on its underlying model. In this case, we know from past experience (and past work) that high-activity users behave differently than lower-activity users, and thus we choose the interpretation of Model 2.

Does that resolve our dilemma? What if there was another variable that we forgot to condition on? Figure 1.6 (bottom) shows this scenario, where in addition to activity, difficulty of the queries also played a role. When we condition on both activity and query difficulty, we find that the result flips again: the new algorithm turns out to have a higher click-through rate. This example illustrates the difficulty of drawing causal conclusions from data alone. Given the same data, the causal conclusion is highly sensitive to the underlying structural model. While we discuss ways to eliminate models that are inconsistent with the data in Chapter 4, it is not possible to infer the right causal model purely from data. Thus, causal reasoning from data must necessarily include domain knowledge that informs the creation of the structural model. Note that this is not a contrived scenario: dilemmas like these are pervasive and come under different names, such as selection bias, berkson’s paradox, and others that we will discuss later the book. As a trivial example, this is the same reason that a naive analysis of hospital visits and death might conclude that going to a hospital leads to death, but of course that is not the correct causal interpretation.

In the rest of the book, we will describe different identification methods that can be used to deconfound an observed association. We will describe how to choose the right formulas for deconfounding, estimate the effect. In some cases, however, we may not be able to identify an effect given the model and available data. In that case, we may reconsider the modeling assumptions, collect new kinds of data, or declare that it is impossible to find the causal effect.


Once a causal effect has been identified, we can estimate it using statistical methods. One way is to directly plug-in an estimate based on the identified estimand. For instance, in the ice-cream example, we may stratify the data based on the different temperatures and then use the plugin estimator for conditional mean. As a concrete example, consider the randomized experiment to determine the effect of medication from Section [1.2](#sec:ch1-randomized-exp). Given the identified estimand from above, we can estimate effect using a simple plug-in estimator.

Average Causal Effect = E[Y|do(A = 1)] − E[Y|do(A = 0)]

=E[Y|A = 1]−E[Y|A = 0]  …  (Identification Step)

where G1 and G0 refer to two groups of people. For infinite data, this is the best estimator for the causal effect, since it directly estimates the estimand. However, in practice, we have finite data that introduces variance challenges. If there are many variables to condition on, we may not have enough data in each stratum and hence the conditional means will no longer be trustworthy.

In general, high-dimensionality is one of the major problems for estimation that we will tackle in this book. Many methods exist for handling such data. One approach can be coarsen the strata so that the strata become approximate (e.g., Temperature ranges in multiples of 10) but the conditional means have low variance. Another approach can be to instead stratify on the probability of treatment rather than all the confounding variables. This makes for better stratification, but the bias in the strata itself is now dependent on the method used for estimating the probability of treatment.

As we will see, many of these methods can also utilize machine learning whenever the estimand can be written as a function of available data. Note that estimation of the causal effect is a purely statistical exercise that estimates the identified causal estimand. Keeping identification and estimation separate has a nice modular advantage: identification and inference can be performed independently, using different methods. It can also tell us when improving the inference algorithm is not likely to yield benefits, such as when the causal effect is not identified. Throughout, we will emphasize on the separation between identification on the causal model and the estimation on the data. The causal interpretation of any calculated effect comes from the structural model, and can be derived without access to any data (assuming that the structural model is correct).


The above three steps will yield an answer to our causal question, but how much should we trust this estimate? As remarked above, causal interpretation comes from identification, which in turn derives its validity from the modeling assumptions. Therefore, the last and perhaps the most important is to check the assumptions that led to identification. In addition, the estimation step also makes assumptions regarding the statistical properties of the data to derive the estimate, which also need to be verified. While the structural model cannot be validated from data, we will discuss how in some cases, the observed data can help us eliminate causal models that are inconsistent with data and check the robustness of our estimate to causal assumptions. As we discussed in the search engine example, a common faulty assumption is that all confounders are known and observed. In Chapter 4, we show how we can simulate datasets with unknown confounders and assess sensitivity of the estimate to such assumption violation. We will also discuss other tests for testing identifying assumptions. These sensitivity tests cannot prove the validity of an assumption, but rather help us refute some kinds of assumptions.

While the sensitivity to causal assumptions may seem as a big disadvantage, this is actually a fundamental limitation of learning from data. Multiple causal models can explain the same data with exactly the same likelihood, thus without any additional knowledge, it is imposssible to disambiguate. The benefit of expressing causal assumptions in the form of a separate structural model is that it allows us to emulate the scientific method in doing our analysis: hypothesize a theory, design an experiment to test it, improve the theory. Analogously, we can imagine a workflow where we start with a causal model, test its assumptions with data, and then change the assumptions based on any inconsistencies. However, if any causal effect depends on the underlying structural model, how is it possible to test the assumptions of a causal model? Fortunately, there is one method whose causal conclusions do not depend solely on assumptions from a structural model. We present this next.

The rest of this book

Part I. of our book focuses on a conceptual introduction these four steps. Chapter 2 covers modeling and identification (Steps 1 and 2). Chapter 3 focuses on estimation. Chapter 4 discusses refutations.

Part II. of this book goes into more of the practical nuts and bolts of these four steps, including details of analytical methods for identification (Chapter 5), a variety of statistical estimation methods for conditioning-based methods (Chapter 6) and natural experiments (Chapter 7), and details of methods for validating and refuting assumptions in practice (Chapter 8). Chapter 9 introduces a number of concerns that complicate real-world analyses, and discusses basic approaches and extensions to mitigate their consequences.

Part III. of this book focuses on the connections between causal reasoning and its application in the context of core machine learning tasks (Chapter 10). We provide a deeper discussion of causal reasoning for experimentation and reinforcement learning (Chapter 11), considerations when learning from observational data (Chapter 12), how causal reasoning relates to robustness and generalization of machine learning models (Chapter 13), and connections between causal reasoning and challenges of explainability and bias in machine learning (Chapter 14).


Leave a comment