The Book of Why - Judea Pearl, Dana Mackenzie

Link: https://www.goodreads.com/book/show/36393702-the-book-of-why

Notes

Causality requires a causal model to be developed. This needs to be independent and ideally prior to collection of the data and guides the data to be collected to verify the model.

If the data is collected first, unless it has information about cause and effect encoded explicitly (via controlled experiments, for example) causation cannot be derived; only correlation can.

Ch2. Genesis of Causal Inference

History of causation from Galton, Pearson to Sewell Wright to Herbert Simon, etc. The opponents of causality considered data the holy grail and that causation is a special case of correlation and something that can be derived from data. Wright and Pearl’s point is that it cannot and causality requires a model of causation to accompany the interpretation of data. Two individuals can have different models of causation that can lead to different interpretations from the same data.

Pearl says wrt Bayesian statistics that at the limit when size of data increases, the prior knowledge factor diminishes to zero. Why is this?

—-

Ch3. From evidence to causes

Bayesian networks helped with a class of problems where there was uncertainty in Knowledge, especially in the context of systems like expert systems, which rely of facts. What if the facts are not binary, but involve probabilities? Other initiatives that were tried include Lofti Zadeh’s fuzzy logic, Glen Shafer’s belief functions, and Edward Feigenbaum’s certainty factors.

Bayesian networks are essentially conditional probability tables - they help provide data for applying Bayesian rule forward and backward in the chain. These networks have been used in multiple places including error correction codes, etc.

Three rules in a 3 node two connections graph:

Chain: a —> b —> c b mediates impact of a. Fork: a <— b —> c b is a confounding factor. Impact of a on c or vice versa is a correlation Collider: a —> b <— c a & c are independent, but one may see reverse correlation if value is measured for b.

—-

Ch 4. RCT and other mechanisms to identify confounding factors.

A key challenge in causal inference is to identify confounding factors - factors that have an effect on both the presumed cause and effect. An example might be listening to electronic music makes someone good at programming. One confounding factor could be someone who studied computer science and had peer groups that listened to electronic music. Randomized Controlled Trials or RCTs help eliminate impact of confounding factors. But they’re not the only mechanism available. Having a causal model is one such alternative.

Confounding, then, should simply be defined as anything that leads to a discrepancy between the two: P(Y | X) ≠ P(Y | do(X)).

In 1986 Greenland and Robins provide a clearer definition of a confounded using “exchangability” as a criteria. The control and treatment groups should be exchangeable and the treatment still do well.

—- Monty Hall paradox

Simpson’s paradox - Data looks bad/good when partitioned, but good/bad overall. This can happen based on the underlying confounders. If there a collider then stratifying the data helps. However if there’s a mediator then aggregating the data is fine. But you can’t aggregate in case of the former. - Also check for the sample sizes across each. Outliers could skew data when aggregated.

—- Front door adjustment:

P(Y | do(X)) = ∑z P(Z = z | X) ∑x P(Y | X = x, Z = z) P(X = x)         (7.1)

Readers with an appetite for mathematics might find it interesting to compare this to the formula for the back-door adjustment, which looks like Equation 7.2.

Backdoor Adjustment:

P(Y | do(X)) = ∑z P(Y | X, Z = z) P(Z = z) (7.2)

Instrumental variables

—-

Ch. 8 Counterfactutals

One reason observational data may not be sufficient is if we can’t determine if the proposed cause is necessary. Counterfactual statements like “if not for X, Y won’t occur”. If we’ve data then perhaps it’s fine.

—-

Timelines

1750s: Reverand Thomas Bayes discovers Bayes theorem.

1866: Mendel’s theory of inheritance of traits

1877: Gaston’s paper on heights that led to theory of correlation. Karl Pearson later coined the term correlation coefficient for the slope of the line that’s the degree of associativity between two variables. Pearson, however, considered a slope of 1 as the causation.

1924: Sewell Wright writes correlation and causation, where he uses path diagrams for modeling causation.

1920s: RA Fisher invents Randomized Controlled Trials (RCT)

1960s: Resurgence of path diagrams in sociology and economics due to work of Herbert Simon, etc.

1986: Greenland and Robins provide a clearer definition of a confounded using “exchangability” as a criteria.

Related articles

Highlights

Mathematically, we write the observed frequency of Lifespan L among patients who voluntarily take the drug as P(L | D), which is the standard conditional probability used in statistical textbooks. This expression stands for the probability (P) of Lifespan L conditional on seeing the patient take Drug D. Note that P(L | D) may be totally different from P(L | do(D)). This difference between seeing and doing is fundamental and explains why we do not regard the falling barometer to be a cause of the coming storm. Seeing the barometer fall increases the probability of the storm, while forcing it to fall does not affect this probability.

Notice that the whole notion of estimands and in fact the whole top part of Figure I does not exist in traditional methods of statistical analysis. There, the estimand and the query coincide.

I especially want to highlight the role of data in the above process. First, notice that we collect data only after we posit the causal model, after we state the scientific query we wish to answer, and after we derive the estimand. This contrasts with the traditional statistical approach, mentioned above, which does not even have a causal model.

I am an outspoken skeptic of this trend because I know how profoundly dumb data are about causes and effects. For example, information about the effects of actions or interventions is simply not available in raw data, unless it is collected by controlled experimental manipulation.

Another advantage causal models have that data mining and deep learning lack is adaptability. Note that in Figure I.1, the estimand is computed on the basis of the causal model alone, prior to an examination of the specifics of the data. This makes the causal inference engine supremely adaptable, because the estimand computed is good for any data that are compatible with the qualitative model, regardless of the numerical relationships among the variables.

On the other hand, if she possessed a model of how the drug operated and its causal structure remained intact in the new location, then the estimand she obtained in training would remain valid.

Note: For the model to be constant it’d have to encode all the free variables and their dependencies (including diet, etc), right? If you had that already and encoded in the data, wouldn’t you be able to use just the data (and not estimand) for answering queries about causality?

How do you even verify the causal model is correct and is not missing some key causal element? Design and type of experiments can be flawed too, right?

I assume the relevance of the causal model is the unambiguous specification of hypothesis.

In another way is it also fair to say the causal model is the encoding of the design of controlled experiments?

While philosophers and scientists had mostly paid attention to the regularity definition, Lewis argued that the counterfactual definition aligns more closely with human intuition: “We think of a cause as something that makes a difference, and the difference it makes must be a difference from what would have happened without it.”

If I could sum up the message of this book in one pithy phrase, it would be that you are smarter than your data.

It is useless to ask for the causes of things unless you can imagine their consequences. Conversely, you cannot claim that Eve caused you to eat from the tree unless you can imagine a world in which, counter to facts, she did not hand you the apple.

Note: Implying causation introduces branch with factual and counterfactual.

We say that one event is associated with another if observing one changes the likelihood of observing the other.

Note: Is this also correlation?

Some associations might have obvious causal interpretations; others may not. But statistics alone cannot tell which is the cause and which is the effect, toothpaste or floss.

If, for example, the programmers of a driverless car want it to react differently to new situations, they have to add those new reactions explicitly. The machine will not figure out for itself that a pedestrian with a bottle of whiskey in hand is likely to respond differently to a honking horn. This lack of flexibility and adaptability is inevitable in any system that works at the first level of the Ladder of Causation.

We cannot answer questions about interventions with passively collected data, no matter how big the data set or how deep the neural network.

Note: Is this still true even if the passively collected data has all permutations between free variables and final value?

It appears to me the second level of causation (“intervention”) is more about filling the gaps and completing more of the picture. But if the picture the existing data already paints is fairly complete or at least complete in a way sufficient for the queries asked, then we can answer questions about what happens if any of the free variables change.

But now you are considering a deliberate intervention that will set a new price regardless of market conditions.

Note: But if the experiment is not randomized correctly one could still miss some key states.

Even if she doesn’t have data on every factor, she might have data on enough key surrogates to make the prediction. A sufficiently strong and accurate causal model can allow us to use rung-one (observational) data to answer rung-two (interventional) queries.

Note: Okay, this now aligns with what I was thinking. It feels like a key to second level is validation that the data contains sufficient associations between all pertinent cause and effect variables.

As these examples illustrate, the defining query of the second rung of the Ladder of Causation is “What if we do…?” What will happen if we change the environment? We can write this kind of query as P(floss | do(toothpaste)), which asks about the probability that we will sell floss at a certain price, given that we set the price of toothpaste at another price.

Note: These are what RCTs do, although just one RCT is not often sufficient as it may not capture all pertinent endemic states.

Counterfactuals have a particularly problematic relationship with data because data are, by definition, facts. They cannot tell us what will happen in a counterfactual or imaginary world where some observed facts are bluntly negated.

Note also that merely collecting Big Data would not have helped us ascend the ladder and answer the above questions. Assume that you are a reporter collecting records of execution scenes day after day. Your data will consist of two kinds of events: either all five variables are true, or all of them are false. There is no way that this kind of data, in the absence of an understanding of who listens to whom, will enable you (or any machine learning algorithm) to predict the results of persuading marksman A not to shoot.

Note: What if the data did contain rows where only A shot in real life?

(Research has shown that three-year-olds already understand the entire Ladder of Causation.)

Note: Citation?

In particular, beginning with Reichenbach and Suppes, philosophers have tried to define causation in terms of probability, using the notion of “probability raising”: X causes Y if X raises the probability of Y.

The proper way to rescue the probability-raising idea is with the do-operator: we can say that X causes Y if P(Y | do(X)) > P(Y). Since intervention is a rung-two concept, this definition can capture the causal interpretation of probability raising, and it can also be made operational through causal diagrams.

Bayesian networks inhabit a world where all questions are reducible to probabilities, or (in the terminology of this chapter) degrees of association between variables; they could not ascend to the second or third rungs of the Ladder of Causation.

According to the central limit theorem, proven in 1810 by Pierre-Simon Laplace, any such random process—one that amounts to a sum of a large number of coin flips—will lead to the same probability distribution, called the normal distribution (or bell-shaped curve). The Galton board is simply a visual demonstration of Laplace’s theorem.

Notice that the scatter plot has a roughly elliptical shape—a fact that was crucial to Galton’s analysis and characteristic of bell-shaped distributions with two variables.

Note: Why?

For the first time, Galton’s idea of correlation gave an objective measure, independent of human judgment or interpretation, of how two variables are related to one another.

Reading Galton’s Natural Inheritance was one of the defining moments of Pearson’s life: “I felt like a buccaneer of Drake’s days—one of the order of men ‘not quite pirates, but with decidedly piratical tendencies,’ as the dictionary has it!” he wrote in 1934. “I interpreted… Galton to mean that there was a category broader than causation, namely correlation, of which causation was only the limit, and that this new conception of correlation brought psychology, anthropology, medicine and sociology in large part into the field of mathematical treatment. It was Galton who first freed me from the prejudice that sound mathematics could only be applied to natural phenomena under the category of causation.”

Pearson, arguably one of England’s first feminists, started the Men’s and Women’s Club in London for discussions of “the woman question.” He was concerned about women’s subordinate position in society and advocated for them to be paid for their work.

This interpretation of path coefficients, in terms of the amount of variation explained by a variable, was reasonable at the time. The modern causal interpretation is different: the path coefficients represent the results of a hypothetical intervention on the source variable.

The prototype of Bayesian analysis goes like this: Prior Belief + New Evidence Revised Belief.

In addition, in many cases it can be proven that the influence of prior beliefs vanishes as the size of the data increases, leaving a single objective conclusion in the end.

Note: What does this mean exactly? Is this related again to the completeness issue? Why is it just more data vs better data?

If you own a cell phone, the codes that your phone uses to pick your call out of thousands of others are decoded by belief propagation, an algorithm devised for Bayesian networks.

a causal diagram is a Bayesian network in which every arrow signifies a direct causal relation, or at least the possibility of one, in the direction of that arrow. Not all Bayesian networks are causal, and in many applications it does not matter. However, if you ever want to ask a rung-two or rung-three query about your Bayesian network, you must draw it with scrupulous attention to causality.

His paper is remembered and argued about 250 years later, not for its theology but because it shows that you can deduce the probability of a cause from an effect. If we know the cause, it is easy to estimate the probability of the effect, which is a forward probability. Going the other direction—a problem known in Bayes’s time as “inverse probability”—is harder. Bayes did not explain why it is harder; he took that as self-evident, proved that it is doable, and showed us how.

This innocent-looking equation came to be known as “Bayes’s rule.” If we look carefully at what it says, we find that it offers a general solution to the inverse-probability problem. It tells us that if we know the probability of S given T, P(S | T), we ought to be able to figure out the probability of T given S, P(T | S), assuming of course that we know P(T) and P(S).

Second, Bayes assumed that L is determined mechanically by shooting a billiard ball from a greater distance, say L*. In this way he bestowed objectivity onto P(L) and transformed the problem into one where prior probabilities are estimable from data, as we see in the teahouse and cancer test examples.

Note: Didn’t follow this. Is L* an asymptotic limit?

The late 1970s, then, were a time of ferment in the AI community over the question of how to deal with uncertainty. There was no shortage of ideas. Lotfi Zadeh of Berkeley offered “fuzzy logic,” in which statements are neither true nor false but instead take a range of possible truth values.

Note: Manindra, my classmate at Nationals, who helped me get into programming, used to talk about Lofti Zadeh.

Unfortunately, although ingenious, these approaches suffered a common flaw: they modeled the expert, not the world, and therefore tended to produce unintended results. For example, they could not operate in both diagnostic and predictive modes, the uncontested specialty of Bayes’s rule.

I entered the arena rather late, in 1982, with an obvious yet radical proposal: instead of reinventing a new uncertainty theory from scratch, let’s keep probability as a guardian of common sense and merely repair its computational deficiencies. More specifically, instead of representing probability in huge tables, as was previously done, let’s represent it with a network of loosely coupled variables. If we only allow each variable to interact with a few neighboring variables, then we might overcome the computational hurdles that had caused other probabilists to stumble.

Although Bayes didn’t know it, his rule for inverse probability represents the simplest Bayesian network. We have seen this network in several guises now: Tea Scones, Disease Test, or, more generally, Hypothesis Evidence. Unlike the causal diagrams we will deal with throughout the book, a Bayesian network carries no assumption that the arrow has any causal meaning. The arrow merely signifies that we know the “forward” probability, P(scones | tea) or P(test | disease). Bayes’s rule tells us how to reverse the procedure, specifically by multiplying the prior probability by a likelihood ratio.

This observation leads to an important conceptual point about chains: the mediator B “screens off” information about A from C, and vice versa. (This was first pointed out by Hans Reichenbach, a German-American philosopher of science.) For example, once we know the value of Smoke, learning about Fire does not give us any reason to raise or lower our belief in Alarm.

These three junctions—chains, forks, and colliders—are like keyholes through the door that separates the first and second levels of the Ladder of Causation. If we peek through them, we can see the secrets of the causal process that generated the data we observe; each stands for a distinct pattern of causal flow and leaves its mark in the form of conditional dependences and independences in the data.

Oddly, statisticians both over- and underrate the importance of adjusting for possible confounders. They overrate it in the sense that they often control for many more variables than they need to and even for variables that they should not control for. I recently came across a quote from a political blogger named Ezra Klein who expresses this phenomenon of “overcontrolling” very clearly: “You see it all the time in studies. ‘We controlled for…’ And then the list starts. The longer the better. Income. Age. Race. Religion. Height. Hair color. Sexual preference. Crossfit attendance. Love of parents. Coke or Pepsi. The more things you can control for, the stronger your study is—or, at least, the stronger your study seems. Controls give the feeling of specificity, of precision.… But sometimes, you can control for too much. Sometimes you end up controlling for the thing you’re trying to measure.” Klein raises a valid concern. Statisticians have been immensely confused about what variables should and should not be controlled for, so the default practice has been to control for everything one can measure. The vast majority of studies conducted in this day and age subscribe to this practice. It is a convenient, simple procedure to follow, but it is both wasteful and ridden with errors. A key achievement of the Causal Revolution has been to bring an end to this confusion.

Note: I wonder how much of this is due to laziness vs fear of failure?

Just came across this series of posts about possible reasons for the “replication crisis” and one of them being experimenters looking for statistically significant results. Controlling for lots of variables could get you there assuming you’ve a sufficient sample size. But it also means you’re potentially biasing the experiment to begin with as opposed to finding the true causes.

https://jaydaigle.net/blog/hypothesis-testing-part-1/

One of my goals in this chapter is to explain, from the point of view of causal diagrams, precisely why RCTs allow us to estimate the causal effect X Y without falling prey to confounder bias. Once we have understood why RCTs work, there is no need to put them on a pedestal and treat them as the gold standard of causal analysis, which all other methods should emulate. Quite the opposite: we will see that the so-called gold standard in fact derives its legitimacy from more basic principles.

I will add to this a second punch line: there are other ways of simulating Model 2. One way, if you know what all the possible confounders are, is to measure and adjust for them. However, randomization does have one great advantage: it severs every incoming link to the randomized variable, including the ones we don’t know about or cannot measure (e.g., “Other” factors in Figures 4.4 to 4.6).

Fortunately, the do-operator gives us scientifically sound ways of determining causal effects from nonexperimental studies, which challenge the traditional supremacy of RCTs.

how should it be defined? Armed with what we now know about the logic of causality, the answer to the second question is easier. The quantity we observe is the conditional probability of the outcome given the treatment, P(Y | X). The question we want to ask of Nature has to do with the causal relationship between X and Y, which is captured by the interventional probability P(Y | do(X)). Confounding, then, should simply be defined as anything that leads to a discrepancy between the two: P(Y | X) ≠ P(Y | do(X)).

Considering that the drugs in your medicine cabinet may have been developed on the basis of a dubious definition of “confounders,” you should be somewhat concerned.

Note: This might be unfair given that these used RCTs, which, if implemented correctly, help eliminate the impact of confounding factors.

In a sense RCTs are like the highway or the hammer for causation. It’s a safe and easier to adopt path given all the toolsets and knowledge built around it, but it’s also onerous to use. There can be other mechanisms that help solve the problem in simpler ways, but need more thought and caution.

Exchangeability simply means that the percentage of people with each kind of sticker (d percent, c percent, p percent, and i percent, respectively) should be the same in both the treatment and control groups. Equality among these proportions guarantees that the outcome would be just the same if we switched the treatments and controls. Otherwise, the treatment and control groups are not alike, and our estimate of the effect of the vaccine will be confounded.

Using this commonsense definition of confounding, Greenland and Robins showed that the “statistical” definitions, both declarative and procedural, give incorrect answers. A variable can satisfy the three-part test of epidemiologists and still increase bias, if adjusted for.

Finally, to deconfound two variables X and Y, we need only block every noncausal path between them without blocking or perturbing any causal paths. More precisely, a back-door path is any path from X to Y that starts with an arrow pointing into X. X and Y will be deconfounded if we block every back-door path (because such paths allow spurious correlation between X and Y). If we do this by controlling for some set of variables Z, we also need to make sure that no member of Z is a descendant of X on a causal path; otherwise we might partly or completely close off that path.

In fact, Cornfield’s method planted the seeds of a very powerful technique called “sensitivity analysis,” which today supplements the conclusions drawn from the inference engine described in the Introduction. Instead of drawing inferences by assuming the absence of certain causal relationships in the model, the analyst challenges such assumptions and evaluates how strong alternative relationships must be in order to explain the observed data. The quantitative result is then submitted to a judgment of plausibility, not unlike the crude judgments invoked in positing the absence of those causal relationships.

Because Simpson’s paradox has been so poorly understood, some statisticians take precautions to avoid it. All too often, these methods avoid the symptom, Simpson’s reversal, without doing anything about the disease, confounding. Instead of suppressing the symptoms, we should pay attention to them. Simpson’s paradox alerts us to cases where at least one of the statistical trends (either in the aggregated data, the partitioned data, or both) cannot represent the causal effects.

To sum up, the back-door adjustment formula and the back-door criterion are like the front and back of a coin. The back-door criterion tells us which sets of variables we can use to deconfound our data. The adjustment formula actually does the deconfounding. In the simplest case of linear regression, partial regression coefficients perform the back-door adjustment implicitly. In the nonparametric case, we must do the adjustment explicitly, either using the back-door adjustment formula directly on the data or on some extrapolated version of it.

The process I have just described, expressing P(cancer | do (smoking)) in terms of do-free probabilities, is called the front-door adjustment. It differs from the back-door adjustment in that we adjust for two variables (Smoking and Tar) instead of one, and these variables lie on the front-door path from Smoking to Cancer rather than the back-door path.

P(Y | do(X)) = ∑z P(Z = z | X) ∑x P(Y | X = x, Z = z) P(X = x)         (7.1) Readers with an appetite for mathematics might find it interesting to compare this to the formula for the back-door adjustment, which looks like Equation 7.2. P(Y | do(X)) = ∑z P(Y | X, Z = z) P(Z = z)         (7.2)

Glynn and Kashin’s results show why the front-door adjustment is such a powerful tool: it allows us to control for confounders that we cannot observe (like Motivation), including those that we can’t even name. RCTs are considered the “gold standard” of causal effect estimation for exactly the same reason. Because front-door estimates do the same thing, with the additional virtue of observing people’s behavior in their own natural habitat instead of a laboratory, I would not be surprised if this method eventually becomes a useful alternative to randomized controlled trials.

In both the front- and back-door adjustment formulas, the ultimate goal is to calculate the effect of an intervention, P(Y | do(X)), in terms of data such as P(Y | X, A, B, Z,…) that do not involve a do-operator. If we are completely successful at eliminating the do’s, then we can use observational data to estimate the causal effect, allowing us to leap from rung one to rung two of the Ladder of Causation.

The fact that we were successful in these two cases (front- and back-door) immediately raises the question of whether there are other doors through which we can eliminate all the do’s. Thinking more generally, we can ask whether there is some way to decide in advance if a given causal model lends itself to such an elimination procedure. If so, we can apply the procedure and find ourselves in possession of the causal effect, without having to lift a finger to intervene. Otherwise, we would at least know that the assumptions imbedded in the model are not sufficient to uncover the causal effect from observational data, and no matter how clever we are, there is no escape from running an interventional experiment of some kind.

In this case the rules of do-calculus provide a systematic method to determine whether causal effects found in the study environment can help us estimate effects in the intended target environment.

No economist had ever before insisted on the distinction between causal coefficients and regression coefficients; they were all in the Karl Pearson–Henry Niles camp that causation is nothing more than a limiting case of correlation. Also, no one before Sewall Wright had ever given a recipe for computing regression coefficients in terms of path coefficients, then reversing the process to get the causal coefficients from the regression. This was Sewall’s exclusive invention.

Like Hume, Lewis was evidently impressed by the fact that humans make counterfactual judgments without much ado, swiftly, comfortably, and consistently. We can assign them truth values and probabilities with no less confidence than we do for factual statements. In his view, we do this by envisioning “possible worlds” in which the counterfactual statements are true.

Structural models also offer a resolution of a puzzle Lewis kept silent about: How do humans represent “possible worlds” in their minds and compute the closest one, when the number of possibilities is far beyond the capacity of the human brain? Computer scientists call this the “representation problem.” We must have some extremely economical code to manage that many worlds.

appreciate how audacious this notation is, you have to step back from the symbols and think about the assumptions they embody. By writing down the symbol Yx, Rubin asserted that Y definitely would have taken some value if X had been x, and this has just as much objective reality as the value Y actually did take. If you don’t buy this assumption (and I’m pretty sure Heisenberg wouldn’t), you can’t use potential outcomes. Also, note that the potential outcome, or counterfactual, is defined at the level of an individual, not a population.

Ironically, equal Experience, which started out as an invitation for matching, has now turned into a loud warning against it. Table 8.1 will, of course, continue its silence about such dangers. For this reason I cannot share Holland’s enthusiasm for casting causal inference as a missing-data problem. Quite the contrary. Recent work of Karthika Mohan, a former student of mine, reveals that even standard problems of missing data require causal modeling for their solution.

Note: It feels like what Pearl is saying is that you need to state your assumptions explicitly if you’re using any of the statistical approaches to fill-in/impute missing data. Causal models are one way to make the assumptions clear. Each variation of a causal model for a given table would be a parallel universe?

The diagram encodes the causal story behind the data, according to which Experience listens to Education and Salary listens to both. In fact, we can already tell something very important just by looking at the diagram. If our model were wrong and EX were a cause of ED, rather than vice versa, then Experience would be a confounder, and matching employees with similar experience would be completely appropriate. With ED as the cause of EX, Experience is a mediator. As you surely know by now, mistaking a mediator for a confounder is one of the deadliest sins in causal inference and may lead to the most outrageous errors. The latter invites adjustment; the former forbids it.

So far in this book, I have used a very informal word—“listening”—to express what I mean by the arrows in a causal diagram. But now it’s time to put a little bit of mathematical meat on this concept, and this is in fact where structural causal models differ from Bayesian networks or regression models. When I say that Salary listens to Education and Experience, I mean that it is a mathematical function of those variables: S = fS(EX, ED). But we need to allow for individual variations, so we extend this function to read S = fS(EX, ED, US), where US stands for “unobserved variables that affect salary.” We know these variables exist (e.g., Alice is a friend of the company’s president), but they are too diverse and too numerous to incorporate explicitly into our model.

Note: In the Sean Carrol podcast Pearl mentioned the equal sign in a function statement hides information as it only states left = right, but provides no information about which is the cause and which an effect. An arrow operator would help provide directionality as well.

Economists and sociologists had been using such models since the 1950s and 1960s and calling them structural equation models (SEMs).

Given that we know the fire escape was blocked (X = 1) and Judy died (Y = 1), what is the probability that Judy would have lived (Y = 0) if X had been 0? Symbolically, the probability we want to evaluate is P(YX = 0 = 0 | X = 1, Y = 1). Because this expression is rather cumbersome, I will later abbreviate it as “PN,” the probability of necessity (i.e., the probability that X = 1 is a necessary or but-for cause of Y = 1).

One reader of my book Causality described this lost feeling beautifully in a letter to me. Melanie Wall, now at Columbia University, used to teach a modeling course to biostatistics and public health students. One time, she explained to her students as usual how to compute the indirect effect by taking the product of direct path coefficients. A student asked her what the indirect effect meant. “I gave the answer that I always give, that the indirect effect is the effect that a change in X has on Y through its relationship with the mediator, Z,” Wall told me. But the student was persistent. He remembered how the teacher had explained the direct effect as the effect remaining after holding the mediator fixed, and he asked, “Then what is being held constant when we interpret an indirect effect?” Wall didn’t know what to say. “I’m not sure I have a good answer for you,” she said. “How about I get back to you?” This was in October 2001, just four months after I had presented a paper on causal mediation at the Uncertainty in Artificial Intelligence conference in Seattle. Needless to say, I was eager to impress Melanie with my newly acquired solution to her puzzle, and I wrote to her the same answer I have given you here: “The indirect effect of X on Y is the increase we would see in Y while holding X constant and increasing M to whatever value M would attain under a unit increase in X.”

A formula reveals everything: it leaves nothing to doubt or ambiguity. When reading a scientific article, I often catch myself jumping from formula to formula, skipping the words altogether. To me, a formula is a baked idea. Words are ideas in the oven.