**The scientific problem.**

Consider an analogy. Suppose we wanted to figure out whether some new pharmaceutical is effective at treating some health condition. What we would do is take a large group of people, randomize some to get the new drug, and compare the outcomes of the people who randomly get the drug to those of people who randomly don’t get the drug. In this randomized controlled trial, we would like to use as large a sample as possible so as to average out the unfathomably complex causes of any given person’s health, and we randomize who gets the drug to rule out the possibility that the underlying determinants of who gets the drug also determine the health outcomes. For example, if we let people choose whether to take the drug and then compare the outcomes of those who choose to take it to those who don’t, it would misleadingly look like the drug caused health to fall if only desperate people in very low health chose to take it.

Now apply this reasoning to a policy change such as hiking the minimum wage. We’d like to evaluate such a policy by taking a very large group of cities, randomizing some cities to get higher minimum wages and others not, and then compare the outcomes of interest, such wages and employment, across those two groups. We’d get an estimate of the average effect of the minimum wage hike which is, on average, correct.

Unfortunately, no such large-scale controlled experiment is available. And unfortunately, if we ask a very specific question like, `what was the effect of the 2015/16 increases in minimum wages on employment in Seattle?’, we’re not asking about how minimum wages affect the average city, we’re asking about this particular case. This is analogous to asking not, `does this drug benefit the average patient’ acknowledging some patients may be helped more than others and some may actually be harmed. It’s like asking `how did this drug affect Fred Jones, specifically, Fred Jones?’ The sample size is, in a sense, n=1. Further, we face the problem that Seattle wasn’t randomly chosen as a city to increase minimum wages, so it’s as if Fred Jones was standing in a large group of people, was the only person to stick his hand up when the research team asked for volunteers to test some new drug, and we want to know the effect of that drug on Fred Jones.

There is no statistical magic which can fully overcome these fundamental problems. We will never be able to “prove” what the effect of the minimum wage was: that’s not the way statistics work in general, and in a case study like `what was the effect of the 2015 increase in minimum wages on employment in Seattle?’ the best we can hope for is to bring some suggestive evidence to the table.

How did the researchers attempt to estimate the effect of Seattle’s minimum wage, given these issues? It may at first glance seem easy: we just look to see whether employment in Seattle went up or down after the minimum wage increased. This intuition is what many skeptics have in mind when they critique the UW study’s finding of negative effects on employment: employment in Seattle rose during 2015 and 2016, so the UW study must be wrong, they claim.

The problem with this argument is that employment rises or falls for many reasons other than the level of the minimum wage. Suppose for the sake of argument that Seattle actually created, say, 10,000 net low-wage jobs following the increase in the minimum wage. The causal effect of the minimum wage on employment is not, then, +10,000, it’s 10,000 *minus the number of jobs which would have been created if the minimum wage were not in place*. If 15,000 would have been created without the minimum wage hike and 10,000 were created with the hike, then the minimum wage hike actually destroyed 5,000 jobs, even though 10,000 jobs were created after the policy went into place.

To illustrate this point with the Seattle data, consider this graph:

In the absence of the minimum wage hike, the graph above might have looked more or less the same but with a slightly steeper increase in employment starting in 2015. This possibility is illustrated on the graph by “counterfactual Seattle.” The red line shows actual employment. The blue line shows what might have happened if Seattle hadn’t increased the minimum wage (entirely made up, assuming a causal effect of the minimum wage of zero in the month implement rising linearly in magnitude to negative two percent two years later). If we knew, somehow, that the real world was just like the graph, that is, we somehow knew that employment would have followed the blue path if counterfactually Seattle hadn’t hiked its minimum wage, we would conclude that the minimum wage hike decreased employment even though employment in our world actually rose following the hike.

Note this same argument applies when a minimum wage is imposed on an economy going into recession rather than booming, such as the 2007 Federal minimum wage hike in the U.S.: decreases in employment following a minimum wage hike tell us exactly nothing about the causal effect of the hike.

**How did the research teams estimate the effects of minimum wage changes?**

To attempt to overcome this problem, both teams of researchers try to find a “control group” of cities which did not hike their minimum wages. The jargon “control group” highlights the methodological ties to the pharmaceutical analogy above: this is not a randomized controlled experiment, but we want to make it look as close as possible to a randomized controlled experiment using statistical methods. Consider this graph of made-up data:

In the graph there are just two times: “before” and “after” the minimum wage hike. The solid blue line shows what happens in the Seattle we see in the data, the real Seattle which actually hiked minimum wages. Employment in real Seattle rises from 4 to 5 units. We shouldn’t conclude that the minimum wage hike increased employment by one unit, however, because we see that in “some other city” (actually in the real studies: an average across many other cities in which there was no change in the minimum wage) which didn’t hike minimum wages, employment rose from 2 to 4 units. We might reason: if Seattle had not hiked its minimum wage, its employment would have followed the same pattern as in cities which didn’t implement the hike. If that were so, Seattle’s employment over time would have looked like the dashed blue line, it would have gone from 4 to 6 units rather than from 4 to 5. In this case, one estimate of the effect of the minimum wage hike in Seattle is -1: employment rose in Seattle by 1 unit, but we guess it would have risen by 2 units if the minimum wage had not been hiked, so we estimate the minimum wage actually reduced employment by 1 unit despite the fact that employment in went up 1 unit after the hike.

A serious issue with this reasoning is that it assumes that the trends in employment in Seattle and the control city would have been the same but for the increase in minimum wages (this is called the “parallel trends” assumption). But that’s not generally true. Suppose that employment in Seattle would have flatlined in 2015 and 2016 if not for the minimum wage hike, even as employment increased in comparable other cities. Then if the actual data looked like either of the graphs above, we’d conclude that the increase in minimum wages actually increased employment in Seattle.

**Synthetic controls.**

Both studies use a variant of the method described above which partially addresses this issue. Instead of assuming that Seattle would have experienced a trend in employment equal to the trend in control cities if Seattle hadn’t increased its minimum wage, they assume that every city experiences common influences on employment over time, but each city may respond differently to those influences. Suppose for example that every city’s employment goes up when the national economy booms, but that Seattle’s employment is much more sensitive to changes in the national economy than, say, Yakima’s. Maybe Seattle’s employment rises 2% when the national employment rate rises 2%, but Yakima’s only goes up 1%. This difference in sensitivity means these two cities would generally experience different trends in employment over time even if neither changes their minimum wage, and the statistical method described above would yield misleading results. Both papers use a method called “synthetic controls” which addresses this concern, and somewhat weakens the assumption that all cities have the same trend in employment over time, replacing it with the assumption that the sensitivity of each city to common influences doesn’t change over time (the UW team also uses a closely related method called “interactive effects,” and finds very similar results). The two papers use different sets of control cities, the UW paper used cities in Washington state whereas the Berkeley paper uses cities across the U.S. It’s not obvious which strategy is better, and it is unfortunate that both teams didn’t report estimates from both strategies in the interest of robustness and comparability.

At the end of the day, in terms of fundamental assumptions, the synthetic control method is more or less the same as, and suffers from similar drawbacks to, the simpler method (called differences-in-differences) first described. Basically, both papers contrast changes in employment in Seattle with changes in employment in other cities which didn’t change their minimum wages, and conclude the minimum wage destroyed jobs if Seattle’s employment fell relative to employment in other cities. Neither paper can differentiate an effect of the minimum wage from Seattle-specific changes in the labor market in 2015 and 2016, and no paper will ever will. Analogously, even if we see the single patient who took the new drug get better much faster than other patients, we can’t differentiate between two competing explanations: the drug is really effective, or something else happened to improve that patient’s health at about the same time he took the drug.

**Do the results conflict?**

The papers differ in the manner in which the construct their estimates. The Berkeley team has quarterly data on total number of employees and total payroll and constructs the average weekly wage paid by dividing total payroll by total employees. This data limitation makes it impossible to differentiate changes in average hours worked per worker from changes in wages per hour, and it means that the Berkeley team’s paper cannot observe how the distribution of wages changes (for example, how the proportion of workers making between $15 and $17 per hour changes), they can only work with the average wage. The Berkeley team, following much of the literature, focuses on restaurant workers since many restaurant workers are paid at or near the minimum wage, so any effects of minimum wages should be easiest to detect in that sector.

The UW team has more detailed data on the wages of individual workers, although for reasons not clear to me they didn’t do much to exploit the fact that (I think) they can observe what happens to an individual worker over time. They also restrict the sample to firms which do not have multi-site locations because they cannot determine from their data whether workers at multi-site firms are subject to the minimum wage hike. It’s not clear to me how the Berkeley team deals with this issue as they face the same problem. From the discussion on page 8, I think the Berkeley team treats all workers for a given firm as being in Seattle and subject to the minimum wage hike if the firm reports either separate employment data from each of its locations so they can pick out the Seattle locations or if the firm reports that its head office is in Seattle. This is no less problematic than the UW team’s omission of multi-site firms: both teams get biased results due to this issue, for somewhat different reasons.

The UW team also attempted to isolate workers likely to be affected by the minimum wage hike, but with their more detailed data they were able to go beyond limiting attention to restaurant workers. They instead limit attention to workers who earn less than $19 per hour. This creates a possible problem: if the minimum wage hikes turn $13 an hour workers into $20 an hour workers, that change will be coded in the UW team’s data as a job destroyed rather than a good job created! However, the team shows that there are essentially no effects of the minimum wage change on employment even at wages even substantially lower than $19 an hour and conclude that this problem is unlikely to generate much bias. This result can be seen in their Figure 1,

The figure shows the estimated effects on employment of the minimum wage increase to $13 for each wage between $9 and $39. These are not raw changes, but rather changes in employment in Seattle relative to changes in employment in the control cities, estimated using the synthetic control method described above. We see big negative effects on number of jobs below $13 per hour, which just shows the minimum wage is actually effective in the sense that such jobs have been legislated out of existence, and the legislation works. What we hope to see is the number of jobs above $13 per hour increases in Seattle more than they increase in control cities, for example, that the policy turned $11 an hour jobs into $13 an hour jobs, but that’s not what we see. In particular, it isn’t the case that the UW team has just mistaken the creation of much better (greater than $19 per hour) jobs for destroyed jobs: high-wage jobs increased in Seattle in 2015-2016, but high-wage jobs also increased elsewhere and by about the same amount in cities in which the minimum wage was not increased. But more low jobs disappeared in Seattle than in cities in which the minimum wage didn’t increase. This is the essence of the UW team’s findings.

**Comparing the estimates.**

The most directly comparable estimates in the two studies are of the effects on restaurant workers, presented in Table 9 of the UW study and Table 2 of the Berkeley study. Both teams estimate that a 10% increase in the minimum wage caused about a 1% increase in the average wage of restaurant workers (those under $19 per hour in the UW case, and about 2% rather than 1% for the Berkeley team’s estimate on fast-food restaurant workers, although this is a noisy estimate). This effect is perhaps surprisingly small, and occurs at least in part because most workers earn enough that they are not directly affected by changes in the minimum wage.

The contentious estimates are those on employment. These estimates have been widely described as in conflict, but they’re actually quite similar, statistically. The results are not easy to compare directly because the Berkeley team frames their results as answers to the question, “for every 1% the minimum wage increases, by what percent does employment rise or fall?” (that is, the report elasticities of employment to the minimum wage). The UW team mostly frames results as answers to the question, “by what percent does employment rise or fall in response to the actual minimum wage hikes?” Further, the UW team addresses restaurant workers in isolation only in Table 9 and focuses and results for all workers, whereas the Berkeley team only estimates effects for restaurant workers.

The UW team’s estimates of the the effects of the minimum wage hikes on all restaurant employment are reported in the second-to-last column of Table 9, labeled “All wage levels, Jobs.” Across time, these estimates range from +3.6% to -1.3%, but in all cases the estimated effect is small relative to the uncertainty of the estimate, that is, none of these estimates are even close to being statistically different from zero. We can express these estimates as responses to each 10% increase in the minimum wage by dividing 1.6 for the the first three quarters after enforcement of the first minimum wage hike (since it was a 16% hike) and 3.7 for the next three quarters (since it was a 37% hike, see the note for Table 8). After these adjustments, the UW team’s estimates appear even smaller in magnitude: ballpark them at about zero.

The Berkeley team’s analogous estimates are reported in the bottom panel of their Table 2. They find similarly small effects: each 10% increase in the minimum wage changes employment from about -0.6% to +0.6%. These estimates are also quite noisy and consistent with moderate positive or negative effects. Without the data and a great deal of work, we can’t formally conduct statistical tests on whether these estimates are consistent with each other, but given how small, similar, and noisy the estimates are, it is very implausible that we find that they are statistically distinguishable.

What about the estimates that are somewhat less comparable across the two studies? Are they in conflict, to the extent that we can compare them? The UW team detects a statistically significant effect on low-wage restaurant workers, who may be a similar group to the Berkeley team’s fast food restaurant workers. The UW team estimates that the minimum wage hikes reduced low-wage restaurant workers’ employment by up to 13% (Table 9, column 5, 2016 quarter 2). That is equivalent to estimating that each 10% increase in the minimum wage decreased employment by 3.5%. These estimates are statistically significantly different from zero. The Berkeley team, conversely, estimates that each 10% increase in the minimum wage decreases fast food employment by only 0.6% (six-tenths of one percentage point), and this estimate is not statistically significantly different from zero. Aren’t these estimates in conflict?

No. To see why, consider this graph showing the two teams estimates and the associated confidence intervals,

The points in the middle of the lines show the teams’ best guesses as to what happens to employment of fast food or low wage restaurant workers’ employment for each 10% increase in the minimum wage. The lines represent 95% confidence intervals around these guesses. One interpretation of these intervals is that they contain all of the effect sizes which are statistically indistinguishable from the best guess, so for example, the UW team guesses that employment falls by 3.5% when the minimum wage rises 10%, but they would not say that decreases between about 1% and 6% are statistically different from that guess. Notice the Berkeley team’s confidence interval heavily overlaps the UW team’s. That does not mean that we can conclude the estimates are statistically indistinguishable (technically, we might still reject the null that the two estimates are the same if the confidence intervals overlap, particularly if the estimates are positively correlated). But, even though we can’t actually conduct the test, it seems very unlikely we’d be able to distinguish between the two estimates. We can also say that neither team would be surprised should a deity reveal that the true effect is actually negative and moderately small, say around -2%.

By analogy, suppose the two teams were trying to determine if a coin is fair. The Berkeley team flips the coin 100 times and gets 54 heads. Their best guess is that the coin is slightly more likely to come up heads than tails, but if the coin were actually fair, they’d frequently get at least 54 heads or 54 tails (42% of the time, to be exact), so they conclude that 54 heads is not statistically significantly different than 50 heads, there is “no effect.” The UW team flips the coin 200 times, so they have somewhat more information, and get 114 heads. 114 heads, it turns out, is statistically different than 100 heads, so the UW team reports that they’ve “found an effect.” But neither would report that their estimates are statistically different from a probability of heads of, say, 55%, and the two teams’ results are not actually in conflict.

In other words, what they Berkeley team means when they report “no effect” on employment is not that there is no effect on employment (yes, that is confusing). What they mean, again, is that there is no *statistically significant* effect on employment, whereas the UW team, using different data and somewhat different statistical methods, finds a statistically significant effect. But the difference between statistically significant and statistically insignificant is often itself not statistically significant. These estimates are both consistent with small negative effects on employment in the restaurant sector. The UW team also reports estimates on other sectors and finds larger negative effects on employment, but there are no analogous estimates in the Berkeley study.

**What does all this mean?**

Two teams of researchers have presented estimates of the effects of Seattle’s recent minimum wage increases on restaurant workers in Seattle, using similar methods and similar data. Both teams find that there were small but detectable increases in average wages paid in the restaurant sector. One team found there were no statistically significant effects on employment, but that result should not be misunderstood as a claim that the study “proves” the effect was actually zero, and the estimates in the two studies are not statistically in conflict in the sense that they are both consistent with small to moderate negative employment effects in the restaurant sector. We can’t compare results in other sectors because the Berkeley study limits attention to restaurant workers.

]]>

The basic idea underlying LATE is to acknowledge that different people (or different units more generally) generally have different causal effects for any given “treatment,” broadly defined. It is common to talk about “the” causal effect of, say, education on earnings, or interest rates on growth, or pharmaceuticals on health, but if different people respond differently to education or to medical treatments and different countries respond differently to macroeconomic interventions, it’s not clear what we mean by “the” causal effect. We can still talk coherently about distributions of causal effects, though, and we may be interested in estimated various averages of those causal effects. Local average treatment effects (LATEs) are one such average.

For concreteness, let’s suppose the government decides to lend a hand to empirical researchers by implementing the following goofy policy: a randomly selected group of high school kids are randomized to get an offer of either $0 or $5,000 to acquire a college degree. We wish to use this natural experiment to estimate “the” effect of getting a college degree on, say, wages. We collect data on all these folks comprised of: a dummy variable which equals one if person was offered $5,000 and zero if they were offered zero, a dummy variable which indicates the student actually received a college degree, and wages, .

Whether someone actually goes to college depends, for some people, on whether they are offered nothing or $5,000. So let denote person college choice as function of . In this simple case, this divides the population into four mutually exclusive groups:

type | ||
---|---|---|

0 | 0 | never-taker (N) |

0 | 1 | complier (C) |

1 | 0 | hipster (H) |

1 | 1 | always-taker (A) |

That is, some people will not go to college regardless of their offer (the never-takers), some will always go to college regardless of their offer (the always-takers), some will go to college if they are offered $5,000 but not if they are offered nothing (the compliers), and some may do the opposite of what they’re “assigned” to do and go to college if offered nothing and not go if they’re offered $5,000 (conventionally the “defiers,” but I prefer the “hipsters,” as one of my students helpfully suggested). Let N, A, C, and H denote membership in these groups, and let denote the proportion of the population who are never-takers, and similarly define , and .

We can always write the observed outcome as

$$ y_i = \beta_0 + \beta_i D_i + u_i $$

where is a constant which can be interpreted as the average wage in the population if no one goes to college, is the causal effect of college for person , and is a mean-zero variable representing all causes of wages other than college. How much do we expect the wages of people who were offered nothing to differ from those offered $5,000? The expected wage of the folks offered nothing is

$$ E[ y_i | Z_i=0 ] = \beta_0 + E [ \beta_i D_i | Z_i=0 ] $$

because , as is randomized and hence independent of . We can decompose the second term by remembering that every person is one of the four types defined above. Conditional on being offered nothing, for the people in the complier and never-taker groups, so is zero for these people. The mean of conditional on and on being a hipster is , the average causal effect among the hipster subpopulation, since for these folks when they’re offered nothing, and we do not need to condition on because is randomized and hence independent of . Following the same reasoning for the always-takers and substituting into the equation above, we find,

$$ E[ y_i | Z_i = 0 ] = \beta_0 + \pi_H [ E(\beta_i | H) ] + \pi_A [ E(\beta_i | A) ], $$

which says that mean wage among people offered nothing depends on the mean wage of all people absent college (), the average change in wages induced by college among people who were offered nothing but went to college anyways , the average change in wages induced by college among the folks who go to college whether offered nothing or $5,000, , and the proportions of the population in these groups. Similarly, the average wage of people offered $5,000 is

$$ E[ y_i | Z_i = 1 ] = \beta_0 + \pi_C [ E(\beta_i | C) ] + \pi_A [ E(\beta_i | A) ], $$

and the difference across the two groups is then

$$ E[ y_i | Z_i = 1 ]-E[ y_i | Z_i = 0 ] = \pi_C [ E(\beta_i | C) ] – \pi_H [ E(\beta_i | H) ], $$

which depends only on average causal effects among the compliers and the hipsters and the proportions of people in these two groups. The never and always-takers don’t change their behavior in response to the offer in their letters, so their outcomes don’t affect the change in average wages across those offered nothing or $5,000.

Now consider the difference in the proportion attending college across the people offered nothing or $5,000. The probability that someone offered nothing goes to college is , the proportion who always go plus the proportion who go only if they’re offered nothing. Similarly, the probability that someone offered $5,000 goes is , the proportion who always go plus the proportion who go only if they’re offered $5,000. The difference in proportions across the two groups is then .

The ratio of changes in wages to changes in participation across those offered nothing or $5,000 is

$$ \beta_{IV} = \frac{ E[ y_i | Z_i = 1 ] – E[ y_i | Z_i = 0 ] }{ Pr(D_i=1|Z_i =1 ) – Pr(D_i=1|Z_i=0)} $$

which, when we plug in sample averages to estimate population moments, is called the Wald estimator, and is what we get by regressing on a constant and using as an instrument. Substituting our calculations above, we find

$$ \beta_{IV} = \frac{\pi_C E(\beta_i | C) – \pi_H E(\beta_i | H)}{\pi_C – \pi_H}.$$

In this case, the IV estimator doesn’t converge to any economically meaningful quantity. Suppose for example that every person in the population experiences a positive causal effect of treatment (). Then we would hope that our estimator at least converges to something positive, but we could instead get a negative estimate, even asymptotically, if either there are more compliers than hipsters in the population () but hipsters have big enough average causal effects such that the numerator is negative, or if the numerator is positive but there are more hipsters than compliers in the population such that the denominator is negative. Notice this is so despite the fact that by assumption the instrument is uncorrelated with the error term and affects the treatment , that is, conventional textbook assumptions are satisfied.

But, if there are no hipsters (), then the estimator converges to

$$ \beta_{IV} = \frac{ \pi_C E( \beta_i | C ) }{ \pi_C } = E(\beta_i | C), $$

the average causal effect of treatment among compliers. This is the “local average treatment effect”* for this particular instrument* for this population: it’s not the average effect among the entire population and it’s not the average effect of college among those who actually go to college, it’s an average “local” to people whose behavior changes when the instrument changes, here, among people who only go to college if they’re offered five grand to do so. People whose college decisions aren’t affected by whether they are offered nothing or $5,000 don’t appear at all in our estimate, because this experiment reveals no information about the causal effect of college for those people, and we need to assume away the presence of irritating hipsters whose presence who render our estimate essentially meaningless.

A different experiment that also satisfies all the textbook assumptions for instrumental variables will generally recover a different LATE. Suppose the government had randomly offered nothing or $10,000 rather than $5,000 as an inducement to go to college. Even assuming we are not plagued by the presence of hipsters, the set of people who are compliers now differs from that in the original experiment, because someone who will only go to college for any offer between $5,000 and $10,000 are now compliers but were formerly never-takers. Since those folks will generally have different effects of going to college than the former set of compliers, we get, even asymptotically, different estimates of the “the” effect of going to college on wages, even though both instruments are by assumption valid.

]]>

I make three points. First, that the article is much more consistent with the literature if one reads “poverty” every time the article uses the word “inequality.” Second, that the fact that income and health are correlated across people or across regions does not tell us that income causes health. And finally, that the research does suggest that decreasing poverty will increase health, but that we should not expect substantial reductions in health care expenditures as a result. I close with some notes on policy implications.

The article conflates two issues which are conceptually distinct and have different policy implications: the effect of poverty on health, and the effect of income inequality on health. The evidence suggests that poverty reduces health, but there is little solid evidence, and much skepticism in the literature, over the idea that inequality *per se* leads to lower health, and even more skepticism over the notion that any effect of inequality on health occurs through “stress.” In a previous, fairly technical, blog post I presented an overview of some of the empirical literature on inequality and health. The very short version is: there is pretty good evidence that an individual’s income causes her health, but this effect is much stronger at low levels of income (as in the figure below). There is no good evidence that, holding an individual’s income constant, increases in income inequality decrease that individual’s health. And finally, a variety of factors, notably education and health in childhood, cause both adult income and adult health.

Pointing out that correlation does not imply causation is tiresome, but here this issue is important as much of the article consists of listing correlations between income and health and interpreting those correlations as if they are evidence of causation from income to health:

Virtually every measure of population health – from child mortality to rates of cancer, cardiovascular disease and traumatic injury – is worse in poor areas than in wealthier ones.

Canadians with an income of $15,000 or less have three times the risk of developing diabetes than those who earn more than $80,000.

Similarly, the risk of dying of cancer within five years of diagnosis is 47 per cent higher in the low-income group than the high-income one.

People living in poor neighbourhoods have a 37 per cent greater risk of suffering a heart attack than those in wealthier areas. But those in middle-income neighbourhoods have a 21 per cent great risk than residents of the richest areas. Most health problems follow a similar gradient.

Consider that, compared with the highest income group, in the lowest income group the rate of stillbirths is 24 per cent higher, the infant mortality rate is 58 per cent higher and cases of sudden infant death syndrome are 83 per cent higher.

Children born to low-income parents are twice as likely to end up in special education classes and three times as likely to suffer mental health problems than those in the highest income group.

Statistics Canada found, for example, that at age 25, life expectancy varies between the highest and lowest income groups by 7.1 years for men and 4.9 years for women.

Some of the best-known research on inequality was done by Sir Michael Marmot, a professor at University College in London. He tracked civil servants in the U.K. and found that mortality rates correlated perfectly with social status and income – in other words, the lowest paid died much younger.

These correlations don’t tell us anything about the effect of income on health, as we would observe such correlations even if there is no causal effect of income on health.

To see why, suppose that income does not cause health for anyone, but also, along with the literature, that education does health. If we were to arrange people in a line from lowest income to highest income, as we move up the line we will tend to find healthier people, but we’d also find that people farther along in the line tend to have more education. We might find all the correlations listed in the article, including correlations between mother’s income and infant’s health, even though we started by assuming that income does not cause health. In the real world, it is likely that there is some effect of adult income on adult health, but we also know that education and a wide variety of other characteristics affect both health and income. That result prevents us from interpreting the finding that higher income and better health are linked as evidence that income causes health.

Some research attempts to statistically control for as many of these influences as possible, but researchers face an additional difficulty: health causes income. Again suppose for the sake of argument that there is no effect of income on health, but also that healthier people tend to fare better in the labor market. If we look across people or across regions, we’ll again find that people with higher incomes tend to be healthier, but we again started with the assumption that income does not cause health.

Some earlier findings purporting to find evidence of effects of income on health are now thought to reflect either “reverse” causation from health to income, or difficult to measure “third variables” which cause both health and income. In addition to education, another particularly important such variable is childhood health, which has been shown to largely explain the correlation between income and adult health in the work by Michael Marmot cited by Mr. Picard (Case and Paxson 2011).

Much of the literature suggests that changes in income have little effect on health, at least for employed adults. Longitudinal studies tracking people over time show that high health predicts which workers will receive promotions, but getting a promotion does not seem to have any effect on a worker’s health, implying that health causes income but not the other way around. Further, comparing two workers with the same initial health, the worker whose income goes up faster is not less likely to die than than his lower-paid counterpart, and there is little evidence that “stress” due to relatively low income is the primary mechanism through which any effect of adult income on health operates (Gardner and Oswald 2004, Adams, Hurd, McFadden, Merrill, and Ribeiro 2003, Jones and Wildman 2008).

Finally, putting aside the complex relationship between income and health, consider the claim that income inequality causes higher health care expenditures,

One study estimates that if those in the bottom 20 per cent of income earned as much as those one step higher on the income ladder, the savings to the health system would be $7.6-billion a year.

The source for this figure appears to be this report from the Ontario Association of Food Banks. The author takes cross-sectional data on health expenditures by income quintile and estimates that these expenditures are $7.6B higher in the lowest quintile than in the second-lowest quintile. But there are two serious problems with the interpretation in the Globe article. First, the article interprets this correlation as entirely causal, and as we have seen, that interpretation is consistent with neither theory nor evidence. Second, making people healthier does not necessarily decrease long-run demand on the health care system—in fact, increasing a person’s health may wind up increasing that person’s demand on the system, as they are more likely to live longer and consume health care resources through old age. This isn’t necessarily the case–possibly policies which decrease poverty would have the net effect of decreasing long-run demand on the health care system–but we’d need much more evidence to draw that conclusion.

The point that better health may have little or a counter-intuitively positive effect on health care spending should not be interpreted as an argument against policies which reduce poverty—we value health in and of itself, not solely or even largely because of its impact on health care spending. But we we ought not offer the enticement of fiscal savings though changes in demand on the publicly-funded health care system as a major reason to enact such anti-poverty measures.

We have seen that the issues surrounding income, income inequality, and health are tricky and the subject of much ongoing research. If we should not conclude that income inequality *per se* is the main culprit nor that policies which affect incomes are likely to have large effects on health care expenditures, what policy implications should we draw from this research?

We know that public policies affecting income can improve the long-run health of children. Simply increasing the incomes of families in poverty can improve children’s health, for example, Hoynes, Miller, and Simon (2012) show that increased generosity of the U.S. EITC (essentially a negative income tax) improved infant health outcomes, which are likely to play out as better outcomes in many dimensions over those infants’ entire lives.

More generally, there is evidence that poverty causes low health. There is evidence that childhood deprivation harms children, and also that harm done to children’s well-being continues to affect those children throughout their lives. And there is evidence that sound public policy interventions can be effective in reducing such problems, including well-targeted income redistribution but also, and perhaps more importantly, improved quality and quantity of education.

]]>One explanation for this puzzle is that Americans who vote are less likely to support legalization than those who do not vote. Voters tend to be older, and possibly have other characteristics which are associated with opposition to drug policy reform.

The Gallup poll result is consistent with, but shows somewhat more support than, recent results from the General Social Survey (GSS), which suggest that roughly half of Americans supported legalization in 2011 and 2012. Unlike the Gallup polls, the GSS data are available to researchers, and include a wide variety of information on respondents, including whether or not they voted in the most recent Presidential election.

I set out to show using all GSS waves between 1975 and 2012 which include the legalization support question (n=26,870) that people who vote are less likely to support legalization than people who do not vote, which would help explain why politicians seem out of sync with public sentiment.

But I found the opposite.

The graph shows support for legalization for all respondents, for respondents who voted in the last Presidential election, and for respondents who did not vote in the last election. Up until about a decade ago, voters tended to be moderately less likely to support legalization than non-voters. But since 2004 voters have been modestly more likely to support legalization than non-voters. Restricting attention to post 2004 sample waves, voters are 3.8 percentage points more likely support than non-voters (z=2.61).

So the puzzle is not resolved by differences in attitudes towards legalization across voters and non-voters.

I also ran a few regressions to check if what we see in the graph goes away with a few basic controls (notably age), and how support varies with demographic characteristics. The table below shows results from regression models in which the dependent variable is a dummy indicating support for legalization. Each cell contains the estimated parameter with the associated t-ratio below. The covariate of interest is “voted last elec.”, a dummy indicating the respondent voted in the preceding Presidential election. (All estimates are marginal effects from probit regressions with robust standard errors.)

The first column shows that, over the entire sample from 1975 through 2012, voting in the last election is associated with about three percentage points lower support for legalization (z=4.59). The second column adds complete sets of age and year dummies, which are enough to flip the sign on the voter dummy. Holding age constant and removing any common trend over time in voting propensities and support for legalization, U.S. voters are slightly more likely to support legalization than non-voters (by 1.4 percentage points, z=2.49). The results on the age effects (unreported) suggest that all else equal older cohorts are less likely to support than younger cohorts, providing some sketchy evidence in favor of the notion that support will increase over time as older cohorts, er, shrug this mortal coil.

The model presented in the last column adds controls for religion, education, political leanings, and region, which does essentially nothing to the estimated association between voting and support for legalization.

The results also show that religious belief and behavior are strong predictors of support: non-religious people are substantially more likely to support legalization than otherwise identical non-religious people (by about 17 percentage points, z=13.71), observant religious people are much less likely to support than religious but non-observant people (by about 14 percentage points, z=14.15).

The omitted education category is high school dropouts. More education strongly predicts higher support for legalization, other things equal.

Finally, holding all else equal, people with moderate political leanings (the omitted category) are more likely to support than conservatives (by about 6 percentage points, z=7.85) and less likely to support than liberals (by about 11 percentage points, z=13.68).

Model 1 | Model 2 | Model 3 | |

voted last elec. | -0.027 | 0.014 | 0.014 |

-4.59 | 2.49 | 2.36 | |

Catholic | 0.016 | ||

2.45 | |||

No religion | 0.168 | ||

13.71 | |||

Other religion | 0.096 | ||

7.19 | |||

Observant | -0.140 | ||

-14.15 | |||

high school | 0.042 | ||

5.33 | |||

junior college | 0.070 | ||

4.84 | |||

bachelor | 0.075 | ||

6.57 | |||

graduate | 0.109 | ||

7.33 | |||

conservative | -0.058 | ||

-7.85 | |||

liberal | 0.114 | ||

13.68 | |||

middle atlantic | -0.035 | ||

-2.85 | |||

e. nor. central | -0.033 | ||

-2.75 | |||

w. nor. central | -0.058 | ||

-4.48 | |||

south atlantic | -0.046 | ||

-3.86 | |||

e. sou. central | -0.046 | ||

-3.32 | |||

w. sou. central | -0.055 | ||

-4.38 | |||

mountain | -0.018 | ||

-1.25 | |||

pacific | 0.017 | ||

1.25 | |||

age dummies | no | yes | yes |

year dummies | no | yes | yes |

n | 26,870 | 26,870 | 26,870 |

There are some good discussions of Hansen’s most influential contribution, the Generalized Method of Moments (GMM) in the economic blogsphere, examples include Guan Yang, John Cochrane, and Jeff Leek. This post presents another, which differs mostly in that the discussion does not focus on applications in asset pricing. The basic ideas in Hansen (1982) are elaborations and generalizations of ideas presented in Sargan (1958), which develops overidentified instrumental variables estimators in a modern context, a method mostly used to infer causal effects from observational data.

Suppose we have a sample on some variable of size and we would like to estimate the mean of denoted . In this simple case, the method of moments tells us to estimate by replacing the population condition

$$ E[ y – \mu ] = 0 $$

with its sample analog,

$$ \frac{1}{n}\sum_i [ y_i – \hat\mu ] = 0, $$

where our estimator is the value of the parameter which makes the equation above true. The method of moments estimator of is simply the sample mean of , denoted .

We draw another observations on a different random variable . Suppose theory tells us that the mean of is the same as the mean of . Following the same reasoning as above, we could estimate using the sample mean of , . But using either of these estimates alone cannot be efficient, as we are wasting the information in the sample we don’t use. Theory tells us that both of these conditions are true

\begin{align}

E(y) – \mu &= 0 \\

E(w) – \mu &=0

\end{align}

but we cannot generally choose to make both of the sample analogs of these conditions true,

\begin{align}

m_1 &= \bar y – \hat\mu = 0 \\

m_2 &= \bar w – \hat\mu =0,

\end{align}

so the method of moments can’t be directly applied. But we can generalize (hence, GMM) and make these two moment conditions and as close to being true as we can in the sense that we can make the squared deviations as small as possible. We could choose to

$$ \textrm{min}_{\mu} (\bar y – \mu)^2 + (\bar w – \mu)^2,$$

minimizing this objective yields consistent (since and are each consistent) but usually inefficient estimates, since we should take into account that and might have different variances and might be correlated. Intuitively, if is much noisier than , we should place less weight on observations on because they contain less information about than observations on . Suppose for simplicity that these are independent samples and thus uncorrelated, but that the variance of is higher than the variance of . We can get rid of these unequal variances by dividing by the standard deviations (here and throughout the post we’ll assume for simplicity that we know all the variance parameters, abstracting from much of the complication of GMM estimation) and choose to

$$ \textrm{min}_{\mu} \label{eq:gmm}

\left ( \frac{\bar y – \mu}{\sigma_{\bar y}}\right)^2

+ \left ( \frac{\bar w – \mu}{\sigma_{\bar w}}\right )^2.$$

Note this is equivalent to weighting each moment condition the reciprocal of its standard deviation, so that we place more weight on the more precise condition. In general the moments will not be uncorrelated, and we should take that into account too.

We can form a test statistic against our theory that the means in these two samples are identical. Suppose we just the observations on , calculate , and then check to see how well that estimate explains the observations on the other variable ,

$$ \sum_i ( w_i – \bar y)^2. $$

If the two samples really have the same mean, then as the sample grows converges to , and this expression is the sum of zero-mean squared normals, and we could base a test statistic on that result. Intuitively, if our theory is false, then the do not tend to zero mean random variables, and when we square them the results tend to be larger than if they did have zero mean.

That’s not the best way to test our theory, however. If our theory is true then the objective tends to the sum of two squared zero-mean variables, but if our theory is incorrect the objective function tends to the sum of squared non-zero mean variables. A test can be based on this idea: if the realized value of the objective function when we minimize it would have to be very far out in the tail of the distribution obtained when the theory is correct (here, the distribution), we have evidence against our theory that and have the same mean. This is a simple example of Hansen’s test, or the J test. Note that if we only observe one of or but not both, we have zero degrees of freedom left over to test the assumptions of the model and we cannot conduct this test.

The reasoning above can be applied to a very wide variety of problems, yielding various GMM estimators depending on which variables have zero mean under some theory. Consider a univariate linear regression model of the form

$$ y = \beta x + u, $$

where we interpret this equation as causal: is the causal effect of a one-unit change in on holding other causes of , , constant (and we assume that all variables have zero mean for simplicity). Suppose the data came from a randomized experiment on and that we have a random sample. Then and are uncorrelated, in other words, the random variables have mean zero,

$$ E [ x_i u_i ] = E[ x_i(y_i – \beta x_i) ] = 0, $$

the sample analog of which is

$$ \label{eq:sum} \frac{1}{n} \sum_i x_i(y_i – \beta x_i)=0.$$

Since we have one parameter and one equation, we can always make this condition true, and the solution is easily seen to be the OLS estimator, even if the errors are heteroskedastic or correlated. We cannot test our theory that and are uncorrelated, since we have one parameter and one equation to solve, so we can always make the sample analog of the moment condition true.

Now suppose that did not come from a perfect randomized experiment, instead we have observational data and no reason to suppose that causes of we do observe () are uncorrelated with causes we don’t observe (). The condition no longer holds and an estimator based on that condition will have undesirable properties. But suppose we observe a variable which has the property that

$$E( z_{1i} u_i ) = E [ z_{1i} (y_i – \beta x_i) ] = 0,$$

that is, that should not covary with if is held fixed. The sample analog of this condition gives us the method of moments estimator of , which turns out to be the simple linear instrumental variables estimator, the ratio of the covariance between and to the covariance between and . We cannot test our theory that only affects because affects because, with one equation and one parameter, we can always find a value of to make the sample analog of this condition true.

Now suppose we have available a second instrument, , which theory tells us also affects only because affects .

The diagram illustrates the model. The variable , colored red to denote that we can’t observe this variable, confounds the relationship between and , implying that covariance between and does not reveal , the causal effect of on .

But we can estimate the covariances between and and between and , inspection of the diagram tells us these should be equal to the and . A one-unit increase in causes an unit increase in , and since a one-unit increase in causes a unit increase in , a one-unit increase in causes a change in . And likewise for : the diagram tells us that covariance between and can only occur if , and we can infer a value for from that covariance divided by the covariance between and .

So we have two distinct causal paths, either of which allows us to estimate the causal effect of on , just like in the introductory example we had two different ways of estimating the sample mean . GMM tells us how to optimally combine these two insights to produce the most precise single estimate of under the theory that and only affect because they affect , just like above GMM told us how to optimally combine two samples which have the same mean under some theory.

Theory tells us that both of these conditions are true

\begin{align}

E(z_{1i}u_i) &= E[ z_{1i}(y_i – \beta x_i) ] = 0 \\

E(z_{2i}u_i) &= E[ z_{2i}(y_i – \beta x_i) ] = 0 \\

\end{align}

but we cannot in general choose the single parameter to make both of the sample analogs of these conditions true. Suppose for simplicity that and are uncorrelated and that the are heteroskedastic but uncorrelated. Then the sample analogs of the theoretical moments above have sample counterparts

\begin{align}

m_1 =& \frac{1}{n} \sum_i [z_{1i}(y_i – \beta x_i)]\\

m_2 =& \frac{1}{n} \sum_i [z_{2i}(y_i – \beta x_i)]\\

\end{align}

and variances , for . Then selecting to

$$

\textrm{min}_{\beta} \left( \frac{m_1}{\sqrt{V(m_1)}}\right)^2 + \left( \frac{m_2}{\sqrt{V(m_2)}}\right)^2

$$

yields the GMM estimator of the causal effect of on . As in the introductory example, the moment conditions are weighted by the reciprocal of their standard deviations, since we want to put more weight on the more precise condition. More generally, we should also take into account that and (and sometimes the ) will generally be correlated.

Just like in the simple case above of estimating a mean from two samples, if our theory is true then the minimized value of the objective function is asymptotically distributed , so estimation by GMM produces a test of the theory as a by-product of estimation (in this context, this test stat is also called the Sargan test). Intuitively, if and are both uncorrelated with , then we could form the simple IV estimator using just , and the residuals from that exercise should be uncorrelated (up to sampling noise) with . If they are not, then we’re not sure what’s wrong—either or could be correlated with —but we conclude that something is wrong with our theory. This is not the most efficient test, though. As in the introductory example, the value of the minimized objective function forms a test statistic against the null that our theory is correct.

In the preceding example, GMM allows us to estimate the causal effect of on using a theory that says that two variables and only affect because they affect . GMM tells us how to make use of our theory to make our estimates as precise as possible, and as a by-product of estimation provides a test statistic against the null hypothesis that our theory is correct (warning: our theory could be also incorrect not because the ‘s affect for some other reason than through , but rather because the causal effect varies across units in the population, see e.g. Heckman, Urzua, and Vytlacil 2006).

GMM can be applied to much more complicated problems to estimate causal effects in a wide variety of nonlinear regression models (e.g., Windmeijer 2006), and to estimate the deep parameters in estimable choice models which can be used to produce out-of-sample predictions which sidestep the Lucas critique, e.g., Hotz and Miller (1993) or Ferrall (2012).

]]>The Chen and Pearl paper has been around for a while in working paper form and recently came out in the Real World Economics Review, also available here from the authors with much clearer typesetting.

The additional textbooks I discuss below are: Amemiya (1985), Kmenta (1986), Davidson and MacKinnon (1993), Gujarati (1999), Hayashi (2000), Wooldridge (2002), Davidson and MacKinnon (2004), Deilman (2005), and Cameron and Trivedi (2005).

**The Issue: Causality in regression models.**

A scientist is attempting to understand the relationship between, say, health and smoking. Let y denote some measure of health and let x denote a measure of smoking intensity, say, number of cigarettes smoked per day. A simple model for health supposes the two outcomes are related by,

.

In short, Chen and Pearl consider these issues: how do econometrics textbooks clearly explain what the parameter means in this model, are they consistent in that interpretation, and generally how well are issues of causality addressed?

That simple-looking equation is much trickier than it appears, as first formally discussed in the econometrics literature by Trygve Haavelmo during the Second World War. For recent discussions, see for example Heckman (2005, 2008), Heckman and Pinto (2013), or blog discussions such as on Pearl’s blog or Andrew Gelman’s blog (note comments from Pearl and from Guido Imbens). First suppose we *define* the random variable u as the difference between y and its conditional expectation:

(1)

then it is easy to show that the error term must be mean-independent of . In econometric jargon, we obtain exogeneity by definition. In this interpretation, the parameter is implicitly defined through,

,

that is, is by definition the gradient of . In the smoking and health example, is by definition how much health changes on average as we consider a person who smokes one more cigarette per day (specifically *without* the caveat, “other things being equal”).

This interpretation of this model is merely “agnostic” or “predictive.” An insurance agency, for example, might be interested in estimating under this interpretation: the answer might help them understand how their payouts will vary if they accept customers who smoke more. But econometricians and other scientists are only rarely interested in such a predictive relationship. Instead, we want to know the causal effect of smoking on health, and the predictive regression generally does not recover that causal effect. Suppose for example we lived in a universe in which a given person’s health is unaffected by their smoking, but also that behaviors and characteristics which lead to low health also tend to lead to more smoking. Then we would tend to estimate negative values for even though by assumption (in whatever universe we’re discussing) smoking does not cause any person’s health.

For this reason econometricians rarely interpret the error term as simply the deviation between the outcome and its conditional expectation. Rather, in a structural interpretation of the equation, takes a causal interpretation and u is interpreted as summarizing all causes of y other than x. It is well-known that any of: (1) “reverse” causation, (2) omitted variables correlated with the regressors, or (3) measurement error in the regressors, lead to correlation between u and x, which in turn means that the parameter is not defined as the derivative of with respect to . We would like to know how a randomly selected person’s health changes if we could intervene and exogenously flip smoking status; the problem is that the correlation between smoking and health calculated from observational data does not generally give us any answer to that question.

**Textbook discussion of the issue. **

The seemingly straightforward issue is not straightforward at all, and exactly what we mean by “causal,” even in the context of simple regression models such as above, is a subject of ongoing multidisciplinary research. Nonetheless, since inferring causal relations from observational data is the defining characteristic of econometric analysis, it seems very reasonable to require that econometrics textbooks should contain lucid discussions of causal relationships and, in so doing, define parameters clearly and unambiguously. Disturbingly, Chen and Pearl find that six popular econometrics textbooks fail, to a greater or lesser extent, to do so.

Chen and Pearl evaluate texts on 10 criteria, which amount to: does the textbook provide as least as much information about causal interpretation as this post does very briefly above, is the text consistent on those interpretations, and does the text provide the equivalent of Pearl’s “do(x)” operator to define causal effects? Other than the “do(x)” criterion, which I don’t think is fair because Pearl’s concept has not caught on the econometrics literature and (even it ought to catch on) should therefore not (yet?) appear in current econometrics textbooks, the criteria seem very fair to me. Pity the poor student who attempts to understand how to interpret a structural econometric model after reading this startling passage in Kennedy, for example:

Using the dictionary meaning of causality, it is impossible to test for causality. Granger developed a special definition of causality which econometricians use in place of the dictionary definition: strictly speaking, econometricians should say “Granger-cause” in place of “cause,” but usually they do not. A variable x is said to Granger-cause y if prediction of the current value of y is enhanced by using past values of x.

This is the only passage in the book in which the word “causality” is used, and the claims in that passage are not correct, in no small part because so-called Granger causality is not a causal concept. Although in my view that passage is by far the worst discussion in the six texts discussed, Chen and Pearl show persuasively that each of the discussed textbooks are at times at least vague in their discussion of causal relations. On the other hand, Chen and Pearl are perhaps somewhat uncharitable in some of their discussion. For example, they make much of this passage from Greene,

[ In the model ] does measure the value of a college education (assuming the rest of the regression model is correctly specified)? The answer is no if the typical individual who chooses to go to college would have relatively high earnings whether or not he or she went to college…

but in context this appears to be a typo: the passage is rescued if “the OLS estimate of” is inserted in front of , and the passage makes no sense if that or an equivalent edit is not made, and Greene in many, many other places clearly differentiates between mere correlations and causal parameters. Chen and Pearl, however, are not satisfied with an answer Greene gave them in a a personal communication as to the meaning of a structural parameter:

In a personal correspondence (2012), Greene wrote, “The precise definition of effect of what on what is subject to interpretation and some ambiguity depending on the setting. I find that model coefficients are usually not the answer I seek, but instead are part of the correct answer. I’m not sure how to answer your query about exactly, precisely carved in stone, what should be.”

I tentatively side with Greene here, although Chen and Pearl do not specify exactly what question Greene was asked. In structural models, the structural parameters are not necessarily causal effects in and of themselves, they are rather assumed to be invariant with respect to some well-specified class of disturbance. For example, the deep parameters characterizing Harold Zurcher’s replacement of bus engines are not themselves causal effects, but given estimates of those parameters, the model can answer meaningful causal questions. Exactly what a structural coefficient means is model-dependent.

**Some results from other textbooks.**

Without going into nearly as much detail as Chen and Pearl, I took a look through some other econometrics textbooks to check to see how they discuss, or do not discuss, causality. Specifically, I looked to see whether the regression parameters are anywhere incorrectly defined as gradients of the conditional expectation of the dependent variable, and I tried to find explicit discussions of causal interpretation of estimated models. The texts surveyed below vary widely in level and vintage, including everything from introductory undergraduate to advanced graduate texts, from 1985 through 2005.

**Amemiya (1985), Advanced Econometrics.**

This textbook is now old, well, ancient, by academic standards, and is relatively technically demanding. Opens, on page 1, by dubiously asserting that the goal of econometrics is to estimate parameters which define the joint distribution of a set of random variables . As far as I can tell, the word “causal” does not appear anywhere, nor are there examples of predictive vs causal interpretation of parameters. Any notions of causality are implicit and framed in purely statistical terms. However, does not incorrectly defines as the gradient of .

**Kmenta (1986), Elements of Econometrics**

Does not incorrectly define as the gradient of .

There is a fairly long, yet confusing discussion of causality at the start of the chapter on simultaneous systems.

Although the concepts of causality and exogeneity are not identical, it is nevertheless possible to conclude that if a variable Y is–in some sense–caused by a variable X, Y cannot be considered exogenous in a system in which X also occurs. A widely discussed definition of causality has been proposed by Granger.

This is the textbook that I learned undergraduate econometrics from. I don’t remember how I thought of causality in econometric models at the time (possibly because I really didn’t like econometrics as an undergraduate). But it’s hard to see how a student could make much headway in understanding causality from that passage. Causality is first introduced “in some sense” deliberately avoiding a definition. An incorrect claim that if one variable causes another they cannot both be treated as exogenous in a system follows: that is simply not true, nothing in regression models precludes causal relationships between exogenous variables (as a trivial example, the square of an exogenous covariate is routinely used to capture nonlinear relationships between variables, which is deterministic and monocausal relationship). And then the notion of Granger-causality is introduced as the only formally defined causal concept in econometrics.

**Davidson and MacKinnon (1993), Estimation and Inference in Econometrics**

The parameters in the linear regression model are defined in Chapter 1 very abstractly as the set of real numbers defining the subspace spanned by the column vectors of the regressors. is never incorrectly defined as the gradient of . Simultaneity and omitted variable bias are discussed in purely statistical, as opposed to causal, terms in Chapter 7.

Discusses causality explicitly in section 18.2, “Exogeneity and causality.” The clearest passage is,

But we have not yet discussed the conditions under which one can validly treat a variable as explanatory. This includes the use of such variables as regressors in least squares estimation and as instruments in instrumental variables or GMM estimation. For conditional inference to be valid, the explanatory variables much be predetermined or exogenous in one or other a variety of senses to be defined below.

which is not very clear at all: the authors intend, I think, the first sentence to mean, “But we have not yet discussed the conditions under which one can treat the coefficient on a variable as reflecting a causal effect.” The matter is then further muddied as later in this subsection the concept of Granger causality is introduced, without clearly differentiating between so-called Granger-causality and causality.

There is an implicit discussion of causality when estimation of supply and demand functions is introduced as an issue to motivate instrumental variable estimation: if we remember from theory that the slopes of these functions are indeed causal effects, then the discussion amounts to asserting that OLS does not recover causal effects in this context.

**Gujarati (1999), Essentials of Econometrics, second edition.**

Does not incorrectly define as the gradient of .

Implicitly defines regression parameters as causal effects (without using the word “causal”) on page 7. On page 8, correctly defines the error term as unobserved causes of the dependent variable, and notes,

Before proceeding further, a warning regarding causation is in order…. Does regression imply causation? Not necessarily. As Kendall and Stuart note, “A statistical relationship, however strong and however suggestive, can never establish causal connection: our ideas of causation must come from outside statistics, ultimately from some theory or other.”

A variant of this warning is repeated on page 124, although somewhat oddly then proceeds to give uses for regression analysis which do not include the estimation of causal effects.

Gives examples of omitted variables bias and simultaneity bias which implicitly define the structural parameters as causal effects, and refers again to these parameters when introducing instrumental variables, a topic not pursued in this introductory-level text.

**Hayashi (2000), Econometrics.**

Defines regression parameters as causal effects (without using the word “causal”) on page 4, but also claims on the same page that an econometric model is a “set of joint distributions satisfying a set of assumptions,” which leaves it unclear whether the author intends regression parameters to reflect causal effects or parameters defining statistical distributions.

Introduces the issue of endogeneity noting that, “The most important assumption made for the OLS [sic] is the orthogonality between the error term and the regressors. Without it, the OLS estimator is not even consistent.” Much like Davidson and MacKinnon (1993), differentiates between causation and mere correlation using estimation of the slopes of supply and demand curves as an example, albeit without using any variant of the word, “cause.”

**Wooldridge (2002), Econometric Analysis of Cross-Section and Panel Data.**

Chen and Pearl discuss “baby” Wooldridge, the undergrad text. Does Papa Wooldridge fare better?

The opening passage of the text, Section 1.1 of the Introduction, begins,

The goal of most empirical studies in economics and other social sciences is to determine whether a change in one variable, say w, causes a change in another variable, say y…. Because economic variables are properly interpreted as random variables, we should use ideas from probability to formalize the sense in which change in w causes a change in y. The notion of ceteris paribus… is the crux of establishing a causal relationship. Simply finding that two variables are correlated is rarely enough….”

Goes on to define regression parameters as partial derivatives of conditional expectations, although not of but of (in our notation) of .

Includes the first, to the best of my knowledge, lengthy discussion of the counterfactuals/treatment effects literature (Chapter 18), and links the preceding discussion of regression models to the treatment effects literature.

**Davidson and MacKinnon (2004), Econometric Theory and Methods.
**

We can make a fixed-effects type observation here, as we have the another text from James and Russell, about a decade later than the 1993 text discussed above. How do the 1993 and 2004 books differ? The introductory passage on page 1 introduces regression parameters and implies their definition depends on how the error term is defined, although at this point exactly what means is deliberately left vague, it’s interpretation is “quite arbitrary,” the authors correctly note. After introducing the equivalent of the model , the text states (in our notation),

At this stage we should note that, as long as we say nothing about the unobserved quantity , [the equation] does not tell us anything. In fact, we can allow to be quite arbitrary, since for any given [value] the model… can always be be made to be true by defining suitably.

A similar passage on page 313 notes that, when a regressor is measured with error, OLS estimation gives the desired result if the error term is defined as simply the difference between the observed outcome and its expectation with respect to the observed regressor, but “in most cases” in econometrics that definition does allow us to estimate the parameters we wish to estimate.

More or less the same discussion of supply and demand as in the 1993 text can again be interpreted as an implicit discussion of causality.

**Dielman (2005), Applied Regression Analysis, 4th ed.**

Incorrectly defines as the slope of on page 75, although in the context of a model explicitly described as a “descriptive regression.” Does not immediately clarify, however, when a regression model should be interpreted as merely descriptive.

Discusses “causal” versus “extrapolative” regression models in the narrow context of time series modeling on page 112, but does not make it clear what the intended difference between these concepts is, nor is it clear why this discussion is limited to time series models. Claims that the issue with causal models is, “causal models require the identification of variables that are related to the dependent variable in a causal manner. Then data must be gathered on these explanatory variables to use the model.” This makes it seem that simple correlations can be used to infer causal relations so long as we can observe both the variables. However, also notes on page 118 that “A common mistake made when using regression analysis is to assume that a strong fit (a high ) of a regression of y on x automatically means `x causes y.'” There is then a brief discussion of endogeneity through simultaneity and through omitted variables, which is quite clear, particularly for an introductory text.

**Cameron and Trivedi (2005), Microeconometrics: Methods and Applications. **

A few sentences into the introduction on page 1, notes that,

A key distinction in econometrics is between essentially descriptive models and data summaries at various levels of statistical sophistication and models that go beyond mere associations and attempt to estimate causal parameters. The classic definitions of causality in econometrics derive from the Cowles Commission simultaneous equations model that draw sharp distinctions between exogenous and endogenous variables, and between structural and reduced form parameters. Although reduced form models are useful for some purposes, knowledge of structural or causal parameters is essential for policy analysis.

This focus on causal parameters is maintained throughout. Chapter 2 is titled “Causal and noncausal models,” and provides a quite high-level formal discussion of causality in the context of both classical simultaneous models, and introduces topics in causal modeling which will be covered through the remainder of the book, including the Rubin Causal Model and a variety of methods researchers use to identify causal parameters. Given this emphasis, it is unsurprising that regression parameters are not incorrectly defined as the gradient of . Discusses counterfactual modeling in Chapter 25, “Treatment Evaluation,” at length, linking the methods in this literature to previous discussions of single-equation regression, matching, instrumental variables, and regression discontinuity designs.

**Remarks.**

The additional textbooks briefly surveyed suffer to a greater or lesser extent from weak discussions of causality as the texts surveyed by Chen and Pearl, with the exceptions of Wooldridge (2002) and particularly Cameron and Trivedi (2005), which I think would only fail Chen and Pearl’s criterion that the equivalent of the “do(x)” concept should be included (and arguably, an equivalent is included).

There is something of a puzzle here in that the oral tradition in applied econometrics heavily emphasizes causation, but it would seem that relatively few textbooks explicitly discuss the matter. In journal articles, seminars, and economics classrooms, there is consensus that the goal of econometric analysis is almost always to estimate a model which can answer causal questions. Overcoming the various serious challenges that arise in making such attempts is the core of most papers in applied econometrics, and how successful a paper is in achieving that goal is the target of sharp-eyed readers and referees. What explains the discrepancy between how economists think about causation and what appears in most econometrics textbooks?

First, econometrics textbooks tend to be authored by theoretical econometricians, who tend to be situated much closer to the interface between statistics and econometrics than applied researchers. Since statisticians do not tend to think in terms of causality, perhaps some of that statistical tradition makes its way over to econometrics textbooks.

Second, statistical concepts which *in the context of applied econometrics* refer to causal concepts are nonetheless presented as statistical concepts in econometrics textbooks, but it is understood that the underlying objects of inference are still causal. A “biased estimate of ” is a purely statistical concept, but if a referee or seminar attendee were to use that phrase they almost certainly mean, “the estimate you present is not a good estimate of the causal effect in which we are interested.” Similarly, a remark like, “your data doesn’t credibly identify ” appears to be a claim about a purely statistical matter, but the person making that claim almost certainly means, “the causal parameter we would like to estimate is hopelessly confounded, given the data we have and the model you’ve developed.” Further to this point, I note that way back in the old-timey days of the 1990s, I took a sequence of econometrics courses from MacKinnon and Davidson based on their 1993 textbook. Even though this text does not include a good discussion of causality using that term, and it is notably lacking in applied examples, it was always very clear to me (and, I think, my classmates) that we are ultimately interested in estimating models which allow us to make causal inferences, as opposed to merely characterizing the joint distribution of some set of variables.

Third, the language of counterfactuals in which the literature on causation is currently being developed is a relatively recent development. As noted above, Wooldridge (2002) is, to the best of my knowledge, the first econometrics textbook to include an extended discussion written in this language. What amounts to the same concepts were previously, as in the examples in previous point, discussed using language borrowed from statistics. The slightly more recent text by Cameron and Trivedi (2005) is substantially more oriented towards causal modeling than any of the other texts, and also includes lengthy discussion of the recent literature on modeling heterogeneous causal effects. My impression from reading Chen and Pearl and flipping through the texts above is the textbooks tend to be getting better over time in terms of discussing causation, presumably in part because these ideas are permeating the applied econometrics literature. Notably, the oldest textbooks discussed above (Amemiya 1985 and Kmenta 1986) present the vaguest discussions of causal concepts.

The oral tradition in economics is not well-reflected in current, or particularly in outdated, textbooks. Chen and Pearl do those of us who teach or study econometrics a service in highlighting this problem, and hopefully discussion in future textbooks will continue to improve.

]]>