GMM and its application outside finance

gmm diagram

The 2013 Nobel prize in economics was won by Fama, Shiller, and some other dude, according to most media accounts. Fama and Shiller were pretty easy to explain: one of them is at Chicago and is associated with a theory called “efficient markets,” so he’s the free market guy. Shiller criticized the Chicago guy, so we know where to put him on the political spectrum. But this third guy, Hansen, well, he’s at Chicago, but he does some sort of theoretical econometrics, so if we’re the Guardian we’ll just assume he’s “ultra-conservative” and then ignore him, or if we’re anyone else we’ll skip to just ignoring him (even the Economist gives up, complaining they can’t explain his work without “writing all sorts of equations in our newspaper.”) This post attempts to provide a relatively gentle, albeit with all sorts of equations, introduction to part of the third guys’s research, focusing on applications in causal modeling in microeconomics rather than the examples from finance or macroeconomics.

There are some good discussions of Hansen’s most influential contribution, the Generalized Method of Moments (GMM) in the economic blogsphere, examples include Guan Yang, John Cochrane, and Jeff Leek. This post presents another, which differs mostly in that the discussion does not focus on applications in asset pricing. The basic ideas in Hansen (1982) are elaborations and generalizations of ideas presented in Sargan (1958), which develops overidentified instrumental variables estimators in a modern context, a method mostly used to infer causal effects from observational data.

GMM in a very simple problem.

Suppose we have a sample on some variable \(y\) of size \(n\) and we would like to estimate the mean of \(y\) denoted \(\mu=E(y)\). In this simple case, the method of moments tells us to estimate \(\mu\) by replacing the population condition

$$ E[ y – \mu ] = 0 $$

with its sample analog,

$$ \frac{1}{n}\sum_i [ y_i – \hat\mu ] = 0, $$

where our estimator \(\hat\mu\) is the value of the parameter \(\mu\) which makes the equation above true. The method of moments estimator of \(\mu\) is simply the sample mean of \(y\), denoted \(\bar y\).

We draw another \(n\) observations on a different random variable \(w\). Suppose theory tells us that the mean of \(w\) is the same as the mean of \(y\). Following the same reasoning as above, we could estimate \(\mu\) using the sample mean of \(w\), \(\bar w\). But using either of these estimates alone cannot be efficient, as we are wasting the information in the sample we don’t use. Theory tells us that both of these conditions are true

\begin{align}
E(y) – \mu &= 0 \\
E(w) – \mu &=0
\end{align}

but we cannot generally choose \(\hat\mu\) to make both of the sample analogs of these conditions true,

\begin{align}
m_1 &= \bar y – \hat\mu = 0 \\
m_2 &= \bar w – \hat\mu =0,
\end{align}

so the method of moments can’t be directly applied. But we can generalize (hence, GMM) and make these two moment conditions \(m_1\) and \(m_2\) as close to being true as we can in the sense that we can make the squared deviations as small as possible. We could choose \(\hat\mu\) to
$$ \textrm{min}_{\mu} (\bar y – \mu)^2 + (\bar w – \mu)^2,$$
minimizing this objective yields consistent (since \(\bar y\) and \(\bar w\) are each consistent) but usually inefficient estimates, since we should take into account that \(y\) and \(w\) might have different variances and might be correlated. Intuitively, if \(w\) is much noisier than \(y\), we should place less weight on observations on \(w\) because they contain less information about \(\mu\) than observations on \(y\). Suppose for simplicity that these are independent samples and thus uncorrelated, but that the variance of \(y\) is higher than the variance of \(w\). We can get rid of these unequal variances by dividing by the standard deviations (here and throughout the post we’ll assume for simplicity that we know all the variance parameters, abstracting from much of the complication of GMM estimation) and choose \(\mu\) to
$$ \textrm{min}_{\mu} \label{eq:gmm}
\left ( \frac{\bar y – \mu}{\sigma_{\bar y}}\right)^2
+ \left ( \frac{\bar w – \mu}{\sigma_{\bar w}}\right )^2.$$
Note this is equivalent to weighting each moment condition the reciprocal of its standard deviation, so that we place more weight on the more precise condition. In general the moments will not be uncorrelated, and we should take that into account too.

We can form a test statistic against our theory that the means in these two samples are identical. Suppose we just the observations on \(y\), calculate \(\bar y\), and then check to see how well that estimate explains the observations on the other variable \(w\),

$$ \sum_i ( w_i – \bar y)^2. $$

If the two samples really have the same mean, then as the sample grows \(\bar y\) converges to \(\mu\), and this expression is the sum of zero-mean squared normals, and we could base a test statistic on that result. Intuitively, if our theory is false, then the \((w_i – \bar y)\) do not tend to zero mean random variables, and when we square them the results tend to be larger than if they did have zero mean.

That’s not the best way to test our theory, however. If our theory is true then the objective tends to the sum of two squared zero-mean variables, but if our theory is incorrect the objective function tends to the sum of squared non-zero mean variables. A test can be based on this idea: if the realized value of the objective function when we minimize it would have to be very far out in the tail of the distribution obtained when the theory is correct (here, the \(\chi^2_1\) distribution), we have evidence against our theory that \(y\) and \(w\) have the same mean. This is a simple example of Hansen’s test, or the J test. Note that if we only observe one of \(y\) or \(w\) but not both, we have zero degrees of freedom left over to test the assumptions of the model and we cannot conduct this test.

Linear regression.

The reasoning above can be applied to a very wide variety of problems, yielding various GMM estimators depending on which variables have zero mean under some theory. Consider a univariate linear regression model of the form
$$ y = \beta x + u, $$
where we interpret this equation as causal: \(\beta\) is the causal effect of a one-unit change in \(x\) on \(y\) holding other causes of \(y\), \(u\), constant (and we assume that all variables have zero mean for simplicity). Suppose the data came from a randomized experiment on \(x\) and that we have a random sample. Then \(u\) and \(x\) are uncorrelated, in other words, the random variables \(x_i u_i\) have mean zero,
$$ E [ x_i u_i ] = E[ x_i(y_i – \beta x_i) ] = 0, $$
the sample analog of which is
$$ \label{eq:sum} \frac{1}{n} \sum_i x_i(y_i – \beta x_i)=0.$$
Since we have one parameter and one equation, we can always make this condition true, and the solution is easily seen to be the OLS estimator, even if the errors are heteroskedastic or correlated. We cannot test our theory that \(u\) and \(x\) are uncorrelated, since we have one parameter \(\beta\) and one equation to solve, so we can always make the sample analog of the moment condition true.

Instrumental variables.

Now suppose that \(x_i\) did not come from a perfect randomized experiment, instead we have observational data and no reason to suppose that causes of \(y\) we do observe (\(x\)) are uncorrelated with causes we don’t observe (\(u\)). The condition \(E( x_i u_i)=0\) no longer holds and an estimator based on that condition will have undesirable properties. But suppose we observe a variable \(z_1\) which has the property that \(z_1\) only affects \(y\) because \(z_1\) affects \(x\) (for example, assignment to the treatment group in an imperfect RCT only affects health because it affects whether patients take treatment). This assumption implies
$$E( z_{1i} u_i ) = E [ z_{1i} (y_i – \beta x_i) ] = 0,$$
that is, that \(z_1\) should not covary with \(y\) if \(x\) is held fixed. The sample analog of this condition gives us the method of moments estimator of \(\beta\), which turns out to be the simple linear instrumental variables estimator, the ratio of the covariance between \(y\) and \(z_1\) to the covariance between \(x\) and \(z_1\). We cannot test our theory that \(z_1\) only affects \(y\) because \(z_1\) affects \(x\) because, with one equation and one parameter, we can always find a value of \(\beta\) to make the sample analog of this condition true.

Now suppose we have available a second instrument, \(z_2\), which theory tells us also affects \(y\) only because \(z_2\) affects \(x\).

The diagram illustrates the model. The variable \(u\), colored red to denote that we can’t observe this variable, confounds the relationship between \(x\) and \(y\), implying that covariance between \(x\) and \(y\) does not reveal \(\beta\), the causal effect of \(x\) on \(y\).

But we can estimate the covariances between \(y\) and \(z_1\) and between \(y\) and \(z_2\), inspection of the diagram tells us these should be equal to the \(\alpha_1\beta\) and \(\alpha_2\beta\). A one-unit increase in \(z_1\) causes an \(\alpha_1\) unit increase in \(x\), and since a one-unit increase in \(x\) causes a \(\beta\) unit increase in \(y\), a one-unit increase in \(x\) causes a \(\beta\alpha_1\) change in \(y\). And likewise for \(z_2\): the diagram tells us that covariance between \(y\) and \(z_2\) can only occur if \(\beta\ne 0\), and we can infer a value for \(\beta\) from that covariance divided by the covariance between \(z_2\) and \(x\).

So we have two distinct causal paths, either of which allows us to estimate the causal effect of \(x\) on \(y\), just like in the introductory example we had two different ways of estimating the sample mean \(\mu\). GMM tells us how to optimally combine these two insights to produce the most precise single estimate of \(\beta\) under the theory that \(z_1\) and \(z_2\) only affect \(y\) because they affect \(x\), just like above GMM told us how to optimally combine two samples which have the same mean under some theory.

Theory tells us that both of these conditions are true

\begin{align}
E(z_{1i}u_i) &= E[ z_{1i}(y_i – \beta x_i) ] = 0 \\
E(z_{2i}u_i) &= E[ z_{2i}(y_i – \beta x_i) ] = 0 \\
\end{align}

but we cannot in general choose the single parameter \(\beta\) to make both of the sample analogs of these conditions true. Suppose for simplicity that \(z_1\) and \(z_2\) are uncorrelated and that the \(u\) are heteroskedastic but uncorrelated. Then the sample analogs of the theoretical moments above have sample counterparts
\begin{align}
m_1 =& \frac{1}{n} \sum_i [z_{1i}(y_i – \beta x_i)]\\
m_2 =& \frac{1}{n} \sum_i [z_{2i}(y_i – \beta x_i)]\\
\end{align}
and variances \(V(m_j) = \sum_i z_{ji}^2 \sigma^2_i\), for \(j=1,2\). Then selecting \(\beta\) to
$$
\textrm{min}_{\beta} \left( \frac{m_1}{\sqrt{V(m_1)}}\right)^2 + \left( \frac{m_2}{\sqrt{V(m_2)}}\right)^2
$$
yields the GMM estimator of the causal effect of \(x\) on \(y\). As in the introductory example, the moment conditions are weighted by the reciprocal of their standard deviations, since we want to put more weight on the more precise condition. More generally, we should also take into account that \(z_1\) and \(z_2\) (and sometimes the \(u\)) will generally be correlated.

Just like in the simple case above of estimating a mean from two samples, if our theory is true then the minimized value of the objective function is asymptotically distributed \(\chi^2_1\), so estimation by GMM produces a test of the theory as a by-product of estimation (in this context, this test stat is also called the Sargan test). Intuitively, if \(z_1\) and \(z_2\) are both uncorrelated with \(u\), then we could form the simple IV estimator using just \(z_1\), and the residuals from that exercise should be uncorrelated (up to sampling noise) with \(z_2\). If they are not, then we’re not sure what’s wrong—either \(z_1\) or \(z_2\) could be correlated with \(u\)—but we conclude that something is wrong with our theory. This is not the most efficient test, though. As in the introductory example, the value of the minimized objective function forms a test statistic against the null that our theory is correct.

Micro applications of GMM to infer causality.

In the preceding example, GMM allows us to estimate the causal effect of \(x\) on \(y\) using a theory that says that two variables \(z_1\) and \(z_2\) only affect \(y\) because they affect \(x\). GMM tells us how to make use of our theory to make our estimates as precise as possible, and as a by-product of estimation provides a test statistic against the null hypothesis that our theory is correct (warning: our theory could be also incorrect not because the \(z\)’s affect \(y\) for some other reason than through \(x\), but rather because the causal effect \(\beta\) varies across units in the population, see e.g. Heckman, Urzua, and Vytlacil 2006).

GMM can be applied to much more complicated problems to estimate causal effects in a wide variety of nonlinear regression models (e.g., Windmeijer 2006), and to estimate the deep parameters in estimable choice models which can be used to produce out-of-sample predictions which sidestep the Lucas critique, e.g., Hotz and Miller (1993) or Ferrall (2012).

Tags: , ,

  • Evan

    I’m having trouble seeing the latex in my browser. I am using Chrome.
    Somewhat strangely, I can briefly the correctly rendered latex while the page is loading, but once the page loads all I see are empty boxes.

    No idea if this is my problem or yours.

  • Chris Auld

    Evan, thanks, it would appear I accidentally reverted to an older version which used a different Latex interpreter (which was prettier and faster, but sometimes just fails for some reason). Can you please let me know if it still doesn’t work for you?

  • Evan

    Looks fine now.

  • Kevin Denny

    In your penultimate paragraph where you say it produces, as a by-product, a test of your theory I think you need to qualify this. It is a test of over-identification not identification. So given z1 is a valid IV you can test whether z2 is or vice-versa. But you cannot test both – by “theory” I think of an identified model.

    • Chris Auld

      I’d interpret it as a test of “the theory” that both of the instruments are exogenous. If the test rejects, then we can conclude that at least one of the instruments is invalid. The serious problem that essentially heterogeneity destroys this interpretation, well, I leave that for another day.

Copyright © 2014 M. Christopher Auld