Say you wish to estimate a model with a binary dependent variable. You recall that you ought not use OLS primarily because OLS will not bound your predicted values between zero and one. So you use a nonlinear variant, say, probit. But you also recall that it doesn’t matter much if you just use OLS and ignore the binary nature of your dependent variable so long as you are interested in estimating the effects of your covariates, not generating predicted values.
Let y denote your outcome, x a continuous covariate, and d a dummy covariate. You generate OLS estimates and marginal effects from probit estimates and compare them:


(1)  (2)  
OLS  Probit  
main  
x  0.218^{***}  0.209^{***} 
(4.99)  (4.88)  
d  0.237^{***}  0.247^{***} 
(8.81)  (11.67)  


N  1000  1000 


Sure enough, it doesn’t matter in this case whether you report OLS estimates or marginal effects from a nonlinear model. Your findings are essentially the same.
Now suppose you want to see if the effect of x on y varies across groups defined by the dummy variable d. You create the interaction of x and d, denoted xd, and perform the exercise above again:


(1)  (2)  
OLS  probit  


main  
x  0.290^{***}  0.216^{***} 
(5.53)  (4.57)  
d  0.365^{***}  0.262^{***} 
(6.24)  (5.44)  
xd  0.231^{*}  0.044 
(2.46)  (0.35)  


N  1000  1000 


Now you have a problem: the OLS estimate tells you that the effect of x on y is smaller when d=1 than when d=0 (b=0.23, t=2.46), but the probit estimates suggest that the effect of x on y is pretty stable as we vary d (b=0.04, t=0.35). What’s going on?
The issue is that the coefficients on the interaction term demand different interpretations in linear and nonlinear models. The OLS model is
\(E[ y  x, d ] = \beta_0 +\beta_1 x + \beta_2 d + \beta_3 (xd)\)
so
\(\frac{\partial E[ y  x, d]}{\partial xd} = \frac{\partial^2 E [ y  x, d ]}{ \partial x \partial d} = \beta_3\)
(where for simplicity I’ve ignored d’s binary nature). However in the probit model
\(E[ y  x, d] = \Phi(\beta_0 +\beta_1 x + \beta_2 d + \beta_3 (xd)),\)
where \(\Phi(\cdot)\) denotes the standard normal CDF. Therefore,
\(\frac{\partial E[ y  x, d ]}{\partial xd} = \beta_3 \phi(\cdot)\)
But this is not what we want to evaluate! This term is the marginal effect of the interaction term (xd), but what we want is the interaction effect:
\(\frac{\partial^2 E[ y  x, d]}{\partial x \partial d},\)
which is a more complicated expression not generally equal to \(\beta_3\phi(\cdot)\). If you run a nonlinear model including an interaction term, the marginal effect of the interaction term is not the cross–effect of the variables you’ve interacted. The two expressions are the same only in linear models, which means this distinction is irrelevant for the OLS model.
In the example above, the data are artificial, and the true value of the interaction term in the index function (\(beta_3\)) is zero. We can see what the OLS estimate recovers as a nonzero interaction by graphing predicted probabilities in the d=0 and d=1 regimes:
The slope is shallower for the d=1 group than for the d=0 group even though there is no structural interaction effect because the d=1 group has high probability of y=1 everywhere. Consider a more extreme case in which the probability that y=1 in the d=1 group increases with x, but from 0.99 to 0.999 as x varies from its lowest to highest value: then the slope must be only slightly above zero everywhere no matter how large an effect x has on the index function.
Chunrong Ai and Edward Norton (2003) discuss this issue, and the authors also wrote new Stata commands to numerically calculate the correct interaction effects. However, Bill Greene points out in a recent paper that perhaps we don’t want to evaluate the interaction effect on the conditional probability, either. Greene argues that nonlinearities imply that the observed interaction effect will generally be nonzero even when the true coefficient on the interaction term in the index function is zero (as illustrated above), such that observed interactions may be artifacts of functional form specifications. A given structural model can generate all sorts of observed interaction effects depending on the properties of the data: for example, in the graph above, the slopes of the two lines would become more similar if we sampled x’s much larger than one, the highest x in the artificial data. For these reasons Greene suggests that the sort of tests econometricians have been reporting following Ai and Norton (2003) are not meaningful. Instead, Greene argues we should conduct hypothesis tests only on the structural parameters of the model, such as the difficulttointerpret probit coefficients, and not on implications of the model such as marginal effects.
See also Puhani (2008), who points out that some of these issues are irrelevant in difference in difference models in which the interaction term of interest is also a treatment dummy.
Tags: Edward Norton, interaction terms, marginal effects, probit, regression, William Green