3.2.1 Misspecifying the Set of Regressors
If one is (implicitly) assuming that the conditioning set of the model contains more vari- ables than the ones that are included, it is possible that the set of explanatory variables is
‘misspecified’. This means that one or more of the omitted variables are relevant, that is, they have nonzero coefficients. This raises two questions: what happens when a relevant variable is excluded from the model, and what happens when an irrelevant variable is included in the model? To illustrate this, consider the following two models:
yi=xi𝛽+zi𝛾+𝜀i, (3.12) and
yi=xi𝛽+𝑣i, (3.13)
both interpreted as describing the conditional expectation ofyi givenxi,zi (and maybe some additional variables). The model in (3.13) is nested in (3.12) and implicitly assumes thatziis irrelevant (𝛾=0). What happens if we estimate model (3.13) whereas model (3.12) is the correct model? That is, what happens when we omitzifrom the set of regressors?
The OLS estimator for𝛽based on (3.13), denoted asb2, is given by b2 =
( N
∑
i=1
xixi )−1∑N
i=1
xiyi. (3.14)
The properties of this estimator under model (3.12) can be determined by substituting (3.12) into (3.14) to obtain
b2=𝛽+ ( N
∑
i=1
xixi )−1∑N
i=1
xizi𝛾+ ( N
∑
i=1
xixi )−1∑N
i=1
xi𝜀i. (3.15) Depending upon the assumptions made for model (3.12), the last term in this expression will have an expectation or probability limit of zero.3The second term on the right-hand side, however, corresponds to a bias (or asymptotic bias) in the OLS estimator owing to
2We abstract from trivial exceptions, likexi= −ziand𝛽= −𝛾.
3Compare the derivations of the properties of the OLS estimator in Section 2.6.
k k estimating the incorrect model (3.13). This is referred to as anomitted variable bias.
As expected, there will be no bias if𝛾=0 (implying that the two models are identi- cal), but there is one more case in which the estimator for𝛽 will not be biased and that is when∑N
i=1xizi=0, or, asymptotically, when E{xizi} =0. If this happens we say thatxiandzi areorthogonal. This does not happen very often in economic appli- cations. Note, for example, that the presence of an intercept inxi implies thatE{zi} should be zero.
The converse is less of a problem. If we were to estimate model (3.12) while in fact model (3.13) was appropriate, that is, we needlessly included the irrelevant variableszi, we would simply be estimating the𝛾coefficients, which in reality are zero. In this case, however, it would be preferable to estimate𝛽from the restricted model (3.13) rather than from (3.12) because the latter estimator for𝛽will usually have a higher variance and thus be less reliable. While the derivation of this result requires some tedious matrix manipu- lations, it is intuitively obvious: model (3.13) imposes more information, so that we can expect that the estimator that exploits this information is, on average, more accurate than one that does not. Thus, including irrelevant variables in your model, even though they have a zero coefficient, will typically increase the variance of the estimators for the other model parameters. Including as many variables as possible in a model is thus not a good strategy, while including too few variables has the danger of biased estimates. This means we need some guidance on how to select the set of regressors.
3.2.2 Selecting Regressors
Again, it should be stressed that, if we interpret the regression model as describing the conditional expectation ofyi given theincluded variablesxi, there is no issue of a misspecified set of regressors, although there might be a problem of functional form (see the next section). This implies that statistically there is nothing to test here. The set ofxivariables will be chosen on the basis of what we find interesting, and often economic theory or common sense guides us in our choice. Interpreting the model in a broader sense implies that there may be relevant regressors that are excluded or irrelevant ones that are included. To find potentially relevant variables, we can use economic theory again. For example, when specifying an individual wage equation, we may use the human capital theory, which essentially says that everything that affects a person’s productivity will affect his or her wage. In addition, we may use job characteristics (blue or white collar, shift work, public or private sector, etc.) and general labour market conditions (e.g. sectorial unemployment).
It is good practice to select the set of potentially relevant variables on the basis of economic arguments rather than statistical ones. Although it is sometimes suggested otherwise, statistical arguments are never certainty arguments. That is, there is always a small (but not ignorable) probability of drawing the wrong conclusion. For example, there is always a probability (corresponding to the size of the test) of rejecting the null hypothesis that a coefficient is zero, while the null is actually true. Such type I errors are rather likely to happen if we use a sequence of many tests to select the regressors to include in the model. This process is referred to asdata snooping, data mining or p-hacking (see Leamer, 1978; Lovell, 1983; or Charemza and Deadman, 1999, Chapter 2), and in economics it is not a compliment if someone accuses you of
k k
SELECTING THE SET OF REGRESSORS 67
doing it.4In general, data snooping refers to the fact that a given set of data is used more than once to choose a model specification and to test hypotheses. You can imagine, for example, that, if you have a set of 20 potential regressors and you try each one of them, it is quite likely to conclude that one of them is ‘significant’, even though there is no true relationship between any of these regressors and the variable you are explaining.
Although statistical software packages sometimes provide mechanical routines to select regressors, these are typicallynot recommendedin economic work. The probability of making incorrect choices is high, and it is not unlikely that your ‘model’ captures some peculiarities in the data that have no real meaning.5 In practice, however, it is hard to prevent some amount of data snooping from entering your work. Even if you do not perform your own specification search and happen to ‘know’ which model to estimate, this ‘knowledge’ may be based upon the successes and failures of past investigations.
Nevertheless, it is important to be aware of the problem. In recent years, the possibility of data snooping biases has played an important role in empirical studies modelling stock returns. Lo and MacKinlay (1990), for example, analyse such biases in tests of financial asset pricing models, while Sullivan, Timmermann and White (2001) analyse the extent to which the presence of calendar effects in stock returns, like the January effect discussed in Section 2.7, can be attributed to data snooping.
To illustrate the data snooping problem, let us consider the following example (Lovell, 1983). Suppose that an investigator wants to specify a linear regression model for next month’s stock returns from a number of equally plausible candidate explanatory vari- ables. The model is restricted to have at most two explanatory variables. What are the implications of searching for the best two candidate regressors when the null hypothesis is true that stock prices follow a random walk and all explanatory variables are actually irrelevant? Because statistical tests are always subject to type I errors (rejecting the null hypothesis while it is actually true), the probability of such errors accumulates rapidly if a large sequence of tests is performed. When the claimed confidence level is 95%, the probability of incorrectly rejecting the null in the above example increases to approxi- mately 1−0.95k∕2, wherekis the number of candidate regressors. For example, if all candidate regressors are uncorrelated, the probability of findingt-values larger than 1.96 when the best two out of 20 regressors are selected is as large as 40%, while in fact all true coefficients are zero. This probability increases to more than 92% if the best two out of 100 candidates have been selected.
The danger of data mining is particularly high if the specification search is from simple to general. In this approach, you start with a simple model, and you include additional variables or lags of variables until the specification appears adequate. That is, until the restrictions imposed by the model are no longer rejected and you are happy with the signs of the coefficient estimates and their significance. Clearly, such a procedure may involve a very large number of tests. Stepwise regression, an automated version of such a specific-to-general approach, is bad practice and can easily lead to inappropriate model
4In computer science and big data analytics, the term data mining is used to describe the (useful) process of summarizing and finding interesting patterns in huge data sets (Varian, 2014).
5For example, when searching long enough, one can document “relationships” between the number of people who died by falling into a swimming pool and the number of films that Nicolas Cage appeared in, or between mozzarella cheese consumption and the number of civil engineering doctorates; see Vigen (2015) for a humorous account of such spurious correlations.
k k specifications, particularly if the candidate explanatory variables are not orthogonal; see
Doornik (2008) for a recent example. An alternative is thegeneral-to-specific modelling approach, advocated by Professor David Hendry and others, typically referred to as the LSE methodology.6 This approach starts by estimating a general unrestricted model (GUM), which is subsequently reduced in size and complexity by testing restrictions that can be imposed; see Charemza and Deadman (1999) for an extensive treatment. The idea behind this approach is appealing. Assuming that a sufficiently general and complicated model can describe reality, any more parsimonious model is an improvement if it conveys all of the same information in a simpler, more compact form. The art of model specifi- cation in the LSE approach is to find models that are valid restrictions of the GUM, and that cannot be reduced to even more parsimonious models that are also valid restrictions.
Although the LSE methodology involves a large number of (mis)specification tests, it can be argued to be relatively insensitive to data-mining problems. The basic argument, formalized by White (1990), is that, as the sample size grows to infinity, only the true specification will survive all specification tests. This assumes that the ‘true specification’
is a special case of the GUM with which a researcher starts. Rather than ending up with a specification that is most likely incorrect, owing to an accumulation of type I and type II errors, the general-to-specific approach in the long run would result in the correct specifi- cation. While this asymptotic result is insufficient to assure that the LSE approach works well with sample sizes typical for empirical work, Hoover and Perez (1999) show that it may work pretty well in practice in the sense that the methodology recovers the correct specification (or a closely related specification) most of the time. An automated version of the general-to-specific approach is developed by Krolzig and Hendry (2001) and is available in PcGets (Owen, 2003) and, with some refinements, in Autometrics (Doornik, 2009). Hendry (2009) discusses the role of model selection in applied econometrics and provides an illustration. Castle, Qin and Reed (2013) review and compare a large num- ber of model selection algorithms. The use of automatic model selection procedures in empirical work is not widespread, although the recent emergence of ‘big data’ generates new interest in this issue, particularly for large dimensional problems (see Varian, 2014).7 In practice, most applied researchers will start somewhere ‘in the middle’ with a spec- ification that could be appropriate and, ideally, then test (1) whether restrictions imposed by the model are correct and (2) whether restrictions not imposed by the model could be imposed. In the first category are misspecification tests for omitted variables, but also for autocorrelation and heteroskedasticity (see Chapter 4). In the second category are tests of parametric restrictions, for example that one or more explanatory variables have zero coefficients.
While the current chapter provides useful tests and procedures for specifying and estimating an econometric model, there is no golden rule to find an acceptable specifi- cation in a given application. Important reasons for this are that specification is simply
6The adjective LSE derives from the fact that there is a strong tradition of time series econometrics at the London School of Economics (LSE), starting in the 1960s (see Mizon, 1995). Currently, the practitioners of LSE econometrics are widely dispersed among institutions throughout the world.
7A reasonably popular approach in economic applications is the LASSO (‘Least absolute shrinkage and selection operator’), developed by Tibshirani (1996). This combines estimation and variable selection in large-dimensional problems (e.g. when there are more regressors than observations) by minimizing the usual sum of squared residuals, but imposing a bound on the sum of the absolute values of the coefficients.
Several variants and extensions have been developed. Ng (2013) reviews recent advances in variable selection methods in predictive regressions.
k k
SELECTING THE SET OF REGRESSORS 69
not easy, that only a limited amount of reliable data is available and that theories are often highly abstract or controversial (see Hendry and Richard, 1983). This makes specification of a model partly an imaginative process for which it is hard to write down rules. Or, as formulated somewhat bluntly in Chapter 1, econometrics is much easier without data. Kennedy (2008, Chapters 5 and 22) provides a very useful discus- sion of specification searches in practice, combined with the ‘ten commandments of applied econometrics’.
In presenting your estimation results, it is not a ‘sin’ to have insignificant variables included in your specification. The fact that your results do not show a significant effect onyiof some variablexikis informative to the reader, and there is no reason to hide it by re-estimating the model while excludingxik. It is also recommended that an intercept term be kept in the model, even if it appears insignificant. Of course, you should be careful including many variables in your model that are multicollinear so that, in the end, almost none of the variables appears individually significant.
Besides formal statistical tests there are other criteria that are sometimes used to select a set of regressors. First of all, theR2, discussed in Section 2.4, measures the proportion of the sample variation inyithat is explained by variation inxi. It is clear that, if we were to extend the model by includingziin the set of regressors, the explained variation would never decrease, so that also theR2would never decrease if we included additional vari- ables in the model. Using theR2as the criterion would thus favour models with as many explanatory variables as possible. This is certainly not optimal, because with too many variables we will not be able to say very much about the model’s coefficients, as they may be estimated rather inaccurately. Because theR2 does not ‘punish’ the inclusion of many variables, it would be better to use a measure that incorporates a trade-off between goodness-of-fit and the number of regressors employed in the model. One way to do this is to use the adjustedR2(orR̄2), as discussed in Chapter 2. Writing it as
R̄2 =1− 1∕(N−K)∑N i=1e2i 1∕(N−1)∑N
i=1(yi−y)̄2, (3.16) and noting that the denominator in this expression is unaffected by the model under con- sideration, shows that the adjustedR2 provides a trade-off between goodness-of-fit, as measured by ∑N
i=1e2i, and the simplicity or parsimony of the model, as measured by the number of parametersK. There exist a number of alternative criteria that provide such a trade-off, the most common ones beingAkaike’s Information Criterion(AIC), proposed by Akaike (1983), given by
AIC=log 1 N
∑N i=1
e2i + 2K
N , (3.17)
and theSchwarz Bayesian Information Criterion(BIC), proposed by Schwarz (1978), which is given by
BIC=log 1 N
∑N i=1
e2i + K
N logN. (3.18)
Models with a lowerAICorBICare typically preferred. Note that both criteria add a penalty that increases with the number of regressors. Because the penalty is larger for
k k BIC, the latter criterion tends to favour more parsimonious models thanAIC. TheBIC
can be shown to be consistent in the sense that, asymptotically, it will select the true model provided the true model is among the set being considered. In small samples however, Monte Carlo evidence shows that AIC can work better. The use of either of these criteria is usually restricted to cases where alternative models are not nested (see Subsection 3.2.3), and economic theory provides no guidance on selecting the appropriate model. A typical situation is the search for a parsimonious model that describes the dynamic process of a particular variable (see Chapter 8); Section 3.5 provides an illustration.
Alternatively, it is possible to test whether the increase inR2is statistically significant.
Testing this is exactly the same as testing whether the coefficients for the newly added variableszi are all equal to zero, and we have seen a test for that in Chapter 2. Recall from (2.59) that the appropriateF-statistic can be written as
F= (R21−R20)∕J
(1−R21)∕(N−K), (3.19)
where R21 and R20 denote the R2 in the model with and withoutzi, respectively, and J is the number of variables inzi. Under the null hypothesis thatzihas zero coefficients, theF-statistic has anFdistribution withJandN−Kdegrees of freedom, provided we can impose conditions (A1)–(A5) from Chapter 2. TheF-test thus provides a statistical answer to the question as to whether the increase inR2as a result of includingziin the model was significant or not. It is also possible to rewriteF in terms of adjustedR2s.
This would show thatR̄21> ̄R20if and only ifF exceeds a certain threshold. In general, these thresholds do not correspond to 5% or 10% critical values of theF distribution, but are substantially smaller. In particular, it can be shown thatR̄21> ̄R20if and only if the F-statistic is larger than one. For a single variable(J=1)this implies that the adjusted R2will increase if the additional variable has at-ratio with an absolute value larger than unity. (Recall that, for a single restriction,t2 =F.) This reveals that the use of the adjusted R2 as a tool to select regressors leads to the inclusion of more variables than standard t- orF-tests.
Direct tests of the hypothesis that the coefficients𝛾forziare zero can be obtained from thet- andF-tests discussed in Chapter 2. Compared withFabove, a test statistic can be derived that is more generally appropriate. Let̂𝛾 denote the OLS estimator for𝛾and let V{̂ ̂𝛾}denote an estimated covariance matrix for ̂𝛾. Then, it can be shown that, under the null hypothesis that𝛾=0, the test statistic
𝜉= ̂𝛾V̂{̂𝛾}−1̂𝛾 (3.20)
has an asymptotic𝜒2distribution withJdegrees of freedom. This is similar to the Wald test described in Chapter 2 (compare (2.63)). The form of the covariance matrix of ̂𝛾 depends upon the assumptions we are willing to make. Under the Gauss–Markov assump- tions, we would obtain a statistic that satisfies𝜉=JF.
It is important to recall that two single tests are not equivalent to one joint test. For example, if we are considering the exclusion of two single variables with coefficients𝛾1 and𝛾2, the individualt-tests may reject neither𝛾1=0 nor𝛾2 =0, whereas the jointF-test (or Wald test) rejects the joint restriction𝛾1=𝛾2=0. The message here is that, if we want to drop two variables from the modelat the same time, we should be looking at a joint test rather than at two separate tests. Once the first variable is omitted from the model,