In calculating the OLS estimatebsome observations may have a much bigger impact than others. If one or a few observations are extremely influential, it is advisable to check them to make sure they are not due to erroneous data (e.g. misplacement of a decimal point) or relate to some atypical cases (e.g. including the CEO of Apple in your sample of wages). More generally, it makes sense to check the sensitivity of your estimation results with respect to (seemingly) small changes in your sample or sample period. In some cases, it is advisable to use more robust estimation methods rather than OLS. Another problem that arises in many situations is that of missing observations. For example, years of experience may not be observed for a subset of individuals. The easy solution is to drop individuals with incomplete information from the sample and estimate the wage equation using complete cases only, but this is only innocuous when the observations are missing in a random way. In this section, we discuss these two problems in a bit more detail, including some pragmatic ways of dealing with them.
2.9.1 Outliers and Influential Observations
Loosely speaking, an outlier is an observation that deviates markedly from the rest of the sample. In the context of a linear regression, an outlier is an observation that is far away from the (true) regression line. Outliers may be due to measurement errors in the data, but can also occur by chance in any distribution, particularly if it has fat tails. If outliers correspond to measurement errors, the preferred solution is to discard the correspond- ing unit from the sample (or correct the measurement error if the problem is obvious).
If outliers are correct data points, it is less obvious what to do. Recall from the discussion in Subsection 2.3.2 that variation in the explanatory variables is a key factor in deter- mining the precision of the OLS estimator, so that outlying observations may be very valuable (and throwing them away is not a good idea).
The problem with outliers is not so much that they deviate from the rest of the sample, but rather that the outcomes of estimation methods, like ordinary least squares, can be very sensitive to one or more outliers. In such cases, an outlier becomes an ‘influential observation’. There is, however, no simple mathematical definition of what exactly is an outlier. Nevertheless, it is highly advisable to compute summary statistics of all relevant variables in your sample before performing any estimation. This also provides a quick
k k
MISSING DATA, OUTLIERS AND INFLUENTIAL OBSERVATIONS 49
outlier ---- 0
2 4 6 8
y
0 2 4 6
x
Figure 2.3 The impact of estimating with and without an outlying observation.
way to identify potential mistakes or problems in your data. For example, for some units in the sample the value of some variable could be several orders of magnitude too large to be plausibly correct. Data items that by definition cannot be negative are sometimes coded as negative. In addition, statistical agencies may code missing values as−99 or
−999.
To illustrate the potential impact of outliers, consider the example in Figure 2.3. The basic sample contains 40 simulated observations based on yi=𝛽1+𝛽2xi+𝜀i, where 𝛽1=3 and𝛽2=1,andxiis drawn from a normal distribution with mean 3 and unit vari- ance. However, we have manually added an outlying observation corresponding tox=6 andy=0.5. The two lines in Figure 2.3 depict the fitted regression lines (estimated by OLS) with and without the outlier included. Clearly, the inclusion of the outlier pulls down the regression line. The estimated slope coefficient when the outlier is included is 0.52 (with a standard error of 0.18), and theR2is only 0.18. When the outlier is dropped, the estimated slope coefficient increases to 0.94 (with a standard error of 0.06), and theR2 increases to 0.86. It is clear in this case that one extreme observation has a severe impact on the estimation results. In reality we cannot always be sure which regression line is closer to the true relationship, but even if the influential observation is correct, the inter- pretation of the regression results may change if it is known that only a few observations are primarily responsible for them.
A first tool to obtain some idea about the possible presence of outliers in a regression context is provided by inspecting the OLS residuals, where all of the observations are used. This, however, is not necessarily helpful. Recall that OLS is based on minimizing the residual sum of squares, given in (2.4),
S(̃𝛽) =∑N
i=1
(yi−xĩ𝛽)2, (2.82)
k k which implies that large residuals are penalized more than proportionally. Accordingly,
OLS tries to prevent very large residuals. This is illustrated by the fact that an outlier, as in Figure 2.3, can substantially affect the estimated regression line. It is therefore a better option to investigate the residual of an observation when the model coefficients are estimated using only the rest of the sample. Denoting the full sample OLS estimate for𝛽by b, as before, we denote the OLS estimate after excluding observationj from the sample byb(j). An easy way to calculateb(j)is to augment the original model with a dummy variable that is equal to one for observationjand 0 otherwise. This effectively discards observationj. The resulting model is given by
yi=xi𝛽+𝛾dij+𝜀i, (2.83) where dij=1 if i=j and 0 otherwise. The OLS estimate for 𝛽 from this regression corresponds to the OLS estimate in the original model when observationjis dropped.
The estimated value of𝛾corresponds to the residualyj−xjb(j)when the model is esti- mated excluding observationjand the routinely calculatedt-ratio of𝛾is referred to as the studentized residual. The studentized residuals are approximately standard normally distributed (under the null hypothesis that𝛾=0)and can be used to judge whether an observation is an outlier. Rather than using conventional significance levels (and a criti- cal value of 1.96), one should pay attention to large outliers (t-ratios much larger than 2) and try to understand the cause of them. Are the outliers correctly reported and, if yes, can they be explained by one or more additional explanatory variables? Davidson and MacKinnon (1993, Section 1.6) provide more discussion and background. A classic ref- erence is Belsley, Kuh and Welsh (1980).
2.9.2 Robust Estimation Methods
As mentioned above, OLS can be very sensitive to the presence of one or more extreme observations. This is due to the fact that it is based on minimizing the sum of squared residuals in (2.82) where each observation is weighted equally. Alternative estimation methods are available that are less sensitive to outliers, and a relatively popular approach is calledleast absolute deviationsor LAD. Its objective function is given by
SLAD(̃𝛽) =
∑N i=1
|yi−xĩ𝛽|, (2.84)
which replaces the squared terms by their absolute values. There is no closed-form solu- tion to minimizing (2.84) and the LAD estimator for𝛽 would have to be determined using numerical optimization. This is a special case of a so-called quantile regression and procedures are readily available in recent software packages, like Eviews and Stata.
In fact, LAD is designed to estimate the conditional median (ofyigivenxi)rather than the conditional mean, and we know medians are less sensitive to outliers than are aver- ages. The statistical properties of the LAD estimator are only available for large samples (see Koenker, 2005, for a comprehensive treatment). Under assumptions (A1)–(A4), the LAD estimator is consistent for the conditional mean parameters𝛽in (2.25) under weak regularity conditions.
Sometimes applied researchers choose for a more pragmatic approach. For example, in corporate finance studies it has become relatively common to ‘winsorize’ the data
k k
MISSING DATA, OUTLIERS AND INFLUENTIAL OBSERVATIONS 51
before performing a regression. Winsorizing means that the tails of the distribution of each variable are adjusted. For example, a 99% winsorization would set all data below the 1st percentile equal to the 1st percentile, and all data above the 99th percentile to the 99th percentile. In essence this amounts to saying ‘I do not believe the data are correct, but I know that the data exist. So instead of completely ignoring the data item, I will replace it with something a bit more reasonable’ (Frank and Goyal, 2008). Estimation is done by standard methods, like ordinary least squares, treating the winsorized observations as if they are genuine observations. Note that winsorizing is different from dropping the extreme observations.
Another alternative is the use oftrimmed least squares(or least trimmed squares).
This corresponds to minimizing the residual sum of squares, but with the most extreme (e.g. 5%) observations – in terms of their residuals – omitted. Because the values of the residuals depend upon the estimated coefficients, the objective function is no longer a quadratic function of ̃𝛽and the estimator would have to be determined numerically; see Rousseeuw and Leroy (2003, Chapter 3).
Frequently, modelling logs rather than levels also helps to reduce the sensitivity of the estimation results to extreme values. For example, variables like wages, total expendi- tures or wealth are typically included in natural logarithms in individual-level models (see Section 3.1). With country-level data, using per capita values can also be helpful in this respect.
2.9.3 Missing Observations
A frequently encountered problem in empirical work, particularly with micro-economic data, is that of missing observations. For example, when estimating a wage equation it is possible that years of schooling are not available for a subset of the individuals.
Or, when estimating a model explaining firm performance, expenditures on research and development may be unobserved for some firms. Abrevaya and Donald (2011) report that nearly 40% of all papers recently published in four top empirical economics journals have data missingness. In such cases, a first requirement is to make sure that the missing data are properly indicated in the data set. It is not uncommon to have missing values being coded as a large (negative) number, for example −999, or simply as zero. Obviously, it is incorrect to treat these ‘numbers’ as if they are actual observations. When miss- ing data are properly indicated, regression software will automatically calculate the OLS estimator using the complete cases only. Although this involves a loss of efficiency com- pared to the hypothetical case when there are no missing observations, it is often the best one can do.
However, missing observations are more problematic if they are not missing at random.
In this case the sample available for estimation may not be a random sample of the pop- ulation of interest and the OLS estimator may be subject tosample selection bias. Let ribe a dummy variable indicating whether unitiis in the estimation sample and thus has no missing data. Then the key condition for not having a bias in estimating the regression model explainingyifromxi,is that the conditional expectation ofyigivenxiis not affected by conditioning upon the requirement that unit iis in the sample. Mathematically, this means that the following equality holds:
E{yi|xi,ri=1} =E{yi|xi}. (2.85)
k k What we can estimate from the available sample is the left-hand side of (2.85), whereas
we are interested in the right-hand side, corresponding to (2.27), and therefore we want the two terms to coincide. The condition in (2.85) is satisfied if the probability distribution ofrigivenxidoes not depend uponyi.This means that selection in the sample is allowed to depend upon the explanatory variablesxi, but not upon the unobservables𝜀i in the regression model. For example, if we only observe wages above a certain threshold and have missing values otherwise, the OLS estimator in the wage equation will suffer from selection bias. On the other hand, when some levels of schooling are overrepresented in the sample this does not bias the results as long as years of schooling is a regressor in the model. We will defer a full treatment of the sample selection problem and some approaches of dealing with it to Sections 7.5 and 7.6.
Suppose we have a sample of 1000 individuals, observing their wages, schooling, experience and some other background characteristics. We also observe their place of residence, but this information is missing for half of the sample. This means that we can estimate a wage equation using 1000 observations, but if we wish to control for place of residence the effective sample reduces to 500. In this case we have to make a trade-off between the ability to control for place of residence in the model and the efficiency gain of using twice as many observations. In such cases, it is not uncommon to report estimation results for both model specifications using the largest possible sample. The estimation results for the two specifications will be different not only because they are based on a different set of regressor variables, but also because the samples used in estimating them are different. In the ideal case, the difference in estimation samples has no systematic impact. To check this, it makes sense to also estimate the different specifications using the same data sample. This sample will contain the cases that are common across the different subsamples (in this case 500 observations). If the results for the same model are significantly different between the samples of 500 and 1000 individuals, this suggests that condition (2.85) is violated, and further investigation into the missing data problem is warranted. The above arguments are even more important when there are missing data for several of the explanatory variables for different subsets of the original sample.
A pragmatic, but inappropriate, solution to deal with missing data is to replace the missing data by some number, for example zero or the sample average, and augment the regression model with a missing data indicator, equal to one if the original data was missing and zero otherwise. This way the complete sample can be used again. While this approach is simple and intuitively appealing, it can be shown to produce biased estimates, even if the data are missing at random (see Jones, 1996).
Imputation means that missing values are replaced by one or more imputed values.
Simple ad hoc imputation methods are typically not recommended. For example, replacing missing values by the sample average of the available cases will clearly distort the marginal distribution of the variable of interest as well as its covariances with other variables. Hot deck imputation, which means that missing values are replaced by random draws from the available observed values, also destroys the relationships with other vari- ables. Little and Rubin (2002) provide an extensive treatment of missing data problems and solutions, including imputation methods. Cameron and Trivedi (2005, Chapter 27) provide more discussion of missing data and imputation in a regression context. In general, any statistical analysis that follows after missing data are imputed should take into account the approximation errors made in the imputation process. That is, imputed
k k
PREDICTION 53
data cannot be treated simply as if they are genuinely observed data (although this is commonly what happens, particularly if the proportion of imputed values is small).
Dardanoni, Modica and Peracchi (2011) provide an insightful analysis of this problem.