Applied Econometrics Lecture 7: Multicollinearity Double whom you will, but never yourself 1 Introduction Multiple regression can be written as follows: Yi = b0 + b1X1 + b2X2 + … + bk
Trang 1Applied Econometrics
Lecture 7: Multicollinearity
Double whom you will, but never yourself
1) Introduction
Multiple regression can be written as follows:
Yi = b0 + b1X1 + b2X2 + … + bkXk
Collinearity refers to linear relationships between two X variables Multicollinearity encompasses linear relationships between more than two X variables Multiple regression is impossible in the presence of perfect collinearity or multicollinearity If X1 and X2 have no independent variation, we cannot estimate the effects of X1 adjusting for X2 or vice versa One of the variables must be dropped This is no loss, since a perfect relationship implies perfect redundancy Perfect multicollinearity is, however, rarely practice problem Strong (not perfect) multicollinearity, which permits estimation but makes it less precise, is more common When the multicollinearity is present, the interpretation of the coefficients will be quite difficult
2) Practice consequences of multicollinearity
Standard errors of coefficients
The easiest way tell whether multicollinearity is causing problems is to examine the standard errors
of the coefficients If several coefficients have high standard errors and dropping one or more variables from the equation lowers the standard errors of the remaining variables Multicollinearity will be the source of the problem
A more sophisticated analysis would take into account the fact that the covariance between estimated parameters may be sensitive to multicollinearity (aq high degree of multicollinearity will be associated with a relatively high covariance between estimated parameters) This suggests that if one estimated parameter bi overestimates the true parameter βi, a second parameter estimates bj is likely
to underestimates βj, and vice versa
Because of the large standard errors, the confident intervals for the relevant population parameters tend to be larger
Trang 2Sensitive coefficients
One of the consequences of high correlation between explanatory variables is that the parameters estimates would be very sensitive to addition or deletion of observations
A high R 2 but few significant t-ratios
There are few coefficients, which are not statistically significant difference from zero and the coefficient of determination is high
3) Detection of multicollinearity
3.1) There is high R2 but few significant t-ratios The F-test will reject the hypothesis that partial slope coefficient are simultaneously equal to zero, but the individual t-test show that non or very few partial slope coefficients are statistically different from zero
3.2) Multicollinearity can be considered a serious problem only if R2y < R2i (Klein, 1962) where
R2y is the squared multiple correlation coefficient between Y and the explanatory variables and
R2i is the squared multiple correlation coefficient between Xi and the other explanatory variables
z Even if R2y < R2i, t – values for the regression coefficients is still statistical significant
z Even if R2i is very high, the simple correlations among regressors are comparatively low
3.3) In the regression of Y on X2, X3, and X4, if one find that R21.234 is very high but r212.34, r213.24, and r214.23 are comparatively low, it may suggest that the variables X2, X3, and X4 are highly intercorrelated and that at least one of these variables is superfluous
3.4) We may use the overall F test to check whether there is a relationship between any one of explanatory variable on the remaining explanatory variables
3.5) In the regression of Y on X1 and X2, we may calculate λ from the following equation:
(S11 – λ)(S11 – λ) – S2
12 = 0 where
( ) (X2i X2)
n 1
i X1i X1 S12
n 1
i X2i X 22 S22
n 1
i X1i X12 S11
−
∑
=
−
∑
= −
∑
= −
=
=
=
Trang 3
The condition number (Raduchel (1971) and Belsley, Kuhn and Welsch (1980)) is defined as:
λ
λ CN
2
1
= where λ1 > λ2
If CN is between 10 to 30, there is moderate to strong multicollinearity
If CN is greater than 30, there is severe multicollinearity
The closer the condition number is to zero, the better condition is
3.6) Theil’s test (1971)1
Calculate m, which is defined as:
−
=
k 1 i
2 i 2
R m
R2 is the squared multiple correlation coefficient between Y and the explanatory variables (X1,
X2, …, Xi, …, Xk)
R2-i is the squared multiple correlation coefficient between Y and the explanatory variables (X1,
X2, …, Xi-1, Xi+1, …, Xk) with Xi omitted
If (X1, X2, …, Xi, …, Xk) are mutually uncorrelated, then m will be zero
3.7) Variance-Inflation Factor (VIF) The VIF is defined as:
( )
R 1
1 βˆ
i
i = − where R2i is the squared multiple correlation coefficient between Xi and other explanatory variables We may calculate for each of explanatory variable separately The VIFis measures the degree of multicollinearity among regressors with reference to the idea situation where all explanatory variables are uncorrelated (R2i = 0 implies VIFi = 1)2 VIFjs will be useful for dropping some variables and imposing parameter constraints only in some very extreme cases where R2i is approximately equal to zero
4) Remedial measures
4.1) Getting more data: Increasing the size of the sample may reduce the multicollinearity problem The variance of the coefficient is defined as follows:
1 Theil, H (1971), Principles of Econometrics, (New York: Wiley), pp 179
2 We can interpret VIF j s as the ratio of the actual variance of β i to what the variance of β i would have been if X i were to be uncorrelated with the remaining X’s
Trang 4(1 R )
S
σ ) βˆ
i ii
2 i
−
=
where σ2 is the variance of the error term
=
=
n 1 i
2
ii Xii Xi
R2i is the squared multiple correlations coefficient between Xi and other explanatory variables
As sample increases, Sii will increase Therefore, for any given R2i, the variance of the coefficient (V(βi)) will decrease, thus decreasing standard error, which will enable us to estimate βi more precisely
4.2) Transforming of variables (using ratios or first differences): The ratios or first differences regression model often reduces the severity of multicollinearity However, the first different regression may generate additional problems: (i) error terms may be serially correlated; (ii) one observation is lost; (iii) the first differencing procedure may not be appropriate in cross-sectional data, where there is no logical ordering of the observations
4.3) Dropping variables: From the previous lectures, dropping a variable to alleviate the problem
of multicollinearity may lead to the specification bias Hence, the remedy may be worst than the disease in some situations because while multicollinearity may precise estimation of the parameters of the model, omitting a variable my seriously mislead us as to the true value of parameters
4.4) Using extraneous estimates (Tobin, 1950): The equation to be estimated is:
lnQ = αˆ + βˆ1lnP + βˆ2lnI where Q, P and I represent quantity of products, price and income respectively
The time-series data of income and price were both highly collinear
First, we estimate the income elasticity because the data which is at a point in time, the price do not vary much; is known as the extraneous estimate
2
βˆ
2
βˆ Second, we regress (lnQ – βˆ2lnI) on lnP to get the estimates of αˆ and βˆ1
Trang 5We may not interpret the problem of how income elasticity is not changed over time However, the technique may be worth of consideration in situations, where the cross-sectional estimates
do not vary substantially from one cross section to another
4.5) Using a priori information: Considering the following equation
Y1 = βˆ1X1 + βˆ2X2
We cannot get good estimates of and because of high correlation between X1 and X2
We get an estimate of from another data set and another equation
1
1
βˆ
Y2 = βˆ1X1 + αˆ2Z
X1 and Z are not highly correlated and we get good estimate of Then we regress
(Y1– X1) on X2 to get an estimate
1
βˆ
1
5) Fragility analysis: Making sense of slope coefficients
It is a useful exercise to investigate the sensitivity of regression coefficients across plausible neighboring specifications to check the fragility of the inferences we make on the basis of nay one specification uncertainty as to which variables to include
1 If the different regressors are highly correlated with one another, then there is a problem of collinearity of multicollinearity This means that the parameters we estimated are very sensitive to the specification model we use and that we may get a high R2 but insignificant coefficients (another indication of multicollinearity is that the R2s from the simple regressions do not sum to near the R2 for the multiple regression)
2 We would much refer to have robust coefficients, which are not sensitive to small changes in the model specifications Consider the following model
Yi = β0 + β1X1 + β2X2 + β3X3
As there are three variables, we have seven possible equations (the number of equation is
2k-1) to estimate, where k is the regressors In some cases, there may be one or more
Trang 6variables we wish to include in all specifications, if we are particularly interested in that variable or really sure it cannot be omitted These seven are as follows:
1) Y on X1
2) Y on X2
3) Y on X3
4) Y on X1 and X2
5) Y on X1 and X3
6) Y on X2 and X3
7) Y on X1, X2 and X3
To carry out a fragility analysis, we perform the following steps:
1 Estimate all seven regressions
2 Construct a table of coefficients (excluding the intercepts and including the R2)
3 If the coefficients vary widely, there is evidence of multicollinearity (look also at simple versus multiple R2)
4 To avoid problems of scale, we normalize each coefficients by dividing through by the mean of the absolute value of that coefficient across all specification and then calculate the maximum, minimum and range of each coefficient
5 We then can identify which regressors are robust
Example 5.1: An examination of fertility in developing countries
The data for this example is in the data file FERTILIT, which contains comparative cross-section data for 64 countries on fertility and its determinants as given by the following variables:
TFR: The total fertility rate, 1980-85 The average number of children born to a woman, using age-specific fertility rates for a given year
FP: An index of family planning effort
GNP: Gross National Products per capita 1980
FL: Female literacy rate, expressed as percentage
CM: Child mortality The number of deaths of children under age live in a year per 1,000 live births
Table 5.1: Summarizes Coefficients from 15 Possible Regressions
Trang 72 -0.568 0.16
The dependent variable is the total fertility rate (TFR) With four explanatory variables, there are 15
possible regressions (24 -1 = 15)
Some coefficients seem to vary much and others rather less so But it is difficult to get a precise idea
comparatively, as the scale of coefficients varies Hence, we normalize them, as shown in table 5.2
Table 5.2: Normalized Coefficients for TFR Regressions
2 -1.99
6 -0.94 -1.06
Trang 811 -1.00 -0.82 -0.77
12 -0.97 -1.01 0.56
15 -0.97 -0.74 -0.72 0.10
From these results, we see that:
1 The income variable is the least robust (range=2.88) The evidences are the coefficient from the
simple regression is twice from others and, in some cases, the coefficients even becomes
positive
2 The family planning coefficient is the most robust (range=0.31) It always retains the same
negative sign and varies over a comparatively small range
3 Collinearity seems particularly severe between lnGNP and CM, it is likely to be necessary to
estimate an equation containing only one of these two variables If both are included, then
neither are statistically significant different from zero
4 Hence a regression of TFR on FP and FL seems sensible
References
Bao, Nguyen Hoang (1995), ‘Applied Econometrics’, Lecture notes and Readings,
Vietnam-Netherlands Project for MA Program in Economics of Development
Belsley, D E., E Kuh and R Welsch (1980), “Regression Diagnostics’ (New York: Wiley, 1980)
Gujarati, Damodar N (1988), Basic Econometrics, Second Edition, McGraw – Hill, Inc
Klein, L R (1962), ‘An Introduction of Econometrics’, Englewood Cliffs, New York: Prentice Hall,
pp.101
Trang 9Maddala, G.S (1992), ‘Introduction to Econometrics’, Macmillan Publishing Company, New York
Mukherjee Chandan, Howard White and Marc Wuyts (1998), ‘Econometrics and Data Analysis for
Developing Countries’ published by Routledge, London, UK
Raduchel, W J (1971), ‘Multicollinearity Once Again’, Harward Institute of Economic Research,
Paper 205, Cambridge, Mass
Theil, H (1971), ‘Principles of Econometrics’, (New York: Wiley), pp 179
Trang 10Workshop 7: Multicollinearity
1) In the regression of Y on X1 and X2, match up the equivalent statements:
a) There is multicollinearity in the regressors
b) Y has a nearly perfect linear relation to X1 and X2
c) The multiple correlation of Y on X1 and X2 is nearly one
d) The residual variance after regression is very small compared to the variance of Y without regression
e) X1 and X2 have high correlation
2) Use the data file KRISNAIJ, we estimate the following model
lnM = β1 + β2lnY + β3lnPf + β4lnPm
The above model specification is, in fact, a restricted version of a more elaborate model which include, apart from the income variable, Y, the price variables, Pf (the price of cereals) and Pm, two more price variables: namely, Pof, a price index of other food products, and Ps, a price index of consumer services Including the last two variables into the double – log model specification yields a six – variable regression
2.1) Construct a table with the results of all possible regressions, which at least include the income variable (why?)
2.2) Construct comparative box plots of the variation in the slope coefficient of each regressor
in the model
2.3) Judging from your table, check whether there is much evidence of multicollinearity
2.4) Check whether any variables in any of the specifications appear superfluous
2.5) How robust is the income elasticity across alternative specifications?
2.6) In your opinion, which price variables appear to be most relevant in the model?