1. Trang chủ
  2. » Giáo Dục - Đào Tạo

Tài liệu Econometrics pdf

167 323 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Econometrics
Tác giả Bruce E. Hansen
Trường học University of Wisconsin
Thể loại Bài viết
Năm xuất bản 2007
Thành phố Wisconsin
Định dạng
Số trang 167
Dung lượng 1,18 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

2.5 Linear Regression An important special case of 2.3 is when the conditional mean function m x is linear in x or linear in functions of x: Notationally, it is convenient to augment the

Trang 2

1.1 Economic Data 1

1.2 Observational Data 1

1.3 Economic Data 2

2 Regression and Projection 3 2.1 Variables 3

2.2 Conditional Density and Mean 3

2.3 Regression Equation 4

2.4 Conditional Variance 6

2.5 Linear Regression 6

2.6 Best Linear Predictor 7

2.7 Technical Proofs 9

2.8 Exercises 11

3 Least Squares Estimation 12 3.1 Random Sample 12

3.2 Estimation 12

3.3 Least Squares 14

3.4 Normal Regression Model 15

3.5 Model in Matrix Notation 16

3.6 Projection Matrices 17

3.7 Residual Regression 18

3.8 Bias and Variance 20

3.9 Gauss-Markov Theorem 21

3.10 Semiparametric E ciency 22

3.11 Multicollinearity 23

3.12 In uential Observations 23

3.13 Technical Proofs 24

3.14 Exercises 25

4 Inference 27 4.1 Sampling Distribution 27

4.2 Consistency 28

4.3 Asymptotic Normality 29

4.4 Covariance Matrix Estimation 31

4.5 Alternative Covariance Matrix Estimators 33

4.6 Functions of Parameters 33

4.7 t tests 34

4.8 Con dence Intervals 35

4.9 Wald Tests 36

4.10 F Tests 37

4.11 Normal Regression Model 38

4.12 Problems with Tests of NonLinear Hypotheses 39

4.13 Monte Carlo Simulation 42

4.14 Estimating a Wage Equation 44

4.15 Technical Proofs 45

Trang 3

4.16 Exercises 48

5 Additional Regression Topics 51 5.1 Generalized Least Squares 51

5.2 Testing for Heteroskedasticity 53

5.3 Forecast Intervals 54

5.4 NonLinear Least Squares 55

5.5 Least Absolute Deviations 56

5.6 Quantile Regression 57

5.7 Testing for Omitted NonLinearity 58

5.8 Omitted Variables 59

5.9 Irrelevant Variables 60

5.10 Model Selection 60

5.11 Technical Proofs 62

5.12 Exercises 64

6 The Bootstrap 66 6.1 De nition of the Bootstrap 66

6.2 The Empirical Distribution Function 66

6.3 Nonparametric Bootstrap 67

6.4 Bootstrap Estimation of Bias and Variance 68

6.5 Percentile Intervals 69

6.6 Percentile-t Equal-Tailed Interval 70

6.7 Symmetric Percentile-t Intervals 71

6.8 Asymptotic Expansions 71

6.9 One-Sided Tests 72

6.10 Symmetric Two-Sided Tests 73

6.11 Percentile Con dence Intervals 74

6.12 Bootstrap Methods for Regression Models 75

6.13 Exercises 75

7 Generalized Method of Moments 77 7.1 Overidenti ed Linear Model 77

7.2 GMM Estimator 77

7.3 Distribution of GMM Estimator 78

7.4 Estimation of the E cient Weight Matrix 79

7.5 GMM: The General Case 80

7.6 Over-Identi cation Test 80

7.7 Hypothesis Testing: The Distance Statistic 81

7.8 Conditional Moment Restrictions 82

7.9 Bootstrap GMM Inference 83

7.10 Exercises 84

8 Empirical Likelihood 86 8.1 Non-Parametric Likelihood 86

8.2 Asymptotic Distribution of EL Estimator 87

8.3 Overidentifying Restrictions 88

8.4 Testing 88

8.5 Numerical Computation 89

8.6 Technical Proofs 90

9 Endogeneity 92 9.1 Instrumental Variables 93

9.2 Reduced Form 93

9.3 Identi cation 94

9.4 Estimation 95

9.5 Special Cases: IV and 2SLS 95

9.6 Bekker Asymptotics 96

9.7 Identi cation Failure 97

Trang 4

9.8 Exercises 99

10 Univariate Time Series 101 10.1 Stationarity and Ergodicity 101

10.2 Autoregressions 102

10.3 Stationarity of AR(1) Process 103

10.4 Lag Operator 103

10.5 Stationarity of AR(k) 103

10.6 Estimation 104

10.7 Asymptotic Distribution 105

10.8 Bootstrap for Autoregressions 105

10.9 Trend Stationarity 106

10.10Testing for Omitted Serial Correlation 106

10.11Model Selection 107

10.12Autoregressive Unit Roots 107

10.13Technical Proofs 108

11 Multivariate Time Series 110 11.1 Vector Autoregressions (VARs) 110

11.2 Estimation 111

11.3 Restricted VARs 111

11.4 Single Equation from a VAR 111

11.5 Testing for Omitted Serial Correlation 111

11.6 Selection of Lag Length in an VAR 112

11.7 Granger Causality 112

11.8 Cointegration 113

11.9 Cointegrated VARs 113

12 Limited Dependent Variables 115 12.1 Binary Choice 115

12.2 Count Data 116

12.3 Censored Data 117

12.4 Sample Selection 118

13 Panel Data 120 13.1 Individual-E ects Model 120

13.2 Fixed E ects 120

13.3 Dynamic Panel Regression 122

14 Nonparametrics 123 14.1 Kernel Density Estimation 123

14.2 Asymptotic MSE for Kernel Estimates 124

A Matrix Algebra 127 A.1 Notation 127

A.2 Matrix Addition 128

A.3 Matrix Multiplication 128

A.4 Trace 129

A.5 Rank and Inverse 130

A.6 Determinant 130

A.7 Eigenvalues 131

A.8 Positive De niteness 132

A.9 Matrix Calculus 132

A.10 Kronecker Products and the Vec Operator 132

A.11 Vector and Matrix Norms 133

Trang 5

B Probability 135

B.1 Foundations 135

B.2 Random Variables 136

B.3 Expectation 137

B.4 Common Distributions 138

B.5 Multivariate Random Variables 140

B.6 Conditional Distributions and Expectation 141

B.7 Transformations 143

B.8 Normal and Related Distributions 143

C Asymptotic Theory 146 C.1 Inequalities 146

C.2 Weak Law of Large Numbers 147

C.3 Convergence in Distribution 148

C.4 Asymptotic Transformations 149

D Maximum Likelihood 151 E Numerical Optimization 155 E.1 Grid Search 155

E.2 Gradient Methods 155

E.3 Derivative-Free Methods 157

Trang 6

Chapter 1

Introduction

Econometrics is the study of estimation and inference for economic models using economic data metric theory concerns the study and development of tools and methods for applied econometric applications.Applied econometrics concerns the application of these tools to economic data

Econo-1.1 Economic Data

An econometric study requires data for analysis The quality of the study will be largely determined

by the data available There are three major types of economic data sets: cross-sectional, time-series, andpanel They are distinguished by the dependence structure across observations

Cross-sectional data sets are characterized by mutually independent observations Surveys are a typicalsource for cross-sectional data The individuals surveyed may be persons, households, or corporations.Time-series data is indexed by time Typical examples include macroeconomic aggregates, prices andinterest rates This type of data is characterized by serial dependence

Panel data combines elements of cross-section and time-series These data sets consist surveys of a set ofindividuals, repeated over time Each individual (person, household or corporation) is surveyed on multipleoccasions

1.2 Observational Data

A common econometric question is to quantify the impact of one set of variables on another variable.For example, a concern in labor economics is the returns to schooling { the change in earnings induced byincreasing a worker's education, holding other variables constant Another issue of interest is the earningsgap between men and women

Ideally, we would use experimental data to answer these questions To measure the returns to schooling,

an experiment might randomly divide children into groups, mandate di erent levels of education to the

di erent groups, and then follow the children's wage path as they mature and enter the labor force The

di erences between the groups could be attributed to the di erent levels of education However, experimentssuch as this are infeasible, even immoral!

Instead, most economic data is observational To continue the above example, what we observe (throughdata collection) is the level of a person's education and their wage We can measure the joint distribution

of these variables, and assess the joint dependence But we cannot infer causality, as we are not able tomanipulate one variable to see the direct e ect on the other For example, a person's level of education is(at least partially) determined by that person's choices and their achievement in education These factorsare likely to be a ected by their personal abilities and attitudes towards work The fact that a person ishighly educated suggests a high level of ability This is an alternative explanation for an observed positivecorrelation between educational levels and wages High ability individuals do better in school, and thereforechoose to attain higher levels of education, and their high ability is the fundamental reason for their highwages The point is that multiple explanations are consistent with a positive correlation between schoolinglevels and education Knowledge of the joint distibution cannot distinguish between these explanations.This discussion means that causality cannot be infered from observational data alone Causal inferencerequires identi cation, and this is based on strong assumptions We will return to a discussion of some ofthese issues in Chapter 9

Trang 7

1.3 Economic Data

Fortunately for economists, the development of the internet has provided a convenient forum for semination of economic data Many large-scale economic datasets are available without charge from gov-ernmental agencies An excellent starting point is the Resources for Economists Data Links, available athttp://rfe.wustl.edu/Data/index.html

dis-Some other excellent data sources are listed below

Bureau of Labor Statistics: http://www.bls.gov/

Federal Reserve Bank of St Louis: http://research.stlouisfed.org/fred2/

Board of Governors of the Federal Reserve System: http://www.federalreserve.gov/releases/

National Bureau of Economic Research: http://www.nber.org/

US Census: http://www.census.gov/econ/www/

Current Population Survey (CPS): http://www.bls.census.gov/cps/cpsmain.htm

Survey of Income and Program Participation (SIPP): http://www.sipp.census.gov/sipp/

Panel Study of Income Dynamics (PSID): http://psidonline.isr.umich.edu/

U.S Bureau of Economic Analysis: http://www.bea.doc.gov/

CompuStat: http://www.compustat.com/www/

International Financial Statistics (IFS): http://ifs.apdi.net/imf/

Trang 8

k regressors It is convenient to write the set of regressors as a vector in Rk:

0BB

Following mathematical convention, real numbers (elements of the real line R) are written using lowercase italics such as y, and vectors (elements of Rk) by lower case bold italics such as x: Upper case bolditalics such as X will be used for matrices

The random variables (y; x) have a distribution F which we call the population This \population"

is in nitely large This abstraction can be a source of confusion as it does not correspond to a physicalpopulation in the real world The distribution F is unknown, and the goal of statistical inference is to learnabout features of F from the sample

At this point in our analysis it is unimportant whether the observations y and x come from continuous

or discrete distributions For example, many regressors in econometric practice are binary, taking on onlythe values 0 and 1, and are typically called dummy variables

2.2 Conditional Density and Mean

To study how the distribution of y varies with the variables x in the population, we start with f (y j x) ;the conditional density of y given x:

To illustrate, Figure 2.1 displays the density1of hourly wages for men and women, from the population

of white non-military wage earners with a college degree and 10-15 years of potential work experience Theseare conditional density functions { the density of hourly wages conditional on race, gender, education andexperience The two density curves show the e ect of gender on the distribution of wages, holding the othervariables constant

While it is easy to observe that the two densities are unequal, it is useful to have numerical measures ofthe di erence An important summary measure is the conditional mean

1 These are nonparametric density estimates using a Gaussian kernel with the bandwidth selected by cross-validation See Chapter 14 The data are from the 2004 Current Population Survey

Trang 9

Figure 2.1: Wage Densities for White College Grads with 10-15 Years Work Experience

Take a closer look at the density functions displayed in Figure 2.1 You can see that the right tail ofthe density is much thicker than the left tail These are asymmetric (skewed) densities, which is a commonfeature of wage distributions When a distribution is skewed, the mean is not necessarily a good summary

of the central tendency In this context it is often convenient to transform the data by taking the (natural)logarithm Figure 2.2 shows the density of log hourly wages for the same population, with mean log hourlywages drawn in with the arrows The di erence in the log mean wage between men and women is 0.30, whichimplies a 30% average wage di erence for this population This is a more robust measure of the typical wagegap between men and women than the di erence in the untransformed wage means For this reason, wageregressions typically use log wages as a dependent variable rather than the level of wages

The comparison in Figure 2.1 is facilitated by the fact that the control variable (gender) is discrete.When the distribution of the control variable is continuous, then comparisons become more complicated Toillustrate, Figure 2.3 displays a scatter plot2 of log wages against education levels Assuming for simplicitythat this is the true joint distribution, the solid line displays the conditional expectation of log wages varyingwith education The conditional expectation function is close to linear; the dashed line is a linear projectionapproximation which will be discussed in the Section 2.6 The main point to be learned from Figure 2.3 isthat the conditional expectation describes the central tendency of the conditional distribution Of particularinterest to graduate students may be the observation that di erence between a B.A and a Ph.D degree inmean log hourly wages is 0.36, implying an average 36% di erence in wage levels

2.3 Regression Equation

The regression error e is de ned to be the di erence between y and its conditional mean (2.2) evaluated

at the observed value of x:

Trang 10

Figure 2.2: Log Wage Densities

3 E (h(x)e) = 0 for any function h ( ) :

The conditional mean also has the property of being the the best predictor of y; in the sense of achievingthe lowest mean squared error To see this, let g (x) be an arbitrary predictor of y given x: The expectedsquared error using this prediction function is

= Ee2+ 2E (e (m (x) g (x))) + E (m (x) g (x))2

= Ee2+ E (m (x) g (x))2

Ee2where the second equality uses Theorem 2.3.1.3 The right-hand-side is minimized by setting g (x) = m (x) :Thus the mean squared error is minimized by the conditional mean

Trang 11

Figure 2.3: Conditional Mean of Wages Given Education

2.4 Conditional Variance

While the conditional mean is a good measure of the location of a conditional distribution, it doesnot provide information about the spread of the distribution A common measure of the dispersion is theconditional variance

2

(x) = var (y j x) = E e2j x :Generally, 2(x) is a non-trivial function of x, and can take any form, subject to the restriction that it is

As an example, take the conditional wage densities displayed in Figure 2.1 The conditional standarddeviation for men is 12.1 and that for women is 10.5 So while men have higher average wages, they are alsosomewhat more dispersed

2.5 Linear Regression

An important special case of (2.3) is when the conditional mean function m (x) is linear in x (or linear

in functions of x): Notationally, it is convenient to augment the regressor vector x by listing the number \1"

as an element We call this the \constant" or \intercept" Equivalently, we assume that x1= 1, where x1isthe rst element of the vector x de ned in (2.1) Thus (2.1) has been rede ned as the k 1 vector

0BB

When m(x) is linear in x; we can write it as

Trang 12

=

0B

Equation (2.8) is called the linear regression model,

An important special case is homoskedastic linear regression model

E (e j x) = 0

2.6 Best Linear Predictor

While the conditional mean m(x) = E (y j x) is the best predictor of y among all functions of x; itsfunctional form is typically unknown, and the linear assumption of the previous section is empirically unlikely

to be accurate Instead, it is more realistic to view the linear speci cation (2.6) as an approximation Wederive an appropriate approximation in this section

In the linear projection model the coe cient is de ned so that the function x0 is the best linearpredictor of y As before, by \best" we mean the predictor function with lowest mean squared error Forany 2 Rk a linear predictor for y is x0 with expected squared prediction error

rst-order condition for minimization (from Appendix A.9) is

Qplays an important role in least-squares theory so we will discuss some of its properties in detail Observethat for any non-zero 2 Rk;

0Q = E ( 0xx0 ) = E ( 0x)2 0

so Q is by construction positive semi-de nite It is invertible if and only if it is positive de nite, whichrequires that for all non-zero ; E ( 0x)2> 0: Equivalently, there cannot exist a non-zero vector such that

0x= 0 identically This occurs when redundant variables are included in x: In order for to be uniquely

de ned, this situation must be excluded

Given the de nition of in (2.10), x0 is the best linear predictor for y: The error is

Trang 13

This completes the derivation of the model We call x0 alternatively the best linear predictor of y given

x0 ;or the linear projection of y onto x: In general we will call equation (2.12) the linear projection model

We now summarize the assumptions necessary for its derivation and list the implications in Theorem2.6.1

A complete proof of Theorem (2.6.1) is presented in Section 2.7

The two equations (2.12) and (2.14) summarize the linear projection model Let's compare it with thelinear regression model (2.8)-(2.9) Since from Theorem 2.3.1.4 we know that the regression error has theproperty E (xe) = 0; it follows that linear regression is a special case of the projection model However,the converse is not true as the projection error does not necessarily satisfy E (e j x) = 0: For example,

We have shown that under mild regularity conditions for any pair (y; x) we can de ne a linear equation(2.12) with the properties listed in Theorem 2.6.1 No additional assumptions are required However, it isimportant to not misinterpret the generality of this statement The linear equation (2.12) is de ned by the

de nition of the best linear predictor and the associated coe cient de nition (2.10) In contrast, in manyeconomic models the parameter may be de ned within the model In this case (2.10) may not hold and theimplications of Theorem 2.6.1 may be false These structural models require alternative estimation methods,and are discussed in Chapter 9

Returning to the joint distribution displayed in Figure 2.3, the dashed line is projection of log wagesonto education In this example the linear predictor is a close approximation to the conditional mean Inother cases the two may be quite di erent Figure 2.4 displays the relationship3 between mean log hourlywages and labor market experience The solid line is the conditional mean, and the straight dashed line isthe linear projection In this case the linear projection is a poor approximation to the conditional mean Itover-predicts wages for young and old workers, and under-predicts for the rest Most importantly, it missesthe strong downturn in expected wages for those above 35 years work experience (equivalently, for those over

53 in age)

This defect in the best linear predictor can be partially corrected through a careful selection of regressors

In the example just presented, we can augment the regressor vector x to include both experience andexperience2: The best linear predictor of log wages given these two variables can be called a quadraticprojection, since the resulting function is quadratic in experience: Other than the rede nition of the regressorvector, there are no changes in our methods or analysis In Figure 2.4 we display as well the quadraticprojection In this example it is a much better approximation to the conditional mean than the linearprojection

3 In the population of Caucasian non-military male wage earners with 12 years of education.

Trang 14

Figure 2.4: Hourly Wage as a Function of Experience

Another defect of linear projection is that it is sensitive to the marginal distribution of the regressors whenthe conditional mean is non-linear We illustrate the issue in Figure 2.5 for a constructed4joint distribution

of y and x The solid line is the non-linear conditional mean of y given x: The data are divided in two { Group

1 and Group 2 { which have di erent marginal distributions for the regressor x; and Group 1 has a lowermean value of x than Group 2 The separate linear projections of y on x for these two groups are displayed

in the Figure by the dashed lines These two projections are distinct approximations to the conditionalmean A defect with linear projection is that it leads to the incorrect conclusion that the e ect of x on y is

di erent for individuals in the two Groups This conclusion is incorrect because is fact there is no di erence

in the conditional mean function The apparant di erence is a by-product of a linear approximation to anon-linear mean, combined with di erent marginal distributions for the conditioning variables

2.7 Technical Proofs

First, it is useful to note that Assumption 2.6.1.3 implies that

Equation (2.10) states that = (E (xx0)) 1E (xy) which is well de ned since (E (xx0)) 1 exists underAssumption 2.6.1.4 It follows that e = y x0 as de ned in (2.11) is also well de ned

4 The x i in Group 1 are N(2; 1) and those in Group 2 are N(4; 1); and the conditional distriubtion of y given x is N(m(x); 1) where m(x) = 2x x 2 =6:

Trang 15

Figure 2.5: Conditional Mean and Two Linear Projections

Note the Schwarz Inequality (A.7) implies (x0 )2 kxk2k k2 and therefore combined with (2.16) wesee that

An application of the Cauchy-Schwarz Inequality (C.3) shows that for any j

E jxjej Ex2j 1=2 Ee2 1=2< 1and therefore the elements in the vector E (xe) are well de ned and nite

Using the de nitions (2.11) and (2.10), and the matrix properties that AA 1= I and Ia = a;

Trang 16

2.8 Exercises

1 Prove parts 2, 3 and 4 of Theorem 2.3.1

2 Suppose that the random variables y and x only take the values 0 and 1, and have the following jointprobability distribution

Find E (y j x) ; E y2j x and var (y j x) for x = 0 and x = 1:

3 Suppose that y is discrete-valued, taking values only on the non-negative integers, and the conditionaldistribution of y given x is Poisson:

E (y j x) : Are they di erent?

5 Take the bivariate linear projection model

6 True or False If y = x + e; x 2 R; and E (e j x) = 0; then E x2e = 0:

7 True or False If y = x0 + e and E (e j x) = 0; then e is independent of x:

8 True or False If y = x0 + e, E (e j x) = 0; and E e2j x = 2; a constant, then e is independent ofx:

9 True or False If y = x + e; x 2 R; and E (xiei) = 0; then E x2e = 0:

10 True or False If y = x0 + e and E(xe) = 0; then E (e j x) = 0:

Show that Eg (x j m; s) = 0 if and only if m = and s = 2:

Trang 17

In a typical application, an econometrician's data is a set of observed measurements on the variables(y; x) for a group of individuals These individuals may be persons, households, rms or other economicagents We call this information the data, dataset, or sample, and denote the number of inviduals in thedataset by the natural number n.

We will use the index i to indicate the i'th individual in the dataset The observation for the i'thindividual will be written as (yi; xi) : yi is the observed value of y for individual i and xi is the observedvalue of x for the same individual

If the data is cross-sectional (each observation is a di erent individual) it is often reasonable to assumethe observations are mutually independent This means that the pair (yi; xi) is independent of (yj; xj)for i 6= j (Sometimes the independent label is misconstrued It is not a statement about the relationshipbetween yiand xi:) Furthermore, if the data is randomly gathered, it is reasonable to model each observation

as a random draw from the same probability distribution In this case we say that the data are independentand identically distributed, or iid We call this a random sample

Assumption 3.1.1 The observations (yi; xi) i = 1; :::; n; are mutually independent across observations iand identically distributed

This chapter explores estimation and inference in the linear projection model for a random sample:

Trang 18

It follows that the moment estimator of replaces the population moments in (3.3) with the sample moments:

1n

2:9542:40

We often write the estimated equation using the format

\log(W age) = 1:313 + 0:128 Education:

An interpretation of the estimated equation is that each year of education is associated with a 12.8% increase

in mean wages

Trang 19

Figure 3.1: Sum-of-Squared Errors Function

calculus (see Appendix A.9) gives the rst-order conditions for minimization:

whose solution is (3.4) Following convention we will call ^ the OLS estimator of :

As a by-product of OLS estimation, we de ne the predicted value

^i= x0i^and the residual

^i = yi ^i

= yi x0i^:

Note that yi = ^yi+ ^ei: It is important to understand the distinction between the error ei and the residual

^i: The error is unobservable, while the residual is a by-product of estimation These two variables arefrequently mislabeled, which can cause confusion

Trang 20

Equation (3.5) implies that

1n

The error variance 2= Ee2

i is also a parameter of interest It measures the variation in the \unexplained"part of the regression Its method of moments estimator is the sample average of the squared residuals

^2= 1n

A justi cation for the latter choice will be provided in Section 3.8

A measure of the explained variation relative to the total variation is the coe cient of determination

^2y= 1n

2= var (x0i )var(yi) = 1

2 2

where 2 = var(yi): An alternative estimator of 2 proposed by Theil called \R-bar-squared" is

2

~2ywhere

3.4 Normal Regression Model

Another motivation for the least-squares estimator can be obtained from the normal regression model.This is the linear regression model with the additional assumption that the error ei is independent of xiand has the distribution N 0; 2 : This is a parametric model, where likelihood methods can be used forestimation, testing, and distribution theory

Trang 21

The log-likelihood function for the normal regression model is

Plugging ^ into the log-likelihood we obtain

3.5 Model in Matrix Notation

For many purposes, including computation, it is convenient to write the model and statistics in matrixnotation The linear equation (2.12) is a system of n equations, one for each observation We can stack these

0BB

@

y1

y2

yn

1CC

0BB

@

x0 1

x0 2

x0 n

1CC

0BB

@

e1

e2

en

1CC

A:

compactly written in the single equation

Trang 22

A useful result is obtained by inserting y = X + e into the formula for ^ to obtain

where In is the n n identity matrix P and M are called projection matrices due to the property that

Trang 23

De ne

M1= In X1 X01X1 1X01:Recalling the de nition M = I X X0X 1X0; observe that X01M1= 0 and thus

1

X01M = M :

Trang 24

resid-uals ^e may be equivalently computed by either the OLS regression (3.14) or via the following algorithm:

1 Regress y on X1; obtain residuals ~y;

2 Regress X2 on X1; obtain residuals ~X2;

3 Regress ~y on ~X2; obtain OLS estimates ^2 and residuals ^e:

In some contexts, the FWL theorem can be used to speed computation, but in most cases there is littlecomputational advantage to using the two-step algorithm Rather, the primary use is theoretical

A common application of the FWL theorem, which you may have seen in an introductory econometricscourse, is the demeaning formula for regression Partition X = [X1 X2] where X1= is a vector of ones,and X2 is the vector of observed regressors In this case,

Trang 25

3.8 Bias and Variance

In this and the following section we consider the special case of the linear regression model (2.8)-(2.9)

In this section we derive the small sample conditional mean and variance of the OLS estimator

By the independence of the observations and (2.9), observe that

E (e j X) =

0B

@

E (eij X)

1C

A =

0B

@

E (eij xi)

1C

and thus the OLS estimator ^ is unbiased for :

Next, for any random vector Z de ne the covariance matrix

Then given (3.19) we see that

where

D= E (ee0j X) :The i'th diagonal element of D is

E e2i j X = E e2i j xi = 2iwhile the ij0th o -diagonal element of D is

E (eiejj X) = E (eij xi) E (ej j xj) = 0:

Trang 26

Thus D is a diagonal matrix with i'th diagonal element i:

D= diag 21; :::; 2n =

0BB

In the special case of the linear homoskedastic regression model, 2

D= In 2, X0DX= X0X 2; and

We now calculate the nite sample bias of the method of moments estimator ^2 for 2, under the tional assumption of conditional homoskedasticity E e2i j xi = 2: From (3.12), the properties of projectionmatrices and the trace operator observe that

in the projection model

3.9 Gauss-Markov Theorem

In this section we restrict attention to the homoskedastic linear regression model, which is (2.8)-(2.9)plus E e2i j xi = 2: Now consider the class of estimators of which are linear functions of the vector y;and thus can be written as

~ = A0y

that the least-squares estimator is the best choice, as it yields the smallest variance among all unbiased linearestimators

By a calculation similar to those of the previous section,

of the solution

Trang 27

Theorem 3.9.1 Gauss-Markov In the homoskedastic linear regression model, the best (minimum-variance)unbiased linear estimator is OLS.

The Gauss-Markov theorem is an e ciency justi cation for the least-squares estimator, but it is quitelimited in scope Not only has the class of models been restricted to homoskedastic linear regressions, theclass of potential estimators has been restricted to linear unbiased estimators This latter restriction isparticularly unsatisfactory, as the theorem leaves open the possibility that a non-linear or biased estimatorcould have lower mean squared error than the least-squares estimator

3.10 Semiparametric E ciency

In the previous section we presented the Gauss-Markov theorem as a limited e ciency justi cation forthe least-squares estimator A broader justi cation is provided in Chamberlain (1987), who establishedthat in the projection model the OLS estimator has the smallest asymptotic mean-squared error amongfeasible estimators This property is called semiparametric e ciency, and is a strong justi cation for theleast-squares estimator We discuss the intuition behind his result in this section

Suppose that the joint distribution of (yi; xi) is discrete That is, for nite r;

know the values yi and xi can take, but we don't know the probabilities.)

In this discrete setting, the de nition (3.3) can be rewritten as

1A

n

X

i=1

1 (yi= j) 1 xi= jfor j = 1; :::; r; where 1 ( ) is the indicator function That is, ^j is the percentage of the observations whichfall in each category The MLE ^mle for is then the analog of (3.21) with the parameters j replaced bythe estimates ^j:

Trang 28

Chamberlain (1987) extends this argument to the case of continuously-distributed data: He observesthat the above argument holds for all multinomial distributions, and any continuous distribution can bearbitrarily well approximated by a multinomial distribution He proves that generically the OLS estimator(3.4) is an asymptotically e cient estimator for the parameter de ned in (2.10) for the class of modelssatisfying Assumption 2.6.1.

3.11 Multicollinearity

If rank(X0X) < k; then ^ is not de ned2 This is called strict multicollinearity This happens whenthe columns of X are linearly dependent, i.e., there is some 6= 0 such that X = 0: Most commonly, thisarises when sets of regressors are included which are identically related For example, if X includes both thelogs of two prices and the log of the relative prices, log(p1); log(p2) and log(p1=p2): When this happens, theapplied researcher quickly discovers the error as the statistical software will be unable to construct (X0X) 1:Since the error is discovered quickly, this is rarely a problem for applied econometric practice

The more relevant issue is near multicollinearity, which is often called \multicollinearity" for brevity.This is the situation when the X0X matrix is near singular, when the columns of X are close to linearlydependent This de nition is not precise, because we have not said what it means for a matrix to be \nearsingular" This is one di culty with the de nition and interpretation of multicollinearity

One implication of near singularity of matrices is that the numerical reliability of the calculations isreduced In extreme cases it is possible that the reported calculations will be in error due to oating-pointcalculation di culties

A more relevant implication of near multicollinearity is that individual coe cient estimates will be precise We can see this most simply in a homoskedastic linear regression model with two regressors

im-yi = x1i 1+ x2i 2+ ei;and

the e ect of collinearity on precision by observing that the asymptotic variance of a coe cient estimate

2 1 2 1 approaches in nity as approaches 1 Thus the more \collinear" are the regressors, the worsethe precision of the individual coe cient estimates

What is happening is that when the regressors are highly dependent, it is statistically di cult to entangle the impact of 1 from that of 2: As a consequence, the precision of individual estimates arereduced

dis-3.12 In uential Observations

The i'th observation is in uential on the least-squares estimate if the deletion of the observation from thesample results in a meaningful change in ^ To investigate the possibility of in uential observations, de nethe leave-one-out least-squares estimator of ; that is, the OLS estimator based on the sample excluding thei'th observation This equals

^

2 See Appendix A.5 for the de ntion of the rank of a matrix.

Trang 29

where X( i) and y( i) are the data matrices omitting the i'th row A convenient alternative expression(derived in Section 3.13) is

^

where

hi= x0i X0X 1xi

is the i'th diagonal element of the projection matrix X X0X 1X0:

We can also de ne the leave-one-out residual

A simple comparison yields that

As we can see, the change in the coe cient estimate by deletion of the i'th observation depends critically

on the magnitude of hi: The hi take values in [0; 1] and sum to k: If the i'th observation has a largevalue of hi; then this observation is a leverage point and has the potential to be an in uential observation.Investigations into the presence of in uential observations can plot the values of (3.25), which is considerablymore informative than plots of the uncorrected residuals ^ei:

3.13 Technical Proofs

of the least-squares estimator is X0X 1 2 and that of A0y is A0A 2: It is su cient to show that the

The matrix C0C is positive semi-de nite (see Appendix A.7) as required

Proof of Theorem equation (3.23) Equation (A.2) in Appendix A.5 states that for nonsingular Aand vector b

This implies

X0X xix0i 1= X0X 1+ (1 hi) 1 X0X 1xix0i X0X 1and thus

the third equality making the substitutions ^ = X0X 1X0yand hi= x0

i X0X 1xi; and the remaindercollecting terms

Trang 30

3.14 Exercises

Let (^; ^2) be the values such that gn(^; ^2) = 0 where gn(m; s) = n 1Pn

i=1g yi; ; 2 : Show that

^ and ^2 are the sample mean and variance

2 Consider the OLS regression of the n 1 vector y on the n k matrix X Consider an alternative set

of regressors Z = XC; where C is a k k non-singular matrix Thus, each column of Z is a mixture

of some of the columns of X: Compare the OLS estimates and residuals from the regression of y on

X to the OLS estimates from the regression of y on Z:

3 Let ^ebe the OLS residual from a regression of y on X = [X1 X2] Find X02e:^

4 Let ^e be the OLS residual from a regression of y on X: Find the OLS coe cient estimate from aregression of ^eon X:

5 Let ^y= X(X0X) 1X0y: Find the OLS coe cient estimate from a regression of ^y on X:

6 Prove that R2 is the square of the simple correlation between y and ^y:

7 Explain the di erence between n1Pn

i=1xix0i and E (xix0i) :

8 Let ^n= X0nXn 1X0nyndenote the OLS estimate when ynis n 1 and Xnis n k A new vation (yn+1; xn+1) becomes available Prove that the OLS estimate computed using this additionalobservation is

obser-^

n+1= ^n+ 1

1 + x0 n+1 X0nXn 1xn+1 X

individual's gender Let d1and d2be vectors of 1's and 0's, with the i0th element of d1equaling 1 andthat of d2 equaling 0 if the person is a man, and the reverse if the person is a woman Suppose thatthere are n1men and n2women in the sample Consider the three regressions

(a) Can all three regressions (3.26), (3.27), and (3.28) be estimated by OLS? Explain if not

(b) Compare regressions (3.27) and (3.28) Is one more general than the other? Explain the ship between the parameters in (3.27) and (3.28)

relation-(c) Compute 0d1 and 0d2; where is an n 1 is a vector of ones

(d) Letting = ( 1 2)0; write equation (3.27) as y = X + e: Consider the assumption E(xiei) = 0

Is there any content to this assumption in this setting?

11 Let d1 and d2 be de ned as in the previous exercise

(a) In the OLS regression

y= d1^1+ d2^2+ ^u;

show that ^1 is sample mean of the dependent variable among the men of the sample (y1), andthat ^2 is the sample mean among the women (y2)

Trang 31

(b) Describe in words the transformations

V1 = education (in years)V2 = region of residence (coded 1 if South, 0 otherwise)V3 = (coded 1 if nonwhite and non-Hispanic, 0 otherwise)V4 = (coded 1 if Hispanic, 0 otherwise)

V5 = gender (coded 1 if female, 0 otherwise)V6 = marital status (coded 1 if married, 0 otherwise)V7 = potential labor market experience (in years)V8 = union status (coded 1 if in union job, 0 otherwise)V9 = hourly wage (in dollars)

Estimate a regression of wage yi on education x1i, experience x2i, and experienced-squared x3i= x22i(and a constant) Report the OLS estimates

Let ^ei be the OLS residual and ^yi the predicted value from the regression Numerically calculate thefollowing:

Are these calculations consistent with the theoretical properties of OLS? Explain

13 Use the data from the previous problem, restimate the slope on education using the residual regressionapproach Regress yi on (1; x2i; x2

2i), regress x1i on (1; x2i; x2

2i), and regress the residuals on the

regression? Explain

In the second-stage residual regression, (the regression of the residuals on the residuals), calculate theequation R2 and sum of squared errors Do they equal the values from the initial OLS regression?Explain

Trang 32

Figure 4.1: Sampling Density of ^2

To illustrate the possibilities in one example, let yi and xi be drawn from the joint density

n = 100 and n = 800: The vertical line marks the true value of the projection coe cient

From the gure we can see that the density functions are dispersed and highly non-normal As the samplesize increases the density becomes more concentrated about the population coe cient To characterize thesampling distribution more fully, we will use the methods of asymptotic approximation A review of themost important tools in asymptotic theory is contained in Appendix C

Trang 33

4.2 Consistency

As discussed in Section 4.1, the OLS estimator ^ is has a statistical distribution which is unknown.Asymptotic (large sample) methods approximate sampling distributions based on the limiting experimentthat the sample size n tends to in nity A preliminary step in this approach is the demonstration thatestimators converge in probability to the true parameters as the sample size gets large This is illustrated

in Figure 4.1 by the fact that the sampling densities become more concentrated as n gets larger

This derivation is based on three key components First, the OLS estimator can be written as a continuousfunction of a set of sample moments Second, the weak law of large numbers (WLLN, Theorem C.2.1) showsthat sample moments converge in probability to population moments And third, the continuous mappingtheorem (Theorem C.4.1) states that continuous functions preserve convergence in probability We nowexplain each step

First, the OLS estimator

1n

n

X

i=1

xjixli p! E (xjixli) ;and

1n

n

X

i=1

xjiyi p! E (xjiyi) :Since this holds for all elements in the matrix 1

n

X

i=1

xiyi p

n

X

i=1

Trang 34

For further details on the proof see Section 4.15.

Theorem 4.2.1 states that the OLS estimator ^ converges in probability to as n diverges to positive

in nity When an estimator converges in probability to the true value as the sample size diverges, we saythat the estimator is consistent This is a good property for an estimator to possess It means that for anygiven joint distribution of (yi; xi) ; there is a sample size n su ciently large such that the estimator ^ will

be arbitrarily close to the true value with high probability Consistency is also an important preliminarystep in establishing other important asymptotic approximations

We can similarly show that the estimators ^2 and s2 are consistent for 2:

The proof is given in Section 4.15

One implication of this theorem is that multiple estimators can be consistent for the sample populationparameter While ^2 and s2 are unequal in any given application, they are close in value when n is verylarge

4.3 Asymptotic Normality

We started this Chapter discussing the need for an approximation to the distribution of the OLS estimator

^: In the previous section we showed that ^ converges in probability to This is a useful rst step, but initself does not provide a useful approximation to the distribution of the estimator In this section we derive

an approximation typically called the asymptotic distribution of the estimator

The derivation starts by writing the estimator as a function of sample moments One of the momentsmust be written as a sum of zero-mean random vectors and normalized so that the central limit theory can

be applied The steps are as follows

Take equation (4.4) and multiply it by p

n: This yields the expressionp

iand the normalized sample average p 1nPn

i=1xiei: Furthermore, the latter has mean zero so thecentral limit theorem (CLT) applies Recall, the CLT (Theorem C.3.1) states that if ui2 Rk is iid, Eui= 0and Eu2

ji< 1 for j = 1; :::; k; then as n ! 1

1pn

n

X

i=1

ui d

n

X

i=1

xiei d

Trang 35

= E xix0ie2i :Then using (4.6), (4.1), and (4.7),

p

where the nal equality follows from a property of the normal distribution

A formal statement of this result requires the following strengthening of the moment conditions

Further details of the proof are given in Section 4.15

As V is the variance of the asymptotic distribution ofpn ^ ; V is often referred to as the

Theorem 4.3.1 states that the sampling distribution of the least-squares estimator, after rescaling, isapproximately normal when the sample size n is su ciently large This holds true for all joint distibutions of(yi; xi) which satisfy the conditions of Assumption 4.3.1 However, for any xed n the sampling distribution

of ^ can be arbitrarily far from the normal distribution In Figure 4.1 we have already seen a simple examplewhere the least-squares estimate is quite asymmetric and non-normal even for reasonably large sample sizes.There is a special case where and V simplify We say that eiis a Homoskedastic Projection Errorwhen

In (4.10) we de ne V0= Q 1 2 whether (4.8) is true or false When (4.8) is true then V = V0; otherwise

The asymptotic distribution of Theorem 4.3.1 is commonly used to approximate the nite sample tribution of p

order for the approximation to be useful? Unfortunately, there is no simple answer to this reasonable tion The trouble is that no matter how large is the sample size, the normal approximation is arbitrarilypoor for some data distribution satisfying the assumptions We illustrate this problem using a simulation.Let yi = 0+ 1xi+ ei where xi is N (0; 1) ; and ei is independent of xi with the Double Pareto density

2, however, its variance diverges to in nity In this context the normalized least-squares slope estimatorq

den-sities of the normalized estimator q

= 3:0 the density is very close to the N(0; 1) density As diminishes the density changes signi cantly,concentrating most of the probability mass around zero

Another example is shown in Figure 4.3 Here the model is yi= 1+ ei where

Trang 36

Figure 4.2: Density of Normalized OLS estimator

and ui N(0; 1): We show the sampling distribution ofp

n ^1 1 setting n = 100; for k = 1; 4, 6 and 8

As k increases, the sampling distribution becomes highly skewed and non-normal The lesson from Figures4.2 and 4.3 is that the N(0; 1) asymptotic approximation is never guaranteed to be accurate

Figure 4.3: Sampling distribution

4.4 Covariance Matrix Estimation

Trang 37

be the method of moments estimator for Q: The homoskedastic covariance matrix V = Q 2is typicallyestimated by

The estimator ^V0 was the dominate covariance estimator used before 1980, and was still the standardchoice for much empirical work done in the early 1980s The methods switched during the late 1980s and early1990s, so that by the late 1990s the White estimate ^V emerged as the standard covariance matrix estimator.When reading and reporting applied work, it is important to pay attention to the distinction between ^V0and ^V, as it is not always clear which has been computed When ^V is used rather than the traditional choice

^

V0; many authors will state that their \standard errors have been corrected for heteroskedasticity", or thatthey use a \heteroskedasticity-robust covariance matrix estimator", or that they use the \White formula",the \Eicker-White formula", the \Huber formula", the \Huber-White formula" or the \GMM covariancematrix" In most cases, these all mean the same thing

The variance estimator ^V is an estimate of the variance of the asymptotic distribution of ^ A more easilyinterpretable measure of spread is its square root { the standard deviation This motivates the de nition of

To illustrate, we return to the log wage regression of Section 3.2 We calculate that s2= 0:20 and

Trang 38

In this case the two estimates are quite similar The (White) standard errors for ^0 arep

7:2=988 = :085and that for ^1 is p

:035=988 = :006: We can write the estimated equation with standard errors using theformat

\log(W age) = 1:313

(:085)

+ 0:128(:006)

Education:

4.5 Alternative Covariance Matrix Estimators

MacKinnon and White (1985) suggested a small-sample corrected version of ^V based on the jackknifeprinciple Recall from Section 3.12 the de nition of ^( i) as the least-squares estimator with the i'thobservation deleted From equation (3.13) of Efron (1982), the jackknife estimator of the variance matrixfor ^ is

= h( )denote the parameter of interest The estimate of is

^ = h(^):

What is an appropriate standard error for ^? Assume that h( ) is di erentiable at the true value of :

By a rst-order Taylor series approximation:

Trang 39

For example, if R is a \selector matrix"

When q = 1 (so h( ) is real-valued), the standard error for ^ is the square root of n 1V ; that is,^s(^) = n 1=2q

is therefore exactly free of unknowns In this case, we say that tn is an exactly pivotal statistic In general,however, pivotal statistics are unavailable and so we must rely on asymptotically pivotal statistics

A simple null and composite hypothesis takes the form

Trang 40

where 0is some pre-speci ed value, and = h( ) is some function of the parameter vector (For example,could be a single element of ):

The standard test for H0against H1 is the t-statistic (or studentized statistic)

tn = tn( 0) =^ 0

s(^) :Under H0; tn d! N(0; 1): Let z =2 is the upper =2 quantile of the standard normal distribution That is,

if Z N(0; 1); then P(Z > z =2) = =2 and P(jZj > z =2) = : For example, z:025 = 1:96 and z:05 = 1:645:

A test of asymptotic signi cance rejects H0if jtnj > z =2: Otherwise the test does not reject, or \accepts"

H0: This is because

P (reject H0j H0 true) = P jtnj > z =2j = 0

The rejection/acceptance dichotomy is associated with the Neyman-Pearson approach to hypothesis testing

An alternative approach, associated with Fisher, is to report an asymptotic p-value The asymptoticp-value for the above statistic is constructed as follows De ne the tail probability, or asymptotic p-valuefunction

p(t) = P (jZj > jtj) = 2 (1 (jtj)) :Then the asymptotic p-value of the statistic tn is

Another helpful observation is that the p-value function has simply made a unit-free transformation of thetest statistic That is, under H0; pn d! U[0; 1]; so the \unusualness" of the test statistic can be compared tothe easy-to-understand uniform distribution, regardless of the complication of the distribution of the originaltest statistic To see this fact, note that the asymptotic distribution of jtnj is F (x) = 1 p(x): Thus

4.8 Con dence Intervals

A con dence interval Cn is an interval estimate of 2 R; and is a function of the data and hence israndom It is designed to cover with high probability Either 2 Cn or 2 C= n: The coverage probability

Ngày đăng: 17/12/2013, 06:15

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

w