Class Notes in Statistics and Econometrics Part 30 docx

Such a moment is therefore defined by the equation The same parameter-defining function gy − µ defines the method of moments esti-mator µˆ of µ if one replaces the expected value in 59.0

Trang 1

Generalized Method of Moments Estimators

This follows mainly [DM93, Chapter 17] A good and accessible treatment

is [M´99] The textbook [Hay00] uses GMM as the organizing principle for all estimation methods except maximum likelihood

A moment µ of a random variableyis the expected value of some function ofy Such a moment is therefore defined by the equation

The same parameter-defining function g(y) − µ defines the method of moments esti-mator µˆ of µ if one replaces the expected value in (59.0.8) with the sample mean of the elements of an observation vector yconsisting of independent observations ofy

In other words, ˆµ(y) is that value which satisfies 1

n Pn i=1(g(yi) −µ) = 0.ˆ

Trang 2

The generalized method of moments estimator extends this rule in several re-spects: theyi no longer have to be i.i.d., the parameter-defining equations may be a system of equations defining more than one paramter at a time, there may be more parameter-defining functions than parameters (overidentification), and not only un-conditional but also un-conditional moments are considered

Under this definition, the OLS estimator is a GMM estimator To show this,

we will write the linear model y =Xβ +ε row by row asyi = x>i β +εi, where

xi is, as in various earlier cases, the ith row ofX written as a column vector The basic property which makes least squares consistent is that the following conditional expectation is zero:

(59.0.9) E[yi−x>i β|xi] = 0

This is more information than just knowing that the unconditional expectation is zero How can this additional information be used to define an estimator? From (59.0.9) follows that the unconditional expectation of the product

(59.0.10) E[xi(yi−x>i β)] = o

Replacing the expected value by the sample mean gives

n

n X i=1

xi(yi−x>i β) = oˆ

Trang 3

which can also be written as

n

x1 · · · xn







y1−x>1βˆ

yn−x>nβˆ





≡ 1

nX

>(y−X ˆβ) = o

These are exactly the OLS Normal Equations This shows that OLS in the linear model is a GMM estimator

Note that the rows of the X-matrix play two different roles in this derivation: they appear in the equation yi =x>i β +εi, and they are also the information set based on which the conditional expectation in (59.0.9) is formed If this latter role

is assumed by the rows of a different matrix of observations W then the GMM estimator becomes the Instrumental Variables Estimator

Most maximum likelihood estimators are also GMM estimators As long as the maxima are at the interior of the parameter region, the ML estimators solve the first order conditions, i.e., the Jacobian of the log likelihood function evaluated at these estimators is zero But it follows from the theory of maximum likelihood estimation that the expected value of the Jacobian of the log likelihood function is zero

Here are the general definitions and theorems, and as example their applications

to the textbook example of the Gamma distribution in [Gre97, p 518] and the Instrumental Variables estimator

Trang 4

y is a vector of n observations created by a Data Generating Process (DGP)

µ ∈ M θ is a k-vector of nonrandom parameters A parameter-defining function

F (y, θ) is a n × ` matrix function with the following properties (a), (b), and (c): (a) the ith row only depends on the ith observation yi, i.e.,







f>1(y1, θ)

f>n(yn, θ)







Sometimes the fihave identical functional form and only differ by the values of some exogenous variables, i.e., fi(yi, θ) = g(yi,xi, θ), but sometimes they have genuinely different functional forms

In the Gamma-function example M is the set of all Gamma distributions, θ =

r λ> consists of the two parameters of the Gamma distribution, ` = k = 2, and the parameter-defining function has the rows

(59.0.14) fi(yi, θ) =

yi− r λ 1

r−1

so that F (yi, θ) =







y1−r λ 1

r−1

yn−r λ 1

r−1







Trang 5

In the IV case, θ = β and ` is the number of instruments If we splitX andW into their rows





x>1

x>n







w>1

w>n





then fi(yi, β) =wi(yi−x>i β) This gives

(59.0.16) F (y, β) =







(y1−x>1β)w>1

(yn−x>nβ)w>n





= diag(y−Xβ)W

(b) The vector functions fi(yi, θ) must be such that the true value of the pa-rameter vector θµ satisfies

for all i, while any other parameter vector θ 6= θµ gives E[fi(yi, θ)] 6= o

In the Gamma example (59.0.17) follows from the fact that the moments of the Gamma distribution are E[y] = λr and E[y1

i] = r−1λ It is also easy to see that r and

λ are characterized by these two relations; given E[y] = µ and E[y1

i] = ν one can solve for r = µν−1µν and λ = ν

µν−1

Trang 6

In the IV model, (59.0.17) is satisfied if theεihave zero expectation conditionally

on wi, and uniqueness is condition (52.0.3) requiring that plimn1W>nXn exists, is nonrandom and has full column rank (In the 781 handout Winter 1998, (52.0.3) was equation (246) on p 154)

Next we need a recipe how to construct an estimator from this parameter-defining function Let us first discuss the case k = ` (exact identification) The GMM estimatorθˆdefined by F satisfies

nF

>(y,θ)ι = oˆ which can also be written in the form

n

n X i=1

fi(yi,θ) = o.ˆ

Assumption (c) for a parameter-defining function is that there is only oneθˆsatisfying (59.0.18)

For IV,

(59.0.20) F>(y,β)ι =˜ W>diag(y−X ˜β)ι =W>(y−X ˜β)

If there are as many instruments as explanatory variables, setting this zero gives the normal equation for the simple IV estimatorW>(y−X ˜β) = o

Trang 7

In the case ` > k, (59.0.17) still holds, but the system of equations (59.0.18) no longer has a solution: there are ` > k relationships for the k parameters In order

to handle this situation, we need to specify what qualifies as a weighting matrix The symmetric positive definite ` × ` matrix A(y) is a weighting matrix if it has

a nonrandom positive definite plim, called A0(y) = plimn→∞A(y) Instead of (59.0.18), now the following equation serves to defineθ:ˆ

(59.0.21) θˆ= argmin ι>F (y,θ)A(y)Fˆ >(y,θ)ιˆ

In this case, condition (c) for a parameter-defining equation reads that there is only one θˆwhich minimizes this criterion function

For IV, A(y) does not depend on y but is n1(W>W)−1 Therefore A0 = plim(1

nW>W)−1, and (59.0.21) becomesβ˜= argmin(y−X>β)>W(W>W)−1W>(y−

X>β), which is indeed the quadratic form minimized by the generalkized instrumen-tal variables estimator

In order to convert the Gamma-function example into an overidentified system,

we add a third relation:

(59.0.22) F (yi, θ) =







y1−r λ 1

r−1 y2−r(r+1)λ2

yn−r λ 1

r−1 y2n−r(r+1)λ2







Trang 8

In this case here is possible to compute the asymptotic covariance; but in real-life situations this covariance matrix is estimated using a preliminary consistent estima-tor of the parameters, as [Gre97] does it Most GMM estimaestima-tors depend on such a consistent pre-estimator

The GMM estimatorθˆdefined in this way is a particular kind of a M -estimator, and many of its properties follow from the general theory of M -estimators We need some more definitions Define the plim of the Jacobian of the parameter-defining mapping D = plimn1∂F>ι/∂θ> and the plim of the covariance matrix of √1

nF>ι is

Ψ = plimn1F>F

For IV, D = plim1n∂W>∂β(y>−Xβ) = − plimn→∞n1W>X, and

Ψ = plim1

nW

>diag(y−Xβ) diag(y−Xβ)W= plim1

nW

>ΩW

where ΩΩΩ is the diagonal matrix with typical element E[(yi−x>i β)2], i.e., ΩΩΩ = V[ε With this notation the theory of M -estimators gives us the following result: The asymptotic MSE -matrix of the GMM is

(59.0.23) (D>A0D)−1D>A0ΨA0D(D>A0D)−1

Trang 9

This gives the following expression for the plim of√

n times the sampling error

of the IV estimator:

(59.0.24)

plim(1

nX

>W(1

nW

>W)−11

nW

>X)−11

nX

>W(1

nW

>W)−11

nW

>ΩW(1

nW

>W)−11

nW

>X(1

nX

>W(1

nW

>W)−11

nW

>X)−1=

(59.0.25)

= plim n(X>W(W>W)−1W>X)−1X>W(W>W)−1W>ΩW(W>W)−1W>X(X>W(W>W)−1W>X)−1

The asymptotic MSE matrix can be obtained fom this by dividing by n An estimate

of the asymptotic covariance matrix is therefore

(59.0.26)

(X>W(W>W)−1W>X)−1X>W(W>W)−1W>ΩW(W>W)−1W>X(X>W(W>W)−1W>X)−1

This is [DM93, (17.36) on p 596]

The best choice of such a weighting matrix is A0= Ψ−1, in which case (59.0.23)

simplifies to (D>Ψ−1D)−1= (D>A0D)−1

The criterion function which the optimal IV estimator must minimize, in the

presence of unknown heteroskedasticity, is therefore

(59.0.27) (y−Xβ)>W(W>ΩW)−1W>(y −Xβ)

Trang 10

The first-order conditions are

(59.0.28) X>W(W>ΩW)−1W>(y −Xβ) = o

and the optimally weighted IVA is

(59.0.29) β˜= (X>W(W>ΩW)−1W>X)−1X>W(W>ΩW)−1W>y

In this, ΩΩΩ can be replaced by an inconsistent estimate, for instance the diagonal matrix with the squared 2SLS residuals in the diagonal, this is what [DM93] refer

to as H2SLS In the simple IV case, this estimator is the simple IV estimator again

In other words, we need more than the minimum number of instruments to be able

to take advantage of the estimated heteroskedasticity [Cra83] proposes in the OLS case, i.e.,W =X, to use the squares of the regressors etc as additional instruments

To show this optimality take some square nonsingular Q with Ψ = QQ> and define P = Q−1 Then

(59.0.30) (D>A0D)−1D>A0ΨA0D(D>A0D)−1− (D>A0D)−1 =

(59.0.31) = (D>A0D)−1D>A0Ψ − D(D>A0D)−1D>A0D(D>A0D)−1

Trang 11

Now the middle matrix can be written as PI − QD(D>Q>QD)−1D>Q>P> which is nonnegative definite because the matrix in the middle is idempotent

The advantage of the GMM is that it is valid for many different DGP’s In this respect it is the opposite of the maximum likelihood estimator, which needs a very specific DGP The more broadly the DGP can be defined, the better the chances are that the GMM etimator is efficient, i.e., in large samples as good as maximum likelihood

Trang 13

Bootstrap Estimators

The bootstrap method is an important general estimation principle, which can serve as as an alternative to reliance on the asymptotic properties of an estimator Assume you have a n × k data matrix X each row of which is an independent observation from the same unknown probability distribution, characterized by the cumulative distribution function F Using this data set you want to draw conclusions about the distribution of some statistic θ(x) wherex∼ F

The “bootstrap” estimation principle is very simple: as your estimate of the distribution of xyou use Fn, the empirical distribution of the given sample X, i.e that probability distribution which assigns probability mass 1/n to each of the k-dimensional observation points xt(or, if the observation xtoccured more than once, say j times, then you assign the probability mass j/n to this point) This empirical

Trang 14

distribution function has been called the nonparametric maximum likelihood esti-mate of F And your estiesti-mate of the distribution of θ(x) is that distribution which derives from this empirical distribution function Just like the maximum likelihood principle, this principle is deceptively simple but has some deep probability theoretic foundations

In simple cases, this is a widely used principle; the sample mean, for instance, is the expected value of the empirical distribution, the same is true about the sample variance (divisor is n) or sample median etc But as soon as θ becomes a little more complicated, and one wants more complex measures of its distribution, such as the standard deviation of a complicated function of x, or some confidence intervals, an analytical expression for this bootstrap estimate is prohibitively complex

But with the availability of modern computing power, an alternative to the analytical evaluation is feasible: draw a large random sample from the empirical distribution, evaluate θ(x) for each xin this artificially generated random sample, and use these datapoints to construct the distribution function of θ(x) A random sample from the empirical distribution is merely a random drawing from the given values with replacement This requires computing power, usually one has to re-sample between 1,000 and 10,000 times to get accurate results, but one does not need to do complicated math, and these so-called nonparametric bootstrap results are very close to the theoretical results wherever those are available

Trang 15

So far we have been discussing the situation that all observations come from the same population In the regression context this is not the case In the OLS model with i.i.d disturbances, the observations of the independent variable ythave different expected values, i.e., they do not come from the same population On the other hand, the disturbances come from the same population Unfortunately, they are not observed, but it turns out that one can successfully apply bootstrap methods here by first computing the OLS residuals and then drawing from these residuals to get pseudo-datapoints and to run the regression on those This is a surprising and strong result; but one has to be careful here that the OLS model is correctly specified For instance, if there is heteroskedasticity which is not corrected for, then the re-sampling would no longer be uniform, and the bootstrap least squares estimates are inconsistent

The jackknife is a much more complicated concept; it was originally invented and is often still introduced as a device to reduce bias, but [Efr82, p 10] claims that this motivation is mistaken It is an alternative to the bootstrap, in which random sampling is replaced by a symmetric systematic “sampling” of datasets which are by 1 observation smaller than the original one: namely, n drawings with one observation left out in each In certain situations this is as good as bootstrapping, but much cheaper I third concept is cross-validation

Trang 16

There is a new book out, [ET93], for which the authors also have written

boot-strap and jackknife functions for Splus, to be found if one does attach("/home/econ/ehrbar/splus/boot/.Data")

Định dạng
Số trang	16
Dung lượng	263,06 KB