Such a moment is therefore defined by the equation The same parameter-defining function gy − µ defines the method of moments esti-mator µˆ of µ if one replaces the expected value in 59.0
Trang 1Generalized Method of Moments Estimators
This follows mainly [DM93, Chapter 17] A good and accessible treatment
is [M´99] The textbook [Hay00] uses GMM as the organizing principle for all estimation methods except maximum likelihood
A moment µ of a random variableyis the expected value of some function ofy Such a moment is therefore defined by the equation
The same parameter-defining function g(y) − µ defines the method of moments esti-mator µˆ of µ if one replaces the expected value in (59.0.8) with the sample mean of the elements of an observation vector yconsisting of independent observations ofy
In other words, ˆµ(y) is that value which satisfies 1
n Pn i=1(g(yi) −µ) = 0.ˆ
Trang 2The generalized method of moments estimator extends this rule in several re-spects: theyi no longer have to be i.i.d., the parameter-defining equations may be a system of equations defining more than one paramter at a time, there may be more parameter-defining functions than parameters (overidentification), and not only un-conditional but also un-conditional moments are considered
Under this definition, the OLS estimator is a GMM estimator To show this,
we will write the linear model y =Xβ +ε row by row asyi = x>i β +εi, where
xi is, as in various earlier cases, the ith row ofX written as a column vector The basic property which makes least squares consistent is that the following conditional expectation is zero:
(59.0.9) E[yi−x>i β|xi] = 0
This is more information than just knowing that the unconditional expectation is zero How can this additional information be used to define an estimator? From (59.0.9) follows that the unconditional expectation of the product
(59.0.10) E[xi(yi−x>i β)] = o
Replacing the expected value by the sample mean gives
n
n X i=1
xi(yi−x>i β) = oˆ
Trang 3which can also be written as
n
x1 · · · xn
y1−x>1βˆ
yn−x>nβˆ
≡ 1
nX
>(y−X ˆβ) = o
These are exactly the OLS Normal Equations This shows that OLS in the linear model is a GMM estimator
Note that the rows of the X-matrix play two different roles in this derivation: they appear in the equation yi =x>i β +εi, and they are also the information set based on which the conditional expectation in (59.0.9) is formed If this latter role
is assumed by the rows of a different matrix of observations W then the GMM estimator becomes the Instrumental Variables Estimator
Most maximum likelihood estimators are also GMM estimators As long as the maxima are at the interior of the parameter region, the ML estimators solve the first order conditions, i.e., the Jacobian of the log likelihood function evaluated at these estimators is zero But it follows from the theory of maximum likelihood estimation that the expected value of the Jacobian of the log likelihood function is zero
Here are the general definitions and theorems, and as example their applications
to the textbook example of the Gamma distribution in [Gre97, p 518] and the Instrumental Variables estimator
Trang 4y is a vector of n observations created by a Data Generating Process (DGP)
µ ∈ M θ is a k-vector of nonrandom parameters A parameter-defining function
F (y, θ) is a n × ` matrix function with the following properties (a), (b), and (c): (a) the ith row only depends on the ith observation yi, i.e.,
f>1(y1, θ)
f>n(yn, θ)
Sometimes the fihave identical functional form and only differ by the values of some exogenous variables, i.e., fi(yi, θ) = g(yi,xi, θ), but sometimes they have genuinely different functional forms
In the Gamma-function example M is the set of all Gamma distributions, θ =
r λ> consists of the two parameters of the Gamma distribution, ` = k = 2, and the parameter-defining function has the rows
(59.0.14) fi(yi, θ) =
yi− r λ 1
r−1
so that F (yi, θ) =
y1−r λ 1
r−1
yn−r λ 1
r−1
Trang 5In the IV case, θ = β and ` is the number of instruments If we splitX andW into their rows
x>1
x>n
w>1
w>n
then fi(yi, β) =wi(yi−x>i β) This gives
(59.0.16) F (y, β) =
(y1−x>1β)w>1
(yn−x>nβ)w>n
= diag(y−Xβ)W
(b) The vector functions fi(yi, θ) must be such that the true value of the pa-rameter vector θµ satisfies
for all i, while any other parameter vector θ 6= θµ gives E[fi(yi, θ)] 6= o
In the Gamma example (59.0.17) follows from the fact that the moments of the Gamma distribution are E[y] = λr and E[y1
i] = r−1λ It is also easy to see that r and
λ are characterized by these two relations; given E[y] = µ and E[y1
i] = ν one can solve for r = µν−1µν and λ = ν
µν−1
Trang 6In the IV model, (59.0.17) is satisfied if theεihave zero expectation conditionally
on wi, and uniqueness is condition (52.0.3) requiring that plimn1W>nXn exists, is nonrandom and has full column rank (In the 781 handout Winter 1998, (52.0.3) was equation (246) on p 154)
Next we need a recipe how to construct an estimator from this parameter-defining function Let us first discuss the case k = ` (exact identification) The GMM estimatorθˆdefined by F satisfies
nF
>(y,θ)ι = oˆ which can also be written in the form
n
n X i=1
fi(yi,θ) = o.ˆ
Assumption (c) for a parameter-defining function is that there is only oneθˆsatisfying (59.0.18)
For IV,
(59.0.20) F>(y,β)ι =˜ W>diag(y−X ˜β)ι =W>(y−X ˜β)
If there are as many instruments as explanatory variables, setting this zero gives the normal equation for the simple IV estimatorW>(y−X ˜β) = o
Trang 7In the case ` > k, (59.0.17) still holds, but the system of equations (59.0.18) no longer has a solution: there are ` > k relationships for the k parameters In order
to handle this situation, we need to specify what qualifies as a weighting matrix The symmetric positive definite ` × ` matrix A(y) is a weighting matrix if it has
a nonrandom positive definite plim, called A0(y) = plimn→∞A(y) Instead of (59.0.18), now the following equation serves to defineθ:ˆ
(59.0.21) θˆ= argmin ι>F (y,θ)A(y)Fˆ >(y,θ)ιˆ
In this case, condition (c) for a parameter-defining equation reads that there is only one θˆwhich minimizes this criterion function
For IV, A(y) does not depend on y but is n1(W>W)−1 Therefore A0 = plim(1
nW>W)−1, and (59.0.21) becomesβ˜= argmin(y−X>β)>W(W>W)−1W>(y−
X>β), which is indeed the quadratic form minimized by the generalkized instrumen-tal variables estimator
In order to convert the Gamma-function example into an overidentified system,
we add a third relation:
(59.0.22) F (yi, θ) =
y1−r λ 1
r−1 y2−r(r+1)λ2
yn−r λ 1
r−1 y2n−r(r+1)λ2
Trang 8In this case here is possible to compute the asymptotic covariance; but in real-life situations this covariance matrix is estimated using a preliminary consistent estima-tor of the parameters, as [Gre97] does it Most GMM estimaestima-tors depend on such a consistent pre-estimator
The GMM estimatorθˆdefined in this way is a particular kind of a M -estimator, and many of its properties follow from the general theory of M -estimators We need some more definitions Define the plim of the Jacobian of the parameter-defining mapping D = plimn1∂F>ι/∂θ> and the plim of the covariance matrix of √1
nF>ι is
Ψ = plimn1F>F
For IV, D = plim1n∂W>∂β(y>−Xβ) = − plimn→∞n1W>X, and
Ψ = plim1
nW
>diag(y−Xβ) diag(y−Xβ)W= plim1
nW
>ΩW
where ΩΩΩ is the diagonal matrix with typical element E[(yi−x>i β)2], i.e., ΩΩΩ = V[ε With this notation the theory of M -estimators gives us the following result: The asymptotic MSE -matrix of the GMM is
(59.0.23) (D>A0D)−1D>A0ΨA0D(D>A0D)−1
Trang 9This gives the following expression for the plim of√
n times the sampling error
of the IV estimator:
(59.0.24)
plim(1
nX
>W(1
nW
>W)−11
nW
>X)−11
nX
>W(1
nW
>W)−11
nW
>ΩW(1
nW
>W)−11
nW
>X(1
nX
>W(1
nW
>W)−11
nW
>X)−1=
(59.0.25)
= plim n(X>W(W>W)−1W>X)−1X>W(W>W)−1W>ΩW(W>W)−1W>X(X>W(W>W)−1W>X)−1
The asymptotic MSE matrix can be obtained fom this by dividing by n An estimate
of the asymptotic covariance matrix is therefore
(59.0.26)
(X>W(W>W)−1W>X)−1X>W(W>W)−1W>ΩW(W>W)−1W>X(X>W(W>W)−1W>X)−1
This is [DM93, (17.36) on p 596]
The best choice of such a weighting matrix is A0= Ψ−1, in which case (59.0.23)
simplifies to (D>Ψ−1D)−1= (D>A0D)−1
The criterion function which the optimal IV estimator must minimize, in the
presence of unknown heteroskedasticity, is therefore
(59.0.27) (y−Xβ)>W(W>ΩW)−1W>(y −Xβ)
Trang 10The first-order conditions are
(59.0.28) X>W(W>ΩW)−1W>(y −Xβ) = o
and the optimally weighted IVA is
(59.0.29) β˜= (X>W(W>ΩW)−1W>X)−1X>W(W>ΩW)−1W>y
In this, ΩΩΩ can be replaced by an inconsistent estimate, for instance the diagonal matrix with the squared 2SLS residuals in the diagonal, this is what [DM93] refer
to as H2SLS In the simple IV case, this estimator is the simple IV estimator again
In other words, we need more than the minimum number of instruments to be able
to take advantage of the estimated heteroskedasticity [Cra83] proposes in the OLS case, i.e.,W =X, to use the squares of the regressors etc as additional instruments
To show this optimality take some square nonsingular Q with Ψ = QQ> and define P = Q−1 Then
(59.0.30) (D>A0D)−1D>A0ΨA0D(D>A0D)−1− (D>A0D)−1 =
(59.0.31) = (D>A0D)−1D>A0Ψ − D(D>A0D)−1D>A0D(D>A0D)−1
Trang 11Now the middle matrix can be written as PI − QD(D>Q>QD)−1D>Q>P> which is nonnegative definite because the matrix in the middle is idempotent
The advantage of the GMM is that it is valid for many different DGP’s In this respect it is the opposite of the maximum likelihood estimator, which needs a very specific DGP The more broadly the DGP can be defined, the better the chances are that the GMM etimator is efficient, i.e., in large samples as good as maximum likelihood
Trang 13Bootstrap Estimators
The bootstrap method is an important general estimation principle, which can serve as as an alternative to reliance on the asymptotic properties of an estimator Assume you have a n × k data matrix X each row of which is an independent observation from the same unknown probability distribution, characterized by the cumulative distribution function F Using this data set you want to draw conclusions about the distribution of some statistic θ(x) wherex∼ F
The “bootstrap” estimation principle is very simple: as your estimate of the distribution of xyou use Fn, the empirical distribution of the given sample X, i.e that probability distribution which assigns probability mass 1/n to each of the k-dimensional observation points xt(or, if the observation xtoccured more than once, say j times, then you assign the probability mass j/n to this point) This empirical
Trang 14distribution function has been called the nonparametric maximum likelihood esti-mate of F And your estiesti-mate of the distribution of θ(x) is that distribution which derives from this empirical distribution function Just like the maximum likelihood principle, this principle is deceptively simple but has some deep probability theoretic foundations
In simple cases, this is a widely used principle; the sample mean, for instance, is the expected value of the empirical distribution, the same is true about the sample variance (divisor is n) or sample median etc But as soon as θ becomes a little more complicated, and one wants more complex measures of its distribution, such as the standard deviation of a complicated function of x, or some confidence intervals, an analytical expression for this bootstrap estimate is prohibitively complex
But with the availability of modern computing power, an alternative to the analytical evaluation is feasible: draw a large random sample from the empirical distribution, evaluate θ(x) for each xin this artificially generated random sample, and use these datapoints to construct the distribution function of θ(x) A random sample from the empirical distribution is merely a random drawing from the given values with replacement This requires computing power, usually one has to re-sample between 1,000 and 10,000 times to get accurate results, but one does not need to do complicated math, and these so-called nonparametric bootstrap results are very close to the theoretical results wherever those are available
Trang 15So far we have been discussing the situation that all observations come from the same population In the regression context this is not the case In the OLS model with i.i.d disturbances, the observations of the independent variable ythave different expected values, i.e., they do not come from the same population On the other hand, the disturbances come from the same population Unfortunately, they are not observed, but it turns out that one can successfully apply bootstrap methods here by first computing the OLS residuals and then drawing from these residuals to get pseudo-datapoints and to run the regression on those This is a surprising and strong result; but one has to be careful here that the OLS model is correctly specified For instance, if there is heteroskedasticity which is not corrected for, then the re-sampling would no longer be uniform, and the bootstrap least squares estimates are inconsistent
The jackknife is a much more complicated concept; it was originally invented and is often still introduced as a device to reduce bias, but [Efr82, p 10] claims that this motivation is mistaken It is an alternative to the bootstrap, in which random sampling is replaced by a symmetric systematic “sampling” of datasets which are by 1 observation smaller than the original one: namely, n drawings with one observation left out in each In certain situations this is as good as bootstrapping, but much cheaper I third concept is cross-validation
Trang 16There is a new book out, [ET93], for which the authors also have written
boot-strap and jackknife functions for Splus, to be found if one does attach("/home/econ/ehrbar/splus/boot/.Data")