If θ is nonrandom and in addition θˆis unbiased, then the MSE -matrixcoincides with the covariance matrix.. This sampling error has the following formula: We will use the MSE -matrix as
Trang 1is nonrandom, and “predictors” if it is random For scalar random variables we willuse the mean squared error as a criterion for closeness Its definition is MSE[ˆφ;φ](read it: mean squared error of ˆφas an estimator or predictor, whatever the casemay be, ofφ):
(23.0.1) MSE[ˆφ;φ] = E[(ˆφ−φ)2]
Trang 2For our purposes, therefore, the estimator (or predictor) ˆφof the unknown parameter(or unobserved random variable)φis no worse than the alternative ˜φif MSE[ˆφ;φ] ≤MSE[˜φ;φ] This is a criterion which can be applied before any observations arecollected and actual estimations are made; it is an “initial” criterion regarding theexpected average performance in a series of future trials (even though, in economics,usually only one trial is made).
23.1 Comparison of Two Vector Estimators
If one wants to compare two vector estimators, sayφˆandφ˜, it is often impossible
to say which of two estimators is better It may be the case that ˆφ1 is better than
˜
φ1 (in terms of MSE or some other criterion), but ˆφ2 is worse than ˜φ2 And even ifevery componentφi is estimated better by ˆφithan by ˜φi, certain linear combinations
t>φof the components ofφmay be estimated better by t>φ˜than by t>φˆ
Problem 294 2 points Construct an example of two vector estimators φˆand
˜
φof the same random vectorφ=φ1 φ2>, so that MSE[ˆφi;φi] < MSE[˜φi;φi] for
i = 1, 2 but MSE[ˆφ1+ ˆφ2;φ1+φ2] > MSE[˜φ1+ ˜φ2;φ1+φ2] Hint: it is easiest to use
an example in which all random variables are constants Another hint: the geometricanalog would be to find two vectors in a plane ˆφ and ˜φ In each component (i.e.,
Trang 3projection on the axes), ˆφ is closer to the origin than ˜φ But in the projection onthe diagonal, ˜φ is closer to the origin than ˆφ.
Answer In the simplest counterexample, all variables involved are constants: φ = 0
as an estimator ofφis defined as
(23.1.1) MSE[φˆ;φ] = E[(φˆ−φ)(φˆ−φ)>]
Trang 4Problem 295 2 points Letθ be a vector of possibly random parameters, and θˆ
an estimator of θ Show that
(23.1.2) MSE[θˆ;θ] = V[θˆ−θ] + (E[θˆ−θ])(E[θˆ−θ])>
Don’t assume the scalar result but make a proof that is good for vectors and scalars
Answer For any random vector x follows
E[ xx>] = E( x − E[ x ] + E[ x ])( x − E[ x ] + E[ x >
= E( x − E[ x ])( x − E[ x >− E( x − E[ x ]) E[ x ]>− EE[ x x − E[ x >+ EE[ x ] E[ x ]>
= V[ x ] − O − O + E[ x ] E[ x ]>.
If θ is nonrandom, formula (23.1.2) simplifies slightly, since in this case V[θˆ−θ] =V[θˆ] In this case, the MSE matrix is the covariance matrix plus the squared biasmatrix If θ is nonrandom and in addition θˆis unbiased, then the MSE -matrixcoincides with the covariance matrix
Theorem 23.1.1 Assumeφˆandφ˜are two estimators of the parameterφ(which
is allowed to be random itself ) Then conditions (23.1.3), (23.1.4), and (23.1.5) are
Trang 6To complete the proof, (23.1.5) has (23.1.3) as a special case if one sets Θ =
Problem 296 Show that if Θ and ΣΣΣ are symmetric and nonnegative definite,then tr(ΘΣΣΣ) ≥ 0 You are allowed to use that tr(AB) = tr(BA), that the trace of anonnegative definite matrix is ≥ 0, and Problem129 (which is trivial)
Answer Write Θ = RR>; then tr(ΘΣ Σ Σ) = tr(RR>Σ Σ) = tr(R>Σ ΣR) ≥ 0
Problem 297 Consider two very simple-minded estimators of the unknownnonrandom parameter vector φ =hφ1
φ 2
i Neither of these estimators depends on anyobservations, they are constants The first estimator is φˆ= [11], and the second is
Trang 7Answer. φˆhas smaller trace of the MSE-matrix.
ˆ
φ − φ =
1 1
Note that both MSE-matrices are singular, i.e., both estimators allow an error-free look at certain
•b 1 point Give two vectors g = [g1
g 2] and h =h1
h2 satisfying MSE[g>φˆ; g>φ] <MSE[g>φ˜; g>φ] and MSE[h>φˆ; h>φ] > MSE[h>φ˜; h>φ] (g and h are not unique;there are many possibilities)
Answer With g = −11
and h = 1
for instance we get g>φˆ− g > φ = 0, g>φ˜−
g > φ = 4, h>φˆ; h>φ = 2, h>φ˜; h>φ = 0, therefore MSE[g >φˆ; g> φ] = 0, MSE[g >φ˜; g> φ] = 16,
Trang 8MSE[h>φˆ; h>φ] = 4, MSE[h>φ˜; h>φ] = 0 An alternative way to compute this is e.g.
Trang 9CHAPTER 24
Sampling Properties of the Least Squares
Estimator
The estimatorβˆwas derived from a geometric argument, and everything which
we showed so far are what [DM93, p 3] calls its numerical as opposed to its statisticalproperties But βˆhas also nice statistical or sampling properties We are assumingright now the specification given in (18.1.3), in which X is an arbitrary matrix of fullcolumn rank, and we are not assuming that the errors must be Normally distributed.The assumption that X is nonrandom means that repeated samples are taken withthe same X-matrix This is often true for experimental data, but not in econometrics.The sampling properties which we are really interested in are those where also the X-matrix is random; we will derive those later For this later derivation, the properties
Trang 10with fixed X-matrix, which we are going to discuss presently, will be needed as anintermediate step The assumption of fixed X is therefore a preliminary technicalassumption, to be dropped later.
In order to know how good the estimatorβˆis, one needs the statistical properties
of its “sampling error”βˆ− β This sampling error has the following formula:
We will use the MSE -matrix as a criterion for how good an estimator of a vector
of unobserved parameters is Chapter 23gave some reasons why this is a sensiblecriterion (compare [DM93, Chapter 5.5])
Trang 1124.1 The Gauss Markov TheoremReturning to the least squares estimator βˆ, one obtains, using (24.0.7), thatMSE[βˆ; β] = E[(βˆ− β)(βˆ− β)>] = (X>X)−1X>E[εεεε>]X(X>X)−1=
= σ2(X>X)−1.(24.1.1)
This is a very simple formula Its most interesting aspect is that this MSE matrixdoes not depend on the value of the true β In particular this means that it isbounded with respect to β, which is important for someone who wants to be assured
of a certain accuracy even in the worst possible situation
Problem 298 2 points Compute the MSE-matrix MSE[εˆ;ε] = E[(εˆ−ε)(εˆ−
ε)>] of the residuals as predictors of the disturbances
Answer Write ε ˆ − ε = M ε − ε = (M − I) ε = −X(X>X)−1X>ε ; therefore MSE[ ε ˆ ; ε ] = E[X(X>X)−1X>ε εε ε>X(X>X)−1X = σ 2 X(X>X)−1X> Alternatively, start with ˆ ε − ε = y −
ˆ − ε = Xβ− ˆ = X(β− βˆ) This allows to use MSE[ ˆ ε ; ε ] = X MSE[ βˆ; β]X>= σ 2 X(X>X)−1X>.
Problem299 2 points Letvbe a random vector that is a linear transformation
of y, i.e., v = Ty for some constant matrix T Furthermore v satisfies E[v] = o.Show that from this follows v= Tεˆ (In other words, no other transformation ofy
with zero expected value is more “comprehensive” than ε However there are many
Trang 12other transformation of ywith zero expected value which are as “comprehensive” as
ε)
Answer E[ v ] = T Xβ must be o whatever the value of β Therefore T X = O, from which follows T M = T Since ˆ ε = M y , this gives immediately v = T ˆ ε (This is the statistical implication
of the mathematical fact that M is a deficiency matrix of X.)
Problem 300 2 points Show thatβˆandεˆare uncorrelated, i.e., cov[βˆi,ˆj] =
0 for all i, j Defining the covariance matrix C[βˆ,εˆ] as that matrix whose (i, j)element is cov[βˆi,ˆj], this can also be written as C[βˆ,εˆ] = O Hint: The covariancematrix satisfies the rules C[Ay, Bz] = A C[y,z]B> and C[y,y] = V[y] (Other rulesfor the covariance matrix, which will not be needed here, are C[z,y] = (C[y,z])>,C[x+y,z] = C[x,z] + C[y,z], C[x,y+z] = C[x,y] + C[x,z], and C[y, c] = O if c is
Trang 13(here consisting of one row only) that contains all the covariances
(24.1.2) C[¯y,βˆ] ≡cov[¯y,βˆ1] cov[¯y,βˆ2] · · · cov[¯y,βˆk]
has the following form: C[¯y,βˆ] = σn21 0 · · · 0 where n is the number of servations Hint: That the regression has an intercept term as first column of theX-matrix means that Xe(1) = ι, where e(1) is the unit vector having 1 in the firstplace and zeros elsewhere, and ι is the vector which has ones everywhere
ob-Answer Write both ¯ y and βˆin terms of y , i.e., ¯ y = 1
n ι > y and βˆ= (X > X) −1 X > y Therefore (24.1.3)
Unbi-˜
φ= a>yof φ = t>β has a bigger MSE than t>βˆ
Proof Write the alternative linear estimator ˜φ= a>yin the form
˜
φ= t>(X>X)−1X>+ c>
y
(24.1.4)
Trang 14then the sampling error is
MSE[˜φ; φ] = E[(˜φ− φ)2] = E[ t>(X>X)−1X>+ c>
εεεε> X(X>X)−1t + c] =
= σ2 t>(X>X)−1X>+ c>
X(X>X)−1t + c = σ2t>(X>X)−1t + σ2c>c,Here we needed again c>X = o> Clearly, this is minimized if c = o, in which case
Answer (Compare [ DM93 , p 159].) Any other linear estimator β˜of β can be written
as β˜ = (X>X)−1X>+ Cy Its expected value is E[ β˜] = (X>X)−1X>Xβ + CXβ For
˜
β to be unbiased, regardless of the value of β, C must satisfy CX = O But then it follows
Trang 15MSE[β˜ ; β] = V[ β˜] = σ 2 (X>X)−1X>+ C X(X>X)−1+ C>= σ 2 (X>X)−1+ σ 2 CC>, i.e.,
it exceeds the MSE-matrix of βˆby a nonnegative definite matrix
24.2 Digression about Minimax EstimatorsTheorem24.1.1 is a somewhat puzzling property of the least squares estimator,since there is no reason in the world to restrict one’s search for good estimators
to unbiased estimators An alternative and more enlightening characterization of
Theorem 24.2.2 βˆ is a linear minimax estimator of the parameter vector β
in the following sense: for every nonrandom coefficient vector t, t>βˆis the linear
Trang 16minimax estimator of the scalar φ = t>β with respect to the MSE I.e., for every
other linear estimator ˜φ= a>yof φ one can find a value β = β0 for which ˜φhas a
larger MSE than the largest possible MSE of t>βˆ
Proof: as in the proof of Theorem 24.1.1, write the alternative linear estimator
Now there are two cases: if c>X = o>, then MSE[˜φ; φ] = σ2t>(X>X)−1t + σ2c>c
This does not depend on β and if c 6= o then this MSE is larger than that for c = o
If c>X 6= o>, then MSE[˜φ; φ] is unbounded, i.e., for any finite number ω one one
can always find a β0 for which MSE[˜φ; φ] > ω Since MSE[ˆφ; φ] is bounded, a β0
can be found that satisfies (24.2.1)
Trang 17If we characterize the BLUE as a minimax estimator, we are using a consistentand unified principle It is based on the concept of the MSE alone, not on a mix-ture between the concepts of unbiasedness and the MSE This explains why themathematical theory of the least squares estimator is so rich.
On the other hand, a minimax strategy is not a good estimation strategy Nature
is not the adversary of the researcher; it does not maliciously choose β in such a waythat the researcher will be misled This explains why the least squares principle,despite the beauty of its mathematical theory, does not give terribly good estimators(in fact, they are inadmissible, see the Section about the Stein rule below)
ˆ
β is therefore simultaneously the solution to two very different minimizationproblems We will refer to it as the OLS estimate if we refer to its property ofminimizing the sum of squared errors, and as the BLUE estimator if we think of it
as the best linear unbiased estimator
Note that even if σ2 were known, one could not get a better linear unbiasedestimator of β
24.3 Miscellaneous Properties of the BLUEProblem 303
Trang 18• a 1 point Instead of (18.2.22) one sometimes sees the formula
P(xt− ¯x)2 for the slope parameter in the simple regression Show that these formulas are math-ematically equivalent
Answer Equivalence of (24.3.1) and (18.2.22) follows fromP(x t − ¯ x) = 0 and therefore also
¯P(x t − ¯ x) = 0 Alternative proof, using matrix notation and the matrix D defined in Problem 189: (18.2.22) is x>D>Dy
Trang 19• c 2 points Show that cov[βˆ, ¯y] = 0.
Answer This is a special case of problem 301, but it can be easily shown here separately: cov[ βˆ, ¯ y ] = cov
hP
s (x s − ¯ x) ysP
t (x t − ¯ x) 2 ,1
n
Xj
yji
=
nPt(x t − ¯ x) 2
Xs
Trang 20• a 1 point Is β˜ an unbiased estimator of β? (Proof is required.)
Answer First derive a nice expression for β˜− β:
˜
β − β =
Px iyi
Px 2 i
−
Px 2
i β
Px 2 i
=
P
x iεiP
x 2 i
= 0 since E εi = 0.
• b 2 points Derive the variance of β˜ (Show your work.)
Trang 22Problem 305 We still assume (24.3.5) is the true model Consider an native estimator:
Trang 23Answer One can argue it: βˆis unbiased for model (18.2.15) whatever the value of α or β, therefore also when α = 0, i.e., when the model is (24.3.5) But here is the pedestrian way:
(x i − ¯ x) 2 +
P
(x i − ¯ x) εiP
(x i − ¯ x) 2
= β +
P(x i − ¯ x) εiP
(x i − ¯ x) 2 since X(x i − ¯ x)x i =X(x i − ¯ x)2
E βˆ= E β + E
P
(x i − ¯ x) εiP
(x i − ¯ x) 2
= β +
P
(x i − ¯ x) E εiP
(x i − ¯ x) 2 = β since E εi = 0 for all i, i.e., βˆis unbiased.
• b 2 points Derive the variance of βˆ if (24.3.5) is the true model
Trang 24Answer One can again argue it: since the formula for var βˆdoes not depend on what the true value of α is, it is the same formula.
(x i − ¯ x) 2 (24.3.11)
• c 1 point Still assuming (24.3.5) is the true model, would you prefer βˆor the
˜
β from Problem 304as an estimator of β?
Answer Since β˜and βˆare both unbiased estimators, if (24.3.5) is the true model, the ferred estimator is the one with the smaller variance As I will show, var β˜≤ varβˆand, therefore,
pre-˜
β is preferred to βˆ To show
var βˆ= σ
2P
(x − ¯ x) 2 ≥ σ
2P
x 2 = var β˜
(24.3.12)
Trang 25one must show
X
(x i − ¯ x)2≤Xx2i(24.3.13)
which is a simple consequence of (12.1.1) Thus var βˆ≥ var β˜; the variances are equal only if ¯ x = 0,
is generally a biased estimator of β.Show that its bias is
P x2
Trang 26Answer In situations like this it is always worth while to get a nice simple expression for the sampling error:
˜
β − β =
Px iyi
Px 2 i
− β (24.3.15)
x 2 i
+ β
P
x 2 iP
x 2 i
+
P
x iεiP
x 2 i
− β (24.3.17)
= α
P
x i
Px 2 i
+
P
x iεi
Px 2 i
(24.3.18)
E[ β˜− β] = E α
P
x iP
x 2 i
+ E
P
x iεiP
x 2 i
(24.3.19)
= α
P
x iP
x 2 i
+
P
x i E εiP
x 2 i
+ 0 = αPn¯x
x 2 i
(24.3.21)
Trang 27• b 2 points Compute var[β˜] Is it greater or smaller than
x 2 i
2var[Xx iyi] (24.3.24)
P
x 2 i
2
X
x2ivar[ yi] (24.3.25)
2
P
x 2 i
2X
x2i since all yiare uncorrelated and have equal variance σ2(24.3.26)
= σ
2P
x 2 i
(24.3.27)
This variance is smaller or equal becausePx 2 ≥P
Trang 28• c 5 points Show that the MSE ofβ˜ is smaller than that of the OLS estimator
if and only if the unknown true parameters α and σ2 satisfy the equation
x 2 i
+
αn¯ x
P
x 2 i
2
2P
(x i − ¯ x) 2
αn¯ x
P
x 2 i
2
2P
(x i − ¯ x) 2 − σ
2P
x 2 i
= σ
2 Px 2
i −P(x i − ¯ x) 2P
(x i − ¯ x) 2P
x 2 i
2 n¯ x 2P
(x i − ¯ x) 2P
x 2 i
n
P(x i − ¯ x) 2 + ¯ x 2 ≤ σ
2P
Trang 29If α = 0 it has a F -distribution with 1 and n − 2 degrees of freedom If α 6= 0 it has what is called
a noncentral distribution, and the only thing we needed to know so far was that it was likely to assume larger values than with α = 0 This is why a small value of that statistic supported the hypothesis that α = 0 But in the present case we are not testing whether α = 0 but whether the constrained MSE is better than the unconstrained This is the case of the above inequality holds, the limiting case being that it is an equality If it is an equality, then the above statistic has a F
distribution with noncentrality parameter 1/2 (Here all we need to know that: if z ∼ N (µ, 1) then
z 2 ∼ χ 2 with noncentrality parameter µ 2 /2 A noncentral F has a noncentral χ 2 in numerator and
a central one in denominator.) The testing principle is therefore: compare the observed value with the upper α point of a F distribution with noncentrality parameter 1/2 This gives higher critical values than testing for α = 0; i.e., one may reject that α = 0 but not reject that the MSE of the contrained estimator is larger This is as it should be Compare [ Gre97 , 8.5.1 pp 405–408] on
From the Gauss-Markov theorem follows that for every nonrandom matrix R,the BLUE of φ = Rβ is φˆ= Rβˆ Furthermore, the best linear unbiased predictor(BLUP) ofε=y− Xβ is the vector of residualsεˆ=y− Xβˆ
Problem 307 Let ˜ε= Ay be a linear predictor of the disturbance vectorε inthe model y= Xβ +εwithε∼ (o, σ2I)
• a 2 points Show that ˜ε is unbiased, i.e., E[˜ε−ε] = o, regardless of the value
of β, if and only if A satisfies AX = O