The Mahalanobis distance generalizes this concept to thecase that the test statistic is a vector random variable.. How could one generalize the squared scalardistance y − µ2/σ2 for the d
Trang 1CHAPTER 39
Random Regressors
Until now we always assumed that X was nonrandom, i.e., the hypotheticalrepetitions of the experiment used the same X matrix In the nonexperimentalsciences, such as economics, this assumption is clearly inappropriate It is onlyjustified because most results valid for nonrandom regressors can be generalized tothe case of random regressors To indicate that the regressors are random, we willwrite them asX
Trang 239.1 Strongest Assumption: Error Term Well Behaved Conditionally
on Explanatory VariablesThe assumption which we will discuss first is thatX is random, but the classicalassumptions hold conditionally on X, i.e., the conditional expectation E[ε|X] = o,and the conditional variance-covariance matrix V[ε|X] = σ2I In this situation,the least squares estimator has all the classical properties conditionally on X, forinstance E[βˆ|X] = β, V[βˆ|X] = σ2(X>X)−1,E[s2|X] = σ2, etc
Moreover, certain properties of the Least Squares estimator remain valid ditionally An application of the law of iterated expectations shows that the leastsquares estimator βˆis still unbiased Start with (24.0.7):
uncon-ˆ
β− β = (X>X)−1X>ε
(39.1.1)
E[βˆ− β|X] = E[(X>X)−1X>ε|X] = (X>X)−1X>E[ε|X] = o.(39.1.2)
E[βˆ− β] = EE[βˆ− β|X] = o
(39.1.3)
Problem 408 1 point In the model with random explanatory variables X youare considering an estimator β˜of β Which statement is stronger: E[β˜] = β, orE[β˜|X] = β Justify your answer
Answer The second statement is stronger The first statement follows from the second by
Trang 3Problem 409 2 points Assume the regressorsX are random, and the classical
assumptions hold conditionally on X, i.e., E[ε|X] = o and V[ε|X] = σ2I Show
that s2 is an unbiased estimate of σ2
Answer From the theory with nonrandom explanatory variables follows E [ s 2 | X ] = σ 2
Therefore E[ s 2 ] = EE [ s 2 | X ] = E[σ 2 ] = σ 2 In words: if the expectation conditional on X
does not depend on X , then it is also the unconditional expectation
The law of iterated expectations can also be used to compute the unconditional
MSE matrix ofβˆ:
MSE[βˆ; β] = E[(βˆ− β)(βˆ− β)>](39.1.4)
= EE[(βˆ− β)(βˆ− β)>|X](39.1.5)
= E[σ2(X>X)−1](39.1.6)
= σ2E[(X>X)−1]
(39.1.7)
Problem410 2 points Show thats2(X>X)−1is unbiased estimator of MSE [βˆ; β]
Trang 4E[ s2( X>X )−1] = EE[ s2( X>X )−1| X ](39.1.8)
= E[σ2( X>X )−1] (39.1.9)
= σ2E[( X>X )−1] (39.1.10)
= MSE[ βˆ; β] by ( 39.1.7 ).
(39.1.11)
The Gauss-Markov theorem generalizes in the following way: Sayβ˜is an estima-tor, linear in y, but not necessarily in X, satisfying E[β˜|X] = β (which is strongerthan unbiasedness); then MSE [β˜; β] ≥ MSE [βˆ; β] Proof is immediate: we know bythe usual Gauss-Markov theorem that MSE [β˜; β|X] ≥ MSE [βˆ; β|X], and takingexpected values will preserve this inequality: EMSE[β˜; β|X] ≥ EMSE[βˆ; β|X],but this expected value is exactly the unconditional MSE
The assumption E[ε|X] = o can also be written E[y|X] =Xβ, and V[ε|X] =
σ2I can also be written as V[y|X] = σ2I Both of these are assumptions about theconditional distribution y|X = X for all X This suggests the following broaden-ing of the regression paradigm: y and X are jointly distributed random variables,and one is interested how y|X = X depends on X If the expected value of thisdistribution depends linearly, and the variance of this distribution is constant, then
Trang 5this is the linear regression model discussed above But the expected value mightalso depend onX in a nonlinear fashion (nonlinear least squares), and the variancemay not be constant—in which case the intuition thaty is some function ofX plussome error term may no longer be appropriate; ymay for instance be the outcome
of a binary choice, the probability of which depends on X (see chapter 69.2; thegeneralized linear model)
39.2 Contemporaneously Uncorrelated Disturbances
In many situations with random regressors, the condition E[ε|X] = o is notsatisfied Instead, the columns ofXare contemporaneously uncorrelated withε, butthey may be correlated with past values ofε The main example here is regressionwith a lagged dependent variable In this case, OLS is no longer unbiased, butasymptotically it still has all the good properties, it is asymptotically normal with thecovariance matrix which one would expect Asymptotically, the computer printout
is still valid This is a very important result, which is often used in econometrics,but most econometrics textbooks do not even start to prove it There is a proof in[Kme86, pp 749–757], and one in [Mal80, pp 535–539]
Problem 411 Since least squares with random regressors is appropriate ever the disturbances are contemporaneously uncorrelated with the explanatory vari-ables, a friend of yours proposes to test for random explanatory variables by checking
Trang 6when-whether the sample correlation coefficients between the residuals and the explanatoryvariables is significantly different from zero or not Is this an appropriate statistic?
Answer No The sample correlation coefficients are always zero! 39.3 Disturbances Correlated with Regressors in Same ObservationBut ifεis contemporaneously correlated withX, then OLS is inconsistent Thiscan be the case in some dynamic processes (lagged dependent variable as regressor,and autocorrelated errors, see question506), when there are, in addition to the rela-tion which one wants to test with the regression, other relations making the righthandside variables dependent on the lefthand side variable, or when the righthand sidevariables are measured with errors This is usually the case in economics, and econo-metrics has developed the technique of simultaneous equations estimation to dealwith it
Problem 412 3 points What does one have to watch out for if some of theregressors are random?
Trang 7CHAPTER 40
The Mahalanobis Distance
Everything in this chapter is unpublished work, presently still in draft form The
aim is to give a motivation for the least squares objective function in terms of an
initial measure of precision The case of prediction is mathematically simpler than
that of estimation, therefore this chapter will only discuss prediction We assume
that the joint distribution ofyandz has the form
σ2> 0, otherwise unknown
β unknown as well
(40.0.1)
yis observed butzis not and has to be predicted But assume we are not interested
in the MSE since we do the experiment only once We want to predict z in such a
Trang 8way that, whatever the true value of β, the predicted value z∗“blends in” best withthe given data y.
There is an important conceptual difference between this criterion and the onebased on the MSE The present criterion cannot be applied until after the data areknown, therefore it is called a “final” criterion as opposed to the “initial” criterion
of the MSE See Barnett [Bar82, pp 157–159] for a good discussion of these issues.How do we measure the degree to which a given data set “blend in,” i.e., are notoutliers for a given distribution? Hypothesis testing uses this criterion The mostoften-used testing principle is: reject the null hypothesis if the observed value of acertain statistic is too much an outlier for the distribution which this statistic wouldhave under the null hypothesis If the statistic is a scalar, and if under the nullhypothesis this statistic has expected value µ and standard deviation σ, then oneoften uses an estimate of |x − µ| /σ, the number of standard deviations the observedvalue is away from the mean, to measure the “distance” of the observed value x fromthe distribution (µ, σ2) The Mahalanobis distance generalizes this concept to thecase that the test statistic is a vector random variable
40.1 Definition of the Mahalanobis DistanceSince it is mathematically more convenient to work with the squared distancethan with the distance itself, we will make the following thought experiment to
Trang 9motivate the Mahalanobis distance How could one generalize the squared scalardistance (y − µ)2/σ2 for the distance of a vector value y from the distribution ofthe vector random variable y ∼ (µ, σ2ΩΩ)? If all yi have same variance σ2, i.e., ifΩ
Ω = I, one might measure the squared distance of y from the distribution (µ, σ2ΩΩ)
by 1
σ 2 maxi(yi− µi)2, but since the maximum from two trials is bigger than the valuefrom one trial only, one should divide this perhaps by the expected value of such
a maximum If the variances are different, say σ2
i, one might want to look a thenumber of standard deviations which the “worst” component of y is away from whatwould be its mean if y were an observation of y, i.e., the squared distance of theobsrved vector from the distribution would be maxi(yi −µ i )2
σ 2 i
, again normalized by itsexpected value
The principle actually used by the Mahalanobis distance goes only a small stepfurther than the examples just cited It is coordinate-free, i.e., any linear combi-nations of the elements of y are considered on equal footing with these elementsthemselves In other words, it does not distinguish between variates and variables.The distance of a given vector value from a certain multivariate distribution is defined
to be the distance of the “worst” linear combination of the elements of this vectorfrom the univariate distribution of this linear combination, normalized in such a waythat the expected value of this distance is 1
Trang 10Definition 40.1.1 Given a random n-vectorywhich has expected value and anonsingular covariance matrix The squared “Mahalanobis distance” or “statisticaldistance” of the observed value y from the distribution ofyis defined to be
nmaxg
g>y − E[g>y]2var[g>y] .
If the denominator var[g>y] is zero, then g = o, therefore the numerator is zero aswell In this case the fraction is defined to be zero
Theorem 40.1.2 Let y be a vector random variable with E[y] = µ and V[y] =
σ2ΩΩ, σ2> 0 and ΩΩΩ positive definite The squared Mahalanobis distance of the value
y from the distribution of y is equal to
nσ2(y − µ)>Ω−1(y − µ)Proof (40.1.2) is a simple consequence of (32.4.4) It is also somewhat intuitivesince the righthand side of (40.1.2) can be considered a division of the square of y −µ
The Mahalanobis distance is an asymmetric measure; a large value indicates abad fit of the hypothetical population to the observation, while a value of, say, 0.1does not necessarily indicate a better fit than a value of 1
Trang 11Problem413 Letybe a random n-vector with expected value µ and nonsingularcovariance matrix σ2ΩΩ Show that the expected value of the Mahalobis distance ofthe observations of y from the distribution of yis 1, i.e.,
The Mahalanobis distance is also defined if the covariance matrix ofyis singular
In this case, certain nonzero linear combinations of the elements ofyare known withcertainty Certain vectors can therefore not possibly be realizations ofy, i.e., the set
of realizations ofy does not fill the whole Rn
Problem 414 2 points The random vector y = hyy12
y3
ihas mean h 1
−3
iandcovariance matrix 1
3
h 2 −1 −1
−1 2 −1
i Is this covariance matrix singular? If so, give a
Trang 12linear combination of the elements of y which is known with certainty And give avalue which can never be a realization ofy Prove everything you state.
Answer Yes, it is singular;
i
=
h0 0
iI.e., y1+ y2+ y3= 0 because its variance is 0 and its mean is zero as well since [ 1 1 1 ]
h 1
−3
i
= 0 Definition 40.1.3 Given a vector random variable y which has a mean and
a covariance matrix A value y has infinite statistical distance from this randomvariable, i.e., it cannot possibly be a realization of this random variable, if a vector
of coefficients g exists such that var[g>y] = 0 but g>y 6= g>E[y] If such a g doesnot exist, then the squared Mahalanobis distance of y fromyis defined as in (40.1.1),with n replaced by rank[ΩΩΩ] If the denominator in (40.1.1) is zero, then it no longernecessarily follows that g = o but it nevertheless follows that the numerator is zero,and the fraction should in this case again be considered zero
If ΩΩΩ is singular, then the inverse ΩΩ−1 in formula (40.1.2) must be replaced by a
“g-inverse.” A g-inverse of a matrix A is any matrix A−which satisfies AA−A = A.G-inverses always exist, but they are usually not unique
Problem 415 a is a scalar What is its g-inverse a−?
Trang 13Theorem 40.1.4 Let y be a random variable with E[y] = µ and V[y] = σ2ΩΩ,
σ2> 0 If it is not possible to express the vector y in the form y = µ + ΩΩΩa for some
a, then the squared Mahalanobis distance of y from the distribution of y is infinite,i.e., MHD[y;y] = ∞; otherwise
σ2rank[ΩΩΩ](y − µ)
>Ω−(y − µ)Now we will dicuss how a given observation vector can be extended by additionalobservations in such a way that the Mahalanobis distance of the whole vector fromits distribution is minimized
40.2 The Conditional Mahalanobis DistanceNow let us assume that after the observation ofyadditional observations becomeavailable I.e., the scenario now is
(40.2.1)
yz
∼µν
, σ2ΩΩyy Ωyz
Ωzy Ωzz
, σ2> 0
Trang 14In this case we define the conditional Mahalanobis distance of an observation z giventhe prior observation y to be
40.3 First Scenario: Minimizing relative increase in Mahalanobis
distance if distribution is known
We start with a situation where the expected values of the random vectorsyand
z are known, and their joint covariance matrix is known up to an unknown scalarfactor σ2> 0 We will write this as
(40.3.1)
yz
∼µν
, σ2ΩΩyy ΩyzΩzy Ωzz
, σ2> 0
Ωyyhas rank p andhΩyy Ωyz
Ω zy Ω zz
ihas rank r Since σ2is not known, one cannot computethe Mahalanobis distance of the observed and/or conjectured values y and z from
Trang 15their distribution But if one works with the relative increase in the Mahalanobisdistance if z is added to y, then σ2 cancels out In order to measure how well theconjectured value z fits together with the observed y we will therefore divide theMahalanobis distance of the vector composed of y and z from its distribution by theMahalanobis distance of y alone from its distribution:
An equivalent criterion which leads to mathematically simpler formulas is todivide the conditional Mahalanobis distance of z given y by the Mahalanobis distance
of y fromy:
(40.3.3)
1 (r−p)σ 2
y − µ
z − ν
>
ΩΩyy ΩyzΩzy Ωzz
pσ 2(y − µ)>Ω−yy(y − µ) .
We already solved this minimization problem in chapter ?? By (??), the mum value of this relative contribution is zero, and the value of z which minimizes
Trang 16mini-this relative contribution is the same as the value of the best linear predictor of z,i.e., the value assumed by the linear predictor which minimizes the MSE among alllinear predictors.
40.4 Second Scenario: One Additional IID Observation
In the above situation, we could minimize the relative increase in the lanobis distance (instead of selecting its minimax value) because all parameters ofthe underlying distribution were known The simplest situation in which they arenot known, and therefore we must resort to minimizing the relative increase in theMahalanobis distance for the most unfavorable value of this unknown parameter,
Maha-is the following: A vector y of n i.i.d observations is given with unknown mean µand variance σ2 > 0 The squared Mahalanobis distance of these data from theirpopulation is 1
nσ 2(y− ιµ)>(y− ιµ) it depends on the unknown µ and σ2 Howcan we predict a n + 1st observation in such a way as to minimize the worst possiblerelative increase in this Mahalanobis distance?
Minimizing the maximum possible relative increase in the Mahalanobis distancedue toyn+1is the same as minimizing
2
(ιµ −y)>(ιµ −y)/n