Another im-portant “natural” choice for the positive definite matrix Ri in the gradient method is available if one maximizes a likelihood function: then Ri can be the inverse ofthe infor
Trang 1CHAPTER 69
Binary Choice Models
69.1 Fisher’s Scoring and Iteratively Reweighted Least SquaresThis section draws on chapter55 about Numerical Minimization Another im-portant “natural” choice for the positive definite matrix Ri in the gradient method
is available if one maximizes a likelihood function: then Ri can be the inverse ofthe information matrix for the parameter values βi This is called Fisher’s Scoringmethod It is closely related to the Newton-Raphson method The Newton-Raphsonmethod uses the Hessian matrix, and the information matrix is minus the expectedvalue of the Hessian Apparently Fisher first used the information matrix as a com-putational simplification in the Newton-Raphson method Today IRLS is used inthe GLIM program for generalized linear models
Trang 2As in chapter 56 discussing nonlinear least squares, β is the vector of eters of interest, and we will work with an intermediate vector η(β) of predictorswhose dimension is comparable to that of the observations Therefore the likelihoodfunction has the formL= L y, η(β) By the chain rule (C.1.23) one can write theJacobian of the likelihood function as ∂β∂L>(β) = u>X, whereu> = ∂η∂L>(η(β)) isthe Jacobian of L as a function of η, evaluated at η(β), and X = ∂β∂η>(β) is theJacobian of η This is the same notation as in the discussion of the Gauss-Newtonregression.
param-Define A = E[uu>] Since X does not depend on the random variables, theinformation matrix ofy with respect to β is then E[X>uu>X] = X>AX If oneuses the inverse of this information matrix as the R-matrix in the gradient algorithm,one gets
Trang 3Justifications of IRLS are: the information matrix is usually analytically simplerthan the Hessian of the likelihood function, therefore it is a convenient approximation,and one needs the information matrix anyway at the end for the covariance matrix
of the M.L estimators
69.2 Binary Dependent VariableAssume each individual in the sample makes an independent random choicebetween two alternatives, which can conveniently be coded as yi = 0 or 1 Theprobability distribution ofyi is fully determined by the probability πi= Pr[yi = 1]
of the event which has yi as its indicator function Then E[yi] = πi and var[yi] =E[y2
i] − E[yi]2
= E[yi] − E[yi]2
= πi(1 − πi)
It is usually assumed that the individual choices are stochastically independent
of each other, i.e., the distribution of the data is fully characterized by the πi Each
πi is assumed to depend on a vector of explanatory variables xi There are differentapproaches to modelling this dependence
The regression modelyi= x>i β +εiwith E[εi] = 0 is inappropriate because x>i βcan take any value, whereas 0 ≤ E[yi] ≤ 1 Nevertheless, people have been tinkeringwith it The obvious first tinker is based on the observation that the εi are nolonger homoskedastic, but their variance, which is a function of πi, can be estimated,therefore one can correct for this heteroskedasticity But things get complicated very
Trang 4quickly and then the main appeal of OLS, its simplicity, is lost This is a headed approach, and any smart ideas which one may get when going down this roadare simply wasted.
wrong-The right way to do this is to set πi = E[yi] = Pr[yi = 1] = h(x>i β) where h issome (necessarily nonlinear) function with values between 0 and 1
69.2.1 Logit Specification (Logistic Regression) The logit or logisticspecification is πi = ex>i β/(1 + ex>i β) Invert to get log(πi/(1 − πi)) = x>i β I.e.,the logarithm of the odds depends linearly on the predictors The log odds are anatural re-scaling of probabilities to a scale which goes from −∞ to +∞, and which
is symmetric in that the log odds of the complement of an event is just the negative
of the log odds of the event itself (See my remarks about the odds ratio in Question
222.)
Problem 560 1 point If y = log1−pp (logit function), show that p = 1+exp yexp y(logistic function)
Trang 5Problem561 Sometimes one finds the following alternative specification of thelogit model: πi= 1/(1+ex>iβ) What is the difference between it and our formulation
of the logit model? Are these two formulations equivalent?
in terms of an unobserved “index variable.”
The index variable model specifies: there is a variablezi with the property that
yi = 1 if and only ifzi > 0 For instance, the decision yi whether or not individual
i moves to a different location can be modeled by the calculation whether the netbenefit of moving, i.e., the wage differential minus the cost of relocation and finding
a new job, is positive or not This moving example is worked out, with references,
in [Gre93, pp 642/3]
The value of the variable zi is not observed, one only observesyi, i.e., the onlything one knows about the value ofz is whether it is positive or not But it is assumed
Trang 6that zi is the sum of a deterministic part which is specific to the individual and arandom part which has the same distribution for all individuals and is stochasticallyindependent between different individuals The deterministic part specific to theindividual is assumed to depend linearly on individual i’s values of the covariates,with coefficients which are common to all individuals In other words,zi= x>i β +εi,where theεi are i.i.d with cumulative distribution function Fε Then it follows πi=Pr[yi = 1] = Pr[zi > 0] = Pr[εi > −x>i β] = 1 − Pr[εi ≤ −x>
i β] = 1 − Fε(−x>i β).I.e., in this case, h(η) = 1 − Fε(−η) If the distribution ofεi is symmetric and has adensity, then one gets the simpler formula h(η) = Fε(η)
Which cumulative distribution function should be chosen?
• In practice, the probit model, in whichzi is normal, is the only one used
• The linear model, in which h is the line segment from (a, 0) to (b, 1), can also
be considered generated by an in index functionziwhich is here uniformlydistributed
• An alternative possible specification with the Cauchy distribution is posed in [DM93, p 516] They say that curiously only logit and probit arebeing used
pro-In practice, the probit model is very similar to the logit model, once one has rescaledthe variables to make the variances equal, but the logit model is easier to handlemathematically
Trang 769.2.3 Replicated Data Before discussing estimation methods I want tobriefly address the issue whether or not to write the data in replicated form [MN89,
p 99–101] If there are several observations for every individual, or if there are severalindividuals for the same values of the covariates (which can happen if all covariatesare categorical), then one can write the data more compactly if one groups the datainto so-called “covariate classes,” i.e., groups of observations which share the samevalues of xi, and definesyito be the number of times the decision came out positive
in this group Then one needs a second variable, mi, which is assumed nonrandom,indicating how many individual decisions are combined in the respective group This
is an equivalent formulation of the data, the only thing one loses is the order in whichthe observations were made (which may be relevant if there are training or warm-upeffects) The original representation of the data is a special case of the grouped form:
in the non-grouped form, all mi = 1 We will from now on write our formulas forthe grouped form
69.2.4 Estimation Maximum likelihood is the preferred estimation method.The likelihood function has the form L = Q πyi
i (1 − πi)(mi − yi) This likelihoodfunction is not derived from a density, but from a probability mass function Forinstance, in the case with non-replicated data, all mi = 1, if you have n binarymeasurements, then you can have only 2n different outcomes, and the probability ofthe sequence y , y = 0, 1, 0, 0, , 1 is as given above
Trang 8This is a highly nonlinear maximization and must be done numerically Let us
go through the method of scoring in the example of a logit distribution
L =X
i
yilog πi+ (mi−yi) log(1 − πi)(69.2.1)
∂2L
∂π2 i
= −yi
π2 i
+ mi−yi
(1 − πi)2
(69.2.3)
Defining η = Xβ, the logit specification can be written as πi = eηi/(1 + eηi).Differentiation gives ∂πi
∂η i = πi(1 − πi) Combine this with (69.2.2) to get(69.2.4) ui= ∂L
A = E[uu>] is a diagonal matrix with miπi(1 − πi) in the diagonal
Problem 562 6 points Show that for the maximization of the likelihood tion of the logit model, Fisher’s scoring method is equivalent to the Newton-Raphsonalgorithm
Trang 9func-Problem 563 Show that in the logistic model, P miπˆi=P
yi
69.3 The Generalized Linear ModelThe binary choice models show how the linear model can be generalized [MN89,
p 27–32] develop a unified theory of many different interesting models, called the
“generalized linear model.” The following few paragraphs are indebted to the rate and useful web site about Generalized Linear Models maintained by Gordon K.Smyth atwww.maths.uq.oz.au/~gks/research/glm
elabo-In which cases is it necessary to go beyond linear models? The most importantand common situation is one in whichyi and µi= E[yi] are bounded:
• If y represents the amount of some physical substance then we may have
y≥ 0 and µ ≥ 0
• Ifyis binary, i.e.,y= 1 if an animal survives andy= 0 if it does not, then
0 ≤ µ ≤ 1
The linear model is inadequate here because complicated and unnatural constraints
on β would be required to make sure that µ stays in the feasible range Generalizedlinear models instead assume a link linear relationship
Trang 10where g() is some known monotonic function which acts pointwise on µ Typicallyg() is used to transform the µi to a scale on which they are unconstrained Forexample we might use g(µ) = log(µ) if µi > 0 or g(µ) = log µ/(1 − µ) if 0 < µi< 1.The same reasons which force us to abandon the linear model also force us toabandon the assumption of normality Ify is bounded then the variance ofy mustdepend on its mean Specifically if µ is close to a boundary forythen var(y) must besmall For example, ify> 0, then we must have var(y) → 0 as µ → 0 For this reasonstrictly positive data almost always shows increasing variability with increased size.
If 0 <y< 1, then var(y) → 0 as µ → 0 or µ → 1 For this reason, generalized linearmodels assume that
Problem 564 Describe estimation situations in which a linear model and mal distribution are not appropriate
Trang 11Nor-The generalized linear model has the following components:
• Random component: Instead of being normally distributed, the nents ofyhave a distribution in the exponential family
compo-• Introduce a new symbol η = Xβ
• A monotonic univariate link function g so that ηi= g(µi) where µ = E[y].The generalized linear model allows for a nonlinear link function g specifyingthat transformation of the expected value of the response variable which dependslinearly on the predictors:
Trang 12models do not require us to specify the whole distribution but can be derived on thebasis of the mean and variance functions alone.
Trang 13CHAPTER 70
Multiple Choice Models
Discrete choice between three or more alternatives; came from choice of portation
trans-The outcomes of these choices should no longer be represented by a vectory, butone needs a matrixY withyij = 1 if the ith individual chooses the jth alternative,and 0 otherwise Consider only three alternatives j = 1, 2, 3, and define Pr(yij =1) = πij
Conditional Logit model is a model which makes all πij dependent on xi It isvery simple extension of binary choice In binary choice we had log πi
1−πi = x>i β, log
of odds ratio Here this is generalized to logπi2
πi1 = x>i β2, and logπi3
πi1 = x>i β3 From
Trang 14One can write this as πij = Peαj +βj Xie αk+βkXi if one defines α1 = β1 = 0 The onlyestimation method used is MLE.
p 47] for the best explanation of this which I found till now
Trang 15APPENDIX A
Matrix Formulas
In this Appendix, efforts are made to give some of the familiar matrix lemmas intheir most general form The reader should be warned: the concept of a deficiencymatrix and the notation which uses a thick fraction line multiplication with a scalarg-inverse are my own
A.1 A Fundamental Matrix DecompositionTheorem A.1.1 Every matrix B which is not the null matrix can be written
as a product of two matrices B = CD, where C has a left inverse L and D a rightinverse R, i.e., LC = DR = I This identity matrix is r × r, where r is the rank
of B
Trang 16A proof is in [Rao73, p 19] This is the fundamental theorem of algebra, thatevery homomorphism can be written as a product of epimorphism and monomor-phism, together with the fact that all epimorphisms and monomorphisms split, i.e.,have one-sided inverses.
One such factorization is given by the singular value theorem: If B = P>ΛQ
is the svd as in Theorem A.9.2, then one might set e.g C = P>Λ and D = Q,consequently L = Λ−1P and R = Q> In this decomposition, the first row/columncarries the largest weight and gives the best approximation in a least squares sense,etc
The trace of a square matrix is defined as the sum of its diagonal elements Therank of a matrix is defined as the number of its linearly independent rows, which isequal to the number of its linearly independent columns (row rank = column rank)
Theorem A.1.2 tr BC = tr CB
Problem 565 Prove theoremA.1.2
Problem 566 Use theorem A.1.1 to prove that if BB = B, then rank B =
tr B
DC = I r This is useful for the trace: tr B = tr CD = tr DC = tr I r = r I have this proof from
Trang 17Theorem A.1.3 B = O if and only if B>B = O.
A.2 The Spectral Norm of a MatrixThe spectral norm of a matrix extends the Euclidean norm kzk from vectors
to matrices Its definition is kAk = maxkzk=1kAzk This spectral norm is themaximum singular value µmax, and if A is square, then A−1 = 1/µmin It is atrue norm, i.e., kAk = 0 if and only if A = O, furthermore kλAk = |λ| · kAk, and thetriangle inequality kA + Bk ≤ kAk + kBk In addition, it obeys kABk ≤ kAk · kBk
Problem 567 Show that the spectral norm is the maximum singular value
> A>Az
z > z .Write A = P>ΛQ as in ( A.9.1 ), Then z>A>Az = z>Q>Λ 2 Qz Therefore we can first show: there is a z in the form z = Q>x which attains this maximum Proof: for every z which has a nonzero value in the numerator of ( A.2.1 ), set x = Qz Then x 6= o, and Q>x attains the same value as z in the numerator of ( A.2.1 ), and a smaller or equal value in the denominator Therefore one can restrict the search for the maximum argument to vectors of the form Q>x But for them the objective function becomes x>Λ2x
x > x , which is maximized by x = i 1 , the first unit vector (or column vector of the unit matrix) Therefore the squared spectral norm is λ 2
ii , and therefore the
Trang 18A.3 Inverses and g-Inverses of Matrices
A g-inverse of a matrix A is any matrix A− satisfying
It always exists but is not always unique If A is square and nonsingular, then A−1
is its only g-inverse
Problem 568 Show that a symmetric matrix ΩΩΩ has a g-inverse which is alsosymmetric
The definition of a g-inverse is apparently due to [Rao62] It is sometimes calledthe “conditional inverse” [Gra83, p 129] This g-inverse, and not the Moore-Penrosegeneralized inverse or pseudoinverse A+, is needed for the linear model, The Moore-Penrose generalized inverse is a g-inverse that in addition satisfies A+AA+ = A+,and AA+ as well as A+A symmetric It always exists and is also unique, but theadditional requirements are burdensome ballast [Gre97, pp 44-5] also advocatesthe Moore-Penrose inverse, but he does not really use it If he were to try to use it,
he would probably soon discover that it is not appropriate The book [Alb72] doesthe linear model with the Moore-Penrose inverse It is a good demonstration of howcomplicated everything gets if one uses an inappropriate mathematical tool
Trang 19Problem 569 Use theoremA.1.1to prove that every matrix has a g-inverse.
Theorem A.3.1 If B = AA−B holds for one g-inverse A−of A, then it holdsfor all g-inverses If A is symmetric and B = AA−B, then also B> = B>A−A
If B = BA−A and C = AA−C then BA−C is independent of the choice of inverses
g-Proof Assume the identity B = AA+B holds for some fixed g-inverse A+(which may be, as the notation suggests, the Moore Penrose g-inverse, but this is
Trang 20not necessary), and let A−be an different g-inverse Then AA−B = AA−AA+B =
AA+B = B For the second statement one merely has to take transposes and notethat a matrix is a g-inverse of a symmetric A if and only if its transpose is For thethird statement: BA+C = BA−AA+AA−C = BA−AA−C = BA−C Here +signifies a different g-inverse; again, it is not necessarily the Moore-Penrose one
Problem 570 Show that x satisfies x = Ba for some a if and only if x =
BB−x
Theorem A.3.2 Both A>(AA>)− and (A>A)−A are g-inverses of A.Proof We have to show
(A.3.3) A = AA>(AA>)−A
which is [Rao73, (1b.5.5) on p 26] Define D = A − AA>(AA>)−A and show, by
A.4 Deficiency MatricesHere is again some idiosyncratic terminology and notation It gives an explicitalgebraic formulation for something that is often done implicitly or in a geometricparadigm A matrix G will be called a “left deficiency matrix” of S, in symbols,
G ⊥ S, if GS = O, and for all Q with QS = O there is an X with Q = XG This
Trang 21factorization property is an algebraic formulation of the geometric concept of a nullspace It is symmetric in the sense that G ⊥ S is also equivalent with: GS = O,and for all R with GR = O there is a Y with R = SY In other words, G ⊥ S and
S>⊥ G> are equivalent
This symmetry follows from the following characterization of a deficiency matrixwhich is symmetric:
Theorem A.4.1 T ⊥ U iff T U = O and T>T + U U> nonsingular.
Proof This proof here seems terribly complicated There must be a simplerway Proof of “⇒”: Assume T ⊥ U Take any γ with γ>T>T γ + γ>U U>γ =
0, i.e., T γ = o and γ>U = o> From this one can show that γ = o: since
T γ = o, there is a ξ with γ = U ξ, therefore γ>γ = γ>U ξ = 0 To prove
“⇐” assume T U = O and T>T + U U> is nonsingular To show that T ⊥ Utake any B with BU = O Then B = B(T>T + U U>)(T>T + U U>)−1 =
BT>T (T>T + U U>)−1 In the same way one gets T = T T>T (T>T + U U>)−1.Premultiply this last equation by T>T (T>T T>T )−T>and use theoremA.3.2to get
T>T (T>T T>T )−T>T = T>T (T>T + U U>)−1 Inserting this into the equationfor B gives B = BT>T (T>T T>T )−T>T , i.e., B factors over T
The R/Splus-function Null gives the transpose of a deficiency matrix
Trang 22Theorem A.4.2 If for all Y , BY = O implies AY = O, then a X exists with
A = XB
Problem 571 Prove theoremA.4.2
Problem 572 Show that I − SS−⊥ S
whose existence is postulated in the definition of a deficiency matrix is Q itself Problem573 Show that S ⊥ U if and only if S is a matrix with maximal rankwhich satisfies SU = O In other words, one cannot add linearly independent rows
to S in such a way that the new matrix still satisfies T U = O
Trang 23such a row to S and get the result which was just ruled out), therefore P = SS where A is the matrix of coefficients of these linear combinations
The deficiency matrix is not unique, but we will use the concept of a deficiencymatrix in a formula only then when this formula remains correct for every deficiencymatrix One can make deficiency matrices unique if one requires them to be projec-tion matrices
Problem574 Given X and a symmetric nonnegative definite ΩΩΩ such that X =Ω
ΩW for some W Show that X ⊥ U if and only if X>Ω−X ⊥ U
⇐ note that X > Ω−X = W>Ω ΩW , therefore XY = Ω Ω ΩW Y = Ω Ω ΩW (W>Ω ΩW )−W>Ω ΩW Y = Ω
ΩW (W>Ω ΩW ) − X>Ω−XY = O.
A matrix is said to have full column rank if all its columns are linearly dent, and full row rank if its rows are linearly independent The deficiency matrixprovides a “holistic” definition for which it is not necessary to look at single rowsand columns X has full column rank if and only if X ⊥ O, and full row rank if andonly if O ⊥ X
indepen-Problem 575 Show that the following three statements are equivalent: (1) Xhas full column rank, (2) X>X is nonsingular, and (3) X has a left inverse
Trang 24Answer Here use X ⊥ O as the definition of “full column rank.” Then (1) ⇔ (2) is theorem A.4.1 Now (1) ⇒ (3): Since IO = O, a P exists with I = P X And (3) ⇒ (1): if a P exists with
I = P X, then any Q with QO = O can be factored over X, simply say Q = QP X
Note that the usual solution of linear matrix equations with g-inverses involves
a deficiency matrix:
Theorem A.4.3 The solution of the consistent matrix equation T X = A is
where T ⊥ U and W is arbitrary
Proof Given consistency, i.e., the existence of at least one Z with T Z = A,(A.4.1) defines indeed a solution, since T X = T T−T Z Conversely, if Y satisfies
T Y = A, then T (Y − T−A) = O, therefore Y − T−A = U W for some W Theorem A.4.4 Let L ⊥ T ⊥ U and J ⊥ HU ⊥ R; then
⊥ U R
Proof First deficiency relation: Since I−T T− = U W for some W , −J HT−T +
J H = O, therefore the matrix product is zero Now assume A B T
H
= O
Trang 25Then BHU = O, i.e., B = DJ for some D Then AT = −DJ H, whichhas as general solution A = −DJ HT− + CL for some C This together gives
A B = C D
−J HT− J
Now the second deficiency relation: clearly,the product of the matrices is zero If M satisfies T M = O, then M = U Nfor some N If M furthermore satisfies HM = O, then HU N = O, therefore
Theorem A.4.5 Assume ΩΩΩ is nonnegative definite symmetric and K is suchthat KΩΩΩ is defined Then the matrix
(A.4.2) Ξ = ΩΩΩ − ΩΩΩK>(KΩΩΩK>)−KΩΩ
has the following properties:
(1) Ξ does not depend on the choice of g-inverse of KΩΩΩK> used in (A.4.2).(2) Any g-inverse of ΩΩΩ is also a g-inverse of Ξ, i.e ΞΩΩ−Ξ = Ξ
(3) Ξ is nonnegative definite and symmetric
(4) For every P ⊥ ΩΩΩ follows K
P
⊥ Ξ
Trang 26(5) If T is any other right deficiency matrix ofK
P
, i.e., ifK
P
⊥ T , then
(A.4.3) Ξ = T (T>Ω−T )−T>
Hint: show that any D satisfying Ξ = T DT> is a g-inverse of T>Ω−T
In order to apply (A.4.3) show that the matrix T = SK where K ⊥ S and
P S ⊥ K is a right deficiency matrix of K
P
Proof of theoremA.4.5: Independence of choice of g-inverse follows from theorem
A.5.10 That ΩΩ− is a g-inverse is also an immediate consequence of theoremA.5.10.From the factorization Ξ = ΞΩΩ−Ξ follows also that Ξ is nnd symmetric (since everynnd symmetric ΩΩΩ also has a symmetric nnd g-inverse) (4) Deficiency property:From K
Ξ = O it follows Ξ = T A for some A, and therefore
Ξ = ΞΩΩ−Ξ = T AΩΩ−A>T>= T DT> where D = AΩΩ−A>
Trang 27Before going on we need a lemma Since (I − ΩΩΩΩΩ−)ΩΩΩ = O, there exists a Nwith I − ΩΩΩΩΩ−= N P , therefore T − ΩΩΩΩΩ−T = N P T = O or
T (T>Ω−T )−T>= ΩΩΩΩΩ−T (T>Ω−T )−T>Ω−ΩΩ and theoremA.5.10
Theorem A.4.6 Given two matrices T and U Then T ⊥ U if and only if forany D the following two statements are equivalent:
T D = O(A.4.6)
and
For all C which satisfy CU = O follows CD = O
(A.4.7)
Trang 28A.5 Nonnegative Definite Symmetric Matrices
By definition, a symmetric matrix ΩΩΩ is nonnegative definite if a>ΩΩa ≥ 0 for allvectors a It is positive definite if a>ΩΩa > 0 for all vectors a 6= o
Theorem A.5.1 ΩΩΩ nonnegative definite symmetric if and only if it can be ten in the form ΩΩΩ = A>A for some A
writ-Theorem A.5.2 If ΩΩΩ is nonnegative definite, and a>ΩΩa = 0, then alreadyΩ
(non-Theorem A.5.6 If ΩΩΩ and ΣΣΣ are nonnegative definite, then tr(ΩΩΩΣΣΣ) ≥ 0.
Problem 576 Prove theoremA.5.6
Find any factorization Σ Σ Σ = P P> Then tr(Ω Ω ΩΣ Σ Σ) = tr(P>Ω ΩP ) ≥ 0
Trang 29Theorem A.5.7 If ΩΩΩ is nonnegative definite symmetric, then
(A.5.1) (g>ΩΩa)2≤ g>ΩΩg a>ΩΩa,
for arbitrary vectors a and g Equality holds if and only if ΩΩΩg and ΩΩΩa are linearlydependent, i.e., α and β exist, not both zero, such that ΩΩΩgα + ΩΩΩaβ = o
Proof: First we will show that the condition for equality is sufficient Thereforeassume ΩΩΩgα + ΩΩΩaβ = 0 for a certain α and β, which are not both zero Withoutloss of generality we can assume α 6= 0 Then we can solve a>ΩΩgα + a>ΩΩaβ = 0 toget a>ΩΩg = −(β/α)a>ΩΩa, therefore the lefthand side of (A.5.1) is (β/α)2(a>ΩΩa)2.Furthermore we can solve g>ΩΩgα + g>ΩΩaβ = 0 to get g>ΩΩg = −(β/α)g>ΩΩa =(β/α)2a>ΩΩa, therefore the righthand side of (A.5.1) is (β/α)2(a>ΩΩa)2as well—i.e.,(A.5.1) holds with equality
Secondly we will show that (A.5.1) holds in the general case and that, if it holdswith equality, ΩΩΩg and ΩΩΩa are linearly dependent We will split this second half ofthe proof into two substeps First verify that (A.5.1) holds if g>ΩΩg = 0 If this isthe case, then already ΩΩΩg = o, therefore the ΩΩΩg and ΩΩΩa are linearly dependent and,
by the first part of the proof, (A.5.1) holds with equality
Trang 30The second substep is the main part of the proof Assume g>ΩΩg 6= 0 Since ΩΩ
is nonnegative definite, it follows
g > Ω Ωg
=
o, which means that ΩΩΩg and ΩΩΩa are linearly dependent
Theorem A.5.8 In the situation of theorem A.5.7, one can take g-inverses asfollows without disturbing the inequality
>ΩΩa)2
g>ΩΩg ≤ a>ΩΩa
Equality holds if and only if a γ 6= 0 exists with ΩΩΩg = ΩΩΩaγ
Problem 577 Show that if ΩΩΩ is nonnegative definite, then its elements satisfy
Trang 31Problem578 Assume ΩΩΩ nonnegative definite symmetric If x satisfies x = ΩΩΩafor some a, show that
Furthermore show that equality holds if and only if ΩΩΩg = xγ for some γ 6= 0
Answer From x = Ω Ω Ωa follows g>x = g>Ω Ωa and x>Ω−x = a>Ω Ωa; therefore it follows from theorem A.5.8
Problem 579 Assume ΩΩΩ nonnegative definite symmetric, x satisfies x = ΩΩΩafor some a, and R is such that Rx is defined Show that
Trang 32Problem 580 Assume ΩΩΩ nonnegative definite symmetric Show that
g : g=Ω Ω Ωa for some a
arbi-Problem 581 Prove theoremA.5.9
Answer Assume x = Ω Ω Ωa and x>Ω−x = a>Ω Ωa ≤ 1; then for any g, g>(Ω Ω Ω − xx>)g> =
g > Ω Ωg − g > Ω Ωaa > Ω Ωg ≥ a > Ω Ωag > Ω Ωg − g > Ω Ωaa > Ω Ωg ≥ 0 by theorem A.5.7
Conversely, assume x cannot be written in the form x = Ω Ω Ωa for some a; then a g exists with
g > Ω Ω = o > but g > x 6= o Then g > (Ω Ω Ω − xx > )g > < 0, therefore not nnd.
Trang 33Finally assume x>Ω−x = a>Ω Ωa > 1; then a>(Ω Ω Ω − xx>)a = a>Ω Ωa − (a>Ω Ωa) 2 < 0, therefore
Theorem A.5.10 If ΩΩΩ and ΣΣΣ are nonnegative definite symmetric, and K amatrix so that ΣΣΣKΩΩΩ is defined, then
(A.5.11) KΩΩΩ = (KΩΩΩK>+ ΣΣΣ)(KΩΩΩK>+ ΣΣΣ)−KΩΩΩ
Furthermore, ΩΩΩK>(KΩΩΩK>+ ΣΣΣ)−KΩΩΩ is independent of the choice of g-inverses
Problem 582 Prove theoremA.5.10
with Σ Σ Σ = P P>and define A =KQ P The independence of the choice of g-inverses follows
The following was apparently first shown in [Alb69] for the special case of theMoore-Penrose pseudoinverse:
Trang 34Theorem A.5.11 The symmetric partitioned matrix ΩΩΩ =ΩΩyy Ωyz
Ω>yz Ωzz
is negative definite if and only if the following conditions hold:
non-Ωyy and Ωzz y := ΩΩzz− ΩΩ>yzΩ−yyΩyz are both nonnegative definite, and(A.5.12)
Ωyz = ΩΩyyΩ−yyΩyz(A.5.13)
Reminder: It follows from theorem A.3.1that (A.5.13) holds for some g-inverse
if and only if it holds for all, and that, if it holds, ΩΩzz.y is independent of the choice
of the g-inverse
Proof of theorem A.5.11: First we prove the necessity of the three conditions
in the theorem If the symmetric partitioned matrix ΩΩΩ is nonnegative definite,there exists a R with ΩΩΩ = R>R Write R = Ry Rz
to get ΩΩyy Ωyz
Ry>Rz = ΩΩyz. To show that ΩΩzz.y is nonnegative definite, define S = (I −
R (R >R )−R >)R Then S>S = R > I − R (R >R )−R >Rz= ΩΩ
Trang 35To show sufficiency of the three conditions of theoremA.5.11, assume the metricΩΩyy Ωyz
sym-Ω>yz Ωzz
satisfies them Pick two matrices Q and S so that ΩΩyy= Q>Qand ΩΩzz.y= S>S Then
therefore nonnegative definite
Problem 583 [SM86, A 3.2/11] Given a positive definite matrix Q and apositive definite eQ with Q∗= Q − eQ nonnegative definite
• a Show that eQ − eQQ−1Q is nonnegative definite.e
• b This part is more difficult: Show that also Q∗− Q∗Q−1Q∗ is nonnegativedefinite
Trang 36Answer We will write it in a symmetric form from which it is obvious that it is nonnegative definite:
Q∗− Q ∗ Q−1Q∗= Q∗− Q ∗ ( Q + Qe ∗)−1Q∗(A.5.14)
= Q∗( Q + Qe ∗)−1( Q + Qe ∗− Q ∗ ) = Q∗( Q + Qe ∗)−1Qe(A.5.15)
= Q(e Q + Qe ∗)−1( Q + Qe ∗) Qe−1Q∗( Q + Qe ∗)−1Qe(A.5.16)
= QQe −1(Q∗+ Q∗Qe−1Q∗)Q−1Q.e(A.5.17)
Problem 584 Given the vector h 6= o For which values of the scalar γ isthe matrix I −hhγ> singular, nonsingular, nonnegative definite, a projection matrix,orthogonal?
orthogonal iff γ = h>h/2, and it is a projection matrix iff γ = h>h Now let us prove that it is singular iff γ = h>h: if this condition holds, then the matrix annuls h; now assume the condition does not hold, i.e., γ 6= h>h, and take any x with (I −hhγ>)x = o It follows x = hα where
α = h>x/γ, therefore (I −hhγ>)x = hα(1 − h>h/γ) Since h 6= o and 1 − h>h/γ 6= 0 this can
Trang 37A.6 Projection Matrices
Problem 585 Show that X(X>X)−X> is the projection matrix on the rangespace R[X] of X, i.e., on the space spanned by the columns of X This is truewhether or not X has full column rank
g-inverse Furthermore one has to show X(X>X)−Xa = a holds if and only if a = Xb for some
b ⇒ is clear, and ⇐ follows from theorem A.3.2
Theorem A.6.1 Let P and Q be projection matrices, i.e., both are symmetricand idempotent Then the following five conditions are equivalent, each meaning thatthe space on which P projects is a subspace of the space on which Q projects:
R[P ] ⊂ R[Q]
(A.6.1)
QP = P(A.6.2)
P Q = P(A.6.3)
Q − P projection matrix(A.6.4)
Q − P nonnegative definite
(A.6.5)
Trang 38(A.6.2) is geometrically trivial It means: if one first projects on a certain space,and then on a larger space which contains the first space as a subspace, then nothinghappens under this second projection because one is already in the larger space.(A.6.3) is geometrically not trivial and worth remembering: if one first projects on acertain space, and then on a smaller space which is a subspace of the first space, thenthe result is the same as if one had projected directly on the smaller space (A.6.4)means: the difference Q − P is the projection on the orthogonal complement of R[P ]
in R[Q] And (A.6.5) means: the projection of a vector on the smaller space cannot
be longer than that on the larger space
Problem 586 Prove theoremA.6.1
( A.6.3 ) ⇐⇒ ( A.6.2 ) and then go in a circle for the remaining conditions: ( A.6.2 ), ( A.6.3 ) ⇒ ( A.6.4 ) ⇒ ( A.6.3 ) ⇒ ( A.6.5 ).
( A.6.1 ) ⇒ ( A.6.2 ): R[P ] ⊂ R[Q] means that for every c exists a d with P c = Qd Therefore far all c follows QP c = QQd = Qd = P c, i.e., QP = P
( A.6.2 ) ⇒ ( A.6.1 ): if P c = QP c for all c, then clearly R[P ] ⊂ R[Q].
( A.6.2 ) ⇒ ( A.6.3 ) by symmetry of P and Q: If QP = P then P Q = P>Q>= (QP )> =
P>= P
( A.6.3 ) ⇒ ( A.6.2 ) follows in exactly the same way: If P Q = P then QP = Q>P>= (P Q)>=
P>= P
Trang 39( A.6.2 ), ( A.6.3 ) ⇒ ( A.6.4 ): Symmetry of Q − P clear, and (Q − P )(Q − P ) = Q − P − P + P =
Q − P
( A.6.4 ) ⇒ ( A.6.5 ): c>(Q − P )c = c>(Q − P )>(Q − P )c ≥ 0.
( A.6.5 ) ⇒ ( A.6.3 ): First show that, if Q − P nnd, then Qc = o implies P c = o Proof: from
Q − P nnd and Qc = o follows 0 ≤ c > (Q − P )c = −c > P c ≤ 0, therefore equality throughout, i.e.,
0 = c>P c = c>P>P c = kP ck 2 and therefore P c = o Secondly: this is also true for matrices:
QC = O implies P C = O, since it is valid for every column of C Thirdly: Since Q(I − Q) = O,
Problem587 If Y = XA for some A, show that Y (Y>Y )−Y>X(X>X)−X>=
Problem 588 (Not eligible for in-class exams) Let Q be a projection matrix(i.e., a symmetric and idempotent matrix) with the property that Q = XAX> for
Trang 40some A Define ˜X = (I − Q)X Then
(A.6.7) X(X>X)−X> = ˜X( ˜X>X)˜ −X˜>+ Q
Hint: this can be done through a geometric argument If you want to do it
algebraically, you might want to use the fact that (X>X)− is also a g-inverse of
˜
X>X.˜
columns of ˜ X are projections of the columns of X on the orthogonal complement of the space on
which Q projects The equation which we have to prove shows therefore that the projection on the
column space of X is the sum of the projections on the space Q projects on plus the projection on
the orthogonal complement of that space in X.
Now an algebraic proof: First let us show that (X>X)− is a g-inverse of ˜ X>X, i.e., let us˜
evaluate
(A.6.8)
X>(I−Q)X(X>X)−X>(I−Q)X = X>X(X>X)−X>X−X>X(X>X)−X>QX−X>QX(X>X)−X>X+X>QX(X>X)−X>QX
(A.6.9) = X>X − X>QX − X>QX + X>QX = X>(I − Q)X.
Only for the fourth term did we need the condition Q = XAX>:
(A.6.10) X>XAX>X(X>X)−X>XAX>X = X>XAX>XAX>X = X>QQX = X>X.