This section picks up and extends a theme from Section I.8. There we connected the eigenvalues and eigenvectors of a symmetric matrix S to the Rayleigh quotient R( x) :
xTSx R(x) = -. -T-.
X X (1)
The maximum value of R(x) is the largest eigenvalue .X1 of S. That maximum is achieved at the eigenvector x = q1 where Sq1 = Aq1 :
R(q ) - q[Sql - q[Alql -.X
1 - T - T - 1ã
ql ql ql ql
Maximum (2)
Similarly the minimum value of R(x) equals the smallest eigenvalue An of S. That minimum is attained at the "bottom eigenvector" x = qn. More than that, all the eigenvectors x = qk of S for eigenvalues between An and A1 are saddle points of R(x).
Saddles have first derivatives = zero but they are not maxima or minima.
Saddle point (3)
These facts connected to the Singular Value Decomposition of A. The connection was through S = AT A. For that positive definite (or semidefinite) matrix S, the Rayleigh1 quotient led to the norm (squared) of A. And the largest eigenvalue of Sis at(A):
In this way a symmetric eigenvalue problem is also an optimization : Maximize R( x).
Generalized Eigenvalues and Eigenvectors Applications in statistics and data science lead us to the next step. Application's in engineering and mechanics point the same way. A second symmetric matrix M enters the denominator of R( x) :
Generalized Rayleigh quotient (5)
"
In dynamical problems M is often the "mass matrix" or the "inertia matrix". In statistics M is generally the covariance matrix. The construction of covariance matrices and their application to classifying data will come in the chapter on probability and statistics.
Here our goal is to see how the eigenvalue problem Sx = >.x changes to Sx = >.M x, when R( x) becomes x T Sx / x T M x. This is the generalized symmetric eigenvalue problem.
If M is positive definite, the maximum of R( x) is the largest eigenvalue of M-1 S.
We will reduce this generalized problem Sx = >.M x to an ordinary eigenvalue problem Hy = >.y. But you have to see that the choice H = M-Is is not really perfect. The reason is simple: M-IS is not usually symmetric! Even a diagonal matrix M will make this point clear. The square root MI/2 of that same diagonal matrix will suggest the right way to hold on to symmetry.
M-I 8 = [ mi 0 ] -I [a b]
0 m2 b c [ a/mi bjrn1 ] is not symmetric bjrn2 cjm2
Those matrices M-1s and H = M-112SM-112 have the same eigenvalues. This H looks awkward, but symmetry is saved when we choose the symmetric square root of M and M-I. Every positive definite M has a positive definite square root.
The diagonal example above had MI/2 = diag (Jml, .,jm2). Its inverse is M-I/2 .
In all cases, we just diagonalize M and take the square root of each eigenvalue:
If M = QAQT has A > 0 then M112 = QA 112QT has A 112 > 0. (6) Squaring MI/2 recovers QAI12QTQAI12QT = QAQT which isM. We will not use MI/2 or M-I/2 numerically! The generalized eigenvalue problem Sx = >.Mx is solved
in MATLAB by the command eig(S, M). Julia and Python and R and all full linear algebra systems include this extension to Sx = >.M x.
A Rayleigh quotient with :z: T M :z: is easily converted to a quotient with y T y :
Set x = M-112y Then yT(M-If2)TSM-If2y
yTy (7)
This changes the generalized problem S:z: = >.M :z: to an ordinary symmetric problem Hy = >.y. If S and M are positive definite, so is H = M-112sM-112 .
The largest Rayleigh quotient still gives the largest eigenvalue AI. And we see the top eigenvector y I of H and the top eigenvector XI of M-IS:
J,lO, Rayleigh Quotients and Generalized Eigenvalues 83
Example 1 Solve Sx = >..M x when S = [ 4 -2] _ 2 4 and M =
Solution Our eigenvalue problems are (S- >..M) x = 0 and (H- >..!) y = 0. We will find the same A's from both determinants: det(S- >.M) = 0 and det(H- >.I) = 0.
[ 4->. -2 ] 2 . /<J
det(S->.M)=det _ 2 4 _ 2>. =2>. -12>.+12=0 gives .X=3±v3.
If you prefer to work with one matrix H = M-112SM-112, we must first compute it:
[ 1 0 ] [ 4 -2 ] [ 1 0 ] [ 4 -V2 ]
H = 0 1/V2 -2 4 0 1/V2 = -V2 2 .
Then its eigenvalues come from the determinant of H - >.I :
[ 4 - >.. -V2 ] 2 (;;
det -V2 2 _ >. = >.. - 6>.. + 6 = 0 also gives >. = 3 ± v 3.
This equation is just half ofthe previous 2>..2 -12>.. + 12 = 0. Same >.'s for Hand M-18.
In mechanical engineering those >..'s would tell us the frequencies w = J>. for two
oscillating masses m1 = 1 and m 2 = 2 in a line of springs. S tells us the stiffness in the three springs that connect these two masses to fixed endpoints.
The differential equation is Newton's Law M d2uj dt2 = -Bu.
'i Generalized Eigenvectors are M -orthogonal A crucial fact about a symmetric matrix S is that any two eigenvectors are orthogonal (when the eigenvalues are different). Does this extend to Sx1 = >.M x1 with two symmet- ric matrices ? The immediate answer is no, but the right answer is yes. For that answer, we have to assume that M is positive definite, and we have to change from xf x2 = 0 to
"M -orthogonality" of X 1 and X2. Two vectors are M -orthogonal if X r M X2 = 0.
Proof Multiply one equation by xi and multiply the other equation by xf : x:fSx1 = >..1x:fMx1 and x[Sx2 = >.2xfMx2
Because Sand Mare symmetric, transposing thefirstequationgives xf Sx2 = >.1xf Mx2.
Subtract the second equation :
Then also xf Sx2 =0. We can test this conclusion on the matrices Sand Min Example 1.
Theeigenvectorsx andy are in thenullspaces where (S-AIM)x=O and (S->.2M)y=0.
(S- AIM) X = [ 4 - (3 + .J3) -2 (S- A2M) y = [ 4 - (3 - .J3)
-2
-2 ] [ XI ]
4 - 2(3 + .J3) X2 -2 ] [ YI ] 4 - 2(3 - .J3) Y2
gives x = c [ 1 +2 v'3 ]
gives y = c [ 1 } v'3 ]
Those eigenvectors x and y are not orthogonal. But they are M -orthogonal because
xTMy= [2 l+.J3)[1 OJ[ 2 ]=O
0 2 1-.J3 .
Positive Semidefinite M : Not Invertible There are important applications in which the matrix M is only positive semidefinite.
Then x T M x can be zero ! The matrix M will not be invertible. The quotient xTSxjxTMx can be infinite. The matrices M-112 and H do not even exist. The eigenvalue problem Sx = AMx is still to be solved, but an infinite eigenvalue A = oo is now very possible.
In statistics M is often a covariance matrix. Its diagonal entries tell us the separate variances of two or more measurements. Its off-diagonal entries tell us the "covariances between the measurements". If we are foolishly repeating exactly the same observations- or if one experiment is completely determined by another-then the covariance matrix M is singular. Its determinant is zero and it is not invertible. The Rayleigh quotient (which divides by x T M x) may become infinite.
One way to look at this mathematically is to write Sx = AM x in a form with a and (3.
a.Sx = (3Mx with a. ~ 0 and (3 ~ 0 and eigenvalues A= (3 . (10) a.
A will be an ordinary positive eigenvalue if a > 0 and (3 > 0. We can even normalize those two numbers by a2 + (32 = 1. But now we see three other possibilities in equation (10):
a > 0 and (3 = 0 Then A = 0 and Sx = Ox: a normal zero eigenvalue of S a = 0 and (3 > 0 Then A = co and M x = 0 : M is not invertible
a = 0 and (3 = 0 Then A = ~ is undetermineo: M x = 0 and also Sx == 0.
a = 0 can occur when we have clusters of data, if the number of samples in a cluster is smaller than the number of features we measure. This is the problem of small sample size.
It happens.
You will understand that the mathematics becomes more delicate. The SVD approach (when you factor a data matrix into A = U~VT with singular vectors v coming from eigenvectors of S = AT A) is not sufficient. We need to generalize the SVD. We need to allow for a second matrix M. This led to the GSVD.
I.lO. Rayleigh Quotients and Generalized Eigenvalues 85 The Generalized SVD (Simplified)
In its full generality, this factorization is complicated. It allows for two matrices S and M and it allows them to be singular. In this book it makes sense to stay with the usual and best case, when these symmetric matrices are positive definite. Then we can see the primary purpose of the GSVD, to factor two matrices at the same time.
Remember that the classical SVD factors a rectangular matrix A into UL:VT. It begins with A and not with S = AT A. Similarly here, we begin with two matrices A and B.
Our simplification is to assume that both are tall thin matrices of rank n. Their sizes are m A by nand m B by n. Then S = AT A and M = BT B are n by nand positive definite.
Generalized Singular Value Decomposition
A and B can be factored into A= UA~AZ and B = UB~BZ(same Z) U A and U B are orthogonal matrices (sizes mA and m B)
~A and ~B are positive diagonal matrices (with L:~L:A + L:~L:B = Inxn)
Z is an invertible matrix (size n)
Notice that Z is probably not an orthogonal matrix. That would be asking too much.
The remarkable property of Z is to simultaneously diagonalize S =AT A and M = BT B :
AT A= zTE~uruAL:AZ = zT <~1~A) z and BTB = zT (~~:EB) z. (11) So this is a fact of linear algebra: Any two positive definite matrices can be diagonalized't by the same matrix Z. By equation (9), its columns can be x1 , . . . , Xn! That was known before the GSVD was invented. And because orthogonality is not required, we can scale Z so that L:~L:A + L:~L:B = I. We can also order its columns Xk to put then positive numbers a A in decreasing order (in L:A).
Please also notice the meaning of "diagonalize". Equation (11) does not contain z-l
and Z, it contains zT and Z. With z-l we have a similarity transformation, preserving eigenvalues. With zT we have a congruence transformation zT s z' preserving symmetry.
(Then the eigenvalues of S and zT S Z have the same signs. This is Sylvester's Law of Inertia in PSet III.2. Here the signs are all positive.) The symmetry of S = ~ T A and the positive definiteness of M = BT B allows one Z to diagonalize both matrices.
The Problem Set (Problem 5) will guide you to a proof of this simplified GSVD.
Fisher's Linear Discriminant Analysis (LDA)
Here is a nice application in statistics and machine learning. We are given samples from two different populations and they are mixed together. We know the basic facts about each population-its average value m and its average spread u around that mean m. So we have a mean m1 and variance u1 for the first population and m2, u2 for the second population. If all the samples are mixed together and we pick one, how do we tell if it probably came from population 1 or population 2 ?
Fisher's "linear discriminant" answers that question.
Actually the problem is one step more complicated. Each sample has several features, like a child's age and height and weight. This is normal for machine learning, to have a
"feature vector" like f = (age, height, weight) for each sample. If a sample has feature vector f, which population did it probably come from? We start with vectors not scalars.
We have an average age ma and average height mh and average weight mw for each population. The mean (average) of population 1 is a vector m1 =(mat, mh1, mwt).
Population 2 also has a vector m2 of average age, average height, average weight.
And the variance u for each population, to measure its spread around its mean, becomes a 3 X 3 matrix E. This "covariance matrix" will be a key to Chapter V on statistics.
For now, we have mt, m 2, El> E2 and we want a rule to discriminate between the two populations.
Fisher's test has a simple form: He finds a separation vector v. If the sample has v T f > c then our best guess is population 1. If v T f < c then the sample probably came from population 2. The vector v is trying to separate the two populations (as much as possible). It maximizes the separation ratio R:
Separation ratio R = (xTm1 - xTm2)2 (12)
xTE1x + xTE2x
That ratio R has the form xTSxjxTMx. The matrix Sis (m1 - m 2)(m1 - m 2)T.
The matrix M is E 1 + E2. We know the rule Sv = >..Mv for the vector x = v that maximizes the separation ratio R.
Fisher could actually find that eigenvector v of M-1 S. So can we, because the matrix S = (m1 - m2)(m1- m 2)T has rank one. So Sv is always in the direction m 1 - m 2.
Then Mv must be in that direction, to have Sv = >..Mv. So v = M-1(m1 - m2 ).
This was a nice problem because we found the eigenvector v. It makes sense that when the unknown sample has feature vector f = (age, height, weight), we would look at the numbers mi f and m;f f. If we were ready for a full statistical discussion (which we are not), then we could see how the weighting matrix M = Eo + E1 enters into the final test on v T f. Here it is enough to say : The feature vectors f from the two populations are separated as well as possible by a plane that is perpendicular to v.
Summary. We have two clouds of points in 3-dimensional feature space. We try to separate them by a plane-not always possible. Fisher proposed one reasonable plane.
Neural networks will succeed by allowing separating surfaces that are not just planes.
___ l]Q. Rayleigh Quotients and Generalized Eigenvalues 87
Problem Set 1.10
1 Solve (S - )..M) x = 0 and (H - )..I) y = 0 after computing the matrix H = M-l/2sM-1/2:
S=[~ ~] M=[~ ~]
Step 1 is to find )..1 and )..2 from det (S- )..M) = 0. The equation det (H- AI) = 0 should produce the same )..1 and )..2 . Those eigenvalues will produce two eigenvec- tors x1 and x2 of S- )..M and two eigenvectors y1 and y2 of H- )..I.
Verify that xi x2 is not zero but xi M x2 = 0. His symmetric so YI y 2 = 0.
2 (a) For x =(a, b) andy= (c, d) write the Rayleigh quotients in Problem 1 as R*(x)= xTSx = (5a2+8ab+5b2) andR(y)= yTHy = 5c2+16cd+20d2
xTMx (ããã+ããã+ããã) yTy (-ãã+ããã)
(b) Take the c and d derivatives of R(y) to find its maximum and minimum.
(c) Take the a and b derivatives of R* ( x) to find its maximum and minimum.
(d) Verify that those maxima occur at eigenvectors from (S - )..M) x = 0 and (H- )..I)y = 0.
3 How are the eigenvectors x1 and x2 related to the eigenvectors y1 and y 2?
4 Change M to [ ~ ~ ] and solve Sx = )..M x. Now M is singular and one of thd eigenvalues ).. is infinite. But its eigenvector x2 is still M -orthogonal to the other eigenvector x1.
5 Start with symmetric positive definite matrices S and M. Eigenvectors of S fill an orthogonal matrix Q so that QT SQ =A is diagonal. What is the diagonal matrix D so thatDTAD =I? Now we haveDTQTSQD =I and we look at DTQT MQD.
Its eigenvector matrix Q2 gives Q~ IQ 2 = I and Q~ DTQT MQDQ 2 = A2 .
Show that z = Q DQ2 diagonalizes both congruences zT s z and zT M z in the
GSVD.
6 (a) Why does every congruence zT SZ preserve the symmetry of S?
(b) Why is zT S Z positive definite when S is positive definite and Z is square and invertible? Apply the energy test to zT S Z. Be sure to explain why Z x is not the zero vector.
7 Which matrices zT I Z are congruent to the identity matrix for invertible Z 7 8 Solve~ this matrix problem basic to Fisher's Linear Discriminant Analysis:
If R( x) = x ;Msx and S = uu T what vector x minimizes R( x) ?
X X
The norm of a nonzero vector vis a positive number llvllã That number measures the
"length" of the vector. There are many useful measures of length (many different norms).
Every norm for vectors or functions or matrices must share these two properties of the absolute value lei of a number:
Multiply v by c (Rescaling) lie vii = lclllv!l All norms
Add v tow (Triangle inequality) llv + wll ::; !lv!l + llwll
We start with three special norms-by far the most important. They are the .f_2 norm and
Ê1 norm and Ê00 norm of the vector v = (v1, ... , vn)ã The vector vis in Rn (real vi) or in Cn (complex Vi):
£2 norm = Euclidean norm llvll2 = vlv11 2 + ... + lvnl 2
Ê1 norm = 1-norm !lvll1 !v1! + !v2l + ã ã ã + lvnl
£00 norm = max norm llv!loo = maximum of !v11, ..• , lvnl
The all-ones vectorv = (1, 1, ... , 1) has norms llvll2 = yn and !1vll1 =nand !lvl!oo = 1.
These three norms are the particular cases p = 2 and p = 1 and p = oo of the £P norm llviiP = (lv1IP + .. ã + lvniP) 11P. This figure shows vectors with norm 1: p = ~is illegal.
£1 norm
!v1! + !v2! ~ 1 diamond
£00 norm r n v 2 (1, 1)
!v1! ~ 1, !v2! ~ 1 square
V2 = -1
£2 norm v2 1 + v2 2 -< 1
circle
(0, 1)
£'12 nonn ~)
vlfVJ + JrV;T ~ 1 (1, 0) not convex
Figure I.15: The important vector norms llvll1, llvll2, llvlloo and a failure (p = 0 fails too).
The failure for p = ! is in the triangle inequality: (1, 0) and (0, 1) have norm 1, but their sum (1, 1) has norm 21/P = 4. Only 1 ::; p::; oo produce an acceptable norm llvllpã
(1)
(2)
1.11. Norms of Vectors and Functions and Matrices 89 The Minimum of llviiP on the line a1v1 + a2v2 = 1
Which point on a diagonal line like 3v1 + 4v2 = 1 is closest to (0, 0)? The answer (and the meaning of "closest") will depend on the norm. This is another way to see important differences between £1 and £2 and £=. We will see a first example of a very special feature : Minimization in £1 produces sparse solutions
To see the closest point to (0, 0), expand the £1 diamond and £2 circle and£= square until they touch the diagonal line. For each p, that touching point v* will solve our optimization problem :
Minimize llviiP among vectors (v1, v2) on the line 3v1 + 4v2 = 1 (0, ~)has llv*lll = ~ ( ;5 , 2~) has llv*ll2 = ~
Figure 1.16: The solutions v* to the £1 and £2 and £=minimizations. The first is sparse.
The first figure displays a highly important property of the minimizing solution to the £1 problem: That solution v* has zero components. The vector v* is "sparse".
This is because a diamond touches a line at a sharp point. The line (or hyperplane in high dimensions) contains the vectors that solve the m constraints Av = b. The
surface of the diamond contains vectors with the same £1 norm. The diamond expands to meet the line at a corner of the diamond ! The Problem Set and also Section III.4 will return to this "basis pursuit" problem and closely related £1 problems.
The essential point is that the solutions to those problems are sparse. They have few nonzero components, and those components have meaning. By contrast the least squares solution (using £2 ) has many small and non-interesting components. By squa,ring, those components become very small and hardly affect the £2 distance.
One final observation: The "£0 norm" of a vector v counts the number of nonzero components. But this is not a true norm. The points with llvllo = 1 lie on the x axis or y axis-one nonzero component only. The figure for p = ~ on the previous page becomes even more extreme-just a cross or a skeleton along the two axes.
Of course this skeleton is not at all convex. The "zero norm" violates the fundamental requirement that ll2vll = 2llvll-In fact ll2vllo = llvllo = numberofnonzeros in v.
The wonderful observation is that we can find the sparsest solution to Av = b by using the £1 norm. We have "convexified" that £0 skeleton along the two axes. We filled in the skeleton, and the result is the £1 diamond.
~
Inner Products and Angles The £2 norm has a special place. When we write llvll with no subscript, this is the norm we mean. It connects to the ordinary geometry of inner products ( v, w) = v T w and angles B between vectors :
Inner product = length squared (3)
Angle (} between vectors v and w (4)
Then v is orthogonal to w when B = 90° and cos B = 0 and v T w = 0.
Those connections (3) and (4) lead to the most important inequalities in mathematics:
Cauchy-Schwarz lvTwl ~ llvll llwll Triangle Inequality llv + wll ~ llvll + llwll
The Problem Set includes a direct proof of Cauchy-Schwarz. Here in the text we connect it to equation (4) for the cosine: Ieos Bl ~ 1 means that ivTwl ~ llvll llwll. And this
in tum leads to the triangle inequality in equation (2)-connecting the sides v, w, and v + w of an ordinary triangle in n dimensions :
Equality
Inequality llv + wW ~ llvll2 + 2 llvll llwll + llwll2 = (llvll + llwll)2 (5) This confirms our intuition : Any side length in a triangle is less than the sum of the other two side lengths: llv + wll ~ llvll + llwllã Equality in the Ê2 norm is only possible when the triangle is totally flat and all angles have I cos B I = 1.
Inner Products and S-Norms A final question about vector norms. Is £2 the only norm connected to inner products (dot products) and to angles? There are no dot products for f.1 and eoo. But we can find other inner products that match other norms :
Choose any symmetric positive definite matrix S
II vii;= vTSv gives a norm for v in R.,(called the S-norm) (v, w)s = vTSw gives the S-innerproductforv, win Rn
(6) (7)
The inner product (v,v)8 agrees with llvll~- We have angles from (4). We have inequalities from (5). The proof is in (5) when every norm includes the matrix S.
We know that every positive definite matrix S can be factored into AT A. Then the S-norm and S-inner product for v and w are exactly the standard £2 norm and the standard inner product for Av and Aw.