Principal Components and the Best Low Rank Matrix- 123docz.net

1.9 Principal Components and the Best Low Rank Matrix

The principal components of A are its singular vectors, the columns Uj and v j of the orthogonal matrices U and V. Principal Component Analysis (PCA) uses the largest u's connected to the first u's and v's to understand the information in a matrix of data.

We are given a matrix A, and we extract its most important part Ak (largest u's):

Ak = u1u1v'f + ã ã ã + ukukvr with rank (Ak) = k.

Ak solves a matrix optimization problem-and we start there. The closest rank k matrix to A is Ak. In statistics we are identifying the pieces of A with largest variance. This puts the SVD at the center of data science.

In that world, PCA is "unsupervised" learning. Our only instructor is linear algebra- the SVD tells us to choose Ak. When the learning is supervised, we have a big set of training data. Deep Learning (Section VII.l) constructs a (nonlinear!) function F that correctly classifies most of that training data. Then we apply F to new data, as you will see.

Principal Component Analysis is based on matrix approximation by Ak. The proof that Ak is the best choice was begun by Schmidt (1907). His theorem was written for operators A in function space, and it extends directly to matrices in vector spaces. Eckart and Young gave a new proof in 1936 (using the Frobenius norm for matrices). Then Mirsky found a more general proof in 1955 that allows any norm IIAII that depends only on the singular values-as in the definitions (2), (3), and (4) below.

Here is that key property of the special rank k matrix Ak = a1u1vT + ã ã ã + akukvi:

Eckart-Young If B has rank k then I lA- Bll ~I lA- Akllã

Three choices for the matrix norm II All have special importance and their own names:

IIAxll

Spectral norm II All2 = max~ = u1 (often called the £2 norm) (2)

Frobenius norm II AI IF= J af +ããã+aÂ (12) and (13) also define IIAIIF (3) Nuclear norm II AI IN= u1 + u2 + ã ã ã + Ur (the trace norm). (4) These norms have different values already for the n by n identity matrix :

IIIIIN =n. (5)

Replace I by any orthogonal matrix Q and the norms stay the same (because all ai = 1):

IIQII2 = 1 IIQIIF = v'n IIQIIN=n. (6)

More than this, the spectral and Frobenius and nuclear norms of any matrix stay the same when A is multiplied (on either side) by an orthogonal matrix.

The singular values don'tchange when U and V change to Q1 U and Q2 V. For complex matrices the word unitary replaces orthogonal. Then QT Q = I. These three norms are unitarily invariant: IIQ 1 AQ~ II = IIAII- Mirsky's proof of the Eckart-Young theorem in equation (1) applies to all unitarily invariant norms: IIAII is computable from~-

All three norms have IIQlAQi II = II All for orthogonal Q1 and Q2 (7) We now give simpler proofs of (1) for the £2 norm and the Frobenius norm.

Eckart-Young Theorem : Best Approximation by Ak

It helps to see what the theorem tells us, before tackling its proof. In this example, A is diagonal and k = 2 :

[

4 0 0 0]

. 0 3 0 0

The rank two matnx closest to A= 0 0 2 0 0 0 0 1

This must be true! You might say that this diagonal matrix is too simple, and not typical.

But the £2 norm and Frobenius norm are not changed when the matrix A becomes Q1AQ2

(for any orthogonal Q 1 and Q2). So this example includes any 4 by 4 matrix with singular values 4, 3, 2, 1. The Eckart-Young Theorem tells us to keep 4 and 3 because they are largest. The error in £2 is I lA- A2ll = 2. The Frobenius norm has I lA- A2IIF = v'5.

The awkward part of the problem is "rank two matrices". That set is not convex.

The average of A2 and B2 (both rank 2) can easily have rank 4. Is it possible that this B2 could be closer to A than A2 ?

Could this B2 be a better rank 2 approximation to A ?

3.5 3.5 [

3.5 3.5

1.5 1.5]ã

1.5 1.5

The errors A - B2 are only 0.5 on the main diagonal where A - A2 has errors 2 and 1.

Of course the errors 3.5 and 1.5 off the diagonal will be too big. But maybe there is another choice that is better than A2 ?

No, A2 is best with rank k = 2. We prove this for the £2 norm and then for Frobenius.

Eck~:Joung If rank (B):::; k then I lA- Bll =max II(A l~x~) xll 2': lTk+lã (8)

J2.ã Principal Components and the Best Low Rank Matrix 73 We know that IIA- Akll = ak+lã The whole proof of I lA- Ell ~ ak+l depends on a good choice of the vector x in computing the norm II A - B II :

k+l

Choose x =f. 0 so that Bx = 0 and x = L CiVi

(9)

First, the nullspace of B has dimension ~ n - k, because B has rank :::; k. Second, the combinations of Vt to Vk+l produce a subspace of dimension k + 1. Those two subspaces must intersect! When dimensions add to (n- k) + (k + 1) = n + 1, the subspaces must share a line (at least). Think of two planes through (0, 0, 0) in R3 -they share a line since 2 + 2 > 3. Choose a nonzero vector x on this line.

Use that x to estimate the norm of A-Bin (8). Remember Bx = 0 and A vi = aiui:

k+l

II(A-B) xW = IIAxll 2 =II I::ciaiuill 2 = L:c~a?. (10)

That sum is at least as large as (2:: en a~+l' which is exactly llxWa~+lã Equation (10) proves that II(A- B)xll ~ O"k+tllxllã This x gives the lower bound we want for IIA-BII:

II(A- B)xll

llxll ~ O"k+l means that I lA- Bll ~ lTk+l =I lA- Akllã Proved! (11)

The Frobenius Norm

Eckart-YoungFrobeniusnorm Now we go to the Frobenius norm, to show that Ak is the best~

approximation there too.

It is useful to see three different formulas for this norm. The first formula treats A as a long vector, and takes the usual e2 norm of that vector. The second formula notices that the main diagonal of AT A contains the e2 norm (squared) of each column of A.

For example, the 1,1 entry of AT A is laul 2 + ã ã ã + lamtl 2 from column 1.

So (12) is the same as (13), we are just taking the numbers lai112 a column at a time.

Then formula ( 14) for the Frobenius norm uses the eigenvalues a; of AT A. (The trace is always the sum of the eigenvalues.) Formula (14) also comes directly from the SVD- the Frobenius norm of A = UEVT is not affected by U and V, so IIAII~ = 11~11~ã

Th. . IS IS lT 2 l + + • • • lT r . 2 '

IIAII~ = laul 2 + la12l 2 + ã ã ã + lamnl 2 (every a~3)

IIAII~ =trace of AT A= (AT A)u +ããã+(AT A)nn IIAII~ = uf + u~ + ã ã ã + u~

(12) (13) (14)

Eckart-Young in the Frobenius Norm For the norm I lA- BIIF, Pete Stewart has found and generously shared this neat proof.

Suppose the matrix B of rank ::; k is closest to A. We want to prove that B = Ak.

The surprise is that we start with the singular value decomposition of B (not A) :

B = U [ ~ ~ ] VT where the diagonal matrix Disk by k. (15) Those orthogonal matrices U and V from B will not necessarily diagonalize A:

A-u[ L+E+R F - G H ]vT (16)

Here L is strictly lower triangular in the first k rows, E is diagonal, and R is strictly upper triangular. Step 1 will show that L, R, and Fare all zero by comparing A and B with this matrix C that clearly has rank::; k:

(17)

This is Stewart's key idea, to construct C with zero rows to show its rank. Those orthogonal matrices U and VT leave the Frobenius norm unchanged. Square all matrix entries and add, noticing that A - C has zeros where A- B has the matrices L, R, F:

I lA- Bll} =I lA- Gil}+ IlLII} + IIRII} + IIFII}. (18)

Since I lA- Bll} was as small as possible we learn that L, R, Fare zero! Similarly we find G = 0. At this point we know that UTA V has two blocks and E is diagonal (like D):

UTA V = [ ~ ~ ] and UT BV = [ ~ ~ ] .

If B is closest to A then uT BV is closest to UTA V. And now we see the truth.

The matrix D must be the same as E = diag (a1 , .. ã., ak).

The singular values of H must be the smallest n - k singular values of A.

The smallest error IIA- BIIF must be IIHIIF = J a~+l + ã ã ã +a; =Eckart-Young.

In the 4 by 4 example starting this section, A2 is best possible: IIA - A2ll F = J5.

It is exceptional to have this explicit solution Ak for a non-convex optimization.

I.9. Principal Components and the Best Low Rank Matrix 75

Minimizing the Frobenius Distance II A - B II~

Here is a different and more direct approach to prove Eckart-Young: Set derivatives ofiiA-Ell} to zero. Every rank k matrix factors into B = CR = (m x k) (k x n).

By the SVD, we can require r orthogonal columns in C (so cT C = diagonal matrix D) and r orthonormal rows in R (so RRT =I). We are aiming for C = UkL.k and R = V?.

Take derivatives of E = II A - C Rll::,. to find the matrices C and R that minimize E :

~~ = 2(CR-A)RT = o ~~ = 2(RTcT-AT) c = o (19) The first gives ART= CRRT =C. Then the second gives RT D =ATe= AT ART.

Since D is diagonal, this means :

The columns of RT are eigenvectors of AT A. They are right singular vectors v 1 of A.

Similarly the columns of C are eigenvectors of AA T : AA T C = ART D = CD. Then C contains left singular vectors u 1. Which singular vectors actually minimize the error E?

E is a sum of all the a2 that were not involved in C and R. To minimize, those should be the smallest singular values of A. That leaves the largest singular values to produce the best B = CR = Ak, with I lA- CRII} = a~+l +ããã+a;. This neat proof is in Nathan Srebro's MIT doctoral thesis: ttic.uchicago.edu/-nati/Publications/thesis.pdf Principal Component Analysis Now we start using the SVD. The matrix A is full of data. We have n samples. For each sample we measure m variables (like height and weight). The data matrix A0 has n.,

columns and m rows. In many applications it is a very large matrix. ' The first step is to find the average (the sample mean) along each row of A0 . Subtract that mean from all m entries in the row. Now each row of the centered matrix A has mean zero. The columns of A are n points in Rm. Because of centering, the sum of the n column vectors is zero. So the average column is the zero vector.

Often those n points are clustered near a line or a plane or another low-dimensional subspace of Rm. Figure 1.11 shows a typical set of data points clustered along a line in R2 (after centering A0 to shift the points left-right and up-down for mean (0, 0) in A).

How will linear algebra find that closest line through (0, 0)? It is in the direction of the first singular vector u1 of A. This is the key point of PCA ! '

A is 2 x n (large nullspace) AAT is 2 x 2 (small matrix) AT A is n x n (large matrix) 1\vo singular values a-1 > a-2 > 0

Figure 1.11: Data points (columns of A) are often close to a line in R2 or a subspace in Rm.

Let me express the problem (which the SVD solves) first in terms of statistics and then in terms of geometry. After that come the linear algebra and the examples.

The Statistics Behind PCA

The key numbers in probability and statistics are the mean and variance. The "mean" is an average of the data (in each row of A0 ). Subtracting those means from each row of Ao produced the centered A. The crucial quantities are the "variances" and "covariances".

The variances are sums of squares of distances from the mean-along each row of A.

The variances are the diagonal entries of the matrix AA T.

Suppose the columns of A correspond to a child's age on the x-axis and its height on the y-axis. (Those ages and heights are measured from the average age and height.) We are looking for the straight line that stays closest to the data points in the figure.

And we have to account for the joint age-height distribution of the data.

The covariances are the off-diagonal entries of the matrix AA T.

Those are dot products (row i of A) ã (row j of A). High covariance means that increased height goes with increased age. (Negative covariance means that one variable increases when the other decreases.) Our example has only two rows from age and height: the symmetric matrix AAT is 2 by 2. As the number n of sample children increases, we divide by n - 1 to give AAT its statistically correct scale.

The sample covariance matrix is defined by S = - - . AAT n-1

The factor is n-1 because one degree of freedom has already been used for mean = 0. Here is an example with six ages and heights already centered to make each row add to zero :

A=[~ -4 -6

7 1

8 -1 -4 -3]

-1 7

For this data, the sample covariance matrix S is easily computed. It is positive definite.

Variances and covariances S = _I_AAT = [ 20 25 ]

6- 1 25 40 .

. .

The two orthogonal eigenvectors of S are u1 and u2. Those are the left singular vectors (principal components) of A. The Eckart-Young theorem says that the vector u1 points along the closest line in Figure 1.11. Eigenvectors of S are singular vectors of A.

The second singular vector u2 will be perpendicular to that closest line.

Important note PCA can be described using the symmetric S = AAT / ( n - 1) or the rectangular A. No doubt S is the nicer matrix. But given the data in A, computing S would be a computational mistake. For large matrices, a direct SVD of A is faster and more accurate.

1.9. Principal Components and the Best Low Rank Matrix 77

In the example, S has eigenvalues near 57 and 3. Their sum is 20 + 40 = 60, the trace of S. The first rank one piece V'57u1 v[ is much larger than the second piece J3u2v:f.

The leading eigenvector u1 ~ (0.6, 0.8) tells us that the closest line in the scatter plot has slope near 8/6. The direction in the graph nearly produces a 6-8- 10 right triangle.

I will move now from the algebra of PCA to the geometry. In what sense will the line in the direction of u1 be the closest line to the centered data?

The Geometry Behind PCA

The best line in Figure 1.11 solves a problem in perpendicular least squares. This is also called orthogonal regression. It is different from the standard least squares fit to n data points, or the least squares solution to a linear system Ax = b. That classical problem in Section Il.2 minimizes I lAx-bjj2 . It measures distances up and down to the best line.

Our problem minimizes perpendicular distances. The older problem leads to a linear system AT Ax = AT b. Our problem leads to eigenvalues a2 and singular vectors ui (eigenvectors of S). Those are the two sides of linear algebra: not the same side.

The sum of squared distances from the data points to the u1 line is a minimum.

To see this, separate each column aj of A into its components along u1 and u2 :

n n n

L llajll2 = L laju1i 2 + L laju2i 2- (20)

The sum on the left is fixed by the data. The first sum on the right has terms u[ ajaj u1:~

It adds to uf(AAT)u1 • So when we maximize that sum in PCA by choosing the top eigenvector u1 of AA T, we minimize the second sum. That second sum of squared distances from data points to the best line (or best subspace) is the smallest possible.

The Linear Algebra Behind PCA

Principal Component Analysis is a way to understand n sample points a1 , . . . , an in m-dimensional space-the data. That data plot is centered : all rows of A add to zero (AI = 0). The crucial connection to linear algebra is in the singular values ai and the singular vectors ui of A. Those come from the eigenvalues Ai = al and the eigenvectors of the sample covariance matrix S = AAT j ( n - 1).

The total variance in the data comes from the Frobenius norm (squared) of A : Total variance T = IIAII~/(n- I)= (ilalll2 + ã ã ã + llanll2)/(n -1). (21) This is the trace of S-the sum down the diagonal. Linear algebra tells us that the trace equals the ,.sum of the eigenvalues u"f j(n- I) of the sample covariance matrix S.

The trace of S connects the total variance to the sum of variances of the principal components u1 , . . . , Ur :

Total variance T = (u~ + ã ã ã + u~)/(n-1). (22)

Exactly as in equation (20), the first principal component u1 accounts for (or "explains") a fraction a~ jT of the total variance. The next singular vector u2 of A explains the next largest fraction aVT. Each singular vector is doing its best to capture the meaning in a matrix-and together they succeed.

The point of the Eckart-Young Theorem is that k singular vectors (acting together) explain more of the data than any other set ofk vectors. So we are justified in choosing u1

to Uk as a basis for the k-dimensional subspace closest to the n data points.

The reader understands that our Figure I.ll showed a cluster of data points around a straight line ( k = 1) in dimension m = 2. Real problems often have k > 1 andm > 2.

The "effective rank" of A and S is the number of singular values above the point where noise drowns the true signal in the data. Often this point is visible on a "scree plot"

showing the dropoff in the singular values O"i (or their squares af'). Figure 1.12 shows the "elbow" in the scree plot where signal ends and noise takes over.

In this example the noise comes from roundoff error in computing singular values of the badly conditioned Hilbert matrix. The dropoff in the true singular values remains very steep. In practice the noise is in the data matrix itself--errors in the measurements of Ao.

Section III.3 of this book studies matrices like H with rapidly decaying a's.

100[•ã. '

t • • •

10-5 f •

10"10i

Singular values of hilb( 40 )

• • . •

• . •

• •

ããã •••••••••••••• j ã;

~ . ~

10-20 ~---w---:io ____ 3 0 ___ ---;\o

1 2 1 3 1

1 1 1

2 3 4

1 1 1 1

Hij = 3 4 5 =

(i + j -1)

Figure 1.12: Scree plot of a 1, ... , 0"39 (a4o = 0) forãthe evil Hilbert matrix, with elbow at the effective rank: r ~ 17 and O"r ~ w-16 .

One-Zero Matrices and Their Properties

Alex Townsend and the author began a study of matrices with 1 's inside a circle and O's outside. As the matrices get larger, their rank goes up. The graph of singular values approaches a limit-which we can't yet predict. But we understand the rank.

!:2: _P~nci_pal Com_p()nents and the Best Low Rank Matrix 79

Three shapes are drawn in Figure 1.13: square, triangle, quarter circle. Any square of 1 's will have rank 1. The triangle has all eigenvalues A = 1, and its singular values are more interesting. The rank of the quarter circle matrix was our first puzzle, solved below.

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Rankl

1 1 1 1 1 1 1 1 1 1 1 1 1 1 RankN

1 1 1

RankCN?

l's l's

y"j,N 2 Figure I.13: Square and triangle and quarter circle of 1 's in matrices with N = 6.

Reflection of these figures in the x-axis will produce a rectangle and larger triangle and semicircle with side 2N. The ranks will not change because the new rows are copies of the old rows. Then reflection in the y-axis will produce a square and a diamond and a full circle. This time the new columns are copies of the old columns : again the same rank.

From the square and triangle we learn that low rank goes with horizontal-vertical alignment. Diagonals bring high rank, and 45 a diagonals bring the highest.

What is the "asymptotic rank" of the quarter circle as the radius N = 6 increases ? We are looking for the leading term C N in the rank.

The fourth figure shows a way to compute C. Draw a square of maximum size in the 't

quarter circle. That square submatrix (all1's) has rank 1. The shape above the square has N - ::/,} N rows (about 0.3N) and the shape beside it has N - :IJ. N columns. Those rows and those columns are independent. Adding those two numbers produces the leading term in the rank-and it is numerically confirmed:

Rank of quarter circle matrix~ (2 - v'2) N as N -+ oo.

We turn to the (nonzero) singular values of these matrices-trivial for the square, known for the triangle, computable for the quarter circle. For these shapes and others, we have always seen a "singular gap". The singular values don't approach zero. All the u's stay above some limit L-and we don't know why.

The graphs show u's for the quarter circle (computed values) and the triangle (exact values). For the triangle of 1 's, the inverse matrix just has a diagonal of 1 's above a diagonal of -1's. Thenui = ~ sin8for N equallyspacedangles0i=(2i-1)7r/(4N+2).

Therefore the gap with no singular values reaches up to umin :::::l ~sin~ = ~ã The quarter circle also has umin .. :::::l ~ã See the student project on math.mit.edullearningfromdata •

Principal Components and the Best Low Rank Matrix

Rayleigh Quotients and Generalized Eigenvalues

Factoring Matrices and Tensors : Positive and Sparse