Power Method for Singular Value Decomposition

Một phần của tài liệu Foundations of Data Science (Trang 51 - 54)

Computing the singular value decomposition is an important branch of numerical analysis in which there have been many sophisticated developments over a long period of time. The reader is referred to numerical analysis texts for more details. Here we present an “in-principle” method to establish that the approximate SVD of a matrix A can be computed in polynomial time. The method we present, called thepower method, is simple and is in fact the conceptual starting point for many algorithms. LetAbe a matrix whose SVD is P

iσiuiviT. We wish to work with a matrix that is square and symmetric. Let B =ATA. By direct multiplication, using the orthogonality of the ui’s that was proved in Theorem 3.7,

B =ATA = X

i

σiviuTi

! X

j

σjujvTj

!

=X

i,j

σiσjvi(uTi ãuj)vTj =X

i

σ2ivivTi .

The matrixB is square and symmetric, and has the same left and right-singular vectors.

In particular, Bvj = (P

iσ2ivivTi )vj =σj2vj, so vj is an eigenvector of B with eigenvalue σ2j. If A is itself square and symmetric, it will have the same right and left-singular vec- tors, namely A=P

i

σiviviT and computing B is unnecessary.

Now consider computing B2. B2 = X

i

σi2vivTi

! X

j

σ2jvjvTj

!

=X

ij

σ2iσj2vi(viTvj)vjT

When i6=j, the dot product viTvj is zero by orthogonality.9 Thus,B2 =

r

P

i=1

σi4viviT. In computing the kth power of B, all the cross product terms are zero and

Bk=

r

X

i=1

σ2ki viviT.

If σ1 > σ2, then the first term in the summation dominates, so Bk → σ2k1 v1v1T. This means a close estimate to v1 can be computed by simply taking the first column of Bk and normalizing it to a unit vector.

3.7.1 A Faster Method

A problem with the above method is that Amay be a very large, sparse matrix, say a 108×108 matrix with 109 nonzero entries. Sparse matrices are often represented by just

9The “outer product”vivjT is a matrix and is not zero even for i6=j.

a list of nonzero entries, say a list of triples of the form (i, j, aij). Though A is sparse, B need not be and in the worse case may have all 1016 entries nonzero10and it is then impos- sible to even write down B, let alone compute the productB2. Even if A is moderate in size, computing matrix products is costly in time. Thus, a more efficient method is needed.

Instead of computing Bk, select a random vector x and compute the product Bkx.

The vector x can be expressed in terms of the singular vectors of B augmented to a full orthonormal basis as x=

d

P

i=1

civi.Then

Bkx≈(σ12kv1v1T

) Xd

i=1

civi

=σ2k1 c1v1.

Normalizing the resulting vector yields v1, the first singular vector of A. The way Bkx is computed is by a series of matrix vector products, instead of matrix products. Bkx= ATA . . . ATAx, which can be computed right-to-left. This consists of 2k vector times sparse matrix multiplications.

To computek singular vectors, one selects a random vectorrand finds an orthonormal basis for the space spanned by r, Ar, . . . , Ak−1r. Then compute Atimes each of the basis vectors, and find an orthonormal basis for the space spanned by the resulting vectors.

Intuitively, one has applied A to a subspace rather than a single vector. One repeat- edly applies A to the subspace, calculating an orthonormal basis after each application to prevent the subspace collapsing to the one dimensional subspace spanned by the first singular vector. The process quickly converges to the first k singular vectors.

An issue occurs if there is no significant gap between the first and second singular values of a matrix. Take for example the case when there is a tie for the first singular vector and σ1 = σ2. Then, the above argument fails. We will overcome this hurdle.

Theorem 3.11 below states that even with ties, the power method converges to some vector in the span of those singular vectors corresponding to the “nearly highest” singular values. The theorem assumes it is given a vector xwhich has a component of magnitude at leastδ along the first right singular vector v1 of A. We will see in Lemma 3.12 that a random vector satisfies this condition with fairly high probability.

Theorem 3.11 LetAbe ann×dmatrix and xa unit length vector inRdwith|xTv1| ≥δ, where δ >0. Let V be the space spanned by the right singular vectors of A corresponding to singular values greater than (1−ε)σ1. Let w be the unit vector after k = ln(1/εδ)2ε iterations of the power method, namely,

w= ATAk

x

(ATA)kx .

10E.g., suppose each entry in the first row ofAis nonzero and the rest ofAis zero.

Then w has a component of at most ε perpendicular to V. Proof: Let

A=

r

X

i=1

σiuiviT

be the SVD of A. If the rank of A is less than d, then for convenience complete {v1,v2, . . .vr}into an orthonormal basis {v1,v2, . . .vd} of d-space. Write xin the basis of the vi’s as

x=

d

X

i=1

civi.

Since (ATA)k =

d

P

i=1

σi2kvivTi , it follows that (ATA)kx =

d

P

i=1

σ2ki civi. By hypothesis,

|c1| ≥δ.

Suppose that σ1, σ2, . . . , σm are the singular values ofAthat are greater than or equal to (1−ε)σ1 and that σm+1, . . . , σd are the singular values that are less than (1−ε)σ1. Now

|(ATA)kx|2 =

d

X

i=1

σ2ki civi

2

=

d

X

i=1

σi4kc2i ≥σ14kc21 ≥σ4k1 δ2. The component of|(ATA)kx|2 perpendicular to the spaceV is

d

X

i=m+1

σi4kc2i ≤(1−ε)4kσ14k

d

X

i=m+1

c2i ≤(1−ε)4kσ14k since Pd

i=1c2i = |x| = 1. Thus, the component of w perpendicular to V has squared length at most (1−ε)4kσ14k

σ4k1 δ2 and so its length is at most (1−ε)2kσ12k

δσ12k = (1−ε)2k

δ ≤ e−2kε δ =ε since k= ln(1/δ2 .

Lemma 3.12 Let y∈Rn be a random vector with the unit variance spherical Gaussian as its probability density. Normalizey to be a unit length vector by setting x=y/|y|. Let v be any unit length vector. Then

Prob

|xTv| ≤ 1 20√

d

≤ 1

10+ 3e−d/96.

Proof: Proving for the unit length vector x that Prob

|xTv| ≤ 1

20√ d

≤ 101 + 3e−d/96 is equivalent to proving for the unnormalized vectorythat Prob(|y| ≥2√

d)≤3e−d/96 and Prob(|yTv| ≤ 101 )≤1/10.That Prob(|y| ≥2√

d) is at most 3e−d/96 follows from Theorem (2.9) with √

d substituted for β.The probability that |yTv| ≤ 101 is at most 1/10 follows from the fact thatyTvis a random, zero mean, unit variance Gaussian with density is at most 1/√

2π≤1/2 in the interval [−1/10,1/10], so the integral of the Gaussian over the interval is at most 1/10.

Một phần của tài liệu Foundations of Data Science (Trang 51 - 54)

Tải bản đầy đủ (PDF)

(479 trang)