This part of the book looks at three types of low rank matrices :
1. Matrices that truly have a small rank (uvT is an extreme case with rank= 1) 2. Matrices that have exponentially decreasing singular values (low effective rank).
3. Incomplete matrices (missing entries) that are completed to low rank matrices.
The first type are not invertible (because rank < n). The second type are invertible in theory but not in practice. The matrix with entries (i + j - 1)-1 is a famous example.
How can you recognize that this matrix or another matrix has very low effective rank ? The third question-matrix completion-is approached in Section III.5. We create a minimization problem that applies to recommender matrices :
Minimize II A II N over all possible choices of the missing entries.
That "nuclear norm" gives a well-posed problem to replace a nonconvex problem : minimizing rank. Nuclear norms are conjectured to be important in gradient descent.
The rank of a matrix corresponds in some deep way to the number of nonzeros in a vector. In that analogy, a low rank matrix is like a sparse vector. Again, the number of nonzeros in x is not a norm ! That number is sometimes written as II x II o, but this
"Ê0 norm" violates the rule ll2xll = 2llxllã We don't double the number of nonzeros.
It is highly important to find sparse solutions to Ax = b. By a seeming miracle, sparse solutions come by minimizing the Ê1 norm llxlll = lx1l + ã ã ã + lxnlã This fact has led to a new world of compressed sensing, with applications throughout engineering and medicine (including changes in the machines for Magnetic Resonance Imag~ng).
Algorithms for £1 minimization are described and compared in III.4.
Section 111.1 opens this chapter with a famous formula for (I - uvT)- 1 and (A - uvT)-1. This is the Sherman-Morrison-Woodbury formula. It shows that the change in the inverse matrix also has rank 1 (if the matrix remains invertible). This formula with its extension to higher rank perturbations (A - UVT)-1 is fundamental.
We also'compute the derivatives of A(t)-1 and .X(t) and O"(t) when A varies with t.
159
-~
111.1 Changes in A-1 from Changes in A
Suppose we subtract a low rank matrix from A. The next section estimates the change in eigenvalues and the change in singular values. This section finds an exact formula for the change in A -1 . The formula is called the matrix inversion lemma by some authors. To others it is better known as the Sherman-Morrison-Woodbury formula.
Those names from engineering and statistics correspond to updating and downdating formulas in numerical analysis.
This formula is the key to updating the solution to a linear system Ax = b. The change could be in an existing row or column of A, or in adding/removing a row or column.
Those would be rank one changes. We start with this simple example, when A = I.
T
The inverse of M = I -uvT is M-1 = I + uv T (1) 1 -v u
There are two striking features of this formula. The first is that the correction to M-1 is also rank one. That is the final term u v T j ( 1-v T u) in the formula. The second feature is that this correction term can become infinite. Then M is not invertible: no M-1 .
This occurs if the number v T u happens to be 1. Equation (1) ends with a division by zero. In this case the formula fails. M =I- uvT is not invertible because Mu = 0:
Mu=(I-uvT)u=u-u(vTu)=O if vTu=l. (2)
The simplest proof of formula (1) is a direct multiplication of M times M-1 :
(3) You see how the number v T u moves outside the matrix uv T in that key final step.
Now we have shown that formula (1) is correct. But we haven't shown where it came from. One good way is to introduce an "extension" of I to a matrix E with a new row and a new column:
Extended matrix E = [ :T ~ ] has determinant D = 1 - v T u
Elimination gives two ways to find E-1. First, subtract v T times row 1 of E from row 2 :
0 I u _1 I u I
] [ ] [ ] -1 [
1 E = O D . Then E = O D- -v T ~ ] (4)
III. I. Changes in A - l from Changes in A 161
The second way subtracts u times the second row of E from its first row :
Now compare those two formulas for the same E-1 . Problem 2 does the algebra:
Two forms
ofE-1 (6)
The 1,1 blocks say that M-1 =I+ uD- 1vT. This is formula (1), with D = 1-vTu.
The Inverse of M = I - UVT
We can take a big step with no effort. Instead of a perturbation uv T of rank 1, suppose we have a perturbation uvT of rank k. The matrix u is n by k and the matrix yT is k by n.
So we have k columns and k rows exactly as we had one column u and one row v T.
The formula for M-1 stays exactly the same! But there are two sizes In and h:
This brings out an important point about these inverse formulas. We are exchanging an inverse of size n for an inverse of size k. Since k = 1 at the beginning of this section, we had an inverse of size 1 which was just an ordinary division by the number 1 - v T u. ":
Now VTU is (k x n) (n x k). We have a k by k matrix Ik- VTU to invert, not n by n.
The fast proof of formula (7) is again a direct check that M M-1 = I:
Un -UVT)(In +U(h- yTu)-lVT) =In -UVT +(In- UVT)U(Ik- vTu)-1VT.
Replace (In - UVT)U in that equation by U(Ik - VTU). This is a neat identity!
The right side reduces to In - UVT + UVT which is In. This proves formula (7).
Again there is an extended matrix E of size n + k that holds the key :
E =[ ~;. X ] hasdeterminant= det(In- UVT) = det(Ik- 'vTU). (8) If k < < n, the right hand side of (7) is probably easier and faster than a direct attack on the left hand side. The matrix yT U of size k is smaller than UVT of size n.
Example .1 What i"he inve<Se of M ~ I - [ : : n ?In this care u ~ v ;= [ n ã
Solution Here vTu ~ 3 '"'dM-' ~I+;::. So M-' equals I-~ [ : : n
Example 2 I f M = I - [ 0~ ~1 011] = I - UVT then M-1 =
That came from Writing the first displayed matrix as UVT and reversing to VTU:
[ 1 0] [0 1 1] [0 1 1] [1 0]
uvT = ~ ~ o o 1 and vT u = o o 1 ~ ~ = [ ~ ~ J .
Then M-1 above is /3 + U [ 12 - VTU] - l yT = I3 + U [ ~ -~ J -1 vT.
The whole point is that the 3 by 3 matrix M-1 came from inverting that bold 2 by 2.
Perturbing any Invertible Matrix A
Up to now we have started with i:he identity matrix I = In. We modified it to I - uv T and then to I - UVT. Change by rank 1 and then by rank k. To get the benefit of the full Sherman-Mon-ison-Woodbury idea, we now go further: Start with A instead of I.
Perturb any invertible A by a rank k matrix UVT. Now M = A - UVT.
Sherman-Morrison-Woodbury formula
M-1 =(A- UVT)-1 = A-1 + A-1U(I- yT A-1u)-1VT A-1 (9)
Up to now A was In. The final formula (9) still connects to an extension matrix E.
Suppose A
is invertible E = [ :T ~ ] is invertible when M = A - uvT is invertible.
To find that inverse of E, we can do row operations to replace VT by zeros:
Multiply row 1 by VT A -1 and subtract from row 2 to get [ ~ I _ V~ A -1 U ]
Or we can do column operations to replace U by zeros :
Multiply column 1 by A -1 U and subtract from column 2 to get [ :T I _ V~ A -1 U ]
As in equation (6), we have two ways to invert E. These two forms of E-1 must be equal.
Here Cis I - VT A-1u and M is A-UVT. The desired matri~ is M-1 (the inverse when A is perturbed). Comparing the (1, 1) blocks in equation (10) produces equation (9).
Summary Then by n inverse of M = A-UVT comes from then by n inverse of A and the k by k inverse of C = I -VT A-1U. For.a fast proof, multiply (9) by A- UVT.
III.l. Changes in A - l from Changes in A 163 This is a good place to collect four closely related matrix identities. In every case, a matrix B or AT or U on the left reappears on the right, even if it doesn't commute with A or V. As in many proofs, the associative law is hiding in plain sight: B ( AB) = ( B A)B.
B(Im + AB) = (In + BA)B B(Im + AB)-1 = (In+ BA)-1 B AT(AAT + >.In)-1 =(AT A+ )..Im)-1 AT
U(Ik - VTU) = (In - UVT)U
A is m by n and B is n by m. The second identity includes the fact that I+ AB is invertible exactly when I + B A is invertible. In other words, -1 is not an eigenvalue of AB exactly when -1 is not an eigenvalue of BA. AB and BA have the same nonzero eigenvalues.
The key as in Section 1.6 is that (I+ AB)x = 0 leads to (I+ BA)Bx = 0.
The Derivative of A - I
In a moment this section will turn to applications of the inverse formulas. First I turn to matrix calculus ! The whole point of derivatives is to find the change in a function f ( x) when xis moved very slightly. That change b..x produces a change b..f. Then the ratio of b..f to b..x approaches the derivative df I dx.
Here x is a matrix A. The function is f (A) = A - l . How does A - l change when A changes? Up to now the change uv T or uvT was small in rank. Now the desired change in A will be infinitesimally small, of any rank.
I start with the letter B = A + b..A, and write down this very useful matrix formula:
I n-1 - A-1 = n-1 (A- B) A-11 (11) You see that this equation is true. On the right side AA-1 is I and s-1 B is I. In fact (11) could lead to the earlier formulas for (A - UVT)-1 • It shows instantly that if A - B has rank 1 (or k), then s-1 - A-1 has rank 1 (or k). The matrices A and Bare assumed invertible, so multiplication by s-1 or A -1 has no effect on the rank.
Now think of A = A(t) as a matrix that changes with the time t. Its derivative at each timet is dAidt. Of course A-1 is also changing with the timet. We want'to find its derivative dA - l I dt. So we divide those changes b..A = B- A and .b..A - l = s-1 -A -.1 by b..t. Now insert A+ b..A forB in equation (11) and let b..t-+ 0.
For a 1 by 1 matrix A = t, with dAidt = 1, we recover the derivative of 1lt as -11t2 .
Problem 7 points out that the derivative of A2 is not 2AdAidt!
Updating Least Squares Section 11.2 discussed the least squares equation AT Ax = ATb---the "normal equations"
to minimize II b - Ax 112 • Suppose that a new equation arrives. Then A has a new row r (1 by n) and there is a new measurement bm+l and a new x:
The matrix in the new normal equations is AT A + r T r. This is a rank one correction to the original AT A. To update x, we do not want to create and solve a whole new set of normal equations. Instead we use the update formula :
[AT A+ TTT r 1 =(AT A)-1-c (AT A)-1 TTT (AT A)-1 with c = 1/(1+r(AT A)-1 TT) To find c quickly we only need to solve the old equation (AT A) y = r T. (14)
Problem 4 will produce the least squares solution Xnew as an update of x. The same idea applies when A has M new rows instead of one. This is recursive least squares.
The Kalman Filter Kalman noticed that this update idea also applies to dynamic least squares. That word dynamic means that even without new data, the state vector x is changing with time.
If x gives the position of a GPS satellite, that position will move by about .6x = v.6t
(v = velocity). This approximation or a better one will be the state equation for Xn+l at the new time. Then a new measurement bm+ 1 at timet+ .6t will further update that approximate position to Xnew- I hope you see that we are now adding two new equations (state equation and measurement equation) to the original system Ax ~ b :
Original State update
Measurement update
Anew = [-~ ~ 0 r ] [ xold ] Xnew = [ !.6t ]ã b
m+1
(15) We want the least squares solution of (15). And there is one more twist that makes the Kalman filter formulas truly impressive (or truly complicated). The state equation and the measurement equation have their own covariance matrices. Those equations are inexact (of course). The variance or covariance V measures their different reliabilities.
The normal equations AT Ax = ATb should properly be weighted by v-1 to become ATV-1 Ax= ATV-1b. And in truth V itself has to be updated at each step.
Through all this, Kalman pursued the goal of using update formulas. Instead of solving the full normal equations to learn xnew. he updated xold in two steps.
The prediction Xstate comes from the state equation. Then comes the correction to Xnew. using the new measurement bm+1: zero correction.
K = Kalman gain matrix Xnew = Xstate + K(bm+l-r Xstate) (16)
The gain matrix K is created from A and rand the covariance matrices Vstate and Vb.
You see that if the new bm+l agrees perfectly with the prediction Xstate• then there is a zero correction in (16) from Xstate to Xnew- .
IlL 1. Changes in A - l from Changes in A 165 We also need to update the covariance of the whole system-measuring the reliability of iinew. In fact this V is often the most important output. It measures the accuracy of the whole system of sensors that produced Xfinalã
For the GPS application, our text with Kai Borre provides much more detail : Algorithms for Global Positioning (Wellesley-Cambridge Press). The goal is to estimate the accuracy of GPS measurements : very high accuracy for the measurement of tectonic plates, lower accuracy for satellites, and much lower accuracy for the position of your car.
Quasi-Newton Update Methods
A completely different update occurs in approximate Newton methods to solve f(x) = 0.
Those are n equations for n unknowns x1 , .•. ,xn. The classical Newton's method uses the Jacobian matrix J ( x) containing the first derivatives of each component of f :
Newton Jik = -8fi and xnew = Xold- J(xold)-1f(x0ld)ã (17) 8xk
That is based on the fundamental approximation J ~x = ~~ of calculus. Here il.f = f(xnew) - f(xold) is - f(xold) because our whole plan is to achieve f(xnew )~0.
The difficulty is the Jacobian matrix J. For large n, even automatic differentiation (the key to backpropagation in Chapter VII) will be slow. Instead of recomputing J at each iteration, quasi-Newton methods use an update formula J(xnew) = J(xold) + ~J.
In principle ~J involves the derivatives of J and therefore second derivatives of f.
The reward is second order accuracy and fast convergence of Newton's method. But the price of computing all second derivatives when n is large (as in deep learning)
may be impossibly high. ~
Quasi-Newton methods create a low rank update to J(xold) instead of computing an entirely new Jacobian at Xnew. The update reflects the new information that comes with computing Xnew in (17). Because it is J-1 that appears in Newton's method, the update formula accounts for its rank one change to Jfiiw-without recomputing J-1 .
Here is the key, and derivatives of !1, ... , f n are not needed:
Quasi-Newton condition Jnew (xnew- Xold) =!new-fold (18) This is information J ~x = il.f in the direction we moved. Since equation (17) uses J-1 instead of J, we update J-1 to satisfy (18). The Sherman-Morrison forll}.ula will do this. Or the "BFGS correction" is a rank-2 matrix discovered by four authors at the same time. Another approach is to update the LU or the LDLT factors of Joldã
Frequently the original n equations f ( x) = 0 come from minimizing a function F(xt, ... ,xn). Then f = (8Fj8x1 , . . . ,8Fj8xn) is the gradient of this function F, and f = 0 at the minimum point. Now the Jacobian matrix J (first derivatives of f) becomes a Hessian matrix H (second derivatives of F). Its entries are Hjk = 82 Fj8xj 8xkã.
If all goes well, Newton's method quickly finds the point x* where F is minimized and its derivatives are f(x*) = 0. The quasi-Newton method that updates J approximately instead of recomputing J is far more affordable for large n. For extremely large n (as in many problems of machine learning) the cost may still be excessive.
Problem Set 111.1
1 Another approach to (I - uv T)-1 starts with the formula for a geometric series:
(1- x)-1 = 1 + x + x2 + x3 + ã ã ã Apply that formula when x = uvT =matrix:
(I- uvT)-1 =I+ uvT + uvTuvT + uvTuvTuvT + ã ã ã
=I +u[1 + vTu+ vTuvTu+ ã ã ã]vT.
T
Takex=vTutoseei+ 1 uv T .Thisisexactlyequation(l)for(I-uvT)--v u 1.
2 Find E-1 from equation ( 4) with D = 1 - v T u and also from equation (5) : From(4) E-1 = [ ~ -~1!_;1 ] [ -~T ~] = [ ]
From (5) E-1 = [ -~~(! ~v:~;;_ 1 ~ ] [ ~ -1u ] = [ ] Compare the 1,1 blocks to find M-1 = (I-uvT)-1 in formula (1).
3 The final Sherman-Morrison-Woodbury formula (9) perturbs A by UVT (rank k).
Write down that formula in the important case when k = 1 :
Test the formula on this small example:
A=[~~] u=[~] v=[~] A- uv T = [ 2 0 ] O 2
4 Problem 3 found the inverse matrix M-1 = (A- uvT)-1 . In solving the equation My= b, we compute only the solution y and not the whole inverse matrix M-1 .
You can find y in two easy steps :
Step 1 Solve Ax = b and Az = u. Compute D = 1 - v T z.
T
Step 2 Then y = x + v Dx z is the so~ution to My= (A- uvT)y =b.
Verify (A -uv T)y =b. We solved two equations using A, no equations using M.
5 Prove that the final formula (9) is correct! Multiply equation (9) by A - UVT.
Watch for the moment when (A- UVT)A -1u becomes U(I-VT A -1u).
6 In the foolish case U = V = In. equation (9) gives what formula for (A- J)-1? Can you prove it directly ?
Ill. I. Changes in A - l from Changes in A 167 7 Problem4 extends to a rank k change M-1 =(A-UVT)-1 • To solve the equation
My = b, we compute only the solution y and not the whole inverse matrix AJ-1.
8
9
Step 1 Solve Ax= band the k equations AZ = U (U and Z are n by k) Step 2 Form the matrixC = I - VT z and solve Cw = vT X. The desired y = M-1b is y = x + Zw.
Use (9) to verify that (A- UVT) y =b. We solved k + 1 equations using A and we multiplied yT Z, but we never used M =A-UVT.
What is the derivative of (A(t))2? The correct derivative is not 2 A(t) - . dA You must compute (A + Ll.A)2 and subtract A2 • Divide by Ll..t and send Ll.t to 0. dt Test formula (12) for the derivative of A-1(t) when
A(t)=[~ t:] and A- 1 (t)=[~ -t2 ] 1 .
10 Suppose you know the average Xotd of b1o b2, ... , bggg. When b10oo alTives, check that the new average is a combination of Xotd and the mismatch b10oo - Xotd:
~ _ b1 + ã ã ã + b10oo _ b1 + ã ã ã + bggg _1_ ( _ b1 + ã ã ã + bggg)
Xnew - 1000 - 999 + 1000 b1000 999 ã
This is a "Kalman filter" Xnew = Xotd + uioo (b10oo - Xotd) with gain matrix 10100 . • 11 The Kalman filter includes also a state equation Xk+ 1 = Fxk with its own error
variance s2 . The dynamic least squares problem allows x to "drift" as k increases :
With F = 1, divide both sides of those three equations by CT, s, and CT. Find xo and
Xi by least squares, which gives more weight to the recent b1 .
Bill Hager's paper on Updating the Inverse of a Matrix was extremely usefuJ in writing this section of the book: SIAM Review 31 (1989).
111.2 Interlacing Eigenvalues and Low Rank Signals
The previous section found the change in A -1 produced by a change in A. We could allow infinitesimal changes dA and also finite changes D. A = -UVT. The results were an infinitesimal change or a finite change in the inverse matrix :
This section asks the same questions about the eigenvalues and singular values of A.
How do each A and each u change as the matrix A changes?
You will see nice formulas for dAI dt and du I dt. But not much is linear about eigenvalues or singular values. Calculus succeeds for infinitesimal changes d).. and da, because the derivative is a linear operator. But we can't expect to know exact values in the jumps to .A( A + D. A) or a( A + D.A). Eigenvalues are more complicated than inverses.
Still there is good news. What can be achieved is remarkable. Here is a taste for a symmetric matrix S. SupposeS changes to S + uuT (a "positive" change of rank 1).
Its eigenvalues change from .>..1 2: .>..2 2: . . . to z1 2: z2 2: . . . We expect increases in eigenvalues since uu T was positive semidefinite. But how large are the increases ?
Each eigenvalue Zi of S + uu T is not smaller than Ai or greater than Ai-lã
So the A's and z 's are "interlaced". Each z2 , ••• , Zn is between two A's :
Z1 ~ Al ~ Z2 ~ A2 ~ • • • ~ Zn ~ Anã (2)
We have upper bounds on the eigenvalue changes even if we don't have formulas for D..A.
There is one point to notice because it could be misunderstood. Suppose the change uu T
in the matrix is Cq2q'f (where q 2 is the second unit eigenvector of 8). Then Sq2 = .A2q2 will see a jump in that eigenvalue to .>..2 + C, because (S + Cq2q'f)q2 = (.>..2 + C)q2.
That jump is large if C is large. So how could the second eigenvalue of S + uu T
possibly have z2 = .>..2 + C ~ .>..1 ?
Answer: If C is a big number, then )..2 + C is not the second eigenvalue of S + uu T ! It becomes z1. the largest eigenvalue of the new.matrix S + Cq2q'f (and its eigenvector is q2). The original top eigenvalue .>..1 of S is now the second eigenvalue z2 of the new matrix. So the statement (2) that z2 ~ )..1 :=:; z1 is the completely true statement (in this example) that z2 = .>..1 is below z1 = .>..2 +C.
We will connect this interlacing to the fact that the eigenvectors between )..1 = .Amax and An = )..min are all saddle points of the ratio R( x) = x T S xI x T x.