Back to Netflix: Matrix Completion

In many practical applications, one would like to recover a matrix from a sample of its entries. In case of recommendation engines, the best known example is the Netflix price. Users are given the opportunity to rate movies, but users typically rate only very few movies so that there are very few scattered observed entries of the data matrix. Yet one would like to complete the matrix so that Netflix might recommend titles that any particular user is likely to be willing to order. In the Netflix competition, for each of the users under consideration, a part of her/his ratings was provided in the training set. For evaluation, the remaining movies the user has rated were provided, and the task was to guess her/his actual ratings. The Netflix price was awarded to the recommendation solution of highest prediction rate on the test set. So the Netflix competition constitutes a classical matrix completion problem.

In mathematical terms, the matrix completion problem may be formulated as follows: we again consider a data matrixA∈Rmnwhich we would like to know as precisely as possible. Unfortunately, the only information aboutAis a sampled set

8.5 Back to Netflix: Matrix Completion 171

of entriesAij, (i,j)∈Ω, whereΩis a subset of the complete set of entriesmn.

Clearly, this problem is ill posed in order to guess the missing entries without making any assumptions about the matrixA.

Now we suppose that the unknown matrixAhas low rank. In [CR08], Emmanuel Candes and Benjamin Recht showed that this assumption radically changes the problem, making the search for solutions meaningful. We follow the guidelines of [CR08, CT10].

For simplicity, assume that the rank-rmatrix Ais nn. Next, we define the orthogonal projectionPΩ:Rnn!Rnnonto the subspace of matrices vanishing outside ofΩas

PΩð ịX ijẳ Xij, ð ịi;j ∈Ω, 0, otherwise,

ð8:29ị so that the information aboutA is given byPΩ(A). We want to recover the data matrix by solving the optimization problem

minimize rankð ịX subject to PΩð ị ẳX PΩ

A, ð8:30ị

which is, in principle, possible if there is only one low-rank matrix with the given entries. Unfortunately, (8.30) is difficult to solve as rank minimization is in general an NP-hard problem for which no known algorithms are capable of solving prob- lems in practical time for (roughly)n10.

Candes and Recht proved in [CR08] that, first, the matrix completion problem is not as ill posed as previously thought and, second, that exact matrix completion is possible by convex programming. At this, they proposed to replace (8.30) by solving the nuclear norm problem

minimize k kX subject to PΩð ị ẳX PΩ

A, ð8:31ị

where thenuclear normkXk*of a matrixXis defined as sum of its singular values:

k kX :ẳX

sið ị:X ð8:32ị

Candes and Recht proved that ifΩis sampled uniformly at random among all subset of cardinality p and A obeys a low coherence condition than with large probability, the unique solution to (8.31) is exactlyA, provided that the number of samples is

pCn6=5rlogn: ð8:33ị

In [CT10] the estimate (8.33) is further improved toward the limitnrlog n.

Why is the transition to formulation (8.31) so important? Whereas the rank function in (8.30) counts the number of nonvanishing singular values, the nuclear norm sums their amplitude and, in some sense, is to the rank functional what the convexl1norm is to the countingl0norm in the area of sparse signal recovery. The main point here is that the nuclear norm is a convex function and can be optimized efficiently via semidefinite programming.

When the matrix variableXis symmetric and positive semidefinite, the nuclear norm ofXis the sum of the (nonnegative) eigenvalues and thus equal to the trace of X. Hence, for positive semidefinite unknown, (8.31) would simply minimize the trace over the constraint set

minimize traceð ịX subject to PΩð ị ẳX PΩ

A X0

which is a semidefinite program. Recall that ann n matrixAis calledpositive semidefinite, denoted byA0, if

xTAx0

for all vectorsxof lengthn. For an introduction to semidefinite programming, see, e.g., [VB96].

For a general matrixA which may be not positive semidefinite and even not symmetric, the nuclear norm heuristic (8.31) can be formulated in terms of semidefinite programming as being equivalent to

minimize 1

2ðtraceðW1ị ỵtraceðW2ịị subject to PΩð ị ẳX PΩ

A W1 X XT W2

ð8:34ị

with additional optimization variables W1 and W2. To outline the analogy (strongly simplified; for details, see [RFP10]), we consider the singular value decomposition ofX

XẳUSVT and of the block matrix

W1 X XT W2

ẳ U V

S U T VT

leading toW1ẳUSUTand W2ẳVSVT. Since the left and right singular vector matrices are unitary, the traces ofW1andW2are equal to the nuclear norm ofX.

8.5 Back to Netflix: Matrix Completion 173

By defining the two factor matricesLẳUS1/2andR ẳVS1/2, we easily observe that trace(W1) + trace(W2)ẳ kLk2F + kRk2F and finally arrive at the optimization problem

minimize 1

2k kL 2Fþk kR 2F subject to PΩLRT

ẳPΩ

A: ð8:35ị

(Please consult [CR08] for a proof of equivalence of (8.34) an (8.35).)

Since in most practical applications data is noisy, we will allow some approximation error on the observed entries, replacing (8.35) by the less rigid formulation

minimize 1

2k kL 2Fþk kR 2F subject to PΩLRT

PΩð ịA

F σ

ð8:36ị

for a small positiveσ. Thus, in Lagrangian form, we arrive atthe formulation minimize λ1

2k kL 2Fþk kR 2F

þ PΩLRT

PΩð ịA

F, ð8:37ị

where λ is the regularization parameter that controls the noise in the data.

Formulation (8.37) is also calledmaximum-margin matrix factorization(MMMF) and goes back to Nathan Srebro in 2005 [RS05]. Related work was also done by the group of Trevor Hastie [MHT10].

Problem (8.37) can be solved, e.g., by using a simple gradient descent method.

Interestingly, the first SVD solution submitted to Netflix (by Simon Funk [Fun06], in 2006) used exactly this approach and achieved considerable progress.

The final price was a mixture of hundreds of models with the SVD playing a crucial role.

After all, one may ask: what is the difference of the SVD (8.37) to the truncated SVD (8.14) that we have considered before? This brings us back to the introductory discussion in Sect. 8.1 about unknown values. The answer is that in (8.37), we assume only the entries Aij, (i,j)∈Ω to be known. In contrast, in our previous formulation, we consider all entriesAij, (i,j)2= Ωto be zero. The Netflix competition, where all entries ofAon a test set (of actual ratings of the users) had to be predicted, is obviously a classic matrix completion problem and hence (8.37) is the certainly the right approach.

In case of matrix factorization for an actual recommendation task, like that of a user or session matrix in Example 8.1, or even the probability matrix P, the discussion is more complex. In fact, we may view all non-visited entries of the matrix to be zero since the user didn’t show any interest in them. This justifies our

previous approach. But, on the contrary, we may also argue that there is too little statistical volume and the user cannot view all products that he/she is potentially interested in, simply because there are too many of them. This would suggest the matrix completion approach. So there are pros and cons for both assumptions.

Example 8.8 We next repeat the test of Example 8.6 with the factorization according to formulation (8.37). Instead of a gradient descent algorithm, we used an ALS algorithm as described in [ZWSP08] which is more robust.

The results are contained in Table8.5whose structure basically corresponds to that of Table8.3. Instead of the time, we have included the error normeΩwhich corresponds to the Frobenius normeFbut is calculated only on the given entries (i,j)∈Ω. Thus,eΩis equal to the root-mean-square error (RMSE) multiplied by the square root of the number of entries ffiffiffiffiffiffiffi

Ω pj j

. Additionally, we compare different values of the regularization parameterλ.

From Table8.5we see that for increasing rank the RMSE is strongly declining, and we capture the given probabilities onΩalmost perfectly. A rank of 50–100 is already sufficient to bring the RMSE close to zero, and higher ranks also do not improve the prediction rate significantly. Unlike in Table8.3, where we needed almost full rank to zero the approximation error, this is because here we just have to approximate the unknown entries.

In contrast to Table8.3, the overall erroreFis slowly decreasing and remains very high. This is again because we do not approximate the zero values outsideΩ. The prediction rate is comparable to Table8.3. This indicates that for the probability matrixP, the approach to consider all non-visited entries to be zero is equally

reasonable like assuming them to be unknown. ■

The result of Example 8.8 does not mean that the matrix completion approach is outright useless for the recommendation engine task. In fact, it could be, e.g., used to complete the matrix of transactions or transitions before it is further processed.

Table 8.5 Comparison of prediction qualities and error norms for different regularization parameter values and with variable rank

λẳ0.1 λẳ0.01 λẳ0.001

k p1 p3 eF eΩ p1 p3 eF eΩ p1 p3 eF eΩ

2 1.18 2.11 13.22 3.62 0.20 1.22 34.52 3.29 0.02 0.32 55.98 3.28 5 2.66 4.83 11.78 3.33 0.96 3.66 29.77 2.93 2.05 3.66 41.52 2.89 50 5.74 9.16 8.13 1.94 5.77 8.21 20.47 0.63 5.41 7.84 27.02 0.55 100 6.13 9.75 7.32 1.80 6.29 9.84 17.62 0.28 6.15 9.12 21.82 0.12 200 6.09 9.86 7.11 1.79 6.32 10.02 16.64 0.28 6.32 9.93 18.06 0.12

8.5 Back to Netflix: Matrix Completion 175

Weaknesses of Current Recommendation Engines

On the Convergence and Implementation