Matrix Factorizations in Data Mining and Beyond- 123docz.net

In classical data mining, approaches based on matrix factorization are ubiquitous.

Typically, they arise as mathematical core problems in collaborative filtering (CF). A classical application of CF to recommendation engineering is the prediction of product ratings by users. Unlike in classical CF, we shall use sessions instead of users (from a mathematical point of view, this does not make any difference) for consistency reasons in the following. Like RL, CF is behavioristic in the sense that no background information with respect to neither of users nor products is involved.

Instead, we associate with each session a list assigning to each product the rating given by the user. These ratings may be explicit, e.g., users may be prompted to rate each visited product on a scale from 1 to 5, or, more commonly, implicit (as before in this book). As for the latter, one may, for instance, endow each type of customer transaction with a score value, say, 1 for a click, 5 for an “add to cart” or “add to wish list,” and 10 for actually buying the product. We consider this list as a signal or, simply, a vector. Inspired by noise reduction and deconvolution techniques in signal processing, most CF approaches are based on the assumption that the thus arising data are noise- afflicted observations of intrinsically low-dimensional signals generated by some unknown source. How shall we proceed to formalize the situation statistically?

The decisive obstacle is the mathematical treatment of the unknown values.

Basically, this may be surmounted in two different manners: the ostensibly more sophisticated approach consists in modeling the unknown ratings as hidden variables, which need to be estimated along with the underlying source. Putting it in terms of signal processing, this gives rise to a problem related to the reconstruction of a partially observed signal. Dealing with hidden variables in

statistical models, however, entails some major computational impediments, including intractable integrals and non-convex optimization, making a realtime implementation very difficult. The alternative approach consists in treating all variables as observed by assigning the value 0 to the unknown ratings. Although this may appear somewhat helter-skelter at the first glance, it may be rationalized by consideringnot visiting a productas a transaction corresponding to the lowest possible rating. Assuming, furthermore, that many of the zero entries are due to noise rather than intrinsic, we may put the approach on a sound footing. We will return to this discussion in Sect. 8.5.

Now we consider a matrix of rewards A ∈ Rnpns withns being the current number of sessions and np the number of different products over all sessions observed so far. Neither the order of sessions nor the order of products within the sessions is taken into account.

Example 8.1 As an example, we consider a web shop with 3 products and 4 sessions, i.e., npẳ3 and nsẳ4. The session values are displayed in Table 8.1.

In terms of the reward assignment described above, this means, e.g., for session 2, product 1 has been clicked, whereas products 2 and 3 have moreover been added to the basket. In session 3, product 1 has been purchased, product 2 has been

clicked, and product 3 has been skipped. ■

Mathematically, the matrix factorization problems arising in CF are of the form min

X∈C1Rnpr,Y∈C2Rrnsf A;ð XYị: ð8:1ị The rankris usually chosen to be considerably smaller thannp. The function fis referred to as the cost function of the factorization and, more often than not, is chosen to be a metric. It stipulates a notion of quality of a factorization.

The setsC1,C2determine the parameter space. In terms of our signal processing metaphor, the factor X characterizes the source, which is restricted to be a low-dimensional subspace, and the columns Y are the intrinsic low-dimensional parameter vectors determining the signals given by the corresponding columns of A.

To put it even simpler, we approximate the matrix A by the product of two smaller matricesXandY. The cost function stipulates a notion of “closeness,” i.e., distance, of two matrices. Since the rankris typically much smaller thannpandns, the representation in terms of X and Y is much more compact than an explicit representation of the entries ofA.

Table 8.1 Example of a

session matrix of a web shop Session 1 Session 2 Session 3 Session 4

Product 1 0 1 10 5

Product 2 1 5 1 1

Product 3 0 5 0 1

8.1 Matrix Factorizations in Data Mining and Beyond 145

Example 8.2 Let us consider the following factorization for Example 8.1 withrẳ1:

1 0:2 0:1 0

@ 1 A

|fflfflfflffl{zfflfflfflffl}

0:23 2:76 9:65 5:17

ð ị

|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}

ẳ

0:23 2:76 9:65 5:17 0:05 0:55 1:93 1:03 0:02 0:28 0:97 0:52 0

1 A

|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}

0 1 10 5

1 5 1 1

0 5 0 1

1 A

|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}

While the initial matrix Aconsists of 12 elements, the factors X andY taken together contain only 7 elements. Now we have to assess whether our rank-1 approximationXY ẳAe is a sufficiently good approximation toA. If not, we may increase the rankr, which, of course, entails an increase of complexity of the factors.

It is obvious, however, that for largenpandnsand a moderate rank, the factorized representation is by orders of magnitude more compact than the explicit one. ■ In terms of CF, a commonly encountered intuitive interpretation is as follows:

the matrixYmaps the sessions to their virtual profiles, the number of which is given by the rankr, and the matrixXmaps profiles to their products of reference.

It is also noteworthy that optimal factors are almost never unique, even if their product is. This is only a minor difficulty since we are eventually interested in the latter and may choose among the corresponding optimal factors arbitrarily.

Please note that the framework stipulated by (8.1) is of profound generality.

It encompasses a vast majority of commonly deployed factorization models.

In particular, we stress that the computational complexity of a factorization (8.1) depends crucially on the choice of f andC1, C2. For example, the factorization model related to PCA, which we shall focus on in what follows, may be reduced to a rather “simple” algebraic problem capable of being solved optimally by algorithms of polynomially bounded complexity. On the other hand, it is possible to state the well-known clustering or vector quantization problem in terms of the above framework. This problem, however, is NP-hard.

As opposed to the control theoretic varieties discussed in the foregoing chapters, REs based on CF are “naı¨ve-old fashioned.” Why, you may ask yourself, after so keenly campaigning for the latter, do we suddenly address so unsophisticated and outdated approach? The reasons for doing so are as follows:

• In recent research, we have found a way to perform PCA-based CF in a realtime adaptive fashion. Since this book is intended to be about adaptive rather than only about the smaller class of control theoretic recommendation systems, this fits well into the framework.

• We are currently working on higher-order (i.e., tensor) generalization of PCA-based CF. This enables to deploy CF in a less behavioristic fashion.

It turns out that the adaptive algorithms for the matrix case carry over rather smoothly to the higher-order setting.

• Most importantly, our research in Chap. 10 will focus on a combination of adaptive CF and RL. Specifically, we are heading to apply (tensor) CF to the transition probability matrices (or tensors) of Markov decision processes as a means of approximation and regularization to render model-based methods tractable for large state spaces.

In what follows, the above points will be discussed in more detail.

Matrix Factorizations in Data Mining and Beyond

Weaknesses of Current Recommendation Engines

On the Convergence and Implementation