Factoring Matrices and Tensors : Positive and Spar- 123docz.net

1.12 Factoring Matrices and Tensors : Positive and Sparse

This section opens a wider view of factorizations for data matrices (and extends to tensors).

Up to now we have given full attention to the SVD. A = U:EVT gave the perfect factors for Principal Component Analysis-perfect until issues of sparseness or nonnegativity or tensor data enter the problem. For many applications these issues are important.

Then we must see the SVD as a first step but not the last step.

Here are factorizations of A and T with new and important properties : Nonnegative Matrices

Sparse and Nonnegative

CP Tensor Decomposition

mini!A-UV!I~

with U 2: 0 and V 2: 0

min !lA- UVI!~ + >.!!UVI!N

with U 2: 0 and V 2: 0

min !IT-Lai obi o Cill

i=l

We will work with matrices A and then tensors T. A matrix is just a two-way tensor.

To compute a factorization A = UV, we introduce a simple alternating iteration.

Update U with V fixed, then update V with U fixed. Each half step is quick because it is effectively linear (the other factor being fixed). This idea applies to the ordinary SVD, if we include the diagonal matrix I; with U. The algorithm is simple and often effective .•

Section 111.4 will do even better. 1

This UV idea also fits the famous k-means algorithm in Section IV.7 on graphs. The problem is to put n vectors a1 , . . . , an into r clusters. If ak is in the cluster around UJ,

this fact ak ~ Uj is expressed by column k of A ~ UV. Then column k of V is column j of the r by r identity matrix.

Nonnegative Matrix Factorization (NMF)

The goal of NMF is to approximate a nonnegative matrix A 2: 0 by a lower rank product UV of two nonnegative matrices U 2: 0 and V 2: 0. The purpose of lower rank is simplicity. The purpose of nonnegativity (no negative entries) is to produce numbers that have a meaning. Features are recognizable with no plus-minus cancellation. A negative weight or volume or count or probability is wrong from the start.

But nonnegativity can be difficult. When A 2: 0 is symmetric positive definite, we hope for a matrix B 2: 0 that satisfies BT B = A. Very often no such matrix exists.

(The matrix A= AT with constant diagonals 1 + J5, 2, 0, 0, 2 is a 5 x 5 example.) We are forced to atcept the matrix BT B (with B 2: 0) that is closest to A (when A 2: 0). The question is how to find B. The unsymmetric case, probably not square, looks for U and V.

Lee and Seung focused attention on NMF in a letter to Nature 401 (1999) 788-791.

The basic problem is clear. Sparsity and nonnegativity are very valuable properties. For a sparse vector or matrix, the few nonzeros will have meaning-when 1000 or 100, 000 individual numbers cannot be separately understood. And it often happens that numbers are naturally nonnegative. But singular vectors in the SVD almost always have many small components of mixed signs. In practical problems, we must be willing to give up the orthogonality of U and V. Those are beautiful properties, but the Lee-Seung essay urged the value of sparse PCA and no negative numbers.

These new objectives require new factorizations of A.

NMF Find nonnegative matrices U and V so that A ~ UV ( 1) SPCA Find sparse low rank matrices B and C so that A ~ BC. (2) First we recall the meaning and purpose of a factorization. A = BC expresses every column of A as a combination of columns of B. The coefficients in that combination are in a column of C. So each column aj of A is the approximation c1jb1 + ã ã ã + Cnjbn.

A good choice of BC means that this sum is nearly exact.

If C has fewer columns than A, this is linear dimensionality reduction. It is fundamental to compression and feature selection and visualization. In many of those problems it can be assumed that the noise is Gaussian. Then the Frobenius norm I lA- BCIIF is a natural measure of the approximation error. Here is an excellent essay describing two important applications, and a recent paper with algorithms and references.

N. Gillis, The Why and How of Nonnegative Matrix Factorization, arXiv: 1401.5226.

L. Xu, B. Yu, andY. Zhang, An alternating direction and projection algorithm for structure- enforced matrix factorization, Computational Optimization Appl. 68 (2017) 333-362.

Facial Feature Extraction

Each column vector of the data matrix A will represent a face. Its components are the intensities of the pixels in that image, so A ~ 0. The goal is to find a few "basic faces" in B, so that their combinations come close to the many faces in A. We may hope that a few variations in the geometry of the eyes and nose and mouth will allow a close reconstruction of most faces. The development of eigenfaces by Turk and Pentland finds a set of basic faces, and matrix factorization A~ BC is another good way.

Text Mining and Document Classification

Now each column of A represents a document. Each row of A represents a word. A simple construction (not in general the best: it ignores the ordering of words) is a sparse nonnegative matrix. To classify the documents in A, we look for sparse nonnegative factors :

Document aj ~~)importance Cij) (topic bi) (3) Since B ~ 0, each topic vector bi can be seen as a document. Since C > 0, we are combining but not subtracting those topic documents. Thus NMF identifies topics and

_IJ2. Factoring Matrices and Tensors: Positive and Sparse 99

classifies the whole set of documents with respect to those topics. Related methods are

"latest semantic analysis" and indexing.

Note that NMF is an NP"hard problem, unlike the SVD. Even exact solutions A = BC are not always unique. More than that, the number of topics (columns of A) is unknown.

Optimality Conditions for Nonnegative U and V Given A 2 0, here are the conditions for U 2 0 and V 2 0 to minimize I lA-UVII}:

y = UVVT -AVT 2 0 with Yij or uij = 0 for all i,j z = uTuv -uTA 2 0 with Zij or Vij = 0 for all i, j Those last conditions already suggest that U and V may tum out to be sparse.

(4)

Computing the Factors : Basic Methods Many algorithms have been suggested to compute U and V and B and C. A central idea is alternating factorization: Hold one factor fixed, and optimize the other factor.

Hold that one fixed, optimize the first factor, and repeat. Using the Frobenius norm, each step is a form of least squares. This is a natural approach and it generally gives a good result. But convergence to the best factors is not sure. We may expect further develop- ments in the theory of optimization. And there is a well-established improvement of this method to be presented in Section III.4 : Alternating Direction Method of Multipliers.

This ADMM algorithm uses a penalty term and duality to promote convergence.

Sparse Principal Components Many applications do allow both negative and positive numbers. We are not counting or building actual objects. In finance, we may buy or sell. In other applications the zero point has no intrinsic meaning. Zero temperature is a matter of opinion, between Centigrade and Fahrenheit. Maybe water votes for Centigrade, and super-cold physics resets 0°.

The number of nonzero components is often important. That is the difficulty with the singular vectors u and v in the SVD. They are full of nonzeros, as in least, squares. We cannot buy miniature amounts of a giant asset, because of transaction costs. If we learn 500 genes that affect a patient's outcome, we cannot deal with them individually. Th be understood and acted on, the number of nonzero decision variables must be under control.

One possibility is to remove the very small components of the u's and v's. But if we want real control, we are better with a direct construction of sparse vectors. A good number of algorithms have been proposed.

H. Zou, T. Hastie, R. Tibshirani, Sparse principal component analysis, J. Computational and Graphical Statistics 15 (2006) 265-286. See https://en.wikipedia.org/wiki!Sparse_PCA

Sparse PCA starts with a data matrix A or a positive (semi)definite sample covariance matrix S. GivenS, a natural idea is to include Card (a:) =number of nonzero components in a penalty term or a constraint on a: :

Maximize T Maximize T .

lla:ll = 1 a: Sa:-pCard(a:) or lla:ll = 1 a: Sa: subJect to Card(a:):::; k. (5) But the cardinality of a: is not the best quantity for optimization algorithms.

Another direction is semidefinite programming, discussed briefly in Section Vl.3. The unknown vector a: becomes an unknown symmetric matrix X. Inequalities like a: ~ 0 (meaning that every Xi ~ 0) are replaced by X ~ 0 (X must be positive semidefinite).

Sparsity is achieved by including an £1 penalty on the unknown matrix X. Looking ahead to IV.5, that penalty uses the nuclear norm IIXIIN: the sum of singular values ai.

The connection between £1 and sparsity was in the figures at the start of Section 1.11.

The £1 minimization had the sparse solution a: = (0, i). That zero may have looked accidental or insignificant, for this short vector in R2 • On the contrary, that zero is the important fact. For matrices, replace the £1 norm by the nuclear norm IIX liN.

A penalty on II a: ll1 or II X II N produces sparse vectors a: and sparse matrices X.

In the end, for sparse vectors a:, our algorithm must select the important variables.

This is the great property of £1 optimization. It is the key to the LASSO : n

LASSO Minimize IIAa:- bll2 +A L lxkl (6)

Finding that minimum efficiently is a triumph of nonlinear optimization. The ADMM and Bregman algorithms are presented and discussed in Section III.4.

One note about LASSO : The optimal a:* will not have more nonzero components than the number of samples. Adding an £2 penalty produces an "elastic net" without this disad- vantage. This £1 + ridge regression can be solved as quickly as least squares.

Elastic net Minimize IIAa:-bll~ + Alla:lll + .BIIxll~ (7)

Section III.4 will present the ADMM algorithm that splits £1 from £2 • And it adds a penalty using Lagrange multipliers and duality. That combination is powerful.

1. R. Tibshirani, Regression shrinkage and selection via the Lasso, Journal of the Royal Statistical Society, Series B 58 (1996) 267-288.

2. H. Zou and T. Hastie, Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society, Series B 67 (2005) 301-320.

1.12. Factoring Matrices and Tensors : Positive and Sparse 101 Tensors

A column vector is a 1-way tensor. A matrix is a 2-way tensor. Then a 3-way tensor T has elements Tijk with three indices: row number, column number, and "tube number".

Slices of T are two-dimensional sections, so the 3-way tensor below has three horizontal slices and four lateral slices and two frontal slices. The rows and columns and tubes are the fibers ofT, with only one varying index.

We can stack m by n matrices (p of them) into a 3-way tensor. And we can stack m by n by p tensors into a 4-way tensor = 4-way array.

vector D

xinR3

§±[]

matrix AinR3 x 4

/ / / / / / / / / /;

vv V; v

tensor TinRa x 4 x 2

Example 1 : A Color Image is a Tensor with 3 Slices

A black and white image is just a matrix of pixels. The numbers in the matrix are the grayscales of each pixel. Normally those numbers are between zero (black) and 255 (white).

Every entry in A has 28 = 256 possible grayscales.

A color image is a tensor. It has 3 slices corresponding to red-green-blue. Each slice shows the density of one of the colors RGB. Processing this tensor T (for example in deep learning: Section VII.2) is not more difficult than a black-white image. ~

Example 2 : The Derivative 8w / 8A of w = Av

This is a tensor that we didn't see coming. Them x n matrix A contains "weights" to be optimized in deep learning. That matrix multiplies a vector v to produce w = Av.

Then the algorithm to optimize A (Sections VI.4 to VII.3) involves the derivative of each output Wi with respect to each weight Ajkã So we have three indices i, j, k.

In matrix multiplication, we know that row j of A has no effect on row i of w = Av.

So the derivative formula includes the symbol §ij, which is 1 if i = j and 0 otherwise. In proper tensor notation that symbol becomes 5j (our authority on tensors is Pavel Grinfeld).

The derivatives of the linear function w = Av with respect to the weights Ajk are in T:

as in Tu1 = v1 and T122 = 0 (8)

Section Vl1.3 has a 2 x 2 x 2 example. This tensor Tijk = Vk8ii is of particular interest:

1. The slices k = constant are multiples Vk of the identity matrix.

2. The key function of deep learning connects each layer of a neural net to the next layer.

If one layer contains a vector v, the next layer contains the vector w = (Av + b)+ã

A is a matrix of "weights". We optimize those weights to match the training data.

So the derivatives of the loss function L will be zero for the optimal weights.

Using the chain rule from calculus, the derivatives of L are found by multiplying the derivatives OW I oA from each layer to the next layer. That is a linear step to Av + b, followed by the nonlinear ReLU function that sets all negative components to zero.

The derivative of the linear step is our 3-way tensor VkDij.

All this will appear again in Section VII.3. But we won't explicitly use tensor calculus.

The idea of backpropagation is to compute all the derivatives of L "automatically".

For every step in computing L, the derivative of that step enters the derivative of L.

Our simple formula (8) for an interesting tensor will get buried in backpropagation.

Example 3 : The Joint Probability Tensor Suppose we measure age a in years and height h in inches and weight w in pounds.

We put N children into I age groups and J height groups and K weight groups.

So a typical child is in an age group i and a height group j and a weight group k - where the numbers i, j, k are between 1, 1, 1 and I, J, K.

Pick a random child. Suppose the I age groups contain a 1, a2, ... , a1 children (adding to N children). Then a random child is in age group i with probability ai/ N.

Similarly the J height groups contain hi, h2, ... , hJ children and the K weight groups contain w1 , w2 , ... , w K children. For that random child,

Probability of height group j is ~ Probability of weight group k is v;:;

Now comes our real goal: joint probabilities Pijkã For each combination i, j, k we count only the children who are in age group i and also height group j and also weight group k. Each child has I times J times K possibilities. (Possibly Pill is zero- no oldest children with the lowest height and weight.) Suppose Niik children are found in the intersection of age group i and height group j and weight group k :

Nããk The joint probability of this age-height-weight combination is PiJk = ___.!!.._. (9)

We have I times J times K numbers Piikã All those numbers are between 0 and 1.

They fit into a 3D tensor T of joint probabilities. This tensor T has I rows and J columns and K "tubes". The sum of all the entries Niik/N is 1.

To appreciate this I by J by K tensor, suppose you add all the numbers p2 jkã You are accounting for all children in age group 2 :

J K

L L P2Jk = p~ = probability that a child is in age group 2. (10)

j=l k=l

1.12. Factoring Matrices and Tensors: Positive and Sparse 103

We are seeing a 2D slice of the tensor T. Certainly the sum Pt + p~ + ã ã ã + p'} equals 1.

Similarly you could add all the numbers P2jSã Now you are accounting for all children in age group 2 and weight group 5 :

L J P2j5 = p~~ = probability of age group 2 and weight group 5. (11)

j=l

These numbers are in a column ofT. We could combine columns to make a slice ofT.

We could combine slices to produce the whole tensor T :

I K I

LLPik'='L:Pi= 1

i=l k=l i=l

By measuring three properties we were led to this 3-way tensor T with entries Tijkã

The Norm and Rank of a Tensor

In general a tensor is a d-way array. In the same way that a matrix entry needs two numbers i, j to identify its position, ad-way tensor needs d numbers. We will concentrate here on d = 3 and 3-way tensors (also called tensors of order 3). After vectors and matrices, d = 3 is the most common and the easiest to understand. The norm of T is like the Frobenius norm of a matrix: Add all Ti;k to find IITII2 •

The theory of tensors is still part of linear algebra (or perhaps multilinear algebra).

Just like a matrix, a tensor can have two different roles in science and engineering: ..

1 A tensor can multiply vectors, matrices, or tensors. Then it is a linear operator.' 2 A tensor can contain data. Its entries could give the brightness of pixels in an image.

A color image is 3-way, stacking RGB. A color video will be a 4-way tensor.

The operator tensor could multiply the data tensor-in the same way that a permutation matrix or a reflection matrix or any orthogonal matrix operates on a data matrix.

The analogies are clear but tensors need more indices and they look more complicated.

They are. We could succeed with tensor multiplication (as for matrices, the operations can come in different orders). We will not succeed so well for tensor factorization.

This has been and still is an intense research direction-to capture as mueh as possible of the matrix factorizations that are so central to linear algebra: LU, QR, QAQT, ur;vT.

Even the definition and computation of the "rank of a tensor" is not so simple or successful as the rank of a matrix. But rank one tensors = outer products are still the simplest and clearest: They are created from three vectors a, b, c.

3-way tensor T = a o b o c of rank one (12)

This outer product a o b o c is defined by the m + n + p numbers in those three vectors.

The rank of a tensor is the smallest number of rank-1 tensors that add to T.

If we add several of these outer products, we have a convenient low rank tensor- even if we don't always know its exact rank. Here is an example to show why.

T = uo u ov + u o v o u + v o uo u (three ran~-1 tensors with I lull= llvll = 1)

T seems to have rank 3. But it is the limit of these rank-2 tensors Tn when n -t oo :

Why could this never happen for matrices ? Because the closest approximation to A by a matrix of rank k is fixed by the Eckart-Young theorem. That best approximation is Ak from the k leading singular vectors in the SVD. The distance from rank 3 to rank 2 is fixed.

Unfortunately there seems to be no SVD for general 3-way tensors. But the next paragraphs show that we still try-because for computations we want a good low rank approximation toT. Two options are CP and Tucker.

The CP Decomposition of a Tensor

A fundamental problem in tensor analysis and fast tensor computations is to approximate a given tensor T by a sum of rank one tensors : an approximate factorization.

CP Decomposition (14)

This decomposition has several discoverers : Hitchcock, Carroll, Chang, and Harshman.

It also has unfortunate names like CANDECOMP and PARAFAC. Eventually it became a CP Decomposition ofT.

This looks like an extension to tensors of the SVD. But there are important differences.

The vectors a1, ... , aR are not orthogonal (the same for b's and c's). We don't have orthogonal invariance (which gave Q1AQ;f the same singular values as A). And the Eckart-Young theorem is not true-we often don't know the rank R tensor closest to T. There are other approaches to tensor decomposition-but so far CP has been the most useful. Kruskal proved that closest rank-one tensors are unique (if they exist).

If we change R, the best a, b, c will change.

So we are faced with an entirely new problem. From the viewpoint of computability, the problem is N P-hard (unsolvable in polynomial time unless it turns out that P = N P, which would surprise almost everyone). Lim and Hitlar have proved that many simpler- sounding problems for tensors are also N P-hard. The route of exact computations is closed.

C. Hillar and L.-H. Lim, Most tensor problems are NP-hard, J. ACM 60 (2013) Article 45.

We look for an algorithm that computes the a, b, c vectors in a reasonably efficient way.

A major step in tensor computations is to come close to the best CP decomposition.

A simple idea (alternating least squares) works reasonably well for now. The overall problem is not convex, but the subproblems cycle in improving A then B then C- and each subproblem (for A and B and C) is convex least squares.

Factoring Matrices and Tensors : Positive and Sparse

Principal Components and the Best Low Rank Matrix

Rayleigh Quotients and Generalized Eigenvalues