Rayleigh Quotient-Type Problems in Machine Learning

Một phần của tài liệu IT training kernel based data fusion for machine learning methods and applications in bioinformatics and text mining yu, tranchevent, de moor moreau 2011 03 26 (Trang 46 - 51)

Principal Component Analysis (PCA) considers a given set of zero mean data X∈ RN×d, where X ={xi}Ni=1, xiRd. The objective is to find a direction vector w on which the variance of the projection wTxi is maximized. Since the variance is invariant to the magnitude of w, the objective of PCA is equivalently formulated as

maximize

w wTCxxw (2.18)

subject to wTw=1,

where Cxx is the sample covariance matrix of X . Obviously, (2.18) is a Rayleigh quotient optimization and the optimal w is given by the eigenvector from the largest eigenvalue pair of Cxx.

2.2.2 Canonical Correlation Analysis

Canonical Correlation Analysis (CCA) finds linear relations between two sets of variables [5]. For two zero mean data sets X RN×d1and Y RN×d2, the objective is to identify vectors w1 and w2 such that the correlation between the projected variables wT1X and wT2Y is maximized, given by

maximize

w1,w2

wT1Cxyw2

wT1Cxxw1

wT2Cyyw2

, (2.19)

where Cxx=E[X XT],Cyy=E[YYT],Cxy=E[XYT]. The problem in (2.19) is usu- ally formulated as the optimization problem, given by

2.2 Rayleigh Quotient-Type Problems in Machine Learning 31 maximize

w1,w2

wT1Cxyw2 (2.20)

subject to wT1Cxxw1=1, wT2Cyyw2=1. Taking the conditions for optimality from the Lagrangian

L(w1,w2;λ1,λ2) =wT1Cxyw2λ1(wT1Cxxw11)λ2(wT2Cyyw21), (2.21) one has

Cxyw2=λ1Cxxw1

Cyxw1=λ2Cyyw2

. (2.22)

Since we have

⎧⎪

⎪⎪

⎪⎪

⎪⎩

wT1Cxyw2=λ1wT1Cxxw1

wT2Cyxw1=λ2wT2Cyyw2

wT1Cxyw2=wT2Cyxw1

wT1Cxxw1=wT2Cyyw2

, (2.23)

we find thatλ1=λ2=λ, thus we obtain a generalized eigenvalue problem, given by

0 Cxy

Cyx 0 w1

w2

Cxx 0 0 Cyy

w1

w2

. (2.24)

Analogously, the objective function of CCA can be also rewritten in a generalized Rayleigh quotient form as

maximize

w1,w2

w1

w2

T 0 Cxy

Cyx 0 w1

w2

w1

w2

T Cxx 0

0 Cyy

w1

w2

. (2.25)

2.2.3 Fisher Discriminant Analysis

Fisher Discriminant Analysis (FDA) optimizes the discriminating direction for clas- sification, which is also expressed as a form similar to the Generalized Rayleigh quotient, where SBis the measure of the separability of class (between class scatter) and SWis the measure of within class scatter, given by

maximize

w

wTSBw

wTSWw, (2.26)

32 2 Rayleigh Quotient-Type Problems in Machine Learning where

SB= (μ2μ1)(μ2μ1)T, SW= ∑

i=1,2 ∑

x∈Xi

(xμi)(xμi)T, andμidenotes the sample mean for class i [9].

2.2.4 k-means Clustering

It has been shown that (e.g., [3]) the principal components in PCA are equivalent to the continuous solutions of the cluster membership indicators in the k-means clustering method. k-means uses k number of prototypes to characterize the data and the partitions are determined by minimizing the variance

Jk−means=∑k

a=1∑

i∈Xi

(xiμa)(xiμa)T. (2.27) For a given data set X and a cluster number k, the summation of all the pairwise distances is a constant value hence minimizing the distortion is equivalent to maxi- mizing the between clusters variance, given by

Jkmeans= ∑k

a=1(μaμ)(μˆ aμ)ˆ T. (2.28) where ˆμis the global sample mean of X.

Denote P as the weighted cluster indicator matrix for k classes, given by

A=F(FTF)12, (2.29)

where F is the N×k binary cluster indicator matrix as F= fi,j N×k,where fi,j=

1 if xi∈lj

0 if xi∈/lj

. (2.30)

Assume X has zero mean, without losing generality, (2.28) can be re-written in the matrix form as

Jk−means=maximize

A trace

ATXTX A

. (2.31)

Because the construction of A in (2.29) and (2.30) ensures that ATA=I, the objec- tive of k-means is exactly the maximization of a Rayleigh quotient. When k=2, A reduces to vector a, and leads to a PCA problem. When k>2, the cluster indi- cators F can be recovered by exploiting the k−1 principal components of XTX , for instance, by QR decomposition proposed in [15]. This PCA based approach of k-means clustering is also known as the spectral relaxation of k-means.

2.2 Rayleigh Quotient-Type Problems in Machine Learning 33

2.2.5 Spectral Clustering

Spectral clustering models the data as graphs where the data samples are represented as vertices connected by non-negative weighted undirected edges. The clustering problem is then restated as to find a partition of the graph that the edges between different groups have a very low weight [7]. Different criteria have been applied to model the objective of cut for example, the RatioCut [2], the normalized cut [11], Markov Random Walks [8], the min-cut [13] and so on. In this book, the discussions about spectral clustering are all based on the normalized cut objective.

Let us denote G= (V,E)as an undirected graph with vertex set V ={v1,...,vn}, W as the weighted adjacency matrix of the graph W ={wi j}i,j=1,...,n, di=∑nj=1wi j

as the degree of a vertex vi∈V , and D as the diagonal matrix with the degrees d1,...,dnon the diagonal. Given a subset of verticesX ⊂V , we denote its comple- ment V\X asX¯. For two disjoint subsets M,N⊂V , the cut is defined as

cut(M,N) = ∑

i∈M,j∈N

wi j . (2.32)

The size of a subset is defined as

vol(M) =∑

i∈M

di. (2.33)

The normalized cut criterion optimizes the partitionX1,...,Xk to minimize the objective as

Ncut(X1,...,Xk) =∑k

i=1

cut(Xi,X¯i)

vol(Xi) . (2.34)

Unfortunately, to obtain the exact solution of (2.34) is NP hard [13]. To solve it, the discrete constraint of clustering indicators is usually relaxed to real values thus the approximated solution of spectral clustering can be obtained from the eigenspectrum of the graph Laplacian matrix. For k-way clustering (k>2), the weighted cluster indicator matrix P is defined in the same way as (2.29) and (2.30), the problem of minimizing the normalized cut is equivalently expressed as

minimize

A trace(ATD12LD12A), (2.35) subject to ATA=I.

This is again the optimization of a Rayleigh quotient problem which can be solved by eigenvalue decomposition. The optimal Acorresponds to the first k eigenvectors of the normalized Laplacian ˜L=D12LD12.

2.2.6 Kernel-Laplacian Clustering

Let us assume that the attribute based data X and the graph affinity matrix W are rep- resentations of the same sets of samples, the objective function of Kernel-Laplacian (KL) clustering can be defined as

34 2 Rayleigh Quotient-Type Problems in Machine Learning

JKLJNcut+ (1κ)Jk−means (2.36)

whereκis the weight adjusting the effect of k-means and spectral clustering objec- tives. A is the weighted cluster indicator matrix as defined before. Replace (2.31) and (2.35) in (2.36), the objective of KL clustering becomes

JKL =κmin

A trace AT˜LA

+ (1κ)max

A trace

ATXTX A

(2.37) s.t. ATA=I.

To solve the optimization problem without tuning the hyperparameterκ, Wang et al.

propose a solution to optimize the trace quotient of the two sub-objectives [14]. The trace quotient formulation is then further relaxed as a maximization of the quotient trace, given by

JKL =maximize

A trace

(AT˜LA)1(ATXTX A)

(2.38) subject to ATA=I.

The objective in (2.38) is again a generalized Rayleigh quotient problem and the optimal solution Ais obtained by solving the generalized eigenvalue problem. To maximize the objective with k clusters, Ais approximated as the largest k eigen- vectors of ˜L+

XTX

, where ˜L+is the pseudo inverse of ˜L [14].

2.2.7 One Class Support Vector Machine

The One class support vector machine (1-SVM) method transforms the binary SVM classification task as one class learning problem. The method transforms the training data of one class into a high dimensional Hilbert space by the feature map, and iteratively finds the maximal margin hyper-plane that best separates the training data from the origin. The solution for the hyper-plane is found by solving the objective as follows:

minimize

w,ξ,ρ

1

2wTw 1 νN

N i=1ξiρ

(2.39) subject to wTφ(xi)ρξi, i=1,...,N

ξi0,

where w is the vector perpendicular to the separating hyper-plane (norm vector), N is the number of training data,ρ is the bias value parameterizes the hyper-plane, ν is a regularization variable penalizing the outliers in the training data,ξi are the slack variables. Taking the conditions for optimality from the Lagrangian as

Một phần của tài liệu IT training kernel based data fusion for machine learning methods and applications in bioinformatics and text mining yu, tranchevent, de moor moreau 2011 03 26 (Trang 46 - 51)

Tải bản đầy đủ (PDF)

(228 trang)