Principal Component Analysis (PCA) considers a given set of zero mean data X∈ RN×d, where X ={xi}Ni=1, xi∈Rd. The objective is to find a direction vector w on which the variance of the projection wTxi is maximized. Since the variance is invariant to the magnitude of w, the objective of PCA is equivalently formulated as
maximize
w wTCxxw (2.18)
subject to wTw=1,
where Cxx is the sample covariance matrix of X . Obviously, (2.18) is a Rayleigh quotient optimization and the optimal w is given by the eigenvector from the largest eigenvalue pair of Cxx.
2.2.2 Canonical Correlation Analysis
Canonical Correlation Analysis (CCA) finds linear relations between two sets of variables [5]. For two zero mean data sets X ∈RN×d1and Y ∈RN×d2, the objective is to identify vectors w1 and w2 such that the correlation between the projected variables wT1X and wT2Y is maximized, given by
maximize
w1,w2
wT1Cxyw2
wT1Cxxw1
wT2Cyyw2
, (2.19)
where Cxx=E[X XT],Cyy=E[YYT],Cxy=E[XYT]. The problem in (2.19) is usu- ally formulated as the optimization problem, given by
2.2 Rayleigh Quotient-Type Problems in Machine Learning 31 maximize
w1,w2
wT1Cxyw2 (2.20)
subject to wT1Cxxw1=1, wT2Cyyw2=1. Taking the conditions for optimality from the Lagrangian
L(w1,w2;λ1,λ2) =wT1Cxyw2−λ1(wT1Cxxw1−1)−λ2(wT2Cyyw2−1), (2.21) one has
Cxyw2=λ1Cxxw1
Cyxw1=λ2Cyyw2
. (2.22)
Since we have
⎧⎪
⎪⎪
⎨
⎪⎪
⎪⎩
wT1Cxyw2=λ1wT1Cxxw1
wT2Cyxw1=λ2wT2Cyyw2
wT1Cxyw2=wT2Cyxw1
wT1Cxxw1=wT2Cyyw2
, (2.23)
we find thatλ1=λ2=λ, thus we obtain a generalized eigenvalue problem, given by
0 Cxy
Cyx 0 w1
w2
=λ
Cxx 0 0 Cyy
w1
w2
. (2.24)
Analogously, the objective function of CCA can be also rewritten in a generalized Rayleigh quotient form as
maximize
w1,w2
w1
w2
T 0 Cxy
Cyx 0 w1
w2
w1
w2
T Cxx 0
0 Cyy
w1
w2
. (2.25)
2.2.3 Fisher Discriminant Analysis
Fisher Discriminant Analysis (FDA) optimizes the discriminating direction for clas- sification, which is also expressed as a form similar to the Generalized Rayleigh quotient, where SBis the measure of the separability of class (between class scatter) and SWis the measure of within class scatter, given by
maximize
w
wTSBw
wTSWw, (2.26)
32 2 Rayleigh Quotient-Type Problems in Machine Learning where
SB= (μ2−μ1)(μ2−μ1)T, SW= ∑
i=1,2 ∑
x∈Xi
(x−μi)(x−μi)T, andμidenotes the sample mean for class i [9].
2.2.4 k-means Clustering
It has been shown that (e.g., [3]) the principal components in PCA are equivalent to the continuous solutions of the cluster membership indicators in the k-means clustering method. k-means uses k number of prototypes to characterize the data and the partitions are determined by minimizing the variance
Jk−means=∑k
a=1∑
i∈Xi
(xi−μa)(xi−μa)T. (2.27) For a given data set X and a cluster number k, the summation of all the pairwise distances is a constant value hence minimizing the distortion is equivalent to maxi- mizing the between clusters variance, given by
Jk−means= ∑k
a=1(μa−μ)(μˆ a−μ)ˆ T. (2.28) where ˆμis the global sample mean of X.
Denote P as the weighted cluster indicator matrix for k classes, given by
A=F(FTF)−12, (2.29)
where F is the N×k binary cluster indicator matrix as F= fi,j N×k,where fi,j=
1 if xi∈lj
0 if xi∈/lj
. (2.30)
Assume X has zero mean, without losing generality, (2.28) can be re-written in the matrix form as
Jk−means=maximize
A trace
ATXTX A
. (2.31)
Because the construction of A in (2.29) and (2.30) ensures that ATA=I, the objec- tive of k-means is exactly the maximization of a Rayleigh quotient. When k=2, A reduces to vector a, and leads to a PCA problem. When k>2, the cluster indi- cators F can be recovered by exploiting the k−1 principal components of XTX , for instance, by QR decomposition proposed in [15]. This PCA based approach of k-means clustering is also known as the spectral relaxation of k-means.
2.2 Rayleigh Quotient-Type Problems in Machine Learning 33
2.2.5 Spectral Clustering
Spectral clustering models the data as graphs where the data samples are represented as vertices connected by non-negative weighted undirected edges. The clustering problem is then restated as to find a partition of the graph that the edges between different groups have a very low weight [7]. Different criteria have been applied to model the objective of cut for example, the RatioCut [2], the normalized cut [11], Markov Random Walks [8], the min-cut [13] and so on. In this book, the discussions about spectral clustering are all based on the normalized cut objective.
Let us denote G= (V,E)as an undirected graph with vertex set V ={v1,...,vn}, W as the weighted adjacency matrix of the graph W ={wi j}i,j=1,...,n, di=∑nj=1wi j
as the degree of a vertex vi∈V , and D as the diagonal matrix with the degrees d1,...,dnon the diagonal. Given a subset of verticesX ⊂V , we denote its comple- ment V\X asX¯. For two disjoint subsets M,N⊂V , the cut is defined as
cut(M,N) = ∑
i∈M,j∈N
wi j . (2.32)
The size of a subset is defined as
vol(M) =∑
i∈M
di. (2.33)
The normalized cut criterion optimizes the partitionX1,...,Xk to minimize the objective as
Ncut(X1,...,Xk) =∑k
i=1
cut(Xi,X¯i)
vol(Xi) . (2.34)
Unfortunately, to obtain the exact solution of (2.34) is NP hard [13]. To solve it, the discrete constraint of clustering indicators is usually relaxed to real values thus the approximated solution of spectral clustering can be obtained from the eigenspectrum of the graph Laplacian matrix. For k-way clustering (k>2), the weighted cluster indicator matrix P is defined in the same way as (2.29) and (2.30), the problem of minimizing the normalized cut is equivalently expressed as
minimize
A trace(ATD−12LD−12A), (2.35) subject to ATA=I.
This is again the optimization of a Rayleigh quotient problem which can be solved by eigenvalue decomposition. The optimal A∗corresponds to the first k eigenvectors of the normalized Laplacian ˜L=D−12LD−12.
2.2.6 Kernel-Laplacian Clustering
Let us assume that the attribute based data X and the graph affinity matrix W are rep- resentations of the same sets of samples, the objective function of Kernel-Laplacian (KL) clustering can be defined as
34 2 Rayleigh Quotient-Type Problems in Machine Learning
JKL=κJNcut+ (1−κ)Jk−means (2.36)
whereκis the weight adjusting the effect of k-means and spectral clustering objec- tives. A is the weighted cluster indicator matrix as defined before. Replace (2.31) and (2.35) in (2.36), the objective of KL clustering becomes
JKL =κmin
A trace AT˜LA
+ (1−κ)max
A trace
ATXTX A
(2.37) s.t. ATA=I.
To solve the optimization problem without tuning the hyperparameterκ, Wang et al.
propose a solution to optimize the trace quotient of the two sub-objectives [14]. The trace quotient formulation is then further relaxed as a maximization of the quotient trace, given by
JKL =maximize
A trace
(AT˜LA)−1(ATXTX A)
(2.38) subject to ATA=I.
The objective in (2.38) is again a generalized Rayleigh quotient problem and the optimal solution A∗is obtained by solving the generalized eigenvalue problem. To maximize the objective with k clusters, A∗is approximated as the largest k eigen- vectors of ˜L+
XTX
, where ˜L+is the pseudo inverse of ˜L [14].
2.2.7 One Class Support Vector Machine
The One class support vector machine (1-SVM) method transforms the binary SVM classification task as one class learning problem. The method transforms the training data of one class into a high dimensional Hilbert space by the feature map, and iteratively finds the maximal margin hyper-plane that best separates the training data from the origin. The solution for the hyper-plane is found by solving the objective as follows:
minimize
w,ξ,ρ
1
2wTw− 1 νN
∑N i=1ξi−ρ
(2.39) subject to wTφ(xi)≥ρ−ξi, i=1,...,N
ξi≥0,
where w is the vector perpendicular to the separating hyper-plane (norm vector), N is the number of training data,ρ is the bias value parameterizes the hyper-plane, ν is a regularization variable penalizing the outliers in the training data,ξi are the slack variables. Taking the conditions for optimality from the Lagrangian as