Bi-level Optimization of k-means on Multiple Kerne- 123docz.net

To solve the objective in the problem Q1 (4.13), in the first phase we maximize JQ1 with respect to A, keepingθ fixed. In the second phase we maximizeJQ1

with respect to θ, keeping A fixed. The two-stage optimization is then repeated until convergence (the proof will be shown later). When θ is fixed, the problem is exactly the relaxed k-means clustering problem as discussed before. When A is fixed, the problem of maximizingJQ1 reduces to the optimization of the coefficients θr assigned on kernels with the given cluster memberships. We will show that when the memberships are given, the problem in Q1 can be formulated as KFD inF.

4.4.1 The Role of Cluster Assignment

In problem Q1, when we maximizeJQ1with respect to A using the fixedθ, the obtained N×k weighted cluster indicator matrix A can also be regarded as the one-vs- others (1vsA) coding of the cluster assignments because each column of A actually distinguishes one cluster from the other clusters. When A is given, the between- cluster scatter matrix SΦb is fixed, thus the problem of optimizing the coefficients of multiple kernel matrices is equivalent to optimizing the KFD [21] using multiple kernel matrices. The scatter matrix of KFD is determined by the cluster assignments of the data points, which can be obtained via an affinity function using A as the input, given by

Fi j=

+1 if Ai j>0, i=1,...,N, j=1,...,k

−1 if Ai j=0, i=1,...,N, j=1,...,k , (4.17) where F is an affinity matrix using {+1,−1}to discriminate the cluster assignments. In the second iteration of our algorithm, to maximizeJQ1 with respect to θ using the fixed A, we formulate it as the optimization of KFD on multiple kernel matrices using the affinity matrix F as input.

4.4.2 Optimizing the Kernel Coefficients as KFD

As known, for a single data set X , given labels of two classes, to find the linear discriminant inF we need to maximize

maximize

wTSΦbw

wT(SΦw+ρI)w , (4.18)

where w is the norm vector of the separating hyperplane in F, SΦb and SwΦ are respectively the between-class and the within-cluster scatters in F, ρ is the

4.4 Bi-level Optimization of k-means on Multiple Kernels 95 regularization term to ensure the positive definiteness of the denominator. For k multiple classes, denote W= [w1,...,wk]as the matrix where each column corresponds to the discriminative direction of 1vsA classes. An important property to notice about the KFD objective is that it is invariant w.r.t. rescalings of w [9]. Hence, we can always choose w (W ) such that the denominator is simply wTSww=1 (WTSwW=Ik) [12]. In other words, if the within-class scatter is isotropic, the norm vectors of discriminant projections are merely the eigenvectors of the between-class scatter [9]. Thus the objective of KFD can be simplified as a Rayleigh quotient formulation. Moreover, the vectors in W are orthogonal to each other, therefore, for convenience we can further rescale W to have WTW =Ik (it can be proved that rescaling W does not change the clustering result). Thus, the KFD objective can be expressed as

maximize

W trace

WTSbΦW

, (4.19)

subject to WTW =Ik.

Based on the observations above, we formulate the objective of k-means using multiple kernels as

Q2: maximize

A,W,θ trace

WTATΩAW

, (4.20)

subject to ATA=Ik, WTW =Ik, Ω =

∑p r=1

θrGr, θr≥0, r=1,...,p

∑p r=1

θr=1.

In the k-means step, we set W=Ikand optimize A (it can be easily proved that fixing W as Ik does not change the clustering result). In the KFD step, we fix A and optimize W andθ. Therefore, the two components optimize towards the same objective as a Rayleigh quotient inF, which also guarantee that the iterative optimization of these two steps will necessary converge to a local optimum. Moreover, in the KFD step of the present problem, we are not interested in W which determines the separating hyperplane, instead, we only need the optimal coefficientsθr assigned on the multiple kernels. In particular, it is known that Fisher Discriminant Analysis is related to the least squares approach [9], and the KFD is related to the least squares support vector machines (LSSVM) proposed by Suykens et al. [26, 27]. Therefore we can solve the KFD problem as LSSVM using multiple kernels.

96 4 Optimized Data Fusion for Kernel k-means Clustering

4.4.3 Solving KFD as LSSVM Using Multiple Kernels

In LSSVM, the cost function of the classification error is replaced by a least squares term [27] and the inequalities in the constraint are replaced by equalities, given by

minimize

w,b,e

2wTw+1

2λeTe (4.21)

subject to yi[wTφ(xi) +b] =1−ei, i=1,...,N,

where w is the norm vector of separating hyper-plane, xiare data samples,φ(ã)is the feature map, yi are the cluster assignments represented in the affinity function F,λ>0 is a positive regularization parameter, e are the least squares error terms.

Taking the conditions for optimality from the Lagrangian, eliminating w,e, defining y= [y1,...,yN]T and Y =diag(y1,...,yN), one obtains the following linear system [26, 27]:

0 yT

y Y KY+I/λ b α

= 0

, (4.22)

whereα are unconstrained dual variables, K is the kernel matrix obtained by kernel trick asK(xi,xj) =φ(xi)Tφ(xj). Without loss of generality, we denoteβ =Yα such that (4.22) becomes

0 1T

1 K+Y−2/λ b β

= 0

Y−11

. (4.23)

In (4.17), the elements of F are{−1,+1}such that we have Y−2=IN, the coefficient matrix of the linear system in (23) then becomes a static value in the multi-class case. In 1vsA coding, solving (4.22) requires to solve k number of linear problems whereas in (4.22), the coefficient matrix is only factorized once such that the so- lution ofβqw.r.t. the multi-class assignments yqis very efficient to obtain. To in- corporate multiple kernels, we follow the multiple kernel learning (MKL) approach proposed by Lanckriet et al. [16] and formulate the LSSVM MKL as a QCQP problem, given by

minimize

β,t

1 2t+ 1

2λβTβ−βTY−11 (4.24) subject to

∑N i=1βi=0,

t≥βTKrβ, r=1,...,p.

In our problem, we use the normalized and centered kernel matrices of all samples thus Kr is equal to Gr, the kernel coefficientsθr correspond to the dual variables bounded by the quadratic constraints in (4.24). The column vector of F, denoted as Fj, j=1,...,k correspond to the k number of Y1,...,Yk in (4.23), where Yj= diag(Fj), j=1,...,k. The bias term b can be solved independently using the optimal

4.4 Bi-level Optimization of k-means on Multiple Kernels 97 β∗and the optimalθ∗, thus can be dropped out from (4.24). Concluding the previous discussion, the SIP formulation of the LSSVM MKL is given by

maxθ,u u (4.25)

s.t. θr≥0,r=1,...,p+1

p+1 r=1∑θr=1,

p+1∑

r=1

θrfr(β)≥u, ∀β

fr(β) =∑k

q=1

2βTqGrβq−βTqYq−11

, r=1,...,p+1.

The pseudocode to solve the LSSVM MKL in (4.25) is presented as follows:

Algorithm 4.4.1. SIP-LS-SVM-MKL(G1,...,Gp,F) Obtain the initial guessβ(0)= [β(0)1 ,...,β(0)k ]

τ=0

while(Δu>ε)

⎧⎪

⎪⎪

⎪⎨

⎪⎪

⎩

step1 : Fixβ, solveθ(τ)then obtain u(τ) step2 : Compute the kernel combinationΩ(τ)

step3 : Solve the single LS-SVM for the optimalβ(τ) step4 : Compute f1(β(τ)),...,fp+1(β(τ))

step5 :Δu=|1−∑pj=+11θ(τ)ju(τ)fj(β(τ))| step6 :τ:=τ+1

comment:τis the indicator of the current loop return(θ(τ),β(τ))

In Algorithm 4.4.1 G1,...,Gpare centered kernel matrices of multiple sources, an identity matrix is set as Gp+1to estimate the regularization parameter, Y1,...,Ykare the N×N diagonal matrices constructed from F. Theεis a fixed constant as the stopping rule of SIP iterations and is set empirically as 0.0001 in our implementation. Normally the SIP takes about ten iterations to converge. In Algorithm 4.1, Step 1 optimizesθas a linear programming and Step 3 is simply a linear problem as

0 1T

1 Ω(τ) b(τ) β(τ)

= 0

Y−11

, (4.26)

whereΩ(τ)=∑p+1r=1θ(τ)j Gr.

98 4 Optimized Data Fusion for Kernel k-means Clustering

4.4.4 Optimized Data Fusion for Kernel k-means Clustering (OKKC)

Now we have clarified the two algorithmic components to optimize the objective Q2 as defined in (4.20). The main characteristic is that the cluster assignments and the coefficients of kernels are optimized iteratively and adaptively until convergence.

The coefficients assigned on multiple kernel matrices leverage the effect of different kernels in data integration to optimize the objective of clustering. Comparing to the average combination of kernel matrices, the optimized combination approach may be more robust to noisy and irrelevant data sources. We name the proposed algorithm optimized kernel k-means clustering (OKKC) and its pseudocode is presented in Algorithm 4.4.2 as follows:

Algorithm 4.4.2. OKKC(G1,G2,...,Gp,k)

comment: Obtain theΩ(0)by the initial guess ofθ1(0),...,θ(0)p

A(0)←KERNEL K-MEANS(Ω(0),k) γ=0

while(ΔA>ε)

⎧⎪

⎪⎪

⎪⎨

⎪⎪

⎩

step1 : F(γ)←A(γ)

step2 :Ω(γ+1)←SIP-LS-SVM-MKL(G1,G2,...,Gp,F(γ)) step3 : A(γ+1)←KERNEL K-MEANS(Ω(γ+1),k)

step4 :ΔA=||A(γ+1)−A(γ)||2/||A(γ+1)||2 step5 :γ:=γ+1

return(A(γ),θ1(γ),...,θp(γ))

The iteration in Algorithm 4.4.2 terminates when the relaxed cluster membership matrix A stops changing. The tolerance valueε is constant value as the stopping rule of OKKC and in our implementation it is set to 0.05. In our implementation, the final cluster assignment is obtained using the kernel k-means algorithm [11] on the optimally combined kernel matrixΩ(τ).

4.4.5 Computational Complexity

The proposed OKKC algorithm has several advantages over some similar algo- rithms proposed in the literature. The optimization procedure of OKKC is bi- level, which is simpler than the tri-level architecture of the NAML algorithm.

The kernel coefficients in OKKC is optimized as LS-SVM MKL, which can be solved efficiently as a convex SIP problem. The kernel coefficients are obtained as iterations of two linear systems: a single kernel LSSVM problem and a linear problem to optimize the kernel coefficients. The time complexity of OKKC is O{γ[N3+τ(N2+p3)] +lkN2}, whereγis the number of OKKC iterations, O(N3)

Bi-level Optimization of k-means on Multiple Kernels

Rayleigh Quotient-Type Problems in Machine Learning

The Norms of Multiple Kernel Learning