The Norms of Multiple Kernel Learning

Một phần của tài liệu IT training kernel based data fusion for machine learning methods and applications in bioinformatics and text mining yu, tranchevent, de moor moreau 2011 03 26 (Trang 58 - 62)

We consider the problem of minimizing a quadratic cost of a real vector in function ofα and a real positive semi-definite (PSD) matrix Q, given by

minimize

α αTQα (3.1)

subject to α∈C,

whereC denotes a convex set. Also, PSD implies thatα,αTQα 0. We will show that many machine learning problems can be cast as the form in (3.1) with additional constraints onα. In particular, if we restrictαTα =1, the problem in (3.1) becomes a Rayleigh quotient and leads to a eigenvalue problem.

Now we consider a convex parametric linear combination of a set of p PSD matrices Qj, given by

Ω= p

j=1θjQj

∀j, θj0,Qj0

. (3.2)

To bound the coefficientsθj, we restrict that, for example,||θj||1=1, thus (3.1) can be equivalently rewritten as a min-max problem, given by

minimize

α maximize

θ αT

p

j=1

θjQj

α (3.3)

subject to Qj0, j=1,...,p α∈C,

θj0, j=1,...,p

p j=1θj=1.

To solve (3.3), we denote tT

pj=1θjQj

α, then the min-max problem can be formulated in a form of quadratically constrained linear programming (QCLP) [10], given by

minimize

α,t t (3.4)

subject to Qj0, j=1,...,p α∈C,

t≥αTQjα, j=1,...,p.

3.3 The Norms of Multiple Kernel Learning 43 The optimal solutionθin (3.3) is obtained from the dual variable corresponding to the quadratic constraints in (3.4). The optimal tis equivalent to the Chebyshev or L∞-norm of the vector of quadratic terms, given by

t=||αTQjα||∞=max{αTQ,...,αTQpα}. (3.5) The L∞-norm is the upper bound w.r.t. the constraint∑pj=1θj=1 because

αT p

j=1

θjQj

α≤t. (3.6)

3.3.2 L2-norm MKL

Apparently, suppose the optimalαis given, optimizing the L∞-norm in (3.5) will pick the single term with the maximal value, and the optimal solution of the coef- ficients is more likely to be sparse. An alternative solution to (3.3) is to introduce a different constraint on the coefficients, for example,||θj||2=1. We thus propose a new extension of the problem in (3.1), given by

minimize

α maximize

θ αT

p

j=1θjQj

α (3.7)

subject to Qj0, j=1,...,p α∈C,

θj0, j=1,...,p

||θj||2=1.

This new extension is analogously solved as a QCLP problem with modified con- straints, given by

minimize

α,η η (3.8)

subject to Qj0, j=1,...,p α∈C,

η≥ ||s||2, j=1,...,p,

where s={αTQ,...,αTQpα}T. The proof that (3.8) is the solution of (3.7) is given in the following theorem.

Theorem 3.1. The QCLP problem in (3.8) is the solution of the problem in (3.7).

Proof. Given two vectors{x1,...,xp},{y1,...,yp}, xj,yjR,j=1,...,p, the Cauchy- Schwarz inequality states that

44 3 Ln-norm Multiple Kernel Learning and Least Squares Support Vector Machines

0 p

j=1

xjyj

2

p

j=1

x2j

p j=1

y2j, (3.9)

with as equivalent form:

0

p

j=1∑

xjyj

2⎤

1 2

p

j=1

x2j

p j=1

y2j 1

2

. (3.10)

Let us denote xjjand yjTQjα, (3.10) becomes

0

p j=1

θjαTQjα

p

j=1θ2jp

j=1

αTQjα2

12

. (3.11)

Since||θj||2=1, (3.11) is equivalent to

0p

j=1

θjαTQjα

p

j=1

αTQjα2

12

. (3.12)

Therefore, given s={αTQ,...,αTQpα}T, the additive term∑pj=1

θjαTQjα is

bounded by the L2-norm||s||2.

Moreover, it is easy to prove that whenθjTQjα/||s||2, the parametric combina- tion reaches the upperbound and the equality holds. Optimizing this L2-norm yields a non-sparse solution inθj. In order to distinguish this from the solution obtained by (3.3) and (3.4), we denote it as the L2-norm approach. It can also easily be seen (not shown here) that the L1-norm approach is simply averaging the quadratic terms with uniform coefficients.

3.3.3 Ln-norm MKL

The L2-norm bound is also generalizable to any positive real number n≥1, defined as Ln-norm MKL. Recently, the similar topic is also investigated by Kloft et al. [27]

and a solution is proposed to solve the primal MKL problem. We will show that our primal-dual interpretation of MKL is also extendable to the Ln-norm. Let us assume thatθ is regularized by the Lm-norm as||θ||m=1, then the Lm-norm extension of equation (3.7) is given by

minimize

α maximize

θ αT

p

j=1θjQj

α (3.13)

subject to Qj0, j=1,...,p α∈C,

θj0, j=1,...,p

||θ||m=1.

3.3 The Norms of Multiple Kernel Learning 45 In the following theorem, we prove that (3.13) can be equivalently solved as a QCLP problem, given by

minimize

α,η η (3.14)

subject to Qj0, j=1,...,p α∈C,

η≥ ||s||n,

where s={αTQ,...,αTQpα}T and the constraint is in Ln-norm, moreover, n=

m

m1. The problem in (3.14) is convex and can be solved by cvx toolbox [19, 20].

Theorem 3.2. If the coefficient vectorθis regularized by a Lm-norm in (3.13), the problem can be solved as a convex programming problem in (3.14) with Ln-norm constraint. Moreover, n=m−1m .

Proof. We generalize the Cauchy-Schwarz inequality to H¨older’s inequality. Let m,n>1 be two numbers that satisfym1+1n=1. Then

0p

j=1

xjyj p

j=1

xmj m1

p j=1

ynj 1n

. (3.15)

Let us denote xjjand yjTQjα, (3.15) becomes

0p

j=1

θjαTQjα

p

j=1∑θmj

1

m

p j=1

αTQjαn

1

n

. (3.16)

Since||θ||m=1, therefore the term

pj=1θmj

1

m can be omitted in the equation, so (3.16) is equivalent to

0p

j=1

θjαTQjα

p

j=1

αTQjαn

1n

. (3.17)

Due to the condition that m1 +1n =1, so n= m−1m , we prove that with the Lm- norm constraint posed onθ, the additive multiple kernel term∑pj=1

θjαTQjα is bounded by the Ln-norm of the vector{αTQ,...,αTQnα}T. Moreover, we have

n=m−1m .

In this section, we have explained the L∞, L1, L2, and Ln-norm approaches to extend the basic problem in (3.1) to multiple matrices Qj. These approaches dif- fered mainly on the constraints applied on the coefficients. To clarify the difference of notations used in this paper with the common interpretations of L1and L2regular- ization onθ, we illustrate the mapping of our L∞, L1, L2, and Lnnotations between

46 3 Ln-norm Multiple Kernel Learning and Least Squares Support Vector Machines the common interpretations of coefficient regularization. As shown in Table 3.2, the notations used in this section are interpreted in the dual space and are equivalent to regularization of kernel coefficients in the primal space. The advantage of dual space interpretation is that we can easily extend the analogue solution to various machine learning algorithms, which have been shown in the previous chapter as the similar Rayleigh quotient problem.

Table 3.2 This relationship between the norm of regularization constrained in the primal problem with the norm of kernels optimized in the dual problem

primal problem dual problem

norm θj αTKjα

L|θ|=1 max||{αTK,...,αTKjα}||

L1 θj=θ¯ max||{αTK,...,αTKjα}||1

L2 ||θ||2=1 max||{αTK,...,αTKjα}||2

L1.5 ||θ||3=1 max||{αTK,...,αTKjα}||1.5

L1.3333 ||θ||4=1 max||{αTK,...,αTKjα}||1.3333 L1.25 ||θ||5=1 max||{αTK,...,αTKjα}||1.25

L1.2 ||θ||6=1 max||{αTK,...,αTKjα}||1.2

L1.1667 ||θ||7=1 max||{αTK,...,αTKjα}||1.1667

Next, we will investigate several concrete MKL algorithms and will propose the corresponding L2-norm and Ln-norm solutions.

Một phần của tài liệu IT training kernel based data fusion for machine learning methods and applications in bioinformatics and text mining yu, tranchevent, de moor moreau 2011 03 26 (Trang 58 - 62)

Tải bản đầy đủ (PDF)

(228 trang)