The Norms of Multiple Kernel Learning

We consider the problem of minimizing a quadratic cost of a real vector in function ofα and a real positive semi-definite (PSD) matrix Q, given by

minimize

α αTQα (3.1)

subject to α∈C,

whereC denotes a convex set. Also, PSD implies that∀α,αTQα ≥0. We will show that many machine learning problems can be cast as the form in (3.1) with additional constraints onα. In particular, if we restrictαTα =1, the problem in (3.1) becomes a Rayleigh quotient and leads to a eigenvalue problem.

Now we consider a convex parametric linear combination of a set of p PSD matrices Qj, given by

Ω= p

∑j=1θjQj

∀j, θj≥0,Qj0

. (3.2)

To bound the coefficientsθj, we restrict that, for example,||θj||1=1, thus (3.1) can be equivalently rewritten as a min-max problem, given by

minimize

α maximize

θ αT

∑j=1

θjQj

α (3.3)

subject to Qj0, j=1,...,p α∈C,

θj≥0, j=1,...,p

∑p j=1θj=1.

To solve (3.3), we denote t=αT

∑pj=1θjQj

α, then the min-max problem can be formulated in a form of quadratically constrained linear programming (QCLP) [10], given by

minimize

α,t t (3.4)

subject to Qj0, j=1,...,p α∈C,

t≥αTQjα, j=1,...,p.

3.3 The Norms of Multiple Kernel Learning 43 The optimal solutionθ∗in (3.3) is obtained from the dual variable corresponding to the quadratic constraints in (3.4). The optimal t∗is equivalent to the Chebyshev or L∞-norm of the vector of quadratic terms, given by

t∗=||αTQjα||∞=max{αTQ1α,...,αTQpα}. (3.5) The L∞-norm is the upper bound w.r.t. the constraint∑pj=1θj=1 because

αT p

∑j=1

θjQj

α≤t∗. (3.6)

3.3.2 L2-norm MKL

Apparently, suppose the optimalα∗is given, optimizing the L∞-norm in (3.5) will pick the single term with the maximal value, and the optimal solution of the coefficients is more likely to be sparse. An alternative solution to (3.3) is to introduce a different constraint on the coefficients, for example,||θj||2=1. We thus propose a new extension of the problem in (3.1), given by

minimize

α maximize

θ αT

∑j=1θjQj

α (3.7)

subject to Qj0, j=1,...,p α∈C,

θj≥0, j=1,...,p

||θj||2=1.

This new extension is analogously solved as a QCLP problem with modified constraints, given by

minimize

α,η η (3.8)

subject to Qj0, j=1,...,p α∈C,

η≥ ||s||2, j=1,...,p,

where s={αTQ1α,...,αTQpα}T. The proof that (3.8) is the solution of (3.7) is given in the following theorem.

Theorem 3.1. The QCLP problem in (3.8) is the solution of the problem in (3.7).

Proof. Given two vectors{x1,...,xp},{y1,...,yp}, xj,yj∈R,j=1,...,p, the Cauchy- Schwarz inequality states that

44 3 Ln-norm Multiple Kernel Learning and Least Squares Support Vector Machines

0≤ p

∑j=1

xjyj

≤∑p

j=1

x2j

∑p j=1

y2j, (3.9)

with as equivalent form:

0≤

⎡

⎣ p

j=1∑

xjyj

2⎤

⎦

1 2

≤ p

∑j=1

x2j

∑p j=1

y2j 1

. (3.10)

Let us denote xj=θjand yj=αTQjα, (3.10) becomes

0≤

∑p j=1

θjαTQjα

≤ p

∑j=1θ2j ∑p

j=1

αTQjα2

. (3.11)

Since||θj||2=1, (3.11) is equivalent to

0≤∑p

j=1

θjαTQjα

≤ p

∑j=1

αTQjα2

. (3.12)

Therefore, given s={αTQ1α,...,αTQpα}T, the additive term∑pj=1

θjαTQjα is

bounded by the L2-norm||s||2.

Moreover, it is easy to prove that whenθ∗j =αTQjα/||s||2, the parametric combination reaches the upperbound and the equality holds. Optimizing this L2-norm yields a non-sparse solution inθj. In order to distinguish this from the solution obtained by (3.3) and (3.4), we denote it as the L2-norm approach. It can also easily be seen (not shown here) that the L1-norm approach is simply averaging the quadratic terms with uniform coefficients.

3.3.3 Ln-norm MKL

The L2-norm bound is also generalizable to any positive real number n≥1, defined as Ln-norm MKL. Recently, the similar topic is also investigated by Kloft et al. [27]

and a solution is proposed to solve the primal MKL problem. We will show that our primal-dual interpretation of MKL is also extendable to the Ln-norm. Let us assume thatθ is regularized by the Lm-norm as||θ||m=1, then the Lm-norm extension of equation (3.7) is given by

minimize

α maximize

θ αT

∑j=1θjQj

α (3.13)

subject to Qj0, j=1,...,p α∈C,

θj≥0, j=1,...,p

||θ||m=1.

3.3 The Norms of Multiple Kernel Learning 45 In the following theorem, we prove that (3.13) can be equivalently solved as a QCLP problem, given by

minimize

α,η η (3.14)

subject to Qj0, j=1,...,p α∈C,

η≥ ||s||n,

where s={αTQ1α,...,αTQpα}T and the constraint is in Ln-norm, moreover, n=

m−1. The problem in (3.14) is convex and can be solved by cvx toolbox [19, 20].

Theorem 3.2. If the coefficient vectorθis regularized by a Lm-norm in (3.13), the problem can be solved as a convex programming problem in (3.14) with Ln-norm constraint. Moreover, n=m−1m .

Proof. We generalize the Cauchy-Schwarz inequality to H¨older’s inequality. Let m,n>1 be two numbers that satisfym1+1n=1. Then

0≤∑p

j=1

xjyj≤ p

∑j=1

xmj m1

∑p j=1

ynj 1n

. (3.15)

Let us denote xj=θjand yj=αTQjα, (3.15) becomes

0≤∑p

j=1

θjαTQjα

≤ p

j=1∑θmj

∑p j=1

αTQjαn

. (3.16)

Since||θ||m=1, therefore the term

∑pj=1θmj

m can be omitted in the equation, so (3.16) is equivalent to

0≤∑p

j=1

θjαTQjα

≤ p

∑j=1

αTQjαn

. (3.17)

Due to the condition that m1 +1n =1, so n= m−1m , we prove that with the Lm- norm constraint posed onθ, the additive multiple kernel term∑pj=1

θjαTQjα is bounded by the Ln-norm of the vector{αTQ1α,...,αTQnα}T. Moreover, we have

n=m−1m .

In this section, we have explained the L∞, L1, L2, and Ln-norm approaches to extend the basic problem in (3.1) to multiple matrices Qj. These approaches dif- fered mainly on the constraints applied on the coefficients. To clarify the difference of notations used in this paper with the common interpretations of L1and L2regular- ization onθ, we illustrate the mapping of our L∞, L1, L2, and Lnnotations between

46 3 Ln-norm Multiple Kernel Learning and Least Squares Support Vector Machines the common interpretations of coefficient regularization. As shown in Table 3.2, the notations used in this section are interpreted in the dual space and are equivalent to regularization of kernel coefficients in the primal space. The advantage of dual space interpretation is that we can easily extend the analogue solution to various machine learning algorithms, which have been shown in the previous chapter as the similar Rayleigh quotient problem.

Table 3.2 This relationship between the norm of regularization constrained in the primal problem with the norm of kernels optimized in the dual problem

primal problem dual problem

norm θj αTKjα

L∞ |θ|=1 max||{αTK1α,...,αTKjα}||∞

L1 θj=θ¯ max||{αTK1α,...,αTKjα}||1

L2 ||θ||2=1 max||{αTK1α,...,αTKjα}||2

L1.5 ||θ||3=1 max||{αTK1α,...,αTKjα}||1.5

L1.3333 ||θ||4=1 max||{αTK1α,...,αTKjα}||1.3333 L1.25 ||θ||5=1 max||{αTK1α,...,αTKjα}||1.25

L1.2 ||θ||6=1 max||{αTK1α,...,αTKjα}||1.2

L1.1667 ||θ||7=1 max||{αTK1α,...,αTKjα}||1.1667

Next, we will investigate several concrete MKL algorithms and will propose the corresponding L2-norm and Ln-norm solutions.

Rayleigh Quotient-Type Problems in Machine Learning

Support Vector Machine MKL for Classiﬁcation