support vector and kernel methods for pattern recognition

Kernel Methods work by embedding the data into a vector space, and by detecting linear relations in that space.!. Kernel FunctionA learning algorithm: performs the learning In the e

Trang 1

Support Vector and Kernel

Methods

for Pattern Recognition

A Little History

! Support Vector Machines (SVM) introduced in

COLT-92 (conference on learning theory) greatly developed

since then.

! Result: a class of algorithms for Pattern Recognition

(Kernel Machines)

! Now: a large and diverse community, from machine

learning, optimization, statistics, neural networks,

functional analysis, etc etc

! Centralized website: www.kernel-machines.org

! Textbook (2000): see www.support-vector.net

Trang 2

Basic Idea

! Kernel Methods work

by embedding the data

into a vector space,

and by detecting linear

relations in that space

Trang 3

Kernel Function

A learning algorithm:

performs the learning

In the embedding space

A kernel function: takes care of the embedding

Overview of the Tutorial

extended example of Kernel

Perceptron

(PCA;regression; clustering;…)

Trang 4

x o

o o

! Kernel methods exploit information about

the inner products between data items

! Many standard algorithms can be rewritten

so that they only require inner products

between data (inputs)

! Kernel functions = inner products in some

feature space (potentially very complex)

! If kernel given, no need to specify what

features of the data are being used

Trang 5

this approach by using an example:

the simplest algorithm with the

simplest kernel

algorithms and general kernels

Trang 6

Perceptron

! Simplest case: classification Decision

function is a hyperplane in input

space

! The Perceptron Algorithm

(Rosenblatt, 57)

! Useful to analyze the Perceptron

algorithm, before looking at SVMs

and Kernel Methods in general

o

x o

o o

Trang 7

combination reflects its ‘difficulty’

Trang 8

! possible to rewrite the

algorithm using this

alternative

representation

x x

o

x o

o o

Trang 9

Dual Representation

! And also the update

rule can be rewritten as

Duality: First Property of SVMs

!DUALITY is the first feature of Support

Vector Machines (and KM in general)

represented in a dual fashion

(in decision function and in training

algorithm)

f x ( ) = w x , + = b ∑ αi iy x x bi, +

Trang 10

Limitations of Perceptron

data

www.support-vector.net

x

x x

x

o

o o

Learning in the Feature Space

they are linearly separable

x → φ ( ) x

Trang 11

Trick

are needed

explicitly mapping the data to feature

space, but just working out the inner

product in that space

information to work)

Kernel-Induced Feature Spaces

points only appear inside dot

products:

necessarily important May not even

know the map

f x ( ) = ∑ α φ φi iy ( , ( ) xi) x + b

φ

Trang 12

Kernels

the dot product between the images of

the two arguments

verify that it is a kernel

K x x ( ,1 2) = φ ( ), ( x1 φ x2)

IMPORTANT CONCEPT

Kernels

by simply rewriting it in dual

representation and replacing dot

products with kernels:

x x1, 2 ← K x x ( ,1 2) = φ ( ), ( x1 φ x2)

Trang 13

2 2

2

1 1 2 2

1

2 2

Trang 14

The Kernel Matrix

K(m,m)

… K(m,3) K(m,2)

K(2,1)

K(1,m)

… K(1,3) K(1,2)

K(1,1)

K=

The Kernel Matrix

machines

necessary information for the learning

Trang 15

Mercer’s Theorem

Positive Definite

(has positive eigenvalues)

matrix can be regarded as a kernel

matrix, that is as an inner product

matrix in some space

Tf)( ) ( , ' ) ( ' ) '

(

Trang 17

Making kernels

can obtain complex kernels by

combining simpler ones according to

kernels, and c>o

! Then also K is a kernel

)()(),(:

),(),()

,(

),()

,()

,(

),()

,(

),()

,(

2 1

1 1

z f x f z x K

X f

z x K z x K z x K

z x K c z x K

Trang 18

) , ( exp ) , (

) ) , ( ( ) , (

1 1

1

2 1 1

1

2 1 1

z z K x x K

z x K z

x K

z x K z z K x x K z

x K

z x K z

x K

c z x K z x

Trang 19

extract kernel as described before

Trang 20

Kernels over General Structures

! Haussler, Watkins, etc: kernels over

sets, over sequences, over trees, etc

! Applied in text categorization,

bioinformatics, etc

A bad kernel …

matrix is mostly diagonal: all points

orthogonal to each other, no clusters,

no structure …

1

…0

00

10

0

…0

01

Trang 21

No Free Kernel

irrelevant features, kernel matrix

becomes diagonal

so choose a good kernel

Other Kernel-based algorithms

! Note: other algorithms can use

kernels, not just LLMs (e.g

clustering; PCA; etc) Dual

representation often possible (in

optimization problems, by

Representer’s theorem)

Trang 22

The Generalization Problem

! The curse of dimensionality: easy to overfit

in high dimensional spaces

(=regularities could be found in the training set that

are accidental, that is that would not be found again

in a test set)

! The SVM problem is ill posed (finding one

hyperplane that separates the data: many

such hyperplanes exist)

! Need principled way to choose the best

possible hyperplane

NEW TOPIC

Trang 23

The Generalization Problem

hyperplane (inductive principles)

pac, MDL, …

simple case motivated by statistical

learning theory (will give the basic

SVM)

Statistical (Computational)

Learning Theory

! Generalization bounds on the risk of

overfitting (in a p.a.c setting:

assumption of I.I.d data; etc)

! Standard bounds from VC theory give

upper and lower bound proportional

to VC dimension

! VC dimension of LLMs proportional to

dimension of space (can be huge)

Trang 24

Assumptions and Definitions

! distribution D over input space X

! train and test points drawn randomly

! VC dimension: size of largest subset of X

shattered by H (every dichotomy

Typically VC >> m, so not useful

Does not tell us which hyperplane to choose

Trang 25

Margin Based Bounds

f

x f y m

R O

i i

i

) ( min

) / (

γ ε

Note: also compression bounds exist; and online bounds.

Margin Based Bounds

! (The worst case bound still holds, but if

lucky (margin is large)) the other bound

can be applied and better generalization

can be achieved:

! Best hyperplane: the maximal margin one

! Margin is large is kernel chosen well

2

) / (

ε

Trang 26

Maximal Margin Classifier

! Minimize the risk of overfitting by

choosing the maximal margin

hyperplane in feature space

margin

! SVMs control capacity by increasing

the margin, not by reducing the

number of degrees of freedom

(dimension free capacity control)

Two kinds of margin

! Functional and geometric margin:

x x

o

o x

x o

o o

Trang 27

Two kinds of margin

Max Margin = Minimal Norm

+

− + −

+ −

+ = + + = −

1 1 2 2

x x

o

o x

x o

o o

x

x x

g

Trang 28

w w

i i

Optimization Theory

! The problem of finding the maximal margin

hyperplane: constrained optimization

, 2

1

≥

− +

i

i i

w w

α

Trang 29

From Primal to Dual

Differentiate and substitute:

=

0 0

0

1 ,

− ∑

i

i i

i y w x b w

i i i

y b

L

x y w

w

L

αα

=

0

i i

i i i

y

x y w

α

Trang 30

i i i

y

x y w

α α

, 2

1 )

(

,

i i

i

j i j i j i i

y

x x y y W

α

α α α

α

IMPORTANT STEP

0

1 ,

i

i i

,2

1)

(

,

i i i

j i j i j i i

y

x x y y W

αα

ααα

Trang 31

, 2

1 )

(

,

i i i

j i j i j i i

y

x x y y W

α α

α α α

α

PROPERTIES

OF THE SOLUTION

! Sparseness: only the

points nearest to the

hyperplane (margin = 1)

have positive weight

! They are called support

Trang 32

o o

Trang 33

Soft Margin Classifier

feature space)

‘finer’ kernel, but that is not a good

idea

tolerates mislabeled points

New Topic

Trang 34

In the case of non-separable data

in feature space, the margin distribution can be optimized

C w

2

1Minimize:

Or:

Subject to:

2

, 2

1

∑

+

i iC

w

Trang 35

i i

0

α α

i i

i

i j j i j i j j j i

1 2 0

0

Trang 36

of outliers, preventing (or

discouraging) points from having too

Trang 37

Kernel Ridge Regression

i i

i

x w

w

ξ

ξ λ

=

−

+∑

, y

Trang 38

1 )

(

maximizeW y i i αiαj x i x j αi

λ α α

α α α α λ α

4

1 '

4

1 '

)

( =y − K −

W

In matrix notation:

Trang 39

The Solution

k I K

y x

w x

f

y I K

y K

1

) (

' ,

) (

2

0 2

1 2

−

λ

λ λ

α

α λ

#

Ridge Regression

used than the square loss

No more sparseness

epsilon-insensitive loss, to obtain sparse

solutions

Trang 40

x L

y i -<w,x i >-b e

0 e

L

y i -<w,x i >+b e

0

x

If the points are close enough to the

Function, they ‘pay no loss’.

If they are out of the insensitive region

They pay in proportion (linear or quadratic)

This gives sparsity back:

points in the insensitive region

will have zero alpha …

Trang 41

:

subject to

) (

i

i i

i

i i

ξ ε

su

) , ( ˆ ˆ 2

1 ˆ ˆ

: maximize

j i i i i i i i i i i

C

x x K y

α α α α

α α α α α α ε α α

Given two classes of points, find their

centers of mass, and label new points

according to the nearest center of

mass

New Topic

Trang 42

+ +

c c c

c

c x c x c c x

c x c x

i i

i class b

, sign

1

, , 2 sign

, 2 ,

2 sign

sign

) i 2

2 2

2 2 2

2

2 2

α α

43 42 1

) ( ), ( 2 ) ( ), ( ) ( ), ( ) ( )

z x K z z K x x

K

z x z

z x

x z

x

− +

=

− +

=

φ

New Topic

Trang 43

C

x

T i i

i i

extract the principal components of a data vector.

Project it onto eigenvectors of dataset …

Assume data are centered

Define covariance

Define eigenvectors of covariance

New Topic

Trang 44

Kernel PCA

m 1, , i

all for

, ,

, 1

x v x m Cv v

i i

j

j j

λλ

0 Combining them, we obtain:

All solutions with Nonzero λ

lie in the span of

X1,…,Xm

Kernel PCA

α λα

α α λ

φ φ φ φ α φ

φ α λ

φ α φ φ

λ λ

φ φ φ

K m K K m

x x x x m

x x

x v

Cv x v x

x x m C x

i j j n i i

i n i

i i i

n n

T i i

i i

) ( ), (

) (

), ( ), (

) ( ) ( 1 0 ) (

Cv v

λ

We know that eigenvectors can be expressed

As lin comb of images of training vectors

We will characterize them by the corresponding

,j x i x j

K = φ φ

Trang 45

ij j n i n j

j i j n i n

n n

K

K x

x

v v

α α λ α α

α α φ

φ α α

, ,

1

) ( ), ( 1

1 ,

, ,

x x

x

v ,φ( ) α φ( ),φ( )

Trang 46

! Normalize alpha coefficients

! Extract PCs of new points by:

∑

=

i

i i n n

x x

x

v ,φ( ) α φ( ),φ( )

Discussion …

Like normal PCA, also kernel PCA has

the property that the most information

(variance) is contained in the first

principal components (projections on

eigenvectors)

Etc, Etc

Trang 47

Spectral Methods

given a partially labeled set, complete

the labeling (TRANSDUCTION)

use the labels to learn a kernel,

then use the kernel to label the data

New Topic

Kernel Alignment

Alignment (= similarity between

Trang 48

=

j i

j i K j i K K

K

,

2 1

2

1 , ( , ) ( , )

It is a similarity measure between kernel matrices.

That is: it depends on the sample.

A more general version can naturally be defined, using the

input distribution We could call the general one ‘alignment’,

and the one defined here ‘empirical alignment’.

(omit

sample

the of function some

is where

) ( ) ] [ ) (

f

e S f A

E S A

P − > ε < − ε m

Used McDiarmid theorem to prove concentration

Trang 49

Kernel Selection

or Combination

! Choose K 1 from a set so to optimize:

! If set is convex, this leads to a convex

i v v

of kernel matrix K

Thresholding the eigenvectors of K we can obtain many different

labelings of the sample, and then we can consider the set of their convex

combinations

Trang 50

Fixed K, choose best Y

a clustering problem

is a complex task

approximate it with a convex problem

The ideal kernel

1

…1

-1-1

-1

…-1

11

-1

…-1

11

YY’=

Trang 51

Spectral Machines

alignment of a set of labels to a given

Courant-Fischer theorem

! A: symmetric and positive definite,

characterized by:

'

vvAv vv

Trang 52

Optimizing Kernel Alignment

labels by thresholding the first

eigenvector of the kernel matrix

(see website): using the Laplacian; or

using SDP …

Using the alignment

for Kernel Adaptation

Trang 53

Kernel Methods Recap

$ Mapping the data into a space

$ Using algebra, optimization, statistics to

Trang 54

Modularity

! Any kernel-based learning algorithm

composed of two modules:

$ A general purpose learning machine

$ A problem specific kernel function

! Any K-B algorithm can be fitted with any

Trang 55

BIOINFO APPLICATIONS

applications of Kernel Methods to

bioinformatics problems

also transduction methods, and

others

spectroscopy data; QSAR data;

protein fold prediction;…

NEW TOPIC !

Diversity of Bioinformatics Data

Trang 56

About bioinformatics problems

! Types of data:

sequences (DNA; or proteins)

gene expression data

SNP; proteomics; etc etc

! Types of tasks:

diagnosis; gene function prediction

protein fold prediction; drugs design; …

! Types of problems:

high dimensional; noisy; very small or very

large datasets; heterogeneous data; …

Gene Expression Data

of genes simultaneously, in a cell or

tissue sample

(genes make proteins by producing RNA; a gene is

expressed when its RNA is present…)

genes (transposing matrix)

Trang 57

Gene Function Prediction

! Predict functional roles for yeast genes

based on their expression profiles

! Given set of 2467 genes, observed their

expression under 79 conditions (from Eisen

et al.)

! Assigned genes to 5 functional classes (from

MIPS yeast genome database).

TCA cycle; respiration; cytoplasmic ribosomes; proteasome;

histones

! SVM: learn to predict class based on

expression profile.

! SVMs compared with 5 other algorithms,

performed best (parzen windows; fisher

discriminant; decision trees; etc).

! Also used to assign to their functional class

‘new’ genes

! Often mistakes have biological interpretation

… See paper (and website).

! Brown, Grundy, Lin, Cristianini, Sugnet, Furey, Ares, Haussler,

“Knowledge Based Analysis of Miroarray Gene Expression Data Using

Support Vector Machines”, PNAS

! www.cse.ucsc.edu/research/compbio

Trang 58

be expected to be predicted on the

basis of expression profiles

biological knowledge: expected to

show correlation

expected to have correlation:

! Phylogenetic Data obtained from comparison

of a given gene with other genomes

! Simplest Phylogenetic Profile: a bit string in

which each bit indicates whether the gene of

interest has a close homolog in the

corresponding genome

! More detailed: negative log of the lowest E

value reported by BLAST in a search against a

complete genome

! Merged with Expression data to improve

performance in Function Identification

Trang 59

Heterogeneous Data

! Similar pattern of occurrence across species could

indicate 1) functional link (they might need each other

to function, so they occur together) Could also simply

indicate 2) sequence similarity

! Used 24 genomes from the Sanger Centre website

! Again: only some functional classes can benefit from

this type of data.

! Generalization improves, but mostly for effect 2): a way

to summarize sequence similarity information

! Pavlidis, Weston, Cai, Grundy, “Gene Functional Classification

from Heterogeneous Data”, International Conference on

Computational Molecular Biology, 2001

Cancer Detection

! Task: automatic classification of tissue

samples

! Case study: ovarian cancer

! Dataset of 97808 cDNAs for each tissue !

(each of which may or may not correspond to a gene)

! Just 31 tissues of 3 types: ovarian cancer;

normal ovarian tissue; other normal tissues

(15 positive and 16 negatives)

! Furey, Cristianini, Duffy, Bednarski, Schummer, Haussler, “Support

Vector Machine Classification and Validation of Cancer Tissue

Samples Using Microarray Expression Data” Bioinformatics

Trang 60

Ovarian Cancer

sample is cancerous or not

potentially responsible for

! Cross validation experiments (l.o.o.).

! Located a consistently misclassified point The sample

was considered cancerous by the SVM (and dubious by

humans that originally labelled it as OK) Re-labelled.

! The only non -ovarian tissue is also misclassified

consistently Removed.

! After its removal: perfect generalization

! Attempt to locate most correlated genes provides less

interesting results (used Fisher score for ranking,

independence assumption).

! Only 5 of the top 10 are actually genes, only 3 cancer

Tiêu đề	Support Vector and Kernel Methods for Pattern Recognition
Thể loại	essay
Năm xuất bản	2000

Định dạng
Số trang	69
Dung lượng	566,37 KB