Bài 6 Slide Support Vector Machines Kernels

PowerPoint Presentation Support Vector Machines Kernels Doing really well with linear decision surfaces Outline Prediction Why might predictions be wrong? Support vector machines Doing really well w.

Trang 1

Support Vector Machines

& Kernels

Doing really well with linear decision surfaces

Trang 2

 Prediction

 Why might predictions be wrong?

 Support vector machines

 Doing really well with linear models

 Making the non- ‐linear linear

Trang 3

Why Might Predictions be Wrong?

• True non- determinism‐

– Flip a biased coin

– p(heads) = θ

– Estimate θ

– If θ > 0.5 predict ‘heads’, else ‘tails’

Lots of ML research on problems like this:

– Learn a model

– Do the best you can in expectation

Trang 4

• Partial observability

– Something needed to predict y is missing from observation x

– N- ‐bit parity problem

• x contains N- ‐1 bits (hard PO)

• x contains N bits but learner ignores some of them (sof PO)

• Noise in the observation x

– Measurement error

– Instrument limitations

Trang 5

• True non- determinism‐

Trang 6

Representational Bias

• Having the right features (x) is crucial

Trang 7

Support Vector Machines

Doing Really Well with Linear Decision Surfaces

Trang 8

Strengths of SVMs

• Good generalization

– in theory

– in practice

• Works well with few training instances

• Find globally best model

• Efficient algorithms

• Amenable to the kernel trick

Trang 9

Minor Notation Change

To better match notation used in SVMs

and to make matrix formulas simpler

We will drop using superscripts for the i th instance

Trang 11

Intuitions

Trang 15

A “Good” Separator

Trang 16

Noise in the Observations

Trang 17

Ruling Out Some Separators

Trang 18

Lots of Noise

Trang 19

Only One Separator Remains

Trang 20

Maximizing the Margin

Trang 21

“ Fat” Separators

Trang 22

“ Fat” Separators

margin

Trang 23

Why Maximize Margin

Increasing margin reduces capacity

• i.e., fewer possible models

Lesson from Learning Theory:

• If the following holds:

– H is sufficiently constrained in size

then low training error is likely to be evidence of low generalization error

23

Trang 24

Alternative View of Logistic Regression

Trang 25

Alternate View of Logistic Regression

Trang 27

Support Vector Machine

Trang 28

Support Vector Machine

Trang 29

Maximum Margin Hyperplane

✓

margin =

2 k✓k 2

1

Trang 30

Support Vectors

✓

1

Trang 31

Large Margin Classifier in Presence of

Trang 34

Maximizing the Margin

m i n

✓

1 2

Let p be the projection of

xi onto the vector θ

-θ

Since p is small, therefore ✓k2 must be

large to have pk✓k2 ≥ 1 (or ≤ - ‐1)

Since p is larger, ✓k 2 can be smaller

in order to have pk✓k2 ≥ 1 (or ≤ - ‐1)

Trang 35

Size of the Margin

θ -θ

k✓k2

margin

35

Trang 36

The SVM Dual Problem

Can solve it more efficiently by taking the Lagrangian dual

• Duality is a common idea in optimization

• It transforms a difficult optimization problem into a simpler one

– αi indicates how important a particular constraint is to the solution

min

✓

12

The primal SVM problem was given as

Trang 37

The SVM Dual Problem

• The Lagrangian is given by

• We must minimize over θ and maximize over α

• At optimal solution, partials w.r.t θ’s are 0

Solve by a bunch of algebra and calculus and we obtain

L (✓, ↵) =

12

Trang 39

Understanding the Dual

Balances between the weight of

constraints for diﬀerent classes Constraint weights (αi’s) cannot be

negative

39

Trang 40

Intuitively, we should be more careful around points near the margin

Points with diﬀerent labels increase the sum

Points with same label decrease the

sum

Measures the similarity between points

Trang 41

In the solution, either:

• αi > 0 and the constraint is tight

Trang 42

Employing the Solution

• Given the optimal solution α*, optimal weights are

X

? i

i i

i 2 S V s

– In this formulation, have not added x0 = 1

• Therefore, we can solve one of the SV constraints

yi (✓ ? · xi + ✓0 ) = 1

to obtain θ0

– Or, more commonly, take the average solution over all support vectors

Trang 43

What if Data Are Not Linearly Separable?

• Cannot find θ that satisfies

• Introduce slack variables ξi

Trang 44

Strengths of SVMs

• Good generalization in theory

• Good generalization in practice

• Work well with few training instances

• Find globally best model

• Efficient algorithms

• Amenable to the kernel trick …

Trang 45

What if Surface is Non- Linear? ‐

X X X X

O

O O

O

X O

O

O O

Image from http://www.atrandomresearch.com/iclass/

Trang 46

Kernel Methods

Making the Non- ‐Linear Linear

Trang 47

When Linear Separators Fail

Trang 48

Mapping into a New Feature Space

i 1 i 2

• Rather than run SVM on xi, run it on Φ(xi)

– Find non- linear ‐ separator in input space

• What if Φ(xi) is really big?

• Use kernels to compute it implicitly!

Image from http://web.engr.oregonstate.edu/ ~afern/classes/cs534/

Trang 49

• Find kernel K such that

K (xi , xj ) = hØ(xi ), Ø(xj )

• Computing K (xi , xj ) should be efficient, much more so than

computing Φ(xi) and Φ(xj)

• Use K (xi , xj ) in SVM algorithm rather than

• Remarkably, this is possible!

xi , xj

Trang 50

The Polynomial Kernel

Trang 51

The Polynomial Kernel

Given by– Φ(x) contains all monomials of degree dK (xi , xj ) = hxi , xj i

• Useful in visual pattern recognition

Trang 52

The Kernel Trick

“Given an algorithm which is formulated in terms of a positive definite kernel K1, one can construct an alternative algorithm by replacing K1 with another positive definite kernel K2”

 SVMs can use the kernel trick

Trang 53

Incorporating Kernels into SVM

Trang 54

The Gaussian Kernel

• Also called Radial Basis Function (RBF) kernel

– Has value 1 when xi = xj

– Value falls oﬀ to 0 with increasing distance

– Note: Need to do feature scaling before using Gaussian Kernel

Trang 55

Gaussian Kernel Example

Trang 56

Trang 58

Trang 59

– Neural networks use sigmoid as activation function

– SVM with a sigmoid kernel is equivalent to 2- layer ‐ perceptron

• Cosine Similarity Kernel

K (xi , xj ) =

– Popular choice for measuring similarity of text documents

– L2 norm projects vectors onto the unit sphere; their dot product is the cosine of the angle

between the vectors

Trang 60

Other Kernels

• Chi- squa‐red Kernel

– Widely used in computer vision applications

– Chi- squared ‐ measures distance between probability distributions

– Data is assumed to be non- negative, ‐ ofen with L1 norm of 1

Trang 61

An Aside: The Math Behind Kernels

What does it mean to be a kernel?

• K (xi , xj ) = hØ(xi ), Ø(xj ) for some Φ

What does it take to be a kernel?

K (xi , xj )

– Positive semi- definite ‐ matrix:

zTGz ≥ 0 for every non- zero ‐ vector

z

Establishing “kernel- hood” ‐ from first principles is non- trivial ‐

R n

Trang 62

A Few Good Kernels

• Cosine similarity kernel

• Chi- squared‐ kernel

Trang 63

Application: Automatic Photo Retouching(Leyvand et al., 2008)

Trang 64

Practical Advice for Applying SVMs

• Use SVM sofware package to solve for parameters

– e.g., SVMlight, libsvm, cvx (fast!) , etc.

• Need to specify:

– Choice of parameter C

– Choice of kernel function

• Associated kernel parameters

Trang 65

Multi- Class ‐ Classification with SVMs

• Many SVM packages already have multi- class ‐ classification built in

• Otherwise, use one- vs- rest‐ ‐

– Train K SVMs, each picks out one class from rest, yielding ✓( 1 ), ,

✓(K )

– Predict class i with largest (✓(i))|x

y 2 {1, , K }

65

Trang 66

SVMs vs Logistic Regression

(Advice from Andrew Ng)

If d is large (relative to n) (e.g., d > n with d = 10,000, n = 10- 1,000)‐

• Use logistic regression or SVM with a linear kernel

If d is small (up to 1,000), n is intermediate (up to 10,000)

• Use SVM with Gaussian kernel

If d is small (up to 1,000), n is large (50,000+)

• Create/add more features, then use logistic regression or SVM without a kernel

Neural networks likely to work well for most of these settings, but may be slower to train

Trang 68

• SVMs find optimal linear separator

• The kernel trick makes SVMs learn non- linear ‐ decision surfaces

• Strength of SVMs:

– Good theoretical and empirical performance

– Supports many types of kernels

• Disadvantages of SVMs:

– “Slow” to train/predict for huge data sets (but relatively fast!)

– Need to choose the kernel (and tune its parameters)

Định dạng
Số trang	68
Dung lượng	1,9 MB