Bài giảng Khai phá dữ liệu (Data mining) Support vector machine

Advantages of SVMs - 2 Learning involves optimization of a convex function no local minima as in neural nets  Only a few parameters are required to tune the learning machine unlike lot

Trang 1

Trịnh Tấn Đạt

Khoa CNTT – Đại Học Sài Gòn

Email: trinhtandat@sgu.edu.vn

Trang 2

 Introduction

 Review of Linear Algebra

 Classifiers & Classifier Margin

 Linear SVMs: Optimization Problem

 Hard Vs Soft Margin Classification

 Non-linear SVMs

Trang 3

 Competitive with other classification methods

 Relatively easy to learn

 Kernel methods give an opportunity to extend the idea to

 Regression

 Density estimation

 Kernel PCA

 Etc.

Trang 4

Advantages of SVMs - 1

 A principled approach to classification, regression and novelty detection

 Good generalization capabilities

 Hypothesis has an explicit dependence on data, via support vectors – hence,can readily interpret model

Trang 5

Advantages of SVMs - 2

 Learning involves optimization of a convex function (no local minima as in

neural nets)

 Only a few parameters are required to tune the learning machine (unlike lots of

weights and learning parameters, hidden layers, hidden units, etc as in neuralnets)

Trang 6

 Vectors, matrices, dot products

 Equation of a straight line in vector notation

 Familiarity with

 Perceptron is useful

 Mathematical programming will be useful

 Vector spaces will be an added benefit

 The more comfortable you are with Linear Algebra, the easier this material will

be

Trang 7

What is a Vector ?

 Think of a vector as a directed line segment in

N-dimensions ! (has “length” and

“direction”)

 Basic idea: convert geometry in higher

dimensions into algebra!

 Once you define a “nice” basis along each

dimension: x-, y-, z-axis …

 Vector becomes a 1 x N matrix!

a

v 

y

Trang 8

Vector Addition: A+B

) ,

( )

, (

) ,

B

C

A+B = C (use the head-to-tail method to

combine vectors)

A+B

Trang 9

Scalar Product: a v

) ,

( )

, ( x1 x2 ax1 ax2a

Trang 10

Vectors: Magnitude ( Length ) and Phase ( direction )

1 2

) ,

, 2

, 1 (

n x x

Trang 11

Inner (dot) Product: v.w or wTv

v

w



2 2

1 1

2 1

).(

, (

Trang 12

Projections w/ Orthogonal Basis

 Get the component of the vector on each axis:

 dot-product with unit vector on each axis!

Aside: this is what Fourier transform does!

Projects a function onto a infinite number of orthonormal basis functions: (e j or e j2n ), and

adds the results up (to get an equivalent “representation” in the “frequency” domain).

Trang 13

Projection: Using Inner Products -1

Trang 14

Projection: Using Inner Products -2

Note: the “error vector” e = b-p

is orthogonal (perpendicular) to p

i.e Inner product: (b-p) T p = 0

p = a (aTb)/ (aTa)

Trang 15

Review of Linear Algebra - 1

Trang 16

Review of Liner Algebra - 2

1 w.x = 0 is the eqn of a st line through origin

2 w x + b = 0 is the eqn of any straight line

3 w x + b = +1 is the eqn of a straight line parallel to (2)

on the positive side of Eqn (1) at a distance 1

4 w x + b = -1 is the eqn of a straight line parallel to (2)

on the negative side of Eqn (1) at a distance 1

Trang 17

Define a Binary Classifier

Trang 19

f(x, w,b) = sign(w x + b )

How would you classify this data?

Trang 20

f(x, w,b) = sign(w x + b )

How would you classify this data?

Trang 23

f(x, w,b) = sign(w x + b )

Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a

f(x, w,b) = sign(w x + b )

Trang 24

f(x, w,b) = sign(w x + b )

The maximum margin linear classifier is the linear classifier with the maximum margin.

This is the simplest kind

of SVM (Called an LSVM)

Linear SVM

Support Vectors are

those datapoints that

the margin pushes up

against

1 Maximizing the margin is good according to intuition

and PAC theory

2 Implies that only support vectors are important;

other training examples are ignorable.

3 Empirically it works very very well.

Trang 25

Significance of Maximum Margin - 1

 From the perspective of statistical learning theory, the motivation forconsidering the Binary Classifier SVM’s comes from theoretical bounds ongeneralization error

 These bounds have two important features

Trang 26

Significance of Maximum Margin - 2

 The upper bound on the generalization error does not depend upon the

dimensionality of the space

 The bound is minimized by maximizing the margin

Trang 27

SVMs: Three Main Ideas

1 Define an optimal hyperplane for a linearly separable case:

1. One that maximizes the margin

2. Solve the optimization problem

2 Extend the definition to non-linearly separable cases:

1. Have a penalty term for misclassifications

3 Map data to a high dimensional space where it is easier to classify with linear decision surfaces:

Trang 28

Setting Up the Optimization Problem

Var1

Var2

k b

x

w   + = −

k b

x

w   + =

0

= +

Trang 29

An Observation

 The vector w is perpendicular

to the Plus plane Why?

 Why choose wx+b = +1 and

wx+b = -1 as the planes defining

Trang 31

Width of the Margin

Trang 32

Learning the Maximum Margin Classifier

 Given a guess of w and b we can compute

 whether all data points are in the correct half-planes

 the width of the margin

 Now we just need to write a program to search the space

of w ’s and b’s to find the widest margin that matches all the data points How?

 Gradient descent? Matrix Inversion? EM? Newton’s Method?

Trang 33

Learning via Quadratic Programming

 QP is a well-studied class of optimization algorithms to

maximize a quadratic function of some real-valued variables subject to linear constraints

Trang 34

( wx + b 

yi i

1 )

Trang 35

Solving the Constrained Minimization

 Classical method is to minimize the associated un-constrained problem using

Lagrange multipliers That is minimize

 This is done by finding the saddle points:



=

− +

−

i

i i

w w b

w

L

1

1 )

)

((

2

1 )

,

¶ = 0 gives å a y = 0

Trang 38

Decision Surface

 The decision surface then is defined by

where z is a test vector

Trang 39

Solving the Optimization Problem

◼ Need to optimize a quadratic function subject to linear constraints.

◼ Quadratic optimization problems are a well-known class of mathematical

programming problems, and many (rather intricate) algorithms exist for solving them.

◼ The solution involves constructing a dual problem where a Lagrange multiplier

α iis associated with every constraint in the primary problem:

Find w and b such that

Φ(w) =½ wTw is minimized;

and for all { (xi ,y i)}: y i (w T x i + b) ≥ 1

α …α

Trang 40

The Optimization Problem Solution

◼ The solution has the form:

◼ Each non-zero α i indicates that corresponding x i is a support vector.

◼ Then the classifying function will have the form:

◼ Notice that it relies on an inner product between the test point x and the

support vectors x i – we will return to this later.

◼ Also keep in mind that solving the optimization problem involved computing

w =Σα i y ix i b = y k- w T x k for any x k such that α k 0

f(x) = Σα i y ix i Tx + b

Trang 41

Dataset with noise

◼ Hard Margin: So far we required

◼ all data points be classified correctly

◼ Allowed NO training errors

◼ What if the training set is noisy?

- Solution 1: use very powerful kernels

denotes +1

denotes -1

Trang 42

Slack variables ξi can be added to allow misclassification of

difficult or noisy examples.

Soft Margin Classification

What should our quadratic optimization criterion be?

1

w w

Trang 43

Hard Margin Vs Soft Margin

◼ The old formulation:

◼ The new formulation incorporating slack variables:

Φ(w) =½ wTw is minimized and for all { (xi ,y i)}

y i (w T x i + b) ≥ 1

Φ(w) =½ wTw + CΣξ i is minimized and for all { (xi ,y i)}

y i (w T x i + b) ≥ 1- ξ i and ξ i ≥ 0 for all i

Trang 44

Linear SVMs: Summary

◼ The classifier is a separating hyperplane.

◼ Most “important” training points are support vectors; they define the

hyperplane.

◼ Quadratic optimization algorithms can identify which training points xi are

support vectors with non-zero Lagrangian multipliers α i

◼ Both in the dual formulation of the problem and in the solution training points appear only inside dot productas:

Find α 1 …α Nsuch that

Q(α) =Σα i - ½ΣΣα i α j y i y jx i T x j is maximized and

(1) Σα i y i= 0

(2) 0 ≤ α i ≤ C for all α i

f(x) = Σα i y ix i T x + b

Trang 45

Why Go to Dual Formulation?

 The vector w could be infinite-dimensional and poses problems

Trang 46

Non-linear SVM

Trang 47

Disadvantages of Linear Decision Surfaces

Var1

Trang 48

Advantages of Non-Linear Surfaces

Var1

Var2

Trang 49

Linear Classifiers in High-Dimensional Spaces

Var1

Var2Constructed Feature 2

Trang 50

Non-linear SVMs

◼ Datasets that are linearly separable with some noise work out great:

◼ But what are we going to do if the dataset is just too hard?

◼ How about… mapping data to a higher-dimensional space:

x 2

Trang 51

 The last figure can be thought of also as a nonlinear basis function in two

dimensions

 That is, we used the basis z = (x,x2)

Trang 52

Non-linear SVMs: Feature spaces

◼ General idea: the original input space can always be mapped to some

higher-dimensional feature space where the training set is separable:

Φ: x → φ(x)

Trang 53

What is the Mapping Function?

 The idea is to achieve linear separation by mapping the data into a higher

dimensional space

 Let us call Φ the function that achieves this mapping

 What is this Φ?

Trang 54

Let us recall the formula we used earlier

-Linear SVM lecture:

The dual formulation

Trang 55

Dual Formula – 1 (Linear SVM)

)

(

, 2

1

1 1 1

l k

l k kl

k R

Trang 56

Dual Formula – 2 (Linear SVM soft margin)

Trang 57

For the Non-linear Case

 Let us replace the inner product (xi xj) by

Φ(xi) Φ(xj) to define the operations in the new higher dimensional space

 If there is a “kernel function” K such that

K(xi, xj) = Φ(xi) Φ(xj) = Φ(xi)TΦ(xj)

then we do not need to know Φ explicitly.

 This strategy is preferred because the alternative of working

Trang 58

Dual in New (Feature) Space

where K = argmaxk (a k )K = arg maxk (a k )

Classify with f (x, w, b) = sign (w. F (x) + b )

Trang 59

Computational Burden

 Because we’re working in a higher-dimension space (and potentially even aninfinite-dimensional space), calculating φ(xi)T φ(xj) may be intractable

 We must do R2/2 dot products to evaluate Q

 Each dot product requires m2/2 additions and multiplications

 The whole thing costs R2m2/4

 Too high!!! Or, does it? Really?

Trang 60

How to Avoid the Burden?

 Instead, we have the kernel trick, which tells us that

Trang 62

A Note

 Note that αi is only non-zero for instances φ(xi) on or near the boundary - those are called the support vectors since they alone specify the decision boundary.

 We can toss out the other data points once training is

complete Thus, we only sum over the xi which constitute the support vectors.

Trang 63

Consider a Φ Φas shown below

Trang 64

Collecting terms in the dot product

Trang 66

Both are Same

 Comparing term by term, we see

 Φ.Φ = (1 + a.b)2

 But computing the right side is lot more efficient, O(m) (m

additions and multiplications)

 Let us call (1 + a.b)2 = K(a,b) = Kernel

Trang 67

Φ in “Kernel Trick” Example

Trang 68

Other Kernels

 Beyond polynomials there are other high dimensional basis

functions that can be made practical by finding the right kernel function

Trang 69

Examples of Kernel Functions

◼ Linear: K(xi,xj)= xi Txj

◼ Polynomial of power p: K(xi,xj)= (1+ xi T xj)p

◼ Gaussian (radial-basis function network):

)2

exp(

),

2



j i

x

x x

−

=

K

Trang 70

 The function we end up optimizing is

Trang 71

Multi-class classification

Trang 72

 One versus all classification

Trang 73

Multi-class SVM

Trang 76

 One-class SVM (unsupervised learning): outlier detection

 Weibull-calibrated SVM (W-SVM) / PI -SVM: open set recognition

Trang 77

 CIFAR-10 image recognition using SVM

 The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with

6000 images per class * There are 50000 training images and 10000 test images.

 These are the classes in the dataset: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck

 Hint : https://github.com/wikiabhi/Cifar-10

https://github.com/mok232/CIFAR-10-Image-Classification

Tiêu đề	Support Vector Machine
Tác giả	Trịnh Tấn Đạt
Người hướng dẫn	TAN DAT TRINH, Ph.D.
Trường học	Saigon University
Chuyên ngành	Information Technology
Thể loại	Lecture
Năm xuất bản	2024
Thành phố	Ho Chi Minh City

Định dạng
Số trang	77
Dung lượng	1,43 MB