Advantages of SVMs - 2 Learning involves optimization of a convex function no local minima as in neural nets Only a few parameters are required to tune the learning machine unlike lot
Trang 1Trịnh Tấn Đạt
Khoa CNTT – Đại Học Sài Gòn
Email: trinhtandat@sgu.edu.vn
Trang 2 Introduction
Review of Linear Algebra
Classifiers & Classifier Margin
Linear SVMs: Optimization Problem
Hard Vs Soft Margin Classification
Non-linear SVMs
Trang 3 Competitive with other classification methods
Relatively easy to learn
Kernel methods give an opportunity to extend the idea to
Regression
Density estimation
Kernel PCA
Etc.
Trang 4Advantages of SVMs - 1
A principled approach to classification, regression and novelty detection
Good generalization capabilities
Hypothesis has an explicit dependence on data, via support vectors – hence,can readily interpret model
Trang 5Advantages of SVMs - 2
Learning involves optimization of a convex function (no local minima as in
neural nets)
Only a few parameters are required to tune the learning machine (unlike lots of
weights and learning parameters, hidden layers, hidden units, etc as in neuralnets)
Trang 6 Vectors, matrices, dot products
Equation of a straight line in vector notation
Familiarity with
Perceptron is useful
Mathematical programming will be useful
Vector spaces will be an added benefit
The more comfortable you are with Linear Algebra, the easier this material will
be
Trang 7What is a Vector ?
Think of a vector as a directed line segment in
N-dimensions ! (has “length” and
“direction”)
Basic idea: convert geometry in higher
dimensions into algebra!
Once you define a “nice” basis along each
dimension: x-, y-, z-axis …
Vector becomes a 1 x N matrix!
a
v
y
Trang 8Vector Addition: A+B
) ,
( )
, (
) ,
B
C
A+B = C (use the head-to-tail method to
combine vectors)
A+B
Trang 9Scalar Product: a v
) ,
( )
, ( x1 x2 ax1 ax2a
Trang 10Vectors: Magnitude ( Length ) and Phase ( direction )
1 2
) ,
, 2
, 1 (
n x x
Trang 11Inner (dot) Product: v.w or wTv
v
w
2 2
1 1
2 1
).(
, (
Trang 12Projections w/ Orthogonal Basis
Get the component of the vector on each axis:
dot-product with unit vector on each axis!
Aside: this is what Fourier transform does!
Projects a function onto a infinite number of orthonormal basis functions: (e j or e j2n ), and
adds the results up (to get an equivalent “representation” in the “frequency” domain).
Trang 13Projection: Using Inner Products -1
Trang 14Projection: Using Inner Products -2
Note: the “error vector” e = b-p
is orthogonal (perpendicular) to p
i.e Inner product: (b-p) T p = 0
p = a (aTb)/ (aTa)
Trang 15Review of Linear Algebra - 1
Trang 16Review of Liner Algebra - 2
1 w.x = 0 is the eqn of a st line through origin
2 w x + b = 0 is the eqn of any straight line
3 w x + b = +1 is the eqn of a straight line parallel to (2)
on the positive side of Eqn (1) at a distance 1
4 w x + b = -1 is the eqn of a straight line parallel to (2)
on the negative side of Eqn (1) at a distance 1
Trang 17Define a Binary Classifier
Trang 19f(x, w,b) = sign(w x + b )
How would you classify this data?
Trang 20f(x, w,b) = sign(w x + b )
How would you classify this data?
Trang 23f(x, w,b) = sign(w x + b )
Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a
f(x, w,b) = sign(w x + b )
Trang 24f(x, w,b) = sign(w x + b )
The maximum margin linear classifier is the linear classifier with the maximum margin.
This is the simplest kind
of SVM (Called an LSVM)
Linear SVM
Support Vectors are
those datapoints that
the margin pushes up
against
1 Maximizing the margin is good according to intuition
and PAC theory
2 Implies that only support vectors are important;
other training examples are ignorable.
3 Empirically it works very very well.
Trang 25Significance of Maximum Margin - 1
From the perspective of statistical learning theory, the motivation forconsidering the Binary Classifier SVM’s comes from theoretical bounds ongeneralization error
These bounds have two important features
Trang 26Significance of Maximum Margin - 2
The upper bound on the generalization error does not depend upon the
dimensionality of the space
The bound is minimized by maximizing the margin
Trang 27SVMs: Three Main Ideas
1 Define an optimal hyperplane for a linearly separable case:
1. One that maximizes the margin
2. Solve the optimization problem
2 Extend the definition to non-linearly separable cases:
1. Have a penalty term for misclassifications
3 Map data to a high dimensional space where it is easier to classify with linear decision surfaces:
Trang 28Setting Up the Optimization Problem
Var1
Var2
k b
x
w + = −
k b
x
w + =
0
= +
Trang 29An Observation
The vector w is perpendicular
to the Plus plane Why?
Why choose wx+b = +1 and
wx+b = -1 as the planes defining
Trang 31Width of the Margin
Trang 32Learning the Maximum Margin Classifier
Given a guess of w and b we can compute
whether all data points are in the correct half-planes
the width of the margin
Now we just need to write a program to search the space
of w ’s and b’s to find the widest margin that matches all the data points How?
Gradient descent? Matrix Inversion? EM? Newton’s Method?
Trang 33Learning via Quadratic Programming
QP is a well-studied class of optimization algorithms to
maximize a quadratic function of some real-valued variables subject to linear constraints
Trang 34( wx + b
yi i
1 )
Trang 35Solving the Constrained Minimization
Classical method is to minimize the associated un-constrained problem using
Lagrange multipliers That is minimize
This is done by finding the saddle points:
=
− +
−
i
i i
w w b
w
L
1
1 )
)
((
2
1 )
,
¶ = 0 gives å a y = 0
Trang 38Decision Surface
The decision surface then is defined by
where z is a test vector
Trang 39Solving the Optimization Problem
◼ Need to optimize a quadratic function subject to linear constraints.
◼ Quadratic optimization problems are a well-known class of mathematical
programming problems, and many (rather intricate) algorithms exist for solving them.
◼ The solution involves constructing a dual problem where a Lagrange multiplier
α iis associated with every constraint in the primary problem:
Find w and b such that
Φ(w) =½ wTw is minimized;
and for all { (xi ,y i)}: y i (w T x i + b) ≥ 1
α …α
Trang 40The Optimization Problem Solution
◼ The solution has the form:
◼ Each non-zero α i indicates that corresponding x i is a support vector.
◼ Then the classifying function will have the form:
◼ Notice that it relies on an inner product between the test point x and the
support vectors x i – we will return to this later.
◼ Also keep in mind that solving the optimization problem involved computing
w =Σα i y ix i b = y k- w T x k for any x k such that α k 0
f(x) = Σα i y ix i Tx + b
Trang 41Dataset with noise
◼ Hard Margin: So far we required
◼ all data points be classified correctly
◼ Allowed NO training errors
◼ What if the training set is noisy?
- Solution 1: use very powerful kernels
denotes +1
denotes -1
Trang 42Slack variables ξi can be added to allow misclassification of
difficult or noisy examples.
Soft Margin Classification
What should our quadratic optimization criterion be?
1
w w
Trang 43Hard Margin Vs Soft Margin
◼ The old formulation:
◼ The new formulation incorporating slack variables:
Find w and b such that
Φ(w) =½ wTw is minimized and for all { (xi ,y i)}
y i (w T x i + b) ≥ 1
Find w and b such that
Φ(w) =½ wTw + CΣξ i is minimized and for all { (xi ,y i)}
y i (w T x i + b) ≥ 1- ξ i and ξ i ≥ 0 for all i
Trang 44Linear SVMs: Summary
◼ The classifier is a separating hyperplane.
◼ Most “important” training points are support vectors; they define the
hyperplane.
◼ Quadratic optimization algorithms can identify which training points xi are
support vectors with non-zero Lagrangian multipliers α i
◼ Both in the dual formulation of the problem and in the solution training points appear only inside dot productas:
Find α 1 …α Nsuch that
Q(α) =Σα i - ½ΣΣα i α j y i y jx i T x j is maximized and
(1) Σα i y i= 0
(2) 0 ≤ α i ≤ C for all α i
f(x) = Σα i y ix i T x + b
Trang 45Why Go to Dual Formulation?
The vector w could be infinite-dimensional and poses problems
Trang 46Non-linear SVM
Trang 47Disadvantages of Linear Decision Surfaces
Var1
Trang 48Advantages of Non-Linear Surfaces
Var1
Var2
Trang 49Linear Classifiers in High-Dimensional Spaces
Var1
Var2Constructed Feature 2
Trang 50Non-linear SVMs
◼ Datasets that are linearly separable with some noise work out great:
◼ But what are we going to do if the dataset is just too hard?
◼ How about… mapping data to a higher-dimensional space:
x 2
Trang 51 The last figure can be thought of also as a nonlinear basis function in two
dimensions
That is, we used the basis z = (x,x2)
Trang 52Non-linear SVMs: Feature spaces
◼ General idea: the original input space can always be mapped to some
higher-dimensional feature space where the training set is separable:
Φ: x → φ(x)
Trang 53What is the Mapping Function?
The idea is to achieve linear separation by mapping the data into a higher
dimensional space
Let us call Φ the function that achieves this mapping
What is this Φ?
Trang 54Let us recall the formula we used earlier
-Linear SVM lecture:
The dual formulation
Trang 55Dual Formula – 1 (Linear SVM)
)
(
, 2
1
1 1 1
l k
l k kl
k R
Trang 56Dual Formula – 2 (Linear SVM soft margin)
Trang 57For the Non-linear Case
Let us replace the inner product (xi xj) by
Φ(xi) Φ(xj) to define the operations in the new higher dimensional space
If there is a “kernel function” K such that
K(xi, xj) = Φ(xi) Φ(xj) = Φ(xi)TΦ(xj)
then we do not need to know Φ explicitly.
This strategy is preferred because the alternative of working
Trang 58Dual in New (Feature) Space
where K = argmaxk (a k )K = arg maxk (a k )
Classify with f (x, w, b) = sign (w. F (x) + b )
Trang 59Computational Burden
Because we’re working in a higher-dimension space (and potentially even aninfinite-dimensional space), calculating φ(xi)T φ(xj) may be intractable
We must do R2/2 dot products to evaluate Q
Each dot product requires m2/2 additions and multiplications
The whole thing costs R2m2/4
Too high!!! Or, does it? Really?
Trang 60How to Avoid the Burden?
Instead, we have the kernel trick, which tells us that
Trang 62A Note
Note that αi is only non-zero for instances φ(xi) on or near the boundary - those are called the support vectors since they alone specify the decision boundary.
We can toss out the other data points once training is
complete Thus, we only sum over the xi which constitute the support vectors.
Trang 63Consider a Φ Φas shown below
Trang 64Collecting terms in the dot product
Trang 66Both are Same
Comparing term by term, we see
Φ.Φ = (1 + a.b)2
But computing the right side is lot more efficient, O(m) (m
additions and multiplications)
Let us call (1 + a.b)2 = K(a,b) = Kernel
Trang 67Φ in “Kernel Trick” Example
Trang 68Other Kernels
Beyond polynomials there are other high dimensional basis
functions that can be made practical by finding the right kernel function
Trang 69Examples of Kernel Functions
◼ Linear: K(xi,xj)= xi Txj
◼ Polynomial of power p: K(xi,xj)= (1+ xi T xj)p
◼ Gaussian (radial-basis function network):
)2
exp(
),
2
j i
j i
x
x x
−
=
K
Trang 70 The function we end up optimizing is
Trang 71Multi-class classification
Trang 72 One versus all classification
Trang 73Multi-class SVM
Trang 76 One-class SVM (unsupervised learning): outlier detection
Weibull-calibrated SVM (W-SVM) / PI -SVM: open set recognition
Trang 77 CIFAR-10 image recognition using SVM
The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with
6000 images per class * There are 50000 training images and 10000 test images.
These are the classes in the dataset: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck
Hint : https://github.com/wikiabhi/Cifar-10
https://github.com/mok232/CIFAR-10-Image-Classification