Feature and Pattern [6] = Feature e Feature is any distinctive aspect, quality or characteristic = Features may be symbolic i.e., color or numeric i-e., height e Definitions = The co
Trang 14 Pattern Recognition
Trang 2
> Introduction to Pattern Recognition System
> Feature Extraction a9: Haar-like feature} Integral image
> Dimension Reduction: PCA
> Bayesian Decision Theory
> Bayesian Discriminant Function for Normal Density
> Linear Discriminant Analysis
> Linear Discriminant Functions
> Support Vector Machine
> Kk Nearest Neighbor
> Statistical Clustering
Trang 3
introduction
Trang 4
— OCR (Optical Character Recognition)
— DNA sequence identification
Trang 5Components of Pattern Classification System [6]
devices Preprocessing reduction Prediction selection
Sensors Feature selection Cross-validation
Trang 6Types of Prediction Problems [6]
= Classification
e The PR problem of assigning an object to a class
e The output of the PR system is an integer label
a e.g Classifying a product as “good” or “bad” in a quality control test
=» Regression
e A generalization of a classification task
e The output of the PR system is a real-valued number
= €.g predicting the share value of a firm based on past performance and stock market indicators
=» Clustering
e The problem of organizing objects into meaningful groups
e The system returns a (Sometimes hierarchical) grouping of objects
m €.g organizing life forms into a taxonomy of species
=» Description
e The problem of representing an object in terms of a series of primitives
e The PR system produces a structural or linguistic description
a e.g labeling an ECG signal in terms of P, QRS and T complexes
Trang 7Feature and Pattern [6]
= Feature
e Feature is any distinctive aspect, quality or characteristic
= Features may be symbolic (i.e., color) or numeric (i-e., height)
e Definitions
= The combination of d features is represented as a d-dimensional column vector called a feature vector
= The d-dimensional space defined by the feature vector is called the feature space
= Objects are represented as points in feature space This representation is called a scatter plot
e Pattern is a composite of traits or features characteristic of an individual
e In classification tasks, a pattern is a pair of variables {x,@} where
= xis acollection of observations or features (feature vector)
= gis the concept behind the observation (label)
HANYANG UNIVERSITY
http://web.yonsei.ac.kr/hgjung
Trang 8Feature and Pattern [6]
a What makes a “good” feature vector?
e The quality of a feature vector is related to its ability to discriminate examples
from different classes
= Examples from the same class should have similar feature values
= Examples from different classes have different feature values
“Good” features “Bad” features
=» More feature properties
Linear separability Non-linear separability Highly correlated features Multi-modal
HANYANG UNIVERSITY
http://web.yonsei.ac.kr/hgjung
Trang 9Classifier [6]
= The task of a classifier is to partition feature
space into class-labeled decision regions a
e Borders between decision regions are called R2 ` decision boundaries
e Theclassification of feature vector x consists of Rs
determining which decision region it belongs to,
and assign x to this class
=» Acliassifier can be represented as a set of discriminant functions
e The classifier assigns a feature vector x to class w, if g(x)>9(x) Vj=i
Trang 10Pattern Recognition Approaches [6]
a Statistical (StatPR)
e Patterns classified based on an underlying statistical model of the features
= The statistical model is defined by a family of class-conditional probability density functions Pr(x|c,) (Probability of feature vector x given class c,)
=» Neural (NeurPR)
e Classification is based on the response of a network of processing units
(neurons) to an input stimuli (pattern)
m “Knowledge” is stored in the connectivity and strength of the synaptic weights
e NeurPR is a trainable, non-algorithmic, black-box strategy
e NeurPR is very attractive since
a it requires minimum a priori knowledge
= With enough layers and neurons, an ANN can create any complex decision region
a Syntactic (SyntPR)
e Patterns classified based on measures of structural similarity
m “Knowledge” is represented by means of formal grammars or relational descriptions
(graphs)
e SyntPR is used not only for classification, but also for description
= Typically, SyntPR approaches formulate hierarchical descriptions of complex patterns built up from simpler sub patterns
Trang 11
Pattern Recognition Approaches [6]
Neural* Statistical Structural
Feature extraction:
# intersections / \
# right oblique lines +
0 *Neural approaches may also \
~ employ feature extraction
http://web.yonsei.ac.kr/hgjung
Trang 13Machine Perception [2]
"salmon"
FIGURE 1.1 The objects to be classified are first sensed by a transducer (camera),
whose signals are preprocessed Next the features are extracted and finally the clas-
sification is emitted, here either “salmon” or “sea bass.” Although the information flow
is often chosen to be from the source to the classifier, some systems employ information
flow in which earlier levels of processing can be altered based on the tentative or pre-
liminary response in later levels (gray arrows) Yet others combine two or more stages
into a unified step, such as simultaneous segmentation and feature extraction From:
Richard O Duda, Peter E Hart, and David G Stork, Pattern Classification Copyright
© 2001 by John Wiley & Sons, Inc
Preprocessing: use a segmentation operation to isolate fishes from one another and from the background
Feature extraction: information
from a single fish is sent to a feature extractor whose purpose 1s to reduce the data by measuring certain
Trang 14Feature Selection [2]
The length of the fish as a possible feature for discrimination
salmon sea bass
count
22†
20†
IS lót J2}
egories; using length alone, we will have some errors The value marked /* will lead to
the smallest number of errors, on average From: Richard O Duda, Peter E Hart, and David G Stork, Pattern Classification Copyright © 2001 by John Wiley & Sons, Inc
= Ihe length is a poor feature alone!
Trang 15
FIGURE 1.3 Histograms for the lightness feature for the two categories No single
threshold value x* (decision boundary) will serve to unambiguously discriminate be-
tween the two categories; using lightness alone, we will have some errors The value x*
marked will lead to the smallest number of errors, on average From: Richard O Duda, Peter E Hart, and David G Stork, Pattern Classification Copyright © 2001 by John
Wiley & Sons, Inc
Trang 16
FIGURE 1.4 The two features of lightness and width for sea bass and salmon The dark
line could serve as a decision boundary of our classifier Overall classification error on
the data shown is lower than if we use only one feature as in Fig 1.3, but there will
still be some errors From: Richard O Duda, Peter E Hart, and David G Stork, Pattern
Classification Copyright © 2001 by John Wiley & Sons, Inc
Trang 17shown leads it to be classified as a sea bass From: Richard O Duda, Peter E Hart, and
David G Stork, Pattem Classification Copyright © 2001 by John Wiley & Sons, Inc
Trang 19
Generalization: Model Selection [3]
Polynomial Curve Fitting
_ Oth Order Polynomial _ | 1st Order Polynomial |
Trang 20Generalization: Model Selection [3]
Polynomial Curve Fitting, Over-fitting
Trang 21
Generalization: Sample Size [3]
Trang 22Generalization: Sample Size [3]
Trang 23Generalization: Regularization [3]
Polynomial Curve Fitting
Regularization: Penalize large coefficient values
E(w) = 2 > {y(fn, Ww) — tn} + 5 Il
n=1
Trang 24
Learning and Adaptation [2]
Trang 25Linear Discriminant Functions [6]
g(x)>O xew, XÌ=Ww'X+Wa ©4Ÿ)
a(x) 0 To X EW,
e where wis the weight vector and W, is the threshold weight or bias (not to be
confused with that of the bias-variance dilemma)
Trang 26Feature Extraction 29}
integral image
Trang 27
Haar-like Feature [7]
The simple features used are reminiscent of Haar basis functions which
have been used by Papageorgiou et al (1998)
Three kinds of features: two-rectangle feature, three-rectangle feature, and four-rectangle feature
Given that the base resolution of the detector is 24x24, the exhaustive set
of rectangle feature is quite large, 160,000
Trang 28
Haar-like Feature: Integral Image [7]
Rectangle features can be computed very rapidly using an intermediate representation
for the image which we call the integral image
The integral image at location x,y contains the sum of the pixels above and to the left of
x, y, Inclusive:
ii(x,y)= > i@’,y’),
X'<x,y'<y
where ii (x, y) 1s the integral image and i (x, y) is the original image (see Fig 2) Using
the following pair of recurrences:
S(X, y) = S(X, vy — 1) +i(4, y) (1) ii(x, y) =ii(x — 1, y) + S(x,y) (2)
(where s(x, y) 1s the cumulative row sum, s(x, —1) =O, and ii (—1, y) = 0) the integral
image can be computed in one pass over the original image
y ES OF al ofallthepixlsaboveandiothekf E-mail: hogijung@hanyang.ac.kr
Trang 29
Haar-like Feature: Integral Image [7]
Using the integral image any rectangular Sum can be computed in four
array references (see Fig 3)
1 is the sum of the pixels in rectangle A The value at location 2 is
A + B, at location 3 is A + C, and at location 4 is A+ B+C + D
The sum within D can be computed as 4 + | — (2 + 3)
Our hypothesis, which is borne out by experiment, is that a very small
number of these features can be combined to form an effective classifier The main challenge is to find these features
Trang 30
Uimension Reduction: PCA
Trang 31
Abstract [1]
Principal component analysis (PCA) is a technique that is useful for the compression and classification of data The purpose 1s to reduce the dimensionality of a data set (sample)
by finding a new set of variables, smaller than the original set of variables, that
nonetheless retains most of the sample's information
By information we mean the variation present in the sample, given by the correlations between the original variables The new variables, called principal components (PCs), are uncorrelated, and are ordered by the fraction of the total information each retains
Trang 32
Geometric Picture of Principal Components [1]
LD)
A sample of 7 observations In the 2-D space K = (x4, 42)
Goal: to account for the variation in a sample in as few variables as
possible, to some accuracy
Trang 33Geometric Picture of Principal Components [1]
¢ the 1st PC is a minimum distance fit to a line in X Space
¢ the 2nd PC is a minimum distance fit to a line in the plane perpendicular
to the 1st PC
PCs are a series of linear least squares fits to a sample, each orthogonal
to all the previous
Trang 34Usage of PCA: Data Compression [1]
Because the kth PC retains the kth greatest fraction of the variation
we can approximate each observation by truncating the sum at the first m < p PCs
Trang 35Usage of PCA: Data Compression [1]
Reduce the dimensionality of the data
from oto m< pby approximating KX Y~ X™ = ZFrAamt
where Z’” is the 7x m portion of Z
and A’ is the ox mportion of A
Trang 36Derivation of PCA using the Covariance Method [8]
Let X be a @-dimensional random vector expressed as column vector
Without loss of generality, assume X has zero mean We want to find
a @xd orthonormal transformation matrix P such that
Y=P'X
with the constraint that
cov(Y) IS a diagonal matrix and PP =P"
> PX is a random vector with all its distinct components pairwise uncorrelated
By substitution, and matrix algebra, we obtain:
(ca) 0†QFTI|ðFn] E-mail: hogijung@hanyang.ac.kr
Trang 37Derivation of PCA using the Covariance Method [8]
substituting into equation above, we obtain:
[Ai Pi, Ao Po, ,AaPa| = [cov(X) P;, cov(X)Po, , cov(X) Pil
Notice that inA;P; = cov(X)P;
P.is an eigenvector of the covariance matrix of X Therefore, by finding the
eigenvectors of the covariance matrix of X, we find a projection matrix P
that satisfies the original constraints
Trang 38Bayesian Decision
Theory
Trang 39
State of Nature [2]
We let o denote the sfate of nature, with a= a, tor sea bass and w= a, for salmon
Because the state of nature is so unpredictable, we consider a to be a
variable that must be described probabilistically
P(o,) = Plo.) (uniform priors) P(o,) + Po.) = 7 (exclusivity and exhaustivity)
More generally, we assume that there is some @ prior’ probability (or simply
prior) P(w,) that the next fish is sea bass, and some prior probability Pfa./
that it is salmon
P(a,) + P( @,) = 7 (exclusivity and exhaustivity)
Decision rule with only the prior information
Decide a, if P(a,) > P(w,) otherwise decide a,
Trang 40
Class-—Conditional Probability Density [2]
In most circumstances we are not asked to make decisions with so little
information In our example, we might for instance use a lightness
measurement x to improve our classifier
We consider x to be a continuous random variable whose distribution
depends on the state of nature and is expressed as D(/@/ This is the
class—conditional probability density function, the probability density
function for x given that the state of nature Is a
* value x given the pattern is in category w
Trang 41Posterior, likelihood, evidence [2]
suppose that we know both the prior probabilities P(o) and the conditional
densities o(x/a/ for 1, 2
Suppose further that we measure the lightness of a fish and discover that its value is x
How does this measurement influence our attitude concerning the true state of nature — that is, the category of the fish?