The classifier consists of a set of functions , whose values, computed at , determine the class to which the corresponding pattern belongs Classification system overview x sensor fea
Trang 1Sergios Theodoridis
Konstantinos Koutroumbas
Trang 2PATTERN RECOGNITION
Typical application areas
Machine vision
Character recognition (OCR)
Computer aided diagnosis
Trang 3 Features: These are measurable quantities obtained from the patterns, and the classification task is based on their
respective values.
Feature vectors: A number of features
constitute the feature vector
Feature vectors are treated as random vectors.
x 1, ,
Trang 4An example:
Trang 5 The classifier consists of a set of functions , whose values, computed at , determine the class to which the
corresponding pattern belongs
Classification system overview
x
sensor
feature generation
feature selection
classifier design
system evaluation Patterns
Trang 6 Supervised – unsupervised – semisupervised pattern recognition:
The major directions of learning are:
Supervised: Patterns whose class is known a-priori
are used for training.
Unsupervised: The number of classes/groups is (in
general) unknown and no training patterns are
available.
Semisupervised : A mixed type of patterns is
available For some of them, their corresponding class
is known and for the rest is not
Trang 7CLASSIFIERS BASED ON BAYES DECISION
THEORY
Statistical nature of feature vectors
Assign the pattern represented by feature vector
to the most probable of the available classes
That is
maximum
l 2
1, x , , x x
Trang 8 Computation of a-posteriori probabilities
) , (
), ( 1 P 2 P M
M i
x
p( i) , 1 , 2 , ,
.
r to iw
Trang 9
) ( ) (
) (
) (
) ( )
( )
(
) ( ) (
) (
) (
i
i i
i i
i
i i
i
P x
p x
p
x p
P x
p x
P
P x
p x
P x p
Trang 10 The Bayes classification rule (for two classes M=2)
Given classify it according to the rule
Equivalently: classify according to the rule
For equiprobable classes the test becomes
x
2 1
2
1 2
1
) (
) (
) (
P x
P
x x
P x
P
If
If
) (
) (
) (
) (
) ( x 1 P 1 p x 2 P 2
x
Trang 11) (
)
Trang 12 Equivalently in words: Divide space in two regions
Probability of error
Total shaded area
2 2
1 1
in If
x
x R
1 )
( 2
1
1 2
x
x
Trang 13Indeed: Moving the threshold the total shaded area INCREASES by the extra “grey” area.
Trang 14 The Bayes classification rule for many (M>2) classes:
Given classify it to if:
Such a choice also minimizes the classification error probability
Minimizing the average risk
For each wrong decision, a penalty term is assigned since some decisions are more sensitive than others
i j
x P
x
P ( i ) ( j )
Trang 15For M=2
• Define the loss matrix
• penalty term for deciding class ,
although the pattern belongs to , etc.
Risk with respect to
)
(
22 21
12 11
p x
d x
p
r
R R
) (
)
11 1
2 1
Trang 16Risk with respect to
p x
d x
p r
R R
) (
)
21 2
2 1
) ( r P
P r
Probabilities of wrong decisions,
weighted by the penalty terms
Trang 17 Choose and so that r is minimized
1
11 12
22 21
1
2
2
1 12
) (
)
( )
( )
) (
) (
) (
) (
) (
) (
) (
2 2
22 1
1 12
2
2 2
21 1
1 11
p P
x p
P x
p P
x p
Trang 18y probabilit error
tion classifica
M inimum if
) (
) (
if
) (
) (
if
0
and 2
1 )
( )
(
12 21
21
12 1
2 2
12
21 2
1 1
22 11
2 1
P x
x P x
P x
P P
If
Trang 191
5 0
0
2
1 )
( )
(
) ) 1 (
exp(
1 )
(
) exp(
1 )
(
2 1
2 2
2 1
L
P P
x x
p
x x
Trang 20Then the threshold value is:
Threshold for minimum r
2
1
) ) 1 (
exp(
) exp(
:
: minimum for
0
2 2
x
P
1 )
2 1
( ˆ
) ) 1 (
( exp 2
) (
exp
: ˆ
0
2 2
x x
x
0
ˆx
Trang 21Thus moves to the left of
Trang 22DISCRIMINANT FUNCTIONS
DECISION SURFACES
If are contiguous:
is the surface separating the regions On the one
side is positive (+), on the other is negative (-) It is known as Decision Surface
)(
)(
:
)(
)(
:
x P
x P
R
x P
x P
R
i j
j
j i
Trang 23 If f (.) monotonically increasing, the rule remains the same if we use:
In general, discriminant functions can be defined independent of the Bayesian rule They lead to suboptimal solutions, yet, if chosen
appropriately, they can be computationally more tractable
Moreover, in practice, they may also lead to better solutions This, for example, may be case if the nature of the underlying pdf’s are unknown
j i x
P f x
P f
x i if : ( ( i )) ( ( j ))
))(
()
Trang 24THE GAUSSIAN DISTRIBUTION
The one-dimensional case
2
1 )
Trang 25 The Multivariate (Multidimensional) case:
where is the mean value,
and is known s the covariance matrix and it is defined as:
An example: The two-dimensional case:
, where
)
( 2
1 exp
) 2 (
1 )
2
1 2
x x
x
E x
] ) )(
1 1
1 2
2 1 1
2
1 2
2
1 exp
2
1 ,
x x
x p
1
x E
x E
2 1 1
2 2
1 1
x E
( )( )
E x x
Trang 26BAYESIAN CLASSIFIER FOR NORMAL
1
an is
) (
)
( 2
1 exp
) 2 (
1 )
2
1 2
i
i i
i i
i i
i
x x
E
x x
E
x x
x p
Trang 27) (
ln )
( ln
)) (
) (
ln(
) (
i i
i i
i
P x
p
P x
p x
i i
i i
T i i
C
C P
x x
x g
1 ( 2
ln
) 2 (
) (
ln )
( )
( 2
1 )
Trang 28That is, is quadratic and the surfaces
quadrics, ellipsoids, parabolas, hyperbolas,
pairs of lines
i i
i i
i i
i
C P
x x
x x
x g
)
(2
1
)(
1)
(2
1)
(
2 2
2 1 2
2 2 1
1 2
2 2
2 1 2
()
(x g x
Trang 29 Example 1:
Example 2:
Trang 30 Decision Hyperplanes
Quadratic terms:
If ALL (the same) the quadraticterms are not of interest They are notinvolved in comparisons Then, equivalently,
Τ i
i
i i
io
T i i
Σ P
w
Σ w
w x
w x
1
2
1 )
( ln
) (
Trang 31) (
)
( ln
)
( 2
1
,
) (
0 )
( )
( )
(
1 )
(
Then
.
j i
j i
j
i j
i o
j i
o T
j i
ij
i
T i i
P
P x
w
x x
w
x g
x g x
g
w x
x g
I Σ
Trang 33• If , the linear classifier moves towards the class with the p 1 psmaller2 probability
Trang 34( )
( x w x x0
) (
1
j i
w
2
1 1
2 0
) (
where
) ) (
)
( ( n )
( 2 1
1
1
x x
x
P
P x
T
j i
j i
j
i j
not
j
i
Trang 35 Minimum Distance Classifiers
)
( 2
1 )
i
T i
: Assign
:
:
)) (
)
((
i
T i
d
Trang 37: tion classifica
Bayesian
using 2
2
0
1 vector
he classify t
9 1 3 0
3 0
1.1
, 3
3
, 0
0 ),
( )
(
), (
) (
and )
( )
( : Given
2 1
2 2
1 1
2 1
2 1
x
p
Σ N
x p P
0
15 0 95
0
2 8
0 ,
0 2 ,
952
2 2
2
0
1
2 2 , 0 1 :
from s
M ahalanobi Compute
1 2
, 2 1
1 ,
2 2
d Σ
d
1 , ,2
1 Observe that Classify
Example:
Trang 38; ,
, ( )
; (
,
,
)
; ( )
( : parameter
ctor unknown ve
an
in known with )
( Let
t independen and
known , ,
Let
1
2 1
2 1
2 1
x p
x x
x p X
p
x x
x X
x p x
p
x p
x x
x
k
N k
N N
Trang 39
) (
) (
) (
1 )
(
)
(
: ˆ
)
; (
ln )
; (
ln )
(
)
; (
max arg
: ˆ
;
; 1
1
1 ML
N k ML
k
N k
k
Ν k
x p x
p L
x p X
p L
x p
Trang 41Asymptotically unbiased and consistent
0
ˆ lim
]
ˆ [ lim
then ,
)
; ( )
(
such that a
is there indeed,
If,
2 0 N
0 N
p
Trang 42x Σ
L
L
L
x Σ
x Σ
x
p
x Σ
x C
x p L
x p x
p x
x x Σ
N x
p
k
N k ML
k
N k
l
k
T k
l k
k
T k
N k k
N k
k k
N
10
)(
.)
(
))(
)
(2
1exp(
)2(
1)
;
(
)(
)
(2
1)
;(ln
)
(
)
;()
(, ,
,unknown,
:),(
1
1 2
1 2
1 1
1
2 1
Trang 43 Maximum Aposteriori Probability Estimation
In ML method, θ was considered as a parameter
Here we shall look at θ as a random vector
described by a pdf p(θ), assumed to be known
Given
Compute the maximum of
From Bayes theorem
)
()
(
or )(
)()
()(
X p
p X
p
X p
X p X
p p
Trang 44The method:
ML MAP
MAP
MAP
p
X p
P
X p
or uniform
is ) ( If
)) (
) ( ( :
ˆ
or ) (
max arg
ˆ
Trang 46 Example:
k
N k ML
k
N k MAP
k
N k k
N k MAP
l l
N
x N
For N
x
x p
x p
p
x , , x X
Ι N
x p
1 MAP
2 2
2 2
1 2
2 0
0 2
2 1 1
2
2 0 2
1 2
1 ˆ
ˆ
N for
or , 1
1 ˆ
0 ) ˆ
(
1 )
(
1 or
0 )) ( ) (
ln(
:
) 2
exp(
) 2 (
1 )
(
unknown, ),
, ( : ) (
Trang 47 Bayesian Inference
How??
) p(
estimate
: goal The
) ( and )
( }, , ,
{
: Given
followed.
is root different
a Here
for estimate
single a
M AP
M L,
1
X x
p x
p x
Trang 48x N
N X
p
N p
N x
p Let
2 2 2
2
2
2 0 0
2
1
),
()
( :out that
It turns
),
()
(
),
()
(
example
an a
insight vimore
)
(
) ( ) (
) ( ) (
) (
) ( )
( )
(
) (
) (
X
p
d p
X p
p X
p X
p
p X
p X
p
d X p
x p X
Trang 49The previous formulae correspond to a sequence
of Gaussians for different values of N
Example : Prior information : , ,
Trang 50 Maximum Entropy Method
Compute the pdf so that to be maximally
non-committal to the unavailable information and
constrained to respect the available information
The above is equivalent with maximizing uncertainty,i.e., entropy, subject to the available constraints
Entropy
x d x p x
p
H ( ) ln ( )
the subject to
maximum
: ) (
p
Trang 51Example: x is nonzero in the interval
and zero otherwise Compute the ME pdf
(
x
x
dx x
( (
1 )
( ˆ
) 1 exp(
) ( ˆ
2 1
1 2
x x
x x
x x
Trang 52• This is most natural The “most random” pdf is the uniform one This complies
with the Maximum Entropy rationale.
It turns out that, if the constraints are the:
mean value and the variance
then the Maximum Entropy estimate is the
Gaussian pdf.
That is, the Gaussian pdf is the “most
random” one, among all pdfs with the
same mean and variance
Trang 53 Mixture Models
Assume parametric modeling, i.e.,
The goal is to estimate
P j x p x
p
1
1
1 )
( 1
) (
) (
)
; ( x j
p
J
P P
P , , ,
,
; (
max k 1 J
N
P P
x
Trang 54This is a nonlinear problem due to the missing
label information This is a typical problem with
an incomplete data set
The Expectation-Maximisation (EM) algorithm
• General formulation
– which are not observed directly
We observe
,
; y p
R Y
y
y the complete data set m, with y( )
),
; ( with ,
) ( y X R l m p x
g
Trang 55• Let
• What we need is to compute
• But are not observed Here comes the EM Maximize the expectation of the loglikelihood
conditioned on the observed samples and the
current iteration estimate of
0
))
; (
ln(
:
y p
)
; ( )
; (
specific a
to '
all )
(
x Y
y
x x p y d y p
x s
y Y
x Y
Trang 56E t
Q(;( )) [ ln( ( ; ;( ))]
0 )) (
; ( :
) 1
j
x k, k), 1,2, ,
N k
x k 1,2, ,
k
j k
k k
k j p x j P x
Trang 57• Unknown parameters
• E-step
• M-step
T J
T T T
T
P P
P P
P ] , [ , , , ],
Q Q
k
j k
, , 2 , 1
, 0
k k
j k
t x
p
P t j
x
p t
; (
)) (
;
( ))
(
; (
)) (
;
( ))
(
;
(
) )
( ( ln )) (
; (
] [ )]
)
; (
ln(
[ ))
(
; (
; 1
1
1 1
k k
k
j k
k k
k J
j
N k
N k
N k
j k
k
P j
x p t
x j P
E P
j x p E
t Q
Trang 58 Nonparametric Estimation
In words : Place a segment of length h at
and count points inside it
If is continuous: as , if
2
2
h x
ˆ x ˆ
h x
) ( )
( x p x
p
2
ˆ -
1 )
ˆ ( ˆ )
( ˆ
total
in
h x
x N
k h
x p x
p
N
h k
N
k P
N
N N
Trang 59 Parzen Windows
Place at a hypercube of length and count
Trang 601)
h
x x
N h
side -
h
an
inside points
of number
*
1
* volume 1
ous discontinu
(.)
continuous
)
(
x p
x) is smooth(
1)
i
x x
Trang 611)])
([
1(
1)]
(ˆ[
x l N
h
x
x E
N h
x p
0 )
' ( of width the
hl
) ( )
(
1
h
x h
) ' ( ) '
( )]
( ˆ [
x
x p x
d x p x x x
p
Trang 62• The smaller the h the higher the variance
Trang 63h=0.1, N=10000
The higher the N the better the accuracy
Trang 6422 21
1
2 2
1 12
) (
)
( )
( ) (
) (
P
P x
p
x
p l
) (
) (
h
x x
h N
Trang 65 CURSE OF DIMENSIONALITY
In all the methods, so far, we saw that the highest
the number of points, N, the better the resulting
estimate
If in the one-dimensional space an interval, filled
with N points, is adequate (for good estimation), in
the two-dimensional space the corresponding square
will require N 2 and in the dimensional space the dimensional cube will require N ℓ points
ℓ-The exponential increase in the number of necessary points in known as the curse of dimensionality This
is a major problem one is confronted with in high
dimensional spaces
Trang 66An Example :
Trang 67 NAIVE – BAYES CLASSIFIER
Let and the goal is to estimate
i = 1, 2, …, M For a “good” estimate of the pdf
one would need, say, N ℓ points
Assume x 1 , x 2 ,…, x ℓ mutually independent Then:
In this case, one would require, roughly, N points
for each pdf Thus, a number of points of the
order N·ℓ would suffice.
It turns out that the Nạve – Bayes classifier
works reasonably well even in cases that violate the independence assumption
x
Trang 68 K Nearest Neighbor Density Estimation
In Parzen:
• The volume is constant
• The number of points in the volume is varying
) (
ˆ
x NV
k x
Trang 69
) (
1 1
2 2
2 2
1
V N
V N
V N k
V N k
Trang 70 The Nearest Neighbor Rule
Choose k out of the N training vectors, identify the k nearest ones to x
Out of these k identify k i that belong to class ω i
The simplest version
k=1 !!!
For large N this is not bad It can be shown that:
if P B is the optimal Bayesian error probability, then:
j i
k k
:
x i i j
M
Trang 71PB kNN B NN
B kNN P P
NN
B NN
P P
P
P P
Trang 72 Voronoi tesselation
Trang 73 Bayes Probability Chain Rule
Assume now that the conditional dependence for
each x i is limited to a subset of the features
appearing in each of the product terms That is:
where
BAYESIAN NETWORKS
) ( )
| (
) , ,
| (
) , ,
| ( )
, , ,
(
1 1
2
1 2
1 1
1 2
1
x p x
x p
x x
x p x
x x
p x
x x
p x
p x
x x p
x 1 , x 2, , x1
Ai i i
Trang 74For example, if ℓ=6, then we could assume:
| (
) , ,
Trang 75A graphical way to portray conditional dependencies
For this case:
) ( )
( )
| ( )
| ( )
,
| ( )
, , ,
( x1 x2 x6 p x6 x5 x4 p x5 x4 p x3 x2 p x2 p x1
Trang 76 Bayesian Networks
Definition: A Bayesian Network is a directed acyclicgraph (DAG) where the nodes correspond to random variables Each node is associated with a set of
conditional probabilities (densities), p(xi |A i ), where x i
is the variable associated with the node and A i is the set of its parents in the graph
A Bayesian Network is specified by:
• The marginal probabilities of its root nodes.
• The conditional probabilities of the non-root nodes,
given their parents , for ALL possible values of the involved variables.
Trang 77The figure below is an example of a Bayesian
Network corresponding to a paradigm from the
medical applications field
models conditional dependencies for an example concerning smokers (S),
tendencies to develop cancer (C) and heart disease (H), together with variables
corresponding to heart (H1, H2) and cancer (C1, C2) medical tests
Trang 78Once a DAG has been constructed, the joint
probability can be obtained by multiplying the
marginal (root nodes) and the conditional (non-root nodes) probabilities
Training: Once a topology is given, probabilities are estimated via the training data set There are also methods that learn the topology
Probability Inference: This is the most common task that Bayesian networks help us to solve efficiently Given the values of some of the variables in the
graph, known as evidence, the goal is to compute the conditional probabilities for some of the other variables, given the evidence
Trang 79 Example: Consider the Bayesian network of the
Trang 80For a), a set of calculations are required that
propagate from node x to node w It turns out that
P(w0|x1) = 0.63.
For b), the propagation is reversed in direction It
turns out that P(x0|w1) = 0.4.
In general, the required inference information is
computed via a combined process of “message
passing” among the nodes of the DAG
Complexity:
For singly connected graphs, message passing