1. Trang chủ
  2. » Giáo án - Bài giảng

pattern recognition

80 146 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Pattern Recognition
Tác giả Sergios Theodoridis, Konstantinos Koutroumbas
Chuyên ngành Pattern Recognition
Thể loại Lecture Notes
Định dạng
Số trang 80
Dung lượng 1,66 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

 The classifier consists of a set of functions , whose values, computed at , determine the class to which the corresponding pattern belongs  Classification system overview x sensor fea

Trang 1

Sergios Theodoridis

Konstantinos Koutroumbas

Trang 2

PATTERN RECOGNITION

 Typical application areas

 Machine vision

 Character recognition (OCR)

 Computer aided diagnosis

Trang 3

 Features: These are measurable quantities obtained from the patterns, and the classification task is based on their

respective values.

Feature vectors: A number of features

constitute the feature vector

Feature vectors are treated as random vectors.

x  1, , 

Trang 4

An example:

Trang 5

 The classifier consists of a set of functions , whose values, computed at , determine the class to which the

corresponding pattern belongs

 Classification system overview

x

sensor

feature generation

feature selection

classifier design

system evaluation Patterns

Trang 6

 Supervised – unsupervised – semisupervised pattern recognition:

The major directions of learning are:

Supervised: Patterns whose class is known a-priori

are used for training.

Unsupervised: The number of classes/groups is (in

general) unknown and no training patterns are

available.

Semisupervised : A mixed type of patterns is

available For some of them, their corresponding class

is known and for the rest is not

Trang 7

CLASSIFIERS BASED ON BAYES DECISION

THEORY

 Statistical nature of feature vectors

 Assign the pattern represented by feature vector

to the most probable of the available classes

That is

maximum

l 2

1, x , , x x

Trang 8

 Computation of a-posteriori probabilities

) , (

), ( 1 P 2 P M

M i

x

p( i) ,  1 , 2 , ,

.

r to iw

Trang 9

) ( ) (

) (

) (

) ( )

( )

(

) ( ) (

) (

) (

i

i i

i i

i

i i

i

P x

p x

p

x p

P x

p x

P

P x

p x

P x p

Trang 10

The Bayes classification rule (for two classes M=2)

 Given classify it according to the rule

 Equivalently: classify according to the rule

 For equiprobable classes the test becomes

x

2 1

2

1 2

1

) (

) (

) (

P x

P

x x

P x

P

If

If

) (

) (

) (

) (

) ( x 1 P 1 p x 2 P 2

x

Trang 11

) (

)

Trang 12

 Equivalently in words: Divide space in two regions

 Probability of error

 Total shaded area

2 2

1 1

in If

x

x R

1 )

( 2

1

1 2

x

x

Trang 13

Indeed: Moving the threshold the total shaded area INCREASES by the extra “grey” area.

Trang 14

 The Bayes classification rule for many (M>2) classes:

 Given classify it to if:

Such a choice also minimizes the classification error probability

 Minimizing the average risk

 For each wrong decision, a penalty term is assigned since some decisions are more sensitive than others

i j

x P

x

P ( i )  ( j )  

Trang 15

For M=2

• Define the loss matrix

• penalty term for deciding class ,

although the pattern belongs to , etc.

Risk with respect to

)

(

22 21

12 11

p x

d x

p

r

R R

) (

)

11 1

2 1

Trang 16

Risk with respect to

p x

d x

p r

R R

) (

)

21 2

2 1

) (  r P

P r

 Probabilities of wrong decisions,

weighted by the penalty terms

Trang 17

 Choose and so that r is minimized

1 

11 12

22 21

1

2

2

1 12

) (

)

( )

( )

) (

) (

) (

) (

) (

) (

) (

2 2

22 1

1 12

2

2 2

21 1

1 11

p P

x p

P x

p P

x p

Trang 18

y probabilit error

tion classifica

M inimum if

) (

) (

if

) (

) (

if

0

and 2

1 )

( )

(

12 21

21

12 1

2 2

12

21 2

1 1

22 11

2 1

P x

x P x

P x

P P

 If

Trang 19

1

5 0

0

2

1 )

( )

(

) ) 1 (

exp(

1 )

(

) exp(

1 )

(

2 1

2 2

2 1

L

P P

x x

p

x x

Trang 20

Then the threshold value is:

Threshold for minimum r

2

1

) ) 1 (

exp(

) exp(

:

: minimum for

0

2 2

x

P

1 )

2 1

( ˆ

) ) 1 (

( exp 2

) (

exp

: ˆ

0

2 2

x x

x

0

ˆx

Trang 21

Thus moves to the left of

Trang 22

DISCRIMINANT FUNCTIONS

DECISION SURFACES

 If are contiguous:

is the surface separating the regions On the one

side is positive (+), on the other is negative (-) It is known as Decision Surface

)(

)(

:

)(

)(

:

x P

x P

R

x P

x P

R

i j

j

j i

Trang 23

If f (.) monotonically increasing, the rule remains the same if we use:

 In general, discriminant functions can be defined independent of the Bayesian rule They lead to suboptimal solutions, yet, if chosen

appropriately, they can be computationally more tractable

Moreover, in practice, they may also lead to better solutions This, for example, may be case if the nature of the underlying pdf’s are unknown

j i x

P f x

P f

x  i if : ( ( i ))  ( ( j ))  

))(

()

Trang 24

THE GAUSSIAN DISTRIBUTION

 The one-dimensional case

2

1 )

Trang 25

 The Multivariate (Multidimensional) case:

where is the mean value,

and is known s the covariance matrix and it is defined as:

 An example: The two-dimensional case:

, where

)

( 2

1 exp

) 2 (

1 )

2

1 2

x x

x

   E x

] ) )(

1 1

1 2

2 1 1

2

1 2

2

1 exp

2

1 ,

x x

x p

1

x E

x E

2 1 1

2 2

1 1

x E

(  )(  )

  E xx

Trang 26

BAYESIAN CLASSIFIER FOR NORMAL

1

an is

) (

)

( 2

1 exp

) 2 (

1 )

2

1 2

i

i i

i i

i i

i

x x

E

x x

E

x x

x p

Trang 27

) (

ln )

( ln

)) (

) (

ln(

) (

i i

i i

i

P x

p

P x

p x

i i

i i

T i i

C

C P

x x

x g

1 ( 2

ln

) 2 (

) (

ln )

( )

( 2

1 )

Trang 28

That is, is quadratic and the surfaces

quadrics, ellipsoids, parabolas, hyperbolas,

pairs of lines

i i

i i

i i

i

C P

x x

x x

x g

)

(2

1

)(

1)

(2

1)

(

2 2

2 1 2

2 2 1

1 2

2 2

2 1 2

()

(xg x

Trang 29

 Example 1:

 Example 2:

Trang 30

 Decision Hyperplanes

Quadratic terms:

If ALL (the same) the quadraticterms are not of interest They are notinvolved in comparisons Then, equivalently,

Τ i

i

i i

io

T i i

Σ P

w

Σ w

w x

w x

1

2

1 )

( ln

) (

Trang 31

) (

)

( ln

)

( 2

1

,

) (

0 )

( )

( )

(

1 )

(

Then

.

j i

j i

j

i j

i o

j i

o T

j i

ij

i

T i i

P

P x

w

x x

w

x g

x g x

g

w x

x g

I Σ

Trang 33

• If , the linear classifier moves towards the class with the p     1  psmaller2 probability

Trang 34

( )

( xw xx0 

) (

1

j i

w     

2

1 1

2 0

) (

where

) ) (

)

( ( n )

( 2 1

1

1

x x

x

P

P x

T

j i

j i

j

i j

not

j

i

 

Trang 35

 Minimum Distance Classifiers

)

( 2

1 )

i

T i

: Assign

:

:

)) (

)

((

i

T i

d      

Trang 37

: tion classifica

Bayesian

using 2

2

0

1 vector

he classify t

9 1 3 0

3 0

1.1

, 3

3

, 0

0 ),

( )

(

), (

) (

and )

( )

( : Given

2 1

2 2

1 1

2 1

2 1

x

p

Σ N

x p P

0

15 0 95

0

2 8

0 ,

0 2 ,

952

2 2

2

0

1

2 2 , 0 1 :

from s

M ahalanobi Compute

1 2

, 2 1

1 ,

2 2

d Σ

d

1 , ,2

1 Observe that Classify

 Example:

Trang 38

; ,

, ( )

; (

,

,

)

; ( )

( : parameter

ctor unknown ve

an

in known with )

( Let

t independen and

known , ,

Let

1

2 1

2 1

2 1

x p

x x

x p X

p

x x

x X

x p x

p

x p

x x

x

k

N k

N N

Trang 39

) (

) (

) (

1 )

(

)

(

: ˆ

)

; (

ln )

; (

ln )

(

)

; (

max arg

: ˆ

;

; 1

1

1 ML

N k ML

k

N k

k

Ν k

x p x

p L

x p X

p L

x p

Trang 41

Asymptotically unbiased and consistent

0

ˆ lim

]

ˆ [ lim

then ,

)

; ( )

(

such that a

is there indeed,

If,

2 0 N

0 N

p

Trang 42

x Σ

L

L

L

x Σ

x Σ

x

p

x Σ

x C

x p L

x p x

p x

x x Σ

N x

p

k

N k ML

k

N k

l

k

T k

l k

k

T k

N k k

N k

k k

N

10

)(

.)

(

))(

)

(2

1exp(

)2(

1)

;

(

)(

)

(2

1)

;(ln

)

(

)

;()

(, ,

,unknown,

:),(

1

1 2

1 2

1 1

1

2 1

Trang 43

 Maximum Aposteriori Probability Estimation

In ML method, θ was considered as a parameter

Here we shall look at θ as a random vector

described by a pdf p(θ), assumed to be known

Given

Compute the maximum of

From Bayes theorem

)

()

(

or )(

)()

()(

X p

p X

p

X p

X p X

p p

Trang 44

The method:

ML MAP

MAP

MAP

p

X p

P

X p

or uniform

is ) ( If

)) (

) ( ( :

ˆ

or ) (

max arg

ˆ

Trang 46

 Example:

k

N k ML

k

N k MAP

k

N k k

N k MAP

l l

N

x N

For N

x

x p

x p

p

x , , x X

Ι N

x p

1 MAP

2 2

2 2

1 2

2 0

0 2

2 1 1

2

2 0 2

1 2

1 ˆ

ˆ

N for

or , 1

1 ˆ

0 ) ˆ

(

1 )

(

1 or

0 )) ( ) (

ln(

:

) 2

exp(

) 2 (

1 )

(

unknown, ),

, ( : ) (

Trang 47

 Bayesian Inference

How??

) p(

estimate

: goal The

) ( and )

( }, , ,

{

: Given

followed.

is root different

a Here

for estimate

single a

M AP

M L,

1

X x

p x

p x

Trang 48

x N

N X

p

N p

N x

p Let

2 2 2

2

2

2 0 0

2

1

),

()

( :out that

It turns

),

()

(

),

()

(

example

an a

insight vimore

)

(

) ( ) (

) ( ) (

) (

) ( )

( )

(

) (

) (

X

p

d p

X p

p X

p X

p

p X

p X

p

d X p

x p X

Trang 49

The previous formulae correspond to a sequence

of Gaussians for different values of N

 Example : Prior information : , ,

Trang 50

 Maximum Entropy Method

Compute the pdf so that to be maximally

non-committal to the unavailable information and

constrained to respect the available information

The above is equivalent with maximizing uncertainty,i.e., entropy, subject to the available constraints

Entropy

x d x p x

p

H    ( ) ln ( )

the subject to

maximum

: ) (

p

Trang 51

Example: x is nonzero in the interval

and zero otherwise Compute the ME pdf

(

x

x

dx x

( (

1 )

( ˆ

) 1 exp(

) ( ˆ

2 1

1 2

x x

x x

x x

Trang 52

• This is most natural The “most random” pdf is the uniform one This complies

with the Maximum Entropy rationale.

It turns out that, if the constraints are the:

mean value and the variance

then the Maximum Entropy estimate is the

Gaussian pdf.

That is, the Gaussian pdf is the “most

random” one, among all pdfs with the

same mean and variance

Trang 53

 Mixture Models

 Assume parametric modeling, i.e.,

The goal is to estimate

P j x p x

p

1

1

1 )

( 1

) (

) (

)

; ( x j

p

J

P P

P , , ,

,

; (

max k 1 J

N

P P

x

Trang 54

This is a nonlinear problem due to the missing

label information This is a typical problem with

an incomplete data set

The Expectation-Maximisation (EM) algorithm

• General formulation

– which are not observed directly

We observe

,

; y p

R Y

y

y the complete data set   m, with y(  )

),

; ( with ,

) ( y X R l m p x

g

Trang 55

• Let

• What we need is to compute

• But are not observed Here comes the EM Maximize the expectation of the loglikelihood

conditioned on the observed samples and the

current iteration estimate of

0

))

; (

ln(

:

y p

)

; ( )

; (

specific a

to '

all )

(

x Y

y

x x p y d y p

x s

y Y

x Y

Trang 56

E t

Q(;( )) [ ln( ( ; ;( ))]

0 )) (

; ( :

) 1

j

x k, k), 1,2, ,

N k

x k 1,2, ,

k

j k

k k

k j p x j P x

Trang 57

• Unknown parameters

• E-step

• M-step

T J

T T T

T

P P

P P

P ] , [ , , , ],

Q Q

k

j k

, , 2 , 1

, 0

k k

j k

t x

p

P t j

x

p t

; (

)) (

;

( ))

(

; (

)) (

;

( ))

(

;

(

) )

( ( ln )) (

; (

] [ )]

)

; (

ln(

[ ))

(

; (

; 1

1

1 1

k k

k

j k

k k

k J

j

N k

N k

N k

j k

k

P j

x p t

x j P

E P

j x p E

t Q

Trang 58

 Nonparametric Estimation

In words : Place a segment of length h at

and count points inside it

If is continuous: as , if

2

2

h x

ˆ x ˆ

h x

) ( )

( x p x

p  

2

ˆ -

1 )

ˆ ( ˆ )

( ˆ

total

in

h x

x N

k h

x p x

p

N

h k

N

k P

N

N N

Trang 59

 Parzen Windows

 Place at a hypercube of length and count

Trang 60

1)

h

x x

N h

side -

h

an

inside points

of number

*

1

* volume 1

ous discontinu

(.)

continuous

)

(

x p

x) is smooth(

1)

i

x x

Trang 61

1)])

([

1(

1)]

(ˆ[

x l N

h

x

x E

N h

x p

0 )

' ( of width the

hl

) ( )

(

1

h

x h

) ' ( ) '

( )]

( ˆ [

x

x p x

d x p x x x

p

Trang 62

• The smaller the h the higher the variance

Trang 63

h=0.1, N=10000

The higher the N the better the accuracy

Trang 64

22 21

1

2 2

1 12

) (

)

( )

( ) (

) (

P

P x

p

x

p l

) (

) (

h

x x

h N

Trang 65

 CURSE OF DIMENSIONALITY

In all the methods, so far, we saw that the highest

the number of points, N, the better the resulting

estimate

If in the one-dimensional space an interval, filled

with N points, is adequate (for good estimation), in

the two-dimensional space the corresponding square

will require N 2 and in the dimensional space the dimensional cube will require N ℓ points

ℓ-The exponential increase in the number of necessary points in known as the curse of dimensionality This

is a major problem one is confronted with in high

dimensional spaces

Trang 66

An Example :

Trang 67

 NAIVE – BAYES CLASSIFIER

Let and the goal is to estimate

i = 1, 2, …, M For a “good” estimate of the pdf

one would need, say, N ℓ points

Assume x 1 , x 2 ,…, x ℓ mutually independent Then:

In this case, one would require, roughly, N points

for each pdf Thus, a number of points of the

order N·ℓ would suffice.

It turns out that the Nạve – Bayes classifier

works reasonably well even in cases that violate the independence assumption

x

Trang 68

 K Nearest Neighbor Density Estimation

In Parzen:

• The volume is constant

• The number of points in the volume is varying

) (

ˆ

x NV

k x

Trang 69

) (

1 1

2 2

2 2

1

V N

V N

V N k

V N k

Trang 70

 The Nearest Neighbor Rule

Choose k out of the N training vectors, identify the k nearest ones to x

Out of these k identify k i that belong to class ω i

 The simplest version

k=1 !!!

For large N this is not bad It can be shown that:

if P B is the optimal Bayesian error probability, then:

j i

k k

:

xi ij  

M

Trang 71

PBkNNBNN

B kNN P P

NN

B NN

P P

P

P P

Trang 72

 Voronoi tesselation

Trang 73

 Bayes Probability Chain Rule

Assume now that the conditional dependence for

each x i is limited to a subset of the features

appearing in each of the product terms That is:

where

BAYESIAN NETWORKS

) ( )

| (

) , ,

| (

) , ,

| ( )

, , ,

(

1 1

2

1 2

1 1

1 2

1

x p x

x p

x x

x p x

x x

p x

x x

p x

p x

x x p

x 1 , x 2, , x1

Aiii

Trang 74

For example, if ℓ=6, then we could assume:

| (

) , ,

Trang 75

A graphical way to portray conditional dependencies

For this case:

) ( )

( )

| ( )

| ( )

,

| ( )

, , ,

( x1 x2 x6 p x6 x5 x4 p x5 x4 p x3 x2 p x2 p x1

Trang 76

 Bayesian Networks

Definition: A Bayesian Network is a directed acyclicgraph (DAG) where the nodes correspond to random variables Each node is associated with a set of

conditional probabilities (densities), p(xi |A i ), where x i

is the variable associated with the node and A i is the set of its parents in the graph

A Bayesian Network is specified by:

• The marginal probabilities of its root nodes.

• The conditional probabilities of the non-root nodes,

given their parents , for ALL possible values of the involved variables.

Trang 77

The figure below is an example of a Bayesian

Network corresponding to a paradigm from the

medical applications field

models conditional dependencies for an example concerning smokers (S),

tendencies to develop cancer (C) and heart disease (H), together with variables

corresponding to heart (H1, H2) and cancer (C1, C2) medical tests

Trang 78

Once a DAG has been constructed, the joint

probability can be obtained by multiplying the

marginal (root nodes) and the conditional (non-root nodes) probabilities

Training: Once a topology is given, probabilities are estimated via the training data set There are also methods that learn the topology

Probability Inference: This is the most common task that Bayesian networks help us to solve efficiently Given the values of some of the variables in the

graph, known as evidence, the goal is to compute the conditional probabilities for some of the other variables, given the evidence

Trang 79

 Example: Consider the Bayesian network of the

Trang 80

For a), a set of calculations are required that

propagate from node x to node w It turns out that

P(w0|x1) = 0.63.

For b), the propagation is reversed in direction It

turns out that P(x0|w1) = 0.4.

In general, the required inference information is

computed via a combined process of “message

passing” among the nodes of the DAG

 Complexity:

For singly connected graphs, message passing

Ngày đăng: 28/04/2014, 10:11