handbook of pattern recognition (bookos.org)

Currently one intensively explored area of pattern recognition applications is the personal identification problem, also called biometrics, though the problem has been around for a numbe

Trang 5

5 Tori Tuck Link, Singapore 596224

USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601

UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

British Library Cataloguing-in-Publication Data

A catalogue record for this book is available from the British Library

First published 2005

Reprinted 2006

HANDBOOK OF PATTERN RECOGNITION & COMPUTER VISION (3rd Edition)

All rights reserved This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to

be invented, without written permission from the Publisher

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA In this case permission to photocopy is not required from the publisher

ISBN 981-256-105-6

Printed in Singapore by Mainland Press

Trang 6

Dedicated to the memory of the late Professor King Sun Fu (1930-1985), the handbook series, with first edition (1993), second edition (1999) and third edition (2005), provides a comprehensive, concise and balanced coverage of the progress and achievements in the field of pattern recognition and computer vision in the last twenty years This is a highly dynamic field which has been expanding greatly over the last thirty years No handbook can cover the essence of all aspects of the field and we have not attempted to do that The carefully selected 33 chapters in the current edition were written by leaders in the field and we believe that the book and its sister volumes, the first and second editions, will provide the growing pattern recognition and computer vision community a set of valuable resource books that can last for a long time Each chapter will speak for itself the importance of the subject area covered

The book continues to contain five parts Part 1 is on the basic methods of pattern recognition Though there are only five chapters, the readers may find other coverage of basic methods in the first and second editions Part 2 is on basic methods in computer vision Again readers may find that Part 2 complements well what were offered

in the first and second editions Part 3 on recognition applications continues to emphasize

on character recognition and document processing It also presents new applications in digital mammograms, remote sensing images and functional magnetic resonance imaging data Currently one intensively explored area of pattern recognition applications is the personal identification problem, also called biometrics, though the problem has been around for a number of years Part 4 is especially devoted to this topic area Indeed chapters in both Part 3 and Part 4 represent the growing importance of applications in pattern recognition In fact Prof Fu had envisioned the growth of pattern recognition applications in the early 60's He and his group at Purdue had worked on the character recognition, speech recognition, fingerprint recognition, seismic pattern recognition, biomedical and remote sensing recognition problems, etc Part 5 on system and technology presents other important aspects of pattern recognition and computer vision

Our sincere thanks go to all contributors of this volume for their outstanding technical contributions We would like to mention specially Dr Quang-Tuan Luong, Dr Giovanni Garibotto and Prof Ching Y Suen for their original contributions to all three volumes Other authors who have contributed to all three volumes are: Prof Thomas S Huang, Prof J.K Aggarwal, Prof Yun Y Tang, Prof C.C Li, Prof R Chellappa and Prof P.S.P Wang We are pleased to mention that Prof Thomas Huang and Prof Jake Aggarwal are the recipients respectively in 2002 and 2004, of the prestigious K.S Fu Prize sponsored by the International Association of Pattern Recognition (IAPR) Among Prof Fu's Ph.D graduates at Purdue who have contributed to the handbook series are: C.H Chen (1965), M.H Loew (1972), S.M Hsu (1975), S.Y Lu (1977), K.Y Huang (1983) and H.D Cheng (1985) Finally we would like to pay tribute to the late Prof Azirel Rosenfeld (1931-2004) who, as one IAPR member put it, is a true scientist and a great giant in the field He was awarded the K.S Fu Prize by IAPR in 1988 Readers are reminded to read Prof Rosenfeld's inspirational article on "Vision - Some Speculations" that appeared as Foreword of the second edition of the handbook series Prof Rosenfeld's profound influence in the field will be felt in the many years to come

v

Trang 7

The camera ready manuscript production requires certain amount of additional efforts, as compared to typeset printing, on the part of editors and authors We like to thank all contributors for their patience in making the necessary revisions to comply with the format requirements during this long process of manuscript preparation Our special thanks go to Steven Patt, in-house editor of World Scientific Publishing, for his efficient effort to make a timely publication of the book possible

Trang 8

Preface to the Third Edition v

Contents vii

Part 1 Basic Methods in Pattern Recognition 1

Chapter 1.1 Statistical Pattern Recognition 3

R.P.W Duin and D.M.J Tax

Chapter 1.2 Hidden Markov Models for Spatio-Temporal Pattern

Recognition 25

Brian C Lovell and Terry Caelli

Chapter 1.3 A New Kernel-Based Formalization of Minimum Error Pattern

Recognition 41

Erik McDermott andShigeru Katagiri

Chapter 1.4 Parallel Contextual Array Grammars with Trajectories 55

P Helen Chandra, C Martin-Vide, K.G Subramanian, D.L Van and P S P Wang

Chapter 1.5 Pattern Recognition with Local Invariant Features 71

C Schmid, G Dorko, S Lazebnik, K Mikolajczyk and J Ponce

Part 2 Basic Methods in Computer Vision 93

Chapter 2.1 Case-Based Reasoning for Image Analysis and Interpretation 95

Petra Perner

Chapter 2.2 Multiple Image Geometry - A Projective Viewpoint 115

Quang-Tuan Luong

Chapter 2.3 Skeletonization in 3D Discrete Binary Images 137

Gabrielle Sanniti di Baja and Injela Nystrom

Chapter 2.4 Digital Distance Transforms in 2D, 3D, and 4D 157

Gunilla Borgefors

Chapter 2.5 Computing Global Shape Measures 177

Paul L Rosin

Trang 9

Chapter 2.6 Texture Analysis with Local Binary Patterns 197

Topi Mdenpdd and Matti Pietikdinen

Part 3 Recognition Applications 217

Chapter 3.1 Document Analysis and Understanding 219

Yuan Yan Tang

Chapter 3.2 Chinese Character Recognition 241

Xiaoqing Ding

Chapter 3.3 Extraction of Words from Handwritten Legal Amounts on

Bank Cheques 259

In Cheol Kim and Ching Y Suen

Chapter 3.4 OCR Assessment of Printed-Fonts for Enhancing Human

Vision 273

Ching Y Suen, Qizhi Xu and Cedric Devoghelaere

Chapter 3.5 Clustering and Classification of Web Documents Using a

Graph Model 287

Adam Schenker, Horst Bunke, Mark Last and Abraham Kandel

Chapter 3.6 Automated Detection of Masses in Mammograms 303

H.D Cheng, X.J Shi, R Min, X.P Cai and H.N Du

Chapter 3.7 Wavelet-Based Kalman Filtering in Scale Space for Image

Fusion 325

Hsi-Chin Hsin and Ching-Chung Li

Chapter 3.8 Multisensor Fusion with Hyperspectral Imaging Data:

Detection and Classification 347

Su May Hsu and Hsiao-hua Burke

Chapter 3.9 Independent Component Analysis of Functional Magnetic

Resonance Imaging Data 365

V.D Calhoun andB Hong

Part 4 Human Identification 385

Chapter 4.1 Multimodal Emotion Recognition 387

Nicu Sebe, Ira Cohen and Thomas S Huang

Chapter 4.2 Gait-Based Human Identification from a Monocular Video

Sequence 411

Amit Kale, AravindSundaresan, AmitK RoyChowdhury and Rama Chelappa

Trang 10

Chapter 4.3 Palmprint Authentication System 431

David Zhang

Chapter 4.4 Reconstruction of High-Resolution Facial Images for Visual

Surveillance 445

Jeong-Seon Park and Seong Whan Lee

Chapter 4.5 Object Recognition with Deformable Feature Graphs: Faces,

Hands, and Cluttered Scenes 461

Jochen Triesch and Christian Eckes

Chapter 4.6 Hierarchical Classification and Feature Reduction for Fast Face

Detection 481

Bernd Heisele, Thomas Serre, Sam Prentice and Tomaso Poggio

Part 5 System and Technology 497

Chapter 5.1 Tracking and Classifying Moving Objects Using Single or

Multiple Cameras 499

Quming Zhou andJ.K Aggarwal

Chapter 5.2 Performance Evaluation of Image Segmentation Algorithms 525

Xiaoyi Jiang

Chapter 5.3 Contents-Based Video Analysis for Knowledge Discovery 543

Chia-Hung Yeh, Shih-Hung Lee andC.-C Jay Kuo

Chapter 5.4 Object-Process Methodology and Its Applications to Image

Processing and Pattern Recognition 559

Dov Dori

Chapter 5.5 Musical Style Recognition — AQuantative Approach 583

Peter van Kranenburg and Eric Backer

Chapter 5.6 Auto-Detector: Mobile Automatic Number Plate Recognition 601

Giovanni B Garribotto

Chapter 5.7 Omnidirectional Vision 619

Hiroshi Ishiguro

Index 629

Trang 11

A review is given of the area of statistical pattern recognition: the representation

of objects and the design and evaluation of trainable systems for generalization Traditional as well as more recently studied procedures are reviewed like the classical Bayes classifiers, neural networks, support vector machines, one-class classifiers and combining classifiers Further we introduce methods for feature reduction and error evaluation New developments in statistical pattern recognition are briefly discussed

1 I n t r o d u c t i o n

Statistical p a t t e r n recognition is t h e research area t h a t studies statistical tools for

t h e generalization of sets of real world objects or phenomena It thereby aims t o find procedures t h a t answer questions like: does this new object fit into t h e p a t t e r n

of a given set of objects, or: t o which of t h e p a t t e r n s defined in a given set does it fit best? T h e first question is related to cluster analysis, b u t is also discussed from some perspective in this chapter T h e second question is on p a t t e r n classification

a n d t h a t is w h a t will be t h e main concern here

T h e overall s t r u c t u r e of a p a t t e r n recognition system m a y be summarized as in Figure 1 Objects have first to be appropriately represented before a generalization can be derived Depending on t h e d e m a n d s of t h e procedures used for this t h e representation h a s t o be a d a p t e d , e.g transformed, scaled or simplified

T h e procedures discussed in this chapter are partially also studied in areas like statistical learning theory 3 2 , machine learning 2 5 a n d neural networks 1 4 As t h e emphasis in p a t t e r n recognition is close to application areas, questions related t o t h e representation of t h e objects are i m p o r t a n t here: how are objects described (e.g features, distances t o p r o t o t y p e s ) , how extensive m a y this description be, w h a t are t h e ways t o incorporate knowledge from t h e application domain? Representations have

t o be a d a p t e d t o fit t h e tools t h a t are used later Simplifications of representations

3

Trang 12

like feature reduction and prototype selection should thereby be considered

In order to derive, from a training set, a classifier that is valid for new objects (i.e that it is able to generalize) the representation should fulfill an important condition: representations of similar real world objects have to be similar as well The representations should be close This is the so-called compactness hypothesis2

on which the generalization from examples to new, unseen objects is built It enables the estimation of their class labels on the basis of distances to examples or on class densities derived from examples

Objects are traditionally represented by vectors in a feature space An important recent development to incorporate domain knowledge is the representation of objects

by their relation to other objects This may be done by a so called kernel method29, derived from features, or directly on dissimilarities computed from the raw data2 6

We will assume that, after processing the raw measurements, objects are given

in a p-dimensional vector space Cl Traditionally this space is spanned by p features, but also the dissimilarities with p prototype objects may be used To simplify the discussion we will use the term feature space for both If K is the number of classes

to be distinguished, a pattern classification system, or shortly classifier C{x) is a function or a procedure that assigns to each object x in fi a class u> c , with c = 1, ,K Such a classifier has to be derived from a set of examples X tv = {xi,i = 1 N} of known classes y, X tr will be called the training set and yi £ wc, c = 1 K

a label Unless otherwise stated it is assumed that yt is unique (objects belong to just a single class) and is known for all objects in X tl

In section 2 training procedures will be discussed to derive classifiers C(x) from

training sets The performance of these classifiers is usually not just related the quality of the features (their ability to show class differences) but also to their number, i.e the dimensionality of the feature space A growing number of features

class labels, confidences

classifiers, class models

feature extraction prototype selection

features, dissimilarities class models, object models

better sensors or measurement conditions

Fig 1 The pattern recognition system

Trang 13

may increase the class separability, but, may also decrease the statistical accuracy

of the training procedure It is thereby important to have a small number of good features In section 3 a review is given of ways to reduce the number of features

by selection or by combination (so called feature extraction) The evaluation of classifiers, discussed in section 4, is an important topic As the characteristics of new applications are often unknown before, the best algorithms for feature reduction and classification have to be found iteratively on the basis of unbiased and accurate testing procedures

This chapter builds further on earlier reviews of the area of statistical pattern recognition by Fukunaga 12 and by Jain et al 16 It is inevitable to repeat and summarize them partly We will, however, also discuss some new directions like one-class classifiers, combining classifiers, dissimilarity representations and techniques for building good classifiers and reducing the feature space simultaneously In the last section of this chapter, the discussion, we will return to these new developments

2 Classifiers

For the development of classifiers, we have to consider two main aspects: the basic assumptions that the classifier makes about the data (which results in a functional form of the classifier), and the optimization procedure to fit the model to the training data It is possible to consider very complex classifiers, but without efficient methods

to fit these classifiers to the data, they are not useful Therefore, in many cases the functional form of the classifier is restricted by the available optimization routines

We will start discussing the two-class classification problem In the first three sections, 2.1, 2.2 and 2.3, the three basic approaches with their assumptions are given: first, modeling the class posteriors, second, modeling class conditional prob-abilities and finally modeling the classification boundary In section 2.4 we discuss how these approaches can be extended to work for more than two classes In the next section, the special case is considered where just one of the classes is reli-ably sampled The last section, 2.6, discusses the possibilities to combine several (non-optimal) classifiers

2.1 Bayes classifiers and approximations

A classifier should assign a new object x to the most likely class In a probabilistic setting this means that the label of the class with the highest posterior probability should be chosen This class can be found when p(wi|x) and p(w2|x) (for a two class classification problem) are known The classifier becomes:

if p(wi|x) > p(w2|x) assign object x to u>\, otherwise to W2- (1) When we assume that p(ui\x) and p(w2|x) are known, and further assume that

misclassifying an object originating from o>i to W2 is as costly as vise versa, classifier (1) is the theoretical optimal classifier and will make the minimum error This

classifier is called the Bayes optimal classifier

Trang 14

P(wi|x) = 1 , ( , T ^, P(w2|x) = 1 - p ( w2| x ) , (2)

In practice p(wi|x) and p(w2|x) are not known, only samples x, are available,

and the misclassification costs might be only known in approximation Therefore

approximations to the Bayes optimal classifier have to be made This classifier can be

approximated in several different ways, depending on knowledge of the classification

problem

The first way is to approximate the class posterior probabilities p(wc|x) The

logistic classifier assumes a particular model for the class posterior probabilities:

1 + exp(—wTx) where w is a p-dimensional weight vector This basically implements a linear clas-

sifier in the feature space

An approach to fit this logistic classifier (2) to training data X u, is to maximize

the data likelihood L:

N

L = n p ( w i | x On i ( x )p ( w 2 | x i )n a ( x )> (3)

i

where nc(x) is 1 if object x belongs to class o>c, and 0 otherwise This can be done by,

for instance, an iterative gradient ascent method Weights are iteratively updated

using:

Wnew = W 0 l d + ? 7 - — , (4)

where rj is a suitably chosen learning rate parameter In Ref 1 the first (and second)

derivative of L with respect to w are derived for this and can be plugged into (4)

2.2 Class densities and Bayes rule

Assumptions on p(w|x) are often difficult to make Sometimes it is more convenient

to make assumptions on the class conditional probability densities p(x|w): they

indicate the distribution of the objects which are drawn from one of the classes

When assumptions on these distributions can be made, classifier (1) can be derived

using Bayes' decision rule:

This rule basically rewrites the class posterior probabilities in terms of the class

conditional probabilities and the class priors p(w) This result can be substituted

into (1), resulting in the following form:

if P(X\LOI)P(U>I) > p(x|w2)p(w2) assign x to wi, otherwise to o;2- (6)

The term p(x) is ignored because this is constant for a given x Any monotonically

increasing function can be applied to both sides without changing the final decision

In some cases, a suitable choice will simplify the notation significantly In particular,

Trang 15

using a logarithmic transformation can simplify the classifier when functions from

the exponential family are used

For the special case of a two-class problem the classifiers can be rewritten in

terms of a single discriminant function / ( x ) which is the difference between the left

hand side and the right hand side A few possibilities are:

p(x|w2) P{LU 2 )

The classifier becomes:

if / ( x ) > 0 assign x t o w j , otherwise to o>2 (10)

In many cases fitting p(x|a>) on training data is relatively straightforward It is

the standard density estimation problem: fit a density on a data sample To estimate

each p(x|w) the objects from just one of the classes UJ is used

Depending on the functional form of the class densities, different classifiers are

constructed One of the most common approaches is to assume a Gaussian density

for each of the classes:

p(x|w) = J V ( X ; / I , E ) = ( 2 7 r ) P/2|S|i/2 e xP ( - ^ ( x - M f S - ^ x - M ) ) - (n)

where /x is the (p-dimensional) mean of the class u>, and £ is the covariance matrix

Further, | £ | indicates the determinant of £ and £- 1 its inverse For the explicit

values of the parameters /x and E usually the maximum likelihood estimates are

plugged in, therefore this classifier is called the plug-in Bayes classifier Extra

com-plications occur when the sample size TV is insufficient to (in particular) compute

£_ 1 In these cases a standard solution is to regularize the covariance matrix such

that the inverse can be computed:

where X is the p x p identity matrix, and A is the regularization parameter to set

the trade off between the estimated covariance matrix and the regularizer I

Substituting (11) for each of the classes u>\ and w 2 (with their estimated /Xj, /x2

and E i , £2) into (9) results in:

/ ( x ) = i x W - E r > + ^ ( M i S r1 - M2S2-1)Tx

- ^ E r V i + ^ S j V s - i l n l E t l + i l n l E I + l n ^ i j (13)

This classifier rule is quadratic in terms of x, and it is therefore called the

normal-based quadratic classifier

For the quadratic classifier a full covariance matrix has to be estimated for each

of the classes In high dimensional feature spaces it can happen that insufficient

Trang 16

data is available to estimate these covariance matrices reliably By restricting the

covariance matrices to have less free variables, estimations can become more reliable

One approach the reduce the number of parameters, is to assume that both classes

have an identical covariance structure: Ei = £2 = E The classifier simplifies to:

/ ( x ) = \{^ - / S - ' x - i / i f E - V i + \l%*- V2 + In ^ (14)

Because this classifier is linear in terms of x, this classifier is called the normal-based

linear classifier

For the linear and the quadratic classifier, strong class distributional assumptions

are made: each class has a Gaussian distribution In many applications this cannot

be assumed, and more flexible class models have to be used One possibility is to

use a 'non-parametric' model An example is the Parzen density model Here the

density is estimated by summing local kernels with a fixed size h which are centered

on each of the training objects:

1 N

p(x|w) = - ^ J V f c x i f t l ) , (15)

i—l

where X is the identity matrix and h is the width parameter which has to be

op-timized By substituting (15) into (6), the Parzen classifier is defined The only

free parameter in this classifier is the size (or width) h of the kernel Optimizing

this parameter by maximizing the likelihood on the training data, will result in the

solution h = 0 To avoid this, a leave-one-out procedure can be used 9

2.3 Boundary methods

Density estimation in high dimensional spaces is difficult In order to have a reliable

estimate, large amounts of training data should be available Unfortunately, in many

cases the number of training objects is limited Therefore it is not always wise to

estimate the class distributions completely Looking at (1), (6) and (10), it is only

of interest which class is to be preferred over the other This problem is simpler

than estimating p(x\u>) For a two-class problem, we just a function / ( x ) is needed

which is positive for objects of UJI and negative otherwise In this section we will list

some classifiers which avoid estimating p(x|w) but try to obtain a suitable / ( x )

The Fisher classifier searches to find a direction w in the feature space, such

that the two classes are separated as well as possible The degree in which the two

classes are separated, is measured by the so-called Fisher ratio, or Fisher criterion:

Here mi and mo, are the means of the two classes, projected onto the direction w:

mi = ~w T Hi and rri2 = wT/x2 The si and s% are the variances of the two classes

projected onto w The criterion therefore favors directions in which the means are

far apart and the variances are small

Trang 17

This Fisher ratio can be explicitly rewritten in terms of w First we rewrite

s l = E x6a ,c(w T x - w 7> c )2 = Exea,c w T(x - Mc)(x - ^c)T w = wrScw Second

we write (mi - m2)2 = (wT/x1 — wT/ z2)2 = wT(/x1 — /x2)(Ati _ iu2 )T w = wTSflW

The term 5 s is also called the between scatter matrix J becomes:

sf + s?, wT5 i w + wTS,2W wT5 i y w '

where Sw = S\ + S2 is also called the within scatter matrix

In order to optimize (17), we set the derivative of (17) to zero and obtain:

We are interested in the direction of w and not in the length, so we drop the scalar

terms between brackets Further, from the definition of SB it follows that SBW is

always in the direction fj, x — fj, 2 - Multiplying both sides of (18) by S w gives:

w ~ S ^ ( M i - M 2 ) - (19)

This classifier is known as the Fisher classifier Note that the threshold b is not

defined for this classifier It is also linear and requires the inversion of the

within-scatter Sw- This formulation yields an identical shape of w as the expression in

(14), although the classifiers use very different starting assumptions!

Most classifiers which have been discussed so far, have a very restricted form

of their decision boundary In many cases these boundaries are not flexible enough

to follow the true decision boundaries A flexible method is the k-nearest neighbor

rule This classifier looks locally which labels are most dominant in the training set

First it finds the k nearest objects in the training set AW(x), and then counts the

number of these neighbors are from class u>i or

u>2-if ni > 712 assign x t o u i , otherwise to cj2- (20)

Although the training of the fc-nearest neighbor classifier is trivial (it only has to

store all training objects, k can simply be optimized by a leave-one-out estimation),

it may become expensive to classify a new object x For this the distances to all

training objects have to be computed, which may be prohibitive for large training

sets and high dimensional feature spaces

Another classifier which is flexible but does not require the storage of the full

training set is the multi-layered feed-forward neural network 4 A neural network is a

collection of small processing units, called the neurons, which are interconnected by

weights w and v to form a network A schematic picture is shown in Figure 2 An

input object x is processed through different layers of neurons, through the hidden

layer to the output layer The output of the j - t h output neuron becomes:

(see Figure 2 for the meaning of the variables) The object x is now assigned to the

class j for which the corresponding output neuron has the highest output Oj

Trang 18

X\ • X2%

•

Xp #

• — - < * n

Fig 2 Schematic picture of a neural network

To optimize this neural network, the squared error between the network output

and the desired class label is defined:

N K

where rij (x) is 1 if object x belongs to class w,-, and 0 otherwise To simplify the

notation, we will combine all the weights w , and v into one weight vector w

This error E is a continuous function of the weights w, and the derivative of

E with respect to these weights can easily be calculated The weights of the

neu-ral network can therefore be optimized to minimize the error by gradient descent,

analogous to (4):

dE

W n e w = W o l d - T y — , ( 2 3 )

where 77 is the learning parameter After expanding this learning rule (23), it

appears that the weight updates for each layer of neurons can be computed

by back-propagating the error which is computed at the output of the network

(nj(xj) — Oj(xi)) This is therefore called the back-propagation update rule

The advantage of this type of neural networks is that they are flexible, and that

they can be trained using these update rules The disadvantages are, that there are

many important parameters to be chosen beforehand (the number of layers, the

number of neurons per layer, the learning rate, the number of training updates,

etc.), and that the optimization can be extremely slow To increase the training

speed, several additions and extensions are proposed, for instance the inclusion of

momentum terms in (23), or the use of second order moments

Neural networks can be easily overtrained Many heuristic techniques have been

developed to decrease the chance of overtraining One of the methods is to use

weight decay, in which an extra regularization term is added to equation (22) This

regularization term, often something of the form

tries to reduce the size of the individual weights in the network By restricting the

size of the weights, the network will adjust less to the noise in the data sample

l + e x p ( - v T h )

Trang 19

and become less complex The regularization parameter A regulates the trade-off

between the classification error E and the classifier complexity When the size of

the network (in terms of the number of neurons) is also chosen carefully, good

performances can be achieved by the neural network

A similar approach is chosen for the support vector classifier 32 The most basic

version is just a linear classifier as in Eq (10) with

The minimum distance from the training objects to the classifier is thereby

maxi-mized This gives the classifier some robustness against noise in the data, such that

it will generalize well for new data It, appears that this maximum margin p is

in-versely related to ||w||2, such that maximizing this margin means minimizing ||w||2

(taking the constraints into account that all the objects are correctly classified)

Fig 3 Schematic picture of a support vector classifier

Given linearly separable data, the linear classifier is found which has the largest

margin p to each of the classes To allow for some errors in the classification, some

slack variables are introduced to weaken the hard constraints The error to minimize

for the support vector classifier therefore consists of two parts: the complexity of the

classifiers in terms of wTw , and the number of classification errors, measured by

X^jCi- The optimization can be stated by the following mathematical formulation:

Trang 20

Parameter C determines the trade-off between the complexity of the classifier, as

measured by wTw , and the number of classification errors

Although the basic version of the support vector classifier is a linear classifier,

it can be made much more powerful by the introduction of kernels When the

constraints (27) are incorporated into (26) by the use of Lagrange multipliers a ,

this error can be rewritten in the so-called dual form For this, we define the labels

y, where y» = 1 when x$ £ ui\ and j/j = — 1 otherwise The optimization becomes:

m a x aTx xTa - lTa , s.t yTa = 0, 0 < a t < C, Vi (28)

with w = J2i a iVi x i- Due to the constraints in (28) the optimization is not trivial,

but standard software packages exist which can solve this quadratic programming

problem It appears that in the optimal solution of (28) many of the Qj become 0

Therefore only a few Qj ^ 0 determine the w The corresponding objects Xi are

called the support vectors All other objects in the training set can be ignored

The special feature of this formulation is that both the classifier / ( x ) and the

error (28) are completely stated in terms of inner products between objects xfXj

This means that the classifier does not explicitly depend on the features of the

objects It depends on the similarity between the object x and the support vectors

Xj, measured by the inner product xTXj By replacing the inner product by another

similarity, defined by the kernel function isT(x, x,), other non-linear classifiers are

obtained One of the most popular kernel functions is the Gaussian kernel:

where a is still a free parameter

The drawback of the support vector classifier is that it requires the solution

of a large quadratic programming problem (28), and that suitable settings for the

parameters C and a have to be found On the other hand, when C and a are

optimized, the performance of this classifier is often very competitive Another

ad-vantage of this classifier is, that it offers the possibility to encode problem specific

knowledge in the kernel function K In particular for problems where a good feature

representation is hard to derive (for instance in the classification of shapes or text

documents) this can be important

2.4 Multi-class classifiers

In the previous section we focused on the two-class classification problems This

sim-plifies the formulation and notation of the classifiers Many classifiers can trivially

be extended to multi-class problems For instance the Bayes classifier (1) becomes:

c

Most of the classifiers directly follow from this Only the boundary methods which

were constructed to explicitly distinguish between two classes, for instance the

Trang 21

Fisher classifier or the support vector classifier, cannot be trivially extended For these classifiers several combining techniques are available The two main ap-proaches to decompose a multi-class problem into a set of two-class problems are:

(1) one-against-all: train K classifiers between one of the classes and all others, (2) one-against-one: train K(K — l ) / 2 classifiers to distinguish all pairwise classes

Afterward classifiers have to be combined using classification confidences (posterior probabilities) or by majority voting A more advanced approach is to use Error-Correcting Output Codes (ECOC), where classifiers are trained to distinguish spe-cific combinations of classes, but are allowed to ignore others7 The classes are chosen such that a redundant output labeling appears, and possible classification errors can be fixed

2.5 One-class classifiers

A fundamental assumption in all previous discussions, is that a representative

train-ing set X tT is available That means that examples from both classes are present, sampled according to their class priors In some applications one of the classes might contain diverse objects, or its objects are difficult or expensive to measure This happens for instance in machine diagnostics or in medical applications A suf-ficient number of representative examples from the class of ill patients or the class

of faulty machines are sometimes hard to collect In these cases one cannot rely on

a representative dataset to train a classifier, and a so-called one-class classifier30

may be used

Fig 4 One-class classifier example

In one-class classifiers, it is assumed that we have examples from just one of the

classes, called the target class From all other possible objects, per definition the outlier objects, no examples are available during training When it is assumed that

the outliers are uniformly distributed around the target class, the classifier should

Trang 22

circumscribe the target object as tight as possible in order to minimize the chance

of accepting outlier objects

In general, the problem of one-class classification is harder than the problem

of conventional two-class classification In conventional classification problems the decision boundary is supported from both sides by examples of both classes Because

in the case of one-class classification only one set of data is available, only one side

of the boundary is supported It is therefore hard to decide, on the basis of just one class, how strictly the boundary should fit around the data in each of the feature directions In order to have a good distinction between the target objects and the outliers, good representation of the data is essential

Approaches similar to standard two-class classification can be used here ing the uniform outlier distribution assumption, the class posteriors can be esti-mated and the class conditional distributions or direct boundary methods can be constructed For high dimensional spaces the density estimators suffer and often boundary methods are to be preferred

Us-2.6 Combining classifiers

In practice it is hard to find (and train) a classifier which fits the data distribution sufficiently well The model can be difficult to construct (by the user), too hard to optimize, or insufficient training data is available to train In these cases it can be very beneficial to combine several "weak" classifiers in order to boost the classifica-tion performance21 It is hoped that each individual classifier will focus on different aspects of the data and err on different objects Combining the set of so-called base classifiers will then complement their weak areas

/ base classifier outputs (e.g confidences)

Fig 5 Combining classifier

The most basic combining approach is to train several different types of classifiers

on the same dataset and combine their outputs One has to realize that classifiers can only correct each other when their outputs vary, i.e the set of classifiers is

diverse 2 2 It appears therefore to be more advantageous to combine classifiers which were trained on objects represented by different features Another approach

to force classifiers to become diverse is to artificially change the training set by resampling (resulting in a bagging6 or a boosting8 approach)

Trang 23

The outputs of the classifier can be combined using several combining rules18, depending on the type of classifier outputs If the classifiers provide crisp output labels, a voting combining rule has to be used When the real valued outputs are available, they can be averaged, weighted averaged or multiplied, the maximum or minimum output can be taken or even an output classifier can be trained If fixed (i.e not trained) rules are used, it is important that the output of a classifier is properly scaled Using a trainable combining rule, this constraint can be elevated but clearly training data is required to optimize this combining rule10

3 Feature reduction

In many classification problems it is unclear what features have to be taken into

account Often a large set of k potentially useful features is collected, and by feature reduction the k most suitable features are chosen Often the distinction between feature selection and feature extraction is made In selection, only a subset of the

original features is chosen The advantage is that in the final application just a few features have to be measured The disadvantage is that the selection of the appropriate subset is an expensive search In extraction new features are derived from the original features Often all original features are used, and no reduction is obtained in the number of measurement Butt in many cases the optimization is easier In Section 3.1 we will discuss several evaluation criteria, then in Section 3.2 feature selection and finally in 3.3 feature extraction

3.1 Feature set evaluation criteria

In order to evaluate a feature set, a criterion J has to be defined Because feature reduction is often applied in classification, the most obvious criterion is thus the performance of the classifier Unfortunately, the optimization of a classifier is often hard Other evaluation criteria might be a cheaper approximation to this classifica-tion performance Therefore approximate criteria are used, measuring the distance

or dissimilarity between distributions, or even ignoring the class labels, but just focusing on unsupervised characteristics

Some typical evaluation criteria are listed in Table 1 The most simple ones use the scatter matrices characterizing the scatter within classes (showing how samples

scatter around their class mean vector, called Sw, the within scatter) and the the scatter between the clusters (showing how the means of the clusters scatter, SB, the

between scatter matrix, see also the discussion of the Fisher ratio in section 2.3) These scatter matrices can be combined using several functions, listed in the first

part of Table 1 Often Si = SB is used, and 52 = Sw or 52 = Sw +

SB-The measures between distributions involve the class distributions p(x\u>i), and

in practice often single Gaussian distributions for each of the classes are chosen The reconstruction errors still contain free parameters in the form of a matrix of

basis vectors W or a set of prototypes fj, k These are optimized in their respective

Trang 24

procedures, like t h e Principal C o m p o n e n t Analysis or Self-Organizing Maps These

s c a t t e r criteria a n d t h e supervised measures between distributions are mainly used

in t h e feature selection, Section 3.2 T h e unsupervised reconstruction errors are used

in feature extraction 3.3

Table 1 Feature selection criteria for measuring the difference between

two distributions or for measuring a reconstruction error

Measures using scatter matrices

For explanation J = tr(5^"1 5i)

In feature selection a subset of t h e original features is chosen A feature reduction

procedure consist of two ingredients: t h e first is t h e evaluation criterion t o evaluate

a given set of features, t h e second is a search strategy to search over all possible

feature s u b s e t s 1 6 Exhaustive search is in m a n y applications not feasible W h e n we

s t a r t with k = 250 features, and we want t o select k = 10, we have to consider in

principle ( 2 5 0 ) — 2 • 10 1 7 different subsets, which is clearly too much

Instead of exhaustive search, a forward selection can be applied It s t a r t s with

t h e single best feature (according to t h e evaluation criterion) and adds t h e feature which gives t h e biggest improvement in performance This is repeated till t h e re-

quested n u m b e r of features k is reached Instead of forward selection, t h e opposite

approach can be used: backward selection This s t a r t s with t h e complete set of tures and removes t h a t feature such t h a t t h e performance increase is t h e largest These approaches have t h e significant drawback t h a t t h e y might miss t h e optimal subset These are t h e subsets for which t h e individual features have poor discrim- inability but combined give a very good performance In order to find these subsets,

fea-a more fea-advfea-anced sefea-arch s t r fea-a t e g y is required It cfea-an be fea-a flofea-ating sefea-arch where fea-adding

Trang 25

and removing features is alternated Another approach is the branch-and-bound

al-gorithm 12, where all the subsets of features is arranged in a search tree This tree

is traversed in such order that as soon as possible large sub branches can be

dis-regarded, and the search process is shortened significantly This strategy will yield

the optimal subset when the evaluation criterion J is monotone, that means that

when for a certain feature set a value of Jk is obtained, a subset of the features

cannot have higher value for Jj, Criteria like the Bayes error, the Chernoff distance

or the functions on the scatter matrices fulfill this

Currently, other approaches appear which combine the traditional feature

selec-tion and subsequent training of a classifier One example is a linear classifier (with

the functional form of (25)) called LASSO, Least Absolute Shrinkage and Selection

Operator 3 1 The classification problem is approached as a regression problem with

an additional regularization A linear function is fitted to the data by minimizing

the following error:

n

i

The first part defines the deviation of the linear function wTXj + 6 from the expected

label yi The second part shrinks the weights w, such that many of them become

zero By choosing a suitable value for C, the number of retained features can be

changed This kind of regularization appears to be very effective when the number

of feature is huge (in the thousands) and the training size is small (in the tens) A

similar solution can be obtained when the term wTw in (26) is replace by |w| 3

3.3 Feature extraction

Instead of using a subset of the given features, a smaller set of new features may

be derived from the old ones This can be done by linear or nonlinear feature

extraction For the computation of new features usually all original features are used

Feature extraction will therefore almost never reduce the amount of measurements

The optimization criteria are often based on reconstruction errors as in Table 1

The most well-known linear extraction method is the Principal Component

Anal-ysis (PCA) 17 Each new feature i is a linear combination of the original features:

x\ = WjX The new features are optimized to minimize the PCA mean squared

error reconstruction error, Table 1 It basically extracts the directions W , in which

the data set shows the highest variance These directions appear to be equivalent to

the eigenvectors of the (estimated) covariance matrix E with the largest eigenvalues

For the i-th principal component W j therefore holds:

E W i = XiWi, Xi > Xj, iff < j (32)

An extension of the (linear) PCA is the kernelized version, kernel-PCA 24 Here

the standard covariance matrix E is replaced by a covariance matrix in a feature

space After rewriting, the eigenvalue problem in the feature space reduces to the

Trang 26

following eigenvalue problem: Koti = AjCKj Here K is a N x N kernel matrix (like

for instance (29)) An object x is mapped onto the i-th principal component by:

j

Although this feature extraction is linear in the kernel space, in the feature space

it will obtain non-linear combinations of features

There are many other methods for extracting nonlinear features, for instance the

Self-Organizing Map (SOM) 2 0 The SOM is an unsupervised clustering and feature

extraction method in which the cluster centers are constrained in their placing The

construction of the SOM is such that all objects in the input space retain as much

as possible their distance and neighborhood relations in the mapped space In other

words, the topology is preserved in the mapped space

The mapping is performed by a specific type of neural network, equipped with a

special learning rule Assume that we want to map an /c-dimensional measurement

space to a fc'-dimensional feature space, where k' < k In fact, often k' = 1 or k' = 2

In the feature space, we define a finite orthogonal grid with grid points At each grid

point we place a neuron or prototype Each neuron stores an fc-dimensional vector

/ifc that serves as a cluster center By defining a grid for the neurons, each neuron

does not only have a neighboring neuron in the measurement space, it also has

a neighboring neuron in the grid During the learning phase, neighboring neurons

in the grid are enforced to also be neighbors in the measurement space By doing

so, the local topology will be preserved Unfortunately, training a SOM involves

the setting of many unintuitive parameters and heuristics (similar to many neural

network approaches)

A more principled approach to the SOM is the Generative Topographic Mapping,

GTM5 The idea is to find a representation of the original p-dimensional data x in

terms of //-dimensional latent variables z For this a mapping function y(z|W) has

to be defined In the GTM it is assumed that the distribution of z in the latent

variable space is a grid of delta functions z*:

where <&(z) consist of M fixed basis functions (in many cases Gaussian functions)

and W is a p x M weight matrix Because in reality the data will never fit the

low-dimensional manifold perfectly, a noise model is introduced: a Gaussian distribution

with variance a 2 :

p ( x | z , W , < 7 ) = J V ( x ; y ( z | W ) , a ) (36)

Trang 27

The distribution p(x) can then be obtained by integration over the z distribution:

p(x|W,(T)= /"p(x|z,W,<r)p(z)dz (37) The advantage is that the model is a full probability model This model can be fitted

by optimizing the log likelihood of the training data (In J^^ p(x|W,«r)) using an

Expectation-Maximization algorithm When the user supplies the dimensionality of

the latent variable space L, the number of grid points M in this space and the basis

functions 3>(z), then the parameters W and a can be optimized

An even simpler model to optimize is the Local Linear Embedding, LLE28 Here

also the goal is to find a low dimensional representation of the training set X u '

But unlike the GTM, where an explicit manifold is fitted, here the low dimensional

representation is optimized such that the objects can be reconstructed from their

neighbors in the training set in the same manner in the low dimensional

representa-tion as in the high dimensional one First, the weights Wij for reconstructing each

object Xj from its neighbors Xj are optimized (minimizing the LLE reconstruction

error, Table 1, under the constraint that V • Wij = 1) Given the weights, the

loca-tion of low-dimensional feature vectors Zj,i = 1, ,N is optimized, using the same

LLE reconstruction error, but where x, is replaced by z* This can be minimized by

solving a eigenvalue problem (similar to finding the principal components)

The feature extraction methods presented above, are all unsupervised, i.e other

information like class labels is not used This can be a significant drawback when the

feature reduction is applied as a preprocessing for solving a classification problem

It might actually happen that all informative features are removed To avoid this,

supervised feature extraction has to be used Very well known is Linear Discriminant

Analysis (LDA)27, which is using the weight vector w from the Fisher classifier (see

section 2.3) as feature direction A multi-class extension is presented in 27 but

it assumes equal covariance matrices for all classes and the number of features is

restricted to if — 1 The LDA can be extended to include the difference in covariance

matrix by using the Chernoff criterion instead of the Fisher criterion 2 3

4 Error estimation

At various stages in the design of a pattern classification system an estimation of

the performance of a procedure, or the separability of a set of classes is needed

Examples are the selection the 'best' feature during feature selection, the feature

subspace to be used when several feature extraction schemes are investigated, the

performance of the base classifiers in order to find a good set of classifiers to be

combined, the optimization of various parameters in classification schemes like the

smoothing parameter in the Parzen classifier and the number of hidden units used

in a neural network classifier, and the final selection of the overall classification

procedure if various competing schemes are followed consecutively Moreover, at

the end an estimate of the performance of the selected classifier is desired

Trang 28

In order to find an unbiased error estimate, a set of test objects with known labels is desirable This set should be representative for the circumstances expected during the practical use of the procedures under study Usually this implies that the test set has to be randomly drawn from the future objects to be classified As their labels should be known for proper testing, these objects are suitable for training as well Once an object is used for training, however, the resulting classifier is expected

to be good for this object Consequently, if it is also used for testing it generates an optimistic bias in the error estimate Below two techniques will be discussed to solve this problem The first is cross-validation, which aims at circumventing the bias The second is a bootstrap technique by which the bias is estimated and corrected

4 1 Cross-validation

Assume that a design set X d is available for the development of a pattern recognition system, or one of its subsystems, and that in addition to the classifier itself an

unbiased estimate of its performance is needed If X d is split (e.g at random) into

a training set X tv and a test set X te then we want X tr to be as large as possible

to train a good classifier, but simultaneously X te has to be sufficiently large for an

accurate error estimate The standard deviation of this estimate is sqrt(e * (1 — e)

N te ) (e.g 0.003 for e = 0.01, N te = 1000 and 0.03 for e = 0.1 and N te = 100)

When the design set is not sufficiently large to split it into a test set and a training set of appropriate sizes, a cross validation procedure might be used in which the

design set is split into B ( B > = 2) subsets of about the same size In total B different classifiers are trained, each by a different group of B — 1 of these subsets

Each classifier is tested by the single subset not used for its training Finally the

B test results are averaged Consequently all objects are used for testing once The classifiers they are testing are all based on an (B — 1)/B part of the training set For larger B these classifiers are expected to be similar and they will be just slightly

worse than the classifier based on all objects A good choice seems to be a 10-fold stratified cross-validation, see 19, i.e N = 10, and objects are selected evenly from

the classes, i.e in agreement with their prior probabilities

4.2 Bootstrap procedures

Instead of following a procedure that tries to minimize the bias in the error estimate, one may try to estimate the bias 1 3'1 5 A general procedure (independent of the used

classifier) can be based on a comparison of the expected apparent error E^ pp of a

classifier trained by bootstrapping the design set with its error E\ estimated by

the entire design set The difference can be used as an estimate for the bias in the

apparent error: E\,i as = E^—E^, which can be used as a correction for the apparent error E% of the classifier based on the design set: E boot = E% pp + E bd — E happ

A second estimator based on bootstrapping is the so called Ees2 error n'1 3'1 5

It is based on a weighted average of the apparent error of the classifier based on

Trang 29

the design set E% and an error estimate EQ for the bootstrap classifier based

on the out-of-bootstrap part of the design set The first is optimistically biased (an apparent error) and the second is an unbiased error estimate (tested by independent samples) of a classifier that is somewhat worse (based on just a bootstrap sample) than the target classifier based on the design set The weights are given by the asymptotic probability that a sample will be included in a bootstrap sample: 0.632 The £6 3 2 error estimate thereby is given by: E 632 = 0.368 * E% pp + 0.632 * £#

4.3 Error curves

The graphical representation of the classification error is an important tool to study, compare and understand the behavior of classification systems Some examples of such error curves are:

Learning curve : the error as a function of the number of training samples Simple

classifiers decrease faster, but have often a higher asymptotic value than more complex ones

Complexity curve : the error as a function of the complexity of the classifier, e.g

the feature size or the number of hidden units Such a curve often shows an increasing error after an optimal feature size or complexity

Parameter curve : the error as a function of a parameter in the training

proce-dure, e.g the smoothing parameter in the Parzen classifier The optimum that may be observed in such curves is related to the best fit of the underlying model

in the classification system w.r.t the data

Error-reject trade off : the error as function of the reject probability If a

clas-sifier output (e.g a confidence estimate) is thresholded to reject unreliably classified objects, then this curve shows the gain in error reduction

R O C curves : the trade-off between two types of errors, e.g the two types of error

in a 2-class problem These Receiver Operator Curves were first studied in munication theory and are useful to select a classifier if the point of operation may vary, e.g due to unknown classification costs or prior probabilities

com-5 Discussion

In the previous sections an overview is given of well established techniques for statistical pattern recognition with a few excursion to more recent developments Modern scientific and industrial developments, the use of computers and internet

in daily life and the fast growing sensor technology raise new problems as well as they enable new solutions We will summarize some new developments in statistical pattern recognition, partially introduced above, partially not yet discussed

Other types of representation than the traditional features enable other ways

to incorporate expert knowledge The dissimilarity representation is an example for this, as it offers the possibility to express knowledge in the definition of the dissimilarity measure, but it opens also other options Instead of being based on the

Trang 30

raw data like spectra, images, or time signals it may be defined on models of objects, like graphs In such cases structural knowledge is used for the object descriptions

In addition to the nearest neighbor rule, dissimilarity based classifiers offer a richer set of tools with more possibilities to learn from examples, thereby bridging the gap between structural and statistical pattern recognition Several problems, however, have still to be solved, like the selection of a representation set, optimal modifications

of a given dissimilarity measure and the construction of dedicated classifiers More complicated pattern recognition problems may not be solved by a single off-the-shelf classifier By the combining classifier technique a number of partial solutions can be combined Several questions are still open here, like the selection

or generation of the base classifiers, the choice of the combiner, the use of a finite training set Moreover, an overall mathematical foundation is still not available One-class classifiers are a good way to handle ill sampled problems, or to build classifiers when some of the classes are undersampled This is important for applica-tions like man or machine monitoring when one of the classes, e.g normal behavior,

is very well defined Such classifiers may also be used when it is not possible to select

a representative training set by an appropriate sampling of the domain of objects

In such cases a domain based class description may be found, locating the class boundary in the representation, without building a probability density function The well spread availability of computers and sensors, and the costs of labeling objects by human experts, may sometimes result in large databases in which just

a small fraction of the objects is labeled Techniques for training classifiers by tially labeled datasets are still in their early years This may also be considered as combining clustering and classification

par-For such problems in which the costs of expert labeling are high, one may also try to optimize the set of objects to be labeled This technique is called active learning Several competing strategies exist, e.g sampling close to an initial decision boundary, or retrieving objects in the modes of the class density distributions Another variant is online learning, in which the order of the objects to be presented

to a decision function is determined by the application, e.g by a production line in a factory It has now to be decided whether objects can be safely classified, or whether

a human expert has to be solicited, not only to reduce the risk of misclassification, but also to optimally improve the available classification function

An often returning question in dynamic environments is whether a trained sification function is still valid, or whether it should be retrained due to new cir-cumstances In such problems 'learning' and 'forgetting' are directly related If a new situation demands retraining, old objects may not be representative anymore and should be forgotten (They may still be stored in case the old situation appears

clas-to return after some time)

Many techniques are proposed and many more are to come for solving problems

as the above A difficulty that cannot be easily handled is that they are often ill defined Consequently, generally valid benchmarks are not available, by which it

Trang 31

is not straightforward t o detect t h e good procedures t h a t m a y work well over a series of applications As good a n d bad procedures cannot easily be distinguished,

it is to be expected t h a t t h e set of tools used in statistical p a t t e r n recognition will significantly grow in t h e near future

R e f e r e n c e s

1 J.A Anderson Logistic discrimination In P.R Kirshnaiah and L.N Kanal, editors,

Classification, Pattern Recognition and Reduction of Dimensionality, volume 2 of Handbook of Statistics, pages 169-191 North Holland, Amsterdam, 1982

2 A.G Arkadev and E.M Braverman Computers and Pattern Recognition Thompson,

4 C M Bishop Neural Networks for Pattern Recognition Oxford University Press,

Wal-ton Street, Oxford 0X2 6DP, 1995

5 C M Bishop, M Svensen, and C.K.I Williams The generative topographic mapping

Neural Computation, 10(l):215-234, 1998

6 L Breiman Bagging predictors Machine Learning, 26(2):123-140, 1996

7 T.G Dietterich and G Bakiri Solving multiclass learning problems via

error-correcting output codes Journal of Artificial Intelligence Research, 2:263-286, 1995

8 H Drucker, C Cortes, L.D Jackel, Y LeCun, and V Vapnik Boosting and other

ensemble methods Neural Computation, 6, 1994

9 R.P.W Duin On the choice of the smoothing parameters for Parzen estimators of

probability density functions IEEE Trans, on Computers, C-25(ll):1175-1179, 1976

10 R.P.W Duin The combining classifier: To train or not to train? In International Conference on Pattern Recognition, volume II, pages 765-770, Quebec, Canada, 2002

11 B Efron and R.J Tibshirani Improvements on cross-validation: the 632+ bootstrap

method J Amer Statist Assoc, 92:548-560, 1997

12 K Fukanaga Introduction to Statistical pattern recognition Academic press, San

Diego, 2nd edition, 1990

13 D.J Hand Recent advances in error rate estimation Pattern Recognition Letters,

4(5):335-346, 1986

14 S.S Haykin Neural Networks, a comprehensive foundation Prentice-Hall, 1999

15 A.K Jain, R C Dubes, and Chen, C.-C Bootstrap techniques for error

estima-tion IEEE Transactions on Pattern Analysis and Machine Intelligence, 9(5):628-633,

September 1987

16 A.K Jain, R.P.W Duin, and J Mao Statistical pattern recognition: A review IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(l):4-37, 2000

17 I.T Jolliffe Principal Component Analysis Springer-Verlag, New York, 1986

18 J Kittler, M Hatef, R.P.W Duin, and J Matas On combining classifiers IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(4):226-239, 1998

19 R Kohavi A study of cross-validation and bootstrap for accuracy estimation and

model selection In Proc of the 15th Int Joint Conference on Artificial Intelligence,

pages 1137-1143, 1995

20 T Kohonen Self-organizing maps Springer-Verlag, Heidelberg, Germany, 1995

21 L.I Kuncheva Combining Pattern Classifiers: Methods and Algorithms Wiley, 2004

Trang 32

22 L.I Kuncheva and C.J Whitaker Measures of diversity in classifier ensembles chine Learning, 51:181-207, 2003

Ma-23 M Loog, R.P.W Duin, and R Haeb-Umbach Multiclass linear dimension reduction

by weighted pairwise fisher criteria IEEE Transactions on Pattern Analysis and chine Intelligence, 23(7):762-766, 2001

Ma-24 S Mika, B Scholkopf, A.J Smola, K.-R Miiller, M Scholz, and G Ratsch Kernel PCA and de-noising in feature spaces In M.S Kearns, S.A Solla, and D.A Cohn,

editors, Advances in Neural Information Processing Systems, volume 11, pages

536-542 MIT Press, 1999

25 T M Mitchell Machine Learning Mc Graw-Hill, New York, 1997

26 E Pekalska and R.P.W Duin Dissimilarity representations allow for building good

classifiers Pattern Recognition Letters, 23(8):943-956, 2002

27 C.R Rao The utilization of multiple measurements in problems of biological

clas-sification (with discussion) Journal of the Royal Statistical Society, B, 10:159-203,

Trang 33

HIDDEN MARKOV MODELS FOR SPATIO-TEMPORAL PATTERN

RECOGNITION

Brian C Lovell" and Terry Caelli b

The School of Information Technology and Electrical Engineering

The University of Queensland, Australia QLD 4072

E-mail:lovell@ itee uq edu.au

National Information and Communications Technology Australia (NICTA) Research School of Information Sciences and Engineering

Australian National University, Australia email: tcaelli@ualberta.ca

The success of many real-world applications demonstrates that hidden Markov models (HMMs) are highly effective in one-dimensional pattern recognition problems such as speech recognition Research is now focussed on extending HMMs to 2-D and possibly 3-D applications which arise in gesture, face, and handwriting recognition Although the HMM has become a major workhorse of the pattern recognition community, there are few analytical results which can explain its remarkably good pattern recognition performance There are also only a few theoretical principles for guiding researchers in selecting topologies or understanding how the model parameters contribute to performance In this chapter,

we deal with these issues and use simulated data to evaluate the performance of a number

of alternatives to the traditional Baum-Welch algorithm for learning HMM parameters We then compare the best of these strategies to Baum-Welch on a real hand gesture recognition system in an attempt to develop insights into these fundamental aspects of learning

1 Introduction

There is an enormous volume of literature on the application of hidden Markov Models (HMMs) to a broad range of pattern recognition tasks In the case of speech recognition, the patterns we wish to recognise are spoken words which are audio signals against time Indeed, the value of Markov models to model speech was recognised by Shannon 26 as early

as 1948 In the case of hand gesture recognition, the patterns are hand movements in both space and time — we call this a spatio-temporal pattern recognition problem The suitabil- ity and efficacy of HMMs to such problems is undeniable and they are now established as one of the major tools of the pattern recognition community Yet, when one looks for research which address fundamental problems such as efficient learning strategies for HMMs

or perhaps analytically determining the most suitable architectures for a given problem, die number of papers is greatly diminished So despite the enormous uptake of HMMs since

25

Trang 34

their introduction in the 1960's, we believe that there is still a great deal of unexplored territory

Much of the application of HMMs in the literature is based firmly on the

methodol-ogy popularised by Rabiner et al (1983) 25>16-24 for speech recognition and these studies are the primary reference for many HMM researchers resulting in two common practices One, to use the forward algorithm to determine the MAP(maximum posterior probabil-ity) of the model, given an observation sequence, as a classification metric Two, to use the Baum-Welch as a model estimation/update procedure We will see how these are not ideal strategies to use as, in the former case, classification is reduced to a single number without directly using the model (data summary) parameters, attributes, per se As for the latter, the Baum-Welch 4 algorithm (a version of the famous Expectation-Maximisation algorithm14,1'21) is, in the words of Stolke and Omohundro 28," far from foolproof since

it uses what amounts to a hill-climbing procedure that is only guaranteed to find a local likelihood maximum." Moreover, as observed by Rabiner 24, results can be very dependent

on the initial values chosen for the HMM parameters

The problem of finding local rather that global maxima is encountered in many other eas of learning theory and optimisation These problems are familiar territory to researchers

ar-in the artificial neural network community and many techniques have been proposed to counter them Moreover genetic and evolutionary algorithmic techniques specialise in solving such problems — albeit often very slowly, especially in the case of biological evolution11 With this in mind, we use simulated data to investigate other approaches to learning HMMs from observation sequences in an attempt to find superior alternatives to the traditional Baum-Welch Algorithm Then we compare and test the best of the alternate strategies on real data from a hand gesture recognition system to see if the real data trials corroborate the conclusions drawn from simulated trials

1.1 Background and Notation

In this study, we focus on the discrete HMM as popularised by Rabiner24 Using the

famil-iar notation from his tutorial paper, a hidden Markov model consists of a set of N nodes, each of which is associated with a set of M possible observations The parameters of the

model include an initial state vector

7T= \Pl,P2,P3,-,PN} T with elements p n , n £ [1, AT] which describes the distribution over the initial node set, a

transition matrix

/ a n a 12 aiw \

0 2 1 a 2 2 • • • 0 2 J V

\ajvi ajv2 • • • OJVJV/

with elements a t j with i,j £ [l,N] for the transition probability from node i to node j

Trang 35

tions and emissions defines the topology or structure of the model (see figure 1 for an illustration of two different transition structures) One commonly used topology is called Fully-Connected (FC) or Ergodic In the FC HMM there is not necessarily a defined start-

ing state and all state transitions are possible such that a^ ^ 0 V i, j £ [1, N] Another

topology, especially popular in speech recognition applications, is called Left-Right In an

LR HMM there is a defined starting state (usually state 1) and only state transitions to

higher-index states are allowed such that a^• = 0 V i > j where i, j G [1, N]

Rabiner24 defines the three basic problems of HMMs by:

Problem 1 Given the observation sequence O = 0\0 2 • • -OT, and a model A =

(A, B, IT), how do we efficiently compute P(0\X), the probability of the

observa-tion sequence given the model?

Problem 2 Given the observation sequence O = O1O2 • • • Or, and the model A, how do

we choose a corresponding state sequence Q = q\q 2 qr which is optimal in some meaningful sense {i.e., best "explains" the observations)?

Problem 3 How do we adjust the model parameters A = (A, B, n) to maximize P((9|A)?

Problems 1 and 2 are elegantly and efficiently solved by the forward and Viterbi29'12

algorithms respectively as described by Rabiner in his tutorial The forward algorithm is

used to recognise matching HMMs (i.e., highest probability models, MAP) from the

obser-vation sequences Note, again, that this is not a typical approach to pattern classification as

Trang 36

it does not involve matching model with observation attributes That would involve paring the model parameters and estimated observation model parameters MAP does not perform this and so it cannot be as sensitive a measure as exact parameter comparisons Indeed, a number of reports have already shown quite different HMMs can have identical emissions( observation sequences) 18'3 The Viterbi algorithm is used less frequently as

com-we are normally more interested in finding the matching model than in finding the state sequence However, this algorithm is critical in evaluating the precision of the HMM; in other words, how well the model can reconstruct (predict) the observations

Rabiner proposes solving Problem 3 via the Baum-Welch algorithm which is, in essence, a gradient ascent algorithm — a method which is guaranteed to find local maxima only Solving Problem 3 is effectively the problem of learning to recognise new patterns, so

it is really the fundamental problem of HMM learning theory; a significant improvement here could boost the performance of all HMM based pattern recognition systems There-fore it is somewhat surprising that there appear to be relatively few papers devoted to this topic — the vast majority are devoted to applications of the HMM In the next section we compare a number of alternatives to and variations of Baum-Welch in an attempt to find superior learning strategies

2 Comparison of Methods for Robust HMM Parameter Estimation

We focus on the problem of reliably learning HMMs from a small set of short observation sequences The need to learn rapidly from small sets arises quite often in practice In our case, we are interested in learning hand gestures which are limited to just 25 observations The limitation arises because we record each video at 25 frames per second and each of our gestures takes less than one second to complete Moreover, we wish to obtain good recognition performance from small training sets to ensure that new gestures can be rapidly recognised by the system

Four HMM parameter estimation methods are evaluated and compared by using a train and test classification methodology For these binary classification tests we create two ran-dom HMMs and then use each of these to generate test and training data sequences For normalization, we ensure that each test sequence can be correctly recognized by its true model; thus the true models obtain 100% classification accuracy on the test data by con-struction The various learning methods are then used to estimate the two HMMs from their respective training sets and then the recognition performance of the pair of estimated HMMs is evaluated on the unseen test data sets This random model generation and eval-uation process is repeated 16 times for each data sample to provide meaningful statistical results

Before parameter re-estimation, we initialize with two random HMMs which should yield 50% recognition performance on average So an average recognition performance above 50% after re-estimation shows that some degree of learning must have taken place Clearly if the learning strategy can perfectly determine both of the HMMs which generated the training data sets, we would have 100% recognition performance on the test sets

We compare four learning methods 1) traditional Baum-Welch, 2) ensemble averaging

Trang 37

Classification Performance Averaged Over 16 Experiments

True model 100% by construction

15

Fig 2 Relative performance of the HMM parameter estimation methods as a function of the number of ing sequences Viterbi Path Counting produces the best quality models with a much smaller number of training iterations

train-introduced by Davis and Lovell9 based on ideas presented by Mackay19, 3) Entropic MAP introduced by Brand6, and 4) Viterbi Path Counting10 which is a special case of Stolke and Omhundro's Best-First algorithm28 The results in figure 2 indicate that these alternate HMM learning methods all classify significantly better than the well-known Baum-Welch algorithm and also require less training data The Entropic MAP estimator performs well but surprisingly the performance is much the same as simple ensemble averaging Ensem-ble averaging involves training multiple models using the Baum-Welch algorithm and then simply averaging the model parameters without regard to structure Note that for a single sequence, ensemble averaging is identical to the traditional usage of the Baum-Welch al-gorithm Overall, the stand-out performer was the VPC algorithm In these and other trials, this method converges to good models very rapidly and has performed better than the other methods in virtually all of our simulated HMM studies

3 Video Gesture Recognition

In an attempt to corroborate the strong performance of VPC compared to Baum-Welch on a real-world application, we test various learning techniques on a system for real-time video gesture recognition as shown in figure 3

Trang 38

In earlier related work, Starner and Pentland 27 developed a HMM-based system to recognise gesture phrases in American Sign Language Later, Lee and Kim 1 5 used HMM- based hand gesture recognition to control viewgraph presentation in data projected semi- nars Our system recognizes gestures based on the letters of the alphabet traced in space in front of a video camera The motivation for this application is to produce a way of typing messages into a camera-equipped mobile phone or PDA using video gestures instead of the keypad or pen interface We use single stroke letter gestures similar to those already widely used for pen data entry in PDAs For example, figure 3 shows the hand gestures for the letters "Z" and "W." The complete gesture set is shown in figure 6

Fig 3 "Fingerwriting:" Single stroke video gesture for letters "W" and "Z."

Each video sequence comprises 25 frames corresponding to one second of video Skin colour segmentation in YUV colour space is applied to locate the hand Pre-processing (morphological) operations smooth the image and remove noise before tracking the hand with a modified Camshift algorithm 5 After segmenting the hand, we calculate image moments to find the centroid in each frame Along the trajectory, the direction (angle) of motion of each of the 25 hand movements is calculated and quantized to one of 18 discrete symbols The resultant discrete angular observation sequence is input to the HMM classification module for training and recognition

We compare traditional Baum-Welch with the most promising alternative from the ulated study, VPC We evaluate recognition performance over all 26 character gestures using fully connected (FC), left-right (LR), and left-right banded (LRB) model topologies with the number of states ranging from 1 to 14 A LRB model is an LR model which

stim-has a transition structure containing self-transitions and next state transitions only (i.e., states cannot be skipped) as shown in figure 5 More formally, a^ ^ 0 V j = i or j =

* + 1, and 0 otherwise, i, j e [1, N]

Our video gesture database contains 780 video gestures with 30 examples of each ture Recognition accuracy is evaluated using threefold cross-validation where 20 gestures are used for training and 10 for testing in each partition These HMMs are initialized with random HMM parameters before using either Baum-Welch or VPC for learning

ges-From figure 4 the best average recognition accuracy achieved is 97.31% when VPC is used for training, topology is LRB, and the number of states is 13 Although this corrobo-

Trang 39

Max

Baum-Welch

FC 80.00 72.69 66.54 80.00 75.20 75.60 77.60 76.80 77.60 76.00 65.20 74.80 84.80 72.80 75.40 84.80

LR 80.00 94.23 92.31 84.80 81.20 84.80 86.40 86.00 85.60 81.60 86.80 86.80 84.00 81.60 85.44 92.31

LRB 80.00 93.85 96.15 85.38 90.77 85.77 89.62 89.62 90.00 88.46 89.23 88.08 90.00 88.46 88.96 96.15

VPC

FC 80.38 71.15 63.85 53.20 59.60 55.20 45.60 44.40 49.20 43.20 42.80 40.80 39.60 38.80 51.98 63.85

LR 80.38 91.92 91.15 91.20 91.20 90.40 91.20 90.40 90.40 90.00 90.00 90.00 90.00 90.40 89.90 91.20

LRB 80.38 90.77 93.08 90.38 95.00 93.85 94.23 94.23 94.62 95.00 95.00 95.77 97.31 93.46 93.08 97.31

Fig 4 Average percent correct recognition for all 26 video letter gestures against topology and training method

0808

Fig 5 Left-Right banded topology

rates the stronger VPC performance exhibited in our simulated data performance trials, a closer investigation of Table 4 raises some doubts about this conjecture through the follow-ing observations

• The Baum Welch algorithm did almost as well as VPC with a best performance

of 96.15% correct recognition with only 3 states Moreover we achieve a very surprising 80% correct recognition with just a single state

• Topology (i.e., constraints on the initial value of the A matrix) has more impact

on performance than the choice of learning algorithm

• Good recognition performance can be obtained over a very broad range of N, the

number of states

3.1 Comments on Learning Algorithm Performance

We do not suggest that the above observations can be generalized to other real-world plication domains but anecdotal evidence from other researchers suggests that similar be-

Trang 40

Fig 6 The alphabet of single-stroke letter hand gestures

haviour is often encountered When we designed this gesture system, we thought that this pattern recognition problem was quite challenging and would significantly differentiate learning strategies Yet the surprisingly good performance over a number of learning algo-

rithms, topologies, and a broad range N suggests that the problem is significantly easier

than we suspected

Our intuition suggests that 3 states is far too small a number to adequately model all of these complex letter gestures, but results show that it is indeed possible to find a three state HMM which yields very good recognition performance We conjecture that the observation

matrix B seems to provide most of the recognition performance and that recognition may

be only weakly affected by good estimation of the transition matrix A

In support of this idea, we may consider the following interpretation of the HMM

Consider each row of the B matrix as the probability mass function of the observation symbols emitted in a given state In the limiting case of a single state HMM, the B matrix

becomes a vector of source symbol probabilities and application of the forward algorithm for recognition is thus equivalent to the well-known and powerful MAP classifier Indeed from figure 4, we see that this single state degenerate HMM can achieve 80% recognition performance So sometimes even if the state transitions are poorly modelled, it is quite possible to find good classifiers based on source statistics

Now clearly if three states can yield strong performance, good HMMs with more than three states must also exist — a simple way to prove this is to note that we can always

add additional states which are unreachable {i.e., transition probability of zero) without

affecting recognition performance This may help explain why performance stays much

the same over a broad range of N as we increase N beyond three

The question that arises is, "Why does the Baum-Welch algorithm perform so well

on real-world data despite its theoretical flaws and rather poor performance on the lated HMM data?" Once again, a possible explanation is that this particular spatio-temporal recognition task is relatively easy, so all methods can do quite well This conjecture may be

Tiêu đề	Handbook of Pattern Recognition & Computer Vision
Tác giả	C H Chen, P S P Wang
Trường học	University of Massachusetts Dartmouth
Chuyên ngành	Pattern Recognition and Computer Vision
Thể loại	handbook
Năm xuất bản	2005
Thành phố	Singapore

Định dạng
Số trang	630
Dung lượng	21,71 MB