Currently one intensively explored area of pattern recognition applications is the personal identification problem, also called biometrics, though the problem has been around for a numbe
Trang 55 Tori Tuck Link, Singapore 596224
USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601
UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
British Library Cataloguing-in-Publication Data
A catalogue record for this book is available from the British Library
First published 2005
Reprinted 2006
HANDBOOK OF PATTERN RECOGNITION & COMPUTER VISION (3rd Edition)
Copyright © 2005 by World Scientific Publishing Co Pte Ltd
All rights reserved This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to
be invented, without written permission from the Publisher
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA In this case permission to photocopy is not required from the publisher
ISBN 981-256-105-6
Printed in Singapore by Mainland Press
Trang 6Dedicated to the memory of the late Professor King Sun Fu (1930-1985), the handbook series, with first edition (1993), second edition (1999) and third edition (2005), provides a comprehensive, concise and balanced coverage of the progress and achievements in the field of pattern recognition and computer vision in the last twenty years This is a highly dynamic field which has been expanding greatly over the last thirty years No handbook can cover the essence of all aspects of the field and we have not attempted to do that The carefully selected 33 chapters in the current edition were written by leaders in the field and we believe that the book and its sister volumes, the first and second editions, will provide the growing pattern recognition and computer vision community a set of valuable resource books that can last for a long time Each chapter will speak for itself the importance of the subject area covered
The book continues to contain five parts Part 1 is on the basic methods of pattern recognition Though there are only five chapters, the readers may find other coverage of basic methods in the first and second editions Part 2 is on basic methods in computer vision Again readers may find that Part 2 complements well what were offered
in the first and second editions Part 3 on recognition applications continues to emphasize
on character recognition and document processing It also presents new applications in digital mammograms, remote sensing images and functional magnetic resonance imaging data Currently one intensively explored area of pattern recognition applications is the personal identification problem, also called biometrics, though the problem has been around for a number of years Part 4 is especially devoted to this topic area Indeed chapters in both Part 3 and Part 4 represent the growing importance of applications in pattern recognition In fact Prof Fu had envisioned the growth of pattern recognition applications in the early 60's He and his group at Purdue had worked on the character recognition, speech recognition, fingerprint recognition, seismic pattern recognition, biomedical and remote sensing recognition problems, etc Part 5 on system and technology presents other important aspects of pattern recognition and computer vision
Our sincere thanks go to all contributors of this volume for their outstanding technical contributions We would like to mention specially Dr Quang-Tuan Luong, Dr Giovanni Garibotto and Prof Ching Y Suen for their original contributions to all three volumes Other authors who have contributed to all three volumes are: Prof Thomas S Huang, Prof J.K Aggarwal, Prof Yun Y Tang, Prof C.C Li, Prof R Chellappa and Prof P.S.P Wang We are pleased to mention that Prof Thomas Huang and Prof Jake Aggarwal are the recipients respectively in 2002 and 2004, of the prestigious K.S Fu Prize sponsored by the International Association of Pattern Recognition (IAPR) Among Prof Fu's Ph.D graduates at Purdue who have contributed to the handbook series are: C.H Chen (1965), M.H Loew (1972), S.M Hsu (1975), S.Y Lu (1977), K.Y Huang (1983) and H.D Cheng (1985) Finally we would like to pay tribute to the late Prof Azirel Rosenfeld (1931-2004) who, as one IAPR member put it, is a true scientist and a great giant in the field He was awarded the K.S Fu Prize by IAPR in 1988 Readers are reminded to read Prof Rosenfeld's inspirational article on "Vision - Some Speculations" that appeared as Foreword of the second edition of the handbook series Prof Rosenfeld's profound influence in the field will be felt in the many years to come
v
Trang 7The camera ready manuscript production requires certain amount of additional efforts, as compared to typeset printing, on the part of editors and authors We like to thank all contributors for their patience in making the necessary revisions to comply with the format requirements during this long process of manuscript preparation Our special thanks go to Steven Patt, in-house editor of World Scientific Publishing, for his efficient effort to make a timely publication of the book possible
Trang 8Preface to the Third Edition v
Contents vii
Part 1 Basic Methods in Pattern Recognition 1
Chapter 1.1 Statistical Pattern Recognition 3
R.P.W Duin and D.M.J Tax
Chapter 1.2 Hidden Markov Models for Spatio-Temporal Pattern
Recognition 25
Brian C Lovell and Terry Caelli
Chapter 1.3 A New Kernel-Based Formalization of Minimum Error Pattern
Recognition 41
Erik McDermott andShigeru Katagiri
Chapter 1.4 Parallel Contextual Array Grammars with Trajectories 55
P Helen Chandra, C Martin-Vide, K.G Subramanian, D.L Van and P S P Wang
Chapter 1.5 Pattern Recognition with Local Invariant Features 71
C Schmid, G Dorko, S Lazebnik, K Mikolajczyk and J Ponce
Part 2 Basic Methods in Computer Vision 93
Chapter 2.1 Case-Based Reasoning for Image Analysis and Interpretation 95
Petra Perner
Chapter 2.2 Multiple Image Geometry - A Projective Viewpoint 115
Quang-Tuan Luong
Chapter 2.3 Skeletonization in 3D Discrete Binary Images 137
Gabrielle Sanniti di Baja and Injela Nystrom
Chapter 2.4 Digital Distance Transforms in 2D, 3D, and 4D 157
Gunilla Borgefors
Chapter 2.5 Computing Global Shape Measures 177
Paul L Rosin
Trang 9Chapter 2.6 Texture Analysis with Local Binary Patterns 197
Topi Mdenpdd and Matti Pietikdinen
Part 3 Recognition Applications 217
Chapter 3.1 Document Analysis and Understanding 219
Yuan Yan Tang
Chapter 3.2 Chinese Character Recognition 241
Xiaoqing Ding
Chapter 3.3 Extraction of Words from Handwritten Legal Amounts on
Bank Cheques 259
In Cheol Kim and Ching Y Suen
Chapter 3.4 OCR Assessment of Printed-Fonts for Enhancing Human
Vision 273
Ching Y Suen, Qizhi Xu and Cedric Devoghelaere
Chapter 3.5 Clustering and Classification of Web Documents Using a
Graph Model 287
Adam Schenker, Horst Bunke, Mark Last and Abraham Kandel
Chapter 3.6 Automated Detection of Masses in Mammograms 303
H.D Cheng, X.J Shi, R Min, X.P Cai and H.N Du
Chapter 3.7 Wavelet-Based Kalman Filtering in Scale Space for Image
Fusion 325
Hsi-Chin Hsin and Ching-Chung Li
Chapter 3.8 Multisensor Fusion with Hyperspectral Imaging Data:
Detection and Classification 347
Su May Hsu and Hsiao-hua Burke
Chapter 3.9 Independent Component Analysis of Functional Magnetic
Resonance Imaging Data 365
V.D Calhoun andB Hong
Part 4 Human Identification 385
Chapter 4.1 Multimodal Emotion Recognition 387
Nicu Sebe, Ira Cohen and Thomas S Huang
Chapter 4.2 Gait-Based Human Identification from a Monocular Video
Sequence 411
Amit Kale, AravindSundaresan, AmitK RoyChowdhury and Rama Chelappa
Trang 10Chapter 4.3 Palmprint Authentication System 431
David Zhang
Chapter 4.4 Reconstruction of High-Resolution Facial Images for Visual
Surveillance 445
Jeong-Seon Park and Seong Whan Lee
Chapter 4.5 Object Recognition with Deformable Feature Graphs: Faces,
Hands, and Cluttered Scenes 461
Jochen Triesch and Christian Eckes
Chapter 4.6 Hierarchical Classification and Feature Reduction for Fast Face
Detection 481
Bernd Heisele, Thomas Serre, Sam Prentice and Tomaso Poggio
Part 5 System and Technology 497
Chapter 5.1 Tracking and Classifying Moving Objects Using Single or
Multiple Cameras 499
Quming Zhou andJ.K Aggarwal
Chapter 5.2 Performance Evaluation of Image Segmentation Algorithms 525
Xiaoyi Jiang
Chapter 5.3 Contents-Based Video Analysis for Knowledge Discovery 543
Chia-Hung Yeh, Shih-Hung Lee andC.-C Jay Kuo
Chapter 5.4 Object-Process Methodology and Its Applications to Image
Processing and Pattern Recognition 559
Dov Dori
Chapter 5.5 Musical Style Recognition — AQuantative Approach 583
Peter van Kranenburg and Eric Backer
Chapter 5.6 Auto-Detector: Mobile Automatic Number Plate Recognition 601
Giovanni B Garribotto
Chapter 5.7 Omnidirectional Vision 619
Hiroshi Ishiguro
Index 629
Trang 11A review is given of the area of statistical pattern recognition: the representation
of objects and the design and evaluation of trainable systems for generalization Traditional as well as more recently studied procedures are reviewed like the classical Bayes classifiers, neural networks, support vector machines, one-class classifiers and combining classifiers Further we introduce methods for feature re- duction and error evaluation New developments in statistical pattern recognition are briefly discussed
1 I n t r o d u c t i o n
Statistical p a t t e r n recognition is t h e research area t h a t studies statistical tools for
t h e generalization of sets of real world objects or phenomena It thereby aims t o find procedures t h a t answer questions like: does this new object fit into t h e p a t t e r n
of a given set of objects, or: t o which of t h e p a t t e r n s defined in a given set does it fit best? T h e first question is related to cluster analysis, b u t is also discussed from some perspective in this chapter T h e second question is on p a t t e r n classification
a n d t h a t is w h a t will be t h e main concern here
T h e overall s t r u c t u r e of a p a t t e r n recognition system m a y be summarized as in Figure 1 Objects have first to be appropriately represented before a generalization can be derived Depending on t h e d e m a n d s of t h e procedures used for this t h e representation h a s t o be a d a p t e d , e.g transformed, scaled or simplified
T h e procedures discussed in this chapter are partially also studied in areas like statistical learning theory 3 2 , machine learning 2 5 a n d neural networks 1 4 As t h e emphasis in p a t t e r n recognition is close to application areas, questions related t o t h e representation of t h e objects are i m p o r t a n t here: how are objects described (e.g fea- tures, distances t o p r o t o t y p e s ) , how extensive m a y this description be, w h a t are t h e ways t o incorporate knowledge from t h e application domain? Representations have
t o be a d a p t e d t o fit t h e tools t h a t are used later Simplifications of representations
3
Trang 12like feature reduction and prototype selection should thereby be considered
In order to derive, from a training set, a classifier that is valid for new objects (i.e that it is able to generalize) the representation should fulfill an important condition: representations of similar real world objects have to be similar as well The representations should be close This is the so-called compactness hypothesis2
on which the generalization from examples to new, unseen objects is built It enables the estimation of their class labels on the basis of distances to examples or on class densities derived from examples
Objects are traditionally represented by vectors in a feature space An important recent development to incorporate domain knowledge is the representation of objects
by their relation to other objects This may be done by a so called kernel method29, derived from features, or directly on dissimilarities computed from the raw data2 6
We will assume that, after processing the raw measurements, objects are given
in a p-dimensional vector space Cl Traditionally this space is spanned by p features, but also the dissimilarities with p prototype objects may be used To simplify the discussion we will use the term feature space for both If K is the number of classes
to be distinguished, a pattern classification system, or shortly classifier C{x) is a function or a procedure that assigns to each object x in fi a class u> c , with c = 1, ,K Such a classifier has to be derived from a set of examples X tv = {xi,i = 1 N} of known classes y, X tr will be called the training set and yi £ wc, c = 1 K
a label Unless otherwise stated it is assumed that yt is unique (objects belong to just a single class) and is known for all objects in X tl
In section 2 training procedures will be discussed to derive classifiers C(x) from
training sets The performance of these classifiers is usually not just related the quality of the features (their ability to show class differences) but also to their number, i.e the dimensionality of the feature space A growing number of features
class labels, confidences
classifiers, class models
feature extraction prototype selection
features, dissimilarities class models, object models
better sensors or measurement conditions
Fig 1 The pattern recognition system
Trang 13may increase the class separability, but, may also decrease the statistical accuracy
of the training procedure It is thereby important to have a small number of good features In section 3 a review is given of ways to reduce the number of features
by selection or by combination (so called feature extraction) The evaluation of classifiers, discussed in section 4, is an important topic As the characteristics of new applications are often unknown before, the best algorithms for feature reduction and classification have to be found iteratively on the basis of unbiased and accurate testing procedures
This chapter builds further on earlier reviews of the area of statistical pattern recognition by Fukunaga 12 and by Jain et al 16 It is inevitable to repeat and summarize them partly We will, however, also discuss some new directions like one-class classifiers, combining classifiers, dissimilarity representations and techniques for building good classifiers and reducing the feature space simultaneously In the last section of this chapter, the discussion, we will return to these new developments
2 Classifiers
For the development of classifiers, we have to consider two main aspects: the basic assumptions that the classifier makes about the data (which results in a functional form of the classifier), and the optimization procedure to fit the model to the training data It is possible to consider very complex classifiers, but without efficient methods
to fit these classifiers to the data, they are not useful Therefore, in many cases the functional form of the classifier is restricted by the available optimization routines
We will start discussing the two-class classification problem In the first three sections, 2.1, 2.2 and 2.3, the three basic approaches with their assumptions are given: first, modeling the class posteriors, second, modeling class conditional prob-abilities and finally modeling the classification boundary In section 2.4 we discuss how these approaches can be extended to work for more than two classes In the next section, the special case is considered where just one of the classes is reli-ably sampled The last section, 2.6, discusses the possibilities to combine several (non-optimal) classifiers
2.1 Bayes classifiers and approximations
A classifier should assign a new object x to the most likely class In a probabilistic setting this means that the label of the class with the highest posterior probability should be chosen This class can be found when p(wi|x) and p(w2|x) (for a two class classification problem) are known The classifier becomes:
if p(wi|x) > p(w2|x) assign object x to u>\, otherwise to W2- (1) When we assume that p(ui\x) and p(w2|x) are known, and further assume that
misclassifying an object originating from o>i to W2 is as costly as vise versa, classifier (1) is the theoretical optimal classifier and will make the minimum error This
classifier is called the Bayes optimal classifier
Trang 14P(wi|x) = 1 , ( , T ^, P(w2|x) = 1 - p ( w2| x ) , (2)
In practice p(wi|x) and p(w2|x) are not known, only samples x, are available,
and the misclassification costs might be only known in approximation Therefore
approximations to the Bayes optimal classifier have to be made This classifier can be
approximated in several different ways, depending on knowledge of the classification
problem
The first way is to approximate the class posterior probabilities p(wc|x) The
logistic classifier assumes a particular model for the class posterior probabilities:
1 + exp(—wTx) where w is a p-dimensional weight vector This basically implements a linear clas-
sifier in the feature space
An approach to fit this logistic classifier (2) to training data X u, is to maximize
the data likelihood L:
N
L = n p ( w i | x On i ( x )p ( w 2 | x i )n a ( x )> (3)
i
where nc(x) is 1 if object x belongs to class o>c, and 0 otherwise This can be done by,
for instance, an iterative gradient ascent method Weights are iteratively updated
using:
Wnew = W 0 l d + ? 7 - — , (4)
where rj is a suitably chosen learning rate parameter In Ref 1 the first (and second)
derivative of L with respect to w are derived for this and can be plugged into (4)
2.2 Class densities and Bayes rule
Assumptions on p(w|x) are often difficult to make Sometimes it is more convenient
to make assumptions on the class conditional probability densities p(x|w): they
indicate the distribution of the objects which are drawn from one of the classes
When assumptions on these distributions can be made, classifier (1) can be derived
using Bayes' decision rule:
This rule basically rewrites the class posterior probabilities in terms of the class
conditional probabilities and the class priors p(w) This result can be substituted
into (1), resulting in the following form:
if P(X\LOI)P(U>I) > p(x|w2)p(w2) assign x to wi, otherwise to o;2- (6)
The term p(x) is ignored because this is constant for a given x Any monotonically
increasing function can be applied to both sides without changing the final decision
In some cases, a suitable choice will simplify the notation significantly In particular,
Trang 15using a logarithmic transformation can simplify the classifier when functions from
the exponential family are used
For the special case of a two-class problem the classifiers can be rewritten in
terms of a single discriminant function / ( x ) which is the difference between the left
hand side and the right hand side A few possibilities are:
p(x|w2) P{LU 2 )
The classifier becomes:
if / ( x ) > 0 assign x t o w j , otherwise to o>2 (10)
In many cases fitting p(x|a>) on training data is relatively straightforward It is
the standard density estimation problem: fit a density on a data sample To estimate
each p(x|w) the objects from just one of the classes UJ is used
Depending on the functional form of the class densities, different classifiers are
constructed One of the most common approaches is to assume a Gaussian density
for each of the classes:
p(x|w) = J V ( X ; / I , E ) = ( 2 7 r ) P/2|S|i/2 e xP ( - ^ ( x - M f S - ^ x - M ) ) - (n)
where /x is the (p-dimensional) mean of the class u>, and £ is the covariance matrix
Further, | £ | indicates the determinant of £ and £- 1 its inverse For the explicit
values of the parameters /x and E usually the maximum likelihood estimates are
plugged in, therefore this classifier is called the plug-in Bayes classifier Extra
com-plications occur when the sample size TV is insufficient to (in particular) compute
£_ 1 In these cases a standard solution is to regularize the covariance matrix such
that the inverse can be computed:
where X is the p x p identity matrix, and A is the regularization parameter to set
the trade off between the estimated covariance matrix and the regularizer I
Substituting (11) for each of the classes u>\ and w 2 (with their estimated /Xj, /x2
and E i , £2) into (9) results in:
/ ( x ) = i x W - E r > + ^ ( M i S r1 - M2S2-1)Tx
- ^ E r V i + ^ S j V s - i l n l E t l + i l n l E I + l n ^ i j (13)
This classifier rule is quadratic in terms of x, and it is therefore called the
normal-based quadratic classifier
For the quadratic classifier a full covariance matrix has to be estimated for each
of the classes In high dimensional feature spaces it can happen that insufficient
Trang 16data is available to estimate these covariance matrices reliably By restricting the
covariance matrices to have less free variables, estimations can become more reliable
One approach the reduce the number of parameters, is to assume that both classes
have an identical covariance structure: Ei = £2 = E The classifier simplifies to:
/ ( x ) = \{^ - / S - ' x - i / i f E - V i + \l%*- V2 + In ^ (14)
Because this classifier is linear in terms of x, this classifier is called the normal-based
linear classifier
For the linear and the quadratic classifier, strong class distributional assumptions
are made: each class has a Gaussian distribution In many applications this cannot
be assumed, and more flexible class models have to be used One possibility is to
use a 'non-parametric' model An example is the Parzen density model Here the
density is estimated by summing local kernels with a fixed size h which are centered
on each of the training objects:
1 N
p(x|w) = - ^ J V f c x i f t l ) , (15)
i—l
where X is the identity matrix and h is the width parameter which has to be
op-timized By substituting (15) into (6), the Parzen classifier is defined The only
free parameter in this classifier is the size (or width) h of the kernel Optimizing
this parameter by maximizing the likelihood on the training data, will result in the
solution h = 0 To avoid this, a leave-one-out procedure can be used 9
2.3 Boundary methods
Density estimation in high dimensional spaces is difficult In order to have a reliable
estimate, large amounts of training data should be available Unfortunately, in many
cases the number of training objects is limited Therefore it is not always wise to
estimate the class distributions completely Looking at (1), (6) and (10), it is only
of interest which class is to be preferred over the other This problem is simpler
than estimating p(x\u>) For a two-class problem, we just a function / ( x ) is needed
which is positive for objects of UJI and negative otherwise In this section we will list
some classifiers which avoid estimating p(x|w) but try to obtain a suitable / ( x )
The Fisher classifier searches to find a direction w in the feature space, such
that the two classes are separated as well as possible The degree in which the two
classes are separated, is measured by the so-called Fisher ratio, or Fisher criterion:
Here mi and mo, are the means of the two classes, projected onto the direction w:
mi = ~w T Hi and rri2 = wT/x2 The si and s% are the variances of the two classes
projected onto w The criterion therefore favors directions in which the means are
far apart and the variances are small
Trang 17This Fisher ratio can be explicitly rewritten in terms of w First we rewrite
s l = E x6a ,c(w T x - w 7> c )2 = Exea,c w T(x - Mc)(x - ^c)T w = wrScw Second
we write (mi - m2)2 = (wT/x1 — wT/ z2)2 = wT(/x1 — /x2)(Ati _ iu2 )T w = wTSflW
The term 5 s is also called the between scatter matrix J becomes:
sf + s?, wT5 i w + wTS,2W wT5 i y w '
where Sw = S\ + S2 is also called the within scatter matrix
In order to optimize (17), we set the derivative of (17) to zero and obtain:
We are interested in the direction of w and not in the length, so we drop the scalar
terms between brackets Further, from the definition of SB it follows that SBW is
always in the direction fj, x — fj, 2 - Multiplying both sides of (18) by S w gives:
w ~ S ^ ( M i - M 2 ) - (19)
This classifier is known as the Fisher classifier Note that the threshold b is not
defined for this classifier It is also linear and requires the inversion of the
within-scatter Sw- This formulation yields an identical shape of w as the expression in
(14), although the classifiers use very different starting assumptions!
Most classifiers which have been discussed so far, have a very restricted form
of their decision boundary In many cases these boundaries are not flexible enough
to follow the true decision boundaries A flexible method is the k-nearest neighbor
rule This classifier looks locally which labels are most dominant in the training set
First it finds the k nearest objects in the training set AW(x), and then counts the
number of these neighbors are from class u>i or
u>2-if ni > 712 assign x t o u i , otherwise to cj2- (20)
Although the training of the fc-nearest neighbor classifier is trivial (it only has to
store all training objects, k can simply be optimized by a leave-one-out estimation),
it may become expensive to classify a new object x For this the distances to all
training objects have to be computed, which may be prohibitive for large training
sets and high dimensional feature spaces
Another classifier which is flexible but does not require the storage of the full
training set is the multi-layered feed-forward neural network 4 A neural network is a
collection of small processing units, called the neurons, which are interconnected by
weights w and v to form a network A schematic picture is shown in Figure 2 An
input object x is processed through different layers of neurons, through the hidden
layer to the output layer The output of the j - t h output neuron becomes:
(see Figure 2 for the meaning of the variables) The object x is now assigned to the
class j for which the corresponding output neuron has the highest output Oj
Trang 18X\ • X2%
•
Xp #
• — - < * n
Fig 2 Schematic picture of a neural network
To optimize this neural network, the squared error between the network output
and the desired class label is defined:
N K
where rij (x) is 1 if object x belongs to class w,-, and 0 otherwise To simplify the
notation, we will combine all the weights w , and v into one weight vector w
This error E is a continuous function of the weights w, and the derivative of
E with respect to these weights can easily be calculated The weights of the
neu-ral network can therefore be optimized to minimize the error by gradient descent,
analogous to (4):
dE
W n e w = W o l d - T y — , ( 2 3 )
where 77 is the learning parameter After expanding this learning rule (23), it
appears that the weight updates for each layer of neurons can be computed
by back-propagating the error which is computed at the output of the network
(nj(xj) — Oj(xi)) This is therefore called the back-propagation update rule
The advantage of this type of neural networks is that they are flexible, and that
they can be trained using these update rules The disadvantages are, that there are
many important parameters to be chosen beforehand (the number of layers, the
number of neurons per layer, the learning rate, the number of training updates,
etc.), and that the optimization can be extremely slow To increase the training
speed, several additions and extensions are proposed, for instance the inclusion of
momentum terms in (23), or the use of second order moments
Neural networks can be easily overtrained Many heuristic techniques have been
developed to decrease the chance of overtraining One of the methods is to use
weight decay, in which an extra regularization term is added to equation (22) This
regularization term, often something of the form
tries to reduce the size of the individual weights in the network By restricting the
size of the weights, the network will adjust less to the noise in the data sample
l + e x p ( - v T h )
Trang 19and become less complex The regularization parameter A regulates the trade-off
between the classification error E and the classifier complexity When the size of
the network (in terms of the number of neurons) is also chosen carefully, good
performances can be achieved by the neural network
A similar approach is chosen for the support vector classifier 32 The most basic
version is just a linear classifier as in Eq (10) with
The minimum distance from the training objects to the classifier is thereby
maxi-mized This gives the classifier some robustness against noise in the data, such that
it will generalize well for new data It, appears that this maximum margin p is
in-versely related to ||w||2, such that maximizing this margin means minimizing ||w||2
(taking the constraints into account that all the objects are correctly classified)
Fig 3 Schematic picture of a support vector classifier
Given linearly separable data, the linear classifier is found which has the largest
margin p to each of the classes To allow for some errors in the classification, some
slack variables are introduced to weaken the hard constraints The error to minimize
for the support vector classifier therefore consists of two parts: the complexity of the
classifiers in terms of wTw , and the number of classification errors, measured by
X^jCi- The optimization can be stated by the following mathematical formulation:
Trang 20Parameter C determines the trade-off between the complexity of the classifier, as
measured by wTw , and the number of classification errors
Although the basic version of the support vector classifier is a linear classifier,
it can be made much more powerful by the introduction of kernels When the
constraints (27) are incorporated into (26) by the use of Lagrange multipliers a ,
this error can be rewritten in the so-called dual form For this, we define the labels
y, where y» = 1 when x$ £ ui\ and j/j = — 1 otherwise The optimization becomes:
m a x aTx xTa - lTa , s.t yTa = 0, 0 < a t < C, Vi (28)
with w = J2i a iVi x i- Due to the constraints in (28) the optimization is not trivial,
but standard software packages exist which can solve this quadratic programming
problem It appears that in the optimal solution of (28) many of the Qj become 0
Therefore only a few Qj ^ 0 determine the w The corresponding objects Xi are
called the support vectors All other objects in the training set can be ignored
The special feature of this formulation is that both the classifier / ( x ) and the
error (28) are completely stated in terms of inner products between objects xfXj
This means that the classifier does not explicitly depend on the features of the
objects It depends on the similarity between the object x and the support vectors
Xj, measured by the inner product xTXj By replacing the inner product by another
similarity, defined by the kernel function isT(x, x,), other non-linear classifiers are
obtained One of the most popular kernel functions is the Gaussian kernel:
where a is still a free parameter
The drawback of the support vector classifier is that it requires the solution
of a large quadratic programming problem (28), and that suitable settings for the
parameters C and a have to be found On the other hand, when C and a are
optimized, the performance of this classifier is often very competitive Another
ad-vantage of this classifier is, that it offers the possibility to encode problem specific
knowledge in the kernel function K In particular for problems where a good feature
representation is hard to derive (for instance in the classification of shapes or text
documents) this can be important
2.4 Multi-class classifiers
In the previous section we focused on the two-class classification problems This
sim-plifies the formulation and notation of the classifiers Many classifiers can trivially
be extended to multi-class problems For instance the Bayes classifier (1) becomes:
c
Most of the classifiers directly follow from this Only the boundary methods which
were constructed to explicitly distinguish between two classes, for instance the
Trang 21Fisher classifier or the support vector classifier, cannot be trivially extended For these classifiers several combining techniques are available The two main ap-proaches to decompose a multi-class problem into a set of two-class problems are:
(1) one-against-all: train K classifiers between one of the classes and all others, (2) one-against-one: train K(K — l ) / 2 classifiers to distinguish all pairwise classes
Afterward classifiers have to be combined using classification confidences (posterior probabilities) or by majority voting A more advanced approach is to use Error-Correcting Output Codes (ECOC), where classifiers are trained to distinguish spe-cific combinations of classes, but are allowed to ignore others7 The classes are chosen such that a redundant output labeling appears, and possible classification errors can be fixed
2.5 One-class classifiers
A fundamental assumption in all previous discussions, is that a representative
train-ing set X tT is available That means that examples from both classes are present, sampled according to their class priors In some applications one of the classes might contain diverse objects, or its objects are difficult or expensive to measure This happens for instance in machine diagnostics or in medical applications A suf-ficient number of representative examples from the class of ill patients or the class
of faulty machines are sometimes hard to collect In these cases one cannot rely on
a representative dataset to train a classifier, and a so-called one-class classifier30
may be used
Fig 4 One-class classifier example
In one-class classifiers, it is assumed that we have examples from just one of the
classes, called the target class From all other possible objects, per definition the outlier objects, no examples are available during training When it is assumed that
the outliers are uniformly distributed around the target class, the classifier should
Trang 22circumscribe the target object as tight as possible in order to minimize the chance
of accepting outlier objects
In general, the problem of one-class classification is harder than the problem
of conventional two-class classification In conventional classification problems the decision boundary is supported from both sides by examples of both classes Because
in the case of one-class classification only one set of data is available, only one side
of the boundary is supported It is therefore hard to decide, on the basis of just one class, how strictly the boundary should fit around the data in each of the feature directions In order to have a good distinction between the target objects and the outliers, good representation of the data is essential
Approaches similar to standard two-class classification can be used here ing the uniform outlier distribution assumption, the class posteriors can be esti-mated and the class conditional distributions or direct boundary methods can be constructed For high dimensional spaces the density estimators suffer and often boundary methods are to be preferred
Us-2.6 Combining classifiers
In practice it is hard to find (and train) a classifier which fits the data distribution sufficiently well The model can be difficult to construct (by the user), too hard to optimize, or insufficient training data is available to train In these cases it can be very beneficial to combine several "weak" classifiers in order to boost the classifica-tion performance21 It is hoped that each individual classifier will focus on different aspects of the data and err on different objects Combining the set of so-called base classifiers will then complement their weak areas
/ base classifier outputs (e.g confidences)
Fig 5 Combining classifier
The most basic combining approach is to train several different types of classifiers
on the same dataset and combine their outputs One has to realize that classifiers can only correct each other when their outputs vary, i.e the set of classifiers is
diverse 2 2 It appears therefore to be more advantageous to combine classifiers which were trained on objects represented by different features Another approach
to force classifiers to become diverse is to artificially change the training set by resampling (resulting in a bagging6 or a boosting8 approach)
Trang 23The outputs of the classifier can be combined using several combining rules18, depending on the type of classifier outputs If the classifiers provide crisp output labels, a voting combining rule has to be used When the real valued outputs are available, they can be averaged, weighted averaged or multiplied, the maximum or minimum output can be taken or even an output classifier can be trained If fixed (i.e not trained) rules are used, it is important that the output of a classifier is properly scaled Using a trainable combining rule, this constraint can be elevated but clearly training data is required to optimize this combining rule10
3 Feature reduction
In many classification problems it is unclear what features have to be taken into
account Often a large set of k potentially useful features is collected, and by feature reduction the k most suitable features are chosen Often the distinction between feature selection and feature extraction is made In selection, only a subset of the
original features is chosen The advantage is that in the final application just a few features have to be measured The disadvantage is that the selection of the appropriate subset is an expensive search In extraction new features are derived from the original features Often all original features are used, and no reduction is obtained in the number of measurement Butt in many cases the optimization is easier In Section 3.1 we will discuss several evaluation criteria, then in Section 3.2 feature selection and finally in 3.3 feature extraction
3.1 Feature set evaluation criteria
In order to evaluate a feature set, a criterion J has to be defined Because feature reduction is often applied in classification, the most obvious criterion is thus the performance of the classifier Unfortunately, the optimization of a classifier is often hard Other evaluation criteria might be a cheaper approximation to this classifica-tion performance Therefore approximate criteria are used, measuring the distance
or dissimilarity between distributions, or even ignoring the class labels, but just focusing on unsupervised characteristics
Some typical evaluation criteria are listed in Table 1 The most simple ones use the scatter matrices characterizing the scatter within classes (showing how samples
scatter around their class mean vector, called Sw, the within scatter) and the the scatter between the clusters (showing how the means of the clusters scatter, SB, the
between scatter matrix, see also the discussion of the Fisher ratio in section 2.3) These scatter matrices can be combined using several functions, listed in the first
part of Table 1 Often Si = SB is used, and 52 = Sw or 52 = Sw +
SB-The measures between distributions involve the class distributions p(x\u>i), and
in practice often single Gaussian distributions for each of the classes are chosen The reconstruction errors still contain free parameters in the form of a matrix of
basis vectors W or a set of prototypes fj, k These are optimized in their respective
Trang 24procedures, like t h e Principal C o m p o n e n t Analysis or Self-Organizing Maps These
s c a t t e r criteria a n d t h e supervised measures between distributions are mainly used
in t h e feature selection, Section 3.2 T h e unsupervised reconstruction errors are used
in feature extraction 3.3
Table 1 Feature selection criteria for measuring the difference between
two distributions or for measuring a reconstruction error
Measures using scatter matrices
For explanation J = tr(5^"1 5i)
In feature selection a subset of t h e original features is chosen A feature reduction
procedure consist of two ingredients: t h e first is t h e evaluation criterion t o evaluate
a given set of features, t h e second is a search strategy to search over all possible
feature s u b s e t s 1 6 Exhaustive search is in m a n y applications not feasible W h e n we
s t a r t with k = 250 features, and we want t o select k = 10, we have to consider in
principle ( 2 5 0 ) — 2 • 10 1 7 different subsets, which is clearly too much
Instead of exhaustive search, a forward selection can be applied It s t a r t s with
t h e single best feature (according to t h e evaluation criterion) and adds t h e feature which gives t h e biggest improvement in performance This is repeated till t h e re-
quested n u m b e r of features k is reached Instead of forward selection, t h e opposite
approach can be used: backward selection This s t a r t s with t h e complete set of tures and removes t h a t feature such t h a t t h e performance increase is t h e largest These approaches have t h e significant drawback t h a t t h e y might miss t h e optimal subset These are t h e subsets for which t h e individual features have poor discrim- inability but combined give a very good performance In order to find these subsets,
fea-a more fea-advfea-anced sefea-arch s t r fea-a t e g y is required It cfea-an be fea-a flofea-ating sefea-arch where fea-adding
Trang 25and removing features is alternated Another approach is the branch-and-bound
al-gorithm 12, where all the subsets of features is arranged in a search tree This tree
is traversed in such order that as soon as possible large sub branches can be
dis-regarded, and the search process is shortened significantly This strategy will yield
the optimal subset when the evaluation criterion J is monotone, that means that
when for a certain feature set a value of Jk is obtained, a subset of the features
cannot have higher value for Jj, Criteria like the Bayes error, the Chernoff distance
or the functions on the scatter matrices fulfill this
Currently, other approaches appear which combine the traditional feature
selec-tion and subsequent training of a classifier One example is a linear classifier (with
the functional form of (25)) called LASSO, Least Absolute Shrinkage and Selection
Operator 3 1 The classification problem is approached as a regression problem with
an additional regularization A linear function is fitted to the data by minimizing
the following error:
n
i
The first part defines the deviation of the linear function wTXj + 6 from the expected
label yi The second part shrinks the weights w, such that many of them become
zero By choosing a suitable value for C, the number of retained features can be
changed This kind of regularization appears to be very effective when the number
of feature is huge (in the thousands) and the training size is small (in the tens) A
similar solution can be obtained when the term wTw in (26) is replace by |w| 3
3.3 Feature extraction
Instead of using a subset of the given features, a smaller set of new features may
be derived from the old ones This can be done by linear or nonlinear feature
extraction For the computation of new features usually all original features are used
Feature extraction will therefore almost never reduce the amount of measurements
The optimization criteria are often based on reconstruction errors as in Table 1
The most well-known linear extraction method is the Principal Component
Anal-ysis (PCA) 17 Each new feature i is a linear combination of the original features:
x\ = WjX The new features are optimized to minimize the PCA mean squared
error reconstruction error, Table 1 It basically extracts the directions W , in which
the data set shows the highest variance These directions appear to be equivalent to
the eigenvectors of the (estimated) covariance matrix E with the largest eigenvalues
For the i-th principal component W j therefore holds:
E W i = XiWi, Xi > Xj, iff < j (32)
An extension of the (linear) PCA is the kernelized version, kernel-PCA 24 Here
the standard covariance matrix E is replaced by a covariance matrix in a feature
space After rewriting, the eigenvalue problem in the feature space reduces to the
Trang 26following eigenvalue problem: Koti = AjCKj Here K is a N x N kernel matrix (like
for instance (29)) An object x is mapped onto the i-th principal component by:
j
Although this feature extraction is linear in the kernel space, in the feature space
it will obtain non-linear combinations of features
There are many other methods for extracting nonlinear features, for instance the
Self-Organizing Map (SOM) 2 0 The SOM is an unsupervised clustering and feature
extraction method in which the cluster centers are constrained in their placing The
construction of the SOM is such that all objects in the input space retain as much
as possible their distance and neighborhood relations in the mapped space In other
words, the topology is preserved in the mapped space
The mapping is performed by a specific type of neural network, equipped with a
special learning rule Assume that we want to map an /c-dimensional measurement
space to a fc'-dimensional feature space, where k' < k In fact, often k' = 1 or k' = 2
In the feature space, we define a finite orthogonal grid with grid points At each grid
point we place a neuron or prototype Each neuron stores an fc-dimensional vector
/ifc that serves as a cluster center By defining a grid for the neurons, each neuron
does not only have a neighboring neuron in the measurement space, it also has
a neighboring neuron in the grid During the learning phase, neighboring neurons
in the grid are enforced to also be neighbors in the measurement space By doing
so, the local topology will be preserved Unfortunately, training a SOM involves
the setting of many unintuitive parameters and heuristics (similar to many neural
network approaches)
A more principled approach to the SOM is the Generative Topographic Mapping,
GTM5 The idea is to find a representation of the original p-dimensional data x in
terms of //-dimensional latent variables z For this a mapping function y(z|W) has
to be defined In the GTM it is assumed that the distribution of z in the latent
variable space is a grid of delta functions z*:
where <&(z) consist of M fixed basis functions (in many cases Gaussian functions)
and W is a p x M weight matrix Because in reality the data will never fit the
low-dimensional manifold perfectly, a noise model is introduced: a Gaussian distribution
with variance a 2 :
p ( x | z , W , < 7 ) = J V ( x ; y ( z | W ) , a ) (36)
Trang 27The distribution p(x) can then be obtained by integration over the z distribution:
p(x|W,(T)= /"p(x|z,W,<r)p(z)dz (37) The advantage is that the model is a full probability model This model can be fitted
by optimizing the log likelihood of the training data (In J^^ p(x|W,«r)) using an
Expectation-Maximization algorithm When the user supplies the dimensionality of
the latent variable space L, the number of grid points M in this space and the basis
functions 3>(z), then the parameters W and a can be optimized
An even simpler model to optimize is the Local Linear Embedding, LLE28 Here
also the goal is to find a low dimensional representation of the training set X u '
But unlike the GTM, where an explicit manifold is fitted, here the low dimensional
representation is optimized such that the objects can be reconstructed from their
neighbors in the training set in the same manner in the low dimensional
representa-tion as in the high dimensional one First, the weights Wij for reconstructing each
object Xj from its neighbors Xj are optimized (minimizing the LLE reconstruction
error, Table 1, under the constraint that V • Wij = 1) Given the weights, the
loca-tion of low-dimensional feature vectors Zj,i = 1, ,N is optimized, using the same
LLE reconstruction error, but where x, is replaced by z* This can be minimized by
solving a eigenvalue problem (similar to finding the principal components)
The feature extraction methods presented above, are all unsupervised, i.e other
information like class labels is not used This can be a significant drawback when the
feature reduction is applied as a preprocessing for solving a classification problem
It might actually happen that all informative features are removed To avoid this,
supervised feature extraction has to be used Very well known is Linear Discriminant
Analysis (LDA)27, which is using the weight vector w from the Fisher classifier (see
section 2.3) as feature direction A multi-class extension is presented in 27 but
it assumes equal covariance matrices for all classes and the number of features is
restricted to if — 1 The LDA can be extended to include the difference in covariance
matrix by using the Chernoff criterion instead of the Fisher criterion 2 3
4 Error estimation
At various stages in the design of a pattern classification system an estimation of
the performance of a procedure, or the separability of a set of classes is needed
Examples are the selection the 'best' feature during feature selection, the feature
subspace to be used when several feature extraction schemes are investigated, the
performance of the base classifiers in order to find a good set of classifiers to be
combined, the optimization of various parameters in classification schemes like the
smoothing parameter in the Parzen classifier and the number of hidden units used
in a neural network classifier, and the final selection of the overall classification
procedure if various competing schemes are followed consecutively Moreover, at
the end an estimate of the performance of the selected classifier is desired
Trang 28In order to find an unbiased error estimate, a set of test objects with known labels is desirable This set should be representative for the circumstances expected during the practical use of the procedures under study Usually this implies that the test set has to be randomly drawn from the future objects to be classified As their labels should be known for proper testing, these objects are suitable for training as well Once an object is used for training, however, the resulting classifier is expected
to be good for this object Consequently, if it is also used for testing it generates an optimistic bias in the error estimate Below two techniques will be discussed to solve this problem The first is cross-validation, which aims at circumventing the bias The second is a bootstrap technique by which the bias is estimated and corrected
4 1 Cross-validation
Assume that a design set X d is available for the development of a pattern recognition system, or one of its subsystems, and that in addition to the classifier itself an
unbiased estimate of its performance is needed If X d is split (e.g at random) into
a training set X tv and a test set X te then we want X tr to be as large as possible
to train a good classifier, but simultaneously X te has to be sufficiently large for an
accurate error estimate The standard deviation of this estimate is sqrt(e * (1 — e)
N te ) (e.g 0.003 for e = 0.01, N te = 1000 and 0.03 for e = 0.1 and N te = 100)
When the design set is not sufficiently large to split it into a test set and a training set of appropriate sizes, a cross validation procedure might be used in which the
design set is split into B ( B > = 2) subsets of about the same size In total B different classifiers are trained, each by a different group of B — 1 of these subsets
Each classifier is tested by the single subset not used for its training Finally the
B test results are averaged Consequently all objects are used for testing once The classifiers they are testing are all based on an (B — 1)/B part of the training set For larger B these classifiers are expected to be similar and they will be just slightly
worse than the classifier based on all objects A good choice seems to be a 10-fold stratified cross-validation, see 19, i.e N = 10, and objects are selected evenly from
the classes, i.e in agreement with their prior probabilities
4.2 Bootstrap procedures
Instead of following a procedure that tries to minimize the bias in the error estimate, one may try to estimate the bias 1 3'1 5 A general procedure (independent of the used
classifier) can be based on a comparison of the expected apparent error E^ pp of a
classifier trained by bootstrapping the design set with its error E\ estimated by
the entire design set The difference can be used as an estimate for the bias in the
apparent error: E\,i as = E^—E^, which can be used as a correction for the apparent error E% of the classifier based on the design set: E boot = E% pp + E bd — E happ
A second estimator based on bootstrapping is the so called Ees2 error n'1 3'1 5
It is based on a weighted average of the apparent error of the classifier based on
Trang 29the design set E% and an error estimate EQ for the bootstrap classifier based
on the out-of-bootstrap part of the design set The first is optimistically biased (an apparent error) and the second is an unbiased error estimate (tested by independent samples) of a classifier that is somewhat worse (based on just a bootstrap sample) than the target classifier based on the design set The weights are given by the asymptotic probability that a sample will be included in a bootstrap sample: 0.632 The £6 3 2 error estimate thereby is given by: E 632 = 0.368 * E% pp + 0.632 * £#
4.3 Error curves
The graphical representation of the classification error is an important tool to study, compare and understand the behavior of classification systems Some examples of such error curves are:
Learning curve : the error as a function of the number of training samples Simple
classifiers decrease faster, but have often a higher asymptotic value than more complex ones
Complexity curve : the error as a function of the complexity of the classifier, e.g
the feature size or the number of hidden units Such a curve often shows an increasing error after an optimal feature size or complexity
Parameter curve : the error as a function of a parameter in the training
proce-dure, e.g the smoothing parameter in the Parzen classifier The optimum that may be observed in such curves is related to the best fit of the underlying model
in the classification system w.r.t the data
Error-reject trade off : the error as function of the reject probability If a
clas-sifier output (e.g a confidence estimate) is thresholded to reject unreliably classified objects, then this curve shows the gain in error reduction
R O C curves : the trade-off between two types of errors, e.g the two types of error
in a 2-class problem These Receiver Operator Curves were first studied in munication theory and are useful to select a classifier if the point of operation may vary, e.g due to unknown classification costs or prior probabilities
com-5 Discussion
In the previous sections an overview is given of well established techniques for statistical pattern recognition with a few excursion to more recent developments Modern scientific and industrial developments, the use of computers and internet
in daily life and the fast growing sensor technology raise new problems as well as they enable new solutions We will summarize some new developments in statistical pattern recognition, partially introduced above, partially not yet discussed
Other types of representation than the traditional features enable other ways
to incorporate expert knowledge The dissimilarity representation is an example for this, as it offers the possibility to express knowledge in the definition of the dissimilarity measure, but it opens also other options Instead of being based on the
Trang 30raw data like spectra, images, or time signals it may be defined on models of objects, like graphs In such cases structural knowledge is used for the object descriptions
In addition to the nearest neighbor rule, dissimilarity based classifiers offer a richer set of tools with more possibilities to learn from examples, thereby bridging the gap between structural and statistical pattern recognition Several problems, however, have still to be solved, like the selection of a representation set, optimal modifications
of a given dissimilarity measure and the construction of dedicated classifiers More complicated pattern recognition problems may not be solved by a single off-the-shelf classifier By the combining classifier technique a number of partial solutions can be combined Several questions are still open here, like the selection
or generation of the base classifiers, the choice of the combiner, the use of a finite training set Moreover, an overall mathematical foundation is still not available One-class classifiers are a good way to handle ill sampled problems, or to build classifiers when some of the classes are undersampled This is important for applica-tions like man or machine monitoring when one of the classes, e.g normal behavior,
is very well defined Such classifiers may also be used when it is not possible to select
a representative training set by an appropriate sampling of the domain of objects
In such cases a domain based class description may be found, locating the class boundary in the representation, without building a probability density function The well spread availability of computers and sensors, and the costs of labeling objects by human experts, may sometimes result in large databases in which just
a small fraction of the objects is labeled Techniques for training classifiers by tially labeled datasets are still in their early years This may also be considered as combining clustering and classification
par-For such problems in which the costs of expert labeling are high, one may also try to optimize the set of objects to be labeled This technique is called active learning Several competing strategies exist, e.g sampling close to an initial decision boundary, or retrieving objects in the modes of the class density distributions Another variant is online learning, in which the order of the objects to be presented
to a decision function is determined by the application, e.g by a production line in a factory It has now to be decided whether objects can be safely classified, or whether
a human expert has to be solicited, not only to reduce the risk of misclassification, but also to optimally improve the available classification function
An often returning question in dynamic environments is whether a trained sification function is still valid, or whether it should be retrained due to new cir-cumstances In such problems 'learning' and 'forgetting' are directly related If a new situation demands retraining, old objects may not be representative anymore and should be forgotten (They may still be stored in case the old situation appears
clas-to return after some time)
Many techniques are proposed and many more are to come for solving problems
as the above A difficulty that cannot be easily handled is that they are often ill defined Consequently, generally valid benchmarks are not available, by which it
Trang 31is not straightforward t o detect t h e good procedures t h a t m a y work well over a series of applications As good a n d bad procedures cannot easily be distinguished,
it is to be expected t h a t t h e set of tools used in statistical p a t t e r n recognition will significantly grow in t h e near future
R e f e r e n c e s
1 J.A Anderson Logistic discrimination In P.R Kirshnaiah and L.N Kanal, editors,
Classification, Pattern Recognition and Reduction of Dimensionality, volume 2 of Handbook of Statistics, pages 169-191 North Holland, Amsterdam, 1982
2 A.G Arkadev and E.M Braverman Computers and Pattern Recognition Thompson,
4 C M Bishop Neural Networks for Pattern Recognition Oxford University Press,
Wal-ton Street, Oxford 0X2 6DP, 1995
5 C M Bishop, M Svensen, and C.K.I Williams The generative topographic mapping
Neural Computation, 10(l):215-234, 1998
6 L Breiman Bagging predictors Machine Learning, 26(2):123-140, 1996
7 T.G Dietterich and G Bakiri Solving multiclass learning problems via
error-correcting output codes Journal of Artificial Intelligence Research, 2:263-286, 1995
8 H Drucker, C Cortes, L.D Jackel, Y LeCun, and V Vapnik Boosting and other
ensemble methods Neural Computation, 6, 1994
9 R.P.W Duin On the choice of the smoothing parameters for Parzen estimators of
probability density functions IEEE Trans, on Computers, C-25(ll):1175-1179, 1976
10 R.P.W Duin The combining classifier: To train or not to train? In International Conference on Pattern Recognition, volume II, pages 765-770, Quebec, Canada, 2002
11 B Efron and R.J Tibshirani Improvements on cross-validation: the 632+ bootstrap
method J Amer Statist Assoc, 92:548-560, 1997
12 K Fukanaga Introduction to Statistical pattern recognition Academic press, San
Diego, 2nd edition, 1990
13 D.J Hand Recent advances in error rate estimation Pattern Recognition Letters,
4(5):335-346, 1986
14 S.S Haykin Neural Networks, a comprehensive foundation Prentice-Hall, 1999
15 A.K Jain, R C Dubes, and Chen, C.-C Bootstrap techniques for error
estima-tion IEEE Transactions on Pattern Analysis and Machine Intelligence, 9(5):628-633,
September 1987
16 A.K Jain, R.P.W Duin, and J Mao Statistical pattern recognition: A review IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(l):4-37, 2000
17 I.T Jolliffe Principal Component Analysis Springer-Verlag, New York, 1986
18 J Kittler, M Hatef, R.P.W Duin, and J Matas On combining classifiers IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(4):226-239, 1998
19 R Kohavi A study of cross-validation and bootstrap for accuracy estimation and
model selection In Proc of the 15th Int Joint Conference on Artificial Intelligence,
pages 1137-1143, 1995
20 T Kohonen Self-organizing maps Springer-Verlag, Heidelberg, Germany, 1995
21 L.I Kuncheva Combining Pattern Classifiers: Methods and Algorithms Wiley, 2004
Trang 3222 L.I Kuncheva and C.J Whitaker Measures of diversity in classifier ensembles chine Learning, 51:181-207, 2003
Ma-23 M Loog, R.P.W Duin, and R Haeb-Umbach Multiclass linear dimension reduction
by weighted pairwise fisher criteria IEEE Transactions on Pattern Analysis and chine Intelligence, 23(7):762-766, 2001
Ma-24 S Mika, B Scholkopf, A.J Smola, K.-R Miiller, M Scholz, and G Ratsch Kernel PCA and de-noising in feature spaces In M.S Kearns, S.A Solla, and D.A Cohn,
editors, Advances in Neural Information Processing Systems, volume 11, pages
536-542 MIT Press, 1999
25 T M Mitchell Machine Learning Mc Graw-Hill, New York, 1997
26 E Pekalska and R.P.W Duin Dissimilarity representations allow for building good
classifiers Pattern Recognition Letters, 23(8):943-956, 2002
27 C.R Rao The utilization of multiple measurements in problems of biological
clas-sification (with discussion) Journal of the Royal Statistical Society, B, 10:159-203,
Trang 33HIDDEN MARKOV MODELS FOR SPATIO-TEMPORAL PATTERN
RECOGNITION
Brian C Lovell" and Terry Caelli b
The School of Information Technology and Electrical Engineering
The University of Queensland, Australia QLD 4072
E-mail:lovell@ itee uq edu.au
National Information and Communications Technology Australia (NICTA) Research School of Information Sciences and Engineering
Australian National University, Australia email: tcaelli@ualberta.ca
The success of many real-world applications demonstrates that hidden Markov models (HMMs) are highly effective in one-dimensional pattern recognition problems such as speech recognition Research is now focussed on extending HMMs to 2-D and possibly 3-D applications which arise in gesture, face, and handwriting recognition Although the HMM has become a major workhorse of the pattern recognition community, there are few analytical results which can explain its remarkably good pattern recognition performance There are also only a few theoretical principles for guiding researchers in selecting topolo- gies or understanding how the model parameters contribute to performance In this chapter,
we deal with these issues and use simulated data to evaluate the performance of a number
of alternatives to the traditional Baum-Welch algorithm for learning HMM parameters We then compare the best of these strategies to Baum-Welch on a real hand gesture recognition system in an attempt to develop insights into these fundamental aspects of learning
1 Introduction
There is an enormous volume of literature on the application of hidden Markov Models (HMMs) to a broad range of pattern recognition tasks In the case of speech recognition, the patterns we wish to recognise are spoken words which are audio signals against time Indeed, the value of Markov models to model speech was recognised by Shannon 26 as early
as 1948 In the case of hand gesture recognition, the patterns are hand movements in both space and time — we call this a spatio-temporal pattern recognition problem The suitabil- ity and efficacy of HMMs to such problems is undeniable and they are now established as one of the major tools of the pattern recognition community Yet, when one looks for re- search which address fundamental problems such as efficient learning strategies for HMMs
or perhaps analytically determining the most suitable architectures for a given problem, die number of papers is greatly diminished So despite the enormous uptake of HMMs since
25
Trang 34their introduction in the 1960's, we believe that there is still a great deal of unexplored territory
Much of the application of HMMs in the literature is based firmly on the
methodol-ogy popularised by Rabiner et al (1983) 25>16-24 for speech recognition and these studies are the primary reference for many HMM researchers resulting in two common practices One, to use the forward algorithm to determine the MAP(maximum posterior probabil-ity) of the model, given an observation sequence, as a classification metric Two, to use the Baum-Welch as a model estimation/update procedure We will see how these are not ideal strategies to use as, in the former case, classification is reduced to a single number without directly using the model (data summary) parameters, attributes, per se As for the latter, the Baum-Welch 4 algorithm (a version of the famous Expectation-Maximisation algorithm14,1'21) is, in the words of Stolke and Omohundro 28," far from foolproof since
it uses what amounts to a hill-climbing procedure that is only guaranteed to find a local likelihood maximum." Moreover, as observed by Rabiner 24, results can be very dependent
on the initial values chosen for the HMM parameters
The problem of finding local rather that global maxima is encountered in many other eas of learning theory and optimisation These problems are familiar territory to researchers
ar-in the artificial neural network community and many techniques have been proposed to counter them Moreover genetic and evolutionary algorithmic techniques specialise in solving such problems — albeit often very slowly, especially in the case of biological evolution11 With this in mind, we use simulated data to investigate other approaches to learning HMMs from observation sequences in an attempt to find superior alternatives to the traditional Baum-Welch Algorithm Then we compare and test the best of the alternate strategies on real data from a hand gesture recognition system to see if the real data trials corroborate the conclusions drawn from simulated trials
1.1 Background and Notation
In this study, we focus on the discrete HMM as popularised by Rabiner24 Using the
famil-iar notation from his tutorial paper, a hidden Markov model consists of a set of N nodes, each of which is associated with a set of M possible observations The parameters of the
model include an initial state vector
7T= \Pl,P2,P3,-,PN} T with elements p n , n £ [1, AT] which describes the distribution over the initial node set, a
transition matrix
/ a n a 12 aiw \
0 2 1 a 2 2 • • • 0 2 J V
\ajvi ajv2 • • • OJVJV/
with elements a t j with i,j £ [l,N] for the transition probability from node i to node j
Trang 35tions and emissions defines the topology or structure of the model (see figure 1 for an illustration of two different transition structures) One commonly used topology is called Fully-Connected (FC) or Ergodic In the FC HMM there is not necessarily a defined start-
ing state and all state transitions are possible such that a^ ^ 0 V i, j £ [1, N] Another
topology, especially popular in speech recognition applications, is called Left-Right In an
LR HMM there is a defined starting state (usually state 1) and only state transitions to
higher-index states are allowed such that a^• = 0 V i > j where i, j G [1, N]
Rabiner24 defines the three basic problems of HMMs by:
Problem 1 Given the observation sequence O = 0\0 2 • • -OT, and a model A =
(A, B, IT), how do we efficiently compute P(0\X), the probability of the
observa-tion sequence given the model?
Problem 2 Given the observation sequence O = O1O2 • • • Or, and the model A, how do
we choose a corresponding state sequence Q = q\q 2 qr which is optimal in some meaningful sense {i.e., best "explains" the observations)?
Problem 3 How do we adjust the model parameters A = (A, B, n) to maximize P((9|A)?
Problems 1 and 2 are elegantly and efficiently solved by the forward and Viterbi29'12
algorithms respectively as described by Rabiner in his tutorial The forward algorithm is
used to recognise matching HMMs (i.e., highest probability models, MAP) from the
obser-vation sequences Note, again, that this is not a typical approach to pattern classification as
Trang 36it does not involve matching model with observation attributes That would involve paring the model parameters and estimated observation model parameters MAP does not perform this and so it cannot be as sensitive a measure as exact parameter comparisons Indeed, a number of reports have already shown quite different HMMs can have identical emissions( observation sequences) 18'3 The Viterbi algorithm is used less frequently as
com-we are normally more interested in finding the matching model than in finding the state sequence However, this algorithm is critical in evaluating the precision of the HMM; in other words, how well the model can reconstruct (predict) the observations
Rabiner proposes solving Problem 3 via the Baum-Welch algorithm which is, in essence, a gradient ascent algorithm — a method which is guaranteed to find local maxima only Solving Problem 3 is effectively the problem of learning to recognise new patterns, so
it is really the fundamental problem of HMM learning theory; a significant improvement here could boost the performance of all HMM based pattern recognition systems There-fore it is somewhat surprising that there appear to be relatively few papers devoted to this topic — the vast majority are devoted to applications of the HMM In the next section we compare a number of alternatives to and variations of Baum-Welch in an attempt to find superior learning strategies
2 Comparison of Methods for Robust HMM Parameter Estimation
We focus on the problem of reliably learning HMMs from a small set of short observation sequences The need to learn rapidly from small sets arises quite often in practice In our case, we are interested in learning hand gestures which are limited to just 25 observations The limitation arises because we record each video at 25 frames per second and each of our gestures takes less than one second to complete Moreover, we wish to obtain good recognition performance from small training sets to ensure that new gestures can be rapidly recognised by the system
Four HMM parameter estimation methods are evaluated and compared by using a train and test classification methodology For these binary classification tests we create two ran-dom HMMs and then use each of these to generate test and training data sequences For normalization, we ensure that each test sequence can be correctly recognized by its true model; thus the true models obtain 100% classification accuracy on the test data by con-struction The various learning methods are then used to estimate the two HMMs from their respective training sets and then the recognition performance of the pair of estimated HMMs is evaluated on the unseen test data sets This random model generation and eval-uation process is repeated 16 times for each data sample to provide meaningful statistical results
Before parameter re-estimation, we initialize with two random HMMs which should yield 50% recognition performance on average So an average recognition performance above 50% after re-estimation shows that some degree of learning must have taken place Clearly if the learning strategy can perfectly determine both of the HMMs which generated the training data sets, we would have 100% recognition performance on the test sets
We compare four learning methods 1) traditional Baum-Welch, 2) ensemble averaging
Trang 37Classification Performance Averaged Over 16 Experiments
True model 100% by construction
15
Fig 2 Relative performance of the HMM parameter estimation methods as a function of the number of ing sequences Viterbi Path Counting produces the best quality models with a much smaller number of training iterations
train-introduced by Davis and Lovell9 based on ideas presented by Mackay19, 3) Entropic MAP introduced by Brand6, and 4) Viterbi Path Counting10 which is a special case of Stolke and Omhundro's Best-First algorithm28 The results in figure 2 indicate that these alternate HMM learning methods all classify significantly better than the well-known Baum-Welch algorithm and also require less training data The Entropic MAP estimator performs well but surprisingly the performance is much the same as simple ensemble averaging Ensem-ble averaging involves training multiple models using the Baum-Welch algorithm and then simply averaging the model parameters without regard to structure Note that for a single sequence, ensemble averaging is identical to the traditional usage of the Baum-Welch al-gorithm Overall, the stand-out performer was the VPC algorithm In these and other trials, this method converges to good models very rapidly and has performed better than the other methods in virtually all of our simulated HMM studies
3 Video Gesture Recognition
In an attempt to corroborate the strong performance of VPC compared to Baum-Welch on a real-world application, we test various learning techniques on a system for real-time video gesture recognition as shown in figure 3
Trang 38In earlier related work, Starner and Pentland 27 developed a HMM-based system to recognise gesture phrases in American Sign Language Later, Lee and Kim 1 5 used HMM- based hand gesture recognition to control viewgraph presentation in data projected semi- nars Our system recognizes gestures based on the letters of the alphabet traced in space in front of a video camera The motivation for this application is to produce a way of typing messages into a camera-equipped mobile phone or PDA using video gestures instead of the keypad or pen interface We use single stroke letter gestures similar to those already widely used for pen data entry in PDAs For example, figure 3 shows the hand gestures for the letters "Z" and "W." The complete gesture set is shown in figure 6
Fig 3 "Fingerwriting:" Single stroke video gesture for letters "W" and "Z."
Each video sequence comprises 25 frames corresponding to one second of video Skin colour segmentation in YUV colour space is applied to locate the hand Pre-processing (morphological) operations smooth the image and remove noise before tracking the hand with a modified Camshift algorithm 5 After segmenting the hand, we calculate image mo- ments to find the centroid in each frame Along the trajectory, the direction (angle) of motion of each of the 25 hand movements is calculated and quantized to one of 18 dis- crete symbols The resultant discrete angular observation sequence is input to the HMM classification module for training and recognition
We compare traditional Baum-Welch with the most promising alternative from the ulated study, VPC We evaluate recognition performance over all 26 character gestures us- ing fully connected (FC), left-right (LR), and left-right banded (LRB) model topologies with the number of states ranging from 1 to 14 A LRB model is an LR model which
stim-has a transition structure containing self-transitions and next state transitions only (i.e., states cannot be skipped) as shown in figure 5 More formally, a^ ^ 0 V j = i or j =
* + 1, and 0 otherwise, i, j e [1, N]
Our video gesture database contains 780 video gestures with 30 examples of each ture Recognition accuracy is evaluated using threefold cross-validation where 20 gestures are used for training and 10 for testing in each partition These HMMs are initialized with random HMM parameters before using either Baum-Welch or VPC for learning
ges-From figure 4 the best average recognition accuracy achieved is 97.31% when VPC is used for training, topology is LRB, and the number of states is 13 Although this corrobo-
Trang 39Max
Baum-Welch
FC 80.00 72.69 66.54 80.00 75.20 75.60 77.60 76.80 77.60 76.00 65.20 74.80 84.80 72.80 75.40 84.80
LR 80.00 94.23 92.31 84.80 81.20 84.80 86.40 86.00 85.60 81.60 86.80 86.80 84.00 81.60 85.44 92.31
LRB 80.00 93.85 96.15 85.38 90.77 85.77 89.62 89.62 90.00 88.46 89.23 88.08 90.00 88.46 88.96 96.15
VPC
FC 80.38 71.15 63.85 53.20 59.60 55.20 45.60 44.40 49.20 43.20 42.80 40.80 39.60 38.80 51.98 63.85
LR 80.38 91.92 91.15 91.20 91.20 90.40 91.20 90.40 90.40 90.00 90.00 90.00 90.00 90.40 89.90 91.20
LRB 80.38 90.77 93.08 90.38 95.00 93.85 94.23 94.23 94.62 95.00 95.00 95.77 97.31 93.46 93.08 97.31
Fig 4 Average percent correct recognition for all 26 video letter gestures against topology and training method
0808
Fig 5 Left-Right banded topology
rates the stronger VPC performance exhibited in our simulated data performance trials, a closer investigation of Table 4 raises some doubts about this conjecture through the follow-ing observations
• The Baum Welch algorithm did almost as well as VPC with a best performance
of 96.15% correct recognition with only 3 states Moreover we achieve a very surprising 80% correct recognition with just a single state
• Topology (i.e., constraints on the initial value of the A matrix) has more impact
on performance than the choice of learning algorithm
• Good recognition performance can be obtained over a very broad range of N, the
number of states
3.1 Comments on Learning Algorithm Performance
We do not suggest that the above observations can be generalized to other real-world plication domains but anecdotal evidence from other researchers suggests that similar be-
Trang 40Fig 6 The alphabet of single-stroke letter hand gestures
haviour is often encountered When we designed this gesture system, we thought that this pattern recognition problem was quite challenging and would significantly differentiate learning strategies Yet the surprisingly good performance over a number of learning algo-
rithms, topologies, and a broad range N suggests that the problem is significantly easier
than we suspected
Our intuition suggests that 3 states is far too small a number to adequately model all of these complex letter gestures, but results show that it is indeed possible to find a three state HMM which yields very good recognition performance We conjecture that the observation
matrix B seems to provide most of the recognition performance and that recognition may
be only weakly affected by good estimation of the transition matrix A
In support of this idea, we may consider the following interpretation of the HMM
Consider each row of the B matrix as the probability mass function of the observation symbols emitted in a given state In the limiting case of a single state HMM, the B matrix
becomes a vector of source symbol probabilities and application of the forward algorithm for recognition is thus equivalent to the well-known and powerful MAP classifier Indeed from figure 4, we see that this single state degenerate HMM can achieve 80% recognition performance So sometimes even if the state transitions are poorly modelled, it is quite possible to find good classifiers based on source statistics
Now clearly if three states can yield strong performance, good HMMs with more than three states must also exist — a simple way to prove this is to note that we can always
add additional states which are unreachable {i.e., transition probability of zero) without
affecting recognition performance This may help explain why performance stays much
the same over a broad range of N as we increase N beyond three
The question that arises is, "Why does the Baum-Welch algorithm perform so well
on real-world data despite its theoretical flaws and rather poor performance on the lated HMM data?" Once again, a possible explanation is that this particular spatio-temporal recognition task is relatively easy, so all methods can do quite well This conjecture may be