pattern recognition 2nd ed. (2009)

The preceding artiﬁcial classiﬁcation task has outlined the rationale behind a large class of pattern recognition problems.. The straight line in Figure 1.2 is known as the decision line

Trang 2

Academic Press is an imprint of Elsevier

30 Corporate Drive, Suite 400, Burlington, MA 01803, USA

525 B Street, Suite 1900, San Diego, California 92101-4495, USA

No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission in writing from the publisher.

Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in Oxford, UK: phone: (+44) 1865 843830, fax: (+44) 1865 853333, E-mail: permissions@elsevier.com You may also complete your request on-line via the Elsevier homepage (http://elsevier.com), by selecting “Support & Contact” then “Copyright and Permission” and then “Obtaining Permissions.”

Library of Congress Cataloging-in-Publication Data

Application submitted

British Library Cataloguing-in-Publication Data

A catalogue record for this book is available from the British Library.

ISBN: 978-1-59749-272-0

For information on all Academic Press publications

visit our Web site at www.books.elsevier.com

Printed in the United States of America

09 10 11 12 13 14 15 16 5 4 3 2 1

Trang 12

This book is the outgrowth of our teaching advanced undergraduate and graduatecourses over the past 20 years These courses have been taught to differentaudiences, including students in electrical and electronics engineering, computerengineering, computer science, and informatics, as well as to an interdisciplinaryaudience of a graduate course on automation This experience led us to makethe book as self-contained as possible and to address students with different back-grounds As prerequisitive knowledge, the reader requires only basic calculus,elementary linear algebra, and some probability theory basics A number of mathe-matical tools, such as probability and statistics as well as constrained optimization,needed by various chapters, are treated in four Appendices The book is designed toserve as a text for advanced undergraduate and graduate students, and it can be usedfor either a one- or a two-semester course Furthermore,it is intended to be used as aself-study and reference book for research and for the practicing scientist/engineer.This latter audience was also our second incentive for writing this book, due to theinvolvement of our group in a number of projects related to pattern recognition

SCOPE AND APPROACH

The goal of the book is to present in a uniﬁed way the most widely used niques and methodologies for pattern recognition tasks Pattern recognition is

tech-in the center of a number of application areas, tech-includtech-ing image analysis, speechand audio recognition, biometrics, bioinformatics, data mining, and informationretrieval Despite their differences, these areas share, to a large extent, a corpus

of techniques that can be used in extracting, from the available data, informationrelated to data categories, important “hidden” patterns, and trends The emphasis inthis book is on the most generic of the methods that are currently available Hav-ing acquired the basic knowledge and understanding, the reader can subsequentlymove on to more specialized application-dependent techniques, which have beendeveloped and reported in a vast number of research papers

Each chapter of the book starts with the basics and moves, progressively, tomore advanced topics’and reviews up-to-date techniques We have made an effort

to keep a balance between mathematical and descriptive presentation This is notalways an easy task However, we strongly believe that in a topic such as patternrecognition, trying to bypass mathematics deprives the reader of understanding theessentials behind the methods and also the potential of developing new techniques,which ﬁt the needs of the problem at hand that he or she has to tackle In patternrecognition, the ﬁnal adoption of an appropriate technique and algorithm is verymuch a problem-dependent task Moreover, according to our experience, teachingpattern recognition is also a good “excuse” for the students to refresh and solidify xv

Trang 13

some of the mathematical basics they have been taught in earlier years “Repetitioest mater studiosum.”

NEW TO THIS EDITION

The new features of the fourth edition include the following

■ MATLAB codes and computer experiments are given at the end of mostchapters

■ More examples and a number of new ﬁgures have been included to enhancethe readability and pedagogic aspects of the book

■ New sections on some important topics of high current interest have beenadded, including:

• Nonlinear dimensionality reduction

• Nonnegative matrix factorization

SUPPLEMENTS TO THE TEXT

Demonstrations based on MATLAB are available for download from the book Website, www.elsevierdirect.com/9781597492720 Also available are electronic ﬁguresfrom the text and (for instructors only) a solutions manual for the end-of-chapterproblems and exercises The interested reader can download detailed proofs,which in the book necessarily, are sometimes, slightly condensed PowerPointpresentations are also available covering all chapters of the book

Our intention is to update the site regularly with more and/or improved versions

of the MATLAB demonstrations Suggestions are always welcome Also at this Website a page will be available for typos, which are unavoidable, despite frequentcareful reading The authors would appreciate readers notifying them about anytypos found

Trang 14

Preface xvii

ACKNOWLEDGMENTS

This book would have not been written without the constant support and help

from a number of colleagues and students throughout the years We are

espe-cially indebted to Kostas Berberidis, Velissaris Gezerlis, Xaris Georgion, Kristina

Georgoulakis, Leyteris Koﬁdis, Thanassis Liavas, Michalis Mavroforakis, Aggelos

Pikrakis, Thanassis Rontogiannis, Margaritis Sdralis, Kostas Slavakis, and Theodoros

Yiannakoponlos The constant support provided by Yannis Kopsinis and Kostas

Thernelis from the early stages up to the ﬁnal stage, with those long nights, has

been invaluable The book improved a great deal after the careful reading and

the serious comments and suggestions of Alexandros Bölnn Dionissis Cavouras,

Vassilis Digalakis, Vassilis Drakopoulos, Nikos Galatsanos, George Glentis, Spiros

Hatzispyros, Evagelos Karkaletsis, Elias Koutsoupias, Aristides Likas, Gerassimos

Mileounis, George Monstakides, George Paliouras, Stavros Perantonis, Takis

Stam-atoponlos, Nikos Vassilas, Manolis Zervakis, and Vassilis Zissimopoulos

The book has greatly gained and improved thanks to the comments of a number

of people who provided feedback on the revision plan and/or comments on revised

chapters:

Tulay Adali, University of Maryland; Mehniet Celenk, Ohio University; Rama lappa, University of Maryland; Mark Clements, Georgia Institute of Technology;

Chel-Robert Duin, Delft University of Technology; Miguel Figneroa,Villanueva University

of Puerto Rico; Dimitris Gunopoulos, University of Athens; Mathias Kolsch, Naval

Postgraduate School; Adam Krzyzak, Concordia University; Baoxiu Li, Arizona State

University; David Miller, Pennsylvania State University; Bernhard Schölkopf, Max

Planck Institute; Hari Sundaram, Arizona State University; Harry Wechsler, George

Mason University; and Alexander Zien, Max Planck Institute

We are greatly indebted to these colleagues for their time and their constructivecriticisms Our collaboration and friendship with Nikos Kalouptsidis have been

a source of constant inspiration for all these years We are both deeply indebted

to him

Last but not least, K Koutroumbas would like to thank Sophia, Marios, and Valentini-Theodora for their tolerance and support and S.Theodoridis would like to thank Despina, Eva, and Eleni, his joyful and supportive “harem.”

Trang 15

Dimitris-1 Introduction

1.1 IS PATTERN RECOGNITION IMPORTANT?

Pattern recognitionis the scientiﬁc discipline whose goal is the classiﬁcation of

objects into a number of categories or classes Depending on the application, these

objects can be images or signal waveforms or any type of measurements that need

to be classiﬁed We will refer to these objects using the generic term patterns.

Pattern recognition has a long history, but before the 1960s it was mostly the output

of theoretical research in the area of statistics As with everything else, the advent

of computers increased the demand for practical applications of pattern tion, which in turn set new demands for further theoretical developments As oursociety evolves from the industrial to its postindustrial phase, automation in indus-trial production and the need for information handling and retrieval are becomingincreasingly important This trend has pushed pattern recognition to the high edge

recogni-of today’s engineering applications and research Pattern recognition is an integral

part of most machine intelligence systems built for decision making.

Machine vision is an area in which pattern recognition is of importance

A machine vision system captures images via a camera and analyzes them to producedescriptions of what is imaged A typical application of a machine vision system is

in the manufacturing industry, either for automated visual inspection or for tion in the assembly line For example, in inspection, manufactured objects on amoving conveyor may pass the inspection station, where the camera stands, and ithas to be ascertained whether there is a defect Thus, images have to be analyzedonline, and a pattern recognition system has to classify the objects into the “defect”or“nondefect”class After that,an action has to be taken,such as to reject the offend-ing parts In an assembly line, different objects must be located and “recognized,”

automa-that is, classiﬁed in one of a number of classes known a priori Examples are the

“screwdriver class,” the “German key class,” and so forth in a tools’ manufacturingunit Then a robot arm can move the objects in the right place

Character (letter or number) recognitionis another important area of patternrecognition, with major implications in automation and information handling Opti-cal character recognition (OCR) systems are already commercially available andmore or less familiar to all of us An OCR system has a “front-end” device consisting

of a light source, a scan lens, a document transport, and a detector At the output of 1

Trang 16

2 CHAPTER 1 Introduction

the light-sensitive detector, light-intensity variation is translated into “numbers” and

an image array is formed In the sequel, a series of image processing techniques are

applied leading to line and character segmentation The pattern recognition

soft-ware then takes over to recognize the characters—that is, to classify each character

in the correct “letter, number, punctuation” class Storing the recognized documenthas a twofold advantage over storing its scanned image First, further electronicprocessing, if needed, is easy via a word processor, and second, it is much moreefﬁcient to store ASCII characters than a document image Besides the printedcharacter recognition systems, there is a great deal of interest invested in systemsthat recognize handwriting A typical commercial application of such a system is

in the machine reading of bank checks The machine must be able to recognizethe amounts in figures and digits and match them Furthermore, it could checkwhether the payee corresponds to the account to be credited Even if only half ofthe checks are manipulated correctly by such a machine, much labor can be savedfrom a tedious job Another application is in automatic mail-sorting machines forpostal code identification in post offices Online handwriting recognition systems

are another area of great commercial interest Such systems will accompany pen computers, with which the entry of data will be done not via the keyboard but bywriting This complies with today’s tendency to develop machines and computerswith interfaces acquiring human-like skills

Computer-aided diagnosisis another important application of pattern tion, aiming at assisting doctors in making diagnostic decisions The ﬁnal diagnosis

recogni-is, of course, made by the doctor Computer-assisted diagnosis has been applied toand is of interest for a variety of medical data, such as X-rays, computed tomographicimages, ultrasound images, electrocardiograms (ECGs), and electroencephalograms(EEGs) The need for a computer-aided diagnosis stems from the fact that medi-cal data are often not easily interpretable, and the interpretation can depend very

much on the skill of the doctor Let us take for example X-ray mammography

for the detection of breast cancer Although mammography is currently the bestmethod for detecting breast cancer, 10 to 30% of women who have the disease andundergo mammography have negative mammograms In approximately two thirds

of these cases with false results the radiologist failed to detect the cancer, whichwas evident retrospectively This may be due to poor image quality, eye fatigue

of the radiologist, or the subtle nature of the ﬁndings The percentage of correctclassiﬁcations improves at a second reading by another radiologist Thus, one canaim to develop a pattern recognition system in order to assist radiologists with a

“second” opinion Increasing conﬁdence in the diagnosis based on mammogramswould, in turn, decrease the number of patients with suspected breast cancer whohave to undergo surgical breast biopsy, with its associated complications

Speech recognitionis another area in which a great deal of research and opment effort has been invested Speech is the most natural means by whichhumans communicate and exchange information Thus, the goal of building intelli-

devel-gent machines that recognize spoken information has been a long-standing one for

scientists and engineers as well as science ﬁction writers Potential applications ofsuch machines are numerous They can be used, for example, to improve efﬁciency

Trang 17

in a manufacturing environment, to control machines in hazardous environmentsremotely, and to help handicapped people to control machines by talking to them.

A major effort, which has already had considerable success, is to enter data into

a computer via a microphone Software, built around a pattern (spoken sounds

in this case) recognition system, recognizes the spoken text and translates it intoASCII characters, which are shown on the screen and can be stored in the memory.Entering information by “talking” to a computer is twice as fast as entry by a skilledtypist Furthermore, this can enhance our ability to communicate with deaf anddumb people

Data mining and knowledge discoveryin databases is another key applicationarea of pattern recognition Data mining is of intense interest in a wide range ofapplications such as medicine and biology, market and ﬁnancial analysis, businessmanagement, science exploration, image and music retrieval Its popularity stemsfrom the fact that in the age of information and knowledge society there is an everincreasing demand for retrieving information and turning it into knowledge More-over, this information exists in huge amounts of data in various forms including, text,images, audio and video, stored in different places distributed all over the world.The traditional way of searching information in databases was the description-basedmodel where object retrieval was based on keyword description and subsequentword matching However, this type of searching presupposes that a manual anno-tation of the stored information has previously been performed by a human This

is a very time-consuming job and, although feasible when the size of the storedinformation is limited, it is not possible when the amount of the available informa-tion becomes large Moreover, the task of manual annotation becomes problematicwhen the stored information is widely distributed and shared by a heterogeneous

“mixture”of sites and users Content-based retrieval systems are becoming more andmore popular where information is sought based on “similarity” between an object,which is presented into the system, and objects stored in sites all over the world

In a content-based image retrieval CBIR (system) an image is presented to an inputdevice (e.g., scanner) The system returns “similar”images based on a measured “sig-nature,” which can encode, for example, information related to color, texture andshape In a music content-based retrieval system, an example (i.e., an extract from

a music piece), is presented to a microphone input device and the system returns

“similar” music pieces In this case, similarity is based on certain (automatically)measured cues that characterize a music piece, such as the music meter, the musictempo, and the location of certain repeated patterns

Mining for biomedical and DNA data analysis has enjoyed an explosive growthsince the mid-1990s All DNA sequences comprise four basic building elements;the nucleotides: adenine (A), cytosine (C), guanine (G) and thymine (T ) Like theletters in our alphabets and the seven notes in music, these four nucleotides arecombined to form long sequences in a twisted ladder form Genes consist of,usually,hundreds of nucleotides arranged in a particular order Speciﬁc gene-sequencepatterns are related to particular diseases and play an important role in medicine

To this end, pattern recognition is a key area that offers a wealth of developed toolsfor similarity search and comparison between DNA sequences Such comparisons

Trang 18

a provision for content-based video information retrieval from digital libraries ofthe type: search and find all video scenes in a digital library showing person “X”laughing Of course, to achieve the final goals in all of these applications, patternrecognition is closely linked with other scientific disciplines, such as linguistics,computer graphics, machine vision, and database design.

Having aroused the reader’s curiosity about pattern recognition, we will nextsketch the basic philosophy and methodological directions in which the variouspattern recognition approaches have evolved and developed

1.2 FEATURES, FEATURE VECTORS, AND CLASSIFIERS

Let us first simulate a simplified case “mimicking” a medical image classificationtask Figure 1.1 shows two images, each having a distinct region inside it Thetwo regions are also themselves visually different We could say that the region ofFigure 1.1a results from a benign lesion, class A, and that of Figure 1.1b from amalignant one (cancer), class B We will further assume that these are not the onlypatterns (images) that are available to us, but we have access to an image database

FIGURE 1.1

Examples of image regions corresponding to (a) class A and (b) class B

Trang 19

The ﬁrst step is to identify the measurable quantities that make these two regions

distinctfrom each other Figure 1.2 shows a plot of the mean value of the sity in each region of interest versus the corresponding standard deviation aroundthis mean Each point corresponds to a different image from the available database

inten-It turns out that class A patterns tend to spread in a different area from class B terns The straight line seems to be a good candidate for separating the two classes.Let us now assume that we are given a new image with a region in it and that we

pat-do not know to which class it belongs It is reasonable to say that we measure themean intensity and standard deviation in the region of interest and we plot the cor-responding point This is shown by the asterisk (∗) in Figure 1.2 Then it is sensible

to assume that the unknown pattern is more likely to belong to class A than class B The preceding artiﬁcial classiﬁcation task has outlined the rationale behind a

large class of pattern recognition problems The measurements used for the

classiﬁ-cation,the mean value and the standard deviation in this case,are known as features.

In the more general case l features x i , i ⫽ 1, 2, , l, are used, and they form the feature vector

x ⫽ [x1, x2, , xl]T

where T denotes transposition Each of the feature vectors identiﬁes uniquely

a single pattern (object) Throughout this book features and feature vectors will

be treated as random variables and vectors, respectively This is natural, as the

measurements resulting from different patterns exhibit a random variation This

is due partly to the measurement noise of the measuring devices and partly to

Trang 20

the distinct characteristics of each pattern For example, in X-ray imaging largevariations are expected because of the differences in physiology among individuals.This is the reason for the scattering of the points in each class shown in Figure 1.1

The straight line in Figure 1.2 is known as the decision line, and it constitutes the classiﬁer whose role is to divide the feature space into regions that correspond

to either class A or class B If a feature vector x, corresponding to an unknown

pattern, falls in the class A region, it is classiﬁed as class A, otherwise as class B.This does not necessarily mean that the decision is correct If it is not correct,

a misclassiﬁcation has occurred In order to draw the straight line in Figure 1.2

we exploited the fact that we knew the labels (class A or B) for each point ofthe ﬁgure The patterns (feature vectors) whose true class is known and which

are used for the design of the classiﬁer are known as training patterns (training feature vectors)

Having outlined the deﬁnitions and the rationale, let us point out the basicquestions arising in a classiﬁcation task

■ How are the features generated? In the preceding example, we used themean and the standard deviation, because we knew how the images had beengenerated In practice, this is far from obvious It is problem dependent, and it

concerns the feature generation stage of the design of a classiﬁcation system

that performs a given pattern recognition task

■ What is the best number l of features to use? This is also a very important task and it concerns the feature selection stage of the classiﬁcation system.

In practice, a larger than necessary number of feature candidates is generated,and then the “best” of them is adopted

■ Having adopted the appropriate, for the speciﬁc task, features, how does onedesign the classiﬁer? In the preceding example the straight line was drawnempirically, just to please the eye In practice, this cannot be the case, and

the line should be drawn optimally, with respect to an optimality criterion.

Furthermore,problems for which a linear classiﬁer (straight line or hyperplane

in the l-dimensional space) can result in acceptable performance are not the

rule In general, the surfaces dividing the space in the various class regionsare nonlinear What type of nonlinearity must one adopt, and what type ofoptimizing criterion must be used in order to locate a surface in the right place

in the l-dimensional feature space? These questions concern the classiﬁer design stage

■ Finally, once the classiﬁer has been designed, how can one assess the

perfor-mance of the designed classiﬁer? That is, what is the classiﬁcation error rate? This is the task of the system evaluation stage.

Figure 1.3 shows the various stages followed for the design of a classiﬁcationsystem As is apparent from the feedback arrows, these stages are not independent

On the contrary,they are interrelated and,depending on the results,one may go back

Trang 21

feature selection

classifier design

system evaluation

feature generation sensor

patterns

FIGURE 1.3

The basic stages involved in the design of a classiﬁcation system

to redesign earlier stages in order to improve the overall performance Furthermore,there are some methods that combine stages, for example, the feature selection andthe classiﬁer design stage, in a common optimization task

Although the reader has already been exposed to a number of basic problems

at the heart of the design of a classiﬁcation system, there are still a few things to

be said

1.3 SUPERVISED, UNSUPERVISED, AND SEMI-SUPERVISED

LEARNING

In the example of Figure 1.1, we assumed that a set of training data were available,

and the classiﬁer was designed by exploiting this a priori known information This

is known as supervised pattern recognition or in the more general context of machine learning as supervised learning However, this is not always the case, and

there is another type of pattern recognition tasks for which training data, of knownclass labels, are not available In this type of problem, we are given a set of featurevectorsx and the goal is to unravel the underlying similarities and cluster (group)

“similar” vectors together This is known as unsupervised pattern recognition or

unsupervised learning or clustering Such tasks arise in many applications in social

sciences and engineering, such as remote sensing, image segmentation, and imageand speech coding Let us pick two such problems

In multispectral remote sensing, the electromagnetic energy emanating from

the earth’s surface is measured by sensitive scanners located aboard a satellite, anaircraft,or a space station This energy may be reflected solar energy (passive) or thereflected part of the energy transmitted from the vehicle (active) in order to “inter-rogate” the earth’s surface The scanners are sensitive to a number of wavelengthbands of the electromagnetic radiation Different properties of the earth’s surfacecontribute to the reflection of the energy in the different bands For example, in thevisible–infrared range properties such as the mineral and moisture contents of soils,the sedimentation of water, and the moisture content of vegetation are the maincontributors to the reflected energy In contrast, at the thermal end of the infrared,

it is the thermal capacity and thermal properties of the surface and near subsurfacethat contribute to the reﬂection Thus, each band measures different properties

Trang 22

“sensed” earth’s surface is formed The elements x i , i ⫽ 1, 2, , l, of the vector are

the corresponding image pixel intensities in the various spectral bands In practice,the number of spectral bands varies

A clustering algorithm can be employed to reveal the groups in which feature vectors are clustered in the l-dimensional feature space Points that correspond to

the same ground cover type, such as water, are expected to cluster together andform groups Once this is done, the analyst can identify the type of each cluster byassociating a sample of points in each group with available reference ground data,that is, maps or visits Figure 1.4 demonstrates the procedure

Clustering is also widely used in the social sciences in order to study and correlate

survey and statistical data and draw useful conclusions, which will then lead to the right actions Let us again resort to a simpliﬁed example and assume that weare interested in studying whether there is any relation between a country’s grossnational product (GNP) and the level of people’s illiteracy, on the one hand, andchildren’s mortality rate on the other In this case, each country is represented by

a three-dimensional feature vector whose coordinates are indices measuring thequantities of interest A clustering algorithm will then reveal a rather compactcluster corresponding to countries that exhibit low GNPs, high illiteracy levels, andhigh children’s mortality expressed as a population percentage

Trang 23

A major issue in unsupervised pattern recognition is that of deﬁning the

“similarity” between two feature vectors and choosing an appropriate measurefor it Another issue of importance is choosing an algorithmic scheme that willcluster (group) the vectors on the basis of the adopted similarity measure In gen-eral, different algorithmic schemes may lead to different results, which the experthas to interpret

Semi-supervised learning/pattern recognition for designing a classiﬁcation tem shares the same goals as the supervised case, however now, the designer has

sys-at his or her disposal a set of psys-atterns of unknown class origin, in addition to thetraining patterns, whose true class is known We usually refer to the former ones as

unlabeled and the latter as labeled data Semi-supervised pattern recognition can

be of importance when the system designer has access to a rather limited number

of labeled data In such cases, recovering additional information from the beled samples, related to the general structure of the data at hand, can be useful inimproving the system design Semi-supervised learning ﬁnds its way also to cluster-

unla-ing tasks In this case, labeled data are used as constraints in the form of must-links and cannot-links In other words, the clustering task is constrained to assign cer-

tain points in the same cluster or to exclude certain points of being assigned in the

same cluster From this perspective, semi-supervised learning provides an a priori

knowledge that the clustering algorithm has to respect

of each chapter and the student has to understand in a ﬁrst reading Wheneverthe required MATLAB code was available (at the time this book was prepared) in

a MATLAB toolbox, we chose to use the associated MATLAB function and explainhow to use its arguments No doubt, each instructor has his or her own preferences,experiences,and unique way of viewing teaching The provided routines are written

in a way that can run on other data sets as well In a separate accompanying book

we provide a more complete list of MATLAB codes embedded in a user-friendlyGraphical User Interface (GUI) and also involving more realistic examples usingreal images and audio signals

Trang 24

1.5 OUTLINE OF THE BOOK

Chapters 2–10 deal with supervised pattern recognition and Chapters 11–16 dealwith the unsupervised case Semi-supervised learning is introduced in Chapter 10.The goal of each chapter is to start with the basics, definitions, and approaches, andmove progressively to more advanced issues and recent techniques To what extentthe various topics covered in the book will be presented in a first course on patternrecognition depends very much on the course’s focus, on the students’ background,and, of course, on the lecturer In the following outline of the chapters, we giveour view and the topics that we cover in a first course on pattern recognition Nodoubt, other views do exist and may be better suited to different audiences At theend of each chapter, a number of problems and computer exercises are provided.Chapter 2 is focused on Bayesian classification and techniques for estimatingunknown probability density functions In a first course on pattern recognition, thesections related to Bayesian inference, the maximum entropy, and the expectationmaximization (EM) algorithm are omitted Special focus is put on the Bayesian clas-sification, the minimum distance (Euclidean and Mahalanobis), the nearest neighborclassifiers, and the naive Bayes classifier Bayesian networks are briefly introduced.Chapter 3 deals with the design of linear classifiers The sections dealing with theprobability estimation property of the mean square solution as well as the bias vari-ance dilemma are only briefly mentioned in our first course The basic philosophyunderlying the support vector machines can also be explained, although a deepertreatment requires mathematical tools (summarized in Appendix C) that most of thestudents are not familiar with during a first course class On the contrary,emphasis isput on the linear separability issue, the perceptron algorithm, and the mean squareand least squares solutions After all, these topics have a much broader horizonand applicability Support vector machines are briefly introduced The geometricinterpretation offers students a better understanding of the SVM theory

Chapter 4 deals with the design of nonlinear classifiers The section dealing withexact classification is bypassed in a first course The proof of the backpropagationalgorithm is usually very boring for most of the students and we bypass its details

A description of its rationale is given, and the students experiment with it usingMATLAB The issues related to cost functions are bypassed Pruning is discussedwith an emphasis on generalization issues Emphasis is also given to Cover’s theoremand radial basis function (RBF) networks The nonlinear support vector machines,decision trees, and combining classiﬁers are only brieﬂy touched via a discussion

on the basic philosophy behind their rationale

Chapter 5 deals with the feature selection stage, and we have made an effort

to present most of the well-known techniques In a ﬁrst course we put emphasis

on the t-test This is because hypothesis testing also has a broad horizon, and at

the same time it is easy for the students to apply it in computer exercises Then,depending on time constraints, divergence, Bhattacharrya distance, and scatteredmatrices are presented and commented on, although their more detailed treatment

Trang 25

is for a more advanced course Emphasis is given to Fisher’s linear discriminantmethod ( LDA) for the two-class case.

Chapter 6 deals with the feature generation stage using transformations TheKarhunen–Loève transform and the singular value decomposition are first intro-duced as dimensionality reduction techniques Both methods are briefly covered inthe second semester In the sequel the independent component analysis (ICA),non-negative matrix factorization and nonlinear dimensionality reduction techniquesare presented Then the discrete Fourier transform (DFT), discrete cosine trans-form (DCT), discrete sine transform (DST), Hadamard, and Haar transforms aredefined The rest of the chapter focuses on the discrete time wavelet transform.The incentive is to give all the necessary information so that a newcomer in thewavelet field can grasp the basics and be able to develop software, based onfilter banks, in order to generate features All these techniques are bypassed in

a ﬁrst course

Chapter 7 deals with feature generation focused on image and audio tion The sections concerning local linear transforms, moments, parametric models,and fractals are not covered in a first course Emphasis is placed on first- and second-order statistics features as well as the run-length method The chain code for shapedescription is also taught Computer exercises are then offered to generate thesefeatures and use them for classification for some case studies In a one-semestercourse there is no time to cover more topics

classiﬁca-Chapter 8 deals with template matching Dynamic programming (DP) and theViterbi algorithm are presented and then applied to speech recognition In atwo-semester course, emphasis is given to the DP and the Viterbi algorithm.The edit distance seems to be a good case for the students to grasp the basics Cor-relation matching is taught and the basic philosophy behind deformable templatematching can also be presented

Chapter 9 deals with context-dependent classiﬁcation Hidden Markov els are introduced and applied to communications and speech recognition Thischapter is bypassed in a ﬁrst course

mod-Chapter 10 deals with system evaluation and semi-supervised learning Thevarious error rate estimation techniques are discussed, and a case study with realdata is treated The leave-one-out method and the resubstitution methods areemphasized in the second semester, and students practice with computer exercises.Semi-supervised learning is bypassed in a ﬁrst course

Chapter 11 deals with the basic concepts of clustering It focuses on deﬁnitions

as well as on the major stages involved in a clustering task The various types ofdata encountered in clustering applications are reviewed, and the most commonlyused proximity measures are provided In a ﬁrst course, only the most widely used

proximity measures are covered (e.g., l pnorms, inner product, Hamming distance).Chapter 12 deals with sequential clustering algorithms These include some

of the simplest clustering schemes, and they are well suited for a ﬁrst course tointroduce students to the basics of clustering and allow them to experiment with

Trang 26

In a ﬁrst course, most of these algorithms are bypassed, and emphasis is given tothe isodata algorithm.

Chapter 15 features a high degree of modularity It deals with clustering rithms based on different ideas,which cannot be grouped under a single philosophy.Spectral clustering, competitive learning, branch and bound, simulated annealing,and genetic algorithms are some of the schemes treated in this chapter These arebypassed in a ﬁrst course

algo-Chapter 16 deals with the clustering validity stage of a clustering procedure Itcontains rather advanced concepts and is omitted in a ﬁrst course Emphasis is given

to the deﬁnitions of internal, external, and relative criteria and the random ses used in each case Indices, adopted in the framework of external and internalcriteria, are presented, and examples are provided showing the use of these indices

hypothe-Syntactic pattern recognitionmethods are not treated in this book Syntacticpattern recognition methods differ in philosophy from the methods discussed inthis book and, in general, are applicable to different types of problems In syntacticpattern recognition, the structure of the patterns is of paramount importance, and

pattern recognition is performed on the basis of a set of pattern primitives, a set

of rules in the form of a grammar, and a recognizer called automaton Thus, we

were faced with a dilemma: either to increase the size of the book substantially, or

to provide a short overview (which, however, exists in a number of other books),

or to omit it The last option seemed to be the most sensible choice

Trang 27

Given a classiﬁcation task of M classes, ␻1,␻2, , ␻ M, and an unknown pattern,which is represented by a feature vectorx, we form the M conditional probabilities

P(␻ i |x), i ⫽ 1, 2, , M Sometimes, these are also referred to as a posteriori

probabilities In words, each of them represents the probability that the unknownpattern belongs to the respective class␻ i, given that the corresponding featurevector takes the valuex Who could then argue that these conditional probabilities

are not sensible choices to quantify the term most probable? Indeed, the classiﬁers

to be considered in this chapter compute either the maximum of these M values

or, equivalently, the maximum of an appropriately deﬁned function of them Theunknown pattern is then assigned to the class corresponding to this maximum

The ﬁrst task we are faced with is the computation of the conditional bilities The Bayes rule will once more prove its usefulness! A major effort in thischapter will be devoted to techniques for estimating probability density functions(pdf ), based on the available experimental evidence, that is, the feature vectorscorresponding to the patterns of the training set

proba-2.2 BAYES DECISION THEORY

We will initially focus on the two-class case Let␻1,␻2be the two classes in which

our patterns belong In the sequel, we assume that the a priori probabilities 13

Trang 28

14 CHAPTER 2 Classiﬁers Based on Bayes Decision Theory

P(␻1), P( ␻2) are known This is a very reasonable assumption, because even ifthey are not known, they can easily be estimated from the available training feature

vectors Indeed, if N is the total number of available training patterns, and N1, N2

of them belong to␻1and␻2, respectively, then P( ␻1)≈ N1/N and P( ␻2)≈ N2/N

The other statistical quantities assumed to be known are the class-conditional

probability density functions p(x |␻ i ), i⫽ 1, 2, describing the distribution of thefeature vectors in each of the classes If these are not known, they can also beestimated from the available training data, as we will discuss later on in this chapter

The pdf p(x|␻ i ) is sometimes referred to as the likelihood function of ␻ i with respect to x Here we should stress the fact that an implicit assumption has been

made That is, the feature vectors can take any value in the l-dimensional feature

space In the case that feature vectors can take only discrete values,density functions

p(x|␻ i ) become probabilities and will be denoted by P(x |␻ i)

We now have all the ingredients to compute our conditional probabilities, asstated in the introduction To this end, let us recall from our probability course

basics the Bayes rule (Appendix A)

p(x |␻1)P( ␻1) ≷ p(x|␻2)P( ␻2) (2.4)

p(x) is not taken into account, because it is the same for all classes and it does

not affect the decision Furthermore, if the a priori probabilities are equal, that is,

space into two regions,R1and R2 According to the Bayes decision rule,for all values

of x in R1the classiﬁer decides␻1and for all values in R2it decides␻2 However,

it is obvious from the ﬁgure that decision errors are unavoidable Indeed, there is

Trang 29

a ﬁnite probability for an x to lie in the R2 region and at the same time to belong

in class␻1 Then our decision is in error The same is true for points originatingfrom class␻2 It does not take much thought to see that the total probability, P e, ofcommitting a decision error for the case of two equiprobable classes, is given by

Pe⫽ 12

cation rule was rather empirical, via our interpretation of the term most probable.

We will now see that this classiﬁcation test, though simple in its formulation, has asounder mathematical interpretation

Minimizing the Classiﬁcation Error Probability

We will show that the Bayesian classiﬁer is optimal with respect to minimizing the

classiﬁcation error probability Indeed, the reader can easily verify, as an exercise,

that moving the threshold away from x0, in Figure 2.1, always increases the sponding shaded area under the curves Let us now proceed with a more formalproof

corre-Proof. Let R1be the region of the feature space in which we decide in favor of

␻1 and R2 be the corresponding region for␻2 Then an error is made ifx ∈ R1,although it belongs to␻2or ifx ∈ R2, although it belongs to␻1 That is,

Trang 30

where P(·, ·) is the joint probability of two events Recalling, once more, ourprobability basics (Appendix A), this becomes

It is now easy to see that the error is minimized if the partitioning regions R1and

R2of the feature space are chosen so that

R1: P(␻1|x) ⬎ P(␻2|x)

Indeed,since the union of the regions R1, R2covers all the space,from the deﬁnition

of a probability density function we have that

This suggests that the probability of error is minimized if R1is the region of space in

which P( ␻1|x) ⬎ P(␻2|x) Then,R2becomes the region where the reverse is true

So far, we have dealt with the simple case of two classes Generalizations tothe multiclass case are straightforward In a classiﬁcation task with M classes,

␻1,␻2, , ␻ M,an unknown pattern,represented by the feature vectorx,is assigned

to class␻ iif

It turns out that such a choice also minimizes the classiﬁcation error probability( Problem 2.1)

Minimizing the Average Risk

The classiﬁcation error probability is not always the best criterion to be adopted forminimization This is because it assigns the same importance to all errors However,there are cases in which some wrong decisions may have more serious implicationsthan others For example, it is much more serious for a doctor to make a wrongdecision and a malignant tumor to be diagnosed as a benign one, than the otherway round If a benign tumor is diagnosed as a malignant one, the wrong decisionwill be cleared out during subsequent clinical examinations However, the results

Trang 31

from the wrong decision concerning a malignant tumor may be fatal Thus, in suchcases it is more appropriate to assign a penalty term to weigh each error For ourexample, let us denote by␻1the class of malignant tumors and as␻2the class of the

benign ones Let, also, R1, R2be the regions in the feature space where we decide

in favor of␻1and␻2, respectively The error probability P e is given by Eq (2.8)

Instead of selecting R1and R2so that P eis minimized, we will now try to minimize

a modiﬁed version of it, that is,

Let us now consider an M-class problem and let R j , j ⫽ 1, 2, , M,be the regions

of the feature space assigned to classes␻ j, respectively Assume now that a featurevectorx that belongs to class ␻ k lies in R i , i ⫽ k Then this vector is misclassiﬁed in

␻ iand an error is committed A penalty term␭ ki , known as loss, is associated with this wrong decision The matrix L, which has at its (k, i) location the corresponding penalty term, is known as the loss matrix.1 Observe that in contrast to the philoso-phy behind Eq (2.14), we have now allowed weights across the diagonal of the lossmatrix (␭ kk), which correspond to correct decisions In practice, these are usuallyset equal to zero, although we have considered them here for the sake of generality

The risk or loss associated with ␻ kis deﬁned as

Observe that the integral is the overall probability of a feature vector from class␻ k

being classiﬁed in␻ i This probability is weighted by␭ ki Our goal now is to choose

the partitioning regions R j so that the average risk

Trang 32

It is obvious that if ␭ ki ⫽ 1 ⫺ ␦ ki, where ␦ ki is Kronecker’s delta (0 if k ⫽ i and

1 if k ⫽ i ), then minimizing the average risk becomes equivalent to minimizing the

classiﬁcation error probability

The two-class case For this speciﬁc case we obtain

p(x|␻2)⬎ p(x|␻1)␭12

␭21

where P( ␻1) ⫽ P(␻2) ⫽ 1/2 has been assumed That is, p(x|␻1) is multiplied by

a factor less than 1 and the effect of this is to move the threshold in Figure 2.1 to

the left of x0 In other words, region R2 is increased while R1 is decreased Theopposite would be true if␭21⬍ ␭12

An alternative cost that sometimes is used for two class problems is the Pearson criterion The error for one of the classes is now constrained to be ﬁxedand equal to a chosen value ( Problem 2.6) Such a decision rule has been used,for example, in radar detection problems The task there is to detect a target in

Neyman-the presence of noise One type of error is Neyman-the so-called false alarm—that is, to

mistake the noise for a signal (target) present Of course, the other type of error

is to miss the signal and to decide in favor of the noise (missed detection) In

many cases the error probability of false alarm is set equal to a predeterminedthreshold

Trang 33

or x0⫽ (1 ⫺ ln 2)/2 ⬍ 1/2; that is, the threshold moves to the left of 1/2 If the two classes are

not equiprobable, then it is easily veriﬁed that if P( ␻1)⬎ (⬍) P(␻2) the threshold moves tothe right (left) That is, we expand the region in which we decide in favor of the most probableclass, since it is better to make fewer errors for the most probable class

2.3 DISCRIMINANT FUNCTIONS AND DECISION SURFACES

It is by now clear that minimizing either the risk or the error probability or the

Neyman-Pearson criterion is equivalent to partitioning the feature space into M regions, for a task with M classes If regions R i , R j happen to be contiguous, then

they are separated by a decision surface in the multidimensional feature space For

the minimum error probability case, this is described by the equation

P(␻ i |x) ⫺ P(␻ j |x) ⫽ 0 (2.21)From the one side of the surface this difference is positive, and from the other

it is negative Sometimes, instead of working directly with probabilities (or riskfunctions), it may be more convenient, from a mathematical point of view, to work

with an equivalent function of them, for example, g i(x) ≡ f (P(␻ i |x)),where f (·) is

a monotonically increasing function g i(x) is known as a discriminant function.

The decision test (2.13) is now stated as

classifyx in ␻ i if g i(x) ⬎ g j(x) ᭙j ⫽ i (2.22)The decision surfaces, separating contiguous regions, are described by

g ij(x) ≡ g i(x) ⫺ g j(x) ⫽ 0, i, j ⫽ 1, 2, , M, i ⫽ j (2.23)

Trang 34

So far,we have approached the classiﬁcation problem via Bayesian probabilistic ments and the goal was to minimize the classiﬁcation error probability or the risk.However, as we will soon see, not all problems are well suited to such approaches.For example, in many cases the involved pdfs are complicated and their estimation

argu-is not an easy task In such cases, it may be preferable to compute decargu-ision surfaces

directly by means of alternative costs, and this will be our focus in Chapters 3 and

4 Such approaches give rise to discriminant functions and decision surfaces, whichare entities with no (necessary) relation to Bayesian classiﬁcation, and they are, ingeneral, suboptimal with respect to Bayesian classiﬁers

In the following we will focus on a particular family of decision surfaces ciated with the Bayesian classiﬁcation for the speciﬁc case of Gaussian densityfunctions

asso-2.4 BAYESIAN CLASSIFICATION FOR NORMAL DISTRIBUTIONS 2.4.1 The Gaussian Probability Density Function

One of the most commonly encountered probability density functions in practice

is the Gaussian or normal probability density function The major reasons for itspopularity are its computational tractability and the fact that it models adequately

a large number of cases One of the most celebrated theorems in statistics is the

central limit theorem The theorem states that if a random variable is the outcome of

a summation of a number of independent random variables, its pdf approaches the

Gaussian function as the number of summands tends to infinity (see Appendix A) Inpractice,it is most common to assume that the sum of random variables is distributedaccording to a Gaussian pdf, for a sufficiently large number of summing terms.The one-dimensional or the univariate Gaussian, as it is sometimes called, isdefined by

The parameters␮ and ␴2turn out to have a speciﬁc meaning The mean value of

the random variable x is equal to ␮, that is,

Trang 35

x x

1

1 0 0

1

FIGURE 2.2

Graphs for the one-dimensional Gaussian pdf (a) Mean value␮ ⫽ 0, ␴2⫽ 1, (b) ␮ ⫽ 1 and

␴2⫽ 0.2 The larger the variance the broader the graph is The graphs are symmetric, and theyare centered at the respective mean value

Figure 2.2a shows the graph of the Gaussian function for␮ ⫽ 0 and ␴2⫽ 1, andFigure 2.2b the case for␮ ⫽ 1 and ␴2⫽ 0.2 The larger the variance the broader thegraph, which is symmetric, and it is always centered at␮ (see Appendix A,for some

where |⌺| denotes the determinant of ⌺ It is readily seen that for l ⫽ 1 the

multivariate Gaussian coincides with the univariate one Sometimes, the symbol

N (␮, ⌺) is used to denote a Gaussian pdf with mean value ␮ and covariance ⌺.

To get a better feeling on what the multivariate Gaussian looks like, let us focus

on some cases in the two-dimensional space, where nature allows us the luxury ofvisualization For this case we have

where E[x i]⫽ ␮ i , i ⫽ 1, 2,and by deﬁnition ␴12⫽ E[(x1⫺␮1)(x2⫺␮2)], which is

known as the covariance between the random variables x1and x2and it is a measure

Trang 36

of their mutual statistical correlation If the variables are statistically independent,their covariance is zero (Appendix A) Obviously, the diagonal elements of⌺ are thevariances of the respective elements of the random vector

Figures 2.3–2.6 show the graphs for four instances of a two-dimensional Gaussianprobability density function Figure 2.3a corresponds to a Gaussian with a diagonalcovariance matrix

(a) The graph of a two-dimensional Gaussian pdf and (b) the corresponding isovalue curves for

a diagonal⌺ with ␴2⫽ ␴2 The graph has a spherical symmetry showing no preference in anydirection

a diagonal⌺ with ␴2⬎⬎ ␴2 The graph is elongated along the x1direction

Trang 37

a diagonal⌺ with ␴2⬍⬍ ␴2 The graph is elongated along the x2direction

a case of a nondiagonal⌺ Playing with the values of the elements of ⌺ one can achieve differentshapes and orientations

that is, both features, x1, x2 have variance equal to 3 and their covariance is zero.The graph of the Gaussian is symmetric For this case the isovalue curves (i.e.,curves of equal probability density values) are circles (hyperspheres in the general

l-dimensional space) and are shown in Figure 2.3b The case shown in Figure 2.4acorresponds to the covariance matrix

2⫽ 3 The graph of the Gaussian is now elongated along the

x1-axis, which is the direction of the larger variance The isovalue curves, shown

Trang 38

in Figure 2.4b, are ellipses Figures 2.5a and 2.5b correspond to the case with

The isovalue curves are ellipses of different orientations and with different ratios

of major to minor axis lengths Let us consider, as an example, the case of a zeromean random vector with a diagonal covariance matrix To compute the isovaluecurves is equivalent to computing the curves of constant values for the exponent,that is,

for some constant C This is the equation of an ellipse whose axes are determined

by the the variances of the involved features As we will soon see, the principal axes

of the ellipses are controlled by the eigenvectors/eigenvalues of the covariancematrix As we know from linear algebra (and it is easily checked), the eigenvalues of

a diagonal matrix, which was the case for our example, are equal to the respectiveelements across its diagonal

2.4.2 The Bayesian Classiﬁer for Normally Distributed Classes

Our goal in this section is to study the optimal Bayesian classiﬁer when the involved

pdfs, p( x |␻ i ), i ⫽ 1, 2, , M (likelihood functions of ␻ i with respect to x),

describing the data distribution in each one of the classes, are multivariate normaldistributions, that is,N (␮ i,⌺i ), i ⫽ 1, 2, , M Because of the exponential form

of the involved densities, it is preferable to work with the following discriminantfunctions, which involve the (monotonic) logarithmic function ln(·):

g i(x) ⫽ ln( p(x|␻ i )P( ␻ i))⫽ ln p(x|␻ i)⫹ ln P(␻ i) (2.33)or

Trang 39

In general, this is a nonlinear quadratic form Take, for example, the case of l ⫽ 2and assume that

and obviously the associated decision curves g i(x)⫺ g j(x)⫽ 0 are quadrics (i.e.,

ellipsoids, parabolas, hyperbolas, pairs of lines) That is, in such cases, the Bayesian

classiﬁer is a quadratic classiﬁer, in the sense that the partition of the feature

space is performed via quadric decision surfaces For l⬎ 2 the decision

sur-faces are hyperquadrics Figure 2.7a shows the decision curve corresponding to

P(␻1)⫽ P(␻2),␮1⫽ [0, 0]T and␮2⫽ [4, 0]T The covariance matrices for the twoclasses are

⌺1⫽

0.3 0.00.0 0.35

, ⌺2⫽

1.2 0.00.0 1.85

For the case of Figure 2.7b the classes are also equiprobable with␮1 ⫽ [0, 0]T,

␮2⫽ [3.2, 0]T and covariance matrices

⌺1⫽

0.1 0.00.0 0.75

, ⌺2⫽

0.75 0.00.0 0.1

Trang 40

8 6 4 2 0

decision surface equations The same is true for the constants c i Thus, they can be

omitted and we may redeﬁne g i(x) as

g i(x) ⫽ w T

Tiêu đề	Pattern Recognition 2nd Ed. (2009)
Chuyên ngành	Pattern Recognition
Thể loại	Book
Năm xuất bản	2009
Thành phố	Burlington

Định dạng
Số trang	967
Dung lượng	13,12 MB