Copsey Mathematics and Data Analysis Consultancy, Malvern, UK Statistical pattern recognition relates to the use of statistical techniques for analysing data measurements in order to ext
Trang 1Cover design: Gary Thompson
PATTERN RECOGNITION
T h i r d E d i t i o n
T h i r d
E d i t i o n
Andrew R Webb Keith D Copsey
Mathematics and Data Analysis Consultancy, Malvern, UK
Statistical pattern recognition relates to the use of statistical techniques for analysing data
measurements in order to extract information and make justified decisions It is a very active area
of study and research, which has seen many advances in recent years Applications such as data
mining, web searching, multimedia data retrieval, face recognition, and cursive handwriting
recognition, all require robust and efficient pattern recognition techniques
This third edition provides an introduction to statistical pattern theory and techniques, with
material drawn from a wide range of fields, including the areas of engineering, statistics, computer
science and the social sciences The book has been updated to cover new methods and applications,
and includes a wide range of techniques such as Bayesian methods, neural networks, support
vector machines, feature selection and feature reduction techniques Technical descriptions and
motivations are provided, and the techniques are illustrated using real examples
Statistical Pattern Recognition, Third Edition:
• Provides a self-contained introduction to statistical pattern recognition
• Includes new material presenting the analysis of complex networks
• Introduces readers to methods for Bayesian density estimation
• Presents descriptions of new applications in biometrics, security, finance and
condition monitoring
• Provides descriptions and guidance for implementing techniques, which will be
invaluable to software engineers and developers seeking to develop real applications
• Describes mathematically the range of statistical pattern recognition techniques
• Presents a variety of exercises including more extensive computer projects
The in-depth technical descriptions make this book suitable for senior undergraduate and graduate
students in statistics, computer science and engineering Statistical Pattern Recognition is also an
excellent reference source for technical professionals Chapters have been arranged to facilitate
implementation of the techniques by software engineers and developers in non-statistical
engineering fields
www.wiley.com/go/statistical_pattern_recognition
STATISTICAL
www.it-ebooks.info
Trang 3Statistical Pattern Recognition
Trang 5Statistical Pattern Recognition
Third Edition
Mathematics and Data Analysis Consultancy, Malvern, UK
A John Wiley & Sons, Ltd., Publication
Trang 6This edition first published 2011
© 2011 John Wiley & Sons, Ltd
Registered office
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom
For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com
The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.
All rights reserved No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form
or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.
Wiley also publishes its books in a variety of electronic formats Some content that appears in print may not be available in electronic books.
Designations used by companies to distinguish their products are often claimed as trademarks All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners The publisher is not associated with any product or vendor mentioned in this book This publication is designed to provide accurate and authoritative information in regard to the subject matter covered It is sold on the understanding that the publisher is not engaged in rendering professional services If professional advice or other expert assistance is required, the services of a competent professional should be sought.
Library of Congress Cataloging-in-Publication Data
Webb, A R (Andrew R.)
Statistical pattern recognition / Andrew R Webb, Keith D Copsey – 3rd ed.
p cm.
Includes bibliographical references and index.
ISBN 978-0-470-68227-2 (hardback) – ISBN 978-0-470-68228-9 (paper)
1 Pattern perception–Statistical methods I Copsey, Keith D II Title.
Trang 7To Rosemary, Samuel, Miriam, Jacob and Ethan
Trang 9Contents
Trang 102.2 Estimating the Parameters of the Distributions 34
Trang 124.2 k-Nearest-Neighbour Method 152
Trang 15CONTENTS xiii
Trang 17CONTENTS xv
Trang 1811.6.6 Example Application Study 536
Trang 21This book provides an introduction to statistical pattern recognition theory and techniques.Most of the material presented in this book is concerned with discrimination and classificationand has been drawn from a wide range of literature including that of engineering, statistics,computer science and the social sciences The aim of the book is to provide descriptions ofmany of the most useful of today’s pattern processing techniques including many of the recentadvances in nonparametric approaches to discrimination and Bayesian computational methodsdeveloped in the statistics literature and elsewhere Discussions provided on the motivationsand theory behind these techniques will enable the practitioner to gain maximum benefitfrom their implementations within many of the popular software packages The techniquesare illustrated with examples of real-world applications studies Pointers are also provided
to the diverse literature base where further details on applications, comparative studies andtheoretical developments may be obtained
The book grew out of our research on the development of statistical pattern recognitionmethodology and its application to practical sensor data analysis problems The book is aimed
at advanced undergraduate and graduate courses Some of the material has been presented aspart of a graduate course on pattern recognition and at pattern recognition summer schools
It is also designed for practitioners in the field of pattern recognition as well as researchers
in the area A prerequisite is a knowledge of basic probability theory and linear algebra,together with basic knowledge of mathematical methods (for example, Lagrange multipliersare used to solve problems with equality and inequality constraints in some derivations) Somebasic material (which was provided as appendices in the second edition) is available on thebook’s website
Scope
The book presents most of the popular methods of statistical pattern recognition ever, many of the important developments in pattern recognition are not confined to thestatistics literature and have occurred where the area overlaps with research in machinelearning Therefore, where we have felt that straying beyond the traditional boundaries ofstatistical pattern recognition would be beneficial, we have done so An example is the
Trang 22How-inclusion of some rule induction methods as a complementary approach to rule discovery bydecision tree induction.
Most of the methodology is generic – it is not specific to a particular type of data orapplication Thus, we exclude preprocessing methods and filtering methods commonly used
in signal and image processing
Approach
The approach in each chapter has been to introduce some of the basic concepts and algorithmsand to conclude each section on a technique or a class of techniques with a practical application
of the approach from the literature The main aim has been to introduce the basic concept
of an approach Sometimes this has required some detailed mathematical description andclearly we have had to draw a line on how much depth we discuss a particular topic Most
of the topics have whole books devoted to them and so we have had to be selective in ourchoice of material Therefore, the chapters conclude with a section on the key references.The exercises at the ends of the chapters vary from ‘open book’ questions to more lengthycomputer projects
New to the third edition
Many sections have been rewritten and new material added The new features of this editioninclude the following:
expanded material on Bayesian sampling schemes and Markov chain Monte Carlomethods, and new sections on Sequential Monte Carlo samplers and VariationalBayes approaches
r Rule induction.
r New chapter on ensemble methods of classification.
r Revision of feature selection material with new section on stability.
r Spectral clustering.
and computer network analysis
Book outline
Chapter 1 provides an introduction to statistical pattern recognition, defining some nology, introducing supervised and unsupervised classification Two related approaches tosupervised classification are presented: one based on the use of probability density functions
Trang 23termi-PREFACE xxi
and a second based on the construction of discriminant functions The chapter concludes with
an outline of the pattern recognition cycle, putting the remaining chapters of the book intocontext Chapters 2, 3 and 4 pursue the density function approach to discrimination Chap-ter 2 addresses parametric approaches to density estimation, which are developed further
in Chapter 3 on Bayesian methods Chapter 4 develops classifiers based on
nonparamet-ric schemes, including the popular k nearest neighbour method, with associated efficient
search algorithms
Chapters 5–7 develop discriminant function approaches to supervised classification.Chapter 5 focuses on linear discriminant functions; much of the methodology of this chapter(including optimisation, regularisation, support vector machines) is used in some of the non-linear methods described in Chapter 6 which explores kernel-based methods, in particular,the radial basis function network and the support vector machine, and projection-based meth-ods (the multilayer perceptron) These are commonly referred to as neural network methods.Chapter 7 considers approaches to discrimination that enable the classification function to becast in the form of an interpretable rule, important for some applications
Chapter 8 considers ensemble methods – combining classifiers for improved robustness.Chapter 9 considers methods of measuring the performance of a classifier
The techniques of Chapters 10 and 11 may be described as methods of exploratory dataanalysis or preprocessing (and as such would usually be carried out prior to the supervisedclassification techniques of Chapters 5–7, although they could, on occasion, be post-processors
of supervised techniques) Chapter 10 addresses feature selection and feature extraction – theprocedures for obtaining a reduced set of variables characterising the original data Suchprocedures are often an integral part of classifier design and it is somewhat artificial topartition the pattern recognition problem into separate processes of feature extraction andclassification However, feature extraction may provide insights into the data structure andthe type of classifier to employ; thus, it is of interest in its own right Chapter 11 considers
unsupervised classification or clustering – the process of grouping individuals in a population
to discover the presence of structure; its engineering application is to vector quantisation forimage and speech coding Chapter 12 on complex networks introduces methods for analysingdata that may be represented using the mathematical concept of a graph This has greatrelevance to social and computer networks
Finally, Chapter 13 addresses some important diverse topics including model selection
Book website
ma-terial on topics including measures of dissimilarity, estimation, linear algebra, data analysisand basic probability
Acknowledgements
In preparing the third edition of this book we have been helped by many people We areespecially grateful to Dr Gavin Cawley, University of East Anglia, for help and advice Weare grateful to friends and colleagues (past and present, from RSRE, DERA and QinetiQ)
Trang 24who have provided encouragement and made comments on various parts of the manuscript.
In particular, we would like to thank Anna Skeoch for providing figures for Chapter 12; andRichard Davies and colleagues at John Wiley for help in the final production of the manuscript.Andrew Webb is especially thankful to Rosemary for her love, support and patience
Andrew R WebbKeith D Copsey
Trang 25Some of the more commonly used notation is given below We have used some notationalconveniences For example, we have tended to use the same symbol for a variable as well as ameasurement on that variable The meaning should be obvious from context Also, we denote
the density function of x as p(x) and y as p(y), even though the functions differ A vector
is denoted by a lower case quantity in bold face, and a matrix by upper case Since patternrecognition is very much a multidisciplinary subject, it is impossible to be both consistentacross all chapters and consistent with the commonly used notation in the different literatures
We have adopted the policy of maintaining consistency as far as possible within a given chapter
p (x) = ∂P/∂x probability density function
p (x | ω j ) probability density function of class j
x i ∈ ω j , 0 otherwise; n j-number of patterns in
ω j , n j= n
i=1z ji
Trang 26N (x; m, ) probability density function for the normal
evaluated at x
Trang 271.1 Statistical pattern recognition
We live in a world where massive amounts of data are collected and recorded on nearly everyaspect of human endeavour: for example, banking, purchasing (credit-card usage, point-of-sale data analysis), Internet transactions, performance monitoring (of schools, hospitals,equipment), and communications The data come in a wide variety of diverse forms – numeric,textual (structured or unstructured), audio and video signals Understanding and making sense
of this vast and diverse collection of data (identifying patterns, trends, anomalies, providingsummaries) requires some automated procedure to assist the analyst with this ‘data deluge’
A practical example of pattern recognition that is familiar to many people is classifying emailmessages (as spam/not spam) based upon message header, content and sender
Approaches for analysing such data include those for signal processing, filtering, datasummarisation, dimension reduction, variable selection, regression and classification and havebeen developed in several literatures (physics, mathematics, statistics, engineering, artificialintelligence, computer science and the social sciences, among others) The main focus ofthis book is on pattern recognition procedures, providing a description of basic techniques
Statistical Pattern Recognition, Third Edition Andrew R Webb and Keith D Copsey.
© 2011 John Wiley & Sons, Ltd Published 2011 by John Wiley & Sons, Ltd.
Trang 28together with case studies of practical applications of the techniques on real-world problems.
A strong emphasis is placed on the statistical theory of discrimination, but clustering alsoreceives some attention Thus, the main subject matter of this book can be summed up in asingle word: ‘classification’, both supervised (using class information to design a classifier –i.e discrimination) and unsupervised (allocating to groups without class information – i.e.clustering) However, in recent years many complex datasets have been gathered (for example,
‘transactions’ between individuals – email traffic, purchases) Understanding these datasetsrequires additional tools in the pattern recognition toolbox Therefore, we also examinedevelopments such as methods for analysing data that may be represented as a graph.Pattern recognition as a field of study developed significantly in the 1960s It was verymuch an interdisciplinary subject Some people entered the field with a real problem to solve.The large number of applications ranging from the classical ones such as automatic character
recognition and medical diagnosis to the more recent ones in data mining (such as credit
scor-ing, consumer sales analysis and credit card transaction analysis) have attracted considerableresearch effort with many methods developed and advances made Other researchers weremotivated by the development of machines with ‘brain-like’ performance, that in some waycould operate giving human performance
Within these areas significant progress has been made, particularly where the domain laps with probability and statistics, and in recent years there have been many exciting newdevelopments, both in methodology and applications These build on the solid foundations
over-of earlier research and take advantage over-of increased computational resources readily able nowadays These developments include, for example, kernel-based methods (includingsupport vector machines) and Bayesian computational methods
avail-The topics in this book could easily have been described under the term machine learning
that describes the study of machines that can adapt to their environment and learn from ple The machine learning emphasis is perhaps more on computationally intensive methodsand less on a statistical approach, but there is strong overlap between the research areas ofstatistical pattern recognition and machine learning
Since many of the techniques we shall describe have been developed over a range of diversedisciplines, there is naturally a variety of sometimes contradictory terminology We shall use
the term ‘pattern’ to denote the p-dimensional data vector x = (x1, , x p ) Tof measurements(T denotes vector transpose), whose components x iare measurements of the features of anobject Thus the features are the variables specified by the investigator and thought to be
important for classification In discrimination, we assume that there exist C groups or classes,
denotedω1, , ω C and associated with each pattern x is a categorical variable z that denotes
the class or group membership; that is, if z = i, then the pattern belongs to ω i , i ∈ {1, , C}.
Examples of patterns are measurements of an acoustic waveform in a speech recognitionproblem; measurements on a patient made in order to identify a disease (diagnosis); mea-surements on patients (perhaps subjective assessments) in order to predict the likely outcome(prognosis); measurements on weather variables (for forecasting or prediction); sets of fi-nancial measurements recorded over time; and a digitised image for character recognition.Therefore, we see that the term ‘pattern’, in its technical meaning, does not necessarily refer
to structure within images
Trang 29STATISTICAL PATTERN RECOGNITION 3
Figure 1.1 Pattern classifier.
The main topic in this book may be described by a number of terms including pattern classifier design or discrimination or allocation rule design Designing the rule requires
specification of the parameters of a pattern classifier, represented schematically in Figure 1.1,
so that it yields the optimal (in some sense) response for a given input pattern This response
is usually an estimate of the class to which the pattern belongs We assume that we have aset of patterns of known class{(x i , z i ), i = 1, , n} (the training or design set) that we use
to design the classifier (to set up its internal parameters) Once this has been done, we may
estimate class membership for a pattern x for which the class label is unknown Learning the
model from a training set is the process of induction; applying the trained model to patterns
of unknown class is the process of deduction.
Thus, the uses of a pattern classifier are to provide:
r A descriptive model that explains the difference between patterns of different classes
in terms of features and their measurements
r A predictive model that predicts the class of an unlabelled pattern.
However, we might ask why do we need a predictive model? Cannot the procedure thatwas used to assign labels to the training set measurements also be used for the test set inclassifier operation? There may be several reasons for developing an automated process:
r to remove humans from the recognition process – to make the process more reliable;
r in banking, to identify good risk applicants before making a loan;
r to make a medical diagnosis without a post mortem (or to assess the state of a piece of
equipment without dismantling it) – sometimes a pattern may only be labelled throughintensive examination of a subject, whether person or piece of equipment;
time consuming process;
harmful to humans and the training data have been gathered under controlled conditions;
r to operate remotely – to classify crops and land use remotely without labour-intensive,
time consuming, surveys
There are many classifiers that can be constructed from a given dataset Examples includedecision trees, neural networks, support vector machines and linear discriminant functions.For a classifier of a given type, we employ a learning algorithm to search through the parameterspace to find the model that best describes the relationship between the measurements andclass labels for the training set The form derived for the pattern classifier depends on a number
of different factors It depends on the distribution of the training data, and the assumptions
Trang 30made concerning its distribution Another important factor is the misclassification cost – thecost of making an incorrect decision In many applications misclassification costs are hard
to quantify, being combinations of several contributions such as monetary costs, time andother more subjective costs For example, in a medical diagnosis problem, each treatmenthas different costs associated with it These relate to the expense of different types of drugs,the suffering the patient is subjected to by each course of action and the risk of furthercomplications
Figure 1.1 grossly oversimplifies the pattern classification procedure Data may undergoseveral separate transformation stages before a final outcome is reached These transforma-tions (sometimes termed preprocessing, feature selection or feature extraction) operate on thedata in a way that, usually, reduces its dimension (reduces the number of features), removingredundant or irrelevant information, and transforms it to a form more appropriate for sub-
sequent classification The term intrinsic dimensionality refers to the minimum number of
variables required to capture the structure within the data In speech recognition, a cessing stage may be to transform the waveform to a frequency representation This may be
prepro-processed further to find formants (peaks in the spectrum) This is a feature extraction process (taking a possibly nonlinear combination of the original variables to form new variables) Fea- ture selection is the process of selecting a subset of a given set of variables (see Chapter 10).
In some problems, there is no automatic feature selection stage, with the feature selectionbeing performed by the investigator who ‘knows’ (through experience, knowledge of previousstudies and the problem domain) those variables that are important for classification In manycases, however, it will be necessary to perform one or more transformations of the measureddata
In some pattern classifiers, each of the above stages may be present and identifiable asseparate operations, while in others they may not be Also, in some classifiers, the preliminarystages will tend to be problem specific, as in the speech example In this book, we considerfeature selection and extraction transformations that are not application specific That isnot to say the methods of feature transformation described will be suitable for any givenapplication, however, but application-specific preprocessing must be left to the investigatorwho understands the application domain and method of data collection
1.2 Stages in a pattern recognition problem
A pattern recognition investigation may consist of several stages enumerated below Not allstages may be present; some may be merged together so that the distinction between twooperations may not be clear, even if both are carried out; there may be some application-specific data processing that may not be regarded as one of the stages listed below However,the points below are fairly typical
1 Formulation of the problem: gaining a clear understanding of the aims of the gation and planning the remaining stages
investi-2 Data collection: making measurements on appropriate variables and recording details
of the data collection procedure (ground truth)
Trang 31STAGES IN A PATTERN RECOGNITION PROBLEM 5
3 Initial examination of the data: checking the data, calculating summary statistics andproducing plots in order to get a feel for the structure
4 Feature selection or feature extraction: selecting variables from the measured set thatare appropriate for the task These new variables may be obtained by a linear ornonlinear transformation of the original set (feature extraction) To some extent, thepartitioning of the data processing into separate feature extraction and classificationprocesses is artificial, since a classifier often includes the optimisation of a featureextraction stage as part of its design
5 Unsupervised pattern classification or clustering This may be viewed as exploratorydata analysis and it may provide a successful conclusion to a study On the other hand,
it may be a means of preprocessing the data for a supervised classification procedure
6 Apply discrimination or regression procedures as appropriate The classifier is designedusing a training set of exemplar patterns
7 Assessment of results This may involve applying the trained classifier to an
indepen-dent test set of labelled patterns Classification performance is often summarised in the
form of a confusion matrix:
where e ijis the number of patterns of classω jthat are predicted to be classω i The
accuracy, a, is calculated from the confusion matrix as
The emphasis of this book is on techniques for performing the steps 4, 5, 6 and 7
Trang 321.3 Issues
The main topic that we address in this book concerns classifier design: given a training set ofpatterns of known class, we seek to use those examples to design a classifier that is optimalfor the expected operating conditions (the test conditions)
There are a number of very important points to make about this design process
Finite design set
We are given a finite design set If the classifier is too complex (there are too many free parameters) it may model noise in the design set This is an example of overfitting If the
classifier is not complex enough, then it may fail to capture structure in the data An illustration
of this is the fitting of a set of data points by a polynomial curve (Figure 1.2) If the degree
of the polynomial is too high then, although the curve may pass through or close to the datapoints thus achieving a low fitting error, the fitting curve is very variable and models everyfluctuation in the data (due to noise) If the degree of the polynomial is too low, the fitting error
is large and the underlying variability of the curve is not modelled (the model underfits the
data) Thus, achieving optimal performance on the design set (in terms of minimising someerror criterion perhaps) is not required: it may be possible, in a classification problem, to
achieve 100% classification accuracy on the design set but the generalisation performance –
the expected performance on data representative of the true operating conditions (equivalentlythe performance on an infinite test set of which the design set is a sample) – is poorer than could
be achieved by careful design Choosing the ‘right’ model is an exercise in model selection.
In practice we usually do not know what is structure and what is noise in the data Also,training a classifier (the procedure of determining its parameters) should not be considered as
a separate issue from model selection, but it often is
–0.1
0.1 0.2
Figure 1.2 Fitting a curve to a noisy set of samples: the data samples are from a quadratic
function with added noise; the fitting curves are a linear fit, a quadratic fit and a high-degreepolynomial
Trang 33APPROACHES TO STATISTICAL PATTERN RECOGNITION 7
Optimality
A second point about the design of optimal classifiers concerns the word ‘optimal’ Thereare several ways of measuring classifier performance, the most common being error rate,although this has severe limitations (see Chapter 9) Other measures, based on the closeness
of the estimates of the probabilities of class membership to the true probabilities, may bemore appropriate in many cases However, many classifier design methods usually optimisealternative criteria since the desired ones are difficult to optimise directly For example, aclassifier may be trained by optimising a square-error measure and assessed using error rate
Representative data
Finally, we assume that the training data are representative of the test conditions If this isnot so, perhaps because the test conditions may be subject to noise not present in the trainingdata, or there are changes in the population from which the data are drawn (population drift),then these differences must be taken into account in the classifier design
1.4 Approaches to statistical pattern recognition
There are two main divisions of classification: supervised classification (or discrimination) and unsupervised classification (sometimes in the statistics literature simply referred to as
classification or clustering)
The problem we are addressing in this book is primarily one of supervised pattern sification Given a set of measurements obtained through observation and represented as a
clas-pattern vector x, we wish to assign the clas-pattern to one of C possible classes, ω i , i = 1, ,
C A decision rule partitions the measurement space into C regions, i , i = 1, , C If an
observation vector is in ithen it is assumed to belong to classω i Each class region imay
be multiply connected – that is, it may be made up of several disjoint regions The boundariesbetween the regions i are the decision boundaries or decision surfaces Generally, it is in
regions close to these boundaries where the highest proportion of misclassifications occurs Insuch situations, we may reject the pattern or withhold a decision until further information is
available so that a classification may be made later This option is known as the reject option and therefore we have C+ 1 outcomes of a decision rule (the reject option being denoted by
ω0) in a C class problem: x belongs to ω1orω2or or ω Cor withhold a decision
In unsupervised classification, the data are not labelled and we seek to find groups inthe data and the features that distinguish one group from another Clustering techniques,described further in Chapter 11, can also be used as part of a supervised classification scheme
by defining prototypes A clustering scheme may be applied to the data for each class separatelyand representative samples for each group within the class (the group means for example)used as the prototypes for that class
In the following section we introduce two approaches to discrimination that will beexplored further in later chapters The first assumes a knowledge of the underlying class-conditional probability density functions (the probability density function of the featurevectors for a given class) Of course, in many applications these will usually be unknown and
must be estimated from a set of correctly classified samples termed the design or training
set Chapters 2, 3 and 4 describe techniques for estimating the probability density functionsexplicitly
The second approach introduced in the next section develops decision rules that use thedata to estimate the decision boundaries directly, without explicit calculation of the probability
Trang 34density functions This approach is developed in Chapters 5 and 6 where specific techniquesare described.
1.5 Elementary decision theory
Here we introduce an approach to discrimination based on knowledge of the probabilitydensity functions of each class Familiarity with basic probability theory is assumed
Consider C classes, ω1, , ω C , with a priori probabilities (the probabilities of each class occurring) p( ω1), , p(ω C), assumed known If we wish to minimise the probability ofmaking an error and we have no information regarding an object other than the class probabilitydistribution then we would assign an object to classω jif
p (ω j ) > p(ω k ) k = 1, ,C; k = j
This classifies all objects as belonging to one class: the class with the largest prior ability For classes with equal prior probabilities, patterns are assigned arbitrarily betweenthose classes
prob-However, we do have an observation vector or measurement vector x and we wish to assign an object to one of the C classes based on the measurements x A decision rule based
on probabilities is to assign x (here we refer to an object in terms of its measurement vector)
to classω j if the probability of classω j given the observation x, that is p (ω j |x), is greatest
over all classesω1, , ω C That is, assign x to class ω jif
p (ω j |x) > p(ω k |x) k = 1, ,C; k = j (1.1)
This decision rule partitions the measurement space into C regions 1, , Csuch that if
x ∈ j then x belongs to class ω j The regions jmay be disconnected
The a posteriori probabilities p (ω j |x) may be expressed in terms of the a priori
probabil-ities and the class conditional density functions p (x|ω i ) using Bayes’ theorem as
p (ω i |x) = p (x|ω i )p(ω i )
p (x)
and so the decision rule (1.1) may be written: assign x to ω jif
p (x|ω j )p(ω j ) > p(x|ω k )p(ω k ) k = 1, ,C; k = j (1.2)
This is known as Bayes’ rule for minimum error.
For two classes, the decision rule (1.2) may be written
l r (x) = p (x|ω1)
p (x|ω2) >
p (ω2)
p (ω1) implies x ∈ class ω1
Trang 35ELEMENTARY DECISION THEORY 9
A
00.020.040.060.080.10.120.140.160.180.2
x
p(x|ω2 p(ω2p(x|ω1 p(ω1
Figure 1.3 p(x |ω i )p( ω i), for classesω1andω2: for x in region A, x is assigned to class ω1
The function l r (x) is the likelihood ratio Figures 1.3 and 1.4 give a simple illustration for a
two-class discrimination problem Classω1 is normally distributed with zero mean and unit
variance, p (x|ω1) = N(x; 0, 1) Class ω2is a normal mixture (a weighted sum of normal sities) p (x|ω2) = 0.6N(x; 1, 1) + 0.4N(x; −1, 2) Figure 1.3 plots p(x|ω i )p(ω i ), i = 1, 2,
den-where the priors are taken to be p( ω1)= 0.5, p(ω2)= 0.5 Figure 1.4 plots the likelihood ratio
l r (x) and the threshold p(ω2)/p(ω1) We see from this figure that the decision rule (1.2) leads
to a disconnected region for classω2
x
A
00.20.40.60.811.21.41.61.82
Trang 36The fact that the decision rule (1.2) minimises the error may be seen as follows The
probability of making an error, p(error), may be expressed as
which is the probability of correct classification Therefore, we wish to choose the regions
iso that the integral given in (1.6) is a maximum This is achieved by selecting i to be
the region for which p (ω i )p(x|ω i ) is the largest over all classes and the probability of correct
e B= 1 −
max
i p (ω i )p(x|ω i )dx (1.8)This is illustrated in Figures 1.5 and 1.6
Trang 37ELEMENTARY DECISION THEORY 11
x B
00.050.10.15
0.2
0.250.30.35
0.4
p(x|ω2
p(x|ω1
Figure 1.5 Class-conditional densities for two normal distributions.
Figure 1.5 plots the two distributions p (x|ω i ), i = 1, 2 (both normal with unit variance
and means±0.5), and Figure 1.6 plots the functions p(x|ω i )p(ω i ) where p(ω1)= 0.3, p(ω2)=
0.7 The Bayes’ decision boundary defined by the point where p (x|ω1)p(ω1) = p(x|ω2)p(ω2)
(Figure 1.6) is marked with a vertical line at x B The areas of the hatched regions in Figure 1.5represent the probability of error: by Equation (1.4), the area of the horizontal hatching is theprobability of classifying a pattern from class 1 as a pattern from class 2 and the area of thevertical hatching the probability of classifying a pattern from class 2 as class 1 The sum ofthese two areas, weighted by the priors [Equation (1.5)], is the probability of making an error
x B
00.050.10.15
0.2
0.250.3
Trang 381.5.2 Bayes’ decision rule for minimum error – reject option
As we have stated above, an error or misrecognition occurs when the classifier assigns apattern to one class when it actually belongs to another In this section we consider the rejectoption Usually it is the uncertain classifications (often close to the decision boundaries) thatcontribute mainly to the error rate Therefore, rejecting a pattern (withholding a decision) maylead to a reduction in the error rate This rejected pattern may be discarded, or set aside untilfurther information allows a decision to be made Although the option to reject may alleviate orremove the problem of a high misrecognition rate, some otherwise correct classifications arealso converted into rejects Here we consider the trade-offs between error rate and reject rate
First, we partition the sample space into two complementary regions: R, a reject region and A, an acceptance or classification region These are defined by
The smaller the value of the threshold t then the larger is the reject region R However, if
t is chosen such that
00.1
0.20.3
0.40.50.60.7
0.80.91
Trang 39ELEMENTARY DECISION THEORY 13
where C is the number of classes, then the reject region is empty This is because the minimum
value which max
i p (ω i |x) can attain is 1/C [since 1 =C
The probability of correct classification, c(t), is a function of the threshold, t, and is given
by Equation (1.7), where now the integral is over the acceptance region, A, only
and the unconditional probability of rejecting a measurement, r, also a function of the threshold
t, is the probability that it lies in R:
Thus, the error rate and reject rate are inversely related Chow (1970) derives a simple
functional relationship between e(t) and r(t) which we quote here without proof Knowing r(t) over the complete range of t allows e(t) to be calculated using the relationship
e (t) = −
t
0
The above result allows the error rate to be evaluated from the reject function for the Bayes’
optimum classifier The reject function can be calculated using unlabelled data and a practical
application of the above result is to problems where labelling of gathered data is costly
In the previous section, the decision rule selected the class for which the a posteriori ability, p (ω j |x), was the greatest This minimised the probability of making an error We
prob-now consider a somewhat different rule that minimises an expected loss or risk This is a
very important concept since in many applications the costs associated with misclassificationdepend upon the true class of the pattern and the class to which it is assigned For example,
in a medical diagnosis problem in which a patient has back pain, it is far worse to classify apatient with severe spinal abnormality as healthy (or having mild back ache) than the otherway round
Trang 40We make this concept more formal by introducing a loss that is a measure of the cost ofmaking the decision that a pattern belongs to classω iwhen the true class isω j We define a
λ ji = cost of assigning a pattern x to ω i when x ∈ ω j
In practice, it may be very difficult to assign costs In some situations,λ may be measured
in monetary units that are quantifiable However, in many situations, costs are a combination
of several different factors measured in different units – money, time, quality of life As a
consequence, they are often a subjective opinion of an expert The conditional risk of assigning
a pattern x to class ω iis defined as