statistical pattern recognition 3rd edition

Copsey Mathematics and Data Analysis Consultancy, Malvern, UK Statistical pattern recognition relates to the use of statistical techniques for analysing data measurements in order to ext

Trang 1

Cover design: Gary Thompson

PATTERN RECOGNITION

T h i r d E d i t i o n

T h i r d

E d i t i o n

Andrew R Webb Keith D Copsey

Mathematics and Data Analysis Consultancy, Malvern, UK

Statistical pattern recognition relates to the use of statistical techniques for analysing data

measurements in order to extract information and make justified decisions It is a very active area

of study and research, which has seen many advances in recent years Applications such as data

mining, web searching, multimedia data retrieval, face recognition, and cursive handwriting

recognition, all require robust and efficient pattern recognition techniques

This third edition provides an introduction to statistical pattern theory and techniques, with

material drawn from a wide range of fields, including the areas of engineering, statistics, computer

science and the social sciences The book has been updated to cover new methods and applications,

and includes a wide range of techniques such as Bayesian methods, neural networks, support

vector machines, feature selection and feature reduction techniques Technical descriptions and

motivations are provided, and the techniques are illustrated using real examples

Statistical Pattern Recognition, Third Edition:

• Provides a self-contained introduction to statistical pattern recognition

• Includes new material presenting the analysis of complex networks

• Introduces readers to methods for Bayesian density estimation

• Presents descriptions of new applications in biometrics, security, finance and

condition monitoring

• Provides descriptions and guidance for implementing techniques, which will be

invaluable to software engineers and developers seeking to develop real applications

• Describes mathematically the range of statistical pattern recognition techniques

• Presents a variety of exercises including more extensive computer projects

The in-depth technical descriptions make this book suitable for senior undergraduate and graduate

students in statistics, computer science and engineering Statistical Pattern Recognition is also an

excellent reference source for technical professionals Chapters have been arranged to facilitate

implementation of the techniques by software engineers and developers in non-statistical

engineering fields

www.wiley.com/go/statistical_pattern_recognition

STATISTICAL

www.it-ebooks.info

Trang 3

Statistical Pattern Recognition

Trang 5

Statistical Pattern Recognition

Third Edition

Mathematics and Data Analysis Consultancy, Malvern, UK

A John Wiley & Sons, Ltd., Publication

Trang 6

This edition first published 2011

Registered office

John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom

For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com

The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.

or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.

Wiley also publishes its books in a variety of electronic formats Some content that appears in print may not be available in electronic books.

Designations used by companies to distinguish their products are often claimed as trademarks All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners The publisher is not associated with any product or vendor mentioned in this book This publication is designed to provide accurate and authoritative information in regard to the subject matter covered It is sold on the understanding that the publisher is not engaged in rendering professional services If professional advice or other expert assistance is required, the services of a competent professional should be sought.

Library of Congress Cataloging-in-Publication Data

Webb, A R (Andrew R.)

Statistical pattern recognition / Andrew R Webb, Keith D Copsey – 3rd ed.

p cm.

Includes bibliographical references and index.

ISBN 978-0-470-68227-2 (hardback) – ISBN 978-0-470-68228-9 (paper)

1 Pattern perception–Statistical methods I Copsey, Keith D II Title.

Trang 7

To Rosemary, Samuel, Miriam, Jacob and Ethan

Trang 9

Contents

Trang 10

2.2 Estimating the Parameters of the Distributions 34

Trang 12

4.2 k-Nearest-Neighbour Method 152

Trang 15

CONTENTS xiii

Trang 17

CONTENTS xv

Trang 18

11.6.6 Example Application Study 536

Trang 21

This book provides an introduction to statistical pattern recognition theory and techniques.Most of the material presented in this book is concerned with discrimination and classificationand has been drawn from a wide range of literature including that of engineering, statistics,computer science and the social sciences The aim of the book is to provide descriptions ofmany of the most useful of today’s pattern processing techniques including many of the recentadvances in nonparametric approaches to discrimination and Bayesian computational methodsdeveloped in the statistics literature and elsewhere Discussions provided on the motivationsand theory behind these techniques will enable the practitioner to gain maximum benefitfrom their implementations within many of the popular software packages The techniquesare illustrated with examples of real-world applications studies Pointers are also provided

to the diverse literature base where further details on applications, comparative studies andtheoretical developments may be obtained

The book grew out of our research on the development of statistical pattern recognitionmethodology and its application to practical sensor data analysis problems The book is aimed

at advanced undergraduate and graduate courses Some of the material has been presented aspart of a graduate course on pattern recognition and at pattern recognition summer schools

It is also designed for practitioners in the field of pattern recognition as well as researchers

in the area A prerequisite is a knowledge of basic probability theory and linear algebra,together with basic knowledge of mathematical methods (for example, Lagrange multipliersare used to solve problems with equality and inequality constraints in some derivations) Somebasic material (which was provided as appendices in the second edition) is available on thebook’s website

Scope

The book presents most of the popular methods of statistical pattern recognition ever, many of the important developments in pattern recognition are not confined to thestatistics literature and have occurred where the area overlaps with research in machinelearning Therefore, where we have felt that straying beyond the traditional boundaries ofstatistical pattern recognition would be beneficial, we have done so An example is the

Trang 22

How-inclusion of some rule induction methods as a complementary approach to rule discovery bydecision tree induction.

Most of the methodology is generic – it is not specific to a particular type of data orapplication Thus, we exclude preprocessing methods and filtering methods commonly used

in signal and image processing

Approach

The approach in each chapter has been to introduce some of the basic concepts and algorithmsand to conclude each section on a technique or a class of techniques with a practical application

of the approach from the literature The main aim has been to introduce the basic concept

of an approach Sometimes this has required some detailed mathematical description andclearly we have had to draw a line on how much depth we discuss a particular topic Most

of the topics have whole books devoted to them and so we have had to be selective in ourchoice of material Therefore, the chapters conclude with a section on the key references.The exercises at the ends of the chapters vary from ‘open book’ questions to more lengthycomputer projects

New to the third edition

Many sections have been rewritten and new material added The new features of this editioninclude the following:

expanded material on Bayesian sampling schemes and Markov chain Monte Carlomethods, and new sections on Sequential Monte Carlo samplers and VariationalBayes approaches

r Rule induction.

r New chapter on ensemble methods of classification.

r Revision of feature selection material with new section on stability.

r Spectral clustering.

and computer network analysis

Book outline

Chapter 1 provides an introduction to statistical pattern recognition, defining some nology, introducing supervised and unsupervised classification Two related approaches tosupervised classification are presented: one based on the use of probability density functions

Trang 23

termi-PREFACE xxi

and a second based on the construction of discriminant functions The chapter concludes with

an outline of the pattern recognition cycle, putting the remaining chapters of the book intocontext Chapters 2, 3 and 4 pursue the density function approach to discrimination Chap-ter 2 addresses parametric approaches to density estimation, which are developed further

in Chapter 3 on Bayesian methods Chapter 4 develops classifiers based on

nonparamet-ric schemes, including the popular k nearest neighbour method, with associated efficient

search algorithms

Chapters 5–7 develop discriminant function approaches to supervised classification.Chapter 5 focuses on linear discriminant functions; much of the methodology of this chapter(including optimisation, regularisation, support vector machines) is used in some of the non-linear methods described in Chapter 6 which explores kernel-based methods, in particular,the radial basis function network and the support vector machine, and projection-based meth-ods (the multilayer perceptron) These are commonly referred to as neural network methods.Chapter 7 considers approaches to discrimination that enable the classification function to becast in the form of an interpretable rule, important for some applications

Chapter 8 considers ensemble methods – combining classifiers for improved robustness.Chapter 9 considers methods of measuring the performance of a classifier

The techniques of Chapters 10 and 11 may be described as methods of exploratory dataanalysis or preprocessing (and as such would usually be carried out prior to the supervisedclassification techniques of Chapters 5–7, although they could, on occasion, be post-processors

of supervised techniques) Chapter 10 addresses feature selection and feature extraction – theprocedures for obtaining a reduced set of variables characterising the original data Suchprocedures are often an integral part of classifier design and it is somewhat artificial topartition the pattern recognition problem into separate processes of feature extraction andclassification However, feature extraction may provide insights into the data structure andthe type of classifier to employ; thus, it is of interest in its own right Chapter 11 considers

unsupervised classification or clustering – the process of grouping individuals in a population

to discover the presence of structure; its engineering application is to vector quantisation forimage and speech coding Chapter 12 on complex networks introduces methods for analysingdata that may be represented using the mathematical concept of a graph This has greatrelevance to social and computer networks

Finally, Chapter 13 addresses some important diverse topics including model selection

Book website

ma-terial on topics including measures of dissimilarity, estimation, linear algebra, data analysisand basic probability

Acknowledgements

In preparing the third edition of this book we have been helped by many people We areespecially grateful to Dr Gavin Cawley, University of East Anglia, for help and advice Weare grateful to friends and colleagues (past and present, from RSRE, DERA and QinetiQ)

Trang 24

who have provided encouragement and made comments on various parts of the manuscript.

In particular, we would like to thank Anna Skeoch for providing figures for Chapter 12; andRichard Davies and colleagues at John Wiley for help in the final production of the manuscript.Andrew Webb is especially thankful to Rosemary for her love, support and patience

Andrew R WebbKeith D Copsey

Trang 25

Some of the more commonly used notation is given below We have used some notationalconveniences For example, we have tended to use the same symbol for a variable as well as ameasurement on that variable The meaning should be obvious from context Also, we denote

the density function of x as p(x) and y as p(y), even though the functions differ A vector

is denoted by a lower case quantity in bold face, and a matrix by upper case Since patternrecognition is very much a multidisciplinary subject, it is impossible to be both consistentacross all chapters and consistent with the commonly used notation in the different literatures

We have adopted the policy of maintaining consistency as far as possible within a given chapter

p (x) = ∂P/∂x probability density function

p (x | ω j ) probability density function of class j

x i ∈ ω j , 0 otherwise; n j-number of patterns in

ω j , n j= n

i=1z ji

Trang 26

N (x; m, ) probability density function for the normal

evaluated at x

Trang 27

1.1 Statistical pattern recognition

We live in a world where massive amounts of data are collected and recorded on nearly everyaspect of human endeavour: for example, banking, purchasing (credit-card usage, point-of-sale data analysis), Internet transactions, performance monitoring (of schools, hospitals,equipment), and communications The data come in a wide variety of diverse forms – numeric,textual (structured or unstructured), audio and video signals Understanding and making sense

of this vast and diverse collection of data (identifying patterns, trends, anomalies, providingsummaries) requires some automated procedure to assist the analyst with this ‘data deluge’

A practical example of pattern recognition that is familiar to many people is classifying emailmessages (as spam/not spam) based upon message header, content and sender

Approaches for analysing such data include those for signal processing, filtering, datasummarisation, dimension reduction, variable selection, regression and classification and havebeen developed in several literatures (physics, mathematics, statistics, engineering, artificialintelligence, computer science and the social sciences, among others) The main focus ofthis book is on pattern recognition procedures, providing a description of basic techniques

Statistical Pattern Recognition, Third Edition Andrew R Webb and Keith D Copsey.

Trang 28

together with case studies of practical applications of the techniques on real-world problems.

A strong emphasis is placed on the statistical theory of discrimination, but clustering alsoreceives some attention Thus, the main subject matter of this book can be summed up in asingle word: ‘classification’, both supervised (using class information to design a classifier –i.e discrimination) and unsupervised (allocating to groups without class information – i.e.clustering) However, in recent years many complex datasets have been gathered (for example,

‘transactions’ between individuals – email traffic, purchases) Understanding these datasetsrequires additional tools in the pattern recognition toolbox Therefore, we also examinedevelopments such as methods for analysing data that may be represented as a graph.Pattern recognition as a field of study developed significantly in the 1960s It was verymuch an interdisciplinary subject Some people entered the field with a real problem to solve.The large number of applications ranging from the classical ones such as automatic character

recognition and medical diagnosis to the more recent ones in data mining (such as credit

scor-ing, consumer sales analysis and credit card transaction analysis) have attracted considerableresearch effort with many methods developed and advances made Other researchers weremotivated by the development of machines with ‘brain-like’ performance, that in some waycould operate giving human performance

Within these areas significant progress has been made, particularly where the domain laps with probability and statistics, and in recent years there have been many exciting newdevelopments, both in methodology and applications These build on the solid foundations

over-of earlier research and take advantage over-of increased computational resources readily able nowadays These developments include, for example, kernel-based methods (includingsupport vector machines) and Bayesian computational methods

avail-The topics in this book could easily have been described under the term machine learning

that describes the study of machines that can adapt to their environment and learn from ple The machine learning emphasis is perhaps more on computationally intensive methodsand less on a statistical approach, but there is strong overlap between the research areas ofstatistical pattern recognition and machine learning

Since many of the techniques we shall describe have been developed over a range of diversedisciplines, there is naturally a variety of sometimes contradictory terminology We shall use

the term ‘pattern’ to denote the p-dimensional data vector x = (x1, , x p ) Tof measurements(T denotes vector transpose), whose components x iare measurements of the features of anobject Thus the features are the variables specified by the investigator and thought to be

important for classification In discrimination, we assume that there exist C groups or classes,

denotedω1, , ω C and associated with each pattern x is a categorical variable z that denotes

the class or group membership; that is, if z = i, then the pattern belongs to ω i , i ∈ {1, , C}.

Examples of patterns are measurements of an acoustic waveform in a speech recognitionproblem; measurements on a patient made in order to identify a disease (diagnosis); mea-surements on patients (perhaps subjective assessments) in order to predict the likely outcome(prognosis); measurements on weather variables (for forecasting or prediction); sets of fi-nancial measurements recorded over time; and a digitised image for character recognition.Therefore, we see that the term ‘pattern’, in its technical meaning, does not necessarily refer

to structure within images

Trang 29

STATISTICAL PATTERN RECOGNITION 3

Figure 1.1 Pattern classifier.

The main topic in this book may be described by a number of terms including pattern classifier design or discrimination or allocation rule design Designing the rule requires

specification of the parameters of a pattern classifier, represented schematically in Figure 1.1,

so that it yields the optimal (in some sense) response for a given input pattern This response

is usually an estimate of the class to which the pattern belongs We assume that we have aset of patterns of known class{(x i , z i ), i = 1, , n} (the training or design set) that we use

to design the classifier (to set up its internal parameters) Once this has been done, we may

estimate class membership for a pattern x for which the class label is unknown Learning the

model from a training set is the process of induction; applying the trained model to patterns

of unknown class is the process of deduction.

Thus, the uses of a pattern classifier are to provide:

r A descriptive model that explains the difference between patterns of different classes

in terms of features and their measurements

r A predictive model that predicts the class of an unlabelled pattern.

However, we might ask why do we need a predictive model? Cannot the procedure thatwas used to assign labels to the training set measurements also be used for the test set inclassifier operation? There may be several reasons for developing an automated process:

r to remove humans from the recognition process – to make the process more reliable;

r in banking, to identify good risk applicants before making a loan;

r to make a medical diagnosis without a post mortem (or to assess the state of a piece of

equipment without dismantling it) – sometimes a pattern may only be labelled throughintensive examination of a subject, whether person or piece of equipment;

time consuming process;

harmful to humans and the training data have been gathered under controlled conditions;

r to operate remotely – to classify crops and land use remotely without labour-intensive,

time consuming, surveys

There are many classifiers that can be constructed from a given dataset Examples includedecision trees, neural networks, support vector machines and linear discriminant functions.For a classifier of a given type, we employ a learning algorithm to search through the parameterspace to find the model that best describes the relationship between the measurements andclass labels for the training set The form derived for the pattern classifier depends on a number

of different factors It depends on the distribution of the training data, and the assumptions

Trang 30

made concerning its distribution Another important factor is the misclassification cost – thecost of making an incorrect decision In many applications misclassification costs are hard

to quantify, being combinations of several contributions such as monetary costs, time andother more subjective costs For example, in a medical diagnosis problem, each treatmenthas different costs associated with it These relate to the expense of different types of drugs,the suffering the patient is subjected to by each course of action and the risk of furthercomplications

Figure 1.1 grossly oversimplifies the pattern classification procedure Data may undergoseveral separate transformation stages before a final outcome is reached These transforma-tions (sometimes termed preprocessing, feature selection or feature extraction) operate on thedata in a way that, usually, reduces its dimension (reduces the number of features), removingredundant or irrelevant information, and transforms it to a form more appropriate for sub-

sequent classification The term intrinsic dimensionality refers to the minimum number of

variables required to capture the structure within the data In speech recognition, a cessing stage may be to transform the waveform to a frequency representation This may be

prepro-processed further to find formants (peaks in the spectrum) This is a feature extraction process (taking a possibly nonlinear combination of the original variables to form new variables) Fea- ture selection is the process of selecting a subset of a given set of variables (see Chapter 10).

In some problems, there is no automatic feature selection stage, with the feature selectionbeing performed by the investigator who ‘knows’ (through experience, knowledge of previousstudies and the problem domain) those variables that are important for classification In manycases, however, it will be necessary to perform one or more transformations of the measureddata

In some pattern classifiers, each of the above stages may be present and identifiable asseparate operations, while in others they may not be Also, in some classifiers, the preliminarystages will tend to be problem specific, as in the speech example In this book, we considerfeature selection and extraction transformations that are not application specific That isnot to say the methods of feature transformation described will be suitable for any givenapplication, however, but application-specific preprocessing must be left to the investigatorwho understands the application domain and method of data collection

1.2 Stages in a pattern recognition problem

A pattern recognition investigation may consist of several stages enumerated below Not allstages may be present; some may be merged together so that the distinction between twooperations may not be clear, even if both are carried out; there may be some application-specific data processing that may not be regarded as one of the stages listed below However,the points below are fairly typical

1 Formulation of the problem: gaining a clear understanding of the aims of the gation and planning the remaining stages

investi-2 Data collection: making measurements on appropriate variables and recording details

of the data collection procedure (ground truth)

Trang 31

STAGES IN A PATTERN RECOGNITION PROBLEM 5

3 Initial examination of the data: checking the data, calculating summary statistics andproducing plots in order to get a feel for the structure

4 Feature selection or feature extraction: selecting variables from the measured set thatare appropriate for the task These new variables may be obtained by a linear ornonlinear transformation of the original set (feature extraction) To some extent, thepartitioning of the data processing into separate feature extraction and classificationprocesses is artificial, since a classifier often includes the optimisation of a featureextraction stage as part of its design

5 Unsupervised pattern classification or clustering This may be viewed as exploratorydata analysis and it may provide a successful conclusion to a study On the other hand,

it may be a means of preprocessing the data for a supervised classification procedure

6 Apply discrimination or regression procedures as appropriate The classifier is designedusing a training set of exemplar patterns

7 Assessment of results This may involve applying the trained classifier to an

indepen-dent test set of labelled patterns Classification performance is often summarised in the

form of a confusion matrix:

where e ijis the number of patterns of classω jthat are predicted to be classω i The

accuracy, a, is calculated from the confusion matrix as

The emphasis of this book is on techniques for performing the steps 4, 5, 6 and 7

Trang 32

1.3 Issues

The main topic that we address in this book concerns classifier design: given a training set ofpatterns of known class, we seek to use those examples to design a classifier that is optimalfor the expected operating conditions (the test conditions)

There are a number of very important points to make about this design process

Finite design set

We are given a finite design set If the classifier is too complex (there are too many free parameters) it may model noise in the design set This is an example of overfitting If the

classifier is not complex enough, then it may fail to capture structure in the data An illustration

of this is the fitting of a set of data points by a polynomial curve (Figure 1.2) If the degree

of the polynomial is too high then, although the curve may pass through or close to the datapoints thus achieving a low fitting error, the fitting curve is very variable and models everyfluctuation in the data (due to noise) If the degree of the polynomial is too low, the fitting error

is large and the underlying variability of the curve is not modelled (the model underfits the

data) Thus, achieving optimal performance on the design set (in terms of minimising someerror criterion perhaps) is not required: it may be possible, in a classification problem, to

achieve 100% classification accuracy on the design set but the generalisation performance –

the expected performance on data representative of the true operating conditions (equivalentlythe performance on an infinite test set of which the design set is a sample) – is poorer than could

be achieved by careful design Choosing the ‘right’ model is an exercise in model selection.

In practice we usually do not know what is structure and what is noise in the data Also,training a classifier (the procedure of determining its parameters) should not be considered as

a separate issue from model selection, but it often is

–0.1

0.1 0.2

Figure 1.2 Fitting a curve to a noisy set of samples: the data samples are from a quadratic

function with added noise; the fitting curves are a linear fit, a quadratic fit and a high-degreepolynomial

Trang 33

APPROACHES TO STATISTICAL PATTERN RECOGNITION 7

Optimality

A second point about the design of optimal classifiers concerns the word ‘optimal’ Thereare several ways of measuring classifier performance, the most common being error rate,although this has severe limitations (see Chapter 9) Other measures, based on the closeness

of the estimates of the probabilities of class membership to the true probabilities, may bemore appropriate in many cases However, many classifier design methods usually optimisealternative criteria since the desired ones are difficult to optimise directly For example, aclassifier may be trained by optimising a square-error measure and assessed using error rate

Representative data

Finally, we assume that the training data are representative of the test conditions If this isnot so, perhaps because the test conditions may be subject to noise not present in the trainingdata, or there are changes in the population from which the data are drawn (population drift),then these differences must be taken into account in the classifier design

1.4 Approaches to statistical pattern recognition

There are two main divisions of classification: supervised classification (or discrimination) and unsupervised classification (sometimes in the statistics literature simply referred to as

classification or clustering)

The problem we are addressing in this book is primarily one of supervised pattern sification Given a set of measurements obtained through observation and represented as a

clas-pattern vector x, we wish to assign the clas-pattern to one of C possible classes, ω i , i = 1, ,

C A decision rule partitions the measurement space into C regions, i , i = 1, , C If an

observation vector is in ithen it is assumed to belong to classω i Each class region imay

be multiply connected – that is, it may be made up of several disjoint regions The boundariesbetween the regions i are the decision boundaries or decision surfaces Generally, it is in

regions close to these boundaries where the highest proportion of misclassifications occurs Insuch situations, we may reject the pattern or withhold a decision until further information is

available so that a classification may be made later This option is known as the reject option and therefore we have C+ 1 outcomes of a decision rule (the reject option being denoted by

ω0) in a C class problem: x belongs to ω1orω2or or ω Cor withhold a decision

In unsupervised classification, the data are not labelled and we seek to find groups inthe data and the features that distinguish one group from another Clustering techniques,described further in Chapter 11, can also be used as part of a supervised classification scheme

by defining prototypes A clustering scheme may be applied to the data for each class separatelyand representative samples for each group within the class (the group means for example)used as the prototypes for that class

In the following section we introduce two approaches to discrimination that will beexplored further in later chapters The first assumes a knowledge of the underlying class-conditional probability density functions (the probability density function of the featurevectors for a given class) Of course, in many applications these will usually be unknown and

must be estimated from a set of correctly classified samples termed the design or training

set Chapters 2, 3 and 4 describe techniques for estimating the probability density functionsexplicitly

The second approach introduced in the next section develops decision rules that use thedata to estimate the decision boundaries directly, without explicit calculation of the probability

Trang 34

density functions This approach is developed in Chapters 5 and 6 where specific techniquesare described.

1.5 Elementary decision theory

Here we introduce an approach to discrimination based on knowledge of the probabilitydensity functions of each class Familiarity with basic probability theory is assumed

Consider C classes, ω1, , ω C , with a priori probabilities (the probabilities of each class occurring) p( ω1), , p(ω C), assumed known If we wish to minimise the probability ofmaking an error and we have no information regarding an object other than the class probabilitydistribution then we would assign an object to classω jif

p (ω j ) > p(ω k ) k = 1, ,C; k = j

This classifies all objects as belonging to one class: the class with the largest prior ability For classes with equal prior probabilities, patterns are assigned arbitrarily betweenthose classes

prob-However, we do have an observation vector or measurement vector x and we wish to assign an object to one of the C classes based on the measurements x A decision rule based

on probabilities is to assign x (here we refer to an object in terms of its measurement vector)

to classω j if the probability of classω j given the observation x, that is p (ω j |x), is greatest

over all classesω1, , ω C That is, assign x to class ω jif

p (ω j |x) > p(ω k |x) k = 1, ,C; k = j (1.1)

This decision rule partitions the measurement space into C regions 1, , Csuch that if

x ∈ j then x belongs to class ω j The regions jmay be disconnected

The a posteriori probabilities p (ω j |x) may be expressed in terms of the a priori

probabil-ities and the class conditional density functions p (x|ω i ) using Bayes’ theorem as

p (ω i |x) = p (x|ω i )p(ω i )

p (x)

and so the decision rule (1.1) may be written: assign x to ω jif

p (x|ω j )p(ω j ) > p(x|ω k )p(ω k ) k = 1, ,C; k = j (1.2)

This is known as Bayes’ rule for minimum error.

For two classes, the decision rule (1.2) may be written

l r (x) = p (x|ω1)

p (x|ω2) >

p (ω2)

p (ω1) implies x ∈ class ω1

Trang 35

ELEMENTARY DECISION THEORY 9

A

00.020.040.060.080.10.120.140.160.180.2

x

p(x|ω2 p(ω2p(x|ω1 p(ω1

Figure 1.3 p(x |ω i )p( ω i), for classesω1andω2: for x in region A, x is assigned to class ω1

The function l r (x) is the likelihood ratio Figures 1.3 and 1.4 give a simple illustration for a

two-class discrimination problem Classω1 is normally distributed with zero mean and unit

variance, p (x|ω1) = N(x; 0, 1) Class ω2is a normal mixture (a weighted sum of normal sities) p (x|ω2) = 0.6N(x; 1, 1) + 0.4N(x; −1, 2) Figure 1.3 plots p(x|ω i )p(ω i ), i = 1, 2,

den-where the priors are taken to be p( ω1)= 0.5, p(ω2)= 0.5 Figure 1.4 plots the likelihood ratio

l r (x) and the threshold p(ω2)/p(ω1) We see from this figure that the decision rule (1.2) leads

to a disconnected region for classω2

x

A

00.20.40.60.811.21.41.61.82

Trang 36

The fact that the decision rule (1.2) minimises the error may be seen as follows The

probability of making an error, p(error), may be expressed as

which is the probability of correct classification Therefore, we wish to choose the regions

iso that the integral given in (1.6) is a maximum This is achieved by selecting i to be

the region for which p (ω i )p(x|ω i ) is the largest over all classes and the probability of correct

e B= 1 −

max

i p (ω i )p(x|ω i )dx (1.8)This is illustrated in Figures 1.5 and 1.6

Trang 37

x B

00.050.10.15

0.2

0.250.30.35

0.4

p(x|ω2

p(x|ω1

Figure 1.5 Class-conditional densities for two normal distributions.

Figure 1.5 plots the two distributions p (x|ω i ), i = 1, 2 (both normal with unit variance

and means±0.5), and Figure 1.6 plots the functions p(x|ω i )p(ω i ) where p(ω1)= 0.3, p(ω2)=

0.7 The Bayes’ decision boundary defined by the point where p (x|ω1)p(ω1) = p(x|ω2)p(ω2)

(Figure 1.6) is marked with a vertical line at x B The areas of the hatched regions in Figure 1.5represent the probability of error: by Equation (1.4), the area of the horizontal hatching is theprobability of classifying a pattern from class 1 as a pattern from class 2 and the area of thevertical hatching the probability of classifying a pattern from class 2 as class 1 The sum ofthese two areas, weighted by the priors [Equation (1.5)], is the probability of making an error

x B

00.050.10.15

0.2

0.250.3

Trang 38

1.5.2 Bayes’ decision rule for minimum error – reject option

As we have stated above, an error or misrecognition occurs when the classifier assigns apattern to one class when it actually belongs to another In this section we consider the rejectoption Usually it is the uncertain classifications (often close to the decision boundaries) thatcontribute mainly to the error rate Therefore, rejecting a pattern (withholding a decision) maylead to a reduction in the error rate This rejected pattern may be discarded, or set aside untilfurther information allows a decision to be made Although the option to reject may alleviate orremove the problem of a high misrecognition rate, some otherwise correct classifications arealso converted into rejects Here we consider the trade-offs between error rate and reject rate

First, we partition the sample space into two complementary regions: R, a reject region and A, an acceptance or classification region These are defined by

The smaller the value of the threshold t then the larger is the reject region R However, if

t is chosen such that

00.1

0.20.3

0.40.50.60.7

0.80.91

Trang 39

where C is the number of classes, then the reject region is empty This is because the minimum

value which max

i p (ω i |x) can attain is 1/C [since 1 =C

The probability of correct classification, c(t), is a function of the threshold, t, and is given

by Equation (1.7), where now the integral is over the acceptance region, A, only

and the unconditional probability of rejecting a measurement, r, also a function of the threshold

t, is the probability that it lies in R:

Thus, the error rate and reject rate are inversely related Chow (1970) derives a simple

functional relationship between e(t) and r(t) which we quote here without proof Knowing r(t) over the complete range of t allows e(t) to be calculated using the relationship

e (t) = −

t

0

The above result allows the error rate to be evaluated from the reject function for the Bayes’

optimum classifier The reject function can be calculated using unlabelled data and a practical

application of the above result is to problems where labelling of gathered data is costly

In the previous section, the decision rule selected the class for which the a posteriori ability, p (ω j |x), was the greatest This minimised the probability of making an error We

prob-now consider a somewhat different rule that minimises an expected loss or risk This is a

very important concept since in many applications the costs associated with misclassificationdepend upon the true class of the pattern and the class to which it is assigned For example,

in a medical diagnosis problem in which a patient has back pain, it is far worse to classify apatient with severe spinal abnormality as healthy (or having mild back ache) than the otherway round

Trang 40

We make this concept more formal by introducing a loss that is a measure of the cost ofmaking the decision that a pattern belongs to classω iwhen the true class isω j We define a

λ ji = cost of assigning a pattern x to ω i when x ∈ ω j

In practice, it may be very difficult to assign costs In some situations,λ may be measured

in monetary units that are quantifiable However, in many situations, costs are a combination

of several different factors measured in different units – money, time, quality of life As a

consequence, they are often a subjective opinion of an expert The conditional risk of assigning

a pattern x to class ω iis defined as

Tiêu đề	Statistical Pattern Recognition
Tác giả	Andrew R. Webb, Keith D. Copsey
Trường học	Mathematics and Data Analysis Consultancy, Malvern, UK
Chuyên ngành	Statistics, Computer Science, Engineering
Thể loại	sách tham khảo
Năm xuất bản	2011
Thành phố	Malvern

Định dạng
Số trang	668
Dung lượng	7,69 MB