computational intelligence and feature selection

The book also detailsthose computational intelligence based methods e.g., fuzzy rule induction andswarm optimization that either beneﬁt from joint use with feature selection orhelp impro

Trang 3

INTELLIGENCE

AND FEATURE

SELECTION

Trang 4

445 Hoes Lane Piscataway, NJ 08854

IEEE Press Editorial Board

Lajos Hanzo, Editor in Chief

J Anderson T G Croda S Nahavandi

A Chatterjee B M Hammerli W Reeve

Kenneth Moore, Director of IEEE Book and Information Services (BIS)

Steve Welch, IEEE Press Manager Jeanne Audino, Project Editor IEEE Computational Intelligence Society, Sponsor

IEEE-CIS Liaison to IEEE Press, Gary B Fogel

Technical Reviewers

Chris Hinde, Loughborough University, UK Hisao Ishibuchi, Osaka Prefecture University, Japan

Books in the IEEE Press Series on Computational Intelligence

Introduction to Evolvable Hardware: A Practical Guide for Designing

Emergent Information Technologies and Enabling Policies for Counter-Terrorism

Edited by Robert L Popp and John Yen

2006 978-0471-77615-4

Computationally Intelligent Hybrid Systems

Edited by Seppo J Ovaska

2005 0-471-47668-4

Handbook of Learning and Appropriate Dynamic Programming

Edited by Jennie Si, Andrew G Barto, Warren B Powell, Donald Wunsch II

2004 0-471-66054-X

Computational Intelligence: The Experts Speak

Edited by David B Fogel and Charles J Robinson

2003 0-471-27454-2

Computational Intelligence in Bioinformatics

Edited by Gary B Fogel, David W Corne, Yi Pan

2008 978-0470-10526-9

Trang 6

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.

Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers,

MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com Requests

to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose No warranty may be created or extended by sales representatives or written sales materials The advice and strategies contained herein may not be suitable for your situation You should consult with a professional where appropriate Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the

United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats Some content that appears in print may not be available in electronic formats For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging-in-Publication Data is available.

ISBN: 978-0-470-22975-0

Printed in the United States of America.

10 9 8 7 6 5 4 3 2 1

Trang 8

2.2.6 Linguistic Hedges / 24

2.2.7 Fuzzy Sets and Probability / 25

2.3.1 Information and Decision Systems / 26

2.3.2 Indiscernibility / 27

2.3.4 Positive, Negative, and Boundary Regions / 28

2.3.5 Feature Dependency and Signiﬁcance / 29

2.3.7 Discernibility Matrix / 31

2.4.1 Fuzzy Equivalence Classes / 33

Trang 9

5.1 Rough Set Attribute Reduction / 86

5.1.1 Additional Search Strategies / 89

5.9 Alternative Approaches / 106

5.10 Comparison of Crisp Approaches / 106

5.10.2 Discernibility Matrix Based Approaches / 108

6.1 Medical Image Classiﬁcation / 113

Trang 10

6.2.4 Dimensionality Reduction / 119

6.2.5 Information Content of Rough Set Reducts / 120

6.2.7 Efﬁciency Considerations of RSAR / 124

8.1 Feature Selection with Fuzzy-Rough Sets / 144

Trang 11

9.2.3 Fuzzy-Rough Reduction with Fuzzy Entropy / 171

Ratio / 1739.2.5 Fuzzy Discernibility Matrix Based FS / 174

10.2 Ant Colony Optimization-Based Selection / 195

10.2.1 Ant Colony Optimization / 196

10.2.2 Traveling Salesman Problem / 197

10.2.3 Ant-Based Feature Selection / 197

Trang 12

12.2.1 Comparison with Unreduced Features / 223

12.2.2 Comparison with Entropy-Based Feature

Selection / 226

12.2.4 Alternative Fuzzy Rule Inducer / 230

12.2.5 Results with Feature Grouping / 231

12.2.6 Results with Ant-Based FRFS / 233

13.2.1 Impact of Feature Selection / 241

13.2.2 Comparison with Relief / 244

13.2.3 Comparison with Existing Work / 248

14 APPLICATIONS V: FORENSIC GLASS ANALYSIS 259

Trang 13

14.2 Estimation of Likelihood Ratio / 261

14.2.1 Exponential Model / 262

14.2.2 Biweight Kernel Estimation / 263

14.2.3 Likelihood Ratio with Biweight and Boundary

Kernels / 26414.2.4 Adaptive Kernel / 266

15.3 Fuzzy-Rough Rule Induction / 286

15.4 Hybrid Rule Induction / 287

Trang 15

as machine learning, pattern recognition, systems control, and signal processing.

FS intends to preserve the meaning of selected attributes; this forms a sharpcontrast with those approaches that reduce problem complexity by transformingthe representational forms of the attributes

Feature selection techniques have been applied to small- and medium-sizeddatasets in order to locate the most informative features for later use Many

FS methods have been developed, and this book provides a critical review ofthese methods, with particular emphasis on their current limitations To help theunderstanding of the readership, the book systematically presents the leadingmethods reviewed in a consistent algorithmic framework The book also detailsthose computational intelligence based methods (e.g., fuzzy rule induction andswarm optimization) that either beneﬁt from joint use with feature selection orhelp improve the selection mechanism

From this background the book introduces the original approach to featureselection using conventional rough set theory, exploiting the rough set ideol-ogy in that only the supplied data and no other information is used Based on

Trang 16

demonstrated applications, the book reviews the main limitation of this approach

in the sense that all data must be discrete The book then proposes and develops afundamental approach based on fuzzy-rough sets It also presents optimizations,extensions, and further new developments of this approach whose underlyingideas are generally applicable to other FS mechanisms

Real-world applications, with worked examples, are provided that illustratethe power and efﬁcacy of the feature selection approaches covered in the book

In particular, the algorithms discussed have proved to be successful in handlingtasks that involve datasets containing huge numbers of features (in the order oftens of thousands), which would be extremely difﬁcult to process further Suchapplications include Web content classiﬁcation, complex systems monitoring, andalgae population estimation The book shows the success of these applications

by evaluating the algorithms statistically with respect to the existing leadingapproaches to the reduction of problem complexity

Finally, this book concludes with initial supplementary investigations to theassociated areas of feature selection, including rule induction and clustering meth-ods using hybridizations of fuzzy and rough set theories This research opens

up many new frontiers for the continued development of the core technologiesintroduced in the ﬁeld of computational intelligence

This book is primarily intended for senior undergraduates, postgraduates,researchers, and professional engineers However, it offers a straightforward pre-sentation of the underlying concepts that anyone with a nonspecialist backgroundshould be able to understand and apply

Acknowledgments

Thanks to those who helped at various stages in the development of the ideaspresented in this book, particularly: Colin Aitken, Stuart Aitken, MalcolmBeynon, Chris Cornelis, Alexios Chouchoulas, Michelle Galea, Knox Haggie,Joe Halliwell, Zhiheng Huang, Jeroen Keppens, Pawan Lingras, Javier Marin-Blazquez, Neil Mac Parthalain, Khairul Rasmani, Dave Robertson, ChangjingShang, Andrew Tuson, Xiangyang Wang, and Greg Zadora Many thanks to theUniversity of Edinburgh and Aberystwyth University where this research wasundertaken and compiled

Thanks must also go to those friends and family who have contributed in somepart to this work; particularly Elaine Jensen, Changjing Shang, Yuan Shen, SarahSholl, Mike Gordon, Andrew Herrick, Iain Langlands, Tossapon Boongoen, Xin

Fu, and Ruiqing Zhao

The editors and staff at IEEE Press were extremely helpful We particularlythank David Fogel and Steve Welch for their support, enthusiasm, and encourage-ment Thanks also to the anonymous referees for their comments and suggestions

Trang 17

that have enhanced the work presented here, and to Elsevier, Springer, and WorldScientiﬁc for allowing the reuse of materials previously published in their jour-nals Additional thanks go to those authors whose research is included in thisbook, for their contributions to this interesting and ever-developing area.

Richard Jensen and Qiang Shen

Aberystwyth University

17th June 2008

Trang 19

THE IMPORTANCE OF FEATURE

SELECTION

1.1 KNOWLEDGE DISCOVERY

It is estimated that every 20 months or so the amount of information in the worlddoubles In the same way, tools for use in the various knowledge ﬁelds (acqui-sition, storage, retrieval, maintenance, etc.) must develop to combat this growth.Knowledge is only valuable when it can be used efﬁciently and effectively; there-fore knowledge management is increasingly being recognized as a key element

in extracting its value This is true both within the research, development, andapplication of computational intelligence and beyond

Central to this issue is the knowledge discovery process, particularly edge discovery in databases (KDD) [10,90,97,314] KDD is the nontrivial process

knowl-of identifying valid, novel, potentially useful, and ultimately understandablepatterns in data Traditionally data was turned into knowledge by means ofmanual analysis and interpretation For many applications manual probing ofdata is slow, costly, and highly subjective Indeed, as data volumes grow dra-matically, manual data analysis is becoming completely impractical in manydomains This motivates the need for efﬁcient, automated knowledge discovery.The KDD process can be decomposed into the following steps, as illustrated inFigure 1.1:

• Data selection A target dataset is selected or created Several existing

datasets may be joined together to obtain an appropriate example set

Computational Intelligence and Feature Selection: Rough and Fuzzy Approaches, by Richard Jensen and Qiang Shen

Trang 20

Figure 1.1 Knowledge discovery process (adapted from [97])

• Data cleaning/preprocessing This phase includes, among other tasks, noise

removal/reduction, missing value imputation, and attribute discretization.The goal is to improve the overall quality of any information that may bediscovered

• Data reduction Most datasets will contain a certain amount of redundancy

that will not aid knowledge discovery and may in fact mislead the process.The aim of this step is to ﬁnd useful features to represent the data andremove nonrelevant features Time is also saved during the data-miningstep as a result

• Data mining A data-mining method (the extraction of hidden predictive

information from large databases) is selected depending on the goals of theknowledge discovery task The choice of algorithm used may be depen-dent on many factors, including the source of the dataset and the values itcontains

• Interpretation/evaluation Once knowledge has been discovered, it is

eval-uated with respect to validity, usefulness, novelty, and simplicity This mayrequire repeating some of the previous steps

The third step in the knowledge discovery process, namely data reduction,

is often a source of signiﬁcant data loss It is this step that forms the focus

of attention of this book The high dimensionality of databases can be reducedusing suitable techniques, depending on the requirements of the future KDDprocesses These techniques fall into one of two categories: those that transformthe underlying meaning of the data features and those that preserve the semantics.Feature selection (FS) methods belong to the latter category, where a smallerset of the original features is chosen based on a subset evaluation function Inknowledge discovery, feature selection methods are particularly desirable as thesefacilitate the interpretability of the resulting knowledge

Trang 21

1.2 FEATURE SELECTION

There are often many features in KDD, and combinatorially large numbers offeature combinations, to select from Note that the number of feature subset com-binations withm features from a collection of N total features can be extremely

large (with this number being N !/[m!(N − m)!] mathematically) It might be

expected that the inclusion of an increasing number of features would increasethe likelihood of including enough information to distinguish between classes.Unfortunately, this is not true if the size of the training dataset does not alsoincrease rapidly with each additional feature included This is the so-called curse

of dimensionality [26] A high-dimensional dataset increases the chances that adata-mining algorithm will ﬁnd spurious patterns that are not valid in general.Most techniques employ some degree of reduction in order to cope with largeamounts of data, so an efﬁcient and effective reduction method is required

1.2.1 The Task

The task of feature selection is to select a subset of the original features present

in a given dataset that provides most of the useful information Hence, afterselection has taken place, the dataset should still have most of the importantinformation still present In fact, good FS techniques should be able to detectand ignore noisy and misleading features The result of this is that the dataset

quality might even increase after selection.

There are two feature qualities that must be considered by FS methods: vancy and redundancy A feature is said to be relevant if it is predictive of thedecision feature(s); otherwise, it is irrelevant A feature is considered to be redun-dant if it is highly correlated with other features An informative feature is onethat is highly correlated with the decision concept(s) but is highly uncorrelatedwith other features (although low correlation does not mean absence of relation-ship) Similarly subsets of features should exhibit these properties of relevancyand nonredundancy if they are to be useful

rele-In [171] two notions of feature relevance, strong and weak relevance, weredeﬁned If a feature is strongly relevant, this implies that it cannot be removedfrom the dataset without resulting in a loss of predictive accuracy If it is weaklyrelevant, then the feature may sometimes contribute to accuracy, though thisdepends on which other features are considered These deﬁnitions are independent

of the speciﬁc learning algorithm used However, this no guarantee that a relevantfeature will be useful to such an algorithm

It is quite possible for two features to be useless individually, and yet highlypredictive if taken together In FS terminology, they may be both redundant andirrelevant on their own, but their combination provides invaluable information.For example, in the exclusive-or problem, where the classes are not linearly sep-arable, the two features on their own provide no information concerning thisseparability It is also the case that they are uncorrelated with each other How-ever, when taken together, the two features are highly informative and can provide

Trang 22

good class separation Hence in FS the search is typically for high-quality featuresubsets, and not merely a ranking of features.

1.2.2 The Beneﬁts

There are several potential beneﬁts of feature selection:

1 Facilitating data visualization By reducing data to fewer dimensions,

trends within the data can be more readily recognized This can be veryimportant where only a few features have an inﬂuence on data outcomes.Learning algorithms by themselves may not be able to distinguish thesefactors from the rest of the feature set, leading to the generation of overlycomplicated models The interpretation of such models then becomes anunnecessarily tedious task

2 Reducing the measurement and storage requirements In domains where

features correspond to particular measurements (e.g., in a water treatmentplant [322]), fewer features are highly desirable due to the expense andtime-costliness of taking these measurements For domains where largedatasets are encountered and manipulated (e.g., text categorization [162]),

a reduction in data size is required to enable storage where space is anissue

3 Reducing training and utilization times With smaller datasets, the runtimes

of learning algorithms can be signiﬁcantly improved, both for training andclassiﬁcation phases It can sometimes be the case that the computationalcomplexity of learning algorithms even prohibits their application to largeproblems This is remedied through FS, which can reduce the problem to

a more manageable size

4 Improving prediction performance Classiﬁer accuracy can be increased as

a result of feature selection, through the removal of noisy or misleadingfeatures Algorithms trained on a full set of features must be able to discernand ignore these attributes if they are to produce useful, accurate predictionsfor unseen data

For those methods that extract knowledge from data (e.g., rule induction) thebeneﬁts of FS also include improving the readability of the discovered knowledge.When induction algorithms are applied to reduced data, the resulting rules aremore compact A good feature selection step will remove unnecessary attributeswhich can affect both rule comprehension and rule prediction performance

Trang 23

classiﬁcation [54,84,164], systems monitoring [322], clustering [131], and expert

systems [354]; see LNCS Transactions on Rough Sets for more examples) This

success is due in part to the following aspects of the theory:

• Only the facts hidden in data are analyzed

• No additional information about the data is required such as thresholds orexpert knowledge on a particular domain

• It ﬁnds a minimal knowledge representation

The work on RST offers an alternative, and formal, methodology that can beemployed to reduce the dimensionality of datasets, as a preprocessing step toassist any chosen modeling method for learning from data It helps select themost information-rich features in a dataset, without transforming the data, allthe while attempting to minimize information loss during the selection process.Computationally, the approach is highly efﬁcient, relying on simple set oper-ations, which makes it suitable as a preprocessor for techniques that are muchmore complex Unlike statistical correlation-reducing approaches [77], it requires

no human input or intervention Most importantly, it also retains the tics of the data, which makes the resulting models more transparent to humanscrutiny

seman-Combined with an automated intelligent modeler, say a fuzzy system or aneural network, the feature selection approach based on RST not only can retainthe descriptive power of the learned models but also allow simpler system struc-tures to reach the knowledge engineer and ﬁeld operator This helps enhance theinteroperability and understandability of the resultant models and their reasoning

As RST handles only one type of imperfection found in data, it is tary to other concepts for the purpose, such as fuzzy set theory The two ﬁeldsmay be considered analogous in the sense that both can tolerate inconsistencyand uncertainty— the difference being the type of uncertainty and their approach

complemen-to it Fuzzy sets are concerned with vagueness; rough sets are concerned withindiscernibility Many deep relationships have been established, and more so,most recent studies have concluded at this complementary nature of the twomethodologies, especially in the context of granular computing Therefore it isdesirable to extend and hybridize the underlying concepts to deal with additionalaspects of data imperfection Such developments offer a high degree of ﬂexibilityand provide robust solutions and advanced tools for data analysis

1.4 APPLICATIONS

As many systems in a variety of ﬁelds deal with datasets of large ality, feature selection has found wide applicability Some of the main areas ofapplication are shown in Figure 1.2

dimension-Feature selection algorithms are often applied to optimize the classiﬁcationperformance of image recognition systems [158,332] This is motivated by a peak-ing phenomenon commonly observed when classiﬁers are trained with a limited

Trang 24

Figure 1.2 Typical feature selection application areas

set of training samples If the number of features is increased, the classiﬁcationrate of the classiﬁer decreases after a peak In melanoma diagnosis, for instance,the clinical accuracy of dermatologists in identifying malignant melanomas

is only between 65% and 85% [124] With the application of FS algorithms,automated skin tumor recognition systems can produce classiﬁcation accuraciesabove 95%

Structural and functional data from analysis of the human genome haveincreased many fold in recent years, presenting enormous opportunities andchallenges for AI tasks In particular, gene expression microarrays are a rapidlymaturing technology that provide the opportunity to analyze the expressionlevels of thousands or tens of thousands of genes in a single experiment

A typical classiﬁcation task is to distinguish between healthy and cancer patientsbased on their gene expression proﬁle Feature selectors are used to drasticallyreduce the size of these datasets, which would otherwise have been unsuitablefor further processing [318,390,391] Other applications within bioinformaticsinclude QSAR [46], where the goal is to form hypotheses relating chemicalfeatures of molecules to their molecular activity, and splice site prediction [299],where junctions between coding and noncoding regions of DNA are detected.The most common approach to developing expressive and human readable

representations of knowledge is the use of if-then production rules Yet real-life

problem domains usually lack generic and systematic expert rules for mappingfeature patterns onto their underlying classes In order to speed up the rule

Trang 25

induction process and reduce rule complexity, a selection step is required Thisreduces the dimensionality of potentially very large feature sets while minimizingthe loss of information needed for rule induction It has an advantageous sideeffect in that it removes redundancy from the historical data This also helpssimplify the design and implementation of the actual pattern classiﬁer itself, bydetermining what features should be made available to the system In additionthe reduced input dimensionality increases the processing speed of the classiﬁer,leading to better response times [12,51].

Many inferential measurement systems are developed using data-based dologies; the models used to infer the value of target features are developed withreal-time plant data This implies that inferential systems are heavily inﬂuenced

metho-by the quality of the data used to develop their internal models Complex tion problems, such as reliable monitoring and diagnosis of industrial plants, arelikely to present large numbers of features, many of which will be redundant forthe task at hand Additionally there is an associated cost with the measurement

applica-of these features In these situations it is very useful to have an intelligent systemcapable of selecting the most relevant features needed to build an accurate andreliable model for the process [170,284,322]

The task of text clustering is to group similar documents together, represented

as a bag of words This representation raises one severe problem: the high sionality of the feature space and the inherent data sparsity This can signiﬁcantlyaffect the performance of clustering algorithms, so it is highly desirable to reducethis feature space size Dimensionality reduction techniques have been success-fully applied to this area—both those that destroy data semantics and those thatpreserve them (feature selectors) [68,197]

dimen-Similar to clustering, text categorization views documents as a collection ofwords Documents are examined, with their constituent keywords extracted andrated according to criteria such as their frequency of occurrence As the number

of keywords extracted is usually in the order of tens of thousands, ality reduction must be performed This can take the form of simplistic ﬁlteringmethods such as word stemming or the use of stop-word lists However, ﬁlteringmethods do not provide enough reduction for use in automated categorizers, so

dimension-a further fedimension-ature selection process must tdimension-ake pldimension-ace Recent dimension-applicdimension-ations of FS inthis area include Web page and bookmark categorization [102,162]

1.5 STRUCTURE

The rest of this book is structured as follows (see Figure 1.3):

• Chapter 2: Set Theory A brief introduction to the various set theories is

presented in this chapter Essential concepts from classical set theory, fuzzyset theory, rough set theory, and hybrid fuzzy-rough set theory are presentedand illustrated where necessary

Trang 26

Figure 1.3 How to read this book

• Chapter 3: Classiﬁcation Methods This chapter discusses both crisp and

fuzzy methods for the task of classiﬁcation Many of the methods presentedhere are used in systems later in the book

techniques for dimensionality reduction with a particular emphasis on ture selection is given in this chapter It begins with a discussion of those

Trang 27

fea-reduction methods that irreversibly transform data semantics This is lowed by a more detailed description and evaluation of the leading featureselectors presented in a uniﬁed algorithmic framework A simple exampleillustrates their operation.

fol-• Chapter 5: Rough Set-based Approaches to Feature Selection This chapter

presents an overview of the existing research regarding the application

of rough set theory to feature selection Rough set attribute reduction(RSAR), the precursor to the developments in this book, is described indetail However, these methods are unsuited to the problems discussed inSection 5.11 In particular, they are unable to handle noisy or real-valueddata effectively—a signiﬁcant problem if they are to be employed withinreal-world applications

• Chapter 6: Applications I: Use of RSAR This chapter looks at the

applica-tions of RSAR in several challenging domains: medical image classiﬁcation,text categorization, and algae population estimation Details of each classi-ﬁcation system are given with several comparative studies carried out thatinvestigate RSAR’s utility Additionally a brief introduction to other appli-cations that use a crisp rough set approach is provided for the interestedreader

• Chapter 7: Rough and Fuzzy Hybridization There has been great interest

in developing methodologies that are capable of dealing with imprecisionand uncertainty The large amount of research currently being carried out infuzzy and rough sets is representative of this Many deep relationships havebeen established, and recent studies have concluded at the complementarynature of the two methodologies Therefore it is desirable to extend andhybridize the underlying concepts to deal with additional aspects of dataimperfection Such developments offer a high degree of ﬂexibility and pro-vide robust solutions and advanced tools for data analysis A general survey

of this research is presented in the chapter, with a focus on applications ofthe theory to disparate domains

• Chapter 8: Fuzzy-Rough Feature Selection In this chapter the

theoreti-cal developments behind this new feature selection method are presentedtogether with a proof of generalization This novel approach uses fuzzy-rough sets to handle many of the problems facing feature selectors outlinedpreviously A complexity analysis of the main selection algorithm is given.The operation of the approach and its beneﬁts are shown through the use

of two simple examples To evaluate this new fuzzy-rough measure of ture signiﬁcance, comparative investigations are carried out with the currentleading signiﬁcance measures

selection has been shown to be highly useful at reducing data sionality, but possesses several problems that render it ineffective fordatasets possessing tens of thousands of features This chapter presentsthree new approaches to fuzzy-rough feature selection (FRFS) based on

Trang 28

dimen-fuzzy similarity relations The ﬁrst employs the new similarity-basedfuzzy lower approximation to locate subsets The second uses boundaryregion information to guide search Finally, a fuzzy extension to crispdiscernibility matrices is given in order to discover fuzzy-rough subsets.The methods are evaluated and compared using benchmark data.

promising areas in feature selection The ﬁrst, feature grouping, is developedfrom recent work in the literature where groups of features are selectedsimultaneously By reasoning with fuzzy labels, the search process can bemade more intelligent allowing various search strategies to be employed.The second, ant-based feature selection, seeks to address the nontrivial issue

of ﬁnding the smallest optimal feature subsets This approach to featureselection uses artiﬁcial ants and pheromone trails in the search for the bestsubsets Both of these developments can be applied within feature selection,

in general, but are applied to the speciﬁc problem of subset search withinFRFS in this book

• Chapter 11: Applications II: Web Content Categorization With the

explo-sive growth of information on the Web, there is an abundance of informationthat must be dealt with effectively and efficiently This area, in particular,deserves the attention of feature selection due to the increasing demand forhigh-performance intelligent Internet applications This motivates the appli-cation of FRFS to the automatic categorization of user bookmarks/favoritesand Web pages The results show that FRFS significantly reduces datadimensionality by several orders of magnitude with little resulting loss inclassification accuracy

• Chapter 12: Applications III: Complex Systems Monitoring Complex

appli-cation problems, such as reliable monitoring and diagnosis of industrialplants, are likely to present large numbers of features, many of which will beredundant for the task at hand With the use of FRFS, these extraneous fea-tures can be removed This not only makes resultant rulesets generated fromsuch data much more concise and readable but can reduce the expense due

to the monitoring of redundant features The monitoring system is applied

to water treatment plant data, producing better classiﬁcation accuracies thanthose resulting from the full feature set and several other reduction methods

• Chapter 13: Applications IV: Algae Population Estimation Biologists need

to identify and isolate the chemical parameters of rapid algae populationﬂuctuations in order to limit their detrimental effect on the environment.This chapter describes an estimator of algae populations, a hybrid systeminvolving FRFS that approximates, given certain water characteristics, thesize of algae populations The system signiﬁcantly reduces computer timeand space requirements through the use of feature selection The resultsshow that estimators using a fuzzy-rough feature selection step producemore accurate predictions of algae populations in general

• Chapter 14: Applications V: Forensic Glass Analysis The evaluation of

glass evidence in forensic science is an important issue Traditionally this

Trang 29

has depended on the comparison of the physical and chemical attributes

of an unknown fragment with a control fragment A high degree of crimination between glass fragments is now achievable due to advances inanalytical capabilities A random effects model using two levels of hier-archical nesting is applied to the calculation of a likelihood ratio (LR) as

dis-a solution to the problem of compdis-arison between two sets of replicdis-atedcontinuous observations where it is unknown whether the sets of measure-ments shared a common origin This chapter presents the investigation intothe use of feature evaluation for the purpose of selecting a single variable

to model without the need for expert knowledge Results are recorded forseveral selectors using normal, exponential, adaptive, and biweight kernelestimation techniques Misclassiﬁcation rates for the LR estimators are used

to measure performance

• Chapter 15: Supplementary Developments and Investigations This chapter

offers initial investigations and ideas for further work, which were oped concurrently with the ideas presented in the previous chapters First,the utility of using the problem formulation and solution techniques frompropositional satisﬁability for ﬁnding rough set reducts is considered This

devel-is presented with an initial experimental evaluation of such an approach,comparing the results with a standard rough set-based algorithm, RSAR.Second, the possibility of universal reducts is proposed as a way of gen-erating more useful feature subsets Third, fuzzy decision tree inductionbased on the fuzzy-rough metric developed in this book is proposed Otherproposed areas of interest include fuzzy-rough clustering and fuzzy-roughfuzziﬁcation optimization

Trang 31

SET THEORY

The problem of capturing the vagueness present in the real world is difﬁcult toovercome using classical set theory alone As a result several extensions to settheory have been proposed that deal with different aspects of uncertainty Some ofthe most popular methods for this are fuzzy set theory, rough set theory, and theirhybridization, fuzzy-rough set theory This chapter starts with a quick introduction

to classical set theory, using a simple example to illustrate the concept Then

an introduction to fuzzy sets is given, covering the essentials required for abasic understanding of their use There are many useful introductory resourcesregarding fuzzy set theory, for example [66,267] Following this, rough set theory

is introduced with an example dataset to illustrate the fundamental concepts.Finally, fuzzy-rough set theory is detailed

2.1 CLASSICAL SET THEORY

2.1.1 Deﬁnition

In classical set theory, elements can belong to a set or not at all For example,

in the set of old people, deﬁned here as {Rod, Jane, Freddy}, the element Rod belongs to this set whereas the element George does not No distinction is made

within a set between elements; all set members belong fully This may be sidered to be a source of information loss for certain applications Returning

con-to the example, Rod may be older than Freddy , but by this formulation both

Computational Intelligence and Feature Selection: Rough and Fuzzy Approaches, by Richard Jensen and Qiang Shen

Trang 32

are considered to be equally old The order of elements within a set is notconsidered.

More formally, let U be a space of objects, referred to as the universe ofdiscourse, and x an element of U A classical (crisp) set A, A ⊆ U, is deﬁned

as a collection of elements x ∈ U, such that each element x can belong to the

set or not belong A classical set A can be represented by a set of ordered pairs

(x, 0) or (x, 1) for each element, indicating x ∈ A or x ∈ A, respectively.

Two sets A and B are said to be equal if they contain exactly the same

elements; every element of A is an element of B and every element of B is an

element ofA The cardinality of a set A is a measure of the number of elements

of the set, and is often denoted |A| For example, the set {Rod, Jane, Freddy}

has a cardinality of 3 (|{Rod, Jane, Freddy}| = 3) A set with no members is

called the empty set, usually denoted∅ and has a cardinality of zero

2.1.2 Subsets

If every element of a setA is a member of a set B, then A is said to be a subset

ofB, denoted A ⊆ B (or equivalently B ⊇ A) If A is a subset of B but is not

equal to B, then it is a proper subset of B, denoted A ⊂ B (or equivalently

B ⊃ A) For example, if A = {Jane, Rod}, B = {Rod, Jane, Geoffrey} and C = {Rod, Jane, Freddy}:

A ⊆ B({Jane, Rod} ⊆ {Rod, Jane, Geoffrey})

A ⊆ A({Jane, Rod} ⊆ {Jane, Rod})

A ⊂ B({Jane, Rod} ⊂ {Rod, Jane, Geoffrey})

A ⊆ C({Jane, Rod} ⊆ {Rod, Jane, Freddy})

B ⊆ C({Rod, Jane, Geoffrey} ⊆ {Rod, Jane, Freddy})

2.1.3 Operators

Several operations exist for the manipulation of sets Only the fundamental ations are considered here

oper-2.1.3.1 Union The union of two sets A and B is a set that contains all the

members of both sets and is denotedA ∪ B More formally, A ∪ B = {x|(x ∈ A)

or(x ∈ B)} Properties of the union operator include

Trang 33

2.1.3.2 Intersection The intersection of two sets A and B is a set that

contains only those elements that A and B have in common More formally,

A ∩ B = {x|(x ∈ A) and (x ∈ B)} If the intersection of A and B is the empty

set, then sets A and B are said to be disjoint Properties of the intersection

This is denotedB − A (or B \ A) More formally, B − A = {x|(x ∈ B) and not (x ∈ A)}.

In some situations, all sets may be considered to be subsets of a given universalsetU In this case, U − A is the absolute complement (or simply complement) of

A, and is denoted AorA c More formally,A c = {x|(x ∈ U) and not (x ∈ A)}.

Properties of complements include

2.2 FUZZY SET THEORY

A distinct approach to coping with reasoning under uncertain circumstances is touse the theory of fuzzy sets [408] The main objective of this theory is to develop

a methodology for the formulation and solution of problems that are too complex

or ill-deﬁned to be suitable for analysis by conventional Boolean techniques Itdeals with subsets of a universe of discourse, where the transition between fullmembership of a subset and no membership is gradual rather than abrupt as inthe classical set theory Such subsets are called fuzzy sets

Fuzzy sets arise, for instance, when mathematical descriptions of ambiguityand ambivalence are needed In the real world the attributes of, say, a physicalsystem often emerge from an elusive vagueness or fuzziness, a readjustment tocontext, or an effect of human imprecision The use of the “soft” boundaries

of fuzzy sets, namely the graded memberships, allows subjective knowledge to

be utilized in deﬁning these attributes With the accumulation of knowledge thesubjectively assigned memberships can, of course, be modiﬁed Even in some

Trang 34

cases where precise knowledge is available, fuzziness may be a concomitant ofcomplexity involved in the reasoning process.

The adoption of fuzzy sets helps to ease the requirement for encoding uncertaindomain knowledge For example, labels likesmall, medium, and large have an

intuitive appeal to represent values of some physical attributes However, if theyare deﬁned with a semantic interpretation in terms of crisp intervals, such as

small = {x|x > 0, x 1}

medium = {x|x > 0, x ≈ 1}

the representation may lead to rather nonintuitive results This is because, inpractice, it is not realistic to draw an exact boundary between these intervals.When encoding a particular real number, it may be difﬁcult to decide which ofthese the number should, or should not, deﬁnitely belong to It may well be thecase that what can be said is only that a given number belongs to the small set

with a possibility ofA and to the medium with a possibility of B The avoidance

of this problem requires gradual membership and hence the break of the laws ofexcluded-middle and contradiction in Boolean logic This forms the fundamentalmotivation for the development of fuzzy logic

2.2.1 Deﬁnition

A fuzzy set can be deﬁned as a set of ordered pairsA = {x, μ A (x) |x ∈ U} The

functionμ A (x) is called the membership function for A, mapping each element

of the universeU to a membership degree in the range [0, 1] The universe may

be discrete or continuous Any fuzzy set containing at least one element with a

membership degree of 1 is called normal

Returning to the example in Section 2.1.1, it may be better to represent the set

of old people as a fuzzy set, Old The membership function for this set is given

in Figure 2.1, deﬁned over a range of ages (the universe) Given that the age of

Rod is 74, it can be determined that this element belongs to the set Old with a membership degree of 0.95 Similarly, if the age of Freddy is 38, the resulting

Figure 2.1 Fuzzy set representing the concept Old

Trang 35

degree of membership is 0.26 Here, both Rod and Freddy belong to the (fuzzy) set of old people, but Rod has a higher degree of membership to this set.

The speciﬁcation of membership functions is typically subjective, as can be

seen in this example There are many justiﬁable deﬁnitions of the concept Old

Indeed people of different ages may deﬁne this concept quite differently Oneway of constructing the membership function is by using a voting model, andthis way the generated fuzzy sets can be rooted in reality with clear semantics

An alternative to this is the integration of the probability distribution of variables.For example, integrating the probability distribution of height gives a possibleversion of the tall fuzzy set This does not help with hedges and sets such asvery tall (see Section 2.2.6), and so although the integration of the distributiongives a voting model, it needs more to arrive at the hedges

In fuzzy logic the truth value of a statement is linguistic (and no longer

Boolean), of the form very true, true, more or less true, not very false, false.

These logic values are themselves fuzzy sets; some may be compounded fuzzysets from other atomic ones, by the use of certain operators As with ordinarycrisp sets, different operations can be deﬁned over fuzzy sets

2.2.2 Operators

The most basic operators on fuzzy sets are the union, intersection, and ment These are fuzzy extensions of their crisp counterparts, ensuring that if theyare applied to crisp sets, the results of their application will be identical to crispunion, intersection, and complement

andB, is speciﬁed by a binary operation on the unit interval; that is, a function

of the form

For each element x in the universe, this function takes as its arguments the

memberships ofx in the fuzzy sets A and B, and yields the membership grade

of the element in the set constituting the intersection ofA and B:

The following axioms must hold for the operator t to be considered a t-norm,

for allx, y, and z in the range [0,1]:

Trang 36

The following are examples of t-norms that are often used as fuzzy intersections:

setsA and B is speciﬁed by a function

A fuzzy unions is a binary operation that satisﬁes at least the following axioms

for all x, y, and z in [0,1]:

• s(x, 0) = x (boundary condition)

• y ≤ z → s(x, y) ≤ s(x, z) (monotonicity)

• s(x, y) = s(y, x) (commutativity)

• s(x, s(y, z)) = s(s(x, y), z) (associativity)

The following are examples of t-conorms that are often used as fuzzy unions:

The most popular interpretation of fuzzy union and intersection is the max/mininterpretation, primarily due to its ease of computation This particular interpre-tation is used in the book

by a function

subject to the following:

• c(0) = 1 and c(1) = 0 (boundary condition)

• ∀a, b ∈ [0, 1], if a ≤ b then c(a) ≥ c(b) (monotonicity)

Trang 37

• c is a continuous function (continuity)

• c is an involution (i.e., c(c(a)) = a for each a ∈ [0, 1])

The complement of a fuzzy set can be denoted in a number of ways;¬A and

A are also in common use It may also be represented as A c, the same way as

in crisp set theory A standard deﬁnition of fuzzy complement is

It should not be difficult to see that these definitions cover the conventionaloperations on crisp sets as their specific cases Taking the set intersection as

if and only if (abbreviated to iff hereafter) it belongs to both A and B, and

vice versa This is covered by the deﬁnition of fuzzy set intersection because

μ A ∩B (x) = 1 (or min(μ A (x), μ B (x)) = 1), iff μ A (x) = 1 and μ B (x)= 1 erwise,μ A ∩B (x) = 0, given that μ A (x) and μ B (x) take values only from {0, 1}

Oth-in this case

2.2.3 Simple Example

An example should help with the understanding of the basic concepts and ators introduced above Suppose that the universe of discourse, X, is a class of

oper-students and that a group, A, of students within this class are said to be tall

in height ThusA is a fuzzy subset of X (but X itself is not fuzzy), since the

boundary betweentall and not tall cannot be naturally deﬁned with a ﬁxed real

number Rather, describing this vague concept using a gradual membership tion as characterized in Figure 2.2 is much more appealing Similarly the fuzzy

func-term very tall can be represented by another fuzzy (sub-)set as also shown in

this ﬁgure Given such a deﬁnition of the fuzzy setA = tall, a proposition like

“studentx is tall” can be denoted by μ A (x).

Assume that Andy is a student in the class and that he has a 80% possibility

of being considered as atall student This means that μ A (Andy) = 0.8 Also

Figure 2.2 Representation of fuzzy sets “tall” and “very tall”

Trang 38

suppose that another fuzzy set B is deﬁned on the same universe X, whose members are said to be young in age, and that μ B (Andy) = 0.7, meaning that Andy is thought to be young with a possibility of 70% From this, using the

operations deﬁned above, the following can be derived that is justiﬁable withrespect to common-sense intuitions:

• μ ¬A (Andy) = 1 − 0.8 = 0.2, indicating the possibility of Andy being not tall

• μ A ∪B (Andy) = max(μ A (Andy), μ B (Andy)) = max(0.8, 0.7) = 0.8,

indicating the possibility of Andy beingtall or young

• μ A ∩B (Andy) = min(μ A (Andy), μ B (Andy)) = min(0.8, 0.7) = 0.7, indicating the possibility of Andy being both tall and young

Not only having an intuitive justiﬁcation, these operations are also tionally very simple However, care should be taken not to explain the resultsfrom the conventional Boolean logic point of view The laws of excluded-middleand contradiction do not hold in fuzzy logic anymore To make this point clearer,let us attempt to compare the results of A ∪ ¬A and A ∩ ¬A obtained by fuzzy

computa-logic with those by Boolean probabilistic computa-logic (although such a comparison isitself a common source for debate in the literature) Applying fuzzy logic to thecase above of Andy gives that

• μ A ∪¬A (Andy) = max(μ A (Andy), μ ¬A (Andy)) = max(0.8, 0.2) = 0.8

• μ A ∩¬A (Andy) = min(μ A (Andy), μ ¬A (Andy)) = min(0.8, 0.2) = 0.2

However, using the theory of probability, different results would be expectedsuch that

• p(“Andy is tall” or “Andy is not tall ”)= 1

• p(“Andy is tall” and “Andy is not tall ”)= 0

This important difference is caused by the deliberate avoidance of theexcluded-middle and contradiction laws in fuzzy logic This avoidance enablesfuzzy logic to represent vague concepts that are difﬁcult to capture otherwise

If the memberships of x belonging to A and ¬A are neither 0 nor 1, then it

should not be surprising that x is also of a nonzero membership belonging to A and ¬A at the same time.

2.2.4 Fuzzy Relations and Composition

In addition to the three operators deﬁned above, many conventional mathematicalfunctions can be extended to be applied to fuzzy values This is possible by the

use of an extension principle, so called because it provides a general extension of

classical mathematical concepts to fuzzy environments This principle is stated asfollows: If ann-ary function f maps the Cartesian product X1× X2× · · · × X n

Trang 39

onto a universe Y such that y = f (x1, x2, , x n ), and A1, A2, , A n are n

fuzzy sets in X1, X2, , X n, respectively, characterized by membership butionsμ A i (X i ), i = 1, 2, , n, a fuzzy set on Y can then be induced as given

distri-below, where∅ is the empty set:

μ B (y)=

max{x1, ,x n ,y =f (x1, ,x n )}min(μ A1(x1), , μ A n (x n )) if f−1(Y )= ∅

A crucial concept in fuzzy set theory is that of fuzzy relations, which is a

generalization of the conventional crisp relations Ann-ary fuzzy relation in X1×

X2× · · · × X nis, in fact, a fuzzy set onX1× X2× · · · × X n Fuzzy relations can

be composed (and this composition is closely related to the extension principleshown above) For instance, ifU is a relation from X1to X2 (or, equivalently,

a relation inX1× X2), andV is a relation from X2toX3, then the composition

of U and V is a fuzzy relation from X1 to X3 which is denoted by U◦V and

deﬁned by

μ U ◦V (x1, x3)= max

x2∈X2min(μ U (x1, x2), μ V (x2, x3)), x1∈ X1, x3∈ X3

A convenient way of representing a binary fuzzy relation is to use a matrix Forexample, the following matrix,P , can be used to indicate that a computer-game

addict is much more fond of multimedia games than conventional ones, be theyworkstation or PC-based:

max(min(0.9, 0.7), min(0.8, 0.8)) max(min(0.9, 0.5), min(0.8, 0.3))

max(min(0.3, 0.7), min(0.2, 0.8)) max(min(0.3, 0.5), min(0.2, 0.3))

Trang 40

This composition indicates that the addict enjoys multimedia games, especiallykeyboard-based multimedia games.

2.2.5 Approximate Reasoning

Fuzzy relations and fuzzy relation composition form the basis for approximatereasoning, sometimes termed fuzzy reasoning Informally, approximate reason-ing means a process by which a possibly inexact conclusion is inferred from

a collection of inexact premises Systems performing such reasoning are built

upon a collection of fuzzy production (if-then) rules, which provide a formal

means of representing domain knowledge acquired from empirical associations

or experience Such systems run upon a given set of facts that allows a (usuallypartial) instantiation of the premise attributes in some of the rules

For example, a rule like

ifx is A i andy is B i, thenz is C i

which governs a particular relationship between the premise attributesx and y

and the conclusion attribute z, can be translated into a fuzzy relation R i:

μ R i (x, y, z) = min(μ A i (x), μ B i (y), μ C i (z))

Here,A i , B i, andC i are fuzzy sets deﬁned on the universesX, Y , and Z of the

attributes x, y, and z, respectively Provided with this relation, if the premise

attributes x and y actually take the fuzzy values A and B, a new fuzzy value

of the conclusion attributez can then be obtained by applying the compositional rule of inference:

C= (A× B)◦R i

Or,

μ C(z)= max

x ∈X,y∈Ymin(μ A(x), μ B(y), μ R i (x, y, z))

Of course, an approximate reasoning system would not normally functionusing only one speciﬁc rule, but a set Given a set of production rules, twodifferent approaches may be adopted to implement the reasoning in such a system;both rely on the use of the compositional rule of inference given above The ﬁrst

is to apply the compositional rule after an overall relation associated with theentire rule set is found The second is to utilize the compositional rule locally

to each individual production rule ﬁrst and then to aggregate the results of suchapplications to form an overall result for the consequent attribute

Given a set ofK if-then rules, basic algorithms for these two approaches are

summarized below For simplicity, it is assumed that each rule contains a ﬁxed

• μ A ∪B (Andy) = max(μ A (Andy), μ B (Andy))... be expectedsuch that

• p(“Andy is tall” or “Andy is not tall ”)=

• p(“Andy is tall” and “Andy is not tall ”)=

This important difference... Applying fuzzy logic to thecase above of Andy gives that

ã A ơA (Andy) = max(μ A (Andy), μ ¬A (Andy)) = max(0.8, 0.2) = 0.8

ã

Tiêu đề	Computational Intelligence and Feature Selection
Tác giả	Richard Jensen, Qiang Shen
Trường học	Aberystwyth University
Chuyên ngành	Computational Intelligence and Feature Selection
Thể loại	Book
Năm xuất bản	2008
Thành phố	Aberystwyth

Định dạng
Số trang	357
Dung lượng	3,89 MB