The book also detailsthose computational intelligence based methods e.g., fuzzy rule induction andswarm optimization that either benefit from joint use with feature selection orhelp impro
Trang 3INTELLIGENCE
AND FEATURE
SELECTION
Trang 4445 Hoes Lane Piscataway, NJ 08854
IEEE Press Editorial Board
Lajos Hanzo, Editor in Chief
J Anderson T G Croda S Nahavandi
A Chatterjee B M Hammerli W Reeve
Kenneth Moore, Director of IEEE Book and Information Services (BIS)
Steve Welch, IEEE Press Manager Jeanne Audino, Project Editor IEEE Computational Intelligence Society, Sponsor
IEEE-CIS Liaison to IEEE Press, Gary B Fogel
Technical Reviewers
Chris Hinde, Loughborough University, UK Hisao Ishibuchi, Osaka Prefecture University, Japan
Books in the IEEE Press Series on Computational Intelligence
Introduction to Evolvable Hardware: A Practical Guide for Designing
Emergent Information Technologies and Enabling Policies for Counter-Terrorism
Edited by Robert L Popp and John Yen
2006 978-0471-77615-4
Computationally Intelligent Hybrid Systems
Edited by Seppo J Ovaska
2005 0-471-47668-4
Handbook of Learning and Appropriate Dynamic Programming
Edited by Jennie Si, Andrew G Barto, Warren B Powell, Donald Wunsch II
2004 0-471-66054-X
Computational Intelligence: The Experts Speak
Edited by David B Fogel and Charles J Robinson
2003 0-471-27454-2
Computational Intelligence in Bioinformatics
Edited by Gary B Fogel, David W Corne, Yi Pan
2008 978-0470-10526-9
Trang 6Published by John Wiley & Sons, Inc., Hoboken, New Jersey.
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers,
MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com Requests
to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose No warranty may be created or extended by sales representatives or written sales materials The advice and strategies contained herein may not be suitable for your situation You should consult with a professional where appropriate Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the
United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats Some content that appears in print may not be available in electronic formats For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloging-in-Publication Data is available.
ISBN: 978-0-470-22975-0
Printed in the United States of America.
10 9 8 7 6 5 4 3 2 1
Trang 82.2.6 Linguistic Hedges / 24
2.2.7 Fuzzy Sets and Probability / 25
2.3.1 Information and Decision Systems / 26
2.3.2 Indiscernibility / 27
2.3.4 Positive, Negative, and Boundary Regions / 28
2.3.5 Feature Dependency and Significance / 29
2.3.7 Discernibility Matrix / 31
2.4.1 Fuzzy Equivalence Classes / 33
Trang 95.1 Rough Set Attribute Reduction / 86
5.1.1 Additional Search Strategies / 89
5.9 Alternative Approaches / 106
5.10 Comparison of Crisp Approaches / 106
5.10.2 Discernibility Matrix Based Approaches / 108
6.1 Medical Image Classification / 113
Trang 106.2.4 Dimensionality Reduction / 119
6.2.5 Information Content of Rough Set Reducts / 120
6.2.7 Efficiency Considerations of RSAR / 124
8.1 Feature Selection with Fuzzy-Rough Sets / 144
Trang 119.2.3 Fuzzy-Rough Reduction with Fuzzy Entropy / 171
Ratio / 1739.2.5 Fuzzy Discernibility Matrix Based FS / 174
10.2 Ant Colony Optimization-Based Selection / 195
10.2.1 Ant Colony Optimization / 196
10.2.2 Traveling Salesman Problem / 197
10.2.3 Ant-Based Feature Selection / 197
Trang 1212.2.1 Comparison with Unreduced Features / 223
12.2.2 Comparison with Entropy-Based Feature
Selection / 226
12.2.4 Alternative Fuzzy Rule Inducer / 230
12.2.5 Results with Feature Grouping / 231
12.2.6 Results with Ant-Based FRFS / 233
13.2.1 Impact of Feature Selection / 241
13.2.2 Comparison with Relief / 244
13.2.3 Comparison with Existing Work / 248
14 APPLICATIONS V: FORENSIC GLASS ANALYSIS 259
Trang 1314.2 Estimation of Likelihood Ratio / 261
14.2.1 Exponential Model / 262
14.2.2 Biweight Kernel Estimation / 263
14.2.3 Likelihood Ratio with Biweight and Boundary
Kernels / 26414.2.4 Adaptive Kernel / 266
15.3 Fuzzy-Rough Rule Induction / 286
15.4 Hybrid Rule Induction / 287
Trang 15as machine learning, pattern recognition, systems control, and signal processing.
FS intends to preserve the meaning of selected attributes; this forms a sharpcontrast with those approaches that reduce problem complexity by transformingthe representational forms of the attributes
Feature selection techniques have been applied to small- and medium-sizeddatasets in order to locate the most informative features for later use Many
FS methods have been developed, and this book provides a critical review ofthese methods, with particular emphasis on their current limitations To help theunderstanding of the readership, the book systematically presents the leadingmethods reviewed in a consistent algorithmic framework The book also detailsthose computational intelligence based methods (e.g., fuzzy rule induction andswarm optimization) that either benefit from joint use with feature selection orhelp improve the selection mechanism
From this background the book introduces the original approach to featureselection using conventional rough set theory, exploiting the rough set ideol-ogy in that only the supplied data and no other information is used Based on
Trang 16demonstrated applications, the book reviews the main limitation of this approach
in the sense that all data must be discrete The book then proposes and develops afundamental approach based on fuzzy-rough sets It also presents optimizations,extensions, and further new developments of this approach whose underlyingideas are generally applicable to other FS mechanisms
Real-world applications, with worked examples, are provided that illustratethe power and efficacy of the feature selection approaches covered in the book
In particular, the algorithms discussed have proved to be successful in handlingtasks that involve datasets containing huge numbers of features (in the order oftens of thousands), which would be extremely difficult to process further Suchapplications include Web content classification, complex systems monitoring, andalgae population estimation The book shows the success of these applications
by evaluating the algorithms statistically with respect to the existing leadingapproaches to the reduction of problem complexity
Finally, this book concludes with initial supplementary investigations to theassociated areas of feature selection, including rule induction and clustering meth-ods using hybridizations of fuzzy and rough set theories This research opens
up many new frontiers for the continued development of the core technologiesintroduced in the field of computational intelligence
This book is primarily intended for senior undergraduates, postgraduates,researchers, and professional engineers However, it offers a straightforward pre-sentation of the underlying concepts that anyone with a nonspecialist backgroundshould be able to understand and apply
Acknowledgments
Thanks to those who helped at various stages in the development of the ideaspresented in this book, particularly: Colin Aitken, Stuart Aitken, MalcolmBeynon, Chris Cornelis, Alexios Chouchoulas, Michelle Galea, Knox Haggie,Joe Halliwell, Zhiheng Huang, Jeroen Keppens, Pawan Lingras, Javier Marin-Blazquez, Neil Mac Parthalain, Khairul Rasmani, Dave Robertson, ChangjingShang, Andrew Tuson, Xiangyang Wang, and Greg Zadora Many thanks to theUniversity of Edinburgh and Aberystwyth University where this research wasundertaken and compiled
Thanks must also go to those friends and family who have contributed in somepart to this work; particularly Elaine Jensen, Changjing Shang, Yuan Shen, SarahSholl, Mike Gordon, Andrew Herrick, Iain Langlands, Tossapon Boongoen, Xin
Fu, and Ruiqing Zhao
The editors and staff at IEEE Press were extremely helpful We particularlythank David Fogel and Steve Welch for their support, enthusiasm, and encourage-ment Thanks also to the anonymous referees for their comments and suggestions
Trang 17that have enhanced the work presented here, and to Elsevier, Springer, and WorldScientific for allowing the reuse of materials previously published in their jour-nals Additional thanks go to those authors whose research is included in thisbook, for their contributions to this interesting and ever-developing area.
Richard Jensen and Qiang Shen
Aberystwyth University
17th June 2008
Trang 19THE IMPORTANCE OF FEATURE
SELECTION
1.1 KNOWLEDGE DISCOVERY
It is estimated that every 20 months or so the amount of information in the worlddoubles In the same way, tools for use in the various knowledge fields (acqui-sition, storage, retrieval, maintenance, etc.) must develop to combat this growth.Knowledge is only valuable when it can be used efficiently and effectively; there-fore knowledge management is increasingly being recognized as a key element
in extracting its value This is true both within the research, development, andapplication of computational intelligence and beyond
Central to this issue is the knowledge discovery process, particularly edge discovery in databases (KDD) [10,90,97,314] KDD is the nontrivial process
knowl-of identifying valid, novel, potentially useful, and ultimately understandablepatterns in data Traditionally data was turned into knowledge by means ofmanual analysis and interpretation For many applications manual probing ofdata is slow, costly, and highly subjective Indeed, as data volumes grow dra-matically, manual data analysis is becoming completely impractical in manydomains This motivates the need for efficient, automated knowledge discovery.The KDD process can be decomposed into the following steps, as illustrated inFigure 1.1:
• Data selection A target dataset is selected or created Several existing
datasets may be joined together to obtain an appropriate example set
Computational Intelligence and Feature Selection: Rough and Fuzzy Approaches, by Richard Jensen and Qiang Shen
Copyright © 2008 Institute of Electrical and Electronics Engineers
Trang 20Figure 1.1 Knowledge discovery process (adapted from [97])
• Data cleaning/preprocessing This phase includes, among other tasks, noise
removal/reduction, missing value imputation, and attribute discretization.The goal is to improve the overall quality of any information that may bediscovered
• Data reduction Most datasets will contain a certain amount of redundancy
that will not aid knowledge discovery and may in fact mislead the process.The aim of this step is to find useful features to represent the data andremove nonrelevant features Time is also saved during the data-miningstep as a result
• Data mining A data-mining method (the extraction of hidden predictive
information from large databases) is selected depending on the goals of theknowledge discovery task The choice of algorithm used may be depen-dent on many factors, including the source of the dataset and the values itcontains
• Interpretation/evaluation Once knowledge has been discovered, it is
eval-uated with respect to validity, usefulness, novelty, and simplicity This mayrequire repeating some of the previous steps
The third step in the knowledge discovery process, namely data reduction,
is often a source of significant data loss It is this step that forms the focus
of attention of this book The high dimensionality of databases can be reducedusing suitable techniques, depending on the requirements of the future KDDprocesses These techniques fall into one of two categories: those that transformthe underlying meaning of the data features and those that preserve the semantics.Feature selection (FS) methods belong to the latter category, where a smallerset of the original features is chosen based on a subset evaluation function Inknowledge discovery, feature selection methods are particularly desirable as thesefacilitate the interpretability of the resulting knowledge
Trang 211.2 FEATURE SELECTION
There are often many features in KDD, and combinatorially large numbers offeature combinations, to select from Note that the number of feature subset com-binations withm features from a collection of N total features can be extremely
large (with this number being N !/[m!(N − m)!] mathematically) It might be
expected that the inclusion of an increasing number of features would increasethe likelihood of including enough information to distinguish between classes.Unfortunately, this is not true if the size of the training dataset does not alsoincrease rapidly with each additional feature included This is the so-called curse
of dimensionality [26] A high-dimensional dataset increases the chances that adata-mining algorithm will find spurious patterns that are not valid in general.Most techniques employ some degree of reduction in order to cope with largeamounts of data, so an efficient and effective reduction method is required
1.2.1 The Task
The task of feature selection is to select a subset of the original features present
in a given dataset that provides most of the useful information Hence, afterselection has taken place, the dataset should still have most of the importantinformation still present In fact, good FS techniques should be able to detectand ignore noisy and misleading features The result of this is that the dataset
quality might even increase after selection.
There are two feature qualities that must be considered by FS methods: vancy and redundancy A feature is said to be relevant if it is predictive of thedecision feature(s); otherwise, it is irrelevant A feature is considered to be redun-dant if it is highly correlated with other features An informative feature is onethat is highly correlated with the decision concept(s) but is highly uncorrelatedwith other features (although low correlation does not mean absence of relation-ship) Similarly subsets of features should exhibit these properties of relevancyand nonredundancy if they are to be useful
rele-In [171] two notions of feature relevance, strong and weak relevance, weredefined If a feature is strongly relevant, this implies that it cannot be removedfrom the dataset without resulting in a loss of predictive accuracy If it is weaklyrelevant, then the feature may sometimes contribute to accuracy, though thisdepends on which other features are considered These definitions are independent
of the specific learning algorithm used However, this no guarantee that a relevantfeature will be useful to such an algorithm
It is quite possible for two features to be useless individually, and yet highlypredictive if taken together In FS terminology, they may be both redundant andirrelevant on their own, but their combination provides invaluable information.For example, in the exclusive-or problem, where the classes are not linearly sep-arable, the two features on their own provide no information concerning thisseparability It is also the case that they are uncorrelated with each other How-ever, when taken together, the two features are highly informative and can provide
Trang 22good class separation Hence in FS the search is typically for high-quality featuresubsets, and not merely a ranking of features.
1.2.2 The Benefits
There are several potential benefits of feature selection:
1 Facilitating data visualization By reducing data to fewer dimensions,
trends within the data can be more readily recognized This can be veryimportant where only a few features have an influence on data outcomes.Learning algorithms by themselves may not be able to distinguish thesefactors from the rest of the feature set, leading to the generation of overlycomplicated models The interpretation of such models then becomes anunnecessarily tedious task
2 Reducing the measurement and storage requirements In domains where
features correspond to particular measurements (e.g., in a water treatmentplant [322]), fewer features are highly desirable due to the expense andtime-costliness of taking these measurements For domains where largedatasets are encountered and manipulated (e.g., text categorization [162]),
a reduction in data size is required to enable storage where space is anissue
3 Reducing training and utilization times With smaller datasets, the runtimes
of learning algorithms can be significantly improved, both for training andclassification phases It can sometimes be the case that the computationalcomplexity of learning algorithms even prohibits their application to largeproblems This is remedied through FS, which can reduce the problem to
a more manageable size
4 Improving prediction performance Classifier accuracy can be increased as
a result of feature selection, through the removal of noisy or misleadingfeatures Algorithms trained on a full set of features must be able to discernand ignore these attributes if they are to produce useful, accurate predictionsfor unseen data
For those methods that extract knowledge from data (e.g., rule induction) thebenefits of FS also include improving the readability of the discovered knowledge.When induction algorithms are applied to reduced data, the resulting rules aremore compact A good feature selection step will remove unnecessary attributeswhich can affect both rule comprehension and rule prediction performance
Trang 23classification [54,84,164], systems monitoring [322], clustering [131], and expert
systems [354]; see LNCS Transactions on Rough Sets for more examples) This
success is due in part to the following aspects of the theory:
• Only the facts hidden in data are analyzed
• No additional information about the data is required such as thresholds orexpert knowledge on a particular domain
• It finds a minimal knowledge representation
The work on RST offers an alternative, and formal, methodology that can beemployed to reduce the dimensionality of datasets, as a preprocessing step toassist any chosen modeling method for learning from data It helps select themost information-rich features in a dataset, without transforming the data, allthe while attempting to minimize information loss during the selection process.Computationally, the approach is highly efficient, relying on simple set oper-ations, which makes it suitable as a preprocessor for techniques that are muchmore complex Unlike statistical correlation-reducing approaches [77], it requires
no human input or intervention Most importantly, it also retains the tics of the data, which makes the resulting models more transparent to humanscrutiny
seman-Combined with an automated intelligent modeler, say a fuzzy system or aneural network, the feature selection approach based on RST not only can retainthe descriptive power of the learned models but also allow simpler system struc-tures to reach the knowledge engineer and field operator This helps enhance theinteroperability and understandability of the resultant models and their reasoning
As RST handles only one type of imperfection found in data, it is tary to other concepts for the purpose, such as fuzzy set theory The two fieldsmay be considered analogous in the sense that both can tolerate inconsistencyand uncertainty— the difference being the type of uncertainty and their approach
complemen-to it Fuzzy sets are concerned with vagueness; rough sets are concerned withindiscernibility Many deep relationships have been established, and more so,most recent studies have concluded at this complementary nature of the twomethodologies, especially in the context of granular computing Therefore it isdesirable to extend and hybridize the underlying concepts to deal with additionalaspects of data imperfection Such developments offer a high degree of flexibilityand provide robust solutions and advanced tools for data analysis
1.4 APPLICATIONS
As many systems in a variety of fields deal with datasets of large ality, feature selection has found wide applicability Some of the main areas ofapplication are shown in Figure 1.2
dimension-Feature selection algorithms are often applied to optimize the classificationperformance of image recognition systems [158,332] This is motivated by a peak-ing phenomenon commonly observed when classifiers are trained with a limited
Trang 24Figure 1.2 Typical feature selection application areas
set of training samples If the number of features is increased, the classificationrate of the classifier decreases after a peak In melanoma diagnosis, for instance,the clinical accuracy of dermatologists in identifying malignant melanomas
is only between 65% and 85% [124] With the application of FS algorithms,automated skin tumor recognition systems can produce classification accuraciesabove 95%
Structural and functional data from analysis of the human genome haveincreased many fold in recent years, presenting enormous opportunities andchallenges for AI tasks In particular, gene expression microarrays are a rapidlymaturing technology that provide the opportunity to analyze the expressionlevels of thousands or tens of thousands of genes in a single experiment
A typical classification task is to distinguish between healthy and cancer patientsbased on their gene expression profile Feature selectors are used to drasticallyreduce the size of these datasets, which would otherwise have been unsuitablefor further processing [318,390,391] Other applications within bioinformaticsinclude QSAR [46], where the goal is to form hypotheses relating chemicalfeatures of molecules to their molecular activity, and splice site prediction [299],where junctions between coding and noncoding regions of DNA are detected.The most common approach to developing expressive and human readable
representations of knowledge is the use of if-then production rules Yet real-life
problem domains usually lack generic and systematic expert rules for mappingfeature patterns onto their underlying classes In order to speed up the rule
Trang 25induction process and reduce rule complexity, a selection step is required Thisreduces the dimensionality of potentially very large feature sets while minimizingthe loss of information needed for rule induction It has an advantageous sideeffect in that it removes redundancy from the historical data This also helpssimplify the design and implementation of the actual pattern classifier itself, bydetermining what features should be made available to the system In additionthe reduced input dimensionality increases the processing speed of the classifier,leading to better response times [12,51].
Many inferential measurement systems are developed using data-based dologies; the models used to infer the value of target features are developed withreal-time plant data This implies that inferential systems are heavily influenced
metho-by the quality of the data used to develop their internal models Complex tion problems, such as reliable monitoring and diagnosis of industrial plants, arelikely to present large numbers of features, many of which will be redundant forthe task at hand Additionally there is an associated cost with the measurement
applica-of these features In these situations it is very useful to have an intelligent systemcapable of selecting the most relevant features needed to build an accurate andreliable model for the process [170,284,322]
The task of text clustering is to group similar documents together, represented
as a bag of words This representation raises one severe problem: the high sionality of the feature space and the inherent data sparsity This can significantlyaffect the performance of clustering algorithms, so it is highly desirable to reducethis feature space size Dimensionality reduction techniques have been success-fully applied to this area—both those that destroy data semantics and those thatpreserve them (feature selectors) [68,197]
dimen-Similar to clustering, text categorization views documents as a collection ofwords Documents are examined, with their constituent keywords extracted andrated according to criteria such as their frequency of occurrence As the number
of keywords extracted is usually in the order of tens of thousands, ality reduction must be performed This can take the form of simplistic filteringmethods such as word stemming or the use of stop-word lists However, filteringmethods do not provide enough reduction for use in automated categorizers, so
dimension-a further fedimension-ature selection process must tdimension-ake pldimension-ace Recent dimension-applicdimension-ations of FS inthis area include Web page and bookmark categorization [102,162]
1.5 STRUCTURE
The rest of this book is structured as follows (see Figure 1.3):
• Chapter 2: Set Theory A brief introduction to the various set theories is
presented in this chapter Essential concepts from classical set theory, fuzzyset theory, rough set theory, and hybrid fuzzy-rough set theory are presentedand illustrated where necessary
Trang 26Figure 1.3 How to read this book
• Chapter 3: Classification Methods This chapter discusses both crisp and
fuzzy methods for the task of classification Many of the methods presentedhere are used in systems later in the book
techniques for dimensionality reduction with a particular emphasis on ture selection is given in this chapter It begins with a discussion of those
Trang 27fea-reduction methods that irreversibly transform data semantics This is lowed by a more detailed description and evaluation of the leading featureselectors presented in a unified algorithmic framework A simple exampleillustrates their operation.
fol-• Chapter 5: Rough Set-based Approaches to Feature Selection This chapter
presents an overview of the existing research regarding the application
of rough set theory to feature selection Rough set attribute reduction(RSAR), the precursor to the developments in this book, is described indetail However, these methods are unsuited to the problems discussed inSection 5.11 In particular, they are unable to handle noisy or real-valueddata effectively—a significant problem if they are to be employed withinreal-world applications
• Chapter 6: Applications I: Use of RSAR This chapter looks at the
applica-tions of RSAR in several challenging domains: medical image classification,text categorization, and algae population estimation Details of each classi-fication system are given with several comparative studies carried out thatinvestigate RSAR’s utility Additionally a brief introduction to other appli-cations that use a crisp rough set approach is provided for the interestedreader
• Chapter 7: Rough and Fuzzy Hybridization There has been great interest
in developing methodologies that are capable of dealing with imprecisionand uncertainty The large amount of research currently being carried out infuzzy and rough sets is representative of this Many deep relationships havebeen established, and recent studies have concluded at the complementarynature of the two methodologies Therefore it is desirable to extend andhybridize the underlying concepts to deal with additional aspects of dataimperfection Such developments offer a high degree of flexibility and pro-vide robust solutions and advanced tools for data analysis A general survey
of this research is presented in the chapter, with a focus on applications ofthe theory to disparate domains
• Chapter 8: Fuzzy-Rough Feature Selection In this chapter the
theoreti-cal developments behind this new feature selection method are presentedtogether with a proof of generalization This novel approach uses fuzzy-rough sets to handle many of the problems facing feature selectors outlinedpreviously A complexity analysis of the main selection algorithm is given.The operation of the approach and its benefits are shown through the use
of two simple examples To evaluate this new fuzzy-rough measure of ture significance, comparative investigations are carried out with the currentleading significance measures
selection has been shown to be highly useful at reducing data sionality, but possesses several problems that render it ineffective fordatasets possessing tens of thousands of features This chapter presentsthree new approaches to fuzzy-rough feature selection (FRFS) based on
Trang 28dimen-fuzzy similarity relations The first employs the new similarity-basedfuzzy lower approximation to locate subsets The second uses boundaryregion information to guide search Finally, a fuzzy extension to crispdiscernibility matrices is given in order to discover fuzzy-rough subsets.The methods are evaluated and compared using benchmark data.
promising areas in feature selection The first, feature grouping, is developedfrom recent work in the literature where groups of features are selectedsimultaneously By reasoning with fuzzy labels, the search process can bemade more intelligent allowing various search strategies to be employed.The second, ant-based feature selection, seeks to address the nontrivial issue
of finding the smallest optimal feature subsets This approach to featureselection uses artificial ants and pheromone trails in the search for the bestsubsets Both of these developments can be applied within feature selection,
in general, but are applied to the specific problem of subset search withinFRFS in this book
• Chapter 11: Applications II: Web Content Categorization With the
explo-sive growth of information on the Web, there is an abundance of informationthat must be dealt with effectively and efficiently This area, in particular,deserves the attention of feature selection due to the increasing demand forhigh-performance intelligent Internet applications This motivates the appli-cation of FRFS to the automatic categorization of user bookmarks/favoritesand Web pages The results show that FRFS significantly reduces datadimensionality by several orders of magnitude with little resulting loss inclassification accuracy
• Chapter 12: Applications III: Complex Systems Monitoring Complex
appli-cation problems, such as reliable monitoring and diagnosis of industrialplants, are likely to present large numbers of features, many of which will beredundant for the task at hand With the use of FRFS, these extraneous fea-tures can be removed This not only makes resultant rulesets generated fromsuch data much more concise and readable but can reduce the expense due
to the monitoring of redundant features The monitoring system is applied
to water treatment plant data, producing better classification accuracies thanthose resulting from the full feature set and several other reduction methods
• Chapter 13: Applications IV: Algae Population Estimation Biologists need
to identify and isolate the chemical parameters of rapid algae populationfluctuations in order to limit their detrimental effect on the environment.This chapter describes an estimator of algae populations, a hybrid systeminvolving FRFS that approximates, given certain water characteristics, thesize of algae populations The system significantly reduces computer timeand space requirements through the use of feature selection The resultsshow that estimators using a fuzzy-rough feature selection step producemore accurate predictions of algae populations in general
• Chapter 14: Applications V: Forensic Glass Analysis The evaluation of
glass evidence in forensic science is an important issue Traditionally this
Trang 29has depended on the comparison of the physical and chemical attributes
of an unknown fragment with a control fragment A high degree of crimination between glass fragments is now achievable due to advances inanalytical capabilities A random effects model using two levels of hier-archical nesting is applied to the calculation of a likelihood ratio (LR) as
dis-a solution to the problem of compdis-arison between two sets of replicdis-atedcontinuous observations where it is unknown whether the sets of measure-ments shared a common origin This chapter presents the investigation intothe use of feature evaluation for the purpose of selecting a single variable
to model without the need for expert knowledge Results are recorded forseveral selectors using normal, exponential, adaptive, and biweight kernelestimation techniques Misclassification rates for the LR estimators are used
to measure performance
• Chapter 15: Supplementary Developments and Investigations This chapter
offers initial investigations and ideas for further work, which were oped concurrently with the ideas presented in the previous chapters First,the utility of using the problem formulation and solution techniques frompropositional satisfiability for finding rough set reducts is considered This
devel-is presented with an initial experimental evaluation of such an approach,comparing the results with a standard rough set-based algorithm, RSAR.Second, the possibility of universal reducts is proposed as a way of gen-erating more useful feature subsets Third, fuzzy decision tree inductionbased on the fuzzy-rough metric developed in this book is proposed Otherproposed areas of interest include fuzzy-rough clustering and fuzzy-roughfuzzification optimization
Trang 31SET THEORY
The problem of capturing the vagueness present in the real world is difficult toovercome using classical set theory alone As a result several extensions to settheory have been proposed that deal with different aspects of uncertainty Some ofthe most popular methods for this are fuzzy set theory, rough set theory, and theirhybridization, fuzzy-rough set theory This chapter starts with a quick introduction
to classical set theory, using a simple example to illustrate the concept Then
an introduction to fuzzy sets is given, covering the essentials required for abasic understanding of their use There are many useful introductory resourcesregarding fuzzy set theory, for example [66,267] Following this, rough set theory
is introduced with an example dataset to illustrate the fundamental concepts.Finally, fuzzy-rough set theory is detailed
2.1 CLASSICAL SET THEORY
2.1.1 Definition
In classical set theory, elements can belong to a set or not at all For example,
in the set of old people, defined here as {Rod, Jane, Freddy}, the element Rod belongs to this set whereas the element George does not No distinction is made
within a set between elements; all set members belong fully This may be sidered to be a source of information loss for certain applications Returning
con-to the example, Rod may be older than Freddy , but by this formulation both
Computational Intelligence and Feature Selection: Rough and Fuzzy Approaches, by Richard Jensen and Qiang Shen
Copyright © 2008 Institute of Electrical and Electronics Engineers
Trang 32are considered to be equally old The order of elements within a set is notconsidered.
More formally, let U be a space of objects, referred to as the universe ofdiscourse, and x an element of U A classical (crisp) set A, A ⊆ U, is defined
as a collection of elements x ∈ U, such that each element x can belong to the
set or not belong A classical set A can be represented by a set of ordered pairs
(x, 0) or (x, 1) for each element, indicating x ∈ A or x ∈ A, respectively.
Two sets A and B are said to be equal if they contain exactly the same
elements; every element of A is an element of B and every element of B is an
element ofA The cardinality of a set A is a measure of the number of elements
of the set, and is often denoted |A| For example, the set {Rod, Jane, Freddy}
has a cardinality of 3 (|{Rod, Jane, Freddy}| = 3) A set with no members is
called the empty set, usually denoted∅ and has a cardinality of zero
2.1.2 Subsets
If every element of a setA is a member of a set B, then A is said to be a subset
ofB, denoted A ⊆ B (or equivalently B ⊇ A) If A is a subset of B but is not
equal to B, then it is a proper subset of B, denoted A ⊂ B (or equivalently
B ⊃ A) For example, if A = {Jane, Rod}, B = {Rod, Jane, Geoffrey} and C = {Rod, Jane, Freddy}:
A ⊆ B({Jane, Rod} ⊆ {Rod, Jane, Geoffrey})
A ⊆ A({Jane, Rod} ⊆ {Jane, Rod})
A ⊂ B({Jane, Rod} ⊂ {Rod, Jane, Geoffrey})
A ⊆ C({Jane, Rod} ⊆ {Rod, Jane, Freddy})
B ⊆ C({Rod, Jane, Geoffrey} ⊆ {Rod, Jane, Freddy})
2.1.3 Operators
Several operations exist for the manipulation of sets Only the fundamental ations are considered here
oper-2.1.3.1 Union The union of two sets A and B is a set that contains all the
members of both sets and is denotedA ∪ B More formally, A ∪ B = {x|(x ∈ A)
or(x ∈ B)} Properties of the union operator include
Trang 332.1.3.2 Intersection The intersection of two sets A and B is a set that
contains only those elements that A and B have in common More formally,
A ∩ B = {x|(x ∈ A) and (x ∈ B)} If the intersection of A and B is the empty
set, then sets A and B are said to be disjoint Properties of the intersection
This is denotedB − A (or B \ A) More formally, B − A = {x|(x ∈ B) and not (x ∈ A)}.
In some situations, all sets may be considered to be subsets of a given universalsetU In this case, U − A is the absolute complement (or simply complement) of
A, and is denoted AorA c More formally,A c = {x|(x ∈ U) and not (x ∈ A)}.
Properties of complements include
2.2 FUZZY SET THEORY
A distinct approach to coping with reasoning under uncertain circumstances is touse the theory of fuzzy sets [408] The main objective of this theory is to develop
a methodology for the formulation and solution of problems that are too complex
or ill-defined to be suitable for analysis by conventional Boolean techniques Itdeals with subsets of a universe of discourse, where the transition between fullmembership of a subset and no membership is gradual rather than abrupt as inthe classical set theory Such subsets are called fuzzy sets
Fuzzy sets arise, for instance, when mathematical descriptions of ambiguityand ambivalence are needed In the real world the attributes of, say, a physicalsystem often emerge from an elusive vagueness or fuzziness, a readjustment tocontext, or an effect of human imprecision The use of the “soft” boundaries
of fuzzy sets, namely the graded memberships, allows subjective knowledge to
be utilized in defining these attributes With the accumulation of knowledge thesubjectively assigned memberships can, of course, be modified Even in some
Trang 34cases where precise knowledge is available, fuzziness may be a concomitant ofcomplexity involved in the reasoning process.
The adoption of fuzzy sets helps to ease the requirement for encoding uncertaindomain knowledge For example, labels likesmall, medium, and large have an
intuitive appeal to represent values of some physical attributes However, if theyare defined with a semantic interpretation in terms of crisp intervals, such as
small = {x|x > 0, x 1}
medium = {x|x > 0, x ≈ 1}
the representation may lead to rather nonintuitive results This is because, inpractice, it is not realistic to draw an exact boundary between these intervals.When encoding a particular real number, it may be difficult to decide which ofthese the number should, or should not, definitely belong to It may well be thecase that what can be said is only that a given number belongs to the small set
with a possibility ofA and to the medium with a possibility of B The avoidance
of this problem requires gradual membership and hence the break of the laws ofexcluded-middle and contradiction in Boolean logic This forms the fundamentalmotivation for the development of fuzzy logic
2.2.1 Definition
A fuzzy set can be defined as a set of ordered pairsA = {x, μ A (x) |x ∈ U} The
functionμ A (x) is called the membership function for A, mapping each element
of the universeU to a membership degree in the range [0, 1] The universe may
be discrete or continuous Any fuzzy set containing at least one element with a
membership degree of 1 is called normal
Returning to the example in Section 2.1.1, it may be better to represent the set
of old people as a fuzzy set, Old The membership function for this set is given
in Figure 2.1, defined over a range of ages (the universe) Given that the age of
Rod is 74, it can be determined that this element belongs to the set Old with a membership degree of 0.95 Similarly, if the age of Freddy is 38, the resulting
Figure 2.1 Fuzzy set representing the concept Old
Trang 35degree of membership is 0.26 Here, both Rod and Freddy belong to the (fuzzy) set of old people, but Rod has a higher degree of membership to this set.
The specification of membership functions is typically subjective, as can be
seen in this example There are many justifiable definitions of the concept Old
Indeed people of different ages may define this concept quite differently Oneway of constructing the membership function is by using a voting model, andthis way the generated fuzzy sets can be rooted in reality with clear semantics
An alternative to this is the integration of the probability distribution of variables.For example, integrating the probability distribution of height gives a possibleversion of the tall fuzzy set This does not help with hedges and sets such asvery tall (see Section 2.2.6), and so although the integration of the distributiongives a voting model, it needs more to arrive at the hedges
In fuzzy logic the truth value of a statement is linguistic (and no longer
Boolean), of the form very true, true, more or less true, not very false, false.
These logic values are themselves fuzzy sets; some may be compounded fuzzysets from other atomic ones, by the use of certain operators As with ordinarycrisp sets, different operations can be defined over fuzzy sets
2.2.2 Operators
The most basic operators on fuzzy sets are the union, intersection, and ment These are fuzzy extensions of their crisp counterparts, ensuring that if theyare applied to crisp sets, the results of their application will be identical to crispunion, intersection, and complement
andB, is specified by a binary operation on the unit interval; that is, a function
of the form
For each element x in the universe, this function takes as its arguments the
memberships ofx in the fuzzy sets A and B, and yields the membership grade
of the element in the set constituting the intersection ofA and B:
The following axioms must hold for the operator t to be considered a t-norm,
for allx, y, and z in the range [0,1]:
Trang 36The following are examples of t-norms that are often used as fuzzy intersections:
setsA and B is specified by a function
A fuzzy unions is a binary operation that satisfies at least the following axioms
for all x, y, and z in [0,1]:
• s(x, 0) = x (boundary condition)
• y ≤ z → s(x, y) ≤ s(x, z) (monotonicity)
• s(x, y) = s(y, x) (commutativity)
• s(x, s(y, z)) = s(s(x, y), z) (associativity)
The following are examples of t-conorms that are often used as fuzzy unions:
The most popular interpretation of fuzzy union and intersection is the max/mininterpretation, primarily due to its ease of computation This particular interpre-tation is used in the book
by a function
subject to the following:
• c(0) = 1 and c(1) = 0 (boundary condition)
• ∀a, b ∈ [0, 1], if a ≤ b then c(a) ≥ c(b) (monotonicity)
Trang 37• c is a continuous function (continuity)
• c is an involution (i.e., c(c(a)) = a for each a ∈ [0, 1])
The complement of a fuzzy set can be denoted in a number of ways;¬A and
A are also in common use It may also be represented as A c, the same way as
in crisp set theory A standard definition of fuzzy complement is
It should not be difficult to see that these definitions cover the conventionaloperations on crisp sets as their specific cases Taking the set intersection as
if and only if (abbreviated to iff hereafter) it belongs to both A and B, and
vice versa This is covered by the definition of fuzzy set intersection because
μ A ∩B (x) = 1 (or min(μ A (x), μ B (x)) = 1), iff μ A (x) = 1 and μ B (x)= 1 erwise,μ A ∩B (x) = 0, given that μ A (x) and μ B (x) take values only from {0, 1}
Oth-in this case
2.2.3 Simple Example
An example should help with the understanding of the basic concepts and ators introduced above Suppose that the universe of discourse, X, is a class of
oper-students and that a group, A, of students within this class are said to be tall
in height ThusA is a fuzzy subset of X (but X itself is not fuzzy), since the
boundary betweentall and not tall cannot be naturally defined with a fixed real
number Rather, describing this vague concept using a gradual membership tion as characterized in Figure 2.2 is much more appealing Similarly the fuzzy
func-term very tall can be represented by another fuzzy (sub-)set as also shown in
this figure Given such a definition of the fuzzy setA = tall, a proposition like
“studentx is tall” can be denoted by μ A (x).
Assume that Andy is a student in the class and that he has a 80% possibility
of being considered as atall student This means that μ A (Andy) = 0.8 Also
Figure 2.2 Representation of fuzzy sets “tall” and “very tall”
Trang 38suppose that another fuzzy set B is defined on the same universe X, whose members are said to be young in age, and that μ B (Andy) = 0.7, meaning that Andy is thought to be young with a possibility of 70% From this, using the
operations defined above, the following can be derived that is justifiable withrespect to common-sense intuitions:
• μ ¬A (Andy) = 1 − 0.8 = 0.2, indicating the possibility of Andy being not tall
• μ A ∪B (Andy) = max(μ A (Andy), μ B (Andy)) = max(0.8, 0.7) = 0.8,
indicating the possibility of Andy beingtall or young
• μ A ∩B (Andy) = min(μ A (Andy), μ B (Andy)) = min(0.8, 0.7) = 0.7, indicating the possibility of Andy being both tall and young
Not only having an intuitive justification, these operations are also tionally very simple However, care should be taken not to explain the resultsfrom the conventional Boolean logic point of view The laws of excluded-middleand contradiction do not hold in fuzzy logic anymore To make this point clearer,let us attempt to compare the results of A ∪ ¬A and A ∩ ¬A obtained by fuzzy
computa-logic with those by Boolean probabilistic computa-logic (although such a comparison isitself a common source for debate in the literature) Applying fuzzy logic to thecase above of Andy gives that
• μ A ∪¬A (Andy) = max(μ A (Andy), μ ¬A (Andy)) = max(0.8, 0.2) = 0.8
• μ A ∩¬A (Andy) = min(μ A (Andy), μ ¬A (Andy)) = min(0.8, 0.2) = 0.2
However, using the theory of probability, different results would be expectedsuch that
• p(“Andy is tall” or “Andy is not tall ”)= 1
• p(“Andy is tall” and “Andy is not tall ”)= 0
This important difference is caused by the deliberate avoidance of theexcluded-middle and contradiction laws in fuzzy logic This avoidance enablesfuzzy logic to represent vague concepts that are difficult to capture otherwise
If the memberships of x belonging to A and ¬A are neither 0 nor 1, then it
should not be surprising that x is also of a nonzero membership belonging to A and ¬A at the same time.
2.2.4 Fuzzy Relations and Composition
In addition to the three operators defined above, many conventional mathematicalfunctions can be extended to be applied to fuzzy values This is possible by the
use of an extension principle, so called because it provides a general extension of
classical mathematical concepts to fuzzy environments This principle is stated asfollows: If ann-ary function f maps the Cartesian product X1× X2× · · · × X n
Trang 39onto a universe Y such that y = f (x1, x2, , x n ), and A1, A2, , A n are n
fuzzy sets in X1, X2, , X n, respectively, characterized by membership butionsμ A i (X i ), i = 1, 2, , n, a fuzzy set on Y can then be induced as given
distri-below, where∅ is the empty set:
μ B (y)=
max{x1, ,x n ,y =f (x1, ,x n )}min(μ A1(x1), , μ A n (x n )) if f−1(Y )= ∅
A crucial concept in fuzzy set theory is that of fuzzy relations, which is a
generalization of the conventional crisp relations Ann-ary fuzzy relation in X1×
X2× · · · × X nis, in fact, a fuzzy set onX1× X2× · · · × X n Fuzzy relations can
be composed (and this composition is closely related to the extension principleshown above) For instance, ifU is a relation from X1to X2 (or, equivalently,
a relation inX1× X2), andV is a relation from X2toX3, then the composition
of U and V is a fuzzy relation from X1 to X3 which is denoted by U◦V and
defined by
μ U ◦V (x1, x3)= max
x2∈X2min(μ U (x1, x2), μ V (x2, x3)), x1∈ X1, x3∈ X3
A convenient way of representing a binary fuzzy relation is to use a matrix Forexample, the following matrix,P , can be used to indicate that a computer-game
addict is much more fond of multimedia games than conventional ones, be theyworkstation or PC-based:
max(min(0.9, 0.7), min(0.8, 0.8)) max(min(0.9, 0.5), min(0.8, 0.3))
max(min(0.3, 0.7), min(0.2, 0.8)) max(min(0.3, 0.5), min(0.2, 0.3))
Trang 40This composition indicates that the addict enjoys multimedia games, especiallykeyboard-based multimedia games.
2.2.5 Approximate Reasoning
Fuzzy relations and fuzzy relation composition form the basis for approximatereasoning, sometimes termed fuzzy reasoning Informally, approximate reason-ing means a process by which a possibly inexact conclusion is inferred from
a collection of inexact premises Systems performing such reasoning are built
upon a collection of fuzzy production (if-then) rules, which provide a formal
means of representing domain knowledge acquired from empirical associations
or experience Such systems run upon a given set of facts that allows a (usuallypartial) instantiation of the premise attributes in some of the rules
For example, a rule like
ifx is A i andy is B i, thenz is C i
which governs a particular relationship between the premise attributesx and y
and the conclusion attribute z, can be translated into a fuzzy relation R i:
μ R i (x, y, z) = min(μ A i (x), μ B i (y), μ C i (z))
Here,A i , B i, andC i are fuzzy sets defined on the universesX, Y , and Z of the
attributes x, y, and z, respectively Provided with this relation, if the premise
attributes x and y actually take the fuzzy values A and B, a new fuzzy value
of the conclusion attributez can then be obtained by applying the compositional rule of inference:
C= (A× B)◦R i
Or,
μ C(z)= max
x ∈X,y∈Ymin(μ A(x), μ B(y), μ R i (x, y, z))
Of course, an approximate reasoning system would not normally functionusing only one specific rule, but a set Given a set of production rules, twodifferent approaches may be adopted to implement the reasoning in such a system;both rely on the use of the compositional rule of inference given above The first
is to apply the compositional rule after an overall relation associated with theentire rule set is found The second is to utilize the compositional rule locally
to each individual production rule first and then to aggregate the results of suchapplications to form an overall result for the consequent attribute
Given a set ofK if-then rules, basic algorithms for these two approaches are
summarized below For simplicity, it is assumed that each rule contains a fixed
... ơA (Andy) = 0.8 = 0.2, indicating the possibility of Andy being not tall• μ A ∪B (Andy) = max(μ A (Andy), μ B (Andy))... be expectedsuch that
• p(“Andy is tall” or “Andy is not tall ”)=
• p(“Andy is tall” and “Andy is not tall ”)=
This important difference... Applying fuzzy logic to thecase above of Andy gives that
ã A ơA (Andy) = max(μ A (Andy), μ ¬A (Andy)) = max(0.8, 0.2) = 0.8
ã