An Instance Based Approach to Feature Selection – RELIEF Kira and Rendell 1992 describe an algorithm called RELIEF that uses instance based learning to assign a relevance weight to each
Trang 1D (Pr(C|V i = v i ,V j = v j ),Pr(C|V j = v j)) =
∑
c∈C p (c|V i = v i ,V j = v j)log2
p(c|V i =v i ,V j =v j)
For each feature i, the algorithm finds a set M i , containing K attributes from those
that remain, that is likely to include the information feature i has about the class
values M i contains K features out of the remaining features for which the value
of Equation 5.2 is smallest The expected cross entropy between the distribution
of the class values, given M i , V i , and the distribution of class values given just
M i , is calculated for each feature i The feature for which this quantity is minimal
is removed from the set This process iterates until the user- specified number of features are removed from the original set
Experiments on natural domains and two artificial domains using C4.5 and na¨ıve Bayes as the final induction algorithm showed that the feature selector gives the
best results when the size K of the conditioning set M is set to 2 In two domains
containing over 1000 features the algorithm is able to reduce the number of features
by more than half, while improving accuracy by one or two percent
One problem with the algorithm is that it requires features with more than two values to be encoded as binary in order to avoid the bias that entropic measures have toward features with many values This can greatly increase the number of features
in the original data, as well as introducing further dependencies Furthermore, the meaning of the original attributes is obscured, making the output of algorithms such
as C4.5 hard to interpret
An Instance Based Approach to Feature Selection – RELIEF
Kira and Rendell (1992) describe an algorithm called RELIEF that uses instance based learning to assign a relevance weight to each feature Each feature’s weight re-flects its ability to distinguish among the class values Features are ranked by weight and those that exceed a user- specified threshold are selected to form the final subset The algorithm works by randomly sampling instances from the training data For each instance sampled the nearest instance of the same class (nearest hit) and oppo-site class (nearest miss) is found An attribute’s weight is updated according to how well its values distinguish the sampled instance from its nearest hit and nearest miss
An attribute will receive a high weight if it differentiates between instances from dif-ferent classes and has the same value for instances of the same class Equation (5.3) shows the weight updating formula used by RELIEF:
W X = W X − di f f (X,R,H)2
m +di f f (X,R,M)2
where W X is the weight for attribute X, R is a randomly sampled instance, H is the nearest hit, M is the nearest miss, and m is the number of randomly sampled
instances
The function diff calculates the difference between two instances for a given attribute For nominal attributes it is defined as either 1 (the values are different) or 0 (the values are the same), while for continuous attributes the difference is the actual
Trang 2difference normalized to the interval [0; 1] Dividing by m guarantees that all weights are in the interval [-1,1]
RELIEF operates on two- class domains Kononenko (1994) describes enhance-ments to RELIEF that enable it to cope with multi- class, noisy and incomplete do-mains Kira and Rendell provide experimental evidence that shows RELIEF to be effective at identifying relevant features even when they interact Interacting features are those, whose values are dependent on the values of other features and the class, and as such, provide further information about the class On the other hand, redun-dant features, are those whose values are dependent on the values of other features irrespective of the class, and as such, provide no further information about the class (for example, in parity problems) However, RELIEF does not handle redundant fea-tures The authors state: “If most of the given features are relevant to the concept, it (RELIEF) would select most of the given features even though only a small number
of them are necessary for concept description.”
Scherf and Brauer (1997) describe a similar instance based approach (EUBAFES)
to assigning feature weights developed independently of RELIEF Like RELIEF, EU-BAFES strives to reinforce similarities between instances of the same class while si-multaneously decrease similarities between instances of different classes A gradient descent approach is employed to optimize feature weights with respect to this goal 5.2.2 Feature Wrappers
Wrapper strategies for feature selection use an induction algorithm to estimate the merit of feature subsets The rationale for wrapper approaches is that the induction method that will ultimately use the feature subset should provide a better estimate of accuracy than a separate measure that has an entirely different inductive bias (Lang-ley and Sage, 1994)
Feature wrappers often achieve better results than filters due to the fact that they are tuned to the specific interaction between an induction algorithm and its training data However, they tend to be much slower than feature filters because they must repeatedly call the induction algorithm and must be re- run when a different induction algorithm is used
Since the wrapper is a well defined process, most of the variation in its appli-cation are due to the method used to estimate the off- sample accuracy of a target induction algorithm, the target induction algorithm itself, and the organization of the search This section reviews work that has focused on the wrapper approach and methods to reduce its computational expense
Wrappers for Decision Tree Learners
John, Kohavi, and Pfleger (1994) were the first to advocate the wrapper (Allen, 1974)
as a general framework for feature selection in machine learning They present for-mal for two degrees of feature relevance definitions, and claim that the wrapper is
able to discover relevant features A feature X iis said to be strongly relevant to the
Trang 3target concept(s) if the probability distribution of the class values, given the full fea-ture set, changes whenX iis removed A featureX iis said to be weakly relevant if
it is not strongly relevant and the probability distribution of the class values, given some subsetS (containing X i) of the full feature set, changes whenX i is removed All features that are not strongly or weakly relevant are irrelevant
Experiments were conducted on three artificial and three natural domains using ID3 and C4.5 (Quinlan, 1993, Quinlan, 1986) as the induction algorithms Accuracy was estimated by using 25- fold cross validation on the training data; a disjoint test set was used for reporting final accuracies Both forward selection and backward elimination search were used With the exception of one artificial domain, results showed that feature selection did not significantly change ID3 or C4.5’s generaliza-tion performance The main effect of feature selecgeneraliza-tion was to reduce the size of the trees Like John et al., Caruana and Freitag (1994) test a number of greedy search methods with ID3 on two calendar scheduling domains As well as backward elim-ination and forward selection they also test two variants of stepwise bi- directional search— one starting with all features, the other with none
Results showed that although the bi- directional searches slightly outperformed the forward and backward searches, on the whole there was very little difference be-tween the various search strategies except with respect to computation time Feature selection was able to improve the performance of ID3 on both calendar scheduling domains
Vafaie and De Jong (1995) and Cherkauer and Shavlik (1996) have both applied genetic search strategies in a wrapper framework for improving the performance of decision tree learners Vafaie and De Jong (1995) describe a system that has two genetic algorithm driven modules— the first performs feature selection, and the sec-ond performs constructive induction (Constructive induction is the process of cre-ating new attributes by applying logical and mathematical operators to the original features (Michalski, 1983)) Both modules were able to significantly improve the performance of ID3 on a texture classification problem
Cherkauer and Shavlik (1996) present an algorithm called SET- Gen which strives
to improve the comprehensibility of decision trees as well as their accuracy To achieve this, SET- Gen’s genetic search uses a fitness function that is a linear combi-nation of an accuracy term and a simplicity term:
Fitness(X) =3
4A +1
4(1− S + F
whereX is a feature subset, A is the average cross- validation accuracy of C4.5,
S is the average size of the trees produced by C4.5 (normalized by the number of
training examples), and F is the number of features is the subsetX (normalized by
the total number of available features) Equation (5.4) ensures that the fittest popu-lation members are those feature subsets that lead C4.5 to induce small but accurate decision trees
Trang 4Wrappers for Instance-based Learning
The wrapper approach was proposed at approximately the same time and indepen-dently of John et al (1994) by Langley and Sage (1994) during their investigation
of the simple nearest neighbor algorithm’s sensitivity to irrelevant attributes Scaling experiments showed that the nearest neighbour’s sample complexity (the number of training examples needed to reach a given accuracy) increases exponentially with
the number of irrelevant attributes present in the data (Aha et al., 1991, Langley and
Sage, 1994) An algorithm called OBLIVION is presented which performs back-ward elimination of features using an oblivious decision tree (When all the original features are included in the tree and given a number of assumptions at classification time, Langley and Sage note that the structure is functionally equivalent to the sim-ple nearest neighbor; in fact, this is how it is imsim-plemented in OBLIVION) as the
induction algorithm Experiments with OBLIVION using k- fold cross validation on
several artificial domains showed that it was able to remove redundant features and learn faster than C4.5 on domains where features interact
Moore and Lee (1994) take a similar approach to augmenting nearest neighbor algorithm but their system uses leave- one- out instead of k- fold cross- validation and concentrates on improving the prediction of numeric rather than discrete classes Aha and Blankert (1994) also use leave- one- out cross validation, but pair it with
a beam search (Beam search is a limited version of best first search that only re-members a portion of the search path for use in backtracking), instead of hill climb-ing Their results show that feature selection can improve the performance of IB1 (a nearest neighbor classifier) on a sparse (very few instances) cloud pattern domain with many features Moore, Hill, and Johnson (1992) encompass not only feature selection in the wrapper process, but also the number of nearest neighbors used in prediction and the space of combination functions Using leave- one- out cross vali-dation they achieve significant improvement on several control problems involving, the prediction of continuous classes
In a similar vein, Skalak (1994) combines feature selection and prototype selec-tion into a single wrapper process using random mutaselec-tion hill climbing as the search strategy Experimental results showed significant improvement in accuracy for near-est neighbor on two natural domains and a drastic reduction in the algorithm’s storage requirement (number of instances retained during training)
Domingos (1997) describes a context sensitive wrapper approach to feature se-lection for instance based learners The motivation for the approach is that there may
be features that are either relevant in only a restricted area of the instance space and irrelevant elsewhere, or relevant given only certain values (weakly interacting) of other features and otherwise irrelevant In either case, when features are estimated globally (over the instance space), the irrelevant aspects of these sorts of features may overwhelm their en-tire useful aspects for instance based learners This is true even when using backward search strategies with the wrapper In the wrapper ap-proach, backward search strategies are generally more effective than forward search strategies in domains with feature interactions Because backward search typically
Trang 5begins with all the features, the removal of a strongly interacting feature is usually detected by decreased accuracy during cross validation
Domingos presents an algorithm called RC which can detect and make use of context sensitive features RC works by selecting a (potentially) different set of fea-tures for each instance in the training set It does this by using a search strategy and cross validation to estimate accuracy For each instance in back-ward the training set, RC finds its nearest neighbour of the same class and removes those features in which the two differ The accuracy of the entire training dataset is then estimated by cross validation If the accuracy has not degraded, the modified instance in question
is accepted; otherwise the instance is restored to its original state and deactivated (no further feature selection is attempted for it) The feature selection process continues until all instances are inactive
Experiments on a selection of machine learning datasets showed that RC out-performed standard wrapper feature selectors using forward and backward search strategies with instance based learners The effectiveness of the context sensitive ap-proach was also shown on artificial domains engineered to exhibit restricted feature dependency When features are globally relevant or irrelevant, RC has no advantage over standard wrapper feature selection Furthermore, when few examples are avail-able, or the data is noisy, standard wrapper approaches can detect globally irrelevant features more easily than RC
Domingos also noted that wrappers that employ instance based learners (includ-ing RC) are unsuitable for use on databases contain(includ-ing many instances because they are quadratic in N (the number of instances)
Kohavi (1995) uses wrapper feature selection to explore the potential of decision table majority (DTM) classifiers Appropriate data structures allow the use of fast incremental cross- validation with DTM classifiers Experiments showed that DTM classifiers using appropriate feature subsets compared very favorably with sophisti-cated algorithms such as C4.5
Wrappers for Bayes Classifiers
Due to the naive Bayes classifier’s assumption that, within each class, probability distributions for attributes are independent of each other Langley and Sage (1994) note that the classifier performance on domains with redundant features can be im-proved by removing such features A forward search strategy is employed to select features for use with na¨ıve Bayes, as opposed to the backward strategies that are used most often with decision tree algorithms and instance based learners The ra-tionale for a forward search is that it should immediately detect dependencies when harmful redundant attributes are added Experiments showed overall improvement and increased learning rate on three out of six natural domains, with no change on the remaining three
Pazzani (1995) combines feature selection and simple constructive induction in
a wrapper framework for improving the performance of naive Bayes Forward and backward hill climbing search strategies are compared In the former case, the algo-rithm considers not only the addition of single features to the current subset, but also
Trang 6creating a new attribute by joining one of the as yet unselected features with each
of the selected features in the subset In the latter case, the algorithm considers both deleting individual features and replacing pairs of features with a joined feature Re-sults on a selection of machine learning datasets show that both approaches improve the performance of na¨ıve Bayes
The forward strategy does a better job at removing redundant attributes than the backward strategy Because it starts with the full set of features, and considers all possible pairwise joined features, the backward strategy is more effective at identi-fying attribute interactions than the forward strategy Improvement for naive Bayes using wrapper- based feature selection is also reported in (Kohavi and Sommerfield,
1995, Kohavi and John, 1996)
Provan and Singh (1996) have applied the wrapper to select features from which
to construct Bayesian networks Their results showed that while feature selection did not improve accuracy over networks which have been constructed from the full set of features, the networks created after feature selection were considerably smaller and faster to learn
5.3 Variable Selection
This section aim to provide a survey of variable selection Suppose is Y a variable
of interest, and X1, ,X pis a set of potential explanatory variables or predictors,
are vectors of n observations The problem of variable selection, or subset selection
as it is often called, arises when one wants to model the relationship between Y and a subset of X1, ,X p, but there is uncertainty about which subset to use Such
a situation is particularly of interest when p is large and X1, ,X p is thought to contain many redundant or irrelevant variables The variable selection problem is most familiar in the linear regression context where attention is restricted to normal
linear models Letting qγindex the subsets of X1, ,X pand letting q be the size of theγ–th subset, the problem is to select and fit a model of the form:
where Xγ is an nxqγ matrix whose columns correspond to theγth subset,βγ is a
qγx 1 vector of regression coefficients and ε≈ N(0,σ2I) More generally, the variable selection problem is a special case of the model selection problem, where
each model under consideration corresponds to a distinct subset of X1, ,X p Typi-cally, a single model class is simply applied to all possible subsets
5.3.1 Mallows Cp (Mallows, 1973)
This method minimizes the mean square error of prediction:
C p= RSSγ ˆ
FULL
Trang 7where, RSSγ is the residual sum of squares for the γth model and σˆ2FULL is the usual unbiased estimate ofσ2based on the full model
The goal is to get a model with minimum C p By using C p one can reduce
di-mension by finding the minimal subset which has minimum C p,
5.3.2 AIC, BIC and F ratio
Two of the other most popular criteria, motivated from very different points of view, are AIC (for Akaike Information Criterion) and BIC (for Bayesian Information
Crite-rion) Letting ˆlγdenote the maximum log likelihood of theγthmodel, AIC selects the
model which maximizes ( ˆlγ− qγ) whereas BIC selects the model which maximizes
( ˆlγ- (logn)qγ/2)
For the linear model, many of the popular selection criteria are special cases of a penalized sum- of squares criterion, providing a unified framework for comparisons Assumingσ2known to avoid complications, this general criterion selects the subset model that minimizes:
where F is a preset “dimensionality penalty” Intuitively, the above penalizes RSSγ/σ2 by F times qγ, the dimension of the γth model AIC and minimum C p
are essentially equivalent, corresponding to F= 2, and BIC is obtained by setting
F = logn By imposing a smaller penalty, AIC and minimum C pwill select larger
models than BIC (unless n is very small).
5.3.3 Principal Component Analysis (PCA)
Principal component analysis (PCA) is the best, in the mean-square error sense, lin-ear dimension reduction technique (Jackson, 1991, Jolliffe, 1986) Being based on the covariance matrix of the variables, it is a second-order method In various fields,
it is also known as the singular value decomposition (SVD), the Karhunen-Loeve transform, the Hotelling transform, and the empirical orthogonal function (EOF) method.In essence, PCA seeks to reduce the dimension of the data by finding a few orthogonal linear combinations (the PCs) of the original variables with the largest
variance The first PC, s1, is the linear combination with the largest variance We have s1= xTw1, where the p-dimensional coefficient vector w1 = (w1,1, ,w1,p)T solves:
w1= arg maxVar
x T w
(5.8) The second PC is the linear combination with the second largest variance and orthogonal to the first PC, and so on There are as many PCs as the number of the original variables For many datasets, the first several PCs explain most of the vari-ance, so that the rest can be disregarded with minimal loss of information
Trang 85.3.4 Factor Analysis (FA)
Like PCA, factor analysis (FA) is also a linear method, based on the second-order data summaries First suggested by psychologists, FA assumes that the measured variables depend on some unknown, and often unmeasurable, common factors Typ-ical examples include variables defined as various test scores of individuals, as such scores are thought to be related to a common “intelligence” factor The goal of FA is
to uncover such relations, and thus can be used to reduce the dimension of datasets following the factor model
5.3.5 Projection Pursuit
Projection pursuit (PP) is a linear method that, unlike PCA and FA, can incorporate higher than second-order information, and thus is useful for non-Gaussian datasets
It is more computationally intensive than second-order methods Given a projection index that defines the “interestingness” of a direction, PP looks for the directions that optimize that index As the Gaussian distribution is the least interesting distribution (having the least structure), projection indices usually measure some aspect of non-Gaussianity If, however, one uses the second-order maximum variance, subject that the projections be orthogonal, as the projection index, PP yields the familiar PCA 5.3.6 Advanced Methods for Variable Selection
Chizi and Maimon (2002) describes in their work some new methods for variable se-lection These methods based on simple algorithms and uses known evaluators like information gain, logistic regression coefficient and random selection All the meth-ods are presented with empirical results on benchmark datasets and with theoretical bounds on each method Wider survey on variable selection can be found there pro-vided with decomposition of the problem of dimension reduction
In summary, features selection is useful for many application domains, such as: Manufacturing lr18,lr14, Security lr7,l10 and Medicine lr2,lr9, and for many data mining techniques, such as: decision trees lr6,lr12, lr15, clustering lr13,lr8, ensemble methods lr1,lr4,lr5,lr16 and genetic algorithms lr17,lr11
References
Aha, D W and Blankert, R L Feature selection for case- based classification of cloud types
In Working Notes of th AAAI- 94 Workshop on Case- Based Reasoning, pages 106–112, 1994
Aha, D W Kibler, and Albert, M K Instance based learning algorithms Machine Learning, 6: 37–66, 1991
Allen, D The relationship between variable selection and data augmentation and a method for prediction Technometrics, 16: 125– 127, 1974
Trang 9Almuallim, H and Dietterich, T G Efficient algorithms for identifying relevant features In.Proceedings of the Ninth Canadian Conference on Artificial Intelligence, pages 38–
45 Morgan Kaufmann, 1992
Almuallim, H and Dietterich, T G Learning with many irrelevant features In Proceedings
of the Ninth National Conference on Artificial Intelligence, pages 547– 542 MIT Press, 1991
Arbel, R and Rokach, L., Classifier evaluation under limited resources, Pattern Recognition Letters, 27(14): 1619–1631, 2006, Elsevier
Averbuch, M and Karson, T and Ben-Ami, B and Maimon, O and Rokach, L., Context-sensitive medical information retrieval, The 11th World Congress on Medical Informat-ics (MEDINFO 2004), San Francisco, CA, September 2004, IOS Press, pp 282–286 Blum P and Langley, P Selection Of Relevant Features And Examples In Machine Learning, Artificial Intelligence, 1997;97: 245-271
Cardie, C Using decision trees to improve cased- based learning In Proceedings of the First International Conference on Knowledge Discovery and Data Mining AAAI Press, 1995 Caruana, R and Freitag, D Greedy attribute selection In Machine Learning: Proceedings of the Eleventh International Conference Morgan Kaufmann, 1994
Cherkauer, K J and Shavlik, J W Growing simpler decision trees to facilitate knowledge discovery In Proceedings of the Second International Conference on Knowledge Dis-covery and Data Mining AAAI Press, 1996
Chizi, B and Maimon, O “On Dimensionality Reduction of High Dimensional Data Sets”,
In “Frontiers in Artificial Intelligence and Applications” IOS press, pp 230-236, 2002 Cohen S., Rokach L., Maimon O., Decision Tree Instance Space Decomposition with Grouped Gain-Ratio, Information Science, Volume 177, Issue 17, pp 3592-3612, 2007 Domingos, P Context- sensitive feature selection for lazy learners Artificial Intelligence Review, (11): 227– 253, 1997
Elder, J.F and Pregibon, D “A Statistical perspective on knowledge discovery in databases”
In Advances in Knowledge Discovery and Data Mining, Fayyad, U Piatetsky-Shapiro,
G Smyth, P & Uthurusamy, R ed., AAAI/MIT Press., 1996
George, E and Foster D Empirical Bayes Variable Selection, Biometrika, 2000
Hall, M Correlation- based feature selection for machine learning, Ph.D Thesis, Department
of Computer Science, University of Waikato, 1999
Holmes, G and Nevill- Manning, C G Feature selection via the discovery of simple clas-sification rules In Proceedings of the Symposium on Intelligent Data Analysis, Baden-Baden, Germany, 1995
Holte, R C Very simple classification rules perform well on most commonly used datasets Machine Learning, 11: 63– 91, 1993
Jackson, J A User’s Guide to Principal Components New York: John Wiley and Sons, 1991 John, G H Kohavi, R and Pfleger, P Irrelevant features and the subset selection problem
In Machine Learning: Proceedings of the Eleventh International Conference Morgan Kaufmann, 1994
Jolliffe, I Principal Component Analysis Springer-Verlag, 1986
Kira, K and Rendell, L A A practical approach to feature selection In Machine Learning: Proceedings of the Ninth International Conference, 1992
Kohavi R and John, G Wrappers for feature subset selection Artificial Intelligence, special issue on relevance, 97(1– 2): 273– 324, 1996
Kohavi, R and Sommerfield, D Feature subset selection using the wrapper method: Overfit-ting and dynamic search space topology In Proceedings of the First International Con-ference on Knowledge Discovery and Data Mining AAAI Press, 1995
Trang 10Kohavi, R Wrappers for Performance Enhancement and Oblivious Decision Graphs PhD thesis, Stanford University, 1995
Koller, D and Sahami, M Towards optimal feature selection In Machine Learning: Proceed-ings of the Thirteenth International Conference on machine Learning Morgan Kauf-mann, 1996
Kononenko, I Estimating attributes: Analysis and extensions of relief In Proceedings of the European Conference on Machine Learning, 1994
Langley, P Selection of relevant features in machine learning In Proceedings of the AAAI Fall Symposium on Relevance AAAI Press, 1994
Langley, P and Sage, S Scaling to domains with irrelevant features In R Greiner, editor, Computational Learning Theory and Natural Learning Systems, volume 4 MIT Press, 1994
Liu, H and Setiono, R A probabilistic approach to feature selection: A filter solution In Machine Learning: Proceedings of the Thirteenth International Conference on Machine Learning Morgan Kaufmann, 1996
Maimon O., and Rokach, L Data Mining by Attribute Decomposition with semiconductors manufacturing case study, in Data Mining for Design and Manufacturing: Methods and Applications, D Braha (ed.), Kluwer Academic Publishers, pp 311–336, 2001 Maimon O and Rokach L., “Improving supervised learning by feature decomposition”, Pro-ceedings of the Second International Symposium on Foundations of Information and Knowledge Systems, Lecture Notes in Computer Science, Springer, pp 178-196, 2002 Maimon, O and Rokach, L., Decomposition Methodology for Knowledge Discovery and Data Mining: Theory and Applications, Series in Machine Perception and Artificial In-telligence - Vol 61, World Scientific Publishing, ISBN:981-256-079-3, 2005
Mallows, C L Some comments on Cp Technometrics 15, 661- 676, 1973
Michalski, R S A theory and methodology of inductive learning Artificial Intelligence, 20( 2): 111– 161, 1983
Moore, A W and Lee, M S Efficient algorithms for minimizing cross validation error
In Machine Learning: Proceedings of the Eleventh International Conference Morgan Kaufmann, 1994
Moore, A W Hill, D J and Johnson, M P An empirical investigation of brute force to choose features, smoothers and function approximations In S Hanson, S Judd, and T Petsche, editors, Computational Learning Theory and Natural Learning Systems, volume
3 MIT Press, 1992
Moskovitch R, Elovici Y, Rokach L, Detection of unknown computer worms based on behav-ioral classification of the host, Computational Statistics and Data Analysis, 52(9):4544–
4566, 2008
Pazzani, M Searching for dependencies in Bayesian classifiers In Proceedings of the Fifth International Workshop on AI and Statistics, 1995
Pfahringer, B Compression- based feature subset selection In Proceeding of the IJCAI- 95 Workshop on Data Engineering for Inductive Learning, pages 109– 119, 1995 Provan, G M and Singh, M Learning Bayesian networks using feature selection In D Fisher and H Lenz, editors, Learning from Data, Lecture Notes in Statistics, pages 291–
300 Springer- Verlag, New York, 1996
Quinlan, J.R C4.5: Programs for machine learning Morgan Kaufmann, Los Altos, Califor-nia, 1993
Quinlan, J.R Induction of decision trees Machine Learning, 1: 81– 106, 1986
Rissanen, J Modeling by shortest data description Automatica, 14: 465–471, 1978