Key words: Dimension Reduction, Preprocessing 5.1 Introduction Data Mining algorithms are used for searching meaningful patterns in raw data sets.. When discussing dimension reduction, g
Trang 180 Christopher J.C Burges
as radial basis function kernels), this centering is equivalent to centering a distance matrix in feature space (Williams, 2001) further points out that for these kernels, classical MDS in feature space is equivalent to a form of metric MDS in input space Although ostensibly kernel PCA gives a function that can be applied to test points, while MDS does not, kernel PCA does so by using the Nystr¨om approximation (see Section 4.2.1), and exactly the same can be done with MDS
The subject of feature extraction and dimensional reduction is vast In this re-view I’ve limited the discussion to mostly geometric methods, and even with that restriction it’s far from complete, so I’d like to alert the reader to three other inter-esting leads The first is the method of principal curves, where the idea is to find that smooth curve that passes through the data in such a way that the sum of short-est distances from each point to the curve is minimized, thus providing a nonlinear, one-dimensional summary of the data (Hastie and Stuetzle, 1989); the idea has since been extended by applying various regularization schemes (including kernel-based), and to manifolds of higher dimension (Sch¨olkopf and Smola, 2002) Second, com-petitions have been held at recent NIPS workshops on feature extraction, and the reader can find a wealth of information there (Guyon, 2003) Finally, recent work
on object detection has shown that boosting, where each weak learner uses a single feature, can be a very effective method for finding a small set of good (and mutually complementary) features from a large pool of possible features (Viola and Jones, 2001)
Acknowledgments
I thank John Platt for valuable discussions Thanks also to Lawrence Saul, Bernhard Sch¨olkopf, Jay Stokes and Mike Tipping for commenting on the manuscript
References
M.A Aizerman, E.M Braverman, and L.I Rozoner Theoretical foundations of the
poten-tial function method in pattern recognition learning Automation and Remote Control,
25:821–837, 1964
P.F Baldi and K Hornik Learning in linear neural networks: A survey IEEE Transactions
on Neural Networks, 6(4):837–858, July 1995.
A Basilevsky Statistical Factor Analysis and Related Methods Wiley, New York, 1994.
M Belkin and P Niyogi Laplacian eigenmaps for dimensionality reduction and data
repre-sentation Neural Computation, 15(6):1373–1396, 2003.
Y Bengio, J Paiement, and P Vincent Out-of-sample extensions for LLE, Isomap, MDS,
Eigenmaps and spectral clustering In Advances in Neural Information Processing
Sys-tems 16 MIT Press, 2004.
C Berg, J.P.R Christensen, and P Ressel Harmonic Analysys on Semigroups
Springer-Verlag, 1984
C M Bishop Bayesian PCA In M S Kearns, S A Solla, and D A Cohn, editors,
Advances in Neural Information Processing Systems, volume 11, pages 382–388,
Cam-bridge, MA, 1999 The MIT Press
Trang 24 Geometric Methods for Feature Extraction and Dimensional Reduction 81
I Borg and P Groenen Modern Multidimensional Scaling: Theory and Applications.
Springer, 1997
B E Boser, I M Guyon, and V Vapnik A training algorithm for optimal margin
classi-fiers In Fifth Annual Workshop on Computational Learning Theory, pages 144–152,
Pittsburgh, 1992 ACM
C.J.C Burges Some Notes on Applied Mathematics for Machine Learning In O Bousquet,
U von Luxburg, and G R¨atsch, editors, Advanced Lectures on Machine Learning, pages
21–40 Springer Lecture Notes in Aritificial Intelligence, 2004
C.J.C Burges, J.C Platt, and S Jana Extracting noise-robust features from audio In Proc.
IEEE Conference on Acoustics, Speech and Signal Processing, pages 1021–1024 IEEE
Signal Processing Society, 2002
C.J.C Burges, J.C Platt, and S Jana Distortion discriminant analysis for audio
fingerprint-ing IEEE Transactions on Speech and Audio Processing, 11(3):165–174, 2003 F.R.K Chung Spectral Graph Theory American Mathematical Society, 1997.
T.F Cox and M.A.A Cox., Multidimensional Scaling Chapman and Hall, 2001
R.B Darlington Factor analysis Technical report, Cornell University, http://comp9.psych.cornell.edu/Darlington/factor.htm
V de Silva and J.B Tenenbaum Global versus local methods in nonlinear
dimensional-ity reduction In S Becker, S Thrun, and K Obermayer, editors, Advances in Neural
Information Processing Systems 15, pages 705–712 MIT Press, 2002.
P Diaconis and D Freedman Asymptotics of graphical projection pursuit Annals of
Statis-tics, 12:793–815, 1984.
K.I Diamantaras and S.Y Kung Principal Component Neural Networks John Wiley, 1996 R.O Duda and P.E Hart Pattern Classification and Scene Analysis John Wiley, 1973.
C Fowlkes, S Belongie, F Chung, and J Malik Spectral grouping using the Nystr¨om
method IEEE Trans Pattern Analysis and Machine Intelligence, 26(2), 2004.
J.H Friedman and W Stuetzle Projection pursuit regression Journal of the American
Statistical Association, 76(376):817–823, 1981.
J.H Friedman, W Stuetzle, and A Schroeder Projection pursuit density estimation J Amer.
Statistical Assoc., 79:599–608, 1984.
J.H Friedman and J.W Tukey A projection pursuit algorithm for exploratory data analysis
IEEE Transactions on Computers, c-23(9):881–890, 1974.
G.H Golub and C.F Van Loan Matrix Computations Johns Hopkins, third edition, 1996.
M Gondran and M Minoux Graphs and Algorithms John Wiley and Sons, 1984.
http://clopinet.com/isabelle/Projects/NIPS2003/
J Ham, D.D Lee, S Mika, and B Sch¨olkopf A kernel view of dimensionality reduction of
manifolds In Proceedings of the International Conference on Machine Learning, 2004 T.J Hastie and W Stuetzle Principal curves Journal of the American Statistical Association,
84(406):502–516, 1989
R.A Horn and C.R Johnson Matrix Analysis Cambridge University Press, 1985 P.J Huber Projection pursuit Annals of Statistics, 13(2):435–475, 1985.
A Hyv¨arinen, J Karhunen, and E Oja Independent Component Analysis Wiley, 2001.
Y LeCun and Y Bengio Convolutional networks for images, speech and time-series In
M Arbib, editor, The Handbook of Brain Theory and Neural Networks MIT Press, 1995.
M Meila and J Shi Learning segmentation by random walks
In Advances in Neural Information Processing Systems, pages 873–879, 2000.
S Mika, B Sch¨olkopf, A J Smola, K.-R M¨uller, M Scholz, and G R¨atsch Kernel PCA and de–noising in feature spaces In M S Kearns, S A Solla, and D A Cohn, editors,
Trang 382 Christopher J.C Burges
Advances in Neural Information Processing Systems 11 MIT Press, 1999.
A Y Ng, M I Jordan, and Y Weiss On spectral clustering: analysis and an algorithm In
Advances in Neural Information Processing Systems 14 MIT Press, 2002.
J Platt Private Communication.
J Platt Fastmap, MetricMap, and Landmark MDS are all Nystr¨om algorithms In Z
Ghahra-mani and R Cowell, editors, Proc 10th International Conference on Artificial
Intelli-gence and Statistics, 2005.
W.H Press, B.P Flannery, S.A Teukolsky, and W.T Vettering Numerical recipes in C: the
art of scientific computing Cambridge University Press, 2nd edition, 1992.
S.T Roweis and L.K Saul Nonlinear dimensionality reduction by locally linear embedding
Science, 290(22):2323–2326, 2000.
I.J Schoenberg Remarks to maurice frechet’s article sur la d´efinition axiomatique d’une
classe d’espace distanci´es vectoriellement applicable sur espace de hilbert Annals of
Mathematics, 36:724–732, 1935.
B Sch¨olkopf The kernel trick for distances In T.K Leen, T.G Dietterich, and V Tresp,
editors, Advances in Neural Information Processing Systems 13, pages 301–307 MIT
Press, 2001
B Sch¨olkopf and A Smola Learning with Kernels MIT Press, 2002.
B Sch¨olkopf, A Smola, and K-R Muller Nonlinear component analysis as
a kernel eigenvalue problem Neural Computation, 10(5):1299–1319, 1998.
J Shi and J Malik Normalized cuts and image segmentation IEEE Transactions on Pattern
Analysis and Machine Intelligence, 22(8):888–905, 2000.
C.E Spearman ’General intelligence’ objectively determined and measured American
Journal of Psychology, 5:201–293, 1904.
C.J Stone Optimal global rates of convergence for nonparametric regression Annals of
Statistics, 10(4):1040–1053, 1982.
J.B Tenenbaum Mapping a manifold of perceptual observations In Michael I Jordan,
Michael J Kearns, and Sara A Solla, editors, Advances in Neural Information
Process-ing Systems, volume 10 The MIT Press, 1998.
M.E Tipping and C.M Bishop Probabilistic principal component analysis Journal of the
Royal Statistical Society, 61(3):611, 1999A.
M.E Tipping and C.M Bishop Mixtures of probabilistic principal component analyzers
Neural Computation, 11(2):443–482, 1999B.
P Viola and M Jones Robust real-time object detection In Second international workshop
on statistical and computational theories of vision - modeling, learning, computing, and sampling, 2001.
S Wilks Mathematical Statistics John Wiley, 1962.
C.K.I Williams On a Connection between Kernel PCA and Metric Multidimensional
Scal-ing In T.K Leen, T.G Dietterich, and V Tresp, editors, Advances in Neural Information
Processing Systems 13, pages 675–681 MIT Press, 2001.
C.K.I Williams and M Seeger Using the Nystr¨om method to speed up kernel machines In
Leen, Dietterich, and Tresp, editors, Advances in Neural Information Processing Systems
13, pages 682–688 MIT Press, 2001.
Trang 4Dimension Reduction and Feature Selection
Barak Chizi1and Oded Maimon1
Tel-Aviv University
Summary Data Mining algorithms search for meaningful patterns in raw data sets The Data Mining process requires high computational cost when dealing with large data sets Reducing dimensionality (the number of attributed or the number of records) can effectively cut this cost This chapter focuses a pre-processing step which removes dimension from a given data set before it is fed to a data mining algorithm This work explains how it is often possible to reduce dimensionality with minimum loss of information Clear dimension reduction taxonomy is described and techniques for dimension reduction are presented theoretically
Key words: Dimension Reduction, Preprocessing
5.1 Introduction
Data Mining algorithms are used for searching meaningful patterns in raw data sets Dimensionality (i.e., the number of data set attributes or groups of attributes) con-stitutes a serious obstacle to the efficiency of most Data Mining algorithms This obstacle is sometimes known as the “curse of dimensionality” (Elder and Pregibon, 1996) Techniques quite efficient in low dimensions (e.g., nearest neighbors) cannot provide any meaningful results when the number of records goes beyond a ‘modest’ size of 10 attributes
Data-mining algorithms are computationally intensive Figure 5.1 describes the typical trade-off between the error rate of a Data Mining model and the cost of ob-taining the model (in particular, the model may be a classification model) The cost
is a function of the theoretical complexity of the Data Mining algorithm that derives the model, and is correlated with the time required for the algorithm to run, and the size of the data set When discussing dimension reduction, given a set of records, the size of the data set is defined as the number of attributes, and is often used as an estimator to the mining cost
Theoretically, knowing the exact functional relation between the cost and the er-ror may point out the ideal classifier (i.e a classifier that produces minimal erer-ror rate
ε∗ and costs h ∗to be derived) On some occasions, one might prefer using an inferior
O Maimon, L Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,
DOI 10.1007/978-0-387-09823-4_5, © Springer Science+Business Media, LLC 2010
Trang 584 Barak Chizi and Oded Maimon
Fig 5.1 typical cost-error relation in a classification models
classifier that uses only a part of the data h ≤ h ∗and produces an increased error rate.
In practice, the exact tradeoff curve of Figure 5.1 is seldom known, and generating it might be computationally prohibitive The objective of dimension reduction in Data Mining domains is to identify the smallest cost at which a Data Mining algorithm can keep the error rate belowεf (this error rate is sometimes referred to efficiency frontier)
Feature selection, is a problem closely related to dimension reduction The objec-tive of feature selection is to identify features in the data-set as important, and discard any other feature as irrelevant and redundant information Since feature selection re-duces the dimensionality of the data, it holds out the possibility of more effective
& rapid operation of data mining algorithms (i.e Data Mining algorithms can be operated faster and more effectively by using feature selection) In some cases, as a result of feature selection, accuracy on future classification can be improved; in other instances, the result is a more compact, easily interpreted representation of the target concept (Hall, 1999)
On the other hand, feature selection is a costly process, and it also contradicts the initial assumption, that all information (i.e attributes) is required in order to achieve maximum accuracy While some attributes are less important, there are no attributes that are irrelevant or redundant As described later on this work, feature selection problem is a sub problem of dimension reduction Figure 5.2 is taxonomy
of dimension reduction reasons
It can be seen that there are four major reasons for performing dimension reduc-tion Each reason can be referred to as a distinctive sub-problem:
1 Decreasing the learning (model) cost;
Trang 65 Dimension Reduction and Feature Selection 85
Dimension Reduction
Reduction of Redundant Dimension
Decreasing Learning
Cost
Feature Selection
Record Selection
Records
Reduction
Attributes Reduction
Attribute Decomposition
Variable Selection
Sample
Decomposition
Function Decomposition
Increasing Learning Performance
Reduction of Irrelevant Dimension
Fig 5.2 Texonomy of dimension reduction problem
2 Increasing the learning (model) performance;
3 Reducing irrelevant dimensions;
4 Reducing redundant dimensions
Reduction of redundant dimensions and of irrelevant dimension can be further divided into two sub-problems:
Feature selection The objective of feature selection is to identify some features in the data-set as important, and discard any other feature as irrelevant and redundant information The process of feature selection reduces the dimensionality of the data and enables learning algorithms to operate faster and more effectively In some cases, the accuracy of future classifications can be improved; in others, the result is a more compact, easily interpreted model (Hall, 1999)
Record selection Just as some attributes are more useful than others, some records (examples) may better aid the learning process than others (Blum and Langley, 1997)
The other two sub-problems of dimension reduction, as described in Figure 5.2, are increasing learning performance and decreasing learning cost Each of the these two sub-problems can also be divided into two further sub-problems: records reduc-tion and attribute reducreduc-tion Record reducreduc-tion is sometimes referred to as sample (or tuple) decomposition Attribute reduction can be further divided into two sub
Trang 786 Barak Chizi and Oded Maimon
problems: attribute decomposition and function decomposition These decomposi-tion problems embody an extensive methodology called decomposidecomposi-tion methodology discussed in Chapter 50.7 of this volume
A sub-problem of attribute decomposition, as seen in Figure 5.2, is variable selec-tion The solution to this problem is a pre-processing step which removes attributes from a given data set before feeding it to a Data Mining algorithm The rationale for this step is the reduction of time required for running the Data Mining algorithm, since the running time depends both on the number of records as well as on the num-ber of attributes in each record (the dimension) Variable selection may scarify some accuracy but saves time in the learning process
This Chapter provides survey of feature selection techniques and variable selec-tion techniques
5.2 Feature Selection Techniques
5.2.1 Feature Filters
The earliest approaches to feature selection within machine learning were filter meth-ods All filter methods use heuristics based on general characteristics of the data rather than a learning algorithm to evaluate the merit of feature subsets As a conse-quence, filter methods are generally much faster than wrapper methods, and, as such, are more practical for use on data of high dimensionality
FOCUS
Almuallim and Dietterich (1992) describe an algorithm originally designed for Boolean domains called FOCUS FOCUS exhaustively searches the space of feature subsets until it finds the minimum combination of features that divides the training data into pure classes (that is, where every combination of feature values is associ-ated with a single class) This is referred to as the “min- features bias” Following feature selection, the final feature subset is passed to ID3 (Quinlan, 1986), which constructs a decision tree
There are two main difficulties with FOCUS, as pointed out by Caruanna and Freitag (1994) Firstly, since FOCUS is driven to attain consistency on the training data, an exhaustive search may be difficult if many features are needed to attain con-sistency Secondly, a strong bias towards consistency can be statistically unwarranted and may lead to over-fitting the training data— the algorithm will continue to add features to repair a single inconsistency
The authors address the first of these problems in their paper (Almuallim and Dietterich, 1992) Three algorithms— each consisting of forward selection search coupled with a heuristic to approximate the min- features bias— are presented as methods to make FOCUS computationally feasible on domains with many features The first algorithm evaluates features using the following information theoretic formula:
Trang 85 Dimension Reduction and Feature Selection 87
Entropy (Q) =
2|Q| −1
∑
i=0
p i +n i
|Sample|
p i
p i +n ilog2 p i
p i +n i + n i
p i +n ilog2 n i
p i +n i
For a given feature subset Q , there are 2 |Q|possible truth value assignments to the features A given feature set divides the training data into groups of instances
with the same truth value assignments to the features in Q Equation 5.1 measures the overall entropy of the class values in these groups— p i and n idenote the number
of positive and negative examples in the i-th group respectively At each stage, the
feature which minimizes Equation 5.1 is added to the current feature subset The second algorithm chooses the most discriminating feature to add to the cur-rent subset at each stage of the search For a given pair of positive and negative ex-amples, a feature is discriminating if its value differs between the two At each stage, the feature is chosen which discriminates the greatest number of positive- negative pairs of examples— that have not yet been discriminated by any existing feature in the subset
The third algorithm is like the second except that each positive- negative example pair contributes a weighted increment to the score of each feature that discriminates
it The increment depends on the total number of features that discriminate the pair LVF
Liu and Setiono (1996) describe an algorithm similar to FOCUS called LVF Like FOCUS, LVF is consistency driven and, unlike FOCUS, can handle noisy domains
if the approximate noise level is known a- priori
LVF generates a random subset S from the feature subset space during each round
of execution If S contains fewer features than the current best subset, the inconsis-tency rate of the dimensionally reduced data described by S is compared with the inconsistency rate of the best subset If S is at least as consistent as the best subset, S
replaces the best subset
The inconsistency rate of the training data prescribed by a given feature subset is defined over all groups of matching instances Within a group of matching instances the inconsistency count is the number of instances in the group minus the number of instances in the group with the most frequent class value The overall inconsistency rate is the sum of the inconsistency counts of all groups of matching instances divided
by the total number of instances Liu and Setiono report good results for LVF when applied to some artificial domains and mixed results when applied to commonly used natural domains They also applied LVF to two “large” data sets— the first having 65,000 instances described by 59 attributes; the second having 5909 instances described by 81 attributes They report that LVF was able to reduce the number of attributes on both data sets by more than half They also note that due to the random nature of LVF, the longer it is allowed to execute, the better the results (as measured
by the inconsistency criterion)
Trang 988 Barak Chizi and Oded Maimon
Filtering Features Through Discretization
Setiono and Liu (1996) note that discretization has the potential to perform feature selection among numeric features If a numeric feature can justifiably be discretized
to a single value, then it can safely be removed from the data
The combined discretization and feature selection algorithm Chi2, uses a chi-squareχ2statistic to perform discretization Numeric attributes are initially sorted by placing each observed value into its own interval Each numeric attribute is then re-peatedly discretized by using theχ2test to determine when adjacent intervals should
be merged
The extent of the merging process is controlled by the use of an automatically setχ2threshold The threshold is determined by attempting to maintain the original fidelity of the data— inconsistency (measured the same way as in the LVF algorithm described above) controls the process The authors report results on three natural do-mains containing a mixture of numeric and nominal features, using C4.5 (Quinlan,
1993, Quinlan, 1986) before and after discretization They conclude that Chi2 is ef-fective at improving C4.5’s performance and eliminating some features However, it
is not clear whether C4.5’s improvement is due entirely to some features having been removed or whether discretization plays a role as well
Using One Learning Algorithm as a Filter for Another
Several researchers have explored the possibility of using a particular learning algo-rithm as a pre- processor to discover useful feature subsets for a primary learning algorithm Cardie (1995) describes the application of decision tree algorithms to the task of selecting feature subsets for use by instance based learners C4.5 was applied
to three natural language data sets; only the features that appeared in the final de-cision trees were used with ak–nearest neighbor classifier The use of this hybrid
system resulted in significantly better performance than either C4.5 or the k nearest neighbor algorithm when used alone
In a similar approach, Singh and Provan (1996) use a greedy oblivious deci-sion tree algorithm to select features from which to construct a Bayesian network Oblivious decision trees differ from those constructed by algorithms such as C4.5 in that all nodes at the same level of an oblivious decision tree test the same attribute Feature subsets selected by three oblivious decision tree algorithms— each employ-ing a different information splittemploy-ing criterion— were evaluated with a Bayesian net-work classifier on several the-oretic machine learning datasets Results showed that Bayesian networks using features selected by the oblivious decision tree algorithms outperformed Bayesian networks without feature selection
Holmes and Nevill–Manning (1995) use Holte’s 1R system (Holte, 1993) to es-timate the predictive accuracy of individual features 1R builds rules based on single features (called predictive 1- rules, 1- rules can be thought of as single level decision trees) If the data is split into training and test sets, it is possible to calculate a classi-fication accuracy for each rule and hence each feature From classiclassi-fication scores, a ranked list of features is obtained Experiments with choosing a select number of the
Trang 105 Dimension Reduction and Feature Selection 89 highest ranked features and using them with common machine learning algorithms showed that, on average, the top three or more features are as accurate as using the original set This approach is unusual due to the fact that no search is conducted Instead, it relies on the user to decide how many features to include from the ranked list in the final subset
Pfahringer (1995) uses a program for inducing decision table majority classi-fiers to select features DTM (Decision Table Majority) classiclassi-fiers are a simple type
of nearest neighbor classifier where the similarity function is restricted to return-ing stored instances that are exact matches with the instance to be classified If no instances are returned, the most prevalent class in the training data is used as the predicted class; otherwise, the majority class of all matching instances is used DTM works best when all features are nominal Induction of a DTM is achieved by greed-ily searching the space of possible decision tables Since a decision table is defined
by the features it includes, induction is simply feature selection
In Pfahringer’s approach, the minimum description length principle (MDL) (Ris-sanen, 1978) guides the search by estimating the cost of encoding a decision table and the training examples it misclassifies with respect to a given feature subset The features appearing in the final decision table are then used with other learning algo-rithms Experiments on a small selection of machine learning datasets showed that feature selection by DTM induction can improve the accuracy of C4.5 in some cases DTM classifiers induced using MDL were also compared with those induced using cross- validation (a wrapper approach) to estimate the accuracy of tables (and hence feature sets) The MDL approach was shown to be more efficient than, and perform
as well as, as cross- validation
An Information Theoretic Feature Filter
Koller and Sahami (1996) introduced a feature selection algorithm based on ideas from information theory and probabilistic reasoning The rationale behind their ap-proach is that, since the goal of an induction algorithm is to estimate the probability distributions over the class values, given the original feature set, feature subset selec-tion should attempt to remain as close to these original distribuselec-tions as possible
More formally, let C be a set of classes, V a set of features, X a subset of V , v
an assignment of values(v1, ,v n ) to the features in V , and v xthe projection of the
values in v onto the variables in X The goal of the feature selector is to choose X so that P (C|X = v x ) is as close as possible to P(C|V = v)
To achieve this goal, the algorithm begins with all the original features and em-ploys a backward elimination search to remove, at each stage, the feature that causes the least change between the two distributions Because it is not reliable to estimate high order probability distributions from limited data, an approximate algorithm is given that uses pair- wise combinations of features Cross entropy is used to measure the difference between two distributions and the user must specify how many fea-tures are to be removed by the algorithm The cross entropy of the class distribution given a pair of features is: