Data Mining and Knowledge Discovery Handbook, 2 Edition part 11 pdf

Key words: Dimension Reduction, Preprocessing 5.1 Introduction Data Mining algorithms are used for searching meaningful patterns in raw data sets.. When discussing dimension reduction, g

Trang 1

80 Christopher J.C Burges

as radial basis function kernels), this centering is equivalent to centering a distance matrix in feature space (Williams, 2001) further points out that for these kernels, classical MDS in feature space is equivalent to a form of metric MDS in input space Although ostensibly kernel PCA gives a function that can be applied to test points, while MDS does not, kernel PCA does so by using the Nystr¨om approximation (see Section 4.2.1), and exactly the same can be done with MDS

The subject of feature extraction and dimensional reduction is vast In this re-view I’ve limited the discussion to mostly geometric methods, and even with that restriction it’s far from complete, so I’d like to alert the reader to three other inter-esting leads The first is the method of principal curves, where the idea is to find that smooth curve that passes through the data in such a way that the sum of short-est distances from each point to the curve is minimized, thus providing a nonlinear, one-dimensional summary of the data (Hastie and Stuetzle, 1989); the idea has since been extended by applying various regularization schemes (including kernel-based), and to manifolds of higher dimension (Schölkopf and Smola, 2002) Second, com-petitions have been held at recent NIPS workshops on feature extraction, and the reader can find a wealth of information there (Guyon, 2003) Finally, recent work

on object detection has shown that boosting, where each weak learner uses a single feature, can be a very effective method for ﬁnding a small set of good (and mutually complementary) features from a large pool of possible features (Viola and Jones, 2001)

Acknowledgments

I thank John Platt for valuable discussions Thanks also to Lawrence Saul, Bernhard Sch¨olkopf, Jay Stokes and Mike Tipping for commenting on the manuscript

References

M.A Aizerman, E.M Braverman, and L.I Rozoner Theoretical foundations of the

poten-tial function method in pattern recognition learning Automation and Remote Control,

25:821–837, 1964

P.F Baldi and K Hornik Learning in linear neural networks: A survey IEEE Transactions

on Neural Networks, 6(4):837–858, July 1995.

A Basilevsky Statistical Factor Analysis and Related Methods Wiley, New York, 1994.

M Belkin and P Niyogi Laplacian eigenmaps for dimensionality reduction and data

repre-sentation Neural Computation, 15(6):1373–1396, 2003.

Y Bengio, J Paiement, and P Vincent Out-of-sample extensions for LLE, Isomap, MDS,

Eigenmaps and spectral clustering In Advances in Neural Information Processing

Sys-tems 16 MIT Press, 2004.

C Berg, J.P.R Christensen, and P Ressel Harmonic Analysys on Semigroups

Springer-Verlag, 1984

C M Bishop Bayesian PCA In M S Kearns, S A Solla, and D A Cohn, editors,

Advances in Neural Information Processing Systems, volume 11, pages 382–388,

Cam-bridge, MA, 1999 The MIT Press

Trang 2

4 Geometric Methods for Feature Extraction and Dimensional Reduction 81

I Borg and P Groenen Modern Multidimensional Scaling: Theory and Applications.

Springer, 1997

B E Boser, I M Guyon, and V Vapnik A training algorithm for optimal margin

classi-ﬁers In Fifth Annual Workshop on Computational Learning Theory, pages 144–152,

Pittsburgh, 1992 ACM

C.J.C Burges Some Notes on Applied Mathematics for Machine Learning In O Bousquet,

U von Luxburg, and G R¨atsch, editors, Advanced Lectures on Machine Learning, pages

21–40 Springer Lecture Notes in Aritiﬁcial Intelligence, 2004

C.J.C Burges, J.C Platt, and S Jana Extracting noise-robust features from audio In Proc.

IEEE Conference on Acoustics, Speech and Signal Processing, pages 1021–1024 IEEE

Signal Processing Society, 2002

C.J.C Burges, J.C Platt, and S Jana Distortion discriminant analysis for audio

ﬁngerprint-ing IEEE Transactions on Speech and Audio Processing, 11(3):165–174, 2003 F.R.K Chung Spectral Graph Theory American Mathematical Society, 1997.

T.F Cox and M.A.A Cox., Multidimensional Scaling Chapman and Hall, 2001

R.B Darlington Factor analysis Technical report, Cornell University, http://comp9.psych.cornell.edu/Darlington/factor.htm

V de Silva and J.B Tenenbaum Global versus local methods in nonlinear

dimensional-ity reduction In S Becker, S Thrun, and K Obermayer, editors, Advances in Neural

Information Processing Systems 15, pages 705–712 MIT Press, 2002.

P Diaconis and D Freedman Asymptotics of graphical projection pursuit Annals of

Statis-tics, 12:793–815, 1984.

K.I Diamantaras and S.Y Kung Principal Component Neural Networks John Wiley, 1996 R.O Duda and P.E Hart Pattern Classiﬁcation and Scene Analysis John Wiley, 1973.

C Fowlkes, S Belongie, F Chung, and J Malik Spectral grouping using the Nystr¨om

method IEEE Trans Pattern Analysis and Machine Intelligence, 26(2), 2004.

J.H Friedman and W Stuetzle Projection pursuit regression Journal of the American

Statistical Association, 76(376):817–823, 1981.

J.H Friedman, W Stuetzle, and A Schroeder Projection pursuit density estimation J Amer.

Statistical Assoc., 79:599–608, 1984.

J.H Friedman and J.W Tukey A projection pursuit algorithm for exploratory data analysis

IEEE Transactions on Computers, c-23(9):881–890, 1974.

G.H Golub and C.F Van Loan Matrix Computations Johns Hopkins, third edition, 1996.

M Gondran and M Minoux Graphs and Algorithms John Wiley and Sons, 1984.

http://clopinet.com/isabelle/Projects/NIPS2003/

J Ham, D.D Lee, S Mika, and B Sch¨olkopf A kernel view of dimensionality reduction of

manifolds In Proceedings of the International Conference on Machine Learning, 2004 T.J Hastie and W Stuetzle Principal curves Journal of the American Statistical Association,

84(406):502–516, 1989

R.A Horn and C.R Johnson Matrix Analysis Cambridge University Press, 1985 P.J Huber Projection pursuit Annals of Statistics, 13(2):435–475, 1985.

A Hyv¨arinen, J Karhunen, and E Oja Independent Component Analysis Wiley, 2001.

Y LeCun and Y Bengio Convolutional networks for images, speech and time-series In

M Arbib, editor, The Handbook of Brain Theory and Neural Networks MIT Press, 1995.

M Meila and J Shi Learning segmentation by random walks

In Advances in Neural Information Processing Systems, pages 873–879, 2000.

S Mika, B Schölkopf, A J Smola, K.-R Müller, M Scholz, and G Rätsch Kernel PCA and de–noising in feature spaces In M S Kearns, S A Solla, and D A Cohn, editors,

Trang 3

82 Christopher J.C Burges

Advances in Neural Information Processing Systems 11 MIT Press, 1999.

A Y Ng, M I Jordan, and Y Weiss On spectral clustering: analysis and an algorithm In

Advances in Neural Information Processing Systems 14 MIT Press, 2002.

J Platt Private Communication.

J Platt Fastmap, MetricMap, and Landmark MDS are all Nystr¨om algorithms In Z

Ghahra-mani and R Cowell, editors, Proc 10th International Conference on Artiﬁcial

Intelli-gence and Statistics, 2005.

W.H Press, B.P Flannery, S.A Teukolsky, and W.T Vettering Numerical recipes in C: the

art of scientiﬁc computing Cambridge University Press, 2nd edition, 1992.

S.T Roweis and L.K Saul Nonlinear dimensionality reduction by locally linear embedding

Science, 290(22):2323–2326, 2000.

I.J Schoenberg Remarks to maurice frechet’s article sur la d´eﬁnition axiomatique d’une

classe d’espace distanci´es vectoriellement applicable sur espace de hilbert Annals of

Mathematics, 36:724–732, 1935.

B Sch¨olkopf The kernel trick for distances In T.K Leen, T.G Dietterich, and V Tresp,

editors, Advances in Neural Information Processing Systems 13, pages 301–307 MIT

Press, 2001

B Sch¨olkopf and A Smola Learning with Kernels MIT Press, 2002.

B Sch¨olkopf, A Smola, and K-R Muller Nonlinear component analysis as

a kernel eigenvalue problem Neural Computation, 10(5):1299–1319, 1998.

J Shi and J Malik Normalized cuts and image segmentation IEEE Transactions on Pattern

Analysis and Machine Intelligence, 22(8):888–905, 2000.

C.E Spearman ’General intelligence’ objectively determined and measured American

Journal of Psychology, 5:201–293, 1904.

C.J Stone Optimal global rates of convergence for nonparametric regression Annals of

Statistics, 10(4):1040–1053, 1982.

J.B Tenenbaum Mapping a manifold of perceptual observations In Michael I Jordan,

Michael J Kearns, and Sara A Solla, editors, Advances in Neural Information

Process-ing Systems, volume 10 The MIT Press, 1998.

M.E Tipping and C.M Bishop Probabilistic principal component analysis Journal of the

Royal Statistical Society, 61(3):611, 1999A.

M.E Tipping and C.M Bishop Mixtures of probabilistic principal component analyzers

Neural Computation, 11(2):443–482, 1999B.

P Viola and M Jones Robust real-time object detection In Second international workshop

on statistical and computational theories of vision - modeling, learning, computing, and sampling, 2001.

S Wilks Mathematical Statistics John Wiley, 1962.

C.K.I Williams On a Connection between Kernel PCA and Metric Multidimensional

Scal-ing In T.K Leen, T.G Dietterich, and V Tresp, editors, Advances in Neural Information

Processing Systems 13, pages 675–681 MIT Press, 2001.

C.K.I Williams and M Seeger Using the Nystr¨om method to speed up kernel machines In

Leen, Dietterich, and Tresp, editors, Advances in Neural Information Processing Systems

13, pages 682–688 MIT Press, 2001.

Trang 4

Dimension Reduction and Feature Selection

Barak Chizi1and Oded Maimon1

Tel-Aviv University

Summary Data Mining algorithms search for meaningful patterns in raw data sets The Data Mining process requires high computational cost when dealing with large data sets Reducing dimensionality (the number of attributed or the number of records) can effectively cut this cost This chapter focuses a pre-processing step which removes dimension from a given data set before it is fed to a data mining algorithm This work explains how it is often possible to reduce dimensionality with minimum loss of information Clear dimension reduction taxonomy is described and techniques for dimension reduction are presented theoretically

Key words: Dimension Reduction, Preprocessing

5.1 Introduction

Data Mining algorithms are used for searching meaningful patterns in raw data sets Dimensionality (i.e., the number of data set attributes or groups of attributes) con-stitutes a serious obstacle to the efﬁciency of most Data Mining algorithms This obstacle is sometimes known as the “curse of dimensionality” (Elder and Pregibon, 1996) Techniques quite efﬁcient in low dimensions (e.g., nearest neighbors) cannot provide any meaningful results when the number of records goes beyond a ‘modest’ size of 10 attributes

Data-mining algorithms are computationally intensive Figure 5.1 describes the typical trade-off between the error rate of a Data Mining model and the cost of ob-taining the model (in particular, the model may be a classiﬁcation model) The cost

is a function of the theoretical complexity of the Data Mining algorithm that derives the model, and is correlated with the time required for the algorithm to run, and the size of the data set When discussing dimension reduction, given a set of records, the size of the data set is deﬁned as the number of attributes, and is often used as an estimator to the mining cost

Theoretically, knowing the exact functional relation between the cost and the er-ror may point out the ideal classiﬁer (i.e a classiﬁer that produces minimal erer-ror rate

ε∗ and costs h ∗to be derived) On some occasions, one might prefer using an inferior

O Maimon, L Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,

DOI 10.1007/978-0-387-09823-4_5, © Springer Science+Business Media, LLC 2010

Trang 5

84 Barak Chizi and Oded Maimon

Fig 5.1 typical cost-error relation in a classiﬁcation models

classiﬁer that uses only a part of the data h ≤ h ∗and produces an increased error rate.

In practice, the exact tradeoff curve of Figure 5.1 is seldom known, and generating it might be computationally prohibitive The objective of dimension reduction in Data Mining domains is to identify the smallest cost at which a Data Mining algorithm can keep the error rate belowεf (this error rate is sometimes referred to efﬁciency frontier)

Feature selection, is a problem closely related to dimension reduction The objec-tive of feature selection is to identify features in the data-set as important, and discard any other feature as irrelevant and redundant information Since feature selection re-duces the dimensionality of the data, it holds out the possibility of more effective

& rapid operation of data mining algorithms (i.e Data Mining algorithms can be operated faster and more effectively by using feature selection) In some cases, as a result of feature selection, accuracy on future classiﬁcation can be improved; in other instances, the result is a more compact, easily interpreted representation of the target concept (Hall, 1999)

On the other hand, feature selection is a costly process, and it also contradicts the initial assumption, that all information (i.e attributes) is required in order to achieve maximum accuracy While some attributes are less important, there are no attributes that are irrelevant or redundant As described later on this work, feature selection problem is a sub problem of dimension reduction Figure 5.2 is taxonomy

of dimension reduction reasons

It can be seen that there are four major reasons for performing dimension reduc-tion Each reason can be referred to as a distinctive sub-problem:

1 Decreasing the learning (model) cost;

Trang 6

5 Dimension Reduction and Feature Selection 85

Dimension Reduction

Reduction of Redundant Dimension

Decreasing Learning

Cost

Feature Selection

Record Selection

Records

Reduction

Attributes Reduction

Attribute Decomposition

Variable Selection

Sample

Decomposition

Function Decomposition

Increasing Learning Performance

Reduction of Irrelevant Dimension

Fig 5.2 Texonomy of dimension reduction problem

2 Increasing the learning (model) performance;

3 Reducing irrelevant dimensions;

4 Reducing redundant dimensions

Reduction of redundant dimensions and of irrelevant dimension can be further divided into two sub-problems:

Feature selection The objective of feature selection is to identify some features in the data-set as important, and discard any other feature as irrelevant and redundant information The process of feature selection reduces the dimensionality of the data and enables learning algorithms to operate faster and more effectively In some cases, the accuracy of future classiﬁcations can be improved; in others, the result is a more compact, easily interpreted model (Hall, 1999)

Record selection Just as some attributes are more useful than others, some records (examples) may better aid the learning process than others (Blum and Langley, 1997)

The other two sub-problems of dimension reduction, as described in Figure 5.2, are increasing learning performance and decreasing learning cost Each of the these two sub-problems can also be divided into two further sub-problems: records reduc-tion and attribute reducreduc-tion Record reducreduc-tion is sometimes referred to as sample (or tuple) decomposition Attribute reduction can be further divided into two sub

Trang 7

problems: attribute decomposition and function decomposition These decomposi-tion problems embody an extensive methodology called decomposidecomposi-tion methodology discussed in Chapter 50.7 of this volume

A sub-problem of attribute decomposition, as seen in Figure 5.2, is variable selec-tion The solution to this problem is a pre-processing step which removes attributes from a given data set before feeding it to a Data Mining algorithm The rationale for this step is the reduction of time required for running the Data Mining algorithm, since the running time depends both on the number of records as well as on the num-ber of attributes in each record (the dimension) Variable selection may scarify some accuracy but saves time in the learning process

This Chapter provides survey of feature selection techniques and variable selec-tion techniques

5.2 Feature Selection Techniques

5.2.1 Feature Filters

The earliest approaches to feature selection within machine learning were filter meth-ods All filter methods use heuristics based on general characteristics of the data rather than a learning algorithm to evaluate the merit of feature subsets As a conse-quence, filter methods are generally much faster than wrapper methods, and, as such, are more practical for use on data of high dimensionality

FOCUS

Almuallim and Dietterich (1992) describe an algorithm originally designed for Boolean domains called FOCUS FOCUS exhaustively searches the space of feature subsets until it ﬁnds the minimum combination of features that divides the training data into pure classes (that is, where every combination of feature values is associ-ated with a single class) This is referred to as the “min- features bias” Following feature selection, the ﬁnal feature subset is passed to ID3 (Quinlan, 1986), which constructs a decision tree

There are two main difficulties with FOCUS, as pointed out by Caruanna and Freitag (1994) Firstly, since FOCUS is driven to attain consistency on the training data, an exhaustive search may be difficult if many features are needed to attain con-sistency Secondly, a strong bias towards consistency can be statistically unwarranted and may lead to over-fitting the training data— the algorithm will continue to add features to repair a single inconsistency

The authors address the ﬁrst of these problems in their paper (Almuallim and Dietterich, 1992) Three algorithms— each consisting of forward selection search coupled with a heuristic to approximate the min- features bias— are presented as methods to make FOCUS computationally feasible on domains with many features The ﬁrst algorithm evaluates features using the following information theoretic formula:

Trang 8

5 Dimension Reduction and Feature Selection 87

Entropy (Q) =

2|Q| −1

∑

i=0

p i +n i

|Sample|

p i

p i +n ilog2 p i

p i +n i + n i

p i +n ilog2 n i

p i +n i

For a given feature subset Q , there are 2 |Q|possible truth value assignments to the features A given feature set divides the training data into groups of instances

with the same truth value assignments to the features in Q Equation 5.1 measures the overall entropy of the class values in these groups— p i and n idenote the number

of positive and negative examples in the i-th group respectively At each stage, the

feature which minimizes Equation 5.1 is added to the current feature subset The second algorithm chooses the most discriminating feature to add to the cur-rent subset at each stage of the search For a given pair of positive and negative ex-amples, a feature is discriminating if its value differs between the two At each stage, the feature is chosen which discriminates the greatest number of positive- negative pairs of examples— that have not yet been discriminated by any existing feature in the subset

The third algorithm is like the second except that each positive- negative example pair contributes a weighted increment to the score of each feature that discriminates

it The increment depends on the total number of features that discriminate the pair LVF

Liu and Setiono (1996) describe an algorithm similar to FOCUS called LVF Like FOCUS, LVF is consistency driven and, unlike FOCUS, can handle noisy domains

if the approximate noise level is known a- priori

LVF generates a random subset S from the feature subset space during each round

of execution If S contains fewer features than the current best subset, the inconsis-tency rate of the dimensionally reduced data described by S is compared with the inconsistency rate of the best subset If S is at least as consistent as the best subset, S

replaces the best subset

The inconsistency rate of the training data prescribed by a given feature subset is deﬁned over all groups of matching instances Within a group of matching instances the inconsistency count is the number of instances in the group minus the number of instances in the group with the most frequent class value The overall inconsistency rate is the sum of the inconsistency counts of all groups of matching instances divided

by the total number of instances Liu and Setiono report good results for LVF when applied to some artiﬁcial domains and mixed results when applied to commonly used natural domains They also applied LVF to two “large” data sets— the ﬁrst having 65,000 instances described by 59 attributes; the second having 5909 instances described by 81 attributes They report that LVF was able to reduce the number of attributes on both data sets by more than half They also note that due to the random nature of LVF, the longer it is allowed to execute, the better the results (as measured

by the inconsistency criterion)

Trang 9

Filtering Features Through Discretization

Setiono and Liu (1996) note that discretization has the potential to perform feature selection among numeric features If a numeric feature can justifiably be discretized

to a single value, then it can safely be removed from the data

The combined discretization and feature selection algorithm Chi2, uses a chi-squareχ2statistic to perform discretization Numeric attributes are initially sorted by placing each observed value into its own interval Each numeric attribute is then re-peatedly discretized by using theχ2test to determine when adjacent intervals should

be merged

The extent of the merging process is controlled by the use of an automatically setχ2threshold The threshold is determined by attempting to maintain the original fidelity of the data— inconsistency (measured the same way as in the LVF algorithm described above) controls the process The authors report results on three natural do-mains containing a mixture of numeric and nominal features, using C4.5 (Quinlan,

1993, Quinlan, 1986) before and after discretization They conclude that Chi2 is ef-fective at improving C4.5’s performance and eliminating some features However, it

is not clear whether C4.5’s improvement is due entirely to some features having been removed or whether discretization plays a role as well

Using One Learning Algorithm as a Filter for Another

Several researchers have explored the possibility of using a particular learning algo-rithm as a pre- processor to discover useful feature subsets for a primary learning algorithm Cardie (1995) describes the application of decision tree algorithms to the task of selecting feature subsets for use by instance based learners C4.5 was applied

to three natural language data sets; only the features that appeared in the final de-cision trees were used with ak–nearest neighbor classifier The use of this hybrid

system resulted in significantly better performance than either C4.5 or the k nearest neighbor algorithm when used alone

In a similar approach, Singh and Provan (1996) use a greedy oblivious deci-sion tree algorithm to select features from which to construct a Bayesian network Oblivious decision trees differ from those constructed by algorithms such as C4.5 in that all nodes at the same level of an oblivious decision tree test the same attribute Feature subsets selected by three oblivious decision tree algorithms— each employ-ing a different information splittemploy-ing criterion— were evaluated with a Bayesian net-work classifier on several the-oretic machine learning datasets Results showed that Bayesian networks using features selected by the oblivious decision tree algorithms outperformed Bayesian networks without feature selection

Holmes and Nevill–Manning (1995) use Holte’s 1R system (Holte, 1993) to es-timate the predictive accuracy of individual features 1R builds rules based on single features (called predictive 1- rules, 1- rules can be thought of as single level decision trees) If the data is split into training and test sets, it is possible to calculate a classi-fication accuracy for each rule and hence each feature From classiclassi-fication scores, a ranked list of features is obtained Experiments with choosing a select number of the

Trang 10

5 Dimension Reduction and Feature Selection 89 highest ranked features and using them with common machine learning algorithms showed that, on average, the top three or more features are as accurate as using the original set This approach is unusual due to the fact that no search is conducted Instead, it relies on the user to decide how many features to include from the ranked list in the ﬁnal subset

Pfahringer (1995) uses a program for inducing decision table majority classi-ﬁers to select features DTM (Decision Table Majority) classiclassi-ﬁers are a simple type

of nearest neighbor classifier where the similarity function is restricted to return-ing stored instances that are exact matches with the instance to be classified If no instances are returned, the most prevalent class in the training data is used as the predicted class; otherwise, the majority class of all matching instances is used DTM works best when all features are nominal Induction of a DTM is achieved by greed-ily searching the space of possible decision tables Since a decision table is defined

by the features it includes, induction is simply feature selection

In Pfahringer’s approach, the minimum description length principle (MDL) (Ris-sanen, 1978) guides the search by estimating the cost of encoding a decision table and the training examples it misclassifies with respect to a given feature subset The features appearing in the final decision table are then used with other learning algo-rithms Experiments on a small selection of machine learning datasets showed that feature selection by DTM induction can improve the accuracy of C4.5 in some cases DTM classifiers induced using MDL were also compared with those induced using cross- validation (a wrapper approach) to estimate the accuracy of tables (and hence feature sets) The MDL approach was shown to be more efficient than, and perform

as well as, as cross- validation

An Information Theoretic Feature Filter

Koller and Sahami (1996) introduced a feature selection algorithm based on ideas from information theory and probabilistic reasoning The rationale behind their ap-proach is that, since the goal of an induction algorithm is to estimate the probability distributions over the class values, given the original feature set, feature subset selec-tion should attempt to remain as close to these original distribuselec-tions as possible

More formally, let C be a set of classes, V a set of features, X a subset of V , v

an assignment of values(v1, ,v n ) to the features in V , and v xthe projection of the

values in v onto the variables in X The goal of the feature selector is to choose X so that P (C|X = v x ) is as close as possible to P(C|V = v)

To achieve this goal, the algorithm begins with all the original features and em-ploys a backward elimination search to remove, at each stage, the feature that causes the least change between the two distributions Because it is not reliable to estimate high order probability distributions from limited data, an approximate algorithm is given that uses pair- wise combinations of features Cross entropy is used to measure the difference between two distributions and the user must specify how many fea-tures are to be removed by the algorithm The cross entropy of the class distribution given a pair of features is:

Định dạng
Số trang	10
Dung lượng	448,63 KB