Data Mining and Knowledge Discovery Handbook, 2 Edition part 12 ppsx

An Instance Based Approach to Feature Selection – RELIEF Kira and Rendell 1992 describe an algorithm called RELIEF that uses instance based learning to assign a relevance weight to each

Trang 1

D (Pr(C|V i = v i ,V j = v j ),Pr(C|V j = v j)) =

∑

c∈C p (c|V i = v i ,V j = v j)log2

p(c|V i =v i ,V j =v j)

For each feature i, the algorithm ﬁnds a set M i , containing K attributes from those

that remain, that is likely to include the information feature i has about the class

values M i contains K features out of the remaining features for which the value

of Equation 5.2 is smallest The expected cross entropy between the distribution

of the class values, given M i , V i , and the distribution of class values given just

M i , is calculated for each feature i The feature for which this quantity is minimal

is removed from the set This process iterates until the user- speciﬁed number of features are removed from the original set

Experiments on natural domains and two artiﬁcial domains using C4.5 and na¨ıve Bayes as the ﬁnal induction algorithm showed that the feature selector gives the

best results when the size K of the conditioning set M is set to 2 In two domains

containing over 1000 features the algorithm is able to reduce the number of features

by more than half, while improving accuracy by one or two percent

One problem with the algorithm is that it requires features with more than two values to be encoded as binary in order to avoid the bias that entropic measures have toward features with many values This can greatly increase the number of features

in the original data, as well as introducing further dependencies Furthermore, the meaning of the original attributes is obscured, making the output of algorithms such

as C4.5 hard to interpret

An Instance Based Approach to Feature Selection – RELIEF

Kira and Rendell (1992) describe an algorithm called RELIEF that uses instance based learning to assign a relevance weight to each feature Each feature’s weight re-flects its ability to distinguish among the class values Features are ranked by weight and those that exceed a user- specified threshold are selected to form the final subset The algorithm works by randomly sampling instances from the training data For each instance sampled the nearest instance of the same class (nearest hit) and oppo-site class (nearest miss) is found An attribute’s weight is updated according to how well its values distinguish the sampled instance from its nearest hit and nearest miss

An attribute will receive a high weight if it differentiates between instances from dif-ferent classes and has the same value for instances of the same class Equation (5.3) shows the weight updating formula used by RELIEF:

W X = W X − di f f (X,R,H)2

m +di f f (X,R,M)2

where W X is the weight for attribute X, R is a randomly sampled instance, H is the nearest hit, M is the nearest miss, and m is the number of randomly sampled

instances

The function diff calculates the difference between two instances for a given attribute For nominal attributes it is deﬁned as either 1 (the values are different) or 0 (the values are the same), while for continuous attributes the difference is the actual

Trang 2

difference normalized to the interval [0; 1] Dividing by m guarantees that all weights are in the interval [-1,1]

RELIEF operates on two- class domains Kononenko (1994) describes enhance-ments to RELIEF that enable it to cope with multi- class, noisy and incomplete do-mains Kira and Rendell provide experimental evidence that shows RELIEF to be effective at identifying relevant features even when they interact Interacting features are those, whose values are dependent on the values of other features and the class, and as such, provide further information about the class On the other hand, redun-dant features, are those whose values are dependent on the values of other features irrespective of the class, and as such, provide no further information about the class (for example, in parity problems) However, RELIEF does not handle redundant fea-tures The authors state: “If most of the given features are relevant to the concept, it (RELIEF) would select most of the given features even though only a small number

of them are necessary for concept description.”

Scherf and Brauer (1997) describe a similar instance based approach (EUBAFES)

to assigning feature weights developed independently of RELIEF Like RELIEF, EU-BAFES strives to reinforce similarities between instances of the same class while si-multaneously decrease similarities between instances of different classes A gradient descent approach is employed to optimize feature weights with respect to this goal 5.2.2 Feature Wrappers

Wrapper strategies for feature selection use an induction algorithm to estimate the merit of feature subsets The rationale for wrapper approaches is that the induction method that will ultimately use the feature subset should provide a better estimate of accuracy than a separate measure that has an entirely different inductive bias (Lang-ley and Sage, 1994)

Feature wrappers often achieve better results than filters due to the fact that they are tuned to the specific interaction between an induction algorithm and its training data However, they tend to be much slower than feature filters because they must repeatedly call the induction algorithm and must be re- run when a different induction algorithm is used

Since the wrapper is a well deﬁned process, most of the variation in its appli-cation are due to the method used to estimate the off- sample accuracy of a target induction algorithm, the target induction algorithm itself, and the organization of the search This section reviews work that has focused on the wrapper approach and methods to reduce its computational expense

Wrappers for Decision Tree Learners

John, Kohavi, and Pﬂeger (1994) were the ﬁrst to advocate the wrapper (Allen, 1974)

as a general framework for feature selection in machine learning They present for-mal for two degrees of feature relevance deﬁnitions, and claim that the wrapper is

able to discover relevant features A feature X iis said to be strongly relevant to the

Trang 3

target concept(s) if the probability distribution of the class values, given the full fea-ture set, changes whenX iis removed A featureX iis said to be weakly relevant if

it is not strongly relevant and the probability distribution of the class values, given some subsetS (containing X i) of the full feature set, changes whenX i is removed All features that are not strongly or weakly relevant are irrelevant

Experiments were conducted on three artificial and three natural domains using ID3 and C4.5 (Quinlan, 1993, Quinlan, 1986) as the induction algorithms Accuracy was estimated by using 25- fold cross validation on the training data; a disjoint test set was used for reporting final accuracies Both forward selection and backward elimination search were used With the exception of one artificial domain, results showed that feature selection did not significantly change ID3 or C4.5’s generaliza-tion performance The main effect of feature selecgeneraliza-tion was to reduce the size of the trees Like John et al., Caruana and Freitag (1994) test a number of greedy search methods with ID3 on two calendar scheduling domains As well as backward elim-ination and forward selection they also test two variants of stepwise bi- directional search— one starting with all features, the other with none

Results showed that although the bi- directional searches slightly outperformed the forward and backward searches, on the whole there was very little difference be-tween the various search strategies except with respect to computation time Feature selection was able to improve the performance of ID3 on both calendar scheduling domains

Vafaie and De Jong (1995) and Cherkauer and Shavlik (1996) have both applied genetic search strategies in a wrapper framework for improving the performance of decision tree learners Vafaie and De Jong (1995) describe a system that has two genetic algorithm driven modules— the first performs feature selection, and the sec-ond performs constructive induction (Constructive induction is the process of cre-ating new attributes by applying logical and mathematical operators to the original features (Michalski, 1983)) Both modules were able to significantly improve the performance of ID3 on a texture classification problem

Cherkauer and Shavlik (1996) present an algorithm called SET- Gen which strives

to improve the comprehensibility of decision trees as well as their accuracy To achieve this, SET- Gen’s genetic search uses a fitness function that is a linear combi-nation of an accuracy term and a simplicity term:

Fitness(X) =3

4A +1

4(1− S + F

whereX is a feature subset, A is the average cross- validation accuracy of C4.5,

S is the average size of the trees produced by C4.5 (normalized by the number of

training examples), and F is the number of features is the subsetX (normalized by

the total number of available features) Equation (5.4) ensures that the fittest popu-lation members are those feature subsets that lead C4.5 to induce small but accurate decision trees

Trang 4

Wrappers for Instance-based Learning

The wrapper approach was proposed at approximately the same time and indepen-dently of John et al (1994) by Langley and Sage (1994) during their investigation

of the simple nearest neighbor algorithm’s sensitivity to irrelevant attributes Scaling experiments showed that the nearest neighbour’s sample complexity (the number of training examples needed to reach a given accuracy) increases exponentially with

the number of irrelevant attributes present in the data (Aha et al., 1991, Langley and

Sage, 1994) An algorithm called OBLIVION is presented which performs back-ward elimination of features using an oblivious decision tree (When all the original features are included in the tree and given a number of assumptions at classiﬁcation time, Langley and Sage note that the structure is functionally equivalent to the sim-ple nearest neighbor; in fact, this is how it is imsim-plemented in OBLIVION) as the

induction algorithm Experiments with OBLIVION using k- fold cross validation on

several artiﬁcial domains showed that it was able to remove redundant features and learn faster than C4.5 on domains where features interact

Moore and Lee (1994) take a similar approach to augmenting nearest neighbor algorithm but their system uses leave- one- out instead of k- fold cross- validation and concentrates on improving the prediction of numeric rather than discrete classes Aha and Blankert (1994) also use leave- one- out cross validation, but pair it with

a beam search (Beam search is a limited version of best first search that only re-members a portion of the search path for use in backtracking), instead of hill climb-ing Their results show that feature selection can improve the performance of IB1 (a nearest neighbor classifier) on a sparse (very few instances) cloud pattern domain with many features Moore, Hill, and Johnson (1992) encompass not only feature selection in the wrapper process, but also the number of nearest neighbors used in prediction and the space of combination functions Using leave- one- out cross vali-dation they achieve significant improvement on several control problems involving, the prediction of continuous classes

In a similar vein, Skalak (1994) combines feature selection and prototype selec-tion into a single wrapper process using random mutaselec-tion hill climbing as the search strategy Experimental results showed signiﬁcant improvement in accuracy for near-est neighbor on two natural domains and a drastic reduction in the algorithm’s storage requirement (number of instances retained during training)

Domingos (1997) describes a context sensitive wrapper approach to feature se-lection for instance based learners The motivation for the approach is that there may

be features that are either relevant in only a restricted area of the instance space and irrelevant elsewhere, or relevant given only certain values (weakly interacting) of other features and otherwise irrelevant In either case, when features are estimated globally (over the instance space), the irrelevant aspects of these sorts of features may overwhelm their en-tire useful aspects for instance based learners This is true even when using backward search strategies with the wrapper In the wrapper ap-proach, backward search strategies are generally more effective than forward search strategies in domains with feature interactions Because backward search typically

Trang 5

begins with all the features, the removal of a strongly interacting feature is usually detected by decreased accuracy during cross validation

Domingos presents an algorithm called RC which can detect and make use of context sensitive features RC works by selecting a (potentially) different set of fea-tures for each instance in the training set It does this by using a search strategy and cross validation to estimate accuracy For each instance in back-ward the training set, RC ﬁnds its nearest neighbour of the same class and removes those features in which the two differ The accuracy of the entire training dataset is then estimated by cross validation If the accuracy has not degraded, the modiﬁed instance in question

is accepted; otherwise the instance is restored to its original state and deactivated (no further feature selection is attempted for it) The feature selection process continues until all instances are inactive

Experiments on a selection of machine learning datasets showed that RC out-performed standard wrapper feature selectors using forward and backward search strategies with instance based learners The effectiveness of the context sensitive ap-proach was also shown on artiﬁcial domains engineered to exhibit restricted feature dependency When features are globally relevant or irrelevant, RC has no advantage over standard wrapper feature selection Furthermore, when few examples are avail-able, or the data is noisy, standard wrapper approaches can detect globally irrelevant features more easily than RC

Domingos also noted that wrappers that employ instance based learners (includ-ing RC) are unsuitable for use on databases contain(includ-ing many instances because they are quadratic in N (the number of instances)

Kohavi (1995) uses wrapper feature selection to explore the potential of decision table majority (DTM) classifiers Appropriate data structures allow the use of fast incremental cross- validation with DTM classifiers Experiments showed that DTM classifiers using appropriate feature subsets compared very favorably with sophisti-cated algorithms such as C4.5

Wrappers for Bayes Classiﬁers

Due to the naive Bayes classiﬁer’s assumption that, within each class, probability distributions for attributes are independent of each other Langley and Sage (1994) note that the classiﬁer performance on domains with redundant features can be im-proved by removing such features A forward search strategy is employed to select features for use with na¨ıve Bayes, as opposed to the backward strategies that are used most often with decision tree algorithms and instance based learners The ra-tionale for a forward search is that it should immediately detect dependencies when harmful redundant attributes are added Experiments showed overall improvement and increased learning rate on three out of six natural domains, with no change on the remaining three

Pazzani (1995) combines feature selection and simple constructive induction in

a wrapper framework for improving the performance of naive Bayes Forward and backward hill climbing search strategies are compared In the former case, the algo-rithm considers not only the addition of single features to the current subset, but also

Trang 6

creating a new attribute by joining one of the as yet unselected features with each

of the selected features in the subset In the latter case, the algorithm considers both deleting individual features and replacing pairs of features with a joined feature Re-sults on a selection of machine learning datasets show that both approaches improve the performance of na¨ıve Bayes

The forward strategy does a better job at removing redundant attributes than the backward strategy Because it starts with the full set of features, and considers all possible pairwise joined features, the backward strategy is more effective at identi-fying attribute interactions than the forward strategy Improvement for naive Bayes using wrapper- based feature selection is also reported in (Kohavi and Sommerﬁeld,

1995, Kohavi and John, 1996)

Provan and Singh (1996) have applied the wrapper to select features from which

to construct Bayesian networks Their results showed that while feature selection did not improve accuracy over networks which have been constructed from the full set of features, the networks created after feature selection were considerably smaller and faster to learn

5.3 Variable Selection

This section aim to provide a survey of variable selection Suppose is Y a variable

of interest, and X1, ,X pis a set of potential explanatory variables or predictors,

are vectors of n observations The problem of variable selection, or subset selection

as it is often called, arises when one wants to model the relationship between Y and a subset of X1, ,X p, but there is uncertainty about which subset to use Such

a situation is particularly of interest when p is large and X1, ,X p is thought to contain many redundant or irrelevant variables The variable selection problem is most familiar in the linear regression context where attention is restricted to normal

linear models Letting qγindex the subsets of X1, ,X pand letting q be the size of theγ–th subset, the problem is to select and ﬁt a model of the form:

where Xγ is an nxqγ matrix whose columns correspond to theγth subset,βγ is a

qγx 1 vector of regression coefﬁcients and ε≈ N(0,σ2I) More generally, the variable selection problem is a special case of the model selection problem, where

each model under consideration corresponds to a distinct subset of X1, ,X p Typi-cally, a single model class is simply applied to all possible subsets

5.3.1 Mallows Cp (Mallows, 1973)

This method minimizes the mean square error of prediction:

C p= RSSγ ˆ

FULL

Trang 7

where, RSSγ is the residual sum of squares for the γth model and σˆ2FULL is the usual unbiased estimate ofσ2based on the full model

The goal is to get a model with minimum C p By using C p one can reduce

di-mension by ﬁnding the minimal subset which has minimum C p,

5.3.2 AIC, BIC and F ratio

Two of the other most popular criteria, motivated from very different points of view, are AIC (for Akaike Information Criterion) and BIC (for Bayesian Information

Crite-rion) Letting ˆlγdenote the maximum log likelihood of theγthmodel, AIC selects the

model which maximizes ( ˆlγ− qγ) whereas BIC selects the model which maximizes

( ˆlγ- (logn)qγ/2)

For the linear model, many of the popular selection criteria are special cases of a penalized sum- of squares criterion, providing a uniﬁed framework for comparisons Assumingσ2known to avoid complications, this general criterion selects the subset model that minimizes:

where F is a preset “dimensionality penalty” Intuitively, the above penalizes RSSγ/σ2 by F times qγ, the dimension of the γth model AIC and minimum C p

are essentially equivalent, corresponding to F= 2, and BIC is obtained by setting

F = logn By imposing a smaller penalty, AIC and minimum C pwill select larger

models than BIC (unless n is very small).

5.3.3 Principal Component Analysis (PCA)

Principal component analysis (PCA) is the best, in the mean-square error sense, lin-ear dimension reduction technique (Jackson, 1991, Jolliffe, 1986) Being based on the covariance matrix of the variables, it is a second-order method In various ﬁelds,

it is also known as the singular value decomposition (SVD), the Karhunen-Loeve transform, the Hotelling transform, and the empirical orthogonal function (EOF) method.In essence, PCA seeks to reduce the dimension of the data by ﬁnding a few orthogonal linear combinations (the PCs) of the original variables with the largest

variance The ﬁrst PC, s1, is the linear combination with the largest variance We have s1= xTw1, where the p-dimensional coefﬁcient vector w1 = (w1,1, ,w1,p)T solves:

w1= arg maxVar

x T w

(5.8) The second PC is the linear combination with the second largest variance and orthogonal to the ﬁrst PC, and so on There are as many PCs as the number of the original variables For many datasets, the ﬁrst several PCs explain most of the vari-ance, so that the rest can be disregarded with minimal loss of information

Trang 8

5.3.4 Factor Analysis (FA)

Like PCA, factor analysis (FA) is also a linear method, based on the second-order data summaries First suggested by psychologists, FA assumes that the measured variables depend on some unknown, and often unmeasurable, common factors Typ-ical examples include variables deﬁned as various test scores of individuals, as such scores are thought to be related to a common “intelligence” factor The goal of FA is

to uncover such relations, and thus can be used to reduce the dimension of datasets following the factor model

5.3.5 Projection Pursuit

Projection pursuit (PP) is a linear method that, unlike PCA and FA, can incorporate higher than second-order information, and thus is useful for non-Gaussian datasets

It is more computationally intensive than second-order methods Given a projection index that deﬁnes the “interestingness” of a direction, PP looks for the directions that optimize that index As the Gaussian distribution is the least interesting distribution (having the least structure), projection indices usually measure some aspect of non-Gaussianity If, however, one uses the second-order maximum variance, subject that the projections be orthogonal, as the projection index, PP yields the familiar PCA 5.3.6 Advanced Methods for Variable Selection

Chizi and Maimon (2002) describes in their work some new methods for variable se-lection These methods based on simple algorithms and uses known evaluators like information gain, logistic regression coefﬁcient and random selection All the meth-ods are presented with empirical results on benchmark datasets and with theoretical bounds on each method Wider survey on variable selection can be found there pro-vided with decomposition of the problem of dimension reduction

In summary, features selection is useful for many application domains, such as: Manufacturing lr18,lr14, Security lr7,l10 and Medicine lr2,lr9, and for many data mining techniques, such as: decision trees lr6,lr12, lr15, clustering lr13,lr8, ensemble methods lr1,lr4,lr5,lr16 and genetic algorithms lr17,lr11

References

Aha, D W and Blankert, R L Feature selection for case- based classiﬁcation of cloud types

In Working Notes of th AAAI- 94 Workshop on Case- Based Reasoning, pages 106–112, 1994

Aha, D W Kibler, and Albert, M K Instance based learning algorithms Machine Learning, 6: 37–66, 1991

Allen, D The relationship between variable selection and data augmentation and a method for prediction Technometrics, 16: 125– 127, 1974

Trang 9

Almuallim, H and Dietterich, T G Efﬁcient algorithms for identifying relevant features In.Proceedings of the Ninth Canadian Conference on Artiﬁcial Intelligence, pages 38–

45 Morgan Kaufmann, 1992

Almuallim, H and Dietterich, T G Learning with many irrelevant features In Proceedings

of the Ninth National Conference on Artiﬁcial Intelligence, pages 547– 542 MIT Press, 1991

Arbel, R and Rokach, L., Classiﬁer evaluation under limited resources, Pattern Recognition Letters, 27(14): 1619–1631, 2006, Elsevier

Averbuch, M and Karson, T and Ben-Ami, B and Maimon, O and Rokach, L., Context-sensitive medical information retrieval, The 11th World Congress on Medical Informat-ics (MEDINFO 2004), San Francisco, CA, September 2004, IOS Press, pp 282–286 Blum P and Langley, P Selection Of Relevant Features And Examples In Machine Learning, Artiﬁcial Intelligence, 1997;97: 245-271

Cardie, C Using decision trees to improve cased- based learning In Proceedings of the First International Conference on Knowledge Discovery and Data Mining AAAI Press, 1995 Caruana, R and Freitag, D Greedy attribute selection In Machine Learning: Proceedings of the Eleventh International Conference Morgan Kaufmann, 1994

Cherkauer, K J and Shavlik, J W Growing simpler decision trees to facilitate knowledge discovery In Proceedings of the Second International Conference on Knowledge Dis-covery and Data Mining AAAI Press, 1996

Chizi, B and Maimon, O “On Dimensionality Reduction of High Dimensional Data Sets”,

In “Frontiers in Artiﬁcial Intelligence and Applications” IOS press, pp 230-236, 2002 Cohen S., Rokach L., Maimon O., Decision Tree Instance Space Decomposition with Grouped Gain-Ratio, Information Science, Volume 177, Issue 17, pp 3592-3612, 2007 Domingos, P Context- sensitive feature selection for lazy learners Artiﬁcial Intelligence Review, (11): 227– 253, 1997

Elder, J.F and Pregibon, D “A Statistical perspective on knowledge discovery in databases”

In Advances in Knowledge Discovery and Data Mining, Fayyad, U Piatetsky-Shapiro,

G Smyth, P & Uthurusamy, R ed., AAAI/MIT Press., 1996

George, E and Foster D Empirical Bayes Variable Selection, Biometrika, 2000

Hall, M Correlation- based feature selection for machine learning, Ph.D Thesis, Department

of Computer Science, University of Waikato, 1999

Holmes, G and Nevill- Manning, C G Feature selection via the discovery of simple clas-siﬁcation rules In Proceedings of the Symposium on Intelligent Data Analysis, Baden-Baden, Germany, 1995

Holte, R C Very simple classiﬁcation rules perform well on most commonly used datasets Machine Learning, 11: 63– 91, 1993

Jackson, J A User’s Guide to Principal Components New York: John Wiley and Sons, 1991 John, G H Kohavi, R and Pﬂeger, P Irrelevant features and the subset selection problem

In Machine Learning: Proceedings of the Eleventh International Conference Morgan Kaufmann, 1994

Jolliffe, I Principal Component Analysis Springer-Verlag, 1986

Kira, K and Rendell, L A A practical approach to feature selection In Machine Learning: Proceedings of the Ninth International Conference, 1992

Kohavi R and John, G Wrappers for feature subset selection Artiﬁcial Intelligence, special issue on relevance, 97(1– 2): 273– 324, 1996

Kohavi, R and Sommerﬁeld, D Feature subset selection using the wrapper method: Overﬁt-ting and dynamic search space topology In Proceedings of the First International Con-ference on Knowledge Discovery and Data Mining AAAI Press, 1995

Trang 10

Kohavi, R Wrappers for Performance Enhancement and Oblivious Decision Graphs PhD thesis, Stanford University, 1995

Koller, D and Sahami, M Towards optimal feature selection In Machine Learning: Proceed-ings of the Thirteenth International Conference on machine Learning Morgan Kauf-mann, 1996

Kononenko, I Estimating attributes: Analysis and extensions of relief In Proceedings of the European Conference on Machine Learning, 1994

Langley, P Selection of relevant features in machine learning In Proceedings of the AAAI Fall Symposium on Relevance AAAI Press, 1994

Langley, P and Sage, S Scaling to domains with irrelevant features In R Greiner, editor, Computational Learning Theory and Natural Learning Systems, volume 4 MIT Press, 1994

Liu, H and Setiono, R A probabilistic approach to feature selection: A ﬁlter solution In Machine Learning: Proceedings of the Thirteenth International Conference on Machine Learning Morgan Kaufmann, 1996

Maimon O., and Rokach, L Data Mining by Attribute Decomposition with semiconductors manufacturing case study, in Data Mining for Design and Manufacturing: Methods and Applications, D Braha (ed.), Kluwer Academic Publishers, pp 311–336, 2001 Maimon O and Rokach L., “Improving supervised learning by feature decomposition”, Pro-ceedings of the Second International Symposium on Foundations of Information and Knowledge Systems, Lecture Notes in Computer Science, Springer, pp 178-196, 2002 Maimon, O and Rokach, L., Decomposition Methodology for Knowledge Discovery and Data Mining: Theory and Applications, Series in Machine Perception and Artiﬁcial In-telligence - Vol 61, World Scientiﬁc Publishing, ISBN:981-256-079-3, 2005

Mallows, C L Some comments on Cp Technometrics 15, 661- 676, 1973

Michalski, R S A theory and methodology of inductive learning Artiﬁcial Intelligence, 20( 2): 111– 161, 1983

Moore, A W and Lee, M S Efﬁcient algorithms for minimizing cross validation error

In Machine Learning: Proceedings of the Eleventh International Conference Morgan Kaufmann, 1994

Moore, A W Hill, D J and Johnson, M P An empirical investigation of brute force to choose features, smoothers and function approximations In S Hanson, S Judd, and T Petsche, editors, Computational Learning Theory and Natural Learning Systems, volume

3 MIT Press, 1992

Moskovitch R, Elovici Y, Rokach L, Detection of unknown computer worms based on behav-ioral classiﬁcation of the host, Computational Statistics and Data Analysis, 52(9):4544–

4566, 2008

Pazzani, M Searching for dependencies in Bayesian classiﬁers In Proceedings of the Fifth International Workshop on AI and Statistics, 1995

Pfahringer, B Compression- based feature subset selection In Proceeding of the IJCAI- 95 Workshop on Data Engineering for Inductive Learning, pages 109– 119, 1995 Provan, G M and Singh, M Learning Bayesian networks using feature selection In D Fisher and H Lenz, editors, Learning from Data, Lecture Notes in Statistics, pages 291–

300 Springer- Verlag, New York, 1996

Quinlan, J.R C4.5: Programs for machine learning Morgan Kaufmann, Los Altos, Califor-nia, 1993

Quinlan, J.R Induction of decision trees Machine Learning, 1: 81– 106, 1986

Rissanen, J Modeling by shortest data description Automatica, 14: 465–471, 1978

Định dạng
Số trang	10
Dung lượng	119,42 KB