Data Mining and Knowledge Discovery Handbook, 2 Edition part 83 doc

Traditional Data Mining algorithms are challenged by two characteristic features of data streams: the inﬁnite data ﬂow and the drifting concepts.. 40 Mining Concept-Drifting Data Streams

Trang 1

800 Haixun Wang, Philip S Yu, and Jiawei Han

Table 40.6 Benefits (US $) using Single Classifiers and Classifier Ensembles (Original Stream)

Chunk G0 G1=E1 G2 E2 G4 E4 G8 E8

12000 201717 203211 197946 253473 211768 269290 215692 289129

6000 103763 98777 101176 121057 102447 138565 106576 143620

4000 69447 65024 68081 80996 69346 90815 70325 96153

3000 43312 41212 42917 59293 44977 67222 46139 71660

Cost-sensitive Learning

For cost-sensitive applications, we aim at maximizing beneﬁts In Figure 40.7(a),

we compare the single classiﬁer approach with the ensemble approach using the credit card transaction stream The beneﬁts are averaged from multiple runs with different chunk size (ranging from 3000 to 12000 transactions per chunk) Starting

from K= 2, the advantage of the ensemble approach becomes obvious

In Figure 40.7(b), we average the beneﬁts of E k and G k (K = 2,··· ,8) for each

fixed chunk size The benefits increase as the chunk size does, as more fraudulent transactions are discovered in the chunk Again, the ensemble approach outperforms the single classifier approach

To study the impact of concept drifts of different magnitude, we derive data streams from the credit card transactions The simulated stream is obtained by sorting the original 5 million transactions by their transaction amount We perform the same test on the simulated stream, and the results are shown in Figure 40.7(c) and 40.7(d) Detailed results of the above tests are given in Table 40.6 and 40.5

40.5 Discussion and Related Work

Data stream processing has recently become a very important research domain Much

work has been done on modeling (Babcock et al., 2002), querying (Babu and Widom,

2001,Gao and Wang, 2002,Greenwald and Khanna, 2001), and mining data streams, for instance, several papers have been published on classiﬁcation (Domingos and

Hulten, 2000, Hulten et al., 2001, Street and Kim, 2001), regression analysis (Chen

et al., 2002), and clustering (Guha et al., 2000).

Traditional Data Mining algorithms are challenged by two characteristic features

of data streams: the inﬁnite data ﬂow and the drifting concepts As methods that

require multiple scans of the datasets (Shafer et al., 1996) can not handle inﬁnite data ﬂows, several incremental algorithms (Gehrke et al., 1999, Domingos and Hulten,

2000) that reﬁne models by continuously incorporating new data from the stream have been proposed In order to handle drifting concepts, these methods are revised again to achieve the goal that effects of old examples are eliminated at a certain rate

In terms of an incremental decision tree classiﬁer, this means we have to discard,

re-grow sub trees, or build alternative subtrees under a node (Hulten et al., 2001).

The resulting algorithm is often complicated, which indicates substantial efforts are

Trang 2

40 Mining Concept-Drifting Data Streams 801 required to adapt state-of-the-art learning methods to the inﬁnite, concept-drifting streaming environment Aside from this undesirable aspect, incremental methods are also hindered by their prediction accuracy Since old examples are discarded at a ﬁxed rate (no matter if they represent the changed concept or not), the learned model

is supported only by the current snapshot – a relatively small amount of data This usually results in larger prediction variances

Classifier ensembles are increasingly gaining acceptance in the data mining com-munity The popular approaches to creating ensembles include changing the in-stances used for training through techniques such as Bagging (Bauer and Kohavi, 1999) and Boosting (Freund and Schapire, 1996) The classifier ensembles have sev-eral advantages over single model classifiers First, classifier ensembles offer a sig-nificant improvement in prediction accuracy (Freund and Schapire, 1996, Tumer and Ghosh, 1996) Second, building a classifier ensemble is more efficient than building a single model, since most model construction algorithms have super-linear complex-ity Third, the nature of classifier ensembles lend themselves to scalable

paralleliza-tion (Hall et al., 2000) and on-line classiﬁcaparalleliza-tion of large databases Previously, we

used averaging ensemble for scalable learning over very-large datasets (Fan, Wang , Yu, and Stolfo, 2003) We show that a model’s performance can be estimated be-fore it is completely learned (Fan, Wang , Yu, and Lo, 2002, Fan, Wang , Yu, and

Lo, 2003) In this work, we use weighted ensemble classiﬁers on concept-drifting data streams It combines multiple classiﬁers weighted by their expected prediction accuracy on the current test data Compared with incremental models trained by data

in the most recent window, our approach combines talents of set of experts based on their credibility and adjusts much nicely to the underlying concept drifts Also, we introduced the dynamic classiﬁcation technique (Fan, Chu, Wang, and Yu, 2002) to the concept-drifting streaming environment, and our results show that it enables us

to dynamically select a subset of classiﬁers in the ensemble for prediction without loss in accuracy

Ackowledgement

We thank Wei Fan of IBM T J Watson Research Center for providing us with a revised version of the C4.5 decision tree classiﬁer and running some experiments

References

Babcock B., Babu S , Datar M , Motawani R , and Widom J., Models and issues in data

stream systems, In ACM Symposium on Principles of Database Systems (PODS), 2002 Babu S and Widom J., Continuous queries over data streams SIGMOD Record, 30:109–

120, 2001

Bauer, E and Kohavi, R., An empirical comparison of voting classiﬁcation algorithms:

Bag-ging, boosting, and variants Machine Learning, 36(1-2):105–139, 1999.

Chen Y., Dong G., Han J., Wah B W., and Wang B W., Multi-dimensional regression

anal-ysis of time-series data streams In Proc of Very Large Database (VLDB), Hongkong,

China, 2002

Trang 3

802 Haixun Wang, Philip S Yu, and Jiawei Han

Cohen W., Fast effective rule induction In Int’l Conf on Machine Learning (ICML), pages

115–123, 1995

Domingos P., and Hulten G., Mining high-speed data streams In Int’l Conf on Knowledge Discovery and Data Mining (SIGKDD), pages 71–80, Boston, MA, 2000 ACM Press Fan W., Wang H., Yu P., and Lo S , Progressive modeling In Int’l Conf Data Mining (ICDM), 2002.

Fan W., Wang H., Yu P., and Lo S , Inductive learning in less than one sequential scan, In

Int’l Joint Conf on Artiﬁcial Intelligence, 2003.

Fan W., Wang H., Yu P., and Stolfo S., A framework for scalable cost-sensitive learning based

on combining probabilities and beneﬁts In SIAM Int’l Conf on Data Mining (SDM),

2002

Fan W., Chu F., Wang H., and Yu P S., Pruning and dynamic scheduling of cost-sensitive

ensembles, In Proceedings of the 18th National Conference on Artiﬁcial Intelligence (AAAI), 2002.

Freund Y., and Schapire R E., Experiments with a new boosting algorithm, In Int’l Conf on Machine Learning (ICML), pages 148–156, 1996.

Gao L and Wang X., Continually evaluating similarity-based pattern queries on a streaming

time series, In Int’l Conf Management of Data (SIGMOD), Madison, Wisconsin, June

2002

Gehrke J., Ganti V., Ramakrishnan R., and Loh W., BOAT– optimistic decision tree

con-struction, In Int’l Conf Management of Data (SIGMOD), 1999.

Greenwald M., and Khanna S., Space-efﬁcient online computation of quantile summaries,

In Int’l Conf Management of Data (SIGMOD), pages 58–66, Santa Barbara, CA, May

2001

Guha S., Milshra N., Motwani R., and O’Callaghan L., Clustering data streams, In IEEE Symposium on Foundations of Computer Science (FOCS), pages 359–366, 2000.

Hall L., Bowyer K., Kegelmeyer W., Moore T., and Chao C., Distributed learning on very

large data sets, In Workshop on Distributed and Parallel Knowledge Discover, 2000 Hulten G., Spencer L., and Domingos P., Mining time-changing data streams, In Int’l Conf.

on Knowledge Discovery and Data Mining (SIGKDD), pages 97–106, San Francisco,

CA, 2001 ACM Press

Quinlan J R., C4.5: Programs for Machine Learning Morgan Kaufmann, 1993.

Rokach L., Mining manufacturing data using genetic algorithm-based feature set decompo-sition, Int J Intelligent Systems Technologies and Applications, 4(1):57-78, 2008 Shafer C., Agrawal R., and Mehta M., Sprint: A scalable parallel classiﬁer for Data Mining,

In Proc of Very Large Database (VLDB), 1996.

Stolfo S., Fan W., Lee W., Prodromidis A., and Chan P., Credit card fraud detection using

meta-learning: Issues and initial results In AAAI-97 Workshop on Fraud Detection and Risk Management, 1997.

Street W N and Kim Y S., A streaming ensemble algorithm (SEA) for large-scale

classiﬁ-cation In Int’l Conf on Knowledge Discovery and Data Mining (SIGKDD), 2001 Tumer K and Ghosh J., Error correlation and error reduction in ensemble classiﬁers, Con-nection Science, 8(3-4):385–403, 1996.

Utgoff, P E., Incremental induction of decision trees, Machine Learning, 4:161–186, 1989.

Wang H., Fan W., Yu P S., and Han J., Mining concept-drifting data streams using ensemble

classiﬁers, In Int’l Conf on Knowledge Discovery and Data Mining (SIGKDD), 2003.

Trang 4

Mining High-Dimensional Data

Wei Wang1and Jiong Yang2

1 Department of Computer Science, University of North Carolina at Chapel Hill

2 Department of Electronic Engineering and Computer Science, Case Western Reserve University

Summary With the rapid growth of computational biology and e-commerce applications, high-dimensional data becomes very common Thus, mining high-dimensional data is an ur-gent problem of great practical importance However, there are some unique challenges for mining data of high dimensions, including (1) the curse of dimensionality and more crucial (2) the meaningfulness of the similarity measure in the high dimension space In this chapter,

we present several state-of-art techniques for analyzing high-dimensional data, e.g., frequent pattern mining, clustering, and classiﬁcation We will discuss how these methods deal with the challenges of high dimensionality

Key words: High-dimensional Data Mining, frequent pattern, clustering high-dimensional data, classifying high-high-dimensional data

41.1 Introduction

The emergence of various new application domains, such as bioinformatics and e-commerce, underscores the need for analyzing high dimensional data In a gene ex-pression microarray data set, there could be tens or hundreds of dimensions, each of which corresponds to an experimental condition In a customer purchase behavior data set, there may be up to hundreds of thousands of merchandizes, each of which

is mapped to a dimension Researchers and practitioners are very eager in analyzing these data sets

Various Data Mining models have been proven to be very successful for analyz-ing very large data sets Among them, frequent patterns, clusters, and classiﬁers are three widely studied models to represent, analyze, and summarize large data sets In this chapter, we focus on the state-of-art techniques for constructing these three Data Mining models on massive high-dimensional data sets

O Maimon, L Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,

DOI 10.1007/978-0-387-09823-4_41, © Springer Science+Business Media, LLC 2010

Trang 5

804 Wei Wang and Jiong Yang

41.2 Chanllenges

Before presenting any algorithm for building individual Data Mining models, we ﬁrst discuss two common challenges for analyzing high-dimensional data The ﬁrst one is the curse of dimensionality The complexity of many existing Data Mining algorithms is exponential with respect to the number of dimensions With increas-ing dimensionality, these algorithms soon become computationally intractable and therefore inapplicable in many real applications

Secondly, the speciﬁcity of similarities between points in a high dimensional

space diminishes It was proven in (Beyer et al., 1999) that, for any point in a high

dimensional space, the expected gap between the Euclidean distance to the closest neighbor and that to the farthest point shrinks as the dimensionality grows This phenomenon may render many Data Mining tasks (e.g., clustering) ineffective and fragile because the model becomes vulnerable to the presence of noise In the re-mainder of this chapter, we present several state-of-art algorithms for mining high-dimensional data sets

41.3 Frequent Pattern

Frequent pattern is a useful model for extracting salient features of the data It was originally proposed for analyzing market basket data (Agrawal, 1994) A market bas-ket data set is typically represented as a set of transactions Each transaction contains

a set of items from a ﬁnite vocabulary In principle, we can represent the data as a matrix, each row represents a transaction and each column represents an item The goal is to ﬁnd the collection of itemsets appearing in a large number of transactions,

deﬁned by a support threshold t Most algorithms for mining frequent patterns utilize the Apriori property stated as follows If an itemset A is frequent (i.e., present in more than t transactions), then every subset of A must be frequent On the other hand, if

an itemset A is infrequent (i.e, present in less than t transactions), then any superset

of A is also infrequent This property is the basis of all level-wise search algorithms.

The general procedure consists of a series of iterations beginning with counting item occurrences and identifying the set of frequent items (or equivalently, frequent

1-itemsets) During each subsequent iteration, candidates for frequent k-itemsets are proposed from frequent (k-1)-itemsets using the Apriori property These candidates are then validated by explicitly counting their actual occurrences The value of k is

incremented before the next iteration starts The process terminates when no more frequent itemset can be generated We often refer to this level-wise approach as the breadth-ﬁrst approach because it evaluates the itemsets residing at the same depth

in the lattice formed by imposing the partial order of subset-superset relationship between itemsets

It is a well-known problem that the full set of frequent patterns contains sig-niﬁcant redundant information and consequently the number of frequent patterns is often too large To address this issue, Pasquier et al (1999) proposed to mine a

se-lective subset of frequent patterns, called closed frequent patterns If the number of

Trang 6

41 Mining High-Dimensional Data 805 occurrences of a pattern is the same to all its immediate subpatterns, then the

pat-tern is considered as a closed patpat-tern The CLOSET algorithm (Pei et al., 2000) is

proposed to expedite the mining of closed frequent patterns CLOSET uses a novel frequent pattern tree (FP structure) as a compact representation to organize the data

set It performs a depth-ﬁrst search, that is, after discovering a frequent itemset A, it searches for superpatterns of A before checking A’s siblings.

A more recent algorithm for mining frequent closed pattern is CHARM (Zaki and Hsiao, 2002) Similar to CLOSET, CHARM searches for patterns in a depth-ﬁrst manner The difference between CHARM and CLOSET is that CHARM stores the data set in a vertical format where a list of row IDs is maintained for each dimen-sion These row ID lists are then merged during a “column enumeration” procedure that generates row ID lists for other nodes in the enumeration tree In addition, a

technique called diffset is used to reduce the length of the row ID lists as well as the

computational complexity of merging them

All previous algorithms can ﬁnd frequent closed patterns when the dimensional-ity is low to moderate When the number of dimensions is very high, e.g., greater than

100, the efﬁciency of these algorithms could be signiﬁcantly impacted

CARPEN-TER (Pan et al., 2003) is therefore proposed to solve this problem It ﬁrst transposes

the matrix representing the data set Next, CARPENTER performs a depth-ﬁrst row-wise enumeration on the transposed matrix It has been shown that this algorithm can greatly reduce the computation time especially when the dimensionality is high

41.4 Clustering

Clustering is a widely adopted Data Mining model that partitions data points into a

set of groups, each of which is called a cluster A data point has a shorter distance

to points within the cluster than those outside the cluster In a high dimensional space, for any point, its distance to its closest point and that to the farthest point tend to be similar This phenomenon may render the clustering result sensitive to any small perturbation to the data due to noise and make the exercise of clustering useless To solve this problem, Agrawal et al proposed a subspace clustering model

(Agrawal et al., 1998) A subspace cluster consists of a subset of objects and a subset

of dimensions such that the distance among these objects is small within the given

set of dimensions The CLIQUE algorithm (Agrawal et al., 1998) is proposed to ﬁnd

the subspace clusters

In many applications, users are more interested in the objects that exhibit a con-sistent trend (rather than points having similar values) within a subset of dimensions One such example is the bicluster model (Cheng and Church, 2000) proposed for

an-alyzing gene expression proﬁles A bicluster is a subset of objects (U) and a subset dimensions (D) such that objects in U have the same trend (i.e., ﬂuctuating simulta-neously) across dimensions in D This is particular useful in analyzing gene

expres-sion levels in a microarray experiment since the expresexpres-sion levels of some genes may

be inﬂated/deﬂated systematically in some experiments Thus, the absolute value is not as important as the trend If two genes have similar trends across a large set

Trang 7

of experiments, they are likely to be co-regulated In the bicluster model, the mean squared error residue is used to qualify a bicluster Cheng and Church (2000) used a heuristic randomized algorithm to ﬁnd biclusters It consists of a series of iterations, each of which locates one bicluster To prevent the same bicluster from being re-ported again in subsequent iterations, each time when a bicluster is found, the values

in the bicluster are replaced by uniform noise before the next iteration starts This procedure continues until a desired number of biclusters are discovered

Although the bicluster model and algorithm have been used in several appli-cations in bioinformatics, it has two major drawbacks: (1) the mean squared error residue may not be the best measure to qualify a bicluster A big cluster may have small mean squared error residue even if it includes a small number of objects whose trends are vastly different in the selected dimensions; (2) the heuristic algorithm may

be interfered by the noise artiﬁcially injected after each iteration and hence may not discover overlapped clusters properly To solve these two problems, the authors

of (Wang et al., 2002) proposed the p-cluster model A p-cluster consists of a subset

of objects U and a subset of dimensions D where for each pair of objects u1and u2in

U and each pair of dimension d1and d2in D, the change of u1from d1to d2should

be similar to that of u2from d1to d2 A threshold is used to evaluate the dissimilarity between two objects on two dimensions Given a subset of objects and a subset of dimensions, if the dissimilarity between every pair of objects on every pair of dimen-sions is less than the threshold, then these objects constitute a p-cluster in the given

dimensions A novel deterministic algorithm is developed in (Wang et al., 2002)to

ﬁnd all maximal p-clusters, which utilizes the Apriori property held on p-clusters

41.5 Classiﬁcation

The classiﬁcation is also a very powerful data analysis tool In a classiﬁcation prob-lem, the dimensions of an object can be divided into two types One dimension records the class type of the object and the rest dimensions are attributes The goal

of classiﬁcation is to build a model that captures the intrinsic associations between the class type and the attributes so that an (unknown) class type can be accurately predicted from the attribute values For this purpose, the data is usually divided into

a training set and a test set, where the training set is used to build the classiﬁer which

is validated by the test set There are several models developed for classifying high dimensional data, e.g., na¨ıve Bayesian, neural networks, decision trees (Mitchell, 1997), SVMs, rule-based classiﬁers, and so on

Supporting vector machine (SVM) (Vapnik, 1998) is one of the newly devel-oped classification models The success of SVM in practice is drawn by its solid mathematical foundation that conveys the following two salient properties (1) The classification boundary functions of SVMs maximize the margin, which equivalently optimize the general performance given a training data set (2) SVMs handle a non-linear classification efficiently using the kernel trick that implicitly transforms the input space into another higher dimensional feature space However, SVM suffers

from two problems First, the complexity of training an SVM is at least O(N2) where

Trang 8

41 Mining High-Dimensional Data 807

N is the number of objects in the training data set It could be too costly when the

training data set is large Second, since an SVM essentially draws a hyper-plain in

a transformed high dimensional space, it is very difﬁcult to identify the principal (original) dimensions that are most responsible for the classiﬁcation

Rule-based classiﬁers (Liu et al., 2000) offer some potential to address the above

two problems A rule-based classiﬁer consists of a set of rules in the following form:

A1[l1,u1]∩A2[l2,u2]∩ ∩A m [l m ,u m]→C, where A i [l i , u i] is the range of attribute

A i ’s value and C is the class type The above rule can be interpreted as that, if an

object whose attributes’ values fall in the ranges in the left hand side, then its class

type is likely to be C (with some high probability) Each rule is also associated with

a confidence level that depicts the probability that such a rule holds When an ob-ject satisfies several rules, either the rule with the highest confidence (e.g., CBA (Liu

et al., 2000)) or a weighted voting of all valid rules (e.g., CPAR (Yin and Han, 2003))

may be used for class prediction However, neither CBA nor CPAR are targeted for

high dimensional data An algorithm called FARMER (Cong et al., 2004) is proposed

to generate rule-based classifiers for high dimensional data set It first quantizes the attributes into a set of bins Each bin is treated as an item subsequently FARMER then generates the closed frequent itemsets using a method similar to CARPEN-TER These closed frequent itemsets are the basis to generate rules Since the di-mensionality is high, the number of possible rules in the classifier could be very large FARMER finally organizes all rules into compact rule groups

References

Agrawal R., Gehrke J., Gunopulos D., Raghavan P.: ”Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications”, Proc ACM SIGMOD Int Conf on Management of Data, Seattle, WA, 1998, pp 94-105

Agrawal R., and Srikant R., Fast Algorithms for Mining Association Rules in Large Databases In Proc of the 20th VLDB Conf., pages 487-499, 1994

Beyer K.S., Goldstein J., Ramakrishnan R and Shaft U.: ”When Is ‘Nearest Neigh-bor’ Meaningful?”, Proceedings 7th International Conference on Database Theory (ICDT’99), pp 217-235, Jerusalem, Israel, 1999

Cheng Y., and Church, G., Biclustering of expression data In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, pp 93-103 San Diego, CA, August 2000

Cong G., Tung Anthony K H., Xu X., Pan F., and Yang J., Farmer: Finding interesting rule groups in microarray datasets In the 23rd ACM SIGMOD International Conference on Management of Data, 2004

Liu B., Ma Y., Wong C K., Improving an Association Rule Based Classiﬁer, Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, p.504-509, September 13-16, 2000

Mitchell T., Machine Learning WCB McGraw Hill, 1997

Pan F., Cong G., Tung A K H., Yang J., and Zaki M J., CARPENTER: ﬁnding closed patterns in long biological data sets Proceedings of ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining, 2003

Trang 9

Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Discovering frequent closed itemsets for as-sociation rules In Beeri, C., Buneman, P., eds., Proc of the 7th Int’l Conf on Database Theory (ICDT’99), Jerusalem, Israel, Volume 1540 of Lecture Notes in Computer Sci-ence., pp 398-416, Springer-Verlag, January 1999

Pei, J., Han, J., and Mao, R., CLOSET: an efﬁcient Algorithm for mining frequent closed itemsets In D Gunopulos and R Rastogi, eds., ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pp 21-30, 2000

Vapnik, V.N., Statistical Learning Theory John Wiley and Sons, 1998

Wang H., Wang W., Yang J and Yu P., Clustering by pattern similarity in large data sets Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp 394-405, 2002

Yin X., Han J., CPAR: classiﬁcation based on predictive association rules Proceedings of SIAM International Conference on Data Mining, San Fransisco, CA, pp 331-335, 2003 Zaki M J and Hsiao C., CHARM: An efﬁcient algorithm for closed itemset mining In Proceedings of the Second SIAM International Conference on Data Mining, Arlington,

VA, 2002 SIAM

Trang 10

Text Mining and Information

Extraction

Moty Ben-Dov1and Ronen Feldman2

1 MDX University, London

2 Hebrew university, Israel

Summary Text Mining is the automatic discovery of new, previously unknown information,

by automatic analysis of various textual resources Text mining starts by extracting facts and events from textual sources and then enables forming new hypotheses that are further explored

by traditional Data Mining and data analysis methods In this chapter we will deﬁne text min-ing and describe the three main approaches for performmin-ing information extraction In addition,

we will describe how we can visually display and analyze the outcome of the information ex-traction process

Key words: text mining, content mining, structure mining, text classiﬁcation, infor-mation extraction, Rules Based Systems

42.1 Introduction

The information age has made it easy for us to store large amounts of texts The pro-liferation of documents available on the Web, on corporate intranets, on news wires, and elsewhere is overwhelming However, while the amount of information avail-able to us is constantly increasing, our ability to absorb and process this information remains constant Search engines only exacerbate the problem by making more and more documents available in a matter of a few key strokes; So-called “push” tech-nology makes the problem even worse by constantly reminding us that we are failing

to track news, events, and trends everywhere We experience information overload, and miss important patterns and relationships even as they unfold before us As the old adage goes, “we can’t see the forest for the trees.”

Text-mining (TM), also known as Knowledge discovery from text (KDT), refers

to the process of extracting interesting patterns from very large text database for the purposes of discovering knowledge Text-mining applies the same analytical func-tions of data-mining but also applies analytic funcfunc-tions from natural language (NL)

and information retrieval (IR) techniques (Dorre et al., 1999).

The text-mining tools are used for:

O Maimon, L Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,

Định dạng
Số trang	10
Dung lượng	363,05 KB