Traditional Data Mining algorithms are challenged by two characteristic features of data streams: the infinite data flow and the drifting concepts.. 40 Mining Concept-Drifting Data Streams
Trang 1800 Haixun Wang, Philip S Yu, and Jiawei Han
Table 40.6 Benefits (US $) using Single Classifiers and Classifier Ensembles (Original Stream)
Chunk G0 G1=E1 G2 E2 G4 E4 G8 E8
12000 201717 203211 197946 253473 211768 269290 215692 289129
6000 103763 98777 101176 121057 102447 138565 106576 143620
4000 69447 65024 68081 80996 69346 90815 70325 96153
3000 43312 41212 42917 59293 44977 67222 46139 71660
Cost-sensitive Learning
For cost-sensitive applications, we aim at maximizing benefits In Figure 40.7(a),
we compare the single classifier approach with the ensemble approach using the credit card transaction stream The benefits are averaged from multiple runs with different chunk size (ranging from 3000 to 12000 transactions per chunk) Starting
from K= 2, the advantage of the ensemble approach becomes obvious
In Figure 40.7(b), we average the benefits of E k and G k (K = 2,··· ,8) for each
fixed chunk size The benefits increase as the chunk size does, as more fraudulent transactions are discovered in the chunk Again, the ensemble approach outperforms the single classifier approach
To study the impact of concept drifts of different magnitude, we derive data streams from the credit card transactions The simulated stream is obtained by sorting the original 5 million transactions by their transaction amount We perform the same test on the simulated stream, and the results are shown in Figure 40.7(c) and 40.7(d) Detailed results of the above tests are given in Table 40.6 and 40.5
40.5 Discussion and Related Work
Data stream processing has recently become a very important research domain Much
work has been done on modeling (Babcock et al., 2002), querying (Babu and Widom,
2001,Gao and Wang, 2002,Greenwald and Khanna, 2001), and mining data streams, for instance, several papers have been published on classification (Domingos and
Hulten, 2000, Hulten et al., 2001, Street and Kim, 2001), regression analysis (Chen
et al., 2002), and clustering (Guha et al., 2000).
Traditional Data Mining algorithms are challenged by two characteristic features
of data streams: the infinite data flow and the drifting concepts As methods that
require multiple scans of the datasets (Shafer et al., 1996) can not handle infinite data flows, several incremental algorithms (Gehrke et al., 1999, Domingos and Hulten,
2000) that refine models by continuously incorporating new data from the stream have been proposed In order to handle drifting concepts, these methods are revised again to achieve the goal that effects of old examples are eliminated at a certain rate
In terms of an incremental decision tree classifier, this means we have to discard,
re-grow sub trees, or build alternative subtrees under a node (Hulten et al., 2001).
The resulting algorithm is often complicated, which indicates substantial efforts are
Trang 240 Mining Concept-Drifting Data Streams 801 required to adapt state-of-the-art learning methods to the infinite, concept-drifting streaming environment Aside from this undesirable aspect, incremental methods are also hindered by their prediction accuracy Since old examples are discarded at a fixed rate (no matter if they represent the changed concept or not), the learned model
is supported only by the current snapshot – a relatively small amount of data This usually results in larger prediction variances
Classifier ensembles are increasingly gaining acceptance in the data mining com-munity The popular approaches to creating ensembles include changing the in-stances used for training through techniques such as Bagging (Bauer and Kohavi, 1999) and Boosting (Freund and Schapire, 1996) The classifier ensembles have sev-eral advantages over single model classifiers First, classifier ensembles offer a sig-nificant improvement in prediction accuracy (Freund and Schapire, 1996, Tumer and Ghosh, 1996) Second, building a classifier ensemble is more efficient than building a single model, since most model construction algorithms have super-linear complex-ity Third, the nature of classifier ensembles lend themselves to scalable
paralleliza-tion (Hall et al., 2000) and on-line classificaparalleliza-tion of large databases Previously, we
used averaging ensemble for scalable learning over very-large datasets (Fan, Wang , Yu, and Stolfo, 2003) We show that a model’s performance can be estimated be-fore it is completely learned (Fan, Wang , Yu, and Lo, 2002, Fan, Wang , Yu, and
Lo, 2003) In this work, we use weighted ensemble classifiers on concept-drifting data streams It combines multiple classifiers weighted by their expected prediction accuracy on the current test data Compared with incremental models trained by data
in the most recent window, our approach combines talents of set of experts based on their credibility and adjusts much nicely to the underlying concept drifts Also, we introduced the dynamic classification technique (Fan, Chu, Wang, and Yu, 2002) to the concept-drifting streaming environment, and our results show that it enables us
to dynamically select a subset of classifiers in the ensemble for prediction without loss in accuracy
Ackowledgement
We thank Wei Fan of IBM T J Watson Research Center for providing us with a revised version of the C4.5 decision tree classifier and running some experiments
References
Babcock B., Babu S , Datar M , Motawani R , and Widom J., Models and issues in data
stream systems, In ACM Symposium on Principles of Database Systems (PODS), 2002 Babu S and Widom J., Continuous queries over data streams SIGMOD Record, 30:109–
120, 2001
Bauer, E and Kohavi, R., An empirical comparison of voting classification algorithms:
Bag-ging, boosting, and variants Machine Learning, 36(1-2):105–139, 1999.
Chen Y., Dong G., Han J., Wah B W., and Wang B W., Multi-dimensional regression
anal-ysis of time-series data streams In Proc of Very Large Database (VLDB), Hongkong,
China, 2002
Trang 3802 Haixun Wang, Philip S Yu, and Jiawei Han
Cohen W., Fast effective rule induction In Int’l Conf on Machine Learning (ICML), pages
115–123, 1995
Domingos P., and Hulten G., Mining high-speed data streams In Int’l Conf on Knowledge Discovery and Data Mining (SIGKDD), pages 71–80, Boston, MA, 2000 ACM Press Fan W., Wang H., Yu P., and Lo S , Progressive modeling In Int’l Conf Data Mining (ICDM), 2002.
Fan W., Wang H., Yu P., and Lo S , Inductive learning in less than one sequential scan, In
Int’l Joint Conf on Artificial Intelligence, 2003.
Fan W., Wang H., Yu P., and Stolfo S., A framework for scalable cost-sensitive learning based
on combining probabilities and benefits In SIAM Int’l Conf on Data Mining (SDM),
2002
Fan W., Chu F., Wang H., and Yu P S., Pruning and dynamic scheduling of cost-sensitive
ensembles, In Proceedings of the 18th National Conference on Artificial Intelligence (AAAI), 2002.
Freund Y., and Schapire R E., Experiments with a new boosting algorithm, In Int’l Conf on Machine Learning (ICML), pages 148–156, 1996.
Gao L and Wang X., Continually evaluating similarity-based pattern queries on a streaming
time series, In Int’l Conf Management of Data (SIGMOD), Madison, Wisconsin, June
2002
Gehrke J., Ganti V., Ramakrishnan R., and Loh W., BOAT– optimistic decision tree
con-struction, In Int’l Conf Management of Data (SIGMOD), 1999.
Greenwald M., and Khanna S., Space-efficient online computation of quantile summaries,
In Int’l Conf Management of Data (SIGMOD), pages 58–66, Santa Barbara, CA, May
2001
Guha S., Milshra N., Motwani R., and O’Callaghan L., Clustering data streams, In IEEE Symposium on Foundations of Computer Science (FOCS), pages 359–366, 2000.
Hall L., Bowyer K., Kegelmeyer W., Moore T., and Chao C., Distributed learning on very
large data sets, In Workshop on Distributed and Parallel Knowledge Discover, 2000 Hulten G., Spencer L., and Domingos P., Mining time-changing data streams, In Int’l Conf.
on Knowledge Discovery and Data Mining (SIGKDD), pages 97–106, San Francisco,
CA, 2001 ACM Press
Quinlan J R., C4.5: Programs for Machine Learning Morgan Kaufmann, 1993.
Rokach L., Mining manufacturing data using genetic algorithm-based feature set decompo-sition, Int J Intelligent Systems Technologies and Applications, 4(1):57-78, 2008 Shafer C., Agrawal R., and Mehta M., Sprint: A scalable parallel classifier for Data Mining,
In Proc of Very Large Database (VLDB), 1996.
Stolfo S., Fan W., Lee W., Prodromidis A., and Chan P., Credit card fraud detection using
meta-learning: Issues and initial results In AAAI-97 Workshop on Fraud Detection and Risk Management, 1997.
Street W N and Kim Y S., A streaming ensemble algorithm (SEA) for large-scale
classifi-cation In Int’l Conf on Knowledge Discovery and Data Mining (SIGKDD), 2001 Tumer K and Ghosh J., Error correlation and error reduction in ensemble classifiers, Con-nection Science, 8(3-4):385–403, 1996.
Utgoff, P E., Incremental induction of decision trees, Machine Learning, 4:161–186, 1989.
Wang H., Fan W., Yu P S., and Han J., Mining concept-drifting data streams using ensemble
classifiers, In Int’l Conf on Knowledge Discovery and Data Mining (SIGKDD), 2003.
Trang 4Mining High-Dimensional Data
Wei Wang1and Jiong Yang2
1 Department of Computer Science, University of North Carolina at Chapel Hill
2 Department of Electronic Engineering and Computer Science, Case Western Reserve University
Summary With the rapid growth of computational biology and e-commerce applications, high-dimensional data becomes very common Thus, mining high-dimensional data is an ur-gent problem of great practical importance However, there are some unique challenges for mining data of high dimensions, including (1) the curse of dimensionality and more crucial (2) the meaningfulness of the similarity measure in the high dimension space In this chapter,
we present several state-of-art techniques for analyzing high-dimensional data, e.g., frequent pattern mining, clustering, and classification We will discuss how these methods deal with the challenges of high dimensionality
Key words: High-dimensional Data Mining, frequent pattern, clustering high-dimensional data, classifying high-high-dimensional data
41.1 Introduction
The emergence of various new application domains, such as bioinformatics and e-commerce, underscores the need for analyzing high dimensional data In a gene ex-pression microarray data set, there could be tens or hundreds of dimensions, each of which corresponds to an experimental condition In a customer purchase behavior data set, there may be up to hundreds of thousands of merchandizes, each of which
is mapped to a dimension Researchers and practitioners are very eager in analyzing these data sets
Various Data Mining models have been proven to be very successful for analyz-ing very large data sets Among them, frequent patterns, clusters, and classifiers are three widely studied models to represent, analyze, and summarize large data sets In this chapter, we focus on the state-of-art techniques for constructing these three Data Mining models on massive high-dimensional data sets
O Maimon, L Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,
DOI 10.1007/978-0-387-09823-4_41, © Springer Science+Business Media, LLC 2010
Trang 5804 Wei Wang and Jiong Yang
41.2 Chanllenges
Before presenting any algorithm for building individual Data Mining models, we first discuss two common challenges for analyzing high-dimensional data The first one is the curse of dimensionality The complexity of many existing Data Mining algorithms is exponential with respect to the number of dimensions With increas-ing dimensionality, these algorithms soon become computationally intractable and therefore inapplicable in many real applications
Secondly, the specificity of similarities between points in a high dimensional
space diminishes It was proven in (Beyer et al., 1999) that, for any point in a high
dimensional space, the expected gap between the Euclidean distance to the closest neighbor and that to the farthest point shrinks as the dimensionality grows This phenomenon may render many Data Mining tasks (e.g., clustering) ineffective and fragile because the model becomes vulnerable to the presence of noise In the re-mainder of this chapter, we present several state-of-art algorithms for mining high-dimensional data sets
41.3 Frequent Pattern
Frequent pattern is a useful model for extracting salient features of the data It was originally proposed for analyzing market basket data (Agrawal, 1994) A market bas-ket data set is typically represented as a set of transactions Each transaction contains
a set of items from a finite vocabulary In principle, we can represent the data as a matrix, each row represents a transaction and each column represents an item The goal is to find the collection of itemsets appearing in a large number of transactions,
defined by a support threshold t Most algorithms for mining frequent patterns utilize the Apriori property stated as follows If an itemset A is frequent (i.e., present in more than t transactions), then every subset of A must be frequent On the other hand, if
an itemset A is infrequent (i.e, present in less than t transactions), then any superset
of A is also infrequent This property is the basis of all level-wise search algorithms.
The general procedure consists of a series of iterations beginning with counting item occurrences and identifying the set of frequent items (or equivalently, frequent
1-itemsets) During each subsequent iteration, candidates for frequent k-itemsets are proposed from frequent (k-1)-itemsets using the Apriori property These candidates are then validated by explicitly counting their actual occurrences The value of k is
incremented before the next iteration starts The process terminates when no more frequent itemset can be generated We often refer to this level-wise approach as the breadth-first approach because it evaluates the itemsets residing at the same depth
in the lattice formed by imposing the partial order of subset-superset relationship between itemsets
It is a well-known problem that the full set of frequent patterns contains sig-nificant redundant information and consequently the number of frequent patterns is often too large To address this issue, Pasquier et al (1999) proposed to mine a
se-lective subset of frequent patterns, called closed frequent patterns If the number of
Trang 641 Mining High-Dimensional Data 805 occurrences of a pattern is the same to all its immediate subpatterns, then the
pat-tern is considered as a closed patpat-tern The CLOSET algorithm (Pei et al., 2000) is
proposed to expedite the mining of closed frequent patterns CLOSET uses a novel frequent pattern tree (FP structure) as a compact representation to organize the data
set It performs a depth-first search, that is, after discovering a frequent itemset A, it searches for superpatterns of A before checking A’s siblings.
A more recent algorithm for mining frequent closed pattern is CHARM (Zaki and Hsiao, 2002) Similar to CLOSET, CHARM searches for patterns in a depth-first manner The difference between CHARM and CLOSET is that CHARM stores the data set in a vertical format where a list of row IDs is maintained for each dimen-sion These row ID lists are then merged during a “column enumeration” procedure that generates row ID lists for other nodes in the enumeration tree In addition, a
technique called diffset is used to reduce the length of the row ID lists as well as the
computational complexity of merging them
All previous algorithms can find frequent closed patterns when the dimensional-ity is low to moderate When the number of dimensions is very high, e.g., greater than
100, the efficiency of these algorithms could be significantly impacted
CARPEN-TER (Pan et al., 2003) is therefore proposed to solve this problem It first transposes
the matrix representing the data set Next, CARPENTER performs a depth-first row-wise enumeration on the transposed matrix It has been shown that this algorithm can greatly reduce the computation time especially when the dimensionality is high
41.4 Clustering
Clustering is a widely adopted Data Mining model that partitions data points into a
set of groups, each of which is called a cluster A data point has a shorter distance
to points within the cluster than those outside the cluster In a high dimensional space, for any point, its distance to its closest point and that to the farthest point tend to be similar This phenomenon may render the clustering result sensitive to any small perturbation to the data due to noise and make the exercise of clustering useless To solve this problem, Agrawal et al proposed a subspace clustering model
(Agrawal et al., 1998) A subspace cluster consists of a subset of objects and a subset
of dimensions such that the distance among these objects is small within the given
set of dimensions The CLIQUE algorithm (Agrawal et al., 1998) is proposed to find
the subspace clusters
In many applications, users are more interested in the objects that exhibit a con-sistent trend (rather than points having similar values) within a subset of dimensions One such example is the bicluster model (Cheng and Church, 2000) proposed for
an-alyzing gene expression profiles A bicluster is a subset of objects (U) and a subset dimensions (D) such that objects in U have the same trend (i.e., fluctuating simulta-neously) across dimensions in D This is particular useful in analyzing gene
expres-sion levels in a microarray experiment since the expresexpres-sion levels of some genes may
be inflated/deflated systematically in some experiments Thus, the absolute value is not as important as the trend If two genes have similar trends across a large set
Trang 7806 Wei Wang and Jiong Yang
of experiments, they are likely to be co-regulated In the bicluster model, the mean squared error residue is used to qualify a bicluster Cheng and Church (2000) used a heuristic randomized algorithm to find biclusters It consists of a series of iterations, each of which locates one bicluster To prevent the same bicluster from being re-ported again in subsequent iterations, each time when a bicluster is found, the values
in the bicluster are replaced by uniform noise before the next iteration starts This procedure continues until a desired number of biclusters are discovered
Although the bicluster model and algorithm have been used in several appli-cations in bioinformatics, it has two major drawbacks: (1) the mean squared error residue may not be the best measure to qualify a bicluster A big cluster may have small mean squared error residue even if it includes a small number of objects whose trends are vastly different in the selected dimensions; (2) the heuristic algorithm may
be interfered by the noise artificially injected after each iteration and hence may not discover overlapped clusters properly To solve these two problems, the authors
of (Wang et al., 2002) proposed the p-cluster model A p-cluster consists of a subset
of objects U and a subset of dimensions D where for each pair of objects u1and u2in
U and each pair of dimension d1and d2in D, the change of u1from d1to d2should
be similar to that of u2from d1to d2 A threshold is used to evaluate the dissimilarity between two objects on two dimensions Given a subset of objects and a subset of dimensions, if the dissimilarity between every pair of objects on every pair of dimen-sions is less than the threshold, then these objects constitute a p-cluster in the given
dimensions A novel deterministic algorithm is developed in (Wang et al., 2002)to
find all maximal p-clusters, which utilizes the Apriori property held on p-clusters
41.5 Classification
The classification is also a very powerful data analysis tool In a classification prob-lem, the dimensions of an object can be divided into two types One dimension records the class type of the object and the rest dimensions are attributes The goal
of classification is to build a model that captures the intrinsic associations between the class type and the attributes so that an (unknown) class type can be accurately predicted from the attribute values For this purpose, the data is usually divided into
a training set and a test set, where the training set is used to build the classifier which
is validated by the test set There are several models developed for classifying high dimensional data, e.g., na¨ıve Bayesian, neural networks, decision trees (Mitchell, 1997), SVMs, rule-based classifiers, and so on
Supporting vector machine (SVM) (Vapnik, 1998) is one of the newly devel-oped classification models The success of SVM in practice is drawn by its solid mathematical foundation that conveys the following two salient properties (1) The classification boundary functions of SVMs maximize the margin, which equivalently optimize the general performance given a training data set (2) SVMs handle a non-linear classification efficiently using the kernel trick that implicitly transforms the input space into another higher dimensional feature space However, SVM suffers
from two problems First, the complexity of training an SVM is at least O(N2) where
Trang 841 Mining High-Dimensional Data 807
N is the number of objects in the training data set It could be too costly when the
training data set is large Second, since an SVM essentially draws a hyper-plain in
a transformed high dimensional space, it is very difficult to identify the principal (original) dimensions that are most responsible for the classification
Rule-based classifiers (Liu et al., 2000) offer some potential to address the above
two problems A rule-based classifier consists of a set of rules in the following form:
A1[l1,u1]∩A2[l2,u2]∩ ∩A m [l m ,u m]→C, where A i [l i , u i] is the range of attribute
A i ’s value and C is the class type The above rule can be interpreted as that, if an
object whose attributes’ values fall in the ranges in the left hand side, then its class
type is likely to be C (with some high probability) Each rule is also associated with
a confidence level that depicts the probability that such a rule holds When an ob-ject satisfies several rules, either the rule with the highest confidence (e.g., CBA (Liu
et al., 2000)) or a weighted voting of all valid rules (e.g., CPAR (Yin and Han, 2003))
may be used for class prediction However, neither CBA nor CPAR are targeted for
high dimensional data An algorithm called FARMER (Cong et al., 2004) is proposed
to generate rule-based classifiers for high dimensional data set It first quantizes the attributes into a set of bins Each bin is treated as an item subsequently FARMER then generates the closed frequent itemsets using a method similar to CARPEN-TER These closed frequent itemsets are the basis to generate rules Since the di-mensionality is high, the number of possible rules in the classifier could be very large FARMER finally organizes all rules into compact rule groups
References
Agrawal R., Gehrke J., Gunopulos D., Raghavan P.: ”Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications”, Proc ACM SIGMOD Int Conf on Management of Data, Seattle, WA, 1998, pp 94-105
Agrawal R., and Srikant R., Fast Algorithms for Mining Association Rules in Large Databases In Proc of the 20th VLDB Conf., pages 487-499, 1994
Beyer K.S., Goldstein J., Ramakrishnan R and Shaft U.: ”When Is ‘Nearest Neigh-bor’ Meaningful?”, Proceedings 7th International Conference on Database Theory (ICDT’99), pp 217-235, Jerusalem, Israel, 1999
Cheng Y., and Church, G., Biclustering of expression data In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, pp 93-103 San Diego, CA, August 2000
Cong G., Tung Anthony K H., Xu X., Pan F., and Yang J., Farmer: Finding interesting rule groups in microarray datasets In the 23rd ACM SIGMOD International Conference on Management of Data, 2004
Liu B., Ma Y., Wong C K., Improving an Association Rule Based Classifier, Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, p.504-509, September 13-16, 2000
Mitchell T., Machine Learning WCB McGraw Hill, 1997
Pan F., Cong G., Tung A K H., Yang J., and Zaki M J., CARPENTER: finding closed patterns in long biological data sets Proceedings of ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining, 2003
Trang 9808 Wei Wang and Jiong Yang
Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Discovering frequent closed itemsets for as-sociation rules In Beeri, C., Buneman, P., eds., Proc of the 7th Int’l Conf on Database Theory (ICDT’99), Jerusalem, Israel, Volume 1540 of Lecture Notes in Computer Sci-ence., pp 398-416, Springer-Verlag, January 1999
Pei, J., Han, J., and Mao, R., CLOSET: an efficient Algorithm for mining frequent closed itemsets In D Gunopulos and R Rastogi, eds., ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pp 21-30, 2000
Vapnik, V.N., Statistical Learning Theory John Wiley and Sons, 1998
Wang H., Wang W., Yang J and Yu P., Clustering by pattern similarity in large data sets Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp 394-405, 2002
Yin X., Han J., CPAR: classification based on predictive association rules Proceedings of SIAM International Conference on Data Mining, San Fransisco, CA, pp 331-335, 2003 Zaki M J and Hsiao C., CHARM: An efficient algorithm for closed itemset mining In Proceedings of the Second SIAM International Conference on Data Mining, Arlington,
VA, 2002 SIAM
Trang 10Text Mining and Information
Extraction
Moty Ben-Dov1and Ronen Feldman2
1 MDX University, London
2 Hebrew university, Israel
Summary Text Mining is the automatic discovery of new, previously unknown information,
by automatic analysis of various textual resources Text mining starts by extracting facts and events from textual sources and then enables forming new hypotheses that are further explored
by traditional Data Mining and data analysis methods In this chapter we will define text min-ing and describe the three main approaches for performmin-ing information extraction In addition,
we will describe how we can visually display and analyze the outcome of the information ex-traction process
Key words: text mining, content mining, structure mining, text classification, infor-mation extraction, Rules Based Systems
42.1 Introduction
The information age has made it easy for us to store large amounts of texts The pro-liferation of documents available on the Web, on corporate intranets, on news wires, and elsewhere is overwhelming However, while the amount of information avail-able to us is constantly increasing, our ability to absorb and process this information remains constant Search engines only exacerbate the problem by making more and more documents available in a matter of a few key strokes; So-called “push” tech-nology makes the problem even worse by constantly reminding us that we are failing
to track news, events, and trends everywhere We experience information overload, and miss important patterns and relationships even as they unfold before us As the old adage goes, “we can’t see the forest for the trees.”
Text-mining (TM), also known as Knowledge discovery from text (KDT), refers
to the process of extracting interesting patterns from very large text database for the purposes of discovering knowledge Text-mining applies the same analytical func-tions of data-mining but also applies analytic funcfunc-tions from natural language (NL)
and information retrieval (IR) techniques (Dorre et al., 1999).
The text-mining tools are used for:
O Maimon, L Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,
DOI 10.1007/978-0-387-09823-4_42, © Springer Science+Business Media, LLC 2010