Data Mining and Knowledge Discovery Handbook, 2 Edition part 81 pdf

Querying and summarizing data that could be stored for further analysis are the main processing tasks studied in data stream management systems.. In this section, a brief description of

Trang 1

constrained device that generates or receive streams of information AOG has three main stages Mining followed by adaptation to resources and data stream rates repre-sent the ﬁrst two stages Merging the generated knowledge structures when running out of memory represents the last stage AOG has been used in clustering,

classiﬁca-tion and frequency counting (Gaber et al., 2005).

Figure 39.8 shows a ﬂowchart of AOG-mining process It shows the sequence of the three stages of AOG

Fig 39.8 AOG Approach

Deﬁnitions, advantages and disadvantages of all of the above task-based ap-proaches are given in Table 39.3

39.8 Related Work

The last few years have witnessed the emergence of data management strategies

focusing on data stream issues (Babcock et al., 2002) Querying and summarizing

data that could be stored for further analysis are the main processing tasks studied

in data stream management systems Extension of query languages, query planning, scheduling, and optimization are the major research activities conducted in this area

Aurora (Abadi et al., 2003), COUGAR (Yao and Gehrke, 2002), Gigascope (Cra-nor et al., 2003), STREAM (Arasu et al., 2003), TelegraphCQ (Krishnamurthy et al., 2003) represent the ﬁrst generation of data stream management systems In this

section, a brief description of each one is given as follows:

• STREAM: STanford stREam datA Manager (STREAM) (Arasu et al., 2003) is a

data stream management system that handles multiple continuous data streams and supports long-running continuous queries The intermediate results of a con-tinuous query are stored in a data structure termed Scratch Store The results of a query could be a data stream transferred to the user or it could be a relation that also could be stored for re-processing To support continuous queries over data streams, a continuous query language termed as CQL has been developed as part

of the system The language supports relation-to-relation, stream-to-relation, and relation-to-stream operators

• Gigascope: is a specialized data stream management system (Cranor et al., 2003)

for the application of network monitoring It has its own SQL-like query language termed as GSQL Unlike CQL, the input and output of this language are only

Trang 2

Table 39.3 Task-based Techniques

Approximation

Al-gorithms

Design algorithms that approximate mining results with error bounds

• Efﬁciency in running time

• the problem

of data rates with regard

to the avail-able resources could not be solved using approximation algorithms

Sliding Window Analyzing the most

recent data streams • Applicable

to most of data stream applications

• don’t provide

a model for the whole data stream

Algorithm Output

Granularity

Adapting the algorithm param-eters according

to data stream rate and memory consumption

• Generic ap-proach that

any mining technique with

no or minor modiﬁcations

• It has an

over-head when run-ning for long period of time

data streams GSQL supports merge, selection, join and aggregation operations

on data streams Query optimization and performance considerations have been addressed in developing the language The system serves a number of network related applications including intrusion detection and trafﬁc analysis

• TelegraphCQ: is a continuous query processing system (Krishnamurthy et al.,

2003) built on the basis of PostgreSQL open source query language The system supports creating data streams, sources, wrappers and queries

• COUGAR: is a data stream management system (Yao and Gehrke, 2002)

de-signed for sensor networks Motivated by the fact that local computation in sen-sor networks is cheaper than transferring data generated from sensen-sors over wire-less connections, a loosely coupled distributed architecture has been proposed to answer in-network queries

• Aurora: is a data stream management system (Abadi et al., 2003) that has the

optimization features for load shedding, real-time query scheduling and QoS as-sessment It is mainly designed to deal with very large numbers of data streams

Trang 3

Queries over data streams have some similarities with data stream mining in terms of research issues and challenges The two main constraints for querying data streams are the unbounded memory requirement and the high data rate Thus, the computation time per data element/record should be less than the data rate or the sampling rate Furthermore, the unbounded memory requirement compounds the challenge by necessitating approximate rather than exact results Signiﬁcant

re-search efforts have been conducted to approximate the query results (Babcock et al.,

2002, Garofalakis et al., 2002b).

The data stream mining algorithms have used some of the techniques introduced

in the data stream management research Sampling and load shedding (Muthukrish-nan, 2003) are among the basic techniques that have been introduced in querying data streams and extended to the data mining process

39.9 Future Directions

The ﬁeld of data stream mining is in a nascent stage of evolution The last few years have witnessed increased attention to this area of research due to the dissemination

of data stream sources Based on the state-of-the-art in the area and demands of data streaming applications, we can identify the future directions of research as follows:

• Developing data mining algorithms for wireless sensor networks to serve a

num-ber of real-time critical applications

• Online medical, scientiﬁc and biological data stream mining using data generated

from medical, biological instruments and various tools employed in scientiﬁc laboratories

• Hardware solutions to small devices emitting or receiving data streams in order

to enable high performance computation on small devices

• Developing software architectures that serve data streaming applications.

39.10 Summary

In this chapter, a review of the state of the art in mining data streams has been pre-sented Clustering, classiﬁcation, frequency counting, time series analysis techniques have been discussed Different systems that use data stream mining techniques have been also presented Generalization of the approaches used in developing data stream mining techniques is given The approaches have been broadly classiﬁed into data-based and task-data-based strategies Sampling, load shedding, sketching, synopsis data structure creation and aggregation represent the data-based approaches Approxi-mation algorithms, sliding window and algorithm output granularity are the two ap-proaches that form the task-based apap-proaches The chapter is concluded with pointers

to future research directions in the area

Trang 4

A Arasu, B Babcock S Babu, M Datar, K Ito, I Nishizawa, J Rosenstein, and J Widom STREAM: The Stanford Stream Data Manager Demonstration description -short overview of system status and plans, in Proc of the ACM Intl Conf on Manage-ment of Data (SIGMOD 2003), June 2003, pp 665 - 665

D Abadi, D Carney, U Cetintemel, M Cherniack, C Convey, C Erwin, E Galvez, M Hatoun, J Hwang, A Maskey, A Rasin, A Singer, M Stonebraker, N Tatbul, Y Xing, R.Yan, S Zdonik Aurora: A Data Stream Management System (Demonstration) Pro-ceedings of the ACM SIGMOD International Conference on Management of Data (SIG-MOD’03), San Diego, CA, June 2003

C Aggarwal, J Han, J Wang, P S Yu, A Framework for Clustering Evolving Data Streams, Proc 2003 Int Conf on Very Large Data Bases (VLDB’03), Berlin, Germany, Sept

2003, pp 81-92

C Aggarwal, J Han, J Wang, and P S Yu, A Framework for Projected Clustering of High Dimensional Data Streams, Proc 2004 Int Conf on Very Large Data Bases (VLDB’04), Toronto, Canada, Aug 2004, pp 852-863

C Aggarwal, J Han, J Wang, and P S Yu, On Demand Classiﬁcation of Data Streams, Proc 2004 Int Conf on Knowledge Discovery and Data Mining (KDD’04), Seattle,

WA, Aug 2004, pp 503-508

I.F Akyildiz, W Su, Y Sankarasubramaniam, and E Cayirci A Survey on Sensor Networks, IEEE Communication Magazine, August, 2002, pp 102-114

B Babcock, S Babu, M Datar, R Motwani, and J Widom Models and issues in data stream systems, Proceedings of PODS, 2002, pp 1-16

B Babcock, M Datar, and R Motwani Load Shedding Techniques for Data Stream Sys-tems (short paper), Proc of the 2003 Workshop on Management and Processing of Data Streams (MPDS 2003), June 2003

B Babcock, M Datar, R Motwani, L O’Callaghan, Maintaining Variance and k-Medians over Data Stream Windows, Proceedings of the 22nd Symposium on Principles of Database Systems (PODS 2003), pp 234 - 243

M Burl, Ch Fowlkes, J Roden, A Stechert, and S Mukhtar, Diamond Eye: A distributed architecture for image data mining, in SPIE DMKD, Orlando, April 1999, pp 197-206

M Charikar, L O’Callaghan, and R Panigrahy, Better streaming algorithms for clustering problems, Proc of 35th ACM Symposium on Theory of Computing (STOC), 2003, pp 30-39

Y.D Cai, D Clutter, G Pape, J Han, M Welge, and L Auvil, MAIDS: Mining Alarming Incidents from Data Streams, (system demonstration), Proc 2004 ACM-SIGMOD Int Conf Management of Data (SIGMOD’04), Paris, France, June 2004, pp 919 - 920

Y Chen, G Dong, J Han, B W Wah, and J Wang, Multi-Dimensional Regression Analysis

of Time-Series Data Streams, Proceedings of VLDB Conference, 2002, pp 323-334

B Castano, M Judd, R C Anderson, and T Estlin, Machine Learning Challenges in Mars Rover Traverse Science, Proc of the ICML 2003 workshop on Machine Learning Tech-nologies for Autonomous Space Applications

C Cranor , Johnson, T., Spataschek, O., and Shkapenyuk, V., Gigascope: a stream database for network applications, In Proceedings of the 2003 ACM SIGMOD international Con-ference on Management of Data (San Diego, California, June 09 - 12, 2003) SIGMOD

’03 ACM, New York, NY, 647-651

L O’Callaghan, Nina Mishra, Adam Meyerson, Sudipto Guha, and Rajeev Motwani, Streaming-data algorithms for high-quality clustering, Proceedings of IEEE

Trang 5

Interna-tional Conference on Data Engineering, March 2002, pp 685-697.

G Cormode, S Muthukrishnan, What’s hot and what’s not: tracking most frequent items dynamically, PODS 2003, pp 296-306

J Coughlan, Accelerating Scientiﬁc Discovery at NASA, SIAM SDM 2004, Florida USA

G Cormode and S Muthukrishnan., What is new: Finding signiﬁcant differences in network data streams, INFOCOM 2004

Y Chi, Philip S Yu, Haixun Wang, Richard R Muntz, Loadstar: A Load Shedding Scheme for Classifying Data Streams, The 2005 SIAM International Conference on Data Mining (SIAM SDM’05), 2005

G Dong, J Han, L.V.S Lakshmanan, J Pei, H Wang and P.S Yu Online mining of changes from data streams: Research problems and preliminary results, Proceedings of the 2003 ACM SIGMOD Workshop on Management and Processing of Data Streams In cooper-ation with the 2003 ACM-SIGMOD Interncooper-ational Conference on Management of Data (SIGMOD’03), San Diego, CA, June 8, 2003

P Domingos and G Hulten, Mining High-Speed Data Streams, In Proceedings of the As-sociation for Computing Machinery Sixth International Conference on Knowledge Dis-covery and Data Mining, 2000, pp 71-80

P Domingos and G Hulten Catching Up with the Data: Research Issues in Mining Data Streams, Workshop on Research Issues in Data Mining and Knowledge Discovery, 2001 Santa Barbara, CA

P Domingos and G Hulten, A General Method for Scaling Up Machine Learning Algo-rithms and its Application to Clustering, Proceedings of the Eighteenth International Conference on Machine Learning, 2001, Williamstown, MA, Morgan Kaufmann, pp 106-113

M Dunham Data Mining: Introductory and Advanced Topics Pearson Education, 2003 F.J Ferrer-Troyano, J.S Aguilar-Ruiz and J.C Riquelme, Discovering Decision Rules from Numerical Data Streams, ACM Symposium on Applied Computing - SAC04, 2004, ACM Press, pp 649-653

U.M Fayyad: Knowledge Discovery in Databases: An Overview ILP 1997, pp 3-16 U.M Fayyad: Mining Databases: Towards Algorithms for Knowledge Discovery IEEE Data Eng Bull 21(1), 1998 pp 39-48

U.M Fayyad, Georges G Grinstein, Andreas Wierse: Information Visualization in Data Min-ing and Knowledge Discovery Morgan Kaufmann 2001

M.M Gaber , Yu P S., A Holistic Approach for Resource-aware Adaptive Data Stream Mining, Journal of New Generation Computing, Special Issue on Knowledge Discovery from Data Streams, 2006

V Ganti, Johannes Gehrke, Raghu Ramakrishnan: Mining Data Streams under Block Evolu-tion SIGKDD Explorations 3(2), 1002 pp 1-10

M Garofalakis, Johannes Gehrke, Rajeev Rastogi: Querying and mining data streams: you only get one look a tutorial SIGMOD Conference 2002: 635

C Giannella, J Han, J Pei, X Yan, and P.S Yu, Mining Frequent Patterns in Data Streams

at Multiple Time Granularities, in H Kargupta, A Joshi, K Sivakumar, and Y Yesha (eds.), Next Generation Data Mining, AAAI/MIT, 2003

A.C Gilbert, Yannis Kotidis, S Muthukrishnan, Martin Strauss: One-Pass Wavelet Decom-positions of Data Streams TKDE 15(3), 2003, pp 541-554

M.M Gaber, Krishnaswamy, S., and Zaslavsky, A., On-board Mining of Data Streams in Sensor Networks, a book chapter in Advanced Methods of Knowledge Discovery from Complex Data, (Eds.) Sanghamitra Badhyopadhyay, Ujjwal Maulik, Lawrence Holder and Diane Cook, Springer Verlag,.2005

Trang 6

R Grossman, Supporting the Data Mining Process with Next Generation DataMining Sys-tems, Enterprise SysSys-tems, August 1998

M.M Gaber, Zaslavsky, A., and Krishnaswamy, S., Towards an Adaptive Approach for Min-ing Data Streams in Resource Constrained Environments, ProceedMin-ings of Sixth Inter-national Conference on Data Warehousing and Knowledge Discovery - Industry Track (DaWaK 2004), Zaragoza, Spain, 30 August - 3 September, Lecture Notes in Computer Science (LNCS), Springer Verlag

S Guha, N Mishra, R Motwani, and L O’Callaghan, Clustering data streams, Proceedings

of the Annual Symposium on Foundations of Computer Science IEEE, November 2000,

pp 359-366

S Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani, and Liadan O’Callaghan, Cluster-ing Data Streams: Theory and Practice TKDE special issue on clusterCluster-ing, vol 15, 2003,

pp 515-528

D.J Hand, Statistics and Data Mining: Intersecting Disciplines, ACM SIGKDD Explo-rations, 1, 1, June 1999, pp 16-19

D.J Hand, Mannila H., and Smyth P Principles of data mining, MIT Press, 2001

W Hoeffding Probability inequalities for sums of bounded random variables, Journal of the American Statistical Association (58), 1963, pp 13-30

J Han, Pei, J., and Yin, Y, Mining frequent patterns without candidate generation, In Proc

2000 ACM-SIGMOD Int Conf Management of Data (SIGMOD’00), pp 1-12

G Hulten, L Spencer, and P Domingos Mining Time-Changing Data Streams ACM SIGKDD 2001, pp 97-106

M Henzinger, P Raghavan and S Rajagopalan, Computing on data streams , Technical Note 1998-011, Digital Systems Research Center, Palo Alto, CA, May 1998

T Hastie, R Tibshirani, J Friedman, The elements of statistical learning: data mining, infer-ence, and prediction, New York: Springer, 2001

P Indyk, N Koudas, and S Muthukrishnan, Identifying Representative Trends in Massive Time Series Data Sets Using Sketches In Proc of the 26th Int Conf on Very Large Data Bases, Cairo, Egypt, September 2000, pp 363 - 372

C Jin, Weining Qian, Chaofeng Sha, Jeffrey X Yu, and Aoying Zhou, Dynamically Main-taining Frequent Items over a Data Stream, In Proceedings of the 12th ACM Conference

on Information and Knowledge Management (CIKM’2003), pp 287-294

M Kantardzic, Data mining : concepts, models, methods and algorithms, Piscataway, NJ: IEEE Pr Wiley Interscience, 2003

H Kargupta, Ruchita Bhargava, Kun Liu, Michael Powers, Patrick Blair, Samuel Bushra, James Dull, Kakali Sarkar, Martin Klein, Mitesh Vasa, and David Handy, VEDAS: A Mobile and Distributed Data Stream Mining System for Real-Time Vehicle Monitoring, Proceedings of SIAM International Conference on Data Mining 2004

S Krishnamurthy, S Chandrasekaran, O Cooper, A Deshpande, M Franklin, J Hellerstein,

W Hong, S Madden, V Raman, F Reiss, and M Shah TelegraphCQ: An Architectural Status Report IEEE Data Engineering Bulletin, Vol 26(1), March 2003

E Keogh, J Lin, and W Truppel Clustering of Time Series Subsequences is Meaningless: Implications for Past and Future Research In proceedings of the 3rd IEEE International Conference on Data Mining Melbourne, FL Nov 19-22, 2003, pp 115-122

H Kargupta, Park, B., Pittie, S., Liu, L., Kushraj, D and Sarkar, K (2002) MobiMine: Monitoring the Stock Market from a PDA ACM SIGKDD Explorations January 2002 Volume 3, Issue 2, ACM Press, pp 37-46

B Krishnamachari and S.S Iyengar Efﬁcient and Fault-tolerant Feature Extraction in Sensor Networks In Proceedings of the 2nd International Workshop on Information Processing

Trang 7

in Sensor Networks (IPSN ’03), Palo Alto, California, April 2003.

B Krishnamachari and S Iyengar Distributed Bayesian Algorithms for Fault-tolerant Event Region Detection in Wireless Sensor Networks IEEE Transactions on Computers, vol

53, No 3, March 2004

M Last, Online Classiﬁcation of Nonstationary Data Streams, Intelligent Data Analysis, Vol

6, No 2, 2002, pp 129-147

Y Law, C Zaniolo, An Adaptive Nearest Neighbor Classiﬁcation Algorithm for Data Streams, Proceedings of the 9th European Conference on the Principals and Practice

of Knowledge Discovery in Databases (PKDD 2005), Springer Verlag, Porto, Portugal, October 3-7, 2005, pp 108-120

J Lin, E Keogh, S Lonardi, and B Chiu, A Symbolic Representation of Time Series, with Implications for Streaming Algorithms, In proceedings of the 8th ACM SIGMOD Work-shop on Research Issues in Data Mining and Knowledge Discovery San Diego, CA June

13, 2003, pp 2-11

G.S Manku and R Motwani Approximate frequency counts over data streams In Proceed-ings of the 28th International Conference on Very Large Data Bases, Hong Kong, China, August 2002, pp 346-357

R Moskovitch, Y Elovici, L Rokach, Detection of unknown computer worms based

on behavioral classiﬁcation of the host, Computational Statistics and Data Analysis, 52(9):4544–4566, 2008

S Muthukrishnan, Data streams: algorithms and applications Proceedings of the fourteenth annual ACM-SIAM symposium on discrete algorithms, 2003

O Nasraoui , Cardona C., Rojas C., and Gonzalez F., Mining Evolving User Proﬁles in Noisy Web Clickstream Data with a Scalable Immune System Clustering Algorithm, in Proc of WebKDD 2003 - KDD Workshop on Web mining as a Premise to Effective and Intelligent Web Applications, Washington DC, August 2003, p 71

C Ordonez Clustering Binary Data Streams with K-means ACM DMKD 2003

B Park and H Kargupta Distributed Data Mining: Algorithms, Systems, and Applications, Data Mining Handbook Editor: Nong Ye 2002

E Perlman and A Java, Predictive Mining of Time Series Data in Astronomy In ASP Conf Ser 295: Astronomical Data Analysis Software and Systems XII, 2003

S Papadimitriou, C Faloutsos, and A Brockwell, Adaptive, Hands-Off Stream Mining, 29th International Conference on Very Large Data Bases VLDB, 2003

S Pirttikangas, J Riekki, J Kaartinen, J Miettinen, S Nissila, J Roning Genie Of The Net: A New Approach For A Context-Aware Health Club In Proceedings of Joint 12th ECML’01 and 5th European Conference on PKDD’01 September 3-7, 2001, Freiburg, Germany

L Rokach, Decomposition methodology for classiﬁcation tasks: a meta decomposer frame-work, Pattern Analysis and Applications, 9(2006):257–271

L Rokach, O Maimon and R Arbel, Selective voting-getting more for less in sensor fusion, International Journal of Pattern Recognition and Artiﬁcial Intelligence 20 (3) (2006), pp 329–350

A Srivastava and J Stroeve, Onboard Detection of Snow, Ice, Clouds and Other Geophysical Processes Using Kernel Methods, Proceedings of the ICML’03 workshop on Machine Learning Technologies for Autonomous Space Applications

S Tanner, M Alshayeb, E Criswell, M Iyer, A McDowell, M McEniry, K Regner, EVE: On-Board Process Planning and Execution, Earth Science Technology Confer-ence, Pasadena, CA, Jun 11 - 14, 2002

Trang 8

N Tatbul, U Cetintemel, S Zdonik, M Cherniack and M Stonebraker, Load Shedding in a Data Stream Manager Proceedings of the 29th International Conference on Very Large Data Bases (VLDB), September, 2003

N Tatbul, U Cetintemel, S Zdonik, M Cherniack, M Stonebraker Load Shedding on Data Streams, In Proceedings of the Workshop on Management and Processing of Data Streams (MPDS 03), San Diego, CA, USA, June 8, 2003

H Toivonen, Sampling large databases for association rules, Proceeding of VLDB Confer-ence, 1996

Y Yao, J E Gehrke, The Cougar Approach to In-Network Query Processing in Sensor Net-works, SIGMOD Record, Volume 31, Number 3 September 2002, pp 9-18

H Wang, W Fan, P Yu and J Han, Mining Concept-Drifting Data Streams using Ensemble Classiﬁers, in the 9th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Aug 2003, Washington DC, USA

Y Zhu and D Shasha, Efﬁcient Elastic Burst Detection in Data Streams, The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

KDD-2003 24 August KDD-2003 - 27 August KDD-2003, pp 336 - 345

Trang 10

Mining Concept-Drifting Data Streams

Haixun Wang1, Philip S Yu2, and Jiawei Han3

1 IBM T J Watson Research Center

haixun@us.ibm.com

2 IBM T J Watson Research Center

psyu@us.ibm.com

3 University of Illinois, Urbana Champaign

hanj@cs.uiuc.edu

Summary Knowledge discovery from inﬁnite data streams is an important and difﬁcult task

We are facing two challenges, the overwhelming volume and the concept drifts of the stream-ing data In this chapter, we introduce a general framework for minstream-ing concept-driftstream-ing data streams using weighted ensemble classifiers We train an ensemble of classification models, such as C4.5, RIPPER, naive Bayesian, etc., from sequential chunks of the data stream The classifiers in the ensemble are judiciously weighted based on their expected classification ac-curacy on the test data under the time-evolving environment Thus, the ensemble approach improves both the efficiency in learning the model and the accuracy in performing classifica-tion Our empirical study shows that the proposed methods have substantial advantage over single-classifier approaches in prediction accuracy, and the ensemble framework is effective for a variety of classification models

Key words: Data Mining, concept learning, classiﬁer design and evaluation

40.1 Introduction

Knowledge discovery on streaming data is a research topic of growing interest

(Bab-cock et al., 2002, Chen et al., 2002, Domingos and Hulten, 2000, Hulten et al.,

2001) The fundamental problem we need to solve is the following: given an inﬁ-nite amount of continuous measurements, how do we model them in order to capture time-evolving trends and patterns in the stream, and make time-critical predictions? Huge data volume and drifting concepts are not unfamiliar to the Data Min-ing community One of the goals of traditional Data MinMin-ing algorithms is to learn models from large databases with bounded-memory It has been achieved

by several classiﬁcation methods, including Sprint (Shafer et al., 1996), BOAT (Gehrke et al., 1999), etc Nevertheless, the fact that these algorithms require

multi-ple scans of the training data makes them inappropriate in the streaming environment where examples are coming in at a higher rate than they can be repeatedly analyzed

O Maimon, L Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,

Định dạng
Số trang	10
Dung lượng	368,95 KB