The following section described the approach in more detail and outlinesthe challenges and lessons learned.Our debugging tool uses a data collection front-end to collect runtime events f
Trang 1Commenced Publication in 1973
Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Trang 2Olufemi A Omitaomu João Gama
Nitesh V Chawla Auroop R Ganguly (Eds.)
Knowledge Discovery from Sensor Data
Second International Workshop, Sensor-KDD 2008 Las Vegas, NV, USA, August 24-27, 2008
Revised Selected Papers
1 3
Trang 3Mohamed Medhat Gaber
Monash University, Centre for Distributed Systems and Software Engineering
900 Dandenong Road, Caulfield East, Melbourne, VIC 3145, Australia
University of Porto, Faculty of Economics, LIAAD-INESC Porto
L.A Rua de Ceuta, 118, 6, 4050-190 Porto, Portugal
E-mail: jgama@liaad.up.pt
Nitesh V Chawla
University of Notre Dame, Computer Science and Engineering Department
353 Fitzpatrick Hall, Notre Dame, IN 46556, USA
E-mail: nchawla@cse.nd.edu
Library of Congress Control Number: 2010924293
CR Subject Classification (1998): H.3, H.4, C.2, H.5, H.2.8, I.5
LNCS Sublibrary: SL 3 – Information Systems and Application, incl Internet/Weband HCI
ISSN 0302-9743
ISBN-10 3-642-12518-2 Springer Berlin Heidelberg New York
ISBN-13 978-3-642-12518-8 Springer Berlin Heidelberg New York
This work is subject to copyright All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer Violations are liable
to prosecution under the German Copyright Law.
Trang 4This volume contains extended papers from Sensor-KDD 2008, the Second ternational Workshop on Knowledge Discovery from Sensor Data The secondSensor-KDD workshop was held in Las Vegas on August 24, 2008, in conjunctionwith the 14th ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining.
In-Wide-area sensor infrastructures, remote sensors, and wireless sensor works, RFIDs, yield massive volumes of disparate, dynamic, and geographicallydistributed data As such sensors are becoming ubiquitous, a set of broad require-ments is beginning to emerge across high-priority applications including disas-ter preparedness and management, adaptability to climate change, national orhomeland security, and the management of critical infrastructures The raw datafrom sensors need to be efficiently managed and transformed to usable informa-tion through data fusion, which in turn must be converted to predictive insightsvia knowledge discovery, ultimately facilitating automated or human-inducedtactical decisions or strategic policy based on decision sciences and decision sup-port systems
net-The expected ubiquity of sensors in the near future, combined with the ical roles they are expected to play in high-priority application solutions, points
crit-to an era of unprecedented growth and opportunities The main motivationfor the Sensor-KDD series of workshops stems from the increasing need for aforum to exchange ideas and recent research results, and to facilitate collab-oration and dialog between academia, government, and industrial stakehold-ers This is clearly reflected in the successful organization of the first workshop(http://www.ornl.gov/sci/knowledgediscovery/SensorKDD-2007/) along with theACM KDD-2007 conference, which was attended by more than seventy registeredparticipants, and resulted in an edited book (CRC Press, ISBN-9781420082326,
2008), and a special issue in the Intelligent Data Analysis journal (Volume 13,
Number 3, 2009)
Based on the positive feedback from the previous workshop attendees andour own experiences and interactions with the government agencies such asDHS, DOD, and involvement with numerous projects on knowledge discoveryfrom sensor data, we organized the second Sensor-KDD workshop along withthe KDD-2008 conference As expected we received very high-quality paper sub-missions which were thoroughly reviewed by a panel of international ProgramCommittee members Based on a minimum of two reviews per paper, we selected
seven full papers and six short papers In addition to the oral presentations of
accepted papers, the workshop featured two invited speakers: Kendra E Moore,Program Manager, DARPA/IPTO and Jiawei Han, Department of ComputerScience, University of Illinois at Urbana-Champaign
Trang 5The contents of this volume include the following papers Data mining niques for diagnostic debugging in sensor networks were presented in an invitedpaper by Abdelzaher et al Sebastiao et al addressed the important problem
tech-of detecting changes in constructing histograms from time-changing high-speeddata streams Delir Haghighi el al introduced an integrated architecture forsituation-aware adaptive data mining and mobile visualization in ubiquitouscomputing environments Davis el al described synchronous and asynchronousexpectation maximization algorithms for unsupervised learning in factor graphs.Rahman et al dealt with intrusion detection in wireless networks Their system,WiFi Miner, is capable of fnding frequent and infrequent patterns from prepro-cessed wireless connection records using an infrequent pattern finding Apriorialgorithm A solution to the problem of detecting underlying patterns in largevolumes of spatiotemporal data as it allows one, for example, to model humanbehavior and plan traffic was given by Hutchins et al Wu el al presented a spa-tiotemporal outlier detection algorithm called Outstretch, which discovers theoutlier movement patterns of the top-k spatial outliers over several time periods
A joint use of large-scale sensory measurements from the Internet and a smallnumber of human inputs for effective network inference through a clustering andsemi-supervised learning algorithm was given by Erjongmanee el al Rashidi andCook presented an adaptive data mining framework for detecting patterns insensor data A description of a dense pixel visualization technique for visualizingsensor data and as well as absolute errors resulting from predictive models waspresented by Rodrigues el al Fang et al presented a two-stage knowledge dis-covery process, where offline approaches are utilized to design online solutionsthat can support real-time decisions Finally, a framework for the discovery ofspatiotemporal neighborhoods in sensor datasets where a time series of data iscollected at many spatial locations was presented by McGuire el al
The workshop witnessed lively participation from all quarters, generated teresting discussions immediately after each presentation and as well as at theend of the workshop We hope that the Sensor-KDD workshop will continue to
in-be an attractive forum for the researchers from academia, industry, and ment, to exchange ideas, initiate collaborations, and lay foundation to the future
govern-of this important and growing area
Olufemi A Omitaomu
Joao GamaNitesh V ChawlaMohamed Medhat GaberAuroop R Ganguly
Trang 6The Second International Workshop on Knowledge Discovery from Sensor Data (Sensor-KDD 2008) was made possible by the following organizers and international Program Committee members
Workshop Chair
Ranga Raju Vatsavai Oak Ridge National Laboratory, USA
Olufemi Omitaomu Oak Ridge National Laboratory, USA
Joao Gama University of Porto, Portugal
Nitesh V Chawla University of Notre Dame, USA
Mohamed Medhat Gaber Monash University, Australia
Auroop Ganguly Oak Ridge National Laboratory, USA
Program Committee (in alphabetical order)
Michaela Black University of Ulster, Coleraine, Northern Ireland,
UKAndre Carvalho University of Sao Paulo, Brazil
Sanjay Chawla University of Sydney, Australia
Francisco Ferrer University of Seville, Spain
Ray Hickey University of Ulster, Coleraine, Northern Ireland,
UKRalf Klinkenberg University of Dortmund, Germany
Miroslav Kubat University Miami, USA
Mark Last Ben-Gurion University, Israel
Chang-Tien Lu Virginia Tech, USA
Elaine Parros Machado de Sousa University of Sao Paulo, Brazil
Laurent Mignet IBM Research, USA
S Muthu Muthukrishnan Rutgers University and AT&T Research, USA Pedro Rodrigues University of Porto, Portugal
Josep Roure Carnegie Mellon University, Pittsburgh, USA Bernhard Seeger University Marburg, Germany
Cyrus Shahabi University of Southern California, USA
Mallikarjun Shankar Oak Ridge National Laboratory, Oak Ridge, USA Alexandre Sorokine Oak Ridge National Laboratory, Oak Ridge, USA Eiko Yoneki University of Cambridge, UK
Nithya Vijayakumar Cisco Systems, Inc., USA
Guangzhi Qu Oakland University, Rochester, USA
Trang 7Data Mining for Diagnostic Debugging in Sensor Networks: Preliminary
Evidence and Lessons Learned 1
Tarek Abdelzaher, Mohammad Khan, Hieu Le,
Hossein Ahmadi, and Jiawei Han
Monitoring Incremental Histogram Distribution for Change Detection
Pari Delir Haghighi, Brett Gillick, Shonali Krishnaswamy,
Mohamed Medhat Gaber, and Arkady Zaslavsky
Unsupervised Plan Detection with Factor Graphs 59
George B Davis, Jamie Olson, and Kathleen M Carley
WiFi Miner: An Online Apriori-Infrequent Based Wireless Intrusion
System 76
Ahmedur Rahman, C.I Ezeife, and A.K Aggarwal
Probabilistic Analysis of a Large-Scale Urban Traffic Sensor
Data Set 94
Jon Hutchins, Alexander Ihler, and Padhraic Smyth
Spatio-temporal Outlier Detection in Precipitation Data 115
Elizabeth Wu, Wei Liu, and Sanjay Chawla
Large-Scale Inference of Network-Service Disruption upon Natural
Parisa Rashidi and Diane J Cook
A Simple Dense Pixel Visualization for Mobile Sensor Data Mining 175
Pedro Pereira Rodrigues and Jo˜ ao Gama
Trang 8Incremental Anomaly Detection Approach for Characterizing Unusual
Profiles 190
Yi Fang, Olufemi A Omitaomu, and Auroop R Ganguly
Spatiotemporal Neighborhood Discovery for Sensor Data 203
Michael P McGuire, Vandana P Janeja, and Aryya Gangopadhyay
Author Index 227
Trang 9Networks: Preliminary Evidence and Lessons
Tarek Abdelzaher, Mohammad Khan, Hieu Le,Hossein Ahmadi, and Jiawei HanUniversity of Illinois at Urbana Champaign
Abstract Sensor networks and pervasive computing systems intimately
combine computation, communication and interactions with the physicalworld, thus increasing the complexity of the development effort, violat-ing communication protocol layering, and making traditional networkdiagnostics and debugging less effective at catching problems Tightercoupling between communication, computation, and interaction with thephysical world is likely to be an increasing trend in emerging edge net-works and pervasive systems This paper reviews recent tools developed
by the authors to understand the root causes of complex interaction bugs
in edge network systems that combine computation, communication andsensing We concern ourselves with automated failure diagnosis in theface of non-reproducible behavior, high interactive complexity, and re-source constraints Several examples are given to finding bugs in realsensor network code using the tools developed, demonstrating the effi-cacy of the approach
perva-henceforth referred to as edge network systems These systems feature
hetero-geneity, and tight interactions between computation, communication, sensing,and control Tight interactions breed interactive complexity; the primary cause
of failures and vulnerabilities in complex systems While individual devices andsubsystems may operate well in isolation, their composition might result in in-compatibilities, anomalies or failures that are typically very difficult to trou-bleshoot On the other hand, software re-use is impaired by the customizednature of application code and deployment environments, making it harder to
The work was supported in part by the U.S National Science Foundation grants
IIS-08-42769, CNS 06-26342, CNS 05-54759, and BDI-05-15813, and NASA grantNNX08AC35A Any opinions, findings, and conclusions expressed here are those ofthe authors and do not necessarily reflect the views of the funding agencies
M.M Gaber et al (Eds.): Sensor-KDD 2008, LNCS 5840, pp 1–24, 2010.
c
Springer-Verlag Berlin Heidelberg 2010
Trang 10amortize debugging, troubleshooting, and tuning cost Moreover, users of edgenetwork systems, such as residents of a smart home, may not be experts onnetworking and system administration Automated techniques are needed fortroubleshooting such systems both at development time and after deployment
in order to reduce production as well as ownership costs
Tighter coupling between communication, computation, and interaction withthe physical world is likely to be an increasing trend Internet pioneers, such
as David Clark, the network’s former chief architect, express the view that bythe end of the next decade, the edge of the Internet will primarily constitutesensors and embedded devices [3]1 This motivates analysis tools of systems ofhigh interactive complexity Data mining literature is rich [25,5,26,11,8,10,4,22]with examples of identification, classification, and understanding of complexpatterns in large, highly coupled systems ranging from biological processes [20]
to commercial databases [23] The key advantage of using data mining is theautomation of discovery of hidden patterns that may take significant amounts
of time to detect manually
While the use of data mining in network troubleshooting is promising, it is by
no means a straightforward application of existing techniques to a new problem.Networked software execution patterns are not governed by “laws of nature”,DNA, business transactions, or social norms They are limited only by program-mers’ imagination The increased diversity and richness of such patterns make
it harder to zoom-in on potential causes of problems without embedding someknowledge of networking, programming, and debugging into the data miningengine This paper describes a cross-cutting solution that leverages the power ofdata mining to uncover hard-to-find bugs in distributed systems
Debugging
Consider multiple development teams building an edge network system such asone designed to instrument an assisted living facility with sensors that monitorthe occupants, ensure their well-being, and alert care-givers to emergencies whenthey occur The system typically consists of a large number of components, fur-ther multiplied by the need to support a variety of different hardware platforms,operating systems and sensor products Often parts of the system are developed
by different vendors These parts are tested and debugged independently bytheir respective developers, then, at a later stage, the system is put together atsome integration testbed for evaluation The integrated system usually does notwork well When a host of problem manifestations are reported, which party isresponsible for these problems? and who needs to fix what? Different developersmust now come together to understand where the malfunction is coming from.This type of bugs is hardest to fix and is a source of significant additional costsand delays in projects Due to the rising tendency to build networked systems of
1 This view was expressed in his motivational keynote on the need for a Future Internet
Design initiative (FIND)
Trang 11an increasing number of embedded interacting components, interaction problemsget worse, which motivates our work.
The network troubleshooting solution advocated in this paper stems from laborative work and experiences of the authors in the area of diagnostic debug-ging In this section, we overview the goals and challenges, outline the designprinciples that stem from these goals and challenges, and the initial evidencefrom a pilot study that suggests the viability of using data mining to uncover awide range of network-related interaction bugs
col-2.1 Goals and Challenges
The coupled nature of edge network systems motivates an extended definition of
a network that includes not only the communication infrastructure but also the
communicating entities themselves, such as sensors, application-level protocols,and user inputs A growing challenge is to develop analysis techniques and soft-ware tools that automate diagnostic troubleshooting of such networks to reducetheir development and ownership cost A significant part of software developmentcost is spent on testing and debugging Similarly, an increasing part of owner-ship cost is spent on maintenance Of particular difficulty is to troubleshoot theimportant and expensive class of problems that arise from (improper or unex-pected) interactions of large numbers of components across a network We aim
to answer developer or user questions such as “why does this network sufferunusually high service delays?”, “why is throughput low despite availability ofresources and service requests?”, “why does my time synchronization protocolfail to synchronize clocks when the localization protocol is run concurrently?”,
or “why does this vehicle tracking system suffer increased false alarms when
it is windy?2” Building efficient troubleshooting support to address the abovequestions is complicated due to the characteristics of interaction bugs; namely:
– Non-reproducible behavior: Interactions in edge network systems feature an
increased level of concurrency and thus an increased level of non-determinism
In turn, non-determinism generates non-reproducible bugs that are hard tofind using traditional debugging tools
– Non-local emergent behavior: By definition, interaction bugs do not manifest
themselves when components are tested in isolation Current debugging toolsare very good at finding bugs that can be traced to individual components.Interaction bugs manifest only at scale as a result of component composition.They result in emergent behavior that arises when a number of seeminglyindividually sound components are combined into a network, which makesthem hard to find
– Resource constraints: Embedded networked devices often operate under
sig-nificant resource constraints Hence, solutions to the debugging problem mustnot use large amounts of run-time resources, making bugs harder to find
2 In a previous deployment of a magnetometer-based wireless tracking system at UVa,
wind resulted in antennae vibration which was caught by the magnetometers andinterpreted as the passage of nearby ferrous objects (vehicles)
Trang 122.2 Design Principles
The data mining approach described in this paper is based on three main design
principles aimed at exploiting concurrency, interactions, and non-determinism to improve the ability to diagnose problems in resource-constrained systems These
principles are as follows:
– Exploiting non-reproducible behavior: Exploitation of non-determinism to
im-prove understanding of system behavior is not new to computing literature.For example, many techniques in estimation theory, concerned with estima-tion of system models, rely on introducing noise to explore a wider range ofsystem states and hence arrive at more accurate models Machine learningand data mining approaches have the same desirable property They requireexamples of both good and bad system behavior to be able to classify theconditions correlated with good and bad In particular, note that conditionsthat cause a problem to occur are correlated (by causality) with the result-ing bad behavior Root causes of non-reproducible bugs are thus inherentlysuited for discovery using data mining and machine learning approaches asthe lack of reproducibility itself and the inherent system non-determinismimprove the odds of occurrence of sufficiently diverse behavior examples totrain the troubleshooting system to understand the relevant correlations andidentify causes of problems Our choice of data mining techniques exploitsthis insight
– Exploiting interactive complexity: Interactive complexity describes a system
where scale and complexity cause components to interact in unexpected ways
A failure that occurs due to such unexpected interactions is therefore notlocalized and is hard to “blame” on any single component This fundamen-tally changes the objective of a troubleshooting tool from aiding in steppingthrough code (which is more suitable for finding a localized error in some line,
such as an incorrect pointer reference), to aiding with diagnosing a sequence
of events (component interactions) that leads to a failure state This leads to
the choice of sequence mining algorithms as the core analytic engine behinddiagnostic debugging
– Addressing resource economy: We develop a progressive diagnosis capability
where inexpensive high-level diagnosis offers initial clues regarding possibleroot causes, resulting in further more detailed inspections of smaller subsets
of the system until the problem is found
The above design principles lead to sequence mining techniques to diagnose action bugs Conceptually, we log a large number of events, correlate sequences
inter-of such events with manifestations inter-of undesirable behavior, then search suchcorrelated sequences for one or more that may be causally responsible for thefailure Finding such culprit sequences is the core research question in diagnosticdebugging
2.3 Preliminary Evidence
To experiment with the usefulness of data mining techniques for purposes ofdebugging networked applications, the authors conducted a pilot investigation
Trang 13and developed a prototype of a diagnostic debugging tool This prototype wasexperimented with over the course of two years to understand the strengths andlimitations of this approach Some results were published in Sensys 2008 [16],DCoSS 2008 [14] and DCoSS 2007 [15] Below we report proof-of-concept ex-amples of successful use of the diagnostic debugging prototype, following by anelaboration of lessons learned from this experience.
1 A “design” bug: As an example of catching a design bug, we summarize
a case-study, published by the authors [16], involving a multi-channel sensornetwork MAC-layer protocol from prior literature [17] that attempts to utilizechannel diversity to improve throughput The protocol assigned a home channel
to every node, grouping nodes that communicated much into a cluster on thesame channel It allowed occasional communication between nodes in differentclusters by letting senders change their channel temporary to the home channel
of a receiver to send a message If communication failed (e.g., because home
channel information became stale), senders would scan all channels looking forthe receiver on a new channel and update home channel information accord-ingly Testing revealed that total network throughput was sometimes worse thanthat of a single-channel MAC Initially, the designer attributed it to the heavy
cost of communication across clusters To verify this hypothesis, the original
protocol, written for MicaZ motes, was instrumented by the authors to log dio channel change events and message communication events (send, receive,acknowledge) as well as related timeouts It was tested on a motes network.Event logs from runs where it outperformed a single-channel MAC were marked
ra-“good” Event logs from runs where it did worse were marked “bad” native sequence mining applied to the two sets of logs revealed a common pat-tern associated prominently with bad logs The pattern included the events NoAck Received, Retry Transmission on Channel (1), Retry Transmission
Discrimi-on Channel (2), Retry TransmissiDiscrimi-on Discrimi-on Channel (3), etc., executed Discrimi-on a
large number of nodes This quickly led the designer to understand a muchdeeper problem When a sender failed to communicate with a receiver in an-other cluster, it would leave its home channel and start scanning other channelscausing communication addressed to it from upstream nodes in its cluster to fail
as well Those nodes would start scanning too, resulting in a cascading effectthat propagated up the network until everyone was scanning and communica-
tion was entirely disrupted everywhere (both within and across clusters) The
protocol had to be redesigned
2 An “accumulative effect” bug: Often failures or performance
prob-lems arise because of accumulative effects such as gradual memory leakage
or clock overflow While such effects evolve over a large period of time,the example summarized below [14] shows how it may be possible to usediagnostic debugging to understand the “tipping point” that causes theproblem to manifest In this case, the authors observed sudden onset of suc-cessive message loss in an implementation of directed diffusion [12]; a well-known sensor network routing protocol As before, communication was loggedtogether with timeout and message drop events Parts of logs coinciding
Trang 14with or closely preceding instances of successive message losses were labeled
“bad” The rest were labeled “good” Discriminative sequence mining revealedthat patterns correlated with successive message loss included the followingevents: Message Send (timestamp = 255), Message Send (timestamp = 0),Message Dropped (Reason = "SameDataOlderTimeStamp") The problem be-came apparent A timestamp counter overflow caused subsequent messages re-
ceived to be erroneously interpreted as “old” duplicates (i.e., having previously
seen sequence numbers) and discarded
3 A “race condition” bug: Race conditions are a significant cause of failures
in systems with a high degree of concurrency In a previous case study [16], theauthors demonstrated how diagnostic debugging helped catch a bug caused by arace condition in their embedded operating system, called LiteOS [2] When thecommunication subsystem of an early version of LiteOS was stress-tested, somenodes would crash occasionally and non-deterministically LiteOS allows loggingsystem call events Such logs were collected from (the flash memory of) nodesthat crashed and nodes that did not, giving rise to “good” and “bad” datasets, respectively Discriminative sequence mining revealed that the followingsequence of events occurred in the good logs but not in the bad logs: PacketReceived, Get Current Radio Handle, whereas the following occurred in thebad logs but not the good logs: Packet Received, Get Serial Send Function.From these observations, it is clear that failure occurs when Packet Received isfollowed by Get Serial Send Function before Get Current Radio Handle iscalled Indeed, the latter prepares the application for receiving a new packet Athigh data rates, another communication event may occur before the application
is prepared, causing a node crash The race condition was eliminating by propersynchronization
4 An “assumptions mismatch” bug: In systems that interact with the
phys-ical world, problems may occur when the assumptions made in protocol designregarding the physical environment do not match physical reality In this casestudy [15], a distributed target tracking protocol, implemented on motes, occa-sionally generated spurious targets The protocol required that nodes sensing atarget form a group and elect a group leader who assigned a new ID to the target.Subsequently, the leader passed the target ID on in a leader handoff operation
as the target moved Communication logs were taken from runs where spurioustargets were observed (“bad” logs) and runs where they were not (“good” logs).Discriminative pattern mining revealed the absence of member-to-leader mes-
sages in the bad logs This suggested that a singleton group was present (i.e.,
the leader was the only group member) Indeed, it turned out that the leaderhand-off protocol was not designed to deal with a singleton group because thedeveloper assumed that a target would always be sensed by multiple sensors (anassumption on physical sensing range) and hence a group would always havemore than one member The protocol had to be realigned with physical reality.While these preliminary results are encouraging, significant challenges were met
as well that required adapting data mining techniques to the needs of debugging
Trang 15systems The following section described the approach in more detail and outlinesthe challenges and lessons learned.
Our debugging tool uses a data collection front-end to collect runtime events fromthe system being debugged for analysis The front end may range from a wirelesseavesdropping network to full instrumentation of code under development (withdebug statements) Later in the proposal, we describe specific front ends wedeveloped, as well as some associated development challenges Conceptually,once the log of runtime events is available from the front end, the tool separatesthe collected sequence of events into two piles: a “good” pile, which contains theparts of the log when the system performs as expected, and a “bad” pile, whichcontains the parts of the log when the system fails a specification or exhibitspoor performance This data separation phase may be done based on predicatesthat define “good” versus “bad” behavior, provided by the application testers
or derived from specifications, as illustrated in the examples in Section 2.3.The core of the debugging framework is a discriminative frequent pattern min-ing algorithm that looks for patterns (sequences of events) that exist with verydifferent frequencies in the “good” and “bad” piles These patterns are called
discriminative since they are correlated (positively or negatively) with the
oc-currence of undesirable behavior Fundamentally, such an algorithm constitutes
a search through the exponentially large space of patterns for those that aremost correlated with failure The search space is exponential because given a
system that logs M event types, there are M N possible sequences of length N
involving these events Moreover, M itself is exponentially large, since events may have parameters (e.g., parameters of a function call, or parameters of a
logged packet header) The space of all parameter values is exponentially large.Any combination of values might be part of the problem
The basic challenge of identifying problematic event sequences can be thought
of as a search through the double-exponential tree, where the root is the emptypattern and each level adds one event to the length of the pattern Hence, apath through this tree to a leaf node represents one event sequence Leaves arecolored depending on whether the corresponding sequences were observed in the
“good” pile, the “bad” pile or both (yielding blends of the two colors in differentproportions) The goal is to search the double-exponential tree for those badleaves (sequences in the bad pile) that are likely to have caused the observedproblem being debugged
Frequent and sequential pattern mining have been studied extensively in datamining research [7], with many efficient algorithms developed For example, thepopularly cited Apriori algorithm [1] initially counts the number of occurrences
(called support ) of each distinct event in the data set (i.e., in the “good” or
“bad” pile) and discards all events that are infrequent (their support is less than
some parameter minSup) The remaining events are frequent patterns of length
1, called frequent 1-itemset S1 It then takes the combinations of S1 (i.e., single
events) to generate frequent 2-itemset candidates One important heuristic used
Trang 16in Apriori is that any subset of a frequent k-itemset must be frequent, which can
be used to effectively prune a candidate k-itemset if any of its (k − 1)-itemst is
infrequent The later development of frequent pattern mining algorithms, such
as FPGrowth [9], FPClose [6], CHARM [29], and LCM [24], as well as tial pattern mining algorithms, such as GSP [21], PrefixSpan [19], SPADE [28],and CloSpan [27], all explore this Apriori heuristic However, for mining sensornetwork bugs, we found that these algorithms could still be ineffective In manycases, the root of network bugs could be rare events (with very low support) butsetting too low minimal support could lead to combinatorial explosion in any
sequen-of the above frequent or sequential pattern mining algorithms Therefore, moreadvanced data mining methods should be explored for debugging such networksystems Below, we present the challenges and techniques in order to compre-hensively and efficiently address automated self-diagnosis of interaction bugs inpractical edge network systems
3.1 Preliminaries
For the purposes of the discussion below, let us define an event to be the basic
element in the log (collected by one of our front ends) that is analyzed forpurposes of diagnosing failures An event generally has a type, a timestamp, aset of attributes, and an associated node or log ID where it was recorded The
set of distinct event types is called an alphabet in an analogy with strings In
other words, if events were letters in an alphabet, we are looking for strings that
cause errors to occur These strings represent event sequences (ordered lists of
events) Logs are ordered chronologically and can be thought of as sequences
of logged events3 For example, S1 = (a, b, b, c, d, b, b, a, c) is
an event sequence Elementsa, b, , are events A discriminative pattern
between two data sets is a subsequence of (not necessarily contiguous) eventsthat occurs with a very different count in the two sets The larger the difference,the better the discrimination With the above terminology in mind, we presentthe challenges addressed in the proposed work
3.2 Challenge: Scalability of Search
Frequent event sequences are easy to identify since they stand out in the number
of times they occur A significant challenge, however, is to identify infrequentsequences that cause problems In debugging, sometimes less frequent patternscould be more indicative of the cause of failure than the most frequent patterns
A single mistake can cause a damaging sequence of events For example, a singlenode reboot event can cause a large number of message losses In such cases, iffrequent patterns are generated that are commonly found in failure cases, themost frequent patterns may not include the real cause of the problem (e.g., thereboot event)
Fortunately, in the case of embedded network debugging, a solution may beinspired by the nature of the problem domain The fundamental issue to observe
3 Lack of clock synchronization makes it impossible to produce an exact global
chrono-logical order; a challenge we address in this work
Trang 17is that much computation in edge network systems is recurrent Code
repeat-edly visits the same states (perhaps not strictly periodically), repeating the sameactions over time Hence, a single problem, such as a node reboot or a race con-dition that pollutes a data structure, often results in multiple manifestations ofthe same unusual symptom (like multiple subsequent message losses or multiplesubsequent false alarms) Catching these recurrent symptoms by an algorithm
such as Apriori or PrefixSpan is much easier due to their larger frequency With
such symptoms identified, the search space can be narrowed and it becomeseasier to correlate them with other less frequent preceding event occurrences.This suggests a two-stage approach In the first stage, a simple algorithm such
as Apriori or PrefixSpan generates the usual frequent discriminative patterns that have support larger than minSup It is expected that the patterns involving
manifestations of bugs will survive at the end of this stage but infrequent eventslike a node reboot will be dropped due to their low support
At the second stage, at first, the algorithm splits the log into segments Thealgorithm counts the number of discriminative frequent patterns found in eachsegment and ranks each segment of the log based on the count (the higher thenumber of discriminative patterns in a segment, the higher the rank) Next, thealgorithm searches for discriminative patterns (those that occurred in one pile
but not the other) with minSup reduced to 1 on the K highest-ranked segments.
This scheme was shown by the authors to have a significant impact on theperformance of the frequent pattern mining algorithm [13] Essentially, it relies
on successive refinement of the search space First, it performs a coarse (andfast) search through the event logs to find the most obvious leads, then it does
a finer-grained but more localized search around regions in the log highlighted
by the first search
3.3 Preventing False Frequent Patterns
The Apriori algorithm and its extensions generate all possible combinations offrequent subsequences of the original sequence As a result, it generates sub-sequences combining events that are “too far” apart to be causally correlatedwith high probability and thus reduces the chance of finding the “culprit se-quence” that actually caused the failure This strategy could negatively impactthe ability to identify discriminative patterns in two ways; (i) it could lead tothe generation of discriminative patterns that are not causally related, and (ii)
it could eliminate discriminative patterns by generating false patterns Considerthe following example
Suppose we have the following two sequences:
Trang 18Now, if we apply the Apriori technique, it will generate (a, c, b) as an equally likely pattern for both S1, and S2 As in both S1 and S2, it will com-bine the first occurrence of a and the first occurrence of c with the second
occurrence ofb So it will get canceled out at the differential analysis phase.
To address this issue, the key observation here is that the first occurrence
of a should not be allowed to combine with the second occurrence of b as
there is another eventa after the first occurrence of a but before the second
occurrence of b and the second occurrence of b is correlated with second
occurrence ofa with higher probability.
To prevent such erroneous combinations, we use a dynamic search window
scheme where the first item of any candidate sequence is used to determine thesearch window In this case, for any pattern starting witha, the search window
is [1, 4] and [4, 8] in S1and S2 With this search window, the algorithm will searchfor pattern (a, c, b) in window [1, 4] and [4, 8] and will fail to find it in S1
but will find it in sequence S2 only As a result, the algorithm will be able toreport pattern (a, c, b) as a discriminative pattern.
This dynamic search window scheme also speeds up the search significantly.
In this scheme, the original pattern (of size 8 events) was reduced to windows ofsize 4 making the search for patterns in those windows more efficient
3.4 Suppressing Redundant Subsequences
At the frequent pattern generation stage, if two patterns, S i and S j , have support
≥ minSup, the Apriori algorithm keeps both sequences as frequent patterns even
if one is a subsequence of the other and both have equal support This makesperfect sense in data mining but not in debugging For example, when miningthe “good” data set, the above strategy assumes that any subset of a “good”pattern is also a good pattern In real-life, this is not true Forgetting a step
in a multi-step procedure may well cause failure Hence, subsequences of goodsequences are not necessarily good Keeping these subsequences as examples
of “good” behavior leads to a major problem at the differential analysis stagewhen discriminative patterns are generated since they may incorrectly cancelout similar subsequences found frequent in the other (i.e., “bad” behavior) datapile For example, consider two sequences below:
S1= (a, b, c, d, a, b, c, d)
S2= (a, b, c, d, a, b, d, c)
Suppose, for correct operation of the protocol, eventa has to be followed by
eventc before event d can happen In sequence S2 this condition is violated.Ideally, we would like our algorithm to report sequence:
S3= (a, b, d) as the “culprit” sequence However, if we apply the Apriori algorithm, it will fail to catch this sequence This is because it will generate S3
as a frequent pattern both for S1 and S2 with support 2 and will get canceled
out at the differential analysis phase As expected, S3 will never show up as
a “discriminative pattern” Note that with the dynamic search window scheme
alone, we cannot prevent this
Trang 19To illustrate, suppose a successful message transmission involves the followingsequence of events:
(enableRadio, messageSent, ackReceived, disableRadio)
Now although sequence:
(enableRadio, messageSent, disableRadio)
is a subsequence of the original “good” sequence, it does not represent a cessful scenario as it disables radio before receiving the “ACK” message
suc-To solve this problem, we need an extra step (which we call sion) before we perform differential analysis to identify discriminative patterns.
sequenceCompres-At this step, we remove the sequence S i if it is a subsequence of S j with the
same support4 This will remove all the redundant subsequences from the quent pattern list Subsequences with a (sufficiently) different support, will beretained and will show up after discriminative pattern mining
fre-In the above example, pattern (a, b, c, d) has support 2 in S1 and port 1 in S2 Pattern (a, b, d) has support 2 in both S1and S2 Fortunately,
sup-at the sequenceCompression step, psup-attern ( a, b, d) will be removed from the frequent pattern list generated for S1 because it is a subsequence of a larger fre-quent pattern of the same support It will therefore remain only on the frequent
pattern list generated for S2 and will show up as a discriminative pattern
3.5 Handling Multi Attribute Events
As different event types can have a different number of attributes in the tuple,mining frequent patterns becomes much more challenging as the mining algorithmhas no prior knowledge of which attributes in a specific event are correlated withfailure and which are not For example, consider the following sequence of events:
<msg_sent,nodeid=1,msgtype=2,nodetype=l>
<msg_sent,nodeid=2,msgtype=2,nodetype=m>
In the above pattern, we do not know which of the attributes are correlated
with failure (if any are related at all) It could be nodeid, or msgtype, or a combination of msgtype and nodetype and so on One trivial solution is to try all
possible permutations However, this is exponential in the number of attributesand becomes unmanageable very quickly Rather, we split such multi attributeevents into a sequence of single attribute events, each with only one attribute ofthe original multi-attribute event The converted sequence for the above example
Trang 20We can now apply simple (uni-dimensional) sequence mining techniques tothe above sequence As before, the user will be given the resulting sequences(that are most correlated with failures) In such sequences, only the relevantattributes of the original multidimensional events will likely survive Attributesirrelevant to the occurrence of failure will likely have a larger spread of values(since these values are orthogonal to the failure) and hence a lower support.Sequences containing them will consequently have a lower support as well Thetop ranking sequences are therefore more likely to focus only on attributes ofinterest, which is what we want to achieve.
3.6 Handling Continuous Data Types
When logged parameters are continuous, it is very hard to identify frequentpatterns as there are potentially an infinite number of possible values for them
To map continuous data to a finite set, we simply discretize into a number ofcategories (bins)
3.7 Optimal Partitioning of Data
It is challenging to separate the log into a part that contains only good behaviorand a part the includes the cause of failure In particular, how far back from thefailure point should we consider when we look for the “bad” sequence? If we gotoo far back, we will mix the sequences that are responsible for failure with somesequences that are unrelated to failure If we go back too little, we may miss theroot cause of failure As a heuristic, we start with a small window size (default
is set to 200 events) and iteratively increase it if necessary After applying thesequence mining algorithm with that window size, the tool performs differentialanalysis between good patterns and bad patterns If after the analysis the badpattern list is empty, it tries a larger window
3.8 Occurrence of Multiple Bugs
It is possible that there may be traces of multiple different bugs in the log tunately, this does not impact the operation of our algorithm As the algorithmdoes not assume anything about the number or types of bugs, it will reportall plausible causes of failure as usual In the presence of multiple bugs mani-fested due to different bad event sequences, all such sequences will be found andreported
We realize that the types of debugging algorithms needed are different for ent applications, and are going to evolve over time with the evolution of hardwareand software platforms Hence, we aim to develop a modular tool architecturethat facilitates evolution and reuse Keeping that in mind, we developed a soft-ware architecture that provides the necessary functionality and flexibility for
Trang 21differ-future development The goal of our architecture is to facilitate easy use andexperimentation with different debugging techniques and foster future develop-ment As there are numerous different types of hardware, programming abstrac-tions, and operating systems in use for wireless sensor networks, the architecturemust be able to accommodate different combinations of hardware and software.Different ways of data collection should not affect the way the data analysislayer works Similarly we realize that for different types of bugs, we may needdifferent types of techniques to identify the bug and we want to provide a flexi-ble framework to experiment with different data analysis algorithms Based onthe above requirements, we separate the whole system into three subsystems; (i)
a data collection front-end, (ii) data preprocessing middleware and (iii) a dataanalysis back-end This architecture is shown in Figure 1
4.1 Data Collection Front-End
The role of data collection front-end is to provide the debug information (i.e., log
files) that can be analyzed for diagnosing failures The source of this debug log isirrelevant to the data analysis subsystem As shown in figure, the developer maychoose to analyze the recorded radio communication messages obtained using apassive listening tool, or the execution traces obtained from simulation runs, orthe run-time sequences of events obtained by logging on actual application motesand so on With this separation of concerns, the front-end developer could designand implement the data collection subsystem more efficiently and independently.The data collection front-end developer merely needs to provide the format ofthe recorded data These data are used by the data preprocessing middleware
to parse the raw recorded byte streams
4.2 Data Preprocessing Middleware
This middleware that sits between the data collection front-end and the dataanalysis back-end provides the necessary functionality to change or modify onesubsystem without affecting the other The interface between the data collectionfront-end and the data analysis back-end is further divided into the followinglayers:
– Data cleaning layer: This layer is front-end specific Each supported front-end
will have one instance of it The layer is the interface between the particulardata collection front-end and the data preprocessing middleware It ensuresthat the recorded events are compliant with format requirements
– Data parsing layer: This layer is provided by our framework and is responsible
for extracting meaningful records from the recorded raw byte stream Toparse the recorded byte stream, this layer requires a header file describingthe recorded message format This information is provided by the application
developer (i.e., the user of the data collection front-end).
– Data labeling layer: To be able to identify the probable causes of failure, the
data analysis subsystem needs samples of logged events representing both
Trang 22Application Specific
Expected “Good” Behavior
Data Labeling Function
Labeled Data File Parsed Data File
Data Parsing Algorithm Application Specific
Header File Describing
Runtime Logging
Recorded Log
Data Analysis Tool -I: WEKA Data Analysis Tool-II: Discriminative Frequent Pattern Miner Data Analysis Tool-III: Graphical Visualizer
Set of Data Collection Front-End
Data Analysis Tool Specific Data Converter
Set of Data Analysis Back-End
Application
Developer
Data Preprocessing Middleware
Fig 1 System Architecture
“good” and “bad” behavior As “good” or “bad” behavior semantics are anapplication specific criterion, the application developer needs to implement
a predicate (a small module) whose interface is already provided by us inthe framework The predicate, presented with an ordered event log, decideswhether behavior is good or bad
– Data conversion layer: This layer provides the interface between the data
preprocessing middleware and the data analysis subsystem One instance ofthis layer exists for each different analysis back-end This layer is responsiblefor converting the labeled data into appropriate format for the data analy-sis algorithm The interface of this data conversion layer is provided by theframework As different data analysis algorithms and techniques can be usedfor analysis, each may have different input format requirements This layerprovides the necessary functionality to accommodate supported data analysistechniques
Trang 234.3 Data Analysis Back-End
At present, we implement the data analysis algorithm and its modificationspresented earlier It is responsible for identifying the causes of failures Theapproach is extensible As newer analysis algorithms are developed that catchmore or different types of bugs, they can be easily incorporated into the tool asalternative back-ends Such algorithms can be applied in parallel to analyze thesame set of logs to find different problems with them
We describe two case studies, first published by the authors in Sensys 2008 [16],
to demonstrate the use of the new diagnostic debugging tool The first casestudy presents a kernel level bug in the LiteOS operating system The secondpresents an example of debugging a multichannel Media Access Control(MAC)protocol [18] implemented in TinyOS 2.0 for MicaZ platform with one half-duplexradio interface
5.1 Case Study - I: LiteOS Bug
In this case study, we troubleshoot a simple data collection application whereseveral sensors monitor light and report it to a sink node The communication isperformed in a single-hop environment In this scenario, sensors transmit packets
to the receiver, and the receiver records received packets and sends an “ACK”back The sending rate that sensors use is variable and depends on the varia-tions in their readings After receiving each message, depending on its sequencenumber, the receiver decides to record the value or not If the sequence number
is older than the last sequence number it has received, the packet is dropped.This application is implemented using MicaZ motes on LiteOS operating sys-tem and is tested on an experimental testbed Each of the nodes is connected to
a desktop computer via an MIB520 programming board and a serial cable The
PC acts as the base station In this experiment, there was one receiver (the basenode) and a set of 5 senders (monitoring sensors) This experiment illustrates atypical experimental debugging set up Prior to deployment, programmers wouldtypically test the protocol on target hardware in the lab This is how such a testmight proceed
Failure Scenario When this simple application was stress-tested, some of
the nodes would crash occasionally and non-deterministically Each time ent nodes would crash and at different times Perplexed by the situation, thedeveloper (a first-year graduate student with no prior experience with sensornetworks) decided to log different types of events using LiteOS support andour debugging tool These were mostly kernel-level events along with a fewapplication-level events The built-in logging functionality provided by LiteOSwas used to log the events A subset of the different types of events that werelogged are listed in Figure 2
Trang 24differ-Recorded Events Attribute List
Context_Switch_To_User_Thread Null
Get_Current_Thread_Index Null Get_Current_Radio_Info_Address Null Get_Current_Radio_Handle_Address Null
Get_Current_Serial_Info_Address Null Get_Serial_Send_Function Null Disable_Radio_State Null
Yield_To_System_Thread Null Get_Current_Thread_Address Null
Get_Radio_Send_Function Null Mutex_Unlock_Function Null
Fig 4 Discriminative frequent patterns found only in “bad” log for LiteOS bug
Failure Diagnosis After running the experiment, “good” logs were collected
from the nodes that did not crash during the experiment and “bad” logs werecollected from nodes that crashed at some point in time After applying our dis-criminative frequent pattern mining algorithm to the logs, we provided two sets
of patterns to the developer, one set includes the highest ranked discriminativepatterns that are found only in “good” logs as shown in Figure 3, and the otherset includes the highest ranked discriminative patterns that are found only in
“bad” logs as shown in Figure 4
Based on the discriminative frequent pattern, it is clear that in “good” pile,
P acket Received event is highly correlated with the Get Current Radio Handle event On the other hand, in the “bad” pile, though P acket Received
event is present, the other event is missing In the “bad” pile,P acket Received
Trang 25is highly correlated withGet serial Send F unction event From these
obser-vations, it is clear that proceeding with a Get serial Send F unction when
Get Current Radio Handle is missing is the most likely cause of failure.
To explain the error we will briefly describe the way a received packet ishandled in LiteOS In the application, receiver always registers for receivingpackets, then waits until a packet arrives At that time, the kernel switchesback to the user thread with appropriate packet information The packet is thenprocessed in the application However, at very high data rates, another packetcan come when the processing of the previous packet has not yet been done Inthat case, LiteOS kernel overwrites the radio receive buffer with new informationeven if the user is still using the old packet data to process the previous packet.Indeed, for correct operation,P acket Receivedevent always has to be followed
by Get Current Radio Handle event before Get Serial Send F unction
event Otherwise it crashes the system Over-writing a receive buffer for some son is a very typical bug in sensor networks This example is presented to illustratethe use of the tool In section 5.2 we present a more complex example that exploresmore of the interactive complexity this tool was truly designed to uncover
rea-5.2 Case Study - II: Multichannel MAC Protocol
In this case study, we debug a multichannel MAC protocol The objective ofthe protocol used in our study is to assign a home channel to each node inthe network dynamically in such a way that the throughput is maximized Thedesign of the protocol exploits the fact that in most wireless sensor networks,
the communication rate among different nodes is not uniform ( e.g., in a data
aggregation network) Hence, the problem was formulated in such a way thatnodes communicating frequently are clustered together and assigned the samehome channel whereas nodes that communicate less frequently are clustered intodifferent channels This minimizes overhead of channel switching when nodesneed to communicate This protocol was recently published in [18]
During experimentation with the protocol, it was noticed that when data ratesbetween different internally closely-communicating clusters is low, the multi-channel protocol outperforms a single channel MAC protocol comfortably as itshould However, when the data rate between clusters was increased, while thethroughput near the base station still outperformed a single channel MAC sig-nificantly, nodes further from the base station were performing worse than in thesingle channel MAC This should not have happened in a well-designed protocol
as the multichannel MAC protocol should utilize the communication spectrumbetter than a single channel MAC The author of the protocol initially concludedthat the performance degradation was due to the overhead associated with com-munication across clusters assigned to different channels Such communicationentails frequent channel switching as the sender node, according to the protocol,must switch the frequency of the receiver before transmission, then return to itshome channel This incurs overhead that increases with the transmission rateacross clusters We decided to verify this conjecture
As a stress test of our tool, we instrumented the protocol to log events related
to the MAC layer (such as message transmission and reception as well as channel
Trang 26switching) and used our tool to determine the discriminative patterns generatedfrom different runs with different message rates, some of which performing betterthan others For better understanding of the failure scenario detected, we brieflydescribe the operation of the multichannel MAC protocol below.
Multichannel MAC Protocol Overview In the multichannel MAC
proto-col, each node initially starts at channel 0 as its home channel To communicatewith others, every node maintains a data structure called “neighbor table” thatstores the neighbor home channel for each of its neighboring nodes Channelsare organized as a ladder, numbered from lowest (0) to highest (12) When anode decides to change its home channel, it sends out a “Bye” message in itscurrent home channel which includes its new home channel number Receiving
a “Bye” message, each other node updates its neighbor table to reflect the newhome channel number for the sender of the “Bye” message After changing itshome channel, a node sends out a “Hello” message in the new home channelwhich includes its nodeID All neighboring nodes on that channel add this node
as a new neighbor and update their neighbor tables accordingly
To increase robustenss to message loss, the protocol also includes a mechanismfor discovering the home channel of a neighbor when its current entry in theneighbor table becomes stale When a node sends a message to a receiver onthat receiver’s home channel (as listed in the neighbor table) but does not receive
an “ACK’ after ’n’ (n is set to 5) tries, it assumes that the destination node is not
on its home channel The reason may be that the destination node has changedits home channel permanently but the notification was lost Instead of wastingmore time on retransmissions on the same channel, the sender starts scanning allchannels, asking if the receiver is there The purpose is to find the receiver’s newhome channel and update the neighbor table accordingly The destination nodewill eventually hear this data message and reply when it is on its home channel.Since the above mechanism is expensive, as an optimization, overhearing isused to reduce staleness of the neighbor table Namely, a node updates thehome channel of a neighbor in its neighbor table when the node overhears anacknowledgement (“ACK”) from that neighbor sent on that channel Since the
“ACK”s are used as a mechanism to infer home channel information, whenever anode switches channels temporarily (e.g., to send to a different node on the homechannel of the latter), it delays sending out “ACK” messages until it comes back
to its home channel in order to prevent incorrect updates of neighbor tables byrecipients of such ACKs
Finally, to estimate channel conditions, each node periodically broadcasts a
“channelUpdate” message which contains the information about successfully ceived and sent messages during the last measurement period (where the period
re-is set at compile time) Based on that information, each node calculates the
channel quality (i.e., probability of successfully accessing the medium), and uses
that measure to probabilistically decide whether to change its home channel ornot Nodes that sink a lot of traffic (e.g., aggregation hubs or cluster heads)switch first Others that communicate heavily with them follow This typically
Trang 27results into a natural separation of node clusters into different frequencies sothey do not interfere.
Performance Problem This protocol was executed on 16 MicaZ motes
imple-menting an aggregation tree where several aggregation cluster-heads filter datareceived from their children, significantly reducing the amount forwarded, thensend that reduced data to a base-station When the data rate across clusterswas low, the protocol outperformed the single channel MAC However, when thedata rate among clusters was increased, the performance of the protocol deteri-orated significantly, performing worse than a single channel MAC in some cases.The developer of the protocol assumed that this was due to the overhead associ-ated with the channel change mechanism which is incurred when communicationhappens among different clusters heavily Much debugging effort was spent onthat direction with no result
Failure Diagnosis To diagnose the cause of the performance problem, we
logged the MAC events relating to radio frequency changes and message munication The question posed to our tool was “Why is the performance bad
com-at higher dcom-ata rcom-ate?” To answer this question, we first executed the protocol
at low data rates (when the performance is better than single channel MAC) tocollect logs representing “good” behavior We then again executed the protocolwith a high data rate (when the performance is worse than single channel MAC)
to collect logs representing “bad” behavior
After performing discriminative pattern analysis, the list of top 5 tive patterns that were produced by our tool is shown in Figure 5
discrimina-The sequences indicate that, in all cases, there seems to be a problem withnot receiving acknowledgements Lack of acknowledgements causes a channelscanning pattern to unfold This is shown as theRetry T ransmission event
Trang 28on different channels, as a result of not receiving acknowledgements Hence,the problem does not lie in the frequent overhead of senders changing theirchannel to that of their receiver in order to send a message across clusters Theproblem lied in the frequent lack of response (an ACK) from a receiver At thefirst stage of frequent pattern mining No Ack Received is identified as the
most frequent event At the second stage, the algorithm searched for frequent
patterns in top K (e.g., top 5) segments of the logs where No Ack Received
event occurred with highest frequency The second stage of the log analysis(correlating frequent events to preceding ones) then uncovered that the lack
of an ACK from the receiver is preceded by a temporary channel change Thisgave away the bug As we described earlier, whenever a node changes its channeltemporarily, it disables “ACK”s until it comes back to its home channel In a highintercluster communication scenario, disabling the “ACK” is a bad decision for anode that spends a significant amount of time communicating with other clusters
on channels other than its own home channel As a side effect, nodes which aretrying to communicate with it fail to receive an “ACK” for a long time and startscanning channels frequently looking for the missing receiver Another interestingaspect of the problem that was discovered is the cascading effect of the problem.When we look at generated discriminative patterns across multiple nodes (notshown for space limitations), we see that the scanning patterns revealed in thelogs shown in fact cascades Channel scanning at the destination node oftentriggers channel scanning at the sender node and this interesting cascaded effectwas also captured by our tool
As a quick fix, we stopped disabling “ACK” when a node is outside its homechannel This may appear to violate some correctness semantics because a nodemay now send an ACK while temporarily being on a channel other than itshome This, one would think, will pollute neighbor tables of nodes that overhearthe ACK because they will update their tables to indicate an incorrect homechannel In reality, the performance of the MAC layer improved significantly(up to 50%), as shown in Figure 6 In retrospect, this is not unexpected As in-tercluster communication increases, the distinction between one’s home channeland the home channel of another with whom one communicates a lot becomesfuzzy, as one spends more and more time on that other node’s home channel(to send messages to it) When ACKs are never disabled, the neighbor tables ofnodes will tend to record with a higher probability the channel on which eachneighbor spend most of its time This could be the neighbor’s home channel
or the channel of a node downstream with which the neighbor communicates alot The distinction becomes immaterial as long as the neighbor can be found
on that channel with a high probability Indeed, complex interaction problemsoften seem simple when explained but are sometimes hard to think of at designtime Dustminer was successful at uncovering the aforementioned interactionand significantly improve the performance of the MAC protocol in question
A Note on Scalability We compare our results with using the the Apriori
algorithm for sequence mining Due to huge numbers of events logged for thiscase study (about 40000 for “good” logs and 40000 for “bad” logs), we couldnot generate frequent patterns of length more than 2 using Apriori To generate
Trang 290 50000 100000 150000 200000 250000 300000 350000
Successful Send Successful Receive
Fig 6 Performance improvement after the bug fix
frequent patterns of length 2 for 40000 events in the “good” log, it took 1683.02seconds (28 minutes) and to finish the whole computation including differentialanalysis it took 4323 seconds (72 minutes) With our two-stage mining scheme,
it took 5.547 seconds to finish the first stage and finishing the whole tion including differential analysis took 332.924 seconds (6 minutes) In terms ofquality of the generated sequences (which is often correlated with the length ofthe sequence), our algorithm returned discriminative sequences of length upto
computa-8, that was enough to understand the chain of events causing the problem asillustrated above We tried to generate frequent patterns of length 3 with Apri-ori, but terminated the process after one day of computation that remained inprogress We used a machine of 2.53 GHz speed and 512 MB RAM The gener-ated patterns of length 2 were insufficient to give insight into the problem
Debugging Overhead To test the impact of logging on application behavior,
we ran the multichannel MAC protocol with logging enabled and without loggingenabled with both moderate data rate and high data rate The network was set
as a data aggregation network
For moderate data rate experiments, the source nodes (node that only sendsmessages) were set to transmit data at a rate of 10 messages/sec, the intermediatenodes were set to transmit data at a rate of 2 messages/sec and one node wasacting as the base station (which only receives messages) We tested this on
a 8 nodes network with 5 source nodes, 2 intermediate nodes and one basestation Over multiple runs, after we take the average to get a reliable estimate,
average number of successfully transmitted messages was increased by 9.57% and average number of successfully received messages was increased by 2.32%.
The most likely reason is writing to flash was creating a randomization effectwhich probably helped to reduced interference at the MAC layer
At high data rate, source nodes were set to transmit data at a rate of 100messages/sec and intermediate nodes were set to transmit data at a rate of 20messages/sec Over multiple runs, after we take the average to get a reliableestimate, average number of successfully transmitted messages was reduced by
1.09% and average number of successfully received messages was dropped by 1.62% The most likely reason is the overhead of writing to flash kicked in at a
such high data rate and eventually reduced the advantage experienced at a lowdata rate
Trang 30The performance improvement of the multichannel MAC protocol reported inthis paper is obtained by running the protocol at the high data rate to preventover estimation.
We realize that this effect on application may change the behavior of the nal application slightly, but that effect seems to be negligible from our experienceand did not affect the diagnostic capability of the discriminative pattern miningalgorithm which is inherently robust against minor statistical variance
origi-As multichannel MAC protocol did not use flash memory to store any data,
we were able to use the whole flash for logging events To test the relationbetween quality of generated discriminative patterns and the logging space used,
we used 100KB, 200KB and 400KB of flash space in three different experiments.The generated discriminative patterns were similar We realize that differentapplication has different amount of flash space requirements and the amount
of logging space may affect the diagnostic capability To help in severe spaceconstraints, we provide the radio interface so users can choose to log at differenttimes instead of logging continuously User can also choose to log events at
different resolutions (e.g., instead of logging every message transmitted, log only
every 50thmessage transmitted).
For LiteOS case study, we did not use flash space at all as the events weretransmitted to basestation (PC) directly using serial connection and eliminatethe flash space overhead completely which makes our tool easily usable fortestbeds which often provides serial connections
In this paper, we presented a sensor network troubleshooting tool that helpsthe developer diagnose root causes of errors The tool is geared towards findinginteraction bugs Very successful examples of debugging tools that hunt for lo-calized errors in code have been produced in previous literature The point ofdeparture in this approach lies in focusing on errors that are not localized (such
as a bad pointer or an incorrect assignment statement) but rather arise because
of adverse interactions among multiple components each of which appears to
be correctly designed With increased distribution and resource constraints, theinteractive complexity of sensor networks applications will remain high, moti-vating tools such as the one we described Future development of the tool willfocus on scalability and user interface to reduce the time and effort needed tounderstand and use it
References
1 Agrawal, R., Srikant, R.: Fast algorithms for mining association rules In: Proc
1994 Int Conf Very Large Data Bases (VLDB 1994), Santiago, Chile, September
1994, pp 487–499 (1994)
2 Cao, Q., Abdelzaher, T., Stankovic, J., He, T.: LiteOS, a UNIX-like operatingsystem and programming platform for wireless sensor networks In: IPSN/SPOTS,
St Louis, MO (April 2008)
Trang 313 Clark, D.: Internet meets sensors: Should we try for architecture convergence In:Networking of Sensor Systems (NOSS) Principal Investigator and InformationalMeeting (October 2005),
http://www.eecs.harvard.edu/noss/slides/info/keynote1/clark.ppt
4 Dasu, T., Johnson, T.: Exploratory Data Mining and Data Cleaning John Wiley
& Sons, Chichester (2003)
5 Ganti, V., Gehrke, J., Ramakrishnan, R.: Mining very large databases puter 32, 38–45 (1999)
Com-6 Grahne, G., Zhu, J.: Efficiently using prefix-trees in mining frequent itemsets In:Proc ICDM 2003 Int Workshop on Frequent Itemset Mining Implementations(FIMI 2003), Melbourne, FL (November 2003)
7 Han, J., Cheng, H., Xin, D., Yan, X.: Frequent pattern mining: Current status andfuture directions Data Mining and Knowledge Discovery 15, 55–86 (2007)
8 Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn MorganKaufmann, San Francisco (2006)
9 Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation.In: Proc 2000 ACM-SIGMOD Int Conf Management of Data (SIGMOD 2000),Dallas, TX, May 2000, pp 1–12 (2000)
10 Hand, D.J., Mannila, H., Smyth, P.: Principles of Data Mining MIT Press, bridge (2001)
Cam-11 Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: DataMining, Inference, and Prediction Springer, Heidelberg (2001)
12 Intanagonwiwat, C., Govindan, R., Estrin, D., Heidemann, J., Silva, F.: Directeddiffusion for wireless sensor networking IEEE/ACM Trans Netw 11(1), 2–16(2003)
13 Khan, M., Le, H., Ahmadi, H., Abdelzaher, T., Han, J.: DustMiner: ing interactive complexity bugs in sensor networks In: Proc 2008 ACM Int Conf
Troubleshoot-on Embedded Networked Sensor Systems (Sensys 2008), Raleigh, NC (November2008)
14 Khan, M., Abdelzaher, T., Gupta, K.: Towards diagnostic simulation in sensornetworks In: Nikoletseas, S.E., Chlebus, B.S., Johnson, D.B., Krishnamachari, B.(eds.) DCOSS 2008 LNCS, vol 5067, pp 252–265 Springer, Heidelberg (2008)
15 Khan, M., Abdelzaher, T., Luo, L.: SNTS: Sensor network troubleshooting suite.In: Aspnes, J., Scheideler, C., Arora, A., Madden, S (eds.) DCOSS 2007 LNCS,vol 4549, pp 142–157 Springer, Heidelberg (2007)
16 Khan, M., Le, H.K., Ahmadi, H., Abdelzaher, T., Han, J.: Dustminer: bleshooting interactive complexity bugs in sensor networks In: ACM Sensys,Raleigh, NC (November 2008)
Trou-17 Le, H.K., Henriksson, D., Abdelzaher, T.F.: A practical multi-channel media accesscontrol protocol for wireless sensor networks In: IPSN, pp 70–81 (2008)
18 Lee, H.K., Henriksson, D., Abdelzaher, T.: A practical multi-channel medium cess control protocol for wireless sensor networks In: International Conference onInformation Processing in Sensor Networks (IPSN 2008), St Louis, Missouri (April2008)
ac-19 Pei, J., Han, J., Mortazavi-Asl, B., Wang, J., Pinto, H., Chen, Q., Dayal, U., Hsu,M.-C.: Mining sequential patterns by pattern-growth: The prefixspan approach.IEEE Trans Knowledge and Data Engineering 16, 1424–1440 (2004)
20 Pevzner, P.A.: Computational Molecular Biology: An Algorithmic Approach MITPress, Cambridge (2000)
21 Srikant, R., Agrawal, R.: Mining sequential patterns: Generalizations and mance improvements In: Proc 5th Int Conf Extending Database Technology(EDBT 1996), Avignon, France, March 1996, pp 3–17 (1996)
Trang 32perfor-22 Tan, P., Steinbach, M., Kumar, V.: Introduction to Data Mining Addison Wesley,Reading (2005)
23 Tang, Z., MacLennan, J.: Data Mining with SQL Server 2005 John Wiley & Sons,Chichester (2005)
24 Uno, T., Asai, T., Uchida, Y., Arimura, H.: LCM ver 2: Efficient mining algorithmsfor frequent/closed/maximal itemsets In: Proc ICDM 2004 Int Workshop onFrequent Itemset Mining Implementations, FIMI 2004 (November 2004)
25 Weiss, S.M., Indurkhya, N.: Predictive Data Mining Morgan Kaufmann, San cisco (1998)
Fran-26 Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and niques, 2nd edn Morgan Kaufmann, San Francisco (2005)
Tech-27 Yan, X., Han, J., Afshar, R.: CloSpan: Mining closed sequential patterns in largedatasets In: Proc 2003 SIAM Int Conf Data Mining (SDM 2003), San Fransisco,
Trang 33for Change Detection in Data Streams
Raquel Sebasti˜ao1,2, Jo˜ao Gama1,3,
Pedro Pereira Rodrigues1,2,4, and Jo˜ao Bernardes4,5
1 LIAAD - INESC Porto, L.A Rua de Ceuta, 118, 6
4050-190 Porto, Portugal
2 Faculty of Science, University of Porto
3 Faculty of Economics, University of Porto
4 Faculty of Medicine, University of Porto
5 INEB, Porto
{raquel,jgama}@liaad.up.pt,{pprodrigues,joaobernardes}@med.up.pt
Abstract Histograms are a common technique for density estimation
and they have been widely used as a tool in exploratory data analysis.Learning histograms from static and stationary data is a well knowntopic Nevertheless, very few works discuss this problem when we have
a continuous flow of data generated from dynamic environments
The scope of this paper is to detect changes from high-speed changing data streams To address this problem, we construct histogramsable to process examples once at the rate they arrive The main goal ofthis work is continuously maintain a histogram consistent with the cur-rent status of the nature We study strategies to detect changes in thedistribution generating examples, and adapt the histogram to the mostrecent data by forgetting outdated data We use the Partition Incremen-tal Discretization algorithm that was designed to learn histograms fromhigh-speed data streams
time-We present a method to detect whenever a change in the distributiongenerating examples occurs The base idea consists of monitoring distri-butions from two different time windows: the reference window, reflect-ing the distribution observed in the past; and the current window whichreceives the most recent data The current window is cumulative andcan have a fixed or an adaptive step depending on the distance betweendistributions We compared both distributions using Kullback-Leibler di-vergence, defining a threshold for change detection decision based on theasymmetry of this measure
We evaluated our algorithm with controlled artificial data sets and pare the proposed approach with nonparametric tests We also present re-sults with real word data sets from industrial and medical domains Thoseresults suggest that an adaptive window’s step exhibit high probability inchange detection and faster detection rates, with few false positives alarms
com-Keywords: Change detection, Data streams, Machine learning,
Learn-ing histograms, MonitorLearn-ing data distribution, Adaptive CumulativeWindows
M.M Gaber et al (Eds.): Sensor-KDD 2008, LNCS 5840, pp 25–42, 2010.
c
Springer-Verlag Berlin Heidelberg 2010
Trang 341 Introduction
Nowadays, the scenario of finite stored data sets is no longer appropriated cause information is gathered assuming the form of transient and infinite datastreams As a large massive amount of information is produced at a high-speedrate it is no longer possible to use algorithms which require to store, in themain memory, the full historic data In Data Streams the data elements arecontinuously received, treated and discarded In this context processing time,memory and sample size are the crucial constraints in knowledge discovery sys-tems [3] Due to the exploratory nature of data and to time restrictions, a usermay prefer a fast but approximate answer to an exact but slow answer Methods
be-to deal with these issues consist of applying synopsis techniques, such as tograms [12,16,19,26], sketches [8] and wavelets [7,15] Histograms are one of thetechniques used in data stream management systems to speed up range queriesand selectivity estimation (the proportion of tuples that satisfy a query), twoillustrative examples where fast but approximate answers are more useful thanslow and exact ones
his-In the context of open-ended data streams, as we never observe all values ofthe random variable, it is not appropriate to use the traditional histograms toconstruct a graphical representation of continuous data, because they requirethe knowledge of all data Thus, there is still missing algorithms to addressconveniently this issue The Partition Incremental Discretization [12,26] and theV-Optimal Histograms [14,16,18] are two examples A key characteristic of a datastream is its dynamic nature The process generating data is not strictly station-ary and evolves over time The target concept may gradually change over time.Moreover, when data is collected over time, at least for large periods of time,
it is not acceptable to assume that the observations are generated at randomaccording to a stationary probability distribution Several methods in machinelearning have been proposed to deal with concept drift [11,17,21,23,26,28,29].Drifting concepts are often handled by time windows or weighted examples ac-cording to their age or utility Another approach to detect drift concepts is mon-itoring distributions on two different time windows, monitoring the evolution of
a statistical function between two distributions: from past data in a referencewindow and in a current window with the most recent data points [20,27]
1.1 Previous Work
In a previous work [27], we presented a method to detect changes in data streams
In that work, we constructed histograms using the two layer structure of the tition Incremental Discretization (PiD) algorithm and addressed the detectionproblem by monitoring distributions using a fixed window model In this work,
Par-we propose a new definition of the number of histogram’s bins and the use of anadaptive-cumulative window model to detect changes We also perform studies onthe distance measures and advance a discrepancy measure based on the asymme-try of the Kullback-Leibler Divergence (KLD) We support this decision in pre-vious results The results of [27] suggest that the KLD achieve faster detection
Trang 35rates than the other tested distances measures (a measure based on entropy andthe cosine distance).
1.2 Motivation, Challenges and Paper Outline
The motivation for studying time-changing high-speed data streams comes fromthe emergence of temporal applications such as communications networks, websearches, financial applications, and sensor data, which produces massive streams
of data Since it is impractical to store completely in memory all data, newalgorithms are needed to process data online at the rate it is available Anotherchallenge is to create compact summaries of data streams Histograms are in factcompact representations for continuous data They can be used as a component
in more sophisticated data mining algorithms, like decision trees [17]
As the distribution underlying the data elements may change over time, thedevelopment of methods to detect when and how the process generating thestream is evolving, is the main challenge of this study The main contribution ofthis paper is a new method to detect changes when learning histograms usingadaptive windows We are able to detect a time window where change has oc-curred Another contribution is an improved technique to initialize histogramssatisfying user constraint on the admissible relative error
The proposed method has potential use in medical and industrial domains,namely, in monitoring biomedical signals and production processes (respectively).The paper is organized as follows The next section presents an algorithm tocontinuously maintain histograms over a data stream In section 3 we extendthe algorithm for change detection Section 4 presents preliminary evaluation
of the algorithm in benchmark datasets and real-world problems Last sectionconcludes the paper and presents some future research lines
Histograms are one of the most used tools in exploratory data analysis Theypresent a graphical representation of data, providing useful information aboutthe distribution of a random variable A histogram is visualized as a bar graphthat shows frequency data The basic algorithm to construct histograms consists
of sorting the values of the random variable and places them into bins Next
counts the number of data samples in each bin The height of the bar drawn onthe top of each bin is proportional to the number of observed values in that bin
A histogram is defined by a set of k non-overlapping intervals and each interval
is defined by its boundaries and a frequency count The most used histograms are
either equal width, where the range of observed values is divided into k intervals
of equal length (∀i, j : (b i − b i−1 ) = (b j − b j−1 )), or equal frequency, where the range of observed values is divided into k bins such that the counts in all bins
are equal (∀i, j : (f i = f j))
When all the data is available, there are exact algorithms to construct
his-tograms [25] All these algorithms require a user defined parameter k, the number
of bins Suppose we know the range of the random variable (domain
informa-tion) and the desired number of intervals k The algorithm to construct equal
Trang 36width histograms traverses the data once; whereas in the case of equal frequencyhistograms a sort operation is required.
One of the main problems of using histograms is the definition of the number
of intervals A rule that has been used is the Sturges’ rule: k = 1 + log2n, where
k is the number of intervals and n is the number of observed data points This
rule has been criticized because it is implicitly using a binomial distribution
to approximate an underlying normal distribution1 Sturges rule has probably
survived because, for moderate values of n (less than 200) produces reasonable histograms However, it does not work for large n Scott gave a formula for
the optimal histogram bin width which asymptotically minimizes the integratedmean square error Since the underlying density is usually unknown, he suggestedusing the Gaussian density as a reference standard, which leads to the data-based
choice for the bin width of a × s × n −1/3 , where a = 3.49 and s is the estimate
of the standard deviation
In exploratory data analysis, histograms are used iteratively The user tries
several histograms using different values of k (the number of intervals), and
chooses the one that better fits his purposes
2.1 The Partition Incremental Discretization (PID)
The Partition Incremental Discretization algorithm (PiD for short) that designed
to provide a histogram representation of high-speed data streams It learns tograms using an architecture composed by two layers The first simplifies andsummarizes the data, the algorithm transverses the data once and incremen-tally maintains an equal-width discretization; the second layer constructs thefinal histogram using only the discretization of the first phase The first layer
his-is initialized without seeing any data As described in [12], the input for theinitialization phase is the number of intervals (that should be much larger thanthe desired final number of intervals) and the range of the variable
Consider a sample x1, x2, of an open-ended random variable with range
R In this context, and allowing to consider extreme values and outliers, the histogram is defined as a set of break points b1, , b k−1 and a set of frequency
counts f1, , f k−1 , f k that define k intervals in the range of the random variable:
]− ∞, b1], ]b1, b2], , ]b k−2 , b k−1 ], ]b k−1 , ∞[. (1)
In a histogram, all x i in a bin is represented by the correspondent middle point,
which means that this approximation error is bounded by half of the length (L)
of the bin As the first layer is composed by equal-width histogram, we obtain:
x i − m j ≤ L
2 =
R 2k , b j ≤ x i < b j+1 , ∀j = 1, , k. (2)
1 Alternative rules for constructing histograms include Scott’s (1979) rule for the class
width:k = 3.5sn −1/3 and Freedman and Diaconis’s (1981) rule for the class width:
k = 2(IQ)n −1/3 where s is the sample standard deviation and IQ is the sample
interquartile range
Trang 37Considering the set of middle-break points m1, , m k of the histogram, we fine the mean square error in each bin as the sum of the square differencesbetween each point in that bin and their correspondent middle-break points:
de-
i (x i − m j)2≤ n j R2/4k2, b j ≤ x i < b j+1,∀j = 1, , k and n j is the number ofdata points in each bin The quadratic error (QE) is defined as the sum of thiserror along all bins:
From the above equations it follows that the quadratic error is bounded, in the
worst case, by: nR2/4k2, where n denotes the number of observed variables.
The definition of the number of intervals is one of the main problems of usinghistograms The number of bins is directly related with the quadratic error Howdifferent would the quadratic error be if we consider just one more bin? To studythe evaluation of the quadratic error of a histogram with the number of bins, we
compute the following ratio, which we refer to as the relative error: = QE(k)
QE(k+1).
In order to bound the decrease of the quadratic error, we define the number
of bins of the first layer as dependent on the upper bound of the relative error
() and on the fail probability (δ):
N1= O(1
ln
1
Establishing a bound for relative error, this definition of the number of bins
ensures that the fail probability will converge to zero when N1 increases So,
setting and δ and using this definition we control the decrease of the quadratic
error Figure 1 shows that the number of bins increases when the error decreasesand the confidence increases Figure 1 (top) represents the number of bins of
layer1in function of and δ The bottom figures give a projection of the number
of bins according with the variables and δ (respectively).
So, differing from [12] the input for the initialization phase is a pair of rameters (that will be used to express accuracy guarantees) and the range of thevariable:
pa-– The upper bound on relative error .
– The desirable confidence level 1− δ.
– The range of the variable.
The range of the variable is only indicative It is used to initialize the set
of breaks using an equal-width strategy Each time we observe a value of the
random variable, we update layer1 The update process determines the intervalcorresponding to the observed value, and increments the counter of this interval
The process of updating layer1 works online, performing a single scan over thedata stream It can process infinite sequences of data, processing each example
in constant time and space The second layer merges the set of intervals defined
by the first layer The input for the second layer is the breaks and counters of
layer1, the type of histogram (equal-width or equal-frequency) and the
desir-able final number of intervals The algorithm for the layer2 is very simple For
Trang 38Fig 1 Representation of the number of bins of layer1 The top figure shows thedependency from and δ and bottom figures show it according to only one variable.
equal-width histograms, it first computes the breaks of the final histogram, from
the actual range of the variable (estimated in layer1) The algorithm traversesthe vector of breaks once, adding the counters corresponding to two consecutive
breaks For equal-frequency histograms, we first compute the exact number F of
points that should be in each final interval (from the total number of points andthe number of desired intervals) The algorithm traverses the vector of counters
of layer1 adding the counts of consecutive intervals till F The computational
costs of this phase can be ignored: it traverses once the discretization obtained inthe first phase We can construct several histograms using different number of in-tervals and different strategies: equal-width or equal-frequency This is the mainadvantage of PiD in exploratory data analysis We use PiD algorithm to createcompact summaries of data, and along with the improvement of the number ofbins definition, we also accomplished it with a change detection technique
The algorithm described in the previous section assumes that the observationscome from a stationary distribution When data flows over time, and at least forlarge periods of time, it is not acceptable to assume that the observations aregenerated at random according to a stationary probability distribution At least
in complex systems and for large time periods, we should expect changes in thedistribution of the data
Trang 393.1 Related Work
When monitoring a stream is fundamental to know if the received data comesfrom the distribution observed so far It is necessary to perform tests in order todetermine if there is a change in the underlying distribution The null hypothesis
is that the previously seen values and the current observed values come from thesame distribution The alternative hypothesis is that they are generated fromdifferent continuous distributions
There are several methods in machine learning to deal with changing cepts [21,22,23,29] In general, approaches to cope with concept drift can be clas-
con-sified into two categories: i) approaches that adapt a learner at regular intervals without considering whether changes have really occurred; ii) approaches that
first detect concept changes, and next, the learner is adapted to these changes
Examples of the former approaches are weighted examples and time windows of
fixed size Weighted examples are based on the simple idea that the importance
of an example should decrease with time (references about this approach can befound in [22,24,29]) When a time window is used, at each time step the learner
is induced only from the examples that are included in the window Here, thekey difficulty is how to select the appropriate window’s size: a small window canassure a fast adaptability in phases with concept changes but in more stablephases it can affect the learner performance, while a large window would pro-duce good and stable learning results in stable phases but can not react quickly
to concept changes
In the latter approaches, with the aim of detecting concept changes, someindicators (e.g performance measures, properties of the data, etc.) are monitoredover time (see [21] for a good classification of these indicators) If during themonitoring process a concept drift is detected, some actions to adapt the learner
to these changes can be taken When a time window of adaptive size is usedthese actions usually lead to adjusting the window’s size according to the extent
of concept drift [21] As a general rule, if a concept drift is detected the window’ssize decreases; otherwise the window’s size increases
Windows Models Most of the methods in this approach monitor the
evo-lution of a distance function between two distributions: from past data in a
reference window and in a current window of the most recent data points An
example of this approach, in the context of learning from Data Streams, hasbeen present by [20] The author proposes algorithms (statistical tests based onChernoff bound) that examine samples drawn from two probability distributionsand decide whether these distributions are different
In this work, we monitor the distance between the distributions in two timewindows: a reference window that has a fixed size and refers to past observa-tions and an adaptive-cumulative window that receives the actual observationsand could have a fixed or an adaptive step depending on the distance betweendistributions For both windows, we compute the relative frequencies: a set of
empirical probabilities p(i) for the reference window and q(i) for the
adaptive-cumulative window
Trang 40Adaptive-Cumulative Window Model In a previous work [27] we defined
the windows sizes as dependent on the number of intervals of the layer1, beinghalf of these ones:N1
2 In this work, in order to evaluate the influence of the ber of examples required to detect a change, we defined the cumulative window(the current one) using an adaptive increasing step that depends on the distancebetween data distributions Starting with a size of N1
num-2 , the step is incremented ifthe distance between data distributions increases and is decremented otherwise,according to the following relation:
Figure 2 shows the dependency of the window’s step on distributions’ distance
Fig 2 Representation of the windows’ step with respect to the absolute difference
betweenKLD(p||q) and KLD(q||p) An illustrative example showing that the window’s
step decreases when the absolute difference between distances increases
3.2 Distance between Distributions – Kullback-Leibler Divergence
Assuming that sample in the reference window has distribution p and that data
in the current window has distribution q, we use as a measure to detect whether
has occurred a change in the distribution the Kullback-Leibler Divergence (KLD)From information theory [4], the Relative Entropy is one of the most gen-eral ways of representing the distance between two distributions [10] Contrary
to the Mutual Information this measure assesses the dissimilarity between twovariables Also known as the Kullback-Leibler divergence, it measures the dis-tance between two probability distributions and so it can be used to test forchange
2 KLD stands for Kullback-Leibler Divergence This measure is introduced in the next
subsection