Knowledge Discovery from Sensor Data doc

The following section described the approach in more detail and outlinesthe challenges and lessons learned.Our debugging tool uses a data collection front-end to collect runtime events f

Trang 1

Commenced Publication in 1973

Founding and Former Series Editors:

Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Trang 2

Olufemi A Omitaomu João Gama

Nitesh V Chawla Auroop R Ganguly (Eds.)

Knowledge Discovery from Sensor Data

Second International Workshop, Sensor-KDD 2008 Las Vegas, NV, USA, August 24-27, 2008

Revised Selected Papers

1 3

Trang 3

Mohamed Medhat Gaber

Monash University, Centre for Distributed Systems and Software Engineering

900 Dandenong Road, Caulfield East, Melbourne, VIC 3145, Australia

University of Porto, Faculty of Economics, LIAAD-INESC Porto

L.A Rua de Ceuta, 118, 6, 4050-190 Porto, Portugal

E-mail: jgama@liaad.up.pt

Nitesh V Chawla

University of Notre Dame, Computer Science and Engineering Department

353 Fitzpatrick Hall, Notre Dame, IN 46556, USA

E-mail: nchawla@cse.nd.edu

Library of Congress Control Number: 2010924293

CR Subject Classification (1998): H.3, H.4, C.2, H.5, H.2.8, I.5

LNCS Sublibrary: SL 3 – Information Systems and Application, incl Internet/Weband HCI

ISSN 0302-9743

ISBN-10 3-642-12518-2 Springer Berlin Heidelberg New York

ISBN-13 978-3-642-12518-8 Springer Berlin Heidelberg New York

This work is subject to copyright All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks Duplication of this publication

or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,

in its current version, and permission for use must always be obtained from Springer Violations are liable

to prosecution under the German Copyright Law.

Trang 4

This volume contains extended papers from Sensor-KDD 2008, the Second ternational Workshop on Knowledge Discovery from Sensor Data The secondSensor-KDD workshop was held in Las Vegas on August 24, 2008, in conjunctionwith the 14th ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining.

In-Wide-area sensor infrastructures, remote sensors, and wireless sensor works, RFIDs, yield massive volumes of disparate, dynamic, and geographicallydistributed data As such sensors are becoming ubiquitous, a set of broad require-ments is beginning to emerge across high-priority applications including disas-ter preparedness and management, adaptability to climate change, national orhomeland security, and the management of critical infrastructures The raw datafrom sensors need to be eﬃciently managed and transformed to usable informa-tion through data fusion, which in turn must be converted to predictive insightsvia knowledge discovery, ultimately facilitating automated or human-inducedtactical decisions or strategic policy based on decision sciences and decision sup-port systems

net-The expected ubiquity of sensors in the near future, combined with the ical roles they are expected to play in high-priority application solutions, points

crit-to an era of unprecedented growth and opportunities The main motivationfor the Sensor-KDD series of workshops stems from the increasing need for aforum to exchange ideas and recent research results, and to facilitate collab-oration and dialog between academia, government, and industrial stakehold-ers This is clearly reﬂected in the successful organization of the ﬁrst workshop(http://www.ornl.gov/sci/knowledgediscovery/SensorKDD-2007/) along with theACM KDD-2007 conference, which was attended by more than seventy registeredparticipants, and resulted in an edited book (CRC Press, ISBN-9781420082326,

2008), and a special issue in the Intelligent Data Analysis journal (Volume 13,

Number 3, 2009)

Based on the positive feedback from the previous workshop attendees andour own experiences and interactions with the government agencies such asDHS, DOD, and involvement with numerous projects on knowledge discoveryfrom sensor data, we organized the second Sensor-KDD workshop along withthe KDD-2008 conference As expected we received very high-quality paper sub-missions which were thoroughly reviewed by a panel of international ProgramCommittee members Based on a minimum of two reviews per paper, we selected

seven full papers and six short papers In addition to the oral presentations of

accepted papers, the workshop featured two invited speakers: Kendra E Moore,Program Manager, DARPA/IPTO and Jiawei Han, Department of ComputerScience, University of Illinois at Urbana-Champaign

Trang 5

The contents of this volume include the following papers Data mining niques for diagnostic debugging in sensor networks were presented in an invitedpaper by Abdelzaher et al Sebastiao et al addressed the important problem

tech-of detecting changes in constructing histograms from time-changing high-speeddata streams Delir Haghighi el al introduced an integrated architecture forsituation-aware adaptive data mining and mobile visualization in ubiquitouscomputing environments Davis el al described synchronous and asynchronousexpectation maximization algorithms for unsupervised learning in factor graphs.Rahman et al dealt with intrusion detection in wireless networks Their system,WiFi Miner, is capable of fnding frequent and infrequent patterns from prepro-cessed wireless connection records using an infrequent pattern ﬁnding Apriorialgorithm A solution to the problem of detecting underlying patterns in largevolumes of spatiotemporal data as it allows one, for example, to model humanbehavior and plan traﬃc was given by Hutchins et al Wu el al presented a spa-tiotemporal outlier detection algorithm called Outstretch, which discovers theoutlier movement patterns of the top-k spatial outliers over several time periods

A joint use of large-scale sensory measurements from the Internet and a smallnumber of human inputs for eﬀective network inference through a clustering andsemi-supervised learning algorithm was given by Erjongmanee el al Rashidi andCook presented an adaptive data mining framework for detecting patterns insensor data A description of a dense pixel visualization technique for visualizingsensor data and as well as absolute errors resulting from predictive models waspresented by Rodrigues el al Fang et al presented a two-stage knowledge dis-covery process, where oﬄine approaches are utilized to design online solutionsthat can support real-time decisions Finally, a framework for the discovery ofspatiotemporal neighborhoods in sensor datasets where a time series of data iscollected at many spatial locations was presented by McGuire el al

The workshop witnessed lively participation from all quarters, generated teresting discussions immediately after each presentation and as well as at theend of the workshop We hope that the Sensor-KDD workshop will continue to

in-be an attractive forum for the researchers from academia, industry, and ment, to exchange ideas, initiate collaborations, and lay foundation to the future

govern-of this important and growing area

Olufemi A Omitaomu

Joao GamaNitesh V ChawlaMohamed Medhat GaberAuroop R Ganguly

Trang 6

The Second International Workshop on Knowledge Discovery from Sensor Data (Sensor-KDD 2008) was made possible by the following organizers and international Program Committee members

Workshop Chair

Ranga Raju Vatsavai Oak Ridge National Laboratory, USA

Olufemi Omitaomu Oak Ridge National Laboratory, USA

Joao Gama University of Porto, Portugal

Nitesh V Chawla University of Notre Dame, USA

Mohamed Medhat Gaber Monash University, Australia

Auroop Ganguly Oak Ridge National Laboratory, USA

Program Committee (in alphabetical order)

Michaela Black University of Ulster, Coleraine, Northern Ireland,

UKAndre Carvalho University of Sao Paulo, Brazil

Sanjay Chawla University of Sydney, Australia

Francisco Ferrer University of Seville, Spain

Ray Hickey University of Ulster, Coleraine, Northern Ireland,

UKRalf Klinkenberg University of Dortmund, Germany

Miroslav Kubat University Miami, USA

Mark Last Ben-Gurion University, Israel

Chang-Tien Lu Virginia Tech, USA

Elaine Parros Machado de Sousa University of Sao Paulo, Brazil

Laurent Mignet IBM Research, USA

S Muthu Muthukrishnan Rutgers University and AT&T Research, USA Pedro Rodrigues University of Porto, Portugal

Josep Roure Carnegie Mellon University, Pittsburgh, USA Bernhard Seeger University Marburg, Germany

Cyrus Shahabi University of Southern California, USA

Mallikarjun Shankar Oak Ridge National Laboratory, Oak Ridge, USA Alexandre Sorokine Oak Ridge National Laboratory, Oak Ridge, USA Eiko Yoneki University of Cambridge, UK

Nithya Vijayakumar Cisco Systems, Inc., USA

Guangzhi Qu Oakland University, Rochester, USA

Trang 7

Data Mining for Diagnostic Debugging in Sensor Networks: Preliminary

Evidence and Lessons Learned 1

Tarek Abdelzaher, Mohammad Khan, Hieu Le,

Hossein Ahmadi, and Jiawei Han

Monitoring Incremental Histogram Distribution for Change Detection

Pari Delir Haghighi, Brett Gillick, Shonali Krishnaswamy,

Mohamed Medhat Gaber, and Arkady Zaslavsky

Unsupervised Plan Detection with Factor Graphs 59

George B Davis, Jamie Olson, and Kathleen M Carley

WiFi Miner: An Online Apriori-Infrequent Based Wireless Intrusion

System 76

Ahmedur Rahman, C.I Ezeife, and A.K Aggarwal

Probabilistic Analysis of a Large-Scale Urban Traﬃc Sensor

Data Set 94

Jon Hutchins, Alexander Ihler, and Padhraic Smyth

Spatio-temporal Outlier Detection in Precipitation Data 115

Elizabeth Wu, Wei Liu, and Sanjay Chawla

Large-Scale Inference of Network-Service Disruption upon Natural

Parisa Rashidi and Diane J Cook

A Simple Dense Pixel Visualization for Mobile Sensor Data Mining 175

Pedro Pereira Rodrigues and Jo˜ ao Gama

Trang 8

Incremental Anomaly Detection Approach for Characterizing Unusual

Proﬁles 190

Yi Fang, Olufemi A Omitaomu, and Auroop R Ganguly

Spatiotemporal Neighborhood Discovery for Sensor Data 203

Michael P McGuire, Vandana P Janeja, and Aryya Gangopadhyay

Author Index 227

Trang 9

Networks: Preliminary Evidence and Lessons

Tarek Abdelzaher, Mohammad Khan, Hieu Le,Hossein Ahmadi, and Jiawei HanUniversity of Illinois at Urbana Champaign

Abstract Sensor networks and pervasive computing systems intimately

combine computation, communication and interactions with the physicalworld, thus increasing the complexity of the development eﬀort, violat-ing communication protocol layering, and making traditional networkdiagnostics and debugging less eﬀective at catching problems Tightercoupling between communication, computation, and interaction with thephysical world is likely to be an increasing trend in emerging edge net-works and pervasive systems This paper reviews recent tools developed

by the authors to understand the root causes of complex interaction bugs

in edge network systems that combine computation, communication andsensing We concern ourselves with automated failure diagnosis in theface of non-reproducible behavior, high interactive complexity, and re-source constraints Several examples are given to ﬁnding bugs in realsensor network code using the tools developed, demonstrating the eﬃ-cacy of the approach

perva-henceforth referred to as edge network systems These systems feature

hetero-geneity, and tight interactions between computation, communication, sensing,and control Tight interactions breed interactive complexity; the primary cause

of failures and vulnerabilities in complex systems While individual devices andsubsystems may operate well in isolation, their composition might result in in-compatibilities, anomalies or failures that are typically very diﬃcult to trou-bleshoot On the other hand, software re-use is impaired by the customizednature of application code and deployment environments, making it harder to

The work was supported in part by the U.S National Science Foundation grants

IIS-08-42769, CNS 06-26342, CNS 05-54759, and BDI-05-15813, and NASA grantNNX08AC35A Any opinions, ﬁndings, and conclusions expressed here are those ofthe authors and do not necessarily reﬂect the views of the funding agencies

M.M Gaber et al (Eds.): Sensor-KDD 2008, LNCS 5840, pp 1–24, 2010.

c

Springer-Verlag Berlin Heidelberg 2010

Trang 10

amortize debugging, troubleshooting, and tuning cost Moreover, users of edgenetwork systems, such as residents of a smart home, may not be experts onnetworking and system administration Automated techniques are needed fortroubleshooting such systems both at development time and after deployment

in order to reduce production as well as ownership costs

Tighter coupling between communication, computation, and interaction withthe physical world is likely to be an increasing trend Internet pioneers, such

as David Clark, the network’s former chief architect, express the view that bythe end of the next decade, the edge of the Internet will primarily constitutesensors and embedded devices [3]1 This motivates analysis tools of systems ofhigh interactive complexity Data mining literature is rich [25,5,26,11,8,10,4,22]with examples of identiﬁcation, classiﬁcation, and understanding of complexpatterns in large, highly coupled systems ranging from biological processes [20]

to commercial databases [23] The key advantage of using data mining is theautomation of discovery of hidden patterns that may take signiﬁcant amounts

of time to detect manually

While the use of data mining in network troubleshooting is promising, it is by

no means a straightforward application of existing techniques to a new problem.Networked software execution patterns are not governed by “laws of nature”,DNA, business transactions, or social norms They are limited only by program-mers’ imagination The increased diversity and richness of such patterns make

it harder to zoom-in on potential causes of problems without embedding someknowledge of networking, programming, and debugging into the data miningengine This paper describes a cross-cutting solution that leverages the power ofdata mining to uncover hard-to-ﬁnd bugs in distributed systems

Debugging

Consider multiple development teams building an edge network system such asone designed to instrument an assisted living facility with sensors that monitorthe occupants, ensure their well-being, and alert care-givers to emergencies whenthey occur The system typically consists of a large number of components, fur-ther multiplied by the need to support a variety of diﬀerent hardware platforms,operating systems and sensor products Often parts of the system are developed

by different vendors These parts are tested and debugged independently bytheir respective developers, then, at a later stage, the system is put together atsome integration testbed for evaluation The integrated system usually does notwork well When a host of problem manifestations are reported, which party isresponsible for these problems? and who needs to fix what? Different developersmust now come together to understand where the malfunction is coming from.This type of bugs is hardest to fix and is a source of significant additional costsand delays in projects Due to the rising tendency to build networked systems of

1 This view was expressed in his motivational keynote on the need for a Future Internet

Design initiative (FIND)

Trang 11

an increasing number of embedded interacting components, interaction problemsget worse, which motivates our work.

The network troubleshooting solution advocated in this paper stems from laborative work and experiences of the authors in the area of diagnostic debug-ging In this section, we overview the goals and challenges, outline the designprinciples that stem from these goals and challenges, and the initial evidencefrom a pilot study that suggests the viability of using data mining to uncover awide range of network-related interaction bugs

col-2.1 Goals and Challenges

The coupled nature of edge network systems motivates an extended deﬁnition of

a network that includes not only the communication infrastructure but also the

communicating entities themselves, such as sensors, application-level protocols,and user inputs A growing challenge is to develop analysis techniques and soft-ware tools that automate diagnostic troubleshooting of such networks to reducetheir development and ownership cost A signiﬁcant part of software developmentcost is spent on testing and debugging Similarly, an increasing part of owner-ship cost is spent on maintenance Of particular diﬃculty is to troubleshoot theimportant and expensive class of problems that arise from (improper or unex-pected) interactions of large numbers of components across a network We aim

to answer developer or user questions such as “why does this network suﬀerunusually high service delays?”, “why is throughput low despite availability ofresources and service requests?”, “why does my time synchronization protocolfail to synchronize clocks when the localization protocol is run concurrently?”,

or “why does this vehicle tracking system suﬀer increased false alarms when

it is windy?2” Building eﬃcient troubleshooting support to address the abovequestions is complicated due to the characteristics of interaction bugs; namely:

– Non-reproducible behavior: Interactions in edge network systems feature an

increased level of concurrency and thus an increased level of non-determinism

In turn, non-determinism generates non-reproducible bugs that are hard toﬁnd using traditional debugging tools

– Non-local emergent behavior: By deﬁnition, interaction bugs do not manifest

themselves when components are tested in isolation Current debugging toolsare very good at ﬁnding bugs that can be traced to individual components.Interaction bugs manifest only at scale as a result of component composition.They result in emergent behavior that arises when a number of seeminglyindividually sound components are combined into a network, which makesthem hard to ﬁnd

– Resource constraints: Embedded networked devices often operate under

sig-niﬁcant resource constraints Hence, solutions to the debugging problem mustnot use large amounts of run-time resources, making bugs harder to ﬁnd

2 In a previous deployment of a magnetometer-based wireless tracking system at UVa,

wind resulted in antennae vibration which was caught by the magnetometers andinterpreted as the passage of nearby ferrous objects (vehicles)

Trang 12

2.2 Design Principles

The data mining approach described in this paper is based on three main design

principles aimed at exploiting concurrency, interactions, and non-determinism to improve the ability to diagnose problems in resource-constrained systems These

principles are as follows:

– Exploiting non-reproducible behavior: Exploitation of non-determinism to

im-prove understanding of system behavior is not new to computing literature.For example, many techniques in estimation theory, concerned with estima-tion of system models, rely on introducing noise to explore a wider range ofsystem states and hence arrive at more accurate models Machine learningand data mining approaches have the same desirable property They requireexamples of both good and bad system behavior to be able to classify theconditions correlated with good and bad In particular, note that conditionsthat cause a problem to occur are correlated (by causality) with the result-ing bad behavior Root causes of non-reproducible bugs are thus inherentlysuited for discovery using data mining and machine learning approaches asthe lack of reproducibility itself and the inherent system non-determinismimprove the odds of occurrence of suﬃciently diverse behavior examples totrain the troubleshooting system to understand the relevant correlations andidentify causes of problems Our choice of data mining techniques exploitsthis insight

– Exploiting interactive complexity: Interactive complexity describes a system

where scale and complexity cause components to interact in unexpected ways

A failure that occurs due to such unexpected interactions is therefore notlocalized and is hard to “blame” on any single component This fundamen-tally changes the objective of a troubleshooting tool from aiding in steppingthrough code (which is more suitable for ﬁnding a localized error in some line,

such as an incorrect pointer reference), to aiding with diagnosing a sequence

of events (component interactions) that leads to a failure state This leads to

the choice of sequence mining algorithms as the core analytic engine behinddiagnostic debugging

– Addressing resource economy: We develop a progressive diagnosis capability

where inexpensive high-level diagnosis oﬀers initial clues regarding possibleroot causes, resulting in further more detailed inspections of smaller subsets

of the system until the problem is found

The above design principles lead to sequence mining techniques to diagnose action bugs Conceptually, we log a large number of events, correlate sequences

inter-of such events with manifestations inter-of undesirable behavior, then search suchcorrelated sequences for one or more that may be causally responsible for thefailure Finding such culprit sequences is the core research question in diagnosticdebugging

2.3 Preliminary Evidence

To experiment with the usefulness of data mining techniques for purposes ofdebugging networked applications, the authors conducted a pilot investigation

Trang 13

and developed a prototype of a diagnostic debugging tool This prototype wasexperimented with over the course of two years to understand the strengths andlimitations of this approach Some results were published in Sensys 2008 [16],DCoSS 2008 [14] and DCoSS 2007 [15] Below we report proof-of-concept ex-amples of successful use of the diagnostic debugging prototype, following by anelaboration of lessons learned from this experience.

1 A “design” bug: As an example of catching a design bug, we summarize

a case-study, published by the authors [16], involving a multi-channel sensornetwork MAC-layer protocol from prior literature [17] that attempts to utilizechannel diversity to improve throughput The protocol assigned a home channel

to every node, grouping nodes that communicated much into a cluster on thesame channel It allowed occasional communication between nodes in diﬀerentclusters by letting senders change their channel temporary to the home channel

of a receiver to send a message If communication failed (e.g., because home

channel information became stale), senders would scan all channels looking forthe receiver on a new channel and update home channel information accord-ingly Testing revealed that total network throughput was sometimes worse thanthat of a single-channel MAC Initially, the designer attributed it to the heavy

cost of communication across clusters To verify this hypothesis, the original

protocol, written for MicaZ motes, was instrumented by the authors to log dio channel change events and message communication events (send, receive,acknowledge) as well as related timeouts It was tested on a motes network.Event logs from runs where it outperformed a single-channel MAC were marked

ra-“good” Event logs from runs where it did worse were marked “bad” native sequence mining applied to the two sets of logs revealed a common pat-tern associated prominently with bad logs The pattern included the events NoAck Received, Retry Transmission on Channel (1), Retry Transmission

Discrimi-on Channel (2), Retry TransmissiDiscrimi-on Discrimi-on Channel (3), etc., executed Discrimi-on a

large number of nodes This quickly led the designer to understand a muchdeeper problem When a sender failed to communicate with a receiver in an-other cluster, it would leave its home channel and start scanning other channelscausing communication addressed to it from upstream nodes in its cluster to fail

as well Those nodes would start scanning too, resulting in a cascading eﬀectthat propagated up the network until everyone was scanning and communica-

tion was entirely disrupted everywhere (both within and across clusters) The

protocol had to be redesigned

2 An “accumulative eﬀect” bug: Often failures or performance

prob-lems arise because of accumulative eﬀects such as gradual memory leakage

or clock overflow While such effects evolve over a large period of time,the example summarized below [14] shows how it may be possible to usediagnostic debugging to understand the “tipping point” that causes theproblem to manifest In this case, the authors observed sudden onset of suc-cessive message loss in an implementation of directed diffusion [12]; a well-known sensor network routing protocol As before, communication was loggedtogether with timeout and message drop events Parts of logs coinciding

Trang 14

with or closely preceding instances of successive message losses were labeled

“bad” The rest were labeled “good” Discriminative sequence mining revealedthat patterns correlated with successive message loss included the followingevents: Message Send (timestamp = 255), Message Send (timestamp = 0),Message Dropped (Reason = "SameDataOlderTimeStamp") The problem be-came apparent A timestamp counter overﬂow caused subsequent messages re-

ceived to be erroneously interpreted as “old” duplicates (i.e., having previously

seen sequence numbers) and discarded

3 A “race condition” bug: Race conditions are a signiﬁcant cause of failures

in systems with a high degree of concurrency In a previous case study [16], theauthors demonstrated how diagnostic debugging helped catch a bug caused by arace condition in their embedded operating system, called LiteOS [2] When thecommunication subsystem of an early version of LiteOS was stress-tested, somenodes would crash occasionally and non-deterministically LiteOS allows loggingsystem call events Such logs were collected from (the ﬂash memory of) nodesthat crashed and nodes that did not, giving rise to “good” and “bad” datasets, respectively Discriminative sequence mining revealed that the followingsequence of events occurred in the good logs but not in the bad logs: PacketReceived, Get Current Radio Handle, whereas the following occurred in thebad logs but not the good logs: Packet Received, Get Serial Send Function.From these observations, it is clear that failure occurs when Packet Received isfollowed by Get Serial Send Function before Get Current Radio Handle iscalled Indeed, the latter prepares the application for receiving a new packet Athigh data rates, another communication event may occur before the application

is prepared, causing a node crash The race condition was eliminating by propersynchronization

4 An “assumptions mismatch” bug: In systems that interact with the

phys-ical world, problems may occur when the assumptions made in protocol designregarding the physical environment do not match physical reality In this casestudy [15], a distributed target tracking protocol, implemented on motes, occa-sionally generated spurious targets The protocol required that nodes sensing atarget form a group and elect a group leader who assigned a new ID to the target.Subsequently, the leader passed the target ID on in a leader handoﬀ operation

as the target moved Communication logs were taken from runs where spurioustargets were observed (“bad” logs) and runs where they were not (“good” logs).Discriminative pattern mining revealed the absence of member-to-leader mes-

sages in the bad logs This suggested that a singleton group was present (i.e.,

the leader was the only group member) Indeed, it turned out that the leaderhand-oﬀ protocol was not designed to deal with a singleton group because thedeveloper assumed that a target would always be sensed by multiple sensors (anassumption on physical sensing range) and hence a group would always havemore than one member The protocol had to be realigned with physical reality.While these preliminary results are encouraging, signiﬁcant challenges were met

as well that required adapting data mining techniques to the needs of debugging

Trang 15

systems The following section described the approach in more detail and outlinesthe challenges and lessons learned.

Our debugging tool uses a data collection front-end to collect runtime events fromthe system being debugged for analysis The front end may range from a wirelesseavesdropping network to full instrumentation of code under development (withdebug statements) Later in the proposal, we describe specific front ends wedeveloped, as well as some associated development challenges Conceptually,once the log of runtime events is available from the front end, the tool separatesthe collected sequence of events into two piles: a “good” pile, which contains theparts of the log when the system performs as expected, and a “bad” pile, whichcontains the parts of the log when the system fails a specification or exhibitspoor performance This data separation phase may be done based on predicatesthat define “good” versus “bad” behavior, provided by the application testers

or derived from speciﬁcations, as illustrated in the examples in Section 2.3.The core of the debugging framework is a discriminative frequent pattern min-ing algorithm that looks for patterns (sequences of events) that exist with verydiﬀerent frequencies in the “good” and “bad” piles These patterns are called

discriminative since they are correlated (positively or negatively) with the

oc-currence of undesirable behavior Fundamentally, such an algorithm constitutes

a search through the exponentially large space of patterns for those that aremost correlated with failure The search space is exponential because given a

system that logs M event types, there are M N possible sequences of length N

involving these events Moreover, M itself is exponentially large, since events may have parameters (e.g., parameters of a function call, or parameters of a

logged packet header) The space of all parameter values is exponentially large.Any combination of values might be part of the problem

The basic challenge of identifying problematic event sequences can be thought

of as a search through the double-exponential tree, where the root is the emptypattern and each level adds one event to the length of the pattern Hence, apath through this tree to a leaf node represents one event sequence Leaves arecolored depending on whether the corresponding sequences were observed in the

“good” pile, the “bad” pile or both (yielding blends of the two colors in diﬀerentproportions) The goal is to search the double-exponential tree for those badleaves (sequences in the bad pile) that are likely to have caused the observedproblem being debugged

Frequent and sequential pattern mining have been studied extensively in datamining research [7], with many eﬃcient algorithms developed For example, thepopularly cited Apriori algorithm [1] initially counts the number of occurrences

(called support ) of each distinct event in the data set (i.e., in the “good” or

“bad” pile) and discards all events that are infrequent (their support is less than

some parameter minSup) The remaining events are frequent patterns of length

1, called frequent 1-itemset S1 It then takes the combinations of S1 (i.e., single

events) to generate frequent 2-itemset candidates One important heuristic used

Trang 16

in Apriori is that any subset of a frequent k-itemset must be frequent, which can

be used to eﬀectively prune a candidate k-itemset if any of its (k − 1)-itemst is

infrequent The later development of frequent pattern mining algorithms, such

as FPGrowth [9], FPClose [6], CHARM [29], and LCM [24], as well as tial pattern mining algorithms, such as GSP [21], PreﬁxSpan [19], SPADE [28],and CloSpan [27], all explore this Apriori heuristic However, for mining sensornetwork bugs, we found that these algorithms could still be ineﬀective In manycases, the root of network bugs could be rare events (with very low support) butsetting too low minimal support could lead to combinatorial explosion in any

sequen-of the above frequent or sequential pattern mining algorithms Therefore, moreadvanced data mining methods should be explored for debugging such networksystems Below, we present the challenges and techniques in order to compre-hensively and eﬃciently address automated self-diagnosis of interaction bugs inpractical edge network systems

3.1 Preliminaries

For the purposes of the discussion below, let us deﬁne an event to be the basic

element in the log (collected by one of our front ends) that is analyzed forpurposes of diagnosing failures An event generally has a type, a timestamp, aset of attributes, and an associated node or log ID where it was recorded The

set of distinct event types is called an alphabet in an analogy with strings In

other words, if events were letters in an alphabet, we are looking for strings that

cause errors to occur These strings represent event sequences (ordered lists of

events) Logs are ordered chronologically and can be thought of as sequences

of logged events3 For example, S1 = (a, b, b, c, d, b, b, a, c) is

an event sequence Elementsa, b, , are events A discriminative pattern

between two data sets is a subsequence of (not necessarily contiguous) eventsthat occurs with a very diﬀerent count in the two sets The larger the diﬀerence,the better the discrimination With the above terminology in mind, we presentthe challenges addressed in the proposed work

3.2 Challenge: Scalability of Search

Frequent event sequences are easy to identify since they stand out in the number

of times they occur A signiﬁcant challenge, however, is to identify infrequentsequences that cause problems In debugging, sometimes less frequent patternscould be more indicative of the cause of failure than the most frequent patterns

A single mistake can cause a damaging sequence of events For example, a singlenode reboot event can cause a large number of message losses In such cases, iffrequent patterns are generated that are commonly found in failure cases, themost frequent patterns may not include the real cause of the problem (e.g., thereboot event)

Fortunately, in the case of embedded network debugging, a solution may beinspired by the nature of the problem domain The fundamental issue to observe

3 Lack of clock synchronization makes it impossible to produce an exact global

chrono-logical order; a challenge we address in this work

Trang 17

is that much computation in edge network systems is recurrent Code

repeat-edly visits the same states (perhaps not strictly periodically), repeating the sameactions over time Hence, a single problem, such as a node reboot or a race con-dition that pollutes a data structure, often results in multiple manifestations ofthe same unusual symptom (like multiple subsequent message losses or multiplesubsequent false alarms) Catching these recurrent symptoms by an algorithm

such as Apriori or PrefixSpan is much easier due to their larger frequency With

such symptoms identiﬁed, the search space can be narrowed and it becomeseasier to correlate them with other less frequent preceding event occurrences.This suggests a two-stage approach In the ﬁrst stage, a simple algorithm such

as Apriori or PrefixSpan generates the usual frequent discriminative patterns that have support larger than minSup It is expected that the patterns involving

manifestations of bugs will survive at the end of this stage but infrequent eventslike a node reboot will be dropped due to their low support

At the second stage, at ﬁrst, the algorithm splits the log into segments Thealgorithm counts the number of discriminative frequent patterns found in eachsegment and ranks each segment of the log based on the count (the higher thenumber of discriminative patterns in a segment, the higher the rank) Next, thealgorithm searches for discriminative patterns (those that occurred in one pile

but not the other) with minSup reduced to 1 on the K highest-ranked segments.

This scheme was shown by the authors to have a signiﬁcant impact on theperformance of the frequent pattern mining algorithm [13] Essentially, it relies

on successive reﬁnement of the search space First, it performs a coarse (andfast) search through the event logs to ﬁnd the most obvious leads, then it does

a ﬁner-grained but more localized search around regions in the log highlighted

by the ﬁrst search

3.3 Preventing False Frequent Patterns

The Apriori algorithm and its extensions generate all possible combinations offrequent subsequences of the original sequence As a result, it generates sub-sequences combining events that are “too far” apart to be causally correlatedwith high probability and thus reduces the chance of ﬁnding the “culprit se-quence” that actually caused the failure This strategy could negatively impactthe ability to identify discriminative patterns in two ways; (i) it could lead tothe generation of discriminative patterns that are not causally related, and (ii)

it could eliminate discriminative patterns by generating false patterns Considerthe following example

Suppose we have the following two sequences:

Trang 18

Now, if we apply the Apriori technique, it will generate (a, c, b) as an equally likely pattern for both S1, and S2 As in both S1 and S2, it will com-bine the ﬁrst occurrence of a and the ﬁrst occurrence of c with the second

occurrence ofb So it will get canceled out at the diﬀerential analysis phase.

To address this issue, the key observation here is that the ﬁrst occurrence

of a should not be allowed to combine with the second occurrence of b as

there is another eventa after the ﬁrst occurrence of a but before the second

occurrence of b and the second occurrence of b is correlated with second

occurrence ofa with higher probability.

To prevent such erroneous combinations, we use a dynamic search window

scheme where the ﬁrst item of any candidate sequence is used to determine thesearch window In this case, for any pattern starting witha, the search window

is [1, 4] and [4, 8] in S1and S2 With this search window, the algorithm will searchfor pattern (a, c, b) in window [1, 4] and [4, 8] and will fail to ﬁnd it in S1

but will ﬁnd it in sequence S2 only As a result, the algorithm will be able toreport pattern (a, c, b) as a discriminative pattern.

This dynamic search window scheme also speeds up the search signiﬁcantly.

In this scheme, the original pattern (of size 8 events) was reduced to windows ofsize 4 making the search for patterns in those windows more eﬃcient

3.4 Suppressing Redundant Subsequences

At the frequent pattern generation stage, if two patterns, S i and S j , have support

≥ minSup, the Apriori algorithm keeps both sequences as frequent patterns even

if one is a subsequence of the other and both have equal support This makesperfect sense in data mining but not in debugging For example, when miningthe “good” data set, the above strategy assumes that any subset of a “good”pattern is also a good pattern In real-life, this is not true Forgetting a step

in a multi-step procedure may well cause failure Hence, subsequences of goodsequences are not necessarily good Keeping these subsequences as examples

of “good” behavior leads to a major problem at the diﬀerential analysis stagewhen discriminative patterns are generated since they may incorrectly cancelout similar subsequences found frequent in the other (i.e., “bad” behavior) datapile For example, consider two sequences below:

S1= (a, b, c, d, a, b, c, d)

S2= (a, b, c, d, a, b, d, c)

Suppose, for correct operation of the protocol, eventa has to be followed by

eventc before event d can happen In sequence S2 this condition is violated.Ideally, we would like our algorithm to report sequence:

S3= (a, b, d) as the “culprit” sequence However, if we apply the Apriori algorithm, it will fail to catch this sequence This is because it will generate S3

as a frequent pattern both for S1 and S2 with support 2 and will get canceled

out at the diﬀerential analysis phase As expected, S3 will never show up as

a “discriminative pattern” Note that with the dynamic search window scheme

alone, we cannot prevent this

Trang 19

To illustrate, suppose a successful message transmission involves the followingsequence of events:

(enableRadio, messageSent, ackReceived, disableRadio)

Now although sequence:

(enableRadio, messageSent, disableRadio)

is a subsequence of the original “good” sequence, it does not represent a cessful scenario as it disables radio before receiving the “ACK” message

suc-To solve this problem, we need an extra step (which we call sion) before we perform diﬀerential analysis to identify discriminative patterns.

sequenceCompres-At this step, we remove the sequence S i if it is a subsequence of S j with the

same support4 This will remove all the redundant subsequences from the quent pattern list Subsequences with a (suﬃciently) diﬀerent support, will beretained and will show up after discriminative pattern mining

fre-In the above example, pattern (a, b, c, d) has support 2 in S1 and port 1 in S2 Pattern (a, b, d) has support 2 in both S1and S2 Fortunately,

sup-at the sequenceCompression step, psup-attern ( a, b, d) will be removed from the frequent pattern list generated for S1 because it is a subsequence of a larger fre-quent pattern of the same support It will therefore remain only on the frequent

pattern list generated for S2 and will show up as a discriminative pattern

3.5 Handling Multi Attribute Events

As different event types can have a different number of attributes in the tuple,mining frequent patterns becomes much more challenging as the mining algorithmhas no prior knowledge of which attributes in a specific event are correlated withfailure and which are not For example, consider the following sequence of events:

<msg_sent,nodeid=1,msgtype=2,nodetype=l>

<msg_sent,nodeid=2,msgtype=2,nodetype=m>

In the above pattern, we do not know which of the attributes are correlated

with failure (if any are related at all) It could be nodeid, or msgtype, or a combination of msgtype and nodetype and so on One trivial solution is to try all

possible permutations However, this is exponential in the number of attributesand becomes unmanageable very quickly Rather, we split such multi attributeevents into a sequence of single attribute events, each with only one attribute ofthe original multi-attribute event The converted sequence for the above example

Trang 20

We can now apply simple (uni-dimensional) sequence mining techniques tothe above sequence As before, the user will be given the resulting sequences(that are most correlated with failures) In such sequences, only the relevantattributes of the original multidimensional events will likely survive Attributesirrelevant to the occurrence of failure will likely have a larger spread of values(since these values are orthogonal to the failure) and hence a lower support.Sequences containing them will consequently have a lower support as well Thetop ranking sequences are therefore more likely to focus only on attributes ofinterest, which is what we want to achieve.

3.6 Handling Continuous Data Types

When logged parameters are continuous, it is very hard to identify frequentpatterns as there are potentially an inﬁnite number of possible values for them

To map continuous data to a ﬁnite set, we simply discretize into a number ofcategories (bins)

3.7 Optimal Partitioning of Data

It is challenging to separate the log into a part that contains only good behaviorand a part the includes the cause of failure In particular, how far back from thefailure point should we consider when we look for the “bad” sequence? If we gotoo far back, we will mix the sequences that are responsible for failure with somesequences that are unrelated to failure If we go back too little, we may miss theroot cause of failure As a heuristic, we start with a small window size (default

is set to 200 events) and iteratively increase it if necessary After applying thesequence mining algorithm with that window size, the tool performs diﬀerentialanalysis between good patterns and bad patterns If after the analysis the badpattern list is empty, it tries a larger window

3.8 Occurrence of Multiple Bugs

It is possible that there may be traces of multiple diﬀerent bugs in the log tunately, this does not impact the operation of our algorithm As the algorithmdoes not assume anything about the number or types of bugs, it will reportall plausible causes of failure as usual In the presence of multiple bugs mani-fested due to diﬀerent bad event sequences, all such sequences will be found andreported

We realize that the types of debugging algorithms needed are diﬀerent for ent applications, and are going to evolve over time with the evolution of hardwareand software platforms Hence, we aim to develop a modular tool architecturethat facilitates evolution and reuse Keeping that in mind, we developed a soft-ware architecture that provides the necessary functionality and ﬂexibility for

Trang 21

differ-future development The goal of our architecture is to facilitate easy use andexperimentation with different debugging techniques and foster future develop-ment As there are numerous different types of hardware, programming abstrac-tions, and operating systems in use for wireless sensor networks, the architecturemust be able to accommodate different combinations of hardware and software.Different ways of data collection should not affect the way the data analysislayer works Similarly we realize that for different types of bugs, we may needdifferent types of techniques to identify the bug and we want to provide a flexi-ble framework to experiment with different data analysis algorithms Based onthe above requirements, we separate the whole system into three subsystems; (i)

a data collection front-end, (ii) data preprocessing middleware and (iii) a dataanalysis back-end This architecture is shown in Figure 1

4.1 Data Collection Front-End

The role of data collection front-end is to provide the debug information (i.e., log

files) that can be analyzed for diagnosing failures The source of this debug log isirrelevant to the data analysis subsystem As shown in figure, the developer maychoose to analyze the recorded radio communication messages obtained using apassive listening tool, or the execution traces obtained from simulation runs, orthe run-time sequences of events obtained by logging on actual application motesand so on With this separation of concerns, the front-end developer could designand implement the data collection subsystem more efficiently and independently.The data collection front-end developer merely needs to provide the format ofthe recorded data These data are used by the data preprocessing middleware

to parse the raw recorded byte streams

4.2 Data Preprocessing Middleware

This middleware that sits between the data collection front-end and the dataanalysis back-end provides the necessary functionality to change or modify onesubsystem without aﬀecting the other The interface between the data collectionfront-end and the data analysis back-end is further divided into the followinglayers:

– Data cleaning layer: This layer is front-end speciﬁc Each supported front-end

will have one instance of it The layer is the interface between the particulardata collection front-end and the data preprocessing middleware It ensuresthat the recorded events are compliant with format requirements

– Data parsing layer: This layer is provided by our framework and is responsible

for extracting meaningful records from the recorded raw byte stream Toparse the recorded byte stream, this layer requires a header ﬁle describingthe recorded message format This information is provided by the application

developer (i.e., the user of the data collection front-end).

– Data labeling layer: To be able to identify the probable causes of failure, the

data analysis subsystem needs samples of logged events representing both

Trang 22

Application Specific

Expected “Good” Behavior

Data Labeling Function

Labeled Data File Parsed Data File

Data Parsing Algorithm Application Specific

Header File Describing

Runtime Logging

Recorded Log

Data Analysis Tool -I: WEKA Data Analysis Tool-II: Discriminative Frequent Pattern Miner Data Analysis Tool-III: Graphical Visualizer

Set of Data Collection Front-End

Data Analysis Tool Specific Data Converter

Set of Data Analysis Back-End

Application

Developer

Data Preprocessing Middleware

Fig 1 System Architecture

“good” and “bad” behavior As “good” or “bad” behavior semantics are anapplication speciﬁc criterion, the application developer needs to implement

a predicate (a small module) whose interface is already provided by us inthe framework The predicate, presented with an ordered event log, decideswhether behavior is good or bad

– Data conversion layer: This layer provides the interface between the data

preprocessing middleware and the data analysis subsystem One instance ofthis layer exists for each different analysis back-end This layer is responsiblefor converting the labeled data into appropriate format for the data analy-sis algorithm The interface of this data conversion layer is provided by theframework As different data analysis algorithms and techniques can be usedfor analysis, each may have different input format requirements This layerprovides the necessary functionality to accommodate supported data analysistechniques

Trang 23

4.3 Data Analysis Back-End

At present, we implement the data analysis algorithm and its modificationspresented earlier It is responsible for identifying the causes of failures Theapproach is extensible As newer analysis algorithms are developed that catchmore or different types of bugs, they can be easily incorporated into the tool asalternative back-ends Such algorithms can be applied in parallel to analyze thesame set of logs to find different problems with them

We describe two case studies, ﬁrst published by the authors in Sensys 2008 [16],

to demonstrate the use of the new diagnostic debugging tool The ﬁrst casestudy presents a kernel level bug in the LiteOS operating system The secondpresents an example of debugging a multichannel Media Access Control(MAC)protocol [18] implemented in TinyOS 2.0 for MicaZ platform with one half-duplexradio interface

5.1 Case Study - I: LiteOS Bug

In this case study, we troubleshoot a simple data collection application whereseveral sensors monitor light and report it to a sink node The communication isperformed in a single-hop environment In this scenario, sensors transmit packets

to the receiver, and the receiver records received packets and sends an “ACK”back The sending rate that sensors use is variable and depends on the varia-tions in their readings After receiving each message, depending on its sequencenumber, the receiver decides to record the value or not If the sequence number

is older than the last sequence number it has received, the packet is dropped.This application is implemented using MicaZ motes on LiteOS operating sys-tem and is tested on an experimental testbed Each of the nodes is connected to

a desktop computer via an MIB520 programming board and a serial cable The

PC acts as the base station In this experiment, there was one receiver (the basenode) and a set of 5 senders (monitoring sensors) This experiment illustrates atypical experimental debugging set up Prior to deployment, programmers wouldtypically test the protocol on target hardware in the lab This is how such a testmight proceed

Failure Scenario When this simple application was stress-tested, some of

the nodes would crash occasionally and non-deterministically Each time ent nodes would crash and at different times Perplexed by the situation, thedeveloper (a first-year graduate student with no prior experience with sensornetworks) decided to log different types of events using LiteOS support andour debugging tool These were mostly kernel-level events along with a fewapplication-level events The built-in logging functionality provided by LiteOSwas used to log the events A subset of the different types of events that werelogged are listed in Figure 2

Trang 24

diﬀer-Recorded Events Attribute List

Context_Switch_To_User_Thread Null

Get_Current_Thread_Index Null Get_Current_Radio_Info_Address Null Get_Current_Radio_Handle_Address Null

Get_Current_Serial_Info_Address Null Get_Serial_Send_Function Null Disable_Radio_State Null

Yield_To_System_Thread Null Get_Current_Thread_Address Null

Get_Radio_Send_Function Null Mutex_Unlock_Function Null

Fig 4 Discriminative frequent patterns found only in “bad” log for LiteOS bug

Failure Diagnosis After running the experiment, “good” logs were collected

from the nodes that did not crash during the experiment and “bad” logs werecollected from nodes that crashed at some point in time After applying our dis-criminative frequent pattern mining algorithm to the logs, we provided two sets

of patterns to the developer, one set includes the highest ranked discriminativepatterns that are found only in “good” logs as shown in Figure 3, and the otherset includes the highest ranked discriminative patterns that are found only in

“bad” logs as shown in Figure 4

Based on the discriminative frequent pattern, it is clear that in “good” pile,

P acket Received event is highly correlated with the Get Current Radio Handle event On the other hand, in the “bad” pile, though P acket Received

event is present, the other event is missing In the “bad” pile,P acket Received

Trang 25

is highly correlated withGet serial Send F unction event From these

obser-vations, it is clear that proceeding with a Get serial Send F unction when

Get Current Radio Handle is missing is the most likely cause of failure.

To explain the error we will brieﬂy describe the way a received packet ishandled in LiteOS In the application, receiver always registers for receivingpackets, then waits until a packet arrives At that time, the kernel switchesback to the user thread with appropriate packet information The packet is thenprocessed in the application However, at very high data rates, another packetcan come when the processing of the previous packet has not yet been done Inthat case, LiteOS kernel overwrites the radio receive buﬀer with new informationeven if the user is still using the old packet data to process the previous packet.Indeed, for correct operation,P acket Receivedevent always has to be followed

by Get Current Radio Handle event before Get Serial Send F unction

event Otherwise it crashes the system Over-writing a receive buﬀer for some son is a very typical bug in sensor networks This example is presented to illustratethe use of the tool In section 5.2 we present a more complex example that exploresmore of the interactive complexity this tool was truly designed to uncover

rea-5.2 Case Study - II: Multichannel MAC Protocol

In this case study, we debug a multichannel MAC protocol The objective ofthe protocol used in our study is to assign a home channel to each node inthe network dynamically in such a way that the throughput is maximized Thedesign of the protocol exploits the fact that in most wireless sensor networks,

the communication rate among diﬀerent nodes is not uniform ( e.g., in a data

aggregation network) Hence, the problem was formulated in such a way thatnodes communicating frequently are clustered together and assigned the samehome channel whereas nodes that communicate less frequently are clustered intodiﬀerent channels This minimizes overhead of channel switching when nodesneed to communicate This protocol was recently published in [18]

During experimentation with the protocol, it was noticed that when data ratesbetween diﬀerent internally closely-communicating clusters is low, the multi-channel protocol outperforms a single channel MAC protocol comfortably as itshould However, when the data rate between clusters was increased, while thethroughput near the base station still outperformed a single channel MAC sig-niﬁcantly, nodes further from the base station were performing worse than in thesingle channel MAC This should not have happened in a well-designed protocol

as the multichannel MAC protocol should utilize the communication spectrumbetter than a single channel MAC The author of the protocol initially concludedthat the performance degradation was due to the overhead associated with com-munication across clusters assigned to diﬀerent channels Such communicationentails frequent channel switching as the sender node, according to the protocol,must switch the frequency of the receiver before transmission, then return to itshome channel This incurs overhead that increases with the transmission rateacross clusters We decided to verify this conjecture

As a stress test of our tool, we instrumented the protocol to log events related

to the MAC layer (such as message transmission and reception as well as channel

Trang 26

switching) and used our tool to determine the discriminative patterns generatedfrom different runs with different message rates, some of which performing betterthan others For better understanding of the failure scenario detected, we brieflydescribe the operation of the multichannel MAC protocol below.

Multichannel MAC Protocol Overview In the multichannel MAC

proto-col, each node initially starts at channel 0 as its home channel To communicatewith others, every node maintains a data structure called “neighbor table” thatstores the neighbor home channel for each of its neighboring nodes Channelsare organized as a ladder, numbered from lowest (0) to highest (12) When anode decides to change its home channel, it sends out a “Bye” message in itscurrent home channel which includes its new home channel number Receiving

a “Bye” message, each other node updates its neighbor table to reﬂect the newhome channel number for the sender of the “Bye” message After changing itshome channel, a node sends out a “Hello” message in the new home channelwhich includes its nodeID All neighboring nodes on that channel add this node

as a new neighbor and update their neighbor tables accordingly

To increase robustenss to message loss, the protocol also includes a mechanismfor discovering the home channel of a neighbor when its current entry in theneighbor table becomes stale When a node sends a message to a receiver onthat receiver’s home channel (as listed in the neighbor table) but does not receive

an “ACK’ after ’n’ (n is set to 5) tries, it assumes that the destination node is not

on its home channel The reason may be that the destination node has changedits home channel permanently but the notiﬁcation was lost Instead of wastingmore time on retransmissions on the same channel, the sender starts scanning allchannels, asking if the receiver is there The purpose is to ﬁnd the receiver’s newhome channel and update the neighbor table accordingly The destination nodewill eventually hear this data message and reply when it is on its home channel.Since the above mechanism is expensive, as an optimization, overhearing isused to reduce staleness of the neighbor table Namely, a node updates thehome channel of a neighbor in its neighbor table when the node overhears anacknowledgement (“ACK”) from that neighbor sent on that channel Since the

“ACK”s are used as a mechanism to infer home channel information, whenever anode switches channels temporarily (e.g., to send to a diﬀerent node on the homechannel of the latter), it delays sending out “ACK” messages until it comes back

to its home channel in order to prevent incorrect updates of neighbor tables byrecipients of such ACKs

Finally, to estimate channel conditions, each node periodically broadcasts a

“channelUpdate” message which contains the information about successfully ceived and sent messages during the last measurement period (where the period

re-is set at compile time) Based on that information, each node calculates the

channel quality (i.e., probability of successfully accessing the medium), and uses

that measure to probabilistically decide whether to change its home channel ornot Nodes that sink a lot of traﬃc (e.g., aggregation hubs or cluster heads)switch ﬁrst Others that communicate heavily with them follow This typically

Trang 27

results into a natural separation of node clusters into diﬀerent frequencies sothey do not interfere.

Performance Problem This protocol was executed on 16 MicaZ motes

imple-menting an aggregation tree where several aggregation cluster-heads filter datareceived from their children, significantly reducing the amount forwarded, thensend that reduced data to a base-station When the data rate across clusterswas low, the protocol outperformed the single channel MAC However, when thedata rate among clusters was increased, the performance of the protocol deteri-orated significantly, performing worse than a single channel MAC in some cases.The developer of the protocol assumed that this was due to the overhead associ-ated with the channel change mechanism which is incurred when communicationhappens among different clusters heavily Much debugging effort was spent onthat direction with no result

Failure Diagnosis To diagnose the cause of the performance problem, we

logged the MAC events relating to radio frequency changes and message munication The question posed to our tool was “Why is the performance bad

com-at higher dcom-ata rcom-ate?” To answer this question, we ﬁrst executed the protocol

at low data rates (when the performance is better than single channel MAC) tocollect logs representing “good” behavior We then again executed the protocolwith a high data rate (when the performance is worse than single channel MAC)

to collect logs representing “bad” behavior

After performing discriminative pattern analysis, the list of top 5 tive patterns that were produced by our tool is shown in Figure 5

discrimina-The sequences indicate that, in all cases, there seems to be a problem withnot receiving acknowledgements Lack of acknowledgements causes a channelscanning pattern to unfold This is shown as theRetry T ransmission event

Trang 28

on different channels, as a result of not receiving acknowledgements Hence,the problem does not lie in the frequent overhead of senders changing theirchannel to that of their receiver in order to send a message across clusters Theproblem lied in the frequent lack of response (an ACK) from a receiver At thefirst stage of frequent pattern mining No Ack Received is identified as the

most frequent event At the second stage, the algorithm searched for frequent

patterns in top K (e.g., top 5) segments of the logs where No Ack Received

event occurred with highest frequency The second stage of the log analysis(correlating frequent events to preceding ones) then uncovered that the lack

of an ACK from the receiver is preceded by a temporary channel change Thisgave away the bug As we described earlier, whenever a node changes its channeltemporarily, it disables “ACK”s until it comes back to its home channel In a highintercluster communication scenario, disabling the “ACK” is a bad decision for anode that spends a signiﬁcant amount of time communicating with other clusters

on channels other than its own home channel As a side effect, nodes which aretrying to communicate with it fail to receive an “ACK” for a long time and startscanning channels frequently looking for the missing receiver Another interestingaspect of the problem that was discovered is the cascading effect of the problem.When we look at generated discriminative patterns across multiple nodes (notshown for space limitations), we see that the scanning patterns revealed in thelogs shown in fact cascades Channel scanning at the destination node oftentriggers channel scanning at the sender node and this interesting cascaded effectwas also captured by our tool

As a quick ﬁx, we stopped disabling “ACK” when a node is outside its homechannel This may appear to violate some correctness semantics because a nodemay now send an ACK while temporarily being on a channel other than itshome This, one would think, will pollute neighbor tables of nodes that overhearthe ACK because they will update their tables to indicate an incorrect homechannel In reality, the performance of the MAC layer improved signiﬁcantly(up to 50%), as shown in Figure 6 In retrospect, this is not unexpected As in-tercluster communication increases, the distinction between one’s home channeland the home channel of another with whom one communicates a lot becomesfuzzy, as one spends more and more time on that other node’s home channel(to send messages to it) When ACKs are never disabled, the neighbor tables ofnodes will tend to record with a higher probability the channel on which eachneighbor spend most of its time This could be the neighbor’s home channel

or the channel of a node downstream with which the neighbor communicates alot The distinction becomes immaterial as long as the neighbor can be found

on that channel with a high probability Indeed, complex interaction problemsoften seem simple when explained but are sometimes hard to think of at designtime Dustminer was successful at uncovering the aforementioned interactionand signiﬁcantly improve the performance of the MAC protocol in question

A Note on Scalability We compare our results with using the the Apriori

algorithm for sequence mining Due to huge numbers of events logged for thiscase study (about 40000 for “good” logs and 40000 for “bad” logs), we couldnot generate frequent patterns of length more than 2 using Apriori To generate

Trang 29

0 50000 100000 150000 200000 250000 300000 350000

Successful Send Successful Receive

Fig 6 Performance improvement after the bug ﬁx

frequent patterns of length 2 for 40000 events in the “good” log, it took 1683.02seconds (28 minutes) and to ﬁnish the whole computation including diﬀerentialanalysis it took 4323 seconds (72 minutes) With our two-stage mining scheme,

it took 5.547 seconds to finish the first stage and finishing the whole tion including differential analysis took 332.924 seconds (6 minutes) In terms ofquality of the generated sequences (which is often correlated with the length ofthe sequence), our algorithm returned discriminative sequences of length upto

computa-8, that was enough to understand the chain of events causing the problem asillustrated above We tried to generate frequent patterns of length 3 with Apri-ori, but terminated the process after one day of computation that remained inprogress We used a machine of 2.53 GHz speed and 512 MB RAM The gener-ated patterns of length 2 were insuﬃcient to give insight into the problem

Debugging Overhead To test the impact of logging on application behavior,

we ran the multichannel MAC protocol with logging enabled and without loggingenabled with both moderate data rate and high data rate The network was set

as a data aggregation network

For moderate data rate experiments, the source nodes (node that only sendsmessages) were set to transmit data at a rate of 10 messages/sec, the intermediatenodes were set to transmit data at a rate of 2 messages/sec and one node wasacting as the base station (which only receives messages) We tested this on

a 8 nodes network with 5 source nodes, 2 intermediate nodes and one basestation Over multiple runs, after we take the average to get a reliable estimate,

average number of successfully transmitted messages was increased by 9.57% and average number of successfully received messages was increased by 2.32%.

The most likely reason is writing to ﬂash was creating a randomization eﬀectwhich probably helped to reduced interference at the MAC layer

At high data rate, source nodes were set to transmit data at a rate of 100messages/sec and intermediate nodes were set to transmit data at a rate of 20messages/sec Over multiple runs, after we take the average to get a reliableestimate, average number of successfully transmitted messages was reduced by

1.09% and average number of successfully received messages was dropped by 1.62% The most likely reason is the overhead of writing to ﬂash kicked in at a

such high data rate and eventually reduced the advantage experienced at a lowdata rate

Trang 30

The performance improvement of the multichannel MAC protocol reported inthis paper is obtained by running the protocol at the high data rate to preventover estimation.

We realize that this effect on application may change the behavior of the nal application slightly, but that effect seems to be negligible from our experienceand did not affect the diagnostic capability of the discriminative pattern miningalgorithm which is inherently robust against minor statistical variance

origi-As multichannel MAC protocol did not use ﬂash memory to store any data,

we were able to use the whole ﬂash for logging events To test the relationbetween quality of generated discriminative patterns and the logging space used,

we used 100KB, 200KB and 400KB of flash space in three different experiments.The generated discriminative patterns were similar We realize that differentapplication has different amount of flash space requirements and the amount

of logging space may aﬀect the diagnostic capability To help in severe spaceconstraints, we provide the radio interface so users can choose to log at diﬀerenttimes instead of logging continuously User can also choose to log events at

diﬀerent resolutions (e.g., instead of logging every message transmitted, log only

every 50thmessage transmitted).

For LiteOS case study, we did not use ﬂash space at all as the events weretransmitted to basestation (PC) directly using serial connection and eliminatethe ﬂash space overhead completely which makes our tool easily usable fortestbeds which often provides serial connections

In this paper, we presented a sensor network troubleshooting tool that helpsthe developer diagnose root causes of errors The tool is geared towards ﬁndinginteraction bugs Very successful examples of debugging tools that hunt for lo-calized errors in code have been produced in previous literature The point ofdeparture in this approach lies in focusing on errors that are not localized (such

as a bad pointer or an incorrect assignment statement) but rather arise because

of adverse interactions among multiple components each of which appears to

be correctly designed With increased distribution and resource constraints, theinteractive complexity of sensor networks applications will remain high, moti-vating tools such as the one we described Future development of the tool willfocus on scalability and user interface to reduce the time and eﬀort needed tounderstand and use it

References

1 Agrawal, R., Srikant, R.: Fast algorithms for mining association rules In: Proc

1994 Int Conf Very Large Data Bases (VLDB 1994), Santiago, Chile, September

1994, pp 487–499 (1994)

2 Cao, Q., Abdelzaher, T., Stankovic, J., He, T.: LiteOS, a UNIX-like operatingsystem and programming platform for wireless sensor networks In: IPSN/SPOTS,

St Louis, MO (April 2008)

Trang 31

3 Clark, D.: Internet meets sensors: Should we try for architecture convergence In:Networking of Sensor Systems (NOSS) Principal Investigator and InformationalMeeting (October 2005),

http://www.eecs.harvard.edu/noss/slides/info/keynote1/clark.ppt

4 Dasu, T., Johnson, T.: Exploratory Data Mining and Data Cleaning John Wiley

& Sons, Chichester (2003)

5 Ganti, V., Gehrke, J., Ramakrishnan, R.: Mining very large databases puter 32, 38–45 (1999)

Com-6 Grahne, G., Zhu, J.: Eﬃciently using preﬁx-trees in mining frequent itemsets In:Proc ICDM 2003 Int Workshop on Frequent Itemset Mining Implementations(FIMI 2003), Melbourne, FL (November 2003)

7 Han, J., Cheng, H., Xin, D., Yan, X.: Frequent pattern mining: Current status andfuture directions Data Mining and Knowledge Discovery 15, 55–86 (2007)

8 Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn MorganKaufmann, San Francisco (2006)

9 Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation.In: Proc 2000 ACM-SIGMOD Int Conf Management of Data (SIGMOD 2000),Dallas, TX, May 2000, pp 1–12 (2000)

10 Hand, D.J., Mannila, H., Smyth, P.: Principles of Data Mining MIT Press, bridge (2001)

Cam-11 Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: DataMining, Inference, and Prediction Springer, Heidelberg (2001)

12 Intanagonwiwat, C., Govindan, R., Estrin, D., Heidemann, J., Silva, F.: Directeddiﬀusion for wireless sensor networking IEEE/ACM Trans Netw 11(1), 2–16(2003)

13 Khan, M., Le, H., Ahmadi, H., Abdelzaher, T., Han, J.: DustMiner: ing interactive complexity bugs in sensor networks In: Proc 2008 ACM Int Conf

Troubleshoot-on Embedded Networked Sensor Systems (Sensys 2008), Raleigh, NC (November2008)

14 Khan, M., Abdelzaher, T., Gupta, K.: Towards diagnostic simulation in sensornetworks In: Nikoletseas, S.E., Chlebus, B.S., Johnson, D.B., Krishnamachari, B.(eds.) DCOSS 2008 LNCS, vol 5067, pp 252–265 Springer, Heidelberg (2008)

15 Khan, M., Abdelzaher, T., Luo, L.: SNTS: Sensor network troubleshooting suite.In: Aspnes, J., Scheideler, C., Arora, A., Madden, S (eds.) DCOSS 2007 LNCS,vol 4549, pp 142–157 Springer, Heidelberg (2007)

16 Khan, M., Le, H.K., Ahmadi, H., Abdelzaher, T., Han, J.: Dustminer: bleshooting interactive complexity bugs in sensor networks In: ACM Sensys,Raleigh, NC (November 2008)

Trou-17 Le, H.K., Henriksson, D., Abdelzaher, T.F.: A practical multi-channel media accesscontrol protocol for wireless sensor networks In: IPSN, pp 70–81 (2008)

18 Lee, H.K., Henriksson, D., Abdelzaher, T.: A practical multi-channel medium cess control protocol for wireless sensor networks In: International Conference onInformation Processing in Sensor Networks (IPSN 2008), St Louis, Missouri (April2008)

ac-19 Pei, J., Han, J., Mortazavi-Asl, B., Wang, J., Pinto, H., Chen, Q., Dayal, U., Hsu,M.-C.: Mining sequential patterns by pattern-growth: The preﬁxspan approach.IEEE Trans Knowledge and Data Engineering 16, 1424–1440 (2004)

20 Pevzner, P.A.: Computational Molecular Biology: An Algorithmic Approach MITPress, Cambridge (2000)

21 Srikant, R., Agrawal, R.: Mining sequential patterns: Generalizations and mance improvements In: Proc 5th Int Conf Extending Database Technology(EDBT 1996), Avignon, France, March 1996, pp 3–17 (1996)

Trang 32

perfor-22 Tan, P., Steinbach, M., Kumar, V.: Introduction to Data Mining Addison Wesley,Reading (2005)

23 Tang, Z., MacLennan, J.: Data Mining with SQL Server 2005 John Wiley & Sons,Chichester (2005)

24 Uno, T., Asai, T., Uchida, Y., Arimura, H.: LCM ver 2: Eﬃcient mining algorithmsfor frequent/closed/maximal itemsets In: Proc ICDM 2004 Int Workshop onFrequent Itemset Mining Implementations, FIMI 2004 (November 2004)

25 Weiss, S.M., Indurkhya, N.: Predictive Data Mining Morgan Kaufmann, San cisco (1998)

Fran-26 Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and niques, 2nd edn Morgan Kaufmann, San Francisco (2005)

Tech-27 Yan, X., Han, J., Afshar, R.: CloSpan: Mining closed sequential patterns in largedatasets In: Proc 2003 SIAM Int Conf Data Mining (SDM 2003), San Fransisco,

Trang 33

for Change Detection in Data Streams

Raquel Sebasti˜ao1,2, Jo˜ao Gama1,3,

Pedro Pereira Rodrigues1,2,4, and Jo˜ao Bernardes4,5

1 LIAAD - INESC Porto, L.A Rua de Ceuta, 118, 6

4050-190 Porto, Portugal

2 Faculty of Science, University of Porto

3 Faculty of Economics, University of Porto

4 Faculty of Medicine, University of Porto

5 INEB, Porto

{raquel,jgama}@liaad.up.pt,{pprodrigues,joaobernardes}@med.up.pt

Abstract Histograms are a common technique for density estimation

and they have been widely used as a tool in exploratory data analysis.Learning histograms from static and stationary data is a well knowntopic Nevertheless, very few works discuss this problem when we have

a continuous ﬂow of data generated from dynamic environments

The scope of this paper is to detect changes from high-speed changing data streams To address this problem, we construct histogramsable to process examples once at the rate they arrive The main goal ofthis work is continuously maintain a histogram consistent with the cur-rent status of the nature We study strategies to detect changes in thedistribution generating examples, and adapt the histogram to the mostrecent data by forgetting outdated data We use the Partition Incremen-tal Discretization algorithm that was designed to learn histograms fromhigh-speed data streams

time-We present a method to detect whenever a change in the distributiongenerating examples occurs The base idea consists of monitoring distri-butions from two different time windows: the reference window, reflect-ing the distribution observed in the past; and the current window whichreceives the most recent data The current window is cumulative andcan have a fixed or an adaptive step depending on the distance betweendistributions We compared both distributions using Kullback-Leibler di-vergence, defining a threshold for change detection decision based on theasymmetry of this measure

We evaluated our algorithm with controlled artiﬁcial data sets and pare the proposed approach with nonparametric tests We also present re-sults with real word data sets from industrial and medical domains Thoseresults suggest that an adaptive window’s step exhibit high probability inchange detection and faster detection rates, with few false positives alarms

com-Keywords: Change detection, Data streams, Machine learning,

Learn-ing histograms, MonitorLearn-ing data distribution, Adaptive CumulativeWindows

M.M Gaber et al (Eds.): Sensor-KDD 2008, LNCS 5840, pp 25–42, 2010.

c

Springer-Verlag Berlin Heidelberg 2010

Trang 34

1 Introduction

Nowadays, the scenario of ﬁnite stored data sets is no longer appropriated cause information is gathered assuming the form of transient and inﬁnite datastreams As a large massive amount of information is produced at a high-speedrate it is no longer possible to use algorithms which require to store, in themain memory, the full historic data In Data Streams the data elements arecontinuously received, treated and discarded In this context processing time,memory and sample size are the crucial constraints in knowledge discovery sys-tems [3] Due to the exploratory nature of data and to time restrictions, a usermay prefer a fast but approximate answer to an exact but slow answer Methods

be-to deal with these issues consist of applying synopsis techniques, such as tograms [12,16,19,26], sketches [8] and wavelets [7,15] Histograms are one of thetechniques used in data stream management systems to speed up range queriesand selectivity estimation (the proportion of tuples that satisfy a query), twoillustrative examples where fast but approximate answers are more useful thanslow and exact ones

his-In the context of open-ended data streams, as we never observe all values ofthe random variable, it is not appropriate to use the traditional histograms toconstruct a graphical representation of continuous data, because they requirethe knowledge of all data Thus, there is still missing algorithms to addressconveniently this issue The Partition Incremental Discretization [12,26] and theV-Optimal Histograms [14,16,18] are two examples A key characteristic of a datastream is its dynamic nature The process generating data is not strictly station-ary and evolves over time The target concept may gradually change over time.Moreover, when data is collected over time, at least for large periods of time,

it is not acceptable to assume that the observations are generated at randomaccording to a stationary probability distribution Several methods in machinelearning have been proposed to deal with concept drift [11,17,21,23,26,28,29].Drifting concepts are often handled by time windows or weighted examples ac-cording to their age or utility Another approach to detect drift concepts is mon-itoring distributions on two diﬀerent time windows, monitoring the evolution of

a statistical function between two distributions: from past data in a referencewindow and in a current window with the most recent data points [20,27]

1.1 Previous Work

In a previous work [27], we presented a method to detect changes in data streams

In that work, we constructed histograms using the two layer structure of the tition Incremental Discretization (PiD) algorithm and addressed the detectionproblem by monitoring distributions using a ﬁxed window model In this work,

Par-we propose a new deﬁnition of the number of histogram’s bins and the use of anadaptive-cumulative window model to detect changes We also perform studies onthe distance measures and advance a discrepancy measure based on the asymme-try of the Kullback-Leibler Divergence (KLD) We support this decision in pre-vious results The results of [27] suggest that the KLD achieve faster detection

Trang 35

rates than the other tested distances measures (a measure based on entropy andthe cosine distance).

1.2 Motivation, Challenges and Paper Outline

The motivation for studying time-changing high-speed data streams comes fromthe emergence of temporal applications such as communications networks, websearches, ﬁnancial applications, and sensor data, which produces massive streams

of data Since it is impractical to store completely in memory all data, newalgorithms are needed to process data online at the rate it is available Anotherchallenge is to create compact summaries of data streams Histograms are in factcompact representations for continuous data They can be used as a component

in more sophisticated data mining algorithms, like decision trees [17]

As the distribution underlying the data elements may change over time, thedevelopment of methods to detect when and how the process generating thestream is evolving, is the main challenge of this study The main contribution ofthis paper is a new method to detect changes when learning histograms usingadaptive windows We are able to detect a time window where change has oc-curred Another contribution is an improved technique to initialize histogramssatisfying user constraint on the admissible relative error

The proposed method has potential use in medical and industrial domains,namely, in monitoring biomedical signals and production processes (respectively).The paper is organized as follows The next section presents an algorithm tocontinuously maintain histograms over a data stream In section 3 we extendthe algorithm for change detection Section 4 presents preliminary evaluation

of the algorithm in benchmark datasets and real-world problems Last sectionconcludes the paper and presents some future research lines

Histograms are one of the most used tools in exploratory data analysis Theypresent a graphical representation of data, providing useful information aboutthe distribution of a random variable A histogram is visualized as a bar graphthat shows frequency data The basic algorithm to construct histograms consists

of sorting the values of the random variable and places them into bins Next

counts the number of data samples in each bin The height of the bar drawn onthe top of each bin is proportional to the number of observed values in that bin

A histogram is deﬁned by a set of k non-overlapping intervals and each interval

is deﬁned by its boundaries and a frequency count The most used histograms are

either equal width, where the range of observed values is divided into k intervals

of equal length (∀i, j : (b i − b i−1 ) = (b j − b j−1 )), or equal frequency, where the range of observed values is divided into k bins such that the counts in all bins

are equal (∀i, j : (f i = f j))

When all the data is available, there are exact algorithms to construct

his-tograms [25] All these algorithms require a user deﬁned parameter k, the number

of bins Suppose we know the range of the random variable (domain

informa-tion) and the desired number of intervals k The algorithm to construct equal

Trang 36

width histograms traverses the data once; whereas in the case of equal frequencyhistograms a sort operation is required.

One of the main problems of using histograms is the deﬁnition of the number

of intervals A rule that has been used is the Sturges’ rule: k = 1 + log2n, where

k is the number of intervals and n is the number of observed data points This

rule has been criticized because it is implicitly using a binomial distribution

to approximate an underlying normal distribution1 Sturges rule has probably

survived because, for moderate values of n (less than 200) produces reasonable histograms However, it does not work for large n Scott gave a formula for

the optimal histogram bin width which asymptotically minimizes the integratedmean square error Since the underlying density is usually unknown, he suggestedusing the Gaussian density as a reference standard, which leads to the data-based

choice for the bin width of a × s × n −1/3 , where a = 3.49 and s is the estimate

of the standard deviation

In exploratory data analysis, histograms are used iteratively The user tries

several histograms using diﬀerent values of k (the number of intervals), and

chooses the one that better ﬁts his purposes

2.1 The Partition Incremental Discretization (PID)

The Partition Incremental Discretization algorithm (PiD for short) that designed

to provide a histogram representation of high-speed data streams It learns tograms using an architecture composed by two layers The first simplifies andsummarizes the data, the algorithm transverses the data once and incremen-tally maintains an equal-width discretization; the second layer constructs thefinal histogram using only the discretization of the first phase The first layer

his-is initialized without seeing any data As described in [12], the input for theinitialization phase is the number of intervals (that should be much larger thanthe desired ﬁnal number of intervals) and the range of the variable

Consider a sample x1, x2, of an open-ended random variable with range

R In this context, and allowing to consider extreme values and outliers, the histogram is deﬁned as a set of break points b1, , b k−1 and a set of frequency

counts f1, , f k−1 , f k that deﬁne k intervals in the range of the random variable:

]− ∞, b1], ]b1, b2], , ]b k−2 , b k−1 ], ]b k−1 , ∞[. (1)

In a histogram, all x i in a bin is represented by the correspondent middle point,

which means that this approximation error is bounded by half of the length (L)

of the bin As the ﬁrst layer is composed by equal-width histogram, we obtain:

x i − m j ≤ L

2 =

R 2k , b j ≤ x i < b j+1 , ∀j = 1, , k. (2)

1 Alternative rules for constructing histograms include Scott’s (1979) rule for the class

width:k = 3.5sn −1/3 and Freedman and Diaconis’s (1981) rule for the class width:

k = 2(IQ)n −1/3 where s is the sample standard deviation and IQ is the sample

interquartile range

Trang 37

Considering the set of middle-break points m1, , m k of the histogram, we ﬁne the mean square error in each bin as the sum of the square diﬀerencesbetween each point in that bin and their correspondent middle-break points:

de-

i (x i − m j)2≤ n j R2/4k2, b j ≤ x i < b j+1,∀j = 1, , k and n j is the number ofdata points in each bin The quadratic error (QE) is deﬁned as the sum of thiserror along all bins:

From the above equations it follows that the quadratic error is bounded, in the

worst case, by: nR2/4k2, where n denotes the number of observed variables.

The deﬁnition of the number of intervals is one of the main problems of usinghistograms The number of bins is directly related with the quadratic error Howdiﬀerent would the quadratic error be if we consider just one more bin? To studythe evaluation of the quadratic error of a histogram with the number of bins, we

compute the following ratio, which we refer to as the relative error: = QE(k)

QE(k+1).

In order to bound the decrease of the quadratic error, we deﬁne the number

of bins of the ﬁrst layer as dependent on the upper bound of the relative error

() and on the fail probability (δ):

N1= O(1

ln

1

Establishing a bound for relative error, this deﬁnition of the number of bins

ensures that the fail probability will converge to zero when N1 increases So,

setting and δ and using this deﬁnition we control the decrease of the quadratic

error Figure 1 shows that the number of bins increases when the error decreasesand the conﬁdence increases Figure 1 (top) represents the number of bins of

layer1in function of and δ The bottom ﬁgures give a projection of the number

of bins according with the variables and δ (respectively).

So, diﬀering from [12] the input for the initialization phase is a pair of rameters (that will be used to express accuracy guarantees) and the range of thevariable:

pa-– The upper bound on relative error .

– The desirable conﬁdence level 1− δ.

– The range of the variable.

The range of the variable is only indicative It is used to initialize the set

of breaks using an equal-width strategy Each time we observe a value of the

random variable, we update layer1 The update process determines the intervalcorresponding to the observed value, and increments the counter of this interval

The process of updating layer1 works online, performing a single scan over thedata stream It can process inﬁnite sequences of data, processing each example

in constant time and space The second layer merges the set of intervals deﬁned

by the ﬁrst layer The input for the second layer is the breaks and counters of

layer1, the type of histogram (equal-width or equal-frequency) and the

desir-able ﬁnal number of intervals The algorithm for the layer2 is very simple For

Trang 38

Fig 1 Representation of the number of bins of layer1 The top ﬁgure shows thedependency from and δ and bottom ﬁgures show it according to only one variable.

equal-width histograms, it ﬁrst computes the breaks of the ﬁnal histogram, from

the actual range of the variable (estimated in layer1) The algorithm traversesthe vector of breaks once, adding the counters corresponding to two consecutive

breaks For equal-frequency histograms, we ﬁrst compute the exact number F of

points that should be in each ﬁnal interval (from the total number of points andthe number of desired intervals) The algorithm traverses the vector of counters

of layer1 adding the counts of consecutive intervals till F The computational

costs of this phase can be ignored: it traverses once the discretization obtained inthe first phase We can construct several histograms using different number of in-tervals and different strategies: equal-width or equal-frequency This is the mainadvantage of PiD in exploratory data analysis We use PiD algorithm to createcompact summaries of data, and along with the improvement of the number ofbins definition, we also accomplished it with a change detection technique

The algorithm described in the previous section assumes that the observationscome from a stationary distribution When data ﬂows over time, and at least forlarge periods of time, it is not acceptable to assume that the observations aregenerated at random according to a stationary probability distribution At least

in complex systems and for large time periods, we should expect changes in thedistribution of the data

Trang 39

3.1 Related Work

When monitoring a stream is fundamental to know if the received data comesfrom the distribution observed so far It is necessary to perform tests in order todetermine if there is a change in the underlying distribution The null hypothesis

is that the previously seen values and the current observed values come from thesame distribution The alternative hypothesis is that they are generated fromdiﬀerent continuous distributions

There are several methods in machine learning to deal with changing cepts [21,22,23,29] In general, approaches to cope with concept drift can be clas-

con-siﬁed into two categories: i) approaches that adapt a learner at regular intervals without considering whether changes have really occurred; ii) approaches that

ﬁrst detect concept changes, and next, the learner is adapted to these changes

Examples of the former approaches are weighted examples and time windows of

ﬁxed size Weighted examples are based on the simple idea that the importance

of an example should decrease with time (references about this approach can befound in [22,24,29]) When a time window is used, at each time step the learner

is induced only from the examples that are included in the window Here, thekey diﬃculty is how to select the appropriate window’s size: a small window canassure a fast adaptability in phases with concept changes but in more stablephases it can aﬀect the learner performance, while a large window would pro-duce good and stable learning results in stable phases but can not react quickly

to concept changes

In the latter approaches, with the aim of detecting concept changes, someindicators (e.g performance measures, properties of the data, etc.) are monitoredover time (see [21] for a good classiﬁcation of these indicators) If during themonitoring process a concept drift is detected, some actions to adapt the learner

to these changes can be taken When a time window of adaptive size is usedthese actions usually lead to adjusting the window’s size according to the extent

of concept drift [21] As a general rule, if a concept drift is detected the window’ssize decreases; otherwise the window’s size increases

Windows Models Most of the methods in this approach monitor the

evo-lution of a distance function between two distributions: from past data in a

reference window and in a current window of the most recent data points An

example of this approach, in the context of learning from Data Streams, hasbeen present by [20] The author proposes algorithms (statistical tests based onChernoﬀ bound) that examine samples drawn from two probability distributionsand decide whether these distributions are diﬀerent

In this work, we monitor the distance between the distributions in two timewindows: a reference window that has a ﬁxed size and refers to past observa-tions and an adaptive-cumulative window that receives the actual observationsand could have a ﬁxed or an adaptive step depending on the distance betweendistributions For both windows, we compute the relative frequencies: a set of

empirical probabilities p(i) for the reference window and q(i) for the

adaptive-cumulative window

Trang 40

Adaptive-Cumulative Window Model In a previous work [27] we deﬁned

the windows sizes as dependent on the number of intervals of the layer1, beinghalf of these ones:N1

2 In this work, in order to evaluate the inﬂuence of the ber of examples required to detect a change, we deﬁned the cumulative window(the current one) using an adaptive increasing step that depends on the distancebetween data distributions Starting with a size of N1

num-2 , the step is incremented ifthe distance between data distributions increases and is decremented otherwise,according to the following relation:

Figure 2 shows the dependency of the window’s step on distributions’ distance

Fig 2 Representation of the windows’ step with respect to the absolute diﬀerence

betweenKLD(p||q) and KLD(q||p) An illustrative example showing that the window’s

step decreases when the absolute diﬀerence between distances increases

3.2 Distance between Distributions – Kullback-Leibler Divergence

Assuming that sample in the reference window has distribution p and that data

in the current window has distribution q, we use as a measure to detect whether

has occurred a change in the distribution the Kullback-Leibler Divergence (KLD)From information theory [4], the Relative Entropy is one of the most gen-eral ways of representing the distance between two distributions [10] Contrary

to the Mutual Information this measure assesses the dissimilarity between twovariables Also known as the Kullback-Leibler divergence, it measures the dis-tance between two probability distributions and so it can be used to test forchange

2 KLD stands for Kullback-Leibler Divergence This measure is introduced in the next

subsection

Tiêu đề	Knowledge Discovery from Sensor Data
Tác giả	Mohamed Medhat Gaber, Ranga Raju Vatsavai, Olufemi A. Omitaomu, João Gama, Nitesh V. Chawla
Trường học	Monash University
Chuyên ngành	Knowledge Discovery from Sensor Data
Thể loại	Workshop Proceedings
Năm xuất bản	2008
Thành phố	Las Vegas

Định dạng
Số trang	234
Dung lượng	5,45 MB