DSpace at VNU: DisCaRia-Distributed Case-Based Reasoning System for Fault Management

DSpace at VNU: DisCaRia-Distributed Case-Based Reasoning System for Fault Management tài liệu, giáo án, bài giảng , luận...

Trang 1

DisCaRia—Distributed Case-Based Reasoning

System for Fault Management

Ha Manh Tran and Jürgen Schönwälder, Senior Member, IEEE

Abstract—Fault resolution in communication networks and

distributed systems is a challenge that demands the expertise of

system administrators and the support of multiple systems, such as

monitoring and event correlation systems Trouble ticket systems

are frequently used to organize the workflow of the fault

resolu-tion process In this context, we introduce DisCaRia, a distributed

case-based reasoning system that assists system administrators

and network operators in resolving faults DisCaRia integrates

various fault knowledge resources that are already available in the

Internet, and it exploits them by applying a distributed case-based

reasoning methodology, which is based on scalable peer-to-peer

technology We present the architecture of DisCaRia, the key

algo-rithms used by DisCaRia, and provide an evaluation of a prototype

implementation of the system.

Index Terms—Fault resolution, fault management, case-based

reasoning, peer-to-peer, bug tracking system, software bug search.

I INTRODUCTION

T HE RESOLUTION of faults in communication networks

and distributed systems is to a large extent a human driven

process Automated monitoring and event correlation systems

[1]–[4] usually produce fault reports that are forwarded to

oper-ators for resolution Support systems [5]–[7] such as trouble

ticket systems are frequently used to organize the workflows

Case-based Reasoning (CBR) [8] has been proposed in the

early 1990s to assist operators in the resolution of faults by

providing mechanisms to correlate an observed fault with

previ-ously solved similar cases (faults) CBR systems are typically

linked to trouble ticket systems since the data maintained in

trouble ticket systems can be used to populate a case database

Existing CBR systems for fault resolution usually operate only

on a local case database and cannot easily exploit knowledge

about faults and their resolutions present at other sites This

restriction to local knowledge resources, however, becomes an

issue in environments where software components and offered

services change very dynamically and the case database is thus

frequently outdated

Manuscript received March 2, 2015; revised October 20, 2015; accepted

October 20, 2015 Date of publication October 30, 2015; date of current version

December 17, 2015 This work was supported in part by the Vietnam National

Foundation for Science and Technology Development (NAFOSTED) under

Grant 102.02-2011.01, in part by Flamingo, a Network of Excellence project

(ICT-318488) supported by the European Commission under its Seventh

Framework Programme, and in part by the EC IST-EMANICS Network of

Excellence under Grant 26854 The associate editor coordinating the review

of this paper and approving it for publication was J Lobo.

H M Tran is with the School of Computer Science and Engineering,

International University-Vietnam National University, Ho Chi Minh City,

Vietnam (e-mail: tmha@hcmiu.edu.vn).

J Schönwälder is with the Department of Computer Science and

Electrical Engineering, Jacobs University Bremen, Bremen, Germany (e-mail:

j.schoenwaelder@jacobs-university.de).

Digital Object Identifier 10.1109/TNSM.2015.2496224

With the recent growth of virtual communities, social net-works, and cloud systems, domain specific search engines often become a suitable alternative to general purpose search engines The key advantage of these systems is to focus the search

on specific domains Domain specific search engines have the potential of connecting a large number of experts with simi-lar interests A virtual community of networking experts can provide the best solutions for networking problems Cloud com-puting systems, fostering the centralization of various services,

in particular require a large number of experts and tools to man-age faults and failures This is especially true for inter-cloud environments [9] that support applications and services run-ning on multiple cloud systems It is thus necessary to develop support systems that can exploit the knowledge of several vir-tual communities and that can connect groups of experts for resolving problems

Our distributed case-based reasoning system DisCaRia takes advantage of Peer-to-Peer (P2P) technology to extend the capa-bility of conventional CBR systems by exploring problem solv-ing knowledge resources available in a distributed environment The DisCaRia peers operate in parallel and comprise indepen-dent CBR components that work concurrently to scrutinize the knowledge resources Our distributed CBR approach applies previous research activities to improve the performance of man-aging a large case database and the quality of the proposed solutions, including case retrieval and reasoning approaches [10]–[12] case learning and retention [13], [14] This article therefore uses the DisCaRia system to present the novel inte-gration and adaptation of CBR and P2P technologies to address

a network and service management problem The contribution

of the article is thus threefold:

1) We propose the distributed CBR approach based on mul-tiple CBR engines organized in a P2P network for explor-ing and exploitexplor-ing federated fault knowledge databases 2) We present the DisCaRia system with the main methods and algorithms, and how they work together to achieve the integration and adaptation of CBR and P2P technologies 3) We perform an evaluation of the DisCaRia system on EmanicsLab distributed computing testbed [15] with fed-erated bug databases obtained from several popular bug tracking systems

The rest of the article is structured as follows: Section II dis-cusses related work Section III introduces the distributed CBR approach and research activities applied to the main compo-nents of the DisCaRia system The creation and maintenance of the case database are detailed in Section IV Section V presents several experiments performed on the EmanicsLab distributed computing testbed in order to evaluate the performance of the 1932-4537 © 2015 IEEE Personal use is permitted, but republication/redistribution requires IEEE permission.

See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Trang 2

DisCaRia system The paper concludes with some remarks on

future work in Section VI

II BACKGROUND ANDRELATEDWORK

This section reviews related work in the areas of trouble

ticket and bug tracking systems for fault resolution, case-based

reasoning, peer-to-peer technology, and network search For

simplification, we use the terms of super peer and DisCaRia

peer, case and bug interchangeably in this article

A Trouble Ticket Systems and Bug Tracking Systems

Trouble ticket systems have been widely used by network

operators in order to assure the quality of communication

ser-vices The ITU-T recommendation X.790 [16] defines trouble

as “any cause that may lead to or contribute to a manager

per-ceiving a degradation in the quality of service of one or more

network services or one or more network resources being

man-aged.” X.790 introduces an interface for the interaction among

parties, e.g., a trouble report data model, and a process for the

resolution of troubles, e.g., determining the status of a

trou-ble, escalating its severity, and notifying involved parties of its

resolution The informational IETF document RFC 1297 [17]

defines trouble as “a single malfunctioning piece of hardware

or software that breaks at some time, has various efforts to fix

it, and eventually is fixed at some given time.” Trouble ticket

systems can be used for communication among network

opera-tion centers (NOCs) and they can be associated with a network

alert system for generating trouble tickets automatically and for

monitoring the progress of the trouble tickets The TMF

docu-ments NMF501 [18] and NMF601 [19] model a trouble

admin-istration system that provides the interface between a service

provider and users for managing trouble information NMF501

focuses on the business and technical requirements and the

trouble administration process; whereas NMF601 describes

the functionality of exchanging management information to

meet the requirements Moreover, the system can create and

track telecommunication troubles reported following the X.790

recommendation

A Bug Tracking System (BTS) is a trouble ticket system used

to keep track of software bugs A bug tracking system uses

an information model for a problem (also sometimes called

a ticket, bug, defect, etc.) that is very similar to the

informa-tion model used by trouble ticket systems Pre-defined fields

are commonly used to keep track of the status of the

prob-lem while textual descriptions are used to describe the probprob-lem

and to track the problem resolution process A BTS in

gen-eral aims at improving the quality of software products They

do so by keeping track of reported problems and by

main-taining historical records of previously experienced problems

They also establish the basis of a knowledge base of an expert

system that allows to search for similar past problems, and

that provides reports and statistics for performance evaluation

of the services [20] While most expert systems proposed for

fault diagnosis and resolution, such as ACE [21], COMPASS

[22], NEMESYS [23], Troubleshooter [24], DAD [25] and

Fig 1 Cyclic case-based reasoning approach using four processes (retrieval, reuse, revision, and retention) and a common case database, adopted from [8].

MANDOLIN [26] explore a knowledge database supplied by a single BTS, DisCaRia is able to explore a federated knowledge database supplied by several BTSs

B Case-Based Reasoning

Case-based Reasoning (CBR) [8] seeks to find solutions for problems by exploiting experience A case essentially consists

of a description of a specific problem and the corresponding solution When a new problem appears, the reasoning process first uses a similarity function to retrieve cases matching the current problem and then it adapts the retrieved cases to the cir-cumstances of the current problem in order to obtain a possible solution Depending on the characteristics of a certain problem domain, this reasoning process can either classify the problem into a group of previously resolved problems or propose an adapted solution for the problem Problem classification can

be suitable for a problem domain with a relatively large case database Various reasoning techniques including rule based reasoning, fuzzy logic, neural networks and belief networks can

be used for this process

Following the discussion in [8], a CBR system consists of four processes as shown in Figure 1:

1) case retrieval to obtain similar cases, 2) case reuse to propose adapted solutions, 3) case revision to verify the adapted solution, 4) case retention to learn the solution.

The CBR system uses a case database to store and provide cases for the operation of the CBR processes

Several CBR systems [5], [7], [27] have been proposed for fault diagnosis and resolution These systems usually collab-orate with trouble ticket systems in order to take advantage

of trouble tickets as the case database While these systems can learn from previous problems to propose solutions for novel problems, they usually only operate on a local case database and hence they cannot exploit problem-solving knowl-edge resources present at remote sites Using shared knowlknowl-edge resources not only provides better opportunities to find solu-tions but also improves the case databases that otherwise

Trang 3

frequently become obsolete in environments where software

components and offered services change very dynamically

DisCaRia uses several CBR engines that are interacting using

a self-organizing P2P network in order to exploit various

problem-solving knowledge resources

C Peer-to-Peer Technology

Peer-to-Peer (P2P) technology [28] has been introduced to

establish application specific network overlays operating over

the Internet [29] A P2P network consists of peers that act

both as client and server simultaneously P2P systems exhibit a

number of interesting properties such as self-organization,

scal-ability, flexibility, and fault tolerance Peers join and leave the

networks with loose control, enabling fully distributed systems

with a very large number of peers Acting in both client and

server roles, peers share resources, such as bandwidth,

stor-age space, and computing power and typically provide lookup

functions Since P2P networks do not have a hierarchical

orga-nization or centralized control, they are designed such that the

failure of individual peers can affect the availability of

cer-tain resources but cannot cause the failure of the overall P2P

network

P2P networks can be classified into two categories:

struc-tured and unstrucstruc-tured networks Strucstruc-tured P2P networks

maintain a controlled and stable overlay network topology A

structured P2P network aims at distributing content at

deter-ministic locations using Distributed Hash Tables (DHTs), thus

facilitating efficient content search and lookup Examples of

structured P2P networks are CAN [30], Chord [31], Tapestry

[32] and Kademlia [33] Unstructured P2P networks spend less

effort on controlling the overlay network topology and the

location of content Instead, they tend to grow randomly

with-out maintaining a certain network topology according to some

tight rules The content is arbitrarily distributed on the peers,

thus fostering different search methods Examples of

unstruc-tured P2P networks include Gnutella [34], Freenet [35], and

BitTorrent [36]

Hybrid P2P networks combine the characteristics of

struc-tured and unstrucstruc-tured P2P networks and they often integrate

well with the client-server paradigm [37] The main idea is to

distinguish so called super peers from peers that possess

vary-ing storage, bandwidth and processvary-ing capabilities Hybrid P2P

networks organize peers into clusters using a clustering

tech-nique A cluster contains at least one capable peer or super

peer, other peers connect to the super peer in the cluster The

connections between the super peers form the super peer P2P

network With sufficient storage, bandwidth and processing

power, the super peers act both as a server to handle queries

from other peers and as a client to route queries to the other

super peers The content advertisement and search mechanism

only take place on the super peer network Hybrid P2P networks

facilitate advanced search mechanisms due to the processing,

storage and bandwidth resources available at the super peers

Examples of super peer networks include Piazza [38], Edutella

[39] and Bibster [40] DisCaRia uses a hybrid P2P network to

take advantage of the inherent scaling properties hybrid P2P

networks offer

D Network Search

More recently, distributed network search algorithms have been suggested as a primitive for building future network management systems [41], [42] Network search systems are organized as an overlay over a physical network topology The overlay provides a distributed query processing facility that can be used to retrieve operational state and configuration data from network elements The approach shares some similar-ity with the ideas behind the DisCaRia system However, the work on network search primarily aims to provide a generic search mechanism for management and monitoring data while the DisCaRia system assumes a certain CBR functionality to be implemented in the DisCaRia peers The idea of the DisCaRia system has also been applied to building a fault management system in the inter-cloud environment that supports applica-tions and services for running across multiple cloud systems [43] This system recruits a P2P network of fault managers that allows system administrators to monitor faults and search similar faults with solutions on cloud systems

III DisCaRia SYSTEM The distributed CBR approach focuses on CBR method-ology and P2P technmethod-ology While CBR methodmethod-ology has been widely known as a problem-solving method for sev-eral problem domains, P2P technology has been widely used for resource sharing in distributed environment P2P sys-tems provide remarkable features including self-organization

in management, scalability in architecture, flexibility in con-tent distribution and fault tolerance that allow peers to join the networks and exchange problems and solutions easily CBR systems require expressive case representation methods, which allow retrieval and reasoning techniques to work efficiently A distributed CBR system contains CBR engines communicating through a P2P network It offers the capability of exploring and exploiting knowledge resources for retrieval and reasoning that can be applied to fault resolution in communication networks and distributed systems

DisCaRia uses a P2P network to achieve a certain level of self-organization and to benefit from a scalable architecture The system extends the underlying basic P2P network with a CBR approach for obtaining more relevant information for fault resolution [13] There are two kinds of peers in the system: super peers and peers Each super peer bears several compo-nents to perform CBR operations, e.g., retrieving, adapting, verifying and learning cases These components require suffi-cient storage, bandwidth and processing power Figure 2 shows the DisCaRia system architecture based on a P2P network of super peers The super peers deal with complicated operations, thus alleviating the problem of peer heterogeneity, i.e., peers with limited capability do not undertake complicated opera-tions Each super peer is responsible for multiple functions including communication, computation, reasoning and mainte-nance, while each regular peer only communicates with super peers for finding relevant resources

A super peer contains four main components: a P2P compo-nent, a computation compocompo-nent, a reasoning compocompo-nent, and a

Trang 4

Fig 2 DisCaRia system architecture with peers and super peers.

Fig 3 Components of a DisCaRia super peer and their interactions.

storage component These components are realized as

indepen-dent processes, thus the failure of one component affects only

this component Each component consists of several functional

modules that handle different tasks, e.g., the P2P component

possesses modules that manage incoming and outgoing

con-nections, or the computation component possesses modules

that handle similarity evaluation and case indexing Figure 3

presents the main components and how they interact within a

super peer The reasoning component communicates with the

computation component to obtain cases from the case database,

and it communicates with the P2P component to obtain cases

from other super peers The storage component

communi-cates with external BTSs to maintain the case database and

the P2P component communicates with other super peers to

exchange cases The P2P component communicates with peers

for requests and responses In addition, DisCaRia supports a

web interface for querying super peers directly without joining

the P2P network

In the following sections, we first discuss the communication

protocols used by DisCaRia and then we explain each DisCaRia

component in more detail

A Communication Protocol

DisCaRia uses the Gnutella P2P protocol [34] to build a fully

distributed and unstructured P2P network of super peers This

P2P network contains several advantages that facilitate data

sharing and searching functions First, the network uses super

peers to solve the problem of peer heterogeneity Peers with

insufficient bandwidth and processing power cannot participate

in complicated tasks, such as routing and processing queries

Second, the network allows super peers to maintain a local database and to perform keyword and semantic search meth-ods on the database Third, the network also provides flexible data replication mechanisms for data sharing The disadvantage

of the network is the flooding-based routing mechanism that can cause a large amount of traffic in the network DisCaRia has improved this routing mechanism by using the feedback scheme presented in the following section The Gnutella

proto-col supports five types of messages: pi ng and pong are used to probe the network, quer y and quer yhi t are used to exchange data, and push is used to deal with peers behind the firewall.

Downloading data is handled separately from this protocol A Gnutella message consists of a header and payload The fields

of the header are shown below:

The ID field is a 16-octet string uniquely identifying the mes-sage on the network The Payload Descr i ptor field is 1 octet

in size and it identifies the message type The TTL field contains

the number of times this message will be forwarded to peers

before it is discarded The H ops field contains the number of

times the message has been forwarded These two fields are

both 1 octet long The Payload Lengt h field is 4 octets long

and indicates the length of the payload The payload immedi-ately follows the header The fields of the payload depend on message type, as defined in the protocol specification [34] The

fields of the payload of a quer yhi t message are shown below:

The N umber o f Results field is 1 octet long and contains the number of results in the result set The Set o f Results field

stores a set of super peers and peers with their corresponding results, i.e., a set of solutions for a certain problem

The pi ng, pong, quer y and quer yhi t remain unchanged,

while DisCaRia has extended the Gnutella protocol by adding

a f eedback message type This message type allows peers

receiving queryhits to evaluate the results contained in query-hits and to send feedback to the related peers The fields of the

payload of a f eedback message are shown as follows:

The N umber o f Evaluati ons field is 1 octet long and

con-tains the number of evaluations in the evaluation set The

Set o f Evaluati ons field stores a set of super peers and peers

with their corresponding results, and the grade of the results DisCaRia not only allows peers to connect to super peers, it also provides a web interface to access super peers This fea-ture enables users to use the DisCaRia search function without downloading and installing peers

B Peer-to-Peer Component

The P2P component organizes the communication with other super peers and between a super peer and its directly attached

Trang 5

Fig 4 Exchange of quer y, quer yhi t, and f eedback messages.

regular peers It also organizes the communication with the

reasoning component This is achieved by using the same P2P

protocol in order to exchange information with the super peer’s

reasoning component, i.e., the reasoning component acts as

well as a peer

The P2P component uses a feedback mechanism [13] for

evaluating the quality of queryhits, and thus fostering peer

learning The mechanism extends the Gnutella protocol to

include the f eedback message described above Figure 4

illus-trates super peer communication and the usage of f eedback

messages During bootstrapping, a peer discovers and connects

to its super peer Subsequently, the peer can send a quer y

mes-sage to its super peer 1 The super peer will forward the mesmes-sage

to its neighboring super peers 2 and 4 and they will in turn

for-ward the query to the super peers 5, 6, 3 and 7 Lets assume that

the super peers 3 and 5 are able to find an answer to the query

Super peer 3 sends a quer yhi t message via 4 and 1 back to the

peer while super peer 5 sends a quer yhi t message via 2 and

1 back to the peer After receiving the quer yhi t messages, the

peer will evaluate the received answers and generate f eedback

messages to the super peers 3 and 5 The f eedback messages

are forwarded by the super peers 1, 2, and 4 The forwarding

super peers will also learn about the peer’s evaluation of the

queryhit results

The feedback mechanism not only facilitates the learning

process of super peers (corresponding to the retention process

of the CBR approach), it also improves the flooding

mecha-nism by sending the queries to a set of the super peers that

have proven their competence in previous queryhits In order

to achieve this improvement, super peers need to keep track

of queries and the super peers with correct queryhits by

listen-ing to feedback messages, i.e., Qr y Lst is a list of elements

(q, Pr Lst), where q is a query and Pr Lst is a list of super

peers Super peers forward queries to the selected sets of expert

super peers and random super peers, i.e., E x pLst is a list of

elements ( p,e), where p is an expert super peer and e is its

cor-responding expert score and N br Lst is a list of neighbor super

peers Note that the expert super peer and its score in E x pLst

are regularly updated by learning from Pr Lst The random

super peers add some randomness to the scheme in order to

allow new super peers to join the P2P network and to become

experts on their own Algorithm 1 shows the super peer ranking

and selection algorithm It iterates over a set of recently similar

queries in Qr y Lst (line 3) to obtain a set of the expert super

Algorithm 1 Super Peer Ranking and Selection

Input: Qr y Lst: a list of elements (q,Pr Lst)

N br Lst : a list of neighbor super peers

E x pLst : a list of elements ( p,e) RnkLst: a list of ranked super peers

, γ : similarity threshold and number of super peers

q c: an input query

Output: a list of selected super peers SelLst

1 Rnk Lst← ∅

2 Sel Lst ← ∅

3 for each(q i , Pr Lst i ) ∈ QryLst do

4 if si m(q c , q i)> then

5 Rnk Lst ← RnkLst ∪ Pr Lst i

7 sor t (Rnk Lst , ExpLst)

8 for each p j ∈ RnkLst do

9 Sel Lst ← SelLst ∪ {p j}

10 if|SelLst| = γ − 1 break end if

11 for each(q i , Pr Lst i ) ∈ QryLst and p j ∈ Pr Lst ido

12 if|SelLst| = γ − 1 then break end if

13 if p j /∈ SelLst then

16 for each p j ∈ rand(Nbr Lst) do

17 if p j /∈ SelLst then

20 if|SelLst| = γ then break end if

peers Rnk Lst (lines 4-5) Since each super peer in Rnk Lst contains its expert score e stored in E x pLst, the sort function ranks Rnk Lst by the expert score (line 7) It then chooses up to

γ − 1 super peers from the ranked set RnkLst (lines 8-10) If

the number of the ranked super peers is insufficient, it considers

the top elements of the query set Qr y Lst (lines 11-15), which

includes the most recently used super peers It finally fills a set

of selected super peers Sel Lst with the super peers from the neighbor set N br Lst.

C Reasoning Component

This component is the heart of the DisCaRia peer that com-prises case retrieval and reasoning operations corresponding to the retrieval and reuse operations of the CBR approach To process a case efficiently, the case is represented by multiple vectors using multi-vector representation method [10]:

• a field-value vector to express fault pre-defined features, e.g., fault categories, system components, and product

releases To represent n pairs, we employ the field-value vector: v f = < f1:v1, , f k : v k , , f n :v n >, where k is

the fixed number of pre-defined pairs

• a field-value vector to express fault comparable features (user-defined features), e.g., error messages, symptoms,

and debug snippets These pairs are represented by v p=

<p1:v1, , p m :v m >, where m is number of symptoms

and parameters They are either binary, numeric or sym-bolic values

Trang 6

• a real-value vector to represent fault details in the textual

form, e.g., fault descriptions, discussions, and related

fault features This semantic vector v sis generated by the

LSI technique [44]

The similarity of field-value vectors is measured by the sum

of weight values of matched field-value pairs, while the

similar-ity of real-value vectors is measured by the cosine function The

following example is a fault case extracted from a networking

forum:

Problem: Hub connectivity not functioning

Description: The WinXP network contains many

machines obtaining an address from a DHCP server

The network is extended to 3 more machines by

unplugging the LAN cable from one of the machine’s

and plugging it to a hub with the intention to add the

3 new machines From the hub, none of the machines

successfully obtains an IP address from the DHCP

server; an error message shows “low or no network

connectivity”

To make this fault case understandable and comparable to

CBR engines, vector v f contains<problem_type:

connecti-vity, problem_area: hardware configuration, hardware : PC,

pl-atform: WinXP> Vector v p comprises <network: LAN,

error-message: low or no network connectivity, ip-address:

false, DHCP: true> Using LSI, several terms are considered

to build the vector v s

Upon receiving a query from the P2P component, the

retrieval operation first checks whether the query already exists

in the cache, and processes the query to generate multiple

vec-tors It then compares these vectors with the vectors of the

cases to select relevant cases from the database Evaluating

vectors requires the computation component to classify cases

based on field-value vectors and index cases based on

real-value vectors These computing operations normally process

large matrices, thus consuming a large amount of time Super

peers index and select the cases based on their database, then

send the selected cases to the querying peer using the P2P

com-ponent Aggregating cases from super peers results in a set of

the retrieved cases

Upon receiving the retrieved cases, the reasoning operation

uses the two-process probability reasoning mechanism [11] to

process the cases The ranking process of the mechanism aims

to narrow down the scope of the query by weighting the

com-mon symptoms between the retrieved cases and the query This

process applies the k-Nearest-Neighbor (kNN) algorithm [45]

to rank the cases that share many of the same field-value pairs

with the query The computation component is responsible for

assigning weight values to field-value pairs The selection

pro-cess of the mechanism aims to predict some promising cases

for the query by correlating between the ranked cases and the

query through the common symptoms This process employs

the Bayesian approach to compute probability values for the

ranked cases, and selects the cases with high probability

val-ues The probability value indicates the strength of belief in the

case based on the previous knowledge of the ranked cases and

the observed symptoms

Given a set of cases C, where a case C r contains a set

of symptoms {S , , S } and a solution, we assume that

solutions in C as a set of exhaustive and mutually exclusive

hypotheses{H1, , H n}, and that any symptom is the result

of a diagnosing probe, e.g., a pi ng probe provides either the

high probability of success or the low probability of success (i.e., failure) The problem contains a set of symptoms

{S1, , S h} without a solution (note that cases and the prob-lem can share the same symptoms) Thus, the puzzle is to find the highest conditional probability of the hypotheses

P (H i |S1, , S h ) with i = 1, , n.

Considering a set of exhaustive and mutually exclusive

hypotheses H1 , , H n and S1 , , S h as a set of evidence pieces (or symptoms) obtained from the problem with an

assumption that S1 , , S h are independent from each other Applying the conditional probability formula, we obtain:

P (H i |S1, , S h ) = P (S1, , S h |H i )P(H i )

P (S1, , S h )

= αP(H i )

h

j=1

P (S j |H i )

where P (S1, , S h |H i ) =h

j=1P (S j |H i ) since S j are

inde-pendent from each other, P (H i ) are the prior probabilities of

hypotheses, andα = [P(S1, , S h )]−1 is determined via the

requirementn

i=1P (H i |S1, , S h ) = 1

In case the evidence set S contains a new evidence piece

S ne w (e.g., the problem has a new symptom), updating the

new evidence piece first computes P (H i |S) and then uses

P (H i |S, S ne w ), as follows:

P (H i |S, S ne w ) = P (H i |S)P(S ne w |S, H i )

P (S ne w |S)

= β P(H i |S)P(S ne w |H i )

whereβ is determined by the same method as α, P(H i |S) is computed as previously, and P (S ne w |H i ) is determined by the

experts The following example computes the probability of

solutions for the connection failure problem:

• H1= Checking firewall software for blocking connections

– S1= Desktop keeps disconnecting from the Internet

– S2= Desktop and Laptop keeps connecting from the router

– S3= Connection usually goes really slow

– S4= Connection is fine before updating the firewall software

– S5= Router is WHR-HP-G54 and wireless adapter

is Linksys WMP54G

• H2= Reinstalling networking components (TCP/IP)

– S1 = Desktop completely stops connecting to the Internet

– S2= Laptop can connect to desktop and the Internet

– S3 = Desktop disconnects to laptop and D-Link router with a limited connectivity

– S6= Desktop uses an Etherlink 10/100 PCI card and laptop uses a wireless adapter

– S7= Registry was damaged on desktop few days ago

• H3 = Checking router configuration for the IP address range

Trang 7

– S1 = Desktop cannot connect to a router and the

Internet

– S2= Laptop connects to the router and the Internet

– S4= The firewall software is often updated on those

machines

– S8= Desktop gets error message of address already

used when renewing

The following table presents the weight values of symptoms

to the solutions (note that updated weight values are not bold)

This table is for demonstration, we only need weight values

related to the problem’s symptoms for implementation:

S1 S2 S3 S4 S5 S6 S7 S8

H1 0.1 0.1 0.25 0.448 0.001 0 0.1 0.001

H2 0.1 0.1 0.3 0.001 0 0.048 0.45 0.001

H3 0.25 0.245 0 0.005 0 0.05 0 0.45

In order to fulfill the requirement of exhaustive and

mutu-ally exclusive hypotheses, we examine a set of hypotheses H1,

H2 and H3, and ignore other hypotheses; i.e., the conditional

probability of other hypotheses is 0 This, however, can be a

problem in practice if the examined set of hypotheses does not

contain the desired hypothesis We also consider a set of

evi-dence pieces S1, ,S8obtained by distinct probes independent

because the effect of a probe to other evidence pieces is minor,

and evidence pieces can only be correlated if probes are not

dis-tinct; e.g., if a connection failure occurs, symptoms collected

by the pi ng and ftp probes can be correlated Hypotheses

pos-sess the same prior probabilities P (H i ) = (0.33, 0.33, 0.34),

and the problem contains the following symptoms:

• H =?

– S1= Desktop gets connection failure

– S2= Other machines still connect to routers and to

the Internet,

– S4= Desktop updated the firewall software two days

ago

By applying the above equations, we obtain P (H i |S1, S2,

S4) = (0.8719, 0.0019, 0.1261) with i = 1, 2, 3 The result

indicates that the chance of firewall software blocking

connec-tions is 87.19% given the symptoms of the problem Intuitively,

a solution is deduced by an incomplete set of symptoms;

solu-tions likely share a subset of symptoms Bayesian computation

distinguishes those solutions by the significance of symptoms

in a case and the significance of symptoms among cases

In addition, this component cooperates with the P2P

compo-nent to learn the resulting cases and send the feedback messages

to the peers, performing the retention operation of the CBR

approach

Algorithm 2 simply reflects the two processes as illustrated

above It iterates over a set of cases C that contains a set of

symptoms C r and a solution H r (line 3), finds a set of common

symptoms S (line 4), and accumulates weight values (lines 5-8)

before creating a set of ranked cases R (line 11) Note that V

is a list of probability values of solutions It then iterates over

hypotheses R (line 12), where each hypothesis is a solution,

initializes the prior probability V r with a value n1, where n is

the number of solutions (line 13), computes and normalizes the

Algorithm 2 Case Ranking and Selection

Input: C: a set of cases (C r , H r ), H r: solution

n: the number of solutions

V : a list of probability values V r

C r,w r: sets of symptoms & weight values (case)

C p,w p: sets of symptoms & weight values (problem)

S: a set of symptoms R: a set of ranked cases

Output: a set of final cases F

3 for each C r ∈ C do

4 S = C r ∩ C p

5 if S= ∅ then

7 for each S i ∈ S do

8 T r = T r + w r i w pi

9 R ← R ∪ {(C r , H r )}

11 sor t (R, T r)

12 for each H r ∈ R do

13 V r = 1

n

14 for each S i ∈ C pdo

15 V r = V r w pi

16 for each V r ∈ V & C r ∈ R do

17 V r = V r V −1

18 F ← F ∪ {(C r , H r )}

19 sor t (F , V r)

posterior probabilities (lines 14-17) before generating a set of

final cases F (line 19).

D Computation Component

This component supports several assessment operations for retrieving cases The first operation uses the weighted aver-age function to measure the field-value vectors and to obtain cases Field-value pairs represent both pre-defined and user-defined features The component directly assigns weight values

to the pre-defined features, while the users indirectly assign weight values to the user-defined features The component can also assign average weight values to the user-defined fea-tures This operation finally selects few thousand cases from hundred thousand cases of the database based on the measure-ment results The second operation uses the Latent Semantic Indexing (LSI) method [44] to index the selected cases based on their textual descriptions, i.e., using the term vectors to generate the real-value vectors The indexing operation is time consum-ing due to computconsum-ing sconsum-ingular value decomposition (SVD) for large matrices, thus demanding to function on a separate pro-cess Various SVD algorithms are implemented to achieve both precision and performance This operation finally uses the cosine function to evaluate the real-value vectors between the selected cases and the query, and select a set of few hun-dred cases The third operation again uses the weighted average function to aggregate values from the above two operations

Trang 8

Fig 5 Unified bug data model represented as a UML class diagram.

and provides the resulting set of few dozen cases with high

relevance The component and algorithms are implemented

in C with the support of the svdlibc [46] library for matrix

computation to achieve high performance This library contains

an implementation of the single-vector Lanczos algorithm

E Storage Component

This component exploits various fault resources to update the

databases of super peers Fault resources can be available at

bug tracking systems, online support forums and archives (also

known as communities or knowledge bases) These systems

support web-based crawling, and some systems also support

RPC-based crawling This component contains crawlers to

obtain bug reports from fault resources Since bug reports

contain various information, the crawlers provide the HTML

parsers to extract selective information based on the unified

bug data model shown in Figure 5 in order to generate cases

The crawlers regularly check the modified time attribute of the

bug reports to update the cases In addition, the cases can also

be updated by the learning operation of super peers, i.e., super

peers learn the solutions of the problem through the feedback

messages The component is implemented in Python and cases

are stored in the MySQL database [47] The database schema

conforms to the unified data model The Porter stemming

algorithm [48] helps to stem terms in bug contents

F Web Interface

Peers only contain the P2P component that communicates

with the P2P network for sharing and searching fault resources

They connect to one of super peers to join DisCaRia Note that

super peers possess not only the P2P component but also the

reasoning, computation and storage components Since peers

have limited capability, they are used to send queries to and

receive results from super peers They are also used to advertise

fault resources on super peers Implementing these peers is

simple, but users are usually reluctant to download and install

them DisCaRia thus provides a web support component on

super peers to enable a web interface that allows users to

con-nects and forward queries to super peers This web interface can

TABLE I

S OME P OPULAR B UG T RACKING S ITES ( AS OF A PRIL 2013) A P LUS

I NDICATES T HAT W E W ERE U NABLE TO G ET P RECISE N UMBERS AND O UR N UMBERS P RESENT A L OWER B OUND

accept several kinds of symptoms, error messages and textual descriptions The web interface is built on Django [49], an open source web application framework written in Python Django provides several facilities for developing web applications and integrates well into web servers, such as Apache HTTP Servers [50]

IV FAULTRESOURCES Many fault resources are available on the Internet DisCaRia aims at crawling bug reports from BTSs, archives and forums These resources share the same purpose of reporting bugs for software and hardware components, but differ from data inclusion and presentation

Investigating the features of several BTSs focuses on several properties that are important for obtaining data from them It

is necessary to check BTS functionality supports with specific installations In order to understand the structure of the infor-mation stored in a BTS, the underlying data model is inspected and documented Some systems provide this information in a textual format while others provide graphical representations, usually in ad-hoc notations Some systems do not provide a clear description of the data model underlying the BTS and it

is necessary to reverse engineer the data model by looking at concrete bug reports

Bug reports are dependent on other bugs if they cannot be resolved or acted upon, until the dependency itself is resolved

or acted upon Tracking any dependency relations between bug reports is thus useful because it helps to correlate bugs This activity is usually challenging without expert and system sup-ports Some systems allow full keyword search for their reports, while others only support searching via a set of pre-defined fil-ters applied on the entire bug database The former is more useful for an automated system that aims to provide keyword search capabilities itself

Table I reports several popular BTSs, sites and numbers of bugs With the exception of Debian BTS, which is only used for the Debian operating system, the other BTSs publish lists

of known public sites that use them for bug tracking While some BTSs provide a machine-readable web service interface

to their bug data, most do not In all systems where such an interface is supported, it is an optional feature, and because

Trang 9

optional features require additional effort from an

adminis-trator to be set up, they are rarely available In addition, a

web service interface often provides much less data than the

human-readable web interface that is most commonly used

Clearly, relying on the availability of a web service API is

unre-alistic To solve this problem, crawlers [51] were created to

directly use the presentational HTML-based web interface in

order to get as much access to information as ordinary users

Other crawlers were built to exploit several methods provided

by the Bugzilla web service interface, e.g., XML-RPC API

The crawlers only submit a bug identifier to obtain the details

of the bug from the Bugzilla server, i.e., a list of field-value

pairs

The database schemas of several BTSs and archives share

several similar fields that can be classified in two main groups:

(i) the administrative information associated with a bug is

rep-resented as field-value pairs, such as identity, severity, status,

product, summary, among others; (ii) the description

informa-tion detailing the bug and any follow-up discussion or acinforma-tions

is typically represented as textual attachments

Figure 5 shows the unified bug data model in the form of

a UML class diagram [52] The central class is the Bug class

The id attribute is the URL where a bug can be retrieved as its

identifier Most of the attributes can be easily extracted from

the retrieved data The severity attribute is probably the most

interesting to fill correctly, because BTSs have very different

severity classifications for bugs This unified model defines four

severity values: critical, normal, minor and feature that can be

used to map the severity values of the BTSs The status attribute

only has two values: open represents what BTSs refer to

uncon-firmed, new, assigned, reopened bugs, while fixed represents

what BTSs refer to resolved, verified, and closed bugs

The textual description is modeled as the Attachment class

Every attachment belongs to exactly one bug Some BTSs

pro-vide information about the platforms affected by a bug The

Platform class represents platforms, e.g., Window or MacOS

The Symptom class represents keywords used to describe and

classify bugs Simple classifications such as severity, status,

platform, etc can be provided by the information of bugs

Symptoms provide complicated classifications such as

prob-lem scope, probprob-lem type, e.g., a bug related to network failure

can be related to connectivity, authentication service, hardware

and software configuration These classifications help to

nar-row down the scope of a bug and to figure out the related bugs

of a bug Symptoms contain sets of distinct keywords, typical

debugging messages or diagnosing patterns

The left part of Figure 5 models what piece of software a

bug is concerned with While some BTSs are only concerned

with bugs in a specific piece of software, software in larger

projects is split into components and bugs can be related to

spe-cific components The Software and Component classes model

this structure The Debian BTS is somewhat different from the

other BTSs as it is primarily used to track issues related to

software “packages”, that is software components packaged for

end user deployment Since there is a large amount of meta

information available for Debian software packages

(depen-dency, maintainer and version information), we have introduced

a separate Package class to represent packaged software

Fig 6 Average memory usage for a super peer with various datasets.

V SYSTEMEVALUATION The DisCaRia system contains several super peers with suffi-cient processing, storage, memory and bandwidth capabilities, and each super peer contains several components with vari-ous functionalities Our previvari-ous studies have evaluated these components and algorithms individually, such as the feed-back scheme [13] for P2P component, the multiple vector representation method [10] for computation component, the two-process probabilistic reasoning method [11] for reasoning component, and the bug report crawlers [14] for storage compo-nent However, it is necessary to evaluate the performance and efficiency of the complete DisCaRia system The responsive-ness metric assesses how fast the system responds to queries This metric also depends on various factors including process-ing power, bandwidth and the retrieval algorithm The quality metric measures how precise the system answer to problem queries This metric also depends on various factors including the size of the database and the reasoning algorithm

The database has been populated by crawling various archives, forums and BTSs such as Mozilla Bugzilla, Red Hat Bugzilla, Launchpad Ubuntu, etc These software bug datasets contain different numbers of bug reports A super peer with nor-mal hardware configuration (Intel Pentium 4 3.6 GHz, 1 GB RAM) can generally accommodate more than 100.000 bug reports Figure 6 shows that the average memory usage of a super peer increases slowly with the number of bug reports The reasoning component handles the classifying and selecting processes that work with the features of cases, and the compu-tation component maintains the indexing and filtering processes that work with the textual descriptions of cases These two components allocate almost all the memory used by the super peer The super peer uses approximately 350 MB to manage 100.000 bug reports Super peers therefore can accommodate

a large number of bug reports without the need of powerful computers DisCaRia with a large number of super peers can exploit leisure and shared resources to form a large knowl-edge database for problem resolution Note that the size of bug reports considerably fluctuates depending on BTSs

Evaluating the responsiveness of DisCaRia requires a dis-tributed environment exhibiting real network latency, peer heterogeneity and churn rate EmanicsLab [15] supported by European Network of Excellence for the Management of Internet Technologies and Complex Services (EMANICS) is a flexible and re-usable distributed computing and storage testbed

Trang 10

TABLE II

E MANICS L AB S UPER P EERS C ONFIGURATION

Fig 7 DisCaRia deployment on EmanicsLab.

that fosters joint research activities of EMANICS partners

EmanicsLab is based on a complete installation of PlanetLab

Central, the backend management infrastructure of PlanetLab

[53] The testbed contains 20 nodes located at 10 universities

and research institutes in Europe DisCaRia has been installed

on 11 EmanicsLab nodes as shown in Table II acting as super

peers with a web interface running on web servers at Jacobs

University

Figure 7 plots the DisCaRia peers deployed on EmanicsLab

These peers randomly connect to each other and directly update

their databases from BTSs EmanicsLab contains some degree

of node heterogeneity since nodes possess varying processing,

memory and bandwidth capabilities Nodes are dedicated to run

some distributed applications from the partners, thus network

latency is sufficient to evaluate this system All queries sent to

DisCaRia are performed through the web interface The queries

extracted from bug reports comprise textual descriptions,

symp-toms or debugging messages The queries are continuously fed

to the web interface using a specific tool that connects directly

to super peers The tool also receives the queryhits from super

TABLE III

P ARAMETER C ONFIGURATION

Fig 8 Average time for retrieving the different numbers of queryhits.

Fig 9 Average time consumption per query over various sizes of the datasets.

peers for analysis Table III describes the parameter configu-ration of super peers The churn rate is set by 1% of the total number of super peers, and other parameters are the maximum numbers of connections or queryhits

Figure 8 plots average time for retrieving the different num-bers of queryhits The basic or advance result contains queryhits returned from super peers running the retrieval algorithm or running the retrieval and reasoning algorithms respectively The limited or unlimited result contains queryhits returned from super peers connecting to the limited or unlimited number

of super peers respectively Local super peers send query-hits within 1 second, while remote super peers send queryquery-hits within 4 to 5 seconds Running the retrieval and reasoning algorithms consumes more time than running the retrieval algo-rithm only Increasing the number of connections causes super peers to forward queries to further super peers, thus increasing communication time

Figure 9 plots the average time consumption of the compu-tation component, the reasoning component and the super peer over the various sizes of the datasets The computation time increases logarithmically as the number of bugs increases The

Định dạng
Số trang	14
Dung lượng	1,55 MB