DSpace at VNU: DisCaRia-Distributed Case-Based Reasoning System for Fault Management tài liệu, giáo án, bài giảng , luận...
Trang 1DisCaRia—Distributed Case-Based Reasoning
System for Fault Management
Ha Manh Tran and Jürgen Schönwälder, Senior Member, IEEE
Abstract—Fault resolution in communication networks and
distributed systems is a challenge that demands the expertise of
system administrators and the support of multiple systems, such as
monitoring and event correlation systems Trouble ticket systems
are frequently used to organize the workflow of the fault
resolu-tion process In this context, we introduce DisCaRia, a distributed
case-based reasoning system that assists system administrators
and network operators in resolving faults DisCaRia integrates
various fault knowledge resources that are already available in the
Internet, and it exploits them by applying a distributed case-based
reasoning methodology, which is based on scalable peer-to-peer
technology We present the architecture of DisCaRia, the key
algo-rithms used by DisCaRia, and provide an evaluation of a prototype
implementation of the system.
Index Terms—Fault resolution, fault management, case-based
reasoning, peer-to-peer, bug tracking system, software bug search.
I INTRODUCTION
T HE RESOLUTION of faults in communication networks
and distributed systems is to a large extent a human driven
process Automated monitoring and event correlation systems
[1]–[4] usually produce fault reports that are forwarded to
oper-ators for resolution Support systems [5]–[7] such as trouble
ticket systems are frequently used to organize the workflows
Case-based Reasoning (CBR) [8] has been proposed in the
early 1990s to assist operators in the resolution of faults by
providing mechanisms to correlate an observed fault with
previ-ously solved similar cases (faults) CBR systems are typically
linked to trouble ticket systems since the data maintained in
trouble ticket systems can be used to populate a case database
Existing CBR systems for fault resolution usually operate only
on a local case database and cannot easily exploit knowledge
about faults and their resolutions present at other sites This
restriction to local knowledge resources, however, becomes an
issue in environments where software components and offered
services change very dynamically and the case database is thus
frequently outdated
Manuscript received March 2, 2015; revised October 20, 2015; accepted
October 20, 2015 Date of publication October 30, 2015; date of current version
December 17, 2015 This work was supported in part by the Vietnam National
Foundation for Science and Technology Development (NAFOSTED) under
Grant 102.02-2011.01, in part by Flamingo, a Network of Excellence project
(ICT-318488) supported by the European Commission under its Seventh
Framework Programme, and in part by the EC IST-EMANICS Network of
Excellence under Grant 26854 The associate editor coordinating the review
of this paper and approving it for publication was J Lobo.
H M Tran is with the School of Computer Science and Engineering,
International University-Vietnam National University, Ho Chi Minh City,
Vietnam (e-mail: tmha@hcmiu.edu.vn).
J Schönwälder is with the Department of Computer Science and
Electrical Engineering, Jacobs University Bremen, Bremen, Germany (e-mail:
j.schoenwaelder@jacobs-university.de).
Digital Object Identifier 10.1109/TNSM.2015.2496224
With the recent growth of virtual communities, social net-works, and cloud systems, domain specific search engines often become a suitable alternative to general purpose search engines The key advantage of these systems is to focus the search
on specific domains Domain specific search engines have the potential of connecting a large number of experts with simi-lar interests A virtual community of networking experts can provide the best solutions for networking problems Cloud com-puting systems, fostering the centralization of various services,
in particular require a large number of experts and tools to man-age faults and failures This is especially true for inter-cloud environments [9] that support applications and services run-ning on multiple cloud systems It is thus necessary to develop support systems that can exploit the knowledge of several vir-tual communities and that can connect groups of experts for resolving problems
Our distributed case-based reasoning system DisCaRia takes advantage of Peer-to-Peer (P2P) technology to extend the capa-bility of conventional CBR systems by exploring problem solv-ing knowledge resources available in a distributed environment The DisCaRia peers operate in parallel and comprise indepen-dent CBR components that work concurrently to scrutinize the knowledge resources Our distributed CBR approach applies previous research activities to improve the performance of man-aging a large case database and the quality of the proposed solutions, including case retrieval and reasoning approaches [10]–[12] case learning and retention [13], [14] This article therefore uses the DisCaRia system to present the novel inte-gration and adaptation of CBR and P2P technologies to address
a network and service management problem The contribution
of the article is thus threefold:
1) We propose the distributed CBR approach based on mul-tiple CBR engines organized in a P2P network for explor-ing and exploitexplor-ing federated fault knowledge databases 2) We present the DisCaRia system with the main methods and algorithms, and how they work together to achieve the integration and adaptation of CBR and P2P technologies 3) We perform an evaluation of the DisCaRia system on EmanicsLab distributed computing testbed [15] with fed-erated bug databases obtained from several popular bug tracking systems
The rest of the article is structured as follows: Section II dis-cusses related work Section III introduces the distributed CBR approach and research activities applied to the main compo-nents of the DisCaRia system The creation and maintenance of the case database are detailed in Section IV Section V presents several experiments performed on the EmanicsLab distributed computing testbed in order to evaluate the performance of the 1932-4537 © 2015 IEEE Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Trang 2DisCaRia system The paper concludes with some remarks on
future work in Section VI
II BACKGROUND ANDRELATEDWORK
This section reviews related work in the areas of trouble
ticket and bug tracking systems for fault resolution, case-based
reasoning, peer-to-peer technology, and network search For
simplification, we use the terms of super peer and DisCaRia
peer, case and bug interchangeably in this article
A Trouble Ticket Systems and Bug Tracking Systems
Trouble ticket systems have been widely used by network
operators in order to assure the quality of communication
ser-vices The ITU-T recommendation X.790 [16] defines trouble
as “any cause that may lead to or contribute to a manager
per-ceiving a degradation in the quality of service of one or more
network services or one or more network resources being
man-aged.” X.790 introduces an interface for the interaction among
parties, e.g., a trouble report data model, and a process for the
resolution of troubles, e.g., determining the status of a
trou-ble, escalating its severity, and notifying involved parties of its
resolution The informational IETF document RFC 1297 [17]
defines trouble as “a single malfunctioning piece of hardware
or software that breaks at some time, has various efforts to fix
it, and eventually is fixed at some given time.” Trouble ticket
systems can be used for communication among network
opera-tion centers (NOCs) and they can be associated with a network
alert system for generating trouble tickets automatically and for
monitoring the progress of the trouble tickets The TMF
docu-ments NMF501 [18] and NMF601 [19] model a trouble
admin-istration system that provides the interface between a service
provider and users for managing trouble information NMF501
focuses on the business and technical requirements and the
trouble administration process; whereas NMF601 describes
the functionality of exchanging management information to
meet the requirements Moreover, the system can create and
track telecommunication troubles reported following the X.790
recommendation
A Bug Tracking System (BTS) is a trouble ticket system used
to keep track of software bugs A bug tracking system uses
an information model for a problem (also sometimes called
a ticket, bug, defect, etc.) that is very similar to the
informa-tion model used by trouble ticket systems Pre-defined fields
are commonly used to keep track of the status of the
prob-lem while textual descriptions are used to describe the probprob-lem
and to track the problem resolution process A BTS in
gen-eral aims at improving the quality of software products They
do so by keeping track of reported problems and by
main-taining historical records of previously experienced problems
They also establish the basis of a knowledge base of an expert
system that allows to search for similar past problems, and
that provides reports and statistics for performance evaluation
of the services [20] While most expert systems proposed for
fault diagnosis and resolution, such as ACE [21], COMPASS
[22], NEMESYS [23], Troubleshooter [24], DAD [25] and
Fig 1 Cyclic case-based reasoning approach using four processes (retrieval, reuse, revision, and retention) and a common case database, adopted from [8].
MANDOLIN [26] explore a knowledge database supplied by a single BTS, DisCaRia is able to explore a federated knowledge database supplied by several BTSs
B Case-Based Reasoning
Case-based Reasoning (CBR) [8] seeks to find solutions for problems by exploiting experience A case essentially consists
of a description of a specific problem and the corresponding solution When a new problem appears, the reasoning process first uses a similarity function to retrieve cases matching the current problem and then it adapts the retrieved cases to the cir-cumstances of the current problem in order to obtain a possible solution Depending on the characteristics of a certain problem domain, this reasoning process can either classify the problem into a group of previously resolved problems or propose an adapted solution for the problem Problem classification can
be suitable for a problem domain with a relatively large case database Various reasoning techniques including rule based reasoning, fuzzy logic, neural networks and belief networks can
be used for this process
Following the discussion in [8], a CBR system consists of four processes as shown in Figure 1:
1) case retrieval to obtain similar cases, 2) case reuse to propose adapted solutions, 3) case revision to verify the adapted solution, 4) case retention to learn the solution.
The CBR system uses a case database to store and provide cases for the operation of the CBR processes
Several CBR systems [5], [7], [27] have been proposed for fault diagnosis and resolution These systems usually collab-orate with trouble ticket systems in order to take advantage
of trouble tickets as the case database While these systems can learn from previous problems to propose solutions for novel problems, they usually only operate on a local case database and hence they cannot exploit problem-solving knowl-edge resources present at remote sites Using shared knowlknowl-edge resources not only provides better opportunities to find solu-tions but also improves the case databases that otherwise
Trang 3frequently become obsolete in environments where software
components and offered services change very dynamically
DisCaRia uses several CBR engines that are interacting using
a self-organizing P2P network in order to exploit various
problem-solving knowledge resources
C Peer-to-Peer Technology
Peer-to-Peer (P2P) technology [28] has been introduced to
establish application specific network overlays operating over
the Internet [29] A P2P network consists of peers that act
both as client and server simultaneously P2P systems exhibit a
number of interesting properties such as self-organization,
scal-ability, flexibility, and fault tolerance Peers join and leave the
networks with loose control, enabling fully distributed systems
with a very large number of peers Acting in both client and
server roles, peers share resources, such as bandwidth,
stor-age space, and computing power and typically provide lookup
functions Since P2P networks do not have a hierarchical
orga-nization or centralized control, they are designed such that the
failure of individual peers can affect the availability of
cer-tain resources but cannot cause the failure of the overall P2P
network
P2P networks can be classified into two categories:
struc-tured and unstrucstruc-tured networks Strucstruc-tured P2P networks
maintain a controlled and stable overlay network topology A
structured P2P network aims at distributing content at
deter-ministic locations using Distributed Hash Tables (DHTs), thus
facilitating efficient content search and lookup Examples of
structured P2P networks are CAN [30], Chord [31], Tapestry
[32] and Kademlia [33] Unstructured P2P networks spend less
effort on controlling the overlay network topology and the
location of content Instead, they tend to grow randomly
with-out maintaining a certain network topology according to some
tight rules The content is arbitrarily distributed on the peers,
thus fostering different search methods Examples of
unstruc-tured P2P networks include Gnutella [34], Freenet [35], and
BitTorrent [36]
Hybrid P2P networks combine the characteristics of
struc-tured and unstrucstruc-tured P2P networks and they often integrate
well with the client-server paradigm [37] The main idea is to
distinguish so called super peers from peers that possess
vary-ing storage, bandwidth and processvary-ing capabilities Hybrid P2P
networks organize peers into clusters using a clustering
tech-nique A cluster contains at least one capable peer or super
peer, other peers connect to the super peer in the cluster The
connections between the super peers form the super peer P2P
network With sufficient storage, bandwidth and processing
power, the super peers act both as a server to handle queries
from other peers and as a client to route queries to the other
super peers The content advertisement and search mechanism
only take place on the super peer network Hybrid P2P networks
facilitate advanced search mechanisms due to the processing,
storage and bandwidth resources available at the super peers
Examples of super peer networks include Piazza [38], Edutella
[39] and Bibster [40] DisCaRia uses a hybrid P2P network to
take advantage of the inherent scaling properties hybrid P2P
networks offer
D Network Search
More recently, distributed network search algorithms have been suggested as a primitive for building future network management systems [41], [42] Network search systems are organized as an overlay over a physical network topology The overlay provides a distributed query processing facility that can be used to retrieve operational state and configuration data from network elements The approach shares some similar-ity with the ideas behind the DisCaRia system However, the work on network search primarily aims to provide a generic search mechanism for management and monitoring data while the DisCaRia system assumes a certain CBR functionality to be implemented in the DisCaRia peers The idea of the DisCaRia system has also been applied to building a fault management system in the inter-cloud environment that supports applica-tions and services for running across multiple cloud systems [43] This system recruits a P2P network of fault managers that allows system administrators to monitor faults and search similar faults with solutions on cloud systems
III DisCaRia SYSTEM The distributed CBR approach focuses on CBR method-ology and P2P technmethod-ology While CBR methodmethod-ology has been widely known as a problem-solving method for sev-eral problem domains, P2P technology has been widely used for resource sharing in distributed environment P2P sys-tems provide remarkable features including self-organization
in management, scalability in architecture, flexibility in con-tent distribution and fault tolerance that allow peers to join the networks and exchange problems and solutions easily CBR systems require expressive case representation methods, which allow retrieval and reasoning techniques to work efficiently A distributed CBR system contains CBR engines communicating through a P2P network It offers the capability of exploring and exploiting knowledge resources for retrieval and reasoning that can be applied to fault resolution in communication networks and distributed systems
DisCaRia uses a P2P network to achieve a certain level of self-organization and to benefit from a scalable architecture The system extends the underlying basic P2P network with a CBR approach for obtaining more relevant information for fault resolution [13] There are two kinds of peers in the system: super peers and peers Each super peer bears several compo-nents to perform CBR operations, e.g., retrieving, adapting, verifying and learning cases These components require suffi-cient storage, bandwidth and processing power Figure 2 shows the DisCaRia system architecture based on a P2P network of super peers The super peers deal with complicated operations, thus alleviating the problem of peer heterogeneity, i.e., peers with limited capability do not undertake complicated opera-tions Each super peer is responsible for multiple functions including communication, computation, reasoning and mainte-nance, while each regular peer only communicates with super peers for finding relevant resources
A super peer contains four main components: a P2P compo-nent, a computation compocompo-nent, a reasoning compocompo-nent, and a
Trang 4Fig 2 DisCaRia system architecture with peers and super peers.
Fig 3 Components of a DisCaRia super peer and their interactions.
storage component These components are realized as
indepen-dent processes, thus the failure of one component affects only
this component Each component consists of several functional
modules that handle different tasks, e.g., the P2P component
possesses modules that manage incoming and outgoing
con-nections, or the computation component possesses modules
that handle similarity evaluation and case indexing Figure 3
presents the main components and how they interact within a
super peer The reasoning component communicates with the
computation component to obtain cases from the case database,
and it communicates with the P2P component to obtain cases
from other super peers The storage component
communi-cates with external BTSs to maintain the case database and
the P2P component communicates with other super peers to
exchange cases The P2P component communicates with peers
for requests and responses In addition, DisCaRia supports a
web interface for querying super peers directly without joining
the P2P network
In the following sections, we first discuss the communication
protocols used by DisCaRia and then we explain each DisCaRia
component in more detail
A Communication Protocol
DisCaRia uses the Gnutella P2P protocol [34] to build a fully
distributed and unstructured P2P network of super peers This
P2P network contains several advantages that facilitate data
sharing and searching functions First, the network uses super
peers to solve the problem of peer heterogeneity Peers with
insufficient bandwidth and processing power cannot participate
in complicated tasks, such as routing and processing queries
Second, the network allows super peers to maintain a local database and to perform keyword and semantic search meth-ods on the database Third, the network also provides flexible data replication mechanisms for data sharing The disadvantage
of the network is the flooding-based routing mechanism that can cause a large amount of traffic in the network DisCaRia has improved this routing mechanism by using the feedback scheme presented in the following section The Gnutella
proto-col supports five types of messages: pi ng and pong are used to probe the network, quer y and quer yhi t are used to exchange data, and push is used to deal with peers behind the firewall.
Downloading data is handled separately from this protocol A Gnutella message consists of a header and payload The fields
of the header are shown below:
The ID field is a 16-octet string uniquely identifying the mes-sage on the network The Payload Descr i ptor field is 1 octet
in size and it identifies the message type The TTL field contains
the number of times this message will be forwarded to peers
before it is discarded The H ops field contains the number of
times the message has been forwarded These two fields are
both 1 octet long The Payload Lengt h field is 4 octets long
and indicates the length of the payload The payload immedi-ately follows the header The fields of the payload depend on message type, as defined in the protocol specification [34] The
fields of the payload of a quer yhi t message are shown below:
The N umber o f Results field is 1 octet long and contains the number of results in the result set The Set o f Results field
stores a set of super peers and peers with their corresponding results, i.e., a set of solutions for a certain problem
The pi ng, pong, quer y and quer yhi t remain unchanged,
while DisCaRia has extended the Gnutella protocol by adding
a f eedback message type This message type allows peers
receiving queryhits to evaluate the results contained in query-hits and to send feedback to the related peers The fields of the
payload of a f eedback message are shown as follows:
The N umber o f Evaluati ons field is 1 octet long and
con-tains the number of evaluations in the evaluation set The
Set o f Evaluati ons field stores a set of super peers and peers
with their corresponding results, and the grade of the results DisCaRia not only allows peers to connect to super peers, it also provides a web interface to access super peers This fea-ture enables users to use the DisCaRia search function without downloading and installing peers
B Peer-to-Peer Component
The P2P component organizes the communication with other super peers and between a super peer and its directly attached
Trang 5Fig 4 Exchange of quer y, quer yhi t, and f eedback messages.
regular peers It also organizes the communication with the
reasoning component This is achieved by using the same P2P
protocol in order to exchange information with the super peer’s
reasoning component, i.e., the reasoning component acts as
well as a peer
The P2P component uses a feedback mechanism [13] for
evaluating the quality of queryhits, and thus fostering peer
learning The mechanism extends the Gnutella protocol to
include the f eedback message described above Figure 4
illus-trates super peer communication and the usage of f eedback
messages During bootstrapping, a peer discovers and connects
to its super peer Subsequently, the peer can send a quer y
mes-sage to its super peer 1 The super peer will forward the mesmes-sage
to its neighboring super peers 2 and 4 and they will in turn
for-ward the query to the super peers 5, 6, 3 and 7 Lets assume that
the super peers 3 and 5 are able to find an answer to the query
Super peer 3 sends a quer yhi t message via 4 and 1 back to the
peer while super peer 5 sends a quer yhi t message via 2 and
1 back to the peer After receiving the quer yhi t messages, the
peer will evaluate the received answers and generate f eedback
messages to the super peers 3 and 5 The f eedback messages
are forwarded by the super peers 1, 2, and 4 The forwarding
super peers will also learn about the peer’s evaluation of the
queryhit results
The feedback mechanism not only facilitates the learning
process of super peers (corresponding to the retention process
of the CBR approach), it also improves the flooding
mecha-nism by sending the queries to a set of the super peers that
have proven their competence in previous queryhits In order
to achieve this improvement, super peers need to keep track
of queries and the super peers with correct queryhits by
listen-ing to feedback messages, i.e., Qr y Lst is a list of elements
(q, Pr Lst), where q is a query and Pr Lst is a list of super
peers Super peers forward queries to the selected sets of expert
super peers and random super peers, i.e., E x pLst is a list of
elements ( p,e), where p is an expert super peer and e is its
cor-responding expert score and N br Lst is a list of neighbor super
peers Note that the expert super peer and its score in E x pLst
are regularly updated by learning from Pr Lst The random
super peers add some randomness to the scheme in order to
allow new super peers to join the P2P network and to become
experts on their own Algorithm 1 shows the super peer ranking
and selection algorithm It iterates over a set of recently similar
queries in Qr y Lst (line 3) to obtain a set of the expert super
Algorithm 1 Super Peer Ranking and Selection
Input: Qr y Lst: a list of elements (q,Pr Lst)
N br Lst : a list of neighbor super peers
E x pLst : a list of elements ( p,e) RnkLst: a list of ranked super peers
, γ : similarity threshold and number of super peers
q c: an input query
Output: a list of selected super peers SelLst
1 Rnk Lst← ∅
2 Sel Lst ← ∅
3 for each(q i , Pr Lst i ) ∈ QryLst do
4 if si m(q c , q i)> then
5 Rnk Lst ← RnkLst ∪ Pr Lst i
7 sor t (Rnk Lst , ExpLst)
8 for each p j ∈ RnkLst do
9 Sel Lst ← SelLst ∪ {p j}
10 if|SelLst| = γ − 1 break end if
11 for each(q i , Pr Lst i ) ∈ QryLst and p j ∈ Pr Lst ido
12 if|SelLst| = γ − 1 then break end if
13 if p j /∈ SelLst then
14 Sel Lst ← SelLst ∪ {p j}
16 for each p j ∈ rand(Nbr Lst) do
17 if p j /∈ SelLst then
18 Sel Lst ← SelLst ∪ {p j}
20 if|SelLst| = γ then break end if
peers Rnk Lst (lines 4-5) Since each super peer in Rnk Lst contains its expert score e stored in E x pLst, the sort function ranks Rnk Lst by the expert score (line 7) It then chooses up to
γ − 1 super peers from the ranked set RnkLst (lines 8-10) If
the number of the ranked super peers is insufficient, it considers
the top elements of the query set Qr y Lst (lines 11-15), which
includes the most recently used super peers It finally fills a set
of selected super peers Sel Lst with the super peers from the neighbor set N br Lst.
C Reasoning Component
This component is the heart of the DisCaRia peer that com-prises case retrieval and reasoning operations corresponding to the retrieval and reuse operations of the CBR approach To process a case efficiently, the case is represented by multiple vectors using multi-vector representation method [10]:
• a field-value vector to express fault pre-defined features, e.g., fault categories, system components, and product
releases To represent n pairs, we employ the field-value vector: v f = < f1:v1, , f k : v k , , f n :v n >, where k is
the fixed number of pre-defined pairs
• a field-value vector to express fault comparable features (user-defined features), e.g., error messages, symptoms,
and debug snippets These pairs are represented by v p=
<p1:v1, , p m :v m >, where m is number of symptoms
and parameters They are either binary, numeric or sym-bolic values
Trang 6• a real-value vector to represent fault details in the textual
form, e.g., fault descriptions, discussions, and related
fault features This semantic vector v sis generated by the
LSI technique [44]
The similarity of field-value vectors is measured by the sum
of weight values of matched field-value pairs, while the
similar-ity of real-value vectors is measured by the cosine function The
following example is a fault case extracted from a networking
forum:
Problem: Hub connectivity not functioning
Description: The WinXP network contains many
machines obtaining an address from a DHCP server
The network is extended to 3 more machines by
unplugging the LAN cable from one of the machine’s
and plugging it to a hub with the intention to add the
3 new machines From the hub, none of the machines
successfully obtains an IP address from the DHCP
server; an error message shows “low or no network
connectivity”
To make this fault case understandable and comparable to
CBR engines, vector v f contains<problem_type:
connecti-vity, problem_area: hardware configuration, hardware : PC,
pl-atform: WinXP> Vector v p comprises <network: LAN,
error-message: low or no network connectivity, ip-address:
false, DHCP: true> Using LSI, several terms are considered
to build the vector v s
Upon receiving a query from the P2P component, the
retrieval operation first checks whether the query already exists
in the cache, and processes the query to generate multiple
vec-tors It then compares these vectors with the vectors of the
cases to select relevant cases from the database Evaluating
vectors requires the computation component to classify cases
based on field-value vectors and index cases based on
real-value vectors These computing operations normally process
large matrices, thus consuming a large amount of time Super
peers index and select the cases based on their database, then
send the selected cases to the querying peer using the P2P
com-ponent Aggregating cases from super peers results in a set of
the retrieved cases
Upon receiving the retrieved cases, the reasoning operation
uses the two-process probability reasoning mechanism [11] to
process the cases The ranking process of the mechanism aims
to narrow down the scope of the query by weighting the
com-mon symptoms between the retrieved cases and the query This
process applies the k-Nearest-Neighbor (kNN) algorithm [45]
to rank the cases that share many of the same field-value pairs
with the query The computation component is responsible for
assigning weight values to field-value pairs The selection
pro-cess of the mechanism aims to predict some promising cases
for the query by correlating between the ranked cases and the
query through the common symptoms This process employs
the Bayesian approach to compute probability values for the
ranked cases, and selects the cases with high probability
val-ues The probability value indicates the strength of belief in the
case based on the previous knowledge of the ranked cases and
the observed symptoms
Given a set of cases C, where a case C r contains a set
of symptoms {S , , S } and a solution, we assume that
solutions in C as a set of exhaustive and mutually exclusive
hypotheses{H1, , H n}, and that any symptom is the result
of a diagnosing probe, e.g., a pi ng probe provides either the
high probability of success or the low probability of success (i.e., failure) The problem contains a set of symptoms
{S1, , S h} without a solution (note that cases and the prob-lem can share the same symptoms) Thus, the puzzle is to find the highest conditional probability of the hypotheses
P (H i |S1, , S h ) with i = 1, , n.
Considering a set of exhaustive and mutually exclusive
hypotheses H1 , , H n and S1 , , S h as a set of evidence pieces (or symptoms) obtained from the problem with an
assumption that S1 , , S h are independent from each other Applying the conditional probability formula, we obtain:
P (H i |S1, , S h ) = P (S1, , S h |H i )P(H i )
P (S1, , S h )
= αP(H i )
h
j=1
P (S j |H i )
where P (S1, , S h |H i ) =h
j=1P (S j |H i ) since S j are
inde-pendent from each other, P (H i ) are the prior probabilities of
hypotheses, andα = [P(S1, , S h )]−1 is determined via the
requirementn
i=1P (H i |S1, , S h ) = 1
In case the evidence set S contains a new evidence piece
S ne w (e.g., the problem has a new symptom), updating the
new evidence piece first computes P (H i |S) and then uses
P (H i |S, S ne w ), as follows:
P (H i |S, S ne w ) = P (H i |S)P(S ne w |S, H i )
P (S ne w |S)
= β P(H i |S)P(S ne w |H i )
whereβ is determined by the same method as α, P(H i |S) is computed as previously, and P (S ne w |H i ) is determined by the
experts The following example computes the probability of
solutions for the connection failure problem:
• H1= Checking firewall software for blocking connections
– S1= Desktop keeps disconnecting from the Internet
– S2= Desktop and Laptop keeps connecting from the router
– S3= Connection usually goes really slow
– S4= Connection is fine before updating the firewall software
– S5= Router is WHR-HP-G54 and wireless adapter
is Linksys WMP54G
• H2= Reinstalling networking components (TCP/IP)
– S1 = Desktop completely stops connecting to the Internet
– S2= Laptop can connect to desktop and the Internet
– S3 = Desktop disconnects to laptop and D-Link router with a limited connectivity
– S6= Desktop uses an Etherlink 10/100 PCI card and laptop uses a wireless adapter
– S7= Registry was damaged on desktop few days ago
• H3 = Checking router configuration for the IP address range
Trang 7– S1 = Desktop cannot connect to a router and the
Internet
– S2= Laptop connects to the router and the Internet
– S4= The firewall software is often updated on those
machines
– S8= Desktop gets error message of address already
used when renewing
The following table presents the weight values of symptoms
to the solutions (note that updated weight values are not bold)
This table is for demonstration, we only need weight values
related to the problem’s symptoms for implementation:
S1 S2 S3 S4 S5 S6 S7 S8
H1 0.1 0.1 0.25 0.448 0.001 0 0.1 0.001
H2 0.1 0.1 0.3 0.001 0 0.048 0.45 0.001
H3 0.25 0.245 0 0.005 0 0.05 0 0.45
In order to fulfill the requirement of exhaustive and
mutu-ally exclusive hypotheses, we examine a set of hypotheses H1,
H2 and H3, and ignore other hypotheses; i.e., the conditional
probability of other hypotheses is 0 This, however, can be a
problem in practice if the examined set of hypotheses does not
contain the desired hypothesis We also consider a set of
evi-dence pieces S1, ,S8obtained by distinct probes independent
because the effect of a probe to other evidence pieces is minor,
and evidence pieces can only be correlated if probes are not
dis-tinct; e.g., if a connection failure occurs, symptoms collected
by the pi ng and ftp probes can be correlated Hypotheses
pos-sess the same prior probabilities P (H i ) = (0.33, 0.33, 0.34),
and the problem contains the following symptoms:
• H =?
– S1= Desktop gets connection failure
– S2= Other machines still connect to routers and to
the Internet,
– S4= Desktop updated the firewall software two days
ago
By applying the above equations, we obtain P (H i |S1, S2,
S4) = (0.8719, 0.0019, 0.1261) with i = 1, 2, 3 The result
indicates that the chance of firewall software blocking
connec-tions is 87.19% given the symptoms of the problem Intuitively,
a solution is deduced by an incomplete set of symptoms;
solu-tions likely share a subset of symptoms Bayesian computation
distinguishes those solutions by the significance of symptoms
in a case and the significance of symptoms among cases
In addition, this component cooperates with the P2P
compo-nent to learn the resulting cases and send the feedback messages
to the peers, performing the retention operation of the CBR
approach
Algorithm 2 simply reflects the two processes as illustrated
above It iterates over a set of cases C that contains a set of
symptoms C r and a solution H r (line 3), finds a set of common
symptoms S (line 4), and accumulates weight values (lines 5-8)
before creating a set of ranked cases R (line 11) Note that V
is a list of probability values of solutions It then iterates over
hypotheses R (line 12), where each hypothesis is a solution,
initializes the prior probability V r with a value n1, where n is
the number of solutions (line 13), computes and normalizes the
Algorithm 2 Case Ranking and Selection
Input: C: a set of cases (C r , H r ), H r: solution
n: the number of solutions
V : a list of probability values V r
C r,w r: sets of symptoms & weight values (case)
C p,w p: sets of symptoms & weight values (problem)
S: a set of symptoms R: a set of ranked cases
Output: a set of final cases F
3 for each C r ∈ C do
4 S = C r ∩ C p
5 if S= ∅ then
7 for each S i ∈ S do
8 T r = T r + w r i w pi
9 R ← R ∪ {(C r , H r )}
11 sor t (R, T r)
12 for each H r ∈ R do
13 V r = 1
n
14 for each S i ∈ C pdo
15 V r = V r w pi
16 for each V r ∈ V & C r ∈ R do
17 V r = V r V −1
18 F ← F ∪ {(C r , H r )}
19 sor t (F , V r)
posterior probabilities (lines 14-17) before generating a set of
final cases F (line 19).
D Computation Component
This component supports several assessment operations for retrieving cases The first operation uses the weighted aver-age function to measure the field-value vectors and to obtain cases Field-value pairs represent both pre-defined and user-defined features The component directly assigns weight values
to the pre-defined features, while the users indirectly assign weight values to the user-defined features The component can also assign average weight values to the user-defined fea-tures This operation finally selects few thousand cases from hundred thousand cases of the database based on the measure-ment results The second operation uses the Latent Semantic Indexing (LSI) method [44] to index the selected cases based on their textual descriptions, i.e., using the term vectors to generate the real-value vectors The indexing operation is time consum-ing due to computconsum-ing sconsum-ingular value decomposition (SVD) for large matrices, thus demanding to function on a separate pro-cess Various SVD algorithms are implemented to achieve both precision and performance This operation finally uses the cosine function to evaluate the real-value vectors between the selected cases and the query, and select a set of few hun-dred cases The third operation again uses the weighted average function to aggregate values from the above two operations
Trang 8Fig 5 Unified bug data model represented as a UML class diagram.
and provides the resulting set of few dozen cases with high
relevance The component and algorithms are implemented
in C with the support of the svdlibc [46] library for matrix
computation to achieve high performance This library contains
an implementation of the single-vector Lanczos algorithm
E Storage Component
This component exploits various fault resources to update the
databases of super peers Fault resources can be available at
bug tracking systems, online support forums and archives (also
known as communities or knowledge bases) These systems
support web-based crawling, and some systems also support
RPC-based crawling This component contains crawlers to
obtain bug reports from fault resources Since bug reports
contain various information, the crawlers provide the HTML
parsers to extract selective information based on the unified
bug data model shown in Figure 5 in order to generate cases
The crawlers regularly check the modified time attribute of the
bug reports to update the cases In addition, the cases can also
be updated by the learning operation of super peers, i.e., super
peers learn the solutions of the problem through the feedback
messages The component is implemented in Python and cases
are stored in the MySQL database [47] The database schema
conforms to the unified data model The Porter stemming
algorithm [48] helps to stem terms in bug contents
F Web Interface
Peers only contain the P2P component that communicates
with the P2P network for sharing and searching fault resources
They connect to one of super peers to join DisCaRia Note that
super peers possess not only the P2P component but also the
reasoning, computation and storage components Since peers
have limited capability, they are used to send queries to and
receive results from super peers They are also used to advertise
fault resources on super peers Implementing these peers is
simple, but users are usually reluctant to download and install
them DisCaRia thus provides a web support component on
super peers to enable a web interface that allows users to
con-nects and forward queries to super peers This web interface can
TABLE I
S OME P OPULAR B UG T RACKING S ITES ( AS OF A PRIL 2013) A P LUS
I NDICATES T HAT W E W ERE U NABLE TO G ET P RECISE N UMBERS AND O UR N UMBERS P RESENT A L OWER B OUND
accept several kinds of symptoms, error messages and textual descriptions The web interface is built on Django [49], an open source web application framework written in Python Django provides several facilities for developing web applications and integrates well into web servers, such as Apache HTTP Servers [50]
IV FAULTRESOURCES Many fault resources are available on the Internet DisCaRia aims at crawling bug reports from BTSs, archives and forums These resources share the same purpose of reporting bugs for software and hardware components, but differ from data inclusion and presentation
Investigating the features of several BTSs focuses on several properties that are important for obtaining data from them It
is necessary to check BTS functionality supports with specific installations In order to understand the structure of the infor-mation stored in a BTS, the underlying data model is inspected and documented Some systems provide this information in a textual format while others provide graphical representations, usually in ad-hoc notations Some systems do not provide a clear description of the data model underlying the BTS and it
is necessary to reverse engineer the data model by looking at concrete bug reports
Bug reports are dependent on other bugs if they cannot be resolved or acted upon, until the dependency itself is resolved
or acted upon Tracking any dependency relations between bug reports is thus useful because it helps to correlate bugs This activity is usually challenging without expert and system sup-ports Some systems allow full keyword search for their reports, while others only support searching via a set of pre-defined fil-ters applied on the entire bug database The former is more useful for an automated system that aims to provide keyword search capabilities itself
Table I reports several popular BTSs, sites and numbers of bugs With the exception of Debian BTS, which is only used for the Debian operating system, the other BTSs publish lists
of known public sites that use them for bug tracking While some BTSs provide a machine-readable web service interface
to their bug data, most do not In all systems where such an interface is supported, it is an optional feature, and because
Trang 9optional features require additional effort from an
adminis-trator to be set up, they are rarely available In addition, a
web service interface often provides much less data than the
human-readable web interface that is most commonly used
Clearly, relying on the availability of a web service API is
unre-alistic To solve this problem, crawlers [51] were created to
directly use the presentational HTML-based web interface in
order to get as much access to information as ordinary users
Other crawlers were built to exploit several methods provided
by the Bugzilla web service interface, e.g., XML-RPC API
The crawlers only submit a bug identifier to obtain the details
of the bug from the Bugzilla server, i.e., a list of field-value
pairs
The database schemas of several BTSs and archives share
several similar fields that can be classified in two main groups:
(i) the administrative information associated with a bug is
rep-resented as field-value pairs, such as identity, severity, status,
product, summary, among others; (ii) the description
informa-tion detailing the bug and any follow-up discussion or acinforma-tions
is typically represented as textual attachments
Figure 5 shows the unified bug data model in the form of
a UML class diagram [52] The central class is the Bug class
The id attribute is the URL where a bug can be retrieved as its
identifier Most of the attributes can be easily extracted from
the retrieved data The severity attribute is probably the most
interesting to fill correctly, because BTSs have very different
severity classifications for bugs This unified model defines four
severity values: critical, normal, minor and feature that can be
used to map the severity values of the BTSs The status attribute
only has two values: open represents what BTSs refer to
uncon-firmed, new, assigned, reopened bugs, while fixed represents
what BTSs refer to resolved, verified, and closed bugs
The textual description is modeled as the Attachment class
Every attachment belongs to exactly one bug Some BTSs
pro-vide information about the platforms affected by a bug The
Platform class represents platforms, e.g., Window or MacOS
The Symptom class represents keywords used to describe and
classify bugs Simple classifications such as severity, status,
platform, etc can be provided by the information of bugs
Symptoms provide complicated classifications such as
prob-lem scope, probprob-lem type, e.g., a bug related to network failure
can be related to connectivity, authentication service, hardware
and software configuration These classifications help to
nar-row down the scope of a bug and to figure out the related bugs
of a bug Symptoms contain sets of distinct keywords, typical
debugging messages or diagnosing patterns
The left part of Figure 5 models what piece of software a
bug is concerned with While some BTSs are only concerned
with bugs in a specific piece of software, software in larger
projects is split into components and bugs can be related to
spe-cific components The Software and Component classes model
this structure The Debian BTS is somewhat different from the
other BTSs as it is primarily used to track issues related to
software “packages”, that is software components packaged for
end user deployment Since there is a large amount of meta
information available for Debian software packages
(depen-dency, maintainer and version information), we have introduced
a separate Package class to represent packaged software
Fig 6 Average memory usage for a super peer with various datasets.
V SYSTEMEVALUATION The DisCaRia system contains several super peers with suffi-cient processing, storage, memory and bandwidth capabilities, and each super peer contains several components with vari-ous functionalities Our previvari-ous studies have evaluated these components and algorithms individually, such as the feed-back scheme [13] for P2P component, the multiple vector representation method [10] for computation component, the two-process probabilistic reasoning method [11] for reasoning component, and the bug report crawlers [14] for storage compo-nent However, it is necessary to evaluate the performance and efficiency of the complete DisCaRia system The responsive-ness metric assesses how fast the system responds to queries This metric also depends on various factors including process-ing power, bandwidth and the retrieval algorithm The quality metric measures how precise the system answer to problem queries This metric also depends on various factors including the size of the database and the reasoning algorithm
The database has been populated by crawling various archives, forums and BTSs such as Mozilla Bugzilla, Red Hat Bugzilla, Launchpad Ubuntu, etc These software bug datasets contain different numbers of bug reports A super peer with nor-mal hardware configuration (Intel Pentium 4 3.6 GHz, 1 GB RAM) can generally accommodate more than 100.000 bug reports Figure 6 shows that the average memory usage of a super peer increases slowly with the number of bug reports The reasoning component handles the classifying and selecting processes that work with the features of cases, and the compu-tation component maintains the indexing and filtering processes that work with the textual descriptions of cases These two components allocate almost all the memory used by the super peer The super peer uses approximately 350 MB to manage 100.000 bug reports Super peers therefore can accommodate
a large number of bug reports without the need of powerful computers DisCaRia with a large number of super peers can exploit leisure and shared resources to form a large knowl-edge database for problem resolution Note that the size of bug reports considerably fluctuates depending on BTSs
Evaluating the responsiveness of DisCaRia requires a dis-tributed environment exhibiting real network latency, peer heterogeneity and churn rate EmanicsLab [15] supported by European Network of Excellence for the Management of Internet Technologies and Complex Services (EMANICS) is a flexible and re-usable distributed computing and storage testbed
Trang 10TABLE II
E MANICS L AB S UPER P EERS C ONFIGURATION
Fig 7 DisCaRia deployment on EmanicsLab.
that fosters joint research activities of EMANICS partners
EmanicsLab is based on a complete installation of PlanetLab
Central, the backend management infrastructure of PlanetLab
[53] The testbed contains 20 nodes located at 10 universities
and research institutes in Europe DisCaRia has been installed
on 11 EmanicsLab nodes as shown in Table II acting as super
peers with a web interface running on web servers at Jacobs
University
Figure 7 plots the DisCaRia peers deployed on EmanicsLab
These peers randomly connect to each other and directly update
their databases from BTSs EmanicsLab contains some degree
of node heterogeneity since nodes possess varying processing,
memory and bandwidth capabilities Nodes are dedicated to run
some distributed applications from the partners, thus network
latency is sufficient to evaluate this system All queries sent to
DisCaRia are performed through the web interface The queries
extracted from bug reports comprise textual descriptions,
symp-toms or debugging messages The queries are continuously fed
to the web interface using a specific tool that connects directly
to super peers The tool also receives the queryhits from super
TABLE III
P ARAMETER C ONFIGURATION
Fig 8 Average time for retrieving the different numbers of queryhits.
Fig 9 Average time consumption per query over various sizes of the datasets.
peers for analysis Table III describes the parameter configu-ration of super peers The churn rate is set by 1% of the total number of super peers, and other parameters are the maximum numbers of connections or queryhits
Figure 8 plots average time for retrieving the different num-bers of queryhits The basic or advance result contains queryhits returned from super peers running the retrieval algorithm or running the retrieval and reasoning algorithms respectively The limited or unlimited result contains queryhits returned from super peers connecting to the limited or unlimited number
of super peers respectively Local super peers send query-hits within 1 second, while remote super peers send queryquery-hits within 4 to 5 seconds Running the retrieval and reasoning algorithms consumes more time than running the retrieval algo-rithm only Increasing the number of connections causes super peers to forward queries to further super peers, thus increasing communication time
Figure 9 plots the average time consumption of the compu-tation component, the reasoning component and the super peer over the various sizes of the datasets The computation time increases logarithmically as the number of bugs increases The