Managing and Mining Graph Data part 53 potx

In KDD ’00: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 150–160, New York, NY, USA, 2000.. In KDD ’03: Proceedings of the n

Trang 1

[11] U Brandes, D Delling, M Gaertler, R Gorke, M Hoefer, Z Nikoloski, and D Wagner Maximizing modularity is hard Arxiv preprint physics/0608255, 2006.

[12] T Bu and D Towsley On distinguishing between internet power law topology generators In Twenty-First Annual Joint Conference of the

IEEE Computer and Communications Societies, volume 2, pages 638–

647 vol.2, 2002

[13] L S Buriol, G Frahling, S Leonardi, A Marchetti-Spaccamela, and

C Sohler Counting triangles in data streams In PODS ’06: Proceedings

of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Prin-ciples of database systems, pages 253–262, New York, NY, USA, 2006.

ACM

[14] D Chakrabarti and C Faloutsos Graph mining: Laws, generators, and

algorithms ACM Comput Surv., 38(1):2, 2006.

[15] A Clauset, M Mewman, and C Moore Finding community structure in

very large networks Arxiv preprint cond-mat/0408187, 2004.

[16] A Clauset, C Moore, and M E J Newman Hierarchical structure and

the prediction of missing links in networks Nature, 453:98–101, 2008.

[17] A Clauset, C R Shalizi, and M E J Newman Power-law distributions

in empirical data arXiv, 706, 2007.

[18] J Diesner, T L Frantz, and K M Carley Communication networks from the enron email corpus "it’s always about the people enron is no

different" Comput Math Organ Theory, 11(3):201–228, 2005.

[19] Y Dourisboure, F Geraci, and M Pellegrini Extraction and classification

of dense communities in the web In WWW ’07: Proceedings of the 16th

international conference on World Wide Web, pages 461–470, New York,

NY, USA, 2007 ACM

[20] P Erd-os and A R«enyi On the evolution of random graphs Publ Math.

Inst Hung Acad Sci, 5:17–61, 1960.

[21] M Faloutsos, P Faloutsos, and C Faloutsos On power-law relationships

of the internet topology In SIGCOMM ’99: Proceedings of the

confer-ence on Applications, technologies, architectures, and protocols for com-puter communication, pages 251–262, New York, NY, USA, 1999 ACM.

[22] G W Flake, S Lawrence, and C L Giles Efficient identification of

web communities In KDD ’00: Proceedings of the sixth ACM SIGKDD

international conference on Knowledge discovery and data mining, pages

150–160, New York, NY, USA, 2000 ACM

[23] D Gibson, R Kumar, and A Tomkins Discovering large dense

sub-graphs in massive sub-graphs In VLDB ’05: Proceedings of the 31st

Trang 2

inter-national conference on Very large data bases, pages 721–732 VLDB

Endowment, 2005

[24] M S Handcock, A E Raftery, and J M Tantrum Model-based

cluster-ing for social networks Journal Of The Royal Statistical Society Series

A, 127(2):301–354, 2007.

[25] R Hanneman and M Riddle Introduction to Social Network Methods.

http://faculty.ucr.edu/ hanneman/, 2005

[26] P D Hoff and M S H Adrian E Raftery Latent space approaches to

social network analysis Journal of the American Statistical Association,

97(460):1090–1098, 2002

[27] J Hopcroft, O Khan, B Kulis, and B Selman Natural communities

in large linked networks In KDD ’03: Proceedings of the ninth ACM

SIGKDD international conference on Knowledge discovery and data mining, pages 541–546, New York, NY, USA, 2003 ACM.

[28] R Kumar, J Novak, and A Tomkins Structure and evolution of online

social networks In KDD ’06: Proceedings of the 12th ACM SIGKDD

international conference on Knowledge discovery and data mining, pages

611–617, New York, NY, USA, 2006 ACM

[29] R Kumar, P Raghavan, S Rajagopalan, and A Tomkins Trawling the

web for emerging cyber-communities Comput Netw., 31(11-16):1481–

1493, 1999

[30] M Latapy Main-memory triangle computations for very large (sparse

(power-law)) graphs Theor Comput Sci., 407(1-3):458–473, 2008.

[31] J Leskovec, L A Adamic, and B A Huberman The dynamics of

vi-ral marketing In EC ’06: Proceedings of the 7th ACM conference on

Electronic commerce, pages 228–237, New York, NY, USA, 2006 ACM.

[32] J Leskovec, L Backstrom, R Kumar, and A Tomkins Microscopic

evolution of social networks In KDD ’08: Proceeding of the 14th ACM

SIGKDD international conference on Knowledge discovery and data mining, pages 462–470, New York, NY, USA, 2008 ACM.

[33] J Leskovec and E Horvitz Planetary-scale views on a large

instant-messaging network In WWW ’08: Proceeding of the 17th international

conference on World Wide Web, pages 915–924, New York, NY, USA,

2008 ACM

[34] J Leskovec, J Kleinberg, and C Faloutsos Graph evolution:

Densifica-tion and shrinking diameters ACM Trans Knowl Discov Data, 1(1):2,

2007

[35] J Leskovec, K J Lang, A Dasgupta, and M W Mahoney Statistical properties of community structure in large social and information

Trang 3

net-works In WWW ’08: Proceeding of the 17th international conference on

World Wide Web, pages 695–704, New York, NY, USA, 2008 ACM.

[36] J Leskovec, M McGlohon, C Faloutsos, N Glance, and M Hurst

Cas-cading behavior in large blog graphs In SIAM International Conference

on Data Mining (SDM 2007), 2007.

[37] B McClosky and I V Hicks Detecting cohesive groups http://www.caam.rice.edu/ ivhicks/CokplexAlgorithmPaper.pdf, 2009 [38] A Mislove, M Marcon, K P Gummadi, P Druschel, and B

Bhattachar-jee Measurement and analysis of online social networks In IMC ’07:

Proceedings of the 7th ACM SIGCOMM conference on Internet measure-ment, pages 29–42, New York, NY, USA, 2007 ACM.

[39] A A Nanavati, S Gurumurthy, G Das, D Chakraborty, K Dasgupta,

S Mukherjea, and A Joshi On the structural properties of massive

tele-com call graphs: findings and implications In CIKM ’06: Proceedings

of the 15th ACM international conference on Information and knowledge management, pages 435–444, New York, NY, USA, 2006 ACM.

[40] M Newman The structure and function of complex networks SIAM

Review, 45:167–256, 2003.

[41] M Newman Power laws, Pareto distributions and Zipf’s law

Contem-porary physics, 46(5):323–352, 2005.

[42] M Newman Finding community structure in networks using the

eigen-vectors of matrices Physical Review E (Statistical, Nonlinear, and Soft

Matter Physics), 74(3), 2006.

[43] M Newman Modularity and community structure in networks PNAS,

103(23):8577–8582, 2006

[44] M Newman, A.-L Barabasi, and D J Watts, editors The Structure and

Dynamics of Networks 2006.

[45] M Newman and M Girvan Finding and evaluating community structure

in networks Physical Review E, 69:026113, 2004.

[46] K Nowicki and T A B Snijders Estimation and prediction for stochas-tic blockstructures Journal of the American Statistical Association,

96(455):1077–1087, 2001

[47] G Palla, I Der«enyi, I Farkas, and T Vicsek Uncovering the overlapping

community structure of complex networks in nature and society Nature,

435:814–818, 2005

[48] C R Palmer, P B Gibbons, and C Faloutsos ANF: a fast and scalable

tool for data mining in massive graphs In KDD ’02: Proceedings of the

eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 81–90, New York, NY, USA, 2002 ACM.

Trang 4

[49] S Papadopoulos, A Skusa, A Vakali, Y Kompatsiaris, and N Wagner Bridge bounding: A local approach for efficient community discovery in complex networks Feb 2009

[50] P Sarkar and A W Moore Dynamic social network analysis using latent

space models SIGKDD Explor Newsl., 7(2):31–40, 2005.

[51] T Schank and D Wagner Finding, counting and listing all triangles in

large graphs, an experimental study In Workshop on Experimental and

Efficient Algorithms, 2005.

[52] A Strehl and J Ghosh Cluster ensembles — a knowledge reuse

frame-work for combining multiple partitions J Mach Learn Res., 3:583–617,

2003

[53] L Tang and H Liu Relational learning via latent social dimensions In

KDD ’09: Proceeding of the 15th ACM SIGKDD international confer-ence on Knowledge discovery and data mining, 2009.

[54] L Tang and H Liu Uncovering cross-dimension group structures in

multi-dimensional networks In SDM workshop on Analysis of Dynamic

Networks, 2009.

[55] L Tang, H Liu, J Zhang, N Agarwal, and J J Salerno Topic taxonomy

adaptation for group profiling ACM Trans Knowl Discov Data, 1(4):1–

28, 2008

[56] L Tang, H Liu, J Zhang, and Z Nazeri Community evolution in

dynamic multi-mode networks In KDD ’08: Proceeding of the 14th

ACM SIGKDD international conference on Knowledge discovery and data mining, pages 677–685, New York, NY, USA, 2008 ACM.

[57] S Tauro, C Palmer, G Siganos, and M Faloutsos A simple conceptual

model for the internet topology In Global Telecommunications

Confer-ence, volume 3, pages 1667–1671, 2001.

[58] J Travers and S Milgram An experimental study of the small world

problem Sociometry, 32(4):425–443, 1969.

[59] C E Tsourakakis Fast counting of triangles in large real networks

with-out counting: Algorithms and laws IEEE International Conference on

Data Mining, 0:608–617, 2008.

[60] K Wakita and T Tsurumi Finding community structure in mega-scale

social networks: [extended abstract] In WWW ’07: Proceedings of the

16th international conference on World Wide Web, pages 1275–1276,

New York, NY, USA, 2007 ACM

[61] S Wasserman and K Faust Social Network Analysis: Methods and

Ap-plications Cambridge University Press, 1994.

[62] D J Watts and S H Strogatz Collective dynamics of ’small-world’

networks Nature, 393:440–442, 1998.

Trang 5

[63] K Yu, S Yu, and V Tresp Soft clsutering on graphs In NIPS, 2005.

Trang 6

SOFTWARE-BUG LOCALIZATION WITH

GRAPH MINING

Frank Eichinger

Institute for Program Structures and Data Organization (IPD)

Universit-at Karlsruhe (TH), Germany

eichinger@ipd.uka.de

Klemens B-ohm

Institute for Program Structures and Data Organization (IPD)

Universit-at Karlsruhe (TH), Germany

boehm@ipd.uka.de

Abstract In the recent past, a number of frequent subgraph mining algorithms has been

proposed They allow for analyses in domains where data is naturally graph-structured However, caused by scalability problems when dealing with large graphs, the application of graph mining has been limited to only a few domains.

In software engineering, debugging is an important issue It is most challenging

to localize bugs automatically, as this is expensive to be done manually Several approaches have been investigated, some of which analyze traces of repeated program executions These traces can be represented as call graphs Such graphs describe the invocations of methods during an execution This chapter is a sur-vey of graph mining approaches for bug localization based on the analysis of dynamic call graphs In particular, this chapter first introduces the subproblem

of reducing the size of call graphs, before the different approaches to localize bugs based on such reduced graphs are discussed Finally, we compare selected techniques experimentally and provide an outlook on future issues.

Keywords: Software Bug Localization, Program Call Graphs

C.C Aggarwal and H Wang (eds.), Managing and Mining Graph Data,

Advances in Database Systems 40, DOI 10.1007/978-1-4419-6045-0_17, 515

Trang 7

1 Introduction

Software quality is a huge concern in industry Almost any software con-tains at least some minor bugs after being released In order to avoid bugs, which incur significant costs, it is important to find and fix them before the re-lease In general, this results in devoting more resources to quality assurance Software developers usually try to find and fix bugs by means of in-depth code reviews, along with testing and classical debugging Locating bugs is consid-ered to be the most time consuming and challenging activity in this context [6,

20, 24, 26] where the resources available are limited Therefore, there is a need for semi-automated techniques guiding the debugging process [34] If a devel-oper obtains some hints where bugs might be localized, debugging becomes more efficient

Research in the field of software reliability has been extensive, and various techniques have been developed addressing the identification of defect-prone parts of software This interest is not limited to software-engineering research

In the machine-learning community, automated debugging is considered to be one of the ten most challenging problems for the next years [11] So far, no bug localization technique is perfect in the sense that it is capable of discovering any kind of bug In this chapter, we look at a relatively new class of bug

local-ization techniques, the analysis of call graphs with graph-mining techniques.

It can be seen as an approach orthogonal to and complementing existing tech-niques

Graph mining, or more specifically frequent subgraph mining, is a

rela-tively young discipline in data mining As described in the other chapters of this book, there are many different techniques as well as numerous applications for graph mining Probably the most prominent application is the analysis of chemical molecules As the NP-complete problem of subgraph isomorphism [16] is an inherent part of frequent subgraph mining algorithms, the analysis of molecules benefits from the relatively small size of most of them Compared

to the analysis of molecular data, software-engineering artifacts are typically mapped to graphs that are much larger Consequently, common graph-mining algorithms do not scale for these graphs In order to make use of call graphs which reflect the invocation structure of specific program executions, it is key

to deploy a suitable call-graph-reduction technique Such techniques help to

alleviate the scalability problems to some extent and allow to make use of graph-mining algorithms in a number of cases As we will demonstrate, such approaches work well in certain cases, but some challenges remain Besides scalability issues that are still unsolved, some call-graph-reduction techniques lead to another challenge: They introduce edge weights representing call fre-quencies As graph-mining research has concentrated on structural and cat-egorical domains, rather than on quantitative weights, we are not aware of

Trang 8

any algorithm specialized in mining weighted graphs Though this chapter

presents a technique to analyze graphs with weighted edges, the technique is a composition of established algorithms rather than a universal weighted graph mining algorithm Thus, besides mining large graphs, weighted graph mining

is a further challenge for graph-mining research driven by the field of software engineering

The remainder of this chapter is structured as follows: Section 2 introduces some basic principles of call graphs, bugs, graph mining and bug localization with such graphs Section 3 gives an overview of related work in software engineering employing data-analysis techniques Section 4 discusses different call-graph-reduction techniques The different bug-localization approaches are presented and compared in Section 5 and Section 6 concludes

2 Basics of Call Graph Based Bug Localization

This section introduces the concept of dynamic call graphs in Subsec-tion 2.1 It presents some classes of bugs in SubsecSubsec-tion 2.2 and SubsecSubsec-tion 2.3 explains how bug localization with call graphs works in principle A brief overview of key aspects of graph and tree mining in the context of this chapter

is given in Subsection 2.4

2.1 Dynamic Call Graphs

Call graphs are either static or dynamic [17] A static call graph [1] can

be obtained from the source code It represents all methods1 of a program as

nodes and all possible method invocations as edges Dynamic call graphs are

of importance in this chapter They represent an execution of a particular pro-gram and reflect the actual invocation structure of the execution Without any

further treatment, a call graph is a rooted ordered tree Themain-method of a

program usually is the root, and the methods invoked directly are its children Figure 17.1a is an abstract example of such a call graph where the root Node𝑎

represents themain-method

Unreduced call graphs typically become very large The reason is that, in modern software development, dedicated methods typically encapsulate every single functionality These methods call each other frequently Furthermore, iterative programming is very common, and methods calling other methods occur within loops, executed thousands of times Therefore, the execution of even a small program lasting some seconds often results in call graphs consist-ing of millions of edges

The size of call graphs prohibits a straightforward mining with state-of-the-art graph-mining algorithms Hence, a reduction of the graphs which

com-1In this chapter, we use method interchangeably with function.

Trang 9

presses the graphs significantly but keeps the essential properties of an individ-ual execution is necessary Section 4 describes different reduction techniques

2.2 Bugs in Software

In the software-engineering literature, there is a number of different

defi-nitions of bugs, defects, errors, failures, faults and the like For the purpose

of this chapter, we do not differentiate between them It is enough to know that a bug in a program execution manifests itself by producing some other results than specified or by leading to some unexpected runtime behavior such

as crashes or non-terminating runs In the following, we introduce some types

of bugs which are particularly interesting in the context of call graph based bug localization

a

(a)

a

(b)

a

(c)

Figure 17.1 An unreduced call graph, a call graph with a structure affecting bug, and a call graph

with a frequency affecting bug.

Crashing and non-crashing bugs: Crashing bugs lead to an

unex-pected termination of the program Prominent examples include null pointer exceptions and divisions by zero In many cases, e.g., depending

on the programming language, such bugs are not hard to find: A stack trace is usually shown which gives hints where the bug occurred Harder

to cope with are non-crashing bugs, i.e., failures which lead to faulty

re-sults without any hint that something went wrong during the execution

As non-crashing bugs are hard to find, all approaches to discover bugs with call-graph mining focus on them and leave aside crashing bugs

Occasional and non-occasional bugs: Occasional bugs are bugs which

occur with some but not with any input data Finding occasional bugs

is particularly difficult, as they are harder to reproduce, and more test cases are necessary for debugging Furthermore, they occur more

fre-quently, as non-occasional bugs are usually detected early, and

occa-sional bugs might only be found by means of extensive testing As all bug-localization techniques presented in this chapter rely on comparing call graphs of failing and correct program executions, they deal with

Trang 10

oc-casional bugs only In other words, besides examples of failing program executions, there needs to be a certain number of correct executions

Structure and call frequency affecting bugs: This distinction is

par-ticularly useful when designing call graph based bug-localization

tech-niques Structure affecting bugs are bugs resulting in different shapes

of the call graph where some parts are missing or occur additionally

in faulty executions An example is presented in Figure 17.1b, where Node𝑏 called from 𝑎 is missing, compared to the original graph in

Fig-ure 17.1a In this example, a faulty if-condition in Node𝑎 could have

caused the bug In contrast, call frequency affecting bugs are bugs which

lead to a change in the number of calls of a certain subtree in faulty ex-ecutions, rather than to completely missing or new substructures In the example in Figure 17.1c, a faulty loop condition or a faultyif-condition

inside a loop in Method𝑐 are typical causes for the increased number of

calls of Method𝑏

As probably any bug-localization technique, call graph based bug localiza-tion is certainly not able to find all kinds of software bugs For example, it

is possible that bugs do not affect the call graph at all For instance, if some mathematical expression calculates faulty results, this does not necessarily af-fect subsequent method calls and call graph mining can not detect this There-fore, call graph based bug localization should be seen as a technique which complements other techniques, as the ones we will describe in Section 3 In this chapter we concentrate on deterministic bugs of single-threaded programs and leave aside bugs which are specific for such situations However, the tech-niques described in the following might locate such bugs as well

2.3 Bug Localization with Call Graphs

So far, several approaches have been proposed to localize bugs by means

of call-graph mining [9, 13, 14, 25] We will present them in detail in the following sections In a nutshell, the approaches consist of three steps:

1 Deduction of call graphs from program executions,

assignment of labels correct or failing.

2 Reduction of call graphs

3 Mining of call graphs,

analysis of the resulting frequent subgraphs

Step 1: Deriving call graphs is relatively simple They can be obtained by

tracing program executions while testing, which is assumed to be done anyway

Furthermore, a classification of program executions as correct or failing is

Định dạng
Số trang	10
Dung lượng	1,39 MB