In KDD ’00: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 150–160, New York, NY, USA, 2000.. In KDD ’03: Proceedings of the n
Trang 1[11] U Brandes, D Delling, M Gaertler, R Gorke, M Hoefer, Z Nikoloski, and D Wagner Maximizing modularity is hard Arxiv preprint physics/0608255, 2006.
[12] T Bu and D Towsley On distinguishing between internet power law topology generators In Twenty-First Annual Joint Conference of the
IEEE Computer and Communications Societies, volume 2, pages 638–
647 vol.2, 2002
[13] L S Buriol, G Frahling, S Leonardi, A Marchetti-Spaccamela, and
C Sohler Counting triangles in data streams In PODS ’06: Proceedings
of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Prin-ciples of database systems, pages 253–262, New York, NY, USA, 2006.
ACM
[14] D Chakrabarti and C Faloutsos Graph mining: Laws, generators, and
algorithms ACM Comput Surv., 38(1):2, 2006.
[15] A Clauset, M Mewman, and C Moore Finding community structure in
very large networks Arxiv preprint cond-mat/0408187, 2004.
[16] A Clauset, C Moore, and M E J Newman Hierarchical structure and
the prediction of missing links in networks Nature, 453:98–101, 2008.
[17] A Clauset, C R Shalizi, and M E J Newman Power-law distributions
in empirical data arXiv, 706, 2007.
[18] J Diesner, T L Frantz, and K M Carley Communication networks from the enron email corpus "it’s always about the people enron is no
different" Comput Math Organ Theory, 11(3):201–228, 2005.
[19] Y Dourisboure, F Geraci, and M Pellegrini Extraction and classification
of dense communities in the web In WWW ’07: Proceedings of the 16th
international conference on World Wide Web, pages 461–470, New York,
NY, USA, 2007 ACM
[20] P Erd-os and A R«enyi On the evolution of random graphs Publ Math.
Inst Hung Acad Sci, 5:17–61, 1960.
[21] M Faloutsos, P Faloutsos, and C Faloutsos On power-law relationships
of the internet topology In SIGCOMM ’99: Proceedings of the
confer-ence on Applications, technologies, architectures, and protocols for com-puter communication, pages 251–262, New York, NY, USA, 1999 ACM.
[22] G W Flake, S Lawrence, and C L Giles Efficient identification of
web communities In KDD ’00: Proceedings of the sixth ACM SIGKDD
international conference on Knowledge discovery and data mining, pages
150–160, New York, NY, USA, 2000 ACM
[23] D Gibson, R Kumar, and A Tomkins Discovering large dense
sub-graphs in massive sub-graphs In VLDB ’05: Proceedings of the 31st
Trang 2inter-national conference on Very large data bases, pages 721–732 VLDB
Endowment, 2005
[24] M S Handcock, A E Raftery, and J M Tantrum Model-based
cluster-ing for social networks Journal Of The Royal Statistical Society Series
A, 127(2):301–354, 2007.
[25] R Hanneman and M Riddle Introduction to Social Network Methods.
http://faculty.ucr.edu/ hanneman/, 2005
[26] P D Hoff and M S H Adrian E Raftery Latent space approaches to
social network analysis Journal of the American Statistical Association,
97(460):1090–1098, 2002
[27] J Hopcroft, O Khan, B Kulis, and B Selman Natural communities
in large linked networks In KDD ’03: Proceedings of the ninth ACM
SIGKDD international conference on Knowledge discovery and data mining, pages 541–546, New York, NY, USA, 2003 ACM.
[28] R Kumar, J Novak, and A Tomkins Structure and evolution of online
social networks In KDD ’06: Proceedings of the 12th ACM SIGKDD
international conference on Knowledge discovery and data mining, pages
611–617, New York, NY, USA, 2006 ACM
[29] R Kumar, P Raghavan, S Rajagopalan, and A Tomkins Trawling the
web for emerging cyber-communities Comput Netw., 31(11-16):1481–
1493, 1999
[30] M Latapy Main-memory triangle computations for very large (sparse
(power-law)) graphs Theor Comput Sci., 407(1-3):458–473, 2008.
[31] J Leskovec, L A Adamic, and B A Huberman The dynamics of
vi-ral marketing In EC ’06: Proceedings of the 7th ACM conference on
Electronic commerce, pages 228–237, New York, NY, USA, 2006 ACM.
[32] J Leskovec, L Backstrom, R Kumar, and A Tomkins Microscopic
evolution of social networks In KDD ’08: Proceeding of the 14th ACM
SIGKDD international conference on Knowledge discovery and data mining, pages 462–470, New York, NY, USA, 2008 ACM.
[33] J Leskovec and E Horvitz Planetary-scale views on a large
instant-messaging network In WWW ’08: Proceeding of the 17th international
conference on World Wide Web, pages 915–924, New York, NY, USA,
2008 ACM
[34] J Leskovec, J Kleinberg, and C Faloutsos Graph evolution:
Densifica-tion and shrinking diameters ACM Trans Knowl Discov Data, 1(1):2,
2007
[35] J Leskovec, K J Lang, A Dasgupta, and M W Mahoney Statistical properties of community structure in large social and information
Trang 3net-works In WWW ’08: Proceeding of the 17th international conference on
World Wide Web, pages 695–704, New York, NY, USA, 2008 ACM.
[36] J Leskovec, M McGlohon, C Faloutsos, N Glance, and M Hurst
Cas-cading behavior in large blog graphs In SIAM International Conference
on Data Mining (SDM 2007), 2007.
[37] B McClosky and I V Hicks Detecting cohesive groups http://www.caam.rice.edu/ ivhicks/CokplexAlgorithmPaper.pdf, 2009 [38] A Mislove, M Marcon, K P Gummadi, P Druschel, and B
Bhattachar-jee Measurement and analysis of online social networks In IMC ’07:
Proceedings of the 7th ACM SIGCOMM conference on Internet measure-ment, pages 29–42, New York, NY, USA, 2007 ACM.
[39] A A Nanavati, S Gurumurthy, G Das, D Chakraborty, K Dasgupta,
S Mukherjea, and A Joshi On the structural properties of massive
tele-com call graphs: findings and implications In CIKM ’06: Proceedings
of the 15th ACM international conference on Information and knowledge management, pages 435–444, New York, NY, USA, 2006 ACM.
[40] M Newman The structure and function of complex networks SIAM
Review, 45:167–256, 2003.
[41] M Newman Power laws, Pareto distributions and Zipf’s law
Contem-porary physics, 46(5):323–352, 2005.
[42] M Newman Finding community structure in networks using the
eigen-vectors of matrices Physical Review E (Statistical, Nonlinear, and Soft
Matter Physics), 74(3), 2006.
[43] M Newman Modularity and community structure in networks PNAS,
103(23):8577–8582, 2006
[44] M Newman, A.-L Barabasi, and D J Watts, editors The Structure and
Dynamics of Networks 2006.
[45] M Newman and M Girvan Finding and evaluating community structure
in networks Physical Review E, 69:026113, 2004.
[46] K Nowicki and T A B Snijders Estimation and prediction for stochas-tic blockstructures Journal of the American Statistical Association,
96(455):1077–1087, 2001
[47] G Palla, I Der«enyi, I Farkas, and T Vicsek Uncovering the overlapping
community structure of complex networks in nature and society Nature,
435:814–818, 2005
[48] C R Palmer, P B Gibbons, and C Faloutsos ANF: a fast and scalable
tool for data mining in massive graphs In KDD ’02: Proceedings of the
eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 81–90, New York, NY, USA, 2002 ACM.
Trang 4[49] S Papadopoulos, A Skusa, A Vakali, Y Kompatsiaris, and N Wagner Bridge bounding: A local approach for efficient community discovery in complex networks Feb 2009
[50] P Sarkar and A W Moore Dynamic social network analysis using latent
space models SIGKDD Explor Newsl., 7(2):31–40, 2005.
[51] T Schank and D Wagner Finding, counting and listing all triangles in
large graphs, an experimental study In Workshop on Experimental and
Efficient Algorithms, 2005.
[52] A Strehl and J Ghosh Cluster ensembles — a knowledge reuse
frame-work for combining multiple partitions J Mach Learn Res., 3:583–617,
2003
[53] L Tang and H Liu Relational learning via latent social dimensions In
KDD ’09: Proceeding of the 15th ACM SIGKDD international confer-ence on Knowledge discovery and data mining, 2009.
[54] L Tang and H Liu Uncovering cross-dimension group structures in
multi-dimensional networks In SDM workshop on Analysis of Dynamic
Networks, 2009.
[55] L Tang, H Liu, J Zhang, N Agarwal, and J J Salerno Topic taxonomy
adaptation for group profiling ACM Trans Knowl Discov Data, 1(4):1–
28, 2008
[56] L Tang, H Liu, J Zhang, and Z Nazeri Community evolution in
dynamic multi-mode networks In KDD ’08: Proceeding of the 14th
ACM SIGKDD international conference on Knowledge discovery and data mining, pages 677–685, New York, NY, USA, 2008 ACM.
[57] S Tauro, C Palmer, G Siganos, and M Faloutsos A simple conceptual
model for the internet topology In Global Telecommunications
Confer-ence, volume 3, pages 1667–1671, 2001.
[58] J Travers and S Milgram An experimental study of the small world
problem Sociometry, 32(4):425–443, 1969.
[59] C E Tsourakakis Fast counting of triangles in large real networks
with-out counting: Algorithms and laws IEEE International Conference on
Data Mining, 0:608–617, 2008.
[60] K Wakita and T Tsurumi Finding community structure in mega-scale
social networks: [extended abstract] In WWW ’07: Proceedings of the
16th international conference on World Wide Web, pages 1275–1276,
New York, NY, USA, 2007 ACM
[61] S Wasserman and K Faust Social Network Analysis: Methods and
Ap-plications Cambridge University Press, 1994.
[62] D J Watts and S H Strogatz Collective dynamics of ’small-world’
networks Nature, 393:440–442, 1998.
Trang 5[63] K Yu, S Yu, and V Tresp Soft clsutering on graphs In NIPS, 2005.
Trang 6SOFTWARE-BUG LOCALIZATION WITH
GRAPH MINING
Frank Eichinger
Institute for Program Structures and Data Organization (IPD)
Universit-at Karlsruhe (TH), Germany
eichinger@ipd.uka.de
Klemens B-ohm
Institute for Program Structures and Data Organization (IPD)
Universit-at Karlsruhe (TH), Germany
boehm@ipd.uka.de
Abstract In the recent past, a number of frequent subgraph mining algorithms has been
proposed They allow for analyses in domains where data is naturally graph-structured However, caused by scalability problems when dealing with large graphs, the application of graph mining has been limited to only a few domains.
In software engineering, debugging is an important issue It is most challenging
to localize bugs automatically, as this is expensive to be done manually Several approaches have been investigated, some of which analyze traces of repeated program executions These traces can be represented as call graphs Such graphs describe the invocations of methods during an execution This chapter is a sur-vey of graph mining approaches for bug localization based on the analysis of dynamic call graphs In particular, this chapter first introduces the subproblem
of reducing the size of call graphs, before the different approaches to localize bugs based on such reduced graphs are discussed Finally, we compare selected techniques experimentally and provide an outlook on future issues.
Keywords: Software Bug Localization, Program Call Graphs
© Springer Science+Business Media, LLC 2010
C.C Aggarwal and H Wang (eds.), Managing and Mining Graph Data,
Advances in Database Systems 40, DOI 10.1007/978-1-4419-6045-0_17, 515
Trang 71 Introduction
Software quality is a huge concern in industry Almost any software con-tains at least some minor bugs after being released In order to avoid bugs, which incur significant costs, it is important to find and fix them before the re-lease In general, this results in devoting more resources to quality assurance Software developers usually try to find and fix bugs by means of in-depth code reviews, along with testing and classical debugging Locating bugs is consid-ered to be the most time consuming and challenging activity in this context [6,
20, 24, 26] where the resources available are limited Therefore, there is a need for semi-automated techniques guiding the debugging process [34] If a devel-oper obtains some hints where bugs might be localized, debugging becomes more efficient
Research in the field of software reliability has been extensive, and various techniques have been developed addressing the identification of defect-prone parts of software This interest is not limited to software-engineering research
In the machine-learning community, automated debugging is considered to be one of the ten most challenging problems for the next years [11] So far, no bug localization technique is perfect in the sense that it is capable of discovering any kind of bug In this chapter, we look at a relatively new class of bug
local-ization techniques, the analysis of call graphs with graph-mining techniques.
It can be seen as an approach orthogonal to and complementing existing tech-niques
Graph mining, or more specifically frequent subgraph mining, is a
rela-tively young discipline in data mining As described in the other chapters of this book, there are many different techniques as well as numerous applications for graph mining Probably the most prominent application is the analysis of chemical molecules As the NP-complete problem of subgraph isomorphism [16] is an inherent part of frequent subgraph mining algorithms, the analysis of molecules benefits from the relatively small size of most of them Compared
to the analysis of molecular data, software-engineering artifacts are typically mapped to graphs that are much larger Consequently, common graph-mining algorithms do not scale for these graphs In order to make use of call graphs which reflect the invocation structure of specific program executions, it is key
to deploy a suitable call-graph-reduction technique Such techniques help to
alleviate the scalability problems to some extent and allow to make use of graph-mining algorithms in a number of cases As we will demonstrate, such approaches work well in certain cases, but some challenges remain Besides scalability issues that are still unsolved, some call-graph-reduction techniques lead to another challenge: They introduce edge weights representing call fre-quencies As graph-mining research has concentrated on structural and cat-egorical domains, rather than on quantitative weights, we are not aware of
Trang 8any algorithm specialized in mining weighted graphs Though this chapter
presents a technique to analyze graphs with weighted edges, the technique is a composition of established algorithms rather than a universal weighted graph mining algorithm Thus, besides mining large graphs, weighted graph mining
is a further challenge for graph-mining research driven by the field of software engineering
The remainder of this chapter is structured as follows: Section 2 introduces some basic principles of call graphs, bugs, graph mining and bug localization with such graphs Section 3 gives an overview of related work in software engineering employing data-analysis techniques Section 4 discusses different call-graph-reduction techniques The different bug-localization approaches are presented and compared in Section 5 and Section 6 concludes
2 Basics of Call Graph Based Bug Localization
This section introduces the concept of dynamic call graphs in Subsec-tion 2.1 It presents some classes of bugs in SubsecSubsec-tion 2.2 and SubsecSubsec-tion 2.3 explains how bug localization with call graphs works in principle A brief overview of key aspects of graph and tree mining in the context of this chapter
is given in Subsection 2.4
2.1 Dynamic Call Graphs
Call graphs are either static or dynamic [17] A static call graph [1] can
be obtained from the source code It represents all methods1 of a program as
nodes and all possible method invocations as edges Dynamic call graphs are
of importance in this chapter They represent an execution of a particular pro-gram and reflect the actual invocation structure of the execution Without any
further treatment, a call graph is a rooted ordered tree Themain-method of a
program usually is the root, and the methods invoked directly are its children Figure 17.1a is an abstract example of such a call graph where the root Node𝑎
represents themain-method
Unreduced call graphs typically become very large The reason is that, in modern software development, dedicated methods typically encapsulate every single functionality These methods call each other frequently Furthermore, iterative programming is very common, and methods calling other methods occur within loops, executed thousands of times Therefore, the execution of even a small program lasting some seconds often results in call graphs consist-ing of millions of edges
The size of call graphs prohibits a straightforward mining with state-of-the-art graph-mining algorithms Hence, a reduction of the graphs which
com-1In this chapter, we use method interchangeably with function.
Trang 9presses the graphs significantly but keeps the essential properties of an individ-ual execution is necessary Section 4 describes different reduction techniques
2.2 Bugs in Software
In the software-engineering literature, there is a number of different
defi-nitions of bugs, defects, errors, failures, faults and the like For the purpose
of this chapter, we do not differentiate between them It is enough to know that a bug in a program execution manifests itself by producing some other results than specified or by leading to some unexpected runtime behavior such
as crashes or non-terminating runs In the following, we introduce some types
of bugs which are particularly interesting in the context of call graph based bug localization
a
(a)
a
(b)
a
(c)
Figure 17.1 An unreduced call graph, a call graph with a structure affecting bug, and a call graph
with a frequency affecting bug.
Crashing and non-crashing bugs: Crashing bugs lead to an
unex-pected termination of the program Prominent examples include null pointer exceptions and divisions by zero In many cases, e.g., depending
on the programming language, such bugs are not hard to find: A stack trace is usually shown which gives hints where the bug occurred Harder
to cope with are non-crashing bugs, i.e., failures which lead to faulty
re-sults without any hint that something went wrong during the execution
As non-crashing bugs are hard to find, all approaches to discover bugs with call-graph mining focus on them and leave aside crashing bugs
Occasional and non-occasional bugs: Occasional bugs are bugs which
occur with some but not with any input data Finding occasional bugs
is particularly difficult, as they are harder to reproduce, and more test cases are necessary for debugging Furthermore, they occur more
fre-quently, as non-occasional bugs are usually detected early, and
occa-sional bugs might only be found by means of extensive testing As all bug-localization techniques presented in this chapter rely on comparing call graphs of failing and correct program executions, they deal with
Trang 10oc-casional bugs only In other words, besides examples of failing program executions, there needs to be a certain number of correct executions
Structure and call frequency affecting bugs: This distinction is
par-ticularly useful when designing call graph based bug-localization
tech-niques Structure affecting bugs are bugs resulting in different shapes
of the call graph where some parts are missing or occur additionally
in faulty executions An example is presented in Figure 17.1b, where Node𝑏 called from 𝑎 is missing, compared to the original graph in
Fig-ure 17.1a In this example, a faulty if-condition in Node𝑎 could have
caused the bug In contrast, call frequency affecting bugs are bugs which
lead to a change in the number of calls of a certain subtree in faulty ex-ecutions, rather than to completely missing or new substructures In the example in Figure 17.1c, a faulty loop condition or a faultyif-condition
inside a loop in Method𝑐 are typical causes for the increased number of
calls of Method𝑏
As probably any bug-localization technique, call graph based bug localiza-tion is certainly not able to find all kinds of software bugs For example, it
is possible that bugs do not affect the call graph at all For instance, if some mathematical expression calculates faulty results, this does not necessarily af-fect subsequent method calls and call graph mining can not detect this There-fore, call graph based bug localization should be seen as a technique which complements other techniques, as the ones we will describe in Section 3 In this chapter we concentrate on deterministic bugs of single-threaded programs and leave aside bugs which are specific for such situations However, the tech-niques described in the following might locate such bugs as well
2.3 Bug Localization with Call Graphs
So far, several approaches have been proposed to localize bugs by means
of call-graph mining [9, 13, 14, 25] We will present them in detail in the following sections In a nutshell, the approaches consist of three steps:
1 Deduction of call graphs from program executions,
assignment of labels correct or failing.
2 Reduction of call graphs
3 Mining of call graphs,
analysis of the resulting frequent subgraphs
Step 1: Deriving call graphs is relatively simple They can be obtained by
tracing program executions while testing, which is assumed to be done anyway
Furthermore, a classification of program executions as correct or failing is