Managing and Mining Graph Data part 56 pps

Besides the increased precision of the localization techniques based on the reduction, Rsubtree also produces smaller graphs than R01m unord cf.. However, in cases where the subtree redu

Trang 1

In concrete terms, we compare the following five alternatives:

E01m : The structural 𝑃SN-scoring approach similar to [9] (cf Subsec-tion 5.1), but based on the unordered R01m unordreduction

Esubtree : The frequency-based Pfreq-scoring approach as in [13, 14] (cf Subsection 5.2) based on the Rsubtree reduction

Ecomb[13] : The combined approach from [13] (cf Subsection 5.3) based

on the R01m unordand Rsubtreereductions

Ecomb[14] : The combined approach from [14] (cf Subsection 5.3) based

on the Rsubtreereduction

Etotal : The combined approach as in [14] (cf Subsection 5.3) but with

the Rtotal wreduction like in [25] (but with weights and without temporal edges, cf Subsection 5.1)

We present the results (the number of the first position in which a bug is found) of the five experiments for all fourteen bugs in Table 17.3 We represent

a bug which is not discovered with the respective approach with ‘25’, the total number of methods of the program Note that with the frequency-based and the combined method rankings, there usually is information available where a bug

is located within a method, and in the context of which subgraph it appears The following comparisons leave aside this additional information

Table 17.3 Experimental results.

Structural, Frequency-Based and Combined Approaches. Comparing the results from E01mand Esubtree, the frequency-based approach (Esubtree) per-forms almost always as good or better than the structural one (E01m) This demonstrates that analyzing numerical call frequencies is adequate to locate bugs Bugs 1, 9 and 13 illustrate that both approaches alone cannot find certain bugs Bug 9 cannot be found by comparing call frequencies (Esubtree) This

is because Bug 9 is a modified condition which always leads to the invocation

of a certain method In consequence, the call frequency is always the same Bugs 1 and 13 are not found with the purely structural approach (E01m) Both are typical call frequency affecting bugs: Bug 1 is in anif-condition inside a

Trang 2

loop and leads to more invocations of a certain method In Bug 13, a modified

for-condition slightly changes the call frequency of a method inside the loop

With the R01m unord reduction technique used in E01m, Bug 2 and 13 have the same graph structure both with correct and with failing executions Thus, it is difficult to impossible to identify structural differences

The combined approaches in Ecomb[13] and Ecomb[14] are intended to take structural information into account as well to improve the results from Esubtree

We do achieve this goal: When comparing Esubtree and Ecomb[14], we retain the already good results from Esubtreein nine cases and improve them in five When looking at the two combination strategies, it is hard to say which one

is better Ecomb[13]turns out to be better in four cases while Ecomb[14]is better in six ones Thus, the technique in Ecomb[14] is slightly better, but not with every bug Furthermore, the technique in Ecomb[13] is less efficient as it requires two graph-mining runs

Reduction Techniques. Looking at the call-graph-reduction techniques, the results from the experiments discussed so far reveal that the subtree-reduction technique with edge weights (Rsubtree) used in Esubtree as well as

in both combined approaches is superior to the zero-one-many reduction (R01m unord) Besides the increased precision of the localization techniques based on the reduction, Rsubtree also produces smaller graphs than R01m unord (cf Subsection 4.5)

Etotalevaluates the total reduction technique We use Rtotal was an instance

of the total reduction family The rationale is that this one can be used with

Ecomb[14] In most cases, the total reduction (Etotal) performs worse than the subtree reduction (Ecomb[14]) This confirms that the subtree-reduction tech-nique is reasonable, and that it is worth to keep more structural information than the total reduction does However, in cases where the subtree reduction produces graphs which are too large for efficient mining, and the total reduc-tion produces sufficiently small graphs, Rtotal wcan be an alternative to Rsubtree

Temporal Order. The experimental results listed in Table 17.3 do not shed any light on the influence of the temporal order When applied to the buggy programs used in our comparisons, the total reduction with temporal edges (Rtotal tmp) produces graphs of a size which cannot be mined in a reasonable time This already shows that the representation of the temporal order with additional edges might lead to graphs whose size is not manageable any more

In preliminary experiments of ours, we have repeated E01m with the R01m ord

reduction and the FREQT [2] rooted ordered tree miner in order to evaluate the

usefulness of the temporal order Although we systematically varied the differ-ent mining parameters, the results of these experimdiffer-ents in general are not better than those in E01m Only in two of the 14 bugs the temporal-aware approach

Trang 3

has performed better than E01m, in the other cases it has performed worse.

In a comparison with the Rsubtree reduction and the gSpan algorithm [32], the

R01m ord reduction with the ordered tree miner displayed a significantly in-creased runtime by a factor of 4.8 on average.4 Therefore, our preliminary result is that the incorporation of the temporal order does not increase the pre-cision of bug localizations This is based on the bugs considered so far, and more comprehensive experiments would be needed for a more reliable state-ment

Threats to Validity. The experiments carried out in this subsection, as well

as in the respective publications [9, 13, 14, 25], illustrate the ability to locate bugs based on dynamic call graphs using graph mining techniques From a software engineering point of view, three issues remain for further evaluations: (1) All experiments are based on artificially seeded bugs Although these bugs mimic typical bugs as they occur in reality, a further investigation with real bugs, e.g., from a real software project, would prove the validity of the pro-posed techniques (2) All experiments feature rather small programs contain-ing the bugs The programs rarely consist of more than one class and represent situations where bugs could be found relatively easy by a manual investigation

as well When solutions for the current scalability issues are found, localiza-tion techniques should be validated with larger software projects (3) None

of the techniques considered has been directly compared to other techniques such as those discussed in Section 3 Such a comparison, based on a large number of bugs, would reveal the advantages and disadvantages of the

differ-ent techniques The iBUGS project [7] provides real bug datasets from large software projects such as AspectJ It might serve as a basis to tackle the issues

mentioned

6 Conclusions and Future Directions

This chapter has dealt with the problem of localizing software bugs, as a use case of graph mining This localization is important as bugs are hard to detect manually Graph mining based techniques identify structural patterns

in trace data which are typical for failing executions but rare in correct They serve as hints for bug localization Respective techniques based on call graph mining first need to solve the subproblem of call graph reduction In this chap-ter we have discussed both reduction techniques for dynamic call graphs and approaches analyzing such graphs Experiments have demonstrated the use-fulness of our techniques and have compared different approaches

4In this comparison, FREQT was restricted as in [9] to find subtrees of a maximum size of four nodes Such

a restriction was not set in gSpan Furthermore, we expect a further significant speedup when CloseGraph [33] is used instead of gSpan.

Trang 4

All techniques surveyed in this chapter work well when applied to relatively small software projects Due to the NP-hard problem of subgraph isomorphism inherent to frequent subgraph mining, none of the techniques presented is di-rectly applicable to large projects One future challenge is to overcome this problem, be it with more sophisticated graph-mining algorithms, e.g., scalable approximate mining or discriminative techniques, or smarter bug-localization frameworks, e.g., different graph representations or constraint based mining One starting point could be the granularity of call graphs So far, call graphs represent method invocations One can think of smaller graphs representing interactions at a coarser level, i.e., classes or packages [12] presents encour-aging results regarding the localization of bugs based on class-level call graphs

As future research, we will investigate how to turn these results into a scalable framework for locating bugs Such a framework would first do bug localiza-tion on a coarse level before ‘zooming in’ and investigating more detailed call graphs

Call graph reduction techniques introducing edge weights trigger another challenge for graph mining: weighted graphs We have shown that the analysis

of such weights is crucial to detect certain bugs Graph-mining research has focused on structural issues so far, and we are not aware of any algorithm for explicit mining of weighted graphs Next to reduced call graphs, such algorithms could mine other real world graphs as well [3], e.g., in logistics [19] and image analysis [27]

Acknowledgments

We are indebted to Matthias Huber for his contributions We further thank Andreas Zeller for fruitful discussions and Valentin Dallmeier for his com-ments on early versions of this chapter

References

[1] F E Allen Interprocedural Data Flow Analysis In Proc of the IFIP

Congress, 1974.

[2] T Asai, K Abe, S Kawasoe, H Arimura, H Sakamoto, and S Arikawa Efficient Substructure Discovery from Large Semi-structured Data In

Proc of the 2nd SIAM Int Conf on Data Mining (SDM), 2002.

[3] D Chakrabarti and C Faloutsos Graph Mining: Laws, Generators, and

Algorithms ACM Computing Surveys (CSUR), 38(1):2, 2006.

[4] R.-Y Chang, A Podgurski, and J Yang Discovering Neglected

Condi-tions in Software by Mining Dependence Graphs IEEE TransacCondi-tions on

Software Engineering, 34(5):579–596, 2008.

Trang 5

[5] Y Chi, R Muntz, S Nijssen, and J Kok Frequent Subtree Mining – An

Overview Fundamenta Informaticae, 66(1–2):161–198, 2005.

[6] V Dallmeier, C Lindig, and A Zeller Lightweight Defect Localization

for Java In Proc of the 19th European Conf on Object-Oriented

Pro-gramming (ECOOP), 2005.

[7] V Dallmeier and T Zimmermann Extraction of Bug Localization

Bench-marks from History In Proc of the 22nd IEEE/ACM Int Conf on

Auto-mated Software Engineering (ASE), 2007.

[8] I F Darwin Java Cookbook O’Reilly, 2004.

[9] G Di Fatta, S Leue, and E Stegantova Discriminative Pattern Mining in

Software Fault Detection In Proc of the 3rd Int Workshop on Software

Quality Assurance (SOQUA), 2006.

[10] R Diestel Graph Theory Springer, 2006.

[11] T G Dietterich, P Domingos, L Getoor, S Muggleton, and P Tadepalli

Structured Machine Learning: The Next Ten Years Machine Learning,

73(1):3–23, 2008

[12] F Eichinger and K B -ohm Towards Scalability of Graph-Mining Based

Bug Localisation In Proc of the 7th Int Workshop on Mining and

Learn-ing with Graphs (MLG), 2009.

[13] F Eichinger, K B -ohm, and M Huber Improved Software Fault

Detec-tion with Graph Mining In Proc of the 6th Int Workshop on Mining and

Learning with Graphs (MLG), 2008.

[14] F Eichinger, K B -ohm, and M Huber Mining Edge-Weighted Call

Graphs to Localise Software Bugs In Proc of the European Conf on

Machine Learning and Principles and Practice of Knowledge Discovery

in Databases (ECML PKDD), 2008.

[15] M D Ernst, J Cockrell, W G Griswold, and D Notkin Dynami-cally Discovering Likely Program Invariants to Support Program

Evolu-tion IEEE Transactions on Software Engineering, 27(2):99–123, 2001 [16] M R Garey and D S Johnson Computers and Intractability: A Guide

to the Theory of NP-Completeness W H Freeman, 1979.

[17] S L Graham, P B Kessler, and M K Mckusick gprof: A Call Graph

Execution Profiler In Proc of the ACM SIGPLAN Symposium on

Com-piler Construction, 1982.

[18] M J Harrold, R Gupta, and M L Soffa A Methodology for Controlling

the Size of a Test Suite ACM Transactions on Software Engineering and

Methodology (TOSEM), 2(3):270–285, 1993.

[19] W Jiang, J Vaidya, Z Balaporia, C Clifton, and B Banich Knowledge

Discovery from Transportation Network Data In Proc of the 21st Int.

Conf on Data Engineering (ICDE), 2005.

Trang 6

[20] J A Jones, M J Harrold, and J Stasko Visualization of Test

Informa-tion to Assist Fault LocalizaInforma-tion In Proc of the 24th Int Conf on Software

Engineering (ICSE), 2002.

[21] P Knab, M Pinzger, and A Bernstein Predicting Defect Densities in

Source Code Files with Decision Tree Learners In Proc of the Int

Work-shop on Mining Software Repositories (MSR), 2006.

[22] B Korel and J Laski Dynamic Program Slicing Information Processing

Letters, 29(3):155–163, 1988.

[23] B Liblit, A Aiken, A X Zheng, and M I Jordan Bug Isolation via

Re-mote Program Sampling ACM SIGPLAN Notices, 38(5):141–154, 2003.

[24] C Liu, X Yan, L Fei, J Han, and S P Midkiff SOBER: Statistical

Model-Based Bug Localization SIGSOFT Software Engineering Notes,

30(5):286–295, 2005

[25] C Liu, X Yan, H Yu, J Han, and P S Yu Mining Behavior Graphs for

“Backtrace” of Noncrashing Bugs In Proc of the 5th SIAM Int Conf on

Data Mining (SDM), 2005.

[26] N Nagappan, T Ball, and A Zeller Mining Metrics to Predict

Com-ponent Failures In Proc of the 28th Int Conf on Software Engineering

(ICSE), 2006.

[27] S Nowozin, K Tsuda, T Uno, T Kudo, and G Bakir Weighted

Sub-structure Mining for Image Analysis In Proc of the Conf on Computer

Vision and Pattern Recognition (CVPR), 2007.

[28] K J Ottenstein and L M Ottenstein The Program Dependence Graph

in a Software Development Environment SIGSOFT Software Engineering

Notes, 9(3):177–184, 1984.

[29] J R Quinlan C4.5: Programs for Machine Learning Morgan Kaufmann

Publishers, 1993

[30] A Schr-oter, T Zimmermann, and A Zeller Predicting Component

Fail-ures at Design Time In Proc of the 5th Int Symposium on Empirical

Software Engineering, 2006.

[31] I H Witten and E Frank Data Mining: Practical Machine Learning

Tools and Techniques with Java Implementations Morgan Kaufmann

Pub-lishers, 2005

[32] X Yan and J Han gSpan: Graph-Based Substructure Pattern Mining In

Proc of the 2nd IEEE Int Conf on Data Mining (ICDM), 2002.

[33] X Yan and J Han CloseGraph: Mining Closed Frequent Graph Patterns

In Proc of the 9th ACM Int Conf on Knowledge Discovery and Data

Mining (KDD), 2003.

Trang 7

[34] T Zimmermann, N Nagappan, and A Zeller Predicting Bugs from

His-tory In T Mens and S Demeyer, editors, Software Evolution, pages 69–88.

Springer, 2008

Trang 8

A SURVEY OF GRAPH MINING TECHNIQUES FOR BIOLOGICAL DATASETS

S Parthasarathy

The Ohio State University

2015 Neil Ave, DL395, Columbus, OH

srini@cse.ohio-state.edu

S Tatikonda

tatikond@cse.ohio-state.edu

D Ucar

ucar@cse.ohio-state.edu

Abstract

Mining structured information has been the source of much research in the data mining community over the last decade The field of bioinformatics has emerged as important application area in this context Examples abound ranging from the analysis of protein interaction networks to the analysis of phylogenetic data In this article we survey the principal results in the field examining them both from the algorithmic contributions and applicability in the domain in ques-tion We conclude this article with a discussion of the key results and identify some interesting directions for future research.

Keywords: Graph Mining, Tree Mining, Biological Networks, Community Discovery

C.C Aggarwal and H Wang (eds.), Managing and Mining Graph Data,

Advances in Database Systems 40, DOI 10.1007/978-1-4419-6045-0_18, 547

Trang 9

1 Introduction

Advances in data collection and storage technology have led to a prolifera-tion of structured informaprolifera-tion available to organizaprolifera-tions and individuals This information is often also available to the user in a myriad of formats and across multiple media This is especially true in the vibrant field of bioinformatics where an increasing large number of problems are represented in structured

or semi-structured format Examples abound ranging from protein interaction networks (graphs) to phylogenetic datasets (trees), and from XML repositories

of proteomic data (trees) to regulatory networks (graphs) The size and number

of such data stores is growing rapidly

Such data may arise directly out of experimental observations (e.g PPI net-work complexes from mass spectrometry) or may be a convenient abstraction for housing relational information (e.g Protein Data Bank) Other examples include mRNA measurements from microarray studies can be used to infer pairwise gene relations that imply co-expression of two genes Regulatory re-lations between DNA binding proteins and genes can also be identified via various experimental technologies such as ChIP-chip, ChIP-seq, or DamID Learning a biological network structure from experimental data that reflects the real world relations is a challenge in itself Where data mining, in par-ticular graph mining, can help is in the analysis of such structure data for the discovery of useful information such as identification of common or useful substructures and detecting anomalous or unusual structures

In this article we survey the use of graph mining for bioinformatics prob-lems This topic has been heavily researched over the last decade and we

review the relevant material We take a broad view of the term graph mining

here Since trees are simply connected acyclic graphs we include approaches that leverage tree mining algorithms as well Additionally within the domain

of graph mining there are approaches that focus on harvesting patterns from a single large graph or network and those that focus on extracting patterns from multiple graphs We also cover other variants of graphs in our discussion in-cluding different tree variants, directed and bi-partite graphs

The rest of this article is broadly divided into four sections Section 2 dis-cusses the use of tree mining algorithms for bioinformatics problems For example, RNA secondary structures can be represented in the form of a tree

A forest of such RNA structure trees can be employed to characterize a newly sequenced novel RNA structure by identification of common topological pat-terns [93] In particular we survey the role played by frequent tree mining algorithms, tree alignment, and statistical methods in this context

In Section 3 we discuss algorithms that target the identification of frequent sub-patterns across multiple networks For example in a recent study [53] it was shown how 39 co-expression networks of Budding Yeast can be analyzed

Trang 10

for coherent dense subgraphs across many of these networks The discovered subgraphs then used to predict functionality of unknown genes In particu-lar we survey the role played by frequent graph mining algorithms and motif discovery algorithms in this context

In Section 4 we discuss approaches that mine single and large biological networks for the identification of important subnetwork structures, such as identification of densely interacting communities from PPI networks or gene co-expression networks In particular we discuss the role played by commu-nity discovery and graph clustering algorithms in the presence of uncertainty and noise in this context

Finally in Section 5 we conclude this survey with a discussion of some open problems in the field

2 Mining Trees

Trees are widely used to represent various biological structures like glycans, RNAs, and phylogenies

Glycans are carbohydrate sugar chains attached to some lipids or proteins, and they are considered the third class of information-encoding biological macromolecules subsequent to DNA and proteins The field of

characteriz-ing and studycharacteriz-ing is known as glycomics, akin to genomics and proteomics.

Glycans play a critical role in many biological processes including embryonic development, cell to cell communication, coordination of immune functions, tumor progression, and protein regulations and interactions Glycans are com-posed of monosaccharides (sugars) that are linked by glycosidic bonds Unlike DNA and proteins which are simple strings of nucleotides and amino acids, monosaccharides may be linked to one or more other sugars, thereby forming

a branched tree structure – they are often represented as rooted ordered la-beled trees In some cases, though rare, glycans may contain cycles due to rare cyclization of carbohydrate structures (e.g., cyclodextrins) [48] There exist a number of representation schemes (KCF [5], LINUCS [13], GLYDE [87], Gly-coCT [48], and GLYDE-II [83]) and database systems (CarbBank1,

SWEET-DB [75], KEGG/GLYCAN [45], EuroCarbSWEET-DB2, GlycoSuiteDB [26]) to store glycan data

Ribonucleic acid (RNA) is a type of molecule that consists of a long chain of nucleotide units RNA molecules play an important role in several key func-tionalities which include translation, splicing, gene regulation, and synthesis

of proteins As with all biomolecules, the function of RNAs is intimately related to their structure The secondary structure of RNAs is a list of base

1 http://bssv01.lancs.ac.uk/gig/pages/gag/carbbank.htm

2

Định dạng
Số trang	10
Dung lượng	1,45 MB