As discussed in this chapter, graph mining algorithms fall into the categories of node clustering and more generally object-based clustering algorithms.. Node clustering algorithms can b
Trang 1is Φ = ∑𝑘
𝑖=1Δ(ℳ𝑖,𝒮𝑖)/𝑘 Similarly, let the average sub-structural
self-similarity at the end of the the previous iteration beΦ′ In the beginning of the next iteration, the algorithm computes the increase of the average sub-structural self-similarity,Φ−Φ′, and checks if it is smaller than a user-specified threshold 𝜖 If not, the algorithm proceeds with another iteration Otherwise,
the algorithm terminates In addition, an upper bound on the number of it-erations is imposed This is done in order to effectively handle situations in which the threshold𝜖 is chosen to be too small Two further issues need to be
implemented in order to effectively use the underlying algorithm:
We need to determine effective methods for determining the similarity between a given document, and a group of other documents Techniques for computing the similarity are discussed in [2]
We need to determine frequent structural patterns in the underlying doc-uments This can be a huge challenge in many applications, especially since structural data is far more challenging to mine than transactional data It has been shown in [2], how sequential pattern mining algorithms
can be adapted to the case of structural data The broad idea is to
flat-ten out the tree structure into a sequential pattern by using a pre-order
traversal Then the clustering is performed on the resulting sequential patterns It has been shown [2] that such an approach is able to retain most of the structural information in the data, while introducing some spurious relations The overall approach has been shown in [2] to be experimentally quite effective
It has been shown in [2], that this method is far more effective than competing techniques such as those discussed in [10, 29]
4 Applications of Graph Clustering Algorithms
Graph clustering algorithms find numerous applications in the literature
As discussed in this chapter, graph mining algorithms fall into the categories
of node clustering and more generally object-based clustering algorithms Object-based clustering algorithms are similar to general clustering algorithms
in the literature, except that we use the underlying graphs as records rather than standard multi-dimensional attributes Such algorithms are useful in a number
of data domains such as molecular biology, chemical graphs, and XML data In general, any data domain which can represent the underlying records in terms
of compact graphs can benefit from such algorithms
Node clustering algorithms can be used for a variety of real applications such as facility location These algorithms can also be used for clustering with arbitrary distance functions between groups of objects These algorithms
Trang 2are more general than those used for clustering records with the use of multi-dimensional distance functions
Node clustering algorithms are closely related to the problem of graph par-titioning These methods are particularly useful for applications which need to determine dense regions of the graphs The determination of dense regions of the graph is closely related to the problem of graph summarization and dimen-sionality reduction The process of dimendimen-sionality reduction on graphs can be used in order to represent them in a small space, so that they can be used effec-tively for indexing and retrieval Furthermore, compressed graphs can be used
in a variety of applications in which it is desirable to use the summary
behav-ior in order to estimate the approximate structural properties of the network.
These estimates can then be subsequently refined for more exact results at a later stage Some specific applications for which clustering algorithms may be leveraged are as follows:
4.1 Community Detection in Web Applications and Social
Networks
Many web applications and social networks can be typically represented as massive graphs For example, the structure of the web is itself a graph [22, 30, 34], in which nodes represent web pages, and hyperlinks represent the edges
of this graph Similarly social networks are graphs in which nodes represent the members of the social network, and the friendship relationship between members represent the corresponding links Node clustering algorithms are a natural fit for community detection in massive graphs The communities have natural interpretations in the context of a variety of web applications:
For the case of web applications such as web sites, communities typi-cally refer to communities of closely linked pages Such communities are typically linked because of common material in terms of topic, or similar interests in terms of readership
For the case of social networks, communities refer to groups of members who may know each other very well, and may therefore be closely linked with one another This is useful in determining important associations in the underlying social network
Blogging communities often behave like social networks, and contain links between related blogs The techniques discussed in this chapter are also useful for determining the closely related blogs with the use of community detection methods
Many of the node clustering applications discussed in this chapter are used in the context of social networks [22, 30, 34] The min-hash approach [5, 22]
Trang 3is commonly used when the underlying graph is massive in nature, such as that in the case of the web This is because the min-hash approach is able to summarize the graph in a very small amount of space This is very useful for practical applications in which it may be possible to represent the entire graph
on disk For example, the size of the web graph is so large, that it may not even be possible to store it on disk without the use of add-ons onto standard desktop hardware Such situations lead to further constraints during the mining process, which are handled quite well by min-hash style approaches This is because the min-hash summary is of extremely small size compared to the size
of the graph itself This compressed representation can even be maintained
in main memory and used to determine the underlying communities in the network directly It has been shown in [5, 22], that such an approach is able to determine communities of very high quality
4.2 Telecommunication Networks
Large telecommunication companies may have millions of customers who may make billions of phone calls to one another over a period of time In this case, the individual phone numbers may be represented as node, and phone calls may be represented as edges In such cases, it may be desirable to de-termine groups of customers who call each other frequently This information can be very useful for target marketing purposes Furthermore, we note that the graphs in a tele-communication network are represented in the form of
edge streams, since the edges may be received continuously over time These
result in even greater challenges from the point of view of analysis, since the edges cannot be explicitly stored on disk The methods discussed in [22] are particularly useful in such scenarios
4.3 Email Analysis
An interesting application in the context of the Enron crisis was to determine important email interactions between groups of Enron employees In this case, the individuals are represented as nodes, and the emails sent between them are represented as edges Node clustering algorithms are very useful in order to isolate dense email interactions between different groups of customers This approach can be used for a variety of intelligence applications such as that of determining suspicious communities in groups of interactions
5 Conclusions and Future Research
In this chapter, we presented a review of the commonly known algorithms for clustering graph data The problem of clustering graphs has been widely studied in the literature, because of its application to a variety of data mining and data management problems Graph clustering algorithms are of two types:
Trang 4Node Clustering Algorithms: In this case, we attempt to partition the
graph into groups of clusters, so that each cluster contains groups of nodes which are densely connected These densely connected groups of nodes may often provide significant information about how the entities
in the underlying graph are inter-connected with one another
Graph Clustering Algorithms: In this case, we have complete graphs
available, and we wish to determine the clusters with the use of the struc-tural information in the underlying graphs Such cases are often encoun-tered in the case of XML data, which are commonly encounencoun-tered in many real domains
We provided an overview of the different clustering algorithms available, and the tradeoffs with the use of different methods The major challenges that remain in the area of graph clustering are as follows:
Clustering Massive Data Sets: In some cases, the data sets containing
the graphs may be so large that they may be held only on disk For ex-ample, if we have a dense graph containing107nodes, then the number
of edges may be as high as1013 In such cases, it may not even be pos-sible to store the graph effectively on disk In cases in which the graph can be stored on disk, it is critical that the algorithm should be designed
in order to take the disk-resident behavior of the underlying data into account This is especially challenging in the case of graph data sets, because the structural behavior of the graph interferes with our ability to process the edges sequentially for many applications In cases in which the graph is too large to store on disk, it is essential to design summary structures which can effectively store the underlying structural behavior
of the graph This stored summary can then be used effectively for graph clustering algorithms
Clustering Graph Streams: In this case, we have large graphs which
are received as edge streams Such graphs are more challenging, since a given edge cannot be processed more than once during the computation process In such cases, summary structures need to be designed in order
to facilitate an effective clustering process These summary structures may be utilized in order to determine effective clusters in the underlying data This approach is similar to the case discussed above in which the size of the graph is too large to store on disk
In addition, techniques need to be designed for interfacing clustering algo-rithms with traditional database management techniques In order to achieve this goal, effective representations and query languages need to be designed for graph data This is a new and emerging area of research, and can be leveraged upon in order to further improve the effectiveness of graph algorithms
Trang 5[1] J Abello, M G Resende, S Sudarsky, Massive quasi-clique detection
Proceedings of the 5th Latin American Symposium on Theoretical Infor-matics (LATIN), pp 598-612, 2002.
[2] C Aggarwal, N Ta, J Feng, J Wang, M J Zaki XProj: A Framework
for Projected Structural Clustering of XML Documents, KDD Conference,
2007
[3] R Agrawal, A Borgida, H.V Jagadish Efficient Maintenance of transitive
relationships in large data and knowledge bases, ACM SIGMOD
Confer-ence, 1989.
[4] R Ahuja, J Orlin, T Magnanti Network Flows: Theory, Algorithms, and
Applications, Prentice Hall, Englewood Cliffs, NJ, 1992.
[5] A Z Broder, M Charikar, A Frieze, and M Mitzenmacher,
Syntac-tic clustering of the web, WWW Conference, Computer Networks, 29(8–
13):1157–1166, 1997
[6] D Chakrabarti, Y Zhan, C Faloutsos R-MAT: A Recursive Model for
Graph Mining SDM Conference, 2004.
[7] S.S Chawathe Comparing Hierachical data in external memory Very
Large Data Bases Conference, 1999.
[8] J Cheriyan, T Hagerup, K Melhorn An𝑂(𝑛3)-time maximum-flow
algo-rithm, SIAM Journal on Computing, Volume 25 , Issue 6, pp 1144 – 1170,
1996
[9] F Chung, Spectral graph theory Washington: Conference Board of the
Mathematical Sciences, 1997.
[10] T Dalamagas, T Cheng, K Winkel, T Sellis Clustering XML Docu-ments Using Structural Summaries Information Systems, Elsevier, Jan-uary 2005
[11] J Cheng, J Xu Yu, X Lin, H Wang, and P S Yu, Fast Computing
Reach-ability Labelings for Large Graphs with High Compression Rate, EDBT
Conference, 2008.
[12] J Cheng, J Xu Yu, X Lin, H Wang, and P S Yu, Fast Computation of
Reachability Labelings in Large Graphs, EDBT Conference, 2006.
[13] E Cohen Size-estimation framework with applications to transitive
clo-sure and reachability, Journal of Computer and System Sciences, v.55 n.3,
p.441-453, Dec 1997
[14] E Cohen, E Halperin, H Kaplan, and U Zwick, Reachability and
dis-tance queries via 2-hop labels, ACM Symposium on Discrete Algorithms,
2002
Trang 6[15] D Cook, L Holder, Mining Graph Data, John Wiley & Sons Inc, 2007 [16] E W Dijkstra, A note on two problems in connection with graphs
Nu-merische Mathematik, 1 (1959), S 269-271.
[17] M Faloutsos, P Faloutsos, C Faloutsos, On Power Law Relationships of
the Internet Topology SIGCOMM Conference, 1999.
[18] P.-O Fjallstrom, Algorithms for Graph Partitioning: A Survey, Linkoping Electronic Articles in Computer and Information Science Vol 3, no 10, 1998
[19] G Flake, R Tarjan, M Tsioutsiouliklis Graph Clustering and Minimum
Cut Trees, Internet Mathematics, 1(4), 385–408, 2003.
[20] I Freeman Centrality in Social Networks, Social Networks, 1, 215–239,
1979
[21] M S Garey, D S Johnson Computers and Intractability: A Guide to the
Theory of NP-completeness,W H Freeman, 1979.
[22] D Gibson, R Kumar, A Tomkins, Discovering Large Dense Subgraphs
in Massive Graphs, VLDB Conference, 2005.
[23] M Girvan, M Newman Community Structure in Social and Biological
Networks, Proceedings of the National Academy of Science, 99, 7821–
7826, 2002
[24] A Jain and R Dubes, Algorithms for Clustering Data, Prentice Hall, New
Jersey, 1998
[25] H Kashima, K Tsuda, A Inokuchi Marginalized Kernels between
La-beled Graphs, ICML, 2003.
[26] B.W Kernighan, S Lin An efficient heuristic procedure for partitioning
graphs, Bell System Tech Journal, vol 49, Feb 1970, pp 291-307.
[27] T Kudo, E Maeda, Y Matsumoto An Application of Boosting to Graph
Classification, NIPS Conf 2004.
[28] M Lee, W Hsu, L Yang, X Yang XClust: Clustering XML Schemas
for Effective Integration ACM Conference on Information and Knowledge
Management, 2002
[29] W Lian, D.W Cheung, N Mamoulis, S Yiu An Efficient and Scalable
Algorithm for Clustering XML Documents by Structure, IEEE
Transac-tions on Knowledge and Data Engineering, Vol 16, No 1, 2004.
[30] R Kumar, P Raghavan, S Rajagopalan, D Sivakumar, A Tomkins, E
Upfal The Web as a Graph ACM PODS Conference, 2000.
[31] M Matsuda et al Classifying molecular sequences using a linkage
graph with their pairwise similarities Theoretical Computer Science,
210(2):305-325, 1999
Trang 7[32] J Pei, D Jiang, A Zhang On Mining Cross-Graph Quasi-Cliques, ACM
KDD Conference, 2005.
[33] J Pei, D Jiang, A Zhang Mining Cross-Graph Quasi-Cliques in Gene
Expression and Protein Interaction Data, ICDE Conference, 2005.
[34] S Raghavan, H Garcia-Molina Representing web graphs ICDE
Con-ference, pages 405-416, 2003.
[35] M Rattigan, M Maier, D Jensen: Graph Clustering with Network
Sruc-ture Indices ICML, 2007.
[36] M Rattigan, M Maier, D Jensen: Using structure indices for
approxi-mation of network properties ACM KDD Conference, 2006.
[37] A A Tsay, W S Lovejoy, David R Karger, Random Sampling in Cut,
Flow, and Network Design Problems, Mathematics of Operations
Re-search, 24(2):383-413, 1999.
[38] H Wang, H He, J Yang, J Xu-Yu, P Yu Dual Labeling: Answering
Graph Reachability Queries in Constant Time ICDE Conference, 2006.
[39] X Yan, J Han CloseGraph: Mining Closed Frequent Graph Patterns,
ACM KDD Conference, 2003.
[40] X Yan, H Cheng, J Han, and P S Yu, Mining Significant Graph Patterns
by Scalable Leap Search, SIGMOD Conference, 2008.
[41] X Yan, P S Yu, and J Han, Graph Indexing: A Frequent Structure-based
Approach, SIGMOD Conference, 2004.
[42] M J Zaki, C C Aggarwal XRules: An Effective Structural Classifier
for XML Data, KDD Conference, 2003.
[43] Z Zeng, J Wang, L Zhou, G Karypis, Out-of-core Coherent Closed
Quasi-Clique Mining from Large Dense Graph Databases, ACM
Transac-tions on Database Systems, Vol 31(2), 2007.
Trang 8A SURVEY OF ALGORITHMS FOR
DENSE SUBGRAPH DISCOVERY
Victor E Lee
Department of Computer Science
Kent State University
Kent, OH 44242
vlee@cs.kent.edu
Ning Ruan
Department of Computer Science
Kent State University
Kent, OH 44242
nruan@cs.kent.edu
Ruoming Jin
Department of Computer Science
Kent State University
Kent, OH 44242
jin@cs.kent.edu
Charu Aggarwal
IBM T.J Watson Research Center
Yorktown Heights, NY 10598
charu@us.ibm.com
Abstract In this chapter, we present a survey of algorithms for dense subgraph discovery.
The problem of dense subgraph discovery is closely related to clustering though the two problems also have a number of differences For example, the problem
of clustering is largely concerned with that of finding a fixed partition in the data, whereas the problem of dense subgraph discovery defines these dense compo-nents in a much more flexible way The problem of dense subgraph discovery
© Springer Science+Business Media, LLC 2010
C.C Aggarwal and H Wang (eds.), Managing and Mining Graph Data,
Advances in Database Systems 40, DOI 10.1007/978-1-4419-6045-0_10, 303
Trang 9may wither be defined over single or multiple graphs We explore both cases In the latter case, the problem is also closely related to the problem of the frequent subgraph discovery This chapter will discuss and organize the literature on this topic effectively in order to make it much more accessible to the reader.
Keywords: Dense subgraph discovery, graph clustering
1 Introduction
In almost any network, density is an indication of importance Just as some-one reading a road map is interesting in knowing the location of the larger cities and towns, investigators who seek information from abstract graphs are often interested in the dense components of the graph Depending on what properties are being modeled by the graph’s vertices and edges, dense regions may indicate high degrees of interaction, mutual similarity and hence collec-tive characteristics, attraccollec-tive forces, favorable environments, or critical mass From a theoretical perspective, dense regions have many interesting prop-erties Dense components naturally have small diameters (worst case shortest path between any two members) Routing within these components is rapid
A simple strategy also exists for global routing If most vertices belong to
a dense component, only a few selected inter-hub links are needed to have a short average distance between any two arbitrary vertices in the entire network Commercial airlines employ this hub-based routing scheme Dense regions are also robust, in the sense that many connections can be broken without splitting the component A less well-known but equally important property of dense subgraphs comes from percolation theory If a graph is sufficiently dense, or equivalently, if messages are forwarded from one node to its neighbors with higher than a certain probability, then there is very high probability of propa-gating a message across the diameter of the graph [20] This fact is useful in everything from epidemiology to marketing
Not all graphs have dense components, however A sparse graph may have few or none In order to understand this issue, we first need to define a formal notion of the words ‘dense’ and ‘sparse’ We will address this issue shortly
A uniform graph is either entirely dense or not dense at all Uniform graphs, however, are rare, usually limited to either small or artificially created ones Due to the usefulness of dense components, it is generally accepted that their existence is the rule rather than the exception in nature and in human-planned networks [39]
Dense components have been identified in and have enhanced understanding
of many types of networks; among the best-known are social networks [53, 44], the World Wide Web [30, 17, 11], financial markets [5], and biological
Trang 10sys-tems [26] Much of the early motivation, research, and nomenclature regarding dense components was in the field of social network analysis Even before the advent of computers, sociologists turned to graph theory to formulate models for the concept of social cohesion Clique, 𝐾-core, 𝐾-plex, and 𝐾-club are
metrics originally devised to measure social cohesiveness [53] It is not sur-prising that we also see dense components in the World Wide Web In many ways, the Web is simply a virtual implementation of traditional direct human-human social networks
Today, the natural sciences, the social sciences, and technological fields are all using network and graph analysis methods to better understand complex systems Dense component discovery and analysis is one important aspect
of network analysis Therefore, readers from many different backgrounds will benefit from understanding more about the characteristics of dense components and some of the methods used to uncover them
In the next section, we outline the graph terminology and define the fun-damental measures of density to be used in the rest of the chapter Section 3 categorizes the algorithmic approaches and presents representative implemen-tations in more detail Section 4 expands the topic to consider frequently-occurring dense components in a set of graphs Section 5 provides examples
of how these techniques have been applied in various scientific fields Section 6 concludes the chapter with a look to the future
2 Types of Dense Components
Different applications find different definitions of dense component to be
useful In this section, we outline the many ways to define a dense component, categorizing them by their important features Understanding these features
of the various types of components are valuable for deciding which type of component to pursue
2.1 Absolute vs Relative Density
We can divide density definitions into two classes, absolute density and rel-ative density An absolute density measure establishes rules and parameter values for what constitutes a dense component, independent of what is out-side the component For example, we could say that we are only interested
in cliques, fully-connected subgraphs of maximum density Absolute density measures take the form of relaxations of the pure clique measure
On the other hand, a relative density measure has no preset level for what is sufficiently dense It compares the density of one region to another, with the goal of finding the densest regions To establish the boundaries of components,
a metric typically looks to maximize the difference between intra-component connectedness and inter-component connectedness Often but not necessarily,