Managing and Mining Graph Data part 32 pps

As discussed in this chapter, graph mining algorithms fall into the categories of node clustering and more generally object-based clustering algorithms.. Node clustering algorithms can b

Trang 1

is Φ = ∑𝑘

𝑖=1Δ(ℳ𝑖,𝒮𝑖)/𝑘 Similarly, let the average sub-structural

self-similarity at the end of the the previous iteration beΦ′ In the beginning of the next iteration, the algorithm computes the increase of the average sub-structural self-similarity,Φ−Φ′, and checks if it is smaller than a user-specified threshold 𝜖 If not, the algorithm proceeds with another iteration Otherwise,

the algorithm terminates In addition, an upper bound on the number of it-erations is imposed This is done in order to effectively handle situations in which the threshold𝜖 is chosen to be too small Two further issues need to be

implemented in order to effectively use the underlying algorithm:

We need to determine effective methods for determining the similarity between a given document, and a group of other documents Techniques for computing the similarity are discussed in [2]

We need to determine frequent structural patterns in the underlying doc-uments This can be a huge challenge in many applications, especially since structural data is far more challenging to mine than transactional data It has been shown in [2], how sequential pattern mining algorithms

can be adapted to the case of structural data The broad idea is to

flat-ten out the tree structure into a sequential pattern by using a pre-order

traversal Then the clustering is performed on the resulting sequential patterns It has been shown [2] that such an approach is able to retain most of the structural information in the data, while introducing some spurious relations The overall approach has been shown in [2] to be experimentally quite effective

It has been shown in [2], that this method is far more effective than competing techniques such as those discussed in [10, 29]

4 Applications of Graph Clustering Algorithms

Graph clustering algorithms find numerous applications in the literature

As discussed in this chapter, graph mining algorithms fall into the categories

of node clustering and more generally object-based clustering algorithms Object-based clustering algorithms are similar to general clustering algorithms

in the literature, except that we use the underlying graphs as records rather than standard multi-dimensional attributes Such algorithms are useful in a number

of data domains such as molecular biology, chemical graphs, and XML data In general, any data domain which can represent the underlying records in terms

of compact graphs can benefit from such algorithms

Node clustering algorithms can be used for a variety of real applications such as facility location These algorithms can also be used for clustering with arbitrary distance functions between groups of objects These algorithms

Trang 2

are more general than those used for clustering records with the use of multi-dimensional distance functions

Node clustering algorithms are closely related to the problem of graph par-titioning These methods are particularly useful for applications which need to determine dense regions of the graphs The determination of dense regions of the graph is closely related to the problem of graph summarization and dimen-sionality reduction The process of dimendimen-sionality reduction on graphs can be used in order to represent them in a small space, so that they can be used effec-tively for indexing and retrieval Furthermore, compressed graphs can be used

in a variety of applications in which it is desirable to use the summary

behav-ior in order to estimate the approximate structural properties of the network.

These estimates can then be subsequently refined for more exact results at a later stage Some specific applications for which clustering algorithms may be leveraged are as follows:

4.1 Community Detection in Web Applications and Social

Networks

Many web applications and social networks can be typically represented as massive graphs For example, the structure of the web is itself a graph [22, 30, 34], in which nodes represent web pages, and hyperlinks represent the edges

of this graph Similarly social networks are graphs in which nodes represent the members of the social network, and the friendship relationship between members represent the corresponding links Node clustering algorithms are a natural fit for community detection in massive graphs The communities have natural interpretations in the context of a variety of web applications:

For the case of web applications such as web sites, communities typi-cally refer to communities of closely linked pages Such communities are typically linked because of common material in terms of topic, or similar interests in terms of readership

For the case of social networks, communities refer to groups of members who may know each other very well, and may therefore be closely linked with one another This is useful in determining important associations in the underlying social network

Blogging communities often behave like social networks, and contain links between related blogs The techniques discussed in this chapter are also useful for determining the closely related blogs with the use of community detection methods

Many of the node clustering applications discussed in this chapter are used in the context of social networks [22, 30, 34] The min-hash approach [5, 22]

Trang 3

is commonly used when the underlying graph is massive in nature, such as that in the case of the web This is because the min-hash approach is able to summarize the graph in a very small amount of space This is very useful for practical applications in which it may be possible to represent the entire graph

on disk For example, the size of the web graph is so large, that it may not even be possible to store it on disk without the use of add-ons onto standard desktop hardware Such situations lead to further constraints during the mining process, which are handled quite well by min-hash style approaches This is because the min-hash summary is of extremely small size compared to the size

of the graph itself This compressed representation can even be maintained

in main memory and used to determine the underlying communities in the network directly It has been shown in [5, 22], that such an approach is able to determine communities of very high quality

4.2 Telecommunication Networks

Large telecommunication companies may have millions of customers who may make billions of phone calls to one another over a period of time In this case, the individual phone numbers may be represented as node, and phone calls may be represented as edges In such cases, it may be desirable to de-termine groups of customers who call each other frequently This information can be very useful for target marketing purposes Furthermore, we note that the graphs in a tele-communication network are represented in the form of

edge streams, since the edges may be received continuously over time These

result in even greater challenges from the point of view of analysis, since the edges cannot be explicitly stored on disk The methods discussed in [22] are particularly useful in such scenarios

4.3 Email Analysis

An interesting application in the context of the Enron crisis was to determine important email interactions between groups of Enron employees In this case, the individuals are represented as nodes, and the emails sent between them are represented as edges Node clustering algorithms are very useful in order to isolate dense email interactions between different groups of customers This approach can be used for a variety of intelligence applications such as that of determining suspicious communities in groups of interactions

5 Conclusions and Future Research

In this chapter, we presented a review of the commonly known algorithms for clustering graph data The problem of clustering graphs has been widely studied in the literature, because of its application to a variety of data mining and data management problems Graph clustering algorithms are of two types:

Trang 4

Node Clustering Algorithms: In this case, we attempt to partition the

graph into groups of clusters, so that each cluster contains groups of nodes which are densely connected These densely connected groups of nodes may often provide significant information about how the entities

in the underlying graph are inter-connected with one another

Graph Clustering Algorithms: In this case, we have complete graphs

available, and we wish to determine the clusters with the use of the struc-tural information in the underlying graphs Such cases are often encoun-tered in the case of XML data, which are commonly encounencoun-tered in many real domains

We provided an overview of the different clustering algorithms available, and the tradeoffs with the use of different methods The major challenges that remain in the area of graph clustering are as follows:

Clustering Massive Data Sets: In some cases, the data sets containing

the graphs may be so large that they may be held only on disk For ex-ample, if we have a dense graph containing107nodes, then the number

of edges may be as high as1013 In such cases, it may not even be pos-sible to store the graph effectively on disk In cases in which the graph can be stored on disk, it is critical that the algorithm should be designed

in order to take the disk-resident behavior of the underlying data into account This is especially challenging in the case of graph data sets, because the structural behavior of the graph interferes with our ability to process the edges sequentially for many applications In cases in which the graph is too large to store on disk, it is essential to design summary structures which can effectively store the underlying structural behavior

of the graph This stored summary can then be used effectively for graph clustering algorithms

Clustering Graph Streams: In this case, we have large graphs which

are received as edge streams Such graphs are more challenging, since a given edge cannot be processed more than once during the computation process In such cases, summary structures need to be designed in order

to facilitate an effective clustering process These summary structures may be utilized in order to determine effective clusters in the underlying data This approach is similar to the case discussed above in which the size of the graph is too large to store on disk

In addition, techniques need to be designed for interfacing clustering algo-rithms with traditional database management techniques In order to achieve this goal, effective representations and query languages need to be designed for graph data This is a new and emerging area of research, and can be leveraged upon in order to further improve the effectiveness of graph algorithms

Trang 5

[1] J Abello, M G Resende, S Sudarsky, Massive quasi-clique detection

Proceedings of the 5th Latin American Symposium on Theoretical Infor-matics (LATIN), pp 598-612, 2002.

[2] C Aggarwal, N Ta, J Feng, J Wang, M J Zaki XProj: A Framework

for Projected Structural Clustering of XML Documents, KDD Conference,

2007

[3] R Agrawal, A Borgida, H.V Jagadish Efficient Maintenance of transitive

relationships in large data and knowledge bases, ACM SIGMOD

Confer-ence, 1989.

[4] R Ahuja, J Orlin, T Magnanti Network Flows: Theory, Algorithms, and

Applications, Prentice Hall, Englewood Cliffs, NJ, 1992.

[5] A Z Broder, M Charikar, A Frieze, and M Mitzenmacher,

Syntac-tic clustering of the web, WWW Conference, Computer Networks, 29(8–

13):1157–1166, 1997

[6] D Chakrabarti, Y Zhan, C Faloutsos R-MAT: A Recursive Model for

Graph Mining SDM Conference, 2004.

[7] S.S Chawathe Comparing Hierachical data in external memory Very

Large Data Bases Conference, 1999.

[8] J Cheriyan, T Hagerup, K Melhorn An𝑂(𝑛3)-time maximum-flow

algo-rithm, SIAM Journal on Computing, Volume 25 , Issue 6, pp 1144 – 1170,

1996

[9] F Chung, Spectral graph theory Washington: Conference Board of the

Mathematical Sciences, 1997.

[10] T Dalamagas, T Cheng, K Winkel, T Sellis Clustering XML Docu-ments Using Structural Summaries Information Systems, Elsevier, Jan-uary 2005

[11] J Cheng, J Xu Yu, X Lin, H Wang, and P S Yu, Fast Computing

Reach-ability Labelings for Large Graphs with High Compression Rate, EDBT

Conference, 2008.

[12] J Cheng, J Xu Yu, X Lin, H Wang, and P S Yu, Fast Computation of

Reachability Labelings in Large Graphs, EDBT Conference, 2006.

[13] E Cohen Size-estimation framework with applications to transitive

clo-sure and reachability, Journal of Computer and System Sciences, v.55 n.3,

p.441-453, Dec 1997

[14] E Cohen, E Halperin, H Kaplan, and U Zwick, Reachability and

dis-tance queries via 2-hop labels, ACM Symposium on Discrete Algorithms,

2002

Trang 6

[15] D Cook, L Holder, Mining Graph Data, John Wiley & Sons Inc, 2007 [16] E W Dijkstra, A note on two problems in connection with graphs

Nu-merische Mathematik, 1 (1959), S 269-271.

[17] M Faloutsos, P Faloutsos, C Faloutsos, On Power Law Relationships of

the Internet Topology SIGCOMM Conference, 1999.

[18] P.-O Fjallstrom, Algorithms for Graph Partitioning: A Survey, Linkoping Electronic Articles in Computer and Information Science Vol 3, no 10, 1998

[19] G Flake, R Tarjan, M Tsioutsiouliklis Graph Clustering and Minimum

Cut Trees, Internet Mathematics, 1(4), 385–408, 2003.

[20] I Freeman Centrality in Social Networks, Social Networks, 1, 215–239,

1979

[21] M S Garey, D S Johnson Computers and Intractability: A Guide to the

Theory of NP-completeness,W H Freeman, 1979.

[22] D Gibson, R Kumar, A Tomkins, Discovering Large Dense Subgraphs

in Massive Graphs, VLDB Conference, 2005.

[23] M Girvan, M Newman Community Structure in Social and Biological

Networks, Proceedings of the National Academy of Science, 99, 7821–

7826, 2002

[24] A Jain and R Dubes, Algorithms for Clustering Data, Prentice Hall, New

Jersey, 1998

[25] H Kashima, K Tsuda, A Inokuchi Marginalized Kernels between

La-beled Graphs, ICML, 2003.

[26] B.W Kernighan, S Lin An efficient heuristic procedure for partitioning

graphs, Bell System Tech Journal, vol 49, Feb 1970, pp 291-307.

[27] T Kudo, E Maeda, Y Matsumoto An Application of Boosting to Graph

Classification, NIPS Conf 2004.

[28] M Lee, W Hsu, L Yang, X Yang XClust: Clustering XML Schemas

for Effective Integration ACM Conference on Information and Knowledge

Management, 2002

[29] W Lian, D.W Cheung, N Mamoulis, S Yiu An Efficient and Scalable

Algorithm for Clustering XML Documents by Structure, IEEE

Transac-tions on Knowledge and Data Engineering, Vol 16, No 1, 2004.

[30] R Kumar, P Raghavan, S Rajagopalan, D Sivakumar, A Tomkins, E

Upfal The Web as a Graph ACM PODS Conference, 2000.

[31] M Matsuda et al Classifying molecular sequences using a linkage

graph with their pairwise similarities Theoretical Computer Science,

210(2):305-325, 1999

Trang 7

[32] J Pei, D Jiang, A Zhang On Mining Cross-Graph Quasi-Cliques, ACM

KDD Conference, 2005.

[33] J Pei, D Jiang, A Zhang Mining Cross-Graph Quasi-Cliques in Gene

Expression and Protein Interaction Data, ICDE Conference, 2005.

[34] S Raghavan, H Garcia-Molina Representing web graphs ICDE

Con-ference, pages 405-416, 2003.

[35] M Rattigan, M Maier, D Jensen: Graph Clustering with Network

Sruc-ture Indices ICML, 2007.

[36] M Rattigan, M Maier, D Jensen: Using structure indices for

approxi-mation of network properties ACM KDD Conference, 2006.

[37] A A Tsay, W S Lovejoy, David R Karger, Random Sampling in Cut,

Flow, and Network Design Problems, Mathematics of Operations

Re-search, 24(2):383-413, 1999.

[38] H Wang, H He, J Yang, J Xu-Yu, P Yu Dual Labeling: Answering

Graph Reachability Queries in Constant Time ICDE Conference, 2006.

[39] X Yan, J Han CloseGraph: Mining Closed Frequent Graph Patterns,

ACM KDD Conference, 2003.

[40] X Yan, H Cheng, J Han, and P S Yu, Mining Significant Graph Patterns

by Scalable Leap Search, SIGMOD Conference, 2008.

[41] X Yan, P S Yu, and J Han, Graph Indexing: A Frequent Structure-based

Approach, SIGMOD Conference, 2004.

[42] M J Zaki, C C Aggarwal XRules: An Effective Structural Classifier

for XML Data, KDD Conference, 2003.

[43] Z Zeng, J Wang, L Zhou, G Karypis, Out-of-core Coherent Closed

Quasi-Clique Mining from Large Dense Graph Databases, ACM

Transac-tions on Database Systems, Vol 31(2), 2007.

Trang 8

A SURVEY OF ALGORITHMS FOR

DENSE SUBGRAPH DISCOVERY

Victor E Lee

Department of Computer Science

Kent State University

Kent, OH 44242

vlee@cs.kent.edu

Ning Ruan

Kent, OH 44242

nruan@cs.kent.edu

Ruoming Jin

Kent, OH 44242

jin@cs.kent.edu

Charu Aggarwal

IBM T.J Watson Research Center

Yorktown Heights, NY 10598

charu@us.ibm.com

Abstract In this chapter, we present a survey of algorithms for dense subgraph discovery.

The problem of dense subgraph discovery is closely related to clustering though the two problems also have a number of differences For example, the problem

of clustering is largely concerned with that of finding a fixed partition in the data, whereas the problem of dense subgraph discovery defines these dense compo-nents in a much more flexible way The problem of dense subgraph discovery

C.C Aggarwal and H Wang (eds.), Managing and Mining Graph Data,

Advances in Database Systems 40, DOI 10.1007/978-1-4419-6045-0_10, 303

Trang 9

may wither be defined over single or multiple graphs We explore both cases In the latter case, the problem is also closely related to the problem of the frequent subgraph discovery This chapter will discuss and organize the literature on this topic effectively in order to make it much more accessible to the reader.

Keywords: Dense subgraph discovery, graph clustering

1 Introduction

In almost any network, density is an indication of importance Just as some-one reading a road map is interesting in knowing the location of the larger cities and towns, investigators who seek information from abstract graphs are often interested in the dense components of the graph Depending on what properties are being modeled by the graph’s vertices and edges, dense regions may indicate high degrees of interaction, mutual similarity and hence collec-tive characteristics, attraccollec-tive forces, favorable environments, or critical mass From a theoretical perspective, dense regions have many interesting prop-erties Dense components naturally have small diameters (worst case shortest path between any two members) Routing within these components is rapid

A simple strategy also exists for global routing If most vertices belong to

a dense component, only a few selected inter-hub links are needed to have a short average distance between any two arbitrary vertices in the entire network Commercial airlines employ this hub-based routing scheme Dense regions are also robust, in the sense that many connections can be broken without splitting the component A less well-known but equally important property of dense subgraphs comes from percolation theory If a graph is sufficiently dense, or equivalently, if messages are forwarded from one node to its neighbors with higher than a certain probability, then there is very high probability of propa-gating a message across the diameter of the graph [20] This fact is useful in everything from epidemiology to marketing

Not all graphs have dense components, however A sparse graph may have few or none In order to understand this issue, we first need to define a formal notion of the words ‘dense’ and ‘sparse’ We will address this issue shortly

A uniform graph is either entirely dense or not dense at all Uniform graphs, however, are rare, usually limited to either small or artificially created ones Due to the usefulness of dense components, it is generally accepted that their existence is the rule rather than the exception in nature and in human-planned networks [39]

Dense components have been identified in and have enhanced understanding

of many types of networks; among the best-known are social networks [53, 44], the World Wide Web [30, 17, 11], financial markets [5], and biological

Trang 10

sys-tems [26] Much of the early motivation, research, and nomenclature regarding dense components was in the field of social network analysis Even before the advent of computers, sociologists turned to graph theory to formulate models for the concept of social cohesion Clique, 𝐾-core, 𝐾-plex, and 𝐾-club are

metrics originally devised to measure social cohesiveness [53] It is not sur-prising that we also see dense components in the World Wide Web In many ways, the Web is simply a virtual implementation of traditional direct human-human social networks

Today, the natural sciences, the social sciences, and technological fields are all using network and graph analysis methods to better understand complex systems Dense component discovery and analysis is one important aspect

of network analysis Therefore, readers from many different backgrounds will benefit from understanding more about the characteristics of dense components and some of the methods used to uncover them

In the next section, we outline the graph terminology and define the fun-damental measures of density to be used in the rest of the chapter Section 3 categorizes the algorithmic approaches and presents representative implemen-tations in more detail Section 4 expands the topic to consider frequently-occurring dense components in a set of graphs Section 5 provides examples

of how these techniques have been applied in various scientific fields Section 6 concludes the chapter with a look to the future

2 Types of Dense Components

Different applications find different definitions of dense component to be

useful In this section, we outline the many ways to define a dense component, categorizing them by their important features Understanding these features

of the various types of components are valuable for deciding which type of component to pursue

2.1 Absolute vs Relative Density

We can divide density definitions into two classes, absolute density and rel-ative density An absolute density measure establishes rules and parameter values for what constitutes a dense component, independent of what is out-side the component For example, we could say that we are only interested

in cliques, fully-connected subgraphs of maximum density Absolute density measures take the form of relaxations of the pure clique measure

On the other hand, a relative density measure has no preset level for what is sufficiently dense It compares the density of one region to another, with the goal of finding the densest regions To establish the boundaries of components,

a metric typically looks to maximize the difference between intra-component connectedness and inter-component connectedness Often but not necessarily,

Tiêu đề	Managing and mining graph data
Trường học	Standard University
Chuyên ngành	Computer Science
Thể loại	Thesis
Năm xuất bản	2023
Thành phố	New York

Định dạng
Số trang	10
Dung lượng	1,48 MB