Managing and Mining Graph Data part 30 ppt

We will discuss the different categories of clustering algorithms and recent ef-forts to design clustering methods for various kinds of graphical data.. The first type consists of node c

Trang 1

A SURVEY OF CLUSTERING ALGORITHMS FOR GRAPH DATA

Charu C Aggarwal

IBM T J Watson Research Center

Hawthorne, NY 10532

charu@us.ibm.com

Haixun Wang

Microsoft Research Asia

Beijing, China 100190

haixunw@microsoft.com

Abstract In this chapter, we will provide a survey of clustering algorithms for graph data.

We will discuss the different categories of clustering algorithms and recent ef-forts to design clustering methods for various kinds of graphical data Clustering algorithms are typically of two types The first type consists of node clustering algorithms in which we attempt to determine dense regions of the graph based

on edge behavior The second type consists of structural clustering algorithms,

in which we attempt to cluster the different graphs based on overall structural behavior We will also discuss the applicability of the approach to other kinds of data such as semi-structured data, and the utility of graph mining algorithms to such representations.

Keywords: Graph Clustering, Dense Subgraph Discovery

1 Introduction

Graph mining has been a popular area of research in recent years because

of numerous applications in computational biology, software bug localization and computer networking In addition, many new kinds of data such as

semi-© Springer Science+Business Media, LLC 2010

C.C Aggarwal and H Wang (eds.), Managing and Mining Graph Data,

Advances in Database Systems 40, DOI 10.1007/978-1-4419-6045-0_9, 275

Trang 2

structured data and XML [2] can typically be represented as graphs In partic-ular, XML data is a popular representation of different kinds of data sets Since core graph-mining algorithms can be extended to this scenario, it follows that the extension of mining algorithms to graphs has tremendous applicability of a wide variety of data sets which are represented as semi-structured data Many traditional algorithms such as clustering, classification, and frequent-pattern mining have been extended to the graph scenario A detailed discussion of various kinds of graph mining algorithms may be found in [15]

In this chapter, we will study the clustering problem for the graph domain The problem of clustering is defined as follows: For a given set of objects, we

would like to divide it into groups of similar objects The similarity between

objects is typically defined with the use of a mathematical objective function This problem is useful in a number of practical applications such as marketing, customer-segmentation, and data summarization The problem of clustering

is extremely important in a number of important data domains A detailed description of clustering algorithms may be found in [24]

Clustering algorithms have significant applications in a variety of graph sce-narios such as congestion detection, facility location, and XML data integration [28] The graph clustering problems are typically defined into two categories:

Node Clustering Algorithms: Node-clustering algorithms are

gener-alizations of multi-dimensional clustering algorithms in which we use functions of the multi-dimensional data points in order to define the dis-tances In the case of graph clustering algorithms, we associate numer-ical values with the edges These numernumer-ical values need not satisfy tra-ditional properties of distance functions such as the triangle inequality

We use these distance values in order to create clusters of nodes We note that the numerical value associated with a given node may either

be a distance value or a similarity value Correspondingly, the objec-tive function associated with the partitioning may either be minimized

or maximized respectively We note that the problem of minimizing the inter-cluster similarity for a fixed number of clusters essentially

re-duces to the problem of graph partitioning or the minimum multi-way cut problem This is also referred to the problem of mining dense graphs

and pseudo-cliques Recently, the problem has also been studied in the

database literature as that of quasi-clique determination In this

prob-lem, we determine groups of nodes which are “almost cliques” In other words, an edge exists between any pair of nodes in the set with high

probability A closely related problem is that of determining shingles

[5, 22] Shingles are defined as those sub-graphs which have a large number of common links This is particularly useful for massive graphs which contain a large number of nodes In such cases, a min-hash

Trang 3

ap-proach [5] can be used in order to summarize the structural behavior of the underlying graph

Graph Clustering Algorithms: In this case, we have a (possibly large)

number of graphs which need to be clustered based on their underlying structural behavior This problem is challenging because of the need to match the structures of the underlying graphs, and use these structures for clustering purposes Such algorithms are discussed both in the con-text of classical graph data sets as well as semi-structured data In the case of semi-structured data, the problem arises in the context of a large number of documents which need to be clustered on the basis of the un-derlying structure and attributes It has been shown in [2] that the use of the underlying document structure leads to significantly more effective algorithms

This chapter is organized as follows In the next section, we will discuss a variety of node clustering algorithms Methods for clustering multiple graphs and XML records are discussed in section 3 Section 4 discusses numerous applications of graph clustering algorithms Section 5 contains the conclusions and summary

2 Node Clustering Algorithms

A number of algorithms for graph node clustering are discussed in [19] In [19], the graph clustering problem is related to the minimum cut and graph partitioning problems In this case, it is assumed that the underlying graphs have weights on the edges It is desired to partition the graph in such a way

so as to minimize the weights of the edges across the partitions In general,

we would like to partition the graph into 𝑘 groups of nodes However, since the special case𝑘 = 2 is efficiently solvable, we would like to first provide a special discussion for this case This version is polynomially solvable, since it

is the mathematical dual of the maximum flow problem This problem is also

referred to as the minimum-cut problem.

2.1 The Minimum Cut Problem

The simplest case is the 2-way minimum cut problem, in which we wish to partition the graph into two clusters, so as to minimize the weight of the edges across the partitions This version of the problem is efficiently solvable, and

can be resolved by use of the maximum flow problem [4].

The minimum-cut problem is defined as follows Consider a graph 𝐺 = (𝑁, 𝐴) with node set 𝑁 and edge set 𝐴 The node set 𝑁 contains the source

𝑠 and sink 𝑡 Each edge (𝑖, 𝑗) ∈ 𝐴 has a weight associated with it which is denoted by𝑢𝑖𝑗 We note that the edges may be either undirected or directed,

Trang 4

though the undirected case is often much more relevant for connectivity ap-plications We would like to partition the node set 𝑁 into two groups 𝑆 and

𝑁− 𝑆 The set of edges such that one end lies in 𝑆 and the other lies in 𝑁 − 𝑆

is denoted by𝐶(𝑆, 𝑁 − 𝑆) We would like to partition the node set 𝑁 into two sets 𝑆 and 𝑁 − 𝑆, such that the sum of the weights in 𝐶(𝑆, 𝑁 − 𝑆) is minimized In other words, we would like to minimize ∑

(𝑖,𝑗) ∈𝐶(𝑆,𝑁−𝑆)𝑢𝑖𝑗 This is the unrestricted version of the minimum-cut problem We will examine two variations of the minimum-cut problem:

We wish to determine the global minimum𝑠-𝑡 cut with no restrictions

on the membership of nodes to different partitions

We wish to determine the minimum𝑠-𝑡 cut, in which one partition con-tains the source node 𝑠 and the other partition contains the sink node 𝑡

It is easy to see that the former problem can be solved by using repeated ap-plications of the latter algorithm By fixing 𝑠 and choosing different values

of the sink𝑡, it can be shown that the global minimum-cut may be effectively determined

It turns out that the maximum flow problem is the mathematical dual of the minimum cut problem In the maximum-flow problem, we assume that the weight 𝑢𝑖𝑗 is a capacity of the edge (𝑖, 𝑗) Each edge is allowed to have a

flow 𝑥𝑖𝑗 which is at most equal to the capacity𝑢𝑖𝑗 Each node other than the source 𝑠 and sink 𝑡 is assumed to satisfy the flow conservation property In

other words, for each node𝑖∈ 𝑁 we have:

∑

𝑗:(𝑖,𝑗) ∈𝐴

𝑥𝑖𝑗 = ∑

𝑗:(𝑗,𝑖) ∈𝐴

We would like to maximize the total flow originating from the source and reaching the sink𝑡, subject to the above constraints The maximum flow

prob-lem is solved with the use of a variety of augmenting-path and preflow push algorithms [4] In augmenting-path methods, we pick a path from 𝑠 to 𝑡 which

has current unused capacity, and increase the flow on this path, such that at least one edge on this path is filled to capacity We repeat this process, until no path with unfilled capacity exists from source𝑠 to sink 𝑡 Many different variations

of this technique exist in terms of the choice of path used in order to augment the flow from source 𝑠 to the sink 𝑡 Example, include the shortest-paths or maximum-capacity augmenting paths Different choices of augmenting-paths will typically lead to different trade-offs in running time These trade-offs are discussed in [4] In general, the two-way cut problem can be solved quite effi-ciently in polynomial time with these different methods It can be shown that the minimum-cut may be determined by determining all nodes 𝑆 which are

Trang 5

reachable from 𝑠 by some path of unfilled capacity We note that 𝑆 will not contain the sink node𝑡 at maximum flow, since the sink is not reachable from the source with the use of a path of unfilled capacity The set 𝐶(𝑆, 𝑁 − 𝑆)

is the minimum𝑠-𝑡 cut Every edge in this set is saturated, and the total flow across the cut is essentially equal to the𝑠-𝑡 maximum flow We can then deter-mine the global minimum cut by fixing the source𝑠, and varying the sink node

𝑡 The minimum cut over all these different possibilities will provide us with the global minimum-cut value A particularly important variant of this method

is the shortest augmenting-path approach In this approach we always augment the maximum amount of flow from the source to sink along the corresponding shortest path It can be shown that for a network containing 𝑛 nodes, and 𝑚 edges, the shortest path is guaranteed to increase by at least one after 𝑂(𝑚) augmentations Since the shortest path cannot be larger than𝑛, it follows that the maximum number of augmentations is𝑂(𝑛⋅𝑚) It is possible to implement each augmentation in𝑂(log(𝑛)) time with the use of dynamic data structures This implies that the overall technique requires at most𝑂(𝑛⋅ 𝑚 ⋅ log(𝑛)) time

A second class of algorithms which are often used in order to solve the maximum flow problem are preflow push algorithms, which do not maintain the flow conservation constraints in their intermediate solutions Rather, an excess flow is maintained at each node, and we try to push as much of this flow as possible along any edge on the shortest path from the source to sink

A detailed discussion of preflow push methods is beyond the scope of this chapter, and may be found in [4] Most maximum flow methods require at leastΩ(𝑛⋅ 𝑚) time, where 𝑛 is the number of nodes, and 𝑚 is the number of edges

A closely related problem to the minimum𝑠-𝑡 cur problem is that of

deter-mining a global minimum cut in an undirected graph This particular case is

more efficient than that of finding the𝑠-𝑡 minimum cut One way of determin-ing a minimum cut is by usdetermin-ing a contraction-based edge-sampldetermin-ing approach While the previous technique is applicable to both the directed and undirected version of the problem, the contraction-based approach is applicable only to the undirected version of the problem Furthermore, the contraction-based ap-proach is applicable only for the case in which the weight of each edge is

𝑢𝑖𝑗 = 1 While the method can easily be extended to the weighted version by varying the edge-sampling probability, the polynomial running time bounds discussed in [37] do not apply to this case The contraction approach is a prob-abilistic technique in which we successively sample edges in order to collapse nodes into larger sets of nodes By successively sampling different sequences

of edges and picking the optimum value [37], it is possible to determine a global minimum cut The broad idea of the contraction-based approach is as follows We pick an edge randomly in the graph, and contract its two end points into a single node We remove all self-loops which are created as a result of

Trang 6

the contraction We may also create some parallel edges, which are allowed

to remain, since they influence the sampling probability1 of contractions The process of contraction is repeated until we are left with two nodes We note that each of this pair of “super-nodes” corresponds to a set of nodes in the original data These two sets of nodes provide us with the final minimum cut We note that the minimum cut will survive in this approach, if none of the edges in the minimum cut are sampled during the contraction An immediate observation

is that cuts with smaller number of edges are more likely to survive using this approach This is because the edges in cuts which contain a large number of edges are much more likely to be sampled One of the key observations in [37]

is the following:

Lemma 9.1 When a graph containing 𝑛 nodes is contracted to 𝑡 nodes, the

probability that the minimum-cut survives during the contraction is given by

𝑂(𝑡2/𝑛2).

Proof: Let the minimum-cut have 𝑘 edges Then, each vertex must have

de-gree at least𝑘, and therefore the graph must contain at least 𝑛⋅𝑘/2 edges Then, the probability that the minimum cut survives the first contraction is given by

1− 𝑘/(#Edges) ≥ 1 − 2/𝑛 This relationship is derived by substituting the lower bound of𝑛⋅ 𝑘/2 for the number of edges Similarly, in the second round

of contractions, the probability of survival is given by1−2/(𝑛−1) Therefore, the overall probability𝑝𝑠of survival is given by:

𝑝𝑠 = Π𝑛𝑖=0−𝑡−1(1− 2/(𝑛 − 𝑖)) =𝑛𝑡⋅ (𝑡 − 1)

Thus, if we contract to two nodes, the probability of the survival of the mini-mum cut is2/(𝑛⋅ (𝑛 − 1)) By repeating the process 𝑛 ⋅ (𝑛 − 1)/2 times, we can show that the probability that the minimum-cut survives is given by at least

1−1/𝑒 If we further scale up by a constant factor 𝐶 > 1, we can show that the probability of survival is given by1− (1/𝑒)𝐶 By picking𝐶 = log(1/𝛿), we can assure that the cut survives with probability at least1− 𝛿, where 𝛿 << 1 The logarithmic relationship assures that we can determine minimum cuts with very high probability at a small additional cost An additional implication of

Lemma 9.1 is that the total number of distinct minimum cuts is bounded above

by𝑛⋅ (𝑛 − 1)/2 This is because the probability of the survival of any particu-lar minimum cut is at least 2/(𝑛 ⋅ (𝑛 − 1)), and the probability of the survival

of any minimum cut cannot be greater than 1.

1 Alternatively, we may replace parallel edges by a single edge of weight which is equal to the number of parallel edges We use this weight in order to bias the sampling process.

Trang 7

Another observation is that the probability of survival of the minimum cut

in the first iteration is the largest, and it reduces in successive iterations For example, in the first iteration, the probability of survival is 1 − (2/𝑛), but the probability of survival in the last iteration is only1/3 Thus, most of the errors are caused in the last few iterations This is particularly reflected in the cumulative error across many iterations, since the probability of maintaining the correct cut on contracting down to𝑡 nodes is 𝑡2/𝑛2, whereas the probability

of maintaining the correct cut in the remaining contractions is1/𝑡2

Therefore, a natural solution is to use a two-phase approach In the first phase, we do not contract down to 2 nodes, but we contract down to𝑡 nodes The probability of maintaining the correct cut by the use of this approach is

at leastΩ(𝑡2/𝑛2) Therefore, 𝑂(𝑛2/𝑡2) contractions are required in order to reduce the graph to𝑡 nodes Since each contraction requires 𝑂(𝑛) time, the running time of the first phase is given by𝑂(𝑛3/𝑡2) In the second phase, we use a standard maximum flow based method in order to determine the mini-mum cut This maximini-mum flow problem needs to be repeated𝑡 times for a fixed source and different sinks However, the base graph on which this is performed

is much smaller, and contains only𝑂(𝑡) nodes Each maximum flow problem requires𝑂(𝑡3) time by using the method discussed in [8], and therefore the to-tal time for all𝑡 problems is given by 𝑂(𝑡4) Therefore, the total running time

is given by𝑂(𝑛3/𝑡2+ 𝑡4) By picking 𝑡 =√

𝑛, we can obtain a running time

of𝑂(𝑛2) Thus, by using a two-phase approach, it is possible to obtain a much better running time, than by using a single-phase contraction approach The key idea behind this improvement is that since most of the error probability is concentrated in the last contractions, it is better to stop the contraction process when the the underlying graph is “small enough”, and then use conventional algorithms in order to determine the minimum cut This combination approach

is theoretically more efficient than any other known algorithm

2.2 Multi-way Graph Partitioning

The multi-way graph partitioning problem is significantly more difficult,

and is NP-hard [21] In this case, we wish to partition a graph into 𝑘 > 2 components, so that the total weight of the edges whose ends lie in different partitions is minimized A well known technique for graph partitioning is the Kerninghan-Lin algorithm [26] This classical algorithm is based on a hill-climbing (or more generally neighborhood-search technique) for determining the optimal graph partitioning Initially, we start off with a random cut of the graph In each iteration, we exchange a pair of vertices in two partitions, to see

if the overall cut value is reduced In the event that the cut value is reduced, then the interchange is performed Otherwise, we pick another pair of vertices

in order to perform the interchange This process is repeated until we converge

Trang 8

to a optimal solution We note that this optimum may not be a global optimum, but may only be a local optimum of the underlying data The main variation in different versions of the Kerninghan-Lin algorithm is the policy which is used for performing the interchanges on the vertices Some examples of strategies which may be used in order to perform the interchange are as follows:

We randomly pick a pair of vertices and perform the interchange, if it improves the underlying solution quality

We test all possible vertex-pair interchanges (or a sample of possible interchanges), and pick the interchange which improves the solution by the greatest amount

A𝑘-interchange is one in which a sequence of 𝑘 interchanges are per-formed at one time We can test any𝑘-interchange and perform it, if it improves the underlying solution quality

We can pick the optimal𝑘-interchange from a sample of possibilities

We note that the use of more sophisticated strategies allows a better improve-ment in the objective function for each interchange, but also requires more time for each interchange For example, the determination of an optimal 𝑘-interchange requires much more time than a straightforward 𝑘-interchange This

is a natural tradeoff which may work out differently depending upon the nature

of the application at hand Furthermore, the choice of the policy also affects the likelihood of getting stuck at a local optimum For example, the use of 𝑘-interchange techniques are far less likely to result in local optimum for larger values of𝑘 In fact, by choosing the best interchange across all possible values

of𝑘, it is possible to ensure that a global optimum is always reached On the other hand, it because increasingly difficult to implement the algorithm effi-ciently with increasing value of𝑘 This is because the time-complexity of the interchange increases exponentially with the value of𝑘 A detailed survey on different methods for optimal graph partitioning may be found in [18]

2.3 Conventional Generalizations and Network Structure

Indices

Two well known (and related) techniques for clustering in the context of multi-dimensional data [24] are the𝑘-medoid and 𝑘-means algorithms In the 𝑘-medoid algorithm (for multi-dimensional data), we sample a small number

of points from the original data as seeds and assign every other data point from

the clusters to the closest of these seeds The closeness may be defined based

on a user-defined objective function The objective function for the cluster-ing is defined as the sum of the correspondcluster-ing distances of data points to the corresponding seeds In the next iteration, the algorithm interchanges one of

Trang 9

the seeds for another randomly selected seed from the data, and checks if the quality of the objective function improves upon performing the interchange

If this is indeed the case, then the interchange is accepted Otherwise, we do not accept the interchange and try another sample interchange This process

is repeated, until the objective function does not improve over a pre-defined number of interchanges A closely related method is the𝑘-means method The main difference with the𝑘-medoid method is that we do not use representa-tive points from the original data after the first iteration of picking the original seeds In subsequent iterations, we use the centroid of each cluster as the seed set for the next iteration This process is repeated until the cluster membership stabilizes

A method has been proposed in [35], which uses characteristics of both2 the 𝑘-means and 𝑘-medoids algorithms As in the case of the conventional partitioning algorithms, it picks𝑘 graph nodes as seeds The main differences from the conventional algorithms are in terms of computation of distances (for assignment purposes), and in determination of subsequent seeds A natural

distance function for graphs is the geodesic distance, or the smallest number

of hops between a pair of nodes In order to determine the seed set for the next

iteration, we compute the local closeness centrality [20] for each cluster, and

use the corresponding node as the sample seed Thus, while this algorithm con-tinues to use seeds from the original data set (as in the𝑘-medoids algorithm),

it uses intuitive ideas from the𝑘-means algorithms in order to determine the identity of these seeds

There are some subtle challenges in the use of the graphical versions of dis-tance based clustering algorithms One challenge is that since disdis-tances are integers, it is possible for data points to be equidistant to several seeds While ties can be resolved by randomly selecting one of the best assignments, this may result in clusterings which do not converge In order to handle this insta-bility, a more relaxed threshold is imposed on the number of medoids which may change from iteration to iteration Specifically, a clustering is considered stable, when the change between iterations is below a certain threshold (say1

to3%)

Another challenge is that the computation of geodesic distances can be very challenging The computational complexity of the all-pairs shortest paths al-gorithm can be𝑂(𝑛3), where 𝑛 is the number of nodes Even pre-storage of all-pairs shortest paths can require 𝑂(𝑛2) time This is computationally not feasible in most practical scenarios, especially when the underlying graphs are large Even the space-requirement can be infeasible for very large graphs may

2 In [35], the method has been proposed as a generalization of the 𝑘-medoid algorithm However, it

actu-ally uses characteristics of both the 𝑘-means and 𝑘-medoid algorithms, since it uses centrality notions in

determination of subsequent seeds.

Trang 10

not be practical In order to handle such cases, the method in [36] uses the

concept of network-structure indices, which can summarize the behavior of the network by using randomized division into zones.

In this case, the graph is divided into multiple zones The set of zones form a connected, mutually exclusive and exhaustive partitioning of the graph The partitioning of the graph into zones is accomplished with the use of a

competitive flooding algorithm In this algorithm, we start off with randomly

selected seeds which are labeled by zone identification, and randomly select some unlabeled neighbor of a currently labeled node, and add a label which

is matching with its current value This approach is repeated until all nodes have been labeled We note that while this approach is extremely fast, it may sometimes result in zones which do not reflect locality well In order to deal

with this situation, we use multiple sets of randomly selected partitions Each

of these partitions is considered a dimension Note that when we use multiple such random partitions, each node becomes distinguishable from other nodes

by virtue of its membership

The distance between a node𝑖 and a zone containing node 𝑗 is denoted as 𝑍𝑜𝑛𝑒𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑖, 𝑧𝑜𝑛𝑒(𝑗)), and is defined as the shortest path between node

𝑖 and any node in zone 𝑗 The distance between 𝑖 and 𝑗 along a particular zone partitioning (or dimension) is approximated as 𝑍𝑜𝑛𝑒𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑖, 𝑧𝑜𝑛𝑒(𝑗)) +

𝑍𝑜𝑛𝑒𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑗, 𝑧𝑜𝑛𝑒(𝑖)) This value is then averaged over all the sets of randomized partitions in order to provide better robustness It has been shown

in [36] that this approach seems to approximate pairwise distances quite well The key observation is that the value of 𝑍𝑜𝑛𝑒𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑖, 𝑧𝑜𝑛𝑒(𝑗)) can be pre-computed in𝑛⋅𝑞 space, where 𝑞 is the number of zones For a small number

of zones, this is quite efficient Upon using 𝑟 different sets of partitions, the overall space requirement is 𝑛⋅ 𝑞 ⋅ 𝑟, which is much smaller than the Ω(𝑛2) space-requirement of all-pairs computation, for typical values of 𝑞 and 𝑟 as suggested in [35]

2.4 The Girvan-Newman Algorithm

The Girvan-Newman algorithm [23] is a divisive clustering algorithm,

which is based on the concept of edge betweenness centrality Betweenness centrality attempts to identify edges which form critical bridges between

dif-ferent connected components, and delete them, until a natural set of clusters remains Formally, betweenness centrality is defined as the proportion of short-est paths between nodes which pass through a certain edge Therefore, for a given edge𝑒, we define the betweenness centrality 𝐵(𝑒) as follows:

𝐵(𝑒) = 𝑁 𝑢𝑚𝐶𝑜𝑛𝑠𝑡𝑟𝑎𝑖𝑛𝑒𝑑𝑃 𝑎𝑡ℎ𝑠(𝑒, 𝑖, 𝑗)

Định dạng
Số trang	10
Dung lượng	1,67 MB