Managing and Mining Graph Data part 31 pptx

2.5 The Spectral Clustering Method Eigenvector techniques are often used in multi-dimensional data in order to determine the underlying correlation structure in the data.. In the spectra

Trang 1

Here𝑁 𝑢𝑚𝐶𝑜𝑛𝑠𝑡𝑟𝑎𝑖𝑛𝑒𝑑𝑃 𝑎𝑡ℎ𝑠(𝑒, 𝑖, 𝑗) refers to the number of (global) short-est paths between𝑖 and 𝑗 which pass through 𝑒, and 𝑁 𝑢𝑚𝑆ℎ𝑜𝑟𝑡𝑃 𝑎𝑡ℎ𝑠(𝑖, 𝑗) refers to the number of shortest paths between 𝑖 and 𝑗 Note that the value

of𝑁 𝑢𝑚𝐶𝑜𝑛𝑠𝑡𝑟𝑎𝑖𝑛𝑒𝑑𝑃 𝑎𝑡ℎ𝑠)(𝑒, 𝑖, 𝑗) may be 0 if none of the shortest paths between 𝑖 and 𝑗 contain 𝑒 The algorithm ranks the edges by order of their betweenness, and and deletes the edge with the highest score The between-ness coefficients are recomputed, and the process is repeated The set of con-nected components after repeated deletion form the natural clusters A variety

of termination-criteria (eg fixing the number of connected components) can

be used in conjunction with the algorithm

A key issue is the efficient determination of edge-betweenness centrality The number of paths between any pair of nodes can be exponentially large, and it would seem that the computation of the betweenness measure would be

a key bottleneck It has been shown in [36], that the network structure index can also be used in order to estimate edge-betweenness centrality effectively

by pairwise node sampling

2.5 The Spectral Clustering Method

Eigenvector techniques are often used in multi-dimensional data in order

to determine the underlying correlation structure in the data It is natural to question as to whether such techniques can also be used for the more general case of graph data It turns out that this is indeed possible with the use of a

method called spectral clustering.

In the spectral clustering method, we make use of the node-node adjacency matrix of the graph For a graph containing𝑛 nodes, let us assume that we have

a𝑛× 𝑛 adjacency matrix, in which the entry (𝑖, 𝑗) correspond to the weight of the edge between the nodes𝑖 and 𝑗 This essentially corresponds to the similar-ity between nodes𝑖 and 𝑗 This entry is denoted by 𝑤𝑖𝑗, and the corresponding matrix is denoted by 𝑊 This matrix is assumed to be symmetric, since we are working with undirected graphs Therefore, we assume that𝑤𝑖𝑗 = 𝑤𝑗𝑖for any pair(𝑖, 𝑗) All diagonal entries of the matrix 𝑊 are assumed to be 0 As discussed earlier, the aim of any node partitioning algorithm is minimize (a function of) the weights across the partitions The spectral clustering method constructs this minimization function in terms of the matrix structure of the

adjacency matrix, and another matrix which is referred to as the degree matrix Thedegree matrix 𝐷 is simply a diagonal matrix, in which all entries are zero except for the diagonal values The diagonal entry𝑑𝑖𝑖is equal to the sum

of the weights of the incident edges In other words, the entry𝑑𝑖𝑗 is defined as follows:

𝑑𝑖𝑗 =∑𝑛

𝑗=1𝑤𝑖𝑗 𝑖 = 𝑗

Trang 2

We formally define the Laplacian Matrix as follows:

Definition 9.2 (Laplacian Matrix) The Laplacian Matrix 𝐿 is defined by

subtracting the weighted adjacency matrix from the degree matrix In other words, we have:

This matrix encodes the structural behavior of the graph effectively and its eigenvector behavior can be used in order to determine the important clusters

in the underlying graph structure We can be shown that the Laplacian matrix𝐿

is positive semi-definite i.e., for any𝑛-dimensional row vector 𝑓 = [𝑓1 𝑓𝑛]

we have𝑓⋅ 𝐿 ⋅ 𝑓𝑇 ≥ 0 This can be easily shown by expressing 𝐿 in terms of its constituent entries which are a function of the corresponding weights 𝑤𝑖𝑗 Upon expansion, it can be shown that:

𝑓⋅ 𝐿 ⋅ 𝑓𝑇 = (1/2)⋅

𝑛

∑

𝑖=1

𝑛

∑

𝑗=1

𝑤𝑖𝑗⋅ (𝑓𝑖− 𝑓𝑗)2 (9.5)

We summarize as follows

Lemma 9.3 The Laplacian matrix 𝐿 is positive semi-definite Specifically, for

any 𝑛-dimensional row vector 𝑓 = [𝑓1 𝑓𝑛], we have:

𝑓⋅ 𝐿 ⋅ 𝑓𝑇 = (1/2)⋅

𝑛

∑

𝑖=1

𝑛

∑

𝑗=1

𝑤𝑖𝑗⋅ (𝑓𝑖− 𝑓𝑗)2

At this point, let us examine some interpretations of the vector 𝑓 in terms

of the underlying graph partitioning Let us consider the case in which each

𝑓𝑖 is drawn from the set {0, 1}, and this determines a two-way partition by labeling each node either 0 or 1 The particular partition to which the node

𝑖 belongs is defined by the corresponding label Note that the expansion of the expression 𝑓 ⋅ 𝐿 ⋅ 𝑓𝑇 from Lemma 9.3 simply represents the sum of the weights of the edges across the partition defined by𝑓 Thus, the determination

of an appropriate value of 𝑓 for which the function 𝑓 ⋅ 𝐿 ⋅ 𝑓𝑇 is minimized also provides us with a good node partitioning Unfortunately, it is not easy to

determine the discrete values of𝑓 which determine this optimum partitioning Nevertheless, we will see later in this section that even when we restrict 𝑓 to real values, this provides us with the intuition necessary to create an effective partitioning

An immediate observation is that the indicator vector 𝑓 = [1 1] is an eigenvector with a corresponding eigenvalue of 0 We note that 𝑓 = [1 1] must be an eigenvector, since𝐿 is positive semi-definite and 𝑓⋅ 𝐿 ⋅ 𝑓𝑇 can be 0 only for eigenvectors with 0 eigenvalues This observation can be generalized

Trang 3

further in order to determine the number of connected components in the graph.

We make the following observation

Lemma 9.4 The number of (linearly independent) eigenvectors with zero

eigenvalues for the Laplacian matrix 𝐿 is equal to the number of connected components in the underlying graph.

Proof: Without loss of generality, we can order the vertices corresponding

to the particular connected component that they belong to In this case, the

Laplacian matrix takes on the following block form, which is illustrated below

for the case of three connected components

𝐿 = 𝐿10 0

0 𝐿2 0

0 0 𝐿3

Each of the blocks 𝐿1, 𝐿2 and 𝐿3 is a Laplacian itself of the corresponding component Therefore, the corresponding indicator vector for that component

is an eigenvector with corresponding eigenvalue 0 The result follows □

We observe that connected components are the most obvious examples of clusters in the graph Therefore, the determination of eigenvectors correspond-ing to zero eigenvalues provides us information about this (relatively rudimen-tary set) of clusters Broadly speaking, it may not be possible to glean such clean membership behavior from the other eigenvectors One of the problems

is that other than this particular rudimentary set of eigenvectors (which corre-spond to the connected components), the vector components of the other eigen-vectors are drawn from the real domain rather than the discrete{0, 1} domain Nevertheless, because of the nature of the natural interpretation of𝑓⋅ 𝐿 ⋅ 𝑓𝑇 in terms of the weights of the edges across nodes with very differing values of𝑓𝑖,

it is natural to cluster together nodes for which the values of𝑓𝑖 are as similar

as possible across any particular eigenvector on the average This provides us with the intuition necessary to define an effective spectral clustering algorithm, which partitions the data set into𝑘 clusters for any arbitrary value of 𝑘 The algorithm is as follows:

Determine the 𝑘 eigenvectors with the smallest eigenvalues Note that each eigenvector has as many components as the number of nodes Let the component of the𝑗th eigenvector for the 𝑖th node be denoted by 𝑝𝑖𝑗 Create a new data set with as many records as the number of nodes The 𝑖th record in this data set corresponds to the 𝑖th node, and has 𝑘 com-ponents The record for this node is simply the eigenvector components for that node, which are denoted by𝑝𝑖1 𝑝𝑖𝑘

Trang 4

Since we would like to cluster nodes with similar eigenvector compo-nents, we use any conventional clustering algorithm (e.g.𝑘-means) in or-der to create𝑘 clusters from this data set Note that the main focus of the

approach was to create a transformation of a structural clustering

algo-rithm into a more conventional multi-dimensional clustering algoalgo-rithm, which is easy to solve The particular choice of the multi-dimensional clustering algorithm is orthogonal to the broad spectral approach The above algorithm provides a broad framework for the spectral clustering al-gorithm The input parameter for the above algorithm is the number of clusters

𝑘 In practice, a number of variations are possible in order to tune the quality

of the clusters which are found Some examples are as follows:

It is not necessary to use the same number of eigenvectors as the input parameter for the number of clusters In general, one should use at least

as many eigenvectors as the number of clusters to be created However, the exact number of eigenvectors to be used in order to get the optimum results may vary with the particular data set This can be known only with experimentation

There are other ways of creating normalized Laplacian matrices which

can provide more effective results in some situations Some classic ex-amples of such Laplacian matrices in terms of the adjacency matrix𝑊 , degree matrix𝐷 and the identity matrix 𝐼 are defined as follows:

𝐿𝐴= 𝐼 − 𝐷−(1/2)⋅ 𝑊 ⋅ 𝐷−(1/2)

𝐿𝐵= 𝐼 − 𝐷−1⋅ 𝑊 More details on the different methods which can be used for effective spectral graph clustering may be found in [9]

2.6 Determining Quasi-Cliques

A different way of determining massive graphs in the underlying data is

that of determining quasi-cliques This technique is different from many other

partitioning algorithms, in that it focuses on definitions which maximize edge

densities within a partition, rather than minimizing edge densities across

par-titions A clique is a graph in which every pair of nodes are connected by an edge A quasi-clique is a relaxation on this concept, and is defined by im-posing a lower bound on the degree of each vertex in the given set of nodes Specifically, we define a𝛾-quasiclique is as follows:

Definition 9.5 A 𝑘-graph (𝑘 ≥ 1) 𝐺 is a 𝛾-quasiclique if the degree of each

node in the corresponding sub-graph of vertices is at least 𝛾 ⋅ 𝑘.

Trang 5

The value of𝛾 always lies in the range (0, 1] We note that by choosing 𝛾 = 1, this definition reverts to that of standard cliques Choosing lower values of𝛾 allows for the relaxations which are more true in the case of real applications This is because we rarely encounter complete cliques in real applications, and

at least some edges within a dense subgraph would always be missing A vertex

is said to be critical, if its degree in the corresponding subgraph is the smallest integer which is at least equal to𝛾⋅ 𝑘

The earliest piece of work on this problem is from [1] The work in [1] uses

a greedy randomized adaptive search algorithm GRASP, to find a quasi-clique with the maximum size A closely related problem is that of finding

find-ing frequently occurrfind-ing cliques in multiple data sets In other words, when

multiple graphs are obtained from different data sets, some dense subgraphs occur frequently together in the different data sets Such graphs help in

deter-mining important dense patterns of behavior in different data sources Such

techniques find applicability in mining important patterns in graphical repre-sentations of customers The techniques are also helpful in mining cross-graph quasi-cliques in gene expression data A description of the application of the technique to the problem of gene-expression data may be found in [33] An efficient algorithm for determining cross graph quasi-cliques was proposed in [32] The main restriction of the work proposed in [32] is that the support threshold for the algorithms is assumed to be100% This restriction has been relaxed in subsequent work [43] The work in [43] examines the problem of mining frequent closed quasi-cliques from a graph database with arbitrary sup-port thresholds In [31] a multi-graph version of the quasi-clique problem was explored However, instead of finding the complete set of quasi-cliques in the graph, they proposed an approximation algorithm to cover all the vertices in the graph with a minimum number of𝑝-quasi-complete subgraphs Thus, this technique is more suited for summarization of the overall graph with a smaller number of densely connected subgraphs

2.7 The Case of Massive Graphs

A closely related problem is that of dense subgraph determination in mas-sive graphs This problem is frequently encountered in large graph data sets For example, the problem of determining large subgraphs of web graphs was studied in [5, 22] A min-hash approach was first used in [5] in order to deter-mine syntactically related clusters This paper also introduces the advantages

of using a min-hash approach in the context of graph clustering Subsequently, the approach was generalized to the case of large dense graphs with the use of recursive application of the basic min-hash algorithm

The broad idea in the min-hash approach is to represent the outlinks of a particular node as sets Two nodes are considered similar, if they share many

Trang 6

outlinks Thus, consider a node 𝐴 with an outlink set 𝑆𝐴and a node𝐵 with outlink set 𝑆𝐵 Then the similarity between the two nodes is defined by the

Jaccard coefficient, which is defined as 𝑆 𝐴 ∩𝑆 𝐵

𝑆 𝐴 ∪𝑆 𝐵 We note that explicit enumera-tion of all the edges in order to compute this can be computaenumera-tionally inefficient

Rather, a min-hash approach is used in order to perform the estimation This min-hash approach is as follows We sort the universe of nodes in a random

order For any set of nodes in random sorted order, we determine the first node

𝐹 𝑖𝑟𝑠𝑡(𝐴) for which an outlink exists from 𝐴 to 𝐹 𝑖𝑟𝑠𝑡(𝐴) We also determine the first node𝐹 𝑖𝑟𝑠𝑡(𝐵) for which an outlink exists from 𝐵 to 𝐹 𝑖𝑟𝑠𝑡(𝐵) It can

be shown that the Jaccard coefficient is an unbiased estimate of the probability that𝐹 𝑖𝑟𝑠𝑡(𝐴) and 𝐹 𝑖𝑟𝑠𝑡(𝐵) are the same node By repeating this process over different permutations over the universe of nodes, it is possible to accurately estimate the Jaccard coefficient This is done by using a constant number of permutations 𝑐 of the node order The actual permutations are implemented

by associated 𝑐 different randomized hash values with each node This cre-ates 𝑐 sets of hash values of size 𝑛 The sort-order for any particular set of hash-values defines the corresponding permutation order For each such per-mutation, we store the minimum node index of the outlink set Thus, for each node, there are 𝑐 such minimum indices This means that, for each node, a fingerprint of size𝑐 can be constructed By comparing the fingerprints of two nodes, the Jaccard coefficient can be estimated This approach can be further generalized with the use of every𝑠 element set contained entirely with 𝑆𝐴and

𝑆𝐵 Thus, the above description is the special case when 𝑠 is set to 1 By using different values of 𝑠 and 𝑐, it is possible to design an algorithm which distinguishes between two sets that are above or below a certain threshold of similarity

The overall technique in [22] first generates a set of 𝑐 shingles of size 𝑠 for each node The process of generating the𝑐 shingles is extremely straight-forward Each node is processed independently We use the min-wise hash function approach in order to generate subsets of size 𝑠 from the outlinks at each node This results in𝑐 subsets for each node Thus, for each node, we have a set of𝑐 shingles Thus, if the graph contains a total of 𝑛 nodes, the total size of this shingle fingerprint is𝑛× 𝑐 × 𝑠𝑝, where 𝑠𝑝 is the space required for each shingle Typically𝑠𝑝 will be 𝑂(𝑠), since each shingle contains 𝑠 nodes For each distinct shingle thus created, we can create a list of nodes which contain it In general, we would like to determine groups of shingles which contain a large number of common nodes In order to do so, the method in [22] performs a second-order shingling in which the meta-shingles are created from the shingles Thus, this further compresses the graph in a data structure

of size𝑐× 𝑐 This is essentially a constant size data structure We note that this group of meta-shingles have the the property that they contain a large

Trang 7

num-ber of common nodes The dense subgraphs can then be extracted from these meta-shingles More details on this approach may be found in [22]

The min-hash approach is frequently used for graphs which are extremely large and cannot be easily processed by conventional quasi-clique mining algo-rithms Since the min-hash approach summarizes the massive graph in a small amount of space, it is particularly useful in leveraging the small space represen-tation for a variety of query-processing techniques Examples of such applica-tions include the web graph and social networks In the case of web graphs, we desire to determine closely connected clusters of web pages with similar con-tent The related problem in social networks is that of finding closely related communities The min-hash approach discussed in [5, 22] precisely helps us achieve this goal, because we can process the summarized min-hash structure

in a variety of ways in order to extract the important communities from the summarized structure More details of this approach may be found in [5, 22]

3 Clustering Graphs as Objects

In this section, we will discuss the problem of clustering entire graphs in

a multi-graph database, rather than examining the node clustering problem

within a single graph Such situations are often encountered in the context of XML data, since each XML document can be regarded as a structural record, and it may be necessary to create clusters from a large number of such objects

We note that XML data is quite similar to graph data in terms of how the data

is organized structurally The attribute values can be treated as graph labels and the corresponding semi-structural relationships as the edges In has been shown in [2, 10, 28, 29] that this structural behavior can be leveraged in order

to create effective clusters

3.1 Extending Classical Algorithms to Structural Data

Since we are examining entre graphs in this version of the clustering

prob-lem, the problem simply boils down to that of clustering arbitrary objects,

where the objects in this case have structural characteristics Many of the conventional algorithms discussed in [24] (such as 𝑘-means type partitional algorithms and hierarchical algorithms can be extended to the case of graph data The main changes required in order to extend these algorithms are as follows:

Most of the underlying classical algorithms typically use some form of distance function in order to measure similarity Therefore, we need appropriate measures in order to define similarity (or distances) between structural objects

Trang 8

Many of the classical algorithms (such as 𝑘-means) use representative objects such as centroids in critical intermediate steps While this is

straightforward in the case of multi-dimensional objects, it is much more challenging in the case of graph objects Therefore, appropriate meth-ods need to be designed in order to create representative objects Fur-thermore, in some cases it may be difficult to create representatives in terms of single objects We will see is that it is often more robust to use

representative summaries of the underlying objects.

There are two main classes of conventional techniques, which have been extended to the case of structural objects These techniques are as follows:

Structural Distance-based Approach: This approach computes

struc-tural distances between documents and uses them in order to compute clusters of documents One of the earliest work on clustering tree

struc-tured data is the XClust algorithm [28], which was designed to cluster

XML schemas in order for efficient integration of large numbers of Doc-ument Type Definitions (DTDs) of XML sources It adopts the agglom-erative hierarchical clustering method which starts with clusters of single DTDs and gradually merges the two most similar clusters into one larger cluster The similarity between two DTDs is based on their element sim-ilarity, which can be computed according to the semantics, structure, and context information of the elements in the corresponding DTDs One of the shortcomings of the XClust algorithm is that it does not make full use of the structure information of the DTDs, which is quite important

in the context of clustering tree-like structures The method in [7] com-putes similarity measures based on the structural edit-distance between documents This edit-distance is used in order to compute the distances between clusters of documents

S-GRACE is hierarchical clustering algorithm [29] In [29], an XML document is converted to a structure graph (or s-graph), and the distance between two XML documents is defined according to the number of the common element-subelement relationships, which can capture bet-ter structural similarity relationships than the tree edit distance in some cases [29]

Structural Summary Based Approach: In many cases, it is possible

to create summaries from the underlying documents These summaries are used for creating groups of documents which are similar to these summaries The first summary-based approach for clustering XML doc-uments was presented in [10] In [10], the XML docdoc-uments are modeled

as rooted ordered labeled trees A framework for clustering XML docu-ments by using structural summaries of trees is presented The aim is to improve algorithmic efficiency without compromising cluster quality

Trang 9

A second approach for clustering XML documents is presented in [2] This technique is a partition-based algorithm The primary idea in this approach is to use frequent-pattern mining algorithms in order to deter-mine the summaries of frequent structures in the data The technique uses a 𝑘-means type approach in which each cluster center comprises

a set of frequent patterns which are local to the partition for that clus-ter The frequent patterns are mined using the documents assigned to a cluster center in the last iteration The documents are then further re-assigned to a cluster center based on the average similarity between the document and the newly created cluster centers from the local frequent patterns In each iteration the document-assignment and the mined fre-quent patterns are iteratively re-assigned, until the cluster centers and document partitions converge to a final state It has been shown in [2] that such a structural summary based approach is significantly superior

to a similarity function based approach as presented in [7] The method

of also superior to the structural approach in [10] because of its use of more robust representations of the underlying structural summaries Since the most recent algorithm is the structural summary method discussed in [2], we will discuss this in more detail in the next section

3.2 The XProj Approach

In this section, we will present XProj, which is a summary-based approach for clustering of XML documents The pseudo-code for clustering of XML documents is illustrated in Figure 9.1 The primary approach is to use a sub-structural modification of a partition based approach in which the clusters of documents are built around groups of representative sub-structures Thus,

in-stead of a single representative of a partition-based algorithm, we use a sub-structural set representative for the sub-structural clustering algorithm Initially,

the document set𝒟 is randomly divided into 𝑘 partitions with equal size, and the sets of structure representatives are generated by mining frequent sub-structures of size𝑙 from these partitions In each iteration, the sub-structural representatives (of a particular size, and a particular support level) of a given partition are the frequent structures from that partition These structural rep-resentatives are used to partition the document collection and vice-versa We note that this can be a potentially expensive operation because of the deter-mination of frequent substructures; in the next section, we will illustrate an interesting way to speed it up In order to actually partition the document col-lection, we calculate the number of nodes in a document which are covered

by each sub-structural set representative A larger coverage corresponds to

a greater level of similarity The aim of this approach is that the algorithm

will determine the most important localized sub-structures over time This

Trang 10

Algorithm XProj(Document Set: 𝒟, Minimum Support:

𝑚𝑖𝑛 𝑠𝑢𝑝, Structural Size: 𝑙, NumClusters: 𝑘 )

begin

Initialize representative sets𝒮1 .𝒮𝑘;

while (𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑛𝑐𝑒𝑐𝑟𝑖𝑡𝑒𝑟𝑖𝑜𝑛 =false)

begin

Assign each document𝐷∈ 𝒟 to one of the sets in

{𝒮1 .𝒮𝑘} using coverage based similarity criterion;

/* Let the corresponding document partitions be

denoted byℳ1 .ℳ𝑘; */

Compute the freq substructures of size𝑙 from each

setℳ𝑖using sequential transformation paradigm;

if (∣ℳ𝑖∣ × 𝑚𝑖𝑛 𝑠𝑢𝑝) ≥ 1

set𝒮𝑖to frequent substructures of size𝑙 fromℳ𝑖;

/* If(∣ℳ𝑖∣ × 𝑚𝑖𝑛 𝑠𝑢𝑝) < 1, 𝒮𝑖remains unchanged; */

end;

end

Figure 9.1 The Sub-structural Clustering Algorithm (High Level Description)

is analogous to the projected clustering approach which determines the most important localized projections over time Once the partitions have been com-puted, we use them to re-compute the representative sets These re-computed representative sets are defined as the frequent sub-structures of size 𝑙 from each partition Thus, the representative set 𝑆𝑖 is defined as the substructural set from the partition ℳ𝑖 which has size 𝑙, and which has absolute support

no less than (∣ℳ𝑖∣ × 𝑚𝑖𝑛 𝑠𝑢𝑝) Thus, the newly defined representative set

𝑆𝑖 also corresponds to the local structures which are defined from the parti-tion ℳ𝑖 Note that if the partition ℳ𝑖 contains too few documents such that (∣ℳ𝑖∣ × 𝑚𝑖𝑛 𝑠𝑢𝑝) < 1, the representative set 𝑆𝑖remains unchanged

Another interesting observation is that the similarity function between a document and a given representative set is defined by the number of nodes

in the document which are covered by that set This makes the similarity func-tion more sensitive to the underlying projecfunc-tions in the document structures This leads to more robust similarity calculations in most circumstances

In order to ensure termination, we need to design a convergence criterion One useful criterion is based on the increase of the average sub-structural self-similarity over the 𝑘 partitions of documents Let the partitions of doc-uments with respect to the current iteration be ℳ1 .ℳ𝑘, and their corre-sponding frequent sub-structures of size 𝑙 be 𝒮1 .𝒮𝑘 respectively Then, the average sub-structural self-similarity at the end of the current iteration

Định dạng
Số trang	10
Dung lượng	1,44 MB