2.5 The Spectral Clustering Method Eigenvector techniques are often used in multi-dimensional data in order to determine the underlying correlation structure in the data.. In the spectra
Trang 1Here𝑁 𝑢𝑚𝐶𝑜𝑛𝑠𝑡𝑟𝑎𝑖𝑛𝑒𝑑𝑃 𝑎𝑡ℎ𝑠(𝑒, 𝑖, 𝑗) refers to the number of (global) short-est paths between𝑖 and 𝑗 which pass through 𝑒, and 𝑁 𝑢𝑚𝑆ℎ𝑜𝑟𝑡𝑃 𝑎𝑡ℎ𝑠(𝑖, 𝑗) refers to the number of shortest paths between 𝑖 and 𝑗 Note that the value
of𝑁 𝑢𝑚𝐶𝑜𝑛𝑠𝑡𝑟𝑎𝑖𝑛𝑒𝑑𝑃 𝑎𝑡ℎ𝑠)(𝑒, 𝑖, 𝑗) may be 0 if none of the shortest paths between 𝑖 and 𝑗 contain 𝑒 The algorithm ranks the edges by order of their betweenness, and and deletes the edge with the highest score The between-ness coefficients are recomputed, and the process is repeated The set of con-nected components after repeated deletion form the natural clusters A variety
of termination-criteria (eg fixing the number of connected components) can
be used in conjunction with the algorithm
A key issue is the efficient determination of edge-betweenness centrality The number of paths between any pair of nodes can be exponentially large, and it would seem that the computation of the betweenness measure would be
a key bottleneck It has been shown in [36], that the network structure index can also be used in order to estimate edge-betweenness centrality effectively
by pairwise node sampling
2.5 The Spectral Clustering Method
Eigenvector techniques are often used in multi-dimensional data in order
to determine the underlying correlation structure in the data It is natural to question as to whether such techniques can also be used for the more general case of graph data It turns out that this is indeed possible with the use of a
method called spectral clustering.
In the spectral clustering method, we make use of the node-node adjacency matrix of the graph For a graph containing𝑛 nodes, let us assume that we have
a𝑛× 𝑛 adjacency matrix, in which the entry (𝑖, 𝑗) correspond to the weight of the edge between the nodes𝑖 and 𝑗 This essentially corresponds to the similar-ity between nodes𝑖 and 𝑗 This entry is denoted by 𝑤𝑖𝑗, and the corresponding matrix is denoted by 𝑊 This matrix is assumed to be symmetric, since we are working with undirected graphs Therefore, we assume that𝑤𝑖𝑗 = 𝑤𝑗𝑖for any pair(𝑖, 𝑗) All diagonal entries of the matrix 𝑊 are assumed to be 0 As discussed earlier, the aim of any node partitioning algorithm is minimize (a function of) the weights across the partitions The spectral clustering method constructs this minimization function in terms of the matrix structure of the
adjacency matrix, and another matrix which is referred to as the degree matrix Thedegree matrix 𝐷 is simply a diagonal matrix, in which all entries are zero except for the diagonal values The diagonal entry𝑑𝑖𝑖is equal to the sum
of the weights of the incident edges In other words, the entry𝑑𝑖𝑗 is defined as follows:
𝑑𝑖𝑗 =∑𝑛
𝑗=1𝑤𝑖𝑗 𝑖 = 𝑗
Trang 2We formally define the Laplacian Matrix as follows:
Definition 9.2 (Laplacian Matrix) The Laplacian Matrix 𝐿 is defined by
subtracting the weighted adjacency matrix from the degree matrix In other words, we have:
This matrix encodes the structural behavior of the graph effectively and its eigenvector behavior can be used in order to determine the important clusters
in the underlying graph structure We can be shown that the Laplacian matrix𝐿
is positive semi-definite i.e., for any𝑛-dimensional row vector 𝑓 = [𝑓1 𝑓𝑛]
we have𝑓⋅ 𝐿 ⋅ 𝑓𝑇 ≥ 0 This can be easily shown by expressing 𝐿 in terms of its constituent entries which are a function of the corresponding weights 𝑤𝑖𝑗 Upon expansion, it can be shown that:
𝑓⋅ 𝐿 ⋅ 𝑓𝑇 = (1/2)⋅
𝑛
∑
𝑖=1
𝑛
∑
𝑗=1
𝑤𝑖𝑗⋅ (𝑓𝑖− 𝑓𝑗)2 (9.5)
We summarize as follows
Lemma 9.3 The Laplacian matrix 𝐿 is positive semi-definite Specifically, for
any 𝑛-dimensional row vector 𝑓 = [𝑓1 𝑓𝑛], we have:
𝑓⋅ 𝐿 ⋅ 𝑓𝑇 = (1/2)⋅
𝑛
∑
𝑖=1
𝑛
∑
𝑗=1
𝑤𝑖𝑗⋅ (𝑓𝑖− 𝑓𝑗)2
At this point, let us examine some interpretations of the vector 𝑓 in terms
of the underlying graph partitioning Let us consider the case in which each
𝑓𝑖 is drawn from the set {0, 1}, and this determines a two-way partition by labeling each node either 0 or 1 The particular partition to which the node
𝑖 belongs is defined by the corresponding label Note that the expansion of the expression 𝑓 ⋅ 𝐿 ⋅ 𝑓𝑇 from Lemma 9.3 simply represents the sum of the weights of the edges across the partition defined by𝑓 Thus, the determination
of an appropriate value of 𝑓 for which the function 𝑓 ⋅ 𝐿 ⋅ 𝑓𝑇 is minimized also provides us with a good node partitioning Unfortunately, it is not easy to
determine the discrete values of𝑓 which determine this optimum partitioning Nevertheless, we will see later in this section that even when we restrict 𝑓 to real values, this provides us with the intuition necessary to create an effective partitioning
An immediate observation is that the indicator vector 𝑓 = [1 1] is an eigenvector with a corresponding eigenvalue of 0 We note that 𝑓 = [1 1] must be an eigenvector, since𝐿 is positive semi-definite and 𝑓⋅ 𝐿 ⋅ 𝑓𝑇 can be 0 only for eigenvectors with 0 eigenvalues This observation can be generalized
Trang 3further in order to determine the number of connected components in the graph.
We make the following observation
Lemma 9.4 The number of (linearly independent) eigenvectors with zero
eigenvalues for the Laplacian matrix 𝐿 is equal to the number of connected components in the underlying graph.
Proof: Without loss of generality, we can order the vertices corresponding
to the particular connected component that they belong to In this case, the
Laplacian matrix takes on the following block form, which is illustrated below
for the case of three connected components
𝐿 = 𝐿10 0
0 𝐿2 0
0 0 𝐿3
Each of the blocks 𝐿1, 𝐿2 and 𝐿3 is a Laplacian itself of the corresponding component Therefore, the corresponding indicator vector for that component
is an eigenvector with corresponding eigenvalue 0 The result follows □
We observe that connected components are the most obvious examples of clusters in the graph Therefore, the determination of eigenvectors correspond-ing to zero eigenvalues provides us information about this (relatively rudimen-tary set) of clusters Broadly speaking, it may not be possible to glean such clean membership behavior from the other eigenvectors One of the problems
is that other than this particular rudimentary set of eigenvectors (which corre-spond to the connected components), the vector components of the other eigen-vectors are drawn from the real domain rather than the discrete{0, 1} domain Nevertheless, because of the nature of the natural interpretation of𝑓⋅ 𝐿 ⋅ 𝑓𝑇 in terms of the weights of the edges across nodes with very differing values of𝑓𝑖,
it is natural to cluster together nodes for which the values of𝑓𝑖 are as similar
as possible across any particular eigenvector on the average This provides us with the intuition necessary to define an effective spectral clustering algorithm, which partitions the data set into𝑘 clusters for any arbitrary value of 𝑘 The algorithm is as follows:
Determine the 𝑘 eigenvectors with the smallest eigenvalues Note that each eigenvector has as many components as the number of nodes Let the component of the𝑗th eigenvector for the 𝑖th node be denoted by 𝑝𝑖𝑗 Create a new data set with as many records as the number of nodes The 𝑖th record in this data set corresponds to the 𝑖th node, and has 𝑘 com-ponents The record for this node is simply the eigenvector components for that node, which are denoted by𝑝𝑖1 𝑝𝑖𝑘
Trang 4Since we would like to cluster nodes with similar eigenvector compo-nents, we use any conventional clustering algorithm (e.g.𝑘-means) in or-der to create𝑘 clusters from this data set Note that the main focus of the
approach was to create a transformation of a structural clustering
algo-rithm into a more conventional multi-dimensional clustering algoalgo-rithm, which is easy to solve The particular choice of the multi-dimensional clustering algorithm is orthogonal to the broad spectral approach The above algorithm provides a broad framework for the spectral clustering al-gorithm The input parameter for the above algorithm is the number of clusters
𝑘 In practice, a number of variations are possible in order to tune the quality
of the clusters which are found Some examples are as follows:
It is not necessary to use the same number of eigenvectors as the input parameter for the number of clusters In general, one should use at least
as many eigenvectors as the number of clusters to be created However, the exact number of eigenvectors to be used in order to get the optimum results may vary with the particular data set This can be known only with experimentation
There are other ways of creating normalized Laplacian matrices which
can provide more effective results in some situations Some classic ex-amples of such Laplacian matrices in terms of the adjacency matrix𝑊 , degree matrix𝐷 and the identity matrix 𝐼 are defined as follows:
𝐿𝐴= 𝐼 − 𝐷−(1/2)⋅ 𝑊 ⋅ 𝐷−(1/2)
𝐿𝐵= 𝐼 − 𝐷−1⋅ 𝑊 More details on the different methods which can be used for effective spectral graph clustering may be found in [9]
2.6 Determining Quasi-Cliques
A different way of determining massive graphs in the underlying data is
that of determining quasi-cliques This technique is different from many other
partitioning algorithms, in that it focuses on definitions which maximize edge
densities within a partition, rather than minimizing edge densities across
par-titions A clique is a graph in which every pair of nodes are connected by an edge A quasi-clique is a relaxation on this concept, and is defined by im-posing a lower bound on the degree of each vertex in the given set of nodes Specifically, we define a𝛾-quasiclique is as follows:
Definition 9.5 A 𝑘-graph (𝑘 ≥ 1) 𝐺 is a 𝛾-quasiclique if the degree of each
node in the corresponding sub-graph of vertices is at least 𝛾 ⋅ 𝑘.
Trang 5The value of𝛾 always lies in the range (0, 1] We note that by choosing 𝛾 = 1, this definition reverts to that of standard cliques Choosing lower values of𝛾 allows for the relaxations which are more true in the case of real applications This is because we rarely encounter complete cliques in real applications, and
at least some edges within a dense subgraph would always be missing A vertex
is said to be critical, if its degree in the corresponding subgraph is the smallest integer which is at least equal to𝛾⋅ 𝑘
The earliest piece of work on this problem is from [1] The work in [1] uses
a greedy randomized adaptive search algorithm GRASP, to find a quasi-clique with the maximum size A closely related problem is that of finding
find-ing frequently occurrfind-ing cliques in multiple data sets In other words, when
multiple graphs are obtained from different data sets, some dense subgraphs occur frequently together in the different data sets Such graphs help in
deter-mining important dense patterns of behavior in different data sources Such
techniques find applicability in mining important patterns in graphical repre-sentations of customers The techniques are also helpful in mining cross-graph quasi-cliques in gene expression data A description of the application of the technique to the problem of gene-expression data may be found in [33] An efficient algorithm for determining cross graph quasi-cliques was proposed in [32] The main restriction of the work proposed in [32] is that the support threshold for the algorithms is assumed to be100% This restriction has been relaxed in subsequent work [43] The work in [43] examines the problem of mining frequent closed quasi-cliques from a graph database with arbitrary sup-port thresholds In [31] a multi-graph version of the quasi-clique problem was explored However, instead of finding the complete set of quasi-cliques in the graph, they proposed an approximation algorithm to cover all the vertices in the graph with a minimum number of𝑝-quasi-complete subgraphs Thus, this technique is more suited for summarization of the overall graph with a smaller number of densely connected subgraphs
2.7 The Case of Massive Graphs
A closely related problem is that of dense subgraph determination in mas-sive graphs This problem is frequently encountered in large graph data sets For example, the problem of determining large subgraphs of web graphs was studied in [5, 22] A min-hash approach was first used in [5] in order to deter-mine syntactically related clusters This paper also introduces the advantages
of using a min-hash approach in the context of graph clustering Subsequently, the approach was generalized to the case of large dense graphs with the use of recursive application of the basic min-hash algorithm
The broad idea in the min-hash approach is to represent the outlinks of a particular node as sets Two nodes are considered similar, if they share many
Trang 6outlinks Thus, consider a node 𝐴 with an outlink set 𝑆𝐴and a node𝐵 with outlink set 𝑆𝐵 Then the similarity between the two nodes is defined by the
Jaccard coefficient, which is defined as 𝑆 𝐴 ∩𝑆 𝐵
𝑆 𝐴 ∪𝑆 𝐵 We note that explicit enumera-tion of all the edges in order to compute this can be computaenumera-tionally inefficient
Rather, a min-hash approach is used in order to perform the estimation This min-hash approach is as follows We sort the universe of nodes in a random
order For any set of nodes in random sorted order, we determine the first node
𝐹 𝑖𝑟𝑠𝑡(𝐴) for which an outlink exists from 𝐴 to 𝐹 𝑖𝑟𝑠𝑡(𝐴) We also determine the first node𝐹 𝑖𝑟𝑠𝑡(𝐵) for which an outlink exists from 𝐵 to 𝐹 𝑖𝑟𝑠𝑡(𝐵) It can
be shown that the Jaccard coefficient is an unbiased estimate of the probability that𝐹 𝑖𝑟𝑠𝑡(𝐴) and 𝐹 𝑖𝑟𝑠𝑡(𝐵) are the same node By repeating this process over different permutations over the universe of nodes, it is possible to accurately estimate the Jaccard coefficient This is done by using a constant number of permutations 𝑐 of the node order The actual permutations are implemented
by associated 𝑐 different randomized hash values with each node This cre-ates 𝑐 sets of hash values of size 𝑛 The sort-order for any particular set of hash-values defines the corresponding permutation order For each such per-mutation, we store the minimum node index of the outlink set Thus, for each node, there are 𝑐 such minimum indices This means that, for each node, a fingerprint of size𝑐 can be constructed By comparing the fingerprints of two nodes, the Jaccard coefficient can be estimated This approach can be further generalized with the use of every𝑠 element set contained entirely with 𝑆𝐴and
𝑆𝐵 Thus, the above description is the special case when 𝑠 is set to 1 By using different values of 𝑠 and 𝑐, it is possible to design an algorithm which distinguishes between two sets that are above or below a certain threshold of similarity
The overall technique in [22] first generates a set of 𝑐 shingles of size 𝑠 for each node The process of generating the𝑐 shingles is extremely straight-forward Each node is processed independently We use the min-wise hash function approach in order to generate subsets of size 𝑠 from the outlinks at each node This results in𝑐 subsets for each node Thus, for each node, we have a set of𝑐 shingles Thus, if the graph contains a total of 𝑛 nodes, the total size of this shingle fingerprint is𝑛× 𝑐 × 𝑠𝑝, where 𝑠𝑝 is the space required for each shingle Typically𝑠𝑝 will be 𝑂(𝑠), since each shingle contains 𝑠 nodes For each distinct shingle thus created, we can create a list of nodes which contain it In general, we would like to determine groups of shingles which contain a large number of common nodes In order to do so, the method in [22] performs a second-order shingling in which the meta-shingles are created from the shingles Thus, this further compresses the graph in a data structure
of size𝑐× 𝑐 This is essentially a constant size data structure We note that this group of meta-shingles have the the property that they contain a large
Trang 7num-ber of common nodes The dense subgraphs can then be extracted from these meta-shingles More details on this approach may be found in [22]
The min-hash approach is frequently used for graphs which are extremely large and cannot be easily processed by conventional quasi-clique mining algo-rithms Since the min-hash approach summarizes the massive graph in a small amount of space, it is particularly useful in leveraging the small space represen-tation for a variety of query-processing techniques Examples of such applica-tions include the web graph and social networks In the case of web graphs, we desire to determine closely connected clusters of web pages with similar con-tent The related problem in social networks is that of finding closely related communities The min-hash approach discussed in [5, 22] precisely helps us achieve this goal, because we can process the summarized min-hash structure
in a variety of ways in order to extract the important communities from the summarized structure More details of this approach may be found in [5, 22]
3 Clustering Graphs as Objects
In this section, we will discuss the problem of clustering entire graphs in
a multi-graph database, rather than examining the node clustering problem
within a single graph Such situations are often encountered in the context of XML data, since each XML document can be regarded as a structural record, and it may be necessary to create clusters from a large number of such objects
We note that XML data is quite similar to graph data in terms of how the data
is organized structurally The attribute values can be treated as graph labels and the corresponding semi-structural relationships as the edges In has been shown in [2, 10, 28, 29] that this structural behavior can be leveraged in order
to create effective clusters
3.1 Extending Classical Algorithms to Structural Data
Since we are examining entre graphs in this version of the clustering
prob-lem, the problem simply boils down to that of clustering arbitrary objects,
where the objects in this case have structural characteristics Many of the conventional algorithms discussed in [24] (such as 𝑘-means type partitional algorithms and hierarchical algorithms can be extended to the case of graph data The main changes required in order to extend these algorithms are as follows:
Most of the underlying classical algorithms typically use some form of distance function in order to measure similarity Therefore, we need appropriate measures in order to define similarity (or distances) between structural objects
Trang 8Many of the classical algorithms (such as 𝑘-means) use representative objects such as centroids in critical intermediate steps While this is
straightforward in the case of multi-dimensional objects, it is much more challenging in the case of graph objects Therefore, appropriate meth-ods need to be designed in order to create representative objects Fur-thermore, in some cases it may be difficult to create representatives in terms of single objects We will see is that it is often more robust to use
representative summaries of the underlying objects.
There are two main classes of conventional techniques, which have been extended to the case of structural objects These techniques are as follows:
Structural Distance-based Approach: This approach computes
struc-tural distances between documents and uses them in order to compute clusters of documents One of the earliest work on clustering tree
struc-tured data is the XClust algorithm [28], which was designed to cluster
XML schemas in order for efficient integration of large numbers of Doc-ument Type Definitions (DTDs) of XML sources It adopts the agglom-erative hierarchical clustering method which starts with clusters of single DTDs and gradually merges the two most similar clusters into one larger cluster The similarity between two DTDs is based on their element sim-ilarity, which can be computed according to the semantics, structure, and context information of the elements in the corresponding DTDs One of the shortcomings of the XClust algorithm is that it does not make full use of the structure information of the DTDs, which is quite important
in the context of clustering tree-like structures The method in [7] com-putes similarity measures based on the structural edit-distance between documents This edit-distance is used in order to compute the distances between clusters of documents
S-GRACE is hierarchical clustering algorithm [29] In [29], an XML document is converted to a structure graph (or s-graph), and the distance between two XML documents is defined according to the number of the common element-subelement relationships, which can capture bet-ter structural similarity relationships than the tree edit distance in some cases [29]
Structural Summary Based Approach: In many cases, it is possible
to create summaries from the underlying documents These summaries are used for creating groups of documents which are similar to these summaries The first summary-based approach for clustering XML doc-uments was presented in [10] In [10], the XML docdoc-uments are modeled
as rooted ordered labeled trees A framework for clustering XML docu-ments by using structural summaries of trees is presented The aim is to improve algorithmic efficiency without compromising cluster quality
Trang 9A second approach for clustering XML documents is presented in [2] This technique is a partition-based algorithm The primary idea in this approach is to use frequent-pattern mining algorithms in order to deter-mine the summaries of frequent structures in the data The technique uses a 𝑘-means type approach in which each cluster center comprises
a set of frequent patterns which are local to the partition for that clus-ter The frequent patterns are mined using the documents assigned to a cluster center in the last iteration The documents are then further re-assigned to a cluster center based on the average similarity between the document and the newly created cluster centers from the local frequent patterns In each iteration the document-assignment and the mined fre-quent patterns are iteratively re-assigned, until the cluster centers and document partitions converge to a final state It has been shown in [2] that such a structural summary based approach is significantly superior
to a similarity function based approach as presented in [7] The method
of also superior to the structural approach in [10] because of its use of more robust representations of the underlying structural summaries Since the most recent algorithm is the structural summary method discussed in [2], we will discuss this in more detail in the next section
3.2 The XProj Approach
In this section, we will present XProj, which is a summary-based approach for clustering of XML documents The pseudo-code for clustering of XML documents is illustrated in Figure 9.1 The primary approach is to use a sub-structural modification of a partition based approach in which the clusters of documents are built around groups of representative sub-structures Thus,
in-stead of a single representative of a partition-based algorithm, we use a sub-structural set representative for the sub-structural clustering algorithm Initially,
the document set𝒟 is randomly divided into 𝑘 partitions with equal size, and the sets of structure representatives are generated by mining frequent sub-structures of size𝑙 from these partitions In each iteration, the sub-structural representatives (of a particular size, and a particular support level) of a given partition are the frequent structures from that partition These structural rep-resentatives are used to partition the document collection and vice-versa We note that this can be a potentially expensive operation because of the deter-mination of frequent substructures; in the next section, we will illustrate an interesting way to speed it up In order to actually partition the document col-lection, we calculate the number of nodes in a document which are covered
by each sub-structural set representative A larger coverage corresponds to
a greater level of similarity The aim of this approach is that the algorithm
will determine the most important localized sub-structures over time This
Trang 10Algorithm XProj(Document Set: 𝒟, Minimum Support:
𝑚𝑖𝑛 𝑠𝑢𝑝, Structural Size: 𝑙, NumClusters: 𝑘 )
begin
Initialize representative sets𝒮1 .𝒮𝑘;
while (𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑛𝑐𝑒𝑐𝑟𝑖𝑡𝑒𝑟𝑖𝑜𝑛 =false)
begin
Assign each document𝐷∈ 𝒟 to one of the sets in
{𝒮1 .𝒮𝑘} using coverage based similarity criterion;
/* Let the corresponding document partitions be
denoted byℳ1 .ℳ𝑘; */
Compute the freq substructures of size𝑙 from each
setℳ𝑖using sequential transformation paradigm;
if (∣ℳ𝑖∣ × 𝑚𝑖𝑛 𝑠𝑢𝑝) ≥ 1
set𝒮𝑖to frequent substructures of size𝑙 fromℳ𝑖;
/* If(∣ℳ𝑖∣ × 𝑚𝑖𝑛 𝑠𝑢𝑝) < 1, 𝒮𝑖remains unchanged; */
end;
end
Figure 9.1 The Sub-structural Clustering Algorithm (High Level Description)
is analogous to the projected clustering approach which determines the most important localized projections over time Once the partitions have been com-puted, we use them to re-compute the representative sets These re-computed representative sets are defined as the frequent sub-structures of size 𝑙 from each partition Thus, the representative set 𝑆𝑖 is defined as the substructural set from the partition ℳ𝑖 which has size 𝑙, and which has absolute support
no less than (∣ℳ𝑖∣ × 𝑚𝑖𝑛 𝑠𝑢𝑝) Thus, the newly defined representative set
𝑆𝑖 also corresponds to the local structures which are defined from the parti-tion ℳ𝑖 Note that if the partition ℳ𝑖 contains too few documents such that (∣ℳ𝑖∣ × 𝑚𝑖𝑛 𝑠𝑢𝑝) < 1, the representative set 𝑆𝑖remains unchanged
Another interesting observation is that the similarity function between a document and a given representative set is defined by the number of nodes
in the document which are covered by that set This makes the similarity func-tion more sensitive to the underlying projecfunc-tions in the document structures This leads to more robust similarity calculations in most circumstances
In order to ensure termination, we need to design a convergence criterion One useful criterion is based on the increase of the average sub-structural self-similarity over the 𝑘 partitions of documents Let the partitions of doc-uments with respect to the current iteration be ℳ1 .ℳ𝑘, and their corre-sponding frequent sub-structures of size 𝑙 be 𝒮1 .𝒮𝑘 respectively Then, the average sub-structural self-similarity at the end of the current iteration