Managing and Mining Graph Data part 41 pptx

Algorithm 21ORIGAMI?, ??? ???, ?, ?Input: Graph dataset?, minimum support min sup, ?, ? Output:?-orthogonal, ?-representative setℛ 1:?? =Edge-Map?; 2:ℱ1=Find-Frequent-Edges?, ??? ???; 3:

Trang 1

Algorithm 21ORIGAMI(𝐷, 𝑚𝑖𝑛 𝑠𝑢𝑝, 𝛼, 𝛽)

Input: Graph dataset𝐷, minimum support min sup, 𝛼, 𝛽

Output:𝛼-orthogonal, 𝛽-representative setℛ

1:𝐸𝑀 =Edge-Map(𝐷);

2:ℱ1=Find-Frequent-Edges(𝐷, 𝑚𝑖𝑛 𝑠𝑢𝑝);

3: ˆℳ = 𝜙;

4:while stopping condition() ∕= true do

5: 𝑀 =Random-Maximal-Graph(𝐷,ℱ1,𝐸𝑀 , 𝑚𝑖𝑛 𝑠𝑢𝑝);

6: ℳ = ˆˆ ℳ ∪ 𝑀;

7:ℛ=Orthogonal-Representative-Sets( ˆℳ, 𝛼, 𝛽);

8:return ℛ;

4.2 Randomized Maximal Subgraph Mining

As the first step,ORIGAMImines a set of maximal subgraphs, on which the 𝛼-orthogonal, 𝛽-representative graph pattern set is generated This is based on the observation that the number of maximal frequent subgraphs is much fewer than that of frequent subgraphs, and the maximal subgraphs provide a synopsis

of the frequent ones to some extent Thus it is reasonable to mine the repre-sentative orthogonal pattern set based on the maximal subgraphs rather than the frequent ones However, even mining all of maximal subgraphs could be infeasible in some real world applications To avoid this problem,ORIGAMI

first finds a sample ˆℳ of the complete set of maximal frequent subgraphs ℳ The goal is to find a set of maximal subgraphs, ˆℳ, which is as diverse as possible To achieve this goal,ORIGAMIavoids using combinatorial enumer-ation to mine maximal subgraph patterns Instead, it adopts a random walk approach to enumerate a diverse set of maximal subgraphs from the positive border of such maximal patterns The randomized mining algorithm starts with

an empty pattern and iteratively adds a random edge during each extension, un-til a maximal subgraph𝑀 is generated and no more edges can be added This process walks a random chain in the partial order of frequent subgraphs To extend an intermediate pattern, 𝑆𝑘 ⊆ 𝑀, it chooses a random vertex 𝑣 from which the extension will be attempted Then a random edge𝑒 incident on 𝑣 is selected for extension If no such edge is found, no extension is possible from the vertex When no vertices can have any further extension in𝑆𝑘, the random walk terminates and 𝑆𝑘 = 𝑀 is the maximal graph On the other hand, if a random edge𝑒 is found, the other endpoint 𝑣′of this edge is randomly selected

By adding the edge𝑒 and its endpoint 𝑣′, a candidate subgraph pattern𝑆𝑘+1is generated and its support is computed This random walk process repeats until

Trang 2

no further extension is possible on any vertex Then the maximal subgraph𝑀

is returned

Ideally, the random chain walks would cover different regions of the pattern space, thus would produce dissimilar maximal patterns However, in practice, this may not be the case, since duplicate maximal subgraphs can be generated

in the following ways: (1) multiple iterations following overlapping chains, or (2) multiple iterations following different chains but leading to the same max-imal pattern Let’s consider a maxmax-imal subgraph𝑀 of size 𝑛 Let 𝑒1𝑒2 𝑒𝑛be

a sequence of random edge extensions, corresponding to a random chain walk leading from an empty graph𝜙 to the maximal graph 𝑀 The probability of a particular edge sequence leading from𝜙 to 𝑀 is given as

𝑃 [(𝑒1𝑒2 𝑒𝑛)] = 𝑃 (𝑒1)

𝑛

∏ 𝑖=2

𝑃 (𝑒𝑖∣𝑒1 𝑒𝑖−1) (4.1)

Let𝐸𝑆(𝑀 ) denote the set of all valid edge sequences for a graph 𝑀 The probability that a graph𝑀 is generated in a random walk is proportional to

∑

𝑒 1 𝑒 2 𝑒 𝑛 ∈𝐸𝑆(𝑀)

𝑃 [(𝑒1𝑒2 𝑒𝑛)] (4.2)

The probability of obtaining a specific maximal pattern depends on the num-ber of chains or edge sequences leading to that pattern and the size of the pat-tern According to Eq.(4.1), as a graph grows larger, the probability of the edge sequence becomes smaller So this random walk approach in general favors a maximal subgraph of smaller size than one of larger size To avoid generat-ing duplicate maximal subgraphs, a termination condition is designed based

on an estimate of the collision rate of the generated patterns Intuitively the collision rate keeps track of the number of duplicate patterns seen within the same or across different random walks As a random walk chain is traversed,

ORIGAMI maintains the signature of the intermediate patterns in a bounded size hash table As an intermediate or maximal subgraph is generated, its signa-ture is added to the hash table and the collision rate is updated If the collision rate exceeds a threshold 𝜖, the method could (1) abort further extension along the current path and randomly choose another path; or (2) trigger the termina-tion conditermina-tion across different walks, since it implies that the same part of the search space is being revisited

4.3 Orthogonal Representative Set Generation

Given a set of maximal subgraphs ˆℳ, the next step is to extract an 𝛼-orthogonal𝛽-representative set from it We can construct a meta-graph Γ( ˆℳ)

to measure similarity between graph patterns in ˆℳ, in which each node

Trang 3

rep-resents a maximal subgraph pattern, and an edge exists between two nodes if their similarity is bounded by𝛼 Then the problem of finding an 𝛼-orthogonal pattern set can be modeled as finding a maximal clique in the similarity graph Γ( ˆℳ)

For a given𝛼, there could be multiple 𝛼-orthogonal pattern sets as feasible solutions We could use the size of the residue set to measure the goodness of

an𝛼-orthogonal set An optimal 𝛼-orthogonal 𝛽-representative set is the one which minimizes the size of the residue set [10] proved that this problem is NP-hard

Given the hardness result, ORIGAMIresorts to approximate algorithms to solve the problem which guarantees local optimality The algorithm starts with

a random maximal clique in the similarity graphΓ( ˆℳ) and tries to improve

it At each state transition, another maximal clique which is a local neighbor

of the current maximal clique is chosen If the new state has a better solution, the new state is accepted as the current state and the process continues The process terminates when all neighbors of the current state have equal or larger residue sizes Two maximal cliques of size 𝑚 and 𝑛 (assume 𝑚 ≥ 𝑛) are considered neighbors if they share exactly𝑛− 1 vertices The state transition procedure selectively removes one vertex from the maximal clique of the cur-rent state and then expands it to obtain another maximal clique which satisfies the neighborhood constraints

5 Conclusions

Frequent subgraph mining is one of the fundamental tasks in graph data mining The inherent complexity in graph data causes the combinatorial ex-plosion problem As a result, a mining algorithm may take a long time or even forever to complete the mining process on some real graph datasets

In this chapter, we introduced several state-of-the-art methods that mine a compact set of significant or representative subgraphs without generating the complete set of graph patterns The proposed mining and pruning techniques were discussed in details These methods greatly reduce the computational cost, while at the same time, increase the applicability of the generated graph patterns These research results have made significant progress on graph min-ing research with a set of new applications

References

[1] T Asai, K Abe, S Kawasoe, H Arimura, H Satamoto, and S Arikawa

Efficient substructure discovery from large semi-structured data In Proc.

2002 SIAM Int Conf Data Mining (SDM’02), pages 158–174, 2002.

Trang 4

[2] C Borgelt and M R Berthold Mining molecular fragments: Finding

rel-evant substructures of molecules In Proc 2002 Int Conf Data Mining

(ICDM’02), pages 211–218, 2002.

[3] B Bringmann and S Nijssen What is frequent in a single graph? In

Proc 2008 Pacific-Asia Conf Knowledge Discovery and Data Mining (PAKDD’08), pages 858–863, 2008.

[4] H Cheng, X Yan, J Han, and C.-W Hsu Discriminative frequent pattern

analysis for effective classification In Proc 2007 Int Conf Data

Engi-neering (ICDE’07), pages 716–725, 2007.

[5] Y Chi, Y Xia, Y Yang, and R Muntz Mining closed and maximal

fre-quent subtrees from databases of labeled rooted trees IEEE Trans

Knowl-edge and Data Eng., 17:190–202, 2005.

[6] L Dehaspe, H Toivonen, and R King Finding frequent substructures in

chemical compounds In Proc 1998 Int Conf Knowledge Discovery and

Data Mining (KDD’98), pages 30–36, 1998.

[7] M Deshpande, M Kuramochi, N Wale, and G Karypis Frequent

substructure-based approaches for classifying chemical compounds IEEE

Trans on Knowledge and Data Engineering, 17:1036–1050, 2005.

[8] M Fiedler and C Borgelt Support computation for mining frequent

sub-graphs in a single graph In Proc 5th Int Workshop on Mining and

Learn-ing with Graphs (MLG’07), 2007.

[9] Y Freund and R Schapire A decision-theoretic generalization of on-line

learning and an application to boosting In Proc 2nd European Conf.

Computational Learning Theory, pages 23–27, 1995.

[10] M Al Hasan, V Chaoji, S Salem, J Besson, and M J Zaki ORIGAMI:

Mining representative orthogonal graph patterns In Proc 2007 Int Conf.

Data Mining (ICDM’07), pages 153–162, 2007.

[11] H He and A K Singh Efficient algorithms for mining significant

sub-structures in graphs with quality guarantees In Proc 2007 Int Conf Data

Mining (ICDM’07), pages 163–172, 2007.

[12] L B Holder, D J Cook, and S Djoko Substructure discovery in the

sub-due system In Proc AAAI’94 Workshop Knowledge Discovery in

Data-bases (KDD’94), pages 169–180, 1994.

[13] J Huan, W Wang, D Bandyopadhyay, J Snoeyink, J Prins, and A

Trop-sha Mining spatial motifs from protein structure graphs In Proc 8th Int.

Conf Research in Computational Molecular Biology (RECOMB), pages

308–315, 2004

[14] J Huan, W Wang, and J Prins Efficient mining of frequent subgraph

in the presence of isomorphism In Proc 2003 Int Conf Data Mining

(ICDM’03), pages 549–552, 2003.

Trang 5

[15] J Huan, W Wang, J Prins, and J Yang SPIN: Mining maximal frequent

subgraphs from graph databases In Proc 2004 ACM SIGKDD Int Conf.

Knowledge Discovery in Databases (KDD’04), pages 581–586, 2004.

[16] A Inokuchi, T Washio, and H Motoda An apriori-based algorithm for

mining frequent substructures from graph data In Proc 2000 European

Symp Principle of Data Mining and Knowledge Discovery (PKDD’00),

pages 13–23, 1998

[17] R Jin, C Wang, D Polshakov, S Parthasarathy, and G Agrawal

Discov-ering frequent topological structures from graph datasets In Proc 2005

ACM SIGKDD Int Conf Knowledge Discovery in Databases (KDD’05),

pages 606–611, 2005

[18] M Koyuturk, A Grama, and W Szpankowski An efficient algorithm

for detecting frequent subgraphs in biological networks Bioinformatics,

20:I200–I207, 2004

[19] T Kudo, E Maeda, and Y Matsumoto An application of boosting to

graph classification In Advances in Neural Information Processing

Sys-tems 18 (NIPS’04), 2004.

[20] M Kuramochi and G Karypis Frequent subgraph discovery In Proc.

2001 Int Conf Data Mining (ICDM’01), pages 313–320, 2001.

[21] M Kuramochi and G Karypis Finding frequent patterns in a large sparse

graph Data Mining and Knowledge Discovery, 11:243–271, 2005.

[22] S Nijssen and J Kok A quickstart in frequent structure mining can make

a difference In Proc 2004 ACM SIGKDD Int Conf Knowledge Discovery

in Databases (KDD’04), pages 647–652, 2004.

[23] J Pei, J Han, B Mortazavi-Asl, H Pinto, Q Chen, U Dayal, and M.-C Hsu PrefixSpan: Mining sequential patterns efficiently by prefix-projected

pattern growth In Proc 2001 Int Conf Data Engineering (ICDE’01),

pages 215–224, 2001

[24] S Ranu and A K Singh GraphSig: A scalable approach to mining

significant subgraphs in large graph databases In Proc 2009 Int Conf.

Data Engineering (ICDE’09), pages 844–855, 2009.

[25] H Saigo, N Kr-amer, and K Tsuda Partial least squares regression for

graph mining In Proc 2008 ACM SIGKDD Int Conf Knowledge

Discov-ery in Databases (KDD’08), pages 578–586, 2008.

[26] L Thomas, S Valluri, and K Karlapalem MARGIN: Maximal frequent

subgraph mining In Proc 2006 Int Conf on Data Mining (ICDM’06),

pages 1097–1101, 2006

[27] K Tsuda Entire regularization paths for graph data In Proc 2007 Int.

Conf Machine Learning (ICML’07), pages 919–926, 2007.

Trang 6

[28] N Vanetik, E Gudes, and S E Shimony Computing frequent graph

patterns from semistructured data In Proc 2002 Int Conf on Data Mining

(ICDM’02), pages 458–465, 2002.

[29] C Wang, W Wang, J Pei, Y Zhu, and B Shi Scalable mining of large

disk-base graph databases In Proc 2004 ACM SIGKDD Int Conf

Knowl-edge Discovery in Databases (KDD’04), pages 316–325, 2004.

[30] T Washio and H Motoda State of the art of graph-based data mining

SIGKDD Explorations, 5:59–68, 2003.

[31] X Yan, H Cheng, J Han, and P S Yu Mining significant graph

pat-terns by scalable leap search In Proc 2008 ACM SIGMOD Int Conf on

Management of Data (SIGMOD’08), pages 433–444, 2008.

[32] X Yan and J Han gSpan: Graph-based substructure pattern mining In

Proc 2002 Int Conf Data Mining (ICDM’02), pages 721–724, 2002.

[33] X Yan and J Han CloseGraph: Mining closed frequent graph patterns

In Proc 2003 ACM SIGKDD Int Conf Knowledge Discovery and Data

Mining (KDD’03), pages 286–295, 2003.

[34] X Yan and J Han Discovery of frequent substructures In D Cook and

L Holder (eds.), Mining Graph Data, pages 99–115, John Wiley Sons,

2007

[35] X Yan, P S Yu, and J Han Graph indexing: A frequent structure-based

approach In Proc 2004 ACM-SIGMOD Int Conf Management of Data

(SIGMOD’04), pages 335–346, 2004.

[36] X Yan, X J Zhou, and J Han Mining closed relational graphs with

con-nectivity constraints In Proc 2005 ACM SIGKDD Int Conf Knowledge

Discovery in Databases (KDD’05), pages 324–333, 2005.

[37] M J Zaki Efficiently mining frequent trees in a forest In Proc 2002

ACM SIGKDD Int Conf Knowledge Discovery in Databases (KDD’02),

pages 71–80, 2002

Trang 7

A SURVEY ON STREAMING ALGORITHMS FOR MASSIVE GRAPHS

Jian Zhang

Computer Science Department

Louisiana State University

zhang@csc.lsu.edu

Abstract Streaming is an important paradigm for handling massive graphs that are too

large to fit in the main memory In the streaming computational model, algo-rithms are restricted to use much less space than they would need to store the input Furthermore, the input is accessed in a sequential fashion, therefore, can

be viewed as a stream of data elements The restriction limits the model and yet, algorithms exist for many graph problems in the streaming model We sur-vey a set of algorithms that compute graph statistics, matching and distance in

a graph, and random walks These are basic graph problems and the algorithms that compute them may be used as building blocks in graph-data management and mining.

Keywords: Streaming algorithms, Massive graph, matching, graph distance, random walk

on graph

1 Introduction

In recent years, graphs of massive size have emerged in many applications For example, in telecommunication networks, the phone numbers that call each other form a call graphs On the Internet, the web pages and the links between them form the web graph Also in applications such as structured data mining, the relationships among the data items in the data set are often modeled as graphs These graphs are massive and often have a large number of nodes and connections (edges)

New challenges arise when computing with massive graphs It is possible

to store a massive graph on a large capacity storage device However, large capacity comes with a price: random accesses in these devices are often quite

C.C Aggarwal and H Wang (eds.), Managing and Mining Graph Data,

Advances in Database Systems 40, DOI 10.1007/978-1-4419-6045-0_13, 393

Trang 8

slow (comparing to random accesses in the main memory) In some cases, it

is not necessary (or even not possible) to store the graph Algorithms that deal with massive graphs have to consider these properties

In traditional computational models, when calculating the complexity of an algorithm, all storage devices are treated indifferently and the union of them (main memory as well as disks) are abstracted as a single memory space Basic operations in this memory space, such as access to a random location, take

an equal (constant) amount of time However, on a real computer, access to data in main memory takes much less time than access to data on the disk Clearly, computational complexities derived using traditional model in these cases cannot reflect the complexity of the real computation

To consider the difference between memory and storage types, several new computational models are proposed In the external memory model [44] the memory is divide into two types: internal (main memory) and external (disks) Accesses to external memory are viewed as one measurement of the algo-rithm’s complexity Despite the fact that access to the external memory is slow and accounts heavily in the algorithm’s complexity measure, an external algorithm can still store all the input data in the external memory and make random access to the data

Compare to the external memory model, the streaming model of computa-tion takes the difference between the main memory and the storage device to

a new level The streaming model completely eliminates random access to the input data The main memory is viewed as a workspace in the streaming model where the computation sketches temporary results and performs frequent ran-dom accesses The inputs to the computation, however, can only be accessed

in a sequential fashion, i.e., as a stream of data items (The input may be stored

on some device and accessed sequentially or the input itself may be a stream

of data items For example, in the Internet routing system, the routers forward packets at such a high speed that there may not be enough time for them to store the packets on slow storage devices.) The size of the workspace is much smaller than the input data In many cases, it is also expected that the process

of each data item in the stream take small amount of time and therefore the computation be done in near-linear time (with respect to the size of the input) Streaming computation model is different from sampling In the sampling model, the computation is allowed to perform random accesses to the inputs With these accesses, it takes a few samples from the input and then computes

on these samples In some cases, the computation may not see the whole input

In the streaming model, the computation is allowed to access the input only in

a sequential fashion but the whole input data can be viewed in this way A non-adaptive sampling algorithm can be trivially transformed into a streaming algorithm However, an adaptive sampling algorithm that takes advantage of

Trang 9

the random access allowed by the sampling model cannot be directly adapted into the streaming model

The streaming model was formally proposed in [29] However, before [29], there were studies [30, 25] that considered similar models, without using the term stream or streaming Since [29], there has been a large body of work on algorithms and complexity in this model ( [3, 24, 31, 27, 26, 37, 32] as a few examples) Some example problems considered in the streaming model are computing statistics, norms and histogram constructions Muthu’s book [37] gives an excellent elaboration on the topics of general streaming algorithms and applications

In this survey, we consider streaming algorithms for graph problems It is no surprise that the computation model is limited Some graph problems therefore cannot be solved using the model However, there are still many graph prob-lems (e.g graph connectivity [22, 45], spanning tree [22]) for which one can obtain solutions or approximation algorithms We will survey approximate al-gorithms for computing graph statistics, matching and distance in a graph, and random walk on a graph in the streaming model We remark that this is not

a comprehensive survey covering all the graph problems that have been con-sidered in the streaming model Rather, we will focus on the set of aforemen-tioned topics Many papers cited here also give lower bounds for the problems

as well as algorithms We will focus on the algorithms and omit discussion on the lower bounds Finally we remark that although these algorithms are not direct graph mining algorithms They compute basic graph-theoretic problems and may be utilized in massive graph mining and management systems

2 Streaming Model for Massive Graphs

We give a detailed description of the streaming model in this section Streaming is a general computation model and is not just for graphs However, since this survey concerns only graph problems and algorithms, we will fo-cus on graph streaming computation We consider mainly undirected graphs The graphs can be weighted—i.e., there is a weight function 𝑤 : 𝐸 → ℝ+

that assigns a non-negative weight to each edge We denote by 𝐺(𝑉, 𝐸) a graph 𝐺 with vertex (node) set 𝑉 = {𝑣1, 𝑣2, , 𝑣𝑛} and edge set 𝐸 = {𝑒1, 𝑒2, , 𝑒𝑚} where 𝑛 is the number of vertices and 𝑚 the number of edges

Definition 13.1 A graph stream is a sequence of edges 𝑒𝑖1, 𝑒𝑖2, , 𝑒𝑖𝑚, where 𝑒𝑖 𝑗 ∈ 𝐸 and 𝑖1, 𝑖2, , 𝑖𝑚 is an arbitrary permutation of [𝑚] =

{1, 2, , 𝑚}.

While an algorithm goes through the stream, the graph is revealed one edge

at a time The edges may be presented in any order There are variants of graph stream in which the adjacency matrix or the adjacency list of the graph

is presented as a stream In such cases, the edges incident to each vertex are

Trang 10

grouped together in the stream (It is sometimes called the incident stream.)

Definition 13.1 is more general and accounts for graphs whose edges may be generated in an arbitrary order

We now summarize the definitions of streaming computation in [29, 28, 16,

22] as follows: A streaming algorithm for massive graph is an algorithm that

computes some function for the graph and has the following properties:

1 The input to the streaming algorithm is a graph stream

2 The streaming algorithm accesses the data elements (edges) in the stream

in a sequential order The order of the data elements in the stream is not controlled by the algorithm

3 The algorithm uses a workspace that is much smaller in size than the input It can perform unrestricted random access in the workspace The amount of workspace required by the streaming algorithm is an impor-tant complexity measure of the algorithm

4 As the input data stream by, the algorithm needs to process each data element quickly The time needed by the algorithm to process each data element in the stream is another important complexity measure of the algorithm

5 The algorithms are restricted to access the input stream in a sequential fashion However, they may go through the stream in multiple, but a small number, of passes The third important complexity measure of the algorithm is the number of passes required

These properties characterize the algorithm’s behavior during the time when

it goes through the input data stream Before this time, the algorithm may perform certain pre-processing on the workspace (but not on the input stream) After going through the data stream, the algorithm may undertake some post-processing on the workspace The pre and post-post-processing only concern the workspace and are essentially computations in the traditional model

Considering the three complexity measures: the workspace size, the per-element processing time, and the number of passes that the algorithm needs

to go through the stream, an efficient streaming computation means that the algorithm’s complexity measures are small For example, many streaming al-gorithms use polylog(𝑁 ) space (polylog(⋅) means 𝑂(log𝑐(⋅)) where c is a con-stant) when the input size is 𝑁 The streaming-clustering algorithm in [28] uses𝑂(𝑁𝜖) space for a small 0 < 𝜖 < 1 Many graph streaming algorithms uses𝑂(𝑛⋅ polylog(𝑛)) space Some streaming algorithms process a data el-ement in 𝑂(1) time Others may need polylog(𝑁 ) time Many streaming algorithms access the input stream in one pass There are also multiple-pass algorithms [36, 22, 40]

Định dạng
Số trang	10
Dung lượng	1,63 MB