Scalable data parallel graph algorithms from generation to management

The results of our study depict that our fine tuned algorithm, PPreZER, forgenerating random graph data can be executed on a typical GPU on average 19 timesfaster than its fastest sequen

Trang 1

Scalable Data-Parallel graph algorithms

from generation to management

Sadegh Nobari (B.Eng.(Hons.),IUST) (Ph.D.,NUS)

A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF COMPUTER SCIENCE

SCHOOL OF COMPUTING NATIONAL UNIVERSITY OF SINGAPORE

Trang 2

I hereby declare that this thesis is my original

work and it has been written by me in its entirety

I have duly acknowledged all the sources of information

which have been used in the thesis

This thesis has also not been submitted for any degree

in any university previously

Sadegh Nobari

23 July 2012

Trang 3

Ph.D was a wonderful extraordinary one time in life experience

I would like to say thanks

To my parents (Zeynab and Nader) and my only brother (Ghasem),

through their sacrifice

my opportunities were possible

To my advisors,

Professor St´ephane Bressan

Professors Anastasia Ailamaki, Panagiotis Karras, Panos Kalnis, Nikos Mamoulis andYannis Velegrakis

for patiently supporting me

To my committee,

Professors Tan Tiow Seng, Tan Kian-Lee, M Tamer ¨Ozsu and Leong Hon Wai

for gladly suffering my impenetrable prose

Trang 4

To my friends,

Xuesong Lu, Song Yi, Tang Ruiming, Antoine Veillard, Quoc Trung Tran, Cao ThanhTung, Ehsan Kazemi, Siarhei Bykau, Mohammad Oliya, Behzad Nemat Pajouh, ThomasHeinis, Clemens Lay, Reza Sherkat and

Trang 5

J J Sylvester, in 1878, in an article on chemistry and algebra in Nature, called a ematical structure to model connections between objects, ”graph” More than a centurylater, the versatility of graphs as a data model is demonstrated by the long list of appli-cations in mathematics, science, engineering and the humanities

math-Cormen, Leiserson, Rivest, and Stein describe the role of graphs and graph rithms in computer science as follows in their popular textbook: ”graphs are a pervasivedata structure in computer science, and algorithms working with them are fundamental

algo-to the field.”

Graphs are natural data structures for modern applications Social network data aretypically represented as graphs, semantic web is based on RDF formalism that is a graphmodel, software models and program dependence in software engineering representedvia graphs In many cases these are very large and dynamic graphs The convergence

of applications managing large graphs and the availability of cheap parallel processinghardware caused a renewed interest in managing very large graphs over parallel systems

In this dissertation, we design scalable and practical graph algorithms for a selected

Trang 6

allel solutions for graph generation with both random and real-world graph models terward, we propose techniques for processing large graphs in parallel, specifically forcomputing the Minimum Spanning Forest and the Shortest Path between vertices.Chapter 3 focuses on the generation of very large graphs The nave algorithm forgenerating eros graphs does not scale to large graphs In this chapter we take a systematicapproach to the development of the PPreZER algorithm by proposing a series of sevenalgorithms The results of our study depict that our fine tuned algorithm, PPreZER, forgenerating random graph data can be executed on a typical GPU on average 19 timesfaster than its fastest sequential version on the CPU.

Af-Chapter 4 moves beyond random graphs and considers the generation of real-worldgraphs This chapter considers the spatial datasets and the generation of graphs by takingthe spatial join of the elements in the two datasets We propose an algorithm (calledHiDOP) to perform this spatial join operation efficiently Consequently we design a dataparallel algorithm inspired from HiDOP algorithm

Chapters 5 and 6 cover the data management part of the thesis Two graph algorithms,a.k.a graph queries, are studied: Minimum Spanning Forest (Chapter 5) and All-PairsShortest Path (Chapter 6) In Chapter 5, PMA, a novel data parallel algorithm this isinspired from Bor˙uvka’s and Prims MSF algorithm is proposed PMA experimentallyshows to be superior over the state of the art MSF algorithms Chapter 6 introduces athreshold Lto the problem definition of all-pairs shortest path such that only the pathsthat have weight less than Lare found, the problem is called L-APSP This threshold isadvantageous when only close connections are of interest, like in large social networks

A large number of APSP algorithms are studied and for each a counterpart L-APSPalgorithm is designed and a parallel version algorithm that exploits GPU is proposed.Finally, this dissertation has led to the proposal of four scalable data-parallel algorithmsfor graph data processing

Trang 7

Table of Contents

Acknowledgements ii

Abstract i

Table of Contents viii

List of Figures xii

List of Tables xiii

List of Algorithms xv

1 Introduction 1 1.1 Graph 1

1.2 Parallel processing 3

1.3 Contributions 5

1.4 Graph data generation 6

1.4.1 Generating random graphs 6

Application 6

Existing algorithms 7

Proposed algorithm 7

1.4.2 Generating real-world graphs 8

Application 8

1.5 Graph data management 9

1.5.1 Finding Minimum Spanning Forest 9

Trang 8

1.5.2 Finding Shortest Path 11

Application 12

1.6 Overview 14

2 Parallel processing on Graphics Processing Unit (GPU) 15 2.1 Many and Multi core architectures 15

2.2 GPU Architecture 16

2.3 The CUDA and BrookGPU programming frameworks 16

2.4 SIMT: Single Instruction, Multiple Threads 17

2.5 Parallel Thread Execution (PTX) 21

2.6 GPU Memory hierarchy 22

2.7 GPU Optimizations 22

2.8 GPU empirical analysis 25

2.9 Programming the GPU 27

2.9.1 Parallel Pseudo-Random Number Generator 27

2.9.2 Parallel Prefix Sum 29

2.9.3 Parallel Stream Compaction 30

2.10 chapter summary 30

3 Scalable Random Graph Generation 33 3.1 Introduction 33

3.2 Related Work 36

3.3 Baseline algorithm 38

3.4 Sequential algorithms 40

3.4.1 Skipping Edges 40

3.4.2 ZER 42

3.4.3 PreLogZER 43

3.4.4 PreZER 44

3.5 Parallel algorithms 45

3.5.1 PER 45

3.5.2 PZER 47

3.5.3 PPreZER 50

3.6 Performance Evaluation 51

3.6.1 Setup 51

3.6.2 Results 51

Overall Comparison 51

Trang 9

Speedup Assessment 53

Comparison among Parallel algorithms 54

Parallelism Speedup 55

Size Scalability 56

Performance Tuning 57

3.6.3 Discussion 58

3.7 Chapter Summary 59

4 Scalable Real-world graph generation 63 4.1 Introduction 63

4.2 Related Work 66

4.2.1 In-Memory Approaches 66

4.2.2 On-disk Approaches 67

Both Datasets Indexed 67

One Dataset Indexed 67

Unindexed 68

4.3 Motivation 70

4.3.1 Touch Detection 71

4.3.2 Motivation Examples 72

4.3.3 Motivation Experiments 73

4.4 HiDOP: Hierarchical Data Oriented Partitioning 75

4.4.1 Problem Definition 75

4.4.2 HiDOP Ideas 76

4.4.3 Algorithm Overview 76

4.4.4 Tree Building Phase 78

4.4.5 Assignment Phase 80

4.4.6 Probing Phase 82

4.4.7 Proof of Correctness 83

4.5 Implementation 84

4.5.1 Partitioning 84

4.5.2 Design Parameters 85

Tree Parameters 85

Local Join Parameters 86

Join Order 87

4.7 Experimental Evaluation 90

Trang 10

4.7.4 Varying Dataset B 93

Small Datasets 93

Large Datasets 94

4.7.5 Varying Epsilon 96

4.7.6 Neuroscience Datasets 96

4.7.7 Parallel HiDOP experiments 99

Overall Comparison 99

Speedup Assessment 100

5 Scalable Parallel Minimum Spanning Forest Computation 103 5.1 Introduction 103

5.2 Related Work 106

5.2.1 Sequential algorithms 106

Bor˙uvka 106

Kruskal 107

Reverse-Delete 108

Prim 108

5.2.2 Parallel algorithms 108

5.3 DPMST: Bor˙uvka based Data Parallel MST algorithm 110

5.3.1 Implementation on GPU 112

5.4 Motivation for scalability 113

5.5 PMA: Scalable Parallel MSF algorithm 114

5.5.1 Partial Prim 114

5.5.2 Unification step 115

5.5.3 Proof of Correctness 117

5.5.4 Complexity Analysis 118

5.6 PMA implementation 119

5.6.1 Partial Prim implementation 119

MinPMA algorithm 120

SortPMA algorithm 121

HybridPMA algorithm 121

5.6.2 Unification implementation 122

5.6.3 Implementation notes 123

5.7 Experiments 123

5.7.1 DPMST performance evaluation 124

Experimental Setup 124

Experimental Results 124

5.7.2 PMA performance evaluation 127

Trang 11

Datasets 129

Maximum subtree size (γ) 130

Removing parallel edges 132

Reduction rate 132

Performance comparison 133

6 Scalable Parallel All-Pairs Shortest Path Computation 136 6.1 Introduction 136

6.2 All-Pairs Shortest Path problem 137

6.3 Sequential algorithms 138

6.3.1 Floyd-Warshall 138

6.3.2 Johnson 139

6.3.3 Repeated Squaring 139

6.3.4 Gaussian elimination 142

6.4.1 Gaussian elimination 143

6.4.2 Johnson 143

6.4.3 Repeated Squaring 144

6.4.4 Floyd-Warshall 145

6.5 Time Complexity Analysis 148

6.5.1 Single Source Shortest Path (SSSP) 148

6.5.2 Floyd-Warshall algorithm 149

6.5.3 Repeated Squaring algorithm 151

6.5.4 Gaussian Elimination 152

6.6 L-Distance Matrix Computation 152

6.6.1 L-Pruned Parallel Floyd-Warshall algorithm 155

6.6.2 L-Pruned Repeated Squaring algorithm 155

6.6.3 L-Pruned Single Source Shortest Path 155

6.6.4 L-Pruned Gaussian Elimination 156

6.7 APSP Performance evaluation 156

6.7.1 Setup 156

6.7.2 Experimental Methodology 157

6.7.3 Distance Matrix Computation 158

6.7.4 L-distance Matrix Computation 158

L-Pruned Floyd-Warshall algorithm 160

Trang 12

6.8 Discussion 166

6.9 Operations for privacy 167

6.9.1 Related Work 171

6.10 L-opacity: Linkage-Aware Graph Anonymization 173

6.10.1 Problem Definition 173

6.10.2 L-Opacification algorithm 179

6.10.3 Basic Operations 180

Opacity Value Computation 180

Edge Removal 183

Edge Removal and Insertion 183

6.10.4 Experimental Evaluation 184

Description of Data 187

Utility metrics 187

Comparison on Distortion 188

Comparison on EMD 189

6.10.5 Comparison on Clustering Coefficients 191

Runtime comparison 192

6.10.6 Pruning Capacity in Distance Matrix Computation 193

6.10.7 Data Properties 195

6.10.8 L-Coherency 195

7 Conclusions 197 7.1 Summary 197

7.2 Graph data generation algorithms 198

7.2.1 Random graphs 198

7.2.2 Real-world graphs 199

7.3 Graph data management algorithms 200

7.3.1 Minimum Spanning Forest problem 200

7.3.2 All-Pairs Shortest Path problem 201

7.4 Research Directions 202

7.4.1 Medium term goals 202

Dynamic graphs 202

Large graph processing 202

7.4.2 Long term goals 203

Parallel graph processing 203

Distributed graph processing 203

Trang 13

List of Figures

1.1 The seven K¨onigsberg bridges, courtesy of [78] 2

1.2 Number of research articles until April 2012 using graphs, extracted from the Scopus database [11] 3

1.3 Field of research articles until April 2012 using graphs, extracted from the Scopus database [11] 4

2.1 Cuda thread organization 18

2.2 A set of SIMT multiprocessors 19

2.3 Executing K times a GPU kernel consist of one iteration 26

2.4 Executing one GPU kernel consist of K iterations 26

2.5 Phases of Parallel Prefix Sum [154] 30

2.6 Stream Compaction for 10 elements 31

3.1 f(k) for varying probabilities p . 45

3.2 Running PER algorithm on 10 elements 48

3.3 Generating edge list via skip list in PZER 49

3.4 Running times for all algorithms 52

3.5 Running times for small probability 53

3.6 Speedup for all algorithms over ER 54

3.7 Speedup for parallel algorithms over their sequential counterparts 55

3.8 Running times for parallel algorithms 56

3.9 The times for pseudo-random number generator with skip for PZER and PPreZER and check for PER 57

Trang 14

3.11 Runtime of the parallel algorithms for varying thread-blocks for Γv= 10K,p=0.1

graphs 59

3.12 Runtime for varying graph size, p = 0.001 60

4.1 The PBSM approach partitions the space in equi-width partitions (a) and assigns the objects to the partitions (b) 69

4.2 S3 algorithm [103] partitions the space increasingly fine-granular the smaller the level of the hierarchy When joining, cell c I is compared with the corresponding cell c Oand all cells overlapping it (shaded) 70

4.3 Schema of a neuron’s morphology modeled with cylinders 71

4.4 Two datasets where S3 performs suboptimal 73

4.5 Execution time of the spatial join with different approaches 74

4.6 The three phases of the HiDOP: building the tree, assignment and joining 77 4.7 The tree data structure of the HiDOP algorithm 79

4.8 The smaller the MBRs of the leaf nodes, the more effective filtering is 86

4.9 Uniform, gaussian and clustered data distributions used for the experi-ments 91

4.10 All the algorithms on small uniform dataset with increasing the size of dataset B 93

4.11 Synthetic large uniform datasets varying size of the second dataset when ǫ = 5 94

4.12 Synthetic large gaussian datasets varying size of the second dataset when ǫ = 5 94

4.13 Synthetic large clustered datasets varying size of the second dataset when ǫ = 5 95

4.14 Comparing the approaches for two different ǫ on all datasets 97

4.15 Comparison of all approaches for ǫ of 5 and 10 on neuroscience datasets 97 4.16 Execution for increasingly dense spatial neuroscience datasets 98

4.17 Execution of HiDOP and Parallel HiDOP on large synthetic and real datasets 99

4.18 Details of running PHiDOP on varying datasets 100

4.19 Speedup of Parallel HiDOP against its sequential version 101

5.1 The state transition diagram of Kang and Bader’s algorithm [96] 110

Trang 15

5.2 Illustration of the proposed data parallel MST Given a graph (a), the algorithm finds the minimum outgoing edge for each component in a parallel (marked in b,c,d) Then it merges the components according to

the resulting edges (e) 111

5.3 The state transition diagram of the PMA algorithm 114

5.4 The graph (a) before and (b) after unifying u and v 123

5.5 Execution of DPMST on DIMACS graphs 125

5.6 Execution of DPMST on Erd˝os-R´enyi G n= 20000,0.1≤p≤0.5.graphs 126

5.7 Execution of DPMST on random dense graphs 126

5.8 Execution time of HybridPMA on Erd˝os-R´enyi graphs, varying average degree 128

5.9 Experiments on varying average degree for four types of graph, |V| = 1M 128 5.10 Experiments on varying the number of vertices 130

5.11 Execution time of PMA with varying γ 131

5.12 Reduction rate of different algorithms 133

6.1 A thread processing column j in the parallel block Floyd-Warshall algo-rithm 145

6.2 Log plot of running time of APSP algorithms on the Enron graph with varying number of sampled vertices 158

6.3 Log plot of running time of APSP algorithms on a synthetic Erd˝os-R´enyi graph with varying probability of inclusion 159

6.4 Log plot of running time of APSP algorithms on the Wiki graph with varying number of sampled vertices 160

6.5 Log plot of running time of APSP algorithms on a synthetic Watts and Strogatz graph with varying average degree 161

6.6 Log plot of running time of L-APSP algorithm on the Enron graph with 2048 vertices 162

6.7 Log plot of running time of L-APSP algorithm on a complete graph with 1024 vertices 162

6.8 Log plot of running time of L-APSP algorithm on an sparse (inclusion probability 0.1) Erd˝os-R´enyi graph with 1024 vertices 163

6.9 Log plot of running time of L-APSP algorithm on a the Wiki graph with 1024 vertices 163

6.10 Log plot of running time of L-APSP algorithm on a dense (average de-gree 256) Watts and Strogatz graph with 1024 vertices 164 6.11 Log plot of running time of L-APSP algorithm on a sparse (average

Trang 16

6.13 Graph for given 3-SAT problem in Theorem 3 178

6.14 Path length matrices 180

6.15 GD numbers and Opacity Matrix 181

6.16 Graph edit distance ratio (Distortion) vs Confidence(θ) 186

6.17 EMD of degree distributions vs Confidence(θ) 189

6.18 EMD of Geodesic distributions vs Confidence(θ)) 190

6.19 Mean of the differences of Clustering Coefficients vs Confidence(θ) 192

6.20 Runtime comparison of Gnutella network when varying number of nodes 192 6.21 Runtime of different heuristics for graphs of different size and density vs Confidence(θ) 193

6.22 Impact of L-based Pruning 194

Trang 17

List of Tables

4.1 Selectivity of the datasets (×E−6) 925.1 Runtime of algorithms on dense small graphs (Times are in millisecond) 1255.2 Runtime of different CPU and GPU algorithms on real-world networks 1296.1 Description of the original datasets 1876.2 Data set properties 1946.3 L-Coherency of different data sets 195

Trang 18

List of Algorithms

1 PLCG 28

2 Original ER 38

3 Decoding 39

4 ER 39

5 ZER 42

6 PreLogZER 43

7 PreZER 46

8 PER 47

9 PZER 49

10 PPreZER 50

11 HiDOP 78

12 Tree Building Phase 80

13 Assignment Phase 81

14 Probing Phase 82

15 Data Parallel HiDOP 88

16 PJOIN 89

17 Bor˙uvka’s algorithm 107

18 Prim’s algorithm 109

19 Data Parallel MST algorithm 112

20 PMA algorithm 115

21 Partial Prim algorithm 116

22 MinPMA algorithm 120

Trang 19

23 SortPMA algorithm 120

24 Unifying algorithm 122

25 Floyd-Warshal algorithm 139

26 Single Edge SP extension algorithm 141

27 Repeated squaring APSP algorithm 141

28 GE-APSP algorithm 142

29 Parallel repeated squaring APSP algorithm 145

30 Parallel Block Floyd-Warshal algorithm 146

31 Optimized Parallel Block Floyd-Warshal algorithm 147

32 L-pruned Floyd-Warshal algorithm 153

33 Pointer-based L-pruned F-W algorithm 154

34 max LO algorithm 181

35 Edge Removal algorithm 182

36 Edge Removal/Insertion algorithm 185

Trang 20

this model has been used by Leonhard Euler for K¨onigsberg Bridge Problem [59] Euler

resolved this question by proving that there is no walk that crosses each of the seven

K¨onigsberg bridges, as illustrated in Figure 1.1, exactly once A graph G is defined as

a pair (V, E), where V is a set of vertices (i.e., nodes), and E is a set of edges between the vertices The adjacency relation for graph G is E ⊆ {(u, v)|u, v ∈ V} When the graph is undirected, the adjacency relation defined by the edges is symmetric, so E ⊆ {{u, v}|u, v ∈ V} Weighted graph (edge-weight graph) is a graph with assigned weight for each edge of it as w(u, v) Graph can be further generalized to Hypergraph where

E is called hyperedges Then, E is a set of non-empty subsets of V Therefore, E is a subset of P(V) \ {∅}, where P(V) is the power set of V Cormen, Leiserson, Rivest and

Trang 21

Stein, in their popular textbook [51], describe the role of graphs and graph algorithms incomputer science as follows:

”Graphs are a pervasive data structure in computer science, and algorithmsworking with them are fundamental to the field.”[51]

Figure 1.1: The seven K¨onigsberg bridges, courtesy of [78]

Figures 1.2 and 1.3 illustrate the number of research articles using graphs and theirfields, respectively1 The results further justify versatile representativeness of graph datastructure For instance, in computer science, graph is an abstract data type for repre-senting social networks [159, 26], data management [18], web graphs2 [17, 20] as well

as flow control and program verification [45] In mathematics, to study the knot ory [16] or group theory [35] Transportation, traffic control and road networks in thefield of engineering [177], protein and brain simulation in the field of bioinformaticsare also modeled by graphs [29] To understand the dynamics of a physical process aswell as complicated atomic structures in physics [145] In social sciences graph hasbeen employed to explore diffusion and to extract communities through analyzing social

Trang 22

Figure 1.2: Number of research articles until April 2012 using graphs, extracted from

the Scopus database [11]

Because of this versatile representativeness of graphs, social networks and the webare among many other phenomena and artifacts that can be modeled as large graphs

to be analyzed through graph algorithms However, given the size of the underlyinggraphs, fundamental operations such as path finding algorithms becomes challenging[132, 20, 27] In 2011, the number of internet users reach to 2,267,233,742 among6,930,055,154 people worldwide 3 that are contributing to produce enormous amount

of data in graph form Among these large scale networks, for instance, Facebook as

a social and Linkedin as a professional network or web graphs and graphs of emailshave seen explosive growth rate In January 2011, LinkedIn contained 101 million userswith a growth rate of 3 million users per month In December 2011, the social graph ofFacebook, contains more than 750 million active users and an average friend count of 130

at the end of December 2011 In April 2012, in the worldwide there were 676,919,707websites and 3.3 billion email accounts [14]

1.2 Parallel processing

Graphics Processing Units (GPUs) were fundamentally designed for fast rendering ofimages for display Nevertheless, the introduction of programmable rendering pipelines

3 http://www.internetworldstats.com/stats.htm

Trang 23

Computer Science26%

Mathematics22%

Engineering21%

Bioinformatics16%

Physics and Astronomy8%

Social Sciences

5%

Business1%

Miscellaneous1%

Figure 1.3: Field of research articles until April 2012 using graphs, extracted from the

Scopus database [11]

let the shader programmers to develop non-graphical computations for these GPUs ploying GPUs for general purpose data processing requires the knowledge of how touse textures as a place for data and how to ask the shaders not for generating pixels butfor processing the data on the textures This nonintuitive process has been evolved byintroducing parallel programming architectures This evolution results to broader use

Em-of the GPUs in various fields, specially in the domain Em-of data processing Nowadays,these readily available GPUs that are known as many core architectures, are ubiquitousand cheap [117] GPUs are commonly installed on today’s home computers, worksta-tions, consoles, and gaming devices They can afford operating thousands of concurrentthreads [4] Therefore, designing parallel algorithms for GPUs becomes one of the most

Trang 24

of work have exploited the GPU’s ubiquity to suggest high-performance, general dataprocessing algorithms therefor [72, 73, 108] However, in contrast to the multi-core ar-chitecture, Central Processing Unit (CPU), GPUs are designed for the fine-grained dataparallel algorithms [134] Therefore, the algorithms designed for the so-called many-core GPUs requires a different tuning, i.e Single Instruction Multiple Threads (SIMT),

in comparison to the algorithms designed for multi-core CPUs

Given the proliferation of the gigantic graph form data and also the increasing demand

of processing graph data on one hand, and the ubiquity of these parallel processors onthe other hand, call for development of scalable and fast graph processing algorithms.Therefore, data parallel algorithms [74] comes into play, to achieve this objective Inthis dissertation, we explore the difficulties of adapting the fundamental yet practicalgraph algorithms, with database applications for these massively parallel processors inorder to design scalable graph algorithms We study both graph data generation prob-lems and graph data management problems [18, 182] Our contribution in this research

is designing scalable algorithms for the above mentioned problems We address bothrandom and real-world graph models The results of our study depict the usability ofGPUs in graph data processing For instance, through experiments, we show that ourfine tuned algorithm for generating random graph data can be executed on a typical GPU

on average 19 times faster than its fastest sequential version on the CPU

After introducing scalable algorithms for generating graph, we use the proposed niques to design scalable algorithms for processing the graphs Path problem is a well-known graph processing algorithm [30] Furthermore, computing Minimum SpanningForest and computing shortest path between vertices are good examples of greedy anddynamic programming algorithms, respectively [51] Therefore, in this thesis we addressthese two problems for processing the graphs We empirically analyze the strengths and

Trang 25

tech-weaknesses of the previous solutions for each problem and explore the trade-offs thatcan be made For each problem we devise a novel solution that can scale more than thestate-of-the-art solutions In the following sections 1.4 and 1.5 we respectively describethe above problems in this thesis in greater detail.

This thesis first studies the algorithms for generating graphs Graphs may be generatedfrom a random process, i.e random graphs, or from modeling a real-world data, e.g.rat’s brain In the following subsections we address each respectively

In random graph generation, two simple, elegant, and general mathematical models are

instrumental The former, noted as G(v, e), chooses a graph uniformly at random from the set of graphs with v vertices and e edges The latter, noted as G(v, p), chooses a graph uniformly at random from the set of graphs with v vertices where each edge has the same independent probability p to exist Paul Erd˝os and Alfr´ed R´enyi proposed the

G (v, e) model [58], while E N Gilbert proposed, at the same time, the G(v, p) model

[68] Nevertheless, both are commonly referred to as Erd˝os-R´enyi models

Application

The above mentioned models have been widely utilized in many fields, e.g cation engineering [53, 64, 118], biology [119, 126] and social network studies [62, 90,

communi-133] The so-called Erd˝os-R´enyi models are also used for sampling Namely, G(v, e) is

a uniform random sampling of e elements from a set of v Therefore, a sampling

pro-cess can be effectively simulated using the random generation propro-cess as a component

Trang 26

Although random graphs are ubiquitously are used as a data representation tool, vious research has not paid due attention to the question of efficiency in random graphgeneration A na¨ıve implementation of the basic Erd˝os-R´enyi graph generation process

pre-does not scale well to very large graphs In this thesis we propose PPreZER, a novel, data parallel algorithm for random graph generation under the G(v, p) model The proposed

algorithm can be tuned to a specific type of graph (directed, undirected, with or withoutself-loops, multipartite) by an orthogonal decoding function

ability of an analytical formula for the expected number of edges that can be skipped, in

a geometric approach Furthermore, we avoid the expensive computation of logarithms

exist in the ZER algorithm via pre-computation Still, the skipping element in the

algo-rithm can be implemented even more efficiently, using an acceptance/rejection [149] orZiggurat [124] method Therefore, we avoid the logarithm computation all together in

our proposed algorithm, namely PreZER algorithm.

Proposed algorithm

We devise a data parallel algorithm for each sequential algorithm, namely ER, ZER and

PreZER We refer to these data parallel versions as PER, PZER and PPreZER,

respec-tively We eschew developing a parallel version of PreLogZER, as the benefits this termediary step brings are sufficiently understood in the sequential version These threealgorithms are executable on an off-the-shelve graphics card with a graphics processorunit (GPU)

Trang 27

in-1.4.2 Generating real-world graphs

Graphs can be generated from processing a real-world data In this thesis we proposed

a scalable method, i.e HiDOP, for creating a real-world graph representing a rat’s brainfrom non-graph data we have data that are not in graph form, but we can perform ajoin operation to generate the underlying graph We design techniques for performingthis join efficiently in order to get a graph from the real data Therefore, we focused

on generating graphs from spatial datasets We want to construct a graph out of twogiven spatial datasets In this graph the nodes are objects in the datasets and the edgesare between two objects iff the distance between two objects is less than a given ǫ Theprocess of finding the pair of objects within distance ǫ is known as spatial join

Application

Efficiently generating these graph from a spatial join, is pivotal for many applications,but particularly important in geographical information systems and in the simulationsciences where scientists work with spatial models In geographical applications thesegraphs are used to represent the collisions or proximity between geographical features [165],i.e., landmarks, houses, roads, etc In medical imaging the generated graphs indicateswhat cancerous cells are within a certain distance [60] In the simulation sciences, sci-entists build and simulate precise spatial models to monitor the folding process of pep-tides [70]

Existing algorithms

Past research has primarily focused on generating these graphs on disk and it is thus

no surprise that current in-memory approaches are not efficient Efficient in-memoryapproaches, however, are important for two reasons: a) main memory has grown so bigthat many scientific data sets fit into it directly and b) the in-memory part of generating

Trang 28

Proposed algorithm

In this work we therefore first develop HiDOP, a new in-memory spatial join approachthat uses data-oriented space partitioning and single assignment Then we design a dataparallel version of HiDOP for efficiently using the parallel processing power of GPU.Our results show that HiDOP is more efficient than known in-memory spatial joins

as well as disk based joins used in memory HiDOP scales substantially better for creasing selectivity and outperforms the best known approach by a factor of 10 The bestknown approach, however, requires undue memory and will not scale HiDOP outper-forms state-of-the-art approaches with a comparable memory footprint by factor of 100.However, we also proposed PHiDOP, the data parallel version of HiDOP, that shows asubstantial speedup (up to 400 times) over its sequential version

Recently, graph data management has become a popular area of research because ofits numerous applications in a wide variety of practical fields, including computationalbiology, software bug localization and computer networking [18, 182] In this thesis westudy path finding problems, namely finding Minimum Spanning Forest and All-PairsShortest Path

A spanning tree of a connected graph G is an acyclic subgraph of G that connects all vertices of G The Minimum Spanning Tree (MST) problem calls to find a spanning tree

of a weighted connected graph G having the minimum total weight [87] In case the

graph is not connected, i.e consists of several connected components, the problem isgeneralized to finding the Minimum Spanning Forest (MSF), i.e a subgraph containing

an MST of each component

Trang 29

Minimum Spanning Forest (MSF) is a special case of the path-finding problem [120].The computation of a graph’s minimum spanning forest is one of the most studied graphproblems with practical applications MSF computation finds applications in domainssuch as the optimization of message broadcasting in communication networks [44, 50,111], biological data analysis [173], and image processing [148], while it forms a basisfor clustering algorithms [169, 180] For instance, assume a social graph with weightededges for the cost of communication among individuals inside the graph, in which wewish to spread some news at the minimum cost in real time We then need to efficiently

compute an MST of G.

Existing algorithms

The existing approaches for parallelizing MSF computation share a similar intuition:Building different trees in parallel, and, when conflicts occur (i.e different trees runinto each other), merging the components and starting over However, this strategy isnot equally efficient on all types of graphs In addition, the question of finding theMSF of a non-connected graph is mostly ignored Most significantly, these approachestend to be cautiously conservative when expanding their trees, as they have to be aware

of the potential conflicts, invoking too many redundant iterations that deteriorate theirperformance In chapter 5, we present an alternative, elegant solution that eschews theconservatism of this existing methods, takes full advantage of parallelism, and keepscommunication cost low

Proposed algorithm

Although past research proposed several parallel algorithms for this problem, MSF, yetnone of them scales to large, high-density graphs In this thesis we propose a novel,

Trang 30

computed MSF Our proposed algorithm minimizes the communication among differentprocessors without constraining the local growth of a processor’s computed subtree Ineffect, our solution achieves a scalability that previous approaches lacked We executeour algorithm on a GPU and study its performance using real and synthetic, sparse aswell as dense, structured and unstructured graph data Our experimental study demon-strates that our algorithm outperforms the previous state-of-the-art GPU-based MSF al-gorithm, while being several orders of magnitude faster than sequential CPU-based al-gorithms.

Shortest path is a fundamental path-finding problem in graph data management Given

a directed weighted graph G(V, E), such that V and E are the set of vertices and edges respectively, with a weight function w : E → R+ that maps edges E to positive real-

valued weights, the All-Pairs Shortest Path problem (APSP) problem finds for every

pair (u, v) of vertices, the path length from u to v However, solving this problem is

known to be costly There exist applications that are only interested in path lengths

below a certain value Real-world networks are connected; any two individuals in them

are bound to be linked by a sufficiently large path The size of this path is usually rather

small, not exceeding six steps Milgram’s small world experiment [127] suggested that

social networks of people in the United States are characterized by short path lengths, ofapproximately three friendship links, on average, without considering global linkages;Watts [171] recreated Milgram’s experiment on the internet and found that the averagenumber of intermediaries via which an e-mail message can be delivered to a target wasaround six; Leskovec and Horvitz [109] found the average path length of 6.6 amongusers of an instant-messaging system

In chapter 6, we introduce the L-All-Pairs Shortest Path (L-APSP) Given a directed

weighted graph G(V, E) with a weight function w : E → R+ that maps edges E to positive real-valued weights, the L-APSP problem finds the path length from u to v only

Trang 31

among the pairs (u, v) of vertices that have length less than or equal to L In practice, the

introduced parameter, L, substantially reduces the runtime of the proposed algorithms.Application

Data sets storing information about persons and their relationships are abundant Onlinesocial networks, e-mail exchange records, collaboration networks, are some examples.Such a graph form data, when published, can provide valuable information in domainssuch as sociology, marketing, and fraud detection Still, the publication of such graphdata entails privacy threats for the individuals involved Identity disclosure and Linkdisclosure are two types of privacy threats that are explained in the following subsectionsrespectively

A privacy threat involves the leakage of sensitive information This information may

involve the identity of a node (i.e., person) in the network, in which case we talk of

such re-identification of the node representing a certain individual in the network [181,

79, 43, 113, 183, 172, 82] The common theme in such works is the idea that each node

should be rendered, by some notion, indistinguishable from k − 1 other nodes in the network; this idea is inspired from the precept of k-anonymity, a principle suggested for

the anonymization of relational data [151]

Still, a privacy threat may also involve the information about connections among

individuals in a social network, i.e linkage disclosure Nevertheless, the protection

against identity disclosure does not imply protection against linkage disclosure too Anetwork preventing identify disclosure (by rendering vertices indistinguishable as in

[181, 79, 43, 113, 183, 172, 82]) may still allow the disclosure of a linkage between

two vertices of interest Zhang and Zhang [178] were the first to make this cardinal servation, and provided a solution for the ensuing edge anonymity problem However,the proposed approaches are either based on clustering nodes into super-nodes [179],

Trang 32

ob-in the network, or follow a randomization approach without clear privacy guarantees[175].

Existing algorithms

In chapter 6 we revisit the linkage anonymization problem as a restricted shortest path

problem Our approach is positioned between the two extremes found in [178] and [46].Contrary to [178], we do not consider only one-edge links to be important; an adversarywho can confidently infer that two individuals in a social network are connected by atwo-edge path still infers valuable sensitive information about them On the other hand,

in contrast to [46], we do not attempt to totally extinguish the potential for the inference

of an arbitrarily large linkage path

In view of the connectedness of real world networks, no privacy is compromised

by revealing the existence of a path among two entities; Instead, the focus of a privacy concern should be on averting the disclosure of the exact length of such paths, especially

when it is short Following this reasoning, we define L-opacity, a privacy principle Our

aim is to prevent the confident inference of such linkages by incurring a minimal amount

of modification on the network

Proposed algorithm

Our proposed model modifies the graph to preserve the privacy of the relationshipsamong users in social networks In order to apply this model we progressively designsolutions which lead to an efficient algorithm Each step of the algorithm computes theshortest path among every pair of nodes with a distance limit, i.e a restricted All PairsShortest Path (APSP) Later, we tune this well-known graph problem, APSP, for thementioned purpose However the problem itself is complex, we prove it is NP complete,

so we are studying the existing parallel APSP algorithms to help the algorithm performsfaster

Trang 33

1.6 Overview

In this thesis we proposed scalable graph data generation and management algorithms interms of time and space We begin with an overview on the architecture of the modernGPUs followed by explaining the primitive algorithms that are tuned for these processors

in chapter 2 Then, we narrow our attention to graph data generators and graph data agements We first study the algorithms for generating random graphs [138] in chapter

man-3 and for generating real-world graphs [1man-39] in chapter 4 We then used the proposedtechniques for generating graphs in graph data management We studied the well-knownfundamental path finding problems in graph data management [120, 30, 182] Therefore,

we study the problem of finding Minimum Spanning Forest [135, 136] in chapter 5 andfinding All-Pairs Shortest Path [137] in chapter 6 Finally, this thesis is concluded inchapter 7 by providing directions for future research

Trang 34

CHAPTER 2

Parallel processing on Graphics Processing Unit (GPU)

2.1 Many and Multi core architectures

Nowadays, Graphics Processing Unit (GPU), a.k.a massively multithreaded processor,supports several thousand concurrent threads GPU, as the name indicates, are basicallydesigned for graphic processing However programmers [15, 19] soon realized that theycan be used for general purpose processing Typically modern GPUs have over one hun-dred processing units, whereas current multi-core CPUs offer a much smaller number

of cores For instance, an NVIDIA GTX 680 card has 1536 parallel computing cores

To exploit the power of many-core GPU it is essential to discover its architectural ferences from multi-core CPU In contrast to CPU, an efficient algorithm designed forGPU usually exhibits large amount of fine-grained parallelism instead of coarse-grainedparallelism In order to adapt a multi-core algorithm to a many-core efficient algorithm,the structure of the algorithm should change from a largely task-parallel (coarse-grain)

dif-to a more data-parallel (fine-grain) instructions The general purpose computation on

Trang 35

graphics hardware has appeared in various application domains [88] such as database[72, 81, 73], matrix operations [93, 108], bioinformatics [114] and distributed comput-ing [7, 12].

Modern GPUs are multiprocessors with multi-threading support Currently, standardGPUs do not offer efficient synchronization mechanisms However, their single-instruction,multiple-threads architecture can leverage thread-level parallelism The chief function of

a multiprocessor GPU is to execute hundreds of threads running the same function currently on different data Data parallelism is achieved by hardware multi-threadingthat maximizes the utilization of the available functional units

con-As our implementation platform we use nVidia graphic cards, which consist of anarray of Streaming Multiprocessors (SMs) Each SM can support a limited number ofco-resident concurrent threads, which share the SM’s limited memory resources Fur-thermore, each SM consists of multiple (usually eight) Scalar Processor (SP) cores The

SM performs all thread management (i.e., creation, scheduling, and barrier

synchroniza-tion) entirely in hardware with zero overhead for a group of 32 threads called warp The

zero overhead of lightweight thread scheduling, combined with fast barrier tion, allows the GPU to efficiently support fine-grained parallelism

In the first years of GPGPU, developers uses graphics APIs such as OpenGL [10] andDirectX [33] to map applications onto the graphics rendering unit Recently, severalGPGPU languages including AMD CTM [1], BrookGPU [3] and NVIDIA CUDA [4]have been proposed by GPU vendors and academic researchers These high level lan-

Trang 36

architecture and provide a programming environment similar to the multi-threaded C/C++languages.

Compute Unified Device Architecture (CUDA), is a parallel computing architecturefor NVIDIA graphic cards CUDA API is a set of library functions which can be viewed

as an extension of the C language From a strict software point of view, a CUDA gram is a collection of threads executing in parallel A batch of threads is called a blockand a collection of blocks is called a grid The threads inside the same block can com-municate with each other by either sharing data through shared memory inside the block

pro-or synchronizing their execution The BrookGPU programming language has a similarpurpose to CUDA, but it was historically developed for ATI cards In the BrookGPUenvironment kernel function is executed in the GPU by a grid of thread blocks

In this thesis, we use CUDA parallel computing architecture that is designed forNVIDIA graphic cards CUDA API is a set of library functions to access GPU and can

be viewed as an extension of the C language The CUDA programming model consists

of sequential host code and parallel kernel code The former executes the parallel codes

(kernel) on parallel devices (typically GPUs but also can be executed efficiently on core CPUs [160]) and is responsible for communication data between main memory andGPU memory The latter contains instructions that are designed for Single Instruction,Multiple Threads (SIMT) architecture

multi-2.4 SIMT: Single Instruction, Multiple Threads

A series of appropriate algorithms for a fine-grained parallelism is discussed by Hillisand Steele in [84] This article demonstrates a programming style that is suitable for

a machine with thousands or millions of processors, like Connection Machine systems[85] or Graphics Processing Units This programming style is known as data-parallel

as opposed to the control parallel style The latter model is appropriate for a controlstructured algorithm like searching the chess game tree [63] The data-parallel model is

Trang 37

0, 1, 0

Thread

0, 0, 0Thread

0, 1, 1

Thread

0, 0, 0Thread

0, 0, 0

Thread

0, 0, 0Thread

0, 0, 1

Block 0, 1

Thread

0, 0, 0Thread

0, 1, 0

Thread

0, 0, 0Thread

0, 1, 1

Thread

0, 0, 0Thread

0, 0, 0

Thread

0, 0, 0Thread

0, 0, 1

Block 1, 0

Thread

0, 0, 0Thread

0, 1, 0

Thread

0, 0, 0Thread

0, 1, 1

Thread

0, 0, 0Thread

0, 0, 0

Thread

0, 0, 0Thread

0, 0, 1

Block 1, 1

Thread

0, 0, 0Thread

0, 1, 0

Thread

0, 0, 0Thread

0, 1, 1

Thread

0, 0, 0Thread

0, 0, 0

Thread

0, 0, 0Thread

0, 0, 1

Figure 2.1: Cuda thread organization

known as a successful model for broad classes of scientific and engineering applications[85]

In contrast to instruction-level parallelism within a single thread, GPU leveragesthread-level parallelism Basically, a multiprocessor GPU is designed for executing hun-dreds of threads concurrently To manage such a large number of threads, it utilizes aunique architecture called SIMT (Single Instruction, Multiple Threads) In this archi-tecture, data parallel processing can be achieved by mapping data elements to parallelprocessing threads GPU handles these threads by leveraging hardware multithreading

to maximize the utilization of its functional units

Trang 38

SIMT specify the execution and branching behavior of a single thread SIMT tecture allows thread-level parallel instructions for independent threads as well as dataparallel instructions for coordinated threads Therefore, SIMT architecture must be wellutilized in order to exploit the parallel processing power of GPU to achieve a substantialperformance improvement.

archi-In the CUDA programming framework, SIMT is defined through data parallel threadsand a CUDA kernel essentially is a set of sequential instructions that are executed viadifferent threads on GPU simultaneously and concurrently Basically, launching a CUDAkernel creates a grid of threads each runs the same instruction on different portion of data.The threads in CUDA are organized in two levels A batch of threads forms a block and

a collection of blocks makes a grid The cooperation between the threads is block basis

In other words, the threads inside a block can cooperate through a shared memory viabarrier called synchronization

…

Figure 2.2: A set of SIMT multiprocessors

SIMT architecture is supported on the modern NVIDIA GPUs via an array of ing Multiprocessors (SM) Each SM can support a limited number of co-resident con-

Trang 39

Stream-current threads (threads of a block) as these threads must share the limited memoryresources of that SM Furthermore, a single SM consists of multiple Scalar Processor(SP) cores (usually 8) Moreover, each SM gets one instruction at a time which meansall SPs in one SM execute the same instruction For instance, GeForce 9800 GT contains

14 SMs, each SM can support up to 512 threads, and 112 SPs (see Figure 2.2) In fact, ablock of a CUDA kernel is a virtual SM multiprocessor of the physical GPU

The SM multiprocessor performs all thread management, i.e creation, schedulingand barrier synchronization, entirely in hardware with zero overhead for a group of 32threads called warp (the first parallel thread technology) Individual threads composing

a warp are free to branch and execute independently but start together at the same gram address More precisely, the execution context for each warp (registers, programcounters, etc) is maintained on-chip for entire lifetime of the warp execution Therefore,switching from one warp to another has no cost More precisely, each SM at every in-struction issue time, selects a warp that has threads ready to execute and issues the nextinstruction to the active threads of the warp The zero-overhead of lightweight threadscheduling on one hand, and fast barrier synchronization on the other hand, makes GPUefficiently supports the very fine-grained parallelism (like assigning a single thread foreach data element)

pro-The programmer defines the number of blocks and threads per block before starting

a kernel and all thread blocks have the same number of threads The number of threadblock is usually dictated by the size of the data being processed Depending on how manyregisters and how much share memory the block of thread requires, multiple blocks mayassigned to a single SM On the other hand, all the threads inside a block must processedbefore the block can be swapped out from the SM When the number of blocks exceedsthe number of SMs, based on the vacant registers and shared memory of SMs, each SMmay run more than one block at a time and the remaining blocks will wait in a queue andexecute later This is a combination of simultaneously and concurrently execution of the

Trang 40

However, in the SIMT architecture we need to specify the appropriate portion of thedata to be processed by each thread As a result, CUDA assigns each thread a coordinate

to be uniquely identifiable The thread coordinate follows the same two-level hierarchy(i.e thread block and grids) The CUDA thread organization is illustrated in Figure2.1 For scheduling the threads inside a block, GPU first looks at the warp they belong.The way a block is split into warps is always the same; each warp contains threads ofconsecutive, increasing thread IDs, threads with id from 0 till 31 are assigned to the firstwarp, threads with id from 32 till 63 to the second warp and so on To execute a warp, thethreads which perform the same instructions are executed concurrently whereas threadsthat deviate are executed sequentially (explained in section 2.5)

The GPU is a compute device that operates as a coprocessor to CPU with capability

of executing large number of threads in parallel Therefore, the data-parallel portion ofalgorithms can be isolated into CUDA kernels and executed on GPU as many differentthreads These kernels compiled to the PTX instruction set, then translated at installtime to the target GPU instruction set PTX-ISA, a low-level parallel thread executionvirtual machine and instruction set architecture, exposes the GPU as a data-parallel com-puting device [140] PTX provides a stable parallel programming model and generatesmachined-independent ISA for C/C++ which is scalable from a single GPU unit to manyparallel units A PTX program determines the execution of a thread inside a cooperativethread array (CTA) In fact, CTAs implement CUDA thread blocks Furthermore, a CTA

is a set of threads that execute the same kernel concurrently of in parallel The threadsinside a CTA can communicate through synchronization points where threads wait un-til all threads in the CTA have arrived Each CTA thread uses its identifier to find itsassigned work, determined role, input and output positions, and compute addresses

Định dạng
Số trang	236
Dung lượng	3,83 MB