The results of our study depict that our fine tuned algorithm, PPreZER, forgenerating random graph data can be executed on a typical GPU on average 19 timesfaster than its fastest sequen
Trang 1Scalable Data-Parallel graph algorithms
from generation to management
Sadegh Nobari (B.Eng.(Hons.),IUST) (Ph.D.,NUS)
A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF COMPUTER SCIENCE
SCHOOL OF COMPUTING NATIONAL UNIVERSITY OF SINGAPORE
Trang 2I hereby declare that this thesis is my original
work and it has been written by me in its entirety
I have duly acknowledged all the sources of information
which have been used in the thesis
This thesis has also not been submitted for any degree
in any university previously
Sadegh Nobari
23 July 2012
Trang 3Ph.D was a wonderful extraordinary one time in life experience
I would like to say thanks
To my parents (Zeynab and Nader) and my only brother (Ghasem),
through their sacrifice
my opportunities were possible
To my advisors,
Professor St´ephane Bressan
Professors Anastasia Ailamaki, Panagiotis Karras, Panos Kalnis, Nikos Mamoulis andYannis Velegrakis
for patiently supporting me
To my committee,
Professors Tan Tiow Seng, Tan Kian-Lee, M Tamer ¨Ozsu and Leong Hon Wai
for gladly suffering my impenetrable prose
Trang 4To my friends,
Xuesong Lu, Song Yi, Tang Ruiming, Antoine Veillard, Quoc Trung Tran, Cao ThanhTung, Ehsan Kazemi, Siarhei Bykau, Mohammad Oliya, Behzad Nemat Pajouh, ThomasHeinis, Clemens Lay, Reza Sherkat and
Trang 5J J Sylvester, in 1878, in an article on chemistry and algebra in Nature, called a ematical structure to model connections between objects, ”graph” More than a centurylater, the versatility of graphs as a data model is demonstrated by the long list of appli-cations in mathematics, science, engineering and the humanities
math-Cormen, Leiserson, Rivest, and Stein describe the role of graphs and graph rithms in computer science as follows in their popular textbook: ”graphs are a pervasivedata structure in computer science, and algorithms working with them are fundamental
algo-to the field.”
Graphs are natural data structures for modern applications Social network data aretypically represented as graphs, semantic web is based on RDF formalism that is a graphmodel, software models and program dependence in software engineering representedvia graphs In many cases these are very large and dynamic graphs The convergence
of applications managing large graphs and the availability of cheap parallel processinghardware caused a renewed interest in managing very large graphs over parallel systems
In this dissertation, we design scalable and practical graph algorithms for a selected
Trang 6allel solutions for graph generation with both random and real-world graph models terward, we propose techniques for processing large graphs in parallel, specifically forcomputing the Minimum Spanning Forest and the Shortest Path between vertices.Chapter 3 focuses on the generation of very large graphs The nave algorithm forgenerating eros graphs does not scale to large graphs In this chapter we take a systematicapproach to the development of the PPreZER algorithm by proposing a series of sevenalgorithms The results of our study depict that our fine tuned algorithm, PPreZER, forgenerating random graph data can be executed on a typical GPU on average 19 timesfaster than its fastest sequential version on the CPU.
Af-Chapter 4 moves beyond random graphs and considers the generation of real-worldgraphs This chapter considers the spatial datasets and the generation of graphs by takingthe spatial join of the elements in the two datasets We propose an algorithm (calledHiDOP) to perform this spatial join operation efficiently Consequently we design a dataparallel algorithm inspired from HiDOP algorithm
Chapters 5 and 6 cover the data management part of the thesis Two graph algorithms,a.k.a graph queries, are studied: Minimum Spanning Forest (Chapter 5) and All-PairsShortest Path (Chapter 6) In Chapter 5, PMA, a novel data parallel algorithm this isinspired from Bor˙uvka’s and Prims MSF algorithm is proposed PMA experimentallyshows to be superior over the state of the art MSF algorithms Chapter 6 introduces athreshold Lto the problem definition of all-pairs shortest path such that only the pathsthat have weight less than Lare found, the problem is called L-APSP This threshold isadvantageous when only close connections are of interest, like in large social networks
A large number of APSP algorithms are studied and for each a counterpart L-APSPalgorithm is designed and a parallel version algorithm that exploits GPU is proposed.Finally, this dissertation has led to the proposal of four scalable data-parallel algorithmsfor graph data processing
Trang 7Table of Contents
Acknowledgements ii
Abstract i
Table of Contents viii
List of Figures xii
List of Tables xiii
List of Algorithms xv
1 Introduction 1 1.1 Graph 1
1.2 Parallel processing 3
1.3 Contributions 5
1.4 Graph data generation 6
1.4.1 Generating random graphs 6
Application 6
Existing algorithms 7
Proposed algorithm 7
1.4.2 Generating real-world graphs 8
Application 8
Existing algorithms 8
Proposed algorithm 9
1.5 Graph data management 9
1.5.1 Finding Minimum Spanning Forest 9
Trang 8Proposed algorithm 10
1.5.2 Finding Shortest Path 11
Application 12
Existing algorithms 13
Proposed algorithm 13
1.6 Overview 14
2 Parallel processing on Graphics Processing Unit (GPU) 15 2.1 Many and Multi core architectures 15
2.2 GPU Architecture 16
2.3 The CUDA and BrookGPU programming frameworks 16
2.4 SIMT: Single Instruction, Multiple Threads 17
2.5 Parallel Thread Execution (PTX) 21
2.6 GPU Memory hierarchy 22
2.7 GPU Optimizations 22
2.8 GPU empirical analysis 25
2.9 Programming the GPU 27
2.9.1 Parallel Pseudo-Random Number Generator 27
2.9.2 Parallel Prefix Sum 29
2.9.3 Parallel Stream Compaction 30
2.10 chapter summary 30
3 Scalable Random Graph Generation 33 3.1 Introduction 33
3.2 Related Work 36
3.3 Baseline algorithm 38
3.4 Sequential algorithms 40
3.4.1 Skipping Edges 40
3.4.2 ZER 42
3.4.3 PreLogZER 43
3.4.4 PreZER 44
3.5 Parallel algorithms 45
3.5.1 PER 45
3.5.2 PZER 47
3.5.3 PPreZER 50
3.6 Performance Evaluation 51
3.6.1 Setup 51
3.6.2 Results 51
Overall Comparison 51
Trang 9Speedup Assessment 53
Comparison among Parallel algorithms 54
Parallelism Speedup 55
Size Scalability 56
Performance Tuning 57
3.6.3 Discussion 58
3.7 Chapter Summary 59
4 Scalable Real-world graph generation 63 4.1 Introduction 63
4.2 Related Work 66
4.2.1 In-Memory Approaches 66
4.2.2 On-disk Approaches 67
Both Datasets Indexed 67
One Dataset Indexed 67
Unindexed 68
4.3 Motivation 70
4.3.1 Touch Detection 71
4.3.2 Motivation Examples 72
4.3.3 Motivation Experiments 73
4.4 HiDOP: Hierarchical Data Oriented Partitioning 75
4.4.1 Problem Definition 75
4.4.2 HiDOP Ideas 76
4.4.3 Algorithm Overview 76
4.4.4 Tree Building Phase 78
4.4.5 Assignment Phase 80
4.4.6 Probing Phase 82
4.4.7 Proof of Correctness 83
4.5 Implementation 84
4.5.1 Partitioning 84
4.5.2 Design Parameters 85
Tree Parameters 85
Local Join Parameters 86
Join Order 87
4.6 Parallel algorithms 87
4.7 Experimental Evaluation 90
Trang 104.7.4 Varying Dataset B 93
Small Datasets 93
Large Datasets 94
4.7.5 Varying Epsilon 96
4.7.6 Neuroscience Datasets 96
4.7.7 Parallel HiDOP experiments 99
Overall Comparison 99
Speedup Assessment 100
4.8 Chapter Summary 101
5 Scalable Parallel Minimum Spanning Forest Computation 103 5.1 Introduction 103
5.2 Related Work 106
5.2.1 Sequential algorithms 106
Bor˙uvka 106
Kruskal 107
Reverse-Delete 108
Prim 108
5.2.2 Parallel algorithms 108
5.3 DPMST: Bor˙uvka based Data Parallel MST algorithm 110
5.3.1 Implementation on GPU 112
5.4 Motivation for scalability 113
5.5 PMA: Scalable Parallel MSF algorithm 114
5.5.1 Partial Prim 114
5.5.2 Unification step 115
5.5.3 Proof of Correctness 117
5.5.4 Complexity Analysis 118
5.6 PMA implementation 119
5.6.1 Partial Prim implementation 119
MinPMA algorithm 120
SortPMA algorithm 121
HybridPMA algorithm 121
5.6.2 Unification implementation 122
5.6.3 Implementation notes 123
5.7 Experiments 123
5.7.1 DPMST performance evaluation 124
Experimental Setup 124
Experimental Results 124
5.7.2 PMA performance evaluation 127
Trang 11Datasets 129
Maximum subtree size (γ) 130
Removing parallel edges 132
Reduction rate 132
Performance comparison 133
5.8 Chapter Summary 134
6 Scalable Parallel All-Pairs Shortest Path Computation 136 6.1 Introduction 136
6.2 All-Pairs Shortest Path problem 137
6.3 Sequential algorithms 138
6.3.1 Floyd-Warshall 138
6.3.2 Johnson 139
6.3.3 Repeated Squaring 139
6.3.4 Gaussian elimination 142
6.4 Parallel algorithms 143
6.4.1 Gaussian elimination 143
6.4.2 Johnson 143
6.4.3 Repeated Squaring 144
6.4.4 Floyd-Warshall 145
6.5 Time Complexity Analysis 148
6.5.1 Single Source Shortest Path (SSSP) 148
6.5.2 Floyd-Warshall algorithm 149
6.5.3 Repeated Squaring algorithm 151
6.5.4 Gaussian Elimination 152
6.6 L-Distance Matrix Computation 152
6.6.1 L-Pruned Parallel Floyd-Warshall algorithm 155
6.6.2 L-Pruned Repeated Squaring algorithm 155
6.6.3 L-Pruned Single Source Shortest Path 155
6.6.4 L-Pruned Gaussian Elimination 156
6.7 APSP Performance evaluation 156
6.7.1 Setup 156
6.7.2 Experimental Methodology 157
6.7.3 Distance Matrix Computation 158
6.7.4 L-distance Matrix Computation 158
L-Pruned Floyd-Warshall algorithm 160
Trang 126.8 Discussion 166
6.9 Operations for privacy 167
6.9.1 Related Work 171
6.10 L-opacity: Linkage-Aware Graph Anonymization 173
6.10.1 Problem Definition 173
6.10.2 L-Opacification algorithm 179
6.10.3 Basic Operations 180
Opacity Value Computation 180
Edge Removal 183
Edge Removal and Insertion 183
6.10.4 Experimental Evaluation 184
Description of Data 187
Utility metrics 187
Comparison on Distortion 188
Comparison on EMD 189
6.10.5 Comparison on Clustering Coefficients 191
Runtime comparison 192
6.10.6 Pruning Capacity in Distance Matrix Computation 193
6.10.7 Data Properties 195
6.10.8 L-Coherency 195
6.11 Chapter Summary 196
7 Conclusions 197 7.1 Summary 197
7.2 Graph data generation algorithms 198
7.2.1 Random graphs 198
7.2.2 Real-world graphs 199
7.3 Graph data management algorithms 200
7.3.1 Minimum Spanning Forest problem 200
7.3.2 All-Pairs Shortest Path problem 201
7.4 Research Directions 202
7.4.1 Medium term goals 202
Dynamic graphs 202
Large graph processing 202
7.4.2 Long term goals 203
Parallel graph processing 203
Distributed graph processing 203
Trang 13List of Figures
1.1 The seven K¨onigsberg bridges, courtesy of [78] 2
1.2 Number of research articles until April 2012 using graphs, extracted from the Scopus database [11] 3
1.3 Field of research articles until April 2012 using graphs, extracted from the Scopus database [11] 4
2.1 Cuda thread organization 18
2.2 A set of SIMT multiprocessors 19
2.3 Executing K times a GPU kernel consist of one iteration 26
2.4 Executing one GPU kernel consist of K iterations 26
2.5 Phases of Parallel Prefix Sum [154] 30
2.6 Stream Compaction for 10 elements 31
3.1 f(k) for varying probabilities p . 45
3.2 Running PER algorithm on 10 elements 48
3.3 Generating edge list via skip list in PZER 49
3.4 Running times for all algorithms 52
3.5 Running times for small probability 53
3.6 Speedup for all algorithms over ER 54
3.7 Speedup for parallel algorithms over their sequential counterparts 55
3.8 Running times for parallel algorithms 56
3.9 The times for pseudo-random number generator with skip for PZER and PPreZER and check for PER 57
Trang 143.11 Runtime of the parallel algorithms for varying thread-blocks for Γv= 10K,p=0.1
graphs 59
3.12 Runtime for varying graph size, p = 0.001 60
3.13 Runtime for varying graph size, p = 0.01 61
3.14 Runtime for varying graph size, p = 0.1 62
4.1 The PBSM approach partitions the space in equi-width partitions (a) and assigns the objects to the partitions (b) 69
4.2 S3 algorithm [103] partitions the space increasingly fine-granular the smaller the level of the hierarchy When joining, cell c I is compared with the corresponding cell c Oand all cells overlapping it (shaded) 70
4.3 Schema of a neuron’s morphology modeled with cylinders 71
4.4 Two datasets where S3 performs suboptimal 73
4.5 Execution time of the spatial join with different approaches 74
4.6 The three phases of the HiDOP: building the tree, assignment and joining 77 4.7 The tree data structure of the HiDOP algorithm 79
4.8 The smaller the MBRs of the leaf nodes, the more effective filtering is 86
4.9 Uniform, gaussian and clustered data distributions used for the experi-ments 91
4.10 All the algorithms on small uniform dataset with increasing the size of dataset B 93
4.11 Synthetic large uniform datasets varying size of the second dataset when ǫ = 5 94
4.12 Synthetic large gaussian datasets varying size of the second dataset when ǫ = 5 94
4.13 Synthetic large clustered datasets varying size of the second dataset when ǫ = 5 95
4.14 Comparing the approaches for two different ǫ on all datasets 97
4.15 Comparison of all approaches for ǫ of 5 and 10 on neuroscience datasets 97 4.16 Execution for increasingly dense spatial neuroscience datasets 98
4.17 Execution of HiDOP and Parallel HiDOP on large synthetic and real datasets 99
4.18 Details of running PHiDOP on varying datasets 100
4.19 Speedup of Parallel HiDOP against its sequential version 101
5.1 The state transition diagram of Kang and Bader’s algorithm [96] 110
Trang 155.2 Illustration of the proposed data parallel MST Given a graph (a), the algorithm finds the minimum outgoing edge for each component in a parallel (marked in b,c,d) Then it merges the components according to
the resulting edges (e) 111
5.3 The state transition diagram of the PMA algorithm 114
5.4 The graph (a) before and (b) after unifying u and v 123
5.5 Execution of DPMST on DIMACS graphs 125
5.6 Execution of DPMST on Erd˝os-R´enyi G n= 20000,0.1≤p≤0.5.graphs 126
5.7 Execution of DPMST on random dense graphs 126
5.8 Execution time of HybridPMA on Erd˝os-R´enyi graphs, varying average degree 128
5.9 Experiments on varying average degree for four types of graph, |V| = 1M 128 5.10 Experiments on varying the number of vertices 130
5.11 Execution time of PMA with varying γ 131
5.12 Reduction rate of different algorithms 133
6.1 A thread processing column j in the parallel block Floyd-Warshall algo-rithm 145
6.2 Log plot of running time of APSP algorithms on the Enron graph with varying number of sampled vertices 158
6.3 Log plot of running time of APSP algorithms on a synthetic Erd˝os-R´enyi graph with varying probability of inclusion 159
6.4 Log plot of running time of APSP algorithms on the Wiki graph with varying number of sampled vertices 160
6.5 Log plot of running time of APSP algorithms on a synthetic Watts and Strogatz graph with varying average degree 161
6.6 Log plot of running time of L-APSP algorithm on the Enron graph with 2048 vertices 162
6.7 Log plot of running time of L-APSP algorithm on a complete graph with 1024 vertices 162
6.8 Log plot of running time of L-APSP algorithm on an sparse (inclusion probability 0.1) Erd˝os-R´enyi graph with 1024 vertices 163
6.9 Log plot of running time of L-APSP algorithm on a the Wiki graph with 1024 vertices 163
6.10 Log plot of running time of L-APSP algorithm on a dense (average de-gree 256) Watts and Strogatz graph with 1024 vertices 164 6.11 Log plot of running time of L-APSP algorithm on a sparse (average
Trang 166.13 Graph for given 3-SAT problem in Theorem 3 178
6.14 Path length matrices 180
6.15 GD numbers and Opacity Matrix 181
6.16 Graph edit distance ratio (Distortion) vs Confidence(θ) 186
6.17 EMD of degree distributions vs Confidence(θ) 189
6.18 EMD of Geodesic distributions vs Confidence(θ)) 190
6.19 Mean of the differences of Clustering Coefficients vs Confidence(θ) 192
6.20 Runtime comparison of Gnutella network when varying number of nodes 192 6.21 Runtime of different heuristics for graphs of different size and density vs Confidence(θ) 193
6.22 Impact of L-based Pruning 194
Trang 17List of Tables
4.1 Selectivity of the datasets (×E−6) 925.1 Runtime of algorithms on dense small graphs (Times are in millisecond) 1255.2 Runtime of different CPU and GPU algorithms on real-world networks 1296.1 Description of the original datasets 1876.2 Data set properties 1946.3 L-Coherency of different data sets 195
Trang 18List of Algorithms
1 PLCG 28
2 Original ER 38
3 Decoding 39
4 ER 39
5 ZER 42
6 PreLogZER 43
7 PreZER 46
8 PER 47
9 PZER 49
10 PPreZER 50
11 HiDOP 78
12 Tree Building Phase 80
13 Assignment Phase 81
14 Probing Phase 82
15 Data Parallel HiDOP 88
16 PJOIN 89
17 Bor˙uvka’s algorithm 107
18 Prim’s algorithm 109
19 Data Parallel MST algorithm 112
20 PMA algorithm 115
21 Partial Prim algorithm 116
22 MinPMA algorithm 120
Trang 1923 SortPMA algorithm 120
24 Unifying algorithm 122
25 Floyd-Warshal algorithm 139
26 Single Edge SP extension algorithm 141
27 Repeated squaring APSP algorithm 141
28 GE-APSP algorithm 142
29 Parallel repeated squaring APSP algorithm 145
30 Parallel Block Floyd-Warshal algorithm 146
31 Optimized Parallel Block Floyd-Warshal algorithm 147
32 L-pruned Floyd-Warshal algorithm 153
33 Pointer-based L-pruned F-W algorithm 154
34 max LO algorithm 181
35 Edge Removal algorithm 182
36 Edge Removal/Insertion algorithm 185
Trang 20this model has been used by Leonhard Euler for K¨onigsberg Bridge Problem [59] Euler
resolved this question by proving that there is no walk that crosses each of the seven
K¨onigsberg bridges, as illustrated in Figure 1.1, exactly once A graph G is defined as
a pair (V, E), where V is a set of vertices (i.e., nodes), and E is a set of edges between the vertices The adjacency relation for graph G is E ⊆ {(u, v)|u, v ∈ V} When the graph is undirected, the adjacency relation defined by the edges is symmetric, so E ⊆ {{u, v}|u, v ∈ V} Weighted graph (edge-weight graph) is a graph with assigned weight for each edge of it as w(u, v) Graph can be further generalized to Hypergraph where
E is called hyperedges Then, E is a set of non-empty subsets of V Therefore, E is a subset of P(V) \ {∅}, where P(V) is the power set of V Cormen, Leiserson, Rivest and
Trang 21Stein, in their popular textbook [51], describe the role of graphs and graph algorithms incomputer science as follows:
”Graphs are a pervasive data structure in computer science, and algorithmsworking with them are fundamental to the field.”[51]
Figure 1.1: The seven K¨onigsberg bridges, courtesy of [78]
Figures 1.2 and 1.3 illustrate the number of research articles using graphs and theirfields, respectively1 The results further justify versatile representativeness of graph datastructure For instance, in computer science, graph is an abstract data type for repre-senting social networks [159, 26], data management [18], web graphs2 [17, 20] as well
as flow control and program verification [45] In mathematics, to study the knot ory [16] or group theory [35] Transportation, traffic control and road networks in thefield of engineering [177], protein and brain simulation in the field of bioinformaticsare also modeled by graphs [29] To understand the dynamics of a physical process aswell as complicated atomic structures in physics [145] In social sciences graph hasbeen employed to explore diffusion and to extract communities through analyzing social
Trang 22Figure 1.2: Number of research articles until April 2012 using graphs, extracted from
the Scopus database [11]
Because of this versatile representativeness of graphs, social networks and the webare among many other phenomena and artifacts that can be modeled as large graphs
to be analyzed through graph algorithms However, given the size of the underlyinggraphs, fundamental operations such as path finding algorithms becomes challenging[132, 20, 27] In 2011, the number of internet users reach to 2,267,233,742 among6,930,055,154 people worldwide 3 that are contributing to produce enormous amount
of data in graph form Among these large scale networks, for instance, Facebook as
a social and Linkedin as a professional network or web graphs and graphs of emailshave seen explosive growth rate In January 2011, LinkedIn contained 101 million userswith a growth rate of 3 million users per month In December 2011, the social graph ofFacebook, contains more than 750 million active users and an average friend count of 130
at the end of December 2011 In April 2012, in the worldwide there were 676,919,707websites and 3.3 billion email accounts [14]
1.2 Parallel processing
Graphics Processing Units (GPUs) were fundamentally designed for fast rendering ofimages for display Nevertheless, the introduction of programmable rendering pipelines
3 http://www.internetworldstats.com/stats.htm
Trang 23Computer Science26%
Mathematics22%
Engineering21%
Bioinformatics16%
Physics and Astronomy8%
Social Sciences
5%
Business1%
Miscellaneous1%
Figure 1.3: Field of research articles until April 2012 using graphs, extracted from the
Scopus database [11]
let the shader programmers to develop non-graphical computations for these GPUs ploying GPUs for general purpose data processing requires the knowledge of how touse textures as a place for data and how to ask the shaders not for generating pixels butfor processing the data on the textures This nonintuitive process has been evolved byintroducing parallel programming architectures This evolution results to broader use
Em-of the GPUs in various fields, specially in the domain Em-of data processing Nowadays,these readily available GPUs that are known as many core architectures, are ubiquitousand cheap [117] GPUs are commonly installed on today’s home computers, worksta-tions, consoles, and gaming devices They can afford operating thousands of concurrentthreads [4] Therefore, designing parallel algorithms for GPUs becomes one of the most
Trang 24of work have exploited the GPU’s ubiquity to suggest high-performance, general dataprocessing algorithms therefor [72, 73, 108] However, in contrast to the multi-core ar-chitecture, Central Processing Unit (CPU), GPUs are designed for the fine-grained dataparallel algorithms [134] Therefore, the algorithms designed for the so-called many-core GPUs requires a different tuning, i.e Single Instruction Multiple Threads (SIMT),
in comparison to the algorithms designed for multi-core CPUs
Given the proliferation of the gigantic graph form data and also the increasing demand
of processing graph data on one hand, and the ubiquity of these parallel processors onthe other hand, call for development of scalable and fast graph processing algorithms.Therefore, data parallel algorithms [74] comes into play, to achieve this objective Inthis dissertation, we explore the difficulties of adapting the fundamental yet practicalgraph algorithms, with database applications for these massively parallel processors inorder to design scalable graph algorithms We study both graph data generation prob-lems and graph data management problems [18, 182] Our contribution in this research
is designing scalable algorithms for the above mentioned problems We address bothrandom and real-world graph models The results of our study depict the usability ofGPUs in graph data processing For instance, through experiments, we show that ourfine tuned algorithm for generating random graph data can be executed on a typical GPU
on average 19 times faster than its fastest sequential version on the CPU
After introducing scalable algorithms for generating graph, we use the proposed niques to design scalable algorithms for processing the graphs Path problem is a well-known graph processing algorithm [30] Furthermore, computing Minimum SpanningForest and computing shortest path between vertices are good examples of greedy anddynamic programming algorithms, respectively [51] Therefore, in this thesis we addressthese two problems for processing the graphs We empirically analyze the strengths and
Trang 25tech-weaknesses of the previous solutions for each problem and explore the trade-offs thatcan be made For each problem we devise a novel solution that can scale more than thestate-of-the-art solutions In the following sections 1.4 and 1.5 we respectively describethe above problems in this thesis in greater detail.
This thesis first studies the algorithms for generating graphs Graphs may be generatedfrom a random process, i.e random graphs, or from modeling a real-world data, e.g.rat’s brain In the following subsections we address each respectively
In random graph generation, two simple, elegant, and general mathematical models are
instrumental The former, noted as G(v, e), chooses a graph uniformly at random from the set of graphs with v vertices and e edges The latter, noted as G(v, p), chooses a graph uniformly at random from the set of graphs with v vertices where each edge has the same independent probability p to exist Paul Erd˝os and Alfr´ed R´enyi proposed the
G (v, e) model [58], while E N Gilbert proposed, at the same time, the G(v, p) model
[68] Nevertheless, both are commonly referred to as Erd˝os-R´enyi models
Application
The above mentioned models have been widely utilized in many fields, e.g cation engineering [53, 64, 118], biology [119, 126] and social network studies [62, 90,
communi-133] The so-called Erd˝os-R´enyi models are also used for sampling Namely, G(v, e) is
a uniform random sampling of e elements from a set of v Therefore, a sampling
pro-cess can be effectively simulated using the random generation propro-cess as a component
Trang 26Although random graphs are ubiquitously are used as a data representation tool, vious research has not paid due attention to the question of efficiency in random graphgeneration A na¨ıve implementation of the basic Erd˝os-R´enyi graph generation process
pre-does not scale well to very large graphs In this thesis we propose PPreZER, a novel, data parallel algorithm for random graph generation under the G(v, p) model The proposed
algorithm can be tuned to a specific type of graph (directed, undirected, with or withoutself-loops, multipartite) by an orthogonal decoding function
ability of an analytical formula for the expected number of edges that can be skipped, in
a geometric approach Furthermore, we avoid the expensive computation of logarithms
exist in the ZER algorithm via pre-computation Still, the skipping element in the
algo-rithm can be implemented even more efficiently, using an acceptance/rejection [149] orZiggurat [124] method Therefore, we avoid the logarithm computation all together in
our proposed algorithm, namely PreZER algorithm.
Proposed algorithm
We devise a data parallel algorithm for each sequential algorithm, namely ER, ZER and
PreZER We refer to these data parallel versions as PER, PZER and PPreZER,
respec-tively We eschew developing a parallel version of PreLogZER, as the benefits this termediary step brings are sufficiently understood in the sequential version These threealgorithms are executable on an off-the-shelve graphics card with a graphics processorunit (GPU)
Trang 27in-1.4.2 Generating real-world graphs
Graphs can be generated from processing a real-world data In this thesis we proposed
a scalable method, i.e HiDOP, for creating a real-world graph representing a rat’s brainfrom non-graph data we have data that are not in graph form, but we can perform ajoin operation to generate the underlying graph We design techniques for performingthis join efficiently in order to get a graph from the real data Therefore, we focused
on generating graphs from spatial datasets We want to construct a graph out of twogiven spatial datasets In this graph the nodes are objects in the datasets and the edgesare between two objects iff the distance between two objects is less than a given ǫ Theprocess of finding the pair of objects within distance ǫ is known as spatial join
Application
Efficiently generating these graph from a spatial join, is pivotal for many applications,but particularly important in geographical information systems and in the simulationsciences where scientists work with spatial models In geographical applications thesegraphs are used to represent the collisions or proximity between geographical features [165],i.e., landmarks, houses, roads, etc In medical imaging the generated graphs indicateswhat cancerous cells are within a certain distance [60] In the simulation sciences, sci-entists build and simulate precise spatial models to monitor the folding process of pep-tides [70]
Existing algorithms
Past research has primarily focused on generating these graphs on disk and it is thus
no surprise that current in-memory approaches are not efficient Efficient in-memoryapproaches, however, are important for two reasons: a) main memory has grown so bigthat many scientific data sets fit into it directly and b) the in-memory part of generating
Trang 28Proposed algorithm
In this work we therefore first develop HiDOP, a new in-memory spatial join approachthat uses data-oriented space partitioning and single assignment Then we design a dataparallel version of HiDOP for efficiently using the parallel processing power of GPU.Our results show that HiDOP is more efficient than known in-memory spatial joins
as well as disk based joins used in memory HiDOP scales substantially better for creasing selectivity and outperforms the best known approach by a factor of 10 The bestknown approach, however, requires undue memory and will not scale HiDOP outper-forms state-of-the-art approaches with a comparable memory footprint by factor of 100.However, we also proposed PHiDOP, the data parallel version of HiDOP, that shows asubstantial speedup (up to 400 times) over its sequential version
Recently, graph data management has become a popular area of research because ofits numerous applications in a wide variety of practical fields, including computationalbiology, software bug localization and computer networking [18, 182] In this thesis westudy path finding problems, namely finding Minimum Spanning Forest and All-PairsShortest Path
A spanning tree of a connected graph G is an acyclic subgraph of G that connects all vertices of G The Minimum Spanning Tree (MST) problem calls to find a spanning tree
of a weighted connected graph G having the minimum total weight [87] In case the
graph is not connected, i.e consists of several connected components, the problem isgeneralized to finding the Minimum Spanning Forest (MSF), i.e a subgraph containing
an MST of each component
Trang 29Minimum Spanning Forest (MSF) is a special case of the path-finding problem [120].The computation of a graph’s minimum spanning forest is one of the most studied graphproblems with practical applications MSF computation finds applications in domainssuch as the optimization of message broadcasting in communication networks [44, 50,111], biological data analysis [173], and image processing [148], while it forms a basisfor clustering algorithms [169, 180] For instance, assume a social graph with weightededges for the cost of communication among individuals inside the graph, in which wewish to spread some news at the minimum cost in real time We then need to efficiently
compute an MST of G.
Existing algorithms
The existing approaches for parallelizing MSF computation share a similar intuition:Building different trees in parallel, and, when conflicts occur (i.e different trees runinto each other), merging the components and starting over However, this strategy isnot equally efficient on all types of graphs In addition, the question of finding theMSF of a non-connected graph is mostly ignored Most significantly, these approachestend to be cautiously conservative when expanding their trees, as they have to be aware
of the potential conflicts, invoking too many redundant iterations that deteriorate theirperformance In chapter 5, we present an alternative, elegant solution that eschews theconservatism of this existing methods, takes full advantage of parallelism, and keepscommunication cost low
Proposed algorithm
Although past research proposed several parallel algorithms for this problem, MSF, yetnone of them scales to large, high-density graphs In this thesis we propose a novel,
Trang 30computed MSF Our proposed algorithm minimizes the communication among differentprocessors without constraining the local growth of a processor’s computed subtree Ineffect, our solution achieves a scalability that previous approaches lacked We executeour algorithm on a GPU and study its performance using real and synthetic, sparse aswell as dense, structured and unstructured graph data Our experimental study demon-strates that our algorithm outperforms the previous state-of-the-art GPU-based MSF al-gorithm, while being several orders of magnitude faster than sequential CPU-based al-gorithms.
Shortest path is a fundamental path-finding problem in graph data management Given
a directed weighted graph G(V, E), such that V and E are the set of vertices and edges respectively, with a weight function w : E → R+ that maps edges E to positive real-
valued weights, the All-Pairs Shortest Path problem (APSP) problem finds for every
pair (u, v) of vertices, the path length from u to v However, solving this problem is
known to be costly There exist applications that are only interested in path lengths
below a certain value Real-world networks are connected; any two individuals in them
are bound to be linked by a sufficiently large path The size of this path is usually rather
small, not exceeding six steps Milgram’s small world experiment [127] suggested that
social networks of people in the United States are characterized by short path lengths, ofapproximately three friendship links, on average, without considering global linkages;Watts [171] recreated Milgram’s experiment on the internet and found that the averagenumber of intermediaries via which an e-mail message can be delivered to a target wasaround six; Leskovec and Horvitz [109] found the average path length of 6.6 amongusers of an instant-messaging system
In chapter 6, we introduce the L-All-Pairs Shortest Path (L-APSP) Given a directed
weighted graph G(V, E) with a weight function w : E → R+ that maps edges E to positive real-valued weights, the L-APSP problem finds the path length from u to v only
Trang 31among the pairs (u, v) of vertices that have length less than or equal to L In practice, the
introduced parameter, L, substantially reduces the runtime of the proposed algorithms.Application
Data sets storing information about persons and their relationships are abundant Onlinesocial networks, e-mail exchange records, collaboration networks, are some examples.Such a graph form data, when published, can provide valuable information in domainssuch as sociology, marketing, and fraud detection Still, the publication of such graphdata entails privacy threats for the individuals involved Identity disclosure and Linkdisclosure are two types of privacy threats that are explained in the following subsectionsrespectively
A privacy threat involves the leakage of sensitive information This information may
involve the identity of a node (i.e., person) in the network, in which case we talk of
such re-identification of the node representing a certain individual in the network [181,
79, 43, 113, 183, 172, 82] The common theme in such works is the idea that each node
should be rendered, by some notion, indistinguishable from k − 1 other nodes in the network; this idea is inspired from the precept of k-anonymity, a principle suggested for
the anonymization of relational data [151]
Still, a privacy threat may also involve the information about connections among
individuals in a social network, i.e linkage disclosure Nevertheless, the protection
against identity disclosure does not imply protection against linkage disclosure too Anetwork preventing identify disclosure (by rendering vertices indistinguishable as in
[181, 79, 43, 113, 183, 172, 82]) may still allow the disclosure of a linkage between
two vertices of interest Zhang and Zhang [178] were the first to make this cardinal servation, and provided a solution for the ensuing edge anonymity problem However,the proposed approaches are either based on clustering nodes into super-nodes [179],
Trang 32ob-in the network, or follow a randomization approach without clear privacy guarantees[175].
Existing algorithms
In chapter 6 we revisit the linkage anonymization problem as a restricted shortest path
problem Our approach is positioned between the two extremes found in [178] and [46].Contrary to [178], we do not consider only one-edge links to be important; an adversarywho can confidently infer that two individuals in a social network are connected by atwo-edge path still infers valuable sensitive information about them On the other hand,
in contrast to [46], we do not attempt to totally extinguish the potential for the inference
of an arbitrarily large linkage path
In view of the connectedness of real world networks, no privacy is compromised
by revealing the existence of a path among two entities; Instead, the focus of a privacy concern should be on averting the disclosure of the exact length of such paths, especially
when it is short Following this reasoning, we define L-opacity, a privacy principle Our
aim is to prevent the confident inference of such linkages by incurring a minimal amount
of modification on the network
Proposed algorithm
Our proposed model modifies the graph to preserve the privacy of the relationshipsamong users in social networks In order to apply this model we progressively designsolutions which lead to an efficient algorithm Each step of the algorithm computes theshortest path among every pair of nodes with a distance limit, i.e a restricted All PairsShortest Path (APSP) Later, we tune this well-known graph problem, APSP, for thementioned purpose However the problem itself is complex, we prove it is NP complete,
so we are studying the existing parallel APSP algorithms to help the algorithm performsfaster
Trang 331.6 Overview
In this thesis we proposed scalable graph data generation and management algorithms interms of time and space We begin with an overview on the architecture of the modernGPUs followed by explaining the primitive algorithms that are tuned for these processors
in chapter 2 Then, we narrow our attention to graph data generators and graph data agements We first study the algorithms for generating random graphs [138] in chapter
man-3 and for generating real-world graphs [1man-39] in chapter 4 We then used the proposedtechniques for generating graphs in graph data management We studied the well-knownfundamental path finding problems in graph data management [120, 30, 182] Therefore,
we study the problem of finding Minimum Spanning Forest [135, 136] in chapter 5 andfinding All-Pairs Shortest Path [137] in chapter 6 Finally, this thesis is concluded inchapter 7 by providing directions for future research
Trang 34CHAPTER 2
Parallel processing on Graphics Processing Unit (GPU)
2.1 Many and Multi core architectures
Nowadays, Graphics Processing Unit (GPU), a.k.a massively multithreaded processor,supports several thousand concurrent threads GPU, as the name indicates, are basicallydesigned for graphic processing However programmers [15, 19] soon realized that theycan be used for general purpose processing Typically modern GPUs have over one hun-dred processing units, whereas current multi-core CPUs offer a much smaller number
of cores For instance, an NVIDIA GTX 680 card has 1536 parallel computing cores
To exploit the power of many-core GPU it is essential to discover its architectural ferences from multi-core CPU In contrast to CPU, an efficient algorithm designed forGPU usually exhibits large amount of fine-grained parallelism instead of coarse-grainedparallelism In order to adapt a multi-core algorithm to a many-core efficient algorithm,the structure of the algorithm should change from a largely task-parallel (coarse-grain)
dif-to a more data-parallel (fine-grain) instructions The general purpose computation on
Trang 35graphics hardware has appeared in various application domains [88] such as database[72, 81, 73], matrix operations [93, 108], bioinformatics [114] and distributed comput-ing [7, 12].
Modern GPUs are multiprocessors with multi-threading support Currently, standardGPUs do not offer efficient synchronization mechanisms However, their single-instruction,multiple-threads architecture can leverage thread-level parallelism The chief function of
a multiprocessor GPU is to execute hundreds of threads running the same function currently on different data Data parallelism is achieved by hardware multi-threadingthat maximizes the utilization of the available functional units
con-As our implementation platform we use nVidia graphic cards, which consist of anarray of Streaming Multiprocessors (SMs) Each SM can support a limited number ofco-resident concurrent threads, which share the SM’s limited memory resources Fur-thermore, each SM consists of multiple (usually eight) Scalar Processor (SP) cores The
SM performs all thread management (i.e., creation, scheduling, and barrier
synchroniza-tion) entirely in hardware with zero overhead for a group of 32 threads called warp The
zero overhead of lightweight thread scheduling, combined with fast barrier tion, allows the GPU to efficiently support fine-grained parallelism
In the first years of GPGPU, developers uses graphics APIs such as OpenGL [10] andDirectX [33] to map applications onto the graphics rendering unit Recently, severalGPGPU languages including AMD CTM [1], BrookGPU [3] and NVIDIA CUDA [4]have been proposed by GPU vendors and academic researchers These high level lan-
Trang 36architecture and provide a programming environment similar to the multi-threaded C/C++languages.
Compute Unified Device Architecture (CUDA), is a parallel computing architecturefor NVIDIA graphic cards CUDA API is a set of library functions which can be viewed
as an extension of the C language From a strict software point of view, a CUDA gram is a collection of threads executing in parallel A batch of threads is called a blockand a collection of blocks is called a grid The threads inside the same block can com-municate with each other by either sharing data through shared memory inside the block
pro-or synchronizing their execution The BrookGPU programming language has a similarpurpose to CUDA, but it was historically developed for ATI cards In the BrookGPUenvironment kernel function is executed in the GPU by a grid of thread blocks
In this thesis, we use CUDA parallel computing architecture that is designed forNVIDIA graphic cards CUDA API is a set of library functions to access GPU and can
be viewed as an extension of the C language The CUDA programming model consists
of sequential host code and parallel kernel code The former executes the parallel codes
(kernel) on parallel devices (typically GPUs but also can be executed efficiently on core CPUs [160]) and is responsible for communication data between main memory andGPU memory The latter contains instructions that are designed for Single Instruction,Multiple Threads (SIMT) architecture
multi-2.4 SIMT: Single Instruction, Multiple Threads
A series of appropriate algorithms for a fine-grained parallelism is discussed by Hillisand Steele in [84] This article demonstrates a programming style that is suitable for
a machine with thousands or millions of processors, like Connection Machine systems[85] or Graphics Processing Units This programming style is known as data-parallel
as opposed to the control parallel style The latter model is appropriate for a controlstructured algorithm like searching the chess game tree [63] The data-parallel model is
Trang 370, 1, 0
Thread
0, 0, 0Thread
0, 1, 1
Thread
0, 0, 0Thread
0, 0, 0
Thread
0, 0, 0Thread
0, 0, 1
Block 0, 1
Thread
0, 0, 0Thread
0, 1, 0
Thread
0, 0, 0Thread
0, 1, 1
Thread
0, 0, 0Thread
0, 0, 0
Thread
0, 0, 0Thread
0, 0, 1
Block 1, 0
Thread
0, 0, 0Thread
0, 1, 0
Thread
0, 0, 0Thread
0, 1, 1
Thread
0, 0, 0Thread
0, 0, 0
Thread
0, 0, 0Thread
0, 0, 1
Block 1, 1
Thread
0, 0, 0Thread
0, 1, 0
Thread
0, 0, 0Thread
0, 1, 1
Thread
0, 0, 0Thread
0, 0, 0
Thread
0, 0, 0Thread
0, 0, 1
Figure 2.1: Cuda thread organization
known as a successful model for broad classes of scientific and engineering applications[85]
In contrast to instruction-level parallelism within a single thread, GPU leveragesthread-level parallelism Basically, a multiprocessor GPU is designed for executing hun-dreds of threads concurrently To manage such a large number of threads, it utilizes aunique architecture called SIMT (Single Instruction, Multiple Threads) In this archi-tecture, data parallel processing can be achieved by mapping data elements to parallelprocessing threads GPU handles these threads by leveraging hardware multithreading
to maximize the utilization of its functional units
Trang 38SIMT specify the execution and branching behavior of a single thread SIMT tecture allows thread-level parallel instructions for independent threads as well as dataparallel instructions for coordinated threads Therefore, SIMT architecture must be wellutilized in order to exploit the parallel processing power of GPU to achieve a substantialperformance improvement.
archi-In the CUDA programming framework, SIMT is defined through data parallel threadsand a CUDA kernel essentially is a set of sequential instructions that are executed viadifferent threads on GPU simultaneously and concurrently Basically, launching a CUDAkernel creates a grid of threads each runs the same instruction on different portion of data.The threads in CUDA are organized in two levels A batch of threads forms a block and
a collection of blocks makes a grid The cooperation between the threads is block basis
In other words, the threads inside a block can cooperate through a shared memory viabarrier called synchronization
…
Figure 2.2: A set of SIMT multiprocessors
SIMT architecture is supported on the modern NVIDIA GPUs via an array of ing Multiprocessors (SM) Each SM can support a limited number of co-resident con-
Trang 39Stream-current threads (threads of a block) as these threads must share the limited memoryresources of that SM Furthermore, a single SM consists of multiple Scalar Processor(SP) cores (usually 8) Moreover, each SM gets one instruction at a time which meansall SPs in one SM execute the same instruction For instance, GeForce 9800 GT contains
14 SMs, each SM can support up to 512 threads, and 112 SPs (see Figure 2.2) In fact, ablock of a CUDA kernel is a virtual SM multiprocessor of the physical GPU
The SM multiprocessor performs all thread management, i.e creation, schedulingand barrier synchronization, entirely in hardware with zero overhead for a group of 32threads called warp (the first parallel thread technology) Individual threads composing
a warp are free to branch and execute independently but start together at the same gram address More precisely, the execution context for each warp (registers, programcounters, etc) is maintained on-chip for entire lifetime of the warp execution Therefore,switching from one warp to another has no cost More precisely, each SM at every in-struction issue time, selects a warp that has threads ready to execute and issues the nextinstruction to the active threads of the warp The zero-overhead of lightweight threadscheduling on one hand, and fast barrier synchronization on the other hand, makes GPUefficiently supports the very fine-grained parallelism (like assigning a single thread foreach data element)
pro-The programmer defines the number of blocks and threads per block before starting
a kernel and all thread blocks have the same number of threads The number of threadblock is usually dictated by the size of the data being processed Depending on how manyregisters and how much share memory the block of thread requires, multiple blocks mayassigned to a single SM On the other hand, all the threads inside a block must processedbefore the block can be swapped out from the SM When the number of blocks exceedsthe number of SMs, based on the vacant registers and shared memory of SMs, each SMmay run more than one block at a time and the remaining blocks will wait in a queue andexecute later This is a combination of simultaneously and concurrently execution of the
Trang 40However, in the SIMT architecture we need to specify the appropriate portion of thedata to be processed by each thread As a result, CUDA assigns each thread a coordinate
to be uniquely identifiable The thread coordinate follows the same two-level hierarchy(i.e thread block and grids) The CUDA thread organization is illustrated in Figure2.1 For scheduling the threads inside a block, GPU first looks at the warp they belong.The way a block is split into warps is always the same; each warp contains threads ofconsecutive, increasing thread IDs, threads with id from 0 till 31 are assigned to the firstwarp, threads with id from 32 till 63 to the second warp and so on To execute a warp, thethreads which perform the same instructions are executed concurrently whereas threadsthat deviate are executed sequentially (explained in section 2.5)
The GPU is a compute device that operates as a coprocessor to CPU with capability
of executing large number of threads in parallel Therefore, the data-parallel portion ofalgorithms can be isolated into CUDA kernels and executed on GPU as many differentthreads These kernels compiled to the PTX instruction set, then translated at installtime to the target GPU instruction set PTX-ISA, a low-level parallel thread executionvirtual machine and instruction set architecture, exposes the GPU as a data-parallel com-puting device [140] PTX provides a stable parallel programming model and generatesmachined-independent ISA for C/C++ which is scalable from a single GPU unit to manyparallel units A PTX program determines the execution of a thread inside a cooperativethread array (CTA) In fact, CTAs implement CUDA thread blocks Furthermore, a CTA
is a set of threads that execute the same kernel concurrently of in parallel The threadsinside a CTA can communicate through synchronization points where threads wait un-til all threads in the CTA have arrived Each CTA thread uses its identifier to find itsassigned work, determined role, input and output positions, and compute addresses