The assignment of tasks to processors for execution is dynamic, i.e., when a processor is idle, it takes a task from the central task pool.. There are several parallel implementation var
Trang 1express the specific situation of data dependences between columns using the
rela-tion par ent( j ) [124, 118] For each column j , 0 ≤ j < n, we define
parent ( j ) = min{i | i ∈ Struct(L ∗ j)} if Struct(L ∗ j) = ∅,
i.e., parent ( j ) is the row index of the first off-diagonal non-zero of column j If Struct (L ∗ j) = ∅, then parent( j) = j The element parent( j) is the first column
i > j which depends on j A column l, j < l < i, between them does not depend
on j , since j /∈ Struct(Ll∗) and nocmod(l , j) is executed Moreover we define for
0≤ i < n
children(i ) = { j < i | parent( j) = i},
i.e., children(i ) contains all columns j that have their first off-diagonal non-zero in row i
The directed graph G = (V, E) has a set of nodes V = {0, , n − 1} with one node for each column and a set of edges E, where (i , j) ∈ E if i = parent( j) and
i = j It can be shown that G is a tree if matrix A is irreducible (A matrix A is called reducible if A can be permuted such that it is block-diagonal For a reducible
matrix, the blocks can be factorized independently.) In the following, we assume
an irreducible matrix Figure 7.25 shows a matrix and its corresponding elimination tree
In the following, we denote the subtree with root j by G[ j ] For sparse Cholesky factorization, an important property of the elimination tree G is that the tree spec-ifies the order in which the columns must be evaluated: The definition of par ent implies that column i must be evaluated before column j , if j = parent(i) Thus, all the children of column j must be completely evaluated before the computation
of j Moreover, column j does not depend on any column that is not in the subtree G[ j ] Hence, columns i and j can be computed in parallel, if G[i ] and G[ j ] are
disjoint subtrees Especially, all leaves of the elimination tree can be computed in parallel and the computation does not need to start with column 0 Thus, the sparsity structure determines the parallelism to be exploited For a given matrix, elimination trees of smaller height usually represent a larger degree of parallelism than trees of larger height [77]
0
1
2
3
4
5
6
7
8
9
∗
∗
∗
∗
∗
∗ ∗
∗
∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
0 1 2 3 4 5 6 7 8 9
9
0 2 5 7 8
4 1
Fig 7.25 Sparse matrix and the corresponding elimination tree
Trang 2434 7 Algorithms for Systems of Linear Equations
7.5.3.1 Parallel Left-Looking Algorithms
The parallel implementation of the left-looking algorithm (I I I ) is based on n col-umn tasks Tcol (0) , , Tcol(n − 1) where task Tcol( j), 0 ≤ j < n, comprises the
execution ofcmod( j , k) for all k ∈ Struct(L j∗) and the execution ofcdiv( j ); this
is the loop body of theforloop in algorithm (I I I ) These tasks are not
indepen-dent of each other but have dependences due to the non-zero elements The parallel implementation uses a task pool for managing the execution of the tasks The task pool has a central task pool for storing column tasks, which can be accessed by every processor Each processor is responsible for performing a subset of the column tasks The assignment of tasks to processors for execution is dynamic, i.e., when a processor is idle, it takes a task from the central task pool
The dynamic implementation has the advantage that the workload is distributed evenly although the tasks might have different execution times due to the sparsity structure The concurrent accesses of the processors to the central task pool have to
be conflict-free so that the unique assignment of a task to a processor for execution
is guaranteed This can be implemented by a locking mechanism so that only one processor accesses the task pool at a specific time
There are several parallel implementation variants for the left-looking algorithm differing in the way the column tasks are inserted into the task pool We consider three implementation variants:
• Variant L1 inserts column task Tcol( j) into the task pool not before all column tasks Tcol (k) with k ∈ Struct(L j∗) have been finished The task pool can be ini-tialized to the leaves of the elimination tree The degree of parallelism is limited
by the number of independent nodes of the tree, since tasks dependent on each other are executed in sequential order Hence, a processor that has accessed task
Tcol ( j ) can execute the task without waiting for other tasks to be finished.
• Variant L2 allows to start the execution of Tcol( j) without requiring that it can
be executed to completion immediately The task pool is initialized to all column tasks available The column tasks are accessed by the processors dynamically from left to right, i.e., an idle processor accesses the next column that has not yet been assigned to a processor
The computation of column task Tcol ( j ) is started before all tasks Tcol (k) with k ∈ Struct(L j∗) have been finished In this case, not all operations cmod( j , k) of Tcol( j) can be executed immediately but the task can perform
only those cmod( j , k) operations with k ∈ Struct(L j∗) for which the corre-sponding tasks have already been executed Thus, the task might have to wait during its execution for other tasks to be finished
To control the execution of a single column task Tcol ( j ), each column j is assigned a data structure S j containing all columns k ∈ Struct(L j∗) for which cmod( j , k) can already be executed When a processor finishes the execution
of the column task Tcol (k) (by executingcdiv(k)), it pushes k onto the data structures S j for each j ∈ Struct(L ∗k) Because different processors might try
to access the same stack at the same time, a locking mechanism has to be used
to avoid access conflicts The processor executing Tcol ( j ) pops column indices
Trang 3Fig 7.26 Parallel
left-looking algorithm
according to variant L2 The
implicit task pool is
implemented in the while
loop and the function
The stacks S1, , S n
implement the bookkeeping
about the dependent columns
already finished
k from S j and executes the correspondingcmod( j , k) operation If S j is empty, the processor waits for another processor to insert new column indices When
|Struct(L j∗ | column indices have been retrieved from Sj , the task Tcol ( j ) can
execute the finalcdiv( j ) operation.
Figure 7.26 shows the corresponding implementation The central task pool
is realized implicitly as a parallel loop; the operationget unique index() ensures a conflict-free assignment of tasks so that the processors accessing the pool at the same time get different unique loop indices representing column tasks The loop body of thewhileloop implements one task Tcol ( j ) The data struc-tures S1 , , Sn are stacks; pop(S j ) retrieves an element and push( j , Si) inserts
an element onto the stack
• Variant L3 is a variation of L2 that takes the structure of the elimination tree into
consideration The columns are not assigned strictly from left to right to the pro-cessors, but according to their height in the elimination tree, i.e., the children of a
column j in the elimination tree are assigned to processors before their parent j
This variant tries to complete the column tasks in the order in which the columns are needed for the completion of the other columns, thus exploiting the additional parallelism that is provided by the sparsity structure of the matrix
7.5.3.2 Parallel Right-Looking Algorithm
The parallel implementation of the right-looking algorithm (I V ) is also based on a
task pool and on column tasks These column tasks are defined differently than the
tasks of the parallel left-looking algorithm: A column task Tcol ( j ), 0 ≤ j < n,
comprises the execution ofcdiv( j ) andcmod(k , j) for all k ∈ Struct(L ∗ j), i.e., a
column task comprises the final computation for column j and the modifications of all columns k > j right of column j that depend on j The task pool is initialized to all column tasks corresponding to the leaves of the elimination tree A task Tcol ( j )
that is not a leaf is inserted into the task pool as soon as the operationscmod( j , k) for all k ∈ Struct(L j∗) are executed and a finalcdiv( j ) operation is possible.
Figure 7.27 sketches a parallel implementation of the right-looking algorithm
The task assignment is implemented by maintaining a counter c j for each
col-umn j The counter is initialized to 0 and is incremented after the execution of
each cmod( j , ∗) operation by the corresponding processor using the conflict-free
Trang 4436 7 Algorithms for Systems of Linear Equations
Fig 7.27 Parallel right-looking algorithm The column tasks are managed by a task poolTP Column tasks are inserted into the task pool by add column() and retrieved from the task pool by get column() The function initialize task pool() initializes the task pool
TP with the leaves of the elimination tree The condition of the outer while loop assigns column indices j to processors The processor retrieves the corresponding column task as soon as the call
procedureadd counter() For the execution of acmod(k , j) operation of a task Tcol ( j ), column k must be locked to prevent other tasks from modifying the same column at the same time A task Tcol ( j ) is inserted into the task pool, when the counter c j has reached the value|Struct(L j∗ |
The differences between this right-looking implementation and the left-looking
variant L2 lie in the execution order of the cmod () operations and in the executing processor In the L2 variant, the operation cmod ( j , k) is initiated by the processor computing column k by pushing it on stack S j, but the operation is executed by
the processor computing column j This execution need not be performed
immedi-ately after the initiation of the operation In the right-looking variant, the operation
cmod ( j, k) is not only initiated, but also executed by the processor that computes column k.
7.5.3.3 Parallel Supernodal Algorithm
The parallel implementation of the supernodal algorithm uses a partition into funda-mental supernodes A supernode I ( p) = {p, p +1, , p +q −1} is a fundamental supernode, if for each i with 0 ≤ i ≤ q −2, we have children(p +i +1) = {p +i}, i.e., node p + i is the only child of p + i + 1 in the elimination tree [124] In Fig 7.23, supernode I (2) = {2, 3, 4} is a fundamental supernode whereas supern-odes I (6) = {6, 7} and I (8) = {8, 9} are not fundamental In a partition into
fun-damental supernodes, all columns of a supernode can be computed as soon as the first column can be computed and a waiting for the computation of columns outside the supernode is not needed In the following, we assume that all supernodes are fundamental, which can be achieved by splitting supernodes into smaller ones A supernode consisting of a single column is fundamental
The parallel implementation of the supernodal algorithm (V I ) is based on supernode tasks Tsup( J ) where task Tsup( J ) for 0 ≤ J < N comprises the
Trang 5execution of smod( j , J) andcdiv( j ) for each j ∈ J from left to right and the
execution ofsmod(k , J) for all k ∈ Struct(L ∗(last(J)) ); N is the number of supern-odes Tasks Tsup( J ) that are ready for execution are held in a central task pool that
is accessed by the idle processors The pool is initialized to supernodes that are ready for completion; these are those supernodes whose first column is a leaf in the elim-ination tree If the execution of acmod(k , J) operation by a task Tsup(J) finishes the modification of another supernode K > J, the corresponding task Tsup(K ) is
inserted into the task pool
The assignment of supernode tasks is again implemented by maintaining a
counter c j for each column j of a supernode Each counter is initialized to 0 and
is incremented for each modification that is executed for column j Ignoring the modifications with columns inside a supernode, a supernode task Tsup( J ) is ready for execution if the counters of the columns j ∈ J reach the value |Struct(L j∗ | The implementation of the counters as well as the manipulation of the columns has
to be protected, e.g., by a lock mechanism For the manipulation of a column k /∈ J
by asmod(k , J) operation, column k is locked to avoid concurrent manipulation by
different processors Figure 7.28 shows the corresponding implementation
Fig 7.28 Parallel supernodal
algorithm
7.6 Exercises for Chap 7
Exercise 7.1 For an n × m matrix A and vectors a and b of length n write a parallel MPI program which computes a rank-1 update A = A − a · b T
which can be computed sequentially by
for (i=0; i<n; i++)
for (j=0; j<n; j++)
A[i][j] = A[i][j]-a[i] · b[j];
Trang 6438 7 Algorithms for Systems of Linear Equations
For the parallel MPI implementation assume that A is distributed among the p pro-cessors in a column-cyclic way The vectors a and b are available at the process
with rank 0 only and must be distributed appropriately before the computation
After the update operation, the matrix A should again be distributed in a
column-cyclic way
Exercise 7.2 Implement the rank-1 update in OpenMP Use a parallelforloop to express the parallel execution
Exercise 7.3 Extend the program piece in Fig 7.2 for performing the Gaussian
elimination with a row-cyclic data distribution to a full MPI program To do so, all helper functions used and described in the text must be implemented Measure the resulting execution times for different matrix sizes and different numbers of processors
Exercise 7.4 Similar to the previous exercise, transform the program piece in Fig 7.6
with a total cyclic data distribution to a full MPI program Compare the resulting execution times for different matrix sizes and different numbers of processors For which scenarios does a significant difference occur? Try to explain the observed behavior
Exercise 7.5 Develop a parallel implementation of Gaussian elimination for shared
address spaces using OpenMP The MPI implementation from Fig 7.2 can be used as an orientation Explain how the available parallelism is expressed in your OpenMP implementation Also explain where synchronization is needed when accessing shared data Measure the resulting execution times for different matrix sizes and different numbers of processors
Exercise 7.6 Develop a parallel implementation of Gaussian elimination using Java
threads Define a new class Gaussian which is structured similar to the Java program in Fig 6.23 for a matrix multiplication Explain which synchronization
is needed in the program Measure the resulting execution times for different matrix sizes and different numbers of processors
Exercise 7.7 Develop a parallel MPI program for Gaussian elimination using a
column-cyclic data distribution An implementation with a row-cyclic distribution has been given in Fig 7.2 Explain which communication is needed for a column-cyclic distribution and include this communication in your program Compute the resulting speedup values for different matrix sizes and different numbers of processors
Exercise 7.8 For n= 8 consider the following tridiagonal equation system:
⎛
⎜
⎜
⎜
⎝
1 1
1 2 1
1 2
1
1 2
⎞
⎟
⎟
⎟
⎠
· x =
⎛
⎜
⎜
⎜
⎝
1 2 3
8
⎞
⎟
⎟
⎟
⎠
.
Trang 7Use the recursive doubling technique from Sect 7.2.2, p 385, to solve this equation system
Exercise 7.9 Develop a sequential implementation of the cyclic reduction algorithm
for solving tridiagonal equation systems, see Sect 7.2.2, p 385 Measure the
result-ing sequential execution times for different matrix sizes startresult-ing with size n = 100
up to size n = 107
Exercise 7.10 Transform the sequential implementation of the cyclic reduction
algorithm from the last exercise into a parallel implementation for a shared address space using OpenMP Use an appropriate parallel forloop to express the paral-lel execution Measure the resulting paralparal-lel execution times for different numbers
of processors for the same matrix sizes as in the previous exercise Compute the resulting speedup values and show the speedup values in a diagram
Exercise 7.11 Develop a parallel MPI implementation of the cyclic reduction
algo-rithm for a distributed address space based on the description in Sect 7.2.2, p 385 Measure the resulting parallel execution times for different numbers of processors and compute the resulting speedup values
Exercise 7.12 Specify the data dependence graph for the cyclic reduction algorithm
for n = 12 equations according to Fig 7.11 For p = 3 processors, illustrate the
three phases according to Fig 7.12 and show which dependences lead to communi-cation
Exercise 7.13 Implement a parallel Jacobi iteration with a pointer-based storage
scheme of the matrix A such that global indices are used in the implementation.
Exercise 7.14 Consider the parallel implementation of the Jacobi iteration in
Fig 7.13 and provide a corresponding shared memory program using OpenMP operations
Exercise 7.15 Implement a parallel SOR method for a dense linear equation system
by modifying the parallel program in Fig 7.14
Exercise 7.16 Provide a shared memory implementation of the Gauss–Seidel method
for the discretized Poisson equation
Exercise 7.17 Develop a shared memory implementation for Cholesky factorization
A = L L T
for a dense matrix A using the basic algorithm.
Exercise 7.18 Develop a message-passing implementation for dense Cholesky
fac-torization A = L L T
Trang 8440 7 Algorithms for Systems of Linear Equations
Exercise 7.19 Consider a matrix with the following non-zero entries:
0 1 2 3 4 5 6 7 8 9 0
1 2 3 4 5 6 7 8 9
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝
∗
∗ ∗
∗
∗ ∗
∗ ∗ ∗
∗ ∗ ∗ ∗ ∗ ∗
∗
∗ ∗ ∗ ∗ ∗ ∗ ∗
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠
(a) Specify all supernodes of this matrix
(b) Consider a supernode J with at least three entries Specify the sequence of cmod and cdiv operations that are executed for this supernode in the right-looking
supernode Cholesky factorization algorithm
(c) Determine the elimination tree resulting for this matrix
(d) Explain the role of the elimination tree for a parallel execution
Exercise 7.20 Derive the parallel execution time of a message-passing program of
the CG method for a distributed memory machine with a linear array as intercon-nection network
Exercise 7.21 Consider a parallel implementation of the CG method in which the
computation step (3) is executed in parallel to the computation step (4) Given a
row-blockwise distribution of matrix A and a row-blockwise distribution of the vector, derive
the data distributions for this implementation variant and give the corresponding parallel execution time
Exercise 7.22 Implement the CG algorithm given in Fig 7.20 with the blockwise
distribution as message-passing program using MPI
Trang 91 F Abolhassan, J Keller, and W.J Paul On the Cost–Effectiveness of PRAMs In Proceedings
of the 3rd IEEE Symposium on Parallel and Distributed Processing, pages 2–9, 1991.
2 A Adl-Tabatabai, C Kozyrakis, and B Saha Unlocking concurrency ACM Queue, 4(10):
24–33, Dec 2006.
3 S.V Adve and K Gharachorloo Shared memory consistency models: A tutorial IEEE
Computer, 29: 66–76, 1995.
4 A Aggarwal, A.K Chandra, and M Snir On Communication Latency in PRAM
Compu-tations In Proceedings of 1989 ACM Symposium on Parallel Algorithms and Architectures (SPAA’89), pages 11–21, 1989.
5 A Aggarwal, A.K Chandra, and M Snir Communication complexity of PRAMs
Theoreti-cal Computer Science, 71: 3–28, 1990.
6 A Aho, M Lam, R Sethi, and J Ullman Compilers: Principles, Techniques & Tools.
Pearson-Addison Wesley, Boston, 2007.
7 A.V Aho, M.S Lam, R Sethi, and J.D Ullman Compilers: Principles, Techniques, and
Tools 2nd Edition, Addison-Wesley Longman Publishing Co., Inc., Boston, 2006.
8 S.G Akl Parallel Computation – Models and Methods Prentice Hall, Upper Saddle River,
1997.
9 K Al-Tawil and C.A Moritz LogGP Performance Evaluation of MPI International
Sympo-sium on High-Performance Distributed Computing, page 366, 1998.
10 A Alexandrov, M Ionescu, K.E Schauser, and C Scheiman LogGP: Incorporating Long Messages into the LogP Model – One Step Closer Towards a Realistic Model for Parallel
Computation In Proceedings of the 7th ACM Symposium on Parallel Algorithms and
Archi-tectures (SPAA’95), pages 95–105, Santa Barbara, July 1995.
11 E Allen, D Chase, J Hallett, V Luchangco, J.-W Maessen, S Ryu, G.L Steele, Jr., and
S Tobin-Hochstadt The Fortress Language Specification, version 1.0 beta, March 2007.
12 R Allen and K Kennedy Optimizing Compilers for Modern Architectures Morgan
Kaufmann, San Francisco, 2002.
13 S Allmann, T Rauber, and G R¨unger Cyclic Reduction on Distributed Shared Memory
Machines In Proceedings of the 9th Euromicro Workshop on Parallel and Distributed
Pro-cessing, pages 290–297, Mantova, Italien, 2001 IEEE Computer Society Press.
14 G.S Almasi and A Gottlieb Highly Parallel Computing Benjamin Cummings, New York,
1994.
15 G Amdahl Validity of the Single Processor Approach to Achieving Large-Scale Computer
Capabilities In AFIPS Conference Proceedings, volume 30, pages 483–485, 1967.
16 K Asanovic, R Bodik, B.C Catanzaro, J.J Gebis, P Husbands, K Keutzer, D.A Patterson, W.L Plishker, J Shalf, S.W Williams, and K.A Yelick The Landscape of Parallel Com-puting Research: A View from Berkeley Technical Report UCB/EECS-2006–183, EECS Department, University of California, Berkeley, December 2006.
T Rauber, G R¨unger, Parallel Programming,
DOI 10.1007/978-3-642-04818-0, C Springer-Verlag Berlin Heidelberg 2010
441
Trang 10442 References
17 M Azimi, N Cherukuri, D.N Jayasimha, A Kumar, P Kundu, S Park, I Schoinas, and
A Vaidya Integration challenges and tradeoffs for tera-scale architectures Intel Technology
Journal, 11(03), 2007.
18 C.J Beckmann and C Polychronopoulos Microarchitecture Support for Dynamic Schedul-ing of Acyclic Task Graphs Technical Report CSRD 1207, University of Illinois, 1992.
19 D.P Bertsekas and J.N Tsitsiklis Parallel and Distributed Computation Athena Scientific,
Nashua, 1997.
20 R Bird Introduction to Functional Programming Using Haskell Prentice Hall, Englewood
Cliffs, 1998.
21 P Bishop and N Warren JavaSpaces in Practice Addison Wesley, Reading, 2002.
22 F Bodin, P Beckmann, D.B Gannon, S Narayana, and S Yang Distributed C ++: Basic
Ideas for an Object Parallel Language In Proceedings of the Supercomputing’91 Conference,
pages 273–282, 1991.
23 D Braess Finite Elements 3rd edition, Cambridge University Press, Cambridge, 2007.
24 P Brucker Scheduling Algorithms 4th edition, Springer-Verlag, Berlin, 2004.
25 D.R Butenhof Programming with POSIX Threads Addison-Wesley Longman Publishing
Co., Inc., Boston, 1997.
26 N Carriero and D Gelernter Linda in Context Communications of the ACM, 32(4):
444–458, 1989.
27 N Carriero and D Gelernter How to Write Parallel Programs MIT Press, Cambridge,
1990.
28 P Charles, C Grothoff, V.A Saraswat, C Donawa, A Kielstra, K Ebcioglu, C von Praun, and V Sarkar X10: An Object-Oriented Approach to Non-uniform Cluster Computing In
R Johnson and R.P Gabriel, editors, Proceedings of the 20th Annual ACMSIGPLAN
Confer-ence on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA),
pages 519–538, ACM, October 2005.
29 A Chin Complexity Models for All-Purpose Parallel Computation In Lectures on Parallel
Computation, chapter 14 Cambridge University Press, Cambridge, 1993.
30 M.E Conway A Multiprocessor System Design In Proceedings of the AFIPS 1963 Fall
Joint Computer Conference, volume 24, pages 139–146, Spartan Books, New York, 963.
31 T.H Cormen, C.E Leiserson, and R.L Rivest Introduction to Algorithms MIT Press,
Cambridge, 2001.
32 D.E Culler, A.C Arpaci-Dusseau, S.C Goldstein, A Krishnamurthy, S Lumetta, T van
Eicken, and K.A Yelick Parallel Programming in Split-C In Proceedings of
Supercomput-ing, pages 262–273, 1993.
33 D.E Culler, A.C Dusseau, R.P Martin, and K.E Schauser Fast Parallel Sorting Under
LogP: From Theory to Practice In Portability and Performance for Parallel Processing,
pages 71–98 Wiley, Southampton, 1994.
34 D.E Culler, R Karp, A Sahay, K.E Schauser, E Santos, R Subramonian, and T von Eicken.
LogP: Towards a Realistic Model of Parallel Computation Proceedings of the 4th ACM
SIG-PLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’93), pages
1–12, 1993.
35 D.E Culler, J.P Singh, and A Gupta Parallel Computer Architecture: A Hardware Software
Approach Morgan Kaufmann, San Francisco, 1999.
36 H.J Curnov and B.A Wichmann A Synthetic Benchmark The Computer Journal, 19(1):
43–49, 1976.
37 D Callahan, B.L Chamberlain, and H.P Zima The Cascade High Productivity Language.
In IPDPS, pages 52–60 IEEE Computer Society, 2004.
38 W.J Dally and C.L Seitz Deadlock-Free Message Routing in Multiprocessor
Interconnec-tion Networks IEEE TransacInterconnec-tions on Computers, 36(5): 547–553, 1987.
39 DEC The Whetstone Performance Technical Report, Digital Equipment Corporation, 1986.
40 E.W Dijkstra Cooperating Sequential Processes In F Genuys, editor, Programming
Lan-guages, pages 43–112 Academic Press Inc., 1968.