Parallel Programming: for Multicore and Cluster Systems- P45 ppsx

The assignment of tasks to processors for execution is dynamic, i.e., when a processor is idle, it takes a task from the central task pool.. There are several parallel implementation var

Trang 1

express the specific situation of data dependences between columns using the

rela-tion par ent( j ) [124, 118] For each column j , 0 ≤ j < n, we define

parent ( j ) = min{i | i ∈ Struct(L ∗ j)} if Struct(L ∗ j) = ∅,

i.e., parent ( j ) is the row index of the first off-diagonal non-zero of column j If Struct (L ∗ j) = ∅, then parent( j) = j The element parent( j) is the first column

i > j which depends on j A column l, j < l < i, between them does not depend

on j , since j /∈ Struct(Ll∗) and nocmod(l , j) is executed Moreover we define for

0≤ i < n

children(i ) = { j < i | parent( j) = i},

i.e., children(i ) contains all columns j that have their first off-diagonal non-zero in row i

The directed graph G = (V, E) has a set of nodes V = {0, , n − 1} with one node for each column and a set of edges E, where (i , j) ∈ E if i = parent( j) and

i = j It can be shown that G is a tree if matrix A is irreducible (A matrix A is called reducible if A can be permuted such that it is block-diagonal For a reducible

matrix, the blocks can be factorized independently.) In the following, we assume

an irreducible matrix Figure 7.25 shows a matrix and its corresponding elimination tree

In the following, we denote the subtree with root j by G[ j ] For sparse Cholesky factorization, an important property of the elimination tree G is that the tree spec-ifies the order in which the columns must be evaluated: The definition of par ent implies that column i must be evaluated before column j , if j = parent(i) Thus, all the children of column j must be completely evaluated before the computation

of j Moreover, column j does not depend on any column that is not in the subtree G[ j ] Hence, columns i and j can be computed in parallel, if G[i ] and G[ j ] are

disjoint subtrees Especially, all leaves of the elimination tree can be computed in parallel and the computation does not need to start with column 0 Thus, the sparsity structure determines the parallelism to be exploited For a given matrix, elimination trees of smaller height usually represent a larger degree of parallelism than trees of larger height [77]

0

1

2

3

4

5

6

7

8

9

∗

∗ ∗

∗

∗ ∗ ∗ ∗ ∗

∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗

0 1 2 3 4 5 6 7 8 9

9

0 2 5 7 8

4 1

Fig 7.25 Sparse matrix and the corresponding elimination tree

Trang 2

434 7 Algorithms for Systems of Linear Equations

7.5.3.1 Parallel Left-Looking Algorithms

The parallel implementation of the left-looking algorithm (I I I ) is based on n col-umn tasks Tcol (0) , , Tcol(n − 1) where task Tcol( j), 0 ≤ j < n, comprises the

execution ofcmod( j , k) for all k ∈ Struct(L j∗) and the execution ofcdiv( j ); this

is the loop body of theforloop in algorithm (I I I ) These tasks are not

indepen-dent of each other but have dependences due to the non-zero elements The parallel implementation uses a task pool for managing the execution of the tasks The task pool has a central task pool for storing column tasks, which can be accessed by every processor Each processor is responsible for performing a subset of the column tasks The assignment of tasks to processors for execution is dynamic, i.e., when a processor is idle, it takes a task from the central task pool

The dynamic implementation has the advantage that the workload is distributed evenly although the tasks might have different execution times due to the sparsity structure The concurrent accesses of the processors to the central task pool have to

be conflict-free so that the unique assignment of a task to a processor for execution

is guaranteed This can be implemented by a locking mechanism so that only one processor accesses the task pool at a specific time

There are several parallel implementation variants for the left-looking algorithm differing in the way the column tasks are inserted into the task pool We consider three implementation variants:

• Variant L1 inserts column task Tcol( j) into the task pool not before all column tasks Tcol (k) with k ∈ Struct(L j∗) have been finished The task pool can be ini-tialized to the leaves of the elimination tree The degree of parallelism is limited

by the number of independent nodes of the tree, since tasks dependent on each other are executed in sequential order Hence, a processor that has accessed task

Tcol ( j ) can execute the task without waiting for other tasks to be finished.

• Variant L2 allows to start the execution of Tcol( j) without requiring that it can

be executed to completion immediately The task pool is initialized to all column tasks available The column tasks are accessed by the processors dynamically from left to right, i.e., an idle processor accesses the next column that has not yet been assigned to a processor

The computation of column task Tcol ( j ) is started before all tasks Tcol (k) with k ∈ Struct(L j∗) have been finished In this case, not all operations cmod( j , k) of Tcol( j) can be executed immediately but the task can perform

only those cmod( j , k) operations with k ∈ Struct(L j∗) for which the corre-sponding tasks have already been executed Thus, the task might have to wait during its execution for other tasks to be finished

To control the execution of a single column task Tcol ( j ), each column j is assigned a data structure S j containing all columns k ∈ Struct(L j∗) for which cmod( j , k) can already be executed When a processor finishes the execution

of the column task Tcol (k) (by executingcdiv(k)), it pushes k onto the data structures S j for each j ∈ Struct(L ∗k) Because different processors might try

to access the same stack at the same time, a locking mechanism has to be used

to avoid access conflicts The processor executing Tcol ( j ) pops column indices

Trang 3

Fig 7.26 Parallel

left-looking algorithm

according to variant L2 The

implicit task pool is

implemented in the while

loop and the function

The stacks S1, , S n

implement the bookkeeping

about the dependent columns

already finished

k from S j and executes the correspondingcmod( j , k) operation If S j is empty, the processor waits for another processor to insert new column indices When

|Struct(L j∗ | column indices have been retrieved from Sj , the task Tcol ( j ) can

execute the finalcdiv( j ) operation.

Figure 7.26 shows the corresponding implementation The central task pool

is realized implicitly as a parallel loop; the operationget unique index() ensures a conflict-free assignment of tasks so that the processors accessing the pool at the same time get different unique loop indices representing column tasks The loop body of thewhileloop implements one task Tcol ( j ) The data struc-tures S1 , , Sn are stacks; pop(S j ) retrieves an element and push( j , Si) inserts

an element onto the stack

• Variant L3 is a variation of L2 that takes the structure of the elimination tree into

consideration The columns are not assigned strictly from left to right to the pro-cessors, but according to their height in the elimination tree, i.e., the children of a

column j in the elimination tree are assigned to processors before their parent j

This variant tries to complete the column tasks in the order in which the columns are needed for the completion of the other columns, thus exploiting the additional parallelism that is provided by the sparsity structure of the matrix

7.5.3.2 Parallel Right-Looking Algorithm

The parallel implementation of the right-looking algorithm (I V ) is also based on a

task pool and on column tasks These column tasks are defined differently than the

tasks of the parallel left-looking algorithm: A column task Tcol ( j ), 0 ≤ j < n,

comprises the execution ofcdiv( j ) andcmod(k , j) for all k ∈ Struct(L ∗ j), i.e., a

column task comprises the final computation for column j and the modifications of all columns k > j right of column j that depend on j The task pool is initialized to all column tasks corresponding to the leaves of the elimination tree A task Tcol ( j )

that is not a leaf is inserted into the task pool as soon as the operationscmod( j , k) for all k ∈ Struct(L j∗) are executed and a finalcdiv( j ) operation is possible.

Figure 7.27 sketches a parallel implementation of the right-looking algorithm

The task assignment is implemented by maintaining a counter c j for each

col-umn j The counter is initialized to 0 and is incremented after the execution of

each cmod( j , ∗) operation by the corresponding processor using the conflict-free

Trang 4

Fig 7.27 Parallel right-looking algorithm The column tasks are managed by a task poolTP Column tasks are inserted into the task pool by add column() and retrieved from the task pool by get column() The function initialize task pool() initializes the task pool

TP with the leaves of the elimination tree The condition of the outer while loop assigns column indices j to processors The processor retrieves the corresponding column task as soon as the call

procedureadd counter() For the execution of acmod(k , j) operation of a task Tcol ( j ), column k must be locked to prevent other tasks from modifying the same column at the same time A task Tcol ( j ) is inserted into the task pool, when the counter c j has reached the value|Struct(L j∗ |

The differences between this right-looking implementation and the left-looking

variant L2 lie in the execution order of the cmod () operations and in the executing processor In the L2 variant, the operation cmod ( j , k) is initiated by the processor computing column k by pushing it on stack S j, but the operation is executed by

the processor computing column j This execution need not be performed

immedi-ately after the initiation of the operation In the right-looking variant, the operation

cmod ( j, k) is not only initiated, but also executed by the processor that computes column k.

7.5.3.3 Parallel Supernodal Algorithm

The parallel implementation of the supernodal algorithm uses a partition into funda-mental supernodes A supernode I ( p) = {p, p +1, , p +q −1} is a fundamental supernode, if for each i with 0 ≤ i ≤ q −2, we have children(p +i +1) = {p +i}, i.e., node p + i is the only child of p + i + 1 in the elimination tree [124] In Fig 7.23, supernode I (2) = {2, 3, 4} is a fundamental supernode whereas supern-odes I (6) = {6, 7} and I (8) = {8, 9} are not fundamental In a partition into

fun-damental supernodes, all columns of a supernode can be computed as soon as the first column can be computed and a waiting for the computation of columns outside the supernode is not needed In the following, we assume that all supernodes are fundamental, which can be achieved by splitting supernodes into smaller ones A supernode consisting of a single column is fundamental

The parallel implementation of the supernodal algorithm (V I ) is based on supernode tasks Tsup( J ) where task Tsup( J ) for 0 ≤ J < N comprises the

Trang 5

execution of smod( j , J) andcdiv( j ) for each j ∈ J from left to right and the

execution ofsmod(k , J) for all k ∈ Struct(L ∗(last(J)) ); N is the number of supern-odes Tasks Tsup( J ) that are ready for execution are held in a central task pool that

is accessed by the idle processors The pool is initialized to supernodes that are ready for completion; these are those supernodes whose first column is a leaf in the elim-ination tree If the execution of acmod(k , J) operation by a task Tsup(J) finishes the modification of another supernode K > J, the corresponding task Tsup(K ) is

inserted into the task pool

The assignment of supernode tasks is again implemented by maintaining a

counter c j for each column j of a supernode Each counter is initialized to 0 and

is incremented for each modification that is executed for column j Ignoring the modifications with columns inside a supernode, a supernode task Tsup( J ) is ready for execution if the counters of the columns j ∈ J reach the value |Struct(L j∗ | The implementation of the counters as well as the manipulation of the columns has

to be protected, e.g., by a lock mechanism For the manipulation of a column k /∈ J

by asmod(k , J) operation, column k is locked to avoid concurrent manipulation by

different processors Figure 7.28 shows the corresponding implementation

Fig 7.28 Parallel supernodal

algorithm

7.6 Exercises for Chap 7

Exercise 7.1 For an n × m matrix A and vectors a and b of length n write a parallel MPI program which computes a rank-1 update A = A − a · b T

which can be computed sequentially by

for (i=0; i<n; i++)

for (j=0; j<n; j++)

A[i][j] = A[i][j]-a[i] · b[j];

Trang 6

For the parallel MPI implementation assume that A is distributed among the p pro-cessors in a column-cyclic way The vectors a and b are available at the process

with rank 0 only and must be distributed appropriately before the computation

After the update operation, the matrix A should again be distributed in a

column-cyclic way

Exercise 7.2 Implement the rank-1 update in OpenMP Use a parallelforloop to express the parallel execution

Exercise 7.3 Extend the program piece in Fig 7.2 for performing the Gaussian

elimination with a row-cyclic data distribution to a full MPI program To do so, all helper functions used and described in the text must be implemented Measure the resulting execution times for different matrix sizes and different numbers of processors

Exercise 7.4 Similar to the previous exercise, transform the program piece in Fig 7.6

with a total cyclic data distribution to a full MPI program Compare the resulting execution times for different matrix sizes and different numbers of processors For which scenarios does a significant difference occur? Try to explain the observed behavior

Exercise 7.5 Develop a parallel implementation of Gaussian elimination for shared

address spaces using OpenMP The MPI implementation from Fig 7.2 can be used as an orientation Explain how the available parallelism is expressed in your OpenMP implementation Also explain where synchronization is needed when accessing shared data Measure the resulting execution times for different matrix sizes and different numbers of processors

Exercise 7.6 Develop a parallel implementation of Gaussian elimination using Java

threads Define a new class Gaussian which is structured similar to the Java program in Fig 6.23 for a matrix multiplication Explain which synchronization

is needed in the program Measure the resulting execution times for different matrix sizes and different numbers of processors

Exercise 7.7 Develop a parallel MPI program for Gaussian elimination using a

column-cyclic data distribution An implementation with a row-cyclic distribution has been given in Fig 7.2 Explain which communication is needed for a column-cyclic distribution and include this communication in your program Compute the resulting speedup values for different matrix sizes and different numbers of processors

Exercise 7.8 For n= 8 consider the following tridiagonal equation system:

⎛

⎜

⎝

1 1

1 2 1

1 2

1

1 2

⎞

⎟

⎠

· x =

⎛

⎜

⎝

1 2 3

8

⎞

⎟

⎠

.

Trang 7

Use the recursive doubling technique from Sect 7.2.2, p 385, to solve this equation system

Exercise 7.9 Develop a sequential implementation of the cyclic reduction algorithm

for solving tridiagonal equation systems, see Sect 7.2.2, p 385 Measure the

result-ing sequential execution times for different matrix sizes startresult-ing with size n = 100

up to size n = 107

Exercise 7.10 Transform the sequential implementation of the cyclic reduction

algorithm from the last exercise into a parallel implementation for a shared address space using OpenMP Use an appropriate parallel forloop to express the paral-lel execution Measure the resulting paralparal-lel execution times for different numbers

of processors for the same matrix sizes as in the previous exercise Compute the resulting speedup values and show the speedup values in a diagram

Exercise 7.11 Develop a parallel MPI implementation of the cyclic reduction

algo-rithm for a distributed address space based on the description in Sect 7.2.2, p 385 Measure the resulting parallel execution times for different numbers of processors and compute the resulting speedup values

Exercise 7.12 Specify the data dependence graph for the cyclic reduction algorithm

for n = 12 equations according to Fig 7.11 For p = 3 processors, illustrate the

three phases according to Fig 7.12 and show which dependences lead to communi-cation

Exercise 7.13 Implement a parallel Jacobi iteration with a pointer-based storage

scheme of the matrix A such that global indices are used in the implementation.

Exercise 7.14 Consider the parallel implementation of the Jacobi iteration in

Fig 7.13 and provide a corresponding shared memory program using OpenMP operations

Exercise 7.15 Implement a parallel SOR method for a dense linear equation system

by modifying the parallel program in Fig 7.14

Exercise 7.16 Provide a shared memory implementation of the Gauss–Seidel method

for the discretized Poisson equation

Exercise 7.17 Develop a shared memory implementation for Cholesky factorization

A = L L T

for a dense matrix A using the basic algorithm.

Exercise 7.18 Develop a message-passing implementation for dense Cholesky

fac-torization A = L L T

Trang 8

Exercise 7.19 Consider a matrix with the following non-zero entries:

0 1 2 3 4 5 6 7 8 9 0

1 2 3 4 5 6 7 8 9

⎛

⎜

⎝

∗

∗ ∗

∗

∗ ∗

∗ ∗ ∗

∗ ∗ ∗ ∗ ∗ ∗

∗

∗ ∗ ∗ ∗ ∗ ∗ ∗

⎞

⎟

⎠

(a) Specify all supernodes of this matrix

(b) Consider a supernode J with at least three entries Specify the sequence of cmod and cdiv operations that are executed for this supernode in the right-looking

supernode Cholesky factorization algorithm

(c) Determine the elimination tree resulting for this matrix

(d) Explain the role of the elimination tree for a parallel execution

Exercise 7.20 Derive the parallel execution time of a message-passing program of

the CG method for a distributed memory machine with a linear array as intercon-nection network

Exercise 7.21 Consider a parallel implementation of the CG method in which the

computation step (3) is executed in parallel to the computation step (4) Given a

row-blockwise distribution of matrix A and a row-blockwise distribution of the vector, derive

the data distributions for this implementation variant and give the corresponding parallel execution time

Exercise 7.22 Implement the CG algorithm given in Fig 7.20 with the blockwise

distribution as message-passing program using MPI

Trang 9

1 F Abolhassan, J Keller, and W.J Paul On the Cost–Effectiveness of PRAMs In Proceedings

of the 3rd IEEE Symposium on Parallel and Distributed Processing, pages 2–9, 1991.

2 A Adl-Tabatabai, C Kozyrakis, and B Saha Unlocking concurrency ACM Queue, 4(10):

24–33, Dec 2006.

3 S.V Adve and K Gharachorloo Shared memory consistency models: A tutorial IEEE

Computer, 29: 66–76, 1995.

4 A Aggarwal, A.K Chandra, and M Snir On Communication Latency in PRAM

Compu-tations In Proceedings of 1989 ACM Symposium on Parallel Algorithms and Architectures (SPAA’89), pages 11–21, 1989.

5 A Aggarwal, A.K Chandra, and M Snir Communication complexity of PRAMs

Theoreti-cal Computer Science, 71: 3–28, 1990.

6 A Aho, M Lam, R Sethi, and J Ullman Compilers: Principles, Techniques & Tools.

Pearson-Addison Wesley, Boston, 2007.

7 A.V Aho, M.S Lam, R Sethi, and J.D Ullman Compilers: Principles, Techniques, and

Tools 2nd Edition, Addison-Wesley Longman Publishing Co., Inc., Boston, 2006.

8 S.G Akl Parallel Computation – Models and Methods Prentice Hall, Upper Saddle River,

1997.

9 K Al-Tawil and C.A Moritz LogGP Performance Evaluation of MPI International

Sympo-sium on High-Performance Distributed Computing, page 366, 1998.

10 A Alexandrov, M Ionescu, K.E Schauser, and C Scheiman LogGP: Incorporating Long Messages into the LogP Model – One Step Closer Towards a Realistic Model for Parallel

Computation In Proceedings of the 7th ACM Symposium on Parallel Algorithms and

Archi-tectures (SPAA’95), pages 95–105, Santa Barbara, July 1995.

11 E Allen, D Chase, J Hallett, V Luchangco, J.-W Maessen, S Ryu, G.L Steele, Jr., and

S Tobin-Hochstadt The Fortress Language Specification, version 1.0 beta, March 2007.

12 R Allen and K Kennedy Optimizing Compilers for Modern Architectures Morgan

Kaufmann, San Francisco, 2002.

13 S Allmann, T Rauber, and G R¨unger Cyclic Reduction on Distributed Shared Memory

Machines In Proceedings of the 9th Euromicro Workshop on Parallel and Distributed

Pro-cessing, pages 290–297, Mantova, Italien, 2001 IEEE Computer Society Press.

14 G.S Almasi and A Gottlieb Highly Parallel Computing Benjamin Cummings, New York,

1994.

15 G Amdahl Validity of the Single Processor Approach to Achieving Large-Scale Computer

Capabilities In AFIPS Conference Proceedings, volume 30, pages 483–485, 1967.

16 K Asanovic, R Bodik, B.C Catanzaro, J.J Gebis, P Husbands, K Keutzer, D.A Patterson, W.L Plishker, J Shalf, S.W Williams, and K.A Yelick The Landscape of Parallel Com-puting Research: A View from Berkeley Technical Report UCB/EECS-2006–183, EECS Department, University of California, Berkeley, December 2006.

T Rauber, G R¨unger, Parallel Programming,

DOI 10.1007/978-3-642-04818-0, C Springer-Verlag Berlin Heidelberg 2010

441

Trang 10

442 References

17 M Azimi, N Cherukuri, D.N Jayasimha, A Kumar, P Kundu, S Park, I Schoinas, and

A Vaidya Integration challenges and tradeoffs for tera-scale architectures Intel Technology

Journal, 11(03), 2007.

18 C.J Beckmann and C Polychronopoulos Microarchitecture Support for Dynamic Schedul-ing of Acyclic Task Graphs Technical Report CSRD 1207, University of Illinois, 1992.

19 D.P Bertsekas and J.N Tsitsiklis Parallel and Distributed Computation Athena Scientific,

Nashua, 1997.

20 R Bird Introduction to Functional Programming Using Haskell Prentice Hall, Englewood

Cliffs, 1998.

21 P Bishop and N Warren JavaSpaces in Practice Addison Wesley, Reading, 2002.

22 F Bodin, P Beckmann, D.B Gannon, S Narayana, and S Yang Distributed C ++: Basic

Ideas for an Object Parallel Language In Proceedings of the Supercomputing’91 Conference,

pages 273–282, 1991.

23 D Braess Finite Elements 3rd edition, Cambridge University Press, Cambridge, 2007.

24 P Brucker Scheduling Algorithms 4th edition, Springer-Verlag, Berlin, 2004.

25 D.R Butenhof Programming with POSIX Threads Addison-Wesley Longman Publishing

Co., Inc., Boston, 1997.

26 N Carriero and D Gelernter Linda in Context Communications of the ACM, 32(4):

444–458, 1989.

27 N Carriero and D Gelernter How to Write Parallel Programs MIT Press, Cambridge,

1990.

28 P Charles, C Grothoff, V.A Saraswat, C Donawa, A Kielstra, K Ebcioglu, C von Praun, and V Sarkar X10: An Object-Oriented Approach to Non-uniform Cluster Computing In

R Johnson and R.P Gabriel, editors, Proceedings of the 20th Annual ACMSIGPLAN

Confer-ence on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA),

pages 519–538, ACM, October 2005.

29 A Chin Complexity Models for All-Purpose Parallel Computation In Lectures on Parallel

Computation, chapter 14 Cambridge University Press, Cambridge, 1993.

30 M.E Conway A Multiprocessor System Design In Proceedings of the AFIPS 1963 Fall

Joint Computer Conference, volume 24, pages 139–146, Spartan Books, New York, 963.

31 T.H Cormen, C.E Leiserson, and R.L Rivest Introduction to Algorithms MIT Press,

Cambridge, 2001.

32 D.E Culler, A.C Arpaci-Dusseau, S.C Goldstein, A Krishnamurthy, S Lumetta, T van

Eicken, and K.A Yelick Parallel Programming in Split-C In Proceedings of

Supercomput-ing, pages 262–273, 1993.

33 D.E Culler, A.C Dusseau, R.P Martin, and K.E Schauser Fast Parallel Sorting Under

LogP: From Theory to Practice In Portability and Performance for Parallel Processing,

pages 71–98 Wiley, Southampton, 1994.

34 D.E Culler, R Karp, A Sahay, K.E Schauser, E Santos, R Subramonian, and T von Eicken.

LogP: Towards a Realistic Model of Parallel Computation Proceedings of the 4th ACM

SIG-PLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’93), pages

1–12, 1993.

35 D.E Culler, J.P Singh, and A Gupta Parallel Computer Architecture: A Hardware Software

Approach Morgan Kaufmann, San Francisco, 1999.

36 H.J Curnov and B.A Wichmann A Synthetic Benchmark The Computer Journal, 19(1):

43–49, 1976.

37 D Callahan, B.L Chamberlain, and H.P Zima The Cascade High Productivity Language.

In IPDPS, pages 52–60 IEEE Computer Society, 2004.

38 W.J Dally and C.L Seitz Deadlock-Free Message Routing in Multiprocessor

Interconnec-tion Networks IEEE TransacInterconnec-tions on Computers, 36(5): 547–553, 1987.

39 DEC The Whetstone Performance Technical Report, Digital Equipment Corporation, 1986.

40 E.W Dijkstra Cooperating Sequential Processes In F Genuys, editor, Programming

Lan-guages, pages 43–112 Academic Press Inc., 1968.

Tiêu đề	Cholesky Factorization for Sparse Matrices
Trường học	University of Example
Chuyên ngành	Computer Science
Thể loại	Thesis
Năm xuất bản	2023
Thành phố	Example City

Định dạng
Số trang	10
Dung lượng	348,65 KB