Parallel Programming: for Multicore and Cluster Systems- P38 ppsx

Determination of the local pivot element: Each processor considers its local elements of column k in the rows k ,.. Distribution of the pivot row: Since the pivot row now row k is requir

Trang 1

can be used to solve several linear systems with the same matrix A and different right-hand side vectors b without repeating the elimination process.

7.1.1.2 Pivoting

Forward elimination and LU decomposition require the division by a (k) kk and so these

methods can only be applied when a kk (k) = 0 That is, even if det A = 0 and the

system Ax = y is solvable, there does not need to exist a decomposition A = LU

when a kk (k) is a zero element However, for a solvable linear system, there exists a

matrix resulting from permutations of rows of A, for which an LU decomposition is possible, i.e., B A = LU with a permutation matrix B describing the permutation of

rows of A The permutation of rows of A, if necessary, is included in the elimination

process In each elimination step, a pivot element is determined to substitute a kk (k)

A pivot element is needed when a kk (k) = 0 and when a (k)

kk is very small which would induce an elimination factor, which is very large leading to imprecise computations Pivoting strategies are used to find an appropriate pivot element Typical strategies are column pivoting, row pivoting, and total pivoting

Column pivoting considers the elements a kk (k) · · · a (k)

nk of column k and determines the element a r k (k) , k ≤ r ≤ n, with the maximum absolute value If r = k, the rows r

and k of matrix A (k) and the values b (k) k and b (k)

r of the vector b (k)are exchanged Row

pivoting determines a pivot element a kr (k) , k ≤ r ≤ n, within the elements a (k)

kk · · · a (k) kn

of row k of matrix A (k) with the maximum absolute value If r = k, the columns k

and r of A (k)are exchanged This corresponds to an exchange of the enumeration of

the unknowns xk and xr of vector x Total pivoting determines the element with the

maximum absolute value in the matrix ˜A (k) = (a (k)

i j ), k ≤ i, j ≤ n, and exchanges

columns and rows of A (k) depending on i = k and j = k In practice, row or column

pivoting is used instead of total pivoting, since they have smaller computation time, and total pivoting may also destroy special matrix structures like banded structures The implementation of pivoting avoids the actual exchange of rows or columns

in memory and uses index vectors pointing to the current rows of the matrix The indexed access to matrix elements is more expensive but in total the indexed access

is usually less expensive than moving entire rows in each elimination step When supported by the programming language, a dynamic data storage in the form of separate vectors for rows of the matrix, which can be accessed through a vector pointing to the rows, may lead to more efficient implementations The advantage is that matrix elements can still be accessed with a two-dimensional index expression but the exchange of rows corresponds to a simple exchange of pointers

7.1.2 Parallel Row-Cyclic Implementation

A parallel implementation of the Gaussian elimination is based on a data distribution

of matrix A and of the sequence of matrices A (k) , k = 2, , n, which can be a

row-oriented, a column-oriented, or a checkerboard distribution, see Sect 3.4 In this section, we consider a row-oriented distribution

Trang 2

computation left for this processor and it becomes idle For a row-cyclic distribution,

there is a better load balance, since processor Pq, 1 ≤ q ≤ p, owns the rows q, q+ p,

q +2p, , i.e., it owns all rows i with 1 ≤ i ≤ n, and q = ((i −1) mod p)+1 The

processors begin to get idle only after the first n − p stages, which is reasonable for

p n Thus, we consider a parallel implementation of the Gaussian elimination

with a row-cyclic distribution of matrix A and a column-oriented pivoting One

step of the forward elimination computing A (k+1)and b (k+1)for given A (k) and b (k)

performs the following computation and communication phases:

1 Determination of the local pivot element: Each processor considers its local

elements of column k in the rows k , , n and determines the element (and its

position) with the largest absolute value

2 Determination of the global pivot element: The global pivot element is the

local pivot element which has the largest absolute value A single-accumulation operation with the maximum operation as reduction determines this global pivot element The root processor of this global communication operation sends the result to all other processors

3 Exchange of the pivot row: If k = r for a pivot element a (k)

r k , the row k owned by processor P q and the pivot row r owned by processor P q have to be exchanged

When q = q, the exchange can be done locally by processor Pq When q = q,

then communication with single transfer operations is required The elements bk and br are exchanged accordingly

4 Distribution of the pivot row: Since the pivot row (now row k) is required by all

processors for the local elimination operations, processor Pq sends the elements

a kk (k) , , a (k)

kn of row k and the element b (k) k to all other processors

5 Computation of the elimination factors: Each processor locally computes the

elimination factors li k for which it owns the row i according to Formula (7.2).

6 Computation of the matrix elements: Each processor locally computes the

elements of A (k+1) and b (k+1) using its elements of A (k) and b (k) according to Formulas (7.3) and (7.4)

The computation of the solution vector x in the backward substitution is inherently sequential, since the values xk, k = n, , 1, depend on each other and are

com-puted one after another In step k, processor Pq owning row k computes the value xk

according to Formula (7.5) and sends the value to all other processors by a single-broadcast operation

A program fragment implementing the computation phases 1–6 and the

back-ward substitution is given in Fig 7.2 The matrix A and the vector b are stored in a

two- and a one-dimensional arrayaandb, respectively Some of the local functions are already introduced in the program in Fig 7.1 The SPMD program uses the variablemeto store the individual processor number This processor number, the

Trang 3

Fig 7.2 Program fragment with C notation and MPI operations for the Gaussian elimination with

row-cyclic distribution

Trang 4

1 Determination of the local pivot element: The functionmax col loc(a,k)

determines the row index r of the elementa[r][k], which has the largest local

absolute value in column k for the rows ≥ k When a processor has no element

of column k for rows ≥ k, the function returns −1.

2 Determination of the global pivot element: The global pivoting is performed

by anMPI Allreduce()operation, implementing a single-accumulation with

a subsequent single-broadcast The MPI reduction operationMPI MAXLOCfor data type MPI DOUBLE INT consisting of one double value and one integer value is used The MPI operations have been introduced in Sect 5.2 The

and the processor owning the corresponding row iny.node Thus, after this step all processors know the global pivot element and the owner for possible communication

3 Exchange of the pivot row: Two cases are considered:

• If the owner of the pivot row is the processor also owning row k (i.e., k%p == y.node), the rows k and r are exchanged locally by this processor for r = k.

Row k is now the pivot row The functioncopy row(a,b,k,buf)copies the pivot row into the bufferbuf, which is used for further communication

• If different processors own the row k and the pivot row r, row k is sent to

the processory.nodeowning the pivot row withMPI SendandMPI Recv

operations Before the send operation, the functioncopy row(a,b,k,buf)

copies row k of arrayaand element k of arraybinto a common bufferbuf

so that only one communication operation needs to be applied After the com-munication, the processory.nodefinalizes its exchange with the pivot row The functioncopy exchange row(a,b,r,buf,k)exchanges the row

r (still the pivot row) and the buffer buf The appropriate row index r is

known from the former local determination of the pivot row Now the former

row k is the row r and the bufferbufcontains the pivot row

Thus, in both cases the pivot row is stored in bufferbuf

4 Distribution of the pivot row: Processory.nodesends the bufferbufto all other processors by anMPI Bcast()operation For the case of the pivot row

being owned by a different processor than the owner of row k, the content of

bufis copied into row k by this processor usingcopy back row()

5 and 6 Computation of the elimination factors and the matrix elements: The

computation of the elimination factors and the new arraysaandbis done in

parallel Processor P q starts this computation with the first row i > k with i mod

p = q.

For a row-cyclic implementation of the Gaussian elimination, an alternative way

of storing arrayaand vectorbcan be used The alternative data structure consists

Trang 5

Fig 7.3 Data structure for

the Gaussian elimination with

n = 8 and p = 4 showing the

rows stored by processor P1

Each row stores n+ 1

elements consisting of one

row of the matrix a and the

corresponding element of b

b

1

b a

a

a11

a

18

5 58 51

of a one-dimensional array of pointers and n one-dimensional arrays of length n+ 1

each containing one row ofaand the corresponding element ofb The entries in the pointer-array point to the row-arrays This storage scheme not only facilitates the exchange of rows but is also convenient for a distributed storage For a distributed

memory, each processor Pq stores the entire array of pointers but only the rows

i with i mod p = q; all other pointers are NULL-pointers Figure 7.3 illustrates

this storage scheme for n = 8 The advantage of storing an element ofbtogether withais that the copy operation into a common buffer can be avoided Also the computation of the new values foraandbis now only one loop with n+1 iterations

This implementation variant is not shown in Fig 7.2

7.1.3 Parallel Implementation with Checkerboard Distribution

A parallel implementation using a block–cyclic checkerboard distribution for matrix

A can be described with the parameterized data distribution introduced in Sect 3.4.

The parameterized data distribution is given by a distribution vector

(( p1 , b1), (p2, b2)) (7.7)

with a p1× p2virtual processor mesh with p1rows, p2columns, and p1· p2 = p

processors The numbers b1and b2are the sizes of a block of data with b1rows and

b2columns The functionG : P → N2maps each processor to a unique position in

the processor mesh This leads to the definition of p1row groups

R q = {Q ∈ P | G(Q) = (q, ·)}

with|R q | = p2for 1≤ q ≤ p1and p2column groups

C q = {Q ∈ P | G(Q) = (·, q)}

with|C q | = p1for 1≤ q ≤ p2 The row groups as well as the column groups are a partition of the entire set of processors, i.e.,

Trang 6

the local memories of the processors of only one row group, denoted Ro(i ) in the following This is the row group Rk with k = i−1

b1

mod p1 + 1 Analogously,

column j is distributed within one column group, denoted as Co( j ), which is the column group Ck with k=j−1

b2

mod p2+ 1

Example For a matrix of size 12 ×12 (i.e., n = 12), p = 4 processors {P1, P2, P3, P4}

and distribution vector (( p1 , b1), (p2, b2)) = ((2, 2), (2, 3)), the virtual processor

mesh has size 2× 2 and the data blocks have size 2 × 3 There are two row groups

and two column groups:

R1= {Q ∈ P | G(Q) = (1, j), j = 1, 2},

R2= {Q ∈ P | G(Q) = (2, j), j = 1, 2},

C1= {Q ∈ P | G(Q) = ( j, 1), j = 1, 2},

C2= {Q ∈ P | G(Q) = ( j, 2), j = 1, 2}

The distribution of matrix A is shown in Fig 7.4 It can be seen that row 5 is dis-tributed in row group R1and that column 7 is distributed in column group C1.

Using a checkerboard distribution with distribution vector (7.7), the computation

of A (k)has the following implementation, which has a different communication pat-tern than the previous implementation Figure 7.5 illustrates the communication and computation phases of the Gaussian elimination with checkerboard distribution

Fig 7.4 Illustration of a

checkerboard distribution for

a 12× 12 matrix The tuples

denote the position of the

processors in the processor

mesh owning the data block

1 2 3 4 5 6 7 8 9 10 11 12

(1,1) (1,2) (1,1) (1,2) (2,1) (2,2) (2,1) (2,2) (1,1)

(2,1)

(1,2) (2,2)

(1,1) (2,1)

(1,2) (2,2) (1,1)

(2,1)

(1,2) (2,2)

(1,1) (2,1)

(1,2) (2,2)

Trang 7

Fig 7.5 Computation phases

of the Gaussian elimination

with checkerboard

distribution

k

k (5a) broadcast of the

k

k r

(5) computation of the matrix elements

(4) broadcast of the k

k

r

pivot row pivot row

k

(5) computation of the elimination factors elimination factors

(3) exchange of the

k

(1) determination of the (2) determination of the

global pivot elements local pivot elements

This figure will be printed

in b/w

Trang 8

2 Determination of the global pivot element: The processors in group Co(k)

perform a single-accumulation operation within this group, for which each pro-cessor in the group provides its local pivot element from phase 1 The reduction operation is the maximum operation also determining the index of the pivot row (and not the number of the owning processor as before) The root processor

of the single-accumulation operation is the processor owning the element a kk (k)

After the single-accumulation, the root processor knows the pivot element a r k (k)

and its row index This information is sent to all other processors

3 Exchange of the pivot row: The pivot row r containing the pivot element a r k (k)

is distributed across row group Ro(r ) Row k is distributed across the row group Ro(k), which may be different from Ro(r ) If Ro(r ) = Ro(k), the processors

of Ro(k) exchange the elements of the rows k and r locally within the columns they own If Ro(r ) = Ro(k), each processor in Ro(k) sends its part of row

k to the corresponding processor in Ro(r ); this is the unique processor which

belongs to the same column group

4 Distribution of the pivot row: The pivot row is needed for the recalculation of

matrix A, but each processor needs only those elements with column indices for which it owns elements Therefore, each processor in Ro(r ) performs a

group-oriented single-broadcast operation within its column group sending its part of the pivot row to the other processors

5 Computation of the elimination factors: The processors of column group

Co(k) locally compute the elimination factors l i k for their elements i of column

k according to Formula (7.2).

5a Distribution of the elimination factors: The elimination factors li kare needed

by all processors in the row group Ro(i ) Since the elements of row i are dis-tributed across the row group Ro(i ), each processor of column group Co(k) performs a group-oriented single-broadcast operation in its row group Ro(i ) to broadcast its elimination factors li kwithin this row group

6 Computation of the matrix elements: Each processor locally computes the

elements of A (k+1) and b (k+1) using its elements of A (k) and b (k) according to Formulas (7.3) and (7.4)

The backward substitution for computing the n elements of the result vector x is done in n consecutive steps where each step consists of the following computations:

1 Each processor of the row group Ro(k) computes that part of the sumn

j =k+1 a k j (n) x j which contains its local elements of row k.

2 The entire sum n

j =k+1 a (n) k j x j is determined by the processors of row group

Ro(k) by a group-oriented single-accumulation operation with the processor P q

as root which stores the element a kk (n) Addition is used as reduction operation

Trang 9

3 Processor Pq computes the value of xkaccording to Formula (7.5).

4 Processor Pq sends the value of xk to all other processors by a single-broadcast operation

A pseudocode for an SPMD program in C notation with MPI operations

imple-menting the Gaussian elimination with checkerboard distribution of matrix A is

given in Fig 7.6 The computations correspond to those given in the pseudocode for the row-cyclic distribution in Fig 7.2, but the pseudocode additionally uses several functions organizing the computations on the groups of processors The functions

k, respectively The functionmember(me,G)determines whether processorme

belongs to groupG The functiongrp leader()determines the first processor in

a group The functionsCop(q)andRop(q)determine the column or row group,

respectively, to which a processor q belongs The functionrank(q,G)returns the local processor number (rank) of a processor in a groupG

1 Determination of the local pivot element: The determination of the local pivot

element is performed only by the processors in column groupCo(k)

2 Determination of the global pivot element: The global pivot element is again

computed by anMPI MAXLOCreduction operation, but in contrast to Fig 7.2 the index of the row of the pivot element is calculated and not the processor number owning the pivot element The reason is that all processors which own a part of the pivot row need to know that some of their data belongs to the current pivot row; this information is used in further communication

3 Exchange of the pivot row: For the exchange and distribution of the pivot row

r , the casesRo(k)==Ro(r)andRo(k)!=Ro(r)are distinguished

• When the pivot row and the row k are stored by the same row group, each processor of this group exchanges its data elements of row k and row

r locally using the function exchange row loc() and copies the

ele-ments of the pivot row (now row k) into the bufferbufusing the function

copy row loc() Only the elements in column k or higher are considered.

• When the pivot row and the row k are stored by different row groups,

communication is required for the exchange of the pivot row The function

for the calling processorme, which is the processor q ∈ Ro(r) belonging to

the same column group asme The functioncompute size(n,k,Ro(k))

computes the number of elements of the pivot row, which is stored for the

call-ing processor in columns greater than k; this number depends on the size of

the row groupRo(k), the block size, and the position k The same function is

used later to determine the number of elimination factors to be communicated

4 Distribution of the pivot row: For the distribution of the pivot row r , a processor

takes part in a single-broadcast operation in its column group The roots of the

broadcast operation performed in parallel are the processors q ∈ Ro(r) The

participants of a broadcast are the processors q∈Cop(q), either as root when

q∈Ro(r)or as recipient otherwise

Trang 10

Fig 7.6 Program of the Gaussian elimination with checkerboard distribution

Định dạng
Số trang	10
Dung lượng	721,52 KB