Determination of the local pivot element: Each processor considers its local elements of column k in the rows k ,.. Distribution of the pivot row: Since the pivot row now row k is requir
Trang 1can be used to solve several linear systems with the same matrix A and different right-hand side vectors b without repeating the elimination process.
7.1.1.2 Pivoting
Forward elimination and LU decomposition require the division by a (k) kk and so these
methods can only be applied when a kk (k) = 0 That is, even if det A = 0 and the
system Ax = y is solvable, there does not need to exist a decomposition A = LU
when a kk (k) is a zero element However, for a solvable linear system, there exists a
matrix resulting from permutations of rows of A, for which an LU decomposition is possible, i.e., B A = LU with a permutation matrix B describing the permutation of
rows of A The permutation of rows of A, if necessary, is included in the elimination
process In each elimination step, a pivot element is determined to substitute a kk (k)
A pivot element is needed when a kk (k) = 0 and when a (k)
kk is very small which would induce an elimination factor, which is very large leading to imprecise computations Pivoting strategies are used to find an appropriate pivot element Typical strategies are column pivoting, row pivoting, and total pivoting
Column pivoting considers the elements a kk (k) · · · a (k)
nk of column k and determines the element a r k (k) , k ≤ r ≤ n, with the maximum absolute value If r = k, the rows r
and k of matrix A (k) and the values b (k) k and b (k)
r of the vector b (k)are exchanged Row
pivoting determines a pivot element a kr (k) , k ≤ r ≤ n, within the elements a (k)
kk · · · a (k) kn
of row k of matrix A (k) with the maximum absolute value If r = k, the columns k
and r of A (k)are exchanged This corresponds to an exchange of the enumeration of
the unknowns xk and xr of vector x Total pivoting determines the element with the
maximum absolute value in the matrix ˜A (k) = (a (k)
i j ), k ≤ i, j ≤ n, and exchanges
columns and rows of A (k) depending on i = k and j = k In practice, row or column
pivoting is used instead of total pivoting, since they have smaller computation time, and total pivoting may also destroy special matrix structures like banded structures The implementation of pivoting avoids the actual exchange of rows or columns
in memory and uses index vectors pointing to the current rows of the matrix The indexed access to matrix elements is more expensive but in total the indexed access
is usually less expensive than moving entire rows in each elimination step When supported by the programming language, a dynamic data storage in the form of separate vectors for rows of the matrix, which can be accessed through a vector pointing to the rows, may lead to more efficient implementations The advantage is that matrix elements can still be accessed with a two-dimensional index expression but the exchange of rows corresponds to a simple exchange of pointers
7.1.2 Parallel Row-Cyclic Implementation
A parallel implementation of the Gaussian elimination is based on a data distribution
of matrix A and of the sequence of matrices A (k) , k = 2, , n, which can be a
row-oriented, a column-oriented, or a checkerboard distribution, see Sect 3.4 In this section, we consider a row-oriented distribution
Trang 2computation left for this processor and it becomes idle For a row-cyclic distribution,
there is a better load balance, since processor Pq, 1 ≤ q ≤ p, owns the rows q, q+ p,
q +2p, , i.e., it owns all rows i with 1 ≤ i ≤ n, and q = ((i −1) mod p)+1 The
processors begin to get idle only after the first n − p stages, which is reasonable for
p n Thus, we consider a parallel implementation of the Gaussian elimination
with a row-cyclic distribution of matrix A and a column-oriented pivoting One
step of the forward elimination computing A (k+1)and b (k+1)for given A (k) and b (k)
performs the following computation and communication phases:
1 Determination of the local pivot element: Each processor considers its local
elements of column k in the rows k , , n and determines the element (and its
position) with the largest absolute value
2 Determination of the global pivot element: The global pivot element is the
local pivot element which has the largest absolute value A single-accumulation operation with the maximum operation as reduction determines this global pivot element The root processor of this global communication operation sends the result to all other processors
3 Exchange of the pivot row: If k = r for a pivot element a (k)
r k , the row k owned by processor P q and the pivot row r owned by processor P q have to be exchanged
When q = q, the exchange can be done locally by processor Pq When q = q,
then communication with single transfer operations is required The elements bk and br are exchanged accordingly
4 Distribution of the pivot row: Since the pivot row (now row k) is required by all
processors for the local elimination operations, processor Pq sends the elements
a kk (k) , , a (k)
kn of row k and the element b (k) k to all other processors
5 Computation of the elimination factors: Each processor locally computes the
elimination factors li k for which it owns the row i according to Formula (7.2).
6 Computation of the matrix elements: Each processor locally computes the
elements of A (k+1) and b (k+1) using its elements of A (k) and b (k) according to Formulas (7.3) and (7.4)
The computation of the solution vector x in the backward substitution is inherently sequential, since the values xk, k = n, , 1, depend on each other and are
com-puted one after another In step k, processor Pq owning row k computes the value xk
according to Formula (7.5) and sends the value to all other processors by a single-broadcast operation
A program fragment implementing the computation phases 1–6 and the
back-ward substitution is given in Fig 7.2 The matrix A and the vector b are stored in a
two- and a one-dimensional arrayaandb, respectively Some of the local functions are already introduced in the program in Fig 7.1 The SPMD program uses the variablemeto store the individual processor number This processor number, the
Trang 3Fig 7.2 Program fragment with C notation and MPI operations for the Gaussian elimination with
row-cyclic distribution
Trang 41 Determination of the local pivot element: The functionmax col loc(a,k)
determines the row index r of the elementa[r][k], which has the largest local
absolute value in column k for the rows ≥ k When a processor has no element
of column k for rows ≥ k, the function returns −1.
2 Determination of the global pivot element: The global pivoting is performed
by anMPI Allreduce()operation, implementing a single-accumulation with
a subsequent single-broadcast The MPI reduction operationMPI MAXLOCfor data type MPI DOUBLE INT consisting of one double value and one integer value is used The MPI operations have been introduced in Sect 5.2 The
and the processor owning the corresponding row iny.node Thus, after this step all processors know the global pivot element and the owner for possible communication
3 Exchange of the pivot row: Two cases are considered:
• If the owner of the pivot row is the processor also owning row k (i.e., k%p == y.node), the rows k and r are exchanged locally by this processor for r = k.
Row k is now the pivot row The functioncopy row(a,b,k,buf)copies the pivot row into the bufferbuf, which is used for further communication
• If different processors own the row k and the pivot row r, row k is sent to
the processory.nodeowning the pivot row withMPI SendandMPI Recv
operations Before the send operation, the functioncopy row(a,b,k,buf)
copies row k of arrayaand element k of arraybinto a common bufferbuf
so that only one communication operation needs to be applied After the com-munication, the processory.nodefinalizes its exchange with the pivot row The functioncopy exchange row(a,b,r,buf,k)exchanges the row
r (still the pivot row) and the buffer buf The appropriate row index r is
known from the former local determination of the pivot row Now the former
row k is the row r and the bufferbufcontains the pivot row
Thus, in both cases the pivot row is stored in bufferbuf
4 Distribution of the pivot row: Processory.nodesends the bufferbufto all other processors by anMPI Bcast()operation For the case of the pivot row
being owned by a different processor than the owner of row k, the content of
bufis copied into row k by this processor usingcopy back row()
5 and 6 Computation of the elimination factors and the matrix elements: The
computation of the elimination factors and the new arraysaandbis done in
parallel Processor P q starts this computation with the first row i > k with i mod
p = q.
For a row-cyclic implementation of the Gaussian elimination, an alternative way
of storing arrayaand vectorbcan be used The alternative data structure consists
Trang 5Fig 7.3 Data structure for
the Gaussian elimination with
n = 8 and p = 4 showing the
rows stored by processor P1
Each row stores n+ 1
elements consisting of one
row of the matrix a and the
corresponding element of b
b
1
b a
a
a
a11
a
18
5 58 51
of a one-dimensional array of pointers and n one-dimensional arrays of length n+ 1
each containing one row ofaand the corresponding element ofb The entries in the pointer-array point to the row-arrays This storage scheme not only facilitates the exchange of rows but is also convenient for a distributed storage For a distributed
memory, each processor Pq stores the entire array of pointers but only the rows
i with i mod p = q; all other pointers are NULL-pointers Figure 7.3 illustrates
this storage scheme for n = 8 The advantage of storing an element ofbtogether withais that the copy operation into a common buffer can be avoided Also the computation of the new values foraandbis now only one loop with n+1 iterations
This implementation variant is not shown in Fig 7.2
7.1.3 Parallel Implementation with Checkerboard Distribution
A parallel implementation using a block–cyclic checkerboard distribution for matrix
A can be described with the parameterized data distribution introduced in Sect 3.4.
The parameterized data distribution is given by a distribution vector
(( p1 , b1), (p2, b2)) (7.7)
with a p1× p2virtual processor mesh with p1rows, p2columns, and p1· p2 = p
processors The numbers b1and b2are the sizes of a block of data with b1rows and
b2columns The functionG : P → N2maps each processor to a unique position in
the processor mesh This leads to the definition of p1row groups
R q = {Q ∈ P | G(Q) = (q, ·)}
with|R q | = p2for 1≤ q ≤ p1and p2column groups
C q = {Q ∈ P | G(Q) = (·, q)}
with|C q | = p1for 1≤ q ≤ p2 The row groups as well as the column groups are a partition of the entire set of processors, i.e.,
Trang 6the local memories of the processors of only one row group, denoted Ro(i ) in the following This is the row group Rk with k = i−1
b1
mod p1 + 1 Analogously,
column j is distributed within one column group, denoted as Co( j ), which is the column group Ck with k=j−1
b2
mod p2+ 1
Example For a matrix of size 12 ×12 (i.e., n = 12), p = 4 processors {P1, P2, P3, P4}
and distribution vector (( p1 , b1), (p2, b2)) = ((2, 2), (2, 3)), the virtual processor
mesh has size 2× 2 and the data blocks have size 2 × 3 There are two row groups
and two column groups:
R1= {Q ∈ P | G(Q) = (1, j), j = 1, 2},
R2= {Q ∈ P | G(Q) = (2, j), j = 1, 2},
C1= {Q ∈ P | G(Q) = ( j, 1), j = 1, 2},
C2= {Q ∈ P | G(Q) = ( j, 2), j = 1, 2}
The distribution of matrix A is shown in Fig 7.4 It can be seen that row 5 is dis-tributed in row group R1and that column 7 is distributed in column group C1.
Using a checkerboard distribution with distribution vector (7.7), the computation
of A (k)has the following implementation, which has a different communication pat-tern than the previous implementation Figure 7.5 illustrates the communication and computation phases of the Gaussian elimination with checkerboard distribution
Fig 7.4 Illustration of a
checkerboard distribution for
a 12× 12 matrix The tuples
denote the position of the
processors in the processor
mesh owning the data block
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 7 8 9 10 11 12
(1,1) (1,2) (1,1) (1,2) (2,1) (2,2) (2,1) (2,2) (1,1)
(2,1)
(1,2) (2,2)
(1,1) (2,1)
(1,2) (2,2) (1,1)
(2,1)
(1,2) (2,2)
(1,1) (2,1)
(1,2) (2,2)
Trang 7Fig 7.5 Computation phases
of the Gaussian elimination
with checkerboard
distribution
k
k
k
k
k
k (5a) broadcast of the
k
k r
(5) computation of the matrix elements
(4) broadcast of the k
k
r
pivot row pivot row
k
k
(5) computation of the elimination factors elimination factors
(3) exchange of the
k
k
(1) determination of the (2) determination of the
global pivot elements local pivot elements
This figure will be printed
in b/w
Trang 82 Determination of the global pivot element: The processors in group Co(k)
perform a single-accumulation operation within this group, for which each pro-cessor in the group provides its local pivot element from phase 1 The reduction operation is the maximum operation also determining the index of the pivot row (and not the number of the owning processor as before) The root processor
of the single-accumulation operation is the processor owning the element a kk (k)
After the single-accumulation, the root processor knows the pivot element a r k (k)
and its row index This information is sent to all other processors
3 Exchange of the pivot row: The pivot row r containing the pivot element a r k (k)
is distributed across row group Ro(r ) Row k is distributed across the row group Ro(k), which may be different from Ro(r ) If Ro(r ) = Ro(k), the processors
of Ro(k) exchange the elements of the rows k and r locally within the columns they own If Ro(r ) = Ro(k), each processor in Ro(k) sends its part of row
k to the corresponding processor in Ro(r ); this is the unique processor which
belongs to the same column group
4 Distribution of the pivot row: The pivot row is needed for the recalculation of
matrix A, but each processor needs only those elements with column indices for which it owns elements Therefore, each processor in Ro(r ) performs a
group-oriented single-broadcast operation within its column group sending its part of the pivot row to the other processors
5 Computation of the elimination factors: The processors of column group
Co(k) locally compute the elimination factors l i k for their elements i of column
k according to Formula (7.2).
5a Distribution of the elimination factors: The elimination factors li kare needed
by all processors in the row group Ro(i ) Since the elements of row i are dis-tributed across the row group Ro(i ), each processor of column group Co(k) performs a group-oriented single-broadcast operation in its row group Ro(i ) to broadcast its elimination factors li kwithin this row group
6 Computation of the matrix elements: Each processor locally computes the
elements of A (k+1) and b (k+1) using its elements of A (k) and b (k) according to Formulas (7.3) and (7.4)
The backward substitution for computing the n elements of the result vector x is done in n consecutive steps where each step consists of the following computations:
1 Each processor of the row group Ro(k) computes that part of the sumn
j =k+1 a k j (n) x j which contains its local elements of row k.
2 The entire sum n
j =k+1 a (n) k j x j is determined by the processors of row group
Ro(k) by a group-oriented single-accumulation operation with the processor P q
as root which stores the element a kk (n) Addition is used as reduction operation
Trang 93 Processor Pq computes the value of xkaccording to Formula (7.5).
4 Processor Pq sends the value of xk to all other processors by a single-broadcast operation
A pseudocode for an SPMD program in C notation with MPI operations
imple-menting the Gaussian elimination with checkerboard distribution of matrix A is
given in Fig 7.6 The computations correspond to those given in the pseudocode for the row-cyclic distribution in Fig 7.2, but the pseudocode additionally uses several functions organizing the computations on the groups of processors The functions
k, respectively The functionmember(me,G)determines whether processorme
belongs to groupG The functiongrp leader()determines the first processor in
a group The functionsCop(q)andRop(q)determine the column or row group,
respectively, to which a processor q belongs The functionrank(q,G)returns the local processor number (rank) of a processor in a groupG
1 Determination of the local pivot element: The determination of the local pivot
element is performed only by the processors in column groupCo(k)
2 Determination of the global pivot element: The global pivot element is again
computed by anMPI MAXLOCreduction operation, but in contrast to Fig 7.2 the index of the row of the pivot element is calculated and not the processor number owning the pivot element The reason is that all processors which own a part of the pivot row need to know that some of their data belongs to the current pivot row; this information is used in further communication
3 Exchange of the pivot row: For the exchange and distribution of the pivot row
r , the casesRo(k)==Ro(r)andRo(k)!=Ro(r)are distinguished
• When the pivot row and the row k are stored by the same row group, each processor of this group exchanges its data elements of row k and row
r locally using the function exchange row loc() and copies the
ele-ments of the pivot row (now row k) into the bufferbufusing the function
copy row loc() Only the elements in column k or higher are considered.
• When the pivot row and the row k are stored by different row groups,
communication is required for the exchange of the pivot row The function
for the calling processorme, which is the processor q ∈ Ro(r) belonging to
the same column group asme The functioncompute size(n,k,Ro(k))
computes the number of elements of the pivot row, which is stored for the
call-ing processor in columns greater than k; this number depends on the size of
the row groupRo(k), the block size, and the position k The same function is
used later to determine the number of elimination factors to be communicated
4 Distribution of the pivot row: For the distribution of the pivot row r , a processor
takes part in a single-broadcast operation in its column group The roots of the
broadcast operation performed in parallel are the processors q ∈ Ro(r) The
participants of a broadcast are the processors q∈Cop(q), either as root when
q∈Ro(r)or as recipient otherwise
Trang 10Fig 7.6 Program of the Gaussian elimination with checkerboard distribution