Parallel Programming: for Multicore and Cluster Systems- P43 docx

416 7 Algorithms for Systems of Linear Equations7.3.5.2 SOR Method for Red–Black Systems An SOR method for the linear equation system 7.46 with relaxation parameterω can be derived from

Trang 1

7.3 Iterative Methods for Linear Systems 413

ˆA · ˆx =

D R F

E D B

·

ˆxR ˆxB

=

ˆb1

ˆb2

where ˆx R denotes the subvector of size n R of the first (red) unknowns and ˆxBdenotes

the subvector of size n B of the last (black) unknowns The right-hand side b of the original equation system is reordered accordingly and has subvector ˆb1for the first

n R equations and subvector ˆb2 for the last n B equations The matrix ˆA consists

of four blocks DR ∈ Rn R ×n R , DB ∈ Rn B ×n B , E ∈ Rn B ×n R , and F ∈ Rn R ×n B

The submatrices D R and D B are diagonal matrices and the submatrices E and F

are sparse banded matrices The structure of the original matrix of the discretized

Poisson equation in Fig 7.9 in Sect 7.2.1 is thus transformed into a matrix ˆA with

the structure shown in Fig 7.17(c)

The diagonal form of the matrices D R and DB shows that a red unknown ˆxi,

i ∈ {1, , n R}, does not depend on the other red unknowns and a black unknown

ˆx j , j ∈ {n R + 1, , nR + nB}, does not depend on the other black unknowns

The matrices E and F specify the dependences between red and black unknowns The row i of matrix F specifies the dependences of the red unknowns ˆxi (i <

n R ) on the black unknowns ˆx j , j = n R + 1, , n R + nB Analogously, a row of

matrix E specifies the dependences of the corresponding black unknowns on the red

unknowns

The transformation of the original linear equation system Ax = b into the equivalent system ˆA ˆx = ˆb can be expressed by a permutation π : {1, , n} → {1, , n} The permutation maps a node i ∈ {1, , n} of the rowwise numbering

onto the numberπ(i) of the red–black numbering in the following way:

x i = ˆx π(i) , b i = ˆb π(i) , i = 1, , n or x = P ˆx and b = P ˆb

with a permutation matrix P = (Pi j)i, j=1, ,n , P i j =

0 otherwise For

the matrices A and ˆA the equation ˆA = P T

A P holds Since for a permutation matrix the inverse is equal to the transposed matrix, i.e., P T = P−1, this leads to

ˆA ˆx = P T

A P P T x = P T

b = ˆb The easiest way to exploit the red–black ordering

is to use an iterative solution method as discussed earlier in this section

7.3.5.1 Gauss–Seidel Iteration for Red–Black Systems

The solution of the linear equation system (7.46) with the Gauss–Seidel iteration is

based on a splitting of the matrix Â of the form Â = ˆD − ˆL − Û, ˆD, ˆL, Û ∈ R n ×n,

ˆ

D R 0

0 D B

, ˆL =

0 0

−E 0

, Uˆ =

0−F

0 0

,

with a diagonal matrix ˆD, a lower triangular matrix ˆL, and an upper triangular

matrix ˆU The matrix 0 is a matrix in which all entries are 0 With this notation, iteration step k of the Gauss–Seidel method is given by

Trang 2

414 7 Algorithms for Systems of Linear Equations

D R 0

E D B

·

x (k R+1)

x (k B+1)

=

b1

b2

−

0 F

0 0

·

x (k) R

x (k) B

(7.47)

for k = 1, 2, According to equation system (7.46), the iteration vector is split into two subvectors x (k R+1)and x (k B+1)for the red and the black unknowns,

respec-tively (To simplify the notation, we use xR instead of ˆx Rin the following discussion

of the red–black ordering.)

The linear equation system (7.47) can be written in vector notation for vectors

x (k R+1)and x B (k+1)in the form

D R · x (k+1)

R = b1− F · x (k)

B for k = 1, 2, , (7.48)

D B · x (k+1)

B = b2− E · x (k+1)

R for k = 1, 2, , (7.49)

in which the decoupling of the red subvector x (k R+1)and the black subvector x (k B+1) becomes obvious: In Eq (7.48) the new red iteration vector x (k R+1) depends only

on the previous black iteration vector x (k) B and in Eq (7.49) the new black iteration

vector x (k B+1)depends only on the red iteration vector x (k R+1)computed before in the same iteration step There is no additional dependence Thus, the potential degree of parallelism in Eq (7.48) or (7.49) is similar to the potential parallelism in the Jacobi

iteration In each iteration step k, the components of x (k R+1)according to Eq (7.48)

can be computed independently, since the vector x (k) B is known, which leads to a

potential parallelism with p = n R processors Afterwards, the vector x (k R+1) is

known and the components of the vector x (k B+1) can be computed independently

according to Eq (7.49), leading to a potential parallelism of p = nR processors For a parallel implementation, we consider the Gauss–Seidel iteration of the red– black ordering (7.48) and (7.49) written out in a component-based form:

8

x (k R+1)

9

i = 1

ˆaii

:

ˆbi−

j ∈N(i) ˆai j · (x (k)

B )j

;

, i = 1, , n R ,

8

x (k B+1)

9

i= 1

ˆai +nR ,i+nR

:

ˆbi +n R−

j ∈N(i) ˆai +n R , j · (x (k+1)

R )j

;

, i = 1, , n B

The set N (i ) denotes the set of adjacent mesh points for mesh point i According to the red–black ordering, the set N (i ) contains only black mesh points for a red point

i and vice versa An implementation on a shared memory machine can employ at most p = nR or p = nB processors There are no access conflicts for the

par-allel computation of x (k) R or x (k) B but a barrier synchronization is needed between the two computation phases The implementation on a distributed memory machine requires a distribution of computation and data As discussed before for the paral-lel SOR method, it is useful to distribute the data according to the mesh structure

Trang 3

7.3 Iterative Methods for Linear Systems 415

such that the processor Pq to which the mesh point i is assigned is responsible for

the computation or update of the corresponding component of the approximation vector In a row-oriented distribution of a squared mesh with√

n×√n = n mesh points to p processors,√

n /p rows of the mesh are assigned to each processor P q,

q ∈ {1, , p} In the red–black coloring this means that each processor owns 1

2

n p

red and 12n p black mesh points (For simplicity we assume that√

n is a multiple of p.) Thus, the mesh points

(q− 1) ·n R

p + 1, , q · n R

p for q = 1, , p and (q− 1) ·n B

p + 1 + n R , , q · n B

p + n R for q = 1, , p are assigned to processor Pq Figure 7.18 shows an SPMD program

implement-ing the Gauss–Seidel iteration with red–black orderimplement-ing The coefficient matrix A

is stored according to the pointer-based scheme introduced earlier in Fig 7.3 After the computation of the red componentsxr, a functioncollect elements(xr) distributes the red vector to all other processors for the next computation Analogously, the black vectorxbis distributed after its computation The function collect elements()can be implemented by a multi-broadcast operation

Fig 7.18 Program fragment for the parallel implementation of the Gauss–Seidel method with the

red–black ordering The arrays xr and xb denote the unknowns corresponding to the red or black mesh points The processor number of the executing processor is stored in me

Trang 4

7.3.5.2 SOR Method for Red–Black Systems

An SOR method for the linear equation system (7.46) with relaxation parameterω

can be derived from the Gauss–Seidel computation (7.48) and (7.49) by using the combination of the new and the old approximation vectors as introduced in Formula (7.41) One step of the SOR method has then the form

˜x (k R+1)= D−1R · b1− D−1R · F · x (k)

B ,

˜x (k B+1)= D−1

B · b2− D−1

B · E · x (k+1)

x (k R+1)= x (k)

R + ω8˜x (k R+1)− x (k)

R

9

x (k B+1)= x (k)

B + ω8˜x (k B+1)− x (k)

B

9

, k = 1, 2,

The corresponding splitting of matrix ˆA is ˆA= 1

ω Dˆ − ˆL − ˆU −1−ω

ω D with theˆ

matrices ˆD , ˆL, ˆU introduced above This can be written using block matrices:

D R 0

ωE D B

·

x (k R+1)

x (k B+1)

(7.51)

= (1 − ω)

D R 0

0 DB

·

x (k) R

x (k) B

− ω

0 F

0 0

·

x (k) R

x (k) B

+ ω

b1 b2

.

For a parallel implementation the component form of this system is used On the other hand, for the convergence results the matrix form and the iteration matrix have to be considered Since the iteration matrix of the SOR method for a given

linear equation system Ax = b with a certain order of the equations and the iter-ation matrix of the SOR method for the red–black system ˆA ˆx = ˆb are different,

convergence results cannot be transferred The iteration matrix of the SOR method with red–black ordering is

ˆS ω=

1

ω Dˆ − ˆL

−1

1− ω

ω Dˆ + ˆU

.

For a convergence of the method it has to be shown thatρ( ˆS ω)< 1 for the spectral

radius of ˆS ω andω ∈ R In general, the convergence cannot be derived from the

convergence of the SOR method for the original system, since P T S P is not iden-tical to ˆS ω , although P T A P = ˆA holds However, for the specific case of the model

problem, i.e., the discretized Poisson equation, the convergence can be shown Using

the equality P T A P = ˆA, it follows that ˆA is symmetric and positive definite and,

thus, the method converges for the model problem, see [61]

Figure 7.19 shows a parallel SPMD implementation of the SOR method for the red–black ordered discretized Poisson equation The elements of the coeffi-cient matrix are coded as constants The unknowns are stored in a two-dimensional structure corresponding to the two-dimensional mesh and not as vector so that

Trang 5

7.4 Conjugate Gradient Method 417

Fig 7.19 Program fragment of a parallel SOR method for a red–black ordered discretized Poisson

equation

unknowns appear asx[i][j]in the program The mesh points and the correspond-ing computations are distributed among the processors; the mesh points belong-ing to a specific processor are stored in myregion The color red or black of

a mesh point (i , j) is an additional attribute which can be retrieved by the

func-tionsis red()andis black() The valuef[i][j]denotes the discretized right-hand side of the Poisson equation as described earlier, see Eq (7.15) The functions exchange red borders() and exchange black borders() exchange the red or black data of the red or black mesh points between neighboring processors

7.4 Conjugate Gradient Method

The conjugate gradient method or CG method is a solution method for linear

equa-tion systems Ax = b with symmetric and positive definite matrix A ∈ R n ×n, which

has been introduced in [86] ( A is symmetric if ai j = a j i and positive definite if

x T Ax > 0 for all x ∈ R n with x = 0.) The CG method builds up a solution x∗∈ Rn

in at most n steps in the absence of roundoff errors Considering roundoff errors more than n steps may be needed to get a good approximation of the exact solution

x∗ For sparse matrices a good approximation of the solution can be achieved in less

than n steps, also with roundoff errors [150] In practice, the CG method is often

used as preconditioned CG method which combines a CG method with a precon-ditioner [154] Parallel implementations are discussed in [72, 133, 134, 154]; [155] gives an overview In this section, we present the basic CG method and parallel implementations according to [23, 71, 166]

Trang 6

7.4.1 Sequential CG Method

The CG method exploits an equivalence between the solution of a linear equation system and the minimization of a function

More precisely, the solution x∗of the linear equation system Ax = b, A ∈ R n ×n,

b∈ Rn, is the minimum of the functionΦ : M ⊂ R n → R with

Φ(x) = 1

2x

T

Ax − b T

if the matrix A is symmetric and positive definite A simple method to determine the

minimum of the functionΦ is the method of the steepest gradient [71] which uses

the negative gradient For a given point xc∈ Rnthe function decreases most rapidly

in the direction of the negative gradient The method computes the following two steps:

(a) Computation of the negative gradient dc∈ Rn

at point xc:

d c = − grad Φ(xc)= −

∂

∂x1

Φ(x c) , , ∂

∂x n

Φ(x c)

= b − Axc

(b) Determination of the minimum ofΦ in the set

{xc + tdc | t ≥ 0} ∩ M ,

which forms a line inRn (line search) This is done by inserting xc + tdc into

Formula (7.52) Using dc = b − Axc and the symmetry of matrix A we get

Φ(x c + tdc) = Φ(xc) − td T

c d c+1

2t

2d c T Ad c (7.53)

The minimum of this function with respect to t ∈ R can be determined using the

derivative of this function with respect to t The minimum is

t c= d c T d c

d T

The steps (a) and (b) of the method of the steepest gradient are used to create a

sequence of vectors xk , k = 0, 1, 2, , with x0 ∈ Rn and xk+1 = xk + tk d k.

The sequence (Φ(x k)) k =0,1,2, is monotonically decreasing which can be seen by inserting Formula (7.54) into Formula (7.53) The sequence converges toward the minimum but the convergence might be slow [71]

The CG method uses a technique to determine the minimum which exploits

orthogonal search directions in the sense of conjugate or A-orthogonal vectors dk.

For a given matrix A, which is symmetric and non-singular, two vectors x , y ∈ R n are called conjugate or A-orthogonal, if x T Ay = 0 If A is positive definite, k

Trang 7

7.4 Conjugate Gradient Method 419

pairwise conjugate vectors d0, , d k−1(with di = 0, i = 0, , k − 1 and k ≤ n) are linearly independent [23] Thus, the unknown solution vector x∗of Ax = b can

be represented as a linear combination of the conjugate vectors d0, , d n−1, i.e.,

x∗=

n−1

k=0

Since the vectors are orthogonal, d T

k Ax∗ = nl=0−1d k T At l d l = tk d T

k Ad k This leads to

t k= d k Ax∗

d T

k Ad k

T

k b

d T

k Ad k

for the coefficients tk Thus, when the orthogonal vectors are known, the values tk,

k = 0, , n − 1, can be computed from the right-hand side b.

The algorithm for the CG method uses a representation

x∗= x0+

n−1

i=0

of the unknown solution vector x∗ as a sum of a starting vector x0 and a term

n−1

i=0α i d ito be computed The second term is computed recursively by

Fig 7.20 Algorithm of the CG method (1) and (2) compute the valuesα kaccording to Eq (7.58) The vectorw k is used for the intermediate result Ad k (3) is the computation given in Formula

(7.57) (4) computes g k+1 for the next iteration step according to Formula (7.58) in a recursive

way: g k+1= Ax k+1− b = A(x k + α k d k)− b = g k + Aα k d k This vector g k+1 represents the error

between the approximation x k and the exact solution (5) and (6) compute the next vector d k+1 of the set of conjugate gradients

Trang 8

x k+1 = xk + αk d k , k = 1, 2, , with (7.57)

α k = −g k T d k

d T

k Ad k

Formulas (7.57) and (7.58) determine x∗ according to Eq (7.56) by computingα i

and addingα i d i in each step, i = 1, 2, Thus, the solution is computed after at most n steps If not all directions dk are needed for x∗, less than n steps are required.

Algorithms implementing the CG method do not choose the conjugate vectors

d0, , d n−1 before computing the vectors x0, , xn−1 but compute the next

con-jugate vector from the given gradient gk by adding a correction term The basic algorithm for the CG method is given in Fig 7.20

7.4.2 Parallel CG Method

The parallel implementation of the CG method is based on the algorithm given

in Fig 7.20 Each iteration step of this algorithm implementing the CG method consists of the following basic vector and matrix operations

7.4.2.1 Basic Operations of the CG Algorithm

The basic operations of the CG algorithm are

(1) a matrix–vector multiplication Adk,

(2) two scalar products g T

k g k and d T

k w k, (3) a so-called ax py-operation xk + αk d k

(The name ax py comes from a x plus y describing the computation.),

(4) an ax py-operation gk + αk w k,

(5) a scalar product g T

k+1g k+1, and

(6) an ax py-operation −gk+1+ βk d k.

The result of g k T g kis needed in two consecutive steps and so the computation of one

scalar product can be avoided by storing g k T g kin the scalar valueγ k Since there are

mainly one matrix–vector product and scalar products, a parallel implementation can be based on parallel versions of these operations

Like the CG method many algorithms from linear algebra are built up from

basic operations like matrix–vector operations or ax py-operations and efficient

implementations of these basic operations lead to efficient implementations of the

entire algorithms The BLAS (Basic Linear Algebra Subroutines) library offers

efficient implementations for a large set of basic operations This includes many

ax py-operations which denote that a vector x is multiplied by a scalar value a and then added to another vector y The prefixes s in saxpy or d daxpy denote axpy-operations for simple precision and double precision, respectively Introductory

descriptions of the BLAS library are given in [43] or [60] A standard way to par-allelize algorithms for linear algebra is to provide efficient parallel implementations

of the BLAS operations and to build up a parallel algorithm from these basic parallel

Trang 9

7.4 Conjugate Gradient Method 421 operations This technique is ideally suited for the CG method since it consists of such basic operations

Here, we consider a parallel implementation based on the parallel implemen-tations for matrix–vector multiplication or scalar product for distributed memory machines as presented in Sect 3 These parallel implementations are based on a data distribution of the matrix and the vectors involved For an efficient implementation

of the CG method it is important that the data distributions of different basic opera-tions fit together in order to avoid expensive data re-distribuopera-tions between the oper-ations Figure 7.21 shows a data dependence graph in which the nodes correspond

to the computation steps (1)–(6) of the CG algorithm in Fig 7.20 and the arrows depict a data dependency between two of these computation steps The arrows are annotated with data structures computed in one step (outgoing arrow) and needed for another step with incoming arrow The data dependence graph for one iteration

step k is a directed acyclic graph (DAG) There are also data dependences to the previous iteration step k − 1 and the next iteration step k + 1, which are shown as

dashed arrows

There are the following dependences in the CG method: The computation (2) needs the resultw k from computation (1) but also the vector dkand the scalar value

γ k from the previous iteration step k − 1; γkis used to store the intermediate result

γ k = g T

k g k Computation (3) needs α k from computation step (2) and the vectors

x k , d k from the previous iteration step k − 1 Computation (4) also needs αk from

( 3 )

xk

xk+1

( 2 )

( 4 ) ( 5 )

αk

βk

γk+1

gk+1

gk γk

wk

( 6 )

dk

dk+1

( 1 )

αk

gk+1

k−1

k

k+1

Iteration step

Fig 7.21 Data dependences between the computation steps (1)–(6) of the CG method in Fig 7.20.

Nodes represent the computation steps of one iteration step k Incoming arrows are annotated by

the data required and outgoing arrows are annotated by the data produced Two nodes have an

arrow between them if one of the nodes produces data which are required by the node with the incoming arrow The data dependences to the previous iteration step k−1 or the next iteration step

k + 1 are given as dashed arrows The data are named in the same way as in Fig 7.20; additionally

the scalarγ kis used for the intermediate resultγ k = g T

k g kcomputed in step (5) and required for the computations ofα andβ in computation steps (2) and (5) of the next iteration step

Trang 10

422 7 Algorithms for Systems of Linear Equations computation step (2) and vector w kfrom computation (1) Computation (5) needs

vector gk+1 from computation (4) and scalar value γ k from the previous iteration

step k −1; computation (6) needs the scalar value from βkfrom computation (5) and

vector dk from iteration step k−1 This shows that there are many data dependences between the different basic operations But it can also be observed that computation (3) is independent of the computations (4)–(6) Thus, the computation sequence (1),(2),(3),(4),(5),(6) as well as the sequence (1),(2),(4),(5),(6),(3) can be used The independence of computation (3) from computations (4)–(6) is also another source

of parallelism, which is a coarse-grained parallelism of two linear algebra operations performed in parallel, in contrast to the fine-grained parallelism exploited for a sin-gle basic operation In the following, we concentrate on the fine-grained parallelism

of basic linear algebra operations

When the basic operations are implemented on a distributed memory machine, the data distribution of matrices and vectors and the data dependences between oper-ations might require data re-distribution for a correct implementation Thus, the data dependence graph in Fig 7.21 can also be used to study the communication require-ments for re-distribution in a message-passing program Also the data dependences between two iteration steps may lead to communication for data re-distribution

To demonstrate the communication requirements, we consider an

implementa-tion of the CG method in which the matrix A has a row-blockwise distribuimplementa-tion and the vectors dk,ω k, gk, xk, and rk have a blockwise distribution In one iteration step of a parallel implementation, the following computation and communication operations are performed

7.4.2.2 Parallel CG Implementation with Blockwise Distribution

The parallel CG implementation has to consider data distributions in the following way:

(0) Before starting the computation of iteration step k, the vector dk computed in the previous step has to be re-distributed from a blockwise distribution of step

k − 1 to a replicated distribution required for step k This can be done with a

multi-broadcast operation

(1) The matrix–vector multiplication w k = Adk is implemented with a

row-blockwise distribution of A as described in Sect 3.6 Since dkis now replicated,

no further communication is needed The result vectorw k is distributed in a blockwise way

(2) The scalar product d T

k w kis computed in parallel with the same blockwise dis-tribution of both vectors (The scalar productγ k = g T

k g k is computed in the previous iteration step.) Each processor computes a local scalar product for its local vectors The final scalar product is then computed by the root processor

of a single-accumulation operation with addition as reduction operation This processor owns the final result α k and sends it to all other processors by a single-broadcast operation

(3) The scalar value α k is known by each processor and thus the ax py-operation

x k+1= xk + αk d kcan be done in parallel without further communication Each

Định dạng
Số trang	10
Dung lượng	362,97 KB