1. Trang chủ
  2. » Công Nghệ Thông Tin

Parallel Programming: for Multicore and Cluster Systems- P41 docx

10 109 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Direct Methods for Linear Systems with Banded Structure
Trường học University of Science and Technology
Chuyên ngành Computer Science
Thể loại Luận văn
Thành phố Hanoi
Định dạng
Số trang 10
Dung lượng 219,95 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

7.2 Direct Methods for Linear Systems with Banded Structure 395 Moreover, each processor exchanges one data value with each of its neighboring processors; the communication time is there

Trang 1

7.2 Direct Methods for Linear Systems with Banded Structure 393 for which the values of

˜a (k j−1), ˜b (k−1)

j , ˜c (k−1)

j , ˜y (k−1)

j

from the previous step computed by a different processor are required Thus, there

is a communication in each of thelog p steps with a message size of four values.

After step N= log p processor Pi computes

˜x i = ˜y (N)

i /˜b (N)

Phase 3: Parallel substitution of cyclic reduction: After the second phase,

the values ˜x i = xi ·q are already computed In this phase, each processor P i,

several steps according to Eq (7.27) In step k, k = Q − 1, , 0, the elements

x j , j = 2k , , n, with step size 2 k+1are computed Processor P

i computes x jwith

j div q + 1 = i for which the values ˜xi−1= x (i −1)q and ˜x i+1 = x (i +1)qcomputed by

processors P i−1 and P i+1 are needed Figure 7.12 illustrates the parallel algorithm

for p = 2 and n = 8.

i=1

i=3

i=4

i=5

i=6

i=7

i=8

8

x

4

x

x x

x

x

x x

2

6

1

3

5

7

i=2

Fig 7.12 Illustration of the parallel algorithm for the cyclic reduction for n = 8 equations and

first and the third phases of the computation have log q = 2 steps The second phase has log p = 1

step As recursive doubling is used in the second phase, there are more components of the solution

to be computed in the second phase compared with the computation shown in Fig 7.11

Trang 2

7.2.2.5 Parallel Execution Time

The execution time of the parallel algorithm can be modeled by the following run-time functions Phase 1 executes

steps where in step k with 1 ≤ k ≤ Q each processor computes at most q/2 k

coefficient blocks of 4 values each Each coefficient block requires 14 arithmetic operations according to Eq (7.23) The computation time of phase 1 can therefore

be estimated as

T1(n, p) = 14top·

Q



k=1

q

2k ≤ 14n

p · top

Moreover, each processor exchanges in each of the Q steps two messages of 4 values

each with its two neighboring processors by participating in single transfer opera-tions Since in each step the transfer operations can be performed by all processors

in parallel without interference, the resulting communication time is

C1(n, p) = 2Q · ts2s(4)= 2 · logn

p · ts2s(4), where t s2s (m) denotes the time of a single transfer operation with message size m.

Phase 2 executeslog p steps In each step, each processor computes 4 coefficients

requiring 14 arithmetic operations Then the value ˜x i = xi ·qis computed according

to Eq (7.28) by a single arithmetic operation The computation time is therefore

T2(n, p) = 14log p · top + top

In each step, each processor sends and receives 4 data values from other processors, leading to a communication time

C2(n , p) = 2log p · t s2s(4).

In each step k of phase 3, k = 0, , Q−1, each processor computes 2 kcomponents

of the solution vector according to Eq (7.27) For each component, five operations are needed Altogether, each processor computes Q−1

components with one component already computed in phase 2 The resulting com-putation time is

T3(n, p) = 5 · (q − 1) · top= 5 ·



n

p − 1



· top

Trang 3

7.2 Direct Methods for Linear Systems with Banded Structure 395 Moreover, each processor exchanges one data value with each of its neighboring processors; the communication time is therefore

C3(n, p) = 2 · ts2s(1).

The resulting total computation time is

T (n , p) =



14n

p − 4



· top

"



19n



· top

The communication overhead is

C(n , p) =

6

2· logn

7

t s2s(4)+ 2 · ts2s(1)

" 2 · log n · ts2s(4)+ 2 · ts2s(1).

Compared to the sequential algorithm, the parallel implementation leads to a small computational redundancy of 14· log p operations The communication overhead

increases logarithmically with the number of rows, whereas the computation time increases linearly

7.2.3 Generalization to Banded Matrices

The cyclic reduction algorithm can be generalized to banded matrices with

semi-bandwidth r > 1 For the description we assume n = s ·r The matrix is represented

as a block-tridiagonal matrix of the form

A(0)2 B2(0) C2(0)

.. .. ..

A(0)s−1 B s(0)−1C s(0)−1

s B(0)

s

X1

X2

.

X s−1

X s

Y1(0)

Y2(0)

.

Y s(0)−1

Y(0)

s

,

where

A(0)i = (alm)l ∈I i ,m∈I i−1 for i = 2, , s ,

B i(0) = (alm)l ∈I i ,m∈I i for i = 1, , s ,

C i(0) = (alm)l ∈I i ,m∈I i+1 for i = 1, , s − 1

are sub-matrices of A The index sets are for i = 1, , s

I = { j ∈ N | (i − 1)r < j ≤ ir}.

Trang 4

The vectors X i , Y i(0)∈ Rr are

X i = (xl)l ∈I i and Y i(0)= (yl)l ∈I i for i = 1, , s.

The algorithm from above is generalized by applying the described computation steps for elements according to Eq (7.23) to blocks and using matrix operations instead of operations on single elements In the first step, three consecutive matrix

equations i − 1, i, i + 1 for i = 3, 4, , s − 2 are considered:

A(0)i−1X i−2 + B(0)

i−1X i−1 + C(0)

i−1,

A(0)i X i−1 + B(0)

i X i + C(0)

i ,

A(0)i+1X i + B(0)

i+1X i+1 + C(0)

i+1X i+2= Y(0)

i+1 Equation (i − 1) is used to eliminate subvector Xi−1 from equation i and equation

with the following initializations:

A(0)1 := 0 ∈ Rr ×r , C(0)

and for k = 0, , log s and i ∈ Z \ {1, , s}

A i (k) = C (k)

i := 0 ∈ Rr ×r ,

B i (k):= I ∈ R r ×r ,

Y i (k):= 0 ∈ Rr

In step k = 1, , log s the following sub-matrices

α (k)

i := −A (k−1)

i

8

B i (k−2−1)k−1

9−1

,

β (k)

i := −C (k−1)

i

8

B i (k+2−1)k−1

9−1

,

A (k) i = α (k)

i · A (k−1)

i−2k−1,

C i (k) = β (k)

i · C (k−1)

B i (k) = α (k)

i C i (k−2−1)k−1+ B (k−1)

i A i (k+2−1)k−1 and the vector

Y i (k) = α (k)

i Y i (k−2−1)k−1+ Y (k−1)

i Y i (k+2−1)k−1 (7.30) are computed The resulting matrix equations are

Trang 5

7.2 Direct Methods for Linear Systems with Banded Structure 397

A (k) i X i−2k + B (k)

i X i + C (k)

i X i+2k = Y (k)

for i = 1, , s In summary, the method of cyclic reduction for banded matrices

comprises the following two phases:

1 Elimination phase: For k = 1, , log s compute the matrices A (k)

i , B i (k) , C i (k) and the vector Y i (k) for i = 2k , , s with step size 2 k according to Eqs (7.29) and (7.30)

2k , , s with step size 2 k+1by solving the linear equation system (7.31), i.e.,

B i (k) X i = Y (k)

i − A (k)

i X i−2k − C (k)

i X i+2k

The computation ofα (k)

i andβ (k)

i requires a matrix inversion or the solution of

a dense linear equation system with a direct method requiring O(r3) computations, i.e., the computations increase with the bandwidth cubically The first step requires

the computation of O(s) = O(n/r) sub-matrices; the asymptotic runtime for this

step is therefore O(nr2) The second step solves a total number of O(s) = O(n/r)

linear equation systems, also resulting in an asymptotic runtime of O(nr2) For the parallel implementation of the cyclic reduction for banded matrices, the parallel method described for tridiagonal systems with its three phases can be used The main difference is that arithmetic operations in the implementation for tridiag-onal systems are replaced by matrix operations in the implementation for banded systems, which increases the amount of computations for each processor The

com-putational effort for the local operations is now O(r3) Also, the communication between the processors exchanges larger messages Instead of single numbers, entire

matrices of size r × r are exchanged so that the message size is O(r2) Thus,

with growing semi-bandwidth r of the banded matrix the time for the computa-tion increases faster than the communicacomputa-tion time For p  s an efficient parallel

implementation can be expected

7.2.4 Solving the Discretized Poisson Equation

The cyclic reduction algorithm for banded matrices presented in Sect 7.2.3 is suitable for the solution of the discretized two-dimensional Poisson equation As shown in Sect 7.2.1, this linear equation system has a banded structure with

semi-bandwidth N where N is the number of discretization points in the x- or y-dimension of the two-dimensional domain, see Fig 7.9 The special structure has

only four non-zero diagonals and the band has a sparse structure The use of the Gaussian elimination method would not preserve the sparse banded structure of the matrix, since the forward elimination for eliminating the two lower diagonals leads

to fill-ins with non-zero elements between the two upper diagonals This induces a higher computational effort which is needed for banded matrices with a dense band

of semi-bandwidth N In the following, we consider the method of cyclic reduction

for banded matrices, which preserves the sparse banded structure

Trang 6

The blocks of the discretized Poisson equation Az = d for a representation as

blocked tridiagonal matrix are given by Eqs (7.18) and (7.19) Using the notation for the banded system, we get

B i(0) := 1

h2B for i = 1, , N ,

A(0)i := −1

h2I and C i(0):= − 1

h2I for i = 1, , N

The vector d ∈ Rn consists of N subvectors D j ∈ RN, i.e.,

d=

D1

D N

d ( j −1)N+1

d j N

Analogously, the solution vector consists of N subvectors Z j of length N each, i.e.,

z=

Z1

Z N

z ( j −1)N+1

z j ·N

The initialization for the cyclic reduction algorithm is given by

B(0):= B ,

D(0)j := Dj for j = 1, , N ,

D (k) j := 0 for k = 0, , log N, j ∈ Z \ {1, , N} ,

Z j := 0 for j ∈ Z \ {1, , N}

In step k of the cyclic reduction, k (k)∈ RN ×Nand

the vectors D (k) j ∈ RN for j = 1, , N are computed according to

B (k) = (B (k−1))2− 2I ,

D (k) j = D (k−1)

j−2k−1+ B (k−1)D (k−1)

j+2k−1 . (7.32)

For k

− Z j−2k + B (k) Z j − Z j+2k = D (k)

j for j = 1, , n (7.33) Together Eqs (7.32) and (7.33) represent the method of cyclic reduction for the

discretized Poisson equation, which can be seen by induction For k= 0, Eq (7.33)

is the initial equation system Az = d For 0 < k < log N and j ∈ {1, , N} the

three equations

Trang 7

7.3 Iterative Methods for Linear Systems 399

j−2k ,

j ,

− Z j + B (k) Z j+2k − Z j+2k+1= D (k)

j+2k

(7.34)

are considered The multiplication of Eq (7.33) with B (k)from the left results in

− B (k) Z j−2k + B (k) B (k) Z j − B (k) Z j+2k = B (k) D (k) j (7.35) Adding Eq (7.35) with the first part in Eq (7.34) and the third part in Eq (7.34) results in

−Z j−2k+1− Z j +B (k) B (k) Z j − Z j −Z j+2k+1= D (k)

j−2k + B (k) D (k) j + D (k)

j+2k , which shows that Formula (7.32) for k+ 1 is derived In summary, the cyclic

reduc-tion for the discretized two-dimensional Poisson equareduc-tion consists of the following two steps:

D (k) j are computed for j = 2k , , N with step size 2 kaccording to Eq (7.32)

2 Substitution phase: For k

B (k) Z j = D (k)

j + Z j−2k + Z j+2k

for j= 2k , , N with step size 2 k+1is solved.

In the first phase,

putation of each matrix includes a matrix multiplication with time O(N3) The computation of a subvector includes a matrix–vector multiplication with complexity

O(N2) Thus, the first phase has a computational complexity of O(N3log N ) In the second phase, O(N ) linear equation systems are solved This requires time O(N3)

when the special structure of the matrices B (k)is not exploited In [61] it is shown how to reduce the time by exploiting this structure A parallel implementation of the discretized Poisson equation can be done in an analogous way as shown in the previous section

7.3 Iterative Methods for Linear Systems

In this section, we introduce classical iteration methods for solving linear equa-tion systems, including the Jacobi iteraequa-tion, the Gauss–Seidel iteraequa-tion, and the SOR method (successive over-relaxation), and discuss their parallel implementa-tion Direct methods as presented in the previous sections involve a factorization

of the coefficient matrix This can be impractical for large and sparse matrices, since fill-ins with non-zero elements increase the computational work For banded

Trang 8

matrices, special methods can be adapted and used as discussed in Sect 7.2 Another possibility is to use iterative methods as presented in this section

Iterative methods for solving linear equation systems Ax = b with coefficient

matrix A∈ Rn ×n and right-hand side b∈ Rn

generate a sequence of approximation vectors{x (k)}k=1,2, that converges to the solution x∗ ∈ Rn The computation of an approximation vector essentially involves a matrix–vector multiplication with the

iteration matrix of the specific problem The matrix A of the linear equation system

is used to build this iteration matrix For the evaluation of an iteration method it is essential how quickly the iteration sequence converges Basic iteration methods are

the Jacobi and the Gauss–Seidel methods, which are also called relaxation methods

historically, since the computation of a new approximation depends on a combina-tion of the previously computed approximacombina-tion vectors Depending on the specific problem to be solved, relaxation methods can be faster than direct solution methods But still these methods are not fast enough for practical use A better convergence behavior can be observed for methods like the SOR method, which has a similar computational structure The practical importance of relaxation methods is their use

as preconditioner in combination with solution methods like the conjugate gradient method or the multigrid method Iterative methods are a good first example to study parallelism as it is typical also for more complex iteration methods In the following,

we describe the relaxation methods according to [23], see also [71, 166] Parallel implementations are considered in [60, 61, 72, 154]

7.3.1 Standard Iteration Methods

Standard iteration methods for the solution of a linear equation system Ax = b are

based on a splitting of the coefficient matrix A∈ Rn ×n into

where M is a non-singular matrix for which the inverse M−1can be computed easily,

e.g., a diagonal matrix For the unknown solution xof the equation Ax = b we get

M x= N x+ b

This equation induces an iteration of the form M x (k+1)= N x (k) +b, which is usually

written as

with iteration matrix C : = M−1N and vector d : = M−1b The iteration method is

called convergent if the sequence {x (k)}k=1,2, converges toward x∗ independently

of the choice of the start vector x(0)∈ Rn, i.e., limk→∞x (k) = x∗or limk

→∞x (k)

Trang 9

7.3 Iterative Methods for Linear Systems 401

equality x (k) − x= C k

(x(0)− x), where C k

denotes the matrix resulting from k multiplications of C Thus, the convergence of Eq (7.36) is equivalent to

lim

k→∞C

A result from linear algebra shows the relation between the convergence criteria and the spectral radius ρ(C) of the iteration matrix C (The spectral radius of a

matrix is the eigenvalue with the largest absolute value, i.e.,ρ(C) = max

(1) Iteration (7.36) converges for every x(0)∈ Rn

(2) lim

k→∞C

k= 0

(3) ρ(C) < 1.

Well-known iteration methods are the Jacobi, the Gauss–Seidel, and the SOR method

7.3.1.1 Jacobi Iteration

The Jacobi iteration is based on the splitting A = D − L − R of the matrix A with

D , L, R ∈ R n ×n The matrix D holds the diagonal elements of A, −L holds the

elements of the lower triangular of A without the diagonal elements, and −R holds

the elements of the upper triangular of A without the diagonal elements All other elements of D , L, R are zero The splitting is used for an iteration of the form

which leads to the iteration matrix C J a := D−1(L + R) or

C J a = (ci j)i , j=1, ,n with c i j =



−ai j /a ii for j = i ,

0 otherwise.

The matrix form is used for the convergence proof, not shown here For the practical computation, the equation written out with all its components is more suitable:

x i (k+1)= 1

a ii

j =1, j =i

a i j x (k) j

The computation of one component x i (k+1), i ∈ {1, , n}, of the (k + 1)th

approx-imation requires all components of the kth approxapprox-imation vector x k Considering a

sequential computation in the order x1(k+1), , x (k+1)

n , it can be observed that the

values x1(k+1), , x (k+1)

i−1 are already known when x i (k+1)is computed This informa-tion is exploited in the Gauss–Seidel iterainforma-tion method

Trang 10

7.3.1.2 Gauss–Seidel Iteration

The Gauss–Seidel iteration is based on the same splitting of the matrix A as the Jacobi iteration, i.e., A = D − L − R, but uses the splitting in a different way for

an iteration

(D − L)x (k+1)= Rx (k) + b

Thus, the iteration matrix of the Gauss–Seidel method is C Ga := (D − L)−1R; this

form is used for numerical properties like convergence proofs, not shown here The component form for the practical use is

x i (k+1)= 1

a ii

⎝bi−i−1

j=1

a i j x (k j+1)−

n



j =i+1

a i j x (k) j

It can be seen that the components of x i (k+1), i ∈ {1, , n}, uses the new

infor-mation x1(k+1), , x (k+1)

i−1 already determined for that approximation vector This is

useful for a faster convergence in a sequential implementation, but the potential parallelism is now restricted

7.3.1.3 Convergence Criteria

For the Jacobi and the Gauss–Seidel iteration the following convergence criteria

based on the structure of A is often helpful The Jacobi and the Gauss–Seidel

itera-tion converge if the matrix A is strongly diagonal dominant, i.e.,

|aii| >

n



j =1, j =i

|ai j | , i = 1, , n

When the absolute values of the diagonal elements are large compared to the sum

of the absolute values of the other row elements, this often leads to a better conver-gence Also, when the iteration methods converge, the Gauss–Seidel iteration often converges faster than the Jacobi iteration, since always the most recently computed vector components are used Still the convergence is usually not fast enough for practical use Therefore, an additional relaxation parameter is introduced to speed

up the convergence

7.3.1.4 JOR Method

The JOR method or Jacobi over-relaxation is based on the splitting A= 1

R− 1−ω

ω D of the matrix A with a relaxation parameter ω ∈ R The component

form of this modification of the Jacobi method is

Ngày đăng: 03/07/2014, 16:21

TỪ KHÓA LIÊN QUAN