7.2 Direct Methods for Linear Systems with Banded Structure 395 Moreover, each processor exchanges one data value with each of its neighboring processors; the communication time is there
Trang 17.2 Direct Methods for Linear Systems with Banded Structure 393 for which the values of
˜a (k j−1), ˜b (k−1)
j , ˜c (k−1)
j , ˜y (k−1)
j
from the previous step computed by a different processor are required Thus, there
is a communication in each of thelog p steps with a message size of four values.
After step N= log p processor Pi computes
˜x i = ˜y (N)
i /˜b (N)
Phase 3: Parallel substitution of cyclic reduction: After the second phase,
the values ˜x i = xi ·q are already computed In this phase, each processor P i,
several steps according to Eq (7.27) In step k, k = Q − 1, , 0, the elements
x j , j = 2k , , n, with step size 2 k+1are computed Processor P
i computes x jwith
j div q + 1 = i for which the values ˜xi−1= x (i −1)q and ˜x i+1 = x (i +1)qcomputed by
processors P i−1 and P i+1 are needed Figure 7.12 illustrates the parallel algorithm
for p = 2 and n = 8.
i=1
i=3
i=4
i=5
i=6
i=7
i=8
8
x
4
x
x x
x
x
x x
2
6
1
3
5
7
i=2
Fig 7.12 Illustration of the parallel algorithm for the cyclic reduction for n = 8 equations and
first and the third phases of the computation have log q = 2 steps The second phase has log p = 1
step As recursive doubling is used in the second phase, there are more components of the solution
to be computed in the second phase compared with the computation shown in Fig 7.11
Trang 27.2.2.5 Parallel Execution Time
The execution time of the parallel algorithm can be modeled by the following run-time functions Phase 1 executes
steps where in step k with 1 ≤ k ≤ Q each processor computes at most q/2 k
coefficient blocks of 4 values each Each coefficient block requires 14 arithmetic operations according to Eq (7.23) The computation time of phase 1 can therefore
be estimated as
T1(n, p) = 14top·
Q
k=1
q
2k ≤ 14n
p · top
Moreover, each processor exchanges in each of the Q steps two messages of 4 values
each with its two neighboring processors by participating in single transfer opera-tions Since in each step the transfer operations can be performed by all processors
in parallel without interference, the resulting communication time is
C1(n, p) = 2Q · ts2s(4)= 2 · logn
p · ts2s(4), where t s2s (m) denotes the time of a single transfer operation with message size m.
Phase 2 executeslog p steps In each step, each processor computes 4 coefficients
requiring 14 arithmetic operations Then the value ˜x i = xi ·qis computed according
to Eq (7.28) by a single arithmetic operation The computation time is therefore
T2(n, p) = 14log p · top + top
In each step, each processor sends and receives 4 data values from other processors, leading to a communication time
C2(n , p) = 2log p · t s2s(4).
In each step k of phase 3, k = 0, , Q−1, each processor computes 2 kcomponents
of the solution vector according to Eq (7.27) For each component, five operations are needed Altogether, each processor computes Q−1
components with one component already computed in phase 2 The resulting com-putation time is
T3(n, p) = 5 · (q − 1) · top= 5 ·
n
p − 1
· top
Trang 37.2 Direct Methods for Linear Systems with Banded Structure 395 Moreover, each processor exchanges one data value with each of its neighboring processors; the communication time is therefore
C3(n, p) = 2 · ts2s(1).
The resulting total computation time is
T (n , p) =
14n
p − 4
· top
"
19n
· top
The communication overhead is
C(n , p) =
6
2· logn
7
t s2s(4)+ 2 · ts2s(1)
" 2 · log n · ts2s(4)+ 2 · ts2s(1).
Compared to the sequential algorithm, the parallel implementation leads to a small computational redundancy of 14· log p operations The communication overhead
increases logarithmically with the number of rows, whereas the computation time increases linearly
7.2.3 Generalization to Banded Matrices
The cyclic reduction algorithm can be generalized to banded matrices with
semi-bandwidth r > 1 For the description we assume n = s ·r The matrix is represented
as a block-tridiagonal matrix of the form
⎛
⎜
⎜
⎜
⎝
A(0)2 B2(0) C2(0)
.. .. ..
A(0)s−1 B s(0)−1C s(0)−1
s B(0)
s
⎞
⎟
⎟
⎟
⎠
⎛
⎜
⎜
⎜
X1
X2
.
X s−1
X s
⎞
⎟
⎟
⎛
⎜
⎜
⎜
⎝
Y1(0)
Y2(0)
.
Y s(0)−1
Y(0)
s
⎞
⎟
⎟
⎟
⎠
,
where
A(0)i = (alm)l ∈I i ,m∈I i−1 for i = 2, , s ,
B i(0) = (alm)l ∈I i ,m∈I i for i = 1, , s ,
C i(0) = (alm)l ∈I i ,m∈I i+1 for i = 1, , s − 1
are sub-matrices of A The index sets are for i = 1, , s
I = { j ∈ N | (i − 1)r < j ≤ ir}.
Trang 4The vectors X i , Y i(0)∈ Rr are
X i = (xl)l ∈I i and Y i(0)= (yl)l ∈I i for i = 1, , s.
The algorithm from above is generalized by applying the described computation steps for elements according to Eq (7.23) to blocks and using matrix operations instead of operations on single elements In the first step, three consecutive matrix
equations i − 1, i, i + 1 for i = 3, 4, , s − 2 are considered:
A(0)i−1X i−2 + B(0)
i−1X i−1 + C(0)
i−1,
A(0)i X i−1 + B(0)
i X i + C(0)
i ,
A(0)i+1X i + B(0)
i+1X i+1 + C(0)
i+1X i+2= Y(0)
i+1 Equation (i − 1) is used to eliminate subvector Xi−1 from equation i and equation
with the following initializations:
A(0)1 := 0 ∈ Rr ×r , C(0)
and for k = 0, , log s and i ∈ Z \ {1, , s}
A i (k) = C (k)
i := 0 ∈ Rr ×r ,
B i (k):= I ∈ R r ×r ,
Y i (k):= 0 ∈ Rr
In step k = 1, , log s the following sub-matrices
α (k)
i := −A (k−1)
i
8
B i (k−2−1)k−1
9−1
,
β (k)
i := −C (k−1)
i
8
B i (k+2−1)k−1
9−1
,
A (k) i = α (k)
i · A (k−1)
i−2k−1,
C i (k) = β (k)
i · C (k−1)
B i (k) = α (k)
i C i (k−2−1)k−1+ B (k−1)
i A i (k+2−1)k−1 and the vector
Y i (k) = α (k)
i Y i (k−2−1)k−1+ Y (k−1)
i Y i (k+2−1)k−1 (7.30) are computed The resulting matrix equations are
Trang 57.2 Direct Methods for Linear Systems with Banded Structure 397
A (k) i X i−2k + B (k)
i X i + C (k)
i X i+2k = Y (k)
for i = 1, , s In summary, the method of cyclic reduction for banded matrices
comprises the following two phases:
1 Elimination phase: For k = 1, , log s compute the matrices A (k)
i , B i (k) , C i (k) and the vector Y i (k) for i = 2k , , s with step size 2 k according to Eqs (7.29) and (7.30)
2k , , s with step size 2 k+1by solving the linear equation system (7.31), i.e.,
B i (k) X i = Y (k)
i − A (k)
i X i−2k − C (k)
i X i+2k
The computation ofα (k)
i andβ (k)
i requires a matrix inversion or the solution of
a dense linear equation system with a direct method requiring O(r3) computations, i.e., the computations increase with the bandwidth cubically The first step requires
the computation of O(s) = O(n/r) sub-matrices; the asymptotic runtime for this
step is therefore O(nr2) The second step solves a total number of O(s) = O(n/r)
linear equation systems, also resulting in an asymptotic runtime of O(nr2) For the parallel implementation of the cyclic reduction for banded matrices, the parallel method described for tridiagonal systems with its three phases can be used The main difference is that arithmetic operations in the implementation for tridiag-onal systems are replaced by matrix operations in the implementation for banded systems, which increases the amount of computations for each processor The
com-putational effort for the local operations is now O(r3) Also, the communication between the processors exchanges larger messages Instead of single numbers, entire
matrices of size r × r are exchanged so that the message size is O(r2) Thus,
with growing semi-bandwidth r of the banded matrix the time for the computa-tion increases faster than the communicacomputa-tion time For p s an efficient parallel
implementation can be expected
7.2.4 Solving the Discretized Poisson Equation
The cyclic reduction algorithm for banded matrices presented in Sect 7.2.3 is suitable for the solution of the discretized two-dimensional Poisson equation As shown in Sect 7.2.1, this linear equation system has a banded structure with
semi-bandwidth N where N is the number of discretization points in the x- or y-dimension of the two-dimensional domain, see Fig 7.9 The special structure has
only four non-zero diagonals and the band has a sparse structure The use of the Gaussian elimination method would not preserve the sparse banded structure of the matrix, since the forward elimination for eliminating the two lower diagonals leads
to fill-ins with non-zero elements between the two upper diagonals This induces a higher computational effort which is needed for banded matrices with a dense band
of semi-bandwidth N In the following, we consider the method of cyclic reduction
for banded matrices, which preserves the sparse banded structure
Trang 6The blocks of the discretized Poisson equation Az = d for a representation as
blocked tridiagonal matrix are given by Eqs (7.18) and (7.19) Using the notation for the banded system, we get
B i(0) := 1
h2B for i = 1, , N ,
A(0)i := −1
h2I and C i(0):= − 1
h2I for i = 1, , N
The vector d ∈ Rn consists of N subvectors D j ∈ RN, i.e.,
d=
⎛
⎜
⎝
D1
D N
⎞
⎟
⎛
⎜
⎝
d ( j −1)N+1
d j N
⎞
⎟
⎠
Analogously, the solution vector consists of N subvectors Z j of length N each, i.e.,
z=
⎛
⎜
⎝
Z1
Z N
⎞
⎟
⎛
⎜
⎝
z ( j −1)N+1
z j ·N
⎞
⎟
⎠
The initialization for the cyclic reduction algorithm is given by
B(0):= B ,
D(0)j := Dj for j = 1, , N ,
D (k) j := 0 for k = 0, , log N, j ∈ Z \ {1, , N} ,
Z j := 0 for j ∈ Z \ {1, , N}
In step k of the cyclic reduction, k (k)∈ RN ×Nand
the vectors D (k) j ∈ RN for j = 1, , N are computed according to
B (k) = (B (k−1))2− 2I ,
D (k) j = D (k−1)
j−2k−1+ B (k−1)D (k−1)
j+2k−1 . (7.32)
For k
− Z j−2k + B (k) Z j − Z j+2k = D (k)
j for j = 1, , n (7.33) Together Eqs (7.32) and (7.33) represent the method of cyclic reduction for the
discretized Poisson equation, which can be seen by induction For k= 0, Eq (7.33)
is the initial equation system Az = d For 0 < k < log N and j ∈ {1, , N} the
three equations
Trang 77.3 Iterative Methods for Linear Systems 399
j−2k ,
j ,
− Z j + B (k) Z j+2k − Z j+2k+1= D (k)
j+2k
(7.34)
are considered The multiplication of Eq (7.33) with B (k)from the left results in
− B (k) Z j−2k + B (k) B (k) Z j − B (k) Z j+2k = B (k) D (k) j (7.35) Adding Eq (7.35) with the first part in Eq (7.34) and the third part in Eq (7.34) results in
−Z j−2k+1− Z j +B (k) B (k) Z j − Z j −Z j+2k+1= D (k)
j−2k + B (k) D (k) j + D (k)
j+2k , which shows that Formula (7.32) for k+ 1 is derived In summary, the cyclic
reduc-tion for the discretized two-dimensional Poisson equareduc-tion consists of the following two steps:
D (k) j are computed for j = 2k , , N with step size 2 kaccording to Eq (7.32)
2 Substitution phase: For k
B (k) Z j = D (k)
j + Z j−2k + Z j+2k
for j= 2k , , N with step size 2 k+1is solved.
In the first phase,
putation of each matrix includes a matrix multiplication with time O(N3) The computation of a subvector includes a matrix–vector multiplication with complexity
O(N2) Thus, the first phase has a computational complexity of O(N3log N ) In the second phase, O(N ) linear equation systems are solved This requires time O(N3)
when the special structure of the matrices B (k)is not exploited In [61] it is shown how to reduce the time by exploiting this structure A parallel implementation of the discretized Poisson equation can be done in an analogous way as shown in the previous section
7.3 Iterative Methods for Linear Systems
In this section, we introduce classical iteration methods for solving linear equa-tion systems, including the Jacobi iteraequa-tion, the Gauss–Seidel iteraequa-tion, and the SOR method (successive over-relaxation), and discuss their parallel implementa-tion Direct methods as presented in the previous sections involve a factorization
of the coefficient matrix This can be impractical for large and sparse matrices, since fill-ins with non-zero elements increase the computational work For banded
Trang 8matrices, special methods can be adapted and used as discussed in Sect 7.2 Another possibility is to use iterative methods as presented in this section
Iterative methods for solving linear equation systems Ax = b with coefficient
matrix A∈ Rn ×n and right-hand side b∈ Rn
generate a sequence of approximation vectors{x (k)}k=1,2, that converges to the solution x∗ ∈ Rn The computation of an approximation vector essentially involves a matrix–vector multiplication with the
iteration matrix of the specific problem The matrix A of the linear equation system
is used to build this iteration matrix For the evaluation of an iteration method it is essential how quickly the iteration sequence converges Basic iteration methods are
the Jacobi and the Gauss–Seidel methods, which are also called relaxation methods
historically, since the computation of a new approximation depends on a combina-tion of the previously computed approximacombina-tion vectors Depending on the specific problem to be solved, relaxation methods can be faster than direct solution methods But still these methods are not fast enough for practical use A better convergence behavior can be observed for methods like the SOR method, which has a similar computational structure The practical importance of relaxation methods is their use
as preconditioner in combination with solution methods like the conjugate gradient method or the multigrid method Iterative methods are a good first example to study parallelism as it is typical also for more complex iteration methods In the following,
we describe the relaxation methods according to [23], see also [71, 166] Parallel implementations are considered in [60, 61, 72, 154]
7.3.1 Standard Iteration Methods
Standard iteration methods for the solution of a linear equation system Ax = b are
based on a splitting of the coefficient matrix A∈ Rn ×n into
where M is a non-singular matrix for which the inverse M−1can be computed easily,
e.g., a diagonal matrix For the unknown solution x∗of the equation Ax = b we get
M x∗= N x∗+ b
This equation induces an iteration of the form M x (k+1)= N x (k) +b, which is usually
written as
with iteration matrix C : = M−1N and vector d : = M−1b The iteration method is
called convergent if the sequence {x (k)}k=1,2, converges toward x∗ independently
of the choice of the start vector x(0)∈ Rn, i.e., limk→∞x (k) = x∗or limk
→∞x (k)−
Trang 97.3 Iterative Methods for Linear Systems 401
equality x (k) − x∗ = C k
(x(0)− x∗), where C k
denotes the matrix resulting from k multiplications of C Thus, the convergence of Eq (7.36) is equivalent to
lim
k→∞C
A result from linear algebra shows the relation between the convergence criteria and the spectral radius ρ(C) of the iteration matrix C (The spectral radius of a
matrix is the eigenvalue with the largest absolute value, i.e.,ρ(C) = max
(1) Iteration (7.36) converges for every x(0)∈ Rn
(2) lim
k→∞C
k= 0
(3) ρ(C) < 1.
Well-known iteration methods are the Jacobi, the Gauss–Seidel, and the SOR method
7.3.1.1 Jacobi Iteration
The Jacobi iteration is based on the splitting A = D − L − R of the matrix A with
D , L, R ∈ R n ×n The matrix D holds the diagonal elements of A, −L holds the
elements of the lower triangular of A without the diagonal elements, and −R holds
the elements of the upper triangular of A without the diagonal elements All other elements of D , L, R are zero The splitting is used for an iteration of the form
which leads to the iteration matrix C J a := D−1(L + R) or
C J a = (ci j)i , j=1, ,n with c i j =
−ai j /a ii for j = i ,
0 otherwise.
The matrix form is used for the convergence proof, not shown here For the practical computation, the equation written out with all its components is more suitable:
x i (k+1)= 1
a ii
⎛
j =1, j =i
a i j x (k) j
⎞
The computation of one component x i (k+1), i ∈ {1, , n}, of the (k + 1)th
approx-imation requires all components of the kth approxapprox-imation vector x k Considering a
sequential computation in the order x1(k+1), , x (k+1)
n , it can be observed that the
values x1(k+1), , x (k+1)
i−1 are already known when x i (k+1)is computed This informa-tion is exploited in the Gauss–Seidel iterainforma-tion method
Trang 107.3.1.2 Gauss–Seidel Iteration
The Gauss–Seidel iteration is based on the same splitting of the matrix A as the Jacobi iteration, i.e., A = D − L − R, but uses the splitting in a different way for
an iteration
(D − L)x (k+1)= Rx (k) + b
Thus, the iteration matrix of the Gauss–Seidel method is C Ga := (D − L)−1R; this
form is used for numerical properties like convergence proofs, not shown here The component form for the practical use is
x i (k+1)= 1
a ii
⎛
⎝bi−i−1
j=1
a i j x (k j+1)−
n
j =i+1
a i j x (k) j
⎞
It can be seen that the components of x i (k+1), i ∈ {1, , n}, uses the new
infor-mation x1(k+1), , x (k+1)
i−1 already determined for that approximation vector This is
useful for a faster convergence in a sequential implementation, but the potential parallelism is now restricted
7.3.1.3 Convergence Criteria
For the Jacobi and the Gauss–Seidel iteration the following convergence criteria
based on the structure of A is often helpful The Jacobi and the Gauss–Seidel
itera-tion converge if the matrix A is strongly diagonal dominant, i.e.,
|aii| >
n
j =1, j =i
|ai j | , i = 1, , n
When the absolute values of the diagonal elements are large compared to the sum
of the absolute values of the other row elements, this often leads to a better conver-gence Also, when the iteration methods converge, the Gauss–Seidel iteration often converges faster than the Jacobi iteration, since always the most recently computed vector components are used Still the convergence is usually not fast enough for practical use Therefore, an additional relaxation parameter is introduced to speed
up the convergence
7.3.1.4 JOR Method
The JOR method or Jacobi over-relaxation is based on the splitting A= 1
R− 1−ω
ω D of the matrix A with a relaxation parameter ω ∈ R The component
form of this modification of the Jacobi method is