Parallel Programming: for Multicore and Cluster Systems- P44 pdf

For sparse matrices with regu-lar structure, like banded matrices, only the diagonals with non-zero elements are stored and the solution methods introduced in the previous sections can b

Trang 1

processor performs the arithmetic operations locally and the vector x k+1results

in a blockwise distribution

(4) The ax py-operation g k+1= g k +α kwkis computed analogously to computation

step (3) and the result vector g k+1is distributed in a blockwise way

(5) The scalar productγk+1 = g T

k+1g k+1 is computed analogously to computation step (2) The resulting scalar valueβk is computed by the root processor of a single-accumulation operation and then broadcasted to all other processors

(6) The ax py-operation d k+1= −g k+1+ β k d kis computed analogously to

compu-tation step (3) The result vector d k+1has a blockwise distribution.

7.4.2.3 Parallel Execution Time

The parallel execution time of one iteration step of the CG method is the sum of the parallel execution times of the basic operations involved We derive the

paral-lel execution time for p processors; n is the system size It is assumed that n is a multiple of p The parallel execution time of one ax py-operation is given by

p · t op ,

since each processor computes n /p components and the computation of each

com-ponent needs one multiplication and one addition As in earlier sections, the time for

one arithmetic operation is denoted by t op The parallel execution time of a scalar product is

Tscal prod= 2 ·

n

p − 1

· t op + Tacc(+)(p, 1) + T sb ( p , 1) ,

where Tacc(op)( p , m) denotes the communication time of a single-accumulation operation with reduction operation op on p processors and message size m The computation of the local scalar products with n /p components requires n/p multi-plications and n /p − 1 additions The distribution of the result of the parallel scalar

product, which is a scalar value, i.e., has size 1, needs the time of a single-broadcast

operation T sb ( p , 1) The matrix–vector multiplication needs time

Tmath vec mult= 2 ·n2

p · t op ,

since each processor computes n /p scalar products The total computation time of

the CG method is

TCG= Tmb

p , n p

+ Tmath vec mult+ 2 · Tscal prod+ 3 · Taxpy ,

Trang 2

where Tmb( p , m) is the time of a multi-broadcast operation with p processors and message size m This operation is needed for the re-distribution of the direction vector d k from iteration step k

7.5 Cholesky Factorization for Sparse Matrices

Linear equation systems arising in practice are often large but have sparse coef-ficient matrices, i.e., they have many zero entries For sparse matrices with regu-lar structure, like banded matrices, only the diagonals with non-zero elements are stored and the solution methods introduced in the previous sections can be used For

an unstructured pattern of non-zero elements in sparse matrices, however, a more general storage scheme is needed and other parallel solution methods are applied

In this section, we consider the Cholesky factorization as an example of such a solution method The general sequential factorization algorithm and its variants for sparse matrices are introduced in Sect 7.5.1 A specific storage scheme for sparse unstructured matrices is given in Sect 7.5.2 In Sect 7.5.3, we discuss parallel implementations of sparse Cholesky factorization for shared memory machines

7.5.1 Sequential Algorithm

The Cholesky factorization is a direct solution method for a linear equation system

symmetric and positive definite, i.e., if a i j = a j i and x T Ax > 0 for all x ∈ R nwith

unique triangular factorization

where L = (l i j)i , j=1, ,n is a lower triangular matrix, i.e., l i j = 0 for i < j and

i , j ∈ {1, , n}, with positive diagonal elements, i.e., lii > 0 for i = 1, , n;

L T denotes the transposed matrix of L, i.e., L T = (l T

i j)i , j=1, ,n with l i j T = l j i [166]

Using the factorization in Eq (7.59), the solution x of a system of equations Ax = b

with b ∈ Rn is determined in two steps by solving the triangular systems L y = b

and L T x = y one after another Because of Ly = L L T x = Ax = b, the vector

x∈ Rnis the solution of the given linear equation system

The implementation of the Cholesky factorization can be derived from a

column-wise formulation of A = L L T

Comparing the elements of A and L L T, we obtain

a i j =

n

k=1

l i k l k j T =

n

k=1

l i k l j k=

j

k=1

l i k l j k =

j

k=1

l j k l i k

Trang 3

since l j k = 0 for k > j and by exchanging elements in the last summation Denoting

the columns of A as ˜a1, , ˜an and the columns of L as ˜l1, , ˜ln results in an

equality for column ˜aj = (a 1 j , , an j) and columns ˜lk = (l 1k, , lnk ) for k ≤ j:

˜aj =

j

k=1

l j k˜lk

leading to

l j j˜lj = ˜aj−

j−1

k=1

for j = 1, , n If the columns ˜l k, k = 1, , j − 1, are already known, the

right-hand side of Formula (7.60) is computable and the column ˜ljcan also be computed

Thus, the columns of L are computed one after another The computation of column

˜lj has two cases:

For the diagonal element the computation is

l j j l j j = a j j−

j−1

k=1

l j k l j k or l j j =

j−1

k=1

l2

j k

For the elements l i j , i > j, the computation is

l i j = 1

l j j

:

a i j −

j−1

k=1

l j k l i k

;

The elements in the upper triangular of matrix L are l i j = 0 for i < j.

The Cholesky factorization yields the factorization A = L L T for a given matrix

A [65] by computing L = (l i j)i =0, ,n−1, j=0, ,i from A = (a i j)i , j=0, ,n−1 column

by column from left to right according to the following algorithm, in which the numbering starts with 0:

(I )

for (j=0; j<n; j++) {

l j j=

<

a j j−j−1

k=0l

2

j k; for (i=j+1; i<n; i++)

l i j = 1

l j j

:

a i j −j−1

k=0l j k l i k

;

}

Trang 4

*

j

*

j

*

* j

data items used for the computation

Computation structure

for computing l ij

for left-looking strategy

Computation structure for right-looking strategy

*

j

i

*

data items updated in the computation

Fig 7.22 Computational structures and data dependences for the computation of L according

to the basic algorithm (left), the left-looking algorithm (middle), and the right-looking algorithm (right)

For each column j , first the new diagonal element l j jis computed using the elements

in row j ; then, the new elements of column j are computed using row j of A and all columns i of L with i < j, see Fig 7.22 (left).

For dense matrices A, the Cholesky factorization requires O(n2) storage space

and O(n3/6) arithmetic operations [166] For sparse matrices, drastic reductions in storage and execution time can be achieved by exploiting the sparsity of A, i.e., by storing and computing only the non-zero entries of A.

The Cholesky factorization usually causes fill-ins for sparse matrices A which means that the matrix L has non-zeros in positions which are zero in A The number

of fill-in elements can be reduced by reordering the rows and columns of A result-ing in a matrix P A P T with a corresponding permutation matrix P For Cholesky factorization, P can be chosen without regard to numerical stability, because no pivoting is required [65] Since P A P T is also symmetric and positive definite for

any permutation matrix P, the factorization of A can be done with the following

steps:

1 Reordering: Find a permutation matrix P ∈ Rn ×n that minimizes the storage

requirement and computing time by reducing fill-ins The reordered linear

equa-tion system is (P A P T )(P x) = Pb.

2 Storage allocation: Determine the structure of the matrix L and set up the sparse

storage scheme This is done before the actual computation of L and is called (symbolic factorization), see [65].

3 Numerical factorization: Perform the factorization P A P T = L L T

4 Triangular solution: Solve L y = Pb and L T z = y Then, the solution of the

original system is x = P T z.

The problem of finding an ordering that minimizes the amount of fill-in is NP-complete [177] But there exist suitable heuristics for reordering The most

Trang 5

popular sequential fill-in reduction heuristic is the minimum degree algorithm [65] Symbolic factorization by a graph-theoretic approach is described in detail in [65]

In the following, we concentrate on the numerical factorization, which is considered

to require by far the most computation time, and assume that the coefficient matrix

is already in reordered form

7.5.1.1 Left-Looking Algorithms

According to [124], we denote the sparsity structure of column j and row i of L

(excluding diagonal entries) by

Struct (L ∗ j)= {k > j|l k j = 0}

Struct (Li∗ = {k < i|l i k = 0}

Struct (L ∗ j ) contains the row indices of all non-zeros of column j and Struct (L i∗

contains the column indices of all non-zeros of row i Using these sparsity structures

a slight modification of computation scheme (I) results The modification uses the following procedures for manipulating columns [124, 152]:

(I I )

cmod(j,k) =

for each i ∈ Struct(L ∗k) with i ≥ j :

a i j = a i j − l j k l i k ; cdiv(j) =

l j j = √a j j ;

for each i ∈ Struct(L ∗ j) :

l i j = a i j /lj j ;

Procedure cmod ( j , k) modifies column j by subtracting a multiple with factor l j k

of column k from column j for columns k already computed Only the non-zero elements of column k are considered in the computation The entries a i j of the

original matrix a are now used to store the intermediate results of the computa-tion of L Procedure cdiv ( j ) computes the square root of the diagonal element and divides all entries of column j by this square root of its diagonal entry l j j Using

these two procedures, column j can be computed by applying cmod ( j , k) for each

cmod ( j, k) to columns k ∈ Struct(L j∗) has no effect because l j k= 0 The columns

of L are computed from left to right and the computation of a column ˜ l j needs all columns ˜l k to the left of column ˜l j This results in the following left-looking

algorithm:

Trang 6

(I I I )

left cholesky =

for j = 0, ., n − 1 {

for each k ∈ Struct(L j∗ : cmod( j , k);

cdiv( j );

}

The code in scheme (I I I ) computes the columns one after another from left to right The entries of column j are modified after all columns to the left of j have completely been computed, i.e., the same target column j is used for a number of consecutive cmod ( j , k) operations; this is illustrated in Fig 7.22 (middle).

7.5.1.2 Right-Looking Algorithm

An alternative way is to use the entries of column j after the complete computation

of column j to modify all columns k to the right of j that depend on column j , i.e.,

to modify all columns k ∈ Struct(L ∗ j ) by subtracting l k j times the column j from column k Because l k j = 0 for k /∈ Struct(L ∗ j ), only the columns k ∈ Struct(L ∗ j)

are manipulated by column j Still the columns are computed from left to right The

difference to the left-looking algorithm is that the calls tocmod() for a column j are done earlier The final computation of a column j then consists only of a call

tocdiv( j ) after all columns to the left are computed This results in the following right-looking algorithm:

(I V )

right cholesky =

for j = 0, ., n − 1 {

cdiv( j ); for each k ∈ Struct(L ∗ j): cmod(k , j);

}

The code fragment shows that in the right-looking algorithm, successive cmod()

operations manipulate different target columns with the same column j An

illus-tration is given in Fig 7.22 (right)

In both the left-looking and right-looking algorithms, each non-zero l i j leads to

an execution of a cmod () operation In the left-looking algorithm, the cmod ( j , k) operation is used to compute column j In the right-looking algorithm, the cmod (k, j) operation is used to manipulate column k ∈ Struct(L ∗ j) after the

com-putation of column j Thus, left-looking and right-looking algorithms use the same

number ofcmod() operations They also use the same number ofcdiv() operations, since there is exactly onecdiv() operation for each column

Trang 7

7.5.1.3 Supernodes

The supernodal algorithm is a computation scheme for sparse Cholesky factoriza-tion that exploits similar patterns of non-zero elements in adjacent columns, see [124, 152] A supernode is a set

I ( p) = {p, p + 1, , p + q − 1}

of contiguous columns in L for which for all i with p ≤ i ≤ p + q − 1

Struct (L ∗i)= Struct(L ∗(p+q−1))∪ {i + 1, , p + q − 1}

Thus, a supernode has a dense triangular block above (and including) row p +q −1,

i.e., all entries are non-zero elements, and an identical sparsity structure for each

column below row p +q −1, i.e., each column has its non-zero elements in the same

rows as the other columns in the supernode Figure 7.23 shows an example Because

of this identical sparsity structure of the columns, a supernode has the property that each member column modifies the same set of target columns outside its supernode [152] Thus, the factorization can be expressed in terms of supernodes modifying columns, rather than columns modifying columns

0 1 2 3 4 5 6 7 8 9 0

1

2

3

4

5

6

7

8

9

∗

∗ ∗

∗

∗ ∗

∗ ∗ ∗

∗

∗ ∗ ∗ ∗

∗ ∗ ∗ ∗ ∗ ∗

∗ ∗

∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗

9 8 5 7

6 1

0 4

3 2

Fig 7.23 Matrix L with supernodes I (0) = {0}, I (1) = {1}, I(2) = {2, 3, 4}, I(5) = {5}, I(6) = {6, 7}, I(8) = {8, 9} The elimination tree is shown at the right

Using the definitions ﬁrst ( J ) = p and last(J) = p + q − 1 for a supernode

is defined:

(V )

smod( j , J) =

cmod( j , k);

Trang 8

which modifies column j with all columns from supernode J There are two cases for modifying a column with a supernode: When column j belongs to supernode J , then column j is modified only by those columns of J that are to the left in node J When column j does not belong to supernode J , then column j is modified by all columns of J Using the procedure smod (), the Cholesky factorization can be per-formed by the following computation scheme, also called right-looking supernodal

algorithm:

(V I )

supernode cholesky =

for each supernode J do from left to right {

cdiv( f ir st( J ));

smod( j , J); cdiv( j );

}

smod(k , J);

}

This computation scheme still computes the columns of L from left to right The

difference to the algorithms presented before is that the computations associated with a supernode are combined On the supernode level, a right-looking scheme is

used: For the computation of the first column of a supernode J only onecdiv() operation is necessary when the modification with all columns to the left is already

done The columns of J are computed in a left-looking way: After the computa-tion of all supernodes to the left of supernode J and because the columns of J are

already modified with these supernodes due the supernodal right-looking scheme,

column j is computed by first modifying it with all columns of J to the left of j

and then performing acdiv() operation After the computation of all columns of

J , all columns k to the right of J that depend on columns of J are modified with each column in J , i.e., by the proceduresmod(k , J).

An alternative way would be a right-looking computation of the columns of J

An advantage of the supernodal algorithm lies in an increased locality of memory

accesses because each column of a supernode J is used for the modification of several columns to the right of J and because all columns of J are used for the modification of the same columns to the right of J

7.5.2 Storage Scheme for Sparse Matrices

Since most entries in a sparse matrix are zero, specific storage schemes are used

to avoid the storage of zero elements These compressed storage schemes store the non-zero entries and additional information about the row and column indices to

Trang 9

identify its original position in the full matrix Thus, a compressed storage scheme for sparse matrices needs the space for the non-zero elements as well as space for additional information

A sparse lower triangular matrix L is stored in a compressed storage scheme of size O(n +nz) where n is the number of rows (or columns) in L and nz is the number

of non-zeros We present the storage scheme of the SPLASH implementation which, according to [116], stores a sparse matrix in a compressed manner similar to [64] This storage scheme exploits the sparsity structure as well as the supernode structure

to store the data We first describe a simpler version using only the sparsity structure without supernodes Exploiting the supernode structure is then based on this storage scheme

The storage scheme uses two arrays Nonzero and Row of length nz and

three arraysStartColumn,StartRow, andSupernodeof length n The array

Nonzerocontains the values of all non-zeros of a triangular matrix L = (l k j)k ≥ j

in column-major order, i.e., the non-zeros are ordered columnwise from left to right in a linear array Information about the corresponding column indices of non-zero elements is implicitly contained in arrayStartColumn: Position j of array

StartColumn stores the index of arrayNonzero in which the first non-zero

element of column j is stored, i.e., Nonzero[StartColumn[ j ]] contains l j j Because the non-zero elements are stored columnwise,StartColumn[ j+ 1] − 1

contains the last non-zero element of column j Thus, the non-zeros of the j th column of L are assigned to the contiguous part of array Nonzerowith indices fromStartColumn[ j ] toStartColumn[ j+ 1] − 1 The size of the contiguous

part of non-zeros of column j in array Nonzerois N j :=StartColumn[ j+

1]−StartColumn[ j ] The arrayRowcontains the row indices of the correspond-ing elements inNonzero In the simpler version without supernodes,Row[r ]

con-tains the row index of the non-zero stored in Nonzero[r ], r = 0, , nz − 1.

Corresponding to the blockwise storage scheme in Nonzero, the indices of the non-zeros of one column are stored in a contiguous block inRow

When the similar sparsity structure of rows in the same supernode is addi-tionally exploited, row indices of non-zeros are stored in a combination of the arraysRowandStartRowin the following way:StartRow[ j ] stores the index

of Rowin which the row index of the first non-zero of column j is stored, i.e.,

Row[StartRow[ j ]] = j because l j jis the first non-zero For each column the row indices are still stored in a contiguous block of Row In contrast to the simpler scheme the blocks for different rows in the same supernode are not disjoint but overlap according to the similar sparsity structure of those columns

The additional arrayStartRowcan be used for a more compact storage scheme

for the supernodal algorithm When j is the first column of a supernode I ( j ) =

{ j, j + 1, , j + k − 1}, then column j + l for 1 ≤ l < k has the same non-zero

pattern as row j for rows greater than or equal to j +l, i.e.,Row[StartRow[ j ] +l]

contains the row index of the first element of column j +l Since this is the diagonal

element, Row[StartRow[ j ] + l] = j + l holds The next entries are the row

indices of the other non-zero elements of column j + l Thus, the row indices of

column j + l are stored in Row[StartRow[ j ] + l], ,Row[StartRow[ j ]+

Trang 10

rj+Nj−1 rj+Nj−1

0

0 1 2

l

00

11

j j

j+1,j+1

nz−1,nz−1

l

c

0

1

j

j+1

n−1

k

0

k

r1 rj

r n−1

r j

r r

r

0 1

j j+1

n−1

0 1

j j+1

n−1

rj+1 j+1 r c

c

c j+1

nz−1

0

1

j

j+1

n−1

j

1

k

= number of non−zeros in column j

Fig 7.24 Compressed storage scheme for a sparse lower triangular matrix L The arrayNonzero

contains the non-zero elements of matrix L and the arrayStartColumn contains the positions

of the first elements of columns in Nonzero The array Row contains the row indices of elements

in Nonzero ; the first element of a row is given in StartRow For a supernodal algorithm, Row can additionally use an overlapping storage (not shown here)

StartColumn[ j+1]−StartColumn[ j−1]] This leads toStartRow[ j +l] =

StartRow[ j ] + l and thus only the row indices of the first column of a supernode

have to be stored to get the full information A fast access to the sets Struct (L ∗ j) is given by

Struct (L ∗ j)= . Row[StartRow[ j ] + i]|0≤i≤StartColumn[j+ 1]

−StartColumn[ j− 1]/.

The storage scheme is illustrated in Fig 7.24 The arraySupernodeis used for

the management of supernodes: If a column j is the first column of a supernode J , then the number of columns of J is stored inSupernode[j]

7.5.3 Implementation for Shared Variables

For a parallel implementation of sparse Cholesky factorization, we consider a shared memory machine There are several sources of parallelism for sparse Cholesky fac-torization, including fine-grained parallelism within the single operationscmod( j , k)

orcdiv( j ) as well as column-oriented parallelism in the left-looking, right-looking,

and supernodal algorithms

The sparsity structure of L may lead to an additional source of parallelism which

is not available for dense factorization Data dependences may be avoided when different columns (and the columns having effect on them) have a disjoint

spar-sity structure This kind of parallelism can be described by elimination trees that

Định dạng
Số trang	10
Dung lượng	299,62 KB