For sparse matrices with regu-lar structure, like banded matrices, only the diagonals with non-zero elements are stored and the solution methods introduced in the previous sections can b
Trang 1processor performs the arithmetic operations locally and the vector x k+1results
in a blockwise distribution
(4) The ax py-operation g k+1= g k +α kwkis computed analogously to computation
step (3) and the result vector g k+1is distributed in a blockwise way
(5) The scalar productγk+1 = g T
k+1g k+1 is computed analogously to computation step (2) The resulting scalar valueβk is computed by the root processor of a single-accumulation operation and then broadcasted to all other processors
(6) The ax py-operation d k+1= −g k+1+ β k d kis computed analogously to
compu-tation step (3) The result vector d k+1has a blockwise distribution.
7.4.2.3 Parallel Execution Time
The parallel execution time of one iteration step of the CG method is the sum of the parallel execution times of the basic operations involved We derive the
paral-lel execution time for p processors; n is the system size It is assumed that n is a multiple of p The parallel execution time of one ax py-operation is given by
p · t op ,
since each processor computes n /p components and the computation of each
com-ponent needs one multiplication and one addition As in earlier sections, the time for
one arithmetic operation is denoted by t op The parallel execution time of a scalar product is
Tscal prod= 2 ·
n
p − 1
· t op + Tacc(+)(p, 1) + T sb ( p , 1) ,
where Tacc(op)( p , m) denotes the communication time of a single-accumulation operation with reduction operation op on p processors and message size m The computation of the local scalar products with n /p components requires n/p multi-plications and n /p − 1 additions The distribution of the result of the parallel scalar
product, which is a scalar value, i.e., has size 1, needs the time of a single-broadcast
operation T sb ( p , 1) The matrix–vector multiplication needs time
Tmath vec mult= 2 ·n2
p · t op ,
since each processor computes n /p scalar products The total computation time of
the CG method is
TCG= Tmb
p , n p
+ Tmath vec mult+ 2 · Tscal prod+ 3 · Taxpy ,
Trang 2where Tmb( p , m) is the time of a multi-broadcast operation with p processors and message size m This operation is needed for the re-distribution of the direction vector d k from iteration step k
7.5 Cholesky Factorization for Sparse Matrices
Linear equation systems arising in practice are often large but have sparse coef-ficient matrices, i.e., they have many zero entries For sparse matrices with regu-lar structure, like banded matrices, only the diagonals with non-zero elements are stored and the solution methods introduced in the previous sections can be used For
an unstructured pattern of non-zero elements in sparse matrices, however, a more general storage scheme is needed and other parallel solution methods are applied
In this section, we consider the Cholesky factorization as an example of such a solution method The general sequential factorization algorithm and its variants for sparse matrices are introduced in Sect 7.5.1 A specific storage scheme for sparse unstructured matrices is given in Sect 7.5.2 In Sect 7.5.3, we discuss parallel implementations of sparse Cholesky factorization for shared memory machines
7.5.1 Sequential Algorithm
The Cholesky factorization is a direct solution method for a linear equation system
symmetric and positive definite, i.e., if a i j = a j i and x T Ax > 0 for all x ∈ R nwith
unique triangular factorization
where L = (l i j)i , j=1, ,n is a lower triangular matrix, i.e., l i j = 0 for i < j and
i , j ∈ {1, , n}, with positive diagonal elements, i.e., lii > 0 for i = 1, , n;
L T denotes the transposed matrix of L, i.e., L T = (l T
i j)i , j=1, ,n with l i j T = l j i [166]
Using the factorization in Eq (7.59), the solution x of a system of equations Ax = b
with b ∈ Rn is determined in two steps by solving the triangular systems L y = b
and L T x = y one after another Because of Ly = L L T x = Ax = b, the vector
x∈ Rnis the solution of the given linear equation system
The implementation of the Cholesky factorization can be derived from a
column-wise formulation of A = L L T
Comparing the elements of A and L L T, we obtain
a i j =
n
k=1
l i k l k j T =
n
k=1
l i k l j k=
j
k=1
l i k l j k =
j
k=1
l j k l i k
Trang 3since l j k = 0 for k > j and by exchanging elements in the last summation Denoting
the columns of A as ˜a1, , ˜an and the columns of L as ˜l1, , ˜ln results in an
equality for column ˜aj = (a 1 j , , an j) and columns ˜lk = (l 1k, , lnk ) for k ≤ j:
˜aj =
j
k=1
l j k˜lk
leading to
l j j˜lj = ˜aj−
j−1
k=1
for j = 1, , n If the columns ˜l k, k = 1, , j − 1, are already known, the
right-hand side of Formula (7.60) is computable and the column ˜ljcan also be computed
Thus, the columns of L are computed one after another The computation of column
˜lj has two cases:
For the diagonal element the computation is
l j j l j j = a j j−
j−1
k=1
l j k l j k or l j j =
j−1
k=1
l2
j k
For the elements l i j , i > j, the computation is
l i j = 1
l j j
:
a i j −
j−1
k=1
l j k l i k
;
;
The elements in the upper triangular of matrix L are l i j = 0 for i < j.
The Cholesky factorization yields the factorization A = L L T for a given matrix
A [65] by computing L = (l i j)i =0, ,n−1, j=0, ,i from A = (a i j)i , j=0, ,n−1 column
by column from left to right according to the following algorithm, in which the numbering starts with 0:
(I )
for (j=0; j<n; j++) {
l j j=
<
a j j−j−1
k=0l
2
j k; for (i=j+1; i<n; i++)
l i j = 1
l j j
:
a i j −j−1
k=0l j k l i k
;
;
}
Trang 4*
*
j
*
*
*
*
*
*
*
*
*
j
*
*
*
*
*
* j
data items used for the computation
Computation structure
for computing l ij
for left-looking strategy
Computation structure for right-looking strategy
*
*
*
j
i
*
*
*
*
*
*
data items updated in the computation
Fig 7.22 Computational structures and data dependences for the computation of L according
to the basic algorithm (left), the left-looking algorithm (middle), and the right-looking algorithm (right)
For each column j , first the new diagonal element l j jis computed using the elements
in row j ; then, the new elements of column j are computed using row j of A and all columns i of L with i < j, see Fig 7.22 (left).
For dense matrices A, the Cholesky factorization requires O(n2) storage space
and O(n3/6) arithmetic operations [166] For sparse matrices, drastic reductions in storage and execution time can be achieved by exploiting the sparsity of A, i.e., by storing and computing only the non-zero entries of A.
The Cholesky factorization usually causes fill-ins for sparse matrices A which means that the matrix L has non-zeros in positions which are zero in A The number
of fill-in elements can be reduced by reordering the rows and columns of A result-ing in a matrix P A P T with a corresponding permutation matrix P For Cholesky factorization, P can be chosen without regard to numerical stability, because no pivoting is required [65] Since P A P T is also symmetric and positive definite for
any permutation matrix P, the factorization of A can be done with the following
steps:
1 Reordering: Find a permutation matrix P ∈ Rn ×n that minimizes the storage
requirement and computing time by reducing fill-ins The reordered linear
equa-tion system is (P A P T )(P x) = Pb.
2 Storage allocation: Determine the structure of the matrix L and set up the sparse
storage scheme This is done before the actual computation of L and is called (symbolic factorization), see [65].
3 Numerical factorization: Perform the factorization P A P T = L L T
4 Triangular solution: Solve L y = Pb and L T z = y Then, the solution of the
original system is x = P T z.
The problem of finding an ordering that minimizes the amount of fill-in is NP-complete [177] But there exist suitable heuristics for reordering The most
Trang 5popular sequential fill-in reduction heuristic is the minimum degree algorithm [65] Symbolic factorization by a graph-theoretic approach is described in detail in [65]
In the following, we concentrate on the numerical factorization, which is considered
to require by far the most computation time, and assume that the coefficient matrix
is already in reordered form
7.5.1.1 Left-Looking Algorithms
According to [124], we denote the sparsity structure of column j and row i of L
(excluding diagonal entries) by
Struct (L ∗ j)= {k > j|l k j = 0}
Struct (Li∗ = {k < i|l i k = 0}
Struct (L ∗ j ) contains the row indices of all non-zeros of column j and Struct (L i∗
contains the column indices of all non-zeros of row i Using these sparsity structures
a slight modification of computation scheme (I) results The modification uses the following procedures for manipulating columns [124, 152]:
(I I )
cmod(j,k) =
for each i ∈ Struct(L ∗k) with i ≥ j :
a i j = a i j − l j k l i k ; cdiv(j) =
l j j = √a j j ;
for each i ∈ Struct(L ∗ j) :
l i j = a i j /lj j ;
Procedure cmod ( j , k) modifies column j by subtracting a multiple with factor l j k
of column k from column j for columns k already computed Only the non-zero elements of column k are considered in the computation The entries a i j of the
original matrix a are now used to store the intermediate results of the computa-tion of L Procedure cdiv ( j ) computes the square root of the diagonal element and divides all entries of column j by this square root of its diagonal entry l j j Using
these two procedures, column j can be computed by applying cmod ( j , k) for each
cmod ( j, k) to columns k ∈ Struct(L j∗) has no effect because l j k= 0 The columns
of L are computed from left to right and the computation of a column ˜ l j needs all columns ˜l k to the left of column ˜l j This results in the following left-looking
algorithm:
Trang 6(I I I )
left cholesky =
for j = 0, ., n − 1 {
for each k ∈ Struct(L j∗ : cmod( j , k);
cdiv( j );
}
The code in scheme (I I I ) computes the columns one after another from left to right The entries of column j are modified after all columns to the left of j have completely been computed, i.e., the same target column j is used for a number of consecutive cmod ( j , k) operations; this is illustrated in Fig 7.22 (middle).
7.5.1.2 Right-Looking Algorithm
An alternative way is to use the entries of column j after the complete computation
of column j to modify all columns k to the right of j that depend on column j , i.e.,
to modify all columns k ∈ Struct(L ∗ j ) by subtracting l k j times the column j from column k Because l k j = 0 for k /∈ Struct(L ∗ j ), only the columns k ∈ Struct(L ∗ j)
are manipulated by column j Still the columns are computed from left to right The
difference to the left-looking algorithm is that the calls tocmod() for a column j are done earlier The final computation of a column j then consists only of a call
tocdiv( j ) after all columns to the left are computed This results in the following right-looking algorithm:
(I V )
right cholesky =
for j = 0, ., n − 1 {
cdiv( j ); for each k ∈ Struct(L ∗ j): cmod(k , j);
}
The code fragment shows that in the right-looking algorithm, successive cmod()
operations manipulate different target columns with the same column j An
illus-tration is given in Fig 7.22 (right)
In both the left-looking and right-looking algorithms, each non-zero l i j leads to
an execution of a cmod () operation In the left-looking algorithm, the cmod ( j , k) operation is used to compute column j In the right-looking algorithm, the cmod (k, j) operation is used to manipulate column k ∈ Struct(L ∗ j) after the
com-putation of column j Thus, left-looking and right-looking algorithms use the same
number ofcmod() operations They also use the same number ofcdiv() operations, since there is exactly onecdiv() operation for each column
Trang 77.5.1.3 Supernodes
The supernodal algorithm is a computation scheme for sparse Cholesky factoriza-tion that exploits similar patterns of non-zero elements in adjacent columns, see [124, 152] A supernode is a set
I ( p) = {p, p + 1, , p + q − 1}
of contiguous columns in L for which for all i with p ≤ i ≤ p + q − 1
Struct (L ∗i)= Struct(L ∗(p+q−1))∪ {i + 1, , p + q − 1}
Thus, a supernode has a dense triangular block above (and including) row p +q −1,
i.e., all entries are non-zero elements, and an identical sparsity structure for each
column below row p +q −1, i.e., each column has its non-zero elements in the same
rows as the other columns in the supernode Figure 7.23 shows an example Because
of this identical sparsity structure of the columns, a supernode has the property that each member column modifies the same set of target columns outside its supernode [152] Thus, the factorization can be expressed in terms of supernodes modifying columns, rather than columns modifying columns
0 1 2 3 4 5 6 7 8 9 0
1
2
3
4
5
6
7
8
9
∗
∗ ∗
∗
∗ ∗
∗ ∗ ∗
∗
∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗ ∗
∗ ∗
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
9 8 5 7
6 1
0 4
3 2
Fig 7.23 Matrix L with supernodes I (0) = {0}, I (1) = {1}, I(2) = {2, 3, 4}, I(5) = {5}, I(6) = {6, 7}, I(8) = {8, 9} The elimination tree is shown at the right
Using the definitions first ( J ) = p and last(J) = p + q − 1 for a supernode
is defined:
(V )
smod( j , J) =
cmod( j , k);
Trang 8which modifies column j with all columns from supernode J There are two cases for modifying a column with a supernode: When column j belongs to supernode J , then column j is modified only by those columns of J that are to the left in node J When column j does not belong to supernode J , then column j is modified by all columns of J Using the procedure smod (), the Cholesky factorization can be per-formed by the following computation scheme, also called right-looking supernodal
algorithm:
(V I )
supernode cholesky =
for each supernode J do from left to right {
cdiv( f ir st( J ));
smod( j , J); cdiv( j );
}
smod(k , J);
}
This computation scheme still computes the columns of L from left to right The
difference to the algorithms presented before is that the computations associated with a supernode are combined On the supernode level, a right-looking scheme is
used: For the computation of the first column of a supernode J only onecdiv() operation is necessary when the modification with all columns to the left is already
done The columns of J are computed in a left-looking way: After the computa-tion of all supernodes to the left of supernode J and because the columns of J are
already modified with these supernodes due the supernodal right-looking scheme,
column j is computed by first modifying it with all columns of J to the left of j
and then performing acdiv() operation After the computation of all columns of
J , all columns k to the right of J that depend on columns of J are modified with each column in J , i.e., by the proceduresmod(k , J).
An alternative way would be a right-looking computation of the columns of J
An advantage of the supernodal algorithm lies in an increased locality of memory
accesses because each column of a supernode J is used for the modification of several columns to the right of J and because all columns of J are used for the modification of the same columns to the right of J
7.5.2 Storage Scheme for Sparse Matrices
Since most entries in a sparse matrix are zero, specific storage schemes are used
to avoid the storage of zero elements These compressed storage schemes store the non-zero entries and additional information about the row and column indices to
Trang 9identify its original position in the full matrix Thus, a compressed storage scheme for sparse matrices needs the space for the non-zero elements as well as space for additional information
A sparse lower triangular matrix L is stored in a compressed storage scheme of size O(n +nz) where n is the number of rows (or columns) in L and nz is the number
of non-zeros We present the storage scheme of the SPLASH implementation which, according to [116], stores a sparse matrix in a compressed manner similar to [64] This storage scheme exploits the sparsity structure as well as the supernode structure
to store the data We first describe a simpler version using only the sparsity structure without supernodes Exploiting the supernode structure is then based on this storage scheme
The storage scheme uses two arrays Nonzero and Row of length nz and
three arraysStartColumn,StartRow, andSupernodeof length n The array
Nonzerocontains the values of all non-zeros of a triangular matrix L = (l k j)k ≥ j
in column-major order, i.e., the non-zeros are ordered columnwise from left to right in a linear array Information about the corresponding column indices of non-zero elements is implicitly contained in arrayStartColumn: Position j of array
StartColumn stores the index of arrayNonzero in which the first non-zero
element of column j is stored, i.e., Nonzero[StartColumn[ j ]] contains l j j Because the non-zero elements are stored columnwise,StartColumn[ j+ 1] − 1
contains the last non-zero element of column j Thus, the non-zeros of the j th column of L are assigned to the contiguous part of array Nonzerowith indices fromStartColumn[ j ] toStartColumn[ j+ 1] − 1 The size of the contiguous
part of non-zeros of column j in array Nonzerois N j :=StartColumn[ j+
1]−StartColumn[ j ] The arrayRowcontains the row indices of the correspond-ing elements inNonzero In the simpler version without supernodes,Row[r ]
con-tains the row index of the non-zero stored in Nonzero[r ], r = 0, , nz − 1.
Corresponding to the blockwise storage scheme in Nonzero, the indices of the non-zeros of one column are stored in a contiguous block inRow
When the similar sparsity structure of rows in the same supernode is addi-tionally exploited, row indices of non-zeros are stored in a combination of the arraysRowandStartRowin the following way:StartRow[ j ] stores the index
of Rowin which the row index of the first non-zero of column j is stored, i.e.,
Row[StartRow[ j ]] = j because l j jis the first non-zero For each column the row indices are still stored in a contiguous block of Row In contrast to the simpler scheme the blocks for different rows in the same supernode are not disjoint but overlap according to the similar sparsity structure of those columns
The additional arrayStartRowcan be used for a more compact storage scheme
for the supernodal algorithm When j is the first column of a supernode I ( j ) =
{ j, j + 1, , j + k − 1}, then column j + l for 1 ≤ l < k has the same non-zero
pattern as row j for rows greater than or equal to j +l, i.e.,Row[StartRow[ j ] +l]
contains the row index of the first element of column j +l Since this is the diagonal
element, Row[StartRow[ j ] + l] = j + l holds The next entries are the row
indices of the other non-zero elements of column j + l Thus, the row indices of
column j + l are stored in Row[StartRow[ j ] + l], ,Row[StartRow[ j ]+
Trang 10rj+Nj−1 rj+Nj−1
0
0 1 2
l
l
l
l
00
11
j j
j+1,j+1
nz−1,nz−1
l
c
c
c
c
c
0
1
j
j+1
n−1
k
k
k
0
k
r1 rj
r n−1
r n−1
r j
r r
r r
r
0 1
j j+1
n−1
0 1
j j+1
n−1
rj+1 j+1 r c
c
c
c j+1
nz−1
0
1
j
j+1
n−1
j
1
k
= number of non−zeros in column j
Fig 7.24 Compressed storage scheme for a sparse lower triangular matrix L The arrayNonzero
contains the non-zero elements of matrix L and the arrayStartColumn contains the positions
of the first elements of columns in Nonzero The array Row contains the row indices of elements
in Nonzero ; the first element of a row is given in StartRow For a supernodal algorithm, Row can additionally use an overlapping storage (not shown here)
StartColumn[ j+1]−StartColumn[ j−1]] This leads toStartRow[ j +l] =
StartRow[ j ] + l and thus only the row indices of the first column of a supernode
have to be stored to get the full information A fast access to the sets Struct (L ∗ j) is given by
Struct (L ∗ j)= . Row[StartRow[ j ] + i]|0≤i≤StartColumn[j+ 1]
−StartColumn[ j− 1]/.
The storage scheme is illustrated in Fig 7.24 The arraySupernodeis used for
the management of supernodes: If a column j is the first column of a supernode J , then the number of columns of J is stored inSupernode[j]
7.5.3 Implementation for Shared Variables
For a parallel implementation of sparse Cholesky factorization, we consider a shared memory machine There are several sources of parallelism for sparse Cholesky fac-torization, including fine-grained parallelism within the single operationscmod( j , k)
orcdiv( j ) as well as column-oriented parallelism in the left-looking, right-looking,
and supernodal algorithms
The sparsity structure of L may lead to an additional source of parallelism which
is not available for dense factorization Data dependences may be avoided when different columns (and the columns having effect on them) have a disjoint
spar-sity structure This kind of parallelism can be described by elimination trees that