In order to take full advantage of the dedicated hardware, we modified the FMM using the pseudoparticle multipole method and Anderson’s method.. In the modified algorithm, multipole and
Trang 1The International Journal of High Performance Computing Applications,
Volume 22, No 2, Summer 2008, pp 194–205
DOI: 10.1177/1094342008090912
© 2008 SAGE Publications Los Angeles, London, New Delhi and Singapore
1
COLLEGE OF TECHNOLOGY, VIETNAM NATIONAL UNIVERSITY, 144 XUAN THUY, CAU GIAY, HANOI, VIETNAM (CHAUNH@VNU.EDU.VN; NHCHAU@GMAIL.COM)
2
K&F COMPUTING RESEARCH CO., 1-21-6-407, KOJIMA-CHO, CHOFU, TOKYO, JAPAN 182-0026
3
COMPUTATIONAL ASTROPHYSICS LABORATORY, INSTITUTE OF PHYSICAL AND CHEMICAL RESEARCH, (RIKEN), HIROSAWA 2-1, WAKO-SHI, SAITAMA, JAPAN 351-0198
ACCELERATION OF FAST MULTIPOLE
METHOD USING SPECIAL-PURPOSE
COMPUTER GRAPE
Nguyen Hai Chau 1
Atsushi Kawai 2
Toshikazu Ebisuzaki 3
Abstract
We have implemented the fast multipole method (FMM) on
a special-purpose computer GRAPE (GRAvity piPE) The
FMM is one of the fastest approximate algorithms to calculate
forces among particles Its calculation cost scales as O(N),
while the naive algorithm scales as O(N2) Here, N is the
number of particles in the system GRAPE is hardware
ded-icated to the calculation of Coulombic or gravitational forces
among particles GRAPE’s calculation speed is 100–1000
times faster than that of conventional computers of the same
price, though it cannot handle anything but force calculation
We can expect significant speedup by the combination of the
fast algorithm and the fast hardware However, a
straightfor-ward implementation of the algorithm actually runs on
GRAPE at rather modest speed This is because of the
lim-ited functionality of the hardware Since GRAPE can handle
particle forces only, just a small fraction of the overall
calcu-lation procedure can be put on it The remaining part must
be performed on a conventional computer connected to
GRAPE In order to take full advantage of the dedicated
hardware, we modified the FMM using the pseudoparticle
multipole method and Anderson’s method In the modified
algorithm, multipole and local expansions are expressed by
distribution of a small number of imaginary particles
(pseu-doparticles), and thus they can be evaluated by GRAPE
Results of numerical experiments on ordinary GRAPE
sys-tems show that, for large-N syssys-tems (N ≥ 105
), GRAPE accelerates the FMM by a factor ranging from 3 for low
accu-racy (RMS relative force error ~10–2) to 60 for high accuracy
(RMS relative force error ~10–5) Performance of the FMM
on GRAPE exceeds that of Barnes–Hut treecode on GRAPE
at high accuracy, in case of close-to-uniform distribution of
particles However, in the same experimental environment
the treecode outperforms the FMM for inhomogeneous
dis-tribution of particles
Key words: molecular dynamics, numerical simulation, fast
multipole method, tree algorithm, Anderson’s method,
pseu-doparticle multipole method, special-purpose computer
1 Introduction
Molecular dynamics (MD) simulations are highly com-pute intensive The most expensive part of MD is calcu-lation of Coulombic forces among particles (i.e., atoms and ions) In a naive direct-summation algorithm, cost of
number of particles This is because Coulombic force is a long-range interaction
In order to reduce the cost of force calculation, fast algorithms such as the Barnes–Hut treecode (Barnes and Hut 1986) and the fast multipole method (FMM; Green-gard and Rokhlin 1987) have been developed In the tree-code, particles are grouped and forces from them are approximated by multipole expansions of the group Par-ticles that are more distant are organized into larger
groups, and thus the calculation cost scales as O(NlogN).
In the FMM, the force is also approximated by a multipole expansion Then the multipole expansion is converted to
a local expansion at each observation point The force on each particle is obtained by evaluating the local
expan-sion The calculation cost of this scheme scales as O(N).
These fast algorithms are widely used in the field of MD simulation (Lakshminarasimhulu and Madura 2002; Lupo et al 2002)
There exists another approach to accelerate the force calculation It is to use hardware dedicated to the calcula-tion of inter-particle forces GRAPE (GRAvity PipE; Sug-imoto et al 1990; Makino and Taiji 1998) is one of the most widely used pieces of special-purpose hardware of this kind Figure 1 shows the basic structure of a GRAPE system It consists of a GRAPE processor board and a general-purpose computer (hereafter the host computer) The host computer sends positions and charges of
parti-Fig 1 Basic structure of a GRAPE system.
Trang 2cles to GRAPE GRAPE then calculates the forces, and
sends results back to the host computer
Using hardwired pipelines, a typical GRAPE system
performs the force calculation 100–1000 times faster
than conventional computers of the same price For
small-N (say small-N 105) particle systems, therefore, the
combina-tion of a simple direct-summacombina-tion algorithm and GRAPE
is the fastest calculation scheme Fast algorithms are not
very effective at such a small N.
For large-N particle systems, however, O(N2)
direct-summation becomes expensive, even with GRAPE If we
successfully combine one of the fast algorithms and the
fast hardware, significant speed up for large-N particle
systems would be expected As for the tree algorithm,
Makino (1991) has successfully implemented a modified
treecode (Barnes 1990) on GRAPE, and achieved a factor
of 30–50 speedup
For the FMM, on the other hand, no implementation
on GRAPE so far exists The FMM’s implementation on
dedicated hardware of a similar kind is reported, but its
performance is rather modest (Amisaki et al 2003) This
is mainly because of the limited functionality of the
hard-ware Since dedicated hardware can calculate the particle
force only, it cannot handle multipole and local
expan-sions Therefore, only a small fraction of the FMM’s
cal-culation can be performed on such hardware, and the
speedup gain remains rather modest
In order to take full advantage of GRAPE, we
modi-fied the FMM using the pseudoparticle multipole method
(Makino 1999) and Anderson’s (1992) method Using
these methods, we can express the multipole and local
expansion by a distribution of a small number of
imagi-nary particles (pseudoparticles) With the modification, we
can use GRAPE to evaluate the expansions Therefore, a
significant fraction of the modified FMM can be handled
on GRAPE
In this paper we describe the implementation and
per-formance of the modified FMM on GRAPE The paper is
organized as follows Section 2 gives a summary of the
FMM and related algorithms In Section 3, a brief
over-view of GRAPE system is given In Section 4, we describe
the implementation of our FMM code, which is modified
so that it runs on GRAPE Results of numerical tests of the
code are shown in Section 6 Section 7 is devoted to
dis-cussion and Section 8 summarizes
2 FMM and Related Algorithms
Here we give a brief description of the FMM (Section 2.1),
and two related algorithms, namely, the Anderson’s
method (Section 2.2) and the pseudoparticle multipole
method (Section 2.3) As will be seen in Section 4, the
latter two algorithms are used to implement the FMM on
GRAPE
The FMM is an approximate algorithm to calculate forces among particles In case of close-to-uniform distri-bution of particles, the FMM’s calculation cost scales as
O(N) This scaling is achieved by approximation of the
forces using the multipole and local expansion technique
Figure 2 shows a schematic idea of force approxima-tion in the FMM The force from a group of distant parti-cles are approximated by a multipole expansion At an observation point, the multipole expansion is converted
to local expansion The local expansion is evaluated by each particle around the observation point A hierarchical tree structure is used for grouping of the particles
The algorithm is applicable for two-dimensional (Green-gard and Rokhlin 1987) and three-dimensional (Green(Green-gard and Rokhlin 1997) particle systems In the following, we review the calculation procedure of the algorithm for the three-dimensional case
2.1.1 Tree construction Assume we have an isolated particle system Initially, we define a large enough cube (root cell) to cover all particles in the system We con-struct an oct-tree con-structure by hierarchical subdivision of the cube into eight smaller cubes (child cells) The subdi-vision procedure starts from the root cell at refinement
level l = 0 The subdivision is then repeated recursively for all sub cells, and stopped when l reaches an optimal refinement level lmax The optimal level lmax is determined
so that it optimizes the calculation speed
2.1.2 M2M transition Next, we form multipole expan-sions for each leaf cell by calculating contributions from all particles inside the cell
Then we ascend the tree structure to form multipole expansions of all non-leaf cells in all coarser levels The procedure starts from parents of the leaf cells For each cell, the multipole expansions of its children are shifted to the geometric center of the cell (M2M transition) and summed
This procedure is continued until it reaches the root cell
<∼
Fig 2 Schematic idea of force approximation in FMM.
Trang 32.1.3 M2L conversion Then we evaluate the multipole
expansions In order to describe this part, here we define
the terminology “neighbor list” and “interaction list.” The
neighbor list of a cell is a set of cells in the same level of
refinement which have contact with the cell The
interac-tion list of a cell is a set of cells which are children of the
neighbors of the cell’s parent and which are not neighbors
of the cell itself Figure 3 shows the neighbor and
interac-tion list of a cell for the two-dimensional case
For each cell we evaluate the multipole expansion of
all cells in its interaction list We convert the multipole
expansion to the local expansion at the geometric center
of the cell in question (M2L conversion), and sum them
2.1.4 L2L transition In the next step, we descend the
tree structure We sum the local expansions at different
refinement levels to obtain the total potential field at leaf
cells For each cell in level l we shift the center of the
local expansion of its parent at level l – 1 (L2L
transi-tion), and then add it to the local expansion of the cell
By this procedure, all cells in level l will have the local
expansion of the total potential field except for the
con-tribution of the neighbor cells By repeating this
proce-dure for all levels, we obtain the potential field for all leaf
cells
2.1.5 Force evaluation Finally, we calculate the force
on each particle in all leaf cells by summing the
contribu-tions of far field and near field forces The near field
con-tribution is directly calculated by evaluating the particle–
particle force The far field contribution is calculated by
evaluating local expansion of the leaf cell at position of
the particle
Anderson (1992) proposed a variant of the FMM using a new formulation of the multipole and local expansions The advantage of his method is its simplicity Anderson’s method makes the implementation of the FMM signifi-cantly simpler Here we briefly describe his method Anderson’s method is based on the Poisson’s formula This formula gives the solution of the boundary value problem of the Laplace equation When the potential on
the surface of a sphere of radius a is given, the potential
Φ at position = (r, φ, θ) is expressed as
(1)
for r ≥ a, and
(2)
for r ≤ a Note that here we use a spherical coordinate
system Here, Φ(a ) is the given potential on the sphere surface The area of the integration S covers the surface
of the unit sphere centered at the origin The function P n denotes the nth Legendre polynomial.
In order to use these formulae as replacements of the multipole and local expansions, Anderson proposed a discrete version of them, i.e., he truncated the right-hand
side of the equations (1)–(2) at a finite n, and replaced the integrations over S with numerical ones using a spherical design Hardin and Sloane (1996) define the spherical
t-design as follows
A set of K points 1 = {P1, …, P K} on the unit sphere
Ωd = S d – 1 = {x = (x1, …, x d) ∈ R d
: x · x = 1} forms a spherical t-design if the identity
(3)
a total measure 1) holds for all polynomials f of degree ≤
t (Hardin and Sloane 1996).
Note that the optimal set, i.e., the smallest set of the
spherical t-design is not known so far for general t In practice we use spherical t-designs as empirically found
by Hardin and Sloane Examples of such t-designs are
avail-able at http://www.research.att.com/~njas/ sphdesigns/
Using the spherical t-design, Anderson obtained the
discrete versions of (1) and (2) as follows:
Fig 3 Neighbour and interaction list of the hatched cell.
r
→
Φ r( ) 4π -1 (2n+1) a
r
-
n 1Pn s r⋅
r
n 0
∞
∑
S
∫
=
4π
- (2n+1) r
a
-
n Pn s r⋅
r
n 0
∞
∑
S
∫
=
s
→
f x ( ) µ x d ( )
K
f Pi( )
i 1
K
∑
=
Trang 4for r ≥ a (outer expansion) and
(5)
for r ≤ a (inner expansion) Here w i is constant weight
value and p is the number of untruncated terms Hereafter
we refer to p as the expansion order.
Anderson’s method uses equations (4) and (5) for
M2M and L2L transitions, respectively The procedures
of other stages are the same as that of the original FMM
Makino (1999) proposed the pseudoparticle multipole
method (P2M2) – yet another formulation of the multipole
expansion The advantage of his method is that the
expan-sions can be evaluated using GRAPE
The basic idea of P2M2 is to use a small number of
pseudoparticles to express the multipole expansions In
other words, this method approximates the potential field
of physical particles by the field generated by a small
number of pseudoparticles This idea is very similar to
that of Anderson’s method Both methods use discrete
quantities to approximate the potential field of the
origi-nal distribution of the particles The difference is that
P2M2 uses the distribution of point charges, while
Ander-son’s method uses potential values In the case of P2M2,
the potential is expressed by point charges, and thus it
can be evaluated using GRAPE
In the following, we describe the formulation
proce-dure of P2M2
The distribution of pseudoparticles is determined so
that it correctly describes the coefficients of a multipole
expansion A naive approach to obtain the distribution is
to directly invert the multipole expansion formula For a
relatively small expansion order, say p ≤ 2, we can solve
the inversion formula, and obtain the optimal distribution
with minimum number of pseudoparticles (Kawai and
Makino 2001)
However, it is rather difficult to solve the inversion
formula for higher p, since the formula is nonlinear For
p > 2, we adopted Makino’s (1999) approach which is
more general In his approach, pseudoparticles are fixed
at the positions given by the spherical t-design (Hardin
and Sloane 1996), and only their charges can change
This makes the formula linear, although the necessary
number of pseudoparticles increases This is because we
can adjust only the charges of pseudoparticles, since we
fixed the positions of them The degree of freedom
assigned to each pseudoparticle is then reduced from four
to one
Makino’s approach systematically gives the solution
of the inversion formula as follows:
(6)
where Q j is the charge of the pseudoparticle, i = (r i, φ, θ) is the position of the physical particle, γij is the angle
pseu-doparticle For the derivation procedure of equation (6), see Makino (1999)
Equation (6) gives the solution for outer expansion
We found that following a similar approach, we can obtain the solution for inner expansion:
(7)
For the derivation procedure of equation (7), see Appen-dix A
3 Function of GRAPE
The primary function of GRAPE is to calculate the force ( i ) exerted on particle i at position i, and potential φ( i) associated with ( i) Although there are several variants of GRAPE for different applications such as astrophysics and MD, the basic functions of these hard-ware devices are substantially the same
The force ( i) and the potential φ( i) are expressed as
(8) and
(9)
where N is the number of particles to handle, j and q j are the position and the charge of particle j, and rs is the
softened distance between particle i and j defined as
r ≡ | i – j|2 + e2, where e is the softening parameter
In order to calculate force ( i), relevant data, i, j,
back to the host The potential φ( i) is calculated in the same manner
4 Implementation of the FMM on GRAPE
The FMM consists of five stages (see Section 2.1), namely, the tree construction, M2M transition, M2L
r
-
n 1
Pn si⋅r
r
Φ as ( )w i i
n 0
p
∑
i 1
K
∑
≈
Φ r( ) (2n+1) a r n Pn si⋅r
r
Φ as ( )w i i
n 0
p
∑
i 1
K
∑
≈
i 1
N
K
- ri
a
-
ij
cos
l 0
p
∑
=
r
→
r
i 1
N
K
- a
ri
-
l 1
Pl(cosγij)
l 0
p
∑
=
f
→
r
→
r
→
r
f ri( ) qj(ri–rj)
rs 3
-j 1
N
∑
=
rs
,
j 1
N
∑
=
→
r
→
s 2
r
→ →r
f
→
r
→
r
→
Trang 5version, L2L transition, and the force evaluation The
force evaluation stage consists of near field and far field
evaluation parts
In the case of the original FMM, only the near field
part of the force evaluation stage can be performed on
GRAPE At this stage, GRAPE directly evaluates force
from each particle expressed in the form of equation (8)
At all other stages, mathematical operations not in the
form of equation (8) or equation (9) are required GRAPE
cannot handle these operations
In our implementation (hereafter code A), we modified
the original FMM so that GRAPE could handle the M2L
conversion stage, which is the most time consuming For
expansions With this modification GRAPE can handle
the M2L stage by evaluating potential values from the
pseudoparticles At the L2L stage, potential values are
locally expanded and shifted using Anderson’s method
Table 1 summarizes mathematical expressions and
oper-ations used at each calculation stage
In the following, we describe the detail of our
imple-mentation
4.1 Tree Construction
The tree construction stage has no change It is
per-formed in the same way as in the original FMM
At the M2M transition stage, we compute positions and
charges of pseudoparticles, instead of forming multipole
expansion as in the original FMM
The procedure starts from the leaf cells Positions and
charges of the leaf cells are calculated from positions and
charges of physical particles Then, those of non-leaf cells
are calculated from positions and charges of
pseudoparti-cles of their child cells This procedure is continued until it
reaches the root cell This process is performed completely
on the host computer
The M2L conversion stage is done on GRAPE In con-trast to the original FMM we do not use the formula to convert the multipole expansion to a local expansion We directly calculate potential values due to pseudoparticles
in the interaction list of each cell
The L2L transition is done in the same manner as Ander-son We use equation (5) to convert the local expansion
of each cell to that of its children
The near field contribution is directly calculated by eval-uating the particle–particle force GRAPE can handle this part without any modification of the algorithm
Using equation (5), the far field potential on a particle at position can be calculated from the set of potential values
of the leaf cell which contains the particle Meanwhile the far field force is calculated using a derivative of equation (5):
(10)
where u = i · /r.
Table 1
Mathematical expressions and operations used in different implementations of the FMM
Underlined parts run on GRAPE.
Original (Greengard and Rokhlin 1997) Code A (Section 4) Code B (Section 5)
pseudoparticle potential
Evaluation of pseudoparticle potential
Near field force Evaluation of physical-particle force Evaluation of
physical-particle force
Evaluation of physical-particle force Far field force Evaluation of local expansion Equation (10) Evaluation of pseudo
particles force
r
→
∇Φ
1–u2
-∇P n( )u
+
n 0
p
∑
i 1
K
∑
≈
2n+1
a n -g as ( )w→i i,
s
→
r
→
Trang 6All the calculation at this stage is done on the host
computer
5 Further Improved Implementation
With the modification described in Section 4, we have
successfully put the bottleneck, namely, the M2L
conver-sion stage, on GRAPE The overall calculation of the
FMM is significantly accelerated
However, we still have room for improvement The
M2L stage is put on GRAPE and is no longer a
bottle-neck Now the most expensive part is the far field force
evaluation Equation (10) is complicated and evaluation
of it would take rather a large fraction of the overall
cal-culation time (Chau, Kawai, and Ebisuzaki 2002)
If we can convert a set of potential values into a set of
pseudoparticles at marginal calculation cost, the force
from those pseudoparticles can be evaluated on GRAPE,
and the bottleneck would disappear In order to facilitate
this conversion, we have developed a new systematic
procedure (hereafter A2P conversion)
Using the A2P conversion, we have implemented yet
another version of FMM (hereafter code B) In code B,
we use A2P conversion to obtain a distribution of
pseu-doparticles that reproduces the potential field given by
Anderson’s inner expansion Once the distribution of
pseudoparticles is obtained, the L2L stage can be
then the force evaluation stage is totally done on GRAPE
(the final column of Table 1)
In the following, we show the procedure of A2P
con-version
For the first step, we distribute pseudoparticles on the
surface of a sphere with radius b using the spherical
t-design Here, b should be larger than the radius of the
sphere a on which Anderson’s potential values g(a i) are
defined According to equation (7), it is guaranteed that
we can adjust the charge of the pseudoparticles so that
g(a i) are reproduced Therefore, the relation
(11)
should be satisfied for all i = 1 … K Using a matrix 1 =
{1/| j – a i|} and vectors = T[Q1, Q2, …, Q K] and =
T[Φ(a 1), Φ(a 2), …, Φ(a K)], we can rewrite equation
(11) as
(12)
In the next step, we solve the linear equation (12) to
that appropriate value of radius b is about 6.0 for
parti-cles inside a cell with side length 1.0 Anderson (1992)
specified that a should be about 0.4 Because of large dif-ference between a and b, equation (12) becomes nearly
singular for high order expansions In this case, Gaussian elimination and LU decomposition do not give a numeri-cally accurate enough solution Therefore, we applied singular values decomposition (SVD; Press et al 1992)
to solve the equation, and obtained better accuracy The additional cost for SVD is negligible
6 Numerical Tests
We performed numerical tests on accuracy and perform-ance of our hardware-accelerated FMM Here we show the results
Conversion Here we show the result of a test on accuracy of the A2P conversion (Section 5) and inner-P2M2 (equation (7))
We performed the test in the following steps:
1 Locate a particle q at (r, π, π/2) (spherical
coordi-nate) Here r runs from 1 to 10.
2 Evaluate potential values due to q at positions defined by spherical t-design on the surface of a sphere radius a = 0.4 centered at the origin The
number and position of the evaluation points
depends on the expansion order p.
3 Apply A2P conversion to the local expansion obtained in the previous step, i.e., solve equation
(12) to obtain charges of pseudoparticles Q j on the
surface of a sphere radius b = 6 centered at the
ori-gin The number and position of the
pseudoparti-cles depend on p.
4 Evaluate the force and potential due to the
pseu-doparticles at observation point L : (0.5, π, π/2)
5 Compare the result with exact force and potential
The exact values are obtained by direct evalua-tion
Figure 4 depicts the test process Figures 5 and 6 show the results of the test The potential error and the force error are shown in Figures 5 and 6, respectively In both
cases, the error for p = 1 to 5 behaves as theoretically expected, i.e., the potential error scales as r –(p + 2), and the
force error scales as r –(p + 1) For p = 6, the error stops decreasing at r ≥ 6 This is because of the singularity of
pseudoparticles are used, the solution of equation (12) suffers large computational error
s
→
s
→
Qj
Rj–asi
-j 1
K
→
→
s
1Q→ = P.→
Trang 76.2 Performance on MDGRAPE-2
Here we show the performance of the FMM code B
(Sec-tion 5) measured on MDGRAPE-2 (Susukita et al 2003)
MDGRAPE-2 is one of the latest devices in the GRAPE
series It is developed for MD simulation and has
addi-tional function to the original GRAPEs, so that it can
handle forces that do not decay as 1/r2, such as Van der
Waals force However, in our test we use MDGRAPE-2
only to calculate Coulombic force and potential The
additional functions are not used in our tests
For the measurement, we used two GRAPE systems The first one consists of one MDGRAPE-2 board (64 pipelines, 192 Gflop/s) and a host computer COMPAQ DS20E (Alpha 21264/667 MHz) The second one con-sists of one MDGRAPE-2 board (16 pipelines, 48 Gflop/s) and a self-assembled host computer (Pentium 4/2.2 GHz, Intel D850 motherboard) We refer the former system as
“system I,” and the latter as “system II.”
In the test, we distributed particles uniformly within a unit cube centered at the origin, and evaluated the force
on all particles The number of particles is from 128K to 4M Notations K and M are 1024 and 1024 × 1024, respectively We measured the calculation time at both
high (p = 5) and low (p = 1) accuracy, with and without GRAPE The finest refinement level lmax is set to lmax = 4 and 5, for runs with and without GRAPE, respectively These values are experimentally chosen so that the over-all calculation time is minimized (see Section 2.1)
In this paper we do not present in detail our experiments
in the case of inhomogeneous distribution of particles since inhomogeneity is not as important as homogeneity or close-to-uniformity in molecular dynamics simulations However, our experiments in the two GRAPE systems show that the treecode runs faster than the FMM in the inhomogeneous case
Results for close-to-uniform distribution cases are shown
in Figures 7–10 and Tables 2–3 Figures 7 and 9 are results
of system I Figures 8 and 10 and Tables 2–3 are of sys-tem II
In Figures 7 and 8, calculation time of the code B is
plotted against the number of particles N Results shown
in Figures 7 and 8 are measured on system I and II, respectively Results of the direct-summation algorithm
are also shown for comparison Our code scales as O(N)
Fig 4 Description of the test for accuracy of
inner-P 2 M 2 and the A2P conversion Numbers on the figure
are steps in the test.
Fig 5 Error of the potential calculated with
inner-P 2 M 2 and the A2P conversion From top to bottom, six
dashed curves are plotted with expansion order p = 1,
2, 3, 4, 5 and 6, respectively.
Fig 6 Force error: details as in Figure 5.
Trang 8with GRAPE are faster than those without GRAPE by a
factor of 5 and 60 for low (RMS relative force error ~10–2)
respectively On system II, the speedup factors are 3 and
14.5 Since the amount of calculation for the M2L stage
becomes more significant at higher p (Table 2), the
spee-dup factor is larger for higher accuracy
Table 3 shows the breakdown of the calculation time for 1M-particle runs We can see GRAPE significantly accelerates the M2L part and force evaluation part The overall performance of our implementation is limited by the speed of the communication bus between the host and GRAPE, rather than the speed of GRAPE itself For
fur-Fig 7 Force calculation time of FMM and
direct-sum-mation algorithm on system I Circles denote
perform-ance of FMM on MDGRAPE-2 Pentagons denote that
on the host computer Open and filled symbols are for
low (p = 1) and high accuracy (p = 5), respectively.
Solid and dashed curves without symbols are
perform-ance of direct method on MDGRAPE-2 and the host
computer, respectively.
Fig 8 Force calculation time of FMM and
direct-sum-mation algorithm on system II Symbols as in Figure 7.
Fig 9 Comparison of force calculation time for FMM and treecode on MDGRAPE-2 on system I Circles are performance of FMM on MDGRAPE-2 Triangles are that of the treecode on MDGRAPE-2 Open and filled symbols are for low and high accuracy, respectively.
Parameter pairs (p, θθθθ) to obtain low and high accuracy
of the treecode are (1, 1.0) and (2, 0.33), respectively.
Fig 10 Comparison of force calculation time for FMM and treecode on MDGRAPE-2 on system II Details as
in Figure 9.
Trang 9ther acceleration, we need to switch from the legacy PCI
bus (32 bit/33 MHz) to the faster buses, such as PCI-X,
or PCI Express
Figure 9 shows the calculation time of our FMM code
and the treecode (Kawai, Makino, and Ebisuzaki 2004),
both running on GRAPE The order of the multipole
expansion p and the opening angle θ for the treecode is
set to (p, θ) = (1, 1.0) and (2, 0.33) for low and high
accu-racy, respectively These values are chosen so that the
treecode gives roughly the same RMS force error as that
of the FMM The RMS force errors at low and high accu-racy are ~5 × 10–2 and ~2 × 10–5, respectively
We can see that the performance of our FMM code and the treecode is almost the same The FMM is better than the treecode at high accuracy, and worse at low accuracy
In a particular GRAPE system, parameters tuning for optimal performance of the modified FMM can be defined
by experiments One should measure the code B’s per-formance on a randomly generated particles system with
Table 2
Pairwise interaction count for 1M particle run.
Accuracy
With GRAPE (lmax = 4) Without GRAPE (lmax = 5)
Force evaluation
Table 3
Time breakdown for 1M particles run on system II.
Accuracy
With GRAPE (lmax = 4) Without GRAPE (lmax = 5)
Building neighbor
M2L
_ _ _ _ _
Force evaluation
_ _ _ _ _
Trang 10different values of the finest refinement level lmax for
each expansion order p from 1 to 5 For example, if the
number of particles in the system is from 128K to 4M
and the GRAPE’s peak performance is either 48 Gflop/s
tested are 3, 4 and 5
7 Discussion
We compared the performance of our FMM
implementa-tion (the code B) with Wrankin’s distributed parallel
multipole tree algorithm (DPMTA; Wrankin and Board
1995)
We measured the performance of Wrankin’s code on
sys-tem II, using the serial version of DPMTA 3.1.3 available at
http://www.ee.duke.edu/~wrankin/Dpmta/
For the measurement, particles are distributed in a unit
cube The expansion order and other parameters of each
code are chosen so that relatively high accuracy (~10–5)
is achieved, and the performance is optimized
Table 4 summarizes the comparison Using GRAPE,
our code outperforms Wrankin’s codes by tenfold
With-out GRAPE, our code is slower than Wrankin’s code by a
factor of 1.1–1.4, mainly because our code requires a
larger number of operation counts, so that it takes full
advantage of GRAPE
Parallelization of the FMM on a cluster of GRAPEs
requires no special techniques Algorithms used for
par-allelization on a cluster of general-purpose computers
(Hu and Johnsson 1996) can be applied without
modifi-cation In our modified FMM, GRAPE is used for the
M2L and force evaluation stages The presence of
GRAPE has no effect to parallelization of the tree
con-struction, building neighbor and interaction lists
In the case of the treecode, several versions of parallel
codes have been developed so far These codes are used
for productive runs in the field of astrophysics
(Fuku-shige, Kawai, and Makino 2004; Fuku(Fuku-shige, Makino, and
Kawai 2005) We can follow a similar approach to paral-lelize our FMM code
8 Summary
Using special-purpose hardware GRAPE, we have suc-cessfully accelerated the FMM In order to take full advantage of the hardware, we have modified the original FMM using Anderson’s method, the pseudoparticle multipole method, and two conversion techniques we have newly invented The experimental results show that GRAPE accelerates the FMM by a factor of 3 to 60, and the factor increases as the required accuracy becomes higher Comparison with the treecode shows that in the case of close-to-uniform distribution of particles, our FMM is faster at high accuracy, while the treecode is faster at low accuracy In case of inhomogeneous distri-bution of particles, the treecode is faster than the FMM
It is suggested that one should use the code B for large scale molecular dynamics simulations and where high accuracy is demanded
Acknowledgments
Thanks are due to Dr T Iitaka at the Institute of Physical and Chemical Research (RIKEN) for the suggestion of using the SVD method
We are grateful to Prof J A Smith from Bridge to Asia and Prof D E Keyes from Columbia University for refining the manuscript
This work is supported by the Advanced Computing Center, RIKEN and the College of Technology, Vietnam National University, Hanoi Part of this work was carried out while N H Chau was a contract researcher of RIKEN and A Kawai was a special postdoctoral researcher of RIKEN
Appendix A
In this appendix, we describe the derivation procedure of equation (7), inner expansion of P2M2
The local expansion of the potential Φ( ) is expressed as
(13)
Here, Y (θ, φ) is the spherical harmonics and β is the expansion coefficient In order to approximate the
poten-tial field due to the distribution of N particles, the
coeffi-cients should satisfy
(14)
Table 4
Performance comparison with Wrankin’s code.
code
Our code with
GRAPE
without GRAPE
r
→
m
r l Yl m(θ φ, )
m –l
l
∑
l 0
p
∑
=
→
l m
l m
βl
2l+1
- qi 1
ri l 1 -Y l m*
θi,φi
i 1
N
∑
=