DSpace at VNU: Acceleration of fast multipole method using special-purpose computer GRAPE

In order to take full advantage of the dedicated hardware, we modified the FMM using the pseudoparticle multipole method and Anderson’s method.. In the modified algorithm, multipole and

Trang 1

The International Journal of High Performance Computing Applications,

Volume 22, No 2, Summer 2008, pp 194–205

DOI: 10.1177/1094342008090912

1

COLLEGE OF TECHNOLOGY, VIETNAM NATIONAL UNIVERSITY, 144 XUAN THUY, CAU GIAY, HANOI, VIETNAM (CHAUNH@VNU.EDU.VN; NHCHAU@GMAIL.COM)

2

K&F COMPUTING RESEARCH CO., 1-21-6-407, KOJIMA-CHO, CHOFU, TOKYO, JAPAN 182-0026

3

COMPUTATIONAL ASTROPHYSICS LABORATORY, INSTITUTE OF PHYSICAL AND CHEMICAL RESEARCH, (RIKEN), HIROSAWA 2-1, WAKO-SHI, SAITAMA, JAPAN 351-0198

ACCELERATION OF FAST MULTIPOLE

METHOD USING SPECIAL-PURPOSE

COMPUTER GRAPE

Nguyen Hai Chau 1

Atsushi Kawai 2

Toshikazu Ebisuzaki 3

Abstract

We have implemented the fast multipole method (FMM) on

a special-purpose computer GRAPE (GRAvity piPE) The

FMM is one of the fastest approximate algorithms to calculate

forces among particles Its calculation cost scales as O(N),

while the naive algorithm scales as O(N2) Here, N is the

number of particles in the system GRAPE is hardware

ded-icated to the calculation of Coulombic or gravitational forces

among particles GRAPE’s calculation speed is 100–1000

times faster than that of conventional computers of the same

price, though it cannot handle anything but force calculation

We can expect significant speedup by the combination of the

fast algorithm and the fast hardware However, a

straightfor-ward implementation of the algorithm actually runs on

GRAPE at rather modest speed This is because of the

lim-ited functionality of the hardware Since GRAPE can handle

particle forces only, just a small fraction of the overall

calcu-lation procedure can be put on it The remaining part must

be performed on a conventional computer connected to

GRAPE In order to take full advantage of the dedicated

hardware, we modified the FMM using the pseudoparticle

multipole method and Anderson’s method In the modified

algorithm, multipole and local expansions are expressed by

distribution of a small number of imaginary particles

(pseu-doparticles), and thus they can be evaluated by GRAPE

Results of numerical experiments on ordinary GRAPE

sys-tems show that, for large-N syssys-tems (N ≥ 105

), GRAPE accelerates the FMM by a factor ranging from 3 for low

accu-racy (RMS relative force error ~10–2) to 60 for high accuracy

(RMS relative force error ~10–5) Performance of the FMM

on GRAPE exceeds that of Barnes–Hut treecode on GRAPE

at high accuracy, in case of close-to-uniform distribution of

particles However, in the same experimental environment

the treecode outperforms the FMM for inhomogeneous

dis-tribution of particles

Key words: molecular dynamics, numerical simulation, fast

multipole method, tree algorithm, Anderson’s method,

pseu-doparticle multipole method, special-purpose computer

1 Introduction

Molecular dynamics (MD) simulations are highly com-pute intensive The most expensive part of MD is calcu-lation of Coulombic forces among particles (i.e., atoms and ions) In a naive direct-summation algorithm, cost of

number of particles This is because Coulombic force is a long-range interaction

In order to reduce the cost of force calculation, fast algorithms such as the Barnes–Hut treecode (Barnes and Hut 1986) and the fast multipole method (FMM; Green-gard and Rokhlin 1987) have been developed In the tree-code, particles are grouped and forces from them are approximated by multipole expansions of the group Par-ticles that are more distant are organized into larger

groups, and thus the calculation cost scales as O(NlogN).

In the FMM, the force is also approximated by a multipole expansion Then the multipole expansion is converted to

a local expansion at each observation point The force on each particle is obtained by evaluating the local

expan-sion The calculation cost of this scheme scales as O(N).

These fast algorithms are widely used in the field of MD simulation (Lakshminarasimhulu and Madura 2002; Lupo et al 2002)

There exists another approach to accelerate the force calculation It is to use hardware dedicated to the calcula-tion of inter-particle forces GRAPE (GRAvity PipE; Sug-imoto et al 1990; Makino and Taiji 1998) is one of the most widely used pieces of special-purpose hardware of this kind Figure 1 shows the basic structure of a GRAPE system It consists of a GRAPE processor board and a general-purpose computer (hereafter the host computer) The host computer sends positions and charges of

parti-Fig 1 Basic structure of a GRAPE system.

Trang 2

cles to GRAPE GRAPE then calculates the forces, and

sends results back to the host computer

Using hardwired pipelines, a typical GRAPE system

performs the force calculation 100–1000 times faster

than conventional computers of the same price For

small-N (say small-N 105) particle systems, therefore, the

combina-tion of a simple direct-summacombina-tion algorithm and GRAPE

is the fastest calculation scheme Fast algorithms are not

very effective at such a small N.

For large-N particle systems, however, O(N2)

direct-summation becomes expensive, even with GRAPE If we

successfully combine one of the fast algorithms and the

fast hardware, significant speed up for large-N particle

systems would be expected As for the tree algorithm,

Makino (1991) has successfully implemented a modified

treecode (Barnes 1990) on GRAPE, and achieved a factor

of 30–50 speedup

For the FMM, on the other hand, no implementation

on GRAPE so far exists The FMM’s implementation on

dedicated hardware of a similar kind is reported, but its

performance is rather modest (Amisaki et al 2003) This

is mainly because of the limited functionality of the

hard-ware Since dedicated hardware can calculate the particle

force only, it cannot handle multipole and local

expan-sions Therefore, only a small fraction of the FMM’s

cal-culation can be performed on such hardware, and the

speedup gain remains rather modest

In order to take full advantage of GRAPE, we

modi-fied the FMM using the pseudoparticle multipole method

(Makino 1999) and Anderson’s (1992) method Using

these methods, we can express the multipole and local

expansion by a distribution of a small number of

imagi-nary particles (pseudoparticles) With the modification, we

can use GRAPE to evaluate the expansions Therefore, a

significant fraction of the modified FMM can be handled

on GRAPE

In this paper we describe the implementation and

per-formance of the modified FMM on GRAPE The paper is

organized as follows Section 2 gives a summary of the

FMM and related algorithms In Section 3, a brief

over-view of GRAPE system is given In Section 4, we describe

the implementation of our FMM code, which is modified

so that it runs on GRAPE Results of numerical tests of the

code are shown in Section 6 Section 7 is devoted to

dis-cussion and Section 8 summarizes

2 FMM and Related Algorithms

Here we give a brief description of the FMM (Section 2.1),

and two related algorithms, namely, the Anderson’s

method (Section 2.2) and the pseudoparticle multipole

method (Section 2.3) As will be seen in Section 4, the

latter two algorithms are used to implement the FMM on

GRAPE

The FMM is an approximate algorithm to calculate forces among particles In case of close-to-uniform distri-bution of particles, the FMM’s calculation cost scales as

O(N) This scaling is achieved by approximation of the

forces using the multipole and local expansion technique

Figure 2 shows a schematic idea of force approxima-tion in the FMM The force from a group of distant parti-cles are approximated by a multipole expansion At an observation point, the multipole expansion is converted

to local expansion The local expansion is evaluated by each particle around the observation point A hierarchical tree structure is used for grouping of the particles

The algorithm is applicable for two-dimensional (Green-gard and Rokhlin 1987) and three-dimensional (Green(Green-gard and Rokhlin 1997) particle systems In the following, we review the calculation procedure of the algorithm for the three-dimensional case

2.1.1 Tree construction Assume we have an isolated particle system Initially, we define a large enough cube (root cell) to cover all particles in the system We con-struct an oct-tree con-structure by hierarchical subdivision of the cube into eight smaller cubes (child cells) The subdi-vision procedure starts from the root cell at refinement

level l = 0 The subdivision is then repeated recursively for all sub cells, and stopped when l reaches an optimal refinement level lmax The optimal level lmax is determined

so that it optimizes the calculation speed

2.1.2 M2M transition Next, we form multipole expan-sions for each leaf cell by calculating contributions from all particles inside the cell

Then we ascend the tree structure to form multipole expansions of all non-leaf cells in all coarser levels The procedure starts from parents of the leaf cells For each cell, the multipole expansions of its children are shifted to the geometric center of the cell (M2M transition) and summed

This procedure is continued until it reaches the root cell

<∼

Fig 2 Schematic idea of force approximation in FMM.

Trang 3

2.1.3 M2L conversion Then we evaluate the multipole

expansions In order to describe this part, here we define

the terminology “neighbor list” and “interaction list.” The

neighbor list of a cell is a set of cells in the same level of

refinement which have contact with the cell The

interac-tion list of a cell is a set of cells which are children of the

neighbors of the cell’s parent and which are not neighbors

of the cell itself Figure 3 shows the neighbor and

interac-tion list of a cell for the two-dimensional case

For each cell we evaluate the multipole expansion of

all cells in its interaction list We convert the multipole

expansion to the local expansion at the geometric center

of the cell in question (M2L conversion), and sum them

2.1.4 L2L transition In the next step, we descend the

tree structure We sum the local expansions at different

refinement levels to obtain the total potential field at leaf

cells For each cell in level l we shift the center of the

local expansion of its parent at level l – 1 (L2L

transi-tion), and then add it to the local expansion of the cell

By this procedure, all cells in level l will have the local

expansion of the total potential field except for the

con-tribution of the neighbor cells By repeating this

proce-dure for all levels, we obtain the potential field for all leaf

cells

2.1.5 Force evaluation Finally, we calculate the force

on each particle in all leaf cells by summing the

contribu-tions of far field and near field forces The near field

con-tribution is directly calculated by evaluating the particle–

particle force The far field contribution is calculated by

evaluating local expansion of the leaf cell at position of

the particle

Anderson (1992) proposed a variant of the FMM using a new formulation of the multipole and local expansions The advantage of his method is its simplicity Anderson’s method makes the implementation of the FMM signifi-cantly simpler Here we briefly describe his method Anderson’s method is based on the Poisson’s formula This formula gives the solution of the boundary value problem of the Laplace equation When the potential on

the surface of a sphere of radius a is given, the potential

Φ at position = (r, φ, θ) is expressed as

(1)

for r ≥ a, and

(2)

for r ≤ a Note that here we use a spherical coordinate

system Here, Φ(a ) is the given potential on the sphere surface The area of the integration S covers the surface

of the unit sphere centered at the origin The function P n denotes the nth Legendre polynomial.

In order to use these formulae as replacements of the multipole and local expansions, Anderson proposed a discrete version of them, i.e., he truncated the right-hand

side of the equations (1)–(2) at a finite n, and replaced the integrations over S with numerical ones using a spherical design Hardin and Sloane (1996) define the spherical

t-design as follows

A set of K points 1 = {P1, …, P K} on the unit sphere

Ωd = S d – 1 = {x = (x1, …, x d) ∈ R d

: x · x = 1} forms a spherical t-design if the identity

(3)

a total measure 1) holds for all polynomials f of degree ≤

t (Hardin and Sloane 1996).

Note that the optimal set, i.e., the smallest set of the

spherical t-design is not known so far for general t In practice we use spherical t-designs as empirically found

by Hardin and Sloane Examples of such t-designs are

avail-able at http://www.research.att.com/~njas/ sphdesigns/

Using the spherical t-design, Anderson obtained the

discrete versions of (1) and (2) as follows:

Fig 3 Neighbour and interaction list of the hatched cell.

r

→

Φ r( ) 4π -1 (2n+1) a

r

- 

 n 1Pn s r⋅

r

n 0

∞

∑

S

∫

=

4π

- (2n+1) r

a

- 

 n Pn s r⋅

r

n 0

∞

∑

S

∫

=

s

→

f x ( ) µ x d ( )

K

f Pi( )

i 1

K

∑

=

Trang 4

for r ≥ a (outer expansion) and

(5)

for r ≤ a (inner expansion) Here w i is constant weight

value and p is the number of untruncated terms Hereafter

we refer to p as the expansion order.

Anderson’s method uses equations (4) and (5) for

M2M and L2L transitions, respectively The procedures

of other stages are the same as that of the original FMM

Makino (1999) proposed the pseudoparticle multipole

method (P2M2) – yet another formulation of the multipole

expansion The advantage of his method is that the

expan-sions can be evaluated using GRAPE

The basic idea of P2M2 is to use a small number of

pseudoparticles to express the multipole expansions In

other words, this method approximates the potential field

of physical particles by the field generated by a small

number of pseudoparticles This idea is very similar to

that of Anderson’s method Both methods use discrete

quantities to approximate the potential field of the

origi-nal distribution of the particles The difference is that

P2M2 uses the distribution of point charges, while

Ander-son’s method uses potential values In the case of P2M2,

the potential is expressed by point charges, and thus it

can be evaluated using GRAPE

In the following, we describe the formulation

proce-dure of P2M2

The distribution of pseudoparticles is determined so

that it correctly describes the coefficients of a multipole

expansion A naive approach to obtain the distribution is

to directly invert the multipole expansion formula For a

relatively small expansion order, say p ≤ 2, we can solve

the inversion formula, and obtain the optimal distribution

with minimum number of pseudoparticles (Kawai and

Makino 2001)

However, it is rather difficult to solve the inversion

formula for higher p, since the formula is nonlinear For

p > 2, we adopted Makino’s (1999) approach which is

more general In his approach, pseudoparticles are fixed

at the positions given by the spherical t-design (Hardin

and Sloane 1996), and only their charges can change

This makes the formula linear, although the necessary

number of pseudoparticles increases This is because we

can adjust only the charges of pseudoparticles, since we

fixed the positions of them The degree of freedom

assigned to each pseudoparticle is then reduced from four

to one

Makino’s approach systematically gives the solution

of the inversion formula as follows:

(6)

where Q j is the charge of the pseudoparticle, i = (r i, φ, θ) is the position of the physical particle, γij is the angle

pseu-doparticle For the derivation procedure of equation (6), see Makino (1999)

Equation (6) gives the solution for outer expansion

We found that following a similar approach, we can obtain the solution for inner expansion:

(7)

For the derivation procedure of equation (7), see Appen-dix A

3 Function of GRAPE

The primary function of GRAPE is to calculate the force ( i ) exerted on particle i at position i, and potential φ( i) associated with ( i) Although there are several variants of GRAPE for different applications such as astrophysics and MD, the basic functions of these hard-ware devices are substantially the same

The force ( i) and the potential φ( i) are expressed as

(8) and

(9)

where N is the number of particles to handle, j and q j are the position and the charge of particle j, and rs is the

softened distance between particle i and j defined as

r ≡ | i – j|2 + e2, where e is the softening parameter

In order to calculate force ( i), relevant data, i, j,

back to the host The potential φ( i) is calculated in the same manner

4 Implementation of the FMM on GRAPE

The FMM consists of five stages (see Section 2.1), namely, the tree construction, M2M transition, M2L

r

- 

 n 1

Pn si⋅r

r

  Φ as ( )w i i

n 0

p

∑

i 1

K

∑

≈

Φ r( ) (2n+1)  a r n Pn si⋅r

r

  Φ as ( )w i i

n 0

p

∑

i 1

K

∑

≈

i 1

N

K

- ri

a

- 

ij

cos

l 0

p

∑

=

r

→

r

i 1

N

K

- a

ri

- 

 l 1

Pl(cosγij)

l 0

p

∑

=

f

→

r

→

r

→

r

f ri( ) qj(ri–rj)

rs 3

-j 1

N

∑

=

rs

,

j 1

N

∑

=

→

r

→

s 2

r

→ →r

f

→

r

→

r

→

Trang 5

version, L2L transition, and the force evaluation The

force evaluation stage consists of near field and far field

evaluation parts

In the case of the original FMM, only the near field

part of the force evaluation stage can be performed on

GRAPE At this stage, GRAPE directly evaluates force

from each particle expressed in the form of equation (8)

At all other stages, mathematical operations not in the

form of equation (8) or equation (9) are required GRAPE

cannot handle these operations

In our implementation (hereafter code A), we modified

the original FMM so that GRAPE could handle the M2L

conversion stage, which is the most time consuming For

expansions With this modification GRAPE can handle

the M2L stage by evaluating potential values from the

pseudoparticles At the L2L stage, potential values are

locally expanded and shifted using Anderson’s method

Table 1 summarizes mathematical expressions and

oper-ations used at each calculation stage

In the following, we describe the detail of our

imple-mentation

4.1 Tree Construction

The tree construction stage has no change It is

per-formed in the same way as in the original FMM

At the M2M transition stage, we compute positions and

charges of pseudoparticles, instead of forming multipole

expansion as in the original FMM

The procedure starts from the leaf cells Positions and

charges of the leaf cells are calculated from positions and

charges of physical particles Then, those of non-leaf cells

are calculated from positions and charges of

pseudoparti-cles of their child cells This procedure is continued until it

reaches the root cell This process is performed completely

on the host computer

The M2L conversion stage is done on GRAPE In con-trast to the original FMM we do not use the formula to convert the multipole expansion to a local expansion We directly calculate potential values due to pseudoparticles

in the interaction list of each cell

The L2L transition is done in the same manner as Ander-son We use equation (5) to convert the local expansion

of each cell to that of its children

The near field contribution is directly calculated by eval-uating the particle–particle force GRAPE can handle this part without any modification of the algorithm

Using equation (5), the far field potential on a particle at position can be calculated from the set of potential values

of the leaf cell which contains the particle Meanwhile the far field force is calculated using a derivative of equation (5):

(10)

where u = i · /r.

Table 1

Mathematical expressions and operations used in different implementations of the FMM

Underlined parts run on GRAPE.

Original (Greengard and Rokhlin 1997) Code A (Section 4) Code B (Section 5)

pseudoparticle potential

Evaluation of pseudoparticle potential

Near field force Evaluation of physical-particle force Evaluation of

physical-particle force

Evaluation of physical-particle force Far field force Evaluation of local expansion Equation (10) Evaluation of pseudo

particles force

r

→

∇Φ

1–u2

-∇P n( )u

+

n 0

p

∑

i 1

K

∑

≈

2n+1

a n -g as ( )w→i i,

s

→

r

→

Trang 6

All the calculation at this stage is done on the host

computer

5 Further Improved Implementation

With the modification described in Section 4, we have

successfully put the bottleneck, namely, the M2L

conver-sion stage, on GRAPE The overall calculation of the

FMM is significantly accelerated

However, we still have room for improvement The

M2L stage is put on GRAPE and is no longer a

bottle-neck Now the most expensive part is the far field force

evaluation Equation (10) is complicated and evaluation

of it would take rather a large fraction of the overall

cal-culation time (Chau, Kawai, and Ebisuzaki 2002)

If we can convert a set of potential values into a set of

pseudoparticles at marginal calculation cost, the force

from those pseudoparticles can be evaluated on GRAPE,

and the bottleneck would disappear In order to facilitate

this conversion, we have developed a new systematic

procedure (hereafter A2P conversion)

Using the A2P conversion, we have implemented yet

another version of FMM (hereafter code B) In code B,

we use A2P conversion to obtain a distribution of

pseu-doparticles that reproduces the potential field given by

Anderson’s inner expansion Once the distribution of

pseudoparticles is obtained, the L2L stage can be

then the force evaluation stage is totally done on GRAPE

(the final column of Table 1)

In the following, we show the procedure of A2P

con-version

For the first step, we distribute pseudoparticles on the

surface of a sphere with radius b using the spherical

t-design Here, b should be larger than the radius of the

sphere a on which Anderson’s potential values g(a i) are

defined According to equation (7), it is guaranteed that

we can adjust the charge of the pseudoparticles so that

g(a i) are reproduced Therefore, the relation

(11)

should be satisfied for all i = 1 … K Using a matrix 1 =

{1/| j – a i|} and vectors = T[Q1, Q2, …, Q K] and =

T[Φ(a 1), Φ(a 2), …, Φ(a K)], we can rewrite equation

(11) as

(12)

In the next step, we solve the linear equation (12) to

that appropriate value of radius b is about 6.0 for

parti-cles inside a cell with side length 1.0 Anderson (1992)

specified that a should be about 0.4 Because of large dif-ference between a and b, equation (12) becomes nearly

singular for high order expansions In this case, Gaussian elimination and LU decomposition do not give a numeri-cally accurate enough solution Therefore, we applied singular values decomposition (SVD; Press et al 1992)

to solve the equation, and obtained better accuracy The additional cost for SVD is negligible

6 Numerical Tests

We performed numerical tests on accuracy and perform-ance of our hardware-accelerated FMM Here we show the results

Conversion Here we show the result of a test on accuracy of the A2P conversion (Section 5) and inner-P2M2 (equation (7))

We performed the test in the following steps:

1 Locate a particle q at (r, π, π/2) (spherical

coordi-nate) Here r runs from 1 to 10.

2 Evaluate potential values due to q at positions defined by spherical t-design on the surface of a sphere radius a = 0.4 centered at the origin The

number and position of the evaluation points

depends on the expansion order p.

3 Apply A2P conversion to the local expansion obtained in the previous step, i.e., solve equation

(12) to obtain charges of pseudoparticles Q j on the

surface of a sphere radius b = 6 centered at the

ori-gin The number and position of the

pseudoparti-cles depend on p.

4 Evaluate the force and potential due to the

pseu-doparticles at observation point L : (0.5, π, π/2)

5 Compare the result with exact force and potential

The exact values are obtained by direct evalua-tion

Figure 4 depicts the test process Figures 5 and 6 show the results of the test The potential error and the force error are shown in Figures 5 and 6, respectively In both

cases, the error for p = 1 to 5 behaves as theoretically expected, i.e., the potential error scales as r –(p + 2), and the

force error scales as r –(p + 1) For p = 6, the error stops decreasing at r ≥ 6 This is because of the singularity of

pseudoparticles are used, the solution of equation (12) suffers large computational error

s

→

s

→

Qj

Rj–asi

-j 1

K

→

s

1Q→ = P.→

Trang 7

6.2 Performance on MDGRAPE-2

Here we show the performance of the FMM code B

(Sec-tion 5) measured on MDGRAPE-2 (Susukita et al 2003)

MDGRAPE-2 is one of the latest devices in the GRAPE

series It is developed for MD simulation and has

addi-tional function to the original GRAPEs, so that it can

handle forces that do not decay as 1/r2, such as Van der

Waals force However, in our test we use MDGRAPE-2

only to calculate Coulombic force and potential The

additional functions are not used in our tests

For the measurement, we used two GRAPE systems The first one consists of one MDGRAPE-2 board (64 pipelines, 192 Gflop/s) and a host computer COMPAQ DS20E (Alpha 21264/667 MHz) The second one con-sists of one MDGRAPE-2 board (16 pipelines, 48 Gflop/s) and a self-assembled host computer (Pentium 4/2.2 GHz, Intel D850 motherboard) We refer the former system as

“system I,” and the latter as “system II.”

In the test, we distributed particles uniformly within a unit cube centered at the origin, and evaluated the force

on all particles The number of particles is from 128K to 4M Notations K and M are 1024 and 1024 × 1024, respectively We measured the calculation time at both

high (p = 5) and low (p = 1) accuracy, with and without GRAPE The finest refinement level lmax is set to lmax = 4 and 5, for runs with and without GRAPE, respectively These values are experimentally chosen so that the over-all calculation time is minimized (see Section 2.1)

In this paper we do not present in detail our experiments

in the case of inhomogeneous distribution of particles since inhomogeneity is not as important as homogeneity or close-to-uniformity in molecular dynamics simulations However, our experiments in the two GRAPE systems show that the treecode runs faster than the FMM in the inhomogeneous case

Results for close-to-uniform distribution cases are shown

in Figures 7–10 and Tables 2–3 Figures 7 and 9 are results

of system I Figures 8 and 10 and Tables 2–3 are of sys-tem II

In Figures 7 and 8, calculation time of the code B is

plotted against the number of particles N Results shown

in Figures 7 and 8 are measured on system I and II, respectively Results of the direct-summation algorithm

are also shown for comparison Our code scales as O(N)

Fig 4 Description of the test for accuracy of

inner-P 2 M 2 and the A2P conversion Numbers on the figure

are steps in the test.

Fig 5 Error of the potential calculated with

inner-P 2 M 2 and the A2P conversion From top to bottom, six

dashed curves are plotted with expansion order p = 1,

2, 3, 4, 5 and 6, respectively.

Fig 6 Force error: details as in Figure 5.

Trang 8

with GRAPE are faster than those without GRAPE by a

factor of 5 and 60 for low (RMS relative force error ~10–2)

respectively On system II, the speedup factors are 3 and

14.5 Since the amount of calculation for the M2L stage

becomes more significant at higher p (Table 2), the

spee-dup factor is larger for higher accuracy

Table 3 shows the breakdown of the calculation time for 1M-particle runs We can see GRAPE significantly accelerates the M2L part and force evaluation part The overall performance of our implementation is limited by the speed of the communication bus between the host and GRAPE, rather than the speed of GRAPE itself For

fur-Fig 7 Force calculation time of FMM and

direct-sum-mation algorithm on system I Circles denote

perform-ance of FMM on MDGRAPE-2 Pentagons denote that

on the host computer Open and filled symbols are for

low (p = 1) and high accuracy (p = 5), respectively.

Solid and dashed curves without symbols are

perform-ance of direct method on MDGRAPE-2 and the host

computer, respectively.

Fig 8 Force calculation time of FMM and

direct-sum-mation algorithm on system II Symbols as in Figure 7.

Fig 9 Comparison of force calculation time for FMM and treecode on MDGRAPE-2 on system I Circles are performance of FMM on MDGRAPE-2 Triangles are that of the treecode on MDGRAPE-2 Open and filled symbols are for low and high accuracy, respectively.

Parameter pairs (p, θθθθ) to obtain low and high accuracy

of the treecode are (1, 1.0) and (2, 0.33), respectively.

Fig 10 Comparison of force calculation time for FMM and treecode on MDGRAPE-2 on system II Details as

in Figure 9.

Trang 9

ther acceleration, we need to switch from the legacy PCI

bus (32 bit/33 MHz) to the faster buses, such as PCI-X,

or PCI Express

Figure 9 shows the calculation time of our FMM code

and the treecode (Kawai, Makino, and Ebisuzaki 2004),

both running on GRAPE The order of the multipole

expansion p and the opening angle θ for the treecode is

set to (p, θ) = (1, 1.0) and (2, 0.33) for low and high

accu-racy, respectively These values are chosen so that the

treecode gives roughly the same RMS force error as that

of the FMM The RMS force errors at low and high accu-racy are ~5 × 10–2 and ~2 × 10–5, respectively

We can see that the performance of our FMM code and the treecode is almost the same The FMM is better than the treecode at high accuracy, and worse at low accuracy

In a particular GRAPE system, parameters tuning for optimal performance of the modified FMM can be defined

by experiments One should measure the code B’s per-formance on a randomly generated particles system with

Table 2

Pairwise interaction count for 1M particle run.

Accuracy

With GRAPE (lmax = 4) Without GRAPE (lmax = 5)

Force evaluation

Table 3

Time breakdown for 1M particles run on system II.

Accuracy

With GRAPE (lmax = 4) Without GRAPE (lmax = 5)

Building neighbor

M2L

_ _ _ _ _

Force evaluation

_ _ _ _ _

Trang 10

different values of the finest refinement level lmax for

each expansion order p from 1 to 5 For example, if the

number of particles in the system is from 128K to 4M

and the GRAPE’s peak performance is either 48 Gflop/s

tested are 3, 4 and 5

7 Discussion

We compared the performance of our FMM

implementa-tion (the code B) with Wrankin’s distributed parallel

multipole tree algorithm (DPMTA; Wrankin and Board

1995)

We measured the performance of Wrankin’s code on

sys-tem II, using the serial version of DPMTA 3.1.3 available at

http://www.ee.duke.edu/~wrankin/Dpmta/

For the measurement, particles are distributed in a unit

cube The expansion order and other parameters of each

code are chosen so that relatively high accuracy (~10–5)

is achieved, and the performance is optimized

Table 4 summarizes the comparison Using GRAPE,

our code outperforms Wrankin’s codes by tenfold

With-out GRAPE, our code is slower than Wrankin’s code by a

factor of 1.1–1.4, mainly because our code requires a

larger number of operation counts, so that it takes full

advantage of GRAPE

Parallelization of the FMM on a cluster of GRAPEs

requires no special techniques Algorithms used for

par-allelization on a cluster of general-purpose computers

(Hu and Johnsson 1996) can be applied without

modifi-cation In our modified FMM, GRAPE is used for the

M2L and force evaluation stages The presence of

GRAPE has no effect to parallelization of the tree

con-struction, building neighbor and interaction lists

In the case of the treecode, several versions of parallel

codes have been developed so far These codes are used

for productive runs in the field of astrophysics

(Fuku-shige, Kawai, and Makino 2004; Fuku(Fuku-shige, Makino, and

Kawai 2005) We can follow a similar approach to paral-lelize our FMM code

8 Summary

Using special-purpose hardware GRAPE, we have suc-cessfully accelerated the FMM In order to take full advantage of the hardware, we have modified the original FMM using Anderson’s method, the pseudoparticle multipole method, and two conversion techniques we have newly invented The experimental results show that GRAPE accelerates the FMM by a factor of 3 to 60, and the factor increases as the required accuracy becomes higher Comparison with the treecode shows that in the case of close-to-uniform distribution of particles, our FMM is faster at high accuracy, while the treecode is faster at low accuracy In case of inhomogeneous distri-bution of particles, the treecode is faster than the FMM

It is suggested that one should use the code B for large scale molecular dynamics simulations and where high accuracy is demanded

Acknowledgments

Thanks are due to Dr T Iitaka at the Institute of Physical and Chemical Research (RIKEN) for the suggestion of using the SVD method

We are grateful to Prof J A Smith from Bridge to Asia and Prof D E Keyes from Columbia University for refining the manuscript

This work is supported by the Advanced Computing Center, RIKEN and the College of Technology, Vietnam National University, Hanoi Part of this work was carried out while N H Chau was a contract researcher of RIKEN and A Kawai was a special postdoctoral researcher of RIKEN

Appendix A

In this appendix, we describe the derivation procedure of equation (7), inner expansion of P2M2

The local expansion of the potential Φ( ) is expressed as

(13)

Here, Y (θ, φ) is the spherical harmonics and β is the expansion coefficient In order to approximate the

poten-tial field due to the distribution of N particles, the

coeffi-cients should satisfy

(14)

Table 4

Performance comparison with Wrankin’s code.

code

Our code with

GRAPE

without GRAPE

r

→

m

r l Yl m(θ φ, )

m –l

l

∑

l 0

p

∑

=

→

l m

βl

2l+1

- qi 1

ri l 1 -Y l m*

θi,φi

i 1

N

∑

=

Định dạng
Số trang	12
Dung lượng	301,48 KB