Annual reports in computational chemistry

Conclusions and Future Directions Acknowledgments References coupled with the emergence of application programming interfaces to support general purpose computation on graphics processin

Trang 2

Annual Reports in

COMPUTATIONAL CHEMISTRY

Edited by

Ralph A Wheeler

Sponsored by the Division of Computers in Chemistry

of the American Chemical Society

Amsterdam • Boston • Heidelberg • London • New York • Oxford Paris • San Diego • San Francisco • Singapore • Sydney • Tokyo

Trang 3

Working together to grow

libraries in developing countries

www.elsevier.com | www.bookaid.org | www.sabre.org

Elsevier

Radarweg 29, PO Box 211, 1000 AE Amsterdam, The Netherlands

Linacre House, Jordan Hill, Oxford OX2 8DP, UK

32 Jamestown Road, London NW1 7BY, UK

525 B Street, Suite 1900, San Diego, CA 92101-4495, USA

30 Corporate Drive, Suite 400, Burlington, MA 01803, USA

First edition 2010

No part of this publication may be reproduced, stored in a retrieval system

or transmitted in any form or by any means electronic, mechanical, photocopying, recording or otherwise without the prior written permission of the publisher

Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in Oxford, UK: phone (+44) (0) 1865 843830; fax (+44) (0) 1865 853333; email: permissions@elsevier.com Alternatively you can submit your request online by visiting the Elsevier web site at http://www.elsevier.com/locate/permissions, and selecting Obtaining permission to use Elsevier material

Notice

No responsibility is assumed by the publisher for any injury and/or damage to persons

or property as a matter of products liability, negligence or otherwise, or from any use

or operation of any methods, products, instructions or ideas contained in the material herein Because of rapid advances in the medical sciences, in particular, independent verification of diagnoses and drug dosages should be made

Library of Congress Cataloging-in-Publication Data

A catalogue record for this book is available from the Library of congress

British Library Cataloging in Publication Data

A catalogue record for this book is available from the British Library

ISBN: 978-0-444-53552-8

ISSN: 1574-1400

For information on all Elsevier publications

visit our website at elsevierdirect.com

Printed and bound in USA

10 11 12 13 10 9 8 7 6 5 4 3 2 1

Trang 5

x Contributors

Sheng-You Huang

Department of Physics and Astronomy, Department of Biochemistry, Dalton Cardiovascular Research Center, and Informatics Institute, University of Missouri, Columbia, MO, USA

George Khelashvili

Department of Physiology and Biophysics, Weill Medical College of Cornell University, New York, NY, USA

Kah Chun Lau

Department of Chemistry, George Washington University, Washington DC, USA Yaakov Levy

Department of Structural Biology, Weizmann Institute of Science, Rehovot, Israel Hongzhi Li

Institute of Molecular Biophysics, Florida State University, Tallahassee, FL, USA Yan Ling

Department of Chemistry and Biochemistry, University of Southern Mississippi, Hattiesburg, MS, USA

Trang 6

San Diego Supercomputer Center, University of California San Diego, La Jolla,

CA, USA; Lehrstuhl fu¨r Theoretische Chemie, Universita¨t Erlangen, Erlangen, Germany

Department of Physics and Astronomy, Department of Biochemistry, Dalton Cardiovascular Research Center, and Informatics Institute, University of Missouri, Columbia, MO, USA

Trang 7

PREFACE

Annual Reports in Computational Chemistry (ARCC) was instituted to provide timely reviews of topics important to researchers in Computational Chemistry ARCC is published and distributed by Elsevier and sponsored by the American Chemical Society’s Division of Computers in Chemistry (COMP) Members in good standing of the COMP Division receive a copy of the ARCC as part of their member benefits Since previous volumes have received such an enthusiastic response from our readers, the COMP Executive Committee expects to deliver future volumes of ARCC that build on the solid contributions in our first five volumes

To ensure that you receive future installments of this series, please join the Division as described on the COMP website at http://www.acscomp.org

Volume 6 features 14 outstanding contributions in six sections and includes a new section devoted to Nanotechnology and the reemergence of the Chemical Education section Topics covered (and Section Editors) include Simulation Methodologies (Carlos Simmerling), Quantum Chemistry (Gregory S Tschumper), Chemical Education (George C Shields), Nanotechnology (Luke E.K Achenie), Biological Modeling (Nathan Baker), and Bioinformatics (Wei Wang) Although individual chapters in ARCC are now indexed by the major abstracting services,

we plan to continue the practice of cumulative indexing of both the current and past editions to provide an easy identification of past reports

As was the case with our previous volumes, the current volume of Annual Reports in Computational Chemistry has been assembled entirely by volunteers to produce a high-quality scientific publication at the lowest possible cost The Editor and the COMP Executive Committee extend our gratitude to the many people who have given their time to make this edition of Annual Reports in Computational Chemistry possible The authors of each of this year’s contributions and the Section Editors have graciously dedicated significant amounts of their time to make this volume successful This year’s edition could not have been assembled without the help of Clare Caruana of Elsevier Thank you one and all for your hard work, your time, and your contributions

We trust that you will find this edition to be interesting and valuable We are actively planning the seventh volume and anticipate that it will restore one or more previously popular sections, including Materials and/or Emerging Technologies In addition, we are actively soliciting input from our readers about future topics, so please contact the editor to make suggestions and/or to volunteer as a contributor

Sincerely, Ralph A Wheeler, Editor

xiii

Trang 8

Section Editor: Carlos Simmerling

Trang 9

C H A P T E R 1

Advancements in Molecular Dynamics Simulations of Biomolecules on

Graphical Processing Units

2 An Overview of GPU Programming 2.1 GPU/CPU hardware differences 2.2 The emergence of GPU programming languages 2.3 GPU programming considerations

3 GPU-Based Implementations of Classical Molecular Dynamics 3.1 Early GPU-based MD code development

3.2 Production GPU-based MD codes

4 Performance and Accuracy 4.1 Performance and scaling 4.2 Validation

5 Applications 5.1 Protein folding

6 Conclusions and Future Directions Acknowledgments

References

coupled with the emergence of application programming interfaces to support general purpose computation on graphics processing units (GPUs) has led to an explosion in the use of GPUs for acceleration of scientific applications Here we explore the use of GPUs within the context of condensed phase molecular dynamics (MD) simulations We discuss the algorithmic differences that the GPU architecture imposes on MD codes,

an overview of the challenges involved in using GPUs for MD, followed by a

1 San Diego Supercomputer Center, University of California San Diego, La Jolla, CA, USA

2 National Biomedical Computation Resource, University of California San Diego, La Jolla, CA, USA

3

Trang 10

4 Ross C Walker et al

critical survey of contemporary MD simulation packages that are attempting

to utilize GPUs Finally we discuss the possible outlook for this field Keywords: GPU; CUDA; stream; NVIDIA; ATI; molecular dynamics;

accelerator; OpenMM; ACEMD; NAMD; AMBER

1 INTRODUCTION

Since the first molecular dynamics (MD) simulation of an enzyme was described

tool in understanding the behavior of biomolecules Since that first 10 ps long simulation of merely 500 atoms in 1977, the field has grown to where small

simulations are numerically intensive requiring access to large-scale supercomputers or well-designed clusters with expensive interconnects that are beyond the reach of many research groups

Many attempts have been made over the years to accelerate classical

MD simulation of condensed-phase biological systems by exploiting alternative hardware technologies Some notable examples include ATOMS by AT&T Bell

erate the direct space nonbond calculations, Clearspeed Inc which developed an

recently DE Shaw Research LLC who developed their own specialized architec

All of these approaches have, however, failed to make an impact on main

original acquisition or development costs of several accelerator technologies These costs have posed a significant barrier to widespread development within the academic research community Additionally these technologies do not form Table 1 Example cost estimates for a range of hardware MD acceleration projects

National Laboratory

a Total development cost: $15 million [ 14 ]

Trang 11

Jan-03Jun-03Nov-03Apr-04Sep-04Feb-05Jul-05Dec-05May-06Oct-06Mar-07Aug-07Jan-08Jun-08 Jan-03Jun-03Nov-03Apr-04Sep-04Feb-05Jul-05Dec-05May-06Oct-06Mar-07Aug-07Jan-08Jun-08

5

Advancements in MD Simulations of Biomolecules on GPUs

part of what would be considered a standard workstation specification This makes it difficult to experiment with such technologies leading to a lack of sustained development or innovation and ultimately their failure to mature into ubiquitous community-maintained research tools

Graphics processing units (GPUs), on the other hand, have been an integral part

of personal computers for decades Ever since 3DFX first introduced the Voodoo graphics chip in 1996, their development has been strongly influenced by the entertainment industry in order to meet the demands for ever increasing realism

in computer games This has resulted in significant industrial investment in the stable, long-term development of GPU technology Additionally the strong demand from the consumer electronics industry has resulted in GPUs becoming cheap and ubiquitous This, combined with substantial year over year increases in the computing power of GPUs, means they have the potential, when utilized efficiently, to

targets for acceleration of many scientific applications including MD simulations The fact that high-end GPUs can be considered standard equipment in scientific workstations means that they already exist in many research labs and can be purchased easily with new equipment This makes them readily available to researchers and thus tempting instruments for computational experimentations The nature of GPU hardware, however, has made their use in general purpose computing challenging to all but those with extensive three-dimensional (3D) graphics programming experience However, as discussed in Section 2 the development of application programming interfaces (APIs) targeted at general purpose scientific computing has reduced this complexity to the point where GPUs are beginning to be accepted as serious tools for the economically efficient acceleration of an extensive range of scientific problems

In this chapter, we provide a brief overview of GPU hardware and programming techniques and then review the progress that has been made in using GPU hardware

to accelerate classical MD simulations of condensed-phase biological systems; we review some of the challenges and limitations that have faced those trying to

Figure 1 Peak floating-point operations per second (a) and memory bandwidth (b) for Intel CPUs and NVIDIA GPUs Reproduced from [15]

Trang 12

implement MD algorithms on GPUs, consider performance numbers and validation techniques, and then highlight some recent applications of GPU-accelerated MD Finally, we comment on the limitations of current GPU MD implementations and what the future may hold for acceleration of MD simulations on GPU hardware

2 AN OVERVIEW OF GPU PROGRAMMING

2.1 GPU/CPU hardware differences

In order to comprehend where the performance benefits lie and understand the complexity facing programmers wishing to utilize GPUs, it is necessary to compare the underlying nature, and design philosophies, of the GPU with that of the CPU Conventional CPUs found in the majority of modern computers, such as those manufactured by Intel and advanced micro devices (AMD), are designed for

running a program, the CPU fetches instructions and associated data from the computer’s random access memory (RAM), decodes it, executes it, and then

this would be classified as single instruction, single data (SISD)

control unit receives the instruction/data pair from RAM during the decoding phase and disseminates out the instruction to give to the arithmetic logic unit (ALU) which is the circuitry that carries out the logical operations on the data Finally, there are cache units which provide local and fast temporary data storage for the CPU Historically, performance improvements in sequential execution have been obtained by increasing CPU clock speeds and the introduction of more complex ALUs that perform increasingly composite operations in fewer clock cycles Additionally, pipelining, which is executing instructions out of order or in parallel while maintaining the overall appearance of sequential execution, has also improved performance (but not calculation speed) by increasing the number

of instructions a CPU can execute in a unit amount of time; and larger on chip cache memory is often used to hide latency

facilitate the display of 3D graphics by performing large numbers of floating(a) CPU

Figure 2 Abstraction contrasting CPU and GPU design Adapted from [18]

Trang 13

7

point operations per video frame: they are essentially specialized numeric computing engines The dominant strategy adopted by the graphics industry to meet this requirement has been to maximize the throughput of a massive number of parallel threads which can all access the RAM on the GPU board Herein lies the key difference with CPUs: the same operation can be carried out on different parts of the input data within the GPU’s memory by an army of individual threads concurrently Within Flynn’s taxonomy, this falls into the single instruction, multiple data (SIMD) category

A GPU has a hierarchical structure composed of multiple streaming multiprocessors (SMs) which in turn consist of sub units of streaming processors Memory is also hierarchical, maintaining an approximately constant size to speed ratio; all SMs share the same device global memory which is large, but relatively slow Smaller, lower latency, on-chip memory which is local to each SM and available to all streaming processors within that SM is provided and even faster register-like memory is present on each streaming processor A read-only cache of the device global memory is available to each SM in the form of a texture cache Physically, GPUs have a much larger number of ALUs than a CPU, but the ALUs are not as complex as the ones found in a CPU The GPU’s clock speed is normally about half that of a contemporary CPU’s; however, GPUs typically have an order of magnitude larger memory bandwidth to their onboard device global memory

2.2 The emergence of GPU programming languages

The spectrum of GPU accessibility for scientific use has two extremes Prior to the development of current general purpose GPU programming models by the major

in the field in hijacking graphic specific APIs, such as OpenGL, and using them as vehicles for carrying out general purpose calculations However, development was time consuming and essentially hardware specific At the other extreme, a compiler should exist which can compile existing scientific code for execution on GPUs without the scientist having to consider the underlying nature of the hardware one is calculating on

At present, we are somewhere in-between these points; the barrier to utilizing GPU hardware for general purpose computation has been reduced by the introduction of general purpose GPU programming models such as NVIDIA’s Com

algorithmic paradigm shifts are often required in existing codes to maximize such performance offered by the massively parallel GPU hardware

The CUDA programming model from NVIDIA appears to be the most mature and widespread in scientific applications at this moment in time, hence the discussion here will focus on specifics pertaining to it CUDA, a C-like programming language, enables code to run concurrently on the CPU and GPU, with the assumption that the numerically intensive parts of a program will be executed on the GPU and remaining sections, which are perhaps not suited to the GPU, remain executing on the CPU A mechanism is provided for the two parts of the running code to communicate with each other

Trang 14

CUDA abstracts the hierarchical GPU hardware structure outlined, into a programming framework, requiring the coder to write in an intrinsically parallel fashion The small numerically intensive subroutines of code that run specifically

on the GPU are termed kernels These are executed in blocks where each block contains multiple instances of the kernel, termed threads

This partitioning enables the following (CUDA runtime mediated) physical mapping onto the GPU hardware: each block is run on an individual MP with the number of threads determined by the number of physical SPs within the MP As

a result, only threads within the same block can synchronize with each other This block-based parallelism and the need to keep all SM units busy in order to achieve efficient performance lead to a number of nontrivial programming considerations

2.3 GPU programming considerations

A key strategy in improving wall clock time to scientific problem solution is recasting an algorithm in a way that makes it computationally palatable for the nature of the hardware that it is being executed on; an algorithm that performs poorly on a CPU may perform many orders of magnitude better on a GPU and vice versa However, when dealing with scientific problems, it is essential that alternative approaches to solving the underlying physics yield the same solution, albeit via different paths It is very tempting given the architectural differences of GPU hardware to change the nature of the problem being solved without a thorough understanding of the implications this has on the scientific results

General strategies when developing efficient algorithms on GPUs include the following:

1 Ensure that host-to-device communication during a calculation is kept to a minimum For example, one should ensure that as much of the calculation remains on the GPU as possible Ferrying data back and forth between the GPU and the host machine is costly due to the latency of the PCIe bus, hence if one is storing atomic coordinates on the host’s memory, then the GPU is going to

be idle while it is waiting for an updated set to arrive The above holds within the GPU as well A corollary to this is that very often it is more efficient to recalculate

an existing result on the GPU again, rather than fetch it from a nonlocal location

2 Accuracy issues that arise from hardware single precision (SP) limitations need to be controlled in a way that is acceptable to the scientific algorithm being simulated Approaches to this include sorting floats by size prior to

3 Recasting the problem in a vector fashion that groups data that will be operated on in the same way allows for maximizing the efficiency of the SPs

It should be clear from the above discussion that while GPUs offer an attractive price performance ratio, there are significant hurdles to utilizing them efficiently Indeed, in some cases, the development costs of GPU-specific code may negate the cost/performance benefits

Trang 15

9

3 GPU-BASED IMPLEMENTATIONS OF CLASSICAL MOLECULAR DYNAMICS

As illustrated in the previous section, GPUs have come a long way in terms of their ease of use for general purpose computing In the last four years, beginning in 2006, NVIDIA’s CUDA and ATI’s Stream APIs have made programming GPUs significantly easier and the addition of DP hardware in NVIDIA’s GT200 line and ATI’s FireStream series has facilitated effective implementation of MD algorithms Due to the reasons discussed above, GPUs are still significantly more complex to program than traditional CPUs However, the potential cost/performance benefit makes them enticing development platforms It is only very recently, however, that the use of GPUs for MD simulations has begun to mature to the point where fully featured production MD codes have appeared The lure of very high performance improvements for minimal cost has influenced early attempts at accelerating MD on GPUs

As we see below, the race to develop MD codes on this “new” hardware has led many

to take inappropriate or untested approximations rather than taking the time to address the shortcomings of GPUs It is also very difficult to compare successes and performance between implementations since a number of manuscripts show only speedups of small parts of the code or comparison against very different types of simulations A detailed look at what appears, at first sight, to be a very crowded and successful field uncovers only a few select codes that could be considered production ready In this section, we provide an overview of the peer-reviewed literature on GPU-based MD along with a discussion of these production ready codes

3.1 Early GPU-based MD code development

In what was arguably the first published implementation of GPU-accelerated

thermal conductivity This work was prior to the release of the CUDA and Stream APIs and hence the authors were forced to implement their algorithm directly in

improvements of between 10 and 11 times that of a single Intel Pentium 3.0 GHz processor While an impressive proof of concept, the Yang et al implementation was very simplistic containing just Lennard—Jones interactions and a neighbor list that was constructed to remain static over the course of the simulation It thus lacked many of the important features, such as covalent terms, short- and long-range electrostatics, thermostats, barostats, neighbor list updates, and restraints needed for MD of biological systems Nevertheless, this pioneering study demonstrated that implementing an MD code on GPUs was feasible

The advent of the CUDA and Stream programming APIs made programming GPUs significantly easier and brought with them an explosion of GPU MD implementations Most early implementations of MD on GPUs are characterized

by an exploration of the field with the development of codes and GPU-specific algorithms focused on simplistic, artificial, or very specific model problems rather than the application of GPUs to “real-world” production MD simulations

Trang 16

Like Yang et al., they too chose to implement just a simplistic van der Waals potential allowing them to avoid all of the complexities inherent in production

MD simulations of condensed-phase systems Unlike Yang, Liu et al recomputed their neighbor list periodically providing the first example of a neighbor list update for MD on GPUs

series of target algorithms for molecular modeling computations, including techniques for direct Coulomb summation for calculating charge—charge interactions within a cutoff They also discussed possible techniques for evaluation of forces

in MD, providing the first mention of a combined treatment of direct space van der Waals and electrostatics in a GPU implementation Their implementation, however, did not include any actual MD but instead focused on the more simplistic applications of ion placement and the calculation of time-averaged Coulomb potentials in the vicinity of a simulated system While providing an example of how Coulomb interactions can be accelerated with GPUs and laying the groundwork for developing an experimental GPU-accelerated version of

production MD simulations

Following on the heels of Yang et al., a number of groups begun implementing their own MD codes on GPUs although most were still simply proof-ofconcept prototypes with limited applicability for production MD calculations

for neighbor list updates but still only applied this to simple van der Waals

first to include the calculation of covalent terms, adding GPU computation

of van der Waals and harmonic bond potentials to their HOOMD code in order to study nonionic liquids They also included integrators and neighbor lists in their implementation; however, while the HOOMD GPU implementation went a step closer to a full MD implementation, it still neglected most of the complexities including both short- and long-range electrostatics, angle terms, torsion terms, and constraints required for simulating condensed-phase systems

simulations of liquid water Their approach was similar to Anderson but also included angle and short-range electrostatic terms While a demonstration of a condensed-phase simulation, the approach used was still extremely restrictive and of limited use in real-world applications

These early GPU-based MD implementations are characterized by significantly oversimplifying the mathematics in order to make implementation on a GPU easier, neglecting, for example, electrostatics, covalent terms, and heterogenous solutions This has resulted in a large number of GPU implementations being published but none with any applicability to “real-world” production MD simulations It is only within the last year (2009/2010) that useful GPU implementations of MD have started to appear

Trang 17

11

3.2 Production GPU-based MD codes

The features typically necessary for a condensed-phase production MD code for biological simulations are explicit and implicit solvent implementations, correct treatment of long-range electrostatics, support for different statistical ensembles (NVT, NVE and NPT), thermostats, restraints, constraints, and integration algorithms At the time of writing, there are only three published

MD GPU implementations that could be considered production quality codes

pendent implementations such as support for generalized Born implicit solva

published

The ACEMD package by Harvey et al could be considered the first

includes support for periodic boundaries and more importantly both and long-range electrostatics using a smooth particle mesh Ewald (PME)

cit solvent generalized Born model on small- and medium-sized systems using

improved the OpenMM library and adapted it to explicit solvent simulation

of long-range electrostatics Additionally, a GPU-accelerated version of GROMACS has been developed which works via links to the OpenMM library GPU acceleration of explicit solvent calculations are also available in NAMD v2.7b2, although acceleration is limited since only the direct space nonbond interactions are calculated on the GPU at present, necessitating a synchroniza

the key features of production MD codes, at the time of writing, is listed in

includes the broadest set of features, capable of running implicit and explicit solvent simulations in all three ensembles with flexible restraints on any atoms as well as allowing the use of multiple precision models although it only supports a single GPU per MD simulation at present Some of the other codes do not include all of the key features for MD simulation such as pressure coupling and implicit solvent models although this will almost certainly change in the future The NAMD implementation is CPU centric, focusing on running MD in a multiple node, multiple GPU environment, whereas others implement all MD features on the GPU and strive to optimize

MD performance on a single GPU or multiple GPUs on a single node We note that of all the production MD codes available OpenMM is the only one to support both NVIDIA and ATI GPUs; the others are developed just for NVIDIA GPUs We also note that ACEMD and AMBER are commercial products, whereas the others are available under various open-source licensing models

Trang 18

Table 2 Key feature comparison between the GPU-accelerated MD codes

Code Simulation implementation GPU acceleration Multiple GPU support GPU type Licensing model

NVT, SHAKE

solvent (GB), PME, NVE, NVT, SHAKE

nodes, but scalability bottlenecked by internode communication

(source available)

a GROMACS has been implemented with OpenMM

Trang 19

13

4 PERFORMANCE AND ACCURACY

4.1 Performance and scaling

The performance of MD simulations on modern clusters and supercomputers is currently limited by the communication bottlenecks that occur due to the significant imbalances that exist between CPU speeds and hardware interconnects The use of GPUs does nothing to alleviate this and indeed actually exacerbates

it by making an individual node faster and thus increasing the amount of communication per unit of time that is required between nodes For this reason, GPU-accelerated MD does not offer the ability to run substantially longer MD simulations than are currently feasible on the best supercomputer hardware, nor does it provide a convincing case for the construction of large clusters of GPUs; however, what it does offer is the ability to run substantially more sampling

on a workstation or single node for minimal cost The huge performance gap that exists between cluster interconnects and GPUs has meant that the majority

of implementations have focused on utilizing just a single GPU (OpenMM, AMBER) or multiple GPUs within a single node (ACEMD) Only NAMD has attempted to utilize multiple nodes but with success that is largely due to simulating very large systems and not attempting to optimize single-node performance, thus requiring large numbers of GPUs to achieve only modest speedups and negating many of the cost/performance benefit arguments Thus the benefit of GPUs to condensed-phase MD should be seen in the concept of condensing small (2—8 node) clusters into single workstations for a fraction of the cost rather than providing a way to run hundreds of microseconds of MD per day on large clusters of GPUs

A fair comparison of performance across current implementations is very difficult since it is almost impossible to run identical simulations in different programs, and indeed even within the same program it is not always possible to make a fair comparison since additional approximations are often made to the GPU implementation in the desire to achieve larger speedups without considering such approaches on the CPU There are also numerous situations where people compare the performance of individual kernels, such as the Coulomb sum, rather than the complete implementation Indeed a careful look at the current literature finds speedups ranging from 7 to 700þ To understand why such numbers might be

which they compare simulations of various boxes of water with their GPU imple

faster than CHARMM on a single CPU but at no point in their paper mention the version of CHARMM used, the compilers used, or even the settings used in the CHARMM code It should be noted that, largely for historical reasons, the use of default settings in CHARMM tends to give very poor performance There are then

of course multiple optimizations that can be made on the GPU due to the simplicity

of the water model The first is the use of cubic boxes which can benefit vectorization on the GPU, for codes supporting PME it also provides more optimum fast fourier transform (FFT) performance The second is the use of the SPC/Fw water

Trang 20

the GPU Finally, the use of a pure water box means that all molecules are essentially identical This allows one to hard code all of the various parameters, since all bonds are identical, all oxygen charges are identical, etc., and thus avoid the additional costs associated with doing such lookups on the GPU For these reasons, the performance and speedups quoted for various GPU implementations should typically be considered an upper bound on the performance achievable Additionally, many factors determine the performance of GPU-accelerated

MD codes Implicit solvent simulations in general show much greater performance boosts over explicit solvent simulation due to the reduced complexities

of the underlying algorithm Specifics include avoiding the need for FFTs and the use of infinite cutoffs which in turn remove the complexity of maintaining a

their single-precision OpenMM code and presumably AMBER 9’s DP Sander implementation for systems of 600 atoms and more than two orders of magnitude

Similar speedup has been observed in direct comparisons between AMBER’s PMEMD code running on 2.8 GHz Intel E5462 CPUs and NVIDIA C1060 Tesla

while OpenMM also showed impressive linear performance scaling over system size in its non-PME explicit solvent simulations and at least 19-fold speedup

ever, it is unclear from the OpenMM manuscript if the comparisons are like for like since the AMBER and NAMD numbers appear to be for full PME-based explicit solvent simulations ACEMD showed that its 3-CPU/3-GPU performance was roughly equivalent to 256-CPU NAMD on the DHFR system and 16-CPU/16

4.2 Validation

While the majority of articles describing new GPU MD implementations have focused considerable attention on performance comparison to CPU simulations, there has been very little effort to comprehensively test and validate the implementations, both in terms of actual bugs and in the use of various approximations such as single precision or alternative electrostatic treatments Since DP has only recently become available on GPUs and because SP still offers a more than 10-fold performance enhancement, all of the GPU-based MD implementations use either single precision or a combination of hybrid single and DP math Several authors have attempted to provide validation of this and other approximations but often only in a limited fashion while instead preferring to focus on

on the CPU and GPU and then provided plots of energy and temperature profiles for the two simulations without any form of statistical analysis

Trang 21

15

compare the deviation in atom positions between two runs on different CPU counts and on the GPU

this was still far from comprehensive For example, they stated in their manuscript that “Potential energies were checked against NAMD values for the initial configuration of a set of systems, , in order to verify the correctness of the force calculations by assuring that energies were identical within 6 significant figures.” Since scalar potential energies do not convey information about the vector forces,

it is unclear how the authors considered this a validation of their force calculations They provide a table with energy changes in the NVE ensemble per nanosecond per degree of freedom but do not provide any independent simulations for comparison The authors also state that “ we validate in this section the conservation properties of energy in a NVT simulation ” which is of little use in validation since energy is not a conserved quantity in the NVT (canonical) ensemble Additionally, they carried out calculations of Na—Na pair distribution functions using their ACEMD GPU code and also GROMACS on a CPU; however, the lack of consistency in the simulation parameters between GPU and CPU and the clear lack of convergence in the results mean that the validation is qualitative at best

simply examining energy conservation for simulations of the lambda repressor and stating, although as with Harvey et al not providing the numbers in the table

to ease comparison, that this compares favorably with other DP CPU implementations

The push to highlight performance on GPUs has meant that not one of the currently published papers on GPU implementations of MD actually provide any validation of the approximations made in terms of statistical mechanical properties For example, one could include showing that converged simulations run on

a GPU and CPU give identical radial distribution functions, order parameters, and residue dipolar couples to name but a few possible tests

5 APPLICATIONS

While a significant number of papers published describe GPU implementations

of MD, a review of the literature reveals very few cited uses of these codes in

“real-world” simulations Indeed only Pande et al have such papers published at the time of writing This serves to underscore the nascent nature of this field

5.1 Protein folding

In the only published examples of the use of GPU-accelerated bio-MD simulations, Pande et al have used the OpenMM library to study protein folding in

Trang 22

tally is ~13 ms With an average performance of 80—200 ns/day on a single GPU, for this 544-atom protein fragment and utilizing the Folding@Home distri

pendent trajectories totaling over 2.73 ms of ensemble-averaged results, with

an average length of 207 ns per trajectory and with some trajectories of greater than 3 ms in length allowing a direct exploration of the folding landscape Similar trajectory lengths were calculated for the NTL9 (922 atom) case Additionally, Harvey and De Fabritiis performed a 1 ms explicit solvent MD simulation of the villin headpiece to probe its folding kinetics as part of their ACEMD benchmark results and achieved 66 ns/day on a three-GPU-equipped

GPU-accelerated MD implementations in helping researchers use personal workstations to reach simulation timescales that would typically only be possible using large clusters and obtain ensemble-averaged results that provide sampling timeframes comparable to experiment This potentially opens the door to studying a whole range of relevant biological events without requiring access

to large-scale supercomputer facilities

6 CONCLUSIONS AND FUTURE DIRECTIONS

It should be clear from this chapter that the field of GPU acceleration of condensed-phase biological MD simulations is still in its infancy Initial work in the field concentrated on artificially simplistic models and it is only recently that production quality MD codes have been developed that can make effective use of this technology The pressure to achieve maximum performance has led to a number of shortcuts and approximations being made, many without any real validation or rigorous study What initially appears to be an established and extremely active field actually, upon scraping the surface, consists of only a few select codes which could be considered

to be production ready and even less examples of “real-world” use However, the current cost benefits of GPUs are enticing and this is driving both code and hardware development

In a few short years, GPU-based MD codes have evolved from proof-of-concept prototypes to production-level software packages Despite the substantial progress made in the code development, the difficulty in programming GPU devices still persists, forcing approximations to be made to circumvent some of the limitations of GPU hardware However, NVIDIA’s recently released Fermi

provides features such as full support for DP and error-correcting memory along with a more versatile FFT implementation that many consider vital to effective use

of GPUs for MD simulations Given this, a number of established groups in the biological MD field are in the process of developing GPU-accelerated versions of

Trang 23

17

their software This will bring more competition to the field and hopefully with it

a better focus on extensive validation of the approximations made

It is anticipated that with the release of GPU versions of widely used MD codes the use of GPUs in research involving MD will likely increase exponentially over the coming years assuming that developers can demonstrate the credibility of these implementations to the same degree to which CPU implementations have been subjected over the years

1 McCammon, J.A., Gelin, B.R., Karplus, M Dynamics of folded proteins Nature 1977, 267, 585—90

2 Duan, Y., Kollman, P.A Pathways to a protein folding intermediate observed in a 1-microsecond simulation in aqueous solution Science 1998, 282, 740—4

3 Yeh, I., Hummer, G Peptide loop-closure kinetics from microsecond molecular dynamics simula tions in explicit solvent J Am Chem Soc 2002, 124, 6563—8

4 Klepeis, J.L., Lindorff-Larsen, K., Dror, R.O., Shaw, D.E Long-timescale molecular dynamics simulations of protein structure and function Curr Opin Struct Biol 2009, 19, 120—7

5 Sanbonmatsu, K.Y., Joseph, S., Tung, C Simulating movement of tRNA into the ribosome during decoding Proc Natl Acad Sci USA 2005, 102, 15854—9

6 Freddolino, P.L., Arkhipov, A.S., Larson, S.B., Mcpherson, A., Schul-ten, K Molecular dynamics simulations of the complete satellite tobacco mosaic virus Structure 2006, 14, 437—49

7 Bakker, A.F., Gilmer, G.H., Grabow, M.H., Thompson, K A special purpose computer for mole cular dynamics calculations J Comput Phys 1990, 90, 313—35

8 Fine, R., Dimmler, G., Levinthal, C FASTRUN: A special purpose, hardwired computer for molecular simulation Protein Struct Funct Genet 1991, 11, 242—53

9 Susukita, R., Ebisuzaki, T., Elmegreen, B.G., Furusawa, H., Kato, K., Kawai, A., Kobayashi, Y., Koishi, T., McNiven, G.D., Narumi, T., Yasuoka, K Hardware accelerator for molecular dynamics: MDGRAPE-2 Comput Phys Commun 2003, 155, 115—31

10 Case, D.A., Darden, T.A., Cheatham, T.E., Simmerling, C.L., Wang, J., Duke, R.E., Luo, R., Crowley, M., Walker, R.C., Zhang, W., Merz, K.M., Wang, B., Hayik, S., Roitberg, A., Seabra, G., Kolossvary, I., Wong, K.F., Paesani, F., Vanicek, J., Wu, X., Brozell, S.R., Steinbrecher, T., Gohlke, H., Yang, L., Tan, C., Mongan, J., Hornak, V., Cui, G., Mathews, D.H., Seetin, M.G., Sagui, C., Babin, V., Koll man, P.A., AMBER 10, University of California, San Francisco, 2008

11 Case, D.A., Cheatham, T.E., Darden, T., Gohlke, H., Luo, R., Merz, K.M., Onufriev, A., Simmerling, C., Wang, B., Woods, R.J The amber biomolecular simulation programs J Comput Chem 2005,

14 Narumi, T., Ohno, Y., Noriyuk, F., Okimoto, N., Suenaga, A., Yanai, R., Taiji, M In From Computa tional Biophysics to Systems Biology: A High-Speed Special-Purpose Computer for Molecular

Trang 24

Dynamics Simulations: MDGRAPE-3 (eds J Meinke, O Zimmermann, S Mohanty and U.H.E Hansmann) J von Neumann Institute for Computing, Ju¨lich, 2006, pp 29—36

15 NVIDIA: Santa Clara, CA, CUDA Programming Guide, http://developer.download.nvidia.com/ compute/cuda/30/toolkit/docs/NVIDIACUDAProgrammingGuide3.0.pdf (Accessed March 6, 2010)

16 von Neumann, J First draft of a report on the EDVAC IEEE Ann Hist Comput 1993, 15, 27—75

17 Flynn, M.J., Some computer organizations and their effectiveness IEEE Trans Comput 1972, C-21, 948—60

18 Kirk, D.B., Hwu, W.W Programming Massively Parallel Processors, Morgan Kaufmann Publish ers, Burlington, 2010

19 Yang, J., Wang, Y., Chen, Y GPU accelerated molecular dynamics simulation of thermal conduc tivities J Comput Phys 2007, 221, 799—804

20 AMD: Sunnyvale, CA, ATI, www.amd.com/stream (Accessed March 14, 2010)

21 Woo, M., Neider, J., Davis, T., Shreiner, D OpenGL Programming Guide: The Official Guide to Learning OpenGL, version 1.2, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1999

22 Liu, W., Schmidt, B., Voss, G., Mu¨ller-Wittig, W In High Performance Computing–HiPC 2007: Lecture Notes in Computer Science (eds S Aluru, M Parashar, R Badrinath and V.K Prasanna), Vol 4873, Springer, Berlin/Heidelberg, 2007, pp 185—96

23 Stone, J.E., Phillips, J.C., Freddolino, P.L., Hardy, D.J., Trabuco, L.G., Schulten, K Accelerating molecular modeling applications with graphics processors J Comput Chem 2007, 28, 2618—40

24 Phillips, J.C., Stone, J.E., Schulten, K Adapting a message-driven parallel application to gpu accelerated clusters, In SC ’08: Proceedings of the 2008 ACM/IEEE conference on Super comput ing, 1—9, IEEE Press, Piscataway, NJ, USA, 2008

25 van Meel, J.A., Arnold, A., Frenkel, D., Portegies Zwart, S.F., Belleman, R.G Harvesting graphics power for MD simulations Mol Simulat 2008, 34, 259—66

26 Rapaport, D.C Enhanced molecular dynamics performance with a programmable graphics pro cessor, arXiv Physics, 2009, arXiv:0911.5631v1

27 Anderson, J.A., Lorenz, C.D., Travesset, A General purpose molecular dynamics simulations fully implemented on graphics processing units J Comput Phys 2008, 227, 5342—59

28 Davis, J., Ozsoy, A., Patel, S., Taufer, M Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors, Springer, Berlin/Heidelberg, 2009

29 Harvey, M.J., Giupponi, G., De Fabritiis, G ACEMD: Accelerating biomolecular dynamics in the microsecond time scale J Chem Theory Comput 2009, 5, 1632—9

30 Friedrichs, M.S., Eastman, P., Vaidyanathan, V., Houston, M., Le Grand, S., Beberg, A.L., Ensign, D L., Bruns, C.M., Pande, V.S Accelerating molecular dynamic simulation on graphics processing units J Comput Chem 2009, 30, 864—72

31 Case, D.A., Darden, T.A., Cheatham, T.E.III, Simmerling, C.L., Wang, J., Duke, R.E., Luo, R., Crowley, M., Walker, R.C., Williamson, M.J., Zhang, W., Merz, K.M., Wang, B., Hayik, S., Roitberg, A., Seabra, G., Kolossv�ary, I., Wong, K.F., Paesani, F., Vanicek, J., Wu, X., Brozell, S.R., Steinbrecher, T., Gohlke, H., Yang, L., Tan, C., Mongan, J., Hornak, V., Cui, G., Mathews, D.H., Seetin, M.G., Sagui, C., Babin, V., Kollman, P.A Amber 11, Technical report, University of Cali fornia, San Francisco, 2010

32 Darden, T., York, D., Pedersen, L Particle mesh ewald: An Nlog(N) method for ewald sums in large systems J Chem Phys 1993, 98, 10089—92

33 Essmann, U., Perera, L., Berkowitz, M.L., Darden, T., Lee, H., Pedersen, L.G A smooth particle mesh Ewald method J Chem Phys 1995, 103, 8577—93

34 Harvey, M.J., De Fabritiis, G An implementation of the smooth particle mesh Ewald method on GPU hardware J Chem Theory Comput 2009, 5, 2371—7

35 Eastman, P., Pande, V.S Efficient nonbonded interactions for molecular dynamics on a graphics processing unit J Comput Chem 2010, 31, 1268—72

36 Brooks, B.R., Bruccoleri, R.E., Olafson, B.D., States, D.J., Swaminathan, S., Karplus, M CHARMM:

A program for macromolecular energy, minimization, and dynamics calculations J Comput Chem 1983, 4, 187—217

37 Wu, Y., Tepper, H.L., Voth, G.A Flexible simple point-charge water model with improved state properties J Chem Phys 2006, 124, 24503

Trang 25

liquid-19

38 Grand, S.L., Goetz, A.W., Xu, D., Poole, D., Walker, R.C Accelerating of amber generalized born calculations using nvidia graphics processing units 2010 (in preparation)

39 Grand, S.L., Goetz, A.W., Xu, D., Poole, D., Walker, R.C Achieving high performance in amber PME simulations using graphics processing units without compromising accuracy 2010 (in preparation)

40 Phillips, J.C., Braun, R., Wang, W., Gumbart, J., Tajkhorshid, E., Villa, E., Chipot, C., Skeel, R.D., Kale, L., Schulten, K Scalable molecular dynamics with NAMD J Comput Chem 2005, 26, 1781—802

41 Ensign, D.L., Pande, V.S The Fip35 WW domain folds with structural and mechanistic hetero geneity in molecular dynamics simulations Biophys J 2009, 96, L53—5

42 Voelz, V.A., Bowman, G.R., Beauchamp, K., Pande, V.S Molecular simulation of ab initio protein folding for a millisecond folder NTL9(1-39) J Am Chem Soc 2010, 132, 1526—8

43 Shirts, M., Pande, V.S Computing: Screen savers of the world unite! Science 2000, 290, 1903—4

44 NVIDIA Corporation Next generation CUDA compute architecture: Fermi, 2009

Trang 26

3.4 Density functional theory with Daubechies wavelets

4 Ab Initio Electron Correlation Methods perturbation theory

5 Quantum Monte Carlo

6 Concluding Remarks Acknowledgments References

implementations for acceleration of quantum chemistry and computational condensed matter physics simulations on graphics processing units (GPUs) as documented in the peer-reviewed literature We give a general overview of programming techniques and concepts that should be considered when porting scientific software to GPUs This is followed by a discussion of Hartree-Fock and density functional theory, wave function-based electron correlation methods and quantum Monte Carlo in which we outline the underlying problems and present the approaches which aim at exploiting the performance of the massively parallel GPU hardware We conclude with a

1 San Diego Supercomputer Center, University of California San Diego, La Jolla, CA, USA

2 Lehrstuhl fu¨r Theoretische Chemie, Universita¨t Erlangen, Erlangen, Germany

21

Trang 27

1 INTRODUCTION

Commodity graphics processing units (GPUs) are becoming increasingly popular

to accelerate molecular and condensed matter simulations due to their low cost and potential for high performance when compared with central processing units (CPUs) In many instances, classical approximations are very successful for such simulations However, a large number of problems of contemporary nano-, bio-,

or materials science require a quantum mechanical description of the electronic structure [1—3] This chapter provides an overview of recent developments within quantum chemistry and computational condensed matter physics that utilize accelerator hardware for this purpose

Quantum chemistry and solid-state physics codes implement relatively

algorithms to take advantage of their specialized hardware A successful GPU implementation requires, for example, a careful consideration of the memory

single-precision GPUs, the numerical accuracy is a central issue because six to seven significant figures are frequently insufficient to match the accuracy of the under

Finally, care should be taken to allow for a coevolution of the code with the hardware There are two general strategies for an implementation First, a complete reimplementation of existing functionality into a new software package The most common way, however, is to incrementally include GPU kernels for the computationally intensive parts of existing software packages The latter approach has the advantage of retaining the full functionality of software packages that in many cases have evolved over several decades

This chapter begins with a brief introduction to the general concepts that have

to be considered in order to successfully port scientific software to GPUs The rest

of this chapter is structured according to the different theoretical models commonly used in quantum chemistry, beginning with density functional theory (DFT) in Section 3 which also covers Hartree—Fock (HF) theory Section 4 deals

Monte Carlo (QMC) Each of these sections contains an overview of the critical parts of the underlying theory followed by a presentation and analysis of approaches that have been taken to accelerate the computationally intensive parts on GPUs Section 6 summarizes the present state of GPU implementations for quantum chemistry and finishes with general conclusions on trends to be expected in the foreseeable future

Trang 28

23

Quantum Chemistry on Graphics Processing Units

2 SOFTWARE DEVELOPMENT FOR GRAPHICS

PROCESSING UNITS

An excellent introduction to software development for GPUs including a discussion of the hardware and its historic development can be found in the book of

GPUs, it is necessary to have an understanding of the characteristics of the GPU hardware architecture

A GPU is an example of a massively parallel stream-processing architecture which uses the single-instruction multiple data (SIMD) vector processing model Typical GPUs contain many arithmetic units which are arranged in groups that share fast access memory and an instruction unit The high density of arithmetic units, however, comes at the expense of larger cache sizes and control units The NVIDIA GeForce

8800 GTX GPU which was released in late 2006, for example, consists of 16 sets of streaming multiprocessors (SMs), each of which is composed of eight scalar processors (ScaPs) Each SM operates independently of the other SMs and at any given clock cycle, each ScaP within an SM executes the same instruction but for different data Due

to this intrinsic parallelization, a GPU can outperform a standard CPU for tasks which exhibit a dense level of data parallelism Successful approaches in GPU programming therefore require exposing the data parallelism in the underlying problem

Each SM has access to four different types of on-chip memory with high bandwidth In the case of the NVIDIA GeForce 8800 GTX, these are 1024 local registers per ScaP, shared memory (cache) of 16 kilobytes (KB), read-only constant cache of 8 KB to speed up reads from the constant memory space, and read-only texture cache of 8 KB to speed up reads from the texture memory space In addition,

a large, noncached off-chip graphics card memory is available This memory, however, has a high latency of approximately 500 GPU cycles Applications on a GPU are organized into streams and kernels The former represent blocks of data while the latter execute operations on the data Before a GPU kernel is executed, the CPU must copy required data to the GPU memory To maximize the speedup of the implemented kernels, the algorithm has to be adapted to the underlying hardware architecture-dependent features like memory layout Copy operations between main memory and graphics card memory, for example, should be avoided because access to the main memory has a high latency on the order of hundreds of GPU cycles One of the main problems when programming GPUs is the limited size of working memory (registers, caches) which are available on chip A large number of parallel threads should therefore be run concurrently to hide the latency of the registers and the shared and global memory and avoid pipeline stalls

It is important to realize that many of these considerations are not only important for GPU programming The arrangement of data in a data-parallel fashion, for example, is also important for parallel programming of distributed memory architectures, which are found in most of today’s standard CPU clusters Thus many of the techniques employed to improve the parallel efficiency of quantum chemistry codes are also applicable to GPUs The same holds for the optimization of memory access patterns A general

Trang 29

24 Andreas W Go ¤tz et al

example for a portable algorithm is the fastest fourier transform in the west (FFTW) Fourier transform library which reaches optimal performance on the

Early use of GPUs required one to describe a problem to be solved in terms of

a graphics pipeline employing either OpenGL or DirectX graphics programming languages This complexity made general purpose computation on GPUs a research topic However, with the release of NVIDIA’s compute unified device

faces (APIs), implementations of algorithms for GPUs using a relatively simple extension of the standard C language have become possible A detailed overview

libraries are available that provide algorithms for commonly used problems in quantum chemistry and solid-state physics such as Fourier transforms (CUFFT)

The first generation of GPUs to support CUDA, such as the NVIDIA Geforce 8800 GTX, only featured 32-bit single-precision (SP) arithmetics and thus was of only limited use for quantum chemistry Major efforts had to be made to deal with roundoff errors resulting from the lack of 64-bit double-precision (DP) data types The second generation of GPUs introduced the missing 64-bit arithmetics, albeit only at an eighth

of the SP performance GPU cards dedicated to general purpose computing such as the NVIDIATesla C1060, which also supports large amounts of up to 4 gigabytes (GB) onboard memory, were introduced The low speed of the DP arithmetics and missing features such as error-correcting code (ECC), however, still hamper widespread acceptance of this generation of GPUs for scientific computing as compared to multisocket CPUs The third generation of GPUs (such as the NVIDIA Fermi) will solve some of the major problems of the earlier models Most importantly, DP support will

be included at only half the speed of SP arithmetics The availability of a global address space and 64-bit support will help to address the memory requirement to solve larger problems and support multiple GPUs in an easier and more transparent fashion Access to CPU main memory will remain slow, however, because the data transfer takes place over the peripheral component interconnect (PCI) bus

3 KOHNSHAM DENSITY FUNCTIONAL AND HARTREEFOCK THEORY

Due to its excellent balance between accuracy and computational cost, Kohn—

to investigate electronic ground states and their properties in chemistry and

which are discussed in Section 4

There are two major computational bottlenecks in KS-DFT and HF calcula

self-consistent field (SCF) equations The latter requires diagonalization of the Fock

Trang 30

25

Table 1 Summary of the capabilities and performance of GPU-based KS-DFT and HF implementations published to date

rE Parallela Speedupb

[21,23,25]

The computational effort for the formation of the KS (or Fock) matrix is dominated by the evaluation of the two-electron repulsion integrals (ERIs) which are required for the Coulomb and exact-exchange contributions and, in the case of DFT, also the numerical quadrature of the exchange-correlation (XC)

reviewed in the remainder of this section

3.1 Electron repulsion integrals

The ERIs which are required in quantum chemistry are given as

ðrÞ ðrÞðr 0Þlðr 0Þ

0

sian functions In general, these basis functions are contracted, that is, linear

X

pqrs

the molecule under consideration Although for large systems most of the inte

number of ERIs that need to be calculated represents a major computational bottleneck Many different algorithms have been devised for the calculation of these ERIs and their efficiency depends on the contraction length and angular

Trang 31

quantum chemistry codes therefore implement several ERI algorithms and make use of the best method for a given type of ERI

From the ERIs, the Coulomb and exact-exchange contributions to the KS (or Fock) matrix are obtained as

l l

matrix can be evaluated directly such that the contracted ERIs never need to be explicitly calculated and stored in memory

Yasuda was the first to realize the potential of GPUs for the acceleration of ERI

ment on GPUs are addressed and the results for the calculation of the Coulomb contribution to the KS matrix with s- and p-type basis functions are presented for a CUDA implementation Although it is not the most efficient algorithm for ERIs over basis functions with low angular momentum quantum number, the Rys

scheme allows one to maximize the load balance of the GPU’s SMs A new interpolation formula for the roots and weights of the quadrature was proposed which

is particularly suitable for SIMD processing, and an error analysis for the quadrature was given A mixed-precision (MP) CPU/GPU scheme was introduced which calculates the largest ERIs (prescreened via the Schwarz integral bound and an adjustable threshold) in DP on the CPU and the remaining ERIs in SP on the GPU such that the absolute error in the calculated ERIs can be controlled This, together with data accumulation (Coulomb matrix formation) via 48-bit multiprecision addition (which can be implemented in software for GPUs without DP support), leads to accurate DFT SCF energies while the errors are of the order of

tions to the Coulomb matrix are directly computed from the uncontracted ERIs in a SIMD fashion on the GPU which avoids the problem of having to transfer the large amount of ERIs from GPU to CPU memory Instead, only the density and Coulomb matrix have to be transferred If all ERIs are evaluated on the GPU (NVIDIA GeForce 8800 GTX), speedups around one order of magnitude have been observed for the formation of the Coulomb matrix for molecules as big as valinomycin (168 atoms) with a 6-31G basis set as compared to a conventional implementation

there is room for improvement in the performance, for example, through pipelining and also potentially by exploiting the DP functionality of current and future GPUs Ufimtsev and Martinez (UM) have also developed CUDA kernels for the calculation of ERIs and Fock matrix formation involving s- and p-type basis functions on

Trang 32

27

requires relatively few intermediates per integral resulting in a low memory requirement, similar to the Rys quadrature Three different mappings of the computational work to thread blocks have been tested which result in different load balancing and data reduction overhead and the ERI kernels have carefully been

from the primitive ERIs, it becomes most efficient to assign the calculation of each primitive ERI batch (i.e., all ERIs over basis functions with magnetic quantum numbers for the given angular momentum quantum numbers) to one thread, independent of the contraction length of the basis functions In order to maximize load balancing, the integral batches are presorted into blocks of angular momentum

but pre- and postprocessing are done on the CPU This approach has been paralle

HF SCF calculations with a 3-21G and 6-31G basis set using UM’s implementation and an NVIDIA GTX280 card can be more than 100 times faster than the

spent in the Fock matrix formation on the GPU However, for large molecules such

as olestra (453 atoms, 2131 basis functions), the linear algebra (LA) required for the solution of the SCF equations starts to become a bottleneck, requiring as much as 50% of the Fock matrix computation time (LA performed on the GPU using CUBLAS) A parallel efficiency of over 60% was achieved on three NVIDIA GeForce

8800 GTX cards as compared to the use of only one graphics accelerator Two points should be mentioned here First, the limitation to s- and p-type functions results in small integral blocks that can be treated entirely in shared memory and registers which means that the ratio of computation to memory access is high This situation will change for basis functions with higher angular momentum quantum numbers

parisons is a legacy Fortran implementation that underperforms on modern CPUs

GPU speedups should be observed for comparisons against implementations of these algorithms which are optimized for performance on modern CPUs

The error in the SCF energies obtained with UM’s code due to the use of SP

in DP, which can be performed on newer GPUs with negligible additional computational cost, improve the accuracy to this level in all investigated cases

In addition, error compensation in relative energies was observed, presumably due to cancellation of contributions of large ERIs For larger molecules, however, computation of the larger ERIs in DP will be required, as has been extensively

UM have also implemented the calculation of the Coulomb and exchange contributions to the analytical HF energy gradients with s- and

between 6 for small molecules and over 100 for larger molecules (olestra) has

Trang 33

been obtained running in parallel on a system equipped with two NVIDIA GTX295 cards (each of which has two GPUs) and an Intel Core2 quad-core 2.66 GHz CPU Reference was again made to GAMESS, running in parallel on all four CPU cores Using the mixed SP/DP approach discussed above, the

close to typical convergence thresholds for geometry optimizations Geometry optimization of a helical hepta-alanine was shown to lead to an optimized structure in good agreement with GAMESS results with an error in the final

with the 6-31G basis set in the microcanonical ensemble using the velocity Verlet

observed over a simulation time of 20 ps

Recently, Asadchev et al presented algorithms and a CUDA implementation

footprint, is efficient for integrals with higher order angular momentum The major problem is that, unlike numerical LA kernels, the quadrature has very complex memory access patterns which span a large data set and depend on the

requires 5376 floating-point numbers for intermediate quantities which are

With DP this corresponds to 123,008 bytes, which is much larger than cache sizes available on GPUs Therefore, these intermediates must be stored and loaded from the device memory as required and it becomes mandatory to arrange the parallel calculation of the ERIs in such a way that these memory loads are minimized For this purpose, integrals in a shell block are reordered such that intermediates can be reused as often as possible Another problem is the large amount of code required to cover all possible cases of integral types in

an efficient manner The authors therefore adopted a template-based approach

in which all cases can be generated from a single template in an automated fashion

The performance of these GPU ERI kernels was tested on NVIDIA GeForce GTX 275 and NVIDIA Tesla T10 cards and compared to the performance of the ERI evaluation with the Rys quadrature as implemented in GAMESS (which, as

achieves around 1 GFLOPS (giga floating point operations per second), the GPUs achieve around 25 GFLOPS in DP and 50 GFLOPS in SP, which is approximately 30% of the theoretically possible DP peak performance The difference between performance in SP and DP is approximately a factor of 2 which shows that the computations are memory bound rather than compute bound No timings are given for the data transfer between GPU memory and main memory apart from stating that it takes several times longer than the actual execution time

of the ERI kernels It is clear that, in order to retain the speed advantage of the ERI evaluation on the GPU, processing of the ERIs (e.g., Fock matrix formation) must

be implemented on the GPU device, as well

Trang 34

29

3.2 Numerical exchange-correlation quadrature

In the generalized gradient approximation (GGA) to DFT, the XC potential depends

dimensional space This makes an analytical solution of the XC integrals impossible and numerical quadrature is used to compute the XC matrix elements,

The numerical XC quadrature is perfectly suited for parallelization and

strategy in which the computationally less demanding steps in the quadra

on the CPU while the expensive steps are done on the GPU These are the

which can be formulated as matrix-vector multiplications and dot products Both steps are organized in batches of grid points and nonnegligible basis functions that are small enough to be kept entirely in shared memory Although in this way some of the basis function values on the grid points must be recalculated, this is more than compensated for by the low latency of the shared memory

In order to deal with roundoff errors due to the use of SP floating-point numbers on the GPU, Yasuda introduced a scheme in which the XC potential is

elements can be calculated analytically This is done in DP on the CPU while the GPU is used for calculating the correction, that is, for the numerical quadrature of

model

in the total energy of valinomycin with a 3-21G or 6-31G basis set and the PW91

approximately 40 is observed with an NVIDIA GeForce 8800 GTX graphics card

as compared to a conventional implementation running on an Intel Pentium 4 CPU with 2.8 GHz This translates into a speedup of around five to ten as compared to more modern CPUs

3.3 Density-fitted Poisson method

Brown et al have presented a different heterogeneous approach to accelerate DFT,

The ClearSpeed accelerator hardware is a compute-oriented stream architecture with raw performance comparable to that of modern GPUs while offering support for DP Just as for GPUs, an efficient use of this hardware requires fine-grained parallelization with a large number of lightweight threads and any algorithm developed for these accelerators will map well onto GPUs By using the Poisson

Trang 35

density fitting method, all bottlenecks of a DFT calculation could be shifted into

prefactor is reduced The auxiliary basis set can be chosen to consist of a few atomcentered Gaussian functions augmented with Poisson functions (obtained by

reduction of the prefactor Furthermore, these overlap integrals can be calculated

by numerical quadrature However, to maintain numerical stability in the SCF procedure, a higher accuracy than provided by default XC quadrature grids is required, thus increasing the number of grid points

The implementation, which is not restricted to basis functions with low angular momentum quantum numbers, passes only information about the numerical quadrature grid, the basis functions, the KS matrix, and the density matrix between the accelerator cards and the host system The numerical quadrature of the XC contribution and the Coulomb contribution due to the

that all computations can be done within the cache memory of the accelerator cards All other parts of the DFT calculation are performed on the host CPU Compared to an implementation with analytical evaluation of the integrals ð; Þ running on one core of a dual core AMD Opteron 2218 CPU with 2.6 GHz, a speedup between 7 and 15 was observed with 12 ClearSpeed xe620

were run for molecules of the size between chorismate (24 atoms) and an

pVTZ and corresponding density fitting basis sets There is further room for improvement, for example, by implementing prescreening which is missing so far However, work done on the host is already becoming a bottleneck and needs to be addressed The diagonalization, for example, takes approximately 30% of the total runtime

3.4 Density functional theory with Daubechies wavelets

Another effort in the physics community should be mentioned here The BigDFT

tions and offers support within the CUDA programming framework It was shown to achieve a high parallel efficiency of 90% on parallel computers in which the cross-sectional bandwidth scales well with the number of processors

It uses a parallelized hybrid CPU/GPU programming model and compared to the full CPU implementation, a constant speedup of up to six was achieved with

Trang 36

31

The quantum chemist’s traditional way to approximate solutions of the electronic

methods These methods improve upon the HF mean-field approximation by add

expected that this will change in the near future because these methods are of critical importance whenever higher accuracy is required than what can be achieved

by DFT or for types of interactions and properties for which DFT breaks down

4.1 Resolution-of-identity second-order MłllerPlesset

perturbation theory

Second-order Møller—Plesset perturbation theory (MP2) is the computationally

Except for transition metal compounds, MP2 equilibrium geometries are of comparable accuracy to DFT However, MP2 captures long-range correlation effects (like dispersion) which are lacking in present-day density functionals The computational cost of MP2 calculations is dominated by the integral transformation from the atomic orbital (AO) to the molecular orbital (MO) basis which

avoided by introduction of the RI integral approximation which requires just the transformation of three-index quantities and reduces the prefactor without

native for small- to medium-sized molecular systems for which DFT fails Aspuru-Guzik and coworkers have worked on accelerating RI-MP2 calculations

of an RI-MP2 calculation essentially consists of matrix multiplications to generate the

X

P

To take full benefit of GPUs for these matrix multiplications, the matrices have

to be larger than a given threshold to minimize the impact of the bus latency when transferring the matrices from the CPU to the GPU memory Depending on the system size (number of atoms, size of basis sets employed), this is achieved by

For the multiplication of general matrices whose size is too large to be held in

established for standard parallel matrix multiplications, this library uses a

Trang 37

two-dimensional decomposition Partial matrix multiplications of these blocks are performed on the GPU with CUBLAS routines and the results are accumulated on the CPU To improve the numerical accuracy, a heterogeneous computing model is employed in which numerically large contributions to the final result are computed and accumulated on a DP device (in general the CPU) and the remaining small contributions are efficiently treated by the SP GPU device It was shown that errors can be reduced by an order of magnitude in exchange for a moderate performance decrease with this MP approach

Compared to the standard CPU implementation, speedups of 13.8, 10.1, and 7.8 were obtained on an NVIDIA Tesla C1060 GPU equipped with 4 GB of memory for the 168-atom molecule valinomycin in SP, MP, and DP, respectively The correspond

matrix multiplications entirely in SP, the resulting error is larger than acceptable for chemical accuracy It is therefore inevitable to put up with some performance penalty for the sake of accuracy It was shown that the ERI evaluation becomes computa

combination with the approaches discussed in Section 3 for the ERI evaluation

5 QUANTUM MONTE CARLO

solving the time-independent Schro¨dinger equation As opposed to variational

ab initio approaches, QMC is based on a stochastic evaluation of the underlying

a very large prefactor

executing CUDA kernels that are explicitly optimized for cache usage and instruction-level parallelism for the computationally intensive parts on a GPU These are the basis function evaluation on grid points and, similar to the numerical XC quadrature and RI-MP2, matrix multiplications The Kahan Summation Formula to improve the accuracy of GPU matrix multiplications was explored which was necessary because of the lack of fully compliant IEEE floating-point implementations on GPUs in 2007 For small molecules with 8—28 atoms (32—152 electrons and 80—516 basis functions), approximately fivefold speedup was obtained using an NVIDIA GeForce 7800 GTX graphics card as compared to an optimized implementation running on an Intel Pentium 4 CPU with 3 GHz Meredith et al have used an implementation of the quantum cluster approximation on SP GPUs to study the effect of disorder on the critical temperature for superconductivity in cuprates with a two-dimensional Hubbard model on a regular

multiplications on the GPU using the CUBLAS library Attempts to increase the performance by circumventing the data transfer bottleneck and implementing the remaining data manipulations on the GPU instead of the CPU resulted in a performance loss for all but the largest problem size that was investigated The simple

Trang 38

33

reason is that smart algorithms that can be implemented efficiently on CPUs do not map well onto GPU architectures or, in other words, the GPU has to do more work

to achieve the same result For the largest problem size studied, a fivefold speedup was observed running on a cluster with 32 AMD Opteron 2.6 GHz CPUs and 32 NVIDIA 8800 GTX graphics cards as compared to using only the CPUs in parallel Sufficient accuracy for scientifically meaningful results within the employed model was proven by comparison to DP results obtained on a CPU

6 CONCLUDING REMARKS

Quantum chemistry software that exploits the capabilities of modern GPUs has only recently started to emerge Significant parts of these initial efforts have been devoted to minimize errors caused by the lack of DP support on older GPUs The advent of next-generation GPUs that support DP arithmetics at a peak performance of only a factor of 2 less than that of SP will make these special approaches obsolete At the same time, future developments will be greatly facilitated

From the literature, one can observe that in order to achieve good results in programming with GPUs it is often necessary to write GPU-only versions of the code One typically has to abandon many of the smart optimizations that have been developed over the years for CPUs and expensive copy operations from the CPU to the GPU memory have to be minimized

With careful work, it is possible to achieve speedups which should allow researchers to perform calculations that otherwise would require large and expensive CPU clusters However, the nature of GPU programming is such that significant effort is still required to make effective use of GPUs These complexities are the reason that the quantum chemistry software that is available for GPUs at the time of this writing is still in its infancy and not yet ready for general use GPU implementations that are capable of full HF and DFT calculations, for example, are still restricted to s- and p-type basis functions HF calculations are not of much practical use by themselves but only as starting point for

momentum quantum numbers Similarly, meaningful DFT calculations have to use polarization functions which means that even for simple organic molecules

or biomolecules without metal atoms at least d-type functions are required While GPU-based ERI implementations for high angular momentum basis functions have been developed, these still have to be incorporated into software

Up to now only energies and gradients have been considered which allows for explorations of potential energy surfaces However, a variety of other quantum chemistry applications would also benefit from the computational power that GPUs provide Of high interest for the researcher are static and dynamic molecular response properties Frequently, these require a higher computational effort than energy and gradient evaluations We therefore expect to see developments in this area soon

Trang 39

We are looking forward to exciting new developments of quantum chemistry software for GPUs accompanied by ground-breaking applications in the near future

ACKNOWLEDGMENTS

This work was supported in part by grant 09-LR-06-117792-WALR from the University of California Lab Fees program and grant XFT-8-88509-01/DE-AC36-99GO10337 from the Department of Energy to RCW

REFERENCES

1 Clary, D.C Quantum chemistry of complex systems Science 2006, 314(5797), 265—6

2 Carter, E.A Challenges in modeling materials properties without experimental input Science

6 Frigo, M., Johnson, S.G The design and implementation of FFTW3 Proc IEEE 2005, 93(2), 216—31

7 NVIDIA: Santa Clara, CA, CUDA Programming Guide, http://developer.download.nvidia.com/ compute/cuda/3_0/toolkit/docs/NVIDIA_CUDA_ProgrammingGuide_3.0.pdf (Accessed March

6, 2010)

8 AMD: Sunnyvale, CA, ATI, www.amd.com/stream (Accessed March 14, 2010)

9 NVIDIA: Santa Clara, CA, CUDA, http://www.nvidia.com/object/cuda_home.html (Accessed March 6, 2010)

10 NVIDIA: Santa Clara, CA, CUFFT Library, http://developer.download.nvidia.com/compute/ cuda/2_3/toolkit/docs/CUFFT_Library_2.3.pdf (Accessed March 6, 2010)

11 NVIDIA: Santa Clara, CA, CUBLAS Library 2.0, http://developer.download.nvidia.com/ compute/cuda/2_0/docs/CUBLAS_Library_2.0.pdf (Accessed March 6, 2010)

12 Innovative Computing Laboratory, University of Tennessee, Matrix Algebra on GPU and core Architectures, http://icl.cs.utk.edu/magma (Accessed March 6, 2010)

Multi-13 Kohn, W., Sham, L Self-consistent equations including exchange and correlation effects Phys Rev 1965, 140, A1133—8

14 Parr, R.G., Yang, W Density-Functional Theory of Atoms and Molecules, Oxford University Press, Oxford, 1989

15 Jensen, F In Annual Reports in Computational Chemistry (ed D.C Spellmeyer), Vol 1, Elsevier, Amsterdam, 2005, pp 3—17

16 Fiolhais, C., Nogueira, F., Marques, M.A.L A Primer in Density Functional Theory, Lecture Notes

in Physics, Springer Verlag, Berlin, 2003

17 Salek, P., Høs, S., Thøgersen, L., Jørgensen, P., Manninen, P., Olsen, J., Jans ık, B Linear-scaling implementation of molecular electronic self-consistent field theory J Chem Phys 2007, 126, 114110

18 Yasuda, K Two-electron integral evaluation on the graphics processor unit J Comput Chem

Trang 40

35

22 McMurchie, L.E., Davidson, E.R One- and two-electron integrals over Cartesian Gaussian func tions J Comput Phys 1978, 26, 218—31

23 Ufimtsev, I.S., Martinez, T.J Quantum chemistry on graphical processing units 2 Direct consistent-field implementation J Chem Theory Comput 2009, 5(4), 1004—15

self-24 Schmidt, M.W., Baldridge, K.K., Boatz, J.A., Elbert, S.T., Gordon, M.S., Jensen, J.H., Koseki, S., Matsunaga, N., Nguyen, K.A., Su, S., Windus, T.L., Dupuis, M., Montgomery, J.A., Jr General atomic and molecular electronic structure system J Comput Chem 1993, 14(11), 1347—63

25 Ufimtsev, I.S., Martinez, T.J Quantum chemistry on graphical processing units 3 Analytical energy gradients, geometry optimization, and first principles molecular dynamics J Chem Theory Comput 2009, 5(10), 2619—28

26 Asadchev, A., Allada, V., Felder, J., Bode, B.M., Gordon, M.S., Windus, T.L Uncontracted Rys quadrature implementation of up to g functions on graphical processing units J Chem Theory Comput 2010, 6(3), 696—704

27 Yasuda, K Accelerating density functional calculations with graphics processing unit J Chem Theory Comput 2008, 4(8), 1230—6

28 Perdew, J.P., Chevary, J., Vosko, S., Jackson, K.A., Pederson, M.R., Singh, D., Fiolhais, C Atoms, molecules, solids, and surfaces: Applications of the generalized gradient approximation for exchange and correlation Phys Rev B 1992, 46, 6671—87

29 ClearSpeed: Bristol, UK, www.clearspeed.com (Accessed March 14, 2010)

30 Brown, P., Woods, C., McIntosh-Smith, S., Manby, F.R Massively multicore parallelization of Kohn-Sham theory J Chem Theory Comput 2008, 4(10), 1620—6

31 Brown, P., Woods, C.J., McIntosh-Smith, S., Manby, F.R., A massively multicore parallelization of the Kohn-Sham energy gradients, J Comput Chem 2010, 31(10), 2008—13

32 Baerends, E.J., Ellis, D., Roos, P Self-consistent molecular Hartree-Fock-Slater calculations I The computational procedure Chem Phys 1973, 2, 41—51

33 Dunlap, B.I., Connoly, J.W.D., Sabin, J.R On some approximations in applications of Xa theory

36 Feyereisen, M.W., Fitzgerald, G., Komornicki, A Use of approximate integrals in ab initio theory

An application in MP2 energy calculations Chem Phys Lett 1993, 208, 359—63

37 Weigend, F., Ha¨ ser, M., Patzelt, H., Ahlrichs, R RI-MP2: Optimized auxiliary basis sets and demonstration of efficiency Chem Phys Lett 1998, 294, 143—52

38 Vogt, L., Olivares-Amaya, R., Kermes, S., Shao, Y., Amador-Bedolla, C., Aspuru-Guzik, A Accel erating resolution-of-the-identity second-order Møller-Plesset quantum chemistry calculations with graphical processing units J Phys Chem A 2008, 112(10), 2049—57

39 Olivares-Amaya, R., Watson, M.A., Edgar, R.G., Vogt, L., Shao, Y., Aspuru-Guzik, A Accelerating correlated quantum chemistry calculations using graphical processing units and a mixed preci sion matrix multiplication library J Chem Theory Comput 2010, 6(1), 135—44

40 SciGPU-GEMM v0.8, http://www.chem-quantum.info/scigpu/?p=61 (Accessed March 6, 2010)

41 Ceperley, D., Alder, B Quantum Monte Carlo Science 1986, 231(4738), 555—60

42 Anderson, A.G., Goddard, W.A., III, Schro¨der, P Quantum Monte Carlo on graphical processing units Comput Phys Commun 2007, 177(3), 298—306

43 Meredith, J.S., Alvarez, G., Maier, T.A., Schulthess, T.C., Vette, J.S Accuracy and performance of graphics processors: A quantum Monte Carlo application case study Parallel Comput 2009, 35(3), 151—63

Định dạng
Số trang	318
Dung lượng	5,72 MB