Conclusions and Future Directions Acknowledgments References coupled with the emergence of application programming interfaces to support general purpose computation on graphics processin
Trang 2Annual Reports in
COMPUTATIONAL CHEMISTRY
Edited by
Ralph A Wheeler
Sponsored by the Division of Computers in Chemistry
of the American Chemical Society
Amsterdam • Boston • Heidelberg • London • New York • Oxford Paris • San Diego • San Francisco • Singapore • Sydney • Tokyo
Trang 3Working together to grow
libraries in developing countries
www.elsevier.com | www.bookaid.org | www.sabre.org
Elsevier
Radarweg 29, PO Box 211, 1000 AE Amsterdam, The Netherlands
Linacre House, Jordan Hill, Oxford OX2 8DP, UK
32 Jamestown Road, London NW1 7BY, UK
525 B Street, Suite 1900, San Diego, CA 92101-4495, USA
30 Corporate Drive, Suite 400, Burlington, MA 01803, USA
First edition 2010
Copyright � 2010 Elsevier B V All rights reserved
No part of this publication may be reproduced, stored in a retrieval system
or transmitted in any form or by any means electronic, mechanical, photocopying, recording or otherwise without the prior written permission of the publisher
Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in Oxford, UK: phone (+44) (0) 1865 843830; fax (+44) (0) 1865 853333; email: permissions@elsevier.com Alternatively you can submit your request online by visiting the Elsevier web site at http://www.elsevier.com/locate/permissions, and selecting Obtaining permission to use Elsevier material
Notice
No responsibility is assumed by the publisher for any injury and/or damage to persons
or property as a matter of products liability, negligence or otherwise, or from any use
or operation of any methods, products, instructions or ideas contained in the material herein Because of rapid advances in the medical sciences, in particular, independent verification of diagnoses and drug dosages should be made
Library of Congress Cataloging-in-Publication Data
A catalogue record for this book is available from the Library of congress
British Library Cataloging in Publication Data
A catalogue record for this book is available from the British Library
ISBN: 978-0-444-53552-8
ISSN: 1574-1400
For information on all Elsevier publications
visit our website at elsevierdirect.com
Printed and bound in USA
10 11 12 13 10 9 8 7 6 5 4 3 2 1
Trang 5x Contributors
Sheng-You Huang
Department of Physics and Astronomy, Department of Biochemistry, Dalton Cardiovascular Research Center, and Informatics Institute, University of Missouri, Columbia, MO, USA
George Khelashvili
Department of Physiology and Biophysics, Weill Medical College of Cornell University, New York, NY, USA
Kah Chun Lau
Department of Chemistry, George Washington University, Washington DC, USA Yaakov Levy
Department of Structural Biology, Weizmann Institute of Science, Rehovot, Israel Hongzhi Li
Institute of Molecular Biophysics, Florida State University, Tallahassee, FL, USA Yan Ling
Department of Chemistry and Biochemistry, University of Southern Mississippi, Hattiesburg, MS, USA
Trang 6San Diego Supercomputer Center, University of California San Diego, La Jolla,
CA, USA; Lehrstuhl fu¨r Theoretische Chemie, Universita¨t Erlangen, Erlangen, Germany
Department of Physics and Astronomy, Department of Biochemistry, Dalton Cardiovascular Research Center, and Informatics Institute, University of Missouri, Columbia, MO, USA
Trang 7PREFACE
Annual Reports in Computational Chemistry (ARCC) was instituted to provide timely reviews of topics important to researchers in Computational Chemistry ARCC is published and distributed by Elsevier and sponsored by the American Chemical Society’s Division of Computers in Chemistry (COMP) Members in good standing of the COMP Division receive a copy of the ARCC as part of their member benefits Since previous volumes have received such an enthusiastic response from our readers, the COMP Executive Committee expects to deliver future volumes of ARCC that build on the solid contributions in our first five volumes
To ensure that you receive future installments of this series, please join the Division as described on the COMP website at http://www.acscomp.org
Volume 6 features 14 outstanding contributions in six sections and includes a new section devoted to Nanotechnology and the reemergence of the Chemical Education section Topics covered (and Section Editors) include Simulation Methodologies (Carlos Simmerling), Quantum Chemistry (Gregory S Tschumper), Chemical Education (George C Shields), Nanotechnology (Luke E.K Achenie), Biological Modeling (Nathan Baker), and Bioinformatics (Wei Wang) Although individual chapters in ARCC are now indexed by the major abstracting services,
we plan to continue the practice of cumulative indexing of both the current and past editions to provide an easy identification of past reports
As was the case with our previous volumes, the current volume of Annual Reports in Computational Chemistry has been assembled entirely by volunteers to produce a high-quality scientific publication at the lowest possible cost The Editor and the COMP Executive Committee extend our gratitude to the many people who have given their time to make this edition of Annual Reports in Computational Chemistry possible The authors of each of this year’s contributions and the Section Editors have graciously dedicated significant amounts of their time to make this volume successful This year’s edition could not have been assembled without the help of Clare Caruana of Elsevier Thank you one and all for your hard work, your time, and your contributions
We trust that you will find this edition to be interesting and valuable We are actively planning the seventh volume and anticipate that it will restore one or more previously popular sections, including Materials and/or Emerging Technologies In addition, we are actively soliciting input from our readers about future topics, so please contact the editor to make suggestions and/or to volunteer as a contributor
Sincerely, Ralph A Wheeler, Editor
xiii
Trang 8Section Editor: Carlos Simmerling
Trang 9C H A P T E R 1
Advancements in Molecular Dynamics Simulations of Biomolecules on
Graphical Processing Units
2 An Overview of GPU Programming 2.1 GPU/CPU hardware differences 2.2 The emergence of GPU programming languages 2.3 GPU programming considerations
3 GPU-Based Implementations of Classical Molecular Dynamics 3.1 Early GPU-based MD code development
3.2 Production GPU-based MD codes
4 Performance and Accuracy 4.1 Performance and scaling 4.2 Validation
5 Applications 5.1 Protein folding
6 Conclusions and Future Directions Acknowledgments
References
coupled with the emergence of application programming interfaces to support general purpose computation on graphics processing units (GPUs) has led to an explosion in the use of GPUs for acceleration of scientific applications Here we explore the use of GPUs within the context of condensed phase molecular dynamics (MD) simulations We discuss the algorithmic differences that the GPU architecture imposes on MD codes,
an overview of the challenges involved in using GPUs for MD, followed by a
1 San Diego Supercomputer Center, University of California San Diego, La Jolla, CA, USA
2 National Biomedical Computation Resource, University of California San Diego, La Jolla, CA, USA
Annual Reports in Computational Chemistry, Volume 6 � 2010 Elsevier B.V ISSN: 1574-1400, DOI 10.1016/S1574-1400(10)06001-9 All rights reserved
3
Trang 104 Ross C Walker et al
critical survey of contemporary MD simulation packages that are attempting
to utilize GPUs Finally we discuss the possible outlook for this field Keywords: GPU; CUDA; stream; NVIDIA; ATI; molecular dynamics;
accelerator; OpenMM; ACEMD; NAMD; AMBER
1 INTRODUCTION
Since the first molecular dynamics (MD) simulation of an enzyme was described
tool in understanding the behavior of biomolecules Since that first 10 ps long simulation of merely 500 atoms in 1977, the field has grown to where small
simulations are numerically intensive requiring access to large-scale supercomputers or well-designed clusters with expensive interconnects that are beyond the reach of many research groups
Many attempts have been made over the years to accelerate classical
MD simulation of condensed-phase biological systems by exploiting alternative hardware technologies Some notable examples include ATOMS by AT&T Bell
erate the direct space nonbond calculations, Clearspeed Inc which developed an
recently DE Shaw Research LLC who developed their own specialized architec
All of these approaches have, however, failed to make an impact on main
original acquisition or development costs of several accelerator technologies These costs have posed a significant barrier to widespread development within the academic research community Additionally these technologies do not form Table 1 Example cost estimates for a range of hardware MD acceleration projects
National Laboratory
a Total development cost: $15 million [ 14 ]
Trang 11
Jan-03Jun-03Nov-03Apr-04Sep-04Feb-05Jul-05Dec-05May-06Oct-06Mar-07Aug-07Jan-08Jun-08 Jan-03Jun-03Nov-03Apr-04Sep-04Feb-05Jul-05Dec-05May-06Oct-06Mar-07Aug-07Jan-08Jun-08
5
Advancements in MD Simulations of Biomolecules on GPUs
part of what would be considered a standard workstation specification This makes it difficult to experiment with such technologies leading to a lack of sustained development or innovation and ultimately their failure to mature into ubiquitous community-maintained research tools
Graphics processing units (GPUs), on the other hand, have been an integral part
of personal computers for decades Ever since 3DFX first introduced the Voodoo graphics chip in 1996, their development has been strongly influenced by the entertainment industry in order to meet the demands for ever increasing realism
in computer games This has resulted in significant industrial investment in the stable, long-term development of GPU technology Additionally the strong demand from the consumer electronics industry has resulted in GPUs becoming cheap and ubiquitous This, combined with substantial year over year increases in the computing power of GPUs, means they have the potential, when utilized efficiently, to
targets for acceleration of many scientific applications including MD simulations The fact that high-end GPUs can be considered standard equipment in scientific workstations means that they already exist in many research labs and can be purchased easily with new equipment This makes them readily available to researchers and thus tempting instruments for computational experimentations The nature of GPU hardware, however, has made their use in general purpose computing challenging to all but those with extensive three-dimensional (3D) graphics programming experience However, as discussed in Section 2 the development of application programming interfaces (APIs) targeted at general purpose scientific computing has reduced this complexity to the point where GPUs are beginning to be accepted as serious tools for the economically efficient acceleration of an extensive range of scientific problems
In this chapter, we provide a brief overview of GPU hardware and programming techniques and then review the progress that has been made in using GPU hardware
to accelerate classical MD simulations of condensed-phase biological systems; we review some of the challenges and limitations that have faced those trying to
Figure 1 Peak floating-point operations per second (a) and memory bandwidth (b) for Intel CPUs and NVIDIA GPUs Reproduced from [15]
Trang 126 Ross C Walker et al
implement MD algorithms on GPUs, consider performance numbers and validation techniques, and then highlight some recent applications of GPU-accelerated MD Finally, we comment on the limitations of current GPU MD implementations and what the future may hold for acceleration of MD simulations on GPU hardware
2 AN OVERVIEW OF GPU PROGRAMMING
2.1 GPU/CPU hardware differences
In order to comprehend where the performance benefits lie and understand the complexity facing programmers wishing to utilize GPUs, it is necessary to compare the underlying nature, and design philosophies, of the GPU with that of the CPU Conventional CPUs found in the majority of modern computers, such as those manufactured by Intel and advanced micro devices (AMD), are designed for
running a program, the CPU fetches instructions and associated data from the computer’s random access memory (RAM), decodes it, executes it, and then
this would be classified as single instruction, single data (SISD)
control unit receives the instruction/data pair from RAM during the decoding phase and disseminates out the instruction to give to the arithmetic logic unit (ALU) which is the circuitry that carries out the logical operations on the data Finally, there are cache units which provide local and fast temporary data storage for the CPU Historically, performance improvements in sequential execution have been obtained by increasing CPU clock speeds and the introduction of more complex ALUs that perform increasingly composite operations in fewer clock cycles Additionally, pipelining, which is executing instructions out of order or in parallel while maintaining the overall appearance of sequential execution, has also improved performance (but not calculation speed) by increasing the number
of instructions a CPU can execute in a unit amount of time; and larger on chip cache memory is often used to hide latency
facilitate the display of 3D graphics by performing large numbers of floating(a) CPU
Figure 2 Abstraction contrasting CPU and GPU design Adapted from [18]
Trang 137
Advancements in MD Simulations of Biomolecules on GPUs
point operations per video frame: they are essentially specialized numeric computing engines The dominant strategy adopted by the graphics industry to meet this requirement has been to maximize the throughput of a massive number of parallel threads which can all access the RAM on the GPU board Herein lies the key difference with CPUs: the same operation can be carried out on different parts of the input data within the GPU’s memory by an army of individual threads concurrently Within Flynn’s taxonomy, this falls into the single instruction, multiple data (SIMD) category
A GPU has a hierarchical structure composed of multiple streaming multiprocessors (SMs) which in turn consist of sub units of streaming processors Memory is also hierarchical, maintaining an approximately constant size to speed ratio; all SMs share the same device global memory which is large, but relatively slow Smaller, lower latency, on-chip memory which is local to each SM and available to all streaming processors within that SM is provided and even faster register-like memory is present on each streaming processor A read-only cache of the device global memory is available to each SM in the form of a texture cache Physically, GPUs have a much larger number of ALUs than a CPU, but the ALUs are not as complex as the ones found in a CPU The GPU’s clock speed is normally about half that of a contemporary CPU’s; however, GPUs typically have an order of magnitude larger memory bandwidth to their onboard device global memory
2.2 The emergence of GPU programming languages
The spectrum of GPU accessibility for scientific use has two extremes Prior to the development of current general purpose GPU programming models by the major
in the field in hijacking graphic specific APIs, such as OpenGL, and using them as vehicles for carrying out general purpose calculations However, development was time consuming and essentially hardware specific At the other extreme, a compiler should exist which can compile existing scientific code for execution on GPUs without the scientist having to consider the underlying nature of the hardware one is calculating on
At present, we are somewhere in-between these points; the barrier to utilizing GPU hardware for general purpose computation has been reduced by the introduction of general purpose GPU programming models such as NVIDIA’s Com
algorithmic paradigm shifts are often required in existing codes to maximize such performance offered by the massively parallel GPU hardware
The CUDA programming model from NVIDIA appears to be the most mature and widespread in scientific applications at this moment in time, hence the discussion here will focus on specifics pertaining to it CUDA, a C-like programming language, enables code to run concurrently on the CPU and GPU, with the assumption that the numerically intensive parts of a program will be executed on the GPU and remaining sections, which are perhaps not suited to the GPU, remain executing on the CPU A mechanism is provided for the two parts of the running code to communicate with each other
Trang 148 Ross C Walker et al
CUDA abstracts the hierarchical GPU hardware structure outlined, into a programming framework, requiring the coder to write in an intrinsically parallel fashion The small numerically intensive subroutines of code that run specifically
on the GPU are termed kernels These are executed in blocks where each block contains multiple instances of the kernel, termed threads
This partitioning enables the following (CUDA runtime mediated) physical mapping onto the GPU hardware: each block is run on an individual MP with the number of threads determined by the number of physical SPs within the MP As
a result, only threads within the same block can synchronize with each other This block-based parallelism and the need to keep all SM units busy in order to achieve efficient performance lead to a number of nontrivial programming considerations
2.3 GPU programming considerations
A key strategy in improving wall clock time to scientific problem solution is recasting an algorithm in a way that makes it computationally palatable for the nature of the hardware that it is being executed on; an algorithm that performs poorly on a CPU may perform many orders of magnitude better on a GPU and vice versa However, when dealing with scientific problems, it is essential that alternative approaches to solving the underlying physics yield the same solution, albeit via different paths It is very tempting given the architectural differences of GPU hardware to change the nature of the problem being solved without a thorough understanding of the implications this has on the scientific results
General strategies when developing efficient algorithms on GPUs include the following:
1 Ensure that host-to-device communication during a calculation is kept to a minimum For example, one should ensure that as much of the calculation remains on the GPU as possible Ferrying data back and forth between the GPU and the host machine is costly due to the latency of the PCIe bus, hence if one is storing atomic coordinates on the host’s memory, then the GPU is going to
be idle while it is waiting for an updated set to arrive The above holds within the GPU as well A corollary to this is that very often it is more efficient to recalculate
an existing result on the GPU again, rather than fetch it from a nonlocal location
2 Accuracy issues that arise from hardware single precision (SP) limitations need to be controlled in a way that is acceptable to the scientific algorithm being simulated Approaches to this include sorting floats by size prior to
3 Recasting the problem in a vector fashion that groups data that will be operated on in the same way allows for maximizing the efficiency of the SPs
It should be clear from the above discussion that while GPUs offer an attractive price performance ratio, there are significant hurdles to utilizing them efficiently Indeed, in some cases, the development costs of GPU-specific code may negate the cost/performance benefits
Trang 159
Advancements in MD Simulations of Biomolecules on GPUs
3 GPU-BASED IMPLEMENTATIONS OF CLASSICAL MOLECULAR DYNAMICS
As illustrated in the previous section, GPUs have come a long way in terms of their ease of use for general purpose computing In the last four years, beginning in 2006, NVIDIA’s CUDA and ATI’s Stream APIs have made programming GPUs significantly easier and the addition of DP hardware in NVIDIA’s GT200 line and ATI’s FireStream series has facilitated effective implementation of MD algorithms Due to the reasons discussed above, GPUs are still significantly more complex to program than traditional CPUs However, the potential cost/performance benefit makes them enticing development platforms It is only very recently, however, that the use of GPUs for MD simulations has begun to mature to the point where fully featured production MD codes have appeared The lure of very high performance improvements for minimal cost has influenced early attempts at accelerating MD on GPUs
As we see below, the race to develop MD codes on this “new” hardware has led many
to take inappropriate or untested approximations rather than taking the time to address the shortcomings of GPUs It is also very difficult to compare successes and performance between implementations since a number of manuscripts show only speedups of small parts of the code or comparison against very different types of simulations A detailed look at what appears, at first sight, to be a very crowded and successful field uncovers only a few select codes that could be considered production ready In this section, we provide an overview of the peer-reviewed literature on GPU-based MD along with a discussion of these production ready codes
3.1 Early GPU-based MD code development
In what was arguably the first published implementation of GPU-accelerated
thermal conductivity This work was prior to the release of the CUDA and Stream APIs and hence the authors were forced to implement their algorithm directly in
improvements of between 10 and 11 times that of a single Intel Pentium 3.0 GHz processor While an impressive proof of concept, the Yang et al implementation was very simplistic containing just Lennard—Jones interactions and a neighbor list that was constructed to remain static over the course of the simulation It thus lacked many of the important features, such as covalent terms, short- and long-range electrostatics, thermostats, barostats, neighbor list updates, and restraints needed for MD of biological systems Nevertheless, this pioneering study demonstrated that implementing an MD code on GPUs was feasible
The advent of the CUDA and Stream programming APIs made programming GPUs significantly easier and brought with them an explosion of GPU MD implementations Most early implementations of MD on GPUs are characterized
by an exploration of the field with the development of codes and GPU-specific algorithms focused on simplistic, artificial, or very specific model problems rather than the application of GPUs to “real-world” production MD simulations
Trang 1610 Ross C Walker et al
Like Yang et al., they too chose to implement just a simplistic van der Waals potential allowing them to avoid all of the complexities inherent in production
MD simulations of condensed-phase systems Unlike Yang, Liu et al recomputed their neighbor list periodically providing the first example of a neighbor list update for MD on GPUs
series of target algorithms for molecular modeling computations, including techniques for direct Coulomb summation for calculating charge—charge interactions within a cutoff They also discussed possible techniques for evaluation of forces
in MD, providing the first mention of a combined treatment of direct space van der Waals and electrostatics in a GPU implementation Their implementation, however, did not include any actual MD but instead focused on the more simplistic applications of ion placement and the calculation of time-averaged Coulomb potentials in the vicinity of a simulated system While providing an example of how Coulomb interactions can be accelerated with GPUs and laying the groundwork for developing an experimental GPU-accelerated version of
production MD simulations
Following on the heels of Yang et al., a number of groups begun implementing their own MD codes on GPUs although most were still simply proof-ofconcept prototypes with limited applicability for production MD calculations
for neighbor list updates but still only applied this to simple van der Waals
first to include the calculation of covalent terms, adding GPU computation
of van der Waals and harmonic bond potentials to their HOOMD code in order to study nonionic liquids They also included integrators and neighbor lists in their implementation; however, while the HOOMD GPU implementation went a step closer to a full MD implementation, it still neglected most of the complexities including both short- and long-range electrostatics, angle terms, torsion terms, and constraints required for simulating condensed-phase systems
simulations of liquid water Their approach was similar to Anderson but also included angle and short-range electrostatic terms While a demonstration of a condensed-phase simulation, the approach used was still extremely restrictive and of limited use in real-world applications
These early GPU-based MD implementations are characterized by significantly oversimplifying the mathematics in order to make implementation on a GPU easier, neglecting, for example, electrostatics, covalent terms, and heterogenous solutions This has resulted in a large number of GPU implementations being published but none with any applicability to “real-world” production MD simulations It is only within the last year (2009/2010) that useful GPU implementations of MD have started to appear
Trang 1711
Advancements in MD Simulations of Biomolecules on GPUs
3.2 Production GPU-based MD codes
The features typically necessary for a condensed-phase production MD code for biological simulations are explicit and implicit solvent implementations, correct treatment of long-range electrostatics, support for different statistical ensembles (NVT, NVE and NPT), thermostats, restraints, constraints, and integration algorithms At the time of writing, there are only three published
MD GPU implementations that could be considered production quality codes
pendent implementations such as support for generalized Born implicit solva
published
The ACEMD package by Harvey et al could be considered the first
includes support for periodic boundaries and more importantly both and long-range electrostatics using a smooth particle mesh Ewald (PME)
cit solvent generalized Born model on small- and medium-sized systems using
improved the OpenMM library and adapted it to explicit solvent simulation
of long-range electrostatics Additionally, a GPU-accelerated version of GROMACS has been developed which works via links to the OpenMM library GPU acceleration of explicit solvent calculations are also available in NAMD v2.7b2, although acceleration is limited since only the direct space nonbond interactions are calculated on the GPU at present, necessitating a synchroniza
the key features of production MD codes, at the time of writing, is listed in
includes the broadest set of features, capable of running implicit and explicit solvent simulations in all three ensembles with flexible restraints on any atoms as well as allowing the use of multiple precision models although it only supports a single GPU per MD simulation at present Some of the other codes do not include all of the key features for MD simulation such as pressure coupling and implicit solvent models although this will almost certainly change in the future The NAMD implementation is CPU centric, focusing on running MD in a multiple node, multiple GPU environment, whereas others implement all MD features on the GPU and strive to optimize
MD performance on a single GPU or multiple GPUs on a single node We note that of all the production MD codes available OpenMM is the only one to support both NVIDIA and ATI GPUs; the others are developed just for NVIDIA GPUs We also note that ACEMD and AMBER are commercial products, whereas the others are available under various open-source licensing models
Trang 18Table 2 Key feature comparison between the GPU-accelerated MD codes
Code Simulation implementation GPU acceleration Multiple GPU support GPU type Licensing model
NVT, SHAKE
solvent (GB), PME, NVE, NVT, SHAKE
nodes, but scalability bottlenecked by internode communication
(source available)
a GROMACS has been implemented with OpenMM
Trang 1913
Advancements in MD Simulations of Biomolecules on GPUs
4 PERFORMANCE AND ACCURACY
4.1 Performance and scaling
The performance of MD simulations on modern clusters and supercomputers is currently limited by the communication bottlenecks that occur due to the significant imbalances that exist between CPU speeds and hardware interconnects The use of GPUs does nothing to alleviate this and indeed actually exacerbates
it by making an individual node faster and thus increasing the amount of communication per unit of time that is required between nodes For this reason, GPU-accelerated MD does not offer the ability to run substantially longer MD simulations than are currently feasible on the best supercomputer hardware, nor does it provide a convincing case for the construction of large clusters of GPUs; however, what it does offer is the ability to run substantially more sampling
on a workstation or single node for minimal cost The huge performance gap that exists between cluster interconnects and GPUs has meant that the majority
of implementations have focused on utilizing just a single GPU (OpenMM, AMBER) or multiple GPUs within a single node (ACEMD) Only NAMD has attempted to utilize multiple nodes but with success that is largely due to simulating very large systems and not attempting to optimize single-node performance, thus requiring large numbers of GPUs to achieve only modest speedups and negating many of the cost/performance benefit arguments Thus the benefit of GPUs to condensed-phase MD should be seen in the concept of condensing small (2—8 node) clusters into single workstations for a fraction of the cost rather than providing a way to run hundreds of microseconds of MD per day on large clusters of GPUs
A fair comparison of performance across current implementations is very difficult since it is almost impossible to run identical simulations in different programs, and indeed even within the same program it is not always possible to make a fair comparison since additional approximations are often made to the GPU implementation in the desire to achieve larger speedups without considering such approaches on the CPU There are also numerous situations where people compare the performance of individual kernels, such as the Coulomb sum, rather than the complete implementation Indeed a careful look at the current literature finds speedups ranging from 7 to 700þ To understand why such numbers might be
which they compare simulations of various boxes of water with their GPU imple
faster than CHARMM on a single CPU but at no point in their paper mention the version of CHARMM used, the compilers used, or even the settings used in the CHARMM code It should be noted that, largely for historical reasons, the use of default settings in CHARMM tends to give very poor performance There are then
of course multiple optimizations that can be made on the GPU due to the simplicity
of the water model The first is the use of cubic boxes which can benefit vectorization on the GPU, for codes supporting PME it also provides more optimum fast fourier transform (FFT) performance The second is the use of the SPC/Fw water
Trang 2014 Ross C Walker et al
the GPU Finally, the use of a pure water box means that all molecules are essentially identical This allows one to hard code all of the various parameters, since all bonds are identical, all oxygen charges are identical, etc., and thus avoid the additional costs associated with doing such lookups on the GPU For these reasons, the performance and speedups quoted for various GPU implementations should typically be considered an upper bound on the performance achievable Additionally, many factors determine the performance of GPU-accelerated
MD codes Implicit solvent simulations in general show much greater performance boosts over explicit solvent simulation due to the reduced complexities
of the underlying algorithm Specifics include avoiding the need for FFTs and the use of infinite cutoffs which in turn remove the complexity of maintaining a
their single-precision OpenMM code and presumably AMBER 9’s DP Sander implementation for systems of 600 atoms and more than two orders of magnitude
Similar speedup has been observed in direct comparisons between AMBER’s PMEMD code running on 2.8 GHz Intel E5462 CPUs and NVIDIA C1060 Tesla
while OpenMM also showed impressive linear performance scaling over system size in its non-PME explicit solvent simulations and at least 19-fold speedup
ever, it is unclear from the OpenMM manuscript if the comparisons are like for like since the AMBER and NAMD numbers appear to be for full PME-based explicit solvent simulations ACEMD showed that its 3-CPU/3-GPU performance was roughly equivalent to 256-CPU NAMD on the DHFR system and 16-CPU/16
4.2 Validation
While the majority of articles describing new GPU MD implementations have focused considerable attention on performance comparison to CPU simulations, there has been very little effort to comprehensively test and validate the implementations, both in terms of actual bugs and in the use of various approximations such as single precision or alternative electrostatic treatments Since DP has only recently become available on GPUs and because SP still offers a more than 10-fold performance enhancement, all of the GPU-based MD implementations use either single precision or a combination of hybrid single and DP math Several authors have attempted to provide validation of this and other approximations but often only in a limited fashion while instead preferring to focus on
on the CPU and GPU and then provided plots of energy and temperature profiles for the two simulations without any form of statistical analysis
Trang 2115
Advancements in MD Simulations of Biomolecules on GPUs
compare the deviation in atom positions between two runs on different CPU counts and on the GPU
this was still far from comprehensive For example, they stated in their manuscript that “Potential energies were checked against NAMD values for the initial configuration of a set of systems, , in order to verify the correctness of the force calculations by assuring that energies were identical within 6 significant figures.” Since scalar potential energies do not convey information about the vector forces,
it is unclear how the authors considered this a validation of their force calculations They provide a table with energy changes in the NVE ensemble per nanosecond per degree of freedom but do not provide any independent simulations for comparison The authors also state that “ we validate in this section the conservation properties of energy in a NVT simulation ” which is of little use in validation since energy is not a conserved quantity in the NVT (canonical) ensemble Additionally, they carried out calculations of Na—Na pair distribution functions using their ACEMD GPU code and also GROMACS on a CPU; however, the lack of consistency in the simulation parameters between GPU and CPU and the clear lack of convergence in the results mean that the validation is qualitative at best
simply examining energy conservation for simulations of the lambda repressor and stating, although as with Harvey et al not providing the numbers in the table
to ease comparison, that this compares favorably with other DP CPU implementations
The push to highlight performance on GPUs has meant that not one of the currently published papers on GPU implementations of MD actually provide any validation of the approximations made in terms of statistical mechanical properties For example, one could include showing that converged simulations run on
a GPU and CPU give identical radial distribution functions, order parameters, and residue dipolar couples to name but a few possible tests
5 APPLICATIONS
While a significant number of papers published describe GPU implementations
of MD, a review of the literature reveals very few cited uses of these codes in
“real-world” simulations Indeed only Pande et al have such papers published at the time of writing This serves to underscore the nascent nature of this field
5.1 Protein folding
In the only published examples of the use of GPU-accelerated bio-MD simulations, Pande et al have used the OpenMM library to study protein folding in
Trang 2216 Ross C Walker et al
tally is ~13 ms With an average performance of 80—200 ns/day on a single GPU, for this 544-atom protein fragment and utilizing the Folding@Home distri
pendent trajectories totaling over 2.73 ms of ensemble-averaged results, with
an average length of 207 ns per trajectory and with some trajectories of greater than 3 ms in length allowing a direct exploration of the folding landscape Similar trajectory lengths were calculated for the NTL9 (922 atom) case Additionally, Harvey and De Fabritiis performed a 1 ms explicit solvent MD simulation of the villin headpiece to probe its folding kinetics as part of their ACEMD benchmark results and achieved 66 ns/day on a three-GPU-equipped
GPU-accelerated MD implementations in helping researchers use personal workstations to reach simulation timescales that would typically only be possible using large clusters and obtain ensemble-averaged results that provide sampling timeframes comparable to experiment This potentially opens the door to studying a whole range of relevant biological events without requiring access
to large-scale supercomputer facilities
6 CONCLUSIONS AND FUTURE DIRECTIONS
It should be clear from this chapter that the field of GPU acceleration of condensed-phase biological MD simulations is still in its infancy Initial work in the field concentrated on artificially simplistic models and it is only recently that production quality MD codes have been developed that can make effective use of this technology The pressure to achieve maximum performance has led to a number of shortcuts and approximations being made, many without any real validation or rigorous study What initially appears to be an established and extremely active field actually, upon scraping the surface, consists of only a few select codes which could be considered
to be production ready and even less examples of “real-world” use However, the current cost benefits of GPUs are enticing and this is driving both code and hardware development
In a few short years, GPU-based MD codes have evolved from proof-of-concept prototypes to production-level software packages Despite the substantial progress made in the code development, the difficulty in programming GPU devices still persists, forcing approximations to be made to circumvent some of the limitations of GPU hardware However, NVIDIA’s recently released Fermi
provides features such as full support for DP and error-correcting memory along with a more versatile FFT implementation that many consider vital to effective use
of GPUs for MD simulations Given this, a number of established groups in the biological MD field are in the process of developing GPU-accelerated versions of
Trang 2317
Advancements in MD Simulations of Biomolecules on GPUs
their software This will bring more competition to the field and hopefully with it
a better focus on extensive validation of the approximations made
It is anticipated that with the release of GPU versions of widely used MD codes the use of GPUs in research involving MD will likely increase exponentially over the coming years assuming that developers can demonstrate the credibility of these implementations to the same degree to which CPU implementations have been subjected over the years
1 McCammon, J.A., Gelin, B.R., Karplus, M Dynamics of folded proteins Nature 1977, 267, 585—90
2 Duan, Y., Kollman, P.A Pathways to a protein folding intermediate observed in a 1-microsecond simulation in aqueous solution Science 1998, 282, 740—4
3 Yeh, I., Hummer, G Peptide loop-closure kinetics from microsecond molecular dynamics simula tions in explicit solvent J Am Chem Soc 2002, 124, 6563—8
4 Klepeis, J.L., Lindorff-Larsen, K., Dror, R.O., Shaw, D.E Long-timescale molecular dynamics simulations of protein structure and function Curr Opin Struct Biol 2009, 19, 120—7
5 Sanbonmatsu, K.Y., Joseph, S., Tung, C Simulating movement of tRNA into the ribosome during decoding Proc Natl Acad Sci USA 2005, 102, 15854—9
6 Freddolino, P.L., Arkhipov, A.S., Larson, S.B., Mcpherson, A., Schul-ten, K Molecular dynamics simulations of the complete satellite tobacco mosaic virus Structure 2006, 14, 437—49
7 Bakker, A.F., Gilmer, G.H., Grabow, M.H., Thompson, K A special purpose computer for mole cular dynamics calculations J Comput Phys 1990, 90, 313—35
8 Fine, R., Dimmler, G., Levinthal, C FASTRUN: A special purpose, hardwired computer for molecular simulation Protein Struct Funct Genet 1991, 11, 242—53
9 Susukita, R., Ebisuzaki, T., Elmegreen, B.G., Furusawa, H., Kato, K., Kawai, A., Kobayashi, Y., Koishi, T., McNiven, G.D., Narumi, T., Yasuoka, K Hardware accelerator for molecular dynamics: MDGRAPE-2 Comput Phys Commun 2003, 155, 115—31
10 Case, D.A., Darden, T.A., Cheatham, T.E., Simmerling, C.L., Wang, J., Duke, R.E., Luo, R., Crowley, M., Walker, R.C., Zhang, W., Merz, K.M., Wang, B., Hayik, S., Roitberg, A., Seabra, G., Kolossvary, I., Wong, K.F., Paesani, F., Vanicek, J., Wu, X., Brozell, S.R., Steinbrecher, T., Gohlke, H., Yang, L., Tan, C., Mongan, J., Hornak, V., Cui, G., Mathews, D.H., Seetin, M.G., Sagui, C., Babin, V., Koll man, P.A., AMBER 10, University of California, San Francisco, 2008
11 Case, D.A., Cheatham, T.E., Darden, T., Gohlke, H., Luo, R., Merz, K.M., Onufriev, A., Simmerling, C., Wang, B., Woods, R.J The amber biomolecular simulation programs J Comput Chem 2005,
14 Narumi, T., Ohno, Y., Noriyuk, F., Okimoto, N., Suenaga, A., Yanai, R., Taiji, M In From Computa tional Biophysics to Systems Biology: A High-Speed Special-Purpose Computer for Molecular
Trang 2418 Ross C Walker et al
Dynamics Simulations: MDGRAPE-3 (eds J Meinke, O Zimmermann, S Mohanty and U.H.E Hansmann) J von Neumann Institute for Computing, Ju¨lich, 2006, pp 29—36
15 NVIDIA: Santa Clara, CA, CUDA Programming Guide, http://developer.download.nvidia.com/ compute/cuda/30/toolkit/docs/NVIDIACUDAProgrammingGuide3.0.pdf (Accessed March 6, 2010)
16 von Neumann, J First draft of a report on the EDVAC IEEE Ann Hist Comput 1993, 15, 27—75
17 Flynn, M.J., Some computer organizations and their effectiveness IEEE Trans Comput 1972, C-21, 948—60
18 Kirk, D.B., Hwu, W.W Programming Massively Parallel Processors, Morgan Kaufmann Publish ers, Burlington, 2010
19 Yang, J., Wang, Y., Chen, Y GPU accelerated molecular dynamics simulation of thermal conduc tivities J Comput Phys 2007, 221, 799—804
20 AMD: Sunnyvale, CA, ATI, www.amd.com/stream (Accessed March 14, 2010)
21 Woo, M., Neider, J., Davis, T., Shreiner, D OpenGL Programming Guide: The Official Guide to Learning OpenGL, version 1.2, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1999
22 Liu, W., Schmidt, B., Voss, G., Mu¨ller-Wittig, W In High Performance Computing–HiPC 2007: Lecture Notes in Computer Science (eds S Aluru, M Parashar, R Badrinath and V.K Prasanna), Vol 4873, Springer, Berlin/Heidelberg, 2007, pp 185—96
23 Stone, J.E., Phillips, J.C., Freddolino, P.L., Hardy, D.J., Trabuco, L.G., Schulten, K Accelerating molecular modeling applications with graphics processors J Comput Chem 2007, 28, 2618—40
24 Phillips, J.C., Stone, J.E., Schulten, K Adapting a message-driven parallel application to gpu accelerated clusters, In SC ’08: Proceedings of the 2008 ACM/IEEE conference on Super comput ing, 1—9, IEEE Press, Piscataway, NJ, USA, 2008
25 van Meel, J.A., Arnold, A., Frenkel, D., Portegies Zwart, S.F., Belleman, R.G Harvesting graphics power for MD simulations Mol Simulat 2008, 34, 259—66
26 Rapaport, D.C Enhanced molecular dynamics performance with a programmable graphics pro cessor, arXiv Physics, 2009, arXiv:0911.5631v1
27 Anderson, J.A., Lorenz, C.D., Travesset, A General purpose molecular dynamics simulations fully implemented on graphics processing units J Comput Phys 2008, 227, 5342—59
28 Davis, J., Ozsoy, A., Patel, S., Taufer, M Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors, Springer, Berlin/Heidelberg, 2009
29 Harvey, M.J., Giupponi, G., De Fabritiis, G ACEMD: Accelerating biomolecular dynamics in the microsecond time scale J Chem Theory Comput 2009, 5, 1632—9
30 Friedrichs, M.S., Eastman, P., Vaidyanathan, V., Houston, M., Le Grand, S., Beberg, A.L., Ensign, D L., Bruns, C.M., Pande, V.S Accelerating molecular dynamic simulation on graphics processing units J Comput Chem 2009, 30, 864—72
31 Case, D.A., Darden, T.A., Cheatham, T.E.III, Simmerling, C.L., Wang, J., Duke, R.E., Luo, R., Crowley, M., Walker, R.C., Williamson, M.J., Zhang, W., Merz, K.M., Wang, B., Hayik, S., Roitberg, A., Seabra, G., Kolossv�ary, I., Wong, K.F., Paesani, F., Vanicek, J., Wu, X., Brozell, S.R., Steinbrecher, T., Gohlke, H., Yang, L., Tan, C., Mongan, J., Hornak, V., Cui, G., Mathews, D.H., Seetin, M.G., Sagui, C., Babin, V., Kollman, P.A Amber 11, Technical report, University of Cali fornia, San Francisco, 2010
32 Darden, T., York, D., Pedersen, L Particle mesh ewald: An Nlog(N) method for ewald sums in large systems J Chem Phys 1993, 98, 10089—92
33 Essmann, U., Perera, L., Berkowitz, M.L., Darden, T., Lee, H., Pedersen, L.G A smooth particle mesh Ewald method J Chem Phys 1995, 103, 8577—93
34 Harvey, M.J., De Fabritiis, G An implementation of the smooth particle mesh Ewald method on GPU hardware J Chem Theory Comput 2009, 5, 2371—7
35 Eastman, P., Pande, V.S Efficient nonbonded interactions for molecular dynamics on a graphics processing unit J Comput Chem 2010, 31, 1268—72
36 Brooks, B.R., Bruccoleri, R.E., Olafson, B.D., States, D.J., Swaminathan, S., Karplus, M CHARMM:
A program for macromolecular energy, minimization, and dynamics calculations J Comput Chem 1983, 4, 187—217
37 Wu, Y., Tepper, H.L., Voth, G.A Flexible simple point-charge water model with improved state properties J Chem Phys 2006, 124, 24503
Trang 25liquid-19
Advancements in MD Simulations of Biomolecules on GPUs
38 Grand, S.L., Goetz, A.W., Xu, D., Poole, D., Walker, R.C Accelerating of amber generalized born calculations using nvidia graphics processing units 2010 (in preparation)
39 Grand, S.L., Goetz, A.W., Xu, D., Poole, D., Walker, R.C Achieving high performance in amber PME simulations using graphics processing units without compromising accuracy 2010 (in preparation)
40 Phillips, J.C., Braun, R., Wang, W., Gumbart, J., Tajkhorshid, E., Villa, E., Chipot, C., Skeel, R.D., Kale, L., Schulten, K Scalable molecular dynamics with NAMD J Comput Chem 2005, 26, 1781—802
41 Ensign, D.L., Pande, V.S The Fip35 WW domain folds with structural and mechanistic hetero geneity in molecular dynamics simulations Biophys J 2009, 96, L53—5
42 Voelz, V.A., Bowman, G.R., Beauchamp, K., Pande, V.S Molecular simulation of ab initio protein folding for a millisecond folder NTL9(1-39) J Am Chem Soc 2010, 132, 1526—8
43 Shirts, M., Pande, V.S Computing: Screen savers of the world unite! Science 2000, 290, 1903—4
44 NVIDIA Corporation Next generation CUDA compute architecture: Fermi, 2009
Trang 263.4 Density functional theory with Daubechies wavelets
4 Ab Initio Electron Correlation Methods perturbation theory
5 Quantum Monte Carlo
6 Concluding Remarks Acknowledgments References
implementations for acceleration of quantum chemistry and computational condensed matter physics simulations on graphics processing units (GPUs) as documented in the peer-reviewed literature We give a general overview of programming techniques and concepts that should be considered when porting scientific software to GPUs This is followed by a discussion of Hartree-Fock and density functional theory, wave function-based electron correlation methods and quantum Monte Carlo in which we outline the underlying problems and present the approaches which aim at exploiting the performance of the massively parallel GPU hardware We conclude with a
1 San Diego Supercomputer Center, University of California San Diego, La Jolla, CA, USA
2 Lehrstuhl fu¨r Theoretische Chemie, Universita¨t Erlangen, Erlangen, Germany
Annual Reports in Computational Chemistry, Volume 6 2010 Elsevier B.V ISSN: 1574-1400, DOI 10.1016/S1574-1400(10)06002-0 All rights reserved
21
Trang 271 INTRODUCTION
Commodity graphics processing units (GPUs) are becoming increasingly popular
to accelerate molecular and condensed matter simulations due to their low cost and potential for high performance when compared with central processing units (CPUs) In many instances, classical approximations are very successful for such simulations However, a large number of problems of contemporary nano-, bio-,
or materials science require a quantum mechanical description of the electronic structure [1—3] This chapter provides an overview of recent developments within quantum chemistry and computational condensed matter physics that utilize accelerator hardware for this purpose
Quantum chemistry and solid-state physics codes implement relatively
algorithms to take advantage of their specialized hardware A successful GPU implementation requires, for example, a careful consideration of the memory
single-precision GPUs, the numerical accuracy is a central issue because six to seven significant figures are frequently insufficient to match the accuracy of the under
Finally, care should be taken to allow for a coevolution of the code with the hardware There are two general strategies for an implementation First, a complete reimplementation of existing functionality into a new software package The most common way, however, is to incrementally include GPU kernels for the computationally intensive parts of existing software packages The latter approach has the advantage of retaining the full functionality of software packages that in many cases have evolved over several decades
This chapter begins with a brief introduction to the general concepts that have
to be considered in order to successfully port scientific software to GPUs The rest
of this chapter is structured according to the different theoretical models commonly used in quantum chemistry, beginning with density functional theory (DFT) in Section 3 which also covers Hartree—Fock (HF) theory Section 4 deals
Monte Carlo (QMC) Each of these sections contains an overview of the critical parts of the underlying theory followed by a presentation and analysis of approaches that have been taken to accelerate the computationally intensive parts on GPUs Section 6 summarizes the present state of GPU implementations for quantum chemistry and finishes with general conclusions on trends to be expected in the foreseeable future
Trang 2823
Quantum Chemistry on Graphics Processing Units
2 SOFTWARE DEVELOPMENT FOR GRAPHICS
PROCESSING UNITS
An excellent introduction to software development for GPUs including a discussion of the hardware and its historic development can be found in the book of
GPUs, it is necessary to have an understanding of the characteristics of the GPU hardware architecture
A GPU is an example of a massively parallel stream-processing architecture which uses the single-instruction multiple data (SIMD) vector processing model Typical GPUs contain many arithmetic units which are arranged in groups that share fast access memory and an instruction unit The high density of arithmetic units, however, comes at the expense of larger cache sizes and control units The NVIDIA GeForce
8800 GTX GPU which was released in late 2006, for example, consists of 16 sets of streaming multiprocessors (SMs), each of which is composed of eight scalar processors (ScaPs) Each SM operates independently of the other SMs and at any given clock cycle, each ScaP within an SM executes the same instruction but for different data Due
to this intrinsic parallelization, a GPU can outperform a standard CPU for tasks which exhibit a dense level of data parallelism Successful approaches in GPU programming therefore require exposing the data parallelism in the underlying problem
Each SM has access to four different types of on-chip memory with high bandwidth In the case of the NVIDIA GeForce 8800 GTX, these are 1024 local registers per ScaP, shared memory (cache) of 16 kilobytes (KB), read-only constant cache of 8 KB to speed up reads from the constant memory space, and read-only texture cache of 8 KB to speed up reads from the texture memory space In addition,
a large, noncached off-chip graphics card memory is available This memory, however, has a high latency of approximately 500 GPU cycles Applications on a GPU are organized into streams and kernels The former represent blocks of data while the latter execute operations on the data Before a GPU kernel is executed, the CPU must copy required data to the GPU memory To maximize the speedup of the implemented kernels, the algorithm has to be adapted to the underlying hardware architecture-dependent features like memory layout Copy operations between main memory and graphics card memory, for example, should be avoided because access to the main memory has a high latency on the order of hundreds of GPU cycles One of the main problems when programming GPUs is the limited size of working memory (registers, caches) which are available on chip A large number of parallel threads should therefore be run concurrently to hide the latency of the registers and the shared and global memory and avoid pipeline stalls
It is important to realize that many of these considerations are not only important for GPU programming The arrangement of data in a data-parallel fashion, for example, is also important for parallel programming of distributed memory architectures, which are found in most of today’s standard CPU clusters Thus many of the techniques employed to improve the parallel efficiency of quantum chemistry codes are also applicable to GPUs The same holds for the optimization of memory access patterns A general
Trang 2924 Andreas W Go ¤tz et al
example for a portable algorithm is the fastest fourier transform in the west (FFTW) Fourier transform library which reaches optimal performance on the
Early use of GPUs required one to describe a problem to be solved in terms of
a graphics pipeline employing either OpenGL or DirectX graphics programming languages This complexity made general purpose computation on GPUs a research topic However, with the release of NVIDIA’s compute unified device
faces (APIs), implementations of algorithms for GPUs using a relatively simple extension of the standard C language have become possible A detailed overview
libraries are available that provide algorithms for commonly used problems in quantum chemistry and solid-state physics such as Fourier transforms (CUFFT)
The first generation of GPUs to support CUDA, such as the NVIDIA Geforce 8800 GTX, only featured 32-bit single-precision (SP) arithmetics and thus was of only limited use for quantum chemistry Major efforts had to be made to deal with roundoff errors resulting from the lack of 64-bit double-precision (DP) data types The second generation of GPUs introduced the missing 64-bit arithmetics, albeit only at an eighth
of the SP performance GPU cards dedicated to general purpose computing such as the NVIDIATesla C1060, which also supports large amounts of up to 4 gigabytes (GB) onboard memory, were introduced The low speed of the DP arithmetics and missing features such as error-correcting code (ECC), however, still hamper widespread acceptance of this generation of GPUs for scientific computing as compared to multisocket CPUs The third generation of GPUs (such as the NVIDIA Fermi) will solve some of the major problems of the earlier models Most importantly, DP support will
be included at only half the speed of SP arithmetics The availability of a global address space and 64-bit support will help to address the memory requirement to solve larger problems and support multiple GPUs in an easier and more transparent fashion Access to CPU main memory will remain slow, however, because the data transfer takes place over the peripheral component interconnect (PCI) bus
3 KOHNSHAM DENSITY FUNCTIONAL AND HARTREEFOCK THEORY
Due to its excellent balance between accuracy and computational cost, Kohn—
to investigate electronic ground states and their properties in chemistry and
which are discussed in Section 4
There are two major computational bottlenecks in KS-DFT and HF calcula
self-consistent field (SCF) equations The latter requires diagonalization of the Fock
Trang 3025
Quantum Chemistry on Graphics Processing Units
Table 1 Summary of the capabilities and performance of GPU-based KS-DFT and HF implementations published to date
rE Parallela Speedupb
[21,23,25]
The computational effort for the formation of the KS (or Fock) matrix is dominated by the evaluation of the two-electron repulsion integrals (ERIs) which are required for the Coulomb and exact-exchange contributions and, in the case of DFT, also the numerical quadrature of the exchange-correlation (XC)
reviewed in the remainder of this section
3.1 Electron repulsion integrals
The ERIs which are required in quantum chemistry are given as
ðrÞ ðrÞðr 0Þlðr 0Þ
0
sian functions In general, these basis functions are contracted, that is, linear
X
pqrs
the molecule under consideration Although for large systems most of the inte
number of ERIs that need to be calculated represents a major computational bottleneck Many different algorithms have been devised for the calculation of these ERIs and their efficiency depends on the contraction length and angular
Trang 3126 Andreas W Go ¤tz et al
quantum chemistry codes therefore implement several ERI algorithms and make use of the best method for a given type of ERI
From the ERIs, the Coulomb and exact-exchange contributions to the KS (or Fock) matrix are obtained as
l l
matrix can be evaluated directly such that the contracted ERIs never need to be explicitly calculated and stored in memory
Yasuda was the first to realize the potential of GPUs for the acceleration of ERI
ment on GPUs are addressed and the results for the calculation of the Coulomb contribution to the KS matrix with s- and p-type basis functions are presented for a CUDA implementation Although it is not the most efficient algorithm for ERIs over basis functions with low angular momentum quantum number, the Rys
scheme allows one to maximize the load balance of the GPU’s SMs A new interpolation formula for the roots and weights of the quadrature was proposed which
is particularly suitable for SIMD processing, and an error analysis for the quadrature was given A mixed-precision (MP) CPU/GPU scheme was introduced which calculates the largest ERIs (prescreened via the Schwarz integral bound and an adjustable threshold) in DP on the CPU and the remaining ERIs in SP on the GPU such that the absolute error in the calculated ERIs can be controlled This, together with data accumulation (Coulomb matrix formation) via 48-bit multiprecision addition (which can be implemented in software for GPUs without DP support), leads to accurate DFT SCF energies while the errors are of the order of
tions to the Coulomb matrix are directly computed from the uncontracted ERIs in a SIMD fashion on the GPU which avoids the problem of having to transfer the large amount of ERIs from GPU to CPU memory Instead, only the density and Coulomb matrix have to be transferred If all ERIs are evaluated on the GPU (NVIDIA GeForce 8800 GTX), speedups around one order of magnitude have been observed for the formation of the Coulomb matrix for molecules as big as valinomycin (168 atoms) with a 6-31G basis set as compared to a conventional implementation
there is room for improvement in the performance, for example, through pipelining and also potentially by exploiting the DP functionality of current and future GPUs Ufimtsev and Martinez (UM) have also developed CUDA kernels for the calculation of ERIs and Fock matrix formation involving s- and p-type basis functions on
Trang 3227
Quantum Chemistry on Graphics Processing Units
requires relatively few intermediates per integral resulting in a low memory requirement, similar to the Rys quadrature Three different mappings of the computational work to thread blocks have been tested which result in different load balancing and data reduction overhead and the ERI kernels have carefully been
from the primitive ERIs, it becomes most efficient to assign the calculation of each primitive ERI batch (i.e., all ERIs over basis functions with magnetic quantum numbers for the given angular momentum quantum numbers) to one thread, independent of the contraction length of the basis functions In order to maximize load balancing, the integral batches are presorted into blocks of angular momentum
but pre- and postprocessing are done on the CPU This approach has been paralle
HF SCF calculations with a 3-21G and 6-31G basis set using UM’s implementation and an NVIDIA GTX280 card can be more than 100 times faster than the
spent in the Fock matrix formation on the GPU However, for large molecules such
as olestra (453 atoms, 2131 basis functions), the linear algebra (LA) required for the solution of the SCF equations starts to become a bottleneck, requiring as much as 50% of the Fock matrix computation time (LA performed on the GPU using CUBLAS) A parallel efficiency of over 60% was achieved on three NVIDIA GeForce
8800 GTX cards as compared to the use of only one graphics accelerator Two points should be mentioned here First, the limitation to s- and p-type functions results in small integral blocks that can be treated entirely in shared memory and registers which means that the ratio of computation to memory access is high This situation will change for basis functions with higher angular momentum quantum numbers
parisons is a legacy Fortran implementation that underperforms on modern CPUs
GPU speedups should be observed for comparisons against implementations of these algorithms which are optimized for performance on modern CPUs
The error in the SCF energies obtained with UM’s code due to the use of SP
in DP, which can be performed on newer GPUs with negligible additional computational cost, improve the accuracy to this level in all investigated cases
In addition, error compensation in relative energies was observed, presumably due to cancellation of contributions of large ERIs For larger molecules, however, computation of the larger ERIs in DP will be required, as has been extensively
UM have also implemented the calculation of the Coulomb and exchange contributions to the analytical HF energy gradients with s- and
between 6 for small molecules and over 100 for larger molecules (olestra) has
Trang 3328 Andreas W Go ¤tz et al
been obtained running in parallel on a system equipped with two NVIDIA GTX295 cards (each of which has two GPUs) and an Intel Core2 quad-core 2.66 GHz CPU Reference was again made to GAMESS, running in parallel on all four CPU cores Using the mixed SP/DP approach discussed above, the
close to typical convergence thresholds for geometry optimizations Geometry optimization of a helical hepta-alanine was shown to lead to an optimized structure in good agreement with GAMESS results with an error in the final
with the 6-31G basis set in the microcanonical ensemble using the velocity Verlet
observed over a simulation time of 20 ps
Recently, Asadchev et al presented algorithms and a CUDA implementation
footprint, is efficient for integrals with higher order angular momentum The major problem is that, unlike numerical LA kernels, the quadrature has very complex memory access patterns which span a large data set and depend on the
requires 5376 floating-point numbers for intermediate quantities which are
With DP this corresponds to 123,008 bytes, which is much larger than cache sizes available on GPUs Therefore, these intermediates must be stored and loaded from the device memory as required and it becomes mandatory to arrange the parallel calculation of the ERIs in such a way that these memory loads are minimized For this purpose, integrals in a shell block are reordered such that intermediates can be reused as often as possible Another problem is the large amount of code required to cover all possible cases of integral types in
an efficient manner The authors therefore adopted a template-based approach
in which all cases can be generated from a single template in an automated fashion
The performance of these GPU ERI kernels was tested on NVIDIA GeForce GTX 275 and NVIDIA Tesla T10 cards and compared to the performance of the ERI evaluation with the Rys quadrature as implemented in GAMESS (which, as
achieves around 1 GFLOPS (giga floating point operations per second), the GPUs achieve around 25 GFLOPS in DP and 50 GFLOPS in SP, which is approximately 30% of the theoretically possible DP peak performance The difference between performance in SP and DP is approximately a factor of 2 which shows that the computations are memory bound rather than compute bound No timings are given for the data transfer between GPU memory and main memory apart from stating that it takes several times longer than the actual execution time
of the ERI kernels It is clear that, in order to retain the speed advantage of the ERI evaluation on the GPU, processing of the ERIs (e.g., Fock matrix formation) must
be implemented on the GPU device, as well
Trang 3429
Quantum Chemistry on Graphics Processing Units
3.2 Numerical exchange-correlation quadrature
In the generalized gradient approximation (GGA) to DFT, the XC potential depends
dimensional space This makes an analytical solution of the XC integrals impossible and numerical quadrature is used to compute the XC matrix elements,
The numerical XC quadrature is perfectly suited for parallelization and
strategy in which the computationally less demanding steps in the quadra
on the CPU while the expensive steps are done on the GPU These are the
which can be formulated as matrix-vector multiplications and dot products Both steps are organized in batches of grid points and nonnegligible basis functions that are small enough to be kept entirely in shared memory Although in this way some of the basis function values on the grid points must be recalculated, this is more than compensated for by the low latency of the shared memory
In order to deal with roundoff errors due to the use of SP floating-point numbers on the GPU, Yasuda introduced a scheme in which the XC potential is
elements can be calculated analytically This is done in DP on the CPU while the GPU is used for calculating the correction, that is, for the numerical quadrature of
model
in the total energy of valinomycin with a 3-21G or 6-31G basis set and the PW91
approximately 40 is observed with an NVIDIA GeForce 8800 GTX graphics card
as compared to a conventional implementation running on an Intel Pentium 4 CPU with 2.8 GHz This translates into a speedup of around five to ten as compared to more modern CPUs
3.3 Density-fitted Poisson method
Brown et al have presented a different heterogeneous approach to accelerate DFT,
The ClearSpeed accelerator hardware is a compute-oriented stream architecture with raw performance comparable to that of modern GPUs while offering support for DP Just as for GPUs, an efficient use of this hardware requires fine-grained parallelization with a large number of lightweight threads and any algorithm developed for these accelerators will map well onto GPUs By using the Poisson
Trang 3530 Andreas W Go ¤tz et al
density fitting method, all bottlenecks of a DFT calculation could be shifted into
prefactor is reduced The auxiliary basis set can be chosen to consist of a few atomcentered Gaussian functions augmented with Poisson functions (obtained by
reduction of the prefactor Furthermore, these overlap integrals can be calculated
by numerical quadrature However, to maintain numerical stability in the SCF procedure, a higher accuracy than provided by default XC quadrature grids is required, thus increasing the number of grid points
The implementation, which is not restricted to basis functions with low angular momentum quantum numbers, passes only information about the numerical quadrature grid, the basis functions, the KS matrix, and the density matrix between the accelerator cards and the host system The numerical quadrature of the XC contribution and the Coulomb contribution due to the
that all computations can be done within the cache memory of the accelerator cards All other parts of the DFT calculation are performed on the host CPU Compared to an implementation with analytical evaluation of the integrals ð; Þ running on one core of a dual core AMD Opteron 2218 CPU with 2.6 GHz, a speedup between 7 and 15 was observed with 12 ClearSpeed xe620
were run for molecules of the size between chorismate (24 atoms) and an
pVTZ and corresponding density fitting basis sets There is further room for improvement, for example, by implementing prescreening which is missing so far However, work done on the host is already becoming a bottleneck and needs to be addressed The diagonalization, for example, takes approximately 30% of the total runtime
3.4 Density functional theory with Daubechies wavelets
Another effort in the physics community should be mentioned here The BigDFT
tions and offers support within the CUDA programming framework It was shown to achieve a high parallel efficiency of 90% on parallel computers in which the cross-sectional bandwidth scales well with the number of processors
It uses a parallelized hybrid CPU/GPU programming model and compared to the full CPU implementation, a constant speedup of up to six was achieved with
Trang 3631
Quantum Chemistry on Graphics Processing Units
The quantum chemist’s traditional way to approximate solutions of the electronic
methods These methods improve upon the HF mean-field approximation by add
expected that this will change in the near future because these methods are of critical importance whenever higher accuracy is required than what can be achieved
by DFT or for types of interactions and properties for which DFT breaks down
4.1 Resolution-of-identity second-order MłllerPlesset
perturbation theory
Second-order Møller—Plesset perturbation theory (MP2) is the computationally
Except for transition metal compounds, MP2 equilibrium geometries are of comparable accuracy to DFT However, MP2 captures long-range correlation effects (like dispersion) which are lacking in present-day density functionals The computational cost of MP2 calculations is dominated by the integral transformation from the atomic orbital (AO) to the molecular orbital (MO) basis which
avoided by introduction of the RI integral approximation which requires just the transformation of three-index quantities and reduces the prefactor without
native for small- to medium-sized molecular systems for which DFT fails Aspuru-Guzik and coworkers have worked on accelerating RI-MP2 calculations
of an RI-MP2 calculation essentially consists of matrix multiplications to generate the
X
P
To take full benefit of GPUs for these matrix multiplications, the matrices have
to be larger than a given threshold to minimize the impact of the bus latency when transferring the matrices from the CPU to the GPU memory Depending on the system size (number of atoms, size of basis sets employed), this is achieved by
For the multiplication of general matrices whose size is too large to be held in
established for standard parallel matrix multiplications, this library uses a
Trang 3732 Andreas W Go ¤tz et al
two-dimensional decomposition Partial matrix multiplications of these blocks are performed on the GPU with CUBLAS routines and the results are accumulated on the CPU To improve the numerical accuracy, a heterogeneous computing model is employed in which numerically large contributions to the final result are computed and accumulated on a DP device (in general the CPU) and the remaining small contributions are efficiently treated by the SP GPU device It was shown that errors can be reduced by an order of magnitude in exchange for a moderate performance decrease with this MP approach
Compared to the standard CPU implementation, speedups of 13.8, 10.1, and 7.8 were obtained on an NVIDIA Tesla C1060 GPU equipped with 4 GB of memory for the 168-atom molecule valinomycin in SP, MP, and DP, respectively The correspond
matrix multiplications entirely in SP, the resulting error is larger than acceptable for chemical accuracy It is therefore inevitable to put up with some performance penalty for the sake of accuracy It was shown that the ERI evaluation becomes computa
combination with the approaches discussed in Section 3 for the ERI evaluation
5 QUANTUM MONTE CARLO
solving the time-independent Schro¨dinger equation As opposed to variational
ab initio approaches, QMC is based on a stochastic evaluation of the underlying
a very large prefactor
executing CUDA kernels that are explicitly optimized for cache usage and instruction-level parallelism for the computationally intensive parts on a GPU These are the basis function evaluation on grid points and, similar to the numerical XC quadrature and RI-MP2, matrix multiplications The Kahan Summation Formula to improve the accuracy of GPU matrix multiplications was explored which was necessary because of the lack of fully compliant IEEE floating-point implementations on GPUs in 2007 For small molecules with 8—28 atoms (32—152 electrons and 80—516 basis functions), approximately fivefold speedup was obtained using an NVIDIA GeForce 7800 GTX graphics card as compared to an optimized implementation running on an Intel Pentium 4 CPU with 3 GHz Meredith et al have used an implementation of the quantum cluster approximation on SP GPUs to study the effect of disorder on the critical temperature for superconductivity in cuprates with a two-dimensional Hubbard model on a regular
multiplications on the GPU using the CUBLAS library Attempts to increase the performance by circumventing the data transfer bottleneck and implementing the remaining data manipulations on the GPU instead of the CPU resulted in a performance loss for all but the largest problem size that was investigated The simple
Trang 3833
Quantum Chemistry on Graphics Processing Units
reason is that smart algorithms that can be implemented efficiently on CPUs do not map well onto GPU architectures or, in other words, the GPU has to do more work
to achieve the same result For the largest problem size studied, a fivefold speedup was observed running on a cluster with 32 AMD Opteron 2.6 GHz CPUs and 32 NVIDIA 8800 GTX graphics cards as compared to using only the CPUs in parallel Sufficient accuracy for scientifically meaningful results within the employed model was proven by comparison to DP results obtained on a CPU
6 CONCLUDING REMARKS
Quantum chemistry software that exploits the capabilities of modern GPUs has only recently started to emerge Significant parts of these initial efforts have been devoted to minimize errors caused by the lack of DP support on older GPUs The advent of next-generation GPUs that support DP arithmetics at a peak performance of only a factor of 2 less than that of SP will make these special approaches obsolete At the same time, future developments will be greatly facilitated
From the literature, one can observe that in order to achieve good results in programming with GPUs it is often necessary to write GPU-only versions of the code One typically has to abandon many of the smart optimizations that have been developed over the years for CPUs and expensive copy operations from the CPU to the GPU memory have to be minimized
With careful work, it is possible to achieve speedups which should allow researchers to perform calculations that otherwise would require large and expensive CPU clusters However, the nature of GPU programming is such that significant effort is still required to make effective use of GPUs These complexities are the reason that the quantum chemistry software that is available for GPUs at the time of this writing is still in its infancy and not yet ready for general use GPU implementations that are capable of full HF and DFT calculations, for example, are still restricted to s- and p-type basis functions HF calculations are not of much practical use by themselves but only as starting point for
momentum quantum numbers Similarly, meaningful DFT calculations have to use polarization functions which means that even for simple organic molecules
or biomolecules without metal atoms at least d-type functions are required While GPU-based ERI implementations for high angular momentum basis functions have been developed, these still have to be incorporated into software
Up to now only energies and gradients have been considered which allows for explorations of potential energy surfaces However, a variety of other quantum chemistry applications would also benefit from the computational power that GPUs provide Of high interest for the researcher are static and dynamic molecular response properties Frequently, these require a higher computational effort than energy and gradient evaluations We therefore expect to see developments in this area soon
Trang 3934 Andreas W Go ¤tz et al
We are looking forward to exciting new developments of quantum chemistry software for GPUs accompanied by ground-breaking applications in the near future
ACKNOWLEDGMENTS
This work was supported in part by grant 09-LR-06-117792-WALR from the University of California Lab Fees program and grant XFT-8-88509-01/DE-AC36-99GO10337 from the Department of Energy to RCW
REFERENCES
1 Clary, D.C Quantum chemistry of complex systems Science 2006, 314(5797), 265—6
2 Carter, E.A Challenges in modeling materials properties without experimental input Science
6 Frigo, M., Johnson, S.G The design and implementation of FFTW3 Proc IEEE 2005, 93(2), 216—31
7 NVIDIA: Santa Clara, CA, CUDA Programming Guide, http://developer.download.nvidia.com/ compute/cuda/3_0/toolkit/docs/NVIDIA_CUDA_ProgrammingGuide_3.0.pdf (Accessed March
6, 2010)
8 AMD: Sunnyvale, CA, ATI, www.amd.com/stream (Accessed March 14, 2010)
9 NVIDIA: Santa Clara, CA, CUDA, http://www.nvidia.com/object/cuda_home.html (Accessed March 6, 2010)
10 NVIDIA: Santa Clara, CA, CUFFT Library, http://developer.download.nvidia.com/compute/ cuda/2_3/toolkit/docs/CUFFT_Library_2.3.pdf (Accessed March 6, 2010)
11 NVIDIA: Santa Clara, CA, CUBLAS Library 2.0, http://developer.download.nvidia.com/ compute/cuda/2_0/docs/CUBLAS_Library_2.0.pdf (Accessed March 6, 2010)
12 Innovative Computing Laboratory, University of Tennessee, Matrix Algebra on GPU and core Architectures, http://icl.cs.utk.edu/magma (Accessed March 6, 2010)
Multi-13 Kohn, W., Sham, L Self-consistent equations including exchange and correlation effects Phys Rev 1965, 140, A1133—8
14 Parr, R.G., Yang, W Density-Functional Theory of Atoms and Molecules, Oxford University Press, Oxford, 1989
15 Jensen, F In Annual Reports in Computational Chemistry (ed D.C Spellmeyer), Vol 1, Elsevier, Amsterdam, 2005, pp 3—17
16 Fiolhais, C., Nogueira, F., Marques, M.A.L A Primer in Density Functional Theory, Lecture Notes
in Physics, Springer Verlag, Berlin, 2003
17 Salek, P., Høs, S., Thøgersen, L., Jørgensen, P., Manninen, P., Olsen, J., Jans ık, B Linear-scaling implementation of molecular electronic self-consistent field theory J Chem Phys 2007, 126, 114110
18 Yasuda, K Two-electron integral evaluation on the graphics processor unit J Comput Chem
Trang 4035
Quantum Chemistry on Graphics Processing Units
22 McMurchie, L.E., Davidson, E.R One- and two-electron integrals over Cartesian Gaussian func tions J Comput Phys 1978, 26, 218—31
23 Ufimtsev, I.S., Martinez, T.J Quantum chemistry on graphical processing units 2 Direct consistent-field implementation J Chem Theory Comput 2009, 5(4), 1004—15
self-24 Schmidt, M.W., Baldridge, K.K., Boatz, J.A., Elbert, S.T., Gordon, M.S., Jensen, J.H., Koseki, S., Matsunaga, N., Nguyen, K.A., Su, S., Windus, T.L., Dupuis, M., Montgomery, J.A., Jr General atomic and molecular electronic structure system J Comput Chem 1993, 14(11), 1347—63
25 Ufimtsev, I.S., Martinez, T.J Quantum chemistry on graphical processing units 3 Analytical energy gradients, geometry optimization, and first principles molecular dynamics J Chem Theory Comput 2009, 5(10), 2619—28
26 Asadchev, A., Allada, V., Felder, J., Bode, B.M., Gordon, M.S., Windus, T.L Uncontracted Rys quadrature implementation of up to g functions on graphical processing units J Chem Theory Comput 2010, 6(3), 696—704
27 Yasuda, K Accelerating density functional calculations with graphics processing unit J Chem Theory Comput 2008, 4(8), 1230—6
28 Perdew, J.P., Chevary, J., Vosko, S., Jackson, K.A., Pederson, M.R., Singh, D., Fiolhais, C Atoms, molecules, solids, and surfaces: Applications of the generalized gradient approximation for exchange and correlation Phys Rev B 1992, 46, 6671—87
29 ClearSpeed: Bristol, UK, www.clearspeed.com (Accessed March 14, 2010)
30 Brown, P., Woods, C., McIntosh-Smith, S., Manby, F.R Massively multicore parallelization of Kohn-Sham theory J Chem Theory Comput 2008, 4(10), 1620—6
31 Brown, P., Woods, C.J., McIntosh-Smith, S., Manby, F.R., A massively multicore parallelization of the Kohn-Sham energy gradients, J Comput Chem 2010, 31(10), 2008—13
32 Baerends, E.J., Ellis, D., Roos, P Self-consistent molecular Hartree-Fock-Slater calculations I The computational procedure Chem Phys 1973, 2, 41—51
33 Dunlap, B.I., Connoly, J.W.D., Sabin, J.R On some approximations in applications of Xa theory
36 Feyereisen, M.W., Fitzgerald, G., Komornicki, A Use of approximate integrals in ab initio theory
An application in MP2 energy calculations Chem Phys Lett 1993, 208, 359—63
37 Weigend, F., Ha¨ ser, M., Patzelt, H., Ahlrichs, R RI-MP2: Optimized auxiliary basis sets and demonstration of efficiency Chem Phys Lett 1998, 294, 143—52
38 Vogt, L., Olivares-Amaya, R., Kermes, S., Shao, Y., Amador-Bedolla, C., Aspuru-Guzik, A Accel erating resolution-of-the-identity second-order Møller-Plesset quantum chemistry calculations with graphical processing units J Phys Chem A 2008, 112(10), 2049—57
39 Olivares-Amaya, R., Watson, M.A., Edgar, R.G., Vogt, L., Shao, Y., Aspuru-Guzik, A Accelerating correlated quantum chemistry calculations using graphical processing units and a mixed preci sion matrix multiplication library J Chem Theory Comput 2010, 6(1), 135—44
40 SciGPU-GEMM v0.8, http://www.chem-quantum.info/scigpu/?p=61 (Accessed March 6, 2010)
41 Ceperley, D., Alder, B Quantum Monte Carlo Science 1986, 231(4738), 555—60
42 Anderson, A.G., Goddard, W.A., III, Schro¨der, P Quantum Monte Carlo on graphical processing units Comput Phys Commun 2007, 177(3), 298—306
43 Meredith, J.S., Alvarez, G., Maier, T.A., Schulthess, T.C., Vette, J.S Accuracy and performance of graphics processors: A quantum Monte Carlo application case study Parallel Comput 2009, 35(3), 151—63