To understand what has been developed and what is proposed for utilizing the Grid in the new biology era, it is useful to review the ‘first wave’ of computational biology application mod
Trang 1The new biology and the Grid
Kim Baldridge and Philip E Bourne
University of California, San Diego, California, United States
40.1 INTRODUCTION
Computational biology is undergoing a revolution from a traditionally compute-intensive science conducted by individuals and small research groups to a high-throughput,
data-driven science conducted by teams working in both academia and industry It is this new
biology as a data-driven science in the era of Grid Computing that is the subject of this
chapter This chapter is written from the perspective of bioinformatics specialists who seek to fully capitalize on the promise of the Grid and who are working with computer scientists and technologists developing biological applications for the Grid
To understand what has been developed and what is proposed for utilizing the Grid
in the new biology era, it is useful to review the ‘first wave’ of computational biology application models In the next section, we describe the first wave of computational models used for computational biology and computational chemistry to date
40.1.1 The first wave: compute-driven biology applications
The first computational models for biology and chemistry were developed for the clas-sical von Neumann machine model, that is, for sequential, scalar processors With the
Grid Computing – Making the Global Infrastructure a Reality. Edited by F Berman, A Hey and G Fox
2003 John Wiley & Sons, Ltd ISBN: 0-470-85319-0
Trang 2emergence of parallel computing, biological applications were developed that could take advantage of multiple processor architectures with distributed or shared memory and locally located disk space to execute a collection of tasks Applications that compute molecular structure or electronic interactions of a protein fragment are examples of pro-grams developed to take advantage of emerging computational technologies
As distributed memory parallel architectures became more prevalent, computational biologists became familiar with message passing library toolkits, first with Parallel Virtual Machine (PVM) and more recently with Message Passing Interface (MPI) This enabled biologists to take advantage of distributed computational models as a target for executing applications whose structure is that of a pipelined set of stages, each dependent on the
completion of a previous stage In pipelined applications, the computation involved for
each stage can be relatively independent from the others For example, one computer may perform molecular computations and immediately stream results to another computer for visualization and analysis of the data generated Another application scenario is that of
a computer used to collect data from an instrument (say a tilt series from an electron microscope), which is then transferred to a supercomputer with a large shared memory
to perform a volumetric reconstruction, which is then rendered on yet a different high-performance graphic engine The distribution of the application pipeline is driven by the number and the type of different tasks to be performed, the available architectures that can support each task, and the I/O requirements between tasks
While the need to support these applications continues to be very important in com-putational biology, an emerging challenge is to support a new generation of applications that analyze and/or process immense amounts of input/output data In such applications, the computation on each of a large number of data points can be relatively small, and the ‘results’ of an application are provided by the analysis and often visualization of the input/output data For such applications, the challenge to infrastructure developers is to provide a software environment that promotes application performance and can leverage large numbers of computational resources for simultaneous data analysis and processing
In this chapter, we consider these new applications that are forming the next wave of computational biology
40.1.2 The next wave: data-driven applications
The next wave of computational biology is characterized by high-throughput, high technol-ogy, data-driven applications The focus on genomics, exemplified by the human genome project, will engender new science impacting a wide spectrum of areas from crop produc-tion to personalized medicine And this is just the beginning The amount of raw DNA sequence being deposited in the public databases doubles every 6 to 8 months Bioinfor-matics and Computational Biology have become a prime focus of academic and industrial research The core of this research is the analysis and synthesis of immense amounts of data resulting in a new generation of applications that require information technology as
a vehicle for the next generation of advances
Bioinformatics grew out of the human genome project in the early 1990s The requests for proposals for the physical and genetic mapping of specific chromosomes called for developments in informatics and computer science, not just for data management but for
Trang 3innovations in algorithms and application of those algorithms to synergistically improve the rate and accuracy of the genetic mapping A new generation of scientists was born, whose demand still significantly outweighs their supply, and who have been brought
up on commodity hardware architectures and fast turnaround This is a generation that contributed significantly to the fast adoption of the Web by biologists and who want instant gratification, a generation that makes a strong distinction between wall clock and CPU time It makes no difference if an application runs 10 times as fast on a high-performance architecture (minimizing execution time) if you have to wait 10 times as long for a result by sitting in a long queue (maximizing turnaround time) In data-driven biology, turnaround time is important in part because of sampling: a partial result is generally useful while the full result is being generated We will see specific examples of this subsequently, for now let us better grasp the scientific field we wish the Grid to support The complexity of new biology applications reflects exponential growth rates at dif-ferent levels of biological complexity This is illustrated in Figure 40.1 that highlights representative activities at different levels of biological complexity While bioinformatics
is currently focusing on the molecular level, this is just the beginning Molecules form complexes that are located in different parts of the cell Cells differentiate into different types forming organs like the brain and liver Increasingly complex biological systems
Sequence
Structure
Assembly
Sub-cellular
Cellular
Organ
Higher-life
Year
Computing power
Sequencing technology
Data
Human genome project
E coli genome
C elegans genome
1 Small genome/Mo.
ESTs
Yeast genome
Gene chips
Virus structure
Ribosome
Model metaboloic pathway of E coli
Biological
Brain mapping
Genetic circuits
Neuronal modeling
Cardiac modeling
Human genome
# People/ Web site
Biological experiment Data Information Knowledge Discovery
Collect Characterize Compare Model Infer
Figure 40.1 From biological data comes knowledge and discovery.
Trang 4generate increasingly large and complex biological data sets If we do not solve the prob-lems of processing the data at the level of the molecule, we will not solve probprob-lems of higher order biological complexity
Technology has catalyzed the development of the new biology as shown on the right vertical axis of Figure 40.1 To date, Moore’s Law has at least allowed data processing
to keep approximate pace with the rate of data produced Moreover, the cost of disks,
as well as the communication access revolution brought about by the Web, has enabled the science to flourish Today, it costs approximately 1% of what it did 10 to 15 years ago to sequence one DNA base pair With the current focus on genomics, data rates are anticipated to far outweigh Moore’s Law in the near future making Grid and cluster technologies more critical for the new biology to flourish
Now and in the near future, a critical class of new biology applications will involve large-scale data production, data analysis and synthesis, and access through the Web and/or advanced visualization tools to the processed data from high-performance databases ideally federated with other types of data In the next section, we illustrate in more detail two new biology applications that fit this profile
40.2 BIOINFORMATICS GRID APPLICATIONS TODAY
The two applications in this section require large-scale data analysis and management, wide access through Web portals, and visualization In the sections below, we describe CEPAR (Combinatorial Extension in PARallel), a computational biology application, and CHEMPORT, a computational chemistry framework
40.2.1 Example 1: CEPAR and CEPort – 3D protein structure comparison
The human genome and the less advertised but very important 800 other genomes that have been mapped, encode genes Those genes are the blueprints for the proteins that are synthesized by reading the genes It is the proteins that are considered the building blocks of life Proteins control all cellular processes and define us as a species and as individuals A step on the way to understanding protein function is protein structure – the 3D arrangement that recognizes other proteins, drugs, and so on The growth in the number and complexity of protein structures has undergone the same revolution as shown
in Figure 40.1, and can be observed in the evolution of the Protein Data Bank (PDB; http://www.pdb.org), the international repository for protein structure data
A key element to understanding the relationship between biological structure and func-tion is to characterize all known protein structures From such a characterizafunc-tion comes the ability to be able to infer the function of the protein once the structure has been determined, since similar structure implies similar function High-throughput structure determination is now happening in what is known as structure genomics – a follow-on to the human genome project in which one objective is to determine all protein structures encoded by the genome of an organism While a typical protein consists of 300 of one of
20 different amino acids – a total of 20300 possibilities – more than all the atoms in the universe – nature has performed her own reduction, both in the number of sequences and
Trang 5in the number of protein structures as defined by discrete folds The number of unique protein folds is currently estimated at between 1000 and 10 000 These folds need to be characterized and all new structures tested to see whether they conform to an existing fold or represent a new fold In short, characterization of how all proteins fold requires that they be compared in 3D to each other in a pairwise fashion
With approximately 30 000 protein chains currently available in the PDB, and with each pair taking 30 s to compare on a typical desktop processor using any one of several algorithms, we have a(30 000∗30 000/2)∗30s size problem to compute all pairwise com-parisons, that is, a total of 428 CPU years on one processor Using a combination of data reduction (a pre-filtering step that permits one structure to represent a number of similar structures), data organization optimization, and efficient scheduling, this computation was performed on 1000 processors of the 1.7 Teraflop IBM Blue Horizon in a matter of days using our Combinatorial Extension (CE) algorithm for pairwise structure comparison [1] The result is a database of comparisons that is used by a worldwide community of users 5
to 10 000 times per month and has led to a number of interesting discoveries cited in over
80 research papers The resulting database is maintained by the San Diego Supercom-puter Center (SDSC) and is available at http://cl.sdsc.edu/ce.html [2] The procedure to compute and update this database as new structures become available is equally amenable
to Grid and cluster architectures, and a Web portal to permit users to submit their own structures for comparison has been established
In the next section, we describe the optimization utilized to diminish execution time and increase applicability of the CE application to distributed and Grid resources The result is a new version of the CE algorithm we refer to as CEPAR CEPAR distributes each 3D comparison of two protein chains to a separate processor for analysis Since each pairwise comparison represents an independent calculation, this is an embarrassingly parallel problem
40.2.1.1 Optimizations of CEPAR
The optimization of CEPAR involves structuring CE as an efficient and scalable mas-ter/worker algorithm While initially implemented on 1024 processors of Blue Horizon, the algorithm and optimization undertaken can execute equally well on a Grid platform The addition of resources available on demand through the Grid is an important next step for problems requiring computational and data integration resources of this magnitude
We have employed algorithmic and optimization strategies based on numerical studies on CEPAR that have made a major impact on performance and scalability To illustrate what can be done in distributed environments, we discuss them here The intent is to familiarize the reader with one approach to optimizing a bioinformatics application for the Grid Using a trial version of the algorithm without optimization (Figure 40.2), performance bottlenecks were identified The algorithm was then redesigned and implemented with the following optimizations:
1 The assignment packets (chunks of data to be worked on) are buffered in advance
2 The master processor algorithm prioritizes incoming messages from workers since such messages influence the course of further calculations
Trang 6Performance (3422 ent.)
0 256 512 768 1024
Number of processors
Ideal Early stopping 100% processed Trial
Figure 40.2 Scalability of CEPAR running on a sample database of 3422 data points (protein chains) The circles show the performance of the trial version of the code The triangles show the improved performance after improvements 1, 2, and 4 were added to the trial version The squares show the performance based on timing obtained with an early stopping criterion (improvement 3) The diamonds provide an illustration of the ideal scaling.
3 Workers processing a data stream that no longer poses any interest (based on a result
from another worker) are halted We call this early stopping.
4 Standard single-processor optimization techniques are applied to the master processor With these optimizations, the scalability of CEPAR was significantly improved, as can
be seen from Figure 40.2
The optimizations significantly improved the performance of the code The MPI imple-mentation on the master processor is straightforward, but it was essential to use buffered sends (or another means such as asynchronous sends) in order to avoid communica-tion channel congescommunica-tion In summary, with 1024 processors the CEPAR algorithm out-performs CE (no parallel optimization) by 30 to 1 and scales well It is anticipated that this scaling would continue even on a larger number of processors
One final point concerns the end-process load imbalance That is, a large number of processors can remain idle while the final few do their job We chose to handle this by breaking the runs involving a large number of processors down into two separate runs The first run does most of the work and exits when the early stopping criterion is met Then the second run completes the task for the outliers using a small number of processors, thus freeing these processors for other users Ease of use of the software is maintained through an automatic two-step job processing utility
CEPAR has been developed to support our current research efforts on PDB structure similarity analysis on the Grid CEPAR software uses MPI, which is a universal standard for interprocessor communication Therefore, it is suitable for running in any parallel envi-ronment that has an implementation of MPI, including PC clusters or Grids There is no dependence on the particular structural alignment algorithm or on the specific application
Trang 7The CEPAR design provides a framework that can be applied to other problems facing computational biologists today where large numbers of data points need to be processed in
an embarrassingly parallel way Pairwise sequence comparison as described subsequently
is an example Researchers and programmers working on parallel software for these prob-lems might find useful the information on the bottlenecks and optimization techniques used to overcome them, as well as the general approach of using numerical studies to aid algorithm design briefly reported here and given in more detail in Reference [1] But what of the naive user wishing to take advantage of high-performance Grid computing?
40.2.1.2 CEPAR portals
One feature of CEPAR is the ability to allow users worldwide provide their own structures for comparison and alignment against the existing database of structures This service cur-rently runs on a Sun Enterprise server as part of the CE Website (http://cl.sdsc.edu/ce.html) outlined above Each computation takes on an average three hours of CPU time for a sin-gle user request On occasion, this service must be turned off as the number of requests for structure comparisons far outweighs what can be processed on a Sun Enterprise server To overcome this shortage of compute resources, a Grid portal has been established to han-dle this situation (https://gridport.npaci.edu/CE/) using SDSC’s GridPort technology [3] The portal allows this computation to be done using additional resources when available Initial target compute resources for the portal are the IBM Blue Horizon, a 64-node Sun Enterprise server and a Linux PC cluster of 64 nodes
The GridPort Toolkit [3] is composed of a collection of modules that are used to provide portal services running on a Web server and template Web pages needed to implement a Web portal The function of GridPort is simply to act as a Web frontend to Globus services [4], which provide a virtualization layer for distributed resources The only requirements for adding a new high-performance computing (HPC) resource to the portal are that the CE program is recompiled on the new architecture, and that Globus services are running on it Together, these technologies allowed the development of a portal with the following capabilities:
• Secure and encrypted access for each user to his/her high-performance computing (HPC) accounts, allowing submission, monitoring, and deletion of jobs and file man-agement;
• Separation of client application (CE) and Web portal services onto separate servers;
• A single, common point of access to multiple heterogeneous compute resources;
• Availability of real-time status information on each compute machine;
• Easily adaptable (e.g addition of newly available compute resources, modification of user interfaces etc.)
40.2.1.3 Work in progress
While the CE portal is operational, much work remains to be done A high priority is the implementation of a distributed file system for the databases, user input files, jobs
in progress, and results A single shared, persistent file space is a key component of the distributed abstract machine model on which GridPort was built At present, files must
Trang 8be explicitly transferred from the server to the compute machine and back again; while this process is invisible to the user, from the point of view of portal development and administration, it is not the most elegant solution to the problem Furthermore, the present system requires that the all-against-all database must be stored locally on the file system of each compute machine This means that database updates must be carried out individually
on each machine
These problems could be solved by placing all user files, along with the databases, in
a shared file system that is available to the Web server and all HPC machines Adding Storage Resource Broker (SRB) [5] capability to the portal would achieve this Work
is presently ongoing on automatically creating an SRB collection for each registered GridPort user; once this is complete, SRB will be added to the CE portal
Another feature that could be added to the portal is the automatic selection of compute machine Once ‘real-world’ data on CPU allocation and turnaround time becomes avail-able, it should be possible to write scripts that inspect the queue status on each compute machine and allocate each new CE search to the machine expected to produce results in the shortest time
Note that the current job status monitoring system could also be improved Work is underway to add an event daemon to the GridPort system, such that compute machines could notify the portal directly when, for example, searches are scheduled, start and finish This would alleviate the reliance of the portal on intermittent inspection of the queue of each HPC machine and provide near-instantaneous status updates Such a system would also allow the portal to be regularly updated with other information, such as warnings when compute machines are about to go down for scheduled maintenance, broadcast messages from HPC system administrators and so on
40.2.2 Example 2: Chemport – a quantum mechanical biomedical framework
The successes of highly efficient, composite software for molecular structure and dynam-ics prediction has driven the proliferation of computational tools and the development of first-generation cheminformatics for data storage, analysis, mining, management, and pre-sentations However, these first-generation cheminformatics tools do not meet the needs
of today’s researchers Massive volumes of data are now routinely being created that span the molecular scale, both experimentally and computationally, which are available for access for an expanding scope of research What is required to continue progress is the integration of individual ‘pieces’ of the methodologies involved and the facilitation
of the computations in the most efficient manner possible
Towards meeting these goals, applications and technology specialists have made con-siderable progress towards solving some of the problems associated with integrating the algorithms to span the molecular scale computationally and through the data, as well
as providing infrastructure to remove the complexity of logging on to a HPC system
in order to submit jobs, retrieve results, and supply ‘hooks’ into other codes In this section, we give an example of a framework that serves as a working environment for researchers, which demonstrates new uses of the Grid for computational chemistry and biochemistry studies
Trang 9Figure 40.3 The job submission page from the SDSC GAMESS portal.
Using GridPort technologies [3] as described for CEPAR, our efforts began with the creation of a portal for carrying out chemistry computations for understanding various details of structure and property for molecular systems – the General Atomic Molecular Electronic Structure Systems (GAMESS) [6] quantum chemistry portal (http://gridport npaci.edu/gamess) The GAMESS software has been deployed on a variety of com-putational platforms, including both distributed and shared memory platforms The job submission page from the GAMESS portal is shown in Figure 40.3 The portal uses Grid technologies such as the SDSC’s GridPort toolkit [3], the SDSC SRB [5] and Globus [7]
to assemble and monitor jobs, as well as store the results One goal in the creation
of a new architecture is to improve the user experience by streamlining job creation and management
Related molecular sequence, structure, and property software have been created using similar frameworks, including the AMBER [8] classical molecular dynamics portal, the EULER [9] genetic sequencing program, and the Adaptive Poisson-Boltzmann Solver (APBS) [10] program for calculating electrostatic potential surfaces around biomolecules Each type of molecular computational software provides a level of understanding of molecular structure that can be used for a larger scale understanding of the function What is needed next are strategies to link the molecular scale technologies through the data and/or through novel new algorithmic strategies Both involve additional Grid technologies
Development of portal infrastructure has enabled considerable progress towards the integration across scale from molecules to cells, linking the wealth of ligand-based data present in the PDB, and detailed molecular scale quantum chemical structure and
Trang 10Builder_Launcher 3DStructure 3DVibration 3DMolecular_Orbitals 3DElectrostatic_Surface 3DSolvent_Surface 3DReaction_Path 3D-Biopolymer_Properties
QMView
Computational modeling
Protein Data Bank (PDB)
Experimental characterization
Quantum mechanics Highly accurate Small molecule Semi Empirical
Moderate accuracy Moderate size molecule
Empirical Low accuracy Large complex
QM compute engine (e.g GAMESS)
*
Internet
Quantum Mechanical Biomedical Framework
(QM-BF)
Quantum
Mechanical
Data
Base
(QM-DB)
O
O O −Na+
N R
n
n
Figure 40.4 Topology of the QM-PDB framework.
property data As such, accurate quantum mechanical data that has been hitherto under-utilized will be made accessible to the nonexpert for integrated molecule to cell stud-ies, including visualization and analysis, to aid in the understanding of more detailed molecular recognition and interaction studies than is currently available or sufficiently reliable The resulting QM-PDB framework (Figure 40.4) integrates robust computational quantum chemistry software (e.g GAMESS) with associated visualization and analysis toolkits, QMView, [11] and associated prototype Quantum Mechanical (QM) database facility, together with the PDB Educational tools and models are also integrated into the framework
With the creation of Grid-based toolkits and associated environment spaces, researchers can begin to ask more complex questions in a variety of contexts over a broader range
of scales, using seamless transparent computing access As more realistic molecular com-putations are enabled, extending well into the nanosecond and even microsecond range
at a faster turnaround time, and as problems that simply could not fit within the