Tài liệu Grid Computing P40 ppt

To understand what has been developed and what is proposed for utilizing the Grid in the new biology era, it is useful to review the ‘first wave’ of computational biology application mod

Trang 1

The new biology and the Grid

Kim Baldridge and Philip E Bourne

University of California, San Diego, California, United States

40.1 INTRODUCTION

Computational biology is undergoing a revolution from a traditionally compute-intensive science conducted by individuals and small research groups to a high-throughput,

data-driven science conducted by teams working in both academia and industry It is this new

biology as a data-driven science in the era of Grid Computing that is the subject of this

chapter This chapter is written from the perspective of bioinformatics specialists who seek to fully capitalize on the promise of the Grid and who are working with computer scientists and technologists developing biological applications for the Grid

To understand what has been developed and what is proposed for utilizing the Grid

in the new biology era, it is useful to review the ‘first wave’ of computational biology application models In the next section, we describe the first wave of computational models used for computational biology and computational chemistry to date

40.1.1 The first wave: compute-driven biology applications

The first computational models for biology and chemistry were developed for the clas-sical von Neumann machine model, that is, for sequential, scalar processors With the

Grid Computing – Making the Global Infrastructure a Reality. Edited by F Berman, A Hey and G Fox

 2003 John Wiley & Sons, Ltd ISBN: 0-470-85319-0

Trang 2

emergence of parallel computing, biological applications were developed that could take advantage of multiple processor architectures with distributed or shared memory and locally located disk space to execute a collection of tasks Applications that compute molecular structure or electronic interactions of a protein fragment are examples of pro-grams developed to take advantage of emerging computational technologies

As distributed memory parallel architectures became more prevalent, computational biologists became familiar with message passing library toolkits, first with Parallel Virtual Machine (PVM) and more recently with Message Passing Interface (MPI) This enabled biologists to take advantage of distributed computational models as a target for executing applications whose structure is that of a pipelined set of stages, each dependent on the

completion of a previous stage In pipelined applications, the computation involved for

each stage can be relatively independent from the others For example, one computer may perform molecular computations and immediately stream results to another computer for visualization and analysis of the data generated Another application scenario is that of

a computer used to collect data from an instrument (say a tilt series from an electron microscope), which is then transferred to a supercomputer with a large shared memory

to perform a volumetric reconstruction, which is then rendered on yet a different high-performance graphic engine The distribution of the application pipeline is driven by the number and the type of different tasks to be performed, the available architectures that can support each task, and the I/O requirements between tasks

While the need to support these applications continues to be very important in com-putational biology, an emerging challenge is to support a new generation of applications that analyze and/or process immense amounts of input/output data In such applications, the computation on each of a large number of data points can be relatively small, and the ‘results’ of an application are provided by the analysis and often visualization of the input/output data For such applications, the challenge to infrastructure developers is to provide a software environment that promotes application performance and can leverage large numbers of computational resources for simultaneous data analysis and processing

In this chapter, we consider these new applications that are forming the next wave of computational biology

40.1.2 The next wave: data-driven applications

The next wave of computational biology is characterized by high-throughput, high technol-ogy, data-driven applications The focus on genomics, exemplified by the human genome project, will engender new science impacting a wide spectrum of areas from crop produc-tion to personalized medicine And this is just the beginning The amount of raw DNA sequence being deposited in the public databases doubles every 6 to 8 months Bioinfor-matics and Computational Biology have become a prime focus of academic and industrial research The core of this research is the analysis and synthesis of immense amounts of data resulting in a new generation of applications that require information technology as

a vehicle for the next generation of advances

Bioinformatics grew out of the human genome project in the early 1990s The requests for proposals for the physical and genetic mapping of specific chromosomes called for developments in informatics and computer science, not just for data management but for

Trang 3

innovations in algorithms and application of those algorithms to synergistically improve the rate and accuracy of the genetic mapping A new generation of scientists was born, whose demand still significantly outweighs their supply, and who have been brought

up on commodity hardware architectures and fast turnaround This is a generation that contributed significantly to the fast adoption of the Web by biologists and who want instant gratification, a generation that makes a strong distinction between wall clock and CPU time It makes no difference if an application runs 10 times as fast on a high-performance architecture (minimizing execution time) if you have to wait 10 times as long for a result by sitting in a long queue (maximizing turnaround time) In data-driven biology, turnaround time is important in part because of sampling: a partial result is generally useful while the full result is being generated We will see specific examples of this subsequently, for now let us better grasp the scientific field we wish the Grid to support The complexity of new biology applications reflects exponential growth rates at dif-ferent levels of biological complexity This is illustrated in Figure 40.1 that highlights representative activities at different levels of biological complexity While bioinformatics

is currently focusing on the molecular level, this is just the beginning Molecules form complexes that are located in different parts of the cell Cells differentiate into different types forming organs like the brain and liver Increasingly complex biological systems

Sequence

Structure

Assembly

Sub-cellular

Cellular

Organ

Higher-life

Year

Computing power

Sequencing technology

Data

Human genome project

E coli genome

C elegans genome

1 Small genome/Mo.

ESTs

Yeast genome

Gene chips

Virus structure

Ribosome

Model metaboloic pathway of E coli

Biological

Brain mapping

Genetic circuits

Neuronal modeling

Cardiac modeling

Human genome

# People/ Web site

Biological experiment Data Information Knowledge Discovery

Collect Characterize Compare Model Infer

Figure 40.1 From biological data comes knowledge and discovery.

Trang 4

generate increasingly large and complex biological data sets If we do not solve the prob-lems of processing the data at the level of the molecule, we will not solve probprob-lems of higher order biological complexity

Technology has catalyzed the development of the new biology as shown on the right vertical axis of Figure 40.1 To date, Moore’s Law has at least allowed data processing

to keep approximate pace with the rate of data produced Moreover, the cost of disks,

as well as the communication access revolution brought about by the Web, has enabled the science to flourish Today, it costs approximately 1% of what it did 10 to 15 years ago to sequence one DNA base pair With the current focus on genomics, data rates are anticipated to far outweigh Moore’s Law in the near future making Grid and cluster technologies more critical for the new biology to flourish

Now and in the near future, a critical class of new biology applications will involve large-scale data production, data analysis and synthesis, and access through the Web and/or advanced visualization tools to the processed data from high-performance databases ideally federated with other types of data In the next section, we illustrate in more detail two new biology applications that fit this profile

40.2 BIOINFORMATICS GRID APPLICATIONS TODAY

The two applications in this section require large-scale data analysis and management, wide access through Web portals, and visualization In the sections below, we describe CEPAR (Combinatorial Extension in PARallel), a computational biology application, and CHEMPORT, a computational chemistry framework

40.2.1 Example 1: CEPAR and CEPort – 3D protein structure comparison

The human genome and the less advertised but very important 800 other genomes that have been mapped, encode genes Those genes are the blueprints for the proteins that are synthesized by reading the genes It is the proteins that are considered the building blocks of life Proteins control all cellular processes and define us as a species and as individuals A step on the way to understanding protein function is protein structure – the 3D arrangement that recognizes other proteins, drugs, and so on The growth in the number and complexity of protein structures has undergone the same revolution as shown

in Figure 40.1, and can be observed in the evolution of the Protein Data Bank (PDB; http://www.pdb.org), the international repository for protein structure data

A key element to understanding the relationship between biological structure and func-tion is to characterize all known protein structures From such a characterizafunc-tion comes the ability to be able to infer the function of the protein once the structure has been determined, since similar structure implies similar function High-throughput structure determination is now happening in what is known as structure genomics – a follow-on to the human genome project in which one objective is to determine all protein structures encoded by the genome of an organism While a typical protein consists of 300 of one of

20 different amino acids – a total of 20300 possibilities – more than all the atoms in the universe – nature has performed her own reduction, both in the number of sequences and

Trang 5

in the number of protein structures as defined by discrete folds The number of unique protein folds is currently estimated at between 1000 and 10 000 These folds need to be characterized and all new structures tested to see whether they conform to an existing fold or represent a new fold In short, characterization of how all proteins fold requires that they be compared in 3D to each other in a pairwise fashion

With approximately 30 000 protein chains currently available in the PDB, and with each pair taking 30 s to compare on a typical desktop processor using any one of several algorithms, we have a(30 000∗30 000/2)∗30s size problem to compute all pairwise com-parisons, that is, a total of 428 CPU years on one processor Using a combination of data reduction (a pre-filtering step that permits one structure to represent a number of similar structures), data organization optimization, and efficient scheduling, this computation was performed on 1000 processors of the 1.7 Teraflop IBM Blue Horizon in a matter of days using our Combinatorial Extension (CE) algorithm for pairwise structure comparison [1] The result is a database of comparisons that is used by a worldwide community of users 5

to 10 000 times per month and has led to a number of interesting discoveries cited in over

80 research papers The resulting database is maintained by the San Diego Supercom-puter Center (SDSC) and is available at http://cl.sdsc.edu/ce.html [2] The procedure to compute and update this database as new structures become available is equally amenable

to Grid and cluster architectures, and a Web portal to permit users to submit their own structures for comparison has been established

In the next section, we describe the optimization utilized to diminish execution time and increase applicability of the CE application to distributed and Grid resources The result is a new version of the CE algorithm we refer to as CEPAR CEPAR distributes each 3D comparison of two protein chains to a separate processor for analysis Since each pairwise comparison represents an independent calculation, this is an embarrassingly parallel problem

40.2.1.1 Optimizations of CEPAR

The optimization of CEPAR involves structuring CE as an efficient and scalable mas-ter/worker algorithm While initially implemented on 1024 processors of Blue Horizon, the algorithm and optimization undertaken can execute equally well on a Grid platform The addition of resources available on demand through the Grid is an important next step for problems requiring computational and data integration resources of this magnitude

We have employed algorithmic and optimization strategies based on numerical studies on CEPAR that have made a major impact on performance and scalability To illustrate what can be done in distributed environments, we discuss them here The intent is to familiarize the reader with one approach to optimizing a bioinformatics application for the Grid Using a trial version of the algorithm without optimization (Figure 40.2), performance bottlenecks were identified The algorithm was then redesigned and implemented with the following optimizations:

1 The assignment packets (chunks of data to be worked on) are buffered in advance

2 The master processor algorithm prioritizes incoming messages from workers since such messages influence the course of further calculations

Trang 6

Performance (3422 ent.)

0 256 512 768 1024

Number of processors

Ideal Early stopping 100% processed Trial

Figure 40.2 Scalability of CEPAR running on a sample database of 3422 data points (protein chains) The circles show the performance of the trial version of the code The triangles show the improved performance after improvements 1, 2, and 4 were added to the trial version The squares show the performance based on timing obtained with an early stopping criterion (improvement 3) The diamonds provide an illustration of the ideal scaling.

3 Workers processing a data stream that no longer poses any interest (based on a result

from another worker) are halted We call this early stopping.

4 Standard single-processor optimization techniques are applied to the master processor With these optimizations, the scalability of CEPAR was significantly improved, as can

be seen from Figure 40.2

The optimizations significantly improved the performance of the code The MPI imple-mentation on the master processor is straightforward, but it was essential to use buffered sends (or another means such as asynchronous sends) in order to avoid communica-tion channel congescommunica-tion In summary, with 1024 processors the CEPAR algorithm out-performs CE (no parallel optimization) by 30 to 1 and scales well It is anticipated that this scaling would continue even on a larger number of processors

One final point concerns the end-process load imbalance That is, a large number of processors can remain idle while the final few do their job We chose to handle this by breaking the runs involving a large number of processors down into two separate runs The first run does most of the work and exits when the early stopping criterion is met Then the second run completes the task for the outliers using a small number of processors, thus freeing these processors for other users Ease of use of the software is maintained through an automatic two-step job processing utility

CEPAR has been developed to support our current research efforts on PDB structure similarity analysis on the Grid CEPAR software uses MPI, which is a universal standard for interprocessor communication Therefore, it is suitable for running in any parallel envi-ronment that has an implementation of MPI, including PC clusters or Grids There is no dependence on the particular structural alignment algorithm or on the specific application

Trang 7

The CEPAR design provides a framework that can be applied to other problems facing computational biologists today where large numbers of data points need to be processed in

an embarrassingly parallel way Pairwise sequence comparison as described subsequently

is an example Researchers and programmers working on parallel software for these prob-lems might find useful the information on the bottlenecks and optimization techniques used to overcome them, as well as the general approach of using numerical studies to aid algorithm design briefly reported here and given in more detail in Reference [1] But what of the naive user wishing to take advantage of high-performance Grid computing?

40.2.1.2 CEPAR portals

One feature of CEPAR is the ability to allow users worldwide provide their own structures for comparison and alignment against the existing database of structures This service cur-rently runs on a Sun Enterprise server as part of the CE Website (http://cl.sdsc.edu/ce.html) outlined above Each computation takes on an average three hours of CPU time for a sin-gle user request On occasion, this service must be turned off as the number of requests for structure comparisons far outweighs what can be processed on a Sun Enterprise server To overcome this shortage of compute resources, a Grid portal has been established to han-dle this situation (https://gridport.npaci.edu/CE/) using SDSC’s GridPort technology [3] The portal allows this computation to be done using additional resources when available Initial target compute resources for the portal are the IBM Blue Horizon, a 64-node Sun Enterprise server and a Linux PC cluster of 64 nodes

The GridPort Toolkit [3] is composed of a collection of modules that are used to provide portal services running on a Web server and template Web pages needed to implement a Web portal The function of GridPort is simply to act as a Web frontend to Globus services [4], which provide a virtualization layer for distributed resources The only requirements for adding a new high-performance computing (HPC) resource to the portal are that the CE program is recompiled on the new architecture, and that Globus services are running on it Together, these technologies allowed the development of a portal with the following capabilities:

• Secure and encrypted access for each user to his/her high-performance computing (HPC) accounts, allowing submission, monitoring, and deletion of jobs and file man-agement;

• Separation of client application (CE) and Web portal services onto separate servers;

• A single, common point of access to multiple heterogeneous compute resources;

• Availability of real-time status information on each compute machine;

• Easily adaptable (e.g addition of newly available compute resources, modification of user interfaces etc.)

40.2.1.3 Work in progress

While the CE portal is operational, much work remains to be done A high priority is the implementation of a distributed file system for the databases, user input files, jobs

in progress, and results A single shared, persistent file space is a key component of the distributed abstract machine model on which GridPort was built At present, files must

Trang 8

be explicitly transferred from the server to the compute machine and back again; while this process is invisible to the user, from the point of view of portal development and administration, it is not the most elegant solution to the problem Furthermore, the present system requires that the all-against-all database must be stored locally on the file system of each compute machine This means that database updates must be carried out individually

on each machine

These problems could be solved by placing all user files, along with the databases, in

a shared file system that is available to the Web server and all HPC machines Adding Storage Resource Broker (SRB) [5] capability to the portal would achieve this Work

is presently ongoing on automatically creating an SRB collection for each registered GridPort user; once this is complete, SRB will be added to the CE portal

Another feature that could be added to the portal is the automatic selection of compute machine Once ‘real-world’ data on CPU allocation and turnaround time becomes avail-able, it should be possible to write scripts that inspect the queue status on each compute machine and allocate each new CE search to the machine expected to produce results in the shortest time

Note that the current job status monitoring system could also be improved Work is underway to add an event daemon to the GridPort system, such that compute machines could notify the portal directly when, for example, searches are scheduled, start and finish This would alleviate the reliance of the portal on intermittent inspection of the queue of each HPC machine and provide near-instantaneous status updates Such a system would also allow the portal to be regularly updated with other information, such as warnings when compute machines are about to go down for scheduled maintenance, broadcast messages from HPC system administrators and so on

40.2.2 Example 2: Chemport – a quantum mechanical biomedical framework

The successes of highly efficient, composite software for molecular structure and dynam-ics prediction has driven the proliferation of computational tools and the development of first-generation cheminformatics for data storage, analysis, mining, management, and pre-sentations However, these first-generation cheminformatics tools do not meet the needs

of today’s researchers Massive volumes of data are now routinely being created that span the molecular scale, both experimentally and computationally, which are available for access for an expanding scope of research What is required to continue progress is the integration of individual ‘pieces’ of the methodologies involved and the facilitation

of the computations in the most efficient manner possible

Towards meeting these goals, applications and technology specialists have made con-siderable progress towards solving some of the problems associated with integrating the algorithms to span the molecular scale computationally and through the data, as well

as providing infrastructure to remove the complexity of logging on to a HPC system

in order to submit jobs, retrieve results, and supply ‘hooks’ into other codes In this section, we give an example of a framework that serves as a working environment for researchers, which demonstrates new uses of the Grid for computational chemistry and biochemistry studies

Trang 9

Figure 40.3 The job submission page from the SDSC GAMESS portal.

Using GridPort technologies [3] as described for CEPAR, our efforts began with the creation of a portal for carrying out chemistry computations for understanding various details of structure and property for molecular systems – the General Atomic Molecular Electronic Structure Systems (GAMESS) [6] quantum chemistry portal (http://gridport npaci.edu/gamess) The GAMESS software has been deployed on a variety of com-putational platforms, including both distributed and shared memory platforms The job submission page from the GAMESS portal is shown in Figure 40.3 The portal uses Grid technologies such as the SDSC’s GridPort toolkit [3], the SDSC SRB [5] and Globus [7]

to assemble and monitor jobs, as well as store the results One goal in the creation

of a new architecture is to improve the user experience by streamlining job creation and management

Related molecular sequence, structure, and property software have been created using similar frameworks, including the AMBER [8] classical molecular dynamics portal, the EULER [9] genetic sequencing program, and the Adaptive Poisson-Boltzmann Solver (APBS) [10] program for calculating electrostatic potential surfaces around biomolecules Each type of molecular computational software provides a level of understanding of molecular structure that can be used for a larger scale understanding of the function What is needed next are strategies to link the molecular scale technologies through the data and/or through novel new algorithmic strategies Both involve additional Grid technologies

Development of portal infrastructure has enabled considerable progress towards the integration across scale from molecules to cells, linking the wealth of ligand-based data present in the PDB, and detailed molecular scale quantum chemical structure and

Trang 10

Builder_Launcher 3DStructure 3DVibration 3DMolecular_Orbitals 3DElectrostatic_Surface 3DSolvent_Surface 3DReaction_Path 3D-Biopolymer_Properties

QMView

Computational modeling

Protein Data Bank (PDB)

Experimental characterization

Quantum mechanics Highly accurate Small molecule Semi Empirical

Moderate accuracy Moderate size molecule

Empirical Low accuracy Large complex

QM compute engine (e.g GAMESS)

*

Internet

Quantum Mechanical Biomedical Framework

(QM-BF)

Quantum

Mechanical

Data

Base

(QM-DB)

O

O O −Na+

N R

n

Figure 40.4 Topology of the QM-PDB framework.

property data As such, accurate quantum mechanical data that has been hitherto under-utilized will be made accessible to the nonexpert for integrated molecule to cell stud-ies, including visualization and analysis, to aid in the understanding of more detailed molecular recognition and interaction studies than is currently available or sufficiently reliable The resulting QM-PDB framework (Figure 40.4) integrates robust computational quantum chemistry software (e.g GAMESS) with associated visualization and analysis toolkits, QMView, [11] and associated prototype Quantum Mechanical (QM) database facility, together with the PDB Educational tools and models are also integrated into the framework

With the creation of Grid-based toolkits and associated environment spaces, researchers can begin to ask more complex questions in a variety of contexts over a broader range

of scales, using seamless transparent computing access As more realistic molecular com-putations are enabled, extending well into the nanosecond and even microsecond range

at a faster turnaround time, and as problems that simply could not fit within the

Tiêu đề	The new biology and the Grid
Tác giả	Kim Baldridge, Philip E. Bourne
Người hướng dẫn	F. Berman, A. Hey, G. Fox
Trường học	University of California, San Diego
Chuyên ngành	Computational Biology
Thể loại	Chapter
Năm xuất bản	2003

Định dạng
Số trang	16
Dung lượng	204,58 KB