Background Any large-scale genome annotation project requires a computational pipeline that cancoordinate a wide range of sequence analyses and a database that can monitor thepipeline an
Trang 1An integrated computational pipeline and database supporting whole genome sequence annotation
C.J Mungall (3, 5), S Misra (1, 4), B.P Berman (1), J Carlson (2), E Frise (2), N Harris (2, 4), B Marshall (1), S Shu (1, 4), E Smith (1, 4), C Wiel (1, 4), G Rubin (1, 2, 3, 4), and S.E Lewis (1, 4).
1 Department of Molecular and Cellular Biology, University of California, Berkeley,CA
2 Drosophila Genome Project, Lawrence Berkeley National Laboratory, Berkeley, CA.
3 Howard Hughes Medical Institute
4 FlyBase-Berkeley, University of California, Berkeley, CA
Trang 2Background
Any large-scale genome annotation project requires a computational pipeline that cancoordinate a wide range of sequence analyses and a database that can monitor thepipeline and store the results it generates The compute pipeline must be as sensitive aspossible to avoid overlooking information and yet selective enough to avoid introducingextraneous information The data management infrastructure must be capable oftracking the entire annotation process as well as storing and displaying the results in away that accurately reflects the underlying biology
Results
We present a case study of our experiences in annotating the Drosophila genome
sequence The key decisions and choices for construction of a genomic analysis and datamanagement system are presented as well as a critical evaluation of our current process
We describe several new open source software tools and a database schema to supportlarge-scale genome annotation
Conclusions
We have developed an integrated and re-usable software system for whole genomeannotation The key contributing factors to overall annotation quality are marshallinghigh-quality, clean sequences for alignments and achieving flexibility in the design andarchitecture of the system
Trang 3The information held in genomic sequence is encoded and to understand the import ofthe sequence we must therefore first assess and describe this primary data This initialcomputational assessment generates some measure of the biologically relevantcharacteristics, for example coding potential or sequence similarity, present in thesequence Because of the amount of sequence to be examined and the volume of datagenerated these measures must be automatically computed and carefully filtered
At the time we launched this effort there were a few other computational pipelinesavailable and we investigated the possibility of reusing one of these for our project.Unfortunately, this was not possible primarily because of differences in strategicapproach, although other factors such as scalability, availability, and customizability
also played a role Many of these pipelines are intended primarily for ad hoc queries
from individual users and would not scale up sufficiently; while useful they were notunder consideration for the comprehensive analysis of an entire genome [1, 2, 3]
For whole genome analysis there are essentially three different strategies: acomputational synthesis to predict the best gene models, aggregations of communitycontributed analyses that the person viewing the data integrates visually, and curation
by experts using a full trail of evidence to support an integrated assessment Groupsthat are charged with rapidly providing a dispersed community with finished genomeannotations have chosen a purely computational route; examples of this strategy areEnsembl [4], NCBI [5], and Celera [6] Aggregative approaches adapt well to the dynamics
of collaborative groups who are focused on sharing results as they develop; examples ofthis strategy are the University of California Santa Cruz (UCSC) viewer [7] and theDistributed Annotation System (DAS) [8] For organisms with well-established andcohesive communities the demand is for carefully reviewed and qualified annotations;the representatives of this approach are two of the oldest genome community databases,ACeDB for C elegans [9] and FlyBase for Drosophila [10]
Our decision was to proceed directly towards the goal of actively examining every geneand feature of the genome to improve the quality of the annotations The prerequisitesfor this goal are a computational pipeline, a database, and an editing tool for the experts.This paper discusses our solution of the first two requirements The editing tool, Apollo,
is described in an accompanying paper [11] Our long term goal is to provide a set ofopen source software tools to support large-scale genome annotation
Our primary design requirement was flexibility so that the pipeline could easily beattuned to the needs of the curators For example using unique data sets for comparisonssuch as; direct submissions from individual researchers; sequences generated by ourinternal EST and cDNA projects [12]; and custom configurations of sequences from the
public databases (a detailed description of the data sets used is available in Misra et al.
[13]) The aim was to provide the biological experts with every salient piece of
Trang 4information possible and then enable them to efficiently summarize this informationmanually
RESULTS
The sequence data sets are the primary input into the pipeline There are three different
categories we will discuss: the Drosophila genomic sequence that we are trying detect features upon; expressed sequences and other sequences that are of Drosophila origin;
and informative sequences from other species
Drosophila genomic sequence
The release 3 genomic sequence was generated using Bacterial Artificial Chromosome(BAC) clones that formed a complete tiling path across the genome [14] We used theBAC sequences to assemble a single continuous sequence for each chromosome arm toverify order and overlaps This was accomplished using in-house software that utilized
tiling path data from physical mapping work (a combination of in situ and Sequence Tag
Sites [STS] mapping) to chain BAC sequences together At a certain point, the assemblyfor each chromosomal arm was frozen, because all possible gaps were filled and equallybecause it is essential for annotation that the underlying sequence is stable This thenbecame the release 3 sequence
Choosing the unit of annotation, both for the pipeline as well as the curators, is a'chicken and egg' problem There are two contrary and arbitrary breakdowns (BACsequences and public sequence accessions) and the one biological breakdown (proteincoding gene region) Ideally we would annotate using the biological breakdown, butinitially there is no way of knowing this and so the entire process must be bootstrapped
up from the arbitrary breakdowns
We considered using the BAC sequences directly as the pipeline input, which has theadvantage of one less processing step Ultimately however, we rejected this idea becausethe BAC sequences are relatively short and contain random portions of the genome andthus there is a high probability of splitting the exons from a single gene onto multipleBAC sequences and as a consequence, complicating the annotation of these genes.Instead the main sequence unit we used in our genomic pipeline was the Genbankaccession These are usually of a size manageable by most analysis programs (around
300 kilobases), but we still faced the issue of genes straddling these arbitrary units Asour solution we carried out a two-step analysis First we fed the BAC sequences into apre-analysis pipeline, which is a lightweight version of the full annotation pipeline Thisgave us a rough idea of where the various genes were located We then projected theseanalysis results from BAC clone coordinates into coordinates on the full arm sequenceassembly This step was followed by the use of another in-house software tool to divide
up the arm sequence, trying to simultaneously optimize two constraints: One constraint
is correspondence to the pre-existing release 2 accessions in Genbank/EMBL/DDBJ [15, 16,
17]; The other constraint is avoiding the creation of gene models that straddle the
Trang 5boundaries between two accessions, as determined by the rough pre-analysis of the BACsequences Because this was an approximation the cuts are later refine by the curatorsand Genbank During the annotation process, if a curator discovers that a unit wasdivided wrongly, and in fact it breaks a gene, they request an extension sufficient tocover the gene Once extended, the sequence is reanalyzed and exported again Allextensions were made to the right, to avoid complicated coordinate adjustments Furtheradjustments were made by Genbank to ensure that, to the degree it was possible, thatgenes remained on the same sequence accession.
It is these divisions of the genome sequence that are then fed into the full annotationautomated pipeline This compute can take up to a week to complete for a fullchromosome arm
Drosophila specific sequences
To re-annotate a genome to sufficient detail, an extensive set of sequences werenecessary for sequence alignments and searches for homologous sequences
First, we collected the nucleic acid sequences of all of the Release 2 Drosophila predicted
genes, to align them to the new finished genomic sequence for use as a starting point forthe Release 3 annotations
Second, we built Drosophila nucleic acid sequence datasets of full-length cDNA and EST data from three different sources: the Berkeley Drosophila Genome Project (BDGP),
public Genbank submissions, and reviewed error reports sent to FlyBase directly andrecorded as personal communications From what was available at the BDGP full-lengthcDNA project we took care to include all sequence reads, including those that were notyet assembled, so as to have the most comprehensive and up-to-date information
possible We pulled from Genbank all Drosophila entries held in dbEST and the nucleic
acid sequences from the INV division excluding our own BDGP submissions andsequences held in other divisions, like genome survey sequences (GSS) The non-BDGPEST sequences were added to the BDGP EST sequences and provided us with acomprehensive EST set The nucleic acid sequences from the INV dataset contained aredundant mix of complete and partial cDNAs as well as genomic sequences; in future,
we plan to isolate the cDNA sequences using feature table information from the genomicsequence submissions (this set was not combined with BDGP full-length cDNAsequences) The larger FlyBase research community sent complete and partial cDNAsequences and protein sequences to FlyBase as error reports and these were manuallycollected and placed into datasets for pipeline analysis [18] As a group, these cDNA andEST sequence sets, when aligned to the genomic sequence, were the key to improvingthe annotations by sensitively revealing the exon-intron structure of genes
Third, along with the EST and complete cDNA sequences from the BDGP, FlyBasereviewed and collated sequences from the scientific community as Annotated ReferenceGene Sequences (ARGS) These manually created sequences integrate information fromthe literature with every Genbank submission available for a particular gene to offer a
Trang 6gold-standard annotation for a gene and these were utilized wherever possible.
Fourth, we obtained a curated amino acid set of those Drosophila translations supported
by experimental evidence, to find proteins related to paralogs elsewhere in the flygenome In order to avoid using previous predictions as evidence for the new release,which would be a circular argument for annotation, we limited these to SWISS-PROTand SpTrEMBL proteins supported by independent experimental evidence [19]
Fifth, we retrieved non-protein-coding nucleic acid sequences for D Drosophila tRNAs,
snRNAs, snoRNAs, and microRNAs from Genbank via FlyBase and used these tomanually generate independent datasets for each category [20] The tRNA set was madeinto a comprehensive set by utilizing coordinates of previously identified tRNAs [21] The genomic analysis of transposable elements is described separately [22], but these dataprovided the sequences that the program RepeatMasker [23] used prior to runningBLASTX
We also have two other types of sequence information available One is the STS andBAC end sequences used for physical mapping and the other is the flanking sequencesfrom P element insertion events that are part of the mutagenesis project Thesesequences were also aligned to the genome by the pipeline, but were not used directlyduring annotation
Other organism sequences
To look for cross-species sequence similarity, we wanted to use the BLASTX program inconjunction with protein datasets that would be current and comprehensive but also asnon-redundant and biologically accurate as possible We decided to use the SPTRdatasets [24] that supplements the manually annotated SWISS-PROT protein dataset [25]with SpTrEMBL (computationally annotated proteins) and TrEMBLNew (proteins fromthe past week that are not yet in SpTrEMBL), but excludes RemTrEMBL, (patent dataand synthetic protein sequences) In order to ensure we had the best match from avariety of model organisms, we split SPTR and used the following subdivisions for
separate BLASTX analyses: rodents, primates, C elegans, S cerevisiae, plants, other
invertebrates, and other vertebrates
We also obtained from Genbank the nucleic acid Mus musculus UniGene set and the
insect-encoded sequences from dbEST, to look for similarities by TBLASTX that mightnot be identified by BLASTX searching of proteins We originally used all of dbEST inour pipeline, but later decided to remove most ESTs in order to lower the compute load
As TBLASTX must translate both query and subject sequences, it is highly computeintensive and the other EST alignments added little new information to the overallanalysis
Trang 7The task monitoring and scheduling Pipeline
Software Infrastructure
There are three major infrastructure components of the pipeline: the database, the Perlmodule (this module is named Pipeline::*), and sufficient computational powerincluding a job management system to allocate this resource The database is crucialbecause it maintains a persistence record reflecting the current state all of the tasks thatare in progress Maintaining the system state in a database is a much more robust andresilient approach than simply using a filing system because it offers transaction-lockingmechanism to ensure that a series of operations are always fully completed We used aMySQL [26] database to manage these large number of analyses run against the genome,transcriptome and proteome The Perl modules provide an application programmerinterface (API) that is used to launch and monitor jobs, retrieve results and supportother interactions with the database As in inexpensive solution to satisfy thecomputational requirements we built a Beowulf cluster
MySQL is an open source “structured query language” (SQL) database that has theadvantage of being fast, free and simple to maintain It had several disadvantagescompared to other SQL (Structured Query Language) databases, in that it onlyimplements a subset of the SQL standard, and lacks many other special features found
in other database systems An SQL database manages data as a collection of tables Eachtable has a fixed set of columns (also called fields) and usually corresponds to aparticular concept in the domain being modeled Tables can be cross-referenced by usingprimary and foreign key fields The database tables can be queried using the SQLlanguage, which allows the dynamic combination of data from different tables [27] Acollection of these tables is called a database schema, and a particular instantiation ofthat schema with the tables populated is a database
There are four basic abstractions that all components of the pipeline system operateupon, these are: a sequence, a job, an analysis, and a batch A sequence is defined as astring of amino or nucleic acids held either in the database or as an entry in a FASTA file(usually both) A job is an instance of a particular program being run to analyze aparticular sequence, for example running BLAST to compare one sequence to a peptideset is considered a single job Jobs can be chained together If a job A is dependent on theoutput of job B then the pipeline software will not launch job A until job B is complete.This is the situation, for instance, with programs that require masked sequence as input
An analysis is a collection of jobs being analyzed with one program using the samearguments against a set of sequences Lastly, batch is a collection of analyses a userlaunches simultaneously Jobs, analyses and batches all have a state attribute that is used
to track their progress through the pipeline (figure 1) In terms of analyses, the state isthe same as the state of the slowest job in that analysis, and for batches, the state is thesame as the slowest analysis in that batch
Trang 8The three applications that use the Perl API are the pipe_launcher.pl script, the flyshellinteractive command line interpreter, and the Internet browser front end Bothpipe_launcher.pl and flyshell provide pipeline users with a powerful variety of ways tolaunch and monitor jobs, analyses and batches and are useful to both those with a basicunderstanding of Unix and bioinformatics tools as well as to those with a strongknowledge of object-oriented Perl The web front end is used for monitoring theprogress of the jobs in the pipeline
pipe_launcher.pl—is a command line tool that is useful for both programmers and
non-programmers To launch jobs, users create configuration files that specify input datasources and any number of analyses to be performed on each of these data sources,along with the arguments for each of the analyses Most of these specifications can beoverridden with command line options This allows each user to create a library ofconfiguration files for sending off large batches of jobs that they can alter with commandline arguments when necessary pipe_launcher.pl returns the batch identifier generated
by the database to the user To monitor jobs in progress, the batch identifier can be used
in a variety of commands, such as monitor, batch, deletebatch and query_batch
flyshell.pl—provides a more flexible interface to power users who are familiar with
object oriented Perl flyshell.pl is an interactive command line Perl interpreter thatpresents the gadfly and pipeline APIs to the end user
web front end—allows convenient, browser-based access for end users to follow
analyses status An HTML form allows users to query the pipeline database by job,analysis or batch identifier, as well as by sequence identifier The user can drill downthrough batches and analyses to get to individual jobs and get the status, raw job outputand error files of each job This window on the pipeline has proven to be a useful tool forquickly viewing results
Once a job is finished (in the database the job’s state is set to FIN), the raw results arerecorded in the database and may be retrieved through the web interface or througheither Perl interface Following this the raw results are parsed, filtered, and stored in thechosen Gadfly database (and the job’s state is set to PROCD) At this point a GAME xmlrepresentation of the processed data can be similarly be retrieved through either the Perl
or web interfaces
Analysis software
The pipeline involves numerous computational analyses that generate data as might beexpected What is perhaps less obvious is that there is also a need to screen and filterdata and this is equally important to the system There are two primary reasons for this,one is to increase the efficiency of the pipeline by reducing the amount of data thatcompute intensive tasks must process, another is to increase the signal to noise ratio byeliminating results that lack content
Sim4wrap—Sim4 [28] is a highly useful and largely accurate way of aligning cDNA andEST sequences against the genome Unfortunately it is highly compute expensive,
Trang 9compared with BLASTN To make the most use of our resources, we split the alignment
of Drosophila cDNA and EST sequences into two serial tasks and wrote a utility program
(Sim4wrap) for this purpose Sim4wrap executes a first pass using BLASTN, using ourgenomic scaffold as the query sequence and the transcript sequences as the subjectdatabase We run BLASTN with the "-B 0" option, as we are only interested in thesummary part of the blast report, not in the high scoring pairs (HSPs) portion where thealignments are shown From this BLAST report summary Sim4wrap parses out thesequences identifiers and filters the original database to produce a temporary Fasta datafile that contains only these sequences Finally we run sim4 again using the genomicsequence as the query and the minimal set of sequences that we have culled as thesubject
Autopromote—The Drosophila genome is not a blank slate because there are previous
annotations from the release 2 genomic sequence Therefore, before the curation of achromosome arm begins, we first "auto-promote" the release 2 annotations and certaincomputational analysis results to the status of full annotations This speeds theannotation process by providing a starting point for the curators to work from
The auto-promotion software must be able to synthesize different analysis result tiers,some of which may be conflicting Our auto-promotion software is a component withinthe Gadfly software collection It works by building a graph of exon-level intersectionsbetween all relevant result features Different result features are weighted differently,and the intersection graph forms a voting network In order to resolve conflicts overwhether a set of predictions should be multiple split genes or a single merged gene, weonly allow one vote per feature, and voters must mutually support each one another.This analysis includes an automated check to see if any transcripts from the previousrelease are no longer present after this process
Trang 10Berkeley Output Parser (BOP) Filtering—All BLAST jobs were run with very
non-restrictive parameters in order to capture as much information as possible and thenresults were filtered The reason for taking this approach is that the genes found ongenomic sequence is not uniformly represented in the public databases Genes that arerichly covered in the public databases are often immediately adjacent to regions forwhich there are currently few distant homologies are available Because it is difficult tonormalize Genbank across species but still allow these faint signals to come through weset the number of allowed alignments very high Sim4 was used strictly for alignments
to Drosophila sequences and for this reason we wanted to apply stringent measures
before accepting an alignment The available limits to the filters are controlled byparameters passed into the program For sim4 these include:
Score is the minimum percent identity that is required to retain an HSP oralignment The default value is 95%
Coverage is a percentage of the total length of the sequence that is aligned to thegenomic Any alignments that are less than this percentage length are eliminated
Length is an absolute minimum length in base pairs required to accept a spanregardless of percent identity or percent length
Join 5’ and 3’, is a Boolean operation and is used for EST data If it is true BOPwill do two things First if will reverse the orientation of any hits where the name
of the sequence contains the phrase 3prime Second, it will merge all alignmentswhere the prefixes of the name are the same Originally this was used solely forthe 5’ and 3’ ESTs that were available, however when we introduced the internalsequencing reads from the cDNA project into the pipeline this portion of codebecame an alternate means of effectively assemble the cDNA sequence Using theintersection of each individual sequence alignment on the genomic a singlevirtual cDNA sequence was constructed and this alignment was provided forannotation
Reverse 3’, is another Boolean parameter used solely for EST data Thosesequences analyzed where the name ends in the suffix that is provided as theparameter argument will be reverse complemented
Discontinuity sets a maximum gap length in the aligned EST or cDNA sequence.The primary aim of this parameter is to eliminate chimeric clones
Remove polyA tail is a Boolean to indicate that short terminal HSPs consistingprimarily of runs of a single base are to be removed
Trang 11For BLAST these filtering options are available:
Remove low complexity, this is specified as a repeat word size (# of consecutivebases or amino acids) and the second is a threshold The alignment is compressedusing Huffman encoding to a bit length and any hit where all HSP spans have ascore lower than this value is discarded Larger word sizes should have largerthresholds
Minimum expectation, offers a simple cutoff for HSP Any HSP with anexpectation greater than this value is deleted The default is generous, it is 1.0
Maximum depth specifies the maximal number of matches that are allowed in agiven genomic region The default is 10 overlapping alignments This parameterapplies to both BLAST and sim4 The aim is to avoid excess reporting of matches
in regions that are highly represented in the aligned data set (e.g from a
non-normalized EST library)
In addition there is a standard filter for BLAST that eliminates ‘shadow’ matches Theseare weak alignments to the same sequence in the same location on the reverse strand ofthe genomic BLAST matches are also re-organized if necessary to assure that the HSPsare in sequential order along the length of the sequence For example, a duplicated genemay appear in a BLAST report as a single alignment, each of which indicate dual HSPs
to two different regions on the genomic, but to the same single portion of the sequence
In these cases the alignment is split into separate alignments to the genomic and each ofwhich have the aligned sequence present just a single time in the HSPs
In our pipeline the processing of Drosophila EST and cDNA sequences required a
minimum percent identity of 95% and a percent length of 80% to retain a match Inaddition, the 5’ and 3’ ESTs from a single cDNA clone were joined, polyA tails removed,and the depth at a single genomic location was limited to 300 matches We filteredBLASTX alignments using a minimum expect value of 1.0e-4, removed repetitive HSPs,removed ‘shadows’ and kept the depth to no more than 50 matches in the same genomiclocation
BOP EST grouping—Another tactic for condensing primary results, but without
removing any information, is to reconstruct all logically possible alternate transcriptsfrom the raw EST alignments This additional piece of code was added to BOP as well.First the set of EST that overlap are collected, from these a tree(s) is built where eachnode is comprised of the set of spans from these ESTs that share splice junctions Thepossible transcripts are the number of paths through this tree(s) This analysis produced
an additional set of alignments augmenting the original EST alignments
External pipelines
We believe that it is imperative for any annotation to utilize every possible bit of usefulinformation Thanks to the generosity of three external groups we were able to haveresults from the Celera, NCBI, and Ensembl pipelines incorporated into our database for
Trang 123 of the 5 chromosome arms (2L, 2R, and 3R) and present this to the curators in addition
to the results from our internal pipeline
Hardware
To carry out the genomic analyses we first built a Beowulf cluster A Beowulf cluster is acollection of compute nodes that are interconnected with a network The sole purpose ofthese nodes and the network is providing compute cycles The nodes themselves areinexpensive, off-the-shelf processor chips, connected using standard networkingtechnology, and running open source software When these components are puttogether a low cost, but high performance, compute system is available Our nodes are
all identical and use Linux as their base operating system, as is usual for Beowulf
clusters One consequence of building a system out of stock materials is that there areinevitable modifications and this means that it is mandatory to utilize an open sourceoperating system and development environment in order to have access to their sourcecode for recompilation The job control software we utilize is the Portable Batch System(PBS) developed by NASA [29]
{THE FOLLOWING JUST CAME IN FROM ERWIN: UNEDITED}
The compute jobs were done on a Beowulf style Linux cluster used as a compute farm.The cluster was built by Linux Networx (http://www.linuxnetworx.com) LinuxNetworx provided additional hardware (ICE box) and Clusterworx software to installthe system software and control and monitor the hardware of the nodes The clusterconfiguration used in this work consisted of 32 standard IA32 architecture nodes eachwith dual Pentium III CPUs running at 700MHz/1GHz and 512MB memory In addition,one single redundant Pentium III based master node was used to control the clusternodes and distribute the compute jobs Nodes were interconnected with standard 100BTEthernet on a isolated subnet with the master node as the only interface to the outsidenetwork The private cluster 100BT network was connected to the NAS based storagevolumes housing the data and user home directories with Gigabit ethernet Each nodehad a 2GB swap partition used to cache the sequence databases from the networkstorage volumes To provide a consistent environment, the nodes had the samemounting points of the directories as all other BDGP Unix computers The network wideNIS maps were translated to the internal cluster NIS maps with an automated script.Local hard disks on the nodes were used as temporary storage for the pipeline jobs Job distribution to the cluster nodes was done with the queuing system OpenPBS,version 2.3.12 (http://www.openpbs.org) PBS was configured with several queues andeach queue having access to a dynamically resizable overlapping fraction of nodes.Queues were configured to use one node at a time either running one job using bothCPUs (such as the multithreaded BLAST or Interpro motif analyis) or two jobs using oneCPU each for optimal utilization of the resources Due to the architecture of the pipeline[should be described somewhere else], individual jobs were often small but 10000s ofthem submitted at any given time However, the default PBS FIFO scheduler, while