concept and the child as the more precise concept and says nothing about how thechild speci¢cally re¢nes the concept.The partof relationship meronymy and its re£exive relationship holony
Trang 1spelling error or as drastic as being a new lexical string If the change does notchange the meaning of the term then there is no change to the GO identi¢er Ifthe meaning is changed, however, then the old term, its identi¢er and de¢nitionare retired (they are marked as ‘obsolete’, they never disappear from the database)and the new term gets a new identi¢er and a new de¢nition Indeed this is true even
if the lexical string is identical between old and new terms; thus if we use the samewords to describe a di¡erent concept then the old term is retired and the new iscreated with its own de¢nition and identi¢er This is the only case where, withinany one of the three GO ontologies, two or more concepts may be lexicallyidentical; all except one of them must be £agged as being obsolete Because thenodes represent semantic concepts (as described by their de¢nitions) it is notstrictly necessary that the terms are unique, but this restriction is imposed inorder to facilitate searching This mechanism helps with maintaining andsynchronizing other databases that must track changes within GO, which is, bydesign, being updated frequently Keeping everything and everyone consistent is
a di⁄cult problem that we had to solve in order permit this dynamic adaptability ofGO
The edges between the nodes represent the relationships between them GO usestwo very di¡erent classes of semantic relationship between nodes: ‘isa’ and ‘partof’.Both the isa and partof relationships within GO should be fully transitive That is
to say an instance of a concept is also an instance of all of the parents of thatconcept (to the root); a part concept that is partof a whole concept is a partof all
of the parents of that concept (to the root) Both relationships are re£exive (seebelow)
The isa relationship is one of subsumption, a relationship that permitsre¢nement in concepts and de¢nitions and thus enables annotators to drawcoarser or ¢ner distinctions, depending on the present degree of knowledge Thisclass of relationship is known as hyponymy (and its re£exive relation hypernymy)
to the authors of the lexical database WordNet (Fellbaum 1998) Thus the termDNA binding is a hyponym of the term nucleic acid binding; converselynucleic acid bindingis a hypernym ofDNA binding The latter term is morespeci¢c than the former, and hence its child It has been argued that the isarelationship, both generally (see below) and as used by GO (P Karp, personalcommunication; S Schultze-Kremer, personal communication) is complex andthat further information describing the nature of the relationship should becaptured Indeed this is true, because the precise connotation of the isarelationship is dependent upon each unique pairing of terms and the meanings ofthese terms Thus the isa relationship is not a relationship between terms, but rather
is a relationship between particular concepts Therefore the isa relationship is not asingle type of relationship; its precise meaning is dependent on the parent and childterms it connects The relationship simply describes the parent as the more general
Trang 2concept and the child as the more precise concept and says nothing about how thechild speci¢cally re¢nes the concept.
The partof relationship (meronymy and its re£exive relationship holonymy)(Cruse 1986, cited in Miller 1998) is also semantically complex as used by GO (seeWinston et al 1987, Miller 1998, Priss 1998, Rogers & Rector 2000) It may meanthat a child node concept ‘is a component of’ its parent concept (The re£exiverelationship [holonymy] would be ‘has a component’.) Themitochondrion‘is acomponent of’ thecell; thesmall ribosomal subunit‘is a component of’ theribosome This is the most common meaning of the partof relationship in the GOcellular _ component ontology In the biological _ process ontology, however, thesemantic meaning of partof can be quite di¡erent, it can mean ‘is a subprocess of’;thus the conceptamino acid activation ‘is a subprocess of’ of the conceptprotein biosynthesis It is in the future for the GO Consortium to clarify thesesemantic relationships while, at the same time not making the vocabularies toocumbersome and di⁄cult to maintain and use
Meronymy and hyponymy cause terms to ‘become intertwined in complex ways’(Miller 1998:38) This is because one term can be a hyponym with respect to oneparent, but a meronym with respect to another Thus the concept cytosolic small ribosomal subunit is both a meronym of the concept cytosolic ribosomeand a hyponym of the conceptsmall ribosomal subunit, since therealso exists the conceptmitochondrial small ribosomal subunit
The third semantic relationship represented in GO is the familiar relationship ofsynonymy Each concept de¢ned in GO (i.e each node) has one primary term (usedfor identi¢cation) and may have zero or many synonyms In the sense of theWordNet noun lexicon a term and its synonyms at each node represents a synset(Miller 1998); in GO, however, the relationship between synonyms is strong, andnot as context dependent as in WordNet’s synsets This means that in GO allmembers of synset are completely interchangeable in whatever context the termsare found That is to say, for example, that ‘lymphocyte receptor of death’ and
‘death receptor 3’ are equivalent labels for the same concept and are conceptuallyidentical One consequence of this strict usage is that synonyms are not inheritedfrom parent to child concepts in GO
The ¢nal semantic relationship in GO is a cross-reference to some other databaseresource, representing the relationship ‘is equivalent to’ Thus the cross-referencebetween the GO concept alcohol dehydrogenase and the EnzymeCommission’s number EC:1.1.1.1 is an equivalence (but not necessarily anidentity, these cross-references within GO are for a practical rather thantheoretical purpose) As with synonyms, database cross-references are notinherited from parent to child concept in GO
As we have expressed, we are not fully satis¢ed that the two major classes ofrelationship within GO, isa and partof, are yet de¢ned as clearly as we would
Trang 3like There is, moreover, some need for a wider agreement in this ¢eld on theclasses of relationship that are required to express complex relationships betweenbiological concepts Others are using relationships that, at ¢rst sight appear to besimilar to these For example, within the aMAZE database (van Helden et al 2001)the relationships ContainedCompartment and SubType appear to be similar toGO’s partof and isa, respectively Yet ContainedCompartment and partof have,
on closer inspection, di¡erent meanings (GO’s partof seems to be a muchbroader concept than aMAZE’s ContainedCompartment)
The three domains now considered by the GO Consortium,molecular _ function, biological _ process and cellular _ component are ortho-gonal They can be applied independently of each other to describe separablecharacteristics A curator can describe where some protein is found withoutknowing what process it is involved in Likewise, it may be known that a protein
is involved in a particular process without knowing its function There are noedges between the domains, although we realize that there are relationshipsbetween them This constraint was made because of problems in de¢ning thesemantic meanings of edges between nodes in di¡erent ontologies (see Rogers &Rector 2000, for a discussion of the problems of transitivity met within anontology that includes di¡erent domains of knowledge) This structure is,however, to a degree, arti¢cial Thus all (or, certainly most) gene productsannotated with the GO function termtranscription factorwill be involved
in the processtranscription, DNA-dependentand the majority will have thecellular locationnucleus This really becomes important not so much within GOitself, but at the level of the use of GO for annotation For example, if a curatorwere annotating genes in FlyBase, the genetic and genomic database for Drosophila(FlyBase 2002), then it would be an obvious convenience for a gene productannotated with the function termtranscription factorto inherit both theprocess transcription, DNA-dependent and the location nucleus Thereare plans to build a tool to do this, but one that allows a curator to say to thesystem ‘in this case do not inherit’ where to do so would be misleading or wrong
Trang 4the basis for the annotation is then summarized, using a small controlled list ofphrases (www.geneontology.org/GO.evidence.html ); perhaps ‘inferred from directassay’ if annotating on the evidence of experimental data in a publication or
‘inferred from sequence comparison with database:object’ (where database:objectcould be, for example, SWISS^PROT:P12345, where P12345 is a sequenceaccession in the SWISS^PROT database of protein sequences), if the inference ismade from a BLAST or InterProScan compute which has been evaluated by acurator
The incorrect inference of a protein’s or predicted protein’s function fromsequence comparison is well known to be a major problem and one that has oftencontaminated both databases and the literature (Kyrpides & Ouzounis 1998, forone example among many) The syntax of GO annotation in databases allowscurators to annotate a protein as NOT having a particular function despiteimpressive BLAST data For example, in the genome of Drosophila melanogasterthere are at least 480 proteins or predicted proteins that any casual or routinecuration of BLASTP output would assign the functionpeptidase (or one ofits child concepts) yet, on closer inspection, at least 14 of these lack residuesrequired for the catalytic function of peptidases (D Coates, personalcommunication) In FlyBase these are curated with the ‘function’ ‘NOTpeptidase’ What is needed is a comprehensive set of computational rules to allowcurators, who cannot be experts in every protein family, to automatically detect thesignatures of these cases, cases where the transitive inference would be incorrect(Kretschmann et al 2001) It is also conceivable that triggers to correct dependentannotations could be constructed because GO annotations track the identi¢ers ofthe sequence upon which annotation is based
Curatorial annotation will be at a quality proportional both to the extent of theavailable evidence for annotation and the human resources available forannotation Potentially, its quality is high but at the expense of human e¡ort Forthis reason several ‘automatic’ methods for the annotation of gene products arebeing developed These are especially valuable for a ¢rst-pass annotation of alarge number of gene products, those, for example, from a complete genomesequencing project One of the ¢rst to be used was M Yandell’s programLOVEATFIRSTSIGHT developed for the annotation of the gene productspredicted from the complete genome of Drosophila melanogaster (Adams et al2000) Here, the sequences were matched (by BLAST) to a set of sequences fromother organisms that had already been curated using GO
Three other methods, DIAN (Pouliot et al 2001), PANTHER (Kerlavage et al2002) and GO Editor (Xie et al 2002), also rely on comprehensive databases ofsequences or sequence clusters that have been annotated with GO terms bycuration, albeit with a large element of automation in the early stages of theprocess PANTHER is a method in which proteins are clustered into
Trang 5‘phylogenetic’ families and subfamilies, which are then annotated with GO terms
by expert curators New proteins can then be matched to a cluster (in fact to aHidden Markov Model describing the conserved sequence patterns of thatcluster) and transitively annotated with appropriate GO terms In a recentexperiment PANTHER performed well in comparison with the curated set of
GO annotations of Drosophila genes in FlyBase (Mi et al 2002) DIAN matchesproteins to a curated set using two algorithms, one is vocabulary based and isonly suitable for sequences that already have some attached annotation; the other
is domain based, using Pfam Hidden Markov Models of protein domains.Even simpler methods have also been used For example, much of the ¢rst-pass
GO annotation of mouse proteins was done by parsing the KEYWORDs attached
to SWISS^PROT records of mouse proteins, using a ¢le that semantically mappedthese KEYWORDs to GO concepts (see www.geneontology.org/external2go/spkw2go)(Hill et al 2001)
Automatic annotations have the advantages of speed, essential if large proteindata sets are to be analysed within a short time Their disadvantage is that theaccuracy of annotation may not be high and the risk of errors by incorrecttransitive inference is great For this reason, all annotations made by suchmethods are tagged in GO gene-association ¢les as being ‘inferred by electronicannotation’ Ideally, all such annotations are reviewed by curators andsubsequently replaced by annotations of higher con¢dence
The problems of complexity and redundancy
There are in the biological _ process ontology many words or strings of words thathave no business being there The major examples of o¡ending concepts arechemical names and anatomical parts There are two reasons why this isproblematic, one practical and the other of more theoretical importance Thepractical problem is one of maintainability The number of chemical compoundsthat are metabolized by living organisms is vast Each one deserves its own uniqueset of GO terms: carbohydrate metabolism (and its children carbohydratebiosynthesis, carbohydrate catabolism), carbohydrate transport and so on In theideal world there would exist a public domain ontology for natural (andxenobiotic) compounds:
Trang 6and so on Then we could make the cross-product between this little DAG (a DAGbecause a carbohydrate could also be an acid or an alcohol, for example) and thissmall biological _process DAG:
simple carbohydrate metabolism
simple carbohydrate biosynthesis
simple carbohydrate catabolism
Trang 7Unfortunately, as no suitable ontology of compounds yet exists in the publicdomain, there is no alternative to the present method of maintaining this part ofthe biological _ process ontology by hand.
A very similar situation exists for anatomical terms, in e¡ect used as anatomicalquali¢ers to terms in the biological _ process ontology An example is eye morphogenesis, a term that can be broken up into an anatomical component(eye) and a process component (morphogenesis) This example illustrates afurther problem, we clearly need to be able to distinguish the morphogenesis of a
£y eye from that of a murine eye, or a Xenopus eye, or an acanthocephalan eye (werethey to have eyes) Such is not the way to maintain an ontology Far better would be
to have species- (or clade-) speci¢c anatomical ontologies and then to generate therequired terms for biological _ process as cross-products This is indeed the way
in which GO will proceed (Hill et al 2002) and anatomical ontologies forDrosophila and Arabidopsis are already available from the GO Consortium( ftp://ftp.geneontology.org/pub/go/anatomy), with those for mouse and C elegans inpreparation (see Bard & Winter 2001, for a discussion) The other advantage ofthis approach is that these anatomical ontologies can then be used in othercontexts, for example for the description of expression patterns or mutantphenotypes (Hamsey 1997)
gobo: global open biological ontologies
Although the three controlled vocabularies built by the GO Consortium are farfrom complete they are already showing their value (e.g Venter et al 2001,Jenssen et al 2001, Laegreid et al 2002, Pouliot et al 2001, Raychaudhuri et al2002) Yet, as discussed in the preceding paragraphs the present method ofbuilding and maintaining some of these vocabularies cannot be sustained Bothfor their own use, as well as the belief that it will be useful for the community atlarge, the GO Consortium is sponsoring gobo (global open biological ontologies)
as an umbrella for structured controlled vocabularies for the biological domain Asmall ontology of such ontologies might look like this:
Trang 8GO is very much a work in progress Moreover, it is a community rather thanindividual e¡ort As such, it tries to be responsive to feedback from its users so that
it can improve its utility to both biologists and bioinformaticists, a distinction, weobserve, that is growing harder to make every day
Acknowledgements
The Gene Ontology Consortium is supported by a grant to the GO Consortium from the National Institutes of Health (HG02273), a grant to FlyBase from the Medical Research Council, London (G9827766) and by donations from AstraZeneca Inc and Incyte Genomics The work described in this review is that of the Gene Ontology Consortium and not the authors they are just the raconteurs; they thank all of their colleagues for their great support They also thank Robert Stevens, a user-friendly arti¢cial intelligencer, for his comments and for providing references that would otherwise have evaded them; MA thanks
Trang 9Donald Michie for introducing him to WordNet, albeit over a rather grotty Chinese meal
AmiGO 2001 url: www.godatabase.org/cgi-bin/go.cgi
Ashburner M, Ball CA, Blake JA et al 2000 Gene ontology: tool for the uni¢cation of biology The Gene Ontology Consortium Nat Genet 25:25^29
Baker PG, Goble CA, Bechhofer S, Paton NW, Stevens R, Brass A 1999 An ontology for bioinformatics applications Bioinformatics 15:510^520
Bard J, Winter R 2001 Ontologies of developmental anatomy: their current and future roles Brief Bioinform 2:289^299
Commission of Plant Gene Nomenclature 1994 Nomenclature of sequenced plant genes Plant Molec Biol Rep 12:S1^S109
Cruse DA 1986 Lexical semantics New York, Cambridge University Press
DAG Edit 2001 url: sourceforge.net/projects/geneontology/
DiBona C, Ockman S, Stone M (eds) 1999 Open sources: voices from the Open Source revolution O’Reilly, Sebastopol, CA
Dure L III 1991 On naming plant genes Plant Molec Biol Rep 9:220^228
Dwight SS, Harris MA, Dolinski K et al 2002 Saccharomyces Genome Database (SGD) provides secondary gene annotation using the Gene Ontology (GO) Nucleic Acids Res 30:69^72
Fellbaum C (ed) 1998 WordNet An electronic lexical database MIT Press, Cambridge, MA Fensel D, van Harmelen F, Horrocks I, McGuinness D, Patel-Schneider PF 2001 OIL: An ontology infrastructure for the semantic web IEEE (Inst Electr Electron Eng) Intelligent Systems 16:38^45 [url: www.daml.org]
Fleischmann RD, Adams MD, White O et al 1995 Whole-genome random sequencing and assembly of Haemophilus in£uenzae Rd Science 269:496^512
The FlyBase Consortium 2002 The FlyBase database of the Drosophila genome projects and community literature Nucleic Acids Res 30:106^108
The Gene Ontology Consortium 2001 Creating the gene ontology resource: design and implementation Genome Res 11:1425^1433
GRAMENE 2002 Controlled ontology and vocabulary for plants url: www.gramene.org/ plant _ ontology
Hamsey M 1997 A review of phenotypes of Saccharomyces cerevisiae Yeast 1:1099^1133 Heath P 1974 (ed) The philosopher’s Alice Carroll L, Alice’s adventures in wonderland & through a looking glass Academy Editions, London
Hill DP, Davis AP, Richardson JE et al 2001 Program description: strategies for biological annotation of mammalian systems: implementing gene ontologies in mouse genome informatics Genomics 74:121^128
Hill DP, Richardson JE, Blake JA, Ringwald M 2002 Extension and integration of the Gene Ontology (GO): combining GO vocabularies with external vocabularies Genome Res, in press
Karp PD 2000 An ontology for biological function based on molecular interactions Bioinformatics 16:269^285
Karp PD, Riley M, Saier M et al 2002a The EcoCyc database Nucleic Acids Res 30:56^58
Trang 10Karp PD, Riley M, Parley SM, Pellegrini-Toole A 2002b The MetaCyc database Nucleic Acids Res 30:59^61
Kerlavage A, Bonazzi V, di Tommaso M et al 2002 The Celera Discovery system Nucleic Acids Res 30:129^136
Kretschmann E, Fleischmann W, Apweiler R 2001 Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT Bioinformatics 17:920^926
Kyrpides NC, Ouzounis CA 1998 Whole-genome sequence annotation ‘going wrong with con¢dence’ Molec Microbiol 32:886^887
Laegreid A, Hvidsten TR, Midelfart H, Komorowski J, Sandvik AK 2002 Supervised learning used to predict biological functions of 196 human genes Submitted
Leser U 1998 Semantic mapping for database integration making use of ontologies url: cis.cs.tu-berlin.de/ * leser/pub n pres/ws ontology ¢nal98.ps.gz
MGED 2001 Microarray Gene Expression Database Group url: www.mged.org
Mewes HW, Heumann K, Kaps A et al 1999 MIPS: a database for genomes and protein sequences Nucleic Acids Res 27:44^48
Mi H, Vandergri¡ J, Campbell M et al 2002 Assessment of genome-wide protein function classi¢cation for Drosophila melanogaster Submitted
Miller GA 1998 Nouns in WordNet In: Fellbaum C (ed) WordNet An electronic lexical database MIT Press, Cambridge, MA, p 23^46
OpenSource 2001 url: www.opensource.org/
Overbeek R, Larsen N, Smith W, Maltsev N, Selkov E 1997 Representation of function: the next step Gene 191:GC1^GC9
Overbeek R, Larsen N, Pusch GD et al 2000 WIT: integrated system for high-level throughput genome sequence analysis and metabolic reconstruction Nucleic Acids Res 28:123^125 Pouliot Y, Gao J, Su QJ, Liu GG, Ling YB 2001 DIAN: a novel algorithm for genome ontological classi¢cation Genome Res 11:1766^1779
Priss UE 1998 The formalization of WordNet by methods of relational concept analysis In: Fellbaum C (ed) WordNet An electronic lexical database MIT Press, Cambridge, MA,
Riley M 1988 Systems for categorizing functions of gene products Curr Opin Struct Biol 8: 388^392
Riley M 1993 Functions of the gene products of Escherichia coli Microbiol Rev 57:862^952 Rison SCG, Hodgman TC, Thornton JM 2000 Comparison of functional annotation schemes for genomes Funct Integr Genomics 1:56^69
Rogers JE, Rector AL 2000 GALEN’s model of parts and wholes: Experience and comparisons Annual Fall Symposium of American Medical Informatics Assocation, Los Angeles Hanley
& Belfus Inc, Philadelphia, CA, p 714^718
Schulze-Kremer S 1997 Integrating and exploiting large-scale, heterogeneous and autonomous databases with an ontology for molecular biology In: Hofestaedt R, Lim H (eds) Molecular bioinformatics The human genome project Shaker Verlag, Aachen, p 43^46
Schulze-Kremer S 1998 Ontologies for molecular biology Proc Paci¢c Symp Biocomput 3: 695^706
Serres MH, Riley M 2000 Multifun, a multifunctional classi¢cation scheme for Escherichia coli K-12 gene products Microb Comp Genomics 5:205^222
Trang 11Serres MH, Gopal S, Nahum LA, Liang P, Gaasterland T, Riley M 2001 A functional update of the Escherichia coli K-12 genome Genome Biol 2:RESEARCH 0035
Sklyar N 2001 Survey of existing Bio-ontologies url: http://dol.uni-leipzig.de/pub/2001^30/en Stevens R, Baker P, Bechhofer S et al 2000 TAMBIS: transparent access to multiple bioinformatics information sources Bioinformatics 16:184^183
Takai-Igarashi T, Nadaoka Y, Kaminuma T 1998 A database for cell signaling networks J Comp Biol 5:747^754
Van Helden J, Naim A, Lemer C, Mancuso R, Eldridge M, Wodak SJ 2001 From molecular activities and processes to biological function Brief Bioinform 2:81^93
Venter JC, Adams MD, Meyers EW et al 2001 The sequence of the human genome Science 291:1304^1351
Wheeler DL, Chappey C, Lash AE et al 2000 Database resources of the National Center for Biotechnology Information Nucleic Acids Res 28:10^14
Winston ME, Cha⁄n R, Herrman D 1987 A taxonomy of part^whole relations Cognitive Sci 11:417^444
Xie H, Wasserman A, Levine Z et al 2002 Automatic large scale protein annotation through Gene Ontology Genome Res 12:785^794
Zdobnov EM, Apweiler R 2001 InterProScan an integration platform for the recognition methods in InterPro Bioinformatics 17:847^848
signature-DISCUSSION
Subramaniam:Sometimes cellular localization drives the molecular function Thesame protein will have a particular function in certain places and then when it islocalized somewhere else it will have a di¡erent function
Ashburner:I thought about doing this at the level of annotation, in which youcould have a conditionality attached to the annotation I have been lying during mytalk, because I have been talking about annotating gene products For variousreasons partly historical and partly because of resources none of the singlemodel organism databases we are collaborating with (at least in their publicversions) really instantiate gene products in the proper way That is, if you had aphosphorylated and a non-phosphorylated form of a particular protein, theyshould have di¡erent identi¢ers and di¡erent names This is what we should beannotating What in fact we are annotating is genes as surrogates of geneproducts I am very aware of this problem With FlyBase we do have di¡erentidenti¢ers for isoforms of proteins, and in theory for di¡erent post-translationalmodi¢cations, but they are not yet readily usable The di⁄cult ones are proteinssuch as NF-kB, which is out there in the cytoplasm when it is bound to IF-kB, butthen the Toll pathway comes and translocates it into the nucleus I can seetheoretically how one can express this, but this is a problem too far at the moment.Subramaniam:MySQL is not really an object relation database If you try to getyour ontology into an object relation database (we have tried to do this) thecardinality doesn’t come out right What happens is that the de¢nitions get a
Trang 12little bit mixed up between di¡erent tables This is one of the problems in trying todeal with Oracle.
Ashburner:That is worth knowing; we can talk to the database people aboutthat The choice of MySQL was pragmatic
Subramaniam:Also, MySQL doesn’t scale
Ashburner:These are pretty small databases, with a few thousand lines per tableand relatively small numbers of tables
McCulloch: What degree of interpretation do you allow, for example, incompartmentation of the protein? If you go to the original paper it won’tnecessarily say that the protein is membrane bound or localized to caveolae: itwill probably say that it is found in a particulate fraction, or the detergent-insoluble fraction
Ashburner: We do have a facility for allowing curators to add biochemicalfraction information, because biochemists tend not to understand biology thatwell I want to emphasize that GO is very pragmatic, although there are placeswhere we are going to have to draw a line
Noble:In relation to the question of linking modelling and databases together, is
it worth asking the question of what the modellers would ideally like to see in adatabase? Does the GO consortium talk to the modellers?
Ashburner:We have a bit There are some people who are beginning to do this,particularly Fritz Roth at Harvard Medical School We have a mechanism by which
we can talk to the modellers because we have open days There are other systemsout there such as EcoCyc (http://ecocyc.org/) that are designed with modelling inmind, for making inference GO isn’t; it’s designed for description and querying
I think it will come GO is being used in ways that we had no concept of initially.For instance, it is being developed for literature mining (see Raychuadhuri et al2002) This could be very interesting
Kanehisa:When there is the same GO identi¢er in to organisms, how reliable is
it in terms of the functional orthologue?
Ashburner:That depends very much on how it is done It is turning out thatwhen a new organism joins the group, what is normally done is a quick-passelectronic annotation using the annotation in SWISS-PROT This is donecompletely electronically, and gives a quick and dirty annotation Then if theyhave the resources the groups start going through this and cleaning it up,hopefully coming up with direct experimental evidence for each annotation Forexample, after Celera we had about 10 000 electronic annotations in FlyBase, butthese have all been replaced by literature curations or annotations derived from amuch more reliable inspection of sequence similarity
Subramaniam: Going back to the issue of ontologies and databases, it isimportant to ask the question about which levels of ontologies can translate intomodelling If you think of modelling in bioinformatics and computational
Trang 13biology, the £ow of information in living systems is going from genes to geneproducts to functional pathways and then physiology What we have heard fromMichael Ashburner is concerned with the gene and gene function level The nextstep is what we are really referring to, which is not merely ¢nding an ontology forthe gene function, but going beyond this to integrated function, or systems levelfunction of the cell There is currently no ontology available at this level This isone of the issues we are trying to address in the cell signalling project; it is criticalfor the next stage of modelling work This has to be driven at this point: whether ornot you make the reverse ontology, at least you should provide format translatorssuch as XML.
Ashburner:GO, of course, is sent around the world in XML
Noble:How do we move forward on this? A comment you made surprised me: Ithink you said that it is forbidden to modify GO
Ashburner:No, it is forbidden to modify it and then sell it as if it were GO If youtook it, modi¢ed it and called it ‘Denis Noble’s ontology’, we would be at leastmildly pissed o¡
Subramaniam:We could call it ‘extended GO’, so that it becomes ‘EGO’!Ashburner:The Manchester people (C Groble, R Stevens and colleagues) havesomething called GONG: GO the Next Generation!
Boissel:Regarding the issue of databases and modelling, we should ¢rst be clearabout the functions of the database regarding the purpose of modelling.According to the decision we have made at this stage of de¢ning the purpose
of the database, there is a series of speci¢cations For example, a very generalspeci¢cation such as entities, localization of entities, relationship betweenentities, and where the information comes from (including the variability ofthe evidence) There are at least four di¡erent chapters within the speci¢cation.But ¢rst we should be clear why we are constructing a database regardingmodelling
Subramaniam:Let’s take speci¢c examples If you talk about pathway ontology,what are you getting from a pathway database? The network topology Andsometimes kinetic parameters, too All this will be encompassed in the databaseand can be translated into modelling Having said this, we should be carefulabout discriminating between two things in the database First, the querying ofthe database to get information that in turn can be used for modelling The other
is going straight from a database into a computational algorithm, and this isprecisely what needs to be done This is why earlier I said that we currently can’t
do this in a distributed computing environment The point really is that we need to
be able to compute, instead of having to write all our programming in SQL, which
we won’t be able to do if we have a complex program We need to design a database
so that it will enable us to communicate directly between the database and ourcomputational algorithm Beyond the pathway level, when we want to model the
Trang 14whole system, I don’t know whether anyone knows how to do this from a databasepoint of view yet.
Berridge:Say we were interested in trying to ¢gure out the pathways in the heart,and I put ‘heart’ into your database, what would I get out?
Ashburner:At the moment, whatever the mouse genome informatics group haveput in
Berridge:Would I get a list of all the proteins that are expressed in the heart?Ashburner:No, but you should get a list of all the genes whose products havebeen inferred to be involved in heart development, for example The physiologicalprocesses are not yet as well covered in GO as we wish, but we are working on thisactively
Noble:So even if it is expressed in the liver, but it a¡ects the heart, it turns up.Ashburner:Yes
Berridge:What questions will people be asking with your database?
Ashburner:If you want to ¢nd all the genes in Drosophila and mouse involved in asignal tranduction pathway, for example It can’t predict them: what you get out iswhat has been put in The trick is to add the entries in a rigorous manner.Berridge:So if I put in Ras I would get out the MAP kinase pathway in thesedi¡erent organisms
Ashburner:Yes
Levin:Looking higher than the level of the pathway, you indicated that therewere no good disease-based databases in the public domain Can you give a sense ofwhy this is?
Ashburner:I have no idea They exist commercially: things like Snomed andICD-10 Some are now being developed I suspect this is because so much of thehuman anatomy and physiology work has been so driven by the art of medicine,rather than the science of biomedicine Doctors are quite avaricious as a whole,particularly in the USA, and many of these databases are used to ensure correctbilling!
Reference
Raychaudhuri S, Chang JT, Sutphin PD, Altman RB 2002 Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature Genome Res 12:203^214