Báo cáo y học: " The Cell Cycle Ontology: an application ontology for the..." pdf

Abstract The Cell Cycle Ontology http://www.CellCycleOntology.org is an application ontology that automatically captures and integrates detailed knowledge on the cell cycle process.. To

Trang 1

The Cell Cycle Ontology: an application ontology for the

representation and integrated analysis of the cell cycle process

Erick Antezana *† , Mikel Egaña ‡ , Ward Blondé § , Aitzol Illarramendi ¶ ,

Iñaki Bilbao ¶ , Bernard De Baets § , Robert Stevens ‡ , Vladimir Mironov ¥ and Martin Kuiper ¥

Addresses: * Department of Plant Systems Biology, VIB, Technologiepark 927, B-9052 Gent, Belgium † Department of Molecular Genetics, Ghent University, Technologiepark 927, B-9052 Gent, Belgium ‡ School of Computer Science, University of Manchester, Oxford Road, Manchester M13 9PL, UK § Department of Applied Mathematics, Biometrics and Computer Science, Ghent University, Coupure links 653,

B-9000 Gent, Belgium ¶ Noray Bioinformatics, SL Parque Tecnológico 801 A, 2°, 48160 Derio (Bizkaia), Spain ¥ Department of Biology, Norwegian University of Science and Technology, Høgskoleringen 5, NO-7491 Trondheim, Norway

Correspondence: Martin Kuiper Email: martin.kuiper@bio.ntnu.no

This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Cell Cycle Ontology

<p>A software resource for the analysis of cell cycle related molecular networks.</p>

Abstract

The Cell Cycle Ontology (http://www.CellCycleOntology.org) is an application ontology that

automatically captures and integrates detailed knowledge on the cell cycle process Cell Cycle

Ontology is enabled by semantic web technologies, and is accessible via the web for browsing,

visualizing, advanced querying, and computational reasoning Cell Cycle Ontology facilitates a

detailed analysis of cell cycle-related molecular network components Through querying and

automated reasoning, it may provide new hypotheses to help steer a systems biology approach to

biological network building

Rationale

Molecular biology has spent the past two decades cataloguing

genes, expression levels, proteins, molecular interactions and

more The combination of all these catalogues should enable

a biologist to start building a comprehensive picture of a

bio-logical system rather than only looking at the individual

com-ponents The formation of representations of these

components into a network that describes a biological system

constitutes the first step in allowing a biologist to develop an

understanding of the behavior of a system If adequate kinetic

and other parameters can be obtained or estimated, such

models can be used for network simulations in a

mathemati-cal framework, making them particularly useful to study the

emergent properties of such a system [1-5] These models

provide the basis for much of systems biology that is built on

integrative data analysis and mathematical modeling [6-9]

In systems biology, dynamic simulations with a model of a biological process serve as a means to validate the model's architecture and parameters, and to provide hypotheses for new experiments

Complementary to such model-dependent hypothesis gener-ation, the field of computational reasoning promises to pro-vide a powerful additional source of new hypotheses concerning biological network components The integration

of biological knowledge from various sources and the align-ment of their representations into one common representa-tion are recognized as critical steps toward hypothesis building [10,11] Such an integrated information resource is essential for exploration and exploitation by both humans

Published: 29 May 2009

Genome Biology 2009, 10:R58 (doi:10.1186/gb-2009-10-5-r58)

Received: 20 December 2008 Revised: 17 April 2009 Accepted: 29 May 2009 The electronic version of this article is the complete one and can be

found online at http://genomebiology.com/2009/10/5/R58

Trang 2

and computers, as in the case of computers via automated

reasoning [12]

Bio-ontologies

While it is easy to compare nucleic acid or polypeptide

sequences from different bioinformatics resources, the

bio-logical knowledge contained in these resources is very

diffi-cult to compare as it is represented in a wide variety of lexical

forms [13-15], and there are no tools that facilitate an easy

comparison and integration of knowledge in this form This is

where ontologies can provide assistance

Ontologies represent knowledge about a specific scientific

domain, and support a consistent and unambiguous

repre-sentation of entities within that domain This knowledge can

be integrated into a single model that holds these domain

entities and their term labels, as well as their connecting

rela-tionships [16] A well-known example of such an ontology is

the Gene Ontology (GO) [17] Therefore, an ontology links

term labels to their interpretations, that is, specifications of

their meanings, defined as a set of properties

Ontologies not only provide the foundation for knowledge

integration, but also the basis for advanced computational

reasoning to validate hypotheses and make implicit

knowl-edge explicit [18,19] Integrated knowlknowl-edge founded on

well-defined semantics provides a framework to enable computers

to conceptually handle knowledge in a manner comparable to

the handling of numerical data: it allows a computer to

proc-ess exprproc-essed facts, look for patterns and make inferences,

thereby extending human thinking about complex

informa-tion On a more technical level, computational reasoning

services can also be used to check the consistency of such

inte-grated knowledge, to re-engineer the design of parts of the

entire ontology or to design entirely new extensions that

com-ply with current knowledge [20]

Generally speaking, ontologies that model domain knowledge

are developed through an iterative process of refinement, an

approach common in the field of software engineering [21]

Ontology development has been pursued for many years, and

while several methodologies have been proposed [22-29],

none has been widely accepted The Open Biomedical

Ontol-ogy (OBO) project [30], however, aims to coordinate the

development of bio-ontologies (for example, the GO and the

Relation Ontology (RO) [31], among many others) The OBO

foundry [32] has provided a set of principles to guide the

development of ontologies These ontologies have gained

wide acceptance within the biomedical community [33] as a

means for data annotation and integration and as a reference

Biological information is known to be difficult to integrate

and analyze [34] One of the reasons for this is that biologists

are inclined to invent new names and expressions for, for

example, proteins and their functions that others have

already named This has led to high incidences of synonymy,

homonymy and polysemy that plague biomedicine Further-more, biological knowledge is often not crisp, as evidenced by the widespread use of quantifiers such as 'often', 'usually' and 'sometimes' Finally, the sheer volume and complexity of bio-logical data and the diversity of representational formats pro-vide profound challenges for efficient biomedical knowledge management Altogether, this calls for a concerted effort of experts from the biomedical and computational sciences to organize and facilitate the integration and exploitation of rap-idly accumulating biological information

Application ontologies in the life sciences and their role

in systems biology

Application ontologies define relevant concepts for a particu-lar application or use [35] They can be built by combining domain ontologies (or parts of domain ontologies) or serving

as 'a reference', and they can be extended according to the needs of a particular application Application ontologies are intended to be directly embedded into knowledge bases on which different applications can be run, such as data mining and hypothesis generation Application ontologies can play

an important role in exploiting the formalization of domain knowledge, thereby facilitating the integration of different types of information (for example, knowledge about biologi-cal processes and subcellular lobiologi-calizations, both parts of GO) Figure 1 shows a sample piece of knowledge composed of such integrated information This schematic representation gives a minimal but context-linked notion of a specific protein and its environment of functional characteristics (for example, where it is located, in which processes it participates, and by which gene it is encoded)

A successful application ontology may form the core of an effi-cient and effective management system Such a system com-bines data extraction methods, data format conversions and a variety of information sources To illustrate the potential use

of application ontologies for the life sciences, we have designed and built a knowledge management system that facilitates the analysis of cell cycle control

Why focus on the cell cycle process?

The eukaryotic cell cycle, or cell division cycle, is the series of events that happen between two consecutive cell divisions that underlie cell multiplication The molecular events that control the cell cycle are ordered and directional; that is, each process occurs in a sequential fashion and it is impossible to reverse the cycle The cell cycle control network is complex and is thought to include hundreds of proteins [36,37]

Although the basic principles of cell cycle control are now well documented [38], we are far from having a complete under-standing of all the intricacies of the underlying system A deeper knowledge of the cell cycle control system is essential

to the understanding of the growth and development of eukaryotic organisms In turn, this is necessary in order to be

Trang 3

able to combat numerous diseases in which cell cycle

aberra-tions are involved, such as cancer

Part of this knowledge has already been incorporated into

dynamic system models that are being exploited to test, refine

and generate hypothesis [39] This holistic and integrative

approach in biological research, also called systems biology,

is gaining momentum [40,41] and is leading to novel insights

into cell machinery [37,42,43] To further augment the cell

cycle research with computational approaches, we have built

the Cell Cycle Ontology (CCO), which integrates a wide

vari-ety of knowledge sources pertinent to the cell cycle

Results and discussion

The Cell Cycle Ontology application ontology

CCO is built to provide laboratory biologists with a one-stop

shop for cell cycle knowledge and to have access to an

inte-grated knowledge system that can be used to explore the potential power of automated reasoning CCO comprises information from a number of resources that contain relevant information about the cell cycle process, such as GO [44], RO, the IntAct database [45], the National Center for Biotechnol-ogy Information (NCBI) taxonomy [46], the UniProt knowl-edge base [47], and putative orthology relationships derived with the OrthoMCL clustering algorithm [48,49] All the information is integrated into a single framework that is sup-ported by the ontologies The integrated knowledge system supports queries that are not feasible with the original, indi-vidual and separate information sources

Bio-ontologies and their presentations have been made acces-sible through existing software tools (such as OBO-Edit [50], Protégé [51]), or web-based tools such as BioPortal [52], which can be used to create new terms and relationships and

to explore and analyze these ontologies) The most frequently

Local neighborhood of the SWI4_YEAST protein

Figure 1

Example of the local neighborhood of the protein SWI4_YEAST: some of the types of relationships used within CCO depict how a given protein

(SWI4_YEAST) is connected to the organism it belongs to (S cerevisiae), its coding gene (SWI4_yeast), biological processes (G1/S transition of mitotic cell

cycle), cellular localization (nucleus), interactions (physical interactions), protein transformations (post-translational modifications), and its orthology

group.

SWI4_YEAST

CCO:B0000111

Saccharomyces

cerevisiae

organism

CCO:T0000016

nucleus

CCO:C0000252

located_in

core cell cycle protein

CCO:B0000000

is_a

G1/S transition of mitotic cell cycle

CCO:P0000012

participates_in

SWI4_yeast

CCO:G0002318

encoded_by

derives_from

Type 517 protein

CO:O0001289

is_a

participates_in

swi6-mpg1

physical

interaction

CCO:I0003305

participates_in

swi4-2 physical interaction

CCO:I0005527

participates_in

swi4-ssa1

physical

interaction

CCO:I0002887

participates_in

ho-491 physical interaction

CCO:I0005128

transforms_into

SWI4_YEAST-Phosphoserine 159

CCO:B0009551

transforms_into

CCO:B0009552

transforms_into

CCO:B0009553

transforms_into

CCO:B0009554

Trang 4

used biomedical ontologies are provided in the Open

Biomed-ical Ontology format (OBOF) [53], while some are also

natively available in the Web Ontology Language (OWL) [54]

(though the OBOF can be transformed into an OWL

represen-tation [55-58]) OWL provides a means of creating

semanti-cally rich ontologies with ample possibilities for querying and

computational reasoning Therefore, we converted the wealth

of information available in the OBOF, and the highly curated

information from public data sources, into the more

expres-sive OWL representation in order to exploit richer forms of

computational reasoning

CCO is extensible, and the CCO integration architecture can

accommodate additional ontologies if necessary In addition,

a broad range of export formats from CCO (in particular,

OWL and Resource Description Framework (RDF)) enables

virtual integration with external sources (controlled

vocabu-laries translated into RDF such as Medical Subject Headings

(MeSH) [59]), allowing for queries that address these

dispa-rate resources through Semantic Web technologies [60,61]

Knowledge representation in the Cell Cycle Ontology

CCO is a resource that can directly support systems biology

Systems biology is essentially a model-driven approach to

biological research, in which a model of a biological process

serves to integrate all the available information (network

components and their interactions) A model simulation

allows for an understanding of network behavior, including

changes to the entities, describing these changes in terms of

what these entities are, where they are located and when these

statements hold To this end, the knowledge of entities and

their interactions needs to be represented in a mathematical

framework that facilitates dynamic simulations

Similarly, to computationally reason about temporal and

spa-tial aspects of a biological process, this knowledge should be

represented by a semantically rich and strict language (for

example, OWL) to exploit computational reasoning tools

Automated reasoners for OWL do not directly support either

temporal or spatial reasoning It is possible, however, to make

representations of temporal and spatial aspects of knowledge

and then reason about them in a way that is adequate for

many application settings

Within cell cycle related research, a scientist may be

inter-ested in a particular protein (what) for which the localization

(where) and specific phase of the cell cycle (when) are

impor-tant analysis components To represent the linkage between

all these different terms, CCO uses relationships as follows

Let: B be a protein; C be a cellular location in which B might

be present; G be the gene that codes for B; P be a biological

process in which B participates; I be an interaction in which B

takes part; and T be the organism that is the source of B

These relationships provide the basis for the atomic elements

of knowledge about the protein B: 'B located in C', 'B coded by

G', 'B participates in P', 'B participates in I', and 'B has source T' The existing relationships also have an inverse relation-ship such as 'P has participant B', 'G codes for B', 'C location

of B', 'T source of B' An example is shown in Figure 1

Cell Cycle Ontology contents

CCO supports four model organisms: Homo sapiens, Saccha-romyces cerevisiae, SchizosacchaSaccha-romyces pombe, and Ara-bidopsis thaliana There is an individual ontology for each of

the supported organisms There is also an integrated ontology that additionally contains (putative) orthology relationships obtained through OrthoMCL clustering Currently, the inte-grated CCO contains 132,263 terms: 90,643 proteins (includ-ing their modified forms), 21,039 genes and 20,581 protein-protein interactions, and it further comprises 30 types of rela-tionships (properties) (see Tables 1, 2 and 3 for detailed infor-mation) The contents of CCO can be viewed and analyzed through a wide variety of tools (see below)

Main features of the Cell Cycle Ontology

CCO is protein centric, meaning that proteins are used as 'hubs' to integrate and connect knowledge The semantic inte-gration of knowledge creates synergy by allowing queries that would not otherwise be possible For example, OBO ontolo-gies can be queried by tools such as OBO-Edit [62], the OBO Explorer [58] and AmiGO [63], but none of these can deal with a query such as 'return the orthologs of a protein X and include all the biological processes and molecular functions in which these orthologs participate' Due to our integrative approach and selection of information sources, CCO is an information-rich ontology that offers many advantages for cell cycle researchers The main characteristics and function-alities of CCO, described in more detail below, can best be summarized as follows: integrated turnkey system - CCO evolves toward a one-stop shop for cell cycle researchers; exploratory analysis - CCO provides ample possibilities for browsing, visualizing and searching; querying facilities - CCO offers advanced methods to retrieve data; reasoning exploita-tion - the integrated knowledge is structured to allow for clas-sification, consistency checking, and more advanced implementations that may provide new hypotheses

Table 1 Organism-specific ontology figures

Ontology

The numbers shown are of some important entities presently contained in CCO (for example, cell cycle genes) for each of the

organism-specific ontologies (A thaliana ontology (At), H sapiens ontology (Hs), S cerevisiae ontology (Sc) and S pombe ontology (Sp)).

Trang 5

CCO has been made available in a wide range of formats to

accommodate a suite of popular visualization and analysis

tools, ensuring maximum flexibility of interaction with the

ontology: OBOF, OWL [64], RDF [65], the eXtensible

Markup Language (XML) [66], DOT [67] and the Graph

Mod-eling Language (GML) [68] Those formats can be classified

into three groups according to the way the user interacts with

CCO: a basic exploration of the structure (OBOF), expressive

queries including the possibility of combining CCO with other

resources (XML, RDF and OWL), and visual exploration

(GML, XML - visANT [69] - and DOT) The representations

are described in detail as follows

OBOF is the de facto standard for knowledge representation

in the bio-ontology community Many tools have been built to

accommodate OBOF (for example, OBO-Edit [50] and OBO

Explorer [58]), and are widely used by biologists Much of the

biological knowledge already captured in ontologies is

repre-sented in OBOF [70] This is why we chose the OBOF resource

as the starting point for the CCO pipeline The OBOF version

of CCO is compliant with version 1.2 of the OBOF specifica-tion OBOF, however, offers little in the way of native reason-ing services and even lacks a semantic infrastructure for knowledge integration, such as RDF and OWL do via Uniform Resource Identifiers (URIs) OBOF queries are limited to simple exploration of the ontology structure

An RDF model is a collection of triple patterns, also simply named 'triples', comprising a subject, a predicate and an object (Figure 2) connected to each other in a graph (for example, the subject of one triple can be the object of another triple) An RDF graph can be flexibly and efficiently queried with the graph query language SPARQL [71] (Figure 3) We have loaded the RDF version of CCO into Open Virtuoso [72]

to enable complex queries via SPARQL In addition, a SPARQL query form [73] and a SPARQL query service [74] are also available to exploit CCO The CCO RDF allows for a first step toward exploiting Semantic Web technologies [75]

as it offers the possibility to integrate knowledge from exter-nal resources [76] Tools such as RDFScape [77] (a plug-in for Cytoscape [78]) can also be used to explore this CCO repre-sentation

The OWL version of CCO is the most expressive one and exceeds the other versions in information content as new axi-oms (see Materials and methods) have been added to exploit its language capabilities (the other versions are equivalent in content to the original ontologies in OBOF) OWL also allows integration of other ontologies within CCO by using an

Table 2

CCO protein figures

Ontology

This table shows the number of cell cycle related proteins that were integrated into the four species-specific ontologies for the model organisms: A thaliana (At), H sapiens (Hs), S cerevisiae (Sc) and S pombe (Sp) See 'Data integration' in Materials and methods for the definition of the term 'core

cell cycle protein'

Table 3

Integrated ontology figures

Ontology

Figures are shown for the composite ontology (CCO): union of the

four organism-specific ontologies (A thaliana (At), H sapiens (Hs), S

cerevisiae (Sc) and S pombe (Sp)) plus their orthology relationships The

OrthoMCL execution adds 5,772 clusters containing at least one core

cell cycle protein (see 'Data integration' in Materials and methods for

the definition of the term 'core cell cycle protein') together with their

proteins to CCO; the total number of proteins in CCO is 90,643

Numbers are given for some of the main entities (for example, cell

cycle proteins) in the composite ontology (CCO)

RDF triple sample

Figure 2

Simple RDF triple sample showing the subject (Nucleus), the predicate (part_of) and the object (Cell).

part_of

Trang 6

RDF matching model

Figure 3

RDF matching model: while querying an RDF model, a matching process is performed against the graph model In the sample, the triples '?protein is_a

CCO_B0000000' and '?protein rdfs:label ?protein label' are matched against the graph on the left.

???

CCO_B000000

is_a

?protein

rdfs:label

?protein_label

CCO_B000000

is_a

?protein

rdfs:label

?protein_label

Trang 7

importing mechanism based on URIs, meaning that extant

encoded knowledge from other resources can be effectively

added and exploited Ontologies expressed in OWL, however,

often cause performance limitations to the extent that it is

prohibitive for specific tools, such as Protégé, when launching

very complex queries OWL reasoners (Pellet [79], FaCT++

[80], RACERPro [81], and KAON2 [82]) can have problems

in dealing with large ontologies (such as CCO) and sometimes

fail without explanation [83] Additionally, the OWLDoc

server [84] allows online queries over CCO [85]

XML allows efficient data processing and programmatic

access to the ontology XML has less expressivity than RDF or

OWL in terms of semantics The structured document

ena-bled in XML also supports querying (for example, with

tech-nologies such as XQuery [86])

GML, XML (visANT) and DOT allow visual exploration of

CCO by tools such as Cytoscape [78], visANT and Graphviz

[87] In particular, visANT provides a very user-friendly way

to examine the CCO network of terms and relationships

Querying the Cell Cycle Ontology with SPARQL

The SPARQL syntax is based on the triple pattern of RDF and,

therefore, allows for a detailed specification of a small graph

pattern, thus a collection of interconnected triples, for which

the graph should be queried When performing a query with

SPARQL, a small RDF graph pattern is built in which any of

the elements of any triple can be a variable (variable names

are prepended in the query with the sign ? or $) This query

pattern is used to match against the complete RDF graph and

any matching structure (collection of triples) is retrieved

(Fig-ure 3)

A query can also specify which variables in the query pattern

should be shown in the answer One of SPARQL's strengths is

its ability to specify various target graphs that could be used

in the same query, resulting in their subsequent combination

and effectively constituting an efficient data integration

mechanism As the pointers to the graphs are URIs,

knowl-edge represented in dispersed RDF resources can be

com-bined in a powerful way

In order to design SPARQL queries on CCO, it is sometimes

necessary to deal with CCO identifiers The following query

shows how to retrieve a term name (called 'label' in RDF)

cor-responding to a given CCO identifier ('CCO_B0000000' in

this example) First, a base URL is defined (BASE), and then

the prefixes (PREFIX) are set to avoid the repetition of long

parts of URIs in the queries The variables (columns) to be

shown in the solution are specified in the SELECT statement

Finally, the query pattern is defined in the WHERE block The

specification of the graphs that should be used (for example,

'cco') is considered as a part of the query pattern The results

table will display the term label: 'core cell cycle protein' (see

'Data integration' in Materials and methods for the definition

of 'core cell cycle protein')

<http://www.semantic-systems-biol-ogy.org/>

PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>

PREFIX ssb:<http://www.semantic-systems-biol-ogy.org/SSB#>

SELECT ?term_label WHERE {

GRAPH <cco> { ssb:CCO_B0000000 rdfs:label ?term_label }

}

A similar query can be employed to retrieve a CCO identifier using a term label The following query retrieves the CCO identifier ('CCO_B0002337') of the protein with the label 'WEE1_ARATH':

SELECT ?unique_id WHERE {

GRAPH <cco> {

?unique_id rdfs:label 'WEE1_ARATH'@en }

} More sophisticated searches based on regular expressions can also be performed as illustrated in the following query that retrieves all the terms having the keyword 'p53' anywhere within the label (the flag 'i' enables case-insensitive expres-sion lookups):

Trang 8

PREFIX

rdfs:<http://www.w3.org/2000/01/rdf-schema#>

SELECT ?unique_id ?name

WHERE {

GRAPH <cco> {

?unique_id rdfs:label ?name

FILTER regex(str(?name), 'p53','i')

}

Consider the simple query 'retrieve the names (labels) of all

core cell cycle proteins from S pombe' These are the proteins

annotated with cell cycle terms by the Gene Ontology

Anno-tation (GOA) [88] group The query pattern consists of two

triples The first triple will match any triple that relates any

subject through the 'is_a' predicate to the 'CCO_B0000000'

object (core cell cycle protein) and the second triple will

match any triple whose subject is the same as in the first

tri-ple, the variable ?protein (defined by ? or $ in front of a string

name), and has the predicate 'rdfs:label' pointing to any

object The result is a column (?protein_label) with the

label of 1,359 core cell cycle proteins in S pombe (for

exam-ple, CDC24_SCHPO) Figure 3 illustrates the query pattern

that corresponds with the following SPARQL query:

PREFIX

ssb:<http://www.semantic-systems-biol-ogy.org/SSB#>

SELECT ?protein_label

WHERE {

GRAPH <cco_S_pombe> {

?protein ssb:is_a ssb:CCO_B0000000

?protein rdfs:label ?protein_label

}

The following SPARQL query on the A thaliana graph allows

users to infer a putative location for proteins with no docu-mented cellular locations The assumption behind such a query is that two proteins that participate in the same inter-action are likely to share the same cellular location, for exam-ple, the 'nucleus' (CCO_C0000252):

SELECT

?prot_in_the_nucleus

?prot_to_study

?interaction_label WHERE {

GRAPH <cco_A_thaliana> {

?interaction a ssb:interaction

?interaction rdfs:label

?interaction_label

?prot_A ssb:participates_in ?interaction

?prot_B ssb:participates_in ?interaction

?prot_A rdfs:label ?prot_in_the_nucleus

?prot_B rdfs:label ?prot_to_study

?prot_A ssb:located_in ssb:CCO_C0000252

OPTIONAL {

?prot_B ssb:located_in ?location_B

} FILTER (!bound(?location_B)) }

}

Trang 9

The query returns 48 proteins (for example, DMC1_ARATH,

SEM12_ARATH) having an interaction with a documented

nuclear protein, meaning their own cellular location is also

likely to include 'nucleus' at some point These results and,

more generally, any answer to a query on CCO simply reflects

the information in the original sources, but their integration

enables the construction of new hypotheses For some

ques-tions, the integrated CCO graph must be used For instance,

to retrieve the orthologs of the protein TIP41_YEAST from S.

cerevisiae (CCO_B0001243) and the processes in which

these orthologs participate, the following query can be used:

PREFIX

ssb:<http://www.semantic-systems-biol-ogy.org/SSB#>

SELECT

?prot_label

?biological_process_label

WHERE {

GRAPH <cco> {

ssb:CCO_B0001243 ssb:is_a

?ortholog_cluster_protein

?prot ssb:is_a ?ortholog_cluster_protein

?prot rdfs:label ?prot_label

?ortholog_cluster_protein rdf:type

ssb:type_protein

OPTIONAL {

?prot ssb:participates_in

?biological_process

?biological_process rdfs:label

?biological_process_label

}

FILTER(?prot != ssb:CCO_B0001243)

}

The query returns 63 distinct putative orthologs, of which 55 are not documented to participate in any known process Thus, with this result these proteins can be hypothesized to participate in the same process as 'TIP41_SCHPO' To retrieve the identity of the processes in which 'TIP41_SCHPO' participates, a new query must be built that returns the answer 'G2/M transition of mitotic cell cycle':

SELECT ?process_label WHERE {

GRAPH <cco> { ssb:CCO_B0001243 ssb:participates_in

?process

?process rdfs:label ?process_label }

} More examples of biological queries can be found at [73]

Finally, we used SPARQL to analyze the subcellular distribu-tion of cell cycle proteins For that, we used the core cell cycle proteins subset of the CCO First, we analyzed the distribution among the three major cellular compartments - the cyto-plasm, nucleus and cell membrane We found that the major-ity of cell cycle proteins are located in the nucleus (755) and the cytoplasm (356), where the majority of cell cycle events are known to take place [38] Twenty-five cell cycle proteins were found to be located in the cell membrane These are likely to play a role in signaling to the cell cycle machinery

We looked in more detail at the distribution of cell cycle pro-teins in the cytoplasm As expected, the majority of cell cycle proteins are found in the cytosol (280) We also wanted to see

if there were cell cycle proteins in the membrane bounded organelles other than in the nucleus To our surprise, all of the analyzed organelles contained cell cycle proteins: the endo-plasmic reticulum (46), the Golgi apparatus (19) and the mitochondrion (43) One could hypothesize that the cell cycle proteins located in the first two compartments are involved in the build-up of a new cell membrane and cell wall between the two daughter cells It is much more difficult, however, to

Trang 10

envi-sion how mitochondrial proteins could be involved in the cell

cycle Even more strikingly, six mitochondrial proteins were

found to play a role in the regulation of the cell cycle Provided

the cellular compartment annotations are correct, and if

taken up by cell cycle researchers, these results may possibly

lead to the discovery of novel mechanisms of cell cycle

regula-tion

An alternative hypothesis to explain a cell cycle role for

pro-teins known to be located in membrane bounded organelles

other than the nucleus is to suggest that these proteins are

also present outside of those organelles For example, if a

pro-tein can be located in both the mitochondrion and the cytosol,

then the cell cycle function of the protein can be exerted in the

cytosol, but not in the mitochondrion where it may fulfill a

different role Therefore, we analyzed alternative locations of

the proteins in question We identified 9, 5 and 15 core cell

cycle proteins from the endoplasmic reticulum, Golgi

appara-tus and mitochondrion, respectively, that have additionally

cytosolic or nuclear localization These proteins have an

unu-sual combination of locations, and merit further investigation

with respect to the molecular mechanisms underlying their

ability to be localized to apparently incompatible locations

This also highlights the need to indicate when and where

functions assigned to a protein are valid

Automated reasoning over bio-ontologies

Description logics and automated reasoners

Description Logics (DL) [89] and Semantic Web technologies

[60,61] provide a foundation for the management and

exploi-tation of knowledge in ontologies The type of OWL used for

CCO is based on DL, which is a family of logic-based

knowl-edge representation formalisms that describe a domain in

terms of concepts (classes), roles (properties or relationships)

and individuals (instances) OWL-DL offers an optimal

trade-off between expressivity and computational tractability [89]

OWL-DL can be considered to be sufficiently expressive in

order to represent a wide variety of biomedical knowledge

[90], while it offers support for automated reasoning It has

become one of the standard languages for representing

ontol-ogies in the semantically strict form that supports automated

reasoning

DL reasoners are computational tools to: ensure that an

ontology does not contain any contradictory facts

(consist-ency checking); compute the subclass relation between each

named class to create the class hierarchy (classification); find

the most specific classes to which an individual belongs

(real-ization); and retrieve information from an ontology

(query-ing)

Ontology curators can use DL reasoners to minimize the term

redundancy, while maintaining sufficiently detailed

descrip-tions and consistency of the contents [18,19] Moreover,

rea-soning tools can also be used to find new classes (either more

specific or general) [20] Finally, and in this context most

importantly, reasoning tools can also be used in biological research for information retrieval and the generation of new hypotheses that are consistent with the knowledge captured

in the ontology

Representing biological knowledge with OWL

OWL-DL queries can be more fine-grained than RDF queries since the semantic model of OWL-DL allows more expressiv-ity The OWL semantics is based on sets (classes) of instances (individuals) Classes can be subclasses of other classes, if and only if all the instances of the subclass are also instances of the superclass, although the superclass has other instances that do not belong to the subclass For example, in GO the well-known 'is a' hierarchy is founded on this concept

Relationships in OWL-DL are interpreted as existing between pairs of individuals Restrictions on classes define which and how many relationships the instances of that class must hold When a restriction is defined, an anonymous class is defined (Figure 4, dotted shape), and the class to which the restriction

is added becomes a subclass or equivalent class of that anon-ymous class For instance, the restriction 'subClassOf part of some Cell' in the class 'Nucleus' states that every instance of the class 'Nucleus' must have at least one relationship along the property 'part_of' to an instance of the class 'Cell' (other quantifiers can be used in these restrictions such as 'only', 'min', 'max' and 'value', and Boolean operators such as 'and', 'or', and 'not')

If the restriction is added as a superclass of the class that is being defined (the class being defined is a subclass of the restriction, as in the example above), the restriction is known

as a 'necessary condition' A necessary condition is a condi-tion that all the instances of the class must fulfill, but is not enough in itself to define class membership Therefore, if an instance is found that has at least one 'part_of' relationship to 'Cell', it does not mean that it is a member of the class

OWL property (part_of) sample

Figure 4

OWL property (part_of) sample: the property 'part of' links individuals belonging to a class (for example, 'Nucleus') to individuals of the class 'Cell' A restriction of the type 'some part_of Cell' on the class 'Nucleus' defines an anonymous class (dotted shape), and will imply that individuals belonging to the class 'Nucleus' also belong to (are 'part_of') the class 'Cell'.

part_of part_of

Định dạng
Số trang	20
Dung lượng	683,21 KB