Abstract The Cell Cycle Ontology http://www.CellCycleOntology.org is an application ontology that automatically captures and integrates detailed knowledge on the cell cycle process.. To
Trang 1The Cell Cycle Ontology: an application ontology for the
representation and integrated analysis of the cell cycle process
Erick Antezana *† , Mikel Egaña ‡ , Ward Blondé § , Aitzol Illarramendi ¶ ,
Iñaki Bilbao ¶ , Bernard De Baets § , Robert Stevens ‡ , Vladimir Mironov ¥ and Martin Kuiper ¥
Addresses: * Department of Plant Systems Biology, VIB, Technologiepark 927, B-9052 Gent, Belgium † Department of Molecular Genetics, Ghent University, Technologiepark 927, B-9052 Gent, Belgium ‡ School of Computer Science, University of Manchester, Oxford Road, Manchester M13 9PL, UK § Department of Applied Mathematics, Biometrics and Computer Science, Ghent University, Coupure links 653,
B-9000 Gent, Belgium ¶ Noray Bioinformatics, SL Parque Tecnológico 801 A, 2°, 48160 Derio (Bizkaia), Spain ¥ Department of Biology, Norwegian University of Science and Technology, Høgskoleringen 5, NO-7491 Trondheim, Norway
Correspondence: Martin Kuiper Email: martin.kuiper@bio.ntnu.no
© 2009 Antezana et al.; licensee BioMed Central Ltd
This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Cell Cycle Ontology
<p>A software resource for the analysis of cell cycle related molecular networks.</p>
Abstract
The Cell Cycle Ontology (http://www.CellCycleOntology.org) is an application ontology that
automatically captures and integrates detailed knowledge on the cell cycle process Cell Cycle
Ontology is enabled by semantic web technologies, and is accessible via the web for browsing,
visualizing, advanced querying, and computational reasoning Cell Cycle Ontology facilitates a
detailed analysis of cell cycle-related molecular network components Through querying and
automated reasoning, it may provide new hypotheses to help steer a systems biology approach to
biological network building
Rationale
Molecular biology has spent the past two decades cataloguing
genes, expression levels, proteins, molecular interactions and
more The combination of all these catalogues should enable
a biologist to start building a comprehensive picture of a
bio-logical system rather than only looking at the individual
com-ponents The formation of representations of these
components into a network that describes a biological system
constitutes the first step in allowing a biologist to develop an
understanding of the behavior of a system If adequate kinetic
and other parameters can be obtained or estimated, such
models can be used for network simulations in a
mathemati-cal framework, making them particularly useful to study the
emergent properties of such a system [1-5] These models
provide the basis for much of systems biology that is built on
integrative data analysis and mathematical modeling [6-9]
In systems biology, dynamic simulations with a model of a biological process serve as a means to validate the model's architecture and parameters, and to provide hypotheses for new experiments
Complementary to such model-dependent hypothesis gener-ation, the field of computational reasoning promises to pro-vide a powerful additional source of new hypotheses concerning biological network components The integration
of biological knowledge from various sources and the align-ment of their representations into one common representa-tion are recognized as critical steps toward hypothesis building [10,11] Such an integrated information resource is essential for exploration and exploitation by both humans
Published: 29 May 2009
Genome Biology 2009, 10:R58 (doi:10.1186/gb-2009-10-5-r58)
Received: 20 December 2008 Revised: 17 April 2009 Accepted: 29 May 2009 The electronic version of this article is the complete one and can be
found online at http://genomebiology.com/2009/10/5/R58
Trang 2and computers, as in the case of computers via automated
reasoning [12]
Bio-ontologies
While it is easy to compare nucleic acid or polypeptide
sequences from different bioinformatics resources, the
bio-logical knowledge contained in these resources is very
diffi-cult to compare as it is represented in a wide variety of lexical
forms [13-15], and there are no tools that facilitate an easy
comparison and integration of knowledge in this form This is
where ontologies can provide assistance
Ontologies represent knowledge about a specific scientific
domain, and support a consistent and unambiguous
repre-sentation of entities within that domain This knowledge can
be integrated into a single model that holds these domain
entities and their term labels, as well as their connecting
rela-tionships [16] A well-known example of such an ontology is
the Gene Ontology (GO) [17] Therefore, an ontology links
term labels to their interpretations, that is, specifications of
their meanings, defined as a set of properties
Ontologies not only provide the foundation for knowledge
integration, but also the basis for advanced computational
reasoning to validate hypotheses and make implicit
knowl-edge explicit [18,19] Integrated knowlknowl-edge founded on
well-defined semantics provides a framework to enable computers
to conceptually handle knowledge in a manner comparable to
the handling of numerical data: it allows a computer to
proc-ess exprproc-essed facts, look for patterns and make inferences,
thereby extending human thinking about complex
informa-tion On a more technical level, computational reasoning
services can also be used to check the consistency of such
inte-grated knowledge, to re-engineer the design of parts of the
entire ontology or to design entirely new extensions that
com-ply with current knowledge [20]
Generally speaking, ontologies that model domain knowledge
are developed through an iterative process of refinement, an
approach common in the field of software engineering [21]
Ontology development has been pursued for many years, and
while several methodologies have been proposed [22-29],
none has been widely accepted The Open Biomedical
Ontol-ogy (OBO) project [30], however, aims to coordinate the
development of bio-ontologies (for example, the GO and the
Relation Ontology (RO) [31], among many others) The OBO
foundry [32] has provided a set of principles to guide the
development of ontologies These ontologies have gained
wide acceptance within the biomedical community [33] as a
means for data annotation and integration and as a reference
Biological information is known to be difficult to integrate
and analyze [34] One of the reasons for this is that biologists
are inclined to invent new names and expressions for, for
example, proteins and their functions that others have
already named This has led to high incidences of synonymy,
homonymy and polysemy that plague biomedicine Further-more, biological knowledge is often not crisp, as evidenced by the widespread use of quantifiers such as 'often', 'usually' and 'sometimes' Finally, the sheer volume and complexity of bio-logical data and the diversity of representational formats pro-vide profound challenges for efficient biomedical knowledge management Altogether, this calls for a concerted effort of experts from the biomedical and computational sciences to organize and facilitate the integration and exploitation of rap-idly accumulating biological information
Application ontologies in the life sciences and their role
in systems biology
Application ontologies define relevant concepts for a particu-lar application or use [35] They can be built by combining domain ontologies (or parts of domain ontologies) or serving
as 'a reference', and they can be extended according to the needs of a particular application Application ontologies are intended to be directly embedded into knowledge bases on which different applications can be run, such as data mining and hypothesis generation Application ontologies can play
an important role in exploiting the formalization of domain knowledge, thereby facilitating the integration of different types of information (for example, knowledge about biologi-cal processes and subcellular lobiologi-calizations, both parts of GO) Figure 1 shows a sample piece of knowledge composed of such integrated information This schematic representation gives a minimal but context-linked notion of a specific protein and its environment of functional characteristics (for example, where it is located, in which processes it participates, and by which gene it is encoded)
A successful application ontology may form the core of an effi-cient and effective management system Such a system com-bines data extraction methods, data format conversions and a variety of information sources To illustrate the potential use
of application ontologies for the life sciences, we have designed and built a knowledge management system that facilitates the analysis of cell cycle control
Why focus on the cell cycle process?
The eukaryotic cell cycle, or cell division cycle, is the series of events that happen between two consecutive cell divisions that underlie cell multiplication The molecular events that control the cell cycle are ordered and directional; that is, each process occurs in a sequential fashion and it is impossible to reverse the cycle The cell cycle control network is complex and is thought to include hundreds of proteins [36,37]
Although the basic principles of cell cycle control are now well documented [38], we are far from having a complete under-standing of all the intricacies of the underlying system A deeper knowledge of the cell cycle control system is essential
to the understanding of the growth and development of eukaryotic organisms In turn, this is necessary in order to be
Trang 3able to combat numerous diseases in which cell cycle
aberra-tions are involved, such as cancer
Part of this knowledge has already been incorporated into
dynamic system models that are being exploited to test, refine
and generate hypothesis [39] This holistic and integrative
approach in biological research, also called systems biology,
is gaining momentum [40,41] and is leading to novel insights
into cell machinery [37,42,43] To further augment the cell
cycle research with computational approaches, we have built
the Cell Cycle Ontology (CCO), which integrates a wide
vari-ety of knowledge sources pertinent to the cell cycle
Results and discussion
The Cell Cycle Ontology application ontology
CCO is built to provide laboratory biologists with a one-stop
shop for cell cycle knowledge and to have access to an
inte-grated knowledge system that can be used to explore the potential power of automated reasoning CCO comprises information from a number of resources that contain relevant information about the cell cycle process, such as GO [44], RO, the IntAct database [45], the National Center for Biotechnol-ogy Information (NCBI) taxonomy [46], the UniProt knowl-edge base [47], and putative orthology relationships derived with the OrthoMCL clustering algorithm [48,49] All the information is integrated into a single framework that is sup-ported by the ontologies The integrated knowledge system supports queries that are not feasible with the original, indi-vidual and separate information sources
Bio-ontologies and their presentations have been made acces-sible through existing software tools (such as OBO-Edit [50], Protégé [51]), or web-based tools such as BioPortal [52], which can be used to create new terms and relationships and
to explore and analyze these ontologies) The most frequently
Local neighborhood of the SWI4_YEAST protein
Figure 1
Example of the local neighborhood of the protein SWI4_YEAST: some of the types of relationships used within CCO depict how a given protein
(SWI4_YEAST) is connected to the organism it belongs to (S cerevisiae), its coding gene (SWI4_yeast), biological processes (G1/S transition of mitotic cell
cycle), cellular localization (nucleus), interactions (physical interactions), protein transformations (post-translational modifications), and its orthology
group.
SWI4_YEAST
CCO:B0000111
Saccharomyces
cerevisiae
organism
CCO:T0000016
nucleus
CCO:C0000252
located_in
core cell cycle protein
CCO:B0000000
is_a
G1/S transition of mitotic cell cycle
CCO:P0000012
participates_in
SWI4_yeast
CCO:G0002318
encoded_by
derives_from
Type 517 protein
CO:O0001289
is_a
participates_in
swi6-mpg1
physical
interaction
CCO:I0003305
participates_in
swi4-2 physical interaction
CCO:I0005527
participates_in
swi4-ssa1
physical
interaction
CCO:I0002887
participates_in
ho-491 physical interaction
CCO:I0005128
transforms_into
SWI4_YEAST-Phosphoserine 159
CCO:B0009551
transforms_into
SWI4_YEAST-Phosphoserine 806
CCO:B0009552
transforms_into
SWI4_YEAST-Phosphoserine 1003
CCO:B0009553
transforms_into
SWI4_YEAST-Phosphoserine 1007
CCO:B0009554
Trang 4used biomedical ontologies are provided in the Open
Biomed-ical Ontology format (OBOF) [53], while some are also
natively available in the Web Ontology Language (OWL) [54]
(though the OBOF can be transformed into an OWL
represen-tation [55-58]) OWL provides a means of creating
semanti-cally rich ontologies with ample possibilities for querying and
computational reasoning Therefore, we converted the wealth
of information available in the OBOF, and the highly curated
information from public data sources, into the more
expres-sive OWL representation in order to exploit richer forms of
computational reasoning
CCO is extensible, and the CCO integration architecture can
accommodate additional ontologies if necessary In addition,
a broad range of export formats from CCO (in particular,
OWL and Resource Description Framework (RDF)) enables
virtual integration with external sources (controlled
vocabu-laries translated into RDF such as Medical Subject Headings
(MeSH) [59]), allowing for queries that address these
dispa-rate resources through Semantic Web technologies [60,61]
Knowledge representation in the Cell Cycle Ontology
CCO is a resource that can directly support systems biology
Systems biology is essentially a model-driven approach to
biological research, in which a model of a biological process
serves to integrate all the available information (network
components and their interactions) A model simulation
allows for an understanding of network behavior, including
changes to the entities, describing these changes in terms of
what these entities are, where they are located and when these
statements hold To this end, the knowledge of entities and
their interactions needs to be represented in a mathematical
framework that facilitates dynamic simulations
Similarly, to computationally reason about temporal and
spa-tial aspects of a biological process, this knowledge should be
represented by a semantically rich and strict language (for
example, OWL) to exploit computational reasoning tools
Automated reasoners for OWL do not directly support either
temporal or spatial reasoning It is possible, however, to make
representations of temporal and spatial aspects of knowledge
and then reason about them in a way that is adequate for
many application settings
Within cell cycle related research, a scientist may be
inter-ested in a particular protein (what) for which the localization
(where) and specific phase of the cell cycle (when) are
impor-tant analysis components To represent the linkage between
all these different terms, CCO uses relationships as follows
Let: B be a protein; C be a cellular location in which B might
be present; G be the gene that codes for B; P be a biological
process in which B participates; I be an interaction in which B
takes part; and T be the organism that is the source of B
These relationships provide the basis for the atomic elements
of knowledge about the protein B: 'B located in C', 'B coded by
G', 'B participates in P', 'B participates in I', and 'B has source T' The existing relationships also have an inverse relation-ship such as 'P has participant B', 'G codes for B', 'C location
of B', 'T source of B' An example is shown in Figure 1
Cell Cycle Ontology contents
CCO supports four model organisms: Homo sapiens, Saccha-romyces cerevisiae, SchizosacchaSaccha-romyces pombe, and Ara-bidopsis thaliana There is an individual ontology for each of
the supported organisms There is also an integrated ontology that additionally contains (putative) orthology relationships obtained through OrthoMCL clustering Currently, the inte-grated CCO contains 132,263 terms: 90,643 proteins (includ-ing their modified forms), 21,039 genes and 20,581 protein-protein interactions, and it further comprises 30 types of rela-tionships (properties) (see Tables 1, 2 and 3 for detailed infor-mation) The contents of CCO can be viewed and analyzed through a wide variety of tools (see below)
Main features of the Cell Cycle Ontology
CCO is protein centric, meaning that proteins are used as 'hubs' to integrate and connect knowledge The semantic inte-gration of knowledge creates synergy by allowing queries that would not otherwise be possible For example, OBO ontolo-gies can be queried by tools such as OBO-Edit [62], the OBO Explorer [58] and AmiGO [63], but none of these can deal with a query such as 'return the orthologs of a protein X and include all the biological processes and molecular functions in which these orthologs participate' Due to our integrative approach and selection of information sources, CCO is an information-rich ontology that offers many advantages for cell cycle researchers The main characteristics and function-alities of CCO, described in more detail below, can best be summarized as follows: integrated turnkey system - CCO evolves toward a one-stop shop for cell cycle researchers; exploratory analysis - CCO provides ample possibilities for browsing, visualizing and searching; querying facilities - CCO offers advanced methods to retrieve data; reasoning exploita-tion - the integrated knowledge is structured to allow for clas-sification, consistency checking, and more advanced implementations that may provide new hypotheses
Table 1 Organism-specific ontology figures
Ontology
The numbers shown are of some important entities presently contained in CCO (for example, cell cycle genes) for each of the
organism-specific ontologies (A thaliana ontology (At), H sapiens ontology (Hs), S cerevisiae ontology (Sc) and S pombe ontology (Sp)).
Trang 5CCO has been made available in a wide range of formats to
accommodate a suite of popular visualization and analysis
tools, ensuring maximum flexibility of interaction with the
ontology: OBOF, OWL [64], RDF [65], the eXtensible
Markup Language (XML) [66], DOT [67] and the Graph
Mod-eling Language (GML) [68] Those formats can be classified
into three groups according to the way the user interacts with
CCO: a basic exploration of the structure (OBOF), expressive
queries including the possibility of combining CCO with other
resources (XML, RDF and OWL), and visual exploration
(GML, XML - visANT [69] - and DOT) The representations
are described in detail as follows
OBOF is the de facto standard for knowledge representation
in the bio-ontology community Many tools have been built to
accommodate OBOF (for example, OBO-Edit [50] and OBO
Explorer [58]), and are widely used by biologists Much of the
biological knowledge already captured in ontologies is
repre-sented in OBOF [70] This is why we chose the OBOF resource
as the starting point for the CCO pipeline The OBOF version
of CCO is compliant with version 1.2 of the OBOF specifica-tion OBOF, however, offers little in the way of native reason-ing services and even lacks a semantic infrastructure for knowledge integration, such as RDF and OWL do via Uniform Resource Identifiers (URIs) OBOF queries are limited to simple exploration of the ontology structure
An RDF model is a collection of triple patterns, also simply named 'triples', comprising a subject, a predicate and an object (Figure 2) connected to each other in a graph (for example, the subject of one triple can be the object of another triple) An RDF graph can be flexibly and efficiently queried with the graph query language SPARQL [71] (Figure 3) We have loaded the RDF version of CCO into Open Virtuoso [72]
to enable complex queries via SPARQL In addition, a SPARQL query form [73] and a SPARQL query service [74] are also available to exploit CCO The CCO RDF allows for a first step toward exploiting Semantic Web technologies [75]
as it offers the possibility to integrate knowledge from exter-nal resources [76] Tools such as RDFScape [77] (a plug-in for Cytoscape [78]) can also be used to explore this CCO repre-sentation
The OWL version of CCO is the most expressive one and exceeds the other versions in information content as new axi-oms (see Materials and methods) have been added to exploit its language capabilities (the other versions are equivalent in content to the original ontologies in OBOF) OWL also allows integration of other ontologies within CCO by using an
Table 2
CCO protein figures
Ontology
This table shows the number of cell cycle related proteins that were integrated into the four species-specific ontologies for the model organisms: A thaliana (At), H sapiens (Hs), S cerevisiae (Sc) and S pombe (Sp) See 'Data integration' in Materials and methods for the definition of the term 'core
cell cycle protein'
Table 3
Integrated ontology figures
Ontology
Figures are shown for the composite ontology (CCO): union of the
four organism-specific ontologies (A thaliana (At), H sapiens (Hs), S
cerevisiae (Sc) and S pombe (Sp)) plus their orthology relationships The
OrthoMCL execution adds 5,772 clusters containing at least one core
cell cycle protein (see 'Data integration' in Materials and methods for
the definition of the term 'core cell cycle protein') together with their
proteins to CCO; the total number of proteins in CCO is 90,643
Numbers are given for some of the main entities (for example, cell
cycle proteins) in the composite ontology (CCO)
RDF triple sample
Figure 2
Simple RDF triple sample showing the subject (Nucleus), the predicate (part_of) and the object (Cell).
part_of
Trang 6RDF matching model
Figure 3
RDF matching model: while querying an RDF model, a matching process is performed against the graph model In the sample, the triples '?protein is_a
CCO_B0000000' and '?protein rdfs:label ?protein label' are matched against the graph on the left.
???
CCO_B000000
is_a
?protein
rdfs:label
?protein_label
CCO_B000000
is_a
?protein
rdfs:label
?protein_label
Trang 7importing mechanism based on URIs, meaning that extant
encoded knowledge from other resources can be effectively
added and exploited Ontologies expressed in OWL, however,
often cause performance limitations to the extent that it is
prohibitive for specific tools, such as Protégé, when launching
very complex queries OWL reasoners (Pellet [79], FaCT++
[80], RACERPro [81], and KAON2 [82]) can have problems
in dealing with large ontologies (such as CCO) and sometimes
fail without explanation [83] Additionally, the OWLDoc
server [84] allows online queries over CCO [85]
XML allows efficient data processing and programmatic
access to the ontology XML has less expressivity than RDF or
OWL in terms of semantics The structured document
ena-bled in XML also supports querying (for example, with
tech-nologies such as XQuery [86])
GML, XML (visANT) and DOT allow visual exploration of
CCO by tools such as Cytoscape [78], visANT and Graphviz
[87] In particular, visANT provides a very user-friendly way
to examine the CCO network of terms and relationships
Querying the Cell Cycle Ontology with SPARQL
The SPARQL syntax is based on the triple pattern of RDF and,
therefore, allows for a detailed specification of a small graph
pattern, thus a collection of interconnected triples, for which
the graph should be queried When performing a query with
SPARQL, a small RDF graph pattern is built in which any of
the elements of any triple can be a variable (variable names
are prepended in the query with the sign ? or $) This query
pattern is used to match against the complete RDF graph and
any matching structure (collection of triples) is retrieved
(Fig-ure 3)
A query can also specify which variables in the query pattern
should be shown in the answer One of SPARQL's strengths is
its ability to specify various target graphs that could be used
in the same query, resulting in their subsequent combination
and effectively constituting an efficient data integration
mechanism As the pointers to the graphs are URIs,
knowl-edge represented in dispersed RDF resources can be
com-bined in a powerful way
In order to design SPARQL queries on CCO, it is sometimes
necessary to deal with CCO identifiers The following query
shows how to retrieve a term name (called 'label' in RDF)
cor-responding to a given CCO identifier ('CCO_B0000000' in
this example) First, a base URL is defined (BASE), and then
the prefixes (PREFIX) are set to avoid the repetition of long
parts of URIs in the queries The variables (columns) to be
shown in the solution are specified in the SELECT statement
Finally, the query pattern is defined in the WHERE block The
specification of the graphs that should be used (for example,
'cco') is considered as a part of the query pattern The results
table will display the term label: 'core cell cycle protein' (see
'Data integration' in Materials and methods for the definition
of 'core cell cycle protein')
<http://www.semantic-systems-biol-ogy.org/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX ssb:<http://www.semantic-systems-biol-ogy.org/SSB#>
SELECT ?term_label WHERE {
GRAPH <cco> { ssb:CCO_B0000000 rdfs:label ?term_label }
}
A similar query can be employed to retrieve a CCO identifier using a term label The following query retrieves the CCO identifier ('CCO_B0002337') of the protein with the label 'WEE1_ARATH':
<http://www.semantic-systems-biol-ogy.org/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
SELECT ?unique_id WHERE {
GRAPH <cco> {
?unique_id rdfs:label 'WEE1_ARATH'@en }
} More sophisticated searches based on regular expressions can also be performed as illustrated in the following query that retrieves all the terms having the keyword 'p53' anywhere within the label (the flag 'i' enables case-insensitive expres-sion lookups):
<http://www.semantic-systems-biol-ogy.org/>
Trang 8PREFIX
rdfs:<http://www.w3.org/2000/01/rdf-schema#>
SELECT ?unique_id ?name
WHERE {
GRAPH <cco> {
?unique_id rdfs:label ?name
FILTER regex(str(?name), 'p53','i')
}
}
Consider the simple query 'retrieve the names (labels) of all
core cell cycle proteins from S pombe' These are the proteins
annotated with cell cycle terms by the Gene Ontology
Anno-tation (GOA) [88] group The query pattern consists of two
triples The first triple will match any triple that relates any
subject through the 'is_a' predicate to the 'CCO_B0000000'
object (core cell cycle protein) and the second triple will
match any triple whose subject is the same as in the first
tri-ple, the variable ?protein (defined by ? or $ in front of a string
name), and has the predicate 'rdfs:label' pointing to any
object The result is a column (?protein_label) with the
label of 1,359 core cell cycle proteins in S pombe (for
exam-ple, CDC24_SCHPO) Figure 3 illustrates the query pattern
that corresponds with the following SPARQL query:
<http://www.semantic-systems-biol-ogy.org/>
PREFIX
rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX
ssb:<http://www.semantic-systems-biol-ogy.org/SSB#>
SELECT ?protein_label
WHERE {
GRAPH <cco_S_pombe> {
?protein ssb:is_a ssb:CCO_B0000000
?protein rdfs:label ?protein_label
}
}
The following SPARQL query on the A thaliana graph allows
users to infer a putative location for proteins with no docu-mented cellular locations The assumption behind such a query is that two proteins that participate in the same inter-action are likely to share the same cellular location, for exam-ple, the 'nucleus' (CCO_C0000252):
<http://www.semantic-systems-biol-ogy.org/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX ssb:<http://www.semantic-systems-biol-ogy.org/SSB#>
SELECT
?prot_in_the_nucleus
?prot_to_study
?interaction_label WHERE {
GRAPH <cco_A_thaliana> {
?interaction a ssb:interaction
?interaction rdfs:label
?interaction_label
?prot_A ssb:participates_in ?interaction
?prot_B ssb:participates_in ?interaction
?prot_A rdfs:label ?prot_in_the_nucleus
?prot_B rdfs:label ?prot_to_study
?prot_A ssb:located_in ssb:CCO_C0000252
OPTIONAL {
?prot_B ssb:located_in ?location_B
} FILTER (!bound(?location_B)) }
}
Trang 9The query returns 48 proteins (for example, DMC1_ARATH,
SEM12_ARATH) having an interaction with a documented
nuclear protein, meaning their own cellular location is also
likely to include 'nucleus' at some point These results and,
more generally, any answer to a query on CCO simply reflects
the information in the original sources, but their integration
enables the construction of new hypotheses For some
ques-tions, the integrated CCO graph must be used For instance,
to retrieve the orthologs of the protein TIP41_YEAST from S.
cerevisiae (CCO_B0001243) and the processes in which
these orthologs participate, the following query can be used:
<http://www.semantic-systems-biol-ogy.org/>
PREFIX
rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX
ssb:<http://www.semantic-systems-biol-ogy.org/SSB#>
SELECT
?prot_label
?biological_process_label
WHERE {
GRAPH <cco> {
ssb:CCO_B0001243 ssb:is_a
?ortholog_cluster_protein
?prot ssb:is_a ?ortholog_cluster_protein
?prot rdfs:label ?prot_label
?ortholog_cluster_protein rdf:type
ssb:type_protein
OPTIONAL {
?prot ssb:participates_in
?biological_process
?biological_process rdfs:label
?biological_process_label
}
FILTER(?prot != ssb:CCO_B0001243)
}
}
The query returns 63 distinct putative orthologs, of which 55 are not documented to participate in any known process Thus, with this result these proteins can be hypothesized to participate in the same process as 'TIP41_SCHPO' To retrieve the identity of the processes in which 'TIP41_SCHPO' participates, a new query must be built that returns the answer 'G2/M transition of mitotic cell cycle':
<http://www.semantic-systems-biol-ogy.org/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX ssb:<http://www.semantic-systems-biol-ogy.org/SSB#>
SELECT ?process_label WHERE {
GRAPH <cco> { ssb:CCO_B0001243 ssb:participates_in
?process
?process rdfs:label ?process_label }
} More examples of biological queries can be found at [73]
Finally, we used SPARQL to analyze the subcellular distribu-tion of cell cycle proteins For that, we used the core cell cycle proteins subset of the CCO First, we analyzed the distribution among the three major cellular compartments - the cyto-plasm, nucleus and cell membrane We found that the major-ity of cell cycle proteins are located in the nucleus (755) and the cytoplasm (356), where the majority of cell cycle events are known to take place [38] Twenty-five cell cycle proteins were found to be located in the cell membrane These are likely to play a role in signaling to the cell cycle machinery
We looked in more detail at the distribution of cell cycle pro-teins in the cytoplasm As expected, the majority of cell cycle proteins are found in the cytosol (280) We also wanted to see
if there were cell cycle proteins in the membrane bounded organelles other than in the nucleus To our surprise, all of the analyzed organelles contained cell cycle proteins: the endo-plasmic reticulum (46), the Golgi apparatus (19) and the mitochondrion (43) One could hypothesize that the cell cycle proteins located in the first two compartments are involved in the build-up of a new cell membrane and cell wall between the two daughter cells It is much more difficult, however, to
Trang 10envi-sion how mitochondrial proteins could be involved in the cell
cycle Even more strikingly, six mitochondrial proteins were
found to play a role in the regulation of the cell cycle Provided
the cellular compartment annotations are correct, and if
taken up by cell cycle researchers, these results may possibly
lead to the discovery of novel mechanisms of cell cycle
regula-tion
An alternative hypothesis to explain a cell cycle role for
pro-teins known to be located in membrane bounded organelles
other than the nucleus is to suggest that these proteins are
also present outside of those organelles For example, if a
pro-tein can be located in both the mitochondrion and the cytosol,
then the cell cycle function of the protein can be exerted in the
cytosol, but not in the mitochondrion where it may fulfill a
different role Therefore, we analyzed alternative locations of
the proteins in question We identified 9, 5 and 15 core cell
cycle proteins from the endoplasmic reticulum, Golgi
appara-tus and mitochondrion, respectively, that have additionally
cytosolic or nuclear localization These proteins have an
unu-sual combination of locations, and merit further investigation
with respect to the molecular mechanisms underlying their
ability to be localized to apparently incompatible locations
This also highlights the need to indicate when and where
functions assigned to a protein are valid
Automated reasoning over bio-ontologies
Description logics and automated reasoners
Description Logics (DL) [89] and Semantic Web technologies
[60,61] provide a foundation for the management and
exploi-tation of knowledge in ontologies The type of OWL used for
CCO is based on DL, which is a family of logic-based
knowl-edge representation formalisms that describe a domain in
terms of concepts (classes), roles (properties or relationships)
and individuals (instances) OWL-DL offers an optimal
trade-off between expressivity and computational tractability [89]
OWL-DL can be considered to be sufficiently expressive in
order to represent a wide variety of biomedical knowledge
[90], while it offers support for automated reasoning It has
become one of the standard languages for representing
ontol-ogies in the semantically strict form that supports automated
reasoning
DL reasoners are computational tools to: ensure that an
ontology does not contain any contradictory facts
(consist-ency checking); compute the subclass relation between each
named class to create the class hierarchy (classification); find
the most specific classes to which an individual belongs
(real-ization); and retrieve information from an ontology
(query-ing)
Ontology curators can use DL reasoners to minimize the term
redundancy, while maintaining sufficiently detailed
descrip-tions and consistency of the contents [18,19] Moreover,
rea-soning tools can also be used to find new classes (either more
specific or general) [20] Finally, and in this context most
importantly, reasoning tools can also be used in biological research for information retrieval and the generation of new hypotheses that are consistent with the knowledge captured
in the ontology
Representing biological knowledge with OWL
OWL-DL queries can be more fine-grained than RDF queries since the semantic model of OWL-DL allows more expressiv-ity The OWL semantics is based on sets (classes) of instances (individuals) Classes can be subclasses of other classes, if and only if all the instances of the subclass are also instances of the superclass, although the superclass has other instances that do not belong to the subclass For example, in GO the well-known 'is a' hierarchy is founded on this concept
Relationships in OWL-DL are interpreted as existing between pairs of individuals Restrictions on classes define which and how many relationships the instances of that class must hold When a restriction is defined, an anonymous class is defined (Figure 4, dotted shape), and the class to which the restriction
is added becomes a subclass or equivalent class of that anon-ymous class For instance, the restriction 'subClassOf part of some Cell' in the class 'Nucleus' states that every instance of the class 'Nucleus' must have at least one relationship along the property 'part_of' to an instance of the class 'Cell' (other quantifiers can be used in these restrictions such as 'only', 'min', 'max' and 'value', and Boolean operators such as 'and', 'or', and 'not')
If the restriction is added as a superclass of the class that is being defined (the class being defined is a subclass of the restriction, as in the example above), the restriction is known
as a 'necessary condition' A necessary condition is a condi-tion that all the instances of the class must fulfill, but is not enough in itself to define class membership Therefore, if an instance is found that has at least one 'part_of' relationship to 'Cell', it does not mean that it is a member of the class
OWL property (part_of) sample
Figure 4
OWL property (part_of) sample: the property 'part of' links individuals belonging to a class (for example, 'Nucleus') to individuals of the class 'Cell' A restriction of the type 'some part_of Cell' on the class 'Nucleus' defines an anonymous class (dotted shape), and will imply that individuals belonging to the class 'Nucleus' also belong to (are 'part_of') the class 'Cell'.
part_of part_of
part_of part_of