42 3-1 Degree distribution of the Yeast and the proteome growth model interaction networks.. 59 3-8 ℓ-hop degree distribution of the yeast, proteome growth model and the sequence similar
Trang 1ANALYZING AND MODELING LARGE
BIOLOGICAL NETWORKS: INFERRING SIGNAL
TRANSDUCTION PATHWAYS
by
Submitted in partial fulfillment of the requirements
for the Degree of Doctor of Philosophy
Electrical Engineering And Computer Science Department
CASE WESTERN RESERVE UNIVERSITY
January 2007
Trang 2UMI Number: 3226720
3226720 2006
UMI Microform Copyright
All rights reserved This microform edition is protected against unauthorized copying under Title 17, United States Code.
ProQuest Information and Learning Company
300 North Zeeb Road P.O Box 1346 Ann Arbor, MI 48106-1346
by ProQuest Information and Learning Company
Trang 3CASE WESTERN RESERVE UNIVERSITY
SCHOOL OF GRADUATE STUDIES
We hereby approve the dissertation of
candidate for the Ph.D degree *
(signed) _
(chair of the committee)
Trang 4to Gamze
Trang 51.1 Background 5
1.1.1 Graph Theoretic Definitions 5
1.1.2 Signal Transduction Pathways 7
1.1.3 Protein-Protein Interactions 10
1.1.4 Discovery of Protein-Protein Interactions 11
1.2 Contributions 15
2 Evolutionary Models of Proteome Networks 18 2.1 Biological Networks 24
2.1.1 The Evolution of Protein-Protein Interactions 26
2.1.2 Random Network Models 28
2.1.3 Properties of Networks 32
2.2 Proteome Growth Model 35
2.3 Analysis of the Proteome Growth Model 36
2.3.1 Properties of the pure duplication model 37 2.3.2 On the degree distribution of the proteome growth model 41
Trang 62.4 Discussion 44
3 Enhanced Duplication Model 48 3.1 Sequence Similarity Distribution in the Yeast Proteome 51
3.2 Enhanced Model Based on Sequence Similarity 56
3.3 Discussion 62
4 Discovering Signaling Pathways: PathFinder 65 4.1 PathFinder 71
4.1.1 Preliminary 73
4.2 Methods 75
4.2.1 Mapping Proteins to Functional Annotations 76
4.2.2 Mining Association Rules from Known Pathways 80
4.2.3 Constructing a Weighted Protein-Protein Interaction Network 87 4.2.4 Searching for Pathway Segments 89
4.3 Experiments on the Yeast Proteome Network 91
4.4 Discussion 102
Trang 7List of Tables
3.1 The average clustering coefficients of the DIP Protein-Protein teraction Network, Proteome Growth Model, and the EnhancedModel 60
In-4.1 Binary Table Example 814.2 PathFinder Search Results 97
Trang 8List of Figures
2-1 ℓ− hop 34
2-2 Percentage of singletons in the pure duplication model 40
2-3 Average degree of non-singleton nodes in pure duplication model 42 3-1 Degree distribution of the Yeast and the proteome growth model interaction networks 49
3-2 ℓ-hop degree distribution comparison of the Yeast and Proteome Growth Model 50
3-3 Distribution of pairwise sequence similarity of yeast proteins 54
3-4 Aggregate distribution of pairwise sequence similarity of yeast pro-teins 55
3-5 Enhanced Model Based on Sequence Similarity 57
3-6 Degree distribution of the proteome sequence similarity networks 59 3-7 Degree distribution of the interaction networks 59
3-8 ℓ-hop degree distribution of the yeast, proteome growth model and the sequence similarity enhanced model 61
4-1 MAP Kinase Pathways 74
4-2 PathFinder 77
4-3 Two interacting proteins and their linked annotation terms 79
4-4 Association Rule Mining Parameters 93
Trang 94-5 PathFinder Ste7-Dig2 Simple Path Results 944-6 PathFinder Ste7-Dig2 Signaling Pathway Segment Results 964-7 The Pheromone Response Signaling Pathway 984-8 The High Osmolarity Signaling Pathway 101
Trang 10grad-I will still be a supporter of Dr S¸ahinalp after my graduation.
I am very grateful to Dr Jiong Yang, for accepting to take over my advisoryduties and helping me accelerate my studies I appreciate his financial support dur-ing my last years and his guidance throughout my studies since he moved to Case.His guidance on finding interesting problems and accurate approaches should bementioned here I also would like to thank him for being my dissertation committeechair
I would like to give my gratitude to Prof Meral ¨Ozsoyo˘glu and Prof Tekin
¨
Ozsoyo˘glu, for their help and guidance during this last five years It has alwaysbeen an inspiration to see their academic achievements I especially would like tomention Prof Tekin ¨Ozsoyo˘glu’s support and priceless advice during my last year
of study
I would like to thank Prof Tekin ¨Ozsoyo˘glu, Dr Mark Adams, and Dr Jing
Li for being on my dissertation committee I deeply appreciate their input to thisdissertation and my research
Soon after I met my wife, I was privileged to be introduced to the Wise, whom
I am eternally indebted to, as I have gained so much from them I always feelwelcome among them, and I am happy to make them proud by finishing this degree.Here, I would like to mention Mrs Marilyn Wise for her support in every aspect
of my life and sharing her spiritual enlightenment with me I appreciate her being
Trang 11my mother here in the United States I also would like to acknowledge the moralsupport of Mr Jonathon K Wise and Ms Cheryl Davis Mr Jonathon K Wise hasbeen a great role model, whom my wife and I respect, and always look for guidance.
I would like to acknowledge my lab friends, Can Alkan, Emre Karakoc¸, andEray T¨uz¨un Although we have been separated by moves and graduations, theywere a great support in this accoplishment Also, I do appreciate Mr BrendanEliott for proofreading my dissertation
Finally, I appreciate more than anything the support and understanding of mybeautiful wife, Gamze throughout my Ph.D program I can not express enoughhow thankful I am for her encouragement, help and endless patience Without her Iwould not have finished this study
G¨urkan Bebek, Ph D
August 2006
Trang 12Analyzing and Modeling Large Biological Networks: Inferring
Signal Transduction Pathways
Abstract by G¨urkan Bebek
Large scale two-hybrid screens have generated a wealth of information describingpotential protein-protein intereactions (PPIs) When interacting proteins are asso-ciated with each other to generate networks, a map of the cell, picturing potentialsignaling pathways and interactive complexes is formed
PPI networks satisfy the small-world property and their degree distribution low the power-law degree distribution Recently, duplication based random graphmodels have been proposed to emulate the evolution of PPI networks and to satisfythese two graph theoretical properties
fol-In this work, we show that the previously proposed model of Pastor-Satorras
et al (2003) does not generate a power-law degree distribution with exponential
cutoff as claimed and the more restrictive model by Chung et al (2003) cannot
be interpreted unconditionally It is possible to slightly modify these models toensure that they generate a power-law degree distribution However, even after thismodification, the more generalℓ-hop degree distribution achieved by these models,
forℓ > 1, are very different from that of the yeast proteome network We address
this problem by introducing a new network growth model taking into account thesequence similarity between pairs of proteins as well as their interactions The newmodel captures theℓ-hop degree distribution of the yeast PPI network for all ℓ > 0,
as well as the immediate degree distribution of the sequence similarity network
We further utilize the PPI networks to discover possible pathway segments covering signal transduction pathways has been an arduous problem, even with the
Trang 13Dis-use of systematic genomic, proteomic and metabolomic technologies The mous amount of data and how to interpret and process this data becomes a chal-lenging computational problem.
enor-In this work we present a new framework to identify signaling pathways in PPInetworks Our goal is to find biologically significant pathway segments in a giveninteraction network First, we discover association rules based on known signaltransduction pathways and their functional annotations Given a pair of startingand ending proteins, our methodology returns candidate pathway segments betweenthese two proteins These candidate pathway segments are further filtered by their
gene expression levels In our study, we used the S cerevisiae interaction network
and microarray data, to successfully reconstruct signal transduction pathways inyeast
Trang 14Chapter 1
Introduction
Aristotle (384-322 B.C.) is known as the originator of the scientific study of life.Aristotle himself wrote around 146 books on the subject Throughout the past 24centuries of biological studies, it is no doubt that the most advancement in this fieldhas been made during the last century
Sequencing of genomes is one of the key achievements towards understanding
of the cellular machinery and biological diversity It is likely that the first organismswere unicellular prokaryote organisms The diversity of life is derived throughevolution, the process of amplification and mutation of genetic material, followed
by varying proportions of chance and natural selection Detrimental agents such astoxins, radiation, viruses etc may alter the genome sequence by point mutations,insertions, or deletions of nucleotides Sometimes, these mechanisms modify thegenetic material in favor of the organism, increasing its likelihood of survival, butmost often its chances are decreased, or there is a negligible effect on the organism.These changes to the genomic content may lead to changes in cellular networks,which may have consequences on the tissue or sometimes the organism as a whole.The significant technological progress and the completion of many genome se-quencing projects, including that of the human, have provided us with a reasonablydetailed view of the cell From this new point of view, i e our new knowledge of
Trang 15cellular networks, we have the means to understand the principles underlying thedynamic behavior of cells However, this will require integration of theoretical andexperimental approaches at a variety of levels.
One of the biggest challenges waiting for the scientists is correlating the genomewith the proteome to explain its biological function Exploiting the genome to un-derstand life in both disease and healthy states will make it possible to develop newtherapeutic approaches Moreover, these approaches should be able to mimic thebehavior of the systems over a wide variety of conditions To achieve this goal, anymodel should be based on and fully constrained by experimental data Hence, theamount and quality of the available experimental data will determine the reliability
of the model Also, computational models should represent the biological systems
as accurately as possible
Today, scientists have access to DNA microarray technology that can neously measure the mRNA expression responses of practically every gene undervarious conditions, producing hundreds of thousands of individual data points Sim-ilarly, high-throughput yeast two-hybrid and mass spectrometry experiments haveidentified thousands of pairwise protein-protein interactions A recent study in sys-tems biology demonstrated that by integrating these diverse data types and assimi-
simulta-lating them into biological models, one can predict cellular behaviors (Ideker et al.,
2001) However, synthesizing these data into models of pathways and networksremains as a significant challenge
In other words, the use of systematic genomic, proteomic and metabolomictechnologies have introduced new paths toward understanding these phenomena.However, these techniques lead to an enormous amount of data, and how to interpretand process this data is now a challenging computational problem In short, we areasked to improve the quality and predict the missing components, to analyze andmodel the dynamics of these phenomena, and to integrate this knowledge with otherknown biological data
Trang 16The molecular and genetic mechanisms underlying cell proliferation, tiation, dynamics, and death, and their involvement in embryonic development, andcancer are still being studied In general, little is known about which molecules arespecific to these cell activities Therefore, the characterization of molecular interac-tions and complexity is crucial in reaching a full understanding of these biologicaldynamics.
differen-Here we are going to focus on the molecular level of activities in the cell.Among many cellular activities that the cell performs, signal transduction is theprimary means by which cells coordinate their metabolic, morphological, and ge-netic responses to environmental cues such as growth factors, hormones, nutrients,osmolarity, and other chemical and physical stimuli Thus, the analysis and dis-covery of these pathways can help us to understand the modifications of the cell
on its way from a normal, healthy state to a transformed one, and finally to cancergeneration or death
The genetic information in a cell is converted into functional components, teins, through the transcription and translation processes Hence, the set of proteins
pro-in a cell (also called the proteome) can describe the underlypro-ing events and tions at a given state A molecular level view of a cell can be created by mappingthese proteins with their interactions on a network, which refers to the associations
associa-of these protein molecules These associations are important for many biologicalphenomena For instance, signals from the exterior of a cell are mediated to theinside of that cell by interacting signaling molecules
Traditionally, the discovery of the molecular components of signaling networks
in organisms has relied upon the use of gene knockouts and epistasis analysis though these methods have been highly effective in generating detailed descriptions
Al-of specific linear signaling pathways, our knowledge Al-of complex signaling networksand their interactions remains incomplete
We are going to approach to this identification problem mentioned above It is
Trang 17desirable to have new computational methods that capture molecular details fromhigh-throughput genomic and proteomic data in an automated fashion In this work,
we will utilize graph theoretic and data mining techniques to accomplish this goal.These techniques are an integral part of this research process, since they lead us toanalyzing and linking this data to other information, (e.g functionality, regulationetc.) by combining theoretical and experimental approaches This analysis leads
us to build efficient models that would predict those under-studied phenomena Forall these reasons, graph theoretic approaches are becoming an important part ofcomputational biology
In this work, we will first focus on analyzing known datasets and models ofprotein-protein interaction networks in model organisms Understanding protein-protein interactions is important in investigating signaling pathways We are going
to present an in depth analysis of the currently available models of protein-proteininteraction networks and have a look at their properties We will then use the re-sults of our investigation on relationships among the elements of these networks todevelop a new model for better understanding of the evolutionary processes.Next, we will present our results on discovering signaling pathways segments.Using the underlying information in the signaling pathways and the protein-protein
interaction networks that we have acquired previously (Bebek et al., 2006b), we will
focus on discovering these smaller functional networks In most of the one-cell ganisms, the variety of signal transduction pathways influences the number of waysthe cell can react and respond to its environment Discovering signal transductionpathways has been a hard problem Despite we now have access to systematic ge-nomic, proteomic and metabolomic technologies, the enormous amount of data weacquire through these technologies creates computationally challenging problems
or-of interpreting and processing or-of this data
Here, we present a new framework to identify signaling pathways in protein interaction networks Given a protein-protein interaction network of an
Trang 18protein-organism, we would like to discover biologically significant pathway segments.First, we reveal association rules based on known signaling pathways and theirfunctional annotations The methodology developed can successfully search forpathway segments on a given protein-protein interaction network In our study, we
have used the S cerevisiae interaction network with microarray data and known
signal pathways to develop and test our models
In this chapter, we will first give basic graph theoretic definitions and then troduce relevant biological concepts Then, a short summary of the results andoverview of the thesis will be presented
1.1.1 Graph Theoretic Definitions
A graph (or network) is a set of objects called nodes or vertices connected by links called arcs or edges A graph is usually denoted with G, or G(V, E) where V is
the set of vertices (nodes) andE ⊆ V × V is the set of edges (arcs) connecting the
vertices The size of a graph is the number of its edges, i e.|E|
The most common type of graph is called a simple graph In simple graphs, at
most one edge (i e either one edge or no edges) may connect any two vertices If
multiple edges are allowed between vertices, the graph is known as a multigraph.
Vertices are usually not allowed to be self-connected, but this restriction is
some-times relaxed to allow loops, i e a loop is an edge whose end vertices are the
same vertex A graph that may contain multiple edges and graph loops is called a
pseudograph A subgraph of a graph G is a graph whose vertex and edge sets are
subsets of those of G A supergraph of a graph G is a graph that contains G as a
subgraph
A graph is directed if its edges are directed (pointing toward either one of the
Trang 19ends) and undirected otherwise A graph is complete (or called a clique) if every
node has a connecting edge to every other node The complete graph onn vertices
is often denoted byKnwhereKnwould haven(n− 1)/2 edges
Nodes that share an edge are called adjacent The degree of a node is the
number of edges incident with the node, i.e, a measure of immediate adjacency
In directed graphs, the in-degree of a node is the number of edges ending at the node, and the out-degree is the number of edges beginning at the node A vertex of
degree zero is an isolated vertex IfE is finite, then the total sum of vertex degrees
is equal to twice the number of edges A degree sequence is a list of degrees of a
graph in non-increasing order A sequence of non-increasing integers is realizable
if it is a degree sequence of some graph
The set of neighbors, called the (open) neighborhoodNG(v) for a vertex v in
a graph G, consists of all vertices adjacent to v but not including v When v is
also included, it is called a closed neighborhood, denoted by NG[v] When stated
without any qualification, a neighborhood is assumed to be open
A path in a graph is a sequence of vertices such that from each of its vertices there is an edge to the successor vertex The length of a path is the number of edges
that the path uses, counting multiple edges multiple times On a path, the first vertex
is called the start vertex and the last vertex is called the end vertex Both of them
are called end or terminal vertices of the path The other vertices in the path are
internal vertices.
If it is possible to establish a path from any vertex to any other vertex of a graph,
the graph is said to be connected; otherwise, the graph is disconnected.
A cycle is a path such that the start vertex and end vertex are the same In a
directed graph, the same concepts apply with the edges being directed from eachvertex to its successor
A path with no repeated vertices is called a simple path, and a cycle with no repeated vertices aside from the start/end vertex is a simple cycle A simple cycle
Trang 20that includes every vertex of the graph is known as a Hamiltonian cycle Two paths
are independent (alternatively, internally vertex-disjoint) if they do not have any
internal vertex in common A cycle (trail) is Eulerian if it uses all edges precisely once A graph that contains an Eulerian trail is traversable A graph that contains
an Eulerian cycle is an Eulerian graph.
A weighted graph is a graph in which each edge is given a numerical weight A
weighted graph is therefore a special type of labeled graph in which the labels are
numbers The weight of a path in a weighted graph is the sum of the weights of the
traversed edges
The distancedG(u, v) between two (not necessary distinct) vertices u and v in
a graphG is the length of a shortest path between them The subscript G is usually
dropped when there is no danger of confusion Whenu = v, the distance is 0 When
u and v are unreachable from each other, their distance is defined to be infinity
The eccentricity of a vertexv in a graph G is the maximum distance from v to
any other vertex The diameter of a graph G is the maximum eccentricity over all
vertices in a graph, while the radius is the minimum.
1.1.2 Signal Transduction Pathways
The molecular components involved in cellular signaling form signal transductionpathways A signal transduction pathway (signaling pathway) in a cell is composed
of the following events First, a signaling molecule arrives outside of the cell andinteracts with the receptor on the extracellular surface of the cell membrane Next,the receptor interacts with intracellular pathway components, starting a cascade ofprotein interactions that propagates the signal within the cell Finally, the signalarrives at its final destination, or molecular target, and evokes a functional response
in the cell
Important biotechnological advances in recent years have allowed increasingly
Trang 21detailed studies of a variety of signaling pathways These advances include duction of recombinant DNA, the Polymerase Chain Reaction (PCR) (Alberts and
pro-et al., 2002), gel electrophoresis (Vincens and Tarroux, 1988), microarrays Risi and Iyer, 1999), and the serial analysis of gene expression (SAGE) (Velculescu
(De-et al., 1995) The development of such techniques are still continuing, and
large-scale assays of peptides and protein-DNA binding activity are becoming more sible (Abbott, 2002) A signaling molecule may be a protein, small peptide, aminoacid, nucleotide, steroid, retinoid, fatty acid derivative or a dissolved gas
fea-There are different types of signaling systems and they differ in signal origin
Paracrine signaling is a form of cell signaling in which the target cell is close to
the signal releasing cell, and the signal chemical is broken down too quickly to becarried to other parts of the body In mature organisms, paracrine signaling func-tions include responses to allergens, repairs to damaged tissue, formation of scartissue, and clotting Examples of paracrine agents are growth factors, somatostatinand histamine In paracrine signaling, the signal originates from a nearby cell and,thus, the signal causes only localized effects
Endocrine signaling molecules are called hormones After their release into
the blood stream, where they are present at very low concentrations, target cellswith high-affinity receptors pull out the hormone from the blood The endocrinesystem links the brain to the organs that control body metabolism, growth and de-velopment, and reproduction In endocrine signaling, hormones are secreted intothe bloodstream and thus may be received by a cell some distance from the origin
of the signal
In synaptic signaling, a signaling molecule is released into a synaptic cleft from
one neuron and received by another neuron Here, synapses allow nerve cells tocommunicate with one another through axons and dendrites, converting electricalimpulses into chemical signals
Finally, a cell may send a signal to itself, which is known as autocrine signaling
Trang 22(Alberts and et al., 2002) There are many types of signaling molecules and alsomany different receptors that may be present on a given cell at a given time The set
of receptors and the density and location of each receptor on the cell surface depend
on cell type and on the current state and environment of the cell Hence, The samestimulus will often cause different responses in different cells
The main method of signal transduction occurs through structural changes ofpathway components A given protein will affect the conformation of one or sev-eral other proteins, activating or inhibiting those proteins and thus propagating thesignal down the pathway (Alberts and et al., 2002) The trigger for signal propaga-tion often occurs with the binding of the signaling molecule to the receptor, whichcauses a conformational change in the receptor Within the cell, signal propagationdepends heavily on the actions of protein kinases and protein phosphatases
Most of the intracellular portions of signaling pathways is a cascade of proteinphosphorylations and dephosphorylations Each step leads to the activation or theinhibition of downstream events or feeds back on upstream events The traditionalview of signal transduction has been as a linear sequence of phosphorylation eventsproceeding from the cell surface to the ultimate intracellular target However, it hasbecome increasingly clear that the propagation of signals in the cell is not a simplechain of events, but a complex networks of interacting pathways and regulatoryfeedback mechanisms (Neves and Iyengar, 2002)
The responses to signaling can include activation of enzyme activity, changes incytoskeleton organization, changes in ion permeability, activation of DNA and/orRNA synthesis, as well as many other aspects of cell function (Alberts and et al.,2002) Through such changes, signaling pathways can control cellular functionssuch as growth, maturation, proliferation, and differentiation These vital functionssuggest the importance of studying signal transduction pathways
Trang 231.1.3 Protein-Protein Interactions
Proteins in the cell are polymers made up of a specific chain of amino acids Thecell reads the genetic information and uses it to construct these macromoleculesthrough the transcription and translation processes Proteins in the cell work to-gether to achieve a particular function, and often physically associate with eachother to function or to form a more complex structure
Protein-protein interactions refer to the associations between protein molecules.These associations are important for many biological functions For instance, sig-nals from the exterior of a cell are mediated to the inside of that cell by interactingsignaling molecules
Protein-protein interactions might last for a long time to form part of a proteincomplex, or a protein may be carrying another protein Moreover, a protein mayinteract briefly with another protein just to modify it, such as the phosphorylation
of a target protein by a protein kinase Interactions are important to most cal processes Many proteins need to interact with other proteins to perform theirfunctions properly Thus, knowledge about the interacting proteins is crucial in theunderstanding of biological functions
Model organisms, species that are extensively studied to understand cal phenomena, were the first genomes to be sequenced In eukaryotes, several
biologi-yeasts, particularly Saccharomyces cerevisiae (”baker’s” or ”budding” yeast), have been widely studied Since the sequencing of S cerevisiae (Goffeau et al., 1996), systematic genome-wide studies of protein interactions have been conducted on S.
cerevisiae After the publication of the S cerevisiae genome sequence, several
com-putational methods based on genomic context were developed for protein-protein
interaction (PPI) prediction (Fields and Song, 1989, Gavin et al., 2002, Ho et al.,
2002, Ito et al., 2001, Uetz et al., 2000).
Today, more than 400 genomes have been completely sequenced and more than
Trang 241700 projects are still in progress1 (Liolios et al., 2006) The Proteomes of these
genomes have been at least partially mapped, but the functions of many proteinsare unknown Identification of the physical interactions in which these proteinsparticipate may reveal their function
1.1.4 Discovery of Protein-Protein Interactions
The physical interactions between proteins can be detected by the characterization
of individual interactions In the past, metabolic reactions mostly were identifiedthrough laborious studies of individual enzymes Moreover, the function of a newlysequenced gene may be inferred from its homology to a protein with an identifiedfunction However, in recent years, high-throughput studies have been developed
in which protein-protein interactions may be identified through genome wide niques As a result, in the past few years, the number of known protein-protein inter-actions have increased significantly The two most important methods used in iden-tifying protein-protein interactions are affinity purification followed by mass spec-trometry, which is a common technique for identifying protein complexes (Gavin
tech-et al., 2002, Ho tech-et al., 2002), and the yeast two-hybrid mtech-ethod used for identifying
individual protein-protein interactions (Fields and Song, 1989, Ito et al., 2001, Uetz
DNA-1 Refer to the Genomes Online Database for recent statistics of genome sequencing projects at
http://www.genomesonline.org(Liolios et al., 2006)
Trang 25gene will not be transcribed unless an activating domain is present, and activatingdomain is only present ifP1 interacts withP2 Therefore, a signal is observed onlywhenP1 andP2 interact with each other.
The two-hybrid method was efficiently adapted for systematic large-scale
stud-ies This technique has been used to study the entire proteome of S cerevisiae (Ito
et al., 2001, Uetz et al., 2000), Caenorhabditis elegans (Li et al., 2004, Walhout
et al., 2000b) and Drosophila melanogaster (Giot et al., 2003) Although the yeast
two-hybrid method is sensitive enough to detect transient as well as stable actions, the method is not very accurate in detecting interactions As many as 50-90% of the initially published interactions are probably erroneous (false positives)
inter-(Deane et al., 2002, Sprinzak et al., 2003) Moreover, Deane et al (2002) showed
through an analysis based on the agreement of the interactions and expression datathat more than half of these interactions are biologically irrelevant
In addition to these, there are a large number of known interactions between teins which are missed in two-hybrid systems (false negatives) (Aloy and Russell,
pro-2002) A two-hybrid false-negative rate of 45% was estimated by Walhout et al (2000b) in their C elegans study The large number of false negatives is likely to be
caused by proteins that only interact when certain activation signals have induced
conformational changes in one or both of the interacting proteins (Ito et al., 2001).
Also, the unnatural mechanism of fused proteins within a compartment, the cleus, where most of the proteins do not naturally interact, is a likely cause for theabsence of known protein-protein interactions in the two-hybrid screens Finally,membrane protein interactions are unlikely to be detectable by Y2H
nu-Another technique for physical interaction discovery is the affinity purificationmethod Affinity purification methods do not identify individual interactions be-tween proteins, but are used to determine which proteins appear in complexes to-
gether In tandem affinity purification (TAP) (Rigaut et al., 1999), some proteins
Trang 26are selected as baits which are used to fish for the prey proteins that form a plex with the bait In TAP, baits are fused with two affinity tags The tags areused to attach the bait to an affinity chromatography column in two tandem steps.Throughout this methodology, stringent purification steps prevent detection of tran-sient interactions within complexes Hence, mostly stable complexes are found.Furthermore, the exact interactions between the proteins in the complexes detected
by TAP have not been determined For example, some of the proteins in the plexes interact directly with each other, but others are at the outskirts of the proteincomplex and are not in direct proximity with each other, although they are likely to
com-be functionally related
Affinity purification methods have been successfully applied to large scale
stud-ies on the proteome of S cerevisiae (Gavin et al., 2002, Ho et al., 2002) as well.
However, the overlap between the protein-protein interactions that are discoveredwith affinity purification methods and two-hybrid screens is small in number (Gavin
et al., 2002, Ho et al., 2002) This is partially because the two methods complement
each other (Aloy and Russell, 2002) This also shows that both methods suffer fromshortcomings within themselves
The high-throughput methods mentioned above are comprehensive and are notbiased towards the expectations of individual researchers However, the overlap ofindependent studies is fairly small (Uetz and Finley, 2005) For instance, the two
independent yeast two-hybrid screens of S cerevisiae had only∼ 20% overlap (Ito
et al., 2001) The reason of a low overlap in similar studies might have been caused
by limited coverage of the whole yeast interactome, or by false positives Cornell
et al (2004) showed through their analysis that among these high-throughput
meth-ods, TAP is the most reliable method By definition, the protein-protein interactionsdetected by two-hybrid screens are different in nature from those detected by low-throughput yeast experiments (a collection can be found at the Munich Information
Center for Protein Sequences (MIPS) (Mewes et al., 1999)) Since there is little
Trang 27overlap between the interaction pairs produced by these methods, there is a needfor reliable validation measures.
Interactions which have been identified in low-throughput experiments are sidered more reliable Although there are existing problems associated with high-throughput methodologies, studies to improve the outcome are still in progress.Recent technical improvements in pooling strategies indicate that the accuracy ofhigh-throughput yeast two-hybrid screens could be significantly increased, while
con-the number of screens is simultaneously decreased (Jin et al., 2006).
A protein-protein interaction that has been observed independently more thanonce is considered to be more reliable However, false positives might be repro-ducible in some cases (Fields, 2005) Moreover, computational methods are fre-quently utilized to indicate the reliability of protein-protein interactions For in-stance, functional annotations and sub-cellular localizations may provide an indica-tion of the reliability of a particular protein-protein interaction since proteins whichare predicted to be active in the same sub-cellular location and have related func-
tions are more likely to interact (Sprinzak et al., 2003) In addition, expression
patterns for proteins in the same complex are expected to be correlated (Jansen
et al., 2002) In other words, interacting proteins are often co-expressed (Grigoriev,
2003, Jansen et al., 2002) Furthermore, structural information (Aloy and Russell,
2002, Edwards et al., 2002) and functional annotations (Marcotte et al., 1999) can
be used to validate protein-protein interactions
Almost all of the interaction that are discovered through experiments are lected in public databases Today, there is a growing number of public databasesthat present protein-protein interaction data for multiple organisms The most com-prehensive databases are Munich Information Center for Protein Sequences (MIPS)
col-(Mewes et al., 1999), the Database of Interacting Proteins (Xenarios et al., 2002), the Biomolecular Interaction Network Database (BIND) (Bader et al., 2003), the
Trang 28BioGRID General Repository for Interaction Datasets (Stark et al., 2006), the ular Interaction Database (MINT) (Zanzoni et al., 2002), Online Predicted Human
Molec-Interaction Database (Brown and Jurisica, 2005), etc
The discovery of the protein-protein interaction network topology (Jeong et al.,
2001, Wagner, 2001) accelerated the study of better understanding the growth ofthese networks These networks drew more attention after the observations ofshared topological properties with many other networks (Aiello and Chung, 2001,
Aiello et al., 2000, Berger et al., 2003, Bollobas et al., 2003, Bollob´as et al., 2001, Cooper and Frieze, 2003, Kleinberg et al., 1999) Since then, the study of evo-
lutionary network modeling to successfully generate the growth of proteome
net-works has been a great challenge (Bhan et al., 2002, Pastor-Satorras et al., 2003, Vazquez et al., 2003) To accomplish this task, known biological theories and em-
pirical studies were taken into account to develop such models The most promising
model developed, which is named in Pastor-Satorras et al (2003) as the proteome
growth model, was described independently in Bhan et al (2002), Pastor-Satorras
et al (2003), Vazquez et al (2003).
The proteome growth model is based on Ohno’s theory of genome growth(Ohno, 1970) In this model, the two underlying mechanisms for genome evolu-tion are gene duplication and point mutations In terms of gene functionality, after
a gene duplication event, one of the genes may accumulate deleterious mutationsand be lost, or both copies of the gene may be retained The proteome growthmodel emulates these processes by growing a network via node duplications andthen modifying the connectivity of the nodes by mechanisms that reflect point mu-tations
Through analysis of this model, different claims were made on what degree
Trang 29distribution this model would generate Earlier network generation models usedfor emulating the growth of similar networks were known to follow a power-law
degree distribution Pastor-Satorras et al (2003) showed that the proteome growth
model would generate a degree distribution that would follow a power-law with
exponential cutoff In another study, using a more restrictive model Chung et al.
(2003) showed that these networks would follow a power-law degree distribution.Here, we further investigate the degree distribution generated with the proteomegrowth model First, we analyze the proteome growth model of Pastor-Satorras
et al (2003), and show that this model does not generate the power-law degree
distribution with exponential cutoff as claimed and the more restrictive model by
Chung et al (2003) cannot be interpreted unconditionally.
Analyzing the networks, we observed that global features of networks, such
as the degree distribution, might actually be misleading In this work, we alsointroduce a new measure calledℓ-hop for network comparison
We address the proteome growth models impotency through the more general
ℓ-hop degree distribution for ℓ > 1 We make more observations over the model,
and further study the original basis of the model, i e Ohno’s theory of genomegrowth We then introduce a new network growth model that takes into account thesequence similarity between pairs of proteins (as a binary relationship) as well astheir interactions The new model captures not only the ℓ-hop degree distribution
of the yeast protein interaction network for all ℓ > 0, but also the immediate
de-gree distribution of the sequence similarity network, which again seems to follow apower-law
We further utilize protein-protein interaction networks to discover possible way segments Protein-protein interactions of an organism may lay out the proteins
path-by their functional relationships These associations improve our understanding
of cellular functions as well as identifying unknown proteins and their functions
Techniques like the two-hybrid system (Giot et al., 2003, Ito et al., 2001, Li et al.,
Trang 302004, Reboul et al., 2003, Uetz et al., 2000) or affinity purification followed by mass spectrometry (Gavin et al., 2002, Ho et al., 2002) are developed to uncover
physical interactions between proteins These experiments identify only a smallfraction of the total protein-protein interaction network (Bader and Hogue, 2002,
Edwards et al., 2002, Grigoriev, 2003, Ito et al., 2002, von Mering et al., 2002, Walhout et al., 2000a,b) There are many studies in which signaling pathways were
modeled using various approaches Previously, signaling pathways were modeled
as modular kinetic simulations of biochemical networks (Neves and Iyengar, 2002)
and by detailed integration of biochemical properties of the pathways (Choi et al.,
2004) In another recent study, Bayesian Networks are applied to multi-variable
cell data to infer signaling pathways (Sachs et al., 2005).
In this thesis, we present a new framework, called PathFinder, to identify
naling pathways in protein-protein interaction networks To find biologically
sig-nificant pathway segments in a given interaction network, we first discover ation rules based on known signal transduction pathways and their functional an-notations Given a pair of starting and ending proteins, our methodology returnscandidate pathway segments between these two proteins These candidate pathwaysegments are further filtered by their gene expression levels In our study, we use the
associ-S cerevisiae interaction network with microarray data and were able to reconstruct
successfully signal transduction pathways of yeast
The rest of the thesis is organized as follows In Chapter 2 we analyze theevolutionary models of proteome networks Using our analysis, in Chapter 3 wediscuss our observations about evolutionary models and protein-protein interactionnetworks and introduce an enhanced duplication model based on protein sequencesimilarity In Chapter 4 we focus on discovering signaling pathways utilizing other
biological networks and present our experimental results carried on S cerevisiae
datasets Final conclusions are drawn in Chapter 5
Trang 31Al-graphs (Ferrer I Cancho and Sol, 2001), neural nets (Watts and Strogatz, 1998), etc.These two properties cannot be observed in the classical random graph models stud-ied by Erd¨os and R´enyi (Erd¨os and R´enyi, 1959) in which, the edges between pairs
of nodes are determined independently However, it is possible to generate graphsthat satisfy these properties by an iterative process that adds one new node to the
graph at each step (Aiello and Chung, 2001, Aiello et al., 2000, Berger et al., 2003, Bollobas et al., 2003, Bollob´as et al., 2001, Cooper and Frieze, 2003, Kleinberg
et al., 1999) The new node is then connected to someb (b can be a constant or an
independent random variable) of the existing nodes, each of which is chosen with
probability proportional to its degree Unfortunately such a preferential attachment
model does not capture the essence of the genome evolution and hence cannot be
Trang 32used to model proteome networks.
The structure of the yeast protein-protein interaction network seems to reveal
two interesting graph theoretic properties (Jeong et al., 2001, Wagner, 2001): (1)
The degree distribution of the nodes (i e the proportion of nodes with degree k
as a function of degree) approximates a power-law (i e is approximatelyck−bforsome constantsc, b) (2) The graph exhibits the small world effect.
According to Ohno’s model (Ohno, 1970), the two underlying mechanisms forgenome evolution are gene duplication and point mutations In terms of gene func-tionality, after a gene duplication event, one of the genes may accumulate deleteri-ous mutations and be lost, or both copies of the gene may be retained Two possibleevolutionary reasons for keeping both copies can be (1) selection for increased lev-els of expression, or (2) divergence of gene function (Nadeau and Sankoff, 1997,Seoighe and Wolfe, 1999b) In this framework, functional divergence can be pro-duced through complementary degeneration, where each daughter gene retains only
a subset of the functions of the parent, or (rarely) if one daughter acquires a new
function (Force et al., 1999) Although the duplicated regions of the genomes have been described and listed before (for instance S Cerevisiae (Seoighe and Wolfe,
1999a, Wolfe and Shields, 1997)), there is no known scheme for how duplicationsformed the current shape of the genomes Recent work, thus, has focused on ran-dom graph models that grow via node duplications and get modified by mechanismsthat emulate point mutations
Among these studies, the most promising one, which is named in Pastor-Satorras
et al (2003) as the proteome growth model, was described independently in Bhan
et al (2002), Pastor-Satorras et al (2003) and Vazquez et al (2003) In this model,
the network grows in iterations The model starts with a set of connected vertices
of sizeN0 In each iteration, a gene or an associated protein represeted by a node,
is chosen uniformly at random and is duplicated with all of its edges After the
duplication step, to emulate mutations there is the divergence step Each edge of
Trang 33the new node is deleted with probability q (= 1− p), followed by inserting edges
between the new node and every other node with probabilityr/t where t is the total
number of nodes andr is a constant In (Pastor-Satorras et al., 2003) by adjusting
the parametersq and r and using a small seed graph (N0 = 5), the proteome growth
model was used to approximate the degree distribution of the yeast proteome work
net-The first serious study to formally analyze the degree distribution of the
pro-teome growth model was by Pastor-Satorras et al (2003), who claimed that the
distribution of both the general yeast proteome network and the proteome growth
model is a power-law with exponential cut-off This means that the fraction of
nodes with degree k among all nodes is independent of time and is approximated
by fk = ck−b · a−k; here a, b, c are constants However, they make a number of
simplifying assumptions in their analysis to get this result For instance, they proximate the probability for generating a node with degreek by the probability of
ap-duplicating a node with degreek + 1 only and subsequently deleting a single edge
This assumption also reduces the number of singletons They further approximate
this probability with a function linear ink
A more recent analysis of the degree distribution of the proteome growth model,for the special case thatr = 0 is given by Chung et al (2003) As per Chung et al.
(2003), we will refer to this special case as the pure duplication model In contrast
to Pastor-Satorras et al (2003), Chung et al (2003) claim that the fraction of nodes
with degreek is independent of time and is of the form fk= ck−b; hereb is a
func-tion ofp = 1− q and values of b ≤ 2 are possible for some p The pure duplication
model creates singleton nodes, i e nodes that are not connected to any other node
of the graph Since, a node can only get a new edge if one of its neighbors is copied,
a singleton will remain singleton during the whole graph generation process Notethat in this model all non-singleton nodes form one connected component
Trang 34In a separate work, van Noort et al (2004) show that the gene coexpression work in S Cerevisiae have scale-free and small-world network properties By using
net-the homology relations between net-the genes in coexpression network, net-they present amodel which can generate networks with similar scale-free and small-world proper-ties The model starts with a number of genes which have a number of transcriptionfactor binding sites (TFBSs) and genes sharing a minimum number of TFBSs con-sidered coexpressed At every time step, each gene can be duplicated or deletedwith certain probabilities Also, a TFBS of a gene can be deleted or a new TFBSfrom another gene can be acquired by a gene with certain corresponding probabil-
ities In contrast to other approaches, van Noort et al (2004) consider deleting or
inserting a TFBS of the gene which deletes a set of connections, or adds a set oflinks to the gene Hence, in their approach, the connections of genes were consid-
ered in groups van Noort et al (2004) claim that the model generates a degree distribution with a slope similar to the coexpression network of S Cerevisiae1 Ad-ditionally, the average clustering coefficient2and the shortest path length of the net-works were compared Although these measures are for understanding the topology
of a network, they are not sufficient to claim that two networks are similar at all
There is also another study presented by Przulj et al (2004), in which a ferent approach for modeling these networks has been studied Przulj et al (2004)
dif-claim that a random geometric model better captures the currently accepted protein interaction networks A geometric disc graph is formed by connecting twonodes of the graph with an edge, if their distance in the metric space is smaller
protein-1Numerical results were not presented in van Noort et al (2004) Hence, the simulation results
given draws certain amount of question about how close the degree distribution, i e the power-law exponent, was.
2 The clustering coefficient of a node is the ratio between the actual number of edges between neighbors of a node and the maximum possible number of edges between these neighbors Average clustering coefficient of a network is the average of clustering coefficients over all units in the system (Watts and Strogatz, 1998)
Trang 35than a certain threshold Przulj et al (2004) argue that the scale-free property of
the proteomes is a result of the noise in the available data at the moment and thedegree distribution of such networks should follow the Poisson distribution Bycounting the number of different motifs in the networks, they form a measure of lo-cal network structure and used this to compare different models with the availableproteomes According to the experiments they carried out, a three dimensional geo-metric disc graph with the same number of nodes but six times the number of edgeshas a similar number of motifs to the proteomes they worked on Although thenetwork motifs considered capture local properties of the networks, in their work,
Przulj et al (2004) (1) do not take into account Ohno’s Theory (Ohno, 1970) which
states that, the proteome network should be generated through a process, which tributes the genome sequence growth and evolution to subsequent gene duplicationsfollowed by mutations on the gene sequences, (2) do not consider global properties
at-of the networks before drawing conclusions, such as the average degree or the gree distribution Moreover, the work presented has vague descriptions on howscale-free networks are formed For instance, there are many models available thatcan generate scale-free networks, but not every scale-free network necessarily isgenerated by emulating proteome network growth, i e duplication and divergence
de-The most recent study that was presented by Ispolatov et al (2005) focuses
on duplication-divergence models with completely asymmetric divergence In acompletely asymmetric divergence process, links are removed from the duplicated
node only In their study, Ispolatov et al (2005) examines this model where the
evolution is characterized by a single parameter, the link retention probability Theyclaim that, this single-parameter duplication-divergence network growth model canapproximate the degree distribution of real protein-protein interaction networks.Although their model generates similar degree distributions, in reality the networklacks the local structure similarity For instance, this model would not generate anytriangular subgraphs (a clique of three in the network) since the duplication would
Trang 36generate cycles of even length or degree one nodes However, cycles of any sizeexists in vast numbers in the real proteome network.
In these studies, the protein-protein interactions identified by high-throughputyeast two-hybrid screens or inferred from mass spectrometry of coimmunoprecip-itated protein complexes were considered However, analysis based on the agree-ment of the interaction and expression data show that almost less than half of these
interactions are biologically relevant (Deane et al., 2002) In a recent study, Han
et al (2005) showed that low coverage makes determination of the true topology of
the network difficult Han et al (2005) also showed from sampling the real network
through these experiments (since the experiments only reveal partial networks) thatregardless of the topology of the network that we are looking for, the topology
of the sub network that is sampled would have a degree distribution similar to apower-law In other words, according to these experiments, it is not clear whetherthe proteome network follows power-law degree distribution or not However, inthis work, we assume that the proteome network should be generated through aprocess, which attributes the genome sequence growth and evolution to subsequentgene duplications followed by mutations on the gene sequences Previously, it hasbeen shown that this process would generate a network that follows a power-law
degree distribution (Bhan et al., 2002, Pastor-Satorras et al., 2003, Vazquez et al.,
2003) Moreover, we show that the degree distribution of the proteome growthmodel follows a power-law
In this chapter, first in Section 2.1 we introduce biological networks and thenfocus on the evolution of protein-protein interaction networks We briefly describetopological properties of networks in Section 2.1.3 Next, in Section 2.1.2 weintroduce random network models that are studied widely for modeling large net-
works In Section 2.2 the Proteome Growth Model of Pastor-Satorras et al (2003)
is introduced
Our specific contributions are presented in the following sections We first show
Trang 37in Section 2.3 that the (expected) proportion of singletons generated by the pure plication model (r = 0) grows in time In fact, the only limiting (time independent)
du-solution is f0 = 1 and fk = 0 for all k > 0 Note that for the case p = q = 0.5 the
average degree of nodes in the pure duplication model does not change over time(see Lemma 3) Together with the fact that the fraction of singletons increases intime, this implies that (i) the average degree of non-singletons must increase in timeand (ii) there is a single connected component of size o(t) with increasing average
degree It is quite possible that this connected component of the network ated by the pure duplication model exhibits a power-law degree distribution withparameterb ≤ 2, however this is difficult to establish
gener-In the rest of Section 2.3, we show that the degree distribution of the proteomegrowth model (in fact, any random model based on duplications) does not follow a
power-law with exponential cut-off as claimed in Pastor-Satorras et al (2003) We
achieve this by showing a bound for the maximum degree of the proteome growthmodel and contrasting it with that of a network which exhibits power-law withexponential cut-off
A network (graph) is a collection of points where these points are called nodes
or vertices, and the arcs connecting these points are called edges (Refer to
Sec-tion 1.1.1 for definiSec-tions)
Biological networks, representations of biological relationships, have been structed to describe various biological phenomena These networks vary from net-works describing the biochemical wirings of the cell to higher level networks such
con-as neuronal networks or the food web Recent studies on analysis of genomes creased the number and importance of cellular networks The most common cellu-lar networks are described below
Trang 38in-A metabolic network is a network of pathways where metabolic substrates and
products are connected with directed edges These arcs indicate metabolic reactionacts on a given substrate and produces a given product Studying metabolic net-work allows for an in depth insight in understanding the molecular mechanisms of
a particular organism (Francke et al., 2005) Examples of various metabolic
path-ways include glycolysis, Krebs cycle, pentose phosphate pathway, etc In simplifiedterms, the construction of a metabolic network involves collecting all of the rele-vant metabolic information of an organism and then compiling it into a network thatmakes sense for various types of analysis to be performed The correlation betweenthe genome and metabolism is made by searching gene databases, such as KEGG(Kanehisa and Goto, 2000), or for particular genes by inputting enzyme or proteinnames In short, metabolic networks are powerful tools, for studying and modelingmetabolism
A genetic regulatory network (also called a GRN or gene regulatory network)
describes gene expression, i e the production of proteins from the genomic code
by transcription and translation Expression of a gene can be controlled by thepresence of other activating or inhibiting proteins, and thus the genome forms aswitching network with nodes representing proteins and directed edges represent-
ing dependence of protein production on other proteins In other words, genetic
regulatory networks are on-off switches and rheostats of a cell operating at the gene
level They dynamically orchestrate the level of expression for each gene in thegenome by controlling whether and how vigorously that gene will be transcribedinto RNA Each RNA transcript then functions as the template for synthesis of aspecific protein by the process of translation
Likewise, the transcriptional (regulation) network can be represented as a
di-rected graph Transcriptional interactions show the relationships between
transcrip-tion factors and the operons they regulate In transcriptranscrip-tional (regulatranscrip-tion) networks,
each node represents an operon, a group of contiguous genes that are transcribed
Trang 39into a single mRNA molecule, and edges represent direct transcriptional tions Each edge is directed from an operon that encodes a transcription factor to
interac-an operon that is regulated by that trinterac-anscription factor (Shen-Orr et al., 2002).
Finally, protein-protein interaction networks represent undirected interactions
among proteins In other words, a protein-protein interaction network
(interac-tome) is a graph in which each node represents a protein and each (undirected) edge represents an interaction A graph including all proteins in an organism and
all possible interactions between these proteins can be called the proteome network
of that organism The interactions in these networks are important to most ical processes, since many proteins need to interact with other proteins to performtheir functions properly Hence, knowledge about the interactions between proteins
biolog-is crucial for understanding biological functions
In this work, we focus on models developed for generating proteprotein teraction networks A protein-protein interaction network of an organism lays outthe proteins by their functional relationships This improves our understanding ofcellular functions We would like to further understand the underlying forces thathave generated these networks by using network generation models In the follow-ing section, the evolution of protein-protein interaction networks will be explained
in-in detail We are goin-ing to use these evolutionary processes for further development
of network models
2.1.1 The Evolution of Protein-Protein Interactions
The complete genome analysis of model organisms showed how gene and genomeduplication events have shaped genomes over the time Remarkably, 30% of the
Saccharomyces cerevisiae genome, 40% that of Drosophila melanogaster, 50% that
of Caenorhabditis elegans, and 38% of the human genome are composed of cated genes (Li et al., 2001, Rubin et al., 2000) According to Ohno’s theory (Ohno,
Trang 40dupli-1970), such duplication events should have provided genetic raw material, a source
of evolutionary novelties, that could have led to the emergence of new genes andfunctions through mutations followed by natural selection Recently, there has been
an enormous increase in genomic knowledge However, the patterns by which geneduplications might give rise to new gene functions over the course of evolution havenot been completely understood This is mainly due to the fact that there are veryfew ways of experimentally investigating the evolution of function in duplicatedgenes
The two underlying mechanisms for genome evolution is gene duplication andpoint mutations followed by natural selection (Ohno, 1970) After a gene dupli-cation event, one of the genes may accumulate deleterious mutations and be lost,
or both copies of the gene may be retained Two possible evolutionary reasons forkeeping both copies can be (1) selection for increased levels of expression, or (2) di-vergence of gene function (Nadeau and Sankoff, 1997, Seoighe and Wolfe, 1999b)
In this framework, functional divergence can be produced through complementarydegeneration, where each daughter gene retains only a subset of the functions of the
parent, or (rarely) if one daughter acquires a new function (Force et al., 1999)
Al-though the duplicated regions of the genomes have been described and listed before
(for instance S Cerevisiae (Seoighe and Wolfe, 1999a, Wolfe and Shields, 1997)),
there is no certain scheme to explain how duplications formed the current shape ofthe genomes
Moreover, closely related organisms share similar proteins Hence, the actions among these proteins are also preserved throughout different organisms
inter-It has been observed that many of the interactions present in yeast appear to also
be present in C elegans, although the protein-protein interactions of the otic intracellular parasite Plasmodium falciparum shows little similarity with the other eukaryotes (Suthram et al., 2005) Understanding of how protein interactions
eukary-evolve would improve our understanding of evolution of new functions