SEA- a novel computational and GUI software pipeline for detectin

Thus, twoproblems have recently received a considerable amount of attention: 1 inferring biologicalpathway structures from gene expression data and gene sets and 2 decomposing differentb

Trang 1

University of New Orleans

ScholarWorks@UNO

University of New Orleans Theses and

Dissertations Dissertations and Theses

Summer 8-4-2011

SEA: a novel computational and GUI software pipeline for

detecting activated biological sub-pathways

Thair Judeh

University of New Orleans, tjudeh@uno.edu

Follow this and additional works at: https://scholarworks.uno.edu/td

Part of the Computer Sciences Commons

Recommended Citation

Judeh, Thair, "SEA: a novel computational and GUI software pipeline for detecting activated biological sub-pathways" (2011) University of New Orleans Theses and Dissertations 463

https://scholarworks.uno.edu/td/463

This Thesis-Restricted is protected by copyright and/or related rights It has been brought to you by

ScholarWorks@UNO with permission from the rights-holder(s) You are free to use this Thesis-Restricted in any way that is permitted by the copyright and related rights legislation that applies to your use For other uses you need to obtain permission from the rights-holder(s) directly, unless additional rights are indicated by a Creative Commons license in the record and/or on the work itself

Trang 2

SEA: a novel computational and GUI software pipeline for detecting activated biological sub-pathways

A Thesis

Submitted to the Graduate Faculty of the

University of New Orleans

in partial fulfillment of therequirements for the degree of

Master of Science

inComputer ScienceBioinformatics

byThair JudehB.S Loyola University New Orleans, 2005

August, 2011

Trang 3

Trang 4

on many interesting research projects I also thank the Research Institute for Children andTulane University for the generous funding they have provided in supporting the researchthat Dr Zhu and I undertook and the Department of Computer Science at UNO for providing

me with an assistantship to support my graduate studies

A special thanks is entitled to my family I thank my mother who has always sought

to instill into my siblings and I a sense of responsibility I thank my father who sacrificedgreatly to ensure the quality of the education that I received throughout my life Finally, Ithank my beloved wife Honida who has constantly pushed me to excel in my research and

in life in general

Trang 5

Table of Contents

List of Figures vi

Abbreviations viii

Abstract ix

Chapter 1: Background and Introduction 1

Chapter 2: Network Reconstruction 6

2.1 Bayesian Networks 6

2.2 Frequency Method 8

2.3 LPA 9

2.3.1 Preprocessing 13

2.3.2 Sorting 14

2.3.3 Growth 14

2.3.4 Pruning 15

2.3.5 Intersection 15

Chapter 3: Network Partitioning 18

3.4 Kernighan-Lin Algorithm 21

3.5 Girvan-Newman Algorithm 23

3.6 Clique Percolation Method 26

Chapter 4: SEA 29

4.7 Related Work 30

4.7.1 GenMAPP 30

4.7.2 The Work of Chen Et Al 31

4.7.3 COSINE 31

4.8 Goals and Original Contributions 32

4.9 Pathway Extraction 33

4.10 Retrieving NCBI Gene IDs 36

4.11 Decomposing the Pathways 37

4.11.1 Signal Cascades 37

4.11.2 Nonlinear Regulatory Modules 38

4.12 User Input 38

4.13 Scoring the Sub-pathways 39

4.14 The Graphical User Interface (GUI) 40

4.14.1 Updating the List of Organisms 41

4.14.2 Selecting or Updating an Organism 42

4.14.3 Loading Profile Data 42

4.14.4 Selecting a Subset of Sub-pathways 42

4.14.5 Ranking the Sub-pathways 43

4.14.6 Viewing Results 43

4.14.7 Saving and Loading Results 43

4.15 Conclusions 43

Trang 6

References 47Vita 48

Trang 7

List of Figures

1.1 The Big Picture 4

2.1 LPA Problem Statement 10

2.2 LPA Input Generation 11

2.3 Transpose Problem 12

2.4 LPA Overview 13

2.5 LPA Growth Stage 15

2.6 LPA Pruning Stage 16

2.7 LPA Intersection Stage 16

3.1 Two Communities 18

3.2 Directed Versus Undirected Communities 20

3.3 Zachary’s Karate Network 21

3.4 Dendrogram 24

3.5 A CPM Illustration 28

3.6 Directed Cliques 28

4.1 SEA Overview 29

4.2 GenMAPP Illustration 30

4.3 Duplicates in KEGG Pathways 35

4.4 Root to Leaf Linear Path Illustration 38

4.5 SEA Quick Start Guide 40

4.6 SEA Interface 41

4.7 SEA Output 44

Trang 8

API Application Programming Interface

BIC Bayesian Information Criterion

BNT Bayes Net Toolbox

COSINE COndition-SpecIfic sub-Network

CPD Conditional Probability Distribution

CPM Clique Percolation Method

CPT Conditional Probability Table

DAG Directed Acyclic Graph

DFS Depth First Search

DNA DeoxyriboNucleic Acid

FTP File Transfer Protocol

GenMAPP Gene Map Annotator and Pathway ProfilerGSGS Gene Set Gibbs Sampler

GUI Graphical User Interface

KEGG Kyoto Encyclopedia of Genes and GenomesKGML KEGG Markup Language

LPA Linear Path Augmentation

MLE Maximum Likelihood Estimator

mRNA messenger RNA

NCBI National Center for Biotechnology InformationPPI Protein-Protein Interaction

RNA RiboNucleic Acid

SEA Structure Enrichment Analysis

Trang 9

SOAP Simple Object Access Protocol

TPM Transitional Probability Matrix

WSDL Web Service Definition LanguageXML Extensible Markup Language

Trang 10

With the ever increasing amount of high-throughput molecular profile data, biologists needversatile tools to enable them to quickly and succinctly analyze their data Furthermore,pathway databases have grown increasingly robust with the KEGG database at the fore-front Previous tools have color-coded the genes on different pathways using differentialexpression analysis Unfortunately, they do not adequately capture the relationships of thegenes amongst one another Structure Enrichment Analysis (SEA) thus seeks to take bio-logical analysis to the next level SEA accomplishes this goal by highlighting for users thesub-pathways of a biological pathways that best correspond to their molecular profile data

in an easy to use GUI interface

Network Partitioning, Network Reconstruction, Structure Enrichment Analysis, CommunityDetection Algorithms, Biological Networks, KEGG

Trang 11

Chapter 1: Background and Introduction

The world of biological systems is a vast and complex system of regulation processes andbiomolecular interactions An underlying goal for biologists is to arrive at a theory thatshines light on the complicated interaction patterns in living organisms These interactionpatterns result in various biological phenomena where recognition of these patterns canprovide much needed insight into biomolecular activities Capturing these biomolecularactivities, however, is a daunting task due to the complexity of the systems at hand as well

as lacking of data needed to fully capture the underlying biomolecular activities Thus, twoproblems have recently received a considerable amount of attention: (1) inferring biologicalpathway structures from gene expression data and gene sets and (2) decomposing differentbiological pathway structures into functional units

A revolution in the understanding of biomolecular interaction mechanisms has curred in large part due to the rapid and significant advances in high-throughput technolo-gies Such technologies, such as microarrays and second-generation sequencing, now enable asystematic study of biomolecular activities due to the copious amount of genome-wide mea-surements These genome-wide measurements continue to be accumulated into numerousdatabases by research labs across the world Unfortunately, gaining biological insights fromlarge-scale gene expression data is a daunting task due to the curse of dimensionality Toovercome this task, many computational and experimental models have been developed togroup genes into various sets based on either a structural or functional similarity This lead

oc-to the birth of gene sets as a new source of data leading oc-to a burst in novel algorithms thatinfer biological pathway structures from gene sets These two types of data, gene expressiondata and gene sets, will now be examined in more detail

Trang 12

First, gene expression data is represented as a matrix of numerical values Each rowcorresponds to a gene while each columns corresponds to an experiment Each entry of thematrix corresponds to the gene expression level for a given gene under a given experiment.Gene expression profiling has thus allowed the simultaneous measurement of the expressionlevels of thousands of genes A systematic study of biomolecular interaction mechanisms isnow possible on a genomic scale One typical example of gene expression data is microarraydata For microarray data one usually has a glass slide that is coated with oligonucleotidescorresponding to specific gene coding regions The slide is then labeled and hybridized withpurified RNA A laser is scanned on the washed microarray slide to obtain gene expressiondata.

Ways to obtain genome-wide measurements have also grown There are a wide array

of microarray platforms, and genome-wide measurements can be obtained via conventionalhybridization based microarray [14, 20, 31] or deep sequencing experiments [32, 33] Somerepresentative microarray platforms include Agilent Microarray, Affymetrix GeneChip, andIllumina BeadArray

Moving on to gene sets, gene sets are defined as a group of genes that share biologicalsimilarities They are a rich source of data for reconstructing the structure of biologicalpathways as they tend to participate in the same biological process Gene sets are derivedfrom a variety of sources including PubMed text, ChIP-chip, co-localization along the achromosome, and gene expression data There are a variety of methods to rank gene setswith GSEA-P [34] being one of the most popular methods A major advantage of workingwith gene sets is their capability to incorporate with ease higher-order interaction patterns.They are also more robust to noise than gene expression data and are capable of integratingdata from a variety of sources Given the ways a gene set may be derived, one must keep inmind the possibility that not all gene sets may represent network structures

Trang 13

An important underlying assumption when trying to reconstruct a biological pathwaystructure using gene sets or gene expression data is that these sets of data were originallyemitted from unobserved signaling pathways There are various algorithms based on thisassumption that attempt to reconstruct the structure of biological pathways using gene setsand/or gene expression data First, a biological pathway structure is a graph G(V, E) where

V is the set of vertices or nodes E is the set of edges In the case of biological pathways, avertex v V may either be a gene or protein whereas an edge e E joining two such verticesrepresents the biological properties connecting them The final underlying network mayeither be directed or undirected, and both types of networks occur naturally in biologicalsystems

For example, a signal transduction is a typical example of a directed network inbiological systems According to the Central Dogma of Molecular Biology, DNA encodes thegenetic information of living organisms DNA directs protein synthesis via the formation

of messenger RNA (mRNA) [4] A signal transduction is thus the primary means thatdecodes DNA into mRNA and then into protein synthesis For a signal transduction tooccur, cytokines or chemokines bind to the transmembrane proteins which in turn activates

a sequential activation of signal molecules leading to a biological end-point In this case adirected edge represents one event in a signal transduction activating another, and a signalingpathway is thus composed of a web of gene regulatory wiring or different transduction events

Undirected networks, on the other hand, are typically exemplified by Protein-ProteinInteraction (PPI) networks [35] These networks have no self-loops, and all vertices consist

of proteins An edge exists between two proteins if they can physically interact

Once a biological pathway structure has been reconstructed, one needs to examine

it at a finer level as usually only part of a biological pathway structure is involved in abiological process of interest Thus, decomposing different biological pathway structures intosub-pathways is a must By retrieving the sub-pathways, one is able to accomplish two major

Trang 14

goals: predict gene functionality and relevant sub-pathways for different phenotypes Forexample, if gene A is clustered with other genes responsible for apoptosis, one may infer thatgene A also plays a role in apoptosis This leads to predicting a new gene functionality forgene A that may have been previously unknown As another example, one may possess cancermolecular profile data By “enriching” the sub-pathways, one may extract new biologicalinsights about the sub-pathways most relevant to cancer Figure 1.1 succinctly summarizesthe relationships amongst the various topics discussed in this introduction.

Figure 1.1: The big picture Gene expression data and gene sets may be converted fromone to another Furthermore, given gene expression data or gene sets, one can reconstructdifferent biological pathway structures Given that only a sub-pathway is usually activatedfor a particular biological process, decomposing a biological pathway structure into sub-pathways is a must From these sub-pathways, one may extract useful biological insights.Otherwise, one may use molecular profile data in conjunction with sub-pathways to extractthe most relevant sub-pathways for the data at hand

To outline the remainder of this thesis, three areas will now be examined in moredetail Chapter 2 will examine three network reconstruction algorithms The first approach

Trang 15

is Bayesian networks [24, 12], which is an approach based on gene expression data Thesecond approach is the Frequency Method [29], which is a gene set based approach Thefinal approach is Linear Path Augmentation (LPA) [15], which is an original contribution

to the field Chapter 3 will examine three network partitioning algorithms including theKernighan-Lin algorithm [19], the Girvan-Newman algorithm [13, 26], and the Clique Per-colation Method (CPM) [27, 28] Finally, for Chapter 4 the focus will be on an originaland novel software pipeline, SEA (Structure Enrichment Analysis), which closely resemblesFigure 1.1

Trang 16

Chapter 2: Network Reconstruction

Given gene expression data and gene sets, it is often the case that more biological insightneeds to be extracted from them One concise manner to extract data from gene expressiondata and gene sets is to reconstruct a biological pathway structure Reconstructing a bio-logical pathway structure is a key step as it is often the gateway for further analysis Forexample, it may be a difficult task to accurately extract signal cascades if the underlyingnetwork is unknown A biological pathway structure can also illustrate how various sub-pathways cross-talk within one another Thus, there are a plethora of reasons to reconstruct

a biological pathway structure

There are a variety of methods to reconstruct biological pathways Some methods,such as Bayesian networks, rely on gene expression data Other methods, such as FrequencyMethod, rely on gene sets Both of these methods will be examined later on in the chapter

In addition, an original and novel algorithm, Linear Path Augmentation (LPA), will bepresented in detail later on in this chapter as well

2.1 Bayesian Networks

A Bayesian network [24, 12] is a graphical model that ties with its vertices some probabilisticrelationships From a network structural view, a Bayesian network embodies the conditionaldependencies and indepedencies of its various vertices It also efficiently encodes the jointprobability distribution of all the vertices in the graph A Bayesian network is represented

by a DAG (directed acyclic graph), which automatically rules out Bayesian networks fromrepresenting feed-back loops and other cyclic structures

A Bayesian network consists of a pair (G, Θ) where G represents a DAG The |V | = nnodes of G are random variables X1, X2, , Xn that may represent discrete or continuous

Trang 17

random variables Θ denotes the set of parameters for each of the random variables and isneeded to encode a random variable’s CPD (conditional probability distribution) or CPT(conditional probability table) depending on whether it is discrete or continuous Moreformally, one can define Θ as

∀ xi Xi given the set of parents of xi in G Θ is often learned by assuming some underlyingdistribution and using gene expression data to derive Θ Using the factorization definition,one can express the joint probability distribution as a product of the conditional probabilities

P (x1, x2, , xn) =

nY

As will be seen in Chapter 4, it may be the case that the structures of interest arealready available Thus, one may venture to say that scoring structures may ultimately

be more important than searching Often times an approximation may be used such asthe Bayesian Information Criterion (BIC) defined as ln p(D|ˆθG, G) − d2ln N where D is thedataset, G is the structure, d is the number of parameters, and N is the size of the dataset.ˆ

θS is an estimate of the model parameters, and for large enough N , one may use the MLE

Thus, a Bayesian network is a good probabilistic modeling approach to learn thestructure of a biological pathway from gene expression data They are also quite robust

Trang 18

against noisy data, which in turn prevents over-fitting of the data Its main disadvantages lie

in its computational complexity and its restriction to DAGs Regardless, Bayesian networksare still quite popular in many fields, and many implementations, such as BNT [23], existthat allow users to harness their power

The Frequency Method [29] is a method to reconstruct directed networks from gene sets

It makes three important assumptions about the gene sets First, it assumes that treestructures in the paths correspond to gene sets Another assumption is the availability ofthe source and destination of each gene set, which may not necessarily be known for allbiological systems Finally, it is assumed that the directed edges used to form a tree in eachgene set are already available, but their order is unknown

Using terminology similar to [2], let S be the set of source nodes, D be the set ofdestination or target nodes, and E is the collection of all directed edges of the graph Eachmember m S ∪ D ∪ E can be associated with a binary vector of length N , the number ofgene sets, where xm(i) = 1 indicates that m is involved with ith gene set By letting si bethe fixed beginning of the ith gene set and di its destination, the order of genes for the ithgene set is found by satisfying

∀ e E with xe(i) = 1 It should be noted that λi(e) is used to determine whether e is closer

to its source si than its destination di The result of Equation 2.3 is that e∗ is placed closet

to si Thus, the edges are placed in proximity to si based on their λ scores

Trang 19

The Frequency Method leads to a unique solution in reconstructing the biologicalpathway structure and is computationally efficient A major drawback is the stringent as-sumptions made by it such as knowing the source and destination genes of each gene set.Furthermore, if there exist multiple paths between a pair of genes, the Frequency Methodmay fail.

LPA (Linear Path Augmentation) [15] is an original and novel network reconstruction rithm The goal of LPA is to reconstruct an original biological pathway structure using genesets as the input The underlying hypothesis of LPA is that gene sets correspond to signalcascades and that the underlying network corresponds to a DAG (Directed Acyclic Graph).With these assumptions LPA has a robust pipeline to reconstruct biological pathways usinggene sets as input Figure 2.1 provides an overview of the problem that LPA attempts tosolve

algo-Before proceeding to the details of LPA, it is prudent to describe how simulationswere conducted To be able to test LPA as well as other algorithms, it is necessary to beable to generate some linear paths from the original network To accomplish this goal, thealgorithm All Linear Paths was developed It is important to note that for a fully connectedDAG, there are Pn−1

j=1

Pj−1 i=1

j

i linear paths where n is the number of vertices in the DAG.Thus, this algorithm is only feasible for very sparse pathways Figure 2.2 presents a flowchart describing the All Linear Paths algorithm

A very significant step that can easily be overlooked is permuting the order of thegene sets at the very end It is natural for algorithms to handle gene sets one at a time

An issue that arises, though, occurs if some assumption or calculation is made using theremaining gene sets One example is GSGS(Gene Set Gibbs Sampler) by [1] In particular,

Trang 20

Figure 2.1: This sample network illustrates the problem that LPA attempts to solve At step

1, one has the original, unobserved biological pathway At step 2 the pathway consists ofsignal cascades Unordered gene sets corresponding to the signal cascades are represented atstep 3 Finally, using a network reconstruction algorithm, the original biological pathway isreconstructed from the gene sets in step 3 This original author contribution first appeared

in [15]

the remaining gene sets in GSGS are used to calculate the TPM (Transitional ProbabilityMatrix) It is hoped that with a good number of gene sets this effect is diminished asthe weight of a single gene set in calculating the TPM is reduced Similarly, for LPA theremaining gene sets play a significant role in the score function to be discussed later on insubsection 2.3.3 For both cases mentioned, the order of the gene sets may affect the finalresults with LPA being affected far more significantly than GSGS

One important note is that any network and its transpose can produce the same set

of linear paths Any algorithm that does network reconstruction must always keep this fact

in mind At least for biological networks, though, this problem is somewhat mitigated asbiologists should usually be able to easily tell the proper matrix For example, biologists

Trang 21

Figure 2.2: All Linear Paths

would not label a transcription factor as a leaf node Thus, from an algorithmic perspective,some prior knowledge is a must

The final step needed for simulation studies are some gold standard networks Thegold standard networks chosen are from the DREAM3 Network Challenges [22] Further-more, the chosen networks are all DAGs and small-scale as well Table 2.1 lists a set ofnetworks from the DREAM3 Network Challenges as well as some useful statistic per net-work Results of the LPA algorithm are also displayed

The LPA algorithm itself is a novel combination of a variety of techniques Its name,Linear Path Augmentation, is based on augmenting matrices with linear paths Based onthe available knowledge, no other algorithm functions in a manner similar to it In addition

to its novelty, it is quite modular consisting of preprocessing, sorting, growth, pruning, and

Trang 22

Figure 2.3: A network and its transpose By running the All Linear Paths algorithm detailed

in Figure 2.2 on both networks, the same set of gene sets is produced In essence, this statesthat without any prior information a network and its transpose are both equal in terms offinding the final network

Table 2.1: Statistics concerning E coli networks from the DREAM3 Network Challenges.Also displayed are the results of the LPA algorithm where Sensitivity = T P +F NT P Specificity

= T N +F PT N Positive Predictive Value (PPV) = T P +F PT P T P equals true positives, F P equalsfalse positives, T N equals true negatives, and F N equals false negatives

intersection stages This modularity allows for ease of updating stages individually Figure2.4 presents a high-level flow chart of the LPA algorithm

Trang 23

Figure 2.4: The LPA algorithm consists of five key stages The first stage, preprocessing,separates the gene sets into components The second stage, sorting, places the gene sets inorder The third stage, growth, searches for candidate networks The fourth stage, pruning,scores the candidate solutions and removes candidate solutions with low score The finaland fifth stage, intersection, is needed in the absence of prior data to reconcile any candidatesolutions still left.

2.3.1 Preprocessing

The idea behind the preprocessing stage is to divide the gene sets into “components.” Theprocess is relatively straightforward If two gene sets A and B share at least one node, theyare placed in the same component If gene set C shares at least one node with either gene set

A or B, it is also placed in the same component If the original network is a single connectedcomponent, than all gene sets will fall into one component Similarly, if the original networkhad k disconnected components, then there will be k sets of gene sets For all scenarioslisted, it is assumed that no gene sets are missing so the number of sets of gene sets inpractice may vary This allows for a divide and conquer approach where the next steps arerun k times, once for each set of gene sets

Trang 24

2.3.2 Sorting

This stage assigns an order for a set of gene sets The LPA algorithm is very sensitive tothe order of the gene sets The order of the gene sets can actually determine whether thealgorithm converges to the correct solution and may have a direct affect on its computationalcomplexity The current approach places the longest gene sets first While this increasesthe computational complexity of the algorithm, it makes it more likely to converge to thecorrect solution

The growth stage is very akin to the “searching” stage of a structure learning algorithm.For the first iteration, assuming no prior knowledge has been provided, length(G1 )!

2 networksare constructed where G1 is the first gene set Each network corresponds to one linear pathfrom the length(G1 )!

2 possible permutations The quantity is divided by two as the reverse ofthe permutations are automatically discarded (Figure 2.3) These networks are stored in aset of candidate networks Fi1 After the pruning stage, one now begins with the pruned Fi10.Each network in Fi10 is expanded using length(G2 )!

2 permutations for G2 However, to reducethe search space, the topological sort order of each network is taken into account Thus,only permutations that do not violate its topological sort order are added For example, if apathway P consists of the linear path 1 → 2 → 3 and the new gene set is {2, 3, 4}, 3 → 2 → 4will not be added as it violates the topological sort order {2 → 3 → 4, 2 → 4 → 3, }, onthe other hand, are valid permutations, and P will split into new networks accordingly Thenew augmented networks are then added to F2

i while the networks in F10

i are discarded Theprocess repeats itself until all gene sets are used and is illustrated in Figure 2.5

Trang 25

Figure 2.5: Growth Stage.

2.3.4 Pruning

The pruning stage is very akin to the “scoring” stage of a structure learning algorithm Thisstage attempts to reduce even further the set of candidate solutions An important part ofthis stage is that it uses all gene sets to compute a score for each network In its essence,this score measures how many gene sets that the underlying network can support In otherwords, if one were to run the All Linear Paths algorithm on the network, its score consists ofthe intersection of its unordered linear paths with the gene sets Figure 2.6 provides furtherdetails on the pruning stage

2.3.5 Intersection

The final stage is needed only when there still remain some candidate network solutions.Thus, the final network returned is the intersection of all remaining candidate networksolutions In the absence of prior knowledge, one must choose between a network andits transpose An ad hoc solution at the moment is to choose the network whose upper

Trang 26

Figure 2.6: Pruning Stage.

triangular matrix is heavier Naturally, this process may fail when the upper triangular andlower triangular matrices have an equal number of edges Figure 2.7 provides an example ofthe intersection stage

Figure 2.7: Intersection Stage

A post-processing step is the combination of the separate components, if any, duced by the algorithm At this stage, the presence of prior knowledge is a must as a

Trang 27

pro-network and its transpose are equally likely in the absence of prior knowledge After thisstep is finished, the final network is ready for presentation to the user.

LPA has some novel contributions At this stage, though, it needs a better sorting,growth, and pruning stages for it be computationally feasible Given its modular nature,though, it is hoped that finding improvements for these stages will be an achievable task

Trang 28

Chapter 3: Network Partitioning

It is often the case that a reconstructed network is too broad of a representation for a process

of interest Furthermore, there are now readily available high fidelity biological networks withthe Kyoto Encyclopedia of Genes and Genomes (KEGG) [16, 18, 17] being at the forefront

of the databases Since not all of a biological pathway structure is activated at once, afiner level of detail is needed when examining the structure of biological pathways As such,decomposing a biological pathway structure into sub-pathways is of utmost importance asthey may provide valuable insight into various biological processes

It is vital to first define what a sub-pathway is For biological pathways the conceptsub-pathway is very similar to the concept of communities in social networks A community

is a subgraph of a given graph such that (1) the connections within the community fromnode to node are strong and (2) the external connections between other communities are fewand weak Figure 3.1 provides an illustration of the concept of communities

Figure 3.1: The network displayed consists of two communities shaded white and black,respectively Both communities exhibit high internal connections Furthermore, the con-nections between the two communities consists only of a single edge This original authorcontribution is set to also appear in [2]

There are two approaches for finding the sub-pathways of a biological pathway ture or graph, namely graph clustering and community detection algorithms [25] The formertype of algorithms have their origin in computer science and other related fields The latter

Trang 29

struc-type of algorithms were originally used by sociologists They now encompass algorithms inapplied mathematics, physics, and biology.

For graph clustering algorithms, a user must specify the number of clusters or titions A graph clustering algorithm will always return the specified number of partitionsregardless of whether the underlying graph is partitionable These algorithms were designedwith specific applications in mind Some applications include improving the paging prop-erties of programs and placing the components of an electronic circuit onto printed circuitcards [19]

par-One may ask, “Why study graph clustering algorithms for biological pathways?”This is indeed a pertinent question The major reason is that these algorithms often serve

as an inspiration for community detection algorithms For example, the Laplacian matrixwhose use is popular in graph clustering algorithms can be modified to perform eigenvectordecomposition [25] Another example can also be found in Newman’s eigenvector method[25] In this paper Newman used the Kernighan-Lin algorithm [19] as inspiration for apost-processing algorithm, namely Algorithm 2

Concerning community detection algorithms, the underlying assumption behind thesealgorithms is that a network or graph can “naturally” be divided into sub-pathways orcommunities Thus, the sub-pathways of a graph can be viewed as a topological property ofthe graph This design philosophy is a major difference between community detection andgraph clustering algorithms

Before discussing some algorithms in detail, it is prudent to discuss the nature ofthese algorithms Most algorithms in this field work for undirected networks and producemutually exclusive partitions It is often far from trivial to extend the undirected version of

an algorithm to work for directed networks [10] It is often the case that an algorithm thatworks only for undirected graphs is simply applied to directed graphs by ignoring the edgedirection in the directed graphs As seen in Figure 3.2, this approach is far from adequate

Tiêu đề	SEA: a novel computational and GUI software pipeline for detecting activated biological sub-pathways
Tác giả	Thair Judeh
Người hướng dẫn	Dr. Dongxiao Zhu
Trường học	University of New Orleans
Chuyên ngành	Computer Science Bioinformatics
Thể loại	thesis
Năm xuất bản	2011
Thành phố	New Orleans

Định dạng
Số trang	58
Dung lượng	2,66 MB