An efficient algorithm for global alignment of proteinprotein interaction networks

Global aligning two proteinprotein interaction networks is an essentially important task in bioinformatics computational biology field of study. It is a challenging and widely studied research topic in recent years. Accurately aligned networks allow us to identify functional modules of proteins andororthologous proteins from which unknown functions of a protein can be inferred. We here introduce a novel efficient heuristic global network alignment algorithm called FASTAn, including two phases: the first to construct an initial alignment and the second to improve such alignment by exerting a local optimization repeated procedure. The experimental results demonstrated that FASTAn outperformed the stateoftheart global network alignment algorithmnamely SPINAL in terms of both commonly used objective scoresand the runtime

Trang 1

An efficient algorithm for global alignment of protein-protein

interaction networks

Vietnam National University, Hanoi, Vietnam.

Thai NguyenUniversity of Education

Vietnam National University, Hanoi, Vietnam.

Abstract

Global aligning two protein-protein interaction networks is an essentially important task in bioinformatics computational biology field of study It is a challenging and widely studied research topic in recent years Accurately aligned networks allow us to identify functional modules of proteins and/ororthologous proteins from which unknown functions of a pro-tein can be inferred We here introduce a novel efficient heuristic global network alignment algorithm called FASTAn, including two phases: the first to construct an initial align-ment and the second to improve such alignalign-ment by exerting a local optimization repeated procedure The experimental results demonstrated that FASTAn outperformed the state-of-the-art global network alignment algorithmnamely SPINAL in terms of both commonly used objective scoresand the run-time.

Keywords: FASTAn, Heuristic algorithm, Biological network alignment, Protein-protein interaction networks

1 INTRODUCTION

Prior to the advent of network alignment in bioinformatics/computational biology, identi-fication of orthologous proteins was only based on evolutionary relationship, which is often denoted by the sequence homology [Aladag and Erten(2013); Park(2011)] It is, however, not adequate for identifying conserved protein complexes [ Kelley (2003); Remm (2001);

Zaslavskiy (2009)] The emergence of advanced high-throughput bio-technologies over the last decade has allowed the characterization of protein-protein interaction network (PPI) for various organisms Such these networks posed a number of interesting network analysis problems [Banks(2008); Dost (2008);Kuchaiev (2010); Kuchaiev and Przulj(2011); HW], such as network topology analysis [Milenkovic(2010)], module detection [Bader and Hogue

(2002)], etc Among these problems, aligning networks is crucially important, which pro-vides valuable information for prediction of protein functions or for verification of known

Trang 2

functions of proteins [Dutkowski and Tiuryn (2007); Junker and Schreiber (2008); Singh

(2008)]

PPI network alignment methods fall into two approaches: local alignment and global alignment For the former, the objective is to identify sub-networks with similar topology and/or conserved sequence homology in the aligned networks [T11 Kelley(2004);Koyuturk

(2006); Narayanan and Karp(2007); Remm(2001)] Generally, the result of a local align-ment includes many overlapped sub-networks since a protein can be aligned with multiple proteins in the other network, causing the ambiguity The objective of the latter approach

is to avoid the ambiguity as in local alignment by drawing an injection between proteins

in two different networks Global alignment of two networks was proven to be NP-hard by Aladag and Erten [Aladag and Erten(2013)]

The first noticeable global network alignment method is IsoRank [Singh (2008)] pro-posed by Sing et al., (2008) which is based on local alignments Afterwards, a number of similar algorithmshave been developed PATH and GA [Zaslavskiy(2009)], PISwap [ Chin-delevitch(2010); et al (2013)] introduced appropriate relaxation over the cost function on

a set of random matrices or applied local searches over existing local alignments generated

by other algorithms MI-GRAAL [Kuchaiev (2010); Kuchaiev and Przulj (2011)]and its variants[Memisevic and Przulj (2012); Milenkovic (2010)] were based on combination of greedy techniques with heuristics information such as graphlet, group classification coeffi-cients, eccentricities and similarity value (E-value from BLAST) These algorithms are all faster in producing better results when compared with others previously proposed They were, however, optimized only for either objective function or scalability, but not both Be-cause PPI networks are very often of large node number both accuracy and scalability (in the sense of running time) are equally important Very recently, Aladag and Erten (2013) proposed SPINAL algorithm [Aladag and Erten (2013)], which has been demonstrated to fastest produce the best resulting alignments SPINAL is a heuristic algorithm with poly-nomial time, comprising two phases: the first to calculate homology scores for every pair of proteins in two networks; the second to build an injection by locally improving every subset

of available solutions

This paper proposes a novel algorithm called FASTAn for global alignment of protein protein interaction networks The algorithm includes two phases: the former to build an initial alignment and the latter to enhance it by local optimization Our experimental results showed that FASTAn outperforms state-of-the-art method namely SPINAL in term

of running time and alignment quality objective function

The remainder of this paper is structured as follows Section 2 present a formal concept

of network alignment problem and some associated issues The proposed algorithm FASTAn

is introduced in section 3 Section 4 then describes our experiments and the performance comparisons between FASTAn and SPINAL Finally, conclusion and perspective works are presented afterwards

2 GLOBAL ALIGNMENT PROBLEM OF PPI NETWORKS AND RELATED WORKS

We denote two protein-protein interaction networks by and , where V1, V2 indicate sets of nodes corresponding to proteins in the network G1, G2, respectively; E1, E2 indicate sets of

Trang 3

edges corresponding to protein-protein interactions in G1, G2, respectively Without losing the generality we can assume that |V1| ≤ |V2| where |V | denotes the element number of V Network alignment aims at finding an injection from V1 into V2 which is the best ac-cording to specific evaluation criteria There currently has no formally clear definition of these criteria In the following definition we make use of criteria which have been exerted in the previous related studies [Aladag and Erten(2013); Chindelevitch(2010);et al (2013);

Kuchaiev and Przulj (2011);Singh(2008)]

Definition 1 (Network alignments) The graph A12= (V12, E12) is considered as a align-ment network of two network if and only if:

1 Each node < ui, vj > of V12 corresponds a pair of nodes ui ∈ V1 and vj ∈ V2

2 Two distinct nodes < ui, vj > and < u0i, v0j > of V12 imply ui6= u0i and vj 6= v0j

3 The edge (< ui, vj >, < u0i, vj0 >) is of E12 if and only if (ui, u0i) ∈ E1 and (vi, vi0) ∈ E2 Definition 2 (Optimal global alignment of PPI networks)

An alignment A12= (V12, E12) is a solution to the problem of global aligning two protein network G1, G2 if it maximizes global network alignment score as in the Eq (1):

GN AS(A12) = α|E12| + (1 − α)X

∀<ui,v j >similar(ui, vj) (1) Where α ∈ [0, 1] is the parameter to balance the relative importance between the network-topological similarity and the sequence similarity The value Similar(ui, vj) is approxi-mated using the BLAST bit-scores or E-values

According to a study by [Aladag and Erten (2013)] the problem of finding optimum global network alignment was proven to be NP-hard They proposed a polynomial time algorithm called SPNAL with the complexity being:

SP IN ALComplexity = O(k × |V1| × |V2| × ∆1× ∆2× log(∆1× ∆2)) (2) Where k is the number of times the main loop being executed (According to [1] the algorithm converges after looping 10-15 times); ∆1, ∆2 are respectively the largest node degree of the network G1, G2

Their experiments on benchmark datasets of protein networks on Saccharomyces cere-visiae, Drosophila melanogaster, Caenorhabditiselegans and Homo sapiensrevealed the out-performance of SPINAL over IsoRank and MI-GRAAL, which are two state-of-the-art meth-ods by then

3 FASTAN ALGORITHM

3.1 Algorithm description

The algorithm FASTAn includes two phases: the first to build an initial alignment and the second to improve such alignment by a local optimization procedure call Rebuild

Initial alignment building

Trang 4

Given two graph G1, G2, the parameter α, similarity scores between node pairs < i, j >

of V1, V2, respectively and each subset of node pairs V12 ∈ V1× V2, we denote V121 = {i ∈

V1 :< i, j >∈ V12}, V2

12 = {j ∈ V2 :< i, j >∈ V12} The FASTAn procedure in Algorithm

1 will perform the following steps: Step 1.Initialize V12 with a node pair < i, j > with the largest similarity score

Step2 Loop from k= 2 to |V1|

2.1 Find a node i in V1− V1

12having the maximum number of edges connecting to nodes

in V121;

2.2 Find a node j in V2−V2

12such that when adding the < i, j > intoV12the GN AS(A12)value (see Eq 1) gets maximal, where A12 is the network with the nodes in V12 and the edges induced by G1, G2 Such node j is called best matched node(i, V12);

2.3 Add the node < i, j > into V12;

2.4 Update E12 based on V12;

Step 3.Perform loops to improve G12= (V12, E12) with the procedure Rebuild

Remark At steps 2.1 and 2.2 it is possible to have more than one node to be the best

In this case the procedure will choose a random node among such

After building successfully an initial alignment FASTAn jumps to phase 2, in which the procedure Rebuild is exerted to improve the quality of such initial alignment

input : Graph 1: G1 = (V1, E1); Graph 2: G2 = (V2, E2); Similarities of node pairs:

Similar[i][j]; Balancing parameter α

output: Alignment network G12= (V12, E12)

V12= < i, j > //The best similar pair ¡i,j¿

for k ← 2 to |V1| do

i = f ind next node(G1);

j = choose best matchednode(i, G1, G2);

V12= V12∪ < i, j >

Update (E12)

end

Rebuild(G12);

Algorithm 1: Procedure of FASTAn

Rebuild procedure

Given G12 resulted from phase 1 and predefined nkeep value (1%) to specify the number

of nodes in the set SeedV12, the procedure Rebuild in Algorithm 2 will perform as follows: Step 1 Create a set SeedV12 of V1 comprising nkeep (1%) nodes in V1 with top scores that are calculated as follows:

score(u) = α × w(u) + (1 − α) × similar(u, f (u) (3) where u ∈ V1 and f (u) ∈ V2 that is aligned with u in G12,w(u) is the number of nodes

v ∈ V1 such that (u, v) ∈ E1 and (f (u), f (v)) ∈ E2

Step 2 Update V12 usingSeedV12 and G12

Step3 Perform the loop as Step 2 of phase 1 with k = nkeep+ 1 until |V1| to identify A12

Trang 5

After every execution of the procedure Rebuild we have a new alignment that is then taken as input G12 for the next Rebuild run This is looped until no improvement of

GN AS(A12) obtained

input : Graph 1: G1 = (V1, E1); Graph 2: G2 = (V2, E2); Alignment network G12; nkeep output: Better Alignment network A12= (V12, E12)

Build SeedV12;

Build V12; // based on SeedV12 and G12

for k ← nkeep+ 1 to |V1| do

i = f ind next node(G1);

j = choose best matchednode(i, G1, G2);

V12= V12∪ < i, j >

Update (E12)

end

Algorithm 2: Rebuild procedure

3.2 FASTAn complexity

It is obvious to see that the complexity of phase 1 and each loop in phase 2 of the algorithm FASTAn is:

O(|V1| × (|E1| + |E2|)) (4) The number of times phase 2 being looped in our experiments does not exceed 20 As

|V1| × ∆1 ≥ E1 and noting the complexity of SPINAL as defined in Eq 2 we have:

|V1| × |V2| × ∆1× ∆2 ≥ |E1| × |E2| ≥ (|V1| × (|E1| + |E2|)) (5) The complexity of FASTAn is therefore of lower order than that of the SPINAL

4 EXPERIMENTS

Experiments have been done to compare the proposed algorithm FASTAn and state-of-the-art method SPINAL on 4 benchmark datasets that had been used in the study of SPINAL [Aladag and Erten(2013)] The comparison criteria are GNAS and edge correctness (EC) measures Although we already presented the complexity comparison between two algo-rithms we also compared the average running time of both The experiments were done on

a PC computer with CPU Intel Core 2 Duo 2.53GHz, RAM DDR2 4GB and Ubuntu 13.10 64bit operation system

4.1 Data

We used 4 benchmark datasets that had been used to evaluate SPINAL performances by its authors [Aladag and Erten(2013)] They are datasets of protein-protein interactions on [Aladag and Erten(2013)]: Saccharomyces cerevisiae (sc), Drosophila melanogaster (dm), Caenorhabditiselegans(ce), and Homo sapiens (hs) These networks were obtained from [?] A description of these network, including protein and interaction number, are shown

in Table 1 It therefore has 6 different pair of networks (ce-dm, ce-hs, ce-sc, dm-hs, dm-sc,

Trang 6

hs-sc) to be aligned The parameter α gets 5 possible values, namely 0.3, 0.4, 0.5, 0.6 and 0.7 as used in [Aladag and Erten (2013)]

Table 1: Data description

Dataset No of proteins No of interactions

4.2 Experiments results

As alluded to in Section 3.1, due to that the FASTAn is a random algorithm FASTAn was executed 100 times for each pair of study PPI networks The GNAS, EC and running time were averaged over those calculated from such 100 resulting alignments They were then compared with those of SPINAL, which had been reported in [Aladag and Erten

(2013)] (See Table 2) The corresponding 95% CI of these scores of FASTAn are presented

in Table 3 The comparisons of running time between FASTAn and SPINAL are shown

in Table 4 Experimental results reveal that FASTAn was able to find out solutions

Table 2: Comparisons of FASTAn and state-of-the-art global network alignment

algorithm SPINAL according to GNAS and EC criteria using differ-ent values of the parameter Each cell shows two values, including the objective functions score GNAS (above) and EC number (below).The values in bold indicate the outperformance of FASTAn over SPINAL

FASTAn SPINAL FASTAn SPINAL FASTAn SPINAL FASTAn SPINAL FASTAn SPINAL ce-dm 778.46 717.99 1034.20 941.19 1290.11 1159.93 1545.86 1350.59 1801.24 1586.87

2560.7 2343.0 2564.6 2320.0 2567.2 2300.0 2567.7 2237.0 2567.6 2258.0 ce-hs 863.46 728.26 1144.17 993.07 1429.89 1229.95 1708.81 1501.61 1994.87 1764.93

2842.8 2370.0 2838.1 2446.0 2844.9 2437.0 2838.0 2487.0 2843.4 2512.0 ce-sc 834.79 709.12 1109.93 963.28 1389.21 1168.95 1663.39 1422.74 1936.83 1683.13

2761.1 2326.0 2761.2 2384.0 2769.7 2323.0 2766.5 2361.0 2763.1 2398.0 dm-hs 2260.31 1883.22 3007.11 2517.23 3755.36 3160.48 4496.45 3790.79 5242.32 4451.6

7478.3 6189.0 7481.9 6235.0 7429.0 6282.0 7478.2 6291.0 7478.8 6344.0 dm-sc 1977.82 1579.06 2631.85 2075.14 3290.03 2668.65 3950.16 3180.27 4603.41 3759.07

6569.7 5203.0 6565.5 5150.0 6570.7 5311.0 6577.4 5283.0 6572.3 5360.0 hs-sc 2268.21 1731.81 3017.96 2253.66 3772.96 2839.00 4520.51 3434.54 5279.88 4066.22

7531.8 5703.0 7528.5 5593.0 7535.2 5651.0 7527.0 5706.0 7538.1 5798.0

(i.e global alignments) having significantly higher GNAS and EC values than SPINAL (p − value < 2.2e − 16) for all α values on 6 available network pairs Interestingly, the worst alignments among those generated from 100 times running FASTAn on all network pairs were all better than the corresponding alignments generated by SPINAL

Trang 7

Table 3: 95% CI of the score GNAS (above in each cell) and EC (below in each

cell) of the proposed method FASTAn calculated for each pair of studied PPI networks with different values of the parameter α

ce-dm 776.71-780.20 1031.87-1036.53 1287.52-1292.69 1542.58-1549.15 1797.47-1805.01

2554.76-2566.71 2558.56-2570.55 2561.92-2572.38 2562.15-2573.19 2562.15-2572.97 ce-hs 861.38-865.54 1141.54-1146.81 1426.24-1433.55 1704.59-1713.04 1936.13-2014.11

2835.66-2849.91 2831.40-2844.80 2837.49-2852.23 2830.9-2845.1 2836.73-2850.15 ce-sc 832.71-836.88 1107.08-1112.78 1385.35-1393.07 1658.72-1668.07 1931.82-1941.84

2753.99-2768.20 2754.07-2768.39 2761.98-2777.5 2758.7-2774.36 2755.95-2770.31 dm-hs 2257.83-2262.8 3003.68-3010.53 3751.37-3759.36 4491.11-4501.78 5236.36-5248.29

7469.99-7486.6 7473.26-7490.54 7478.89-7494.99 7469.29-7487.1 7470.22-7487.3 dm-sc 1975.58-1980.05 2628.55-2635.16 3285.91-3294.15 3944.38-3955.95 4596.57-4610.25

6562.24-6577.18 6557.19-6573.79 6562.41-6578.91 6567.72-6586.99 6562.5116-6582.07 hs-sc 2265.05-2271.38 3013.83-3022.09 3767.3-3778.62 4514.5-4526.5 5272.06-5287.69

7521.13-7542.37 7518.17-7538.89 7523.85-7546.57 7516.92-7537 7526.93-7549.27

Table 4: The average running time (in second) of FASTAn and that of SPINAL

when both are run to align each pair of studied PPI networkson the same PC

Data sets ce-dm dm-sc dm-hs ce-hs hs-sc ce-sc SPINAL 540.2 1912.1 1736.8 664.3 2630.6 638.2 FASTAn 221.5 1064.5 1395.9 327.9 1507.8 142.2

5 CONCLUTION AND FUTURE WORKS

In this article we proposed a novel algorithm called FASTAn including two phases for global alignment of two protein-protein interaction networks The first phase builds an initial alignment while the second exerts a local optimization procedure to improve the quality of the initial alignment Experimental results demonstrated the advancement and efficacy of the proposed algorithm in global alignment of protein-protein interaction network in terms

of GNAS, EC criteria and running time as well The authors of SPINAL also introduced another version of SPINAL that is optimized for the Gene Ontology Coherence (GOC) measure In the future we will develop FASTAn following this direction

Acknowledgments

This work was mainly done during the research stay in the Vietnamese institute for advanced study in mathematics (VIASM)

References

A.E Aladag and C Erten Spinal: scalable protein interaction network alignment Bioin-formatics, 29:917924, 2013

G.D Bader and C.W Hogue Analyzing yeast protein-protein interaction data obtained from different sources Nat Biotechnol, 20:991997, 2002

Trang 8

E et al Banks Netgrep: fast network schema searches in interactomes Genome Biology, 9:12474–12486, 2008

L et al Chindelevitch Local optimization for global alignment of protein interaction networks volume 15, page 123132, 2010

B et al Dost Qnet: a tool for querying protein interaction networks J Comput Biol, 15: 913–925, 2008

J Dutkowski and J Tiuryn Identification of functional modules from conserved ancestral proteinprotein interactions Bioinformatics, 23:i149i158, 2007

Chindelevitch L et al Optimizing a global alignment of protein interaction networks, bioinformatics Bioinformatics, 29:27652773, 2013

Kuhn HW The hungarian method for the assignment problem Naval Res Logistics, 7: 83–97

B.H Junker and F Schreiber Analysis of Bological Networks 2008

B.P et al Kelley Conserved pathways within bacteria and yeast as revealed by global protein network alignment Proc Natl Acad Sci USA, 100:1139411399, 2003

M et al Koyuturk Virus detection using clonal selection algorithm with genetic algorithm (vdc algorithm) J Comput Biol., 13:182199, 2006

O Kuchaiev and Przulj Integrative network alignment reveals large regions of global network similarity in yeast and human Bioinformatics, 27:13411354, 2011

O et al Kuchaiev Topological network alignment uncovers biological function and phy-logeny J R Soc Interface., 7:13411354, 2010

V Memisevic and N Przulj C-graal: common-neighbors-based global graph alignment of biological networks Integr Biol, 4:734743, 2012

T et al Milenkovic Integrative network alignment reveals large regions of global network similarity in yeast and human Optimal network alignment with graphlet degree vectors, 9:121137, 2010

M Narayanan and R.M Karp Comparing protein interaction networks via a graph match-and-split algorithm J Comput Biol, 14:892907, 2007

D et al Park Isobase: a database of functionally related proteins across ppi networks Nucleic Acids Res, 39:295300, 2011

M et al Remm Automatic clustering of orthologs and in-paralogs from pairwise species comparisons J Mol Biol, 314:10411052, 2001

R et al Singh Global alignment of multiple protein interaction networks In Pacific Symposium on Biocomputing, page 303314, 2008

Trang 9

B.P et al T11 Kelley Pathblast: a tool for alignment of protein interaction networks Nucleic Acids Res, 32:8388, 2004

M et al Zaslavskiy Global alignment of protein-protein interaction networks by graph matching methods volume 25, page 259267, 2009

Định dạng
Số trang	9
Dung lượng	247,39 KB
File đính kèm	Preprint1418.rar (216 KB)