213. A novel ant based algorithm for multiple graph alignment

Keywords—Multiple Graph Alignment, Ant Colony Optimization, local search, memetic algorithm, SMMAS pheromone update rule I.. In [20] the authors proposed ACO-MGA algorithm that using

Trang 1

A Novel Ant Based Algorithm for Multiple Graph

Alignment

Tran Ngoc Ha

Thai Nguyen University of Education

hatn84@gmail.com

Do Duc Dong Vietnam National University-Hanoi dongdoduc@vnu.edu.vn

Hoang Xuan Huan Vietnam National University-Hanoi huanhx@vnu.edu.vn

Abstract— Multiple graph alignment (MGA) is a new

approach to analyze protein structure in order to exploring their

functional similarity In this article, we propose a two-stage

memetic algorithm to solve the MGA problem, named

ACO-MGA2, based on ant colony optimization metaheuristic A local

search procedure is applied only to the second stage of the

algorithm to save runtime Experimental results have shown that

ACO-MGA2 outperforms state-of-the-art algorithms while

producing alignments of better quality

Keywords—Multiple Graph Alignment, Ant Colony

Optimization, local search, memetic algorithm, SMMAS

pheromone update rule

I INTRODUCTION

Multiple sequence alignment is a useful approach for

analyzing evolutionary homology among DNA sequences or

proteins However, this method is not suitable to determine the

functional similarities among the molecules because

functional similarities relate more closely to structural features

rather than the sequential ones [6,12,15,18,19]

Recently, a number of authors [1, 2, 10-12, 22-24] have

proposed using graphical models to represent

three-dimensional structures of proteins and using the graph

alignment techniques to infer functional similarities based on

structural analysis These methods mainly use exact pair-wise

graph matching technique They produce meaningful results

when studying the functional evolution of non-homologous

molecules However, it is difficult to leverage these methods

to discover biological meaningful samples from approximately

saved ones

Weskamp et al [21] were the first (2007) to introduce the

concept of multiple graph alignment (MGA) and to use it to

analyze the protein active sites They proposed a heuristic

algorithm according to greedy strategy The graphs are used to

approximately describe binding pockets in Cavbase [8,14] In

this approach, each binding pocket is modeled as a connected

graph G(V, E ) and MGA problem is stated as follows : Given

a set G ={G 1 (V 1 ,E 1 ),…,G n (V n ,E n )}which is a set of connected,

node-labeled, edge-weighted graphs In each graph, there are

three edit operations: deletion or insertion of a vertex, change

of the label of a node, change of the weight of an edge The

mission of the MGA problem is to find an alignment for the

vertices of the graphs belong to G to optimize a predefined

objective function

MGA is NP-hard problem (see [6, 21]) The heuristic

algorithms are only suitable for small problems, hence, not

suitable for real applications Fober et al [6] have extended the usage of this problem for the structural analysis of biomolecules and have proposed an evolutionary algorithm called GAVEO Experiments show that this algorithm is more efficient than greedy algorithm although it is more time consuming

In [20] the authors proposed ACO-MGA algorithm that using simply ant colony optimization scheme to solve the multiple graph alignment problem Experiment shows that this algorithm has better results than the GAVEO algorithm; however its running time is long and its efficiency is not good for large data sets

This paper introduces a two-stage memetic algorithm based on ant colony optimization called ACO-MGA2 as an improvement of the ACO-MGA to align multiple graphs We

keep construction graph as in ACO-MGA, but improve the

heuristic information and the local search procedures To reduce the running time, the algorithm is split into two stages The local search is only applied at the second stage of the memetic scheme [13] It consists of two procedures: 1) Rearranging the different labeled vertices in alignment vectors

in order to improve the compatibility of the vertices, 2) Swapping identical labeled vertices on each graph to increase the appropriateness of the edges’ weight Improvements in both runtime and efficiency of ACO-MGA2 is demonstrated empirically by comparison with GAVEO and Greedy The rest of this paper is organized as follows: Section 2 provides mathematical statements for multiple graph alignment problem and summarizes the related work Section

3 introduces the newly proposed algorithm The experimental results are presented in Section 4 Several conclusions are presented in the last section

II MULTIPLE GRAPH ALIGNMENT PROBLEM AND

RELATED WORKS

A Multiple graph alignment problem

The multiple graph alignment problem is introduced [21]

by Weskamp et al, with the purpose of studying proteins characteristics Fober et al [6] extended it to analyze the structure of molecules which includes the chemical composition and the protein binding site by Follows are the problem statement (more details see [6, 20])

Definition 1 (Multigraph) Multigraph is a set of graphs G

= {G 1 (V 1 , E 1 ), , G n (V n , E n )} , where G i (V i , E i ) is a connected

Trang 2

graph, each vertex is labeled under a given set L, the edges

weight represent the Euclidean distances between the vertices

Definition 2 (Edit operations) There are following edit

operations to distinguish between a graph G(V; E) and another

graph:

i) Insertion or deletion of a node: A node v ∈ V and

edges associated with it can be deleted or inserted

ii) Change of the label of a node: The label () of a

node ∈ can be replaced by other label in L

iii) Change of the weight of an edge The weight () of

an edge can be changed based on the conformation

Definition 3 (Multiple Graph Alignment). Let multigraph

G ={G 1 (V 1 ,E 1 ),…,G n (V n ,E n )} , adding to each vertex set V i a

dummy node (denoted ⊥) that is not connected to the other

nodes An alignment of G is defined as follows Then

{ } { }

1

multigraph G if and only if:

i) For all i=1,…,n and for each ∈ , there exists

exactly one a = (a 1 ,…,a n )

ii) For each a = (a 1 ,…,a n ) ∈ , there exists at least one 1

≤ i ≤ n such that ≠ ⊥

Each a = (a 1 ,…,a n ) ∈ is called a column vector of

corresponding alignment, ∈ is real node

For readers’ ease, we keep the notation convention G

={G 1 (V 1 , E 1 ),…,G n (V n , E n )} to refer to the multigraph in which

the graph Gi has been added a dummy node

Definition 4 (Scoring function). The score s of a given

alignment A = (a1,…, an) is defined as in Equation 1

n

s A ns a es a a

= ≤ < ≤

where ns is the score of the fitness of the corresponding

column and is calculated by the Equation 2

1

i

m

ns l(a )=l(a ) a

ns l(a ) l(a ) ns

ns a = , a a

≤ < ≤





≠





∑

and es evaluates the compatibility of the edge length and

is calculated by the Equation 3:

1 1

1

, , ,

i j i j

mm k k k l l l

i j

i j i j

mm k k k l l l ij

k l m m kl

i j

mm kl

es (a ,a ) E (a ,a ) E

a a

es (a ,a ) E (a ,a ) E es

es d

a a

es d ε

≤ < ≤

      

      

     

≤



     

   

>



∑

ε

(3)

In Equation 3,

 Parameters (ns m,

ns mm , ns dummy , es m , es mm ) are reused from [21]: ns m = 1.0;

ns mm = -5.0; ns dummy = -2.5; es m = 0.2; es mm =-0.1

Solution of an MGA problem is alignment that maximizing the scoring function() This is a NP-hard problem (see [6, 21]) If one use the exhaustive method to solve it, the complexity will be ) where Vmax is

the number of vertices of the graph with the highest number of vertices and n is the number of graphs

B Related works

Weskamp et al [21] proposed applying multiple graph alignment problems to study protein characteristics, where graphs are used to approximately describe the binding pockets

Greedy algorithm. Weskamp et al [21] first (2007) studied the MGA problem and used it in the analysis of protein active sites The authors proposed a greedy algorithm, which transforms the multiple graphs comparison into the pair-wise comparison to find out a good enough solution within a small amount of time

GAVEO algorithm. Fober et al [6] proposed a genetic algorithm called GAVEO that substantially improve efficiency compared with the greedy algorithm proposed by Weskamp although its runtime is higher

ACO - MGA algorithm. The authors [20] proposed an ant colony optimization algorithm (ACO), which uses simple heuristics and local search techniques, which change the position of the same label vertices of each component graph to increase edge fitness of the objective function This method yields better results than GAVEO but its running time is longer when data size is large

ACO method. This method is proposed by Dorigo (see [5]) in 1991, is a stochastic metaheuristics method to solve difficult combinatorial optimization problems In these algorithms, the original problem is transformed into the problem of finding the solution on a construction graph G = (V, E, Ω, η, T), where V is the vertices set, E is the edges set,

Ω is constrain set to build the solution, η and T are vectors that represent the heuristics and reinforcement learning information for constructing a solution Infomation may be placed on the edges or on the vertices

In each iteration, each ant in the colony of m ants will build a solution on the construction graph It starts from a start vertex and develops random sequence based on reinforcement learning information, which is represented by the pheromone trail and the heuristics information The random sequence follow random walk that is fit with Ω constrain Then the solution is evaluated (may be additionally applied a local search) and updated pheromone trail as reinforcement learning information for the next step The best-found solution will be the solution of the problem (more details see [5])

Memetic algorithm. The Memetic algorithms [13] introduce local search techniques for iterative algorithms based on population The solutions found after each iteration

is selected upon to apply the local search techniques in a flexible way Thus, algorithms are efficient and take less runtime

Trang 3

To apply memetic scheme based on ACO method, there

are four factors need resolving: 1) the construction graph and

the procedure for sequentially developing according to given

constraints, 2) heuristic information, 3) pheromone update

rule, 4) the local search techniques and their usage

III THE PROPOSED ALGORITHM

Considering the alignment problem for a set G of graphs

G ={G 1 (V 1 ,E 1 ),…,G n (V n ,E n ) where each graph has added

dummy node as in definition 3 and 4 our new algorithm is an

ACO-based memetic algorithm named ACO-MGA2 It uses

the same construction graph as ACO-MGA algorithm does but

with more efficient heuristic information and local search

procedures General framework of ACO-MGA2 is as follows

A General framework

After initializing parameters and m artificial ants (agents)

ACO-MGA2 repeatedly perform two stages as in Algorithm 1

The first stage (applied for the first 70% of iterations) In

each iteration, each ant builds solutions on the construction

graph based on heuristic information and pheromone trail

intensity Then the algorithm determines the best solution of

the iteration, updates pheromone trail according to SMMAS

rule and updates the best solution found by then

The second stage (apply for the last 30% of iterations) In

each iteration, after ants build solutions, two local search

techniques are applied to find the best solution of iteration

Because of the the vertex label fitness has more effect on the

objective function (Equation 1) than the edge weight fitness

does, the procedure for re-positioning vertices of different

label on alignment vectors is applied precedent These

procedures is applied follow “The best” strategy (that is

searching from the first graph to the last graph to get the best

possible solution) Then ACO-MGA2 updates pheromone trail

according to SMMAS rule and updates the best solution

Algorithm 1: ACO-MGA2 algorithm

Input: A set of graphs G ={G 1 (V 1 ,E 1 ),…,G n (V n ,E n )

Output: The best alignment A⊆ (V1 ∪ ⊥ × × { } ) (V n∪ ⊥ { } ) for G

Begin

Initialize; // initialize pheromone trail matrix and m ants;

while (stop conditions not satisfied) do

for each a ∈A do

Ant a build a multiple graph alignment;

Local search// run only at the second stage

Search by changing the positions of the different label vertices;

Search by changing the positions of the same label vertices;

Update pheromone trail follows SMMAS rule;

Update the best solution;

End while

Save the best solution;

End

B Components of ACO-MGA2

Construction Graph

The construction graph consists of n layers where layer i

is graph Gi in the set G Vertices of a layer are connected to all

of vertices of the below layer Vertices of the bottom layer are connected to all of the vertices of the top layer The top layer considered as the next layer of the bottom layer Figure 1 illustrates the construction graph where ants start from the graph G1, which does not display connections with the bottom layer Round nodes are real and square nodes are dummy

Fig.1 Construction graph for n graphs alignment

An alignment of graphs (by Definition 3) is a path from

G1 through every layer to Gn such that each path passes only one vertex of each layer and each vertex of the construction graph has only one path passes through Dummy nodes allow more than one paths to passes through

Remark. Note that the paths forming this alignment can be considered as a single path by the insight of the popular ACO algorithm This implied path starts from a vertex of the graph

G1 passing through all next graphs to the last graph It then

"walks" to the vertex of the top layer of another alignment vector until passing through all real nodes, each node exactly once

Pheromone trails and heuristic information

Pheromone trail intensity , on the edge connecting vertex j of graph Gi with vertex k of the next graph is initialized as ! and will be updated after each iteration

Heuristic information ",

,

( , ) 1 ( )

1

*

i

j k

count k a

k is a real node i

n a

k is a dummy node

n Vmax

+





= 





(4)

Where count(k,a) is the number of vertices in vector {a 1 ,…a i } has the same label with label(k) of vertex k if k is

real vertex Vmax is the vertices number of the graph with most vertices

Random walk procedure to construct an alignment

In each iteration, each ant will repeat the process to build

vectors a = (a 1 ,…, a n ) for an alignment A as follows

Trang 4

The ant randomly chooses an real vertex which is not

aligned on the construction graph as starting vertex and base

on heuristics information and pheromone trail to walk in a

randomly sequential manner (with probability given by

Equation 5) to the vertex on the next graph For ease of

visualization, we assume this vertex is the vertex a 1 of the

graph G 1 and random walk along the <a 1 ,…,a i > path to vertex

j = ai of graph Gi where it chose vertex k in Gi +1 with

probability:

1

,

_

( ) *[ (a)]

i

s R V

p

+

∈

=

where R_Vi are not yet aligned vertices belonging to Vi

including the dummy node

After a vector is fully developed into a=(a 1 ,…,a n ), the real

vertices in vector a is removed from the construction graph to

continue repeating the alignment procedure of ants until all

vertices have already aligned

Note that if the first real node selected does not belong to

G1 but belongs to Gm instead, the above procedure will consist

of two processes: aligning from Gm to Gn and aligning from G1

to Gm-1

Pheromone Update Rule

After the ants found the solutions (in the first stage) or

carried out local search (in the second stage), the pheromone

trail intensity is updated according to SMMAS pheromone

trail update rule in [4, 9], as follows:

,

*

max

i

j k

min

(i,j,k) best solution (i,j,k) best solution

ρ τ

∈



∆ =

∈



(7)

where τmax and τmin are given parameters, ρ∈ (0,1) is

parameter, best solution is the best solution found in current

iteration

Note that in Equation 6, parameter ρ defines two

properties: reinforcement search around the best-found

solution and explore new solution The large ρ puts emphasis

on reinforcement search, and the small ρ puts emphasis on

exploration

Local search

Local search procedure is sequentially performed from

the graph G1 to the graph Gn by the principle stop when found

the best result This procedure consists of two techniques:

change the position of the same label vertices and change the

position of different label vertices

1) Swap the pairs of different label vertices: Swap the pair

of different label vertices of considered graph Gi on the

corresponding alignment vectors if that increases the number

of the same label vertices on the vector alignment

2) Swap the pairs of same label vertices: Swap the pair of

the same label vertices of considered graph Gi on the

corresponding alignment vectors if that improves the fitness of weights on the related edges

If after swapped, score function is increased, the received answer will replace the current best solution This process is repeated until find the best solution

In Equation 1, the fitness of vertex labels has more effect

on the objective function more than the fitness of edge weight does Hence swapping the pair of difference label vertices is priority Therefore, for each alignment, we only swap the pair

of same label vertices after the finishing swapping the pair of different label vertices

Because local search procedure is time consuming, it is only applied in the second stage when the best- found solution

is good enough

IV EXPERIMENT RESULTS

Because the ACO-MGA2 is an improved version of MGA, experiments presented here only compare ACO-MGA2 with Greedy algorithm [21] and the evolutionary algorithm GAVEO [6] with respect to the solution quality and runtime Experiments are performed as follows:

1) Run the algorithms on the same data sets with a predetermined number of iterations to compare the alignment quality and runtime

2) Run the algorithms on the same data sets with

predetermined time to compare the quality of alignment Runtime is changed to assess convergence property

Our experiments are performed on a computer with following configuration: CPU Intel Core 2 Duo 2.5Ghz, RAM DDR2 3GB and Windows 7 operating system Parameters are set as follows:

• The number of ants at each iteration is 30

• ρ1=0.3, ρ2=0.7, % = & = 1

• τmax = 1.0 and τmin = τmax /(n 2 *V max

2

), where n is the number of graphs, Vmax is the number of vertices

of the graph with most vertices

• Local search procedure is appied in the last 30% of iterations

A Effect and Runtime comparisons

The empirical data consists of 74 structures generated from Cavbase database Each structure represents a protein cavity belonging to protein family of thermolysin, bacteria protease commonly used in analysis of protein and annotated with the EC number 3.4.24.27 in the ENZYME database [5]

In this data set, each graph generated has 42 to 94 vertices From the 74 structures, the graphs are selected to generate random data sets consisting of 4, 8, 16, 32 graphs To compare the solution quality of algorithms, we performed each algorithm on each data set 20 times and took the average values for comparison

The score and the runtime of the algorithms are shown in Table 1

Trang 5

Table 1 Comparison of the score and runtime with the data sets consisting of

4, 8, 16 and 32 graphs

Remark: The experimental results in Table 1 show that:

• Greedy algorithm runs much faster than the

ACO-MGA2 algorithm and GAVEO, but its solution

quality is too low

• ACO-MGA2 algorithm in any case has better

solution quality Especially when increasing the

number of graphs, the outperformance of

ACO-MGA2 over GAVEO is more prominent When

comparing in terms of runtime, the ACO-MGA2

algorithm also gets better results than the GAVEO

does

B Comparing GAVEO and ACO-MGA2 under a

predetermined amount of time

Because the greedy method require small runtime and its

solution quality is very low, in this section, we only compare

the solution quality of GAVEO and the solution quality of

ACO-MGA2 in the same runtime

We run GAVEO and ACO-MGA2 algorithms on a data

set of 16 graphs, each graph contains 45 to 94 vertices, with

the runtime increase from 1000s to the 6000s The results are

shown in chart in Figure 2

Fig.2 Comparison of results of ACO-MGA2 algorithm and GAVEO

algorithm with data set of 16 graphs when runtime increase from 1000s to

6000s

Remark: Chart in Figure 2 show that when the time amout

increases from 1000s to 6000s solution quality of

ACO-MGA2 algorithm always is better than GAVEO algorithm

V CONCLUSIONS MGA problem is a new approach to structural analysis of biological molecules, until now there are three algorithms introduced to solve it Greedy algorithm is a heuristic algorithm so it is exceptional in runtime but its solution quality is not good The newly proposed algorithm ACO-MGA2 is an improvement version of ACO-MGA Experiments showed its outstanding efficiency compared with GAVEO algorithm with respect to both solution quality and runtime

As well as the other ACO-based algorithms, ACO-MGA2 could be easily implemented as parallel to work with the large number of graphs

ACKNOWLEDGMENT

This work was mainly done during the stay of the authors

in Vietnamese institute for advanced study in mathematics (VIASM)

We thank Dr Thomas Fober for useful email communications and providing the dataset for testing

REFERENCES [1] Aladag, A.E and Erten, C (2013) “SPINAL: scalable protein interaction network alignment,” Bioinformatics, 29, 917–924

[2] Conte, P Foggia, C Sansone, and M Vento (2004), Thirty Years of Graph Matching in Pattern Recognition,”Int’l J Pattern Recognition and Artificial Intelligence, vol 18, no 3, pp 265-298,

[3] O Dror, H Benyamini, R Nussinov, and H Wolfson (2003), “MASS: Multiple Structural Alignment by Secondary Structures,” Bioinformatics, Vol 19 No.1, 95-104

[4] Do Duc, H Q Dinh, and H Hoang Xuan , (2008) “On the Pheromone Update Rules of Ant Colony Optimization Approaches for the Job Shop Scheduling Problem,” 11th Pacific Rim International Conference on

Multi-Agents, PRIMA 2008, Hanoi, Vietnam (LNCS), pp 153-160,

December 15-16 [5] M Dorigo, and T Stutzle, Ant Colony Optimization The MIT Press,

Cambridge, Masachusetts (2004)

[6] T Fober, M Mernberger, G Klebe and E Hullermeier (2009),

“Evolutionary Construction of Multiple Graph Alignments for the Structural Analysis of Biomolecules,” Bioinformatics vol 25, No.16, 2110-2117

[7] J F Gibrat, T Madej and S H Bryant (1996), “Surprising similarities

in structure comparison,” Current Opinion in Structural Biology, Vol 6,

No 3, 377-385

[8] M Hendlich, A Bergner, J Günther, and G Klebe, “Relibase:Design and Development of a Database for Comprehensive Analysis of Protein-Ligand Interactions,” J Molecular Biology, vol 326, pp 607-620, 2003 [9] H Hoang Xuan, T Nguyen Linh, D Do Duc, H Huu Tue, “Solving the Traveling Salesman Problem with Ant Colony Optimization: A Revisit

and New Efficient Algorithms,” REV Journal on Electronics and Communications, Vol 2, No 3–4, July – December,2012, 121-129

[10] K Kinoshita and H Nakamura, (2005), “Identication of the Ligand

Binding Sites on the Molecular Surface of Proteins” Protein Science,

Vol 14, No 3, 711-718

[11] Kuchaiev,O and Przulj,N (2011) Integrative network alignment reveals large regions of global network similarity in yeast and human Bioinformatics, 27, 1390–1396

[12] M Meenberger, G Klebe andE Hullermaer (2009),

“SEGA:Semiglobal Graph Alignment for Structure-Bases Protein Comperison,” IEEE/ACM Trans on Computational Biology and Informatics, Vol 8, No 5, 1330-1342

[13] Neri, C Cotta, P Moscato o, Handbook of Memetic algorithms, Springer, 2012

Method/Number

Greedy

GAVEO

ACO-MGA2

Trang 6

[14] S Schmitt, D Kuhn, and G Klebe, “A New Method to Detect Related

Function among Proteins Independent of Sequence and Fold

Homology,” J Molecular Biology, vol 323, no 2, pp 387-406, 2002

[15] D Shasha, J Wang, and R Giugno (2002), “Algorithmics and

Applications of Tree and Graph Searching,” Proc 21th ACM

SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems,

ACM Press New York, USA, 39-52

[16] M Shatsky, R Nussinov and H Wolfson (2004), “A Method for

Simultaneous Alignment of Multiple Protein Structures,” Proteins

Structure Function and Bioinformatics, Vol 56, No 1, 143-156

[17] M Shatsky, A Shulman-Peleg, R Nussinov, and H J Wolfson (2006),

“The multiple common point set problem and its application to molecule

binding pattern detection,” Journal of Computational Biology, Vol 13,

No 2, 407-428

[18] R Spriggs, P Artymiuk, P and Willett (2003), “Searching for Patterns

of Amino Acids in 3D Protein Structures.” J of Chem Inform and

Comp Sciences, Vol 43, No 2, 412-421

[19] J D.Thompson, D G Higgins and T J Gibson (1994) “Clustal W:

improving the sensitivity of progressive multiple sequence alignment

through sequence weighting, position-specic gap penalties and weight matrix choice,” Nucleic Acids Research, Vol 22, 4673-4680

[20] Tran Ngoc Ha, Do Duc Dong, Hoang Xuan Huan, “An Efficient Ant Colony Optimization Algorithm for Multiple Graph Alignment,” Proceedings of the international conference on Computing, Management and Telecommunications, 2013, 386 - 391

[21] N Weskamp, E Hullermeier, D Kuhn and G Klebe (2007), “Multiple Graph Alignment for the Structural Analysis of Protein Active Sites,” IEEE/ACM Trans Comput Biol Bioinform vol.4 No.2, 2007, 310-320 [22] X Yan, P Yu and J Han (2005), “Substructure Similarity Search in Graph Databases,” Proc of ACM SIGMOD Int Conf on Management

of Data, New York, 766-777

[23] X Yan, F Zhu, J Han, and P Yu (2006), “Searching Substructures with Superimposed Distance,” Proc of International Conference on Data Engineering, 88-88

[24] S Zhang, M Hu, and J Yang (2007) “Treepi: A novel graph indexing method,” Proc of 23th International Conference on Data Engineering, 966-975.

Định dạng
Số trang	6
Dung lượng	345,97 KB