Keywords—Multiple Graph Alignment, Ant Colony Optimization, local search, memetic algorithm, SMMAS pheromone update rule I.. In [20] the authors proposed ACO-MGA algorithm that using
Trang 1A Novel Ant Based Algorithm for Multiple Graph
Alignment
Tran Ngoc Ha
Thai Nguyen University of Education
hatn84@gmail.com
Do Duc Dong Vietnam National University-Hanoi dongdoduc@vnu.edu.vn
Hoang Xuan Huan Vietnam National University-Hanoi huanhx@vnu.edu.vn
Abstract— Multiple graph alignment (MGA) is a new
approach to analyze protein structure in order to exploring their
functional similarity In this article, we propose a two-stage
memetic algorithm to solve the MGA problem, named
ACO-MGA2, based on ant colony optimization metaheuristic A local
search procedure is applied only to the second stage of the
algorithm to save runtime Experimental results have shown that
ACO-MGA2 outperforms state-of-the-art algorithms while
producing alignments of better quality
Keywords—Multiple Graph Alignment, Ant Colony
Optimization, local search, memetic algorithm, SMMAS
pheromone update rule
I INTRODUCTION
Multiple sequence alignment is a useful approach for
analyzing evolutionary homology among DNA sequences or
proteins However, this method is not suitable to determine the
functional similarities among the molecules because
functional similarities relate more closely to structural features
rather than the sequential ones [6,12,15,18,19]
Recently, a number of authors [1, 2, 10-12, 22-24] have
proposed using graphical models to represent
three-dimensional structures of proteins and using the graph
alignment techniques to infer functional similarities based on
structural analysis These methods mainly use exact pair-wise
graph matching technique They produce meaningful results
when studying the functional evolution of non-homologous
molecules However, it is difficult to leverage these methods
to discover biological meaningful samples from approximately
saved ones
Weskamp et al [21] were the first (2007) to introduce the
concept of multiple graph alignment (MGA) and to use it to
analyze the protein active sites They proposed a heuristic
algorithm according to greedy strategy The graphs are used to
approximately describe binding pockets in Cavbase [8,14] In
this approach, each binding pocket is modeled as a connected
graph G(V, E ) and MGA problem is stated as follows : Given
a set G ={G 1 (V 1 ,E 1 ),…,G n (V n ,E n )}which is a set of connected,
node-labeled, edge-weighted graphs In each graph, there are
three edit operations: deletion or insertion of a vertex, change
of the label of a node, change of the weight of an edge The
mission of the MGA problem is to find an alignment for the
vertices of the graphs belong to G to optimize a predefined
objective function
MGA is NP-hard problem (see [6, 21]) The heuristic
algorithms are only suitable for small problems, hence, not
suitable for real applications Fober et al [6] have extended the usage of this problem for the structural analysis of biomolecules and have proposed an evolutionary algorithm called GAVEO Experiments show that this algorithm is more efficient than greedy algorithm although it is more time consuming
In [20] the authors proposed ACO-MGA algorithm that using simply ant colony optimization scheme to solve the multiple graph alignment problem Experiment shows that this algorithm has better results than the GAVEO algorithm; however its running time is long and its efficiency is not good for large data sets
This paper introduces a two-stage memetic algorithm based on ant colony optimization called ACO-MGA2 as an improvement of the ACO-MGA to align multiple graphs We
keep construction graph as in ACO-MGA, but improve the
heuristic information and the local search procedures To reduce the running time, the algorithm is split into two stages The local search is only applied at the second stage of the memetic scheme [13] It consists of two procedures: 1) Rearranging the different labeled vertices in alignment vectors
in order to improve the compatibility of the vertices, 2) Swapping identical labeled vertices on each graph to increase the appropriateness of the edges’ weight Improvements in both runtime and efficiency of ACO-MGA2 is demonstrated empirically by comparison with GAVEO and Greedy The rest of this paper is organized as follows: Section 2 provides mathematical statements for multiple graph alignment problem and summarizes the related work Section
3 introduces the newly proposed algorithm The experimental results are presented in Section 4 Several conclusions are presented in the last section
II MULTIPLE GRAPH ALIGNMENT PROBLEM AND
RELATED WORKS
A Multiple graph alignment problem
The multiple graph alignment problem is introduced [21]
by Weskamp et al, with the purpose of studying proteins characteristics Fober et al [6] extended it to analyze the structure of molecules which includes the chemical composition and the protein binding site by Follows are the problem statement (more details see [6, 20])
Definition 1 (Multigraph) Multigraph is a set of graphs G
= {G 1 (V 1 , E 1 ), , G n (V n , E n )} , where G i (V i , E i ) is a connected
Trang 2graph, each vertex is labeled under a given set L, the edges
weight represent the Euclidean distances between the vertices
Definition 2 (Edit operations) There are following edit
operations to distinguish between a graph G(V; E) and another
graph:
i) Insertion or deletion of a node: A node v ∈ V and
edges associated with it can be deleted or inserted
ii) Change of the label of a node: The label () of a
node ∈ can be replaced by other label in L
iii) Change of the weight of an edge The weight () of
an edge can be changed based on the conformation
Definition 3 (Multiple Graph Alignment). Let multigraph
G ={G 1 (V 1 ,E 1 ),…,G n (V n ,E n )} , adding to each vertex set V i a
dummy node (denoted ⊥) that is not connected to the other
nodes An alignment of G is defined as follows Then
{ } { }
1
multigraph G if and only if:
i) For all i=1,…,n and for each ∈ , there exists
exactly one a = (a 1 ,…,a n )
ii) For each a = (a 1 ,…,a n ) ∈ , there exists at least one 1
≤ i ≤ n such that ≠ ⊥
Each a = (a 1 ,…,a n ) ∈ is called a column vector of
corresponding alignment, ∈ is real node
For readers’ ease, we keep the notation convention G
={G 1 (V 1 , E 1 ),…,G n (V n , E n )} to refer to the multigraph in which
the graph Gi has been added a dummy node
Definition 4 (Scoring function). The score s of a given
alignment A = (a1,…, an) is defined as in Equation 1
n
s A ns a es a a
= ≤ < ≤
where ns is the score of the fitness of the corresponding
column and is calculated by the Equation 2
1
1
i
i
m
ns l(a )=l(a ) a
ns l(a ) l(a ) ns
ns a = , a a
≤ < ≤
≠
∑
and es evaluates the compatibility of the edge length and
is calculated by the Equation 3:
1 1
1
, , ,
i j i j
mm k k k l l l
i j
i j i j
mm k k k l l l ij
k l m m kl
i j
mm kl
es (a ,a ) E (a ,a ) E
a a
es (a ,a ) E (a ,a ) E es
es d
a a
es d ε
≤ < ≤
≤
>
∑
ε
(3)
In Equation 3,
Parameters (ns m,
ns mm , ns dummy , es m , es mm ) are reused from [21]: ns m = 1.0;
ns mm = -5.0; ns dummy = -2.5; es m = 0.2; es mm =-0.1
Solution of an MGA problem is alignment that maximizing the scoring function() This is a NP-hard problem (see [6, 21]) If one use the exhaustive method to solve it, the complexity will be ) where Vmax is
the number of vertices of the graph with the highest number of vertices and n is the number of graphs
B Related works
Weskamp et al [21] proposed applying multiple graph alignment problems to study protein characteristics, where graphs are used to approximately describe the binding pockets
Greedy algorithm. Weskamp et al [21] first (2007) studied the MGA problem and used it in the analysis of protein active sites The authors proposed a greedy algorithm, which transforms the multiple graphs comparison into the pair-wise comparison to find out a good enough solution within a small amount of time
GAVEO algorithm. Fober et al [6] proposed a genetic algorithm called GAVEO that substantially improve efficiency compared with the greedy algorithm proposed by Weskamp although its runtime is higher
ACO - MGA algorithm. The authors [20] proposed an ant colony optimization algorithm (ACO), which uses simple heuristics and local search techniques, which change the position of the same label vertices of each component graph to increase edge fitness of the objective function This method yields better results than GAVEO but its running time is longer when data size is large
ACO method. This method is proposed by Dorigo (see [5]) in 1991, is a stochastic metaheuristics method to solve difficult combinatorial optimization problems In these algorithms, the original problem is transformed into the problem of finding the solution on a construction graph G = (V, E, Ω, η, T), where V is the vertices set, E is the edges set,
Ω is constrain set to build the solution, η and T are vectors that represent the heuristics and reinforcement learning information for constructing a solution Infomation may be placed on the edges or on the vertices
In each iteration, each ant in the colony of m ants will build a solution on the construction graph It starts from a start vertex and develops random sequence based on reinforcement learning information, which is represented by the pheromone trail and the heuristics information The random sequence follow random walk that is fit with Ω constrain Then the solution is evaluated (may be additionally applied a local search) and updated pheromone trail as reinforcement learning information for the next step The best-found solution will be the solution of the problem (more details see [5])
Memetic algorithm. The Memetic algorithms [13] introduce local search techniques for iterative algorithms based on population The solutions found after each iteration
is selected upon to apply the local search techniques in a flexible way Thus, algorithms are efficient and take less runtime
Trang 3To apply memetic scheme based on ACO method, there
are four factors need resolving: 1) the construction graph and
the procedure for sequentially developing according to given
constraints, 2) heuristic information, 3) pheromone update
rule, 4) the local search techniques and their usage
III THE PROPOSED ALGORITHM
Considering the alignment problem for a set G of graphs
G ={G 1 (V 1 ,E 1 ),…,G n (V n ,E n ) where each graph has added
dummy node as in definition 3 and 4 our new algorithm is an
ACO-based memetic algorithm named ACO-MGA2 It uses
the same construction graph as ACO-MGA algorithm does but
with more efficient heuristic information and local search
procedures General framework of ACO-MGA2 is as follows
A General framework
After initializing parameters and m artificial ants (agents)
ACO-MGA2 repeatedly perform two stages as in Algorithm 1
The first stage (applied for the first 70% of iterations) In
each iteration, each ant builds solutions on the construction
graph based on heuristic information and pheromone trail
intensity Then the algorithm determines the best solution of
the iteration, updates pheromone trail according to SMMAS
rule and updates the best solution found by then
The second stage (apply for the last 30% of iterations) In
each iteration, after ants build solutions, two local search
techniques are applied to find the best solution of iteration
Because of the the vertex label fitness has more effect on the
objective function (Equation 1) than the edge weight fitness
does, the procedure for re-positioning vertices of different
label on alignment vectors is applied precedent These
procedures is applied follow “The best” strategy (that is
searching from the first graph to the last graph to get the best
possible solution) Then ACO-MGA2 updates pheromone trail
according to SMMAS rule and updates the best solution
Algorithm 1: ACO-MGA2 algorithm
Input: A set of graphs G ={G 1 (V 1 ,E 1 ),…,G n (V n ,E n )
Output: The best alignment A⊆ (V1 ∪ ⊥ × × { } ) (V n∪ ⊥ { } ) for G
Begin
Initialize; // initialize pheromone trail matrix and m ants;
while (stop conditions not satisfied) do
for each a ∈A do
Ant a build a multiple graph alignment;
Local search// run only at the second stage
Search by changing the positions of the different label vertices;
Search by changing the positions of the same label vertices;
Update pheromone trail follows SMMAS rule;
Update the best solution;
End while
Save the best solution;
End
B Components of ACO-MGA2
Construction Graph
The construction graph consists of n layers where layer i
is graph Gi in the set G Vertices of a layer are connected to all
of vertices of the below layer Vertices of the bottom layer are connected to all of the vertices of the top layer The top layer considered as the next layer of the bottom layer Figure 1 illustrates the construction graph where ants start from the graph G1, which does not display connections with the bottom layer Round nodes are real and square nodes are dummy
Fig.1 Construction graph for n graphs alignment
An alignment of graphs (by Definition 3) is a path from
G1 through every layer to Gn such that each path passes only one vertex of each layer and each vertex of the construction graph has only one path passes through Dummy nodes allow more than one paths to passes through
Remark. Note that the paths forming this alignment can be considered as a single path by the insight of the popular ACO algorithm This implied path starts from a vertex of the graph
G1 passing through all next graphs to the last graph It then
"walks" to the vertex of the top layer of another alignment vector until passing through all real nodes, each node exactly once
Pheromone trails and heuristic information
Pheromone trail intensity , on the edge connecting vertex j of graph Gi with vertex k of the next graph is initialized as ! and will be updated after each iteration
Heuristic information ",
,
( , ) 1 ( )
1
*
i
j k
count k a
k is a real node i
n a
k is a dummy node
n Vmax
+
=
(4)
Where count(k,a) is the number of vertices in vector {a 1 ,…a i } has the same label with label(k) of vertex k if k is
real vertex Vmax is the vertices number of the graph with most vertices
Random walk procedure to construct an alignment
In each iteration, each ant will repeat the process to build
vectors a = (a 1 ,…, a n ) for an alignment A as follows
Trang 4The ant randomly chooses an real vertex which is not
aligned on the construction graph as starting vertex and base
on heuristics information and pheromone trail to walk in a
randomly sequential manner (with probability given by
Equation 5) to the vertex on the next graph For ease of
visualization, we assume this vertex is the vertex a 1 of the
graph G 1 and random walk along the <a 1 ,…,a i > path to vertex
j = ai of graph Gi where it chose vertex k in Gi +1 with
probability:
1
,
_
( ) *[ (a)]
( ) *[ (a)]
i
i
s R V
p
+
∈
=
where R_Vi are not yet aligned vertices belonging to Vi
including the dummy node
After a vector is fully developed into a=(a 1 ,…,a n ), the real
vertices in vector a is removed from the construction graph to
continue repeating the alignment procedure of ants until all
vertices have already aligned
Note that if the first real node selected does not belong to
G1 but belongs to Gm instead, the above procedure will consist
of two processes: aligning from Gm to Gn and aligning from G1
to Gm-1
Pheromone Update Rule
After the ants found the solutions (in the first stage) or
carried out local search (in the second stage), the pheromone
trail intensity is updated according to SMMAS pheromone
trail update rule in [4, 9], as follows:
,
*
*
max
i
j k
min
(i,j,k) best solution (i,j,k) best solution
ρ τ
ρ τ
∈
∆ =
∈
(7)
where τmax and τmin are given parameters, ρ∈ (0,1) is
parameter, best solution is the best solution found in current
iteration
Note that in Equation 6, parameter ρ defines two
properties: reinforcement search around the best-found
solution and explore new solution The large ρ puts emphasis
on reinforcement search, and the small ρ puts emphasis on
exploration
Local search
Local search procedure is sequentially performed from
the graph G1 to the graph Gn by the principle stop when found
the best result This procedure consists of two techniques:
change the position of the same label vertices and change the
position of different label vertices
1) Swap the pairs of different label vertices: Swap the pair
of different label vertices of considered graph Gi on the
corresponding alignment vectors if that increases the number
of the same label vertices on the vector alignment
2) Swap the pairs of same label vertices: Swap the pair of
the same label vertices of considered graph Gi on the
corresponding alignment vectors if that improves the fitness of weights on the related edges
If after swapped, score function is increased, the received answer will replace the current best solution This process is repeated until find the best solution
In Equation 1, the fitness of vertex labels has more effect
on the objective function more than the fitness of edge weight does Hence swapping the pair of difference label vertices is priority Therefore, for each alignment, we only swap the pair
of same label vertices after the finishing swapping the pair of different label vertices
Because local search procedure is time consuming, it is only applied in the second stage when the best- found solution
is good enough
IV EXPERIMENT RESULTS
Because the ACO-MGA2 is an improved version of MGA, experiments presented here only compare ACO-MGA2 with Greedy algorithm [21] and the evolutionary algorithm GAVEO [6] with respect to the solution quality and runtime Experiments are performed as follows:
1) Run the algorithms on the same data sets with a predetermined number of iterations to compare the alignment quality and runtime
2) Run the algorithms on the same data sets with
predetermined time to compare the quality of alignment Runtime is changed to assess convergence property
Our experiments are performed on a computer with following configuration: CPU Intel Core 2 Duo 2.5Ghz, RAM DDR2 3GB and Windows 7 operating system Parameters are set as follows:
• The number of ants at each iteration is 30
• ρ1=0.3, ρ2=0.7, % = & = 1
• τmax = 1.0 and τmin = τmax /(n 2 *V max
2
), where n is the number of graphs, Vmax is the number of vertices
of the graph with most vertices
• Local search procedure is appied in the last 30% of iterations
A Effect and Runtime comparisons
The empirical data consists of 74 structures generated from Cavbase database Each structure represents a protein cavity belonging to protein family of thermolysin, bacteria protease commonly used in analysis of protein and annotated with the EC number 3.4.24.27 in the ENZYME database [5]
In this data set, each graph generated has 42 to 94 vertices From the 74 structures, the graphs are selected to generate random data sets consisting of 4, 8, 16, 32 graphs To compare the solution quality of algorithms, we performed each algorithm on each data set 20 times and took the average values for comparison
The score and the runtime of the algorithms are shown in Table 1
Trang 5Table 1 Comparison of the score and runtime with the data sets consisting of
4, 8, 16 and 32 graphs
Remark: The experimental results in Table 1 show that:
• Greedy algorithm runs much faster than the
ACO-MGA2 algorithm and GAVEO, but its solution
quality is too low
• ACO-MGA2 algorithm in any case has better
solution quality Especially when increasing the
number of graphs, the outperformance of
ACO-MGA2 over GAVEO is more prominent When
comparing in terms of runtime, the ACO-MGA2
algorithm also gets better results than the GAVEO
does
B Comparing GAVEO and ACO-MGA2 under a
predetermined amount of time
Because the greedy method require small runtime and its
solution quality is very low, in this section, we only compare
the solution quality of GAVEO and the solution quality of
ACO-MGA2 in the same runtime
We run GAVEO and ACO-MGA2 algorithms on a data
set of 16 graphs, each graph contains 45 to 94 vertices, with
the runtime increase from 1000s to the 6000s The results are
shown in chart in Figure 2
Fig.2 Comparison of results of ACO-MGA2 algorithm and GAVEO
algorithm with data set of 16 graphs when runtime increase from 1000s to
6000s
Remark: Chart in Figure 2 show that when the time amout
increases from 1000s to 6000s solution quality of
ACO-MGA2 algorithm always is better than GAVEO algorithm
V CONCLUSIONS MGA problem is a new approach to structural analysis of biological molecules, until now there are three algorithms introduced to solve it Greedy algorithm is a heuristic algorithm so it is exceptional in runtime but its solution quality is not good The newly proposed algorithm ACO-MGA2 is an improvement version of ACO-MGA Experiments showed its outstanding efficiency compared with GAVEO algorithm with respect to both solution quality and runtime
As well as the other ACO-based algorithms, ACO-MGA2 could be easily implemented as parallel to work with the large number of graphs
ACKNOWLEDGMENT
This work was mainly done during the stay of the authors
in Vietnamese institute for advanced study in mathematics (VIASM)
We thank Dr Thomas Fober for useful email communications and providing the dataset for testing
REFERENCES [1] Aladag, A.E and Erten, C (2013) “SPINAL: scalable protein interaction network alignment,” Bioinformatics, 29, 917–924
[2] Conte, P Foggia, C Sansone, and M Vento (2004), Thirty Years of Graph Matching in Pattern Recognition,”Int’l J Pattern Recognition and Artificial Intelligence, vol 18, no 3, pp 265-298,
[3] O Dror, H Benyamini, R Nussinov, and H Wolfson (2003), “MASS: Multiple Structural Alignment by Secondary Structures,” Bioinformatics, Vol 19 No.1, 95-104
[4] Do Duc, H Q Dinh, and H Hoang Xuan , (2008) “On the Pheromone Update Rules of Ant Colony Optimization Approaches for the Job Shop Scheduling Problem,” 11th Pacific Rim International Conference on
Multi-Agents, PRIMA 2008, Hanoi, Vietnam (LNCS), pp 153-160,
December 15-16 [5] M Dorigo, and T Stutzle, Ant Colony Optimization The MIT Press,
Cambridge, Masachusetts (2004)
[6] T Fober, M Mernberger, G Klebe and E Hullermeier (2009),
“Evolutionary Construction of Multiple Graph Alignments for the Structural Analysis of Biomolecules,” Bioinformatics vol 25, No.16, 2110-2117
[7] J F Gibrat, T Madej and S H Bryant (1996), “Surprising similarities
in structure comparison,” Current Opinion in Structural Biology, Vol 6,
No 3, 377-385
[8] M Hendlich, A Bergner, J Günther, and G Klebe, “Relibase:Design and Development of a Database for Comprehensive Analysis of Protein-Ligand Interactions,” J Molecular Biology, vol 326, pp 607-620, 2003 [9] H Hoang Xuan, T Nguyen Linh, D Do Duc, H Huu Tue, “Solving the Traveling Salesman Problem with Ant Colony Optimization: A Revisit
and New Efficient Algorithms,” REV Journal on Electronics and Communications, Vol 2, No 3–4, July – December,2012, 121-129
[10] K Kinoshita and H Nakamura, (2005), “Identication of the Ligand
Binding Sites on the Molecular Surface of Proteins” Protein Science,
Vol 14, No 3, 711-718
[11] Kuchaiev,O and Przulj,N (2011) Integrative network alignment reveals large regions of global network similarity in yeast and human Bioinformatics, 27, 1390–1396
[12] M Meenberger, G Klebe andE Hullermaer (2009),
“SEGA:Semiglobal Graph Alignment for Structure-Bases Protein Comperison,” IEEE/ACM Trans on Computational Biology and Informatics, Vol 8, No 5, 1330-1342
[13] Neri, C Cotta, P Moscato o, Handbook of Memetic algorithms, Springer, 2012
Method/Number
Greedy
GAVEO
ACO-MGA2
Trang 6[14] S Schmitt, D Kuhn, and G Klebe, “A New Method to Detect Related
Function among Proteins Independent of Sequence and Fold
Homology,” J Molecular Biology, vol 323, no 2, pp 387-406, 2002
[15] D Shasha, J Wang, and R Giugno (2002), “Algorithmics and
Applications of Tree and Graph Searching,” Proc 21th ACM
SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems,
ACM Press New York, USA, 39-52
[16] M Shatsky, R Nussinov and H Wolfson (2004), “A Method for
Simultaneous Alignment of Multiple Protein Structures,” Proteins
Structure Function and Bioinformatics, Vol 56, No 1, 143-156
[17] M Shatsky, A Shulman-Peleg, R Nussinov, and H J Wolfson (2006),
“The multiple common point set problem and its application to molecule
binding pattern detection,” Journal of Computational Biology, Vol 13,
No 2, 407-428
[18] R Spriggs, P Artymiuk, P and Willett (2003), “Searching for Patterns
of Amino Acids in 3D Protein Structures.” J of Chem Inform and
Comp Sciences, Vol 43, No 2, 412-421
[19] J D.Thompson, D G Higgins and T J Gibson (1994) “Clustal W:
improving the sensitivity of progressive multiple sequence alignment
through sequence weighting, position-specic gap penalties and weight matrix choice,” Nucleic Acids Research, Vol 22, 4673-4680
[20] Tran Ngoc Ha, Do Duc Dong, Hoang Xuan Huan, “An Efficient Ant Colony Optimization Algorithm for Multiple Graph Alignment,” Proceedings of the international conference on Computing, Management and Telecommunications, 2013, 386 - 391
[21] N Weskamp, E Hullermeier, D Kuhn and G Klebe (2007), “Multiple Graph Alignment for the Structural Analysis of Protein Active Sites,” IEEE/ACM Trans Comput Biol Bioinform vol.4 No.2, 2007, 310-320 [22] X Yan, P Yu and J Han (2005), “Substructure Similarity Search in Graph Databases,” Proc of ACM SIGMOD Int Conf on Management
of Data, New York, 766-777
[23] X Yan, F Zhu, J Han, and P Yu (2006), “Searching Substructures with Superimposed Distance,” Proc of International Conference on Data Engineering, 88-88
[24] S Zhang, M Hu, and J Yang (2007) “Treepi: A novel graph indexing method,” Proc of 23th International Conference on Data Engineering, 966-975.