Finding motifs in gene sequences is one of the most important problems of bioinformatics and belongs to NPhard type. This paper proposes a new ant colony optimization algorithm based on consensus approach, in which a relax technique is applied to recognize the location of common motif. The efficiency of the algorithm is evaluated by comparing it with the stateoftheart algorithms
Trang 1An Efficient Ant Colony Algorithm for DNA Motif Finding
Xuan Huan Hoang1, The Hung Nguyen1, T Thu Ha Doan2, and T Anh Tuyet Duong1
1University of Engineering and Technology, VNU, Hanoi, Vietnam
{huanhx, hungnt_55, tuyetdta_55}@vnu.edu.vn
2
Hanoi University of Agriculture, doanha86@gmail.com
Abstract Finding motifs in gene sequences is one of the most important problems of
bioinformatics and belongs to NP-hard type This paper proposes a new ant colony optimization algorithm based on consensus approach, in which a relax technique is applied to recognize the location of common motif The efficiency of the algorithm is evaluated by comparing it with the state-of-the-art algorithms
1 Introduction
Gene regulatory elements are called the DNA motifs (later we call it “motifs” for short), which contain a number of important biological information [1,5,12,14,18] The identification of DNA motif is currently one of the most important problems in bioinformatics and is NP-hard (see [2,10,12,16,17,19]) There are two main approaches to search for a motif: biological experiment and computing method, i.e bioinformatics Due to the high cost and time consuming, biological experiments are not really effective, whereas computing methods are widely used to predict motifs
Researchers have made various definitions of motif, many statements for motif finding problem and also developed a number of algorithms for finding motif [1,3,5,15] One of the widely used approaches is to use an approximate algorithm to optimize consensus score or information content [2,7,10,11,15,16,19] Recently, the methods that use ant colony optimization (ACO) have been applied effectively by several authors for this problem For example, Bouamama et al (2010) proposed MFACO algorithm [2] that uses consensus score
to find motifs and information content to locate their appearances (binding sites) in each DNA sequence Yang et al (2011) proposed an algorithm [19], referred to from now on as EMACO, that combines ACO algorithm with Expectation Maximization (EM) to find the starting positions of motif in sequences Liu et al (2013) proposed ACRI algorithm [11] that uses information content as the objective function for the same purpose as EMACO In this paper, we propose a new ACO algorithm called ACOMotif using the total Hamming distance score function of motif to DNA sequences for this problem ACOMotif uses the structural graph as in MFACO but with different heuristics information, pheromone update rule, and local search technique For each motif found, to locate the starting positions in the DNA sequences, the algorithm subsequently applies a relax technique and gives R-ACOMotif version for this goal Runtime of ACOMotif is also compared with MotifSuite (2012) on a very large dataset obtained from [21] called SCPD The efficiency of ACOMotif is indicated
by the experiments on the same datasets published in three articles above and on SCPD The rest of this paper is organized as follows Section 2 states the DNA motif finding problem, followed by a brief introduction of ACO method and how it was applied in
Trang 2MFACO, ACRI, and EMACO algorithms Our new algorithm will be introduced in Section 3 Section 4 describes the experiments comparing ACOMotif/R-ACOMotif with MFACO, EMACO, ACRI, and MotifSuite Some conclusions are presented in the last section
2 DNA motif finding problem and related works
2.1 DNA motif finding problem
DNA motif finding problem, from optimization perspective, can be described as follows
[2,19]: consider a set of same length DNA sequences S = {S 1 ,S 2 , , S N}, in which
, belongs to the letter set Σ = {A, C, G, T} for all i, j For a given natural number
l, there are two approaches to discover a motif:
1) Consensus approach: Find a string S c of length l and a set of subsequences
, in which m i is the substring of length l of Si, such that they minimize the objective function:
Then S c is called a motif of S, each m i is called a motif instance (or instance in short)
from Si
If we consider M as a matrix (called a consensus matrix) with the ith row being the string
m i and denote C(u,j) as the number of nucleotides u in column j, the objective function
CSc(M) is formulated as:
Then motif S c is a string of length l and the letter at position i is the nucleotide that occurs most often in the i th column of M Note that each M can have many motifs but we consider only one of them
2) Positional approach: Find a set of substrings } and a set of starting positions }, in which, each instance m i is a length l substring
of S i corresponding to starting position a i In this approach, the objective function is information content:
In which, Q(u,j) indicates the frequency of nucleotide u in column j of the matrix M, p u
is the background frequency of u in the entire set S In reality, the location of m i on S i is called the binding site of DNA
Remark: Note that, it is not sure that the optimal solutions of this objective function are real
motifs So, the more solutions and the closer to real motif’s binding sites from locations of the instances, the better an algorithm is
2.2 Ant Colony Optimization method
Ant colony Optimization (ACO) proposed by Dorigo [6,9] is a random metaheuristic method to solve hard combinatorial optimization problems This algorithm has been diversely improved in the literature and widely applied in many applications Memetic scheme using population-based search technique was first proposed by Moscato [13] and applied for genetic algorithm Today, it is incorporated with other algorithms [3,8]
Trang 3Memetic-ACO algorithms
In this article, we apply ACO with reinforcement search following a simple memetic schema as described in Figure 1 In such algorithms, the original problems are converted into the problems of finding solutions on the structural graph G = (V, E, Ω, η, T) where V is the vertex set, E is the edge set, and η and T are the set of heuristics information and pheromone trail, respectively, in reinforcement learning ; η and T can be placed on vertices or edges An
acceptable solution is a path satisfying the condition Ω, starting from a vertex in C 0 set of V, then expanded by a random to the next vertex based on heuristics information and pheromone
trail The ACO algorithm uses N ant artificial ants, in each iteration, each ant finds a solution
by a randomized procedure on the structural graph Then all the ant solutions are assessed and choosing the best one to apply enhancing strategy or local search technique Consequently, the obtained solutions will be evaluated again and the pheromone trail is updated as reinforcement learning information in the next iteration Although many algorithms use the same graph G(V,E), they use diffirent heuristics information, pheromone update rules and local search techniques From now on we will call G(V,E) as the structural graph
Procedure of Memetic-ACO algorithms;
Begin
Initialize; // initialize pheromone trail matrix and u ants;
Repeat
Construct solutions; // each ant constructs its own solution; Choose a subset Ωil to evolve by enhanced (or local) search; For each individual in Ωil do
Run enhance (or local) search
Update trail;
Until End condition;
Choose the solutions
End;
Algorithm 1 Specification of a simple memetic-ACO algorithm
Recently, some ACO-based algorithms following this scheme have been applied
effectively for DNA motif finding:
MFACO(2010), proposed by Bouamama et al [2], uses consensus score to find motifs
and information content to determine their starting positions Experiments showed that this algorithm obtains better results than the other best techniques: GS, BP, and MEME
EMACO (2011), proposed by Do Yang et al [19], combines ACO and Expectation
Maximization (EM), in which EM is used to determine binding sites Experiments revealed that this method is better than GAME and GALF
ACRI (2013), proposed by Liu et al [11], uses information content as the objective
function to determine positions Different from two methods above, this algorithm uses local search at two adjacent positions instead of using random search method It is also
Trang 4experimentally proven to have better results than that of these algorithms: MEME, AlignACE, and Gibbs Sampler
In ACO algorithms, there are four important factors affecting their performances: 1) structural graph, 2) heuristics information, 3) pheromone update rule, and 4) local search technique The proposed algorithm uses the same structural graph of MFACO algorithm
3 The proposed algorithm
This algorithm, named ACOMotif, uses total Hamming distance of motif to DNA sequences as the objective function ACOMotif uses structural graph of MFACO but with different heuristics information, pheromone update rule, and local search technique For each motif found, to locate binding sites in DNA sequences, the algorithm subsequently applies relax technique, which is why in this case ACOMotif is called R-ACOMotif
3.1 ACOMotif
ACOMotif follows the scheme described in Algorithm 1.The output is the set Q which
includes the motifs of length l and the corresponding instances on the DNA sequences which
have smallest hamming distance compared to the motifs The detailed description of ACOMotif is as follows
Structural graph
The structural graph G (V, E) is the same as MFACO’s To find a motif of length l, the graph has 4l vertices arranged in four rows and l columns Each vertex at position (u, j) is
labeled by the corresponding nucleotide u as shown in Fig.1 The labels of the vertices in each row are also used to refer to the rows From left to right, edges connect vertices of two consecutive columns We denote as the edge connecting vertex (u, j) to (v, j +1)
Heuristics Information and pheromone trail are placed at the vertices of the first column and
on the edges
Figure 1 Structural graph for finding motif of length l
Heuristic information
The heuristics information is placed at the first column vertices and on the edges
At t vertices of the first column, heuristics information is the frequency of nucleotide in the entire dataset S
Trang 5Heuristics information on edge is the frequency of the couple uv
nucleotide in S There are only 16 such quantities with (u,v) ∑x∑
Remark Note that, in MFACO, heuristics information at edges is computed by high-order
background model based on the frequency of the motif pattern from the first column to the
current column in S Since the appearance of these patterns in each DNA sequence S i is rare, this kind of statistical information is limited
Pheromone update rule
Our algorithm uses the SMMAS pheromone update rule (Smooth Max-Min Ant System) [9] Pheromone trails (on each vertex u of the first column) and (on edges
) are first initialized to a predetermined value After each loop, pheromone trail
at each vertex u is updated by Equation (4):
and are pre-determined parameters
Pheromone trail on edge is updated by Equation (5)
where:
The computational analysis and experiment in [9] show that this rule is better than MMAS update rule used in MFACO
Randomized procedure to find solution
In each iteration, each ant randomly selects a starting node u at the first column with probability :
(6)
Then the ant randomly walks through all the columns sequentially with the probability
of choosing the edge from vertex u of column j to vertex v at column j+1 being
The path of the ant from starting vertex to the last column vertex identifies an acceptable solution for motif
The objective function and identification of binding sites
For each acceptable solution S c, instead of using as in Equation 1, ACOMotif
Trang 6takes the total Hamming distance H d (S c ) from S c to DNA sequences in S as the objective
function:
(8) where
} (9)
Minimum string m i (9) is an instance of and its position in S i is the binding site Note that each can have multiple instances on S i with the same distance
Local search
ACOMotif applies hill-climbing technique for local search as described below After all ants finish their paths through the graph, the solutions are formed with their corresponding total hamming scores; then the local search is applied on the solutions having smallest scores
For each potential motif S m , use the set Q(S m ) to contain search results, and the iteration procedure is carried out as follows:
Step 1: Initialize Q(S m ) = {S m};
Step 2.Repeat:
For each i=1,…,l do:
2.1 Replace letter at position i of S m by one of three remaining letters
consecutively in set ∑ to get S p; 2.2 Compute ;
2.3 If ≤ then S m S p and Q(S m ) Q(S m ) {S p};
Until we cannot improve the objective function anymore
After applying local search for potential motifs in each iteration, the sets Q(S m), consisting of candidates with the smallest or nearly smallest score, are combined into the set
Q containing all the best solutions up to that point, which have the same binding site (retaining only a motif) Based on set Q, the pheromone trail on the graph is updated according to Equation (4) and (5) The algorithm stops when it finishes running a predefined number of loops The binding sites associated with motifs in Q allow us to identify instances
of motif
3.2 R- ACOMotif
Because real positions of motifs in DNA sequences are not certainly the solution of optimization problem, ACOMotif additionally uses relax technique to locate binding site of each motif When ACOMotif employs this technique, it is called R- ACOMotif
With each motif S c found in the set of solutions Q of ACOMotif and given a number , relax technique finds set of instances } and set of starting
Step 1 // Expand the instances set and binding sites
1.1 On each sequence S i, finds locations so that for each substring of
Trang 7length l, Hamming distance from substring to S c is or
We get sets M i and A i including and respectively;
1.2.Compute the number of elements n i of sets M i respectively and ;
Step 2.// Filter to reduce the size of sets M i and A i
Repeat
2.1 Rearrange the order of the set M incrementally with repect to n i // later
follows this new order;
2.2 Determine the smallest number k so that with every i ≥ k then n i >1;
2.3 For each i = k to N do
2.3.1 For each M i, compute
;
2.3.2 Compute g i= min{ };
2.3.3 If then remove out of set M i; 2.4 // reduce ;
Until is smaller than half of that value before the loop;
Step 3 // find the best solutions
3.1 Build all consensus matrices from the reduced set M i and compute
consensus score as in Equation (2);
3.2 Sort the matrices in step 3.1 and the corresponding locations of the instances
in decreasing order with respect to their consensus scores;
Step 4 The solution is the first tuple in the list in step 3.2 // can take more depending on
priority computed
4 Experimental results
The program was written in Perl, run on a desktop computer equipped with CPU Intel Core i5 2.5 GHz and 4 GB RAM, using Ubuntu 12.04 Operating System Our experiments compare the new algorithm’s efficiency with those of MFACO [2], EMACO [19], and ACRI [11] on the same datasets, using the same numbers of loops and ants as in the corresponding evaluations The number of ants is fixed to 8 Because we do not have the programs of these algorithms, we cannot compare runtime on the same configuration machine, the results of the compared algorithms will be taken directly from the published articles The runtime of ACOMotif/R-ACOMotif is in average The parameters had been set as follows:
is chosen depending on algorithm’s number of loops
Number of loops 10-100 100-300 300-600 > 600
Coefficient 0.03-0.05 0.02-0.03 0.01-0.02 0,005
To evaluate computation time, ACOMotif was compared with MotifSuite (2012) [20] on SCPD dataset [21] The efficiency of ACOMotif was assessed by experiments on the same published dataset of three algorithms above and on SCPD
4.1 Comparison with MFACO using consensus approach
Experiments on H.sapiens used in [2] contain three small sets with the number of strings are
6, 9, and 12, respectively Each of them has 3,001 nucleotide in length H.sapiens dataset did
Trang 8not have known actual motif biologically Therefore, our experiments just compare the values
of objective functions as computed in Equation (1) and Equation (2) We use notations HSc
for total Hamming distance score and CSc for consensus score Note that:
Then smaller HSc is equivalent to greater CSc; therefore, we only need to care about HSc
The experiments have been performed in the same way as in [2]: each set is run three times
with 50 loops as in [2], computation time is in average, the result of MFACO is taken from
[2]
The experimental results for H.Sapiens dataset 1 are shown in Table I and Table II with motif
length l = 7 and l = 13 respectively
Table I A comparison between ACOMotif
and MFACO on H.sapiens 1: = 0.03, l =
7; N=6, runtime 41s
CCTCCCC 42 0 AAAAAAA 42 0
AAAAAAA 42 0 AGGAGGA 42 0
GCAGCGG 42 0 AAAAAAG 42 0
GCCGGGG 42 0 TAAAAAT 42 0
GCCGCCG 42 0
AAAAAAG 42 0
GCCTGTG 42 0
TAAAAAT 42 0
CGGCGCC 42 0
GGGCCAG 42 0
GGCCAGG 42 0
GCGGGCG 42 0
CCCGGGC 42 0
Table II A comparison between ACOMotif and
MFACO on H.sapiens 1: = 0.03, l = 13, N= 6,
runtime 96s
AAAAAAAAAAAGA 76 2 AAAAAAAAAAAGA 76 2 AAAAAAAAAAAAG 75 3 AAAAAAAAAAAAG 75 3 GCTGAGGCAGGAG 72 6 AAAAAAAAAAAGT 75 3 GCCGCCGCCGCCG 72 6 AAAAAAAAAAAAG 75 3 CGCCGCCGCCGCC 72 6
GAGGCTGAGGCAG 71 7
Remark: Table I shows that ACOMotifis is considerably better (with 13 motifs found) in
comparison to MFACO (with only 4 motifs found) with the same HSc and CSc We can see
from Table 2 that both algorithms discovered the motif having the best score, but MFACO did
find three motifs whose HSc score equals 3, while ACOMotif found only one However, the
number of motifs discovered by ACOMotif is higher in general
The experimental results for H.Sapiens dataset 2 are represented in Table III and Table IV,
with motif length l = 7 and l = 13, respectively
Table III Comparison between ACOMotif
and MFACO on H.sapiens2: = 0.03, l = 7;
N=9, runtime 62s
Table IV Comparison between ACOMotif and
MFACO on H.sapiens2: = 0.03, l = 13; N=9,
runtime 126s
CCCTCCT 63 0 CCCTCCT 63 0
CTCCCTT 62 1 CCCTCAG 62 1
GAGCAGG 62 1 GGGTTGG 62 1
GGGTTGG 62 1 GAGCAGG 62 1
GGGGCTG 62 1
TGGGAGG 62 1
GGCGGCC 62 1
GGGGCTG 62 1
CCCCTCC 62 1
TTCCTGG 62 1
CCCCTCC 62 1
GGGCTGG 62 1
CCTCCCT 62 1
GCCGGCGGGCGCC 102 15 GCCGGCGGGCGCC 102 15 GGCCCCCGGGCGG 101 16 GCGGGCGGGCGCC 101 16 GGGGGAGCAGGAG 101 16 GCCGGCGGGCGGC 100 17 GGCCGGCGGGCGG 100 17 GCCGGAGGGCGCC 100 17 GCAGGGGCTGGGG 100 17
GGCCAGGCTCGGC 100 17 CCCCGCCCCCGGC 100 17
Trang 9Remark: Table III shows that ACOMotif is considerably better in terms of number of found
motifs with minimum score ACOMotif found 12 motifs whose HSc equals 1, compared with only 3 motifs when using MFACO As can be seen from Table IV, ACOMotif still represented its superiority over MFACO when it also found the best score motif, but two motifs whose HSc equals 16, and four motifs with HSc 17
The experimental results for H.Sapiens dataset 3 are illustrated in Table V and Table VI, with
motif length l = 7 and l = 13, respectively
Table V Comparison between ACOMotif and
MFACO on H.sapiens3: = 0.03, l = 7; N=9,
runtime 114s
TableVI Comparison between ACOMotif and
MFACO on H.sapiens3: = 0.03, l = 13; N=9,
runtime 231s
Remark: Table V proves that ACOMotif discovered one more motif with minimum score 3
With table VI, ACOMotif still gave better results in terms of both number of found motifs and score
4.2 Comparisonwith MFACO and ACRI in terms of position approach
The experiment was carried on E.coli dataset: CRP binding sites used by both MFACO and ACRI in [2], [11] to compare discovered binding site The dataset containseighteen
105-nucleotite strings The length of examined motif is 22 as the same in MFACO and ACRI R-ACOMotif ran 20 times, each with 300 loops, 10 ants, and = 0.02 Experimental result is expressed in Table VII
Table VII: Comparison result betweenR-ACOMotif and MFACO, ACRI algorithms
Ordered
number
c
GGCGGGG 123 3 GGGGCGG 123 3
CTGAGGC 123 3 CCCAGCT 123 3
CCAGCTG 123 3 CCAGCTG 123 3
GAGGCAG 123 3 CTGAGGC 123 3
GGGGCGG 123 3
GGGAGGCTGAGGC 205 29 GGGAGGCTGAGGC 205 29 CGGGAGGCGGAGG 204 30 CGGGAGGCGGAGG 204 30 GCTGAGGCAGGAG 202 32 GGAGGCTGAGGCA 202 32 GGAGGCTGAGGCA 202 32 CGGGAGGCGGGGG 201 33 GGCTGAGGCAGGA 119 35
GGGCGGGGCGGGG 119 35
Trang 10Remark: The result shows that R-ACOMotif and MFACO both discovered all the correct
starting positions, however ACRI had comparably high error
4.3 Comparison with EMACO in terms of position approach
Experiments on two datasets, ERE and E2F, were carried on with EMACO [19] Each of them includes 25 strings and 200 nucleotides per string; real motifs and its starting positionson each string are known in advance The algorithms were run 20 times and compared their average values, using 20 ants, 100 loops, and According to [19], the discovered position is correct if it is at most 3 unit(s) away from real location
To assess the result, the study [4] proposed three measurements including precision, recall, and F-score:
Precision = , Recall = , F- score = , (11)
where nc is the number of binding sites that were correctly predicted, np is the total number of predicted binding sites, and nt is the total number of actual binding sites Especially, F-score is said to be suitable for assessing quality of algorithm [11] Experimental result comparing with EMACO is presented in Table 8
Table VIII.Comparison between R-ACOMotif and EMACO on ERE and E2F datasets
Remark: From the above table, we can see easily that on both two datasets, R-ACOMotif has
significantly better measurements as against EMACO In particular, the former’s F-score ishigher than the later’s one Thus, we can conclude that R-ACOMotif runs more efficient than ACO combined with EM algorithm in [19]
4.4 Comparison with MotifSuite
To compare computation time and precision scores, ACOMotif and MotifSuite [20] were run
on two datasets GCR1 and GCN4 taken from SCPD data [21] GCR1 contains six 9,050 in
length strings (DNA sequences), and real motif is CTTCC (l = 5) GCN4 contains nine strings, and real motif is TGACTC (l =6)
Both two programs were run 20 times with the same 20 loops to compute average runtime and
to check if they can find the real motif or not ACOMotif used eight ants with The experimental result shows that ACOMotif found the real motif CTTCC on GCR1 with score HSc = 0 and TGACTC on GCN4 with score HSc = 1, whereas MotifSuite did not Runtime of the two algorithms presented in Table IX shows that ACOMotif is dramatically faster