An Efficient Ant Colony Algorithm for DNA Motif Finding

Finding motifs in gene sequences is one of the most important problems of bioinformatics and belongs to NPhard type. This paper proposes a new ant colony optimization algorithm based on consensus approach, in which a relax technique is applied to recognize the location of common motif. The efficiency of the algorithm is evaluated by comparing it with the stateoftheart algorithms

Trang 1

An Efficient Ant Colony Algorithm for DNA Motif Finding

Xuan Huan Hoang1, The Hung Nguyen1, T Thu Ha Doan2, and T Anh Tuyet Duong1

1University of Engineering and Technology, VNU, Hanoi, Vietnam

{huanhx, hungnt_55, tuyetdta_55}@vnu.edu.vn

2

Hanoi University of Agriculture, doanha86@gmail.com

Abstract Finding motifs in gene sequences is one of the most important problems of

bioinformatics and belongs to NP-hard type This paper proposes a new ant colony optimization algorithm based on consensus approach, in which a relax technique is applied to recognize the location of common motif The efficiency of the algorithm is evaluated by comparing it with the state-of-the-art algorithms

1 Introduction

Gene regulatory elements are called the DNA motifs (later we call it “motifs” for short), which contain a number of important biological information [1,5,12,14,18] The identification of DNA motif is currently one of the most important problems in bioinformatics and is NP-hard (see [2,10,12,16,17,19]) There are two main approaches to search for a motif: biological experiment and computing method, i.e bioinformatics Due to the high cost and time consuming, biological experiments are not really effective, whereas computing methods are widely used to predict motifs

Researchers have made various definitions of motif, many statements for motif finding problem and also developed a number of algorithms for finding motif [1,3,5,15] One of the widely used approaches is to use an approximate algorithm to optimize consensus score or information content [2,7,10,11,15,16,19] Recently, the methods that use ant colony optimization (ACO) have been applied effectively by several authors for this problem For example, Bouamama et al (2010) proposed MFACO algorithm [2] that uses consensus score

to find motifs and information content to locate their appearances (binding sites) in each DNA sequence Yang et al (2011) proposed an algorithm [19], referred to from now on as EMACO, that combines ACO algorithm with Expectation Maximization (EM) to find the starting positions of motif in sequences Liu et al (2013) proposed ACRI algorithm [11] that uses information content as the objective function for the same purpose as EMACO In this paper, we propose a new ACO algorithm called ACOMotif using the total Hamming distance score function of motif to DNA sequences for this problem ACOMotif uses the structural graph as in MFACO but with different heuristics information, pheromone update rule, and local search technique For each motif found, to locate the starting positions in the DNA sequences, the algorithm subsequently applies a relax technique and gives R-ACOMotif version for this goal Runtime of ACOMotif is also compared with MotifSuite (2012) on a very large dataset obtained from [21] called SCPD The efficiency of ACOMotif is indicated

by the experiments on the same datasets published in three articles above and on SCPD The rest of this paper is organized as follows Section 2 states the DNA motif finding problem, followed by a brief introduction of ACO method and how it was applied in

Trang 2

MFACO, ACRI, and EMACO algorithms Our new algorithm will be introduced in Section 3 Section 4 describes the experiments comparing ACOMotif/R-ACOMotif with MFACO, EMACO, ACRI, and MotifSuite Some conclusions are presented in the last section

2 DNA motif finding problem and related works

2.1 DNA motif finding problem

DNA motif finding problem, from optimization perspective, can be described as follows

[2,19]: consider a set of same length DNA sequences S = {S 1 ,S 2 , , S N}, in which

, belongs to the letter set Σ = {A, C, G, T} for all i, j For a given natural number

l, there are two approaches to discover a motif:

1) Consensus approach: Find a string S c of length l and a set of subsequences

, in which m i is the substring of length l of Si, such that they minimize the objective function:

Then S c is called a motif of S, each m i is called a motif instance (or instance in short)

from Si

If we consider M as a matrix (called a consensus matrix) with the ith row being the string

m i and denote C(u,j) as the number of nucleotides u in column j, the objective function

CSc(M) is formulated as:

Then motif S c is a string of length l and the letter at position i is the nucleotide that occurs most often in the i th column of M Note that each M can have many motifs but we consider only one of them

2) Positional approach: Find a set of substrings } and a set of starting positions }, in which, each instance m i is a length l substring

of S i corresponding to starting position a i In this approach, the objective function is information content:

In which, Q(u,j) indicates the frequency of nucleotide u in column j of the matrix M, p u

is the background frequency of u in the entire set S In reality, the location of m i on S i is called the binding site of DNA

Remark: Note that, it is not sure that the optimal solutions of this objective function are real

motifs So, the more solutions and the closer to real motif’s binding sites from locations of the instances, the better an algorithm is

2.2 Ant Colony Optimization method

Ant colony Optimization (ACO) proposed by Dorigo [6,9] is a random metaheuristic method to solve hard combinatorial optimization problems This algorithm has been diversely improved in the literature and widely applied in many applications Memetic scheme using population-based search technique was first proposed by Moscato [13] and applied for genetic algorithm Today, it is incorporated with other algorithms [3,8]

Trang 3

Memetic-ACO algorithms

In this article, we apply ACO with reinforcement search following a simple memetic schema as described in Figure 1 In such algorithms, the original problems are converted into the problems of finding solutions on the structural graph G = (V, E, Ω, η, T) where V is the vertex set, E is the edge set, and η and T are the set of heuristics information and pheromone trail, respectively, in reinforcement learning ; η and T can be placed on vertices or edges An

acceptable solution is a path satisfying the condition Ω, starting from a vertex in C 0 set of V, then expanded by a random to the next vertex based on heuristics information and pheromone

trail The ACO algorithm uses N ant artificial ants, in each iteration, each ant finds a solution

by a randomized procedure on the structural graph Then all the ant solutions are assessed and choosing the best one to apply enhancing strategy or local search technique Consequently, the obtained solutions will be evaluated again and the pheromone trail is updated as reinforcement learning information in the next iteration Although many algorithms use the same graph G(V,E), they use diffirent heuristics information, pheromone update rules and local search techniques From now on we will call G(V,E) as the structural graph

Procedure of Memetic-ACO algorithms;

Begin

Initialize; // initialize pheromone trail matrix and u ants;

Repeat

Construct solutions; // each ant constructs its own solution; Choose a subset Ωil to evolve by enhanced (or local) search; For each individual in Ωil do

Run enhance (or local) search

Update trail;

Until End condition;

Choose the solutions

End;

Algorithm 1 Specification of a simple memetic-ACO algorithm

Recently, some ACO-based algorithms following this scheme have been applied

effectively for DNA motif finding:

MFACO(2010), proposed by Bouamama et al [2], uses consensus score to find motifs

and information content to determine their starting positions Experiments showed that this algorithm obtains better results than the other best techniques: GS, BP, and MEME

EMACO (2011), proposed by Do Yang et al [19], combines ACO and Expectation

Maximization (EM), in which EM is used to determine binding sites Experiments revealed that this method is better than GAME and GALF

ACRI (2013), proposed by Liu et al [11], uses information content as the objective

function to determine positions Different from two methods above, this algorithm uses local search at two adjacent positions instead of using random search method It is also

Trang 4

experimentally proven to have better results than that of these algorithms: MEME, AlignACE, and Gibbs Sampler

In ACO algorithms, there are four important factors affecting their performances: 1) structural graph, 2) heuristics information, 3) pheromone update rule, and 4) local search technique The proposed algorithm uses the same structural graph of MFACO algorithm

3 The proposed algorithm

This algorithm, named ACOMotif, uses total Hamming distance of motif to DNA sequences as the objective function ACOMotif uses structural graph of MFACO but with different heuristics information, pheromone update rule, and local search technique For each motif found, to locate binding sites in DNA sequences, the algorithm subsequently applies relax technique, which is why in this case ACOMotif is called R-ACOMotif

3.1 ACOMotif

ACOMotif follows the scheme described in Algorithm 1.The output is the set Q which

includes the motifs of length l and the corresponding instances on the DNA sequences which

have smallest hamming distance compared to the motifs The detailed description of ACOMotif is as follows

Structural graph

The structural graph G (V, E) is the same as MFACO’s To find a motif of length l, the graph has 4l vertices arranged in four rows and l columns Each vertex at position (u, j) is

labeled by the corresponding nucleotide u as shown in Fig.1 The labels of the vertices in each row are also used to refer to the rows From left to right, edges connect vertices of two consecutive columns We denote as the edge connecting vertex (u, j) to (v, j +1)

Heuristics Information and pheromone trail are placed at the vertices of the first column and

on the edges

Figure 1 Structural graph for finding motif of length l

Heuristic information

The heuristics information is placed at the first column vertices and on the edges

At t vertices of the first column, heuristics information is the frequency of nucleotide in the entire dataset S

Trang 5

Heuristics information on edge is the frequency of the couple uv

nucleotide in S There are only 16 such quantities with (u,v) ∑x∑

Remark Note that, in MFACO, heuristics information at edges is computed by high-order

background model based on the frequency of the motif pattern from the first column to the

current column in S Since the appearance of these patterns in each DNA sequence S i is rare, this kind of statistical information is limited

Pheromone update rule

Our algorithm uses the SMMAS pheromone update rule (Smooth Max-Min Ant System) [9] Pheromone trails (on each vertex u of the first column) and (on edges

) are first initialized to a predetermined value After each loop, pheromone trail

at each vertex u is updated by Equation (4):

and are pre-determined parameters

Pheromone trail on edge is updated by Equation (5)

where:

The computational analysis and experiment in [9] show that this rule is better than MMAS update rule used in MFACO

Randomized procedure to find solution

In each iteration, each ant randomly selects a starting node u at the first column with probability :

(6)

Then the ant randomly walks through all the columns sequentially with the probability

of choosing the edge from vertex u of column j to vertex v at column j+1 being

The path of the ant from starting vertex to the last column vertex identifies an acceptable solution for motif

The objective function and identification of binding sites

For each acceptable solution S c, instead of using as in Equation 1, ACOMotif

Trang 6

takes the total Hamming distance H d (S c ) from S c to DNA sequences in S as the objective

function:

(8) where

} (9)

Minimum string m i (9) is an instance of and its position in S i is the binding site Note that each can have multiple instances on S i with the same distance

Local search

ACOMotif applies hill-climbing technique for local search as described below After all ants finish their paths through the graph, the solutions are formed with their corresponding total hamming scores; then the local search is applied on the solutions having smallest scores

For each potential motif S m , use the set Q(S m ) to contain search results, and the iteration procedure is carried out as follows:

Step 1: Initialize Q(S m ) = {S m};

Step 2.Repeat:

For each i=1,…,l do:

2.1 Replace letter at position i of S m by one of three remaining letters

consecutively in set ∑ to get S p; 2.2 Compute ;

2.3 If ≤ then S m S p and Q(S m ) Q(S m ) {S p};

Until we cannot improve the objective function anymore

After applying local search for potential motifs in each iteration, the sets Q(S m), consisting of candidates with the smallest or nearly smallest score, are combined into the set

Q containing all the best solutions up to that point, which have the same binding site (retaining only a motif) Based on set Q, the pheromone trail on the graph is updated according to Equation (4) and (5) The algorithm stops when it finishes running a predefined number of loops The binding sites associated with motifs in Q allow us to identify instances

of motif

3.2 R- ACOMotif

Because real positions of motifs in DNA sequences are not certainly the solution of optimization problem, ACOMotif additionally uses relax technique to locate binding site of each motif When ACOMotif employs this technique, it is called R- ACOMotif

With each motif S c found in the set of solutions Q of ACOMotif and given a number , relax technique finds set of instances } and set of starting

Step 1 // Expand the instances set and binding sites

1.1 On each sequence S i, finds locations so that for each substring of

Trang 7

length l, Hamming distance from substring to S c is or

We get sets M i and A i including and respectively;

1.2.Compute the number of elements n i of sets M i respectively and ;

Step 2.// Filter to reduce the size of sets M i and A i

Repeat

2.1 Rearrange the order of the set M incrementally with repect to n i // later

follows this new order;

2.2 Determine the smallest number k so that with every i ≥ k then n i >1;

2.3 For each i = k to N do

2.3.1 For each M i, compute

;

2.3.2 Compute g i= min{ };

2.3.3 If then remove out of set M i; 2.4 // reduce ;

Until is smaller than half of that value before the loop;

Step 3 // find the best solutions

3.1 Build all consensus matrices from the reduced set M i and compute

consensus score as in Equation (2);

3.2 Sort the matrices in step 3.1 and the corresponding locations of the instances

in decreasing order with respect to their consensus scores;

Step 4 The solution is the first tuple in the list in step 3.2 // can take more depending on

priority computed

4 Experimental results

The program was written in Perl, run on a desktop computer equipped with CPU Intel Core i5 2.5 GHz and 4 GB RAM, using Ubuntu 12.04 Operating System Our experiments compare the new algorithm’s efficiency with those of MFACO [2], EMACO [19], and ACRI [11] on the same datasets, using the same numbers of loops and ants as in the corresponding evaluations The number of ants is fixed to 8 Because we do not have the programs of these algorithms, we cannot compare runtime on the same configuration machine, the results of the compared algorithms will be taken directly from the published articles The runtime of ACOMotif/R-ACOMotif is in average The parameters had been set as follows:

is chosen depending on algorithm’s number of loops

Number of loops 10-100 100-300 300-600 > 600

Coefficient 0.03-0.05 0.02-0.03 0.01-0.02 0,005

To evaluate computation time, ACOMotif was compared with MotifSuite (2012) [20] on SCPD dataset [21] The efficiency of ACOMotif was assessed by experiments on the same published dataset of three algorithms above and on SCPD

4.1 Comparison with MFACO using consensus approach

Experiments on H.sapiens used in [2] contain three small sets with the number of strings are

6, 9, and 12, respectively Each of them has 3,001 nucleotide in length H.sapiens dataset did

Trang 8

not have known actual motif biologically Therefore, our experiments just compare the values

of objective functions as computed in Equation (1) and Equation (2) We use notations HSc

for total Hamming distance score and CSc for consensus score Note that:

Then smaller HSc is equivalent to greater CSc; therefore, we only need to care about HSc

The experiments have been performed in the same way as in [2]: each set is run three times

with 50 loops as in [2], computation time is in average, the result of MFACO is taken from

[2]

The experimental results for H.Sapiens dataset 1 are shown in Table I and Table II with motif

length l = 7 and l = 13 respectively

Table I A comparison between ACOMotif

and MFACO on H.sapiens 1: = 0.03, l =

7; N=6, runtime 41s

CCTCCCC 42 0 AAAAAAA 42 0

AAAAAAA 42 0 AGGAGGA 42 0

GCAGCGG 42 0 AAAAAAG 42 0

GCCGGGG 42 0 TAAAAAT 42 0

GCCGCCG 42 0

AAAAAAG 42 0

GCCTGTG 42 0

TAAAAAT 42 0

CGGCGCC 42 0

GGGCCAG 42 0

GGCCAGG 42 0

GCGGGCG 42 0

CCCGGGC 42 0

Table II A comparison between ACOMotif and

MFACO on H.sapiens 1: = 0.03, l = 13, N= 6,

runtime 96s

AAAAAAAAAAAGA 76 2 AAAAAAAAAAAGA 76 2 AAAAAAAAAAAAG 75 3 AAAAAAAAAAAAG 75 3 GCTGAGGCAGGAG 72 6 AAAAAAAAAAAGT 75 3 GCCGCCGCCGCCG 72 6 AAAAAAAAAAAAG 75 3 CGCCGCCGCCGCC 72 6

GAGGCTGAGGCAG 71 7

Remark: Table I shows that ACOMotifis is considerably better (with 13 motifs found) in

comparison to MFACO (with only 4 motifs found) with the same HSc and CSc We can see

from Table 2 that both algorithms discovered the motif having the best score, but MFACO did

find three motifs whose HSc score equals 3, while ACOMotif found only one However, the

number of motifs discovered by ACOMotif is higher in general

The experimental results for H.Sapiens dataset 2 are represented in Table III and Table IV,

with motif length l = 7 and l = 13, respectively

Table III Comparison between ACOMotif

and MFACO on H.sapiens2: = 0.03, l = 7;

N=9, runtime 62s

Table IV Comparison between ACOMotif and

MFACO on H.sapiens2: = 0.03, l = 13; N=9,

runtime 126s

CCCTCCT 63 0 CCCTCCT 63 0

CTCCCTT 62 1 CCCTCAG 62 1

GAGCAGG 62 1 GGGTTGG 62 1

GGGTTGG 62 1 GAGCAGG 62 1

GGGGCTG 62 1

TGGGAGG 62 1

GGCGGCC 62 1

GGGGCTG 62 1

CCCCTCC 62 1

TTCCTGG 62 1

CCCCTCC 62 1

GGGCTGG 62 1

CCTCCCT 62 1

GCCGGCGGGCGCC 102 15 GCCGGCGGGCGCC 102 15 GGCCCCCGGGCGG 101 16 GCGGGCGGGCGCC 101 16 GGGGGAGCAGGAG 101 16 GCCGGCGGGCGGC 100 17 GGCCGGCGGGCGG 100 17 GCCGGAGGGCGCC 100 17 GCAGGGGCTGGGG 100 17

GGCCAGGCTCGGC 100 17 CCCCGCCCCCGGC 100 17

Trang 9

Remark: Table III shows that ACOMotif is considerably better in terms of number of found

motifs with minimum score ACOMotif found 12 motifs whose HSc equals 1, compared with only 3 motifs when using MFACO As can be seen from Table IV, ACOMotif still represented its superiority over MFACO when it also found the best score motif, but two motifs whose HSc equals 16, and four motifs with HSc 17

The experimental results for H.Sapiens dataset 3 are illustrated in Table V and Table VI, with

motif length l = 7 and l = 13, respectively

Table V Comparison between ACOMotif and

runtime 114s

TableVI Comparison between ACOMotif and

runtime 231s

Remark: Table V proves that ACOMotif discovered one more motif with minimum score 3

With table VI, ACOMotif still gave better results in terms of both number of found motifs and score

4.2 Comparisonwith MFACO and ACRI in terms of position approach

The experiment was carried on E.coli dataset: CRP binding sites used by both MFACO and ACRI in [2], [11] to compare discovered binding site The dataset containseighteen

105-nucleotite strings The length of examined motif is 22 as the same in MFACO and ACRI R-ACOMotif ran 20 times, each with 300 loops, 10 ants, and = 0.02 Experimental result is expressed in Table VII

Table VII: Comparison result betweenR-ACOMotif and MFACO, ACRI algorithms

Ordered

number

c

GGCGGGG 123 3 GGGGCGG 123 3

CTGAGGC 123 3 CCCAGCT 123 3

CCAGCTG 123 3 CCAGCTG 123 3

GAGGCAG 123 3 CTGAGGC 123 3

GGGGCGG 123 3

GGGAGGCTGAGGC 205 29 GGGAGGCTGAGGC 205 29 CGGGAGGCGGAGG 204 30 CGGGAGGCGGAGG 204 30 GCTGAGGCAGGAG 202 32 GGAGGCTGAGGCA 202 32 GGAGGCTGAGGCA 202 32 CGGGAGGCGGGGG 201 33 GGCTGAGGCAGGA 119 35

GGGCGGGGCGGGG 119 35

Trang 10

Remark: The result shows that R-ACOMotif and MFACO both discovered all the correct

starting positions, however ACRI had comparably high error

4.3 Comparison with EMACO in terms of position approach

Experiments on two datasets, ERE and E2F, were carried on with EMACO [19] Each of them includes 25 strings and 200 nucleotides per string; real motifs and its starting positionson each string are known in advance The algorithms were run 20 times and compared their average values, using 20 ants, 100 loops, and According to [19], the discovered position is correct if it is at most 3 unit(s) away from real location

To assess the result, the study [4] proposed three measurements including precision, recall, and F-score:

Precision = , Recall = , F- score = , (11)

where nc is the number of binding sites that were correctly predicted, np is the total number of predicted binding sites, and nt is the total number of actual binding sites Especially, F-score is said to be suitable for assessing quality of algorithm [11] Experimental result comparing with EMACO is presented in Table 8

Table VIII.Comparison between R-ACOMotif and EMACO on ERE and E2F datasets

Remark: From the above table, we can see easily that on both two datasets, R-ACOMotif has

significantly better measurements as against EMACO In particular, the former’s F-score ishigher than the later’s one Thus, we can conclude that R-ACOMotif runs more efficient than ACO combined with EM algorithm in [19]

4.4 Comparison with MotifSuite

To compare computation time and precision scores, ACOMotif and MotifSuite [20] were run

on two datasets GCR1 and GCN4 taken from SCPD data [21] GCR1 contains six 9,050 in

length strings (DNA sequences), and real motif is CTTCC (l = 5) GCN4 contains nine strings, and real motif is TGACTC (l =6)

Both two programs were run 20 times with the same 20 loops to compute average runtime and

to check if they can find the real motif or not ACOMotif used eight ants with The experimental result shows that ACOMotif found the real motif CTTCC on GCR1 with score HSc = 0 and TGACTC on GCN4 with score HSc = 1, whereas MotifSuite did not Runtime of the two algorithms presented in Table IX shows that ACOMotif is dramatically faster

Định dạng
Số trang	12
Dung lượng	488,69 KB
File đính kèm	Preprint1406.rar (442 KB)