DSpace at VNU: ACOHAP: An efficient ant colony optimization for the haplotype inference by pure parsimony problem

Experiments with both small and large data sets show that ACOHAP outperforms other state-of-the-art heuristic methods.. ACOHAP is as good as the currently best exact method, RPoly, on sm

Trang 1

DOI 10.1007/s11721-013-0077-8

ACOHAP: an efficient ant colony optimization

for the haplotype inference by pure parsimony problem

Dong Duc Do · Sy Vinh Le · Xuan Huan Hoang

Received: 20 March 2012 / Accepted: 12 February 2013 / Published online: 28 February 2013

Abstract Haplotype information plays an important role in many genetic analyses

How-ever, the identification of haplotypes based on sequencing methods is both expensive and time consuming Current sequencing methods are only efficient to determine conflated data

of haplotypes, that is, genotypes This raises the need to develop computational methods to infer haplotypes from genotypes

Haplotype inference by pure parsimony is an NP-hard problem and still remains a chal-lenging task in bioinformatics In this paper, we propose an efficient ant colony optimization (ACO) heuristic method, named ACOHAP, to solve the problem The main idea is based on the construction of a binary tree structure through which ants can travel and resolve con-flated data of all haplotypes from site to site Experiments with both small and large data sets show that ACOHAP outperforms other state-of-the-art heuristic methods ACOHAP is

as good as the currently best exact method, RPoly, on small data sets However, it is much better than RPoly on large data sets These results demonstrate the efficiency of the ACO-HAP algorithm to solve the haplotype inference by pure parsimony problem for both small and large data sets

Keywords ACOHAP· Ant colony optimization · Haplotype inference · Pure parsimony ·

Genotypes

D.D Do

Information Technology Institute, VNU, 144 Xuan Thuy, Cau Giay, Hanoi, Vietnam

e-mail: dongdoduc@vnu.edu.vn

S.V Le ()

University of Engineering and Technology & Information Technology Institute, VNU, 144 Xuan Thuy, Cau Giay, Hanoi, Vietnam

e-mail: vinhbio@gmail.com

X.H Hoang

University of Engineering and Technology, VNU, 144 Xuan Thuy, Cau Giay, Hanoi, Vietnam

e-mail: huanhx@vnu.edu.vn

Trang 2

1 Introduction

Single nucleotide polymorphisms (SNPs) are the most frequent form of genomic variations

The nucleotide variants at SNP sites are called alleles Most SNPs are biallelic, that is, only

two different nucleotides are observed in the population A haplotype is a sequence of alleles

on one chromosome Haplotypes provide important information for many genetic analyses (The International Hapmap Consortium2007; Graça et al.2010, and references therein) However, the experimental determination of haplotypes is expensive and time consuming Fortunately, current sequencing methods can efficiently determine their conflated data, that

is, genotypes This motivates researchers to develop computational methods to infer haplo-types from genohaplo-types

Haplotype inference is a challenging problem in bioinformatics (Istrail2004; Graça et al

2010) Haplotype inference by pure parsimony (HIPP), that is, finding the smallest number

of haplotypes to explain a set of genotypes without recombination, is an NP-hard prob-lem (Gusfield 2001,2003) Many different computational methods have been proposed

to solve this problem These methods can be classified as either heuristic or exact ap-proaches

Clark was the first to propose an inference rule-based approach to solve the HIPP prob-lem (Clark1990) Tininini et al generalized Clark’s inference rules to construct the

Coll-Haps algorithm (Tininini et al.2010) This algorithm starts from a list of haplotypes and performs a sequence of collapsed steps in order to minimize the number of distinct haplo-types CollHaps, a heuristic method, is designed to conduct a number of complete attempts (driven by a randomized quasi-greedy strategy) for each problem instance to find the best

solution Another heuristic method, called parsimonious tree-grow method (PTG), was

pro-posed to solve this problem (Li et al.2005) The main idea of PTG is to resolve genotypes from site to site by growing a maximum parsimony tree Although PTG is very efficient for what concerns the computational complexity, its accuracy is not as good as the one of other methods

Exact methods for solving the HIPP problem include integer linear programming tech-niques (Gusfield 2001), Boolean satisfiability (SAT) formulation techniques (Lynce and Marques-Silva 2008), and Boolean constraint techniques (Graça et al.2007,2008) Re-cently, Graça et al (2007,2008) developed a pseudo-Boolean optimization method, called

RPoly, for solving the HIPP problem Experimental results show that RPoly outperforms

other exact methods Although exact methods are able to find optimal solutions for the HIPP problem, they are only applicable to small data sets due to the computational burden (Gusfield and Orzack2005; Brown and Harrower2006)

The ant colony optimization (ACO) approach has been widely used to tackle combina-torial optimization problems (Dorigo and Stützle2004, and references therein) Benedet-tini and colleagues were the first to apply ACO to solve the HIPP problem (Benedet-tini et al 2008) Their algorithm includes two levels: in the first level, it uses ACO

to determine a good visiting order of genotypes; in the second level, it employs ACO

to infer haplotypes from genotypes following orders determined in the first level Al-though the heuristic information is estimated to determine a good visiting order of geno-types in this ACO system, it is not estimated to instruct ants to infer individual alle-les

In this paper, we propose an ACO-based method, named ACOHAP, to solve the HIPP problem for large data sets The key idea is the construction of a binary tree structure which allows ants to travel and resolve conflated data of all haplotypes from site to site Heuristic information for each individual allele can be effectively estimated to guide ants towards good

Trang 3

solutions This technique overcomes the limitation of the 2-level ACO method in estimating heuristic information

The rest of the paper is organized as follows: The HIPP problem is presented in Sect.2; the ACOHAP algorithm is described in Sect.3; Sect.4analyzes the performance of dif-ferent methods on both small and large data sets; conclusions are given in the last sec-tion

2 The HIPP problem

We assume SNPs are biallelic and represent alleles by 0 or 1 A haplotype of m sites is rep-resented by a string h = h1 h m of size m where h i ∈ {0, 1} Consider an unordered

haplo-type pair (h a , h b ) Their corresponding genotype g is represented by a string g = g1 g m

of size m where g i ∈ {0, 1, 2} is the conflated data at site i Hereby, g i is defined as fol-lows:

g i=

h a

i if h a

i = h b

i (genotype g is homozygous at site i)

2 if h a

i = h b

i (genotype g is heterozygous at site i) (1) The haplotype pair (h a , h b ) is called a haplotype resolution of g, and we say the genotype

g is resolved by (h a , h b ) A given genotype g has exactly 2 k−1different haplotype

resolu-tions, where k is the number of sites at which g is heterozygous For example, genotype

g = 022 can be resolved by the two different unordered haplotype pairs (000, 011) and

(001, 010).

Let G = {g1, , g n } be a set of n genotypes at the m loci under consideration It is said

that haplotype set H = {h1, , h k } is a solution of G if each genotype of G is resolved by

a pair of haplotypes in H

Given a set of genotypes G, the HIPP problem is to find a solution H with minimum number of haplotypes (Gusfield2003)

Consider an example of three genotypes G = {g1, g2, g3} = {121, 002, 221}, H = {h1, h2, h3, h4} = {000, 001, 101, 111} is an optimal solution of G having four haplotypes

where (h3, h4), (h1, h2) and (h2, h4) are haplotype resolutions of g1, g2, and g3, respec-tively

3 Methods

3.1 Graph construction

This section describes a binary tree structure to represent all possible haplotypes with m

sites (see Fig.1) The tree structure has the following properties:

– It is a full binary tree with m + 1 levels The root is at level 0 and leaves are at level m.

– We denote by v, v ∈ {0, 1}, the label of a branch A branch from an internal node X to its

left (right) child is labeled as 0 (1) and called 0-branch (1-branch)

– A node is labeled by the concatenated string of branch labels from the root to the node

– The label of a node at level i represents a haplotype with i sites.

The binary tree of 2mleaves represents 2mdifferent possible haplotypes

Trang 4

Fig 1 The full binary tree of depth m (the root is at level 0; leaves are at level m) Branches are labeled

either 0 or 1 A node is labeled by the concatenated string of branch labels from the root to the node The

label of a node at level i represents a haplotype with i sites Ants can travel from the root to the leaves of the tree to determine haplotypes For example, two haplotypes h a = 001 and h b = 101 (bold paths) are a

haplotype resolution of g= 201

In the algorithm proposed below, we think of an ant traveling from the root to the leaves

of the tree to resolve each genotype g into two haplotypes h a and h b At level i− 1, the

ant determines allele h a

i by following either the 0-branch (h a

i = 0) to the left child or the

1-branch (h a

i = 1) to the right child Specifically, if genotype g is homozygous at site i,

we assign h a

i = g i and the ant follows the g i-branch Otherwise, the ant takes a decision

to follow either the 0-branch or the 1-branch based on the pheromone trail and heuristic

information It is worth to notice that the complementary haplotype h bcan be determined

from h a and g and vice versa Specifically,

h b i =

h a

i if g iis 0 or 1 (homozygous)

1− h a

For example, genotype g = 201 can be resolved into two haplotypes h a = 001 and h b= 101

(bold paths in the tree in Fig.1)

3.2 The HAPIN algorithm

We propose an ant-walking algorithm, named HAPIN, to determine a solution H of G The algorithm starts from the root of the tree with n initially undetermined haplotype pairs (h 1a , h 1b ), , (h na , h nb ) To determine these haplotype pairs, we think of them as ants and

allow them to follow branches of the tree from the root to the leaves In such a way, that

haplotype pair (h sa , h sb ) forms a haplotype resolution of g s for each s, s ∈ {1, , n} A leaf

is active if it contains at least one haplotype Labels of active leaves constitute a solution H

of G (see Fig.2)

Trang 5

Fig 2 The full binary tree structure through which ants can travel and determine a solution H of G Consider

an example of three genotypes G = {121, 002, 221}, haplotypes h 1a , h 1b , h 2a , h 2b , h 3a , h 3bfollow branches

from the root to four leaves: 000, 001, 101, and 111 The set H = {000, 001, 101, 111} including labels of

these leaves is a solution of G

The HAPIN algorithm performs m iterations In the ith iteration, alleles at site i for all haplotypes are inferred (all ants move from nodes at level i − 1 to nodes at level i) Let H o (H e ) be the list of haplotypes whose genotypes are homozygous (heterozygous) at site i In the ith iteration, the HAPIN algorithm performs the two following steps: the homozygous step and the heterozygous step The homozygous step determines alleles at site i for all haplotypes of H o It is straightforward to determine these alleles The heterozygous step

determines alleles at site i for all haplotypes of H e The determination of each allele is guided by both the pheromone trail and heuristic information The details of the HAPIN algorithm are described in Algorithm1

Figure2illustrates an example of three genotypes G = {g1, g2, g3} = {121, 002, 221}.

Haplotypes h 1a , h 1b , h 2a , h 2b , h 3a , h 3b follow branches to four leaves: 000 (h 2a), 001

(h 2b , h 3a ), 101 (h 1a ), and 111 (h 1b , h 3b ) Thus, the haplotype set H = (000, 001, 101, 111)

is a solution of G.

3.2.1 Pheromone trail update

Pheromone trail update is a crucial step in ant colony optimization as it guides ants to find

good solutions The heterozygous step uses pheromone trail τ s

ivas a guide to choose between

alleles h sa

i and h sb

i We use the smooth max-min ant system method (Do et al.2008), a refined version of the well-known pheromone boundary rule (Stützle and Hoos2000), to update the

pheromone trail Let H = {h1, , h k} be a solution which is used to update the pheromone

trail Consider genotype g s ∈ G resolved by haplotypes h sa , h sb ∈ H , the pheromone trail

Trang 6

Algorithm 1: HAPIN Algorithm

Input: A set of n genotypes with m sites G = {g1, , g n } where g s = g s

1 g s m

Output: A haplotype solution H = {h1, , h k } of G

Begin

Set a list of n initially undetermined haplotype pairs (h 1a , h 1b ), , (h na , h nb )to the root of the tree

for i = 1 → m do

// In the ith iteration, alleles of all haplotypes at site i are determined

1 Homozygous step: Homozygous genotypes at site i will be processed in this step Consider a homozygous genotype g s at site i, alleles h sa

i and h sb

i are

simply assigned equal to g s

i Specifically, if g s

i = 0, we assign h sa

i = h sb

i = 0

(haplotypes h sa and h sbfollow the 0-branches to the left nodes) Otherwise, we

assign h sa

i = h sb

i = 1 (haplotypes h sa and h sbfollow the 1-branches to the right nodes)

2 Heterozygous step: Heterozygous genotypes at site i will be processed in this step Consider a heterozygous g s at site i, allele h sa

i will be assigned to v, v∈

{0, 1} The probability P s

i (v) that h sa

i is assigned v (h sa follows the v-branch;

h sb follows the (1 − v)-branch) is defined as follows:

P i s (v)= (τ iv s ) α (η s

iv ) β (τ s

i0) α (η s

i0) β + (τ s

i1) α (η s

in which α, β are the relative influence coefficients of the pheromone trail and

of the heuristic information, respectively These parameters are set as in Table1

Collect labels of all active leaves to form a haplotype solution H

End

Table 1 The ParamILS program (Hutter et al.2007) was used to determine good parameter values for ACO-HAP Value: the parameters values used in the tuning process (bold numbers are default values)

f {0.1, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 4.0, 5.0} 3.0 τmax

τmin = f × m × n

pheromone trail

information

ρ {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9} 0.3 Pheromone evaporation factor

τ s

iv (i = 1, , m; v ∈ {0, 1}) is updated as follows:

τ iv s = (1 − ρ)τ s

iv+

ρτmin h sa

i = v

ρτmax h sa

in which ρ ∈ (0, 1] is the evaporation parameter; τmin and τmax are the lower bound and upper bound of the pheromone trail, respectively These parameters are set as in Table1

Trang 7

3.2.2 Heuristic information estimation

For what concerns a heterozygous genotype g s at site i, the heuristic information η s

ivis used

as a guide to determine alleles h sa

i and h sb

i We denote by H ithe list of haplotypes whose

alleles at site i are already determined This haplotype list is used to estimate the heuristic information η s

iv

Two partially determined haplotypes h and hare called compatible if they can follow branches to the same leaf For example, the two haplotypes h 2b and h 3ain Fig.2are com-patible because they follow branches to the same leaf 001

We use the heuristic that haplotypes h sa , h sb should follow branches such that they will

be compatible with as many haplotypes of H i as possible We denote by c sa

v the number of

haplotypes of H i which are compatible with h a when following the v-branch, v ∈ {0, 1}.

If c sa

v = 0, h sa cannot travel with any haplotype of H i to the same leaf, that is, it will

re-sult in a new active leaf If h sa follows the v-branch (h sb follows the (1 − v)-branch), the

minimum number of new active leaves t s

v resulting from h sa and h sb can be determined as follows:

t v s=

⎧

⎪

2 if c sa

v = 0 & c sb

1−v= 0

0 if c sa

v = 0 & c sb

1−v= 0

1 otherwise.

(5)

Haplotypes h sa and h sb tend to follow branches which minimize the minimum number

of new active leaves and maximize the number of compatible haplotypes To this end, the

heuristic information η s

i0and η s

i1are estimated as follows:

(η s i0, η s i1)=

⎧

⎪

0< t s

1

0> t1s ((c0sa + c sb

1 + 1)m, (c sa

1 + c sb

0 + 1)m) if t s

0= t s

1

(6)

3.3 The ACOHAP algorithm

We propose an ACO algorithm, named ACOHAP, to search for a good solution H of G.

The key component of ACOHAP is the HAPIN algorithm which determines one so-lution at each run Local searches are typically coupled with ACO algorithms to

im-prove solutions ACOHAP uses the so-called stochastic first imim-provement rule method

(Di Gaspero and Roli 2008) to improve solutions generated by the HAPIN algorithm This simple local search improves a solution by replacing two current haplotypes by a new one if possible The improvement process is repeated until no further replacement is found

The ACOHAP algorithm performs the HAPIN algorithm several times to find a good solution Since the HIPP problem assumes that sites are independent, the orders along which sites are inferred are randomly specified for each run of the HAPIN algorithm

to enforce search space exploration The ACOHAP algorithm is described in Algo-rithm2

The complexities of both the HAPIN algorithm and the stochastic first improvement

rule algorithm are O(n2m) Thus, the complexity of the inference step in the ACOHAP algorithm is O(n2m)for each run The overall complexity of the ACOHAP algorithm is

Trang 8

Algorithm 2: ACOHAP Algorithm

Data: A list of n genotypes with m sites G = {g1, , g n } where g s = g s

1 g s m

Result: The best found solution H = {h1, , h k}

begin

//Initialization step

– Initialize pheromone trail

– Set the currently best solution H best = undefined

– Set the number of loops N loops= 0

repeat

H local = undefined

for p ← 1 to N antsdo

//Inference step

– Specify a random inferring order of sites S = (s1 , , s m )

– Perform the HAPIN algorithm to determine solution H p using S, i.e alleles of all haplotypes at site s i are determined in the ith iteration

– Perform the stochastic first improvement rule algorithm to improve the

obtained solution H p

– If H p is better than H local then update H local = H p

//Updating step

– Use H localto update the pheromone trail

– If H local is better than the currently best solution H best then update H best=

H local

– Increase the number of loops N loopsby one

//Restarting step

– Reset the pheromone trail if no improvement is found after 30 consecutive iterations

until (the running time exceeds a given time limit)

//Return the best solution found H best

Return H best

end

4 Experimental results

We compared ACOHAP to the currently best methods, RPoly version 1.2.1 (Graça et al

2008),1CollHaps (Tininini et al.2010),2and PTG (Li et al.2005)3on both small and large data sets All experiments were conducted on a PC cluster of 24 nodes (AMD 2.2 GHz,

48 GB RAM) RPoly is an exact method which uses pseudo-Boolean optimization tech-niques to find optimal solutions The running time of RPoly was set to 100000 seconds

1 http://sat.inesc-id.pt/~assg/rpoly/.

2 http://www.iasi.cnr.it/~liuzzi/BIOCOMP/SNP/.

3 http://doc.aporc.org/wiki/PTG.

Trang 9

(∼28 h) for each problem instance RPoly returns an approximate solution if it cannot find

an optimal solution for a given problem instance after 100000 seconds

CollHaps is a heuristic method which generalizes the well-known Clark’s rule method to solve the HIPP problem PTG is a very fast heuristic algorithm Both CollHaps and ACO-HAP require long running times to converge to optimal solutions In our experiments, the running time limit for both ACOHAP and CollHaps was set to 1000 seconds In addition to the time limit of 100000 seconds, we also tested the performance of RPoly with a time limit

of 1000 seconds, called RPoly1000, for each problem instance

We used the ParamILS program (Hutter et al.2007) to determine good parameter settings for ACOHAP Table1presents the parameters and their values as used in the tuning process The parameter settings for ACOHAP are obtained using ParamILS on artificially generated benchmarks (SU1, SU2, SU3, and SU-100k) from the International HapMap Consortium (Marchini et al.2006) These parameters are given in Table1and are used to test the per-formance of ACOHAP on all data sets Note that ACOHAP was applied only once to each problem instance

Unfortunately, the software for the two-level ACO method (ACO-HI+) (Benedettini et al

2008) is no longer available for testing (as communicated by the first author of Benedettini

et al.2008) However, we can compare ACOHAP with ACO-HI+on 100 problem instances

of the SU2 data set (Marchini et al.2006) for which results obtained with ACO-HI+are available (Benedettini et al.2008)

4.1 Small data sets

We tested the different methods on nine small data sets including four artificially generated benchmarks and five biological data sets

4.1.1 Artificially generated benchmarks

We used artificially generated benchmarks (SU1, SU2, SU3, SU-100kb)4from the Interna-tional HapMap Consortium (Marchini et al.2006) to assess the performance of the different methods SU1 was generated using a constant recombination rate across the whole region, a constant population size, and random mating It contains 100 problem instances of 90 geno-types SU2 was generalized from SU1 in the sense that the recombination rate varies across the region SU3 is the same as SU2 except that the demography model is consistent with the white Americans model The small data set SU-100kb contains 29 problem instances each consisting of 90 genotypes These four data sets together contain 329 different problem instances They are summarized in Table2

Results from RPoly, ACOHAP and CollHaps for the SU data sets are presented in Ta-ble3 The PTG method is not applicable to multiple problem instances Therefore, we cannot assess its performance for these 329 artificial problem instances The sum of the objective function values of the solutions generated by ACOHAP for all 329 problem instances is

42305, which is smaller than those from RPoly (42823) and CollHaps (42852) ACOHAP found optimal solutions for 303 (92 %) out of 329 problem instances RPoly found optimal solutions for these and 17 other problem instances (320 in total) However, optimal solutions from RPoly for these 17 problem instances are only slightly better than those from ACO-HAP ACOHAP is much better than RPoly on the nine problem instances where RPoly could

4 http://www.stats.ox.ac.uk/~marchini/phaseoff.html.

Trang 10

Table 2 Artificially generated benchmarks (SU1, SU2, SU3, SU-100kb) from the International HapMap

Consortium (Marchini et al 2006) These four data sets contain 329 different problem instances each consist-ing of 90 genotypes

Data set #Problem instances #Genotypes (n) Genotype length (m)

Table 3 Results obtained with RPoly, RPoly1000, ACOHAP and CollHaps for the SU data sets; #Opts: The number of optimal solutions; #Haps: The number of haplotypes

Table 4 Results obtained with ACOHAP and CollHaps on artificially generated benchmarks

ACO-HAP < CollHaps: ACOACO-HAP is better than CollHaps; ACOACO-HAP= CollHaps: ACOHAP is as good as

Coll-Haps; ACOHAP > CollHaps: ACOHAP is worse than CollHaps

Data set #Problem instances

ACOHAP < CollHaps

#Problem instances ACOHAP = CollHaps #Problem instancesACOHAP > CollHaps

not find optimal solutions after 100000 seconds The approximate solutions from RPoly for these nine problem instances are even worse than those from CollHaps RPoly with a time limit of 1000 seconds (RPoly1000) found optimal solutions for 310 problem instances Thus,

it could not find optimal solutions for 10 problem instances whose optimal solutions were found by RPoly with a time limit of 100000 seconds

The sum of the objective function values of the solutions generated by ACOHAP (42305)

is about 1.3 % smaller than that of CollHaps (42852) ACOHAP is as good as CollHaps for the small data set SU-100kb However, it is superior to CollHaps for larger data sets, that

is, SU1, SU2, and SU3 (see Table4for more details) For example, ACOHAP outperforms CollHaps on 76 out of 100 problem instances of the SU1 data set and is equal to CollHaps

on the 24 remaining problem instances CollHaps shows no better results than ACOHAP for the SU1 and SU3 data sets It is only better than ACOHAP on two out of 100 problem instances of SU2

Định dạng
Số trang	15
Dung lượng	529,49 KB