Experiments with both small and large data sets show that ACOHAP outperforms other state-of-the-art heuristic methods.. ACOHAP is as good as the currently best exact method, RPoly, on sm
Trang 1DOI 10.1007/s11721-013-0077-8
ACOHAP: an efficient ant colony optimization
for the haplotype inference by pure parsimony problem
Dong Duc Do · Sy Vinh Le · Xuan Huan Hoang
Received: 20 March 2012 / Accepted: 12 February 2013 / Published online: 28 February 2013
© Springer Science+Business Media New York 2013
Abstract Haplotype information plays an important role in many genetic analyses
How-ever, the identification of haplotypes based on sequencing methods is both expensive and time consuming Current sequencing methods are only efficient to determine conflated data
of haplotypes, that is, genotypes This raises the need to develop computational methods to infer haplotypes from genotypes
Haplotype inference by pure parsimony is an NP-hard problem and still remains a chal-lenging task in bioinformatics In this paper, we propose an efficient ant colony optimization (ACO) heuristic method, named ACOHAP, to solve the problem The main idea is based on the construction of a binary tree structure through which ants can travel and resolve con-flated data of all haplotypes from site to site Experiments with both small and large data sets show that ACOHAP outperforms other state-of-the-art heuristic methods ACOHAP is
as good as the currently best exact method, RPoly, on small data sets However, it is much better than RPoly on large data sets These results demonstrate the efficiency of the ACO-HAP algorithm to solve the haplotype inference by pure parsimony problem for both small and large data sets
Keywords ACOHAP· Ant colony optimization · Haplotype inference · Pure parsimony ·
Genotypes
D.D Do
Information Technology Institute, VNU, 144 Xuan Thuy, Cau Giay, Hanoi, Vietnam
e-mail: dongdoduc@vnu.edu.vn
S.V Le ()
University of Engineering and Technology & Information Technology Institute, VNU, 144 Xuan Thuy, Cau Giay, Hanoi, Vietnam
e-mail: vinhbio@gmail.com
X.H Hoang
University of Engineering and Technology, VNU, 144 Xuan Thuy, Cau Giay, Hanoi, Vietnam
e-mail: huanhx@vnu.edu.vn
Trang 21 Introduction
Single nucleotide polymorphisms (SNPs) are the most frequent form of genomic variations
The nucleotide variants at SNP sites are called alleles Most SNPs are biallelic, that is, only
two different nucleotides are observed in the population A haplotype is a sequence of alleles
on one chromosome Haplotypes provide important information for many genetic analyses (The International Hapmap Consortium2007; Graça et al.2010, and references therein) However, the experimental determination of haplotypes is expensive and time consuming Fortunately, current sequencing methods can efficiently determine their conflated data, that
is, genotypes This motivates researchers to develop computational methods to infer haplo-types from genohaplo-types
Haplotype inference is a challenging problem in bioinformatics (Istrail2004; Graça et al
2010) Haplotype inference by pure parsimony (HIPP), that is, finding the smallest number
of haplotypes to explain a set of genotypes without recombination, is an NP-hard prob-lem (Gusfield 2001,2003) Many different computational methods have been proposed
to solve this problem These methods can be classified as either heuristic or exact ap-proaches
Clark was the first to propose an inference rule-based approach to solve the HIPP prob-lem (Clark1990) Tininini et al generalized Clark’s inference rules to construct the
Coll-Haps algorithm (Tininini et al.2010) This algorithm starts from a list of haplotypes and performs a sequence of collapsed steps in order to minimize the number of distinct haplo-types CollHaps, a heuristic method, is designed to conduct a number of complete attempts (driven by a randomized quasi-greedy strategy) for each problem instance to find the best
solution Another heuristic method, called parsimonious tree-grow method (PTG), was
pro-posed to solve this problem (Li et al.2005) The main idea of PTG is to resolve genotypes from site to site by growing a maximum parsimony tree Although PTG is very efficient for what concerns the computational complexity, its accuracy is not as good as the one of other methods
Exact methods for solving the HIPP problem include integer linear programming tech-niques (Gusfield 2001), Boolean satisfiability (SAT) formulation techniques (Lynce and Marques-Silva 2008), and Boolean constraint techniques (Graça et al.2007,2008) Re-cently, Graça et al (2007,2008) developed a pseudo-Boolean optimization method, called
RPoly, for solving the HIPP problem Experimental results show that RPoly outperforms
other exact methods Although exact methods are able to find optimal solutions for the HIPP problem, they are only applicable to small data sets due to the computational burden (Gusfield and Orzack2005; Brown and Harrower2006)
The ant colony optimization (ACO) approach has been widely used to tackle combina-torial optimization problems (Dorigo and Stützle2004, and references therein) Benedet-tini and colleagues were the first to apply ACO to solve the HIPP problem (Benedet-tini et al 2008) Their algorithm includes two levels: in the first level, it uses ACO
to determine a good visiting order of genotypes; in the second level, it employs ACO
to infer haplotypes from genotypes following orders determined in the first level Al-though the heuristic information is estimated to determine a good visiting order of geno-types in this ACO system, it is not estimated to instruct ants to infer individual alle-les
In this paper, we propose an ACO-based method, named ACOHAP, to solve the HIPP problem for large data sets The key idea is the construction of a binary tree structure which allows ants to travel and resolve conflated data of all haplotypes from site to site Heuristic information for each individual allele can be effectively estimated to guide ants towards good
Trang 3solutions This technique overcomes the limitation of the 2-level ACO method in estimating heuristic information
The rest of the paper is organized as follows: The HIPP problem is presented in Sect.2; the ACOHAP algorithm is described in Sect.3; Sect.4analyzes the performance of dif-ferent methods on both small and large data sets; conclusions are given in the last sec-tion
2 The HIPP problem
We assume SNPs are biallelic and represent alleles by 0 or 1 A haplotype of m sites is rep-resented by a string h = h1 h m of size m where h i ∈ {0, 1} Consider an unordered
haplo-type pair (h a , h b ) Their corresponding genotype g is represented by a string g = g1 g m
of size m where g i ∈ {0, 1, 2} is the conflated data at site i Hereby, g i is defined as fol-lows:
g i=
h a
i if h a
i = h b
i (genotype g is homozygous at site i)
2 if h a
i = h b
i (genotype g is heterozygous at site i) (1) The haplotype pair (h a , h b ) is called a haplotype resolution of g, and we say the genotype
g is resolved by (h a , h b ) A given genotype g has exactly 2 k−1different haplotype
resolu-tions, where k is the number of sites at which g is heterozygous For example, genotype
g = 022 can be resolved by the two different unordered haplotype pairs (000, 011) and
(001, 010).
Let G = {g1, , g n } be a set of n genotypes at the m loci under consideration It is said
that haplotype set H = {h1, , h k } is a solution of G if each genotype of G is resolved by
a pair of haplotypes in H
Given a set of genotypes G, the HIPP problem is to find a solution H with minimum number of haplotypes (Gusfield2003)
Consider an example of three genotypes G = {g1, g2, g3} = {121, 002, 221}, H = {h1, h2, h3, h4} = {000, 001, 101, 111} is an optimal solution of G having four haplotypes
where (h3, h4), (h1, h2) and (h2, h4) are haplotype resolutions of g1, g2, and g3, respec-tively
3 Methods
3.1 Graph construction
This section describes a binary tree structure to represent all possible haplotypes with m
sites (see Fig.1) The tree structure has the following properties:
– It is a full binary tree with m + 1 levels The root is at level 0 and leaves are at level m.
– We denote by v, v ∈ {0, 1}, the label of a branch A branch from an internal node X to its
left (right) child is labeled as 0 (1) and called 0-branch (1-branch)
– A node is labeled by the concatenated string of branch labels from the root to the node
– The label of a node at level i represents a haplotype with i sites.
The binary tree of 2mleaves represents 2mdifferent possible haplotypes
Trang 4Fig 1 The full binary tree of depth m (the root is at level 0; leaves are at level m) Branches are labeled
either 0 or 1 A node is labeled by the concatenated string of branch labels from the root to the node The
label of a node at level i represents a haplotype with i sites Ants can travel from the root to the leaves of the tree to determine haplotypes For example, two haplotypes h a = 001 and h b = 101 (bold paths) are a
haplotype resolution of g= 201
In the algorithm proposed below, we think of an ant traveling from the root to the leaves
of the tree to resolve each genotype g into two haplotypes h a and h b At level i− 1, the
ant determines allele h a
i by following either the 0-branch (h a
i = 0) to the left child or the
1-branch (h a
i = 1) to the right child Specifically, if genotype g is homozygous at site i,
we assign h a
i = g i and the ant follows the g i-branch Otherwise, the ant takes a decision
to follow either the 0-branch or the 1-branch based on the pheromone trail and heuristic
information It is worth to notice that the complementary haplotype h bcan be determined
from h a and g and vice versa Specifically,
h b i =
h a
i if g iis 0 or 1 (homozygous)
1− h a
For example, genotype g = 201 can be resolved into two haplotypes h a = 001 and h b= 101
(bold paths in the tree in Fig.1)
3.2 The HAPIN algorithm
We propose an ant-walking algorithm, named HAPIN, to determine a solution H of G The algorithm starts from the root of the tree with n initially undetermined haplotype pairs (h 1a , h 1b ), , (h na , h nb ) To determine these haplotype pairs, we think of them as ants and
allow them to follow branches of the tree from the root to the leaves In such a way, that
haplotype pair (h sa , h sb ) forms a haplotype resolution of g s for each s, s ∈ {1, , n} A leaf
is active if it contains at least one haplotype Labels of active leaves constitute a solution H
of G (see Fig.2)
Trang 5Fig 2 The full binary tree structure through which ants can travel and determine a solution H of G Consider
an example of three genotypes G = {121, 002, 221}, haplotypes h 1a , h 1b , h 2a , h 2b , h 3a , h 3bfollow branches
from the root to four leaves: 000, 001, 101, and 111 The set H = {000, 001, 101, 111} including labels of
these leaves is a solution of G
The HAPIN algorithm performs m iterations In the ith iteration, alleles at site i for all haplotypes are inferred (all ants move from nodes at level i − 1 to nodes at level i) Let H o (H e ) be the list of haplotypes whose genotypes are homozygous (heterozygous) at site i In the ith iteration, the HAPIN algorithm performs the two following steps: the homozygous step and the heterozygous step The homozygous step determines alleles at site i for all haplotypes of H o It is straightforward to determine these alleles The heterozygous step
determines alleles at site i for all haplotypes of H e The determination of each allele is guided by both the pheromone trail and heuristic information The details of the HAPIN algorithm are described in Algorithm1
Figure2illustrates an example of three genotypes G = {g1, g2, g3} = {121, 002, 221}.
Haplotypes h 1a , h 1b , h 2a , h 2b , h 3a , h 3b follow branches to four leaves: 000 (h 2a), 001
(h 2b , h 3a ), 101 (h 1a ), and 111 (h 1b , h 3b ) Thus, the haplotype set H = (000, 001, 101, 111)
is a solution of G.
3.2.1 Pheromone trail update
Pheromone trail update is a crucial step in ant colony optimization as it guides ants to find
good solutions The heterozygous step uses pheromone trail τ s
ivas a guide to choose between
alleles h sa
i and h sb
i We use the smooth max-min ant system method (Do et al.2008), a refined version of the well-known pheromone boundary rule (Stützle and Hoos2000), to update the
pheromone trail Let H = {h1, , h k} be a solution which is used to update the pheromone
trail Consider genotype g s ∈ G resolved by haplotypes h sa , h sb ∈ H , the pheromone trail
Trang 6Algorithm 1: HAPIN Algorithm
Input: A set of n genotypes with m sites G = {g1, , g n } where g s = g s
1 g s m
Output: A haplotype solution H = {h1, , h k } of G
Begin
Set a list of n initially undetermined haplotype pairs (h 1a , h 1b ), , (h na , h nb )to the root of the tree
for i = 1 → m do
// In the ith iteration, alleles of all haplotypes at site i are determined
1 Homozygous step: Homozygous genotypes at site i will be processed in this step Consider a homozygous genotype g s at site i, alleles h sa
i and h sb
i are
simply assigned equal to g s
i Specifically, if g s
i = 0, we assign h sa
i = h sb
i = 0
(haplotypes h sa and h sbfollow the 0-branches to the left nodes) Otherwise, we
assign h sa
i = h sb
i = 1 (haplotypes h sa and h sbfollow the 1-branches to the right nodes)
2 Heterozygous step: Heterozygous genotypes at site i will be processed in this step Consider a heterozygous g s at site i, allele h sa
i will be assigned to v, v∈
{0, 1} The probability P s
i (v) that h sa
i is assigned v (h sa follows the v-branch;
h sb follows the (1 − v)-branch) is defined as follows:
P i s (v)= (τ iv s ) α (η s
iv ) β (τ s
i0) α (η s
i0) β + (τ s
i1) α (η s
in which α, β are the relative influence coefficients of the pheromone trail and
of the heuristic information, respectively These parameters are set as in Table1
Collect labels of all active leaves to form a haplotype solution H
End
Table 1 The ParamILS program (Hutter et al.2007) was used to determine good parameter values for ACO-HAP Value: the parameters values used in the tuning process (bold numbers are default values)
f {0.1, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 4.0, 5.0} 3.0 τmax
τmin = f × m × n
pheromone trail
information
ρ {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9} 0.3 Pheromone evaporation factor
τ s
iv (i = 1, , m; v ∈ {0, 1}) is updated as follows:
τ iv s = (1 − ρ)τ s
iv+
ρτmin h sa
i = v
ρτmax h sa
in which ρ ∈ (0, 1] is the evaporation parameter; τmin and τmax are the lower bound and upper bound of the pheromone trail, respectively These parameters are set as in Table1
Trang 73.2.2 Heuristic information estimation
For what concerns a heterozygous genotype g s at site i, the heuristic information η s
ivis used
as a guide to determine alleles h sa
i and h sb
i We denote by H ithe list of haplotypes whose
alleles at site i are already determined This haplotype list is used to estimate the heuristic information η s
iv
Two partially determined haplotypes h and hare called compatible if they can follow branches to the same leaf For example, the two haplotypes h 2b and h 3ain Fig.2are com-patible because they follow branches to the same leaf 001
We use the heuristic that haplotypes h sa , h sb should follow branches such that they will
be compatible with as many haplotypes of H i as possible We denote by c sa
v the number of
haplotypes of H i which are compatible with h a when following the v-branch, v ∈ {0, 1}.
If c sa
v = 0, h sa cannot travel with any haplotype of H i to the same leaf, that is, it will
re-sult in a new active leaf If h sa follows the v-branch (h sb follows the (1 − v)-branch), the
minimum number of new active leaves t s
v resulting from h sa and h sb can be determined as follows:
t v s=
⎧
⎪
⎪
2 if c sa
v = 0 & c sb
1−v= 0
0 if c sa
v = 0 & c sb
1−v= 0
1 otherwise.
(5)
Haplotypes h sa and h sb tend to follow branches which minimize the minimum number
of new active leaves and maximize the number of compatible haplotypes To this end, the
heuristic information η s
i0and η s
i1are estimated as follows:
(η s i0, η s i1)=
⎧
⎪
⎪
0< t s
1
0> t1s ((c0sa + c sb
1 + 1)m, (c sa
1 + c sb
0 + 1)m) if t s
0= t s
1
(6)
3.3 The ACOHAP algorithm
We propose an ACO algorithm, named ACOHAP, to search for a good solution H of G.
The key component of ACOHAP is the HAPIN algorithm which determines one so-lution at each run Local searches are typically coupled with ACO algorithms to
im-prove solutions ACOHAP uses the so-called stochastic first imim-provement rule method
(Di Gaspero and Roli 2008) to improve solutions generated by the HAPIN algorithm This simple local search improves a solution by replacing two current haplotypes by a new one if possible The improvement process is repeated until no further replacement is found
The ACOHAP algorithm performs the HAPIN algorithm several times to find a good solution Since the HIPP problem assumes that sites are independent, the orders along which sites are inferred are randomly specified for each run of the HAPIN algorithm
to enforce search space exploration The ACOHAP algorithm is described in Algo-rithm2
The complexities of both the HAPIN algorithm and the stochastic first improvement
rule algorithm are O(n2m) Thus, the complexity of the inference step in the ACOHAP algorithm is O(n2m)for each run The overall complexity of the ACOHAP algorithm is
Trang 8Algorithm 2: ACOHAP Algorithm
Data: A list of n genotypes with m sites G = {g1, , g n } where g s = g s
1 g s m
Result: The best found solution H = {h1, , h k}
begin
//Initialization step
– Initialize pheromone trail
– Set the currently best solution H best = undefined
– Set the number of loops N loops= 0
repeat
H local = undefined
for p ← 1 to N antsdo
//Inference step
– Specify a random inferring order of sites S = (s1 , , s m )
– Perform the HAPIN algorithm to determine solution H p using S, i.e alleles of all haplotypes at site s i are determined in the ith iteration
– Perform the stochastic first improvement rule algorithm to improve the
obtained solution H p
– If H p is better than H local then update H local = H p
//Updating step
– Use H localto update the pheromone trail
– If H local is better than the currently best solution H best then update H best=
H local
– Increase the number of loops N loopsby one
//Restarting step
– Reset the pheromone trail if no improvement is found after 30 consecutive iterations
until (the running time exceeds a given time limit)
//Return the best solution found H best
Return H best
end
4 Experimental results
We compared ACOHAP to the currently best methods, RPoly version 1.2.1 (Graça et al
2008),1CollHaps (Tininini et al.2010),2and PTG (Li et al.2005)3on both small and large data sets All experiments were conducted on a PC cluster of 24 nodes (AMD 2.2 GHz,
48 GB RAM) RPoly is an exact method which uses pseudo-Boolean optimization tech-niques to find optimal solutions The running time of RPoly was set to 100000 seconds
1 http://sat.inesc-id.pt/~assg/rpoly/.
2 http://www.iasi.cnr.it/~liuzzi/BIOCOMP/SNP/.
3 http://doc.aporc.org/wiki/PTG.
Trang 9(∼28 h) for each problem instance RPoly returns an approximate solution if it cannot find
an optimal solution for a given problem instance after 100000 seconds
CollHaps is a heuristic method which generalizes the well-known Clark’s rule method to solve the HIPP problem PTG is a very fast heuristic algorithm Both CollHaps and ACO-HAP require long running times to converge to optimal solutions In our experiments, the running time limit for both ACOHAP and CollHaps was set to 1000 seconds In addition to the time limit of 100000 seconds, we also tested the performance of RPoly with a time limit
of 1000 seconds, called RPoly1000, for each problem instance
We used the ParamILS program (Hutter et al.2007) to determine good parameter settings for ACOHAP Table1presents the parameters and their values as used in the tuning process The parameter settings for ACOHAP are obtained using ParamILS on artificially generated benchmarks (SU1, SU2, SU3, and SU-100k) from the International HapMap Consortium (Marchini et al.2006) These parameters are given in Table1and are used to test the per-formance of ACOHAP on all data sets Note that ACOHAP was applied only once to each problem instance
Unfortunately, the software for the two-level ACO method (ACO-HI+) (Benedettini et al
2008) is no longer available for testing (as communicated by the first author of Benedettini
et al.2008) However, we can compare ACOHAP with ACO-HI+on 100 problem instances
of the SU2 data set (Marchini et al.2006) for which results obtained with ACO-HI+are available (Benedettini et al.2008)
4.1 Small data sets
We tested the different methods on nine small data sets including four artificially generated benchmarks and five biological data sets
4.1.1 Artificially generated benchmarks
We used artificially generated benchmarks (SU1, SU2, SU3, SU-100kb)4from the Interna-tional HapMap Consortium (Marchini et al.2006) to assess the performance of the different methods SU1 was generated using a constant recombination rate across the whole region, a constant population size, and random mating It contains 100 problem instances of 90 geno-types SU2 was generalized from SU1 in the sense that the recombination rate varies across the region SU3 is the same as SU2 except that the demography model is consistent with the white Americans model The small data set SU-100kb contains 29 problem instances each consisting of 90 genotypes These four data sets together contain 329 different problem instances They are summarized in Table2
Results from RPoly, ACOHAP and CollHaps for the SU data sets are presented in Ta-ble3 The PTG method is not applicable to multiple problem instances Therefore, we cannot assess its performance for these 329 artificial problem instances The sum of the objective function values of the solutions generated by ACOHAP for all 329 problem instances is
42305, which is smaller than those from RPoly (42823) and CollHaps (42852) ACOHAP found optimal solutions for 303 (92 %) out of 329 problem instances RPoly found optimal solutions for these and 17 other problem instances (320 in total) However, optimal solutions from RPoly for these 17 problem instances are only slightly better than those from ACO-HAP ACOHAP is much better than RPoly on the nine problem instances where RPoly could
4 http://www.stats.ox.ac.uk/~marchini/phaseoff.html.
Trang 10Table 2 Artificially generated benchmarks (SU1, SU2, SU3, SU-100kb) from the International HapMap
Consortium (Marchini et al 2006) These four data sets contain 329 different problem instances each consist-ing of 90 genotypes
Data set #Problem instances #Genotypes (n) Genotype length (m)
Table 3 Results obtained with RPoly, RPoly1000, ACOHAP and CollHaps for the SU data sets; #Opts: The number of optimal solutions; #Haps: The number of haplotypes
Table 4 Results obtained with ACOHAP and CollHaps on artificially generated benchmarks
ACO-HAP < CollHaps: ACOACO-HAP is better than CollHaps; ACOACO-HAP= CollHaps: ACOHAP is as good as
Coll-Haps; ACOHAP > CollHaps: ACOHAP is worse than CollHaps
Data set #Problem instances
ACOHAP < CollHaps
#Problem instances ACOHAP = CollHaps #Problem instancesACOHAP > CollHaps
not find optimal solutions after 100000 seconds The approximate solutions from RPoly for these nine problem instances are even worse than those from CollHaps RPoly with a time limit of 1000 seconds (RPoly1000) found optimal solutions for 310 problem instances Thus,
it could not find optimal solutions for 10 problem instances whose optimal solutions were found by RPoly with a time limit of 100000 seconds
The sum of the objective function values of the solutions generated by ACOHAP (42305)
is about 1.3 % smaller than that of CollHaps (42852) ACOHAP is as good as CollHaps for the small data set SU-100kb However, it is superior to CollHaps for larger data sets, that
is, SU1, SU2, and SU3 (see Table4for more details) For example, ACOHAP outperforms CollHaps on 76 out of 100 problem instances of the SU1 data set and is equal to CollHaps
on the 24 remaining problem instances CollHaps shows no better results than ACOHAP for the SU1 and SU3 data sets It is only better than ACOHAP on two out of 100 problem instances of SU2