Classes of Algorithms for Phylogenetic Analysis- 123docz.net

The algorithms for phylogenetic analysis can be broadly classified into three main classes, which mainly differ on the strategy used to compute the objective function, while also propos- ing alternative approaches to search over the huge solution space for this problem. The main classes are the following:

• Distance-based algorithms: includes methods based on the previous computation of a matrix of pairwise distances among the sequences (based on the sequence alignment), seeking to find the trees where the distances are consistent with the ones in the input matrix;

• Maximum parsimony: includes methods that search for the trees where the number of needed mutations (in internal nodes of the tree) to explain the variability of the sequences is minimized;

• Statistical/Bayesian methods: defines probabilistic models for the occurrence of different types of mutations, and uses those models to score trees based on their probability (or likelihood), searching for the most likely trees that explain the sequences according to the assumed model.

In this book, we will give an emphasis to the methods on the first group, while trying to provide an overview over the remaining classes. This will be done in the next subsections.

9.2.1 Distance-Based Methods

Distance-based methods for phylogeny inference rely on objective functions that measure the consistency of the distances between the leaves (representing sequences) in the tree, with the ones obtained from sequence similarity (through sequence alignment). Thus, the methods will try to adjust the structure of the tree and the length of the edges connecting the nodes to mimic the pairwise distances between sequences.

The first step in distance-based methods is the definition of a matrix of distances between the sequences. In this case, distance will be the reverse of similarity, and the methods used to calculate these distances will consist on aligning the sequences and calculating the distances based on this alignment.

One simple way of achieving this task is to calculate the percentage of columns where the alignment contains mismatches or gaps. Of course, more sophisticated distance functions may be defined using the set of parameters for pairwise alignment discussed in Chapter6.

Given the matrix of distances, the objective function may be defined as an error function, that tries to minimize the differences between the distances in the tree and the distances in the matrix. One common approach is to consider the sum of squared errors:

score(T )=

i,j∈S

(dij(T )−Dij)2 (9.1)

whereSis the set of input sequences,T represents the tree (solution to be scored),dij(T ) represents the distance of the leaves representing sequencesiandj in the treeT, andDij represents the distance between sequencesiandj in the input matrixDgiven from sequence alignment.

Note that the distance between two nodes (uandv) in a rooted phylogenetic treeT is related to the vertical distances traveled from the originuto the destinationv. Ifwis the nearest common ancestor ofuandv, the distance betweenuandvis given by the sum ofduw+dwv. In the tree from Fig.9.1, for instance, to travel between the leaves representing sequences 1 and 4 would imply going from sequence 1 to the common ancestor of sequences 1 to 4, and traveling back down to sequence 4. Both these distances are computed by the difference of the heights of the nodes:duw =h(w)−h(u)anddwv=h(w)−h(v), whereh(x)denotes the height of nodex (note thatd(w)is larger thand(u)andd(v), sincewis a common ancestor ofuandv).

If we assume the molecular clock hypothesis, that states that the mutation rate in all branches of the tree is uniform, this implies the distance between all leaves and the root is the same (as it is the case with the tree in the figure). In this case, the tree isultrametricand the height of

each leaf can be defined as 0. Therefore, in the previous expression, we would haveduv = 2×h(w).

Given an objective function based on these general principles, a number of methods have been proposed. Unfortunately, when the number of sequences grows, the solution space grows very rapidly, and the problem has been proven to be NP-hard. As such, there are no algorithms that can efficiently provide guaranteed optimal solutions for reasonable dimensions (number of sequences).

So, most algorithms used in practice are heuristic, providing reasonable solutions in practice for most of the problem instances. One of the simplest and most popular methods in Unweighted Pair Group Method Using Arithmetic Averages(UPGMA), which is based on agglomerative hierarchical clustering algorithms.

This algorithm firstly considers each sequence (tree leaf) to be in its own cluster, which are put at height zero in the tree. The algorithm starts by considering the pair of closest sequences/clusters (i.e. the minimum value in the matrixD), and joining these sequences creating an internal node, with height in the tree equal to half of the distance between these sequences.

Those sequences will act, in the next iteration as a single cluster (merging the two clusters with each sequence), being the distances to the remaining sequences/clusters calculated by the average of the distances, and leading to an update ofDwhere the columns and rows of the connected sequences disappear and a new row and new column appear for the new cluster.

In subsequent iterations, the algorithm proceeds by finding the pair of clusters with minimum distances and repeating the same steps: joining the clusters, adding an internal node to the tree with the given height and updating the distance matrixD. The algorithm stops when all sequences are within a single cluster, which corresponds to the root node of the tree.

An example of the application of the UPGMA algorithm is shown in Fig.9.2to a set of five sequences. The input distance matrix is shown in the upper left corner. In the first step, clusters representing sequences 1 and 2 are merged, being the respective internal node in the tree placed at heighth=1. The matrixDis updated removing the rows and columns for sequences 1 and 2, and creating new ones for the new cluster. The distances of this cluster to the remaining are calculated by the mean of the distances of the sequences in the cluster (sequences 1 and 2, in this case) to the remaining.

In the following step, clusters representing sequences 4 and 5 are merged, and the algorithm proceeds in a similar manner as for the previous ones. The third step merges the cluster of sequences 1 and 2 with the one of sequence 3, while the final step merges the two remaining clusters.

Figure 9.2: Example of the application of the UPGMA to a set of five sequences, showing the different steps of the algorithm concerning the state of: (A) the distance matricesD; (B) the clusters created; (C) the evolutionary tree.

As a general rule, in UPGMA, the distance between two clustersAandB is calculated aver- aging the distances of all pairs made from elements of both sets:

|A|.|B|

i∈A

j∈B

Dij (9.2)

Within the execution of the algorithm, clusters are merged. If, in a given iteration, clustersA andB are joined to obtain a new cluster (A∪B), the distances from the new cluster to each clusterX can be calculated by a weighted average of the distances already calculated in the matrix:

D(A∪B, X)=|A|.D(A, X)+ |B|.D(B, X)

|A| + |B| (9.3) whereD(X, Y )here represents the distance of clustersXandY inD.

An alternative is to use the algorithm WPGMA (Weighted Pair Group Method with Arithmetic Mean) where the distances of new clusters to existing ones are calculated as the arithmetic mean of the distances of the joined clusters:

D(A∪B, X)=D(A, X)+D(B, X)

2 (9.4)

The UPGMA algorithm is quite simple and has gained popularity in the community, but suf- fers from many limitations. One of the assumptions of the algorithm is that the mutation rate is uniform and, thus, the trees are ultrametric. If the input distance matrix is ultrametric, the algorithm returns the optimal solution. However, this is an assumption that is rarely true in practice and which leads to erroneous results in many situations.

An alternative method isNeighbor Joining(NJ), originally created to infer unrooted trees, and that does not assume the constant mutation rates across the different lineages. Note that, although it creates unrooted trees, the result of the NJ algorithm is in some cases given as a rooted tree, by adding a root. There are different ways to do so, being the most used to place the root at the midpoint of the longest distance connecting two leaves in the tree.

The main difference of NJ, when compared to UPGMA, is that the criterion to select the clusters to merge in each step does not only take into account the distances between the clusters, but also seeks to select cluster pairs where the nodes are apart from the other ones. To achieve this purpose, the original distance matrixDis pre-processed to create a new matrixQ, where each cell is calculated as:

Qij=(n−2)Dij−

k=1

Dik−

k=1

Dj k (9.5)

It will be this matrixQthat will be used to select the nearest clusters in each step, as it is the case in UPGMA. Notice that this implies that the clusters to be merged (or the nodes that will be joined by edges) are selected based on a trade-off, seeking pairs of nodes that have a shortest distance among themselves (first term of the previous expression) and that have large distances from other clusters (measured by the last two sums in the previous expression).

In each step of the algorithm, the matrixDis recalculated. Takinguas the new cluster (new node in the tree) created joining nodesaandb, the distances from the other clusters/nodes (represented asi) are calculated as:

Dui=1

2(Dai+Dbi−Dab) (9.6)

From this new matrixD, the new matrixQis recalculated applying the expression given above, and the algorithm proceeds by choosing the minimum value to select the clusters to merge (nodes to join).

9.2.2 Maximum Parsimony

The maximum parsimony methods define the objective functions of the phylogenetic inference problem by estimating the number of mutations that are implied by the tree to explain the input sequences. Thus, in principle, shorter trees that explain the data are preferred, following the Occam’s razor principle that states a preference for simpler models.

These methods are typically based on a previously defined multiple sequence alignment of the input sequences, which allows to identify a set of columns in the alignment that are considered informative about the possible philogeny, namely columns where there are mutations (gaps or mismatches) and thus variations of the different sequences. From this information, this method tries to identify the tree that requires the minimum number of mutations to explain the variations in the sequences.

The simplest way of identifying the most parsimonious tree is the enumeration of all possible solutions and their cost. However, this is again an NP-hard problem, which implies that this is a non-viable strategy in practice, being only possible to achieve for a small number of sequences (typically less than 10).

An alternative which has been proposed for this problem is the use ofbranch and bound methods (we will cover those in more detail in the context of motif discovery in Chapter10).

In this case, an exact solution will still be found, but without having to test all possible trees, since some areas of the search space may be discarded since we have guaranteed that these do not contain the optimal solution. In practice, this allows to increase the number of sequences in the input, still guaranteeing optimal solutions, up to about 20 sequences.

In other cases, there is the need to resort to the development of heuristic methods, which may allow to consider a larger number of sequences. Among these heuristics, one may find those based on nearest-neighbor interchanges, sub-tree pruning, and regrafting, tree bisection and reconnection, among other tree rearrangement methods, which also include meta-heuristics as genetic algorithms or simulated annealing.

Overall, these methods have the advantage of making easier to identify relationships between tree branches and sequence mutations. However, they show more limitations when the sequences are distant and require deeper philogenies.

9.2.3 Statistical Methods

Maximum likelihood methods rely on a statistical model of the occurrence of mutations in the DNA. This model is used to estimate the likelihood of possible trees, by multiplying the estimated probability of each mutation event implied by the tree. Models of DNA evolution include, among others, Kimura’s two parameter model, the Jukes-Cantor or the Tamura-Nei model.

This class of methods is typically quite intensive computationally, being also NP-hard in terms of complexity. Thepruning algorithm, a variant of Dynamic Programming can be used to reduce the complexity, by computing the likelihood of sub-trees, thus decomposing the overall problem in smaller sub-problems.

Maximum likelihood methods offer a great flexibility given the plethora of possible mutation models that can be considered. One advantage is that they allow statistical flexibility by per- mitting varying rates of evolution across sites and lineages.

Bayesian inference methods are an alternative that offer some similarity with the previous ones. Bayesian methods work based on a prior probability distribution applied to the possible phylogenetic trees. The search methods generally use variants of Markov chain Monte Carlo algorithms, including different sets of local move operators which apply changes over the trees.

Classes of Algorithms for Phylogenetic Analysis

Genes: Discrete Units of Genetic Information

Biological Sequences: Representations and Basic Algorithms