Classes of Optimization Algorithms for Multiple Se- 123docz.net

8.2.1 Dynamic Programming

For pairwise alignment, we have already seen that there are efficient algorithms, based on Dy- namic Programming (DP), to solve those optimization problems, assuring optimal solutions,

Figure 8.1: Example of the recurrence rule for a dynamic programming algorithm for multiple sequence alignment.

considering the defined objective functions. So, an immediate question would be: can we generalize these algorithms forN sequences?

The answer is “Yes, but ...”. Indeed, we can generalize DP methods for MSA, but these are not efficient when the number of sequences to align grows. In terms of complexity, the pairwise alignment case has, as we saw in the previous chapter, a DP algorithm with a quadratic complexity, with the average size of the sequences to align as the basis. When generalizing for MSA, the complexity becomes exponential, withN being the exponent, and thus it will only be possible to use this method for a small number of sequences, below the typical number re- quired in biological research.

Indeed, if, in the pairwise case, we need to fill 2-dimensional matrices (of scores and trace- back), in MSA we would to fillN-dimensional structures (hypercubes) to assure an optimal solution. We will show an example for the case whereN =3. In this case, theSand T matrices we need to fill in the DP algorithm (revisit Chapter6for the details) will be 3-dimensional. You can imagine filling a large cube with smaller cubes in it, while the ar- rows for the trace-back would connect vertexes along faces and edges or crossing through the smaller cubes (Fig.8.1).

An example for the recurrence relation for the DP, in this case, will be given by the following expression:

Si,j,k=max(Si−1,j−1,k−1+sm(ai, bj)+sm(ai, ck)+sm(bj, ck), Si−1,j−1,k+sm(ai, bj)+g,

Si−1,j,k−1+sm(ai, ck)+g, Si,j−1,k−1+sm(bj, ck)+g, Si−1,j,k+2g,

Si,j−1,k+2g,

Si,j,k−1+2g),∀0< i≤n,0< j≤m,0< k≤p

In the previous expression, we follow the nomenclature for Chapter6, namely in the defini- tion of the recurrence relation for theNeedleman-Wunschalgorithm, and will consider the objective function to be given by the sum of pairs method. The sequences to align will be A=a1a2. . . an,B =b1b2. . . bmandC =c1c2. . . cp, and the gaps will be penalized byg (constant penalty for each gap in the column).

There have been some attempts to reduce the computational cost of DP algorithms for MSA, by avoiding the need to try all possible paths in theN-dimensional hypercube. These ap- proaches are based on the fact that the optimal MSA imposes a given pairwise alignment for each pair of sequences. This can be seen as a projected path over the 2-dimensional space of the pairwise alignment matrices.

For an optimal MSA, it is possible to calculate an upper bound for the cost of the projected path, for a given pair of sequences. This limits the set of possible paths to consider in the pairwise alignment, which in turn can be used to limit the set of hypotheses to explore when searching for the MSA. Although this has brought significant improvements, the number of sequences that could be aligned was still around 10, thus not serving most of the practical pur- poses.

8.2.2 Heuristic Algorithms

An alternative to DP algorithms, which may be used when we need to scale MSA for a larger number of sequences, which is the most common practical scenario, is to use heuristic (also called approximation) algorithms, which are not able to assure optimal solutions, but have an acceptable complexity in terms of their running times.

Broadly, heuristic algorithms for MSA can be classified as:

• Progressive– start by aligning two sequences and then iteratively add the remaining sequences to the alignment;

• Iterative– consider an initial alignment and then try to improve this alignment by mov- ing, adding or removing gaps;

• Hybrid– can combine different strategies, and use complementary information (e.g. protein structural information, libraries of good local alignments).

The general idea of progressive algorithms is to create an initial alignment of a core of the two most related sequences, adding, in subsequent iterations, increasingly distant sequences, until the final alignment with all sequences is attained.

Figure 8.2: Overall workflow of the CLUSTAL, representing progressive MSA algorithms.

Although there are several variants of these algorithms, we will here detail the main steps of theCLUSTALalgorithm, a classical MSA method, that although being now discontinued, has given rise to a successful family of algorithms, namelyCLUSTALWand, more recently, Clustal Omega, currently widely used. An overview of the main steps ofCLUSTALis pro- vided in Fig.8.2.

The first step of the algorithm is to calculate the pairwise alignments of all pairs of sequences in the set. From these alignments, a matrix is created containing the similarities between each pair of sequences, which will serve as input to an algorithm that creates a guide tree from this matrix. The methods to create this tree will be covered in more detail in the next chapter.

The selection of the two first sequences is of foremost importance in these methods, as it will define the core of the alignment. From the previous matrix, and guide tree, these sequences will typically be the ones with higher similarity, with their common ancestor nearer the leaves of the tree. The order of the remaining sequences to consider by the algorithm will be defined by the guide tree (in the figure, the first two will be S1 and S2, followed by S3 and S4).

One important step of these algorithms is the way further sequences are added to an existing alignment. The usual way to address this step is to create, from the existing alignment, a sum- mary of the content of each column. In the simplest version, one could represent each column by the most common character (consensus), but this would be overly simplistic. Instead, it is common to represent the relative frequencies of the different characters (this is normally called a profile and will be covered in more detail in Chapter11).

Thus, in each iteration, we need to align a sequence with a previous alignment, represented as a profile. This implies following an algorithm similar to the ones used in pairwise alignment,

Figure 8.3: Example of the calculation of the score of a column when combining an alignment with a sequence. The selected columns are highlighted. SM stands for the substitution matrix used.

but in this case adjusted to align a sequence to a profile, which implies changing the way the score for a column is computed.

An example of this process inCLUSTALis shown in Fig.8.3. In this case, to define the match score (diagonal move in theS/T matrices in the DP algorithm), all possible characters ap- pearing in the profile, weighted by their frequency, are combined with the symbol in the sequence to add. Thus, the score will be a weighted average of the scores of the possible pairs.

In the step of adding new sequences to an alignment, the moto “once a gap, always a gap”

prevails, since when a gap appears in an alignment it will be maintained throughout the next steps.

Notice that, in the more general case, we may need to join alignments from two profiles, cor- responding to different branches of the guide tree, to build a new alignment. The process is easily thought as a generalization of the method described above.

Within theCLUSTALfamily, theCLUSTALWalgorithm was quite popular for many years.

Regarding the original version, it added a number of improvements, namely the adaptation of the parameters, both the substitution matrix used and the gap penalties, according to both the stage of the alignment and the residues present in the sequences in given regions.

The specific rules for these adaptations were taken from observation of good and bad alignments, taking into account also protein structural information. As an example of such a rule, stretches of hydrophilic residues usually indicate loop or random coil structural regions, lead- ing to a reduction of gap opening penalties in these regions.

On the other hand, regarding substitution matrices, these are varied along the alignment process, depending on the estimated divergence of the sequences being added to the alignment at each stage. Also, when calculating scores in the alignment of profiles, sequences are down- weighted if they are very similar to others in the set, and up-weighted if they are distant from the others.

Recently, theCLUSTALfamily has switched to the use of the more recentCLUSTAL Omega.

This algorithm greatly increased the efficiency of the algorithm used to create the guide tree that determines the order of the sequences in the alignments. Thus, it allows the algorithm to be usable with a larger number of sequences (in the site at the EBI, they currently allow 4000, but report testing with over 100,000).

Although very popular and quite used in practice, progressive alignments have some problems, mostly related to their heuristic greedy nature. Indeed, wrong decisions made in early steps of the alignment will not be corrected afterwards. The worst results of these algorithms occur when most of the sequences to align have low similarity.

An approach developed to counteract such problems has led to consistency based methods, from which the most well known is probablyT-coffee. The idea is to maximize the agreement of pairwise alignments and, thus, try to avoid errors in the progressive algorithms.T-coffee builds libraries of global and local pairwise alignments, which are used to calculate weights for each position pair in each pairwise alignment. These weights are then used in the progressive alignment as scores in the DP algorithm when adding new sequences to the alignment.

This method provides MSA solutions that are typically more accurate, but it suffers from some computational efficiency problems, not being suited for a large number of sequences, when compared for instance withClustalOmega.

Iterative algorithms can be an alternative. They start by generating an alignment using a given method, which may be a progressive alignment or any other method. Then, they try to improve this alignment by making selected changes and evaluating the effect of those changes over the objective function.

One alternative to improve the alignments is to repeatedly realign sub-groups of sequences, or sub-groups of columns, by changing the position of gaps or adding/removing gaps. More evolved optimization meta-heuristics, such as genetic algorithms, have also been tried, with some success.

Hybrid algorithms include recent proposals such asMUSCLE, which combine progressive alignments based on profiles, with techniques from iterative methods, in this case working by refining the guide tree to improve the resulting alignment. It is one of the more efficient methods currently available.

On the other hand, theMAFFTalgorithm brings the application of Fast Fourier Transforms (FFT) to MSA algorithms, being the sequences of aminoacid symbols converted into sequences of volumes and polarities. FFTs allow to rapidly identify homologous regions, allow- ing fast algorithms for MSA, which also combine progressive and iterative methods. Recent versions allow a lot of flexibility in the configuration with different trade-offs of quality and speed.

Classes of Optimization Algorithms for Multiple Sequence Alignment

Genes: Discrete Units of Genetic Information

Biological Sequences: Representations and Basic Algorithms