Dynamic Programming Algorithms for Global Alignmen- 123docz.net

6.4.1 The Needleman-Wunsch Algorithm

We have already realized that the pairwise sequence alignment problem is quite complex with a huge number of possible solutions, even for moderate sized sequences. Since we do not have the possibility of trying all possible solutions, we need to find more intelligent ap- proaches for the problem.

The main idea driving these algorithms will be a divide-and-conquer approach, quite common to address complex problems by sub-dividing those into simpler problems and combining their solutions. This is made possible by the additive nature of the scoring functions we have presented in the previous section, which allows to decompose a solution (an alignment) on its different columns, being the overall score the sum of the scores of the columns.

Notice that the affine gap penalty assumption does not obey this principle, and, therefore, the algorithms we will study in these sections will consider the simpler gap penalty based on a constant per gap (column). It is still possible to define similar algorithms for the affine gap penalty model, but these are more complex and will not be covered in this text.

Based on these observations, in the early days of Bioinformatics, different researchers have proposed the use ofdynamic programming(DP) algorithms to address biological sequence alignment. DP is a general-purpose class of optimization algorithms based on a divide-and- conquer approach, where optimal solutions for sub-problems and their scores are re-used (and not recomputed) when solving larger problems. This idea has been applied to sequence alignment considering that the alignment of two sequences can be composed from alignments of sub-sequences of those sequences.

We will next detail these algorithms, starting with theNeedleman-Wunsch(NW) algorithm for global sequence alignment, whose name is derived from the two authors that proposed it.

Consider we want to align two sequencesAandBof sizesnandm, respectively. The indi- vidual symbols inAandBcan be accessed by subscripting, using indexes, and thus we can representA=a1a2. . . anandB=b1b2. . . bm.

When executing the NW algorithm, we will build a matrixS, where the elements of se- quenceAwill be placed to index rows and the elements of sequenceB will be placed to index columns. We will add an extra column and an extra row in the beginning to represent alignments with gaps in all positions of one of the sequences.

Figure 6.4: Example of an S matrix for theNeedleman-Wunschalgorithm.

One of the purposes of the algorithm is to fill this matrixSconsidering that, in each position, the element will indicate the score of the optimal alignment of the sub-sequences ofAandB, considering all symbols from the beginning until the character represented in the respective row fromAand column fromB. Thus, the elementSi,j of the matrix will indicate the optimal score for the alignment of the sub-sequence ofAwith the firsticharacters (a1...ai) and of the sub-sequence ofBwith the firstj characters (b1...bj). Notice that we consider the matrixSto be indexed in rows and columns starting from 0 in both cases, being the row and column with index 0 the ones for gaps.

The structure of theSmatrix can be visualized in Fig.6.4for a given example, considering the alignment of two sequences of aminoacids:A=P W SH GandB=H GW AG. Notice that the matrixShas dimensions ofn+1 rows andm+1 columns.

The main idea of this algorithm is that this matrixScan be filled cell by cell, using adja- cent cells to reach the value of the target cell. This will mean using the previously calculated scores of smaller alignments to get the optimal score for the current alignment through composition. To fillS, we can use the following recurrence relation:

Si,j =max(Si−1,j−1+sm(ai, bj), Si−1,j+g, Si,j−1+g),∀0< i≤n,0< j≤m (6.4) wheresm(c1, c2)gives the value of the substitution matrix for symbolsc1andc2, whileg provides the penalty value for a gap, considering a constant penalty per gap position (with g <0).

Notice that this recurrence expression defines possible orders for the calculation of all values inS, since to calculateSi,j, the values ofSi−1,j−1,Si−1,j andSi,j−1need to be previously

known. In practice, we fill the matrix by rows or by columns going from left to right and from top to bottom, i.e. increasing the indexes.

This recurrence relation states that an alignment may always be obtained by the composition of other alignments. In particular, it builds an alignment by adding an extra column to a previous alignment and computing the new score by summing the previous score with the one from the added column.

There are three ways to add a column to a previous alignment: to add a column with the next symbols from each sequence, to add the next symbol from the first sequence and a gap, or to add the next symbol from the second sequence and a gap. These are the three hypotheses given in the recurrence from Eq. (6.4). In the first hypothesis, the score of the new column is calculated by checking the respective entry in the substitution matrix, while in the other two the score is given byg. The new scores are added to the previous ones and the maximum score is taken for the new alignment and used to fillSin that position.

To fillSusing this recurrence relation, we need to define how to initialize the matrix, filling the first row and the first column. In both cases, the idea will be to multiply the number of columns in the alignment byg, since these alignments are composed of only gaps in one of the sequences. Thus,Si,0=i∗g,∀0< i≤nandS0,j=j∗g,∀0< j ≤m.

Fig.6.4provides an example of the application of the algorithm to fillS. In this case, the substitution matrix used is theBLOSUM62provided in Fig.6.3andg= −8.

Notice that, in example from the figure, the gaps’ row and column are shaded gray being calculated as stated above. For the remaining cells, the recurrence relation from Eq. (6.4) is applied. An example is provided in the figure for the cellS1,1.

From the definition put forward for the matrixS, it is easy to conclude that the overall score of the best alignment, for the whole sequences, is given in the lower right corner of the matrix (cell highlighted with dots in the figure).

The process explained above allows to reach the score for the optimal alignment (indeed for all optimal alignments of sub-sequences from the original sequences starting at the first position), but it is still not sufficient to obtain the best alignment itself. To be able to reconstruct the best alignment, we need to “memorize” the decisions took every time the recurrence relation is applied. Indeed, we need to store the information of which was the hypothesis (from the three possible ones) that provided the highest score, and consequently which was the previous alignment.

This is normally achieved by encoding these choices using a 3 symbol alphabet and building a trace-back matrix, typically denoted asT, with the same dimensions as theSmatrix. Fig.6.5

Figure 6.5: Example of the trace-back matrix for theNeedleman-Wunschalgorithm in the example provided.

provides theT matrix for the previous example in a graphical way, by using arrows which indicate if the best previous alignment came from a diagonal in the matrix (first hypothesis of the recurrence with no gaps), or if it came from the previous row (vertical) or column (horizontal), cases where there was a gap in one of the sequences.

Having the trace-back information available allows to recover the best alignment by traversing the matrix. The process starts in the lower right corner of the matrix, follows the directions (arrows) from theT matrix, and terminates in the upper left corner (row and column with index zero). When moving in the matrix, the alignment is built in the reverse order (from the last column to the first). If the move is diagonal, the column consists of the characters in the row and column from the original cell (first hypothesis of the recurrence with no gaps). If the move is vertical, we are only moving in the first sequence, with a gap in the second, while in a horizontal move the reverse happens.

Fig.6.6provides an illustration of this process for the previous example. The path over the matrix is highlighted and the final alignment is provided. As a validation, the optimal score is recalculated and we can check it matches the one in the corner of the matrix.

6.4.2 Implementing the Needleman-Wunsch Algorithm

The algorithm explained in the previous section will be implemented in a set of Python functions provided below. The first function, given in the next block, provides the implementation of the core algorithm filling the matricesSandT, applying the recurrence defined in the previous section, as well as the initialization of the gaps’ row and column. In this case, we used integer values in the trace-back matrix representation, where the value 1 is used for diagonal, 2 is used for vertical and 3 for horizontal. The auxiliary functionmax3tprovides the integer to fillT using this encoding scheme.

Figure 6.6: Example of the process of recovering the optimal alignment from the trace-back in- formation in theNeedleman-Wunsch algorithm, for the example provided above.

Note that, in this function, the matricesSandT are represented as lists of lists, and these lists are filled when the values are being calculated through the use of theappendfunction. The function first initializes the gaps’ row and column, and then fills the remaining of the matrices applying the recurrence relation. The result of the function is a tuple, providing in the first position the matrixSand in the second the matrixT.

d e f needleman_Wunsch ( seq1 , seq2 , sm , g):

S = [[0]]

T = [[0]]

## initialize gaps ’ row

f o r j i n r a n g e(1 , l e n( seq2 ) +1) : S [0]. append (g ∗ j)

T [0]. append (3)

## initialize gaps ’ column f o r i i n r a n g e(1 , l e n( seq1 ) +1) :

S. append ([g ∗ i ]) T. append ([2])

## apply the recurrence relation to fill the remaining of the matrix

f o r i i n r a n g e(0 , l e n( seq1 )):

f o r j i n r a n g e(l e n( seq2 )):

s1 = S[i ][ j] + score_pos ( seq1 [i], seq2 [j], sm , g);

s2 = S[i ][ j +1] + g s3 = S[i +1][ j] + g

S[i +1]. append (max(s1 , s2 , s3))

T[i +1]. append ( max3t (s1 , s2 , s3 )) r e t u r n (S , T)

d e f max3t (v1 , v2 , v3 ):

i f v1 > v2:

i f v1 > v3: r e t u r n 1 e l s e: r e t u r n 3

e l s e:

i f v2 > v3: r e t u r n 2 e l s e: r e t u r n 3

The next function takes the trace-back (T) matrix built from the last function, together with the two sequences, and implements the process of recovering the optimal alignment. The algorithm starts in the bottom right corner of the matrix and uses the information inT to update the position and gather the alignment. A vertical (horizontal) cell inT leads to moving to the previous row (column), and adds to the alignment a column with a symbol from the sequence A (B) in the respective row (column) and a gap in the other. A diagonal cell inT leads to moving to the previous row and column, and adds to the alignment a column with a symbol from sequence A and another from sequence B, in the respective row and column. The algorithm stops when the top left corner of the matrix is reached. Notice that the alignment is represented by two strings of the same size, and is built in this function from the end to the beginning, i.e. new columns are added always in the beginning.

d e f recover_align (T , seq1 , seq2 ):

res = ["", ""]

i = l e n( seq1 ) j = l e n( seq2 ) w h i l e i >0 or j >0:

i f T[i ][j ]==1:

res [0] = seq1 [i−1] + res [0]

res [1] = seq2 [j−1] + res [1]

i −= 1 j −= 1

e l i f T[i ][j] == 3:

res [0] = "−" + res [0]

res [1] = seq2 [j−1] + res [1]

j −= 1 e l s e:

res [0] = seq1 [i−1] + res [0]

res [1] = "−" + res [1]

i −= 1 r e t u r n res

The following code block defines a testing function for the previous code, using the example of a protein sequence alignment from the previous section, aligning sequences “PHSWG” and

“HGWAG”, using theBLOSUM62substitution matrix andg= −8.

d e f print_mat ( mat ):

f o r i i n r a n g e(0 , l e n( mat )):

p r i n t( mat [i ])

d e f test_global_alig () :

sm = read_submat_file (" blosum62 . mat ") seq1 = " PHSWG "

seq2 = " HGWAG "

res = needleman_Wunsch ( seq1 , seq2 , sm , −8) S = res [0]

T = res [1]

p r i n t(" Score of optimal alignment :", S[l e n( seq1 ) ][l e n( seq2 ) ]) print_mat (S)

print_mat (T)

alig = recover_align (T , seq1 , seq2 ) p r i n t( alig [0])

p r i n t( alig [1]) test_global_alig ()

Dynamic Programming Algorithms for Global Alignment

Genes: Discrete Units of Genetic Information

Biological Sequences: Representations and Basic Algorithms