6.5.1 The Smith-Waterman Algorithm
In the last section, we covered a dynamic programming algorithm for the global alignment of biological sequences, working with an additive score that takes into account a substitution matrix and constant gap penalties. Here, we will discuss the changes that we need to consider in the algorithm, for the case of local alignments.
First, we need to address the relevant changes in the problem definition. When performing a local alignment, we will be interested in the best partial alignment of sub-sequences from the
Figure 6.7: Example of an S matrix for theSmith-Watermanalgorithm.
two sequences, which maximize the objective function (score). Notice that the sub-sequences need to consider all characters in a sequence from a starting til an ending position. There are no changes in the way scores are calculated, but now alignments covering only parts of the sequences are acceptable.
This definition imposes a number of changes in the algorithm from the previous section, lead- ing to theSmith-Watermanalgorithm, that again takes its name from the original authors. The major change in the algorithm is the reformulation of the recurrence relation used in the DP.
In this case, if the best alternative from the three considered before leads to a negative score, this means that in that position the best is to restart the alignment from this position onward, ignoring the previous parts of the sequences. This will imply to add 0 as an alternative in the recurrence relation that becomes:
Si,j =max(Si−1,j−1+sm(ai, bj), Si−1,j+g, Si,j−1+g,0),∀0< i≤n,0< j≤m (6.5) Also, the initialization of the matrix (first row and column) will be done filling with zeros:
Si,0=0,∀0< i ≤nandS0,j =0,∀0< j ≤m, as these will be cases where the best local alignment will ignore columns with gaps only that would reduce the score.
Fig.6.7shows theSmatrix for an example, where the sequences are the same that were used in the previous section, but now the aim is to perform a local alignment. Two examples are shown for the application of the previous recurrence relation.
Note that in this case, the best alignment can occur in any cell of the matrix, corresponding to the highest score value, i.e. the maximum value in theSmatrix. In the example, there are two alternative best alignments, both with a score of 19.
As before, to be able to recover the optimal alignment we need to keep trace-back informa- tion (theT matrix). In this case, this matrix can have four possible values, the three as before
Figure 6.8: Example of the process of gathering the trace-back information and recovering the optimal alignment for theSmith-Watermanalgorithm, for the example provided above.
(diagonal, horizontal, and vertical) and an extra hypothesis corresponding to the case where the alignment is terminated (where theSmatrix is filled with 0). TheT matrix is graphically shown for the example in Fig.6.8, as before by overlapping arrows. The cells where the align- ment is terminated have no overlapping arrows.
Also in the figure, we can check the process of recovering the alignment. In this case, this pro- cess starts in the cell with the highest score. Then, it proceeds as before following the arrows, until the upper left corner is reached (meaning the sequences terminate) or a cell with the in- dication that the alignment should terminate is reached (no arrow provided). In any of those cases the alignment is terminated. This process is shown by the highlighted cells in gray, for the two alternative optimal alignments.
6.5.2 Implementing the Smith-Waterman Algorithm
TheSmith-Watermanalgorithm was implemented using Python functions as before. The first code block shows the core function that builds theSandT matrices, also returning the max- imum score. The implementation is similar to the one of global alignments, with the changes explained above. In this case, a tuple with theSandT matrices is also returned, but an extra element is added to the tuple with the score of the optimal alignment.
d e f smith_Waterman ( seq1 , seq2 , sm , g):
S = [[0]]
T = [[0]]
maxscore = 0
f o r j i n r a n g e(1 , l e n( seq2 ) +1) :
S [0]. append (0) T [0]. append (0)
f o r i i n r a n g e(1 , l e n( seq1 ) +1) : S. append ([0])
T. append ([0])
f o r i i n r a n g e(0 , l e n( seq1 )):
f o r j i n r a n g e(l e n( seq2 )):
s1 = S[i ][ j] + score_pos ( seq1 [i], seq2 [j], sm , g);
s2 = S[i ][ j +1] + g s3 = S[i +1][ j] + g
b = max(s1 , s2 , s3)
i f b <= 0:
S[i +1]. append (0) T[i +1]. append (0) e l s e:
S[i +1]. append (b)
T[i +1]. append ( max3t (s1 , s2 , s3)) i f b > maxscore :
maxscore = b r e t u r n (S , T , maxscore )
The following code shows the function used to perform the recovery of the optimal align- ment, given theSandT matrices computed using the previous function. The algorithm im- plemented first finds the cell in the matrix with the highest score (using auxiliary function max_mat), and that is the starting point. The process of moving through the matrix and build- ing the alignment are similar to the ones implemented in functionrecover_align. The main difference lies in the termination criterion, which is, in this case, to find a cell where theT matrix has a value of 0.
Notice that this function only handles one optimal alignment, and therefore when multiple alignments exist with the same score (as it is the case in the example from the previous sec- tion) only one is returned.
d e f recover_align_local (S , T , seq1 , seq2 ):
res = ["", ""]
i , j = max_mat (S) w h i l e T[i ][j ] >0:
i f T[i ][j ]==1:
res [0] = seq1 [i−1] + res [0]
res [1] = seq2 [j−1] + res [1]
i −= 1 j −= 1
e l i f T[i ][j] == 3:
res [0] = "−" + res [0];
res [1] = seq2 [j−1] + res [1]
j −= 1
e l i f T[i ][j] == 2:
res [0] = seq1 [i−1] + res [0]
res [1] = "−" + res [1]
i −= 1 r e t u r n res d e f max_mat ( mat ):
maxval = mat [0][0]
maxrow = 0 maxcol = 0
f o r i i n r a n g e(0 ,l e n( mat )):
f o r j i n r a n g e(0 , l e n( mat [i ])):
i f mat [i ][ j] > maxval : maxval = mat [i ][ j]
maxrow = i maxcol = j r e t u r n ( maxrow , maxcol )
An example of the use of these functions to reach the alignment of the protein sequences given in the example is provided next.
d e f test_local_alig () :
sm = read_submat_file (" blosum62 . mat ") seq1 = " HGWAG "
seq2 = " PHSWG "
res = smith_Waterman ( seq1 , seq2 , sm , −8) S = res [0]
T = res [1]
p r i n t(" Score of optimal alignment :", res [2]) print_mat (S)
print_mat (T)
alinL = recover_align_local (S , T , seq1 , seq2 ) p r i n t( alinL [0])
p r i n t( alinL [1]) test_local_alig ()