Pairwise Sequence Alignment in BioPython

The BioPython framework includes a specific module, namedBio.pairwise2that provides an implementation of dynamic programming algorithms, similar to those put forward in this chapter. We will provide here a few examples of the use of this module, while the full docu- mentation may be found inhttp://biopython.org/DIST/docs/api/Bio.pairwise2- module.html.

When performing alignments using this module, several functions are available, which start by either “global” or “local”, depending on the type of alignment. The name of the function terminates with a code of two characters which indicates the parameters it takes: the first indicates the parameters for matches/mismatches or a substitution matrix, and the second indicates the parameters for gap penalties.

Regarding the first character, if it takes the value of “x” the alignment considers the match score to be 1 and the mismatch score to be 0. If “m” is provided, the function allows to define a match and a mismatch score using appropriate parameters. On the other hand, if “d” is selected, a dictionary may be passed to the function defining a full substitution matrix.

Regarding the second character, “x” is used when no gap penalties are imposed (g=0), while

“s” is used to define an affine gap penalty model allowing to define different penalties for gap opening and extension (or constant gap penalties by setting these values to be the same).

The following example shows how to perform an alignment of two DNA sequences, using a match score of 1, while the mismatch score and gap penalties are both 0. The code prints the number of alternative optimal alignments and the alignments themselves with the scores.

from Bio i m p o r t pairwise2

from Bio . pairwise2 i m p o r t format_alignment

alignments = pairwise2 . align . globalxx (" ATAGAGAATAG ", " ATGGCAGATAGA ") p r i n t (l e n( alignments ))

f o r a i n alignments :

p r i n t( format_alignment (∗a))

The next example shows how to align two protein sequences, using theBLOSUM62substitu- tion matrix, an opening gap penalty of−4 and an extension penalty of−1.

from Bio i m p o r t pairwise2

from Bio . pairwise2 i m p o r t format_alignment from Bio . SubsMat i m p o r t MatrixInfo

matrix = MatrixInfo . blosum62

f o r a i n pairwise2 . align . globalds (" KEVLA ", " EVSAW ", matrix , −4, −1):

p r i n t( format_alignment (∗a))

Finally, the last example shows how to perform local alignments: in the first case of DNA sequences using a match score of 3, mismatch score of−2 and constant gap penaltygof−3.

The second shows a local alignment of the same protein sequences from the last example, also using the same set of parameters.

from Bio i m p o r t pairwise2

from Bio . pairwise2 i m p o r t format_alignment from Bio . SubsMat i m p o r t MatrixInfo

matrix = MatrixInfo . blosum62

local_dna = pairwise2 . align . localms (" ATAGAGAATAG ", " GGGAGAATC ", 3,−2,−3,−3)

f o r a i n local_dna : p r i n t( format_alignment (∗a))

local_prot = pairwise2 . align . localds (" KEVLA ", " EVSAW ", matrix , −4,

−1)

f o r a i n local_prot : p r i n t( format_alignment (∗a))

Bibliographical Notes and Further Reading

BLOSUMmatrices have been proposed by Henikoff and Henikoff in [74], while Margaret Dayhoff firstly proposed PAM matrices [43]. The algorithms ofNeedleman-Wunschand Smith-Watermantake the names of theirs authors and have been proposed in [118] and [141], respectively. Edit distance was introduced by Levenshtein [97].

Exercises and Programming Projects

Exercises

1. a. Consider the application of theSmith-Watermanalgorithm to the sequences: S1:

ANDDR; S2: AARRD. The alignment parameters should be theBLOSUM62sub- stitution matrix and the value ofg= −8. Calculate (by hand); (i) the S matrix with the best scores; (ii) the trace-back matrix; (iii) the optimal alignment and its score.

Check if there are any alternative optimal alignments.

b. Write a program in Python, using the functions defined in this chapter, that allows to confirm the results you obtained in the previous exercise.

2. a. Consider the application of theNeedleman-Wunschalgorithm to the following DNA sequences: S1: TACT; S2: ACTA. The used parameters are the following: gap penalty (g):−3, match (equal characters): 3, mismatch:−1. Calculate (by hand);

(i) the S matrix with the best scores; (ii) the trace-back matrix; (iii) the optimal alignment and its score. Check if there are any alternative optimal alignments.

b. Write a program in Python, using the functions defined in this chapter, that allows to confirm the results you obtained in the previous exercise.

3. Write and test a function that, given a binary matrix (with elements 0 or 1), coming from a function that creates dotplot matrices, identifies the largest diagonal containing ones (it can be the main diagonal or any other diagonal in the matrix). The result should be a tuple with: the size of the diagonal, the row where it begins, the column where it begins.

4. Consider the functions to calculate pairwise global alignments. Note that, in the case there are distinct alignments with the same optimal score, the functions only return one of them. Notice that these ties arise in the cases where, in the recurrence relation of the DP algorithm, there are at least two alternatives that return the same score.

a. Define a functionneedleman_Wunsch_with_ties, which is able to return atrace- back matrix(T) with each cell being a list of optimal alternatives and not a single one.

b. Define a functionrecover_align_with_ties, which taking atrace-back matrixcre- ated by the previous function, can return a list with the multiple optimal alignments.

5. Considering the functions to calculate pairwise local alignments, define similar functions to the previous exercise for the case of multiple optimal alignments. Note that, in this case, ties may also arise due to multiple equal scores in theSmatrix (check the example from Figs.6.7and6.8).

6. Write and test a function that, given two lists of sequences (l1 andl2), searches for each sequence in thel1 the most similar sequence inl2 (considering similarity based on iden- tity, as defined above). The result will be a list with the sizel1, indicating in each position i the index inl2 of the most similar sequence to thei-th sequence inl1.

7. Write and test a function that, given two DNA sequencess1ands2, searches for the best possible local alignment between a putative protein encoded by a reading frame froms1 and a putative protein encoded by a reading frame froms2(check Section4.4for the de- tails on reading frame calculations). The result will be a tuple with the best alignment and its score. The parameters of the alignment should be passed as arguments to the function.

Programming Projects

1. Using the object-oriented functionalities of Python, develop a set of classes to represent alignments and implement the algorithms described in this chapter. These could be inte- grated within classes to handle biological sequences, as proposed in previous chapters.

As a suggestion, the following classes may be defined:

• A class to keep substitution matrices, implementing their creation in different ways (e.g. with match/mismatch scores or loading from files) and access to scores for a pair of symbols;

• A class to keep alignments, allowing to create them, to access their columns, and calculate scores given the alignment parameters;

• A class to implement the alignment algorithms based on DP, which should keep alignment parameters as attributes.

2. Implement a class to represent dot plots, implementing methods to create, print and analyze these matrices.

a. Start by considering binary matrices and the algorithms defined in this chapter.

b. Consider more complex approaches, by implementing matrices with numerical values in the range[0,1]. These may be calculated with filters or using substitution matrices of pairs of characters, or neighborhood windows.

Searching Similar Sequences in Databases

In this chapter, we discuss the problem of finding similar sequences to a target sequence in databases of a large dimension. We discuss how the need to repeat pairwise comparisons a large number of times changes the requirements of the problem and demands more efficient solutions. We discuss existing heuristic algorithms and tools for this task, and proceed to implement a simplified version of the most popular one (BLAST). We finish the chapter by looking atBioPythonand checking how we can run and analyze the results from BLAST using Python scripts.

Pairwise Sequence Alignment in BioPython

Genes: Discrete Units of Genetic Information

Biological Sequences: Representations and Basic Algorithms