We will implement, in this section, a simplified version of a progressive algorithm. The im- plementation proposed for this algorithm will be object-oriented, thus defining the basis to implement further algorithms. We will thus define a few core classes that will be described firstly, before addressing the implementation of the algorithm itself.
The starting point of this implementation will be the classMySeqproposed in Section4.6, which implements the biological sequences that will be aligned (DNA, RNA, or proteins).
The remaining classes will be explained further in the next subsections.
8.3.1 Representing Alignments: ClassMyAlign
We will start by developing a class to represent alignments. This class, although quite simple, will allow more modularity, defining a natural result for pairwise or multiple sequence align- ment algorithms.
The classMyAlignwill define alignments using two variables: the alignment type (DNA, RNA, protein) and the list of sequences included in the alignment that will be defined as strings, including the symbol “-” to represent gaps. The constructor and some core methods to access the information kept in the alignment are given below:
c l a s s MyAlign :
d e f __init__ (s e l f, lseqs , al_type = " protein "):
s e l f. listseqs = lseqs s e l f. al_type = al_type
d e f __len__ (s e l f): # number of columns r e t u r n l e n(s e l f. listseqs [0])
d e f __getitem__ (s e l f, n):
i f t y p e(n) i s t u p l e and l e n(n) ==2:
i , j = n
r e t u r n s e l f. listseqs [i ][j]
e l i f t y p e(n) i s i n t: r e t u r n s e l f. listseqs [n]
r e t u r n None d e f __str__ (s e l f):
res = ""
f o r seq i n s e l f. listseqs : res += "\n" + seq r e t u r n res
d e f num_seqs (s e l f):
r e t u r n l e n(s e l f. listseqs ) d e f column (s e l f, indice ):
res = []
f o r k i n r a n g e(l e n(s e l f. listseqs )):
res . append (s e l f. listseqs [k ][ indice ]) r e t u r n res
i f __name__ == " __main__ ":
alig = MyAlign ([" ATGA−A","AA−AT−"], " dna ") p r i n t( alig )
p r i n t(l e n( alig )) p r i n t( alig . column (2) ) p r i n t( alig [1 ,1]) p r i n t( alig [0])
Apart from the constructor, the three other special methods define what is the length of an alignment (defined as the number of columns), how to access elements of the alignment (both using a single integer value, returning a sequence, and using two indexes for rows – sequences – and columns), and how to print an alignment as a string. The remaining two methods allow to easily access the number of sequences and a column in the alignment. The last part shows a very simple example of the use of these methods.
Another important method in this class, which will be used to implement our MSA algorithm, is the calculation of the consensus of the alignment. A consensus is defined as the sequence composed of the most frequent character in each column of the alignment, ignoring gaps. This definition is implemented by the following method, included in theMyAlignclass:
c l a s s MyAlign : (...)
d e f consensus (s e l f):
cons = ""
f o r i i n r a n g e(l e n(s e l f)):
cont = {}
f o r k i n r a n g e(l e n(s e l f. listseqs )):
c = s e l f. listseqs [k ][i]
i f c i n cont :
cont [c] = cont [c] + 1 e l s e:
cont [c] = 1 maximum = 0
cmax = None
f o r ke i n cont . keys () :
i f ke != "−" and cont [ ke] > maximum : maximum = cont [ ke]
cmax = ke cons = cons + cmax r e t u r n cons
i f __name__ == " __main__ ":
alig = MyAlign ([" ATGA−A","AA−AT−"], " dna ") p r i n t( alig . consensus ())
This method collects the frequencies of the characters in each column, using a dictionary, and selects the most common. Notice that, in this way, ties are broken arbitrarily, as dictionary keys have no inherent order.
8.3.2 Pairwise Alignment: ClassAlignSeq
The algorithms for pairwise sequence alignment based on Dynamic Programming were previ- ously implemented in Chapter6, using Python functions. In this case, we will use an object- oriented implementation.
The first class defined for this purpose,SubstMatrix, implements substitution matrices in a very similar way to the ones we implemented before, being used the same dictionary based representation (see Section6.3.3for the details). The class has an attribute keeping the alpha- bet, and a second one keeping the dictionary with the scores for pairs of characters.
We will not provide here the full content of this class, given the similarity to the code pre- sented before (the full code is available in the book’s website). Below, we just provide the constructor, a few access methods, and the headers of the remaining methods, where in most cases the names are similar to the ones used in Chapter6.
c l a s s SubstMatrix : d e f __init__ (s e l f):
s e l f. alphabet = ""
s e l f. sm = {}
d e f __getitem__ (s e l f, ij):
i , j = ij
r e t u r n s e l f. score_pair (i , j) d e f score_pair (s e l f, c1 , c2 ):
i f c1 n o t i n s e l f. alphabet or c2 n o t i n s e l f. alphabet : r e t u r n None
r e t u r n s e l f. sm [ c1 + c2]
d e f read_submat_file (s e l f, filename , sep ):
(...)
d e f create_submat (s e l f, match , mismatch , alphabet ):
(...)
Note that, as before, we can create substitution matrices reading the full set of scores from file (methodread_submat_file) or defining two scores for matches and mismatches (method create_submat).
The class that allows to execute the alignments is namedPairwiseAlignment. This class will have a set of variables keeping the parameters of the alignment (substitution matrix and gap penalty), the sequences to align, and theS andT matrices from the DP algorithms. As before, we do not show here the full code (available in the site).
from MyAlign i m p o r t MyAlign from MySeq i m p o r t MySeq
from SubstMatrix i m p o r t SubstMatrix c l a s s PairwiseAlignment :
d e f __init__ (s e l f, sm , g):
s e l f.g = g s e l f. sm = sm s e l f.S = None
s e l f.T = None s e l f. seq1 = None s e l f. seq2 = None
d e f score_pos (s e l f, c1 , c2 ):
(...)
d e f score_alin (s e l f, alin ):
(...)
d e f needleman_Wunsch (s e l f, seq1 , seq2 ):
i f ( seq1 . seq_type != seq2 . seq_type ): r e t u r n None (...)
r e t u r n s e l f.S[l e n( seq1 ) ][l e n( seq2 )]
d e f recover_align (s e l f):
(...)
r e t u r n MyAlign (res , s e l f. seq1 . seq_type ) d e f smith_Waterman (s e l f, seq1 , seq2 ):
i f ( seq1 . seq_type != seq2 . seq_type ): r e t u r n None (...)
r e t u r n maxscore
d e f recover_align_local (s e l f):
(...)
r e t u r n MyAlign (res , s e l f. seq1 . seq_type )
Notice that, in this implementation, the methods that run the algorithms return the optimal scores, and alter the matricesSandT defined as class variables. The methods to recover the alignments use the content of those matrices to return the optimal alignments, as an object of the classMyAligndescribed in the previous section.
8.3.3 Implementing Multiple Sequence Alignment: ClassMultipleAlign
In this section, we provide the implementation of the simplified progressive MSA algorithm.
This algorithm will receive as input a set of sequences, provided in a given order, as well as the parameters that define the objective function of the alignment (substitution matrix and gap penalties), returning the best possible alignment of the sequences.
Figure 8.4: Example of the application of the MSA algorithm to be implemented in this chapter.
In our implementation, theNeedleman-Wunschalgorithm, described in Chapter6, will be used to provide for the pairwise alignments. The MSA will start by using this algorithm to align the first two sequences, according to the defined parameters. Afterwards, in each it- eration, the following sequence will be added to the alignment. This process first takes the current alignment and calculates its consensus. Then, this consensus is aligned with the new sequence using theNeedleman-Wunschalgorithm. The resulting alignment is reconstructed based on the columns of the consensus, filling the column with gaps if the alignment puts a gap in that position.
An example of an alignment with 3 sequences is shown in Fig.8.4, where the alignment pa- rameters used were: match score: 1, mismatch score:−1, gap penalty:−1. Note that in the calculation of the consensus, in the column where there is a tie, the consensus selected the first character in lexicographical order (in this case, C).
We will implement this algorithm resorting to the classMultipleAlignment, built over the basis created by the set of classes proposed in the previous sections. This class will have two variables: one containing the sequences to be aligned (objects of classMySeq), and the other the alignment parameters in the form of an object of the classPairwiseAlignment, which will be used to perform the pairwise alignments using theNeedleman-Wunschalgorithm. The implementation of this class is provided below.
from PairwiseAlignment i m p o r t PairwiseAlignment from MyAlign i m p o r t MyAlign
from MySeq i m p o r t MySeq
from SubstMatrix i m p o r t SubstMatrix
c l a s s MultipleAlignment ():
d e f __init__ (s e l f, seqs , alignseq ):
s e l f. seqs = seqs
s e l f. alignpars = alignseq
d e f add_seq_alignment (s e l f, alignment , seq ):
res = []
f o r i i n r a n g e(l e n( alignment . listseqs ) +1) : res . append ("")
cons = MySeq ( alignment . consensus () , alignment . al_type ) s e l f. alignpars . needleman_Wunsch ( cons , seq )
align2 = s e l f. alignpars . recover_align () orig = 0
f o r i i n r a n g e(l e n( align2 )):
i f align2 [0 ,i ]== ’−’:
f o r k i n r a n g e(l e n( alignment . listseqs )):
res [k] += "−"
e l s e:
f o r k i n r a n g e(l e n( alignment . listseqs )):
res [k] += alignment [k , orig ] orig +=1
res [l e n( alignment . listseqs )] = align2 . listseqs [1]
r e t u r n MyAlign (res , alignment . al_type ) d e f align_consensus (s e l f):
s e l f. alignpars . needleman_Wunsch (s e l f. seqs [0] , s e l f. seqs [1]) res = s e l f. alignpars . recover_align ()
f o r i i n r a n g e(2 , l e n(s e l f. seqs )):
res = s e l f. add_seq_alignment (res , s e l f. seqs [i ]) r e t u r n res
d e f test () :
s1 = MySeq (" ATAGC ") s2 = MySeq (" AACC ") s3 = MySeq (" ATGAC ") sm = SubstMatrix ()
sm. create_submat(1,−1," ACGT ") aseq = PairwiseAlignment (sm,−1)
ma = MultipleAlignment ([s1 ,s2 , s3], aseq ) al = ma . align_consensus ()
p r i n t( al )
i f __name__ == " __main__ ":
test ()
In the code, the functionalign_consensusprovides the implementation of the MSA algo- rithm, returning an object of the classMyAlign. The functionadd_seq_alignmentis used as an auxiliary function, that adds new sequences to existing alignments, being called in each it- eration of the main algorithm to align the consensus of the previous alignment with the new sequence, returning the new alignment. Thetestfunction defines a simple example, with the scenario previously depicted in Fig.8.4.