Probabilistic motifs are typically expressed throughProbabilistic Weight Matrices (PWM) [47,68,144] also calledTemplatesorProfiles. A PWM is a matrix of the weighted matches of each of the biological symbols (nucleotides or aminoacids) as the rows, and each position of the motif, as columns. Fig.11.1shows the scheme of a PWM, which represents a DNA motif of sizeN. The value of the cellPij denotes the probability of the nucleotideibe found in positionj of the motifP. In this model, independence between symbols of the motif is assumed. For a sequenceS=S1S2...SN of lengthN, the likelihood of being matched by a PWMP is given by formula (11.1).
For sequence comparison purposes the logarithms are handled easier than probability values, therefore the log-odds of thePij cells are usually used, creating the so-calledPosition Specific
Figure 11.1: Representation of aPosition Weight Matrix. Rows represent symbols and columns the positions along the motif. Each cell contains the probability of finding the respective symbol at a certain position.
Bioinformatics Algorithms. DOI:10.1016/B978-0-12-812520-5.00011-0
Copyright © 2018 Elsevier Inc. All rights reserved. 237
Figure 11.2: Position weight matrix derived from a set of eight sequences and scanning of input sequence for finding the most probable motif match.
Scoring Matrices. The likelihood ofSbeing matched byP can be written as a score given by formula (11.2).
p(S, P )=
N
i=1
P (Si, i) (11.1)
score(S, P )=
N
i=1
logP (Si, i) (11.2)
Fig.11.2presents the PWM generated from a set of 8 sequences. The probability of a se- quencea =GAT CAT matching this motifP is given by the product of all its position probabilities as:p(GAT CAT|P )= 58×78×38×48×68×58 =0.03433.
Given a sequenceSand PWMP, we can calculate the sub-sequence ofSwith lengthN with the highest likelihood of being matched byP. This can be done by scanningSwith a sliding window of lengthN and applying the calculation as in formula (11.1). Fig.11.2describes the process and the respective probability calculation when scanning an input sequence.
We will now define the classMyMotifsthat handles position weight matrices and implements the calculations described above. Before that, we implement two functions: the first one is required to create a matrix of given dimensions and the second for visualization of a matrix.
d e f create_matrix_zeros ( nrows , ncols ):
res = [ ]
f o r i i n r a n g e(0 , nrows ):
res . append ([0]∗ncols ) r e t u r n res
d e f print_matrix ( mat ):
f o r i i n r a n g e(0 , l e n( mat )): p r i n t( mat [i ])
The classMyMotifsimplements the data structure and methods required to create a PWM, to derive different deterministic representations of the motif, and to determine the probability of occurrence of the motif along a sequence.
The class contains five attributes: i) sequences used to build the PWM model; ii) total number of sequences; iii) alphabet; iv) a matrix containing the absolute counts, and v) a matrix with the frequency of the symbols at each position of the motif. These matrices are created calling the functionsdo_countsandcreate_pwm.
c l a s s MyMotifs :
""" Class to handle Probabilistic Weighted Matrix """
d e f __init__ (s e l f, seqs = [] , pwm = [] , alphabet = None):
i f seqs :
s e l f. size = l e n( seqs [0])
s e l f. seqs = seqs # objet from class MySeq
s e l f. alphabet = seqs [0]. alphabet () s e l f. do_counts ()
s e l f. create_pwm () e l s e:
s e l f. pwm = pwm
s e l f. size = l e n( pwm [0]) s e l f. alphabet = alphabet d e f __len__ (s e l f):
r e t u r n s e l f. size
d e f do_counts (s e l f):
s e l f. counts = create_matrix_zeros (l e n(s e l f. alphabet ) , s e l f. size )
f o r s i n s e l f. seqs :
f o r i i n r a n g e(s e l f. size ):
lin = s e l f. alphabet . index (s[i ]) s e l f. counts [ lin ][ i] += 1
d e f create_pwm (s e l f):
i f s e l f. counts == None: s e l f. do_counts ()
s e l f. pwm = create_matrix_zeros (l e n(s e l f. alphabet ) , s e l f. size ) f o r i i n r a n g e(l e n(s e l f. alphabet )):
f o r j i n r a n g e(s e l f. size ):
s e l f. pwm [i ][j] = f l o a t(s e l f. counts [i ][ j ]) / l e n(s e l f. seqs )
From a PWM, we can generate a consensus sequence that captures the most conserved sym- bols at each position of the motif. Theconsensusfunction scans every position of the motif and selects the most frequent symbol at each position. Themasked_consensusfunction works in a similar way, but for each position it outputs a symbol from the alphabet in the case its frequency is at least 50% of the sequences, otherwise it outputs the symbol “-”.
d e f consensus (s e l f):
""" returns the sequence motif obtained with the most frequent symbol at each position of the motif """
res = ""
f o r j i n r a n g e(s e l f. size ):
maxcol = s e l f. counts [0][ j]
maxcoli = 0
f o r i i n r a n g e(1 , l e n(s e l f. alphabet ) ):
i f s e l f. counts [i ][j] > maxcol : maxcol = s e l f. counts [i ][ j]
maxcoli = i
res += s e l f. alphabet [ maxcoli ] r e t u r n res
d e f masked_consensus (s e l f):
""" returns the sequence motif obtained with the symbol that occurs in at least 50% of the input sequences """
res = ""
f o r j i n r a n g e(s e l f. size ):
maxcol = s e l f. counts [0][ j]
maxcoli = 0
f o r i i n r a n g e(1 , l e n(s e l f. alphabet ) ):
i f s e l f. counts [i ][j] > maxcol : maxcol = s e l f. counts [i ][ j]
maxcoli = i
i f maxcol > l e n(s e l f. seqs ) / 2:
res += s e l f. alphabet [ maxcoli ] e l s e:
res += "−"
r e t u r n res
We then implement in the class the functions to calculate the probability of a sequence match- ing a given motif. This is implemented in functionprobability_sequencewhere for the input sequence we calculate the product of the probability of occurrence in that position of all its symbols.
We can then generalize this procedure for a longer sequence and apply the probability of oc- currence to every sub-sequence of motif lengthN (functionprobability_all_positions). Note that the scanning is done from the first index to the one corresponding to position|S| −N+1, i.e. to the length of the input sequence minus the length of the motif plus one. The probabil- ities of every sub-sequence are then stored in a list. This function can be easily adapted to find the sub-sequence with highest probability of matching the motif. In the functionmost_
probable_sequence, the index of such sub-sequence is returned.
Knowing the sub-sequences where the motif has higher likelihood of occurrence will allow to update and refine our motif. In functioncreate_motifwe scan the input sequences and select the most probable sub-sequences to build a new motif with an object of the classMyMotifs being returned.
d e f probability_sequence(s e l f, seq ):
res = 1.0
f o r i i n r a n g e(s e l f. size ):
lin = s e l f. alphabet . index ( seq [i ]) res ∗= s e l f. pwm [ lin ][ i]
r e t u r n res
d e f probability_all_positions (s e l f, seq ):
res = []
f o r k i n r a n g e(l e n( seq )−s e l f. size +1) :
res . append (s e l f. probability_sequence( seq )) r e t u r n res
d e f most_probable_sequence(s e l f, seq ):
""" Returns the index of the most probable sub−sequence of the input sequence """
maximum = −1.0 maxind = −1
f o r k i n r a n g e(l e n( seq )−s e l f. size ):
p = s e l f. probability_sequence( seq [k:k+ s e l f. size ]) i f(p > maximum ):
maximum = p maxind = k r e t u r n maxind
d e f create_motif (s e l f, seqs ):
from MySeq i m p o r t MySeq l = []
f o r s i n seqs :
ind = s e l f. most_probable_sequence(s. seq ) subseq = MySeq ( s[ ind :( ind +s e l f. size )], s.
get_seq_biotype () ) l. append ( subseq ) r e t u r n MyMotifs (l)
Finally, code is provided to test our class. We start by providing eighth sequences to build the motif and visualize the matrix of absolute counts and frequencies. The consensus and masked consensus motifs are also found. Probability calculation is also performed for several input sequences.
d e f test () :
from MySeq i m p o r t MySeq seq1 = MySeq (" AAAGTT ") seq2 = MySeq (" CACGTG ") seq3 = MySeq (" TTGGGT ") seq4 = MySeq (" GACCGT ")
seq5 = MySeq (" AACCAT ") seq6 = MySeq (" AACCCT ") seq7 = MySeq (" AAACCT ") seq8 = MySeq (" GAACCT ")
lseqs = [ seq1 , seq2 , seq3 , seq4 , seq5 , seq6 , seq7 , seq8 ] motifs = MyMotifs ( lseqs )
p r i n t (" Counts matrix ") print_matrix ( motifs . counts ) p r i n t (" PWM ")
print_matrix ( motifs . pwm ) p r i n t (" Sequence alphabet ") p r i n t( motifs . alphabet ) [p r i n t(s) f o r s i n lseqs ] p r i n t (" Consensus sequence ") p r i n t( motifs . consensus ())
p r i n t (" Masked Consensus sequence ") p r i n t( motifs . masked_consensus ())
p r i n t( motifs . probability_sequence(" AAACCT ")) p r i n t( motifs . probability_sequence(" ATACAG "))
p r i n t( motifs . most_probable_sequence(" CTATAAACCTTACATC "))
In the first part of this chapter, we discussed how a PWM provides a probabilistic representa- tion of a motif by capturing the frequency of each symbol along its sequence positions. PWM can then be used to search for novel motifs matches within the input sequences and can be refined by incorporating other high-scoring matches.
When the number of sub-sequences from which the PWM model is derived is relatively small, some symbols may not be observed at some positions, i.e. have an observed frequency of zero. This will lead to situations where the resulting motif probability will be zero. To avoid this and to increase numerical stability a pseudo-count is typically added to the PWM values.
Visualization is another important aspect of motif representation. A common way to visualize a PWM is through a sequence logo, introduced by Schneider and Stephens in 1990 [47,137].
This type of graphics represents a stack of letters with a height proportional to the conserva- tion level of the motif positions. Tools such as Weblogo [40] allow to easily create appealing representations of PWMs. In sequence logos, the information content of positionifor every symbolb, is given by formula (11.3).
Ii=2+
b
Pb,i×log2Pb,i (11.3)
Figure 11.3: Weblogo for the eight sub-sequences.
For a DNA motif, positions perfectly conserved contain 2 bits of information, while 1 bit corresponds to positions where two of the four bases occur in less than 50% of the posi- tions. Fig.11.3shows the sequence logo created by Weblogo for the eight sub-sequences in Fig.11.2.