Introduction: Problem Definition and Relevance- 123docz.net

Within the context of biological sequence mining, the termmotif refers to a non-trivial sequential pattern that is shared across multiple sequences. With non-trivial, we mean that the motif has a minimum relevant length and captures a combination of symbols that differs from the underlying symbol distribution. However, its main feature is its recurrence, i.e. the fact that it occurs in several of the analyzed sequences.

The relevance of a motif arises when the set of sequences where it occurs are associated to ge- nomic elements, such as genes, that share a certain biological property or are under the same regulatory control. The hypothesis will then be that the motif will play a role in the associated biological phenomenon.

For instance, in DNA sequences, motifs may occur in the promoter region of genes and indicate the presence of binding sites of single proteins or protein complexes that have a regulatory role in gene transcription. In protein sequences, motifs may indicate the existence of conserved domains, i.e. parts of the protein that play specific biological functions, such as enzyme binding sites for substrates or other molecules.

We can distinguish two main classes of motifs:deterministicandprobabilistic. Deterministic motifs are often captured by enhanced regular expression syntax, and as it name indicates are either present or absent in the input sequences. These were already addressed in the context of Chapter5.

On the other hand, probabilistic motifs form a loose model that captures the underlying vari- ability of the motif occurrences. When a given segment of a sequence is presented to the model, the probability of being part of the motif can be derived. As we will see in the next chapter, position weight matrices (PWMs) can be used as a way to represent probabilistic motifs. In this chapter, we will discuss the algorithmic framework to define deterministic motifs and present motif finding algorithms with increasing efficiency level.

Bioinformatics Algorithms. DOI:10.1016/B978-0-12-812520-5.00010-9

We will start by introducing the concepts that will help us to better define the problem of efficiently discovering the sequence motifs conserved among a set of related biological sequences. Note that we will use the termssequence patternandmotif with equivalent meaning throughout this chapter.

As we have seen in previous chapters, biological sequences are defined as strings of symbols, which for DNA consist of four lettersDN A = {A, C, G, T}and for protein sequences of twenty letters,Protein= {A, C, D, E, F, G, H, I, K, L, M, N,P , Q, R, S, T , V , W, Y}. The characteristics of nucleotide sequences and proteins, such as the size of the alphabet, the typ- ical length of the sequence or the underlying symbol distribution raises different algorithmic challenges. Therefore, motif discovery algorithms need to be optimized according to the na- ture of sequences to which they are applied.

As we mentioned before, a deterministic motif can be captured by regular expression syntax. In some cases, the motif is composed by a continuous combination of symbols, while in other more complex cases the motif may contain certain parts, possibly of variable length, corresponding to highly variable symbols, which apart from their length do not add any relevant information. These two cases can be differentiated using the terms,sub-stringandsub- sequence. As defined in [70], a sub-string corresponds to a consecutive part of a string, while a sub-sequence corresponds to a new sequence obtained by eventually dropping some symbols from the original sequence and preserving the respective order. The later is a more gen- eral case than the former. For instance, for a sequenceS=abcdef,acdf is a sub-sequence of Sandbcd is a sub-string ofS.

If|S|denotes the length of the sequenceSandS=xyz,|x| ≥0 and|z| ≥0 thenxis said to be aprefixofSandy issuffixofS. Alternatively,Sis asuperstringorsupersequenceofx.

Generally, a deterministic motifM can be defined as a non-null string of symbols from an enhanced alphabet. The alphabetresults from an extension ofwith an additional set of symbolsE,=∩E.Eadds greater expressiveness to the patterns and may contain symbols that express the existence of gaps or symbols that reflect positions that may be occupied by more than one symbol. The set of all the possible regular expressions that result from the substitution of symbols from inMdefines apattern language:L(M). A motifMis said to matchorhita sequenceS, ifScontains some stringythat occurs inL(M).

We have seen in Chapter5that the restriction enzymeEcoRIcuts the DNA when it finds the sequence patternGAAT T C. In this case, we have a very clear pattern represented by a unique string. However, it is often the case that the binding sequences change slightly. This leads to more subtle patterns that need to comprise more than one sequence match. This is, for instance, the case ofHindII[140], a type II restriction enzyme that cuts the DNA in specific sequences that are six nucleotides long. When the DNA is read from the 5’ to 3’ sense, the

Figure 10.1: Multiple representations of a motif with matches in five DNA sequences.

cutting points are formed by the sequences:GT followed byT orC, then followed byAorG andAC. In this case, we have a degenerate motif but still highly conserved.

When considering for instance transcription factors (TFs), their motifs become more complex since these proteins often bind to sequences that are significantly degenerated. This degener- acy highly increases the complexity ofde novomotif discovery.

TheEcoRIcan be described using only theDN Aalphabet, whileHindIIis a good candidate to be described an extended version of the alphabetDN A . Fig.10.1depicts the occurrence of the strings along the input sequence and their alignment. The second and third position of the motif can be captured by the symbol Y and R that respectively represent C or T and A or G. The IUPAC nucleotide code provides symbols for different degenerate symbols and is an example of an extended alphabet.

A possible representation of a motif is to store the alignment of all its instances in the input sequences. Likewise, we can also store the starting position and later recover the correspond strings. This way we capture all the diversity of the motif, conserving all the information.

A drawback of this representation is that we need to store as many strings as occurrences of the motif.

An alternative representation is to take advantage of an extended alphabet and capture the motif with regular expression syntax. This is called aconsensusmotif. While this provides a more compact and intuitive motif representation, we loose information regarding the number of times each instance of the motif occurs.

Another possibility is to build a profile or matrix of the frequencies of the symbols at each position of the motif, represented as columns of the matrix. This representation only depends

on the size of the alphabet and the length of the motif and is independent on the number motif occurrences. The negative aspect comes from the fact that we no longer preserve the order in which the symbols occur. In the next chapter, we will discuss in detail the representation based on matrix of frequencies along the motif positions.

One important measure of interest of a motif is its recurrence, also denoted asfrequencyor support. For an input set of sequences,D, this can be measured as the number of matches of M in different sequences ofD(each sequence counts once) or in the total number of matches ofM inD(more than one count per sequence). This leads to the notion offrequent motif, which is a motif that has frequency higher or equal than a user defined threshold; and infre- quent otherwise.

For the sake of generality, we can define an abstract measure of interest,Score, as measure of M with relation toD. The frequency of a motifMis an example of such score measure. For a profile representation, we can think of a scoring measure that takes into account the frequencies of the most frequent symbol at each position. Overall, these frequencies can be either summed or multiplied. For a profile matrixcountand motif lengthLthe score function for a motifM can be defined the equation:

Score(M)=

i=1

maxk∈count(k, i) (10.1)

We are now able to formalize the motif discovery problem: given an input a set of sequences D= {S1, S2, ..., St}, defined over an alphabet, a motif lengthL, an optional minimum score valueσ: find all the motifs inDthat maximize a given score function or that have a score greater or equal thanσ.

To implement the algorithms for deterministic motif finding, we will create a class called DeterministicMotifFinding. The class will contain two attributes, themotif_sizethat corresponds to the length of the motif and the vectorseqsthat contains the input sequences where the motif is to be found. The functionread_filereads the input sequences from a given file.

In Fig.10.1, we show that from a set of aligned motif matches, we can build a matrix with a number of columns equal to the motif length and number of rows equal to the alphabet size. Each cell of the matrix contains the frequency of the symbol in the given position. The functioncreate_motif_from_indexesis defined to build such matrix. It receives as input the indices of the motif along the input sequences. Variableresconsists of a bi-dimensional matrix. Next, it iterates over all indices to retrieve the respective sequence matches. Note the use of the built-in functionenumeratethat returns a sequential index and the associated value in the vector. Then, with a nested loop the count on respective cell of the matrix is incremented.

These functions provide a suitable motif representation to approach the motif finding problem.

c l a s s DeterministicMotifFinding :

""" Class for deterministic motif finding . """

d e f __init__ (s e l f, size = 8, seqs = None):

s e l f. motif_size = size i f ( seqs != None):

s e l f. seqs = seqs

s e l f. alphabet = seqs [0]. alphabet () e l s e:

s e l f. seqs = []

d e f __len__ (s e l f):

r e t u r n l e n(s e l f. seqs ) d e f __getitem__ (s e l f, n):

r e t u r n s e l f. seqs [n]

d e f seq_size (s e l f, i):

r e t u r n l e n(s e l f. seqs [i ]) d e f read_file (s e l f, fic , t):

f o r s i n open(fic , "r"):

s e l f. seqs . append ( MySeq (s. strip () . upper () ,t)) s e l f. alphabet = s e l f. seqs [0]. alphabet ()

d e f create_motif_from_indexes (s e l f, indexes ):

pseqs = []

res = [[0]∗s e l f. motif_size f o r i i n r a n g e(l e n(s e l f. alphabet )) ]

f o r i , ind i n enumerate( indexes ):

subseq = s e l f. seqs [i ][ ind :( ind +s e l f. motif_size )]

f o r i i n r a n g e(s e l f. motif_size ):

f o r k i n r a n g e(l e n(s e l f. alphabet )):

i f subseq [i] == s e l f. alphabet [k ]:

res [k ][ i] = res [k ][ i] + 1 r e t u r n res

The functionscoredefined below implements the scoring function defined in Eq. (10.1). This function iterates through all the positions of the motif and, for each position, determines the maximum value which is added to the score of the motif. The functionscore_multiplicative implements a similar scoring, but instead of summing, the score is obtained by multiplying the maximum value at each motif position.

d e f score (s e l f, s):

score = 0

mat = s e l f. create_motif_from_indexes (s) f o r j i n r a n g e(l e n( mat [0]) ):

maxcol = mat [0][ j]

f o r i i n r a n g e(1 , l e n( mat )):

i f mat [i ][j] > maxcol : maxcol = mat [i ][j]

score += maxcol r e t u r n score

d e f score_multiplicative(s e l f, s):

score = 1.0

mat = s e l f. create_motif_from_indexes (s) f o r j i n r a n g e(l e n( mat [0]) ):

maxcol = mat [0][ j]

f o r i i n r a n g e(1 , l e n( mat )):

i f mat [i ][j] > maxcol : maxcol = mat [i ][j]

score ∗= maxcol r e t u r n score

Introduction: Problem Definition and Relevance

Genes: Discrete Units of Genetic Information

Biological Sequences: Representations and Basic Algorithms