Pre-Processing the Sequence: Suffix Trees- 123docz.net

16.3.1 Definitions and Algorithms

As we saw, tries are an interesting way of processing a set of patterns to make their search over different sequences more efficient. However, in many situations, the most relevant problem is to search a large number of different patterns once (or a few number of times) over a very large sequence. This is the case when seeking to align a large number of reads (frag- ments) to a known sequence of a large dimension (e.g. a reference genome).

In this case, instead of pre-processing the pattern(s), we may need to process the target sequence, in a way that can lead to the efficient search of occurrences of any given pattern over it. Although the problem is different from the one in the previous section, one of the most popular solutions is quite similar to the tries proposed above.

Figure 16.5: Example of a suffix tree for the sequence TACTG. Leaves are represented as nodes with gray background, numbered according to the initial position of the suffix they represent.

Indeed, one solution to pre-process a large sequence is the use ofsuffix trees, which may be seen as a special case of tries, built from the set of all suffixes of the target sequence.

Bringing a more formal definition, a treeT =(V , E)is considered a suffix tree, representing a sequences, if the following conditions are all met:

• the number of leaves ofT is equal to the length ofs(leaves are numbered from 0 to the length ofs−1);

• each edge in the tree (inE) has as its label a symbol froms;

• every suffix ofs is represented by a leaf, and can be reconstructed concatenating the symbols in the path from the root to the leaf;

• all edges going out of a node inV have distinct symbols.

An example is provided in Fig.16.5, showing the suffix tree for a DNA sequence (TACTG).

Note that the leaves (represented with gray background) are numbered according to the starting position of the suffix they represent.

Given the definition above, and the algorithm to build tries proposed in the previous section, building suffix trees for a sequence seems a trivial task. Indeed, we only need to compute suffixes for the sequence and add those to a trie, following the algorithm we have defined by iteratively adding each suffix to the tree in exactly the same way we added patterns in the last section. The only change to be considered here is that we need to number the leaves with the index of the suffix, i.e. its starting position in the overall sequence.

However, the astute reader has, at this point, already realized that there may be situations where we can not create a suffix tree for a given sequence. As an exercise, we recommend that the reader tries to build a tree for the sequence: TACTA.

Figure 16.6: Example of a suffix tree for the sequence TACTA. Not all suffixes are represented as leaves, since some are prefixes of other suffixes.

You will quickly realize that the number of leaves of the tree will be smaller than the number of suffixes, since for two of the suffixes we do not reach a leaf. This happens in situations where the suffix is a prefix of another suffix. In this case, TA is prefix of TACTA, and A is a prefix of ACTA. Therefore, in the resulting suffix tree there are no leaves numbered with 3 or 4 (see Fig.16.6). This means that the suffix tree in the figure does not comply with all conditions set in the above definition.

One way to address the previous problem is to add a symbol in the end of the sequence, which in our examples we will represent as a $. In this way, no suffix will be the prefix of another suffix, as all suffixes terminate with $, which will not appear in any other position in the sequence. Applying this change to the target sequence, we can follow the defined algorithm to build a suffix tree without further problems. We illustrate this strategy by adding this symbol to the previous sequence (TACTA$). The resulting tree is shown in Fig.16.7.

Another important task is the search for the occurrence of patterns in the sequence, using the suffix tree that represents it. This process is quite simple to achieve in a suffix tree. Indeed, we just need to walk over the suffix tree starting in the root, and following the edges according to the symbols in the pattern.

If those edges exist for the complete set of symbols in the pattern, when we reach the end of the pattern, we will be placed in a given node of the tree. From the definition of a suffix tree, all leaves below that node in the tree correspond to starting positions of the pattern in the sequence.

This process is illustrated in Fig.16.8A, where the tree from Fig.16.7representing the sequence TACTA is used to search for pattern TA. The result shows the pattern occurs in positions 0 and 3 from the original sequence.

Figure 16.7: Example of a suffix tree for the sequence TACTA, with $ symbol added in the last position.

Note that if the previous process fails in a given node, i.e. if there is no branch leaving the node marked with the symbol in the sequence, this means that the pattern does not occur. This is the case with the search for pattern ACG in Fig.16.8B.

Since suffix trees can be used in scenarios where the target sequence is very large, one of their main problems is the amount of memory needed, since trees will become very large. Noticing that in many cases there are linear segments, i.e. sequences of nodes only with a single leaving edge, these may be compacted by considering only the first and last node of the segment, and concatenating the symbols in the path.

An example of this process is provided by Fig.16.9, where the tree from Fig.16.7is compacted. Notice that the tree is shown in two versions: the first shows edges with strings of symbols, while the latter shows a tree with ranges of positions. Indeed, since the strings in the edges are always sub-strings of the target sequence, to avoid redundancy it is sufficient to keep the starting and end positions for this string.

Suffix trees can also be used for a number of different tasks when handling strings, and in particular biological sequences. Indeed, from what it was explained above, it seems clear that they are useful to search for repeats of patterns in sequences, thus enabling the identification of many types of repeats in genomes.

Also, suffix trees may be created from more than a sequence, thus enabling to address problems such as discovering which trees contain a given pattern, the longest sub-string shared by a set of sequences or calculating the maximum overlap of a set of sequences.

Figure 16.8: Example of the search for the patterns TA, in (A), and ACG, in (B), over a suffix tree (for the sequence TACTA). Nodes in blue (mid gray in print version) represent the walk over the tree, considering the symbols in the pattern, while nodes in red (dark gray in print version) show nodes where the symbol in the sequence does not have a matching edge. Leaves marked in blue (light gray in print version) represent all suffixes matching the pattern.

In Fig.16.10, we show an example of a suffix tree built from two sequences: TAC and ATA.

Two different symbols are used to mark the end of each sequence. The leaves are labeled by joining the sequence index with the starting position of the suffix.

Notice that the pattern search algorithm may be applied over a tree built from multiple sequences in a similar way. In the case of the tree from the figure, for instance, searching for pattern TA would return two matches, one in each sequence.

Figure 16.9: Example of a compact suffix tree, representing the same sequence as the one in Fig.16.7. (A) Edges show sub-strings; (B) Edges show position intervals.

Figure 16.10: Example of a suffix tree, built from two sequences: TAC and ATA. The symbols

$1and $2are added to mark the end of each sequence. Leaves are labeled with a tuple with the sequence and the initial position of the tuple.

16.3.2 Implementing Suffix Trees in Python

To implement suffix trees, we will use a structure very similar to the one used in the previous section for tries. The data structure used to keep tries will need an adaptation to be able to represent suffix trees. So, for each node, we will keep a tuple where the first element is used to represent the position of the suffix (in the case of leaves), being−1 for internal nodes. The second element of the tuple will be a dictionary working similarly to the one used for tries, i.e.

keeping the edges (symbols are the keys and destination nodes are the values).

We will define a classSuffixTreeto keep suffix trees and implement the respective algorithms. In the next code block, we show the definition of the class, along with the constructor and a method to print a tree.

c l a s s SuffixTree :

d e f __init__ (s e l f):

s e l f. nodes = { 0:(−1,{}) } # root node

s e l f. num = 0 d e f print_tree (s e l f):

f o r k i n s e l f. nodes . keys ():

i f s e l f. nodes [k ][0] < 0:

p r i n t (k , "−>", s e l f. nodes [k ][1]) e l s e:

p r i n t (k , ":", s e l f. nodes [k ][0])

Next, we can proceed to define methods to implement the algorithm to build suffix trees from a given sequence. This will include the methodadd_nodethat allows to add a node to the tree, given the origin node, the symbol to label the edge and the leaf number (if it is a leaf; de- faults to−1, otherwise). Then, we will define a methodadd_suffixwhich implements the addition of a suffix to the tree, being an adaptation of the similar method to add a pattern in the previous section (methodadd_patternof the classTrie). Lastly, we will define the methodsuffix_tree_from_seqwhich will add the $ symbol to the sequence and iteratively call the previous method for each suffix of the sequence. Thetestfunction builds the suffix tree for an example with the sequence TACTA, building the tree shown in Fig.16.7.

c l a s s SuffixTree : (...)

d e f add_node (s e l f, origin , symbol , leafnum = −1):

s e l f. num += 1

s e l f. nodes [ origin ][1][ symbol ] = s e l f. num s e l f. nodes [s e l f. num ] = ( leafnum ,{}) d e f add_suffix (s e l f, p , sufnum ):

pos = 0 node = 0

w h i l e pos < l e n(p):

i f p[ pos ] n o t i n s e l f. nodes [ node ][1]. keys () : i f pos == l e n(p)−1:

s e l f. add_node ( node , p[ pos ], sufnum ) e l s e:

s e l f. add_node ( node , p[ pos ]) node = s e l f. nodes [ node ][1][ p[ pos ]]

pos += 1

d e f suffix_tree_from_seq(s e l f, text ):

t = text +"$"

f o r i i n r a n g e(l e n(t)):

s e l f. add_suffix (t[i:] , i) d e f test () :

seq = " TACTA "

st = SuffixTree ()

st. suffix_tree_from_seq( seq ) st. print_tree ()

test ()

Finally, we will show the functions that implement the algorithm for pattern search. The methodfind_patternimplements the process of walking the tree until reaching the final node, or failing the search. If the first case occurs, the leaves below the node are returned by calling the auxiliary methodget_leafes_below, defined using recursiveness (note this might not be a good idea for real world scenarios). In the second, the method returnsNone, for the case where there are no matches. The testing function applies the method to the previous tree, searching for the patterns TA and ACG, mimicking the case depicted in Fig.16.8.

c l a s s SuffixTree : (...)

d e f find_pattern (s e l f, pattern ):

pos = 0 node = 0

f o r pos i n r a n g e(l e n( pattern )):

i f pattern [ pos ] i n s e l f. nodes [ node ][1]. keys ():

node = s e l f. nodes [ node ][1][ pattern [ pos ]]

pos += 1

e l s e: r e t u r n None

r e t u r n s e l f. get_leafes_below ( node ) d e f get_leafes_below (s e l f, node ):

res = []

i f s e l f. nodes [ node ][0] >=0:

res . append (s e l f. nodes [ node ][0]) e l s e:

f o r k i n s e l f. nodes [ node ][1]. keys ():

newnode = s e l f. nodes [ node ][1][ k]

leafes = s e l f. get_leafes_below ( newnode ) res . extend ( leafes )

r e t u r n res d e f test () :

seq = " TACTA "

st = SuffixTree ()

st. suffix_tree_from_seq( seq ) p r i n t ( st. find_pattern (" TA ")) p r i n t ( st. find_pattern (" ACG ")) test ()

Pre-Processing the Sequence: Suffix Trees

Genes: Discrete Units of Genetic Information

Biological Sequences: Representations and Basic Algorithms