Pre-Processing the Patterns: Tries

Một phần của tài liệu Thiết kế các thuật toán sinh tin học bằng Python (Trang 343 - 350)

16.2.1 Definitions and Algorithms

Triesaren-ary trees that allow to organize a set of string patterns. We have addressed the con- cept of trees in Chapter9, while we mostly discussed binary trees. In the case of tries, each internal node can havenbranches (withn≥1).

Bioinformatics Algorithms. DOI:10.1016/B978-0-12-812520-5.00016-X

Copyright © 2018 Elsevier Inc. All rights reserved. 337

Figure 16.1: Example of a trie for an alphabet with nucleotide symbols: (A) the trie; (B) the pat- terns represented by the trie in lexicographical order.

In tries, symbols from a given alphabet are associated to edges (arcs) of the tree, and the rep- resented patterns are read walking from the root to a leaf. Thus, the tree will have one leaf for each pattern.

An example is shown in Fig.16.1, for a set of patterns over a DNA sequence alphabet. No- tice that each of the patterns shown in Fig.16.1B can be obtained following a path from the root to one of the leaves, concatenating the symbols present in the edges belonging to the path.

As it may seem intuitive by looking at the previous figure, a trie can be built from a set of pat- terns. This task can be achieved following a simple algorithm, which starts with a tree just with the root node, and iteratively considers each of the patterns and adds the necessary nodes to that the tree includes a path from the root to a leaf representing the pattern.

For each patternpin the input list, we will, in parallel, read the pattern and walk over the tree, starting in the first position of the pattern (p0), and positioning the current node as the root of the tree. In each iterationi, we will take the symbolpiin the pattern, and check if there is an edge going out of the current node in the tree which is labeled withpi. If there is, we descend through that branch and update the current node. If there isn’t, we will create a new node, which will become the current one, and connect it from the current node, labeling the branch withpi. To terminate this iteration, we incrementiby 1, and move to the next itera- tion, repeating the process until we reach the final position in the pattern (i.e.i is the length ofp).

Figure 16.2: Example of the application of the algorithm to build a trie from a set of patterns.

In this case, the patterns to be considered are: GAT, CCT, and GAG. The current node in each iteration is marked in gray background. Added nodes and edges have dashed lines in their border and arrow, respectively.

An example of the application of this algorithm is provided in Fig.16.2, with a trie built from a set of 3 patterns. The addition of each pattern to the current trie is shown in each line, with the different iterations being represented by the vertical sequence of tries, where the current node is highlighted in gray and new nodes/edges have dashed lines.

Given a trie, created from a set of patterns, it can be used to efficiently search the occurrence of any of those patterns in a target sequence. To check how an algorithm for this task can be designed, let us first see how we can check if a pattern in the trie is a prefix of the target se- quence.

In this case, we just go through the characters in the sequence and in parallel walk over the trie, starting in the root, following the edges corresponding to the characters in the sequence, until one of the following situations occurs: (i) a leaf of the trie is reached, meaning that a pattern has been identified; (ii) the character in the sequence does not exist as a label to an edge going out from the current node, and thus no pattern in the trie has been identified.

This algorithm can be easily extended to search for pattern occurrences in the whole target sequence, by repeating this process for all suffixes of the sequence, i.e. all sub-strings starting in positioni of the sequence and going to its end, for values ofifrom 0 to the sequence length minus one.

An example of the application of this algorithm is given in Fig.16.3, over the trie depicted in Fig.16.1. In this case, we show the application of the algorithm for the first two suffixes, i.e.

the sequences starting in the first two positions of the target sequence.

As a limitation of the use of tries for this purpose, notice that when we are searching for pat- terns in a new suffix, using the previous algorithm, we do not use any information on the previous searches to help in the current one.

Thus, in that aspect, the use of automata put forward in Chapter5may be more efficient, while they are more difficult to build to consider multiple patterns. Indeed, automata may be seen as transformations of tries into graphs, adding failure edges that connect some nodes with partial matches avoiding to restart searches from scratch.

16.2.2 Implementing Tries in Python

The first decision to take when implementing data structures as tries is how to represent them.

In this case, as it happened with graphs in the previous chapters, taking into account computa- tional and memory efficiency, we will opt to use dictionaries to keep tries.

We will label the nodes of the tree with sequential integer numbers, and these will serve as the keys in our dictionary. The value associated to each node will represent the edges going out of that node, in the form of another dictionary, where there will be an item for each edge: keys will be the symbols labeling the edge and values will be the indexes of the destination nodes.

An empty dictionary will identify the leaves of the tree.

An example is shown in Fig.16.4, where the trie built in the example from Fig.16.2is now labeled with numbers in the nodes, being shown the respective representation using dictionar- ies.

Based on this representation, we will implement tries and related algorithms in Python defin- ing a new class namedTrie. An object of this class will keep a trie, having as its main at- tribute (self.nodes) a dictionary with the previous structure. Also, there will be an attribute

Figure 16.3: Example of the application of the algorithm to search for pattern occurrences in a target sequence. In this case, we show the result of applying the algorithm for the first and sec- ond positions of the target sequence. In the first case (top), the search is successful since a leaf is reached (for pattern GAGAT). In the second, the search fails in the last symbol, since an edge labeled with T is not found. Nodes included in the search path are colored in blue (mid gray in print version), while the nodes where it fails and respective edges are colored in red (dark gray in print version).

(self.num) to keep the current size of the tree (number of nodes), so that labeling of new nodes is possible.

The following code shows the class definition, its constructor, a method to print tries (in a format similar to the one used in Fig.16.4), together with methods to add a pattern to a trie and to create a trie from a list of patterns, following the algorithms described in the previous

Figure 16.4: Representation of an example trie using dictionaries. (A) An example trie with num- bered nodes; (B) its representation using a dictionary, being shown the keys and the respective values (other dictionaries with symbols as keys and destination nodes as values; leaves are repre- sented as empty dictionaries).

section (see Fig.16.2). The example in functiontestbuilds and prints the simple trie shown in Figs.16.2and16.4.

c l a s s Trie :

d e f __init__ (s e l f):

s e l f. nodes = { 0:{} } # dictionary

s e l f. num = 0 d e f print_trie (s e l f):

f o r k i n s e l f. nodes . keys ():

p r i n t (k , "−>" , s e l f. nodes [k ]) d e f add_node (s e l f, origin , symbol ):

s e l f. num += 1

s e l f. nodes [ origin ][ symbol ] = s e l f. num s e l f. nodes [s e l f. num ] = {}

d e f add_pattern (s e l f, p):

pos = 0 node = 0

w h i l e pos < l e n(p):

i f p[ pos ] n o t i n s e l f. nodes [ node ]. keys ():

s e l f. add_node ( node , p[ pos ])

node = s e l f. nodes [ node ][p[ pos ]]

pos += 1

d e f trie_from_patterns (s e l f, pats ):

f o r p i n pats :

s e l f. add_pattern (p) d e f test () :

patterns = [" GAT ", " CCT ", " GAG "]

t = Trie ()

t. trie_from_patterns ( patterns ) t. print_trie ()

test ()

Note that theadd_nodemethod creates a new node and links it to an existing one (specified by theoriginparameter), labeling the edge with a given symbol, also passed as an argument to the method. This method is used by theadd_patternmethod which follows the algorithm illustrated in Fig.16.2. Theposvariable iterates over the symbols in the patternp, while the variablenodekeeps the current node in the tree (starting by the root, which is labeled with number 0). Finally, thetrie_from_patternsmethod simply adds each of the patterns in the input list using the previous method.

Next, we will proceed to implement the algorithms to search for patterns in target sequences, using tries. We will first add a method in the previous class to search if a pattern represented in the trie is a prefix of the sequence (methodprefix_trie_match), and then use it to search for occurrences over the whole sequence (methodtrie_matches), as described in the previous section (see Fig.16.3).

c l a s s Trie : (...)

d e f prefix_trie_match (s e l f, text ):

pos = 0 match = ""

node = 0

w h i l e pos < l e n( text ):

i f text [ pos ] i n s e l f. nodes [ node ]. keys ():

node = s e l f. nodes [ node ][ text [ pos ]]

match += text [ pos ]

i f s e l f. nodes [ node ] == {}:

r e t u r n match e l s e:

pos += 1 e l s e: r e t u r n None r e t u r n None

d e f trie_matches (s e l f, text ):

res = []

f o r i i n r a n g e(l e n( text )):

m = s e l f. prefix_trie_match ( text [i :])

i f m != None: res . append ((i ,m))

r e t u r n res d e f test2 () :

patterns = [" AGAGAT ", " AGC ", " AGTCC ", " CAGAT ", " CCTA ", " GAGAT ", "

GAT ", " TC "]

t = Trie ()

t. trie_from_patterns ( patterns )

p r i n t (t. prefix_trie_match (" GAGATCCTA ")) p r i n t (t. trie_matches (" GAGATCCTA ")) test2 ()

Một phần của tài liệu Thiết kế các thuật toán sinh tin học bằng Python (Trang 343 - 350)

Tải bản đầy đủ (PDF)

(395 trang)