Heuristic Algorithm: Boyer-Moore

As we saw in the previous section, the complexity of the naive algorithm can be penalizing in its computational efficiency. There are alternative algorithms which seek to improve the average computational efficiency of pattern searching, trying to use the structure of the pattern to speed up the search saving a significant part of the pairwise symbol comparisons.

Although there are a number of alternatives for these algorithms, we will cover here only the Boyer-Moore algorithm, that although having in the worst case scenario a complexity similar to the naive algorithm, in most cases allows significant gains in performance. The algorithm is based on two rules that allow to move forward more than one position in the target sequence in some situations.

As in the naive algorithm, the target sequence is scanned from the beginning to the end (left to right), but in this case the comparison of the pattern, with the sub-sequence in the text, is done from right to left. When there is a mismatch between the target sequence and the pattern, two rules can be applied to check if the process can be more efficient, by moving forward more than one position in the sequence.

The first rule that can be applied is thebad-character rule, which states that we can advance the pattern to the next occurrence (in the pattern) of the symbol in the sequence at the position of the mismatch. If no occurrences of that symbol exist in the pattern, we can move forward the maximum number of symbols (until the end of the pattern).

Fig.5.1A shows examples of three cases of a possible application of this rule. In the first example, the symbol in the sequence where the mismatch occurs (T) does not occur in the pattern, and therefore we can move forward the number of positions corresponding to placing the pattern’s first symbol in the position following the mismatch. In the second example, the symbol occurs in the pattern, and, thus, we can move the pattern so that the rightmost occurrence of the symbol in the pattern matches the mismatched symbol in the sequence. The same happens in the third example, but in this case we only move forward one position.

The other rule that may be applied is thegood suffix rule, which states that, in case of a mismatch, we can move forward to the next instance in the pattern of the part (suffix) that matched before (in the right) the mismatch.

Fig.5.1B shows examples of the application of this rule. In the first case, the suffix “AC”

matched and, therefore, the pattern moves to the next occurrence of this string in the pattern.

The next case shows what happens when the suffix does not occur in the remaining of the pattern, allowing to move forward the pattern a number of positions equal to its length. The third example shows a special case, where the full matching suffix (“CAC”) does not occur in the pattern, but a suffix does (“AC”).

Figure 5.1: Examples of the application of the two rules used by the Boyer-Moore algorithm:

(A) bad-character rule; (B) good suffix rule.

To make this algorithm efficient, and allow to rapidly verify which rule may be applied in each case, a pre-processing of the pattern needs to be conducted before the search itself, keep- ing relevant information in efficient data structures. Note that the number of positions to move forward depends only on the pattern and, therefore, this pre-processing applies only over the pattern to make the required information readily available, being independent of the target sequence. Thus, the pre-processing does not need to be repeated to search the same pattern in other target sequences.

Although this pre-processing has computational costs, these are normally worth since the pattern is typically much smaller than the sequence, and the costs are paid by the larger gains in the search process.

In the case of the bad-character rule, we create a dictionary with all possible symbols in the alphabet as keys, and with values defining the rightmost position where the symbol occurs in the pattern (−1 if the symbol does not occur). This allows to rapidly calculate the number of positions to move forward according to this rule by calculating the offset:position of the mismatch in the pattern – value for the symbol in the dictionary. Notice that this value might be negative and, in this case, this means the rule does not help and it will be ignored in that iteration. This process is done in theprocess_bcrfunction in the code given below.

The pre-processing for the good suffix rule is more complex and we will not explain here all the details (the code is given below and we leave a detailed analysis to the interested reader).

The result of this process is to create a list that keeps the number of positions that may be moved forward, depending on the position of the mismatch on the pattern (list index). Notice

that in this process, both the situations illustrated above need to be taken into account. This process is done in theprocess_gsrfunction in the code given below.

The implementation of the algorithm is given next as a Python class. The class allows to de- fine an alphabet and a pattern in the constructor and does the pre-processing for both rules, according to the pattern, in the functionpreprocesscalled by the constructor.

The functionsearch_patternallows to use an initialized object of this class to search over target sequences for the given pattern. It is an adaptation of the naive algorithm given in the previous section, but which makes use of the data structures from the pre-processing, using the two rules to move forward the maximum number of allowed positions. In the worst case, it advances a single position (as in the naive algorithm), but in other cases it can use one of the rules to move forward more positions (the maximum of the values provided by each of the rules).

c l a s s BoyerMoore :

d e f __init__ (s e l f, alphabet , pattern ):

s e l f. alphabet = alphabet s e l f. pattern = pattern s e l f. preprocess () d e f preprocess (s e l f):

s e l f. process_bcr () s e l f. process_gsr () d e f process_bcr (s e l f):

s e l f. occ = {}

f o r symb i n s e l f. alphabet : s e l f. occ [ symb ] = −1

f o r j i n r a n g e(l e n(s e l f. pattern )):

c = s e l f. pattern [j]

s e l f. occ [c] = j d e f process_gsr (s e l f):

s e l f.f = [0] ∗ (l e n(s e l f. pattern ) +1) s e l f.s = [0] ∗ (l e n(s e l f. pattern ) +1) i = l e n(s e l f. pattern )

j = l e n(s e l f. pattern ) +1 s e l f.f[i] = j

w h i l e i >0:

w h i l e j <= l e n(s e l f. pattern ) and s e l f. pattern [i−1] != s e l f . pattern [j−1]:

i f s e l f.s[j] == 0: s e l f.s[j] = j−i;

j = s e l f.f[j]

i −= 1 j −= 1

s e l f.f[i] = j j = s e l f.f [0]

f o r i i n r a n g e(l e n(s e l f. pattern )):

i f s e l f.s[i] == 0: s e l f.s[i] = j i f i == j: j = s e l f.f[j]

d e f search_pattern (s e l f, text ):

res = []

i = 0

w h i l e i <= l e n( text ) − l e n(s e l f. pattern ):

j= l e n(s e l f. pattern )− 1

w h i l e j >=0 and s e l f. pattern [j ]== text [j+i ]: j −= 1 i f (j <0) :

res . append (i) i += s e l f.s [0]

e l s e:

c = text [j+i]

i += max(s e l f.s[j +1] , j− s e l f. occ [c ]) r e t u r n res

d e f test () :

bm = BoyerMoore (" ACTG ", " ACCA ") p r i n t ( bm. search_pattern ("

ATAGAACCAATGAACCATGATGAACCATGGATACCCAACCACC"))

test ()

Genes: Discrete Units of Genetic Information

Biological Sequences: Representations and Basic Algorithms