Implementing Our Own BLAST

In this section, we will build our own version of a very simplified version ofBLAST, which we will nameMyBlast. Although much simpler thanBLASTitself, it will allow to get an idea of some of the algorithmic constructions used to build these tools.

The first step in ourMyBlastprogram will be to create a database. We will assume here the simplest hypothesis: the database will be a list of sequences (strings), and we will define be- low a function to load this database from a text file, where each sequence will be in a separate line.

d e f read_database ( filename ):

f = open ( filename ) db = []

f o r line i n f:

db . append ( line . rstrip () ) f. close ()

r e t u r n db

The next step will be to pre-process the query, creating a map that will identify all words of sizew(a parameter) occurring in it, keeping for each word the positions where it occurs. This technique is commonly namedhashingand if, in a first look, it seems a waste of time, it is very useful if one needs to repeatedly find occurrences of these words in the query. The data structure used to keep these results will be a Python dictionary, since it allows a very efficient access of the values associated to its keys.

Thus, the next function creates a dictionary (map) from the query sequence following this idea. The dictionary returned has the words in the query as keys, and lists of positions where these occur as values.

d e f build_map ( query , w):

res = {}

f o r i i n r a n g e(l e n( query )−w +1) :

subseq = query [i:i+w]

i f subseq i n res :

res [ subseq ]. append (i) e l s e:

res [ subseq ] = [i]

r e t u r n res

In our simplified version ofMyBlast, we will not use a substitution matrix, rather considering a match score of 1 and a mismatch score of 0, not considering gaps. Thus, we only consider perfect hits, i.e. our threshold will be a score equal tow.

The next code chunk shows a function that, given a sequence, finds all matches of words from this sequence with the query. The result will be a list of hits, where each hit is a tuple with:

(index of the match in the query, index of the match in the sequence). Notice that we use the map created from the query using the previous function, instead of the query itself, increasing the efficiency of the search.

d e f get_hits (seq , m , w):

res = [] # list of tuples f o r i i n r a n g e(l e n( seq )−w +1) :

subseq = seq [i:i+w]

i f subseq i n m:

l = m[ subseq ] f o r ind i n l:

res . append ( (ind ,i) ) r e t u r n res

The next step will be to extend the hits that were found by the previous function. Again, here, we will greatly simplify the process by considering that the hit will be extended, in both di- rections, while the contribution to the increase in the score is larger or equal to half of the positions in the extension. The result is provided as a tuple with the following fields: starting index of the alignment on the query, the starting index of the alignment on the sequence, the size of the alignment, and the score (i.e. the number of matching characters).

d e f extends_hit (seq , hit , query , w):

stq , sts = hit [0] , hit [1]

## move forward matfw = 0

k =0

bestk = 0

w h i l e 2∗matfw >= k and stq +w+k < l e n( query ) and sts +w+k < l e n( seq ):

i f query [ stq +w+k] == seq [ sts +w+k ]:

matfw +=1 bestk = k +1 k += 1

size = w + bestk

## move backwards k = 0

matbw = 0 bestk = 0

w h i l e 2∗matbw >= k and stq > k and sts > k:

i f query [stq−k−1] == seq [sts−k−1]:

matbw +=1 bestk = k +1 k +=1

size += bestk

r e t u r n (stq−bestk , sts−bestk , size , w+ matfw + matbw )

The next function will identify the best alignment between the query and a given sequence, using the previous ones. We will identify all hits of sizewand extend all those hits. The one with the best overall score (highest number of matches) will be selected. The result is provided as a tuple with the format returned by the last function.

d e f hit_best_score (seq , query , m , w):

hits = get_hits (seq , m , w) bestScore = −1.0

best = () f o r h i n hits :

ext = extends_hit (seq , h , query , w) score = ext [3]

i f score > bestScore or ( score == bestScore and ext [2] < best [2]) :

bestScore = score best = ext

r e t u r n best

The final step is to apply the previous functions to compare a query with all the sequences in the database. In this case, we will find the best alignment of the query with each sequence

in the database, and find the best overall alignment of the query with a given sequence. The result will be a tuple similar to the ones described above, adding in the last position the index of the sequence with the best alignment.

d e f best_alignment (db , query , w):

m = build_map ( query , w) bestScore = −1.0

res = (0 ,0 ,0 ,0 ,0)

f o r k i n r a n g e(0 ,l e n( db )):

bestSeq = hit_best_score ( db[k], query , m , w) i f bestSeq != ():

score = bestSeq [3]

i f score > bestScore or ( score == bestScore and bestSeq [2]

< res [2]) :

bestScore = score

res = bestSeq [0] , bestSeq [1] , bestSeq [2] , bestSeq [3] , k

i f bestScore < 0: r e t u r n () e l s e: r e t u r n res

Genes: Discrete Units of Genetic Information

Biological Sequences: Representations and Basic Algorithms