BLAST Algorithm and Programs

7.2.1 Overview of the BLAST Algorithm

TheBLASTalgorithm is currently the most used to perform searches over DNA or protein sequence databases. The aim is to search for good alignments between a query sequence and the sequences of a defined database. The basic idea of the algorithms is to use short “words”

(i.e. sub-sequences) and search for matches of high similarity (with no gaps) of these words in the query and in the sequences from the database. These matches will form a basis that can be further extended, in both directions, to obtain high-quality alignments of larger dimension.

The main steps of theBLASTalgorithm may be summarized in the following:

1. Remove regions of low complexity (e.g. sequence repeats) from the sequence that may compromise the quality of the alignment;

2. Obtain all possible “words” of sizew(a parameter of the algorithm), i.e. sub-sequences of lengthwoccurring in the query sequence;

3. For each word from the previous step, compile the list of all possible words of sizewthat can be defined in the allowable alphabet, whose alignment score (with no gaps) is higher than a thresholdT (parameter of the algorithm);

4. Search in all sequences from the database, all occurrences of the words collected in the last step, which represent matches (hits) of sizewbetween the query and one of the database sequences;

5. Extend all hits from the last step, in both directions, while the score follows a given criterion (typically, the criterion is dependent on the size of the extension);

6. Select the alignments in the previous step with highest scores, normalized for its size (these are named the high-scoring pairs-HSPs).

In the most recent version ofBLAST, from 1997, the criterion to extend alignments is more demanding, requiring two near hits with scores aboveT separated by a distance smaller than

a given parameter. This change leads to less extensions and a more efficient algorithm, while the obtained results are shown to be in a similar level. With this strategy, it is also possible to include gaps within the two hits, and as a result BLAST can return gapped alignments.

Notice that the parameterswandT are quite important in the set of returned results and algorithm’s efficiency. Choosing a smaller value ofworT increases sensitivity, i.e. the number of interesting sequences found, but also increases the processing time.

7.2.2 BLAST Programs

The described algorithm gave rise to a number of programs specific for different sequence types, which can be locally installed or run over distinct servers available in the web. The most important of these servers is maintained by NCBI and can be accessed in the URL https://www.ncbi.nlm.nih.gov/BLAST.

The main programs areBLASTNandBLASTP, for nucleotide and protein searches.BLASTN can be used to search different nucleotide databases for similar sequences, being the default thenr/nt, a non-redundant collection of nucleotide sequences, while many other alternatives exist, including to filter by a given species genome, to use theRefSeqdatabase, to consider only ribosomal RNA sequences (16S), etc.

BLASTNallows to optimize the search for highly similar sequences, allowing a faster search for longer sequences (using themegablastprogram). InBLASTN, there are a number of parameters that can be set including the word sizewwhose default value is 11, and the ones defining the scoring function, the match/mismatch scores (default 2 and−3) and the gap penalties for opening and extension (default values of−5 and−2).

On the other hand,BLASTPcan be used to search for protein sequence databases, such as the nr, the non-redundant set of protein sequences,RefSeq,UniProt(curated sequences from the SwissProtdatabase) or sequences from the PDB database. The set of adjustable parameters are similar toBLASTN, with different default values:wis set to 6, while the scoring function uses a substitution matrix (BLOSUM62by default), and gap penalties for opening and extension (default values of−11 and−1). For protein alignments, there are alternative programs which are not covered in this book, such asPSI-BLAST,PHI-BLAST andDELTA-BLAST.

Also, there are three other programs in theBLASTsuite which may be used to search for sequences of a different type:

• BLASTX– takes a DNA sequence as query, but searches over protein sequences, and thus can be used to find potential protein products encoded by a nucleotide sequence; the DNA sequence is translated considering all 6 reading frames (relevant definitions and algorithms may be found in Chapter4);

• TBLASTN– takes a protein sequence as query, but searches over DNA sequence databases, thus trying to identify database sequences encoding proteins similar to the one in the query; sequences from the databases are translated considering the 6 reading frames, prior to the alignments;

• TBLASTX– takes a DNA sequence as input, searching over DNA sequence databases, but in both cases the sequences are translated considering the 6 reading frames and the matches are searched over protein sequences (this leads to 36 comparisons); this method can be used to identify nucleotide sequences similar to the query based on their coding potential.

7.2.3 Significance of the Alignments

One important aspect to discuss relates with the statistical significance of the alignments obtained. Although a detailed discussion on this subject is outside the scope of this book, we will briefly explain the underlying problems and present the main metrics returned by these programs to address this issue.

The scoring functions used byBLASTand dynamic programming, based on substitution matrices and gap penalties, are relative to the size of the sequences. So, they are useful to compare different alignments for the same pair of sequences, but cannot be used to compare alignments of different sequences with distinct sizes. So, when comparing hits of a query with other sequences of distinct sizes, there is the need to compute normalized scores that can be used to compare alignments of different sizes.BLASTcomputes normalized scores using measures from information theory, and as such the normalized scores are provided in bits.

However, even these scores normalized for sequence size are not enough to compare any type of alignment, and thus can not provide an answer to the question: is the similarity found between the query and the sequence, for a given alignment, statistically significant? Or, in other words, how probable is that this similarity occurs by pure chance? These are important ques- tions to be able to infer possible homology from sequence similarity.

The most popular metric to evaluate the significance of the alignment, provided as a result for each HSP inBLAST, is theEvalue. This value indicates the number of expected alignments with a score at least as high as the one provided by the current HSP. Although we will not discuss details, this value is calculated taking into account the score of the alignment, the size of the database, and length of the sequences included, as well as the parameters of the alignment.

The lower the value ofEis, i.e. the closer it is to zero, the more significant the considered match is. It is difficult to define a threshold for what is considered significant, being used values from 10−5up to 0.05 in different scenarios and by different authors.

When checking if an alignment indicates homology or not, it is also important not to rely only on theEvalue, but look also at other results. SinceBLASTreturns local alignments, it is of foremost importance to look at the coverage of the alignment in the query (and also in the sequence found), i.e. looking at which parts of the overall sequences are included in the alignment proposed. This will help in understanding if there might be a global homology or just similarity in parts of the sequences (e.g. protein domains or other local patterns).

Genes: Discrete Units of Genetic Information

Biological Sequences: Representations and Basic Algorithms