Using BLAST Through BioPython

TheBioPythonpackage that we have covered in a few of the previous chapters has interfaces that allow to run theBLASTprograms, both when these are locally installed and through remote access to the servers. The functions provided byBioPythonallow to prepare queries for BLAST, defining the query sequences and relevant parameters, execute queries, recover the results, and handle their processing. These functionalities are quite useful to automate these procedures, allowing their setup and execution in large scale.

In the following, we will address functions that allow the remote call toBLASTqueries through the NCBI servers. We will not cover here the use of locally installedBLASTpro- grams, although many of the steps are similar.

The most relevantBioPythonmodule for remote calls isBio.Blast.NCBIWWW, being the core functionqblast. This function receives as parameters: the program to use (a string from the set: “blastn”, “blastp”, “blastx”, “tblastn” or “tblastx”), the database to search (a string, for instance “nr”, “nt” or “swissprot”) and the query sequence (a string, the sequence inFASTA format or an identifier from NCBI such as GI). There are also several optional parameters that allow to define the type of output (XML by default), the threshold for theEvalue, the substitution matrix to use, the gap penalties, among others.

The next example shows how to run aBLASTNsearch in the Python shell, over the non- redundant nucleotide database, considering the sequence in the “example_blast.fasta” file (FASTAformat).

>>> from Bio . Blast i m p o r t NCBIWWW

>>> from Bio i m p o r t SeqIO

>>> record = SeqIO . read (open(" example_blast . fasta "), f o r m a t=" fasta ")

>>> result_handle = NCBIWWW . qblast (" blastn ", " nt ", record .f o r m a t("

fasta "))

Notice that the third line reads the sequence from theFASTAfile (check Section4.8for the details on theSeqIOmodule), while the last line executes theBLASTsearch with the given parameters (this can take a while since it is made over an Internet connection). The resulting variable is a handle for the results that may be used to save the results to file, as shown in the next code block, or serve as an input to analysis functions that will be shown afterwards. The next code chunk saves theBLASTresults into an XML file.

>>> save_file = open(" my_blast . xml ", "w")

>>> save_file . write ( result_handle . read ())

>>> save_file . close ()

>>> result_handle . close ()

BioPythonprovides parsers for XML result files fromBLASTthat can be used to load files saved as shown in the previous examples, or other files coming from runningBLASTqueries directly in the web servers.

If you have a result saved in an XML file “my_blast.xml”, the following code reads the file and returns a handle to be used in the analysis.

>>> result_handle = open(" my_blast . xml ")

The handle returned in this example can be used as an input to thereadfunction, when the file only has the results of a singleBLASTsearch, or to theparsefunction, when there are multiple results. The result is an object of the classBlastRecordin the first case, or an iterator over objects of this class, in the second, that can be accessed using a cycleforor using the functionnext.

>>> from Bio . Blast i m p o r t NCBIXML

>>> blast_record = NCBIXML . read ( result_handle )

>>> blast_records = NCBIXML . parse ( result_handle )

>>> f o r blast_record i n blast_records :

An object of the classBlastRecordcontains all information available from the result of a BLAST search, including also the parameters used in the process. The results are organized in a hierarchical way, with three different levels of information:

• At the level of theBlastRecord, there is the list ofalignmentswith respective information, but also some of the general parameters used in the search:matrix, that keeps the substitution matrix;gap_penalties, which keeps the penalties used for gaps;database, keeping the target database.

• At the level of eachalignment, we can find the set of HSPs (notice that a single sequence can align with the query in different locations, giving rise to different HSPs) in the field hsps, but also some information related to the full alignment of the query with a given sequence, such as:title, accession, hit_id, hit_def, providing the description, accession number, the identifier and the definition of the sequence, oralignment.length, that provides the full length of the alignment.

• At the level of the HSP, we can find information specific for the local alignment, such as expect, score, query_start, sbjct_start, align_length, sbjct. match, which contain theE value, the normalized score, the index in the query where the alignment starts, the index in the sequence where the alignment starts, the HSP length, the part of the query sequence part in the HSP alignment, the part of the sequence in the alignment, and the matches of both.

A more complete reference of the contents at each level may be found in Section 7.4 of the BioPython’s tutorial [8]. The next example shows how to navigate this information, printing some of the fields in the different levels of information.

>>> E_threshold = 0.001

>>> f o r blast_record i n blast_records :

f o r alignment i n blast_record . alignments :

... f o r hsp i n alignment . hsps :

... i f hsp . expect < E_threshold :

... p r i n t ("∗∗∗∗Alignment∗∗∗∗")

... p r i n t (" sequence :", alignment . title )

... p r i n t (" length :", alignment . length )

... p r i n t ("e value :", hsp . expect )

... p r i n t ( hsp . query [0:75] + " ... " )

... p r i n t( hsp . match [0:75] + " ... ")

... p r i n t( hsp . sbjct [0:75] + " ... ")

To illustrate further the use of these functions, let us consider a simple example and write a script that can show the different functionalities of these functions. We will use aFASTAfile

(“interl10.fasta”) containing a human protein, the interleukin-10 precursor, available at the URL:https://www.ncbi.nlm.nih.gov/protein/10835141.

The first part of our script, shown below, will load theFASTAfile, check the number of aminoacids in the sequence (should be 178), run theBLASTPquery using the non-redundant protein sequence database and default parameters. Then, we will save theBLASTPresults to an XML file, so we can recover them later to parse the results.

Note that this step is not strictly needed since we can use the handle (variableresult_handle) directly later. However, since we are running remoteBLAST, which can take a while, in this way we only need to run this once, and the part of interpreting the results can be run without further access to the Internet and avoiding additional delay.

from Bio . Blast i m p o r t NCBIXML from Bio . Blast i m p o r t NCBIWWW from Bio i m p o r t SeqIO

record = SeqIO . read (open(" interl10 . fasta "), f o r m a t=" fasta ") p r i n t (l e n( record . seq ))

result_handle = NCBIWWW . qblast (" blastp ", " nr ", record .f o r m a t(" fasta ") )

save_file = open(" interl−blast . xml ", "w") save_file . write ( result_handle . read ()) save_file . close ()

result_handle . close ()

Next, we will show how to open the XML file, recover the results from theBLASTsearch, and print some of the parameters used in the search.

result_handle = open(" interl−blast . xml ") blast_record = NCBIXML . read ( result_handle ) p r i n t (" PARAMETERS :")

p r i n t (" Database : " + blast_record . database ) p r i n t (" Matrix : " + blast_record . matrix )

p r i n t (" Gap penalties : ", blast_record . gap_penalties )

The next code chunk shows how we can access the first alignment (the one with lowestE value), checking which sequence matches (showing accession number, identifier and definition), the alignment length and the number of HSPs (in this case a single one with the full alignment).

first_alignment = blast_record . alignments [0]

p r i n t (" FIRST ALIGNMENT :")

p r i n t (" Accession : " + first_alignment . accession ) p r i n t (" Hit id: " + first_alignment . hit_id )

p r i n t (" Definition : " + first_alignment . hit_def ) p r i n t (" Alignment length : ", first_alignment . length ) p r i n t (" Number of HSPs : ", l e n( first_alignment . hsps ))

Since there is a single HSP, we can check the results in more detail using the following code:

hsp = first_alignment . hsps [0]

p r i n t ("E−value : ", hsp . expect ) p r i n t (" Score : ", hsp . score )

p r i n t (" Length : ", hsp . align_length ) p r i n t (" Identities : ", hsp . identities ) p r i n t (" Alignment of the HSP :")

p r i n t ( hsp . query ) p r i n t ( hsp . match ) p r i n t ( hsp . sbjct )

We may also wish to gather some relevant information about the top 10 alignments, getting the sequences that matched and theEvalue. This is done in the following code:

p r i n t (" Top 10 alignments :") f o r i i n r a n g e(10) :

alignment = blast_record . alignments [i]

p r i n t (" Accession : " + alignment . accession ) p r i n t (" Definition : " + alignment . hit_def ) f o r hsp i n alignment . hsps :

p r i n t ("E−value : ", hsp . expect ) p r i n t()

As an example of a more specific task, we may be interested in checking which organisms are present in the top 20 alignments. This is done in the next code block, where regular expres- sions (check Section5.5) are used to filter thehit_def field to obtain the relevant information.

i m p o r t re specs = []

f o r i i n r a n g e(20) :

alignment = blast_record . alignments [i]

definition = alignment . hit_def

x = re . search (" \[(.∗?) \] ", definition ). group (1) specs . append (x)

p r i n t (" Organisms :") f o r s i n specs : p r i n t(s)

Bibliographical Notes and Further Reading

BLASThas been proposed in 1990 in a paper by Altshul et al. [11], while the improved version with gapped alignments was proposed in 1997 in [12].

FASTAprovides an alternative suite toBLAST. It has been firstly presented in [123], and can be found in several servers includinghttp://www.ebi.ac.uk/Tools/sss/fasta.

Exercises and Programming Projects

Exercises

1. Consider the functionget_hitsabove. Create a variant that allows at most 1 character to be different between the sequence and the query words.

2. a. Write a function that given two sequences of the same length, determines if they have at mostd mismatches (d is an argument of the function). The function returns Trueif the number of mismatches is less or equal tod, andFalseotherwise.

b. Using the previous function find all approximate matches of a patternpin a sequence. An approximate match of the pattern can have at mostd characters that do not match (d is an argument of the function).

3. Search in theUniProtdatabase the record for the human protein APAF (O14727). Save it in theFASTAformat. UsingBioPythonperform the following operations:

a. Load the file and check that the protein contains 1248 aminoacids.

b. UsingBLASTP, search for sequences with high similarity to this sequence, in the

“swissprot” database.

c. Check which the global parameters were used in the search: the database, the substitution matrix, and the gap penalties.

d. List the best alignments returned, showing the accession numbers of the sequences, theEvalue of the alignments, and the alignment length.

e. Repeat the search restricting the target sequences to the organismS. cerevisiae(sug- gestion: use the argumententrez_queryin theqblastfunction).

f. Check the results from the last operation, listing the best alignments, and checking carefully in each the start position of the alignment in the query and in the sequence.

g. What do you conclude about the existence of homologous genes in the yeast for the human protein APAF ?

Programming Projects

1. Improve the implementation ofMyBlastby changing it to a class, which allows further configuration and considering the following variants:

• Allow the database to be provided by an FASTA file, keeping in a dictionary the sequences and their descriptions.

• Consider extending only those hits that have other hits in a neighborhood of less thannpositions (nshould be a parameter of the function). In the extensions, you can use dynamic programming to achieve the best alignment of the characters between the hits.

• Add the option to use substitution matrices in the calculation of the matches. In this case, you should pre-compute all possible words of sizewthat score overT (T is a parameter) against words in the query, after mapping it. Adjust the hit extension for this scenario, creating criteria for stopping extensions based on scores and extension length.

• Adapt the functions to allow returning a ranking of the best alignments, and not only the one that scores highest.

Multiple Sequence Alignment

In this chapter, we generalize the alignment problem to consider multiple sequences and discuss the impacts on the problem complexity and available algorithms. We review the main classes of optimization algorithms for this problem and discuss their main advantages and limitations. Then, we focus on progressive alignment and implement a simplified version of one of these algorithms in Python. We close the chapter reviewing how to handle alignments inBioPython.

Genes: Discrete Units of Genetic Information

Biological Sequences: Representations and Basic Algorithms