Processing Sequences With BioPython

TheBioPythonpackage [7] (http://www.biopython.org/) gathers a set of open-source software resources written in Python to address a large number of Bioinformatics tasks, including also an abundant documentation that facilitates its use, including a very complete tutorial and cookbook [8]. It is one of the projects which are members of theOpen Bioinfor- matics Foundation(OBF), that also includes a number of other Bioinformatics free software written in other languages, such as Perl, Ruby or Java.

The package can be easily installed using one of the available tools for package management, which were covered in Section2.4.4, being instructions provided in the package web page.

We will cover many of the functions available in this package in different chapters within this book, addressing in each the topics related to the presented algorithms.

In this section, we will start by addressing the wayBioPythonhandles biological sequences.

Since it is written using the object-oriented features of the Python language, we will show

here its main classes related to sequence processing. Here, theSeqclass is the core, allow- ing to work over biological sequences of different types. One first observation to notice is that the object instances of this class are immutable, i.e. their content cannot be changed after cre- ation.

The following example shows how to create a simple sequence only passing a string as an ar- gument to the constructor. All objects of the classSeqhave an associated alphabet defining allowed symbols in their content, which is also an object of one of the classes defining possi- ble alphabets inBioPython. In this case, the alphabet will be an instance of classAlphabet, that defines the most generic case.

Note that the examples will be given in interactive mode, to be run in a Python console, al- though they could be inserted within runnable scripts with minor changes. The first line allows to load the corresponding module, serving also to test if your installation was done correctly.

>>> from Bio . Seq i m p o r t Seq

>>> my_seq = my_seq = Seq (" ATAGAGAAATCGCTGC ")

>>> my_seq

Seq (’ ATAGAGAAATCGCTGC ’, Alphabet ())

>>> p r i n t( my_seq ) ATAGAGAAATCGCTGC

>>> my_seq . alphabet Alphabet ()

The following examples show how to define sequences specifying the desired alphabets ac- cording to the intended use. In this case, we define two sequences, one of DNA and the other of a protein, indicating compatible alphabets, that in this case are similar to the ones defined in the context of the previous functions in this chapter, only including non-ambiguous symbols.

>>> from Bio . Alphabet i m p o r t IUPAC

>>> my_seq = Seq (" ATAGAGAAATCGCTGC ", IUPAC . unambiguous_dna )

>>> my_seq

Seq (’ ATAGAGAAATCGCTGC ’, IUPACUnambiguousDNA ())

>>> my_seq . alphabet IUPACUnambiguousDNA ()

>>> my_prot = Seq (" MJKLKVERSVVMSVLP ", IUPAC . protein )

>>> my_prot

Seq (’ MJKLKVERSVVMSVLP ’, IUPACProtein ())

Other alphabets are illustrated by the following example, where we can also learn how to check the content of each alphabet type. Note that in these examples, we assume the previous modules were imported in the above code chunks.

>>> IUPAC . unambiguous_dna . letters

’GATC ’

>>> IUPAC . ambiguous_dna . letters

’ GATCRYWSMKHBVDN ’

>>> IUPAC . IUPACProtein . letters

’ ACDEFGHIKLMNPQRSTVWY ’

>>> IUPAC . ExtendedIUPACProtein. letters

’ ACDEFGHIKLMNPQRSTVWYBXZJUO ’

The objects of theSeqclass can be treated, for many effects, as strings, since they support many of the functions and operators of lists and strings. Among these are indexing, splicing, concatenating (with+operator) ,forcycles and other iterators,inoperator, as well as functions aslen,upper,lower,find,count, among others. Let us see some examples:

>>> f o r i i n my_seq : p r i n t(i) A

T ...

>>> l e n( my_seq ) 16

>>> my_seq [2:4]

’AG ’

>>> my_seq . count ("G") 4

>>> " GAGA " i n my_seq True

>>> my_seq . find (" ATC ") 8

It is important to note that, regarding concatenation, only sequences with compatible alphabets will be merged. Let us see two examples, where firstly we have two sequences from the same alphabet, and then two sequences from compatible alphabets to be merged. Notice that in the second case, the resulting alphabet is a third one, in this case, one that is more generic than the two original ones.

>>> seq1 = Seq (" MEVRNAKSLV ", IUPAC . protein )

>>> seq2 = Seq (" GHERWKY ", IUPAC . protein )

>>> seq1 + seq2

Seq (’ MEVRNAKSLVGHERWKY ’, IUPACProtein ())

>>> from Bio . Alphabet i m p o r t generic_nucleotide

>>> nuc_seq = Seq (" ATAGAGAAATCGCTGC ", generic_nucleotide )

>>> dna_seq = Seq (" TGATAGAACGT ", IUPAC . unambiguous_dna )

>>> nuc_seq + dna_seq

Seq (’ ATAGAGAAATCGCTGCTGATAGAACGT ’, NucleotideAlphabet () )

TheSeqclass also implements a number of biologically relevant functions, many of those similar to the ones covered in previous sections. In the next example, we can observe how to calculate the transcription and the reverse complement of a DNA sequence.

>>> coding_dna = Seq (" ATGAAGGCCATTGTAATGGGCCGC", IUPAC . unambiguous_dna )

>>> template_dna = coding_dna . reverse_complement ()

>>> template_dna

Seq (’ GCGGCCCATTACAATGGCCTTCAT ’, IUPACUnambiguousDNA () )

>>> messenger_rna = coding_dna . transcribe ()

>>> messenger_rna

Seq (’ AUGAAGGCCAUUGUAAUGGGCCGC ’, IUPACUnambiguousRNA () )

Also, there are several functions that allow to do translation over both DNA and RNA sequences, as shown below.

>>> rna_seq = Seq (’ AUGCGUUUAACU ’, IUPAC . unambiguous_rna )

>>> rna_seq . translate () Seq (’MRLT ’, IUPACProtein ())

>>> coding_dna = Seq (" ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG ", IUPAC . unambiguous_dna )

>>> coding_dna . translate ()

Seq (’ MAIVMGR∗KGAR∗’, HasStopCodon ( IUPACProtein () , ’∗’))

>>> coding_dna . translate ( table =" Vertebrate Mitochondrial ") Seq (’ MAIVMGRWKGAR∗’, HasStopCodon ( IUPACProtein () , ’∗’))

Note that when doing translation, stop codons may occur in the sequence, which will be rep- resented as ‘*’ symbols. This leads to alphabets where this symbol is included, as it is shown in the example.

In the last example above, the translation was done using a codon translation table different from the standard one. In this case, the table used was the one that occurs in the translation

of proteins synthesized in the mitochondria. There are several different codon tables inBio- Python, including the “Standard”, the “Vertebrate Mitochondrial”, and the “Bacterial”, which are used in different scenarios.

The following examples show how to check some of the contents of each table, highlighting, in this case, some differences between the standard and the mitochondrial tables.

>>> from Bio . Data i m p o r t CodonTable

>>> standard_table = CodonTable . unambiguous_dna_by_name[" Standard "]

>>> mito_table = CodonTable . unambiguous_dna_by_name[" Vertebrate Mitochondrial "]

>>> p r i n t ( standard_table ) ( ... )

>>> mito_table . stop_codons [’TAA ’, ’TAG ’, ’AGA ’, ’AGG ’]

>>> mito_table . start_codons

[’ATT ’, ’ATC ’, ’ATA ’, ’ATG ’, ’GTG ’]

>>> mito_table . forward_table [" ATA "]

’M ’

>>> standard_table . forward_table [" ATA "]

’I ’

Genes: Discrete Units of Genetic Information

Biological Sequences: Representations and Basic Algorithms