Sequence Annotation Objects in BioPython

As already discussed, the usability of functions working over biological sequences is greatly enhanced if we can work with sequences stored in files, given the typical dimensions of these sequences. TheSeqIOclass provides a number of functions that allow to read and write sequences from/to files, in a wide range of different formats, not only containing the sequences themselves, but also meta-information, in the form of annotations. These annotations allow to add information to the whole sequence, or parts of the sequence, that can be related to its biological function or other relevant knowledge.

TheSeqRecordclass is the basic container of sequences and their annotation inBioPython, therefore being used to keep the results of the reading functions. ASeqRecordobject has the following fields:

• seq: the sequence itself, an object of theSeqclass;

• id: the sequence identifier;

• name: the sequence name;

• description: a description of the sequence;

• annotations: global annotations for the whole sequence (a dictionary where the keys are annotation types – unstructured properties – and the values are the specific values of those properties for the sequence);

• features: structured features, a list ofSeqFeatureobjects, which can apply to the whole sequence or parts of it;

• letter_annotations: possible annotations for each letter (position) in the sequence;

• dexrefs: references to databases.

To understand the organization of the features, we need to take a look at theSeqFeature class, which allows to keep structured information about the annotations of a sequence. The structure of this object is deeply oriented towards the content of the records in the Genbank and EMBL databases.

The main attributes of aSeqFeatureobject are:

• location– indicates the region of the sequence affected by this annotation as aFeature- Locationobject;

• type– string stating the feature type;

• qualifiers– additional information about the feature, kept as a property-value dictionary.

TheSeqLocationclass allows to represent regions of the sequences in a flexible manner, al- lowing to define fixed or fuzzy ranges of positions. Positions can be either exact or fuzzy, and in this last case, there are many options, like theAfterPosition,BeforePositionor the BetweenPosition, which specify a value larger, lesser or in between defined values. The following example can help to understand some of the possibilities, which we will not explore fully here.

>>> from Bio i m p o r t SeqFeature

>>> start = SeqFeature . AfterPosition (10)

>>> end = SeqFeature . BetweenPosition (40 , left =35 , right =40)

>>> my_location = SeqFeature . FeatureLocation ( start , end )

>>> p r i n t ( my_location ) [ >10:(35^40)]

>>> i n t( my_location . start ) 10

>>> i n t( my_location . end ) 40

The last part of the example creates aFeatureLocationobject from the defined positions. The intfunction forces the fuzzy positions to be considered as a single value.

The definition of feature locations is quite useful in many scenarios, for instance to extract, from the whole sequence, the part to which a given feature applies. The next example shows that scenario, where agenefeature is defined in part of a sequence, more concretely referring to the reverse complement strand. The methodextractis used here to access the nucleotides between positions 5 and 18 of the reverse complement strand.

>>> from Bio . SeqFeature i m p o r t SeqFeature , FeatureLocation

>>> example_seq = Seq (" ACCGAGACGGCAAAGGCTAGCATAGGTATGAGACTT ")

>>> example_feat = SeqFeature ( FeatureLocation (5 , 18) , t y p e=" gene ", strand=−1)

>>> feature_seq = example_feat . extract ( example_parent )

>>> p r i n t ( feature_seq ) AGCCTTTGCCGTC

It is possible to create aSeqRecordobject and fill it field by field, including features and annotation, as shown in the next example.

>>> from Bio . Seq i m p o r t Seq

>>> seq = Seq (" ATGAATGATAGCTGAT ")

>>> from Bio . SeqRecord i m p o r t SeqRecord

>>> seq_rec = SeqRecord ( seq )

>>> seq_rec .i d = " ABC12345 "

>>> seq_rec . description = " My own sequence ."

>>> seq_rec . annotations [" role "] = " unknown "

>>> seq_rec . annotations {’role ’: ’ unknown ’}

However, it is more common and convenient to have this as the result of a reading function.

TheSeqIOclass provides two main functions to read information from files in different formats: thereadfunction reads a singleSeqRecordobject, with one sequence and possibly its annotations, while the functionparseis able to read multiple records, returning an iterator over aSeqRecordcontainer.

To exemplify the use of the methods from theSeqIOclass, we will use the complete sequence, and its annotation, from a plasmid (pPCP1) of theYersinia pestisbacterium, more specifically from the strainYersinia pestis biovar Microtus str. 91001, with the accession num- berNC_005816 in Genbank. The corresponding record, from the RefSeq database, can be found in the following URL:https://www.ncbi.nlm.nih.gov/nuccore/NC_005816. We will save the contents of this record in two different formats: FASTA and Genbank.

The first example shows how to read a single sequence from a file in the FASTA format. This format allows a minimal representation of the sequences, where for each sequence there is a header row (initiated by the>symbol) with some meta-information (typically identifiers and a brief description) and followed by a number of rows with the sequence itself. In the example, the file was named “NC_005816.fna”. We can verify that the sequence has 9609 nucleotides.

>>> from Bio i m p o r t SeqIO

>>> record = SeqIO . read (" NC_005816 . fna ", " fasta ")

>>> record

SeqRecord ( seq = Seq (" TGTAACGAACGGTGCAAT ... ")

>>> l e n( record . seq ) 9609

>>> record .i d

gi |45478711| ref | NC_005816 .1|

>>> record . description

gi |45478711| ref | NC_005816 .1| Yersinia pestis biovar Microtus s t r. 91001 plasmid pPCP1 , complete sequence

>>> record . annotations {}

>>> record . features []

When a minimal format as FASTA is used, most of the information within the resulting records will be empty, as can be easily checked from the results in previous examples (for instance, there are no annotations or features), as this information is not available in this format. Also, theidanddescriptionfields are filled a bit by guessing, using some rules that are common in certain tools when saving FASTA files, but are not accepted universal stan- dards.

A different scenario occurs if a richer format is provided. Let us illustrate this by saving the previous record in the Genbank format, that is able to store all annotations and features from the provided sequence, i.e. the information that is displayed when accessing the URL defined above.

In this case, the file name is similar but with the extension “.gb”. Notice that the format needs to be supplied as the second argument of thereadfunction. In this case, while theseqfield, i.e. the sequence itself, is the same, the other fields show different content. To start,id,name, anddescriptionfields can now be filled correctly. Also, theannotationsandfeaturesfields have now gained a richer content loaded from the RefSeq record.

>>> from Bio i m p o r t SeqIO

>>> record = SeqIO . read (" NC_005816 . gb", " genbank ")

>>> record . seq

Seq (’ TGTAACGAACGGTGCAATC ... CTG ’, IUPACAmbiguousDNA ())

>>> p r i n t(l e n( record . seq )) 9609

>>> record .i d

’ NC_005816 .1 ’

>>> record . name

’ NC_005816 ’

>>> record . description

’ Yersinia pestis biovar Microtus str . 91001 plasmid pPCP1 , complete sequence . ’

>>> l e n( record . annotations ) 11

>>> l e n( record . features ) 29

Theannotationsfields, as stated above, is a dictionary that provides a number of properties for the sequence as a whole. Let us check, in the next example, some of the information contained in this case.

>>> record . annotations [" source "]

’ Yersinia pestis biovar Microtus str . 91001 ’

>>> record . annotations [" taxonomy "]

[’ Bacteria ’, ’ Proteobacteria ’, ’ Gammaproteobacteria ’, ’

Enterobacteriales ’, ’ Enterobacteriaceae ’, ’ Yersinia ’]

>>> record . annotations [" date "]

23−MAY−2013

>>> record . annotations [" gi "]

45478711

To explore further the 29 features in this record, firstly, we will count how many of the features correspond to annotated genes:

>>> feat_genes = []

>>> f o r i i n r a n g e(l e n( record . features )):

...i f record . features [i ].t y p e == " gene ": feat_genes . append ( record . features [i ])

>>> l e n( feat_genes ) 10

Let us now explore a bit more these genes, getting their locus tag, strand, and location. Note that the strand 1 is the one represented by the sequence itself, while−1 stands for the reverse complement.

>>> f o r f i n feat_genes : p r i n t(f. qualifiers [’ locus_tag ’], f. strand , f . location )

[’ YP_pPCP01 ’] 1 [86:1109](+) [’ YP_pPCP02 ’] 1 [1105:1888](+) [’ YP_pPCP03 ’] 1 [2924:3119](+) ...

Using theextractfunction we can try to find which are the proteins encoded by each of these genes.

>>> f o r f i n feat_genes : p r i n t(f. extract ( record . seq ). translate ( table =

" Bacterial ", cds =True))

MVTFETVMEIKILHKQGMSSRAIARELGISRNTVKRYLQAKSEPP ...

It is left as an exercise for the reader to check that there are equally 10 features of the type

“CDS” (coding sequence), which have a qualifier key namedtranslationthat keeps the encoded protein, and therefore should be equal to the result obtained above.

There are some cases where we need to read several sequences from a single file, which can be done by using theparsefunction. The file mentioned in the example below can be obtained from the book’s web site or the BioPython’s tutorial [8], and contains different sequences of ribosomal rRNA genes from different species of orchids.

In the example, we go through all the records, printing their description and collecting the organisms in a list.

>>> all_species = []

>>> f o r seq_record i n SeqIO . parse (" ls_orchid . gbk ", " genbank "):

...: p r i n t ( seq_record . description )

...: all_species . append ( seq_record . annotations [" organism "]) C. irapeanum 5.8 S rRNA gene and ITS1 and ITS2 DNA .

C. californicum 5.8 S rRNA gene and ITS1 and ITS2 DNA . C. fasciculatum 5.8 S rRNA gene and ITS1 and ITS2 DNA . ...

>>> p r i n t ( all_species )

[’ Cypripedium irapeanum ’, ’ Cypripedium californicum ’, ... , ’ Paphiopedilum barbatum ’]

It is also important to mention thatBioPythonhas a number of functions which allow to retrieve sequences directly from databases and process them afterwards. In the next example, we show how to retrieve a set of sequences from NCBI, providing the GI identifiers in a list.

>>> from Bio i m p o r t Entrez

>>> from Bio i m p o r t SeqIO

>>> Entrez . email = " example@gmail . com "

>>> handle = Entrez . efetch ( db=" nucleotide ", rettype =" gb ", retmode ="

text ", i d=" 6273291 , 6273290 , 6273289 ")

>>> f o r seq_record i n SeqIO . parse ( handle , " gb "):

...: p r i n t ( seq_record .i d, seq_record . description [:100] , " ... ") ...: p r i n t (" Sequence length : ", l e n( seq_record ))

>>> handle . close ()

AF191665 .1 Opuntia marenae rpl16 gene ; chloroplast gene f o r chloroplast product , partial intron sequence . ...

Sequence length : 902 ...

To conclude this section, and this chapter, we will check how to write records to file. The writemethod from theSeqIOclass allows to write the contents of one (or several)Seq- Recordobjects to file in different formats, receiving as arguments a list ofSeqRecordob- jects, the file name and a string defining the format. On the other hand, theconvertfunction can be used to directly convert records from one format to another, a very useful task in a bioinformatician’s daily work. So, the two code blocks below have an equivalent behavior.

records = SeqIO . parse (" ls_orchid . gbk ", " genbank ")

count = SeqIO . write ( records , " my_example . fasta ", " fasta ")

count = SeqIO . convert (" ls_orchid . gbk ", " genbank ", " my_example . fasta ",

" fasta ")

Exercises and Programming Projects

Exercises

1. Write a program that reads a DNA sequence, converts it to capital letters, and counts how many nucleotides are purines and pyrimidines.

2. Write a program that reads a DNA sequence and checks if it is equal to its reverse complement.

3. Write and test a function that, given a DNA sequence, returns the total number of “CG”

duplets contained in it.

4. Write and test a Python function that, given a DNA sequence, returns the size of the first protein that can be encoded by that sequence (in any of the three reading frames). The function should return−1 if no protein is found.

5. Write and test a function that given an aminoacid sequence returns a logic value indicat- ing if the sequence can be a protein or not.

6. Write and test a Python function that, given a DNA sequence, creates a map (dictionary) with the frequencies of the aminoacids it encodes (assuming the translation is initiated in the first position of the DNA sequence). Stop codons should be ignored.

7. Write a program that reads an aminoacid sequence and a DNA sequence and prints the list of all sub-sequences of the DNA sequence that encode the given protein sequence.

8. Write a function that, given a sequence as an argument, allows to detect if there are re- peated sub-sequences of sizek(the second argument of the function). The result should be a dictionary where keys are sub-sequences and values are the number of times they oc- cur (at least 2). Use the function in a program that reads the sequence andkand prints the result by decreasing frequency.

Programming Projects

1. Taking as your basis the classMySeqdeveloped above, implement sub-classes for the three distinct types of biological sequences: DNA, RNA, and proteins. In each define an appropriate constructor. Redefine the methods from the parent class, in the cases where you feel this is necessary or useful. Adapt the types of the outputs in each method accord- ingly.

2. The packagerandomincludes a number of function that allow to generate random num- bers. Using some of those functions, build a module that implements the generation of random DNA sequences and the analysis of mutations over these sequences. You can in- clude functions to generate random sequences of a given size, to simulate the occurrence of a given number of mutations in a DNA sequence in random positions (including inser- tions, deletions, and substitutions), and functions to study the impact of mutations in the encoded proteins of those sequences.

Finding Patterns in Sequences

In this chapter, we discuss how to find patterns in sequences and the importance of this task in Bioinformatics. We put forward the basic algorithms for pattern finding and discuss their complexity. Heuristic algorithms are presented to lower the average computational complexity of this task, by suitable pre-processing of the patterns to search. Also, we present regular expressions as a way to find more flexible patterns in sequences, showing how these can be implemented in Python, and discussing their biological relevance with some examples.

Sequence Annotation Objects in BioPython

Genes: Discrete Units of Genetic Information

Biological Sequences: Representations and Basic Algorithms