Biological Sequences: Representations and Basic Al- 123docz.net

As previously discussed, in Chapter3, in biological systems, the genetic information is encoded into DNA molecules. For many practical purposes, in Bioinformatics algorithms and tools, these molecules can be represented as one-dimensional sequences of nucleotides.

Since in DNA (or RNA) molecules, there are four distinct types of nucleotides, the compu- tational representation of these sequences consists of strings (i.e. sequences of characters) defined over an alphabet of four distinct symbols. For DNA, these symbols are the letters A, C, G, and T, which represent, respectively, adenine, cytosine, guanine and thymine, while, in the case of RNA, the T gives place to a U, representing uracil.

Although the basic alphabet for DNA only contains the symbols for the four nucleotides, the International Union of Pure and Applied Chemistry(IUPAC) has defined an extended set of symbols that allow ambiguity in the identification of a nucleotide, useful for instance in the results of a sequencing process in positions where there is uncertainty in the identification of the nucleotide, or in the design of Polymerase Chain Reaction (PCR) primers. Table4.1 shows the symbols in this IUPAC alphabet and their meaning.

The other most important biological sequences are proteins, which are typically represented by sequences of aminoacids. In the standard genetic code, there are 20 different encoded aminoacids that can occur in protein sequences, and thus the strings are defined in an alphabet of 20 characters (see Table4.2). In some cases, it is important to add to this alphabet a symbol representing the decoding of a stop codon, to be used for instance when running auto- matic tools for DNA translation, which we will cover in more detail in the next sections. The most used symbols for the stop codon are the underscore (_) or the asterisk (*).

Bioinformatics Algorithms. DOI:10.1016/B978-0-12-812520-5.00004-3

Table 4.1: IUPAC symbols for nucleotides.

Symbol Name Nucleotides represented

A Adenine A

C Cytosine C

G Guanine G

T Thymine T

U Uracil U

K Keto G, T

M Amino A, C

R Purine A, G

S Strong C, G

W Weak A, T

Y Pyrimidine C, T

B Not A C, G, T

D Not C A, G, T

H Not G A, C, T

V Not T A, C, G

N Any base A, C, G, T

Table 4.2: IUPAC symbols for aminoacids.

Symbol Name

A Alanine

C Cysteine

D Aspartic acid

E Glutamic acid

F Phenylalanine

G Glycine

H Histidine

I Isoleucine

L Lysine

M Methionine

N Asparagine

P Proline

Q Glutamine

R Arginine

S Serine

T Threonine

V Valine

T Tryptophan

Y Tyrosine

Since DNA, RNA, and protein sequences can be represented as strings, we can use many of the features and functions that were shown on Chapter2to process these sequences and ex- tract meaningful biological results. As an example, let us define a function that checks if an

inputted DNA sequence is valid, using some of the functions over strings discussed in Sec- tion2.6.3.

Note that the first line provides a documentation for the function, providing a comment that states the function’s purpose, detailing the expected inputs and results. This will be used in most functions developed throughout the book to improve its readability.

In the function code, valid characters are counted and their sum is compared to the length of the string to check if all characters are valid. Similar functions could be easily built for RNA or protein sequences.

d e f validate_dna ( dna_seq ):

""" Checks if DNA sequence is valid . Returns True is sequence is valid , or F a l s e otherwise . """

seqm = dna_seq . upper ()

valid = seqm . count ("A") + seqm . count ("C") + seqm . count ("G") + seqm . count ("T")

i f valid == l e n( seqm ): r e t u r n True e l s e: r e t u r n F a l s e

>>> validate_dna (" atagagagatctcg ") True

>>> validate_dna (" ATAGAXTAGAT ") F a l s e

Another useful example is the ability to calculate the frequency of the different symbols in the sequences, which will be the aim of the function shown in the next code block. The result of the function will be a dictionary, where the keys are the symbols and the corresponding val- ues are frequencies. Note that this can be applied to any of the sequence types defined above (DNA, RNA and proteins).

d e f frequency ( seq ):

""" Calculates the frequency of each symbol in the sequence . Returns a dictionary . """

dic = {}

f o r s i n seq . upper ():

i f s i n dic : dic [s] += 1 e l s e: dic [s] = 1

r e t u r n dic

>>> frequency (" atagataactcgcatag ")

{’A ’: 7, ’C ’: 3, ’G ’: 3, ’T ’: 4}

>>> frequency (" MVVMKKSHHVLHSQSLIK ")

{’H ’: 3, ’I ’: 1, ’K ’: 3, ’L ’: 2, ’M ’: 2, ’Q ’: 1, ’S ’: 3, ’V ’: 3}

This function will be used in the next program to calculate the frequency of the aminoacids in a sequence read from the user’s input. The frequency of the aminoacids will be printed from the most frequent to the least frequent. Note, in this example, the use of thelambdanotation that allows to define functions implicitly within blocks of code. In this case, this notation is used to select the second element of a tuple, applied to each element of the list of (key, value) tuples coming from the dictionary when applying the methoditems.

seq_aa = i n p u t(" Protein sequence :") freq_aa = frequency ( seq_aa )

list_f = s o r t e d( freq_aa . items () , key =lambda x: x [1] , reverse = True) f o r (k ,v) i n list_f :

p r i n t(" Aminoacid :", k , ":", v)

To end this section, let us define a new function for a task that is in many cases quite useful.

In this case, we will compute the GC content of a DNA sequence, i.e. the percentage of ‘G’

and ‘C’ nucleotides in the sequence.

d e f gc_content ( dna_seq ):

""" Returns percentage of G and C nucleotides in a DNA sequence .

"""

gc_count = 0 f o r s i n dna_seq :

i f s i n " GCgc ": gc_count += 1 r e t u r n gc_count / l e n( dna_seq )

In many cases, for instance when looking for genes or exons, we need to compute the GC content of the different parts of a sequence. The next function provides a solution for this task, computing the GC content of the non-overlapping sub-sequences of sizekof the inputted sequence.

d e f gc_content_subseq ( dna_seq , k =100) :

""" Returns GC content of non−overlapping sub−sequences of size k . The result is a list . """

res = []

f o r i i n r a n g e(0 , l e n( dna_seq )−k+1 , k):

subseq = dna_seq [i:i+k]

gc = gc_content ( subseq ) res . append ( gc )

r e t u r n res

As a practical note, the functions developed in this section (and the next ones) can be put to- gether in a file (e.g. names “sequences.py”), implementing a Python module. This allows to define the main program (e.g. for testing) either in this file, typically in the end, or in other files in the same folder, by using theimportcommand (whose behavior was explained in the previous chapter).

Biological Sequences: Representations and Basic Algorithms

Genes: Discrete Units of Genetic Information

Seeking Putative Genes: Open Reading Frames