In the previous section, we looked at translation without any concern about some of the rules that can be observed in real life. Indeed, the translation of a protein always begins with a spe- cific codon (the start codon – the “ATG” in the standard table), which codes for the aminoacid Methionine (‘M’). This aminoacid is always the first in a protein, but can also occur in other positions. Also, the translation process terminates when a stop codon is found.
Thus, the translation function developed in the previous section can only be used if the initial position coincides with a start codon, and the sequence terminates in a stop codon, i.e. the DNA sequence passed is a coding DNA sequence.
However, in many cases, we are given a DNA sequence (e.g. from a genome sequencing project) and we do not know in advance where the coding regions are. In these cases, we may be interested in scanning the DNA sequence to search for putative genes or coding regions.
The first step to search for these regions of interest is to take a DNA (or RNA) sequence and compute thereading frames. A reading frame is a way of dividing the DNA sequence into a set of consecutive non-overlapping triplets (possible codons) (see Section3.2.2and Ta- ble3.2). A given sequence has three possible reading frames, starting in the first, second, and third positions. Adding to these three, considering that there is another complementary strand, we should also compute the other three frames corresponding to the reverse complement.
In this context, the following function computes the translation of the six different reading frames, given a DNA sequence. This allows to scan for all possibilities for protein coding re- gions given a region of the genome.
d e f reading_frames ( dna_seq ):
""" Computes the six reading frames of a DNA sequence ( including the reverse complement . """
a s s e r t validate_dna ( dna_seq ) , " Invalid DNA sequence "
res = []
res . append ( translate_seq ( dna_seq ,0) ) res . append ( translate_seq ( dna_seq ,1) ) res . append ( translate_seq ( dna_seq ,2) ) rc = reverse_complement ( dna_seq ) res . append ( translate_seq (rc ,0) ) res . append ( translate_seq (rc ,1) ) res . append ( translate_seq (rc ,2) ) r e t u r n res
Having the reading frames computed, the next obvious step is to find possible proteins which can be encoded within these frames. This task is to find the so-calledopen reading frames (ORF), which are reading frames that have the potential to be translated into protein.
The next function starts to deal with this problem, by extracting all possible proteins from an aminoacid sequence. Note that this function goes through the sequence, when an ‘M’ is found starts a putative protein (kept in thecurrent_protlist), and when a stop symbol is found all proteins in this list are added to the result.
d e f all_proteins_rf ( aa_seq ):
""" Computes all possible proteins in an aminoacid sequence . Returns list of possible proteins . """
aa_seq = aa_seq . upper () current_prot = []
proteins = []
f o r aa i n aa_seq : i f aa == "_":
i f current_prot :
f o r p i n current_prot : proteins . append (p) current_prot = []
e l s e:
i f aa == "M":
current_prot . append ("")
f o r i i n r a n g e(l e n( current_prot )):
current_prot [i] += aa r e t u r n proteins
This function, together with the previous one, allows to compute all putative proteins in all reading frames. It starts by computing the translation of the reading frames and then goes
through the six aminoacid sequences and searches for all possible proteins using the previous function, gathering those into a resulting list.
d e f all_orfs ( dna_seq ):
""" Computes all possible proteins for all open reading frames . """
a s s e r t validate_dna ( dna_seq ) , " Invalid DNA sequence "
rfs = reading_frames ( dna_seq ) res = []
f o r rf i n rfs :
prots = all_proteins_rf ( rf) f o r p i n prots : res . append (p) r e t u r n res
Since the lists returned by this last function may contain a large number of proteins in real world scenarios, it is useful to order the list by the size of the putative proteins and to filter this list considering a minimum size. This makes sense, since small putative proteins may occur frequently by chance, while a protein pattern with over a couple of tens of aminoacids is not likely to occur by chance. Indeed, notice that the probability of a stop codon is 3/64, around 5%, and therefore a stop codon is expected roughly every 20 aminoacids.
The next function is, therefore, an improved version of the previous that considers the ordered insertion of the proteins in the list, considering their size. This is achieved by the auxiliary function provided below, which inserts each protein in the list in the right position considering its size, by keeping the resulting lists ordered by decreasing size.
d e f all_orfs_ord ( dna_seq , minsize = 0):
""" Computes all possible proteins for all open reading frames . Returns ordered list of proteins with minimum size . """
a s s e r t validate_dna ( dna_seq ) , " Invalid DNA sequence "
rfs = reading_frames ( dna_seq ) res = []
f o r rf i n rfs :
prots = all_proteins_rf ( rf) f o r p i n prots :
i f l e n(p) > minsize : insert_prot_ord (p , res ) r e t u r n res
d e f insert_prot_ord ( prot , list_prots ):
i = 0
w h i l e i < l e n( list_prots ) and l e n( prot ) < l e n( list_prots [i ]) :
i += 1
list_prots . insert (i , prot )