PROTEIN TYPE SPECIFIC AMINO ACID SUBSTITUTION MODELS FOR INFLUENZA VIRUSES A Thesis Submitted for the degree of MASTER OF COMPUTER SCIENCE BY Nguyen Van Sau Information Technology Un
Trang 1PROTEIN TYPE SPECIFIC AMINO ACID SUBSTITUTION MODELS FOR
INFLUENZA VIRUSES
A Thesis Submitted for the degree of
MASTER OF COMPUTER SCIENCE
BY
Nguyen Van Sau
Information Technology University of Engineering and Technology Vietnam National University Hanoi, 144 Xuan Thuy, Ha Noi, Viet Nam
MAY 2012
© (Sau Nguyen Van), 2012 All rights reserve
Trang 3Parts of this thesis have been published in the following articles:
1 Nguyen Van Sau, Dang Cao Cuong, Le Si Quang, Le Sy Vinh, "Protein Type Specific Amino Acid Substitution Models for Influenza Viruses," KSE, pp.98-103,
2011 Third International Conference on Knowledge and Systems Engineering,
2011
Trang 4Contents
ACKNOWLEDGMENTS I
LIST OF FIGURES 1
LIST OF TABLES 2
NOTATIONS/ABBREVIATIONS 3
ORIGINALITY STATEMENT 4
ABSTRACT 5
CHAPTER 1 OVERVIEW 6
1.1 Motivation 6
1.2 Organization of this thesis 7
CHAPTER 2 AMINO ACID SUBSTITUTION MODELS 9
2.1Amino acid sequences 9
2.2 Amino-acid substitution models 10
CHAPTER 3 METHODS TO ESTIMATE MODELS 13
4.1 Methods 13
4.1.1 Counting methods 13
4.1.2 Maximum likelihood methods 14
4.2 Protein type specific amino acid substitution models estimation 17
CHAPTER 4 DATA PREPARATION 21
3.1 Collecting data 21
3.2 Categorizing data 23
3.3 Splitting data 27
3.4 Aligning data 28
CHAPTER 5 RESULTS 29
CHAPTER 6 SUMMARY AND CONCLUSION 34
APPENDIX 35
BIBLIOGRAPHY 36
Trang 5LIST OF FIGURES
Figure 1 Growth of number of base pairs in NCBI from April 2002 to June 2011 7 Figure 2 Different shapes of Γ-distribution with respect to shape parameter 12 Figure 3 The four-step approach to estimate protein type specific amino acid substitution models 19 Figure 4 Link to download influenza virus' data 21 Figure 5 The Robinson-Foulds distances between trees inferred using FLU and 11
protein type specific models for protein of Influenza A viruses 33
Trang 6LIST OF TABLES
Table 1 Statistical number of deaths present 6
Table 2 Twenty different amino acids 9
Table 3 Data of 11 protein types of influenza A viruses 19
Table 4 Classification data into 11 subgroups 25
Table 5 Summary of FLU and HA, M1, M2, NA, NS1, NS2, NP, PA, PB1, PB1-F2, PB2 models when analyzing their corresponding protein type sequences 29
Table 6 Pairwise comparisons between FLU and HA, M1, M2, NA, NS1, NS2, NP, PA, PB1, PB1-F2, PB2 models in term of log likelihoods 30
Table 7 The results of the five best models when analyzing HA sequences 31
Table 8 The result of five best models when analyzing NA sequences 31
Table 9 Log likelihood comparison among HA (NA) model and other models when analyzing HA (NA) protein sequences 32
Table 10 Correlations among 12 models 32
Trang 7NOTATIONS/ABBREVIATIONS
WHO: World Health Organization
RF: Robinson and Foulds
MLE: Maximum Likelihood Estimate
EMBL: European Molecular Biology Laboratory NCBI: National Center for Biotechnology Information DDBJ: DNA Data Bank of Japan
BLAST: Basic Local Alignment Search Tool
MAS: Multiple Alignment Sequences
ML: Maximum Likelihood
MP: Maximum Parsimony
Trang 8ORIGINALITY STATEMENT
‘I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at University of Engineering and Technology (UET/Coltech) or any other educational institution, except where due acknowledgement is made in the thesis Any contribution made to the research by others, with whom I have worked at UET/Coltech or elsewhere, is explicitly acknowledged in the thesis I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project's design and conception or in style, presentation and linguistic expression is acknowledged.’
Hanoi, May 30th, 2012 Signed ………
Trang 9ABSTRACT
The amino acid substitution model (matrix) is a crucial part of protein sequence analysis systems General amino acid substitution models have been estimated from large protein databases However, they are not specific for influenza viruses In the previous study, we estimated the amino acid substitution model, FLU, for all influenza viruses Experiments showed that FLU outperformed other models when analyzing influenza protein sequences
Influenza virus genomes consist of different protein types, which are different in both structures and evolutionary processes Although FLU matrix is specific for influenza viruses, it is still not specific for influenza protein types Since influenza viruses cause serious problems for both human health and social economics, it is worth to study them as specific as possible
In this thesis, we used more than 27 million amino acids to estimate 11 protein type specific models for influenza viruses Experiments showed that protein type specific models outperformed the FLU model, the best model for influenza viruses These protein type specific models help researchers to conduct studies on influenza viruses more precisely
Trang 10CHAPTER 1 OVERVIEW
1.1 Motivation
Influenza viruses cause a lot of deaths and risks in economics According to World Health Organization (WHO – http://www.who.int/en/), the first recorded influenza pandemic began in Europe and spreads over Asia and Africa in 1580 The biggest epidemic Spanish influenza is believed to have killed at least 20 million up to 40 million people worldwide The “Asian Flu” began in China and killed 1 million people global in 20th century After that some pandemics continuously occur (see Table 1 for more information)
Table 1 Statistical number of deaths present
Country 2003 2004 2005 2006 2007 2008 2009 2010 2011 Total
Azerbaijan 0 0 0 0 0 0 8 5 0 0 0 0 0 0 0 0 0 0 8 5 Bangladesh 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 2 0 3 0 Cambodia 0 0 0 0 4 4 2 2 1 1 1 0 1 0 1 1 7 7 17 15 China 1 1 0 0 8 5 13 8 5 3 4 4 7 4 2 1 0 0 40 26 Djibouti 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 Egypt 0 0 0 0 0 0 18 10 25 9 8 4 39 4 29 13 32 12 151 52 Indonesia 0 0 0 0 20 13 55 45 42 37 24 20 21 19 9 7 7 5 178 146 Iraq 0 0 0 0 0 0 3 2 0 0 0 0 0 0 0 0 0 0 3 2 Laos 0 0 0 0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 2 2 Myanmar 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 Nigeria 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 1 1 Pakistan 0 0 0 0 0 0 0 0 3 1 0 0 0 0 0 0 0 0 3 1 Thailand 0 0 17 12 5 2 3 3 0 0 0 0 0 0 0 0 0 0 25 17 Turkey 0 0 0 0 0 0 12 4 0 0 0 0 0 0 0 0 0 0 12 4 Viet Nam 3 3 29 20 61 19 0 0 8 5 6 5 5 5 7 2 0 0 119 59 Total 4 4 46 32 98 43 11
Trang 11Figure 1 shows the number of influenza sequences recorded up to 2011 You can see more information about “Growth of GenBank database” at the National Center for Biotechnology Information “ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt”
Figure 1 Growth of number of base pairs in NCBI from April 2002 to June 2011 Studying the amino acid substitution models of influenza sequences is one of the biggest tasks in bioinformatics Influenza virus genomes consist of different protein types, which are different in both structures and evolutionary processes Although FLU matrix (Cuong Cao Dang, Quang Si Le, Olivier Gascuel and Vinh Sy Le, 2010)
is specific for influenza viruses, it is still not specific for influenza protein types Since influenza causes serious problems for both human health and social economics, it is worth to study them as specific as possible We apply a maximum likelihood method to build protein type specific amino acid substitution models for influenza viruses
1.2 Organization of this thesis
Chapter 2: Introduction to amino acid sequences, amino acid substitution models and methods
Chapter 3: We discuss the maximum likelihood methods for estimating protein type specific amino acid substitution models for influenza viruses
Chapter 4: We present our procedure to collect and process protein type specific amino acid substitution models
Trang 12Chapter 5: We discuss results of our models when comparing with the best current models of influenza viruses
Chapter 6: We summarize the content of this thesis
Finally, the MUSCLE, PHYLIP package and XRATE are described in Appendix
Trang 13CHAPTER 2 AMINO ACID SUBSTITUTION MODELS
2.1Amino acid sequences
There are 20 amino acids amino acids as in Table 2 (Marco Salemi, Anne-Mieke Vandamme, 2003) Generally, an amino acid sequence is represented by a sequence of the amino acids
Table 2 Twenty different amino acids Name Three-letter abbreviation One letter abbreviation
An amino acid can be represented by a three or one-letter abbreviation code as described
in Table 2 For an example below is an amino acid sequence of 471 amino acids
MNPNQKIITIGSISLGLVVFNVLLHVVSIIVTVLVLGKGGNNGICNETVVREYNETVRIEKVTQWHNTNV VEYVPYWNGGTYMNNTEAICDAKGFAPFSKDNGIRIGSRGHIFVIREPFVSCSPIECRTFFLTQGSLLND KHSNGTVKDRSPFRTLMSVEVGQSPNVYQARFEAVAWSATACHDGKKWMTVGVTGPDSKAVAVIHYGGVP TDVVNSWAGDILRTQESSCTCIQGDCYWVMTDGPANRQAQYRIYKANQGRIIGQTDISFNGGHIEECSCY PNDGKVECVCRDGWTGTNRPVLVISPDLSYRVGYLCAGIPSDTPRGEDTQFTGSCTSPMGNQGYGVKGFG
Trang 14FRQGTDVWMGRTISRTSRSGFEILRIKNGWTQTSKEQIRKQVVVDNLNWSGYSGSFTLPVELSGKDCLVP CFWVEMIRGKPEEKTIWTSSSSIVMCGVDYEVADWSWHDGAILPFDIDKM
There are different kinds of mutations when comparing two amino acid sequences (Brown, 2002)
Substitutions: one amino acid is simply exchanged by another
Deletion: one or more amino acids is/are deleted from the sequence
Insertion: one or more amino acids are inserted into the sequence
In evolution process, amino acid sequences can be changed by one or more mutations In the next section, we will present amino acid substitution models
2.2 Amino-acid substitution models
Amino acid substitution models are used for analyzing protein sequences to describe the evolutionary process of protein sequences These models are usually used to infer protein phylogenetic trees under maximum likelihood or Bayesian frameworks (Felsenstein, 2004; Ziheng, 2006) They are also used to estimate pairwise distances between protein sequences that subsequently serve as inputs for distance-based phylogenetic analyses (Opperdoes, 2003) Moreover, these models can be used for aligning protein sequences (Setubal C, Meidanis J, 1997) These and other applications of the amino acid substitution model are reviewed in (Thorne, 2000)
The substitution process between amino-acids is assumed to be a time-homogeneous time-continuous, time-reversible, stationary Markov process There are twenty different amino-acids, so they require ( ) substitution model parameters to be estimated The counting method use to estimate parameters
Amino acid sites are assumed to evolve independently, and the process has remained constant throughout the evolution The substitution process between amino acids is modeled by a time-homogeneous, time-continuous, time-reversible, and stationary Markov process (Felsenstein, 2004; Ziheng, 2006; Strimmer K, Haeseler AV, 2003) The amino acid model is a 20×20-matrix , where is the number of substitutions from amino acid to amino acid per time unit The diagonal elements are assigned such that the sum of each row equals zero The matrix can be decomposed into a symmetric exchangeability rate matrix and amino acid frequency vector such that and
Trang 15Given a multiple sequence alignment of sites, their phylogenetic tree and the model is Then the likelihood of calculated by
where | is the likelihood of site given tree and model that can be efficiently calculated by a pruning algorithm (Felsenstein, 1981) Then the model is estimated by maximizing likelihood
We have to estimate the and from data sets
Models of rate heterogeneity
The observation of substitution rates at three codon positions among sites of sequences is
called heterogeneous substitution rates (Felsenstein, 2004) Available evidence shows
that heterogeneity of substitution rate is the most dramatic at the second codon position and the least dramatic at the third codon position (Yang, 1996)
Fitch and Margoliash are the first persons proposed rate heterogeneity model (Fitch WM, Margoliash E, 1967) First time, this is the simple model which has two states of sequence sites as either variable or invariable Their sites are not distinguishable because of possible back substitutions or by chance some variable sites are unvaried (Churchill GA, Haeseler AV, Navidi WC, 1992) To overcome the problem, the two-state model imposes
a parameter which indicates the percentage of invariable sites on the sequence In real applications, the parameter is usually estimated from data
Recently, rate heterogeneity has been commonly measured by fitting the gamma distribution of substitution rates (or -distribution), with the resulting shape parameter an inversely correlated with increasing rate heterogeneity Thus, substitution rate scaling factors across sites are typically drawn from a -distribution with expectation 1.0 and variance
where
∫
The degree of rate heterogeneity across sites is adjusted by varying the shape parameter
as shown in Figure 2 A smaller shape parameter describes a stronger heterogeneity of
Trang 16rates across sites For example, a strong heterogeneous rate is modeled by setting shape parameter That means substitution rates are very slow at most of sites, but much faster at a few sites In contrast, if , we observe a weak heterogeneity of substitution rates In other words, substitution rate scaling factors are close to 1.0 over all sites
Figure 2 Different shapes of Γ-distribution with respect to shape parameter
(Le Sy Vinh (2005) Phylogeny Reconstructions Come of Age, thesis, pp.18)
In 1995, Gu et al proposed the combination of the two-state model and -distribution model (Gu X, Fu YX, Li WH, 1995) The hybrid model assumes a fraction of sequence sites to be invariable, other sites are variable with substitution rate scaling factors drawn from the -distribution
Nowadays, Mayer et al have proposed a method to identify site-specific substitution rates (Meyer, S and von Haeseler, A., 2003) The method estimates a substitution rate scaling factor for each site based on the maximum likelihood principle The site-specific substitution rate model is implemented into the Parat program (Meyer, S and von Haeseler, A., 2003) as well as the IQPNNI package
Trang 17CHAPTER 3 METHODS TO ESTIMATE MODELS
4.1 Methods
Protein sequence analysis systems usually require an amino acid substitution model for analyzing the relationships between protein sequences Therefore, estimating amino acid substitution models is a crucial task in Bioinformatics for more than 4 decades
There are two main approaches to estimate amino acid substitution models from proteins
alignments such as: counting method and maximum likelihood method
4.1.1 Counting methods
Dayhoff et al., were the first and used a parsimony-based counting method to generate accepted point mutation (PAM) matrices from the limited amount of protein sequence data available at the time (Dayhoff, 1972) To do this, multiple protein families used to infer phylogenetic trees, and used the maximum parsimony (MP) along the trees of ancestral sequences
This approach base on “accepted point mutations” to build matrix In other words, an accepted point mutation in a protein is a replacement of one amino acid by another, accepted by natural selection To apply to amino acid model, the observed frequency of replacement of each amino acid to each other one is considered unchanged Hence, we have 20x20 = 400 possible comparisons The matrix amino acid denoted by , and each element
is symmetric probability that the amino acid will be replaced by the amino acid The non-diagonal elements are computed by
∑ where is an element of the accepted point mutation matrix that is known, is proportionality constant, and is the mutability of the th amino acid
The diagonal elements are calculated by The total probability, the sum of all the elements, must be 1
This method estimates substitution rates between amino acids based on an assumption that the probability of exchanging from an amino acid to another one in a period of time is linear to the substitution rates between the two amino acids Thus, substitution rates can be
Trang 18estimated directly from the number of exchanges between amino acid sequence pairs This approach is simple and applicable to large databases However, the assumption is only acceptable if the time period is short, thus, the amino acid sequences must be very closely related (typically with >85% identity) PAM (Dayhoff, M O., R V ECK, and C M Park, 1972; Dayhoff, M O and Schwartz, R M and Orcutt, B C, 1978) and JTT (Jones, David
T and Taylor, William R and Thornton, Janet M., 1992) are two most popular models estimated by this approach
In 1992, Jones (Jones, David T and Taylor, William R and Thornton, Janet M., 1992) and Gonnet (Gaston H Gonnet, Mark A Cohen, Steven A Benner, 1992) have used the same methodology as Dayhoff (Dayhoff, M O and Schwartz, R M and Orcutt, B C, 1978) with a large database of protein data Jones et al (D.T Jones, W.R Taylor, J.M Thornton, 1994) have also calculated an amino acid replacement matrix, specifically for membrane spanning segments This matrix has remarkably different values from the Dayhoff matrices and called JTT matrix
In 1992, Henikoff and Henikoff (Henikoff, S.; Henikoff, J G, 1992) used a new method
to use local, ungapped alignments of distantly related sequences to derive the BLOSUM series of matrices, which refers to the minimum percentage identity of the blocks of multiple aligned amino acids used to construct the matrix
In 1996, Adachi and Hasegawa used the relationship between amino acid substitution process and mtDNA-encoded proteins and used the maximum likelihood method to construct a transition probability matrix, called the mtREV matrix The authors showed that mtREV outperformed other models when analyzing the phylogenetic relationships among species based on their mtDNA-encoded protein sequences
In 2001, Whelan and Goldman used an approximate maximum likelihood method to build
a new model of amino acid substitution, is called WAG (Whelan S, Goldman N, 2001) Authors showed that WAG was better than Dayhoff models when comparing maximum likelihood values of globular protein families
4.1.2 Maximum likelihood methods
This approach takes advantages of multiple alignments by using the maximum likelihood method The main idea is to estimate both phylogenies as well as the substitution models to maximize the likelihood of alignments Adachi and Hasegawa (Adachi, Jun and Hasegawa, Masami, 1996), Yang et al (Yang, 2006), and Adachi et al (Adachi, Jun and Hasegawa,
Trang 19Masami, 1996) were first to apply the approach to alignments from few species with an assumption that all proteins come from the same phylogeny
Unfortunately, the number of calculations while calculating each individual likelihood increase significantly with each sequence added to an analysis Consequently, this may restrict the accuracy of the resulting models or the variety of proteins for which the models are subsequently found to be useful
To solve the problems above, Whelan and Goldman released the assumption where they used approximate phylogenies for different alignments (Whelan S, Goldman N, 2001) The all amino acid sites in an alignment are assumed to evolve independently and according to the same Markov process which is stationary and homogeneous In additionally, the amino acid frequencies and evolution model are assumed constant through time and across all sites
in the alignment The probability of amino acid replaced by amino acid over time is
, where These probabilities can be written as a matrix, ,
which is calculated as , where is the rate matrix, with off-diagonal elements being the instantaneous rates of change of amino acid to amino acid and with diagonal elements being fixed so that the row sums of equal The formula of matrix presented by (
)
(4.1)
where represent the exchangeabilities of amino acid pairs and the value presents the equilibrium or stationary frequencies of the 20 amino acids While performing likelihood calculations on a tree, the sum ∑ was fixed
Before estimating model, Whelan and Goldman also give more assumptions such as: these parameters describing the evolutionary process remain relatively constant across near-optimal tree topologies and when performing ML estimation under two different models of evolution for a single-tree topology, the observation relates to changes in individual branch lengths that occur
The ratios of branch lengths were fixed, and a scaling factor was introduced, which allowed all branch lengths to increase or decrease linearly And assumed that the families’
Trang 20topologies and relative branch lengths are now fixed at near-optimal values, the overall likelihood calculated by
| | (4.2)
where is alignments protein families, describes the tree topologies and relative branch lengths for each family, and represents the model of evolution consisting of the exchangeability parameters ,
The parameters associated with are fixed, then we find the ML model of evolution by maximizing over in equation (4.3), as our assumptions mean that the resulting model will be close to that obtained by maximizing equation (4.2) over both and
4.1.3 Le’s approach
Le and Gascuel extended Whelan and Goldman’s method by optimizing evolution rates across sites and phylogenies in estimating processes (Quang Le and Olivier Gascuel, 2008) The first way, Le and Gascuel improved amino acid substitution modeling by releasing the property of the evolution rates across sites Authors use this property to rewrite
The non-normalized matrices will be estimated and can be written as ; where
is a global rate However, normalized matrices will be used in tree inference, as usual, and will denote a normalized matrix unless explicitly stated
Trang 21The likelihood of the data (denoted by ) for a given tree (including branch lengths) and replacement matrix is computed by
where these parameters are the same in the Equation 4.3
Le and Gascuel had been given intensive information at sites such as: they do not evolve at the same rate due to various evolutionary pressures and distributed by gamma distribution (denoted by ) (gamma model is usually improved when assuming that some sites are invariant and do not undergo any substitution along the studied phylogenetic tree) (Quang
Le and Olivier Gascuel, 2008) Hence, we denote and are the probability and rate of each site belongs to a category , respectively When accounting for invariant sites, we denote is invariant sites
The likelihood of given the phylogenetic tree , substitution model , and gramma distributed rate with invariant sites is computed as
∏ ∑ (4.7)
where is the likelihood of site assuming the invariant model, that is,
is equal to if the site is constant and contains only amino acid ; otherwise zero when the site is not constant; is the simple rates matrix from So,
we can be rewritten the equation (4.7) as
∏ [ ∑ ] (4.8) The model has been estimated by Le and Gascuel’s approach, called LG
4.2 Protein type specific amino acid substitution models estimation
General models have been estimated from large databases; however, current studies have shown that they might be not appropriate for particular set of species due to differences in the evolutionary processes of these species (David C Nickle, Laura Heath, Mark A Jensen, Peter B Gilbert, James I Mullins, and Sergei L Kosakovsky Pond, 2007; Cuong Cao Dang, Quang Si Le, Olivier Gascuel and Vinh Sy Le, 2010; M W Dimmic, Rest JS, Mindell DP, and Goldstein RA., 2002) A number of specific amino acid substitution models for important species have been introduced For example, Dimmic and colleagues estimated the rtRev model for inference of retrovirus and reverse transcriptase Phylogeny