Protein type specific amino acid substitution models for influenza viruses Nguyen Van Sau1, Dang Cao Cuong1, Le Si Quang3, Le Sy Vinh1, 2 University of Engineering and Technology, VNU 1
Trang 1Protein type specific amino acid substitution models for influenza viruses
Nguyen Van Sau1, Dang Cao Cuong1,
Le Si Quang3, Le Sy Vinh1, 2
University of Engineering and Technology, VNU 1 Institute of Information Technology, VNU 2 Vietnam National University Hanoi, 144 Xuan Thuy, Ha Noi, Viet Nam
Welcome Trust Centre for Human Genetics 3
University of Oxford, UK Roosevelt Drive, Oxford OX3 7BN, UK
saunv.mcs09@coltech.vnu.vn, cuongdc@vnu.edu.vn, quang@well.ox.ac.uk, vinhls@vnu.edu.vn
Abstract²The amino acid substitution model (matrix) is a
crucial part of protein sequence analysis systems General
amino acid substitution models have been estimated from
large protein databases, however, they are not specific for
influenza viruses In previous study, we estimated the
amino acid substitution model, FLU, for all influenza
viruses Experiments showed that FLU outperformed
other models when analyzing influenza protein sequences
Influenza virus genomes consist of different protein
types, which are different in both structures and
evolutionary processes Although FLU matrix is specific
for influenza viruses, it is still not specific for influenza
protein types Since influenza viruses cause serious
problems for both human health and social economics, it
is worth to study them as specific as possible
In this paper, we used more than 27 million amino
acids to estimate 11 protein type specific models for
influenza viruses Experiments showed that protein type
specific models outperformed the FLU model, the best
model for influenza viruses These protein type specific
models help researcher to conduct studies on influenza
viruses more precisely
Keywords-influenza virus, amino acid substitution
model, phylogeny tree
1 BACKGROUND
Protein sequence analysis systems usually require
an amino acid substitution model for analyzing the
relationships between protein sequences Therefore,
estimating amino acid substitution models is a crucial
task in Bioinformatics for more than 4 decades
There are two main approaches to estimate amino
acid substitution models from proteins alignments The
first one estimates substitution rates between amino
acids based on an assumption that the probability of
exchanging from an amino acid to another one in a
period of time is linear to the substitution rates between
the two amino acids Thus, substitution rates can be
estimated directly from the number of exchanges
between amino acid sequence pairs This approach is
simple and applicable to large databases However, the
assumption is only acceptable if the time period is
short, thus, the amino acid sequences must be very
closely related (typically with >85% identity) PAM
[1,2] and JTT [3] are two most popular models estimated by this approach
The second approach takes advantages of multiple alignments by using the maximum likelihood method
The main idea is to estimate both phylogenies as well
as the substitution models to maximize the likelihood
of alignments Adachi and Hasegawa [4], Yang et al
[5], and Adachi et al [4] were first to apply the approach to alignments from few species with an assumption that all proteins come from the same phylogeny Whelan and Goldman released the assumption where they used approximate phylogenies for different alignments Le and Gascuel [6] extended :KOHQ DQG *ROGPDQ¶V PHWKRd by optimizing phylogenies and evolution rates across sites in estimating processes
General models have been estimated from large databases, however, current studies have showed that they might be not appropriate for particular set of species due to differences in the evolutionary processes
of these species [7,8,9] A number of specific amino acid substitution models for important species have been introduced For example, Dimmic and colleagues estimated the rtRev model for inference of retrovirus and reverse transcriptase Phylogeny [9] Nickle and coworkers introduced HIV-specific models that showed a consistently superior fit compared with the best general models when analyzing HIV proteins [7]
Influenza viruses are the most dangerous viruses for avian and humans They are a kind of RNA virus and belong to the Orthomyxoviridae family They are divided into three types: influenza A, influenza B, and influenza C, of which influenza A type is the most prevalent and dangerous In recent years, influenza A viruses have caused serious problems for human health and social economics Current emerging influenza epidemics are H5N1 ('avian flu') or H1N1 More details about historical and recently emerging influenza pandemics and epidemics can be found at the World Health Organization website (http://www.who.int/csr/disease/influenza/en/)
Theoretical and experimental studies have been extensively conducted for decades to understand the evolution, transmission, and infection processes of
2011 Third International Conference on Knowledge and Systems Engineering
Trang 2influenza viruses [10,11,12,8] (and references therein)
Recently, we published the FLU model, which was
specifically estimated for influenza viruses Our
extensive experiments showed that FLU is much better
than other models when analyzing influenza protein
sequences
Although FLU model is specific for influenza
viruses, it is not specific for protein types The
influenza A virus genome consists of 11 different
protein types: HA, NA, M1, M2, NS1, NS2, NP, PA,
PB1, PB1-F2, PB2 (see Table 1 for more details)
These protein types have different structures and
evolve at different rates These raise a need to have
different amino acid substitution models for different
protein types
In this study, we continue working on amino acid
substitution models for influenza viruses Since
influenza A viruses are the most prevalent and
dangerous, we studied and estimated 11 amino acid
substitution models for 11 protein types of influenza A
viruses These models will allow researchers to analyze
the evolution processes of influenza proteins more
precisely
The paper is organized into 5 sections In the
section 2 (Method) we will present theoretical
background of amino acid substitution models; and our
approach to estimate protein type specific models
Section 3 (Data preparation) describes our process to
prepare protein sequences to estimate models Result
comparisons among models will be reported in the
section 4 Conclusions are given in the last section
2 METHOD
The substitution process among each amino acid
sites is assumed to be independent, stationary and
remain constant over the time [13,14] We can use a
homogeneous, continuous, and
time-reversible Markov process [13,14,15] to model the
substitution process between amino acids
Table 1 Data of 11 protein types of influenza A viruses
Protein type #Sequences #Alignments Proportion
(%)
Figure 1 The four-step approach to estimate protein type specific amino acid substitution models
The model consists of two components: 1) an instantaneous substitution rate 20x20-matrix where is the number of substitutions from amino acid x to amino acid y per time unit; 2) an amino acid frequency 20-vector where is the frequency of amino acid While can be easily estimated from data using a counting method, Q is the study subject of estimation methods
We will apply four-step maximum likelihood approach to estimate protein type specific models as pictured in Figure 1:
- Data preparation: Downloaded, cleaned,
classified and aligned sequences to create multiple protein alignments (more details will be presented
in Section 3)
- Constructing tree step: For each protein
alignment, use the maximum likelihood method (such as PhyML [16]) to construct a phylogenetic tree using an initial matrix Q (initial with FLU matrix)
- Estimating model step: Use an
expectation-maximization algorithm (such as XRATE [17]) to train a new model Q' using protein alignments and reconstructed trees
- Comparing Step: Compare Q and Q' If Q' is
nearly identical to Q, Q' is consider as the final model Otherwise, replace Q by Q' and go to Constructing tree step
Trang 3Extensive experiments show that Q is almost
unchangeable (Q'~ Q) after three iterations
3 DATA PREPARATION
On Jan 07th 2011, there were more than 9,300
complete genomes including 200,000 protein
sequences in the Influenza database at NCBI
(www.ncbi.nlm.nih.gov/genomes/FLU/) [18] In the
database, 95% of sequences are influenza A proteins,
including ~9,000 complete genomes and ~190,000
protein sequences The other sequences are influenza B
and C viruses
The number of available sequences for influenza B
and C types is not enough to estimate protein type
specific models for these virus types We concentrate
on estimating models for 11 protein types of influenza
A viruses The data preparation process is described as
below:
- Downloading step: We downloaded 200,000
influenza A protein sequences consisting of more
than 27 million amino acids
- Cleaning step: There are a large number of
sequence duplications We removed duplicated
sequences and obtained ~100,000 unique protein
sequences
- Categorizing step: Sequences are classified into 11
classes corresponding to 11 protein types: HA,
M1, M2, NA, NS1, NS2, NP, PA, PB1, PB1-F2,
PB2
- Splitting Step: Sequences of the same class were
split into subgroups such that each subgroup
consists of from 5 to 50 sequences
- Aligning step: Sequences of each subgroup are
aligned using MUSCLE program (default
parameters) [19] and subsequently cleaned by
GBLOCK program (parameter ±b5=h) [20] to
eliminate sites containing too many gaps We
selected 2,500 alignments (66,139 sequences,
1,058,987 sites, and 27,588,017 amino acids) each
consists of at least 50 amino acid sites
4 RESULTS
We estimated amino acid substitution models: HA,
NA, M1, M2, NS1, NS2, NP, PA, PB1, PB1-F2 and
PB2 for 11 corresponding protein types of influenza A
viruses To compare different models, we conducted
two folds cross validation To this end, we randomly
divided the dataset of each protein type into two equal
subsets, one for training and the other for testing
Model performance analysis
We compared 11 protein type specific models with the FLU model (the best amino acid substitution model for influenza viruses) by comparing maximum likelihood trees constructed using different models
Note that it is the standard metric to compare different models
As expected, experiments showed that protein type specific models outperformed all other models when analyzing their corresponding protein sequences (see Table 2) For example, PA model is the best model in 98.37% cases when analyzing PA alignments As we can see from Table 2 that, the NS2 does not completely outperform other models It is the best model in only 65% of cases when analyzing NS2 sequences This is due to the fact that only a small amount of NS2 protein sequences are available for estimating the NS2 model
Table 3 shows the summary comparisons between FLU and protein type specific models in term of log likelihoods when analyzing their corresponding proteins Note that the greater log likelihood per site is the better model It is obvious that the protein type specific models are better than FLU model when analyzing their corresponding proteins For example, the log likelihood of HA model (-16.5699) is higher than log likelihoods of other models
Table 2 Summary of FLU and HA, M1, M2, NA, NS1, NS2, NP,
PA, PB1, PB1-F2, PB2 models when analyzing their corresponding protein type sequences For example, PA is the
best model in 98.37% of PA alignment
model is the best
PA 98.37 NS1 98.30
NP 97.53
NA 95.76
HA 90.87 M2 89.52 M1 87.50 PB1 87.39 PB2 86.50 PB1-F2 68.33 NS2 65.93
Trang 4Table 3 Pairwise comparisons between FLU and HA, M1, M2,
NA, NS1, NS2, NP, PA, PB1, PB1-F2, PB2 models in term of log
likelihoods
FLU (M 2 )
M 1 >M 2 M 1 <M 2
The most important protein types of influenza
viruses are HA and NA proteins The combinations of
different HA and NA protein variations result in
different influenza subtypes such as H1N1, H5N1
Table 1 shows that HA (~26%) and NA (~14%) are
two most prevalent proteins in the database In the
following we present analysis with HA and NA
proteins
Table 4 shows that HA model is the best model
when analyzing HA proteins HA helps to construct the
best likelihood trees for 587 out of 646 HA protein
alignments (~90.87%) It is the second best in the 58
other cases (~8.9%) The FLU is the second best
models in most of cases (587 over 646 cases)
Table 5 shows similar observation when analyzing
NA sequences NA model completely outperforms
other models when analyzing NA protein sequences
Table 4 The results of five best models when analyzing HA
sequences HA is the best model in 587 out of 646 cases
PB1-F2 41 1 0 0 0
Table 5 The result of five best models when analyzing NA sequences NA is the best model in 361 over 377 cases
It stands at the first place in 361 (95.76%) over 377 cases, and the second place for the other cases FLU is again the second best model for most cases (95.76%) Table 6 presents the log likelihood per site of different models when analyzing HA and NA protein The log likelihood per site of HA (NA) model when analyzing HA (NA) proteins is -16.5699 (-13.0690) which is the best in comparison with other models The FLU is the second best model
Table 6 Log likelihood comparison among HA (NA) model and other models when analyzing HA (NA) protein sequences
Model Log Likelihood
per site (HA proteins)
Model Log Likelihood
per site (NA proteins)
PB1-F2
-21.2087
PB1-F2 -16.6106
Trang 5Table 7 Correlations among 12 models As we can see that, these models are very different from each other For example, the correlation
between HA and FLU models is only 0.878
M1 0.295
Model correlation analysis
The correlations among FLU and protein specific
models are reported in Table 7 It is obvious that these
models are very different from each other For
example, the correlation between HA (NA) and FLU
models is only 0.878 (0.886) The comparison suggests
FLU model is not specific to analyze different protein
types In other words, we should use protein type
specific models to analyze their corresponding protein
Tree structure analysis
We also analyze the impact of matrices on tree
structures To this end, we used the Robinson-Foulds
(RF) [21] distance to measure the difference between 2
tree topologies RF distance is the number of
bi-partitions present in one of the two trees but not the other, divided by the number of possible bipartitions Thus, two trees are closer and their topologies are closer if their RF distance between them is smaller Note that the value of RF distance ranges from 0.0 to 1.0 We computed RF distances between trees reconstructed using FLU model and other protein type specific models
Figure 2 shows that tree topologies inferred using FLU and the 11 protein type specific models are different There are 7,932 cases where the RF distance
is 0.1 We even observed many cases where the reconstructed trees from FLU and protein type specific models are very different Thus, these models have a strong impact on tree structures
Figure 2 The Robinson-Foulds distances between trees inferred using FLU and 11 protein type specific models for protein of Influenza A viruses The horizontal axis indicates the RF distance between 2 tree topologies, where the vertical axis indicates the number of
alignments
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Trang 65 CONCLUSIONS
Influenza viruses are one of the most dangerous
virus kinds for human health and economics Although
they have been the subject of thousand studies, they
still get extensive studies from researchers, funds from
governments and pharmacy companies
Through our intensive studies of influenza viruses
with a huge amount of protein sequences, we were able
to estimate 11 amino acid substitution models for 11
protein types of influenza A viruses
Our experiments showed that protein type specific
models gave better results than the best model, FLU,
for influenza viruses Model correlation and tree
structure analyses presented that these models are very
different form each other and have strong impact on
the tree structures The protein type specific models
enable researchers to study influenza protein sequences
more precisely We strongly recommend researchers to
use protein type specific models to analyze
corresponding protein sequences
ACKNOWLEDGMENT
This work is partially supported by the TRIG project at
University of Engineering and Technology, VNU
Hanoi
10 References
[1] Dayhoff, Atlas of Protein Sequence Structure, M O
Dayhoff, Ed Washington DC: National Biomedical
Research Foundation, 1978, vol 5
[2] Dayhoff, M O and Schwartz, R M and Orcutt, B C.,
"A Model of Evolutionary Change in Proteins," in Atlas
of Protein Sequence Structure, M O Dayhoff, Ed
Washington DC: National Biomedical Research
Foundation, 1978, vol 5, pp 345-352
[3] Jones, David T and Taylor, William R and Thornton,
Janet M., "The rapid generation of mutation data
matrices from protein sequences," Comput Appl
Biosci., pp 275-282, 1992
[4] Adachi, Jun and Hasegawa, Masami, "Model of Amino
Acid Substitution in Proteins Encoded by Mitochondrial
DNA," J Mol Evol., pp 459-468, 1996
[5] Nielsen, Rasmus and Yang, Ziheng, "Likelihood Models
for Detecting Positively Selected Amino Acid Sites and
Applications to the HIV-1 Envelope Gene," Genetics,
vol 148, pp 929-936, 1998
[6] Quang Le and Olivier Gascuel, "An improved general
amino acid replacement matrix," Mol Biol Evol., pp
1307-1320, 2008
[7] David C Nickle, Laura Heath, Mark A Jensen, Peter B
Gilbert, James I Mullins, and Sergei L Kosakovsky
Pond, "HIV-Specific Probabilistic Models of Protein
Evolution," PLoS ONE, vol 2, p e503, 2007
[8] Cuong Cao Dang, Quang Si Le, Olivier Gascuel and
Vinh Sy Le, "FLU, an amino acid substitution model
for," BMC Evolutionary Biology, 2010
[9] M W Dimmic, Rest JS, Mindell DP, and Goldstein RA., "rtREV: an amino acid substitution matrix for inference of retrovirus and reverse transcriptase
phylogeny," J Mol Evol., vol 55, pp 65-73, 2002 [10] Anthony S Fauci, "Race against time," Nature, vol
435, 2009
[11] Daniel A Janies, Andrew Hill, Rob Guralnick, Farhat Habib, Eric Waltari, Ward C Wheeler, "Genomic Analysis and Geographic Visualization of the Spread of
Avian Influenza (H5N1)," Systematic Biology, vol 56,
pp 321-329, 2007
[12] Tien D Nguyen, The Vinh Nguyen, Dhanasekaran Vijaykrishna, Robert G Webster,Yi Guan, J.S Malik Peiris,and Gavin J.D Smith, "Multiple Sublineages of Influenza A Virus (H5N1), Vietnam, 2005-2007,"
Emerging Infectious Diseases, vol 14, pp 632-636,
2008
[13] Felsenstein, Joe, Infering Phylogenies Sunderland,
Massachusetts: Sinauer Associates, 2004
[14] Ziheng Yang, Computational Molecular Evolution, 1,
Ed.: Oxford University Press, 2006
"Nucleotide Substitution Models," in The Phylogenetics Handbook A Practical Approach to DNA and Protein Phylogeny, Marco and Vandamme, Anne-Mieke Salemi,
Ed Cambridge: Cambridge University Press, 2003, pp 72-100
[16] Guindon, Stephane and Oliver Gascuel, "A Simple, Fast and Accurate Algorithm to Estimate Large Phylogenies
by Maximum Likelihood," Syst Biol., vol 696-704, p
52, 2003
[17] Peter S Klosterman, Andrew V Uzilov, Yuri R Bendaña, Robert K Bradley, Sharon Chao, Carolin Kosiol, Nick Goldman and Ian Holmes, "XRate: a fast prototyping,
training and annotation tool for phylo-grammars," BMC Bioinformatics, vol 7, p 428, 2006
[18] Bao Y, Bolotov P, Dernovoy D, Kiryutin B, Zaslavsky
L, Tatusova T, Ostell J, Lipman D, "The Influenza Virus Resource at the National Center for Biotechnology
Information.," J Virol, vol 82, pp 596-601, 2008
alignment with high accuracy and high throughput,"
Nucl Acids Res., vol 32, pp 1792-1797, 2004
[20] J Castresana, "Selection of conserved blocks from multiple alignments for their use in phylogenetic
analysis," Molecular Biology and Evolution, vol 17, pp
540-552, 2000
[21] Felsenstein, Joseph, "Distance methods for inferring
phylogenies: A Justification," Evolution, vol 38, pp
16-24, 1984