1. Trang chủ
  2. » Thể loại khác

DSpace at VNU: Protein type specific amino acid substitution models for influenza viruses

6 149 0

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 6
Dung lượng 222,15 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Protein type specific amino acid substitution models for influenza viruses Nguyen Van Sau1, Dang Cao Cuong1, Le Si Quang3, Le Sy Vinh1, 2 University of Engineering and Technology, VNU 1

Trang 1

Protein type specific amino acid substitution models for influenza viruses

Nguyen Van Sau1, Dang Cao Cuong1,

Le Si Quang3, Le Sy Vinh1, 2

University of Engineering and Technology, VNU 1 Institute of Information Technology, VNU 2 Vietnam National University Hanoi, 144 Xuan Thuy, Ha Noi, Viet Nam

Welcome Trust Centre for Human Genetics 3

University of Oxford, UK Roosevelt Drive, Oxford OX3 7BN, UK

saunv.mcs09@coltech.vnu.vn, cuongdc@vnu.edu.vn, quang@well.ox.ac.uk, vinhls@vnu.edu.vn

Abstract²The amino acid substitution model (matrix) is a

crucial part of protein sequence analysis systems General

amino acid substitution models have been estimated from

large protein databases, however, they are not specific for

influenza viruses In previous study, we estimated the

amino acid substitution model, FLU, for all influenza

viruses Experiments showed that FLU outperformed

other models when analyzing influenza protein sequences

Influenza virus genomes consist of different protein

types, which are different in both structures and

evolutionary processes Although FLU matrix is specific

for influenza viruses, it is still not specific for influenza

protein types Since influenza viruses cause serious

problems for both human health and social economics, it

is worth to study them as specific as possible

In this paper, we used more than 27 million amino

acids to estimate 11 protein type specific models for

influenza viruses Experiments showed that protein type

specific models outperformed the FLU model, the best

model for influenza viruses These protein type specific

models help researcher to conduct studies on influenza

viruses more precisely

Keywords-influenza virus, amino acid substitution

model, phylogeny tree

1 BACKGROUND

Protein sequence analysis systems usually require

an amino acid substitution model for analyzing the

relationships between protein sequences Therefore,

estimating amino acid substitution models is a crucial

task in Bioinformatics for more than 4 decades

There are two main approaches to estimate amino

acid substitution models from proteins alignments The

first one estimates substitution rates between amino

acids based on an assumption that the probability of

exchanging from an amino acid to another one in a

period of time is linear to the substitution rates between

the two amino acids Thus, substitution rates can be

estimated directly from the number of exchanges

between amino acid sequence pairs This approach is

simple and applicable to large databases However, the

assumption is only acceptable if the time period is

short, thus, the amino acid sequences must be very

closely related (typically with >85% identity) PAM

[1,2] and JTT [3] are two most popular models estimated by this approach

The second approach takes advantages of multiple alignments by using the maximum likelihood method

The main idea is to estimate both phylogenies as well

as the substitution models to maximize the likelihood

of alignments Adachi and Hasegawa [4], Yang et al

[5], and Adachi et al [4] were first to apply the approach to alignments from few species with an assumption that all proteins come from the same phylogeny Whelan and Goldman released the assumption where they used approximate phylogenies for different alignments Le and Gascuel [6] extended :KOHQ DQG *ROGPDQ¶V PHWKRd by optimizing phylogenies and evolution rates across sites in estimating processes

General models have been estimated from large databases, however, current studies have showed that they might be not appropriate for particular set of species due to differences in the evolutionary processes

of these species [7,8,9] A number of specific amino acid substitution models for important species have been introduced For example, Dimmic and colleagues estimated the rtRev model for inference of retrovirus and reverse transcriptase Phylogeny [9] Nickle and coworkers introduced HIV-specific models that showed a consistently superior fit compared with the best general models when analyzing HIV proteins [7]

Influenza viruses are the most dangerous viruses for avian and humans They are a kind of RNA virus and belong to the Orthomyxoviridae family They are divided into three types: influenza A, influenza B, and influenza C, of which influenza A type is the most prevalent and dangerous In recent years, influenza A viruses have caused serious problems for human health and social economics Current emerging influenza epidemics are H5N1 ('avian flu') or H1N1 More details about historical and recently emerging influenza pandemics and epidemics can be found at the World Health Organization website (http://www.who.int/csr/disease/influenza/en/)

Theoretical and experimental studies have been extensively conducted for decades to understand the evolution, transmission, and infection processes of

2011 Third International Conference on Knowledge and Systems Engineering

Trang 2

influenza viruses [10,11,12,8] (and references therein)

Recently, we published the FLU model, which was

specifically estimated for influenza viruses Our

extensive experiments showed that FLU is much better

than other models when analyzing influenza protein

sequences

Although FLU model is specific for influenza

viruses, it is not specific for protein types The

influenza A virus genome consists of 11 different

protein types: HA, NA, M1, M2, NS1, NS2, NP, PA,

PB1, PB1-F2, PB2 (see Table 1 for more details)

These protein types have different structures and

evolve at different rates These raise a need to have

different amino acid substitution models for different

protein types

In this study, we continue working on amino acid

substitution models for influenza viruses Since

influenza A viruses are the most prevalent and

dangerous, we studied and estimated 11 amino acid

substitution models for 11 protein types of influenza A

viruses These models will allow researchers to analyze

the evolution processes of influenza proteins more

precisely

The paper is organized into 5 sections In the

section 2 (Method) we will present theoretical

background of amino acid substitution models; and our

approach to estimate protein type specific models

Section 3 (Data preparation) describes our process to

prepare protein sequences to estimate models Result

comparisons among models will be reported in the

section 4 Conclusions are given in the last section

2 METHOD

The substitution process among each amino acid

sites is assumed to be independent, stationary and

remain constant over the time [13,14] We can use a

homogeneous, continuous, and

time-reversible Markov process [13,14,15] to model the

substitution process between amino acids

Table 1 Data of 11 protein types of influenza A viruses

Protein type #Sequences #Alignments Proportion

(%)

Figure 1 The four-step approach to estimate protein type specific amino acid substitution models

The model consists of two components: 1) an instantaneous substitution rate 20x20-matrix where is the number of substitutions from amino acid x to amino acid y per time unit; 2) an amino acid frequency 20-vector where is the frequency of amino acid While can be easily estimated from data using a counting method, Q is the study subject of estimation methods

We will apply four-step maximum likelihood approach to estimate protein type specific models as pictured in Figure 1:

- Data preparation: Downloaded, cleaned,

classified and aligned sequences to create multiple protein alignments (more details will be presented

in Section 3)

- Constructing tree step: For each protein

alignment, use the maximum likelihood method (such as PhyML [16]) to construct a phylogenetic tree using an initial matrix Q (initial with FLU matrix)

- Estimating model step: Use an

expectation-maximization algorithm (such as XRATE [17]) to train a new model Q' using protein alignments and reconstructed trees

- Comparing Step: Compare Q and Q' If Q' is

nearly identical to Q, Q' is consider as the final model Otherwise, replace Q by Q' and go to Constructing tree step

Trang 3

Extensive experiments show that Q is almost

unchangeable (Q'~ Q) after three iterations

3 DATA PREPARATION

On Jan 07th 2011, there were more than 9,300

complete genomes including 200,000 protein

sequences in the Influenza database at NCBI

(www.ncbi.nlm.nih.gov/genomes/FLU/) [18] In the

database, 95% of sequences are influenza A proteins,

including ~9,000 complete genomes and ~190,000

protein sequences The other sequences are influenza B

and C viruses

The number of available sequences for influenza B

and C types is not enough to estimate protein type

specific models for these virus types We concentrate

on estimating models for 11 protein types of influenza

A viruses The data preparation process is described as

below:

- Downloading step: We downloaded 200,000

influenza A protein sequences consisting of more

than 27 million amino acids

- Cleaning step: There are a large number of

sequence duplications We removed duplicated

sequences and obtained ~100,000 unique protein

sequences

- Categorizing step: Sequences are classified into 11

classes corresponding to 11 protein types: HA,

M1, M2, NA, NS1, NS2, NP, PA, PB1, PB1-F2,

PB2

- Splitting Step: Sequences of the same class were

split into subgroups such that each subgroup

consists of from 5 to 50 sequences

- Aligning step: Sequences of each subgroup are

aligned using MUSCLE program (default

parameters) [19] and subsequently cleaned by

GBLOCK program (parameter ±b5=h) [20] to

eliminate sites containing too many gaps We

selected 2,500 alignments (66,139 sequences,

1,058,987 sites, and 27,588,017 amino acids) each

consists of at least 50 amino acid sites

4 RESULTS

We estimated amino acid substitution models: HA,

NA, M1, M2, NS1, NS2, NP, PA, PB1, PB1-F2 and

PB2 for 11 corresponding protein types of influenza A

viruses To compare different models, we conducted

two folds cross validation To this end, we randomly

divided the dataset of each protein type into two equal

subsets, one for training and the other for testing

Model performance analysis

We compared 11 protein type specific models with the FLU model (the best amino acid substitution model for influenza viruses) by comparing maximum likelihood trees constructed using different models

Note that it is the standard metric to compare different models

As expected, experiments showed that protein type specific models outperformed all other models when analyzing their corresponding protein sequences (see Table 2) For example, PA model is the best model in 98.37% cases when analyzing PA alignments As we can see from Table 2 that, the NS2 does not completely outperform other models It is the best model in only 65% of cases when analyzing NS2 sequences This is due to the fact that only a small amount of NS2 protein sequences are available for estimating the NS2 model

Table 3 shows the summary comparisons between FLU and protein type specific models in term of log likelihoods when analyzing their corresponding proteins Note that the greater log likelihood per site is the better model It is obvious that the protein type specific models are better than FLU model when analyzing their corresponding proteins For example, the log likelihood of HA model (-16.5699) is higher than log likelihoods of other models

Table 2 Summary of FLU and HA, M1, M2, NA, NS1, NS2, NP,

PA, PB1, PB1-F2, PB2 models when analyzing their corresponding protein type sequences For example, PA is the

best model in 98.37% of PA alignment

model is the best

PA 98.37 NS1 98.30

NP 97.53

NA 95.76

HA 90.87 M2 89.52 M1 87.50 PB1 87.39 PB2 86.50 PB1-F2 68.33 NS2 65.93

Trang 4

Table 3 Pairwise comparisons between FLU and HA, M1, M2,

NA, NS1, NS2, NP, PA, PB1, PB1-F2, PB2 models in term of log

likelihoods

FLU (M 2 )

M 1 >M 2 M 1 <M 2

The most important protein types of influenza

viruses are HA and NA proteins The combinations of

different HA and NA protein variations result in

different influenza subtypes such as H1N1, H5N1

Table 1 shows that HA (~26%) and NA (~14%) are

two most prevalent proteins in the database In the

following we present analysis with HA and NA

proteins

Table 4 shows that HA model is the best model

when analyzing HA proteins HA helps to construct the

best likelihood trees for 587 out of 646 HA protein

alignments (~90.87%) It is the second best in the 58

other cases (~8.9%) The FLU is the second best

models in most of cases (587 over 646 cases)

Table 5 shows similar observation when analyzing

NA sequences NA model completely outperforms

other models when analyzing NA protein sequences

Table 4 The results of five best models when analyzing HA

sequences HA is the best model in 587 out of 646 cases

PB1-F2 41 1 0 0 0

Table 5 The result of five best models when analyzing NA sequences NA is the best model in 361 over 377 cases

It stands at the first place in 361 (95.76%) over 377 cases, and the second place for the other cases FLU is again the second best model for most cases (95.76%) Table 6 presents the log likelihood per site of different models when analyzing HA and NA protein The log likelihood per site of HA (NA) model when analyzing HA (NA) proteins is -16.5699 (-13.0690) which is the best in comparison with other models The FLU is the second best model

Table 6 Log likelihood comparison among HA (NA) model and other models when analyzing HA (NA) protein sequences

Model Log Likelihood

per site (HA proteins)

Model Log Likelihood

per site (NA proteins)

PB1-F2

-21.2087

PB1-F2 -16.6106

Trang 5

Table 7 Correlations among 12 models As we can see that, these models are very different from each other For example, the correlation

between HA and FLU models is only 0.878

M1 0.295

Model correlation analysis

The correlations among FLU and protein specific

models are reported in Table 7 It is obvious that these

models are very different from each other For

example, the correlation between HA (NA) and FLU

models is only 0.878 (0.886) The comparison suggests

FLU model is not specific to analyze different protein

types In other words, we should use protein type

specific models to analyze their corresponding protein

Tree structure analysis

We also analyze the impact of matrices on tree

structures To this end, we used the Robinson-Foulds

(RF) [21] distance to measure the difference between 2

tree topologies RF distance is the number of

bi-partitions present in one of the two trees but not the other, divided by the number of possible bipartitions Thus, two trees are closer and their topologies are closer if their RF distance between them is smaller Note that the value of RF distance ranges from 0.0 to 1.0 We computed RF distances between trees reconstructed using FLU model and other protein type specific models

Figure 2 shows that tree topologies inferred using FLU and the 11 protein type specific models are different There are 7,932 cases where the RF distance

is 0.1 We even observed many cases where the reconstructed trees from FLU and protein type specific models are very different Thus, these models have a strong impact on tree structures

Figure 2 The Robinson-Foulds distances between trees inferred using FLU and 11 protein type specific models for protein of Influenza A viruses The horizontal axis indicates the RF distance between 2 tree topologies, where the vertical axis indicates the number of

alignments

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Trang 6

5 CONCLUSIONS

Influenza viruses are one of the most dangerous

virus kinds for human health and economics Although

they have been the subject of thousand studies, they

still get extensive studies from researchers, funds from

governments and pharmacy companies

Through our intensive studies of influenza viruses

with a huge amount of protein sequences, we were able

to estimate 11 amino acid substitution models for 11

protein types of influenza A viruses

Our experiments showed that protein type specific

models gave better results than the best model, FLU,

for influenza viruses Model correlation and tree

structure analyses presented that these models are very

different form each other and have strong impact on

the tree structures The protein type specific models

enable researchers to study influenza protein sequences

more precisely We strongly recommend researchers to

use protein type specific models to analyze

corresponding protein sequences

ACKNOWLEDGMENT

This work is partially supported by the TRIG project at

University of Engineering and Technology, VNU

Hanoi

10 References

[1] Dayhoff, Atlas of Protein Sequence Structure, M O

Dayhoff, Ed Washington DC: National Biomedical

Research Foundation, 1978, vol 5

[2] Dayhoff, M O and Schwartz, R M and Orcutt, B C.,

"A Model of Evolutionary Change in Proteins," in Atlas

of Protein Sequence Structure, M O Dayhoff, Ed

Washington DC: National Biomedical Research

Foundation, 1978, vol 5, pp 345-352

[3] Jones, David T and Taylor, William R and Thornton,

Janet M., "The rapid generation of mutation data

matrices from protein sequences," Comput Appl

Biosci., pp 275-282, 1992

[4] Adachi, Jun and Hasegawa, Masami, "Model of Amino

Acid Substitution in Proteins Encoded by Mitochondrial

DNA," J Mol Evol., pp 459-468, 1996

[5] Nielsen, Rasmus and Yang, Ziheng, "Likelihood Models

for Detecting Positively Selected Amino Acid Sites and

Applications to the HIV-1 Envelope Gene," Genetics,

vol 148, pp 929-936, 1998

[6] Quang Le and Olivier Gascuel, "An improved general

amino acid replacement matrix," Mol Biol Evol., pp

1307-1320, 2008

[7] David C Nickle, Laura Heath, Mark A Jensen, Peter B

Gilbert, James I Mullins, and Sergei L Kosakovsky

Pond, "HIV-Specific Probabilistic Models of Protein

Evolution," PLoS ONE, vol 2, p e503, 2007

[8] Cuong Cao Dang, Quang Si Le, Olivier Gascuel and

Vinh Sy Le, "FLU, an amino acid substitution model

for," BMC Evolutionary Biology, 2010

[9] M W Dimmic, Rest JS, Mindell DP, and Goldstein RA., "rtREV: an amino acid substitution matrix for inference of retrovirus and reverse transcriptase

phylogeny," J Mol Evol., vol 55, pp 65-73, 2002 [10] Anthony S Fauci, "Race against time," Nature, vol

435, 2009

[11] Daniel A Janies, Andrew Hill, Rob Guralnick, Farhat Habib, Eric Waltari, Ward C Wheeler, "Genomic Analysis and Geographic Visualization of the Spread of

Avian Influenza (H5N1)," Systematic Biology, vol 56,

pp 321-329, 2007

[12] Tien D Nguyen, The Vinh Nguyen, Dhanasekaran Vijaykrishna, Robert G Webster,Yi Guan, J.S Malik Peiris,and Gavin J.D Smith, "Multiple Sublineages of Influenza A Virus (H5N1), Vietnam, 2005-2007,"

Emerging Infectious Diseases, vol 14, pp 632-636,

2008

[13] Felsenstein, Joe, Infering Phylogenies Sunderland,

Massachusetts: Sinauer Associates, 2004

[14] Ziheng Yang, Computational Molecular Evolution, 1,

Ed.: Oxford University Press, 2006

"Nucleotide Substitution Models," in The Phylogenetics Handbook A Practical Approach to DNA and Protein Phylogeny, Marco and Vandamme, Anne-Mieke Salemi,

Ed Cambridge: Cambridge University Press, 2003, pp 72-100

[16] Guindon, Stephane and Oliver Gascuel, "A Simple, Fast and Accurate Algorithm to Estimate Large Phylogenies

by Maximum Likelihood," Syst Biol., vol 696-704, p

52, 2003

[17] Peter S Klosterman, Andrew V Uzilov, Yuri R Bendaña, Robert K Bradley, Sharon Chao, Carolin Kosiol, Nick Goldman and Ian Holmes, "XRate: a fast prototyping,

training and annotation tool for phylo-grammars," BMC Bioinformatics, vol 7, p 428, 2006

[18] Bao Y, Bolotov P, Dernovoy D, Kiryutin B, Zaslavsky

L, Tatusova T, Ostell J, Lipman D, "The Influenza Virus Resource at the National Center for Biotechnology

Information.," J Virol, vol 82, pp 596-601, 2008

alignment with high accuracy and high throughput,"

Nucl Acids Res., vol 32, pp 1792-1797, 2004

[20] J Castresana, "Selection of conserved blocks from multiple alignments for their use in phylogenetic

analysis," Molecular Biology and Evolution, vol 17, pp

540-552, 2000

[21] Felsenstein, Joseph, "Distance methods for inferring

phylogenies: A Justification," Evolution, vol 38, pp

16-24, 1984

Ngày đăng: 12/12/2017, 06:32

TỪ KHÓA LIÊN QUAN

🧩 Sản phẩm bạn có thể quan tâm