1. Trang chủ
  2. » Luận Văn - Báo Cáo

Amino acid substitution model for rotavirus

56 5 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 56
Dung lượng 1,95 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

One popular method to approximate the evolution of proteins is touse an amino-acid substitution model which can reveal the instantaneous rate that an amino acid is changed into another a

Trang 1

VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY

NGUYEN DUC CANH

AMINO ACID SUBSTITUTION MODEL


FOR ROTAVIRUS


MASTER THESIS Major: Computer Science

HA NOI - 2019

Trang 2

VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY

Nguyen Duc Canh

AMINO ACID SUBSTITUTION MODEL


FOR ROTAVIRUS


MASTER THESIS Major: Computer Science

Supervisor: Assoc Prof Le Sy Vinh

HA NOI - 2019

Trang 3

Modeling protein evolution has been a major field of research in bioinformaticsfor decades One popular method to approximate the evolution of proteins is touse an amino-acid substitution model which can reveal the instantaneous rate that

an amino acid is changed into another amino acid Such kind of model is useful

in many different ways and has become a main component in a variety of formatic systems Many models such as JTT, WAG and LG have been estimatedusing data from various species Recent research showed that these models might

bioin-be inappropriate for analysis of some specific species Meanwhile, the world haswitnessed a series of emerging epidemics caused by viruses, notably rotavirus - acontagious virus that can cause gastroenteritis These epidemics raise a need formodeling the evolution of these emerging viruses In this thesis, using the datafrom the Viral Genome Resource at National Center for Biotechnology Informa-tion (NCBI), we propose the ROTA model that has been specifically estimated formodeling the evolution of rotavirus Analysis revealed significant differences be-tween ROTA and existing models in amino acid frequencies, exchangeability co-efficients as well as inferred phylogenies Experiments showed that ROTA bettercharacterizes the evolutionary patterns of rotavirus than other models and should

be useful in most systems that requires an accurate description of rotavirus tion

Trang 4

I would like to express my sincere gratitude to my advisor Assoc Prof Le Sy Vinhfor the continuous support of my study and research, for his patience, motivation,enthusiasm, and immense knowledge His guidance helped me in all the time

of research and writing of this thesis I could not have imagined having a betteradvisor and mentor for my Master study

Besides my advisor, I would like to thank Dr Dang Cao Cuong and MSc LeKim Thu for giving devoted explanations to my questions and guiding me to solvevarious problems that I had to face

I also thank my friends: Can Duy Cat, Nguyen Minh Trang, Le Hai Nam fortheir continuous motivations without which I would never be able to complete thisthesis

My sincere thanks also goes to Information Faculty of University of ing and Technology, Vietnam National University, Hanoi for providing me thenecessary facilities to conduct experiments

Engineer-Last but not the least, I would like to thank my parents for giving birth to me atthe first place and supporting me spiritually throughout my life

Trang 5

I declare that this thesis has not been submitted for a higher degree to any otherUniversity or Institution.

Trang 6

Table of Contents

Acknowledgements iv

Table of Contents vii

List of Figures ix

1.1 Sequence evolution 5

1.1.1 Heredity materials 5

1.1.2 Evolution and homologous sequences 6

1.2 Modeling sequence evolution 9

1.2.1 Sequence alignment 9

1.2.2 General time-reversible substitution model 11

1.2.3 Model of rate heterogeneity 15

1.2.4 Available amino acid substitution models 16

1.3 Phylogenetic trees 17

1.3.1 Overview 17

Trang 7

1.3.2 Phylogenetic tree reconstruction 18

1.3.3 Robinson-Foulds distance 20

1.3.4 Phylogenetic hypothesis testing 20

2 Method 24 2.1 Modeling method 24

2.2 Model estimation process 26

3 Results and discussion 28 3.1 Data preparation 28

3.2 Model analysis 30

3.3 Performance on testing alignments 31

3.4 Tree topology analysis 32

3.4.1 Robinson-Foulds distance 32

3.4.2 Shimodaira-Hasegawa test 33

3.5 Protein-specific models 35

Trang 8

NCBI National Center for Biotechnology

Information

RF Robinson-Foulds

Trang 9

List of Figures

1.1 A sample phylogenetic tree of 5 sequences 181.2 Two treesT1 andT2describe the same set of 5 sequences{1, 2, 3, 4, 5}

but have different topologies TreeT1has two bipartitions{1, 2}|{3, 4, 5}

and{1, 2, 3}|{4, 5}while treeT2 has two bipartitions{1, 3}|{2, 4, 5}

and{1, 2, 3}|{4, 5} Bipartition{1, 2}|{3, 4, 5}is present only in

T1 whereas bipartition {1, 3}|{2, 4, 5} is present only in T2

Bi-partition{1, 2, 3}|{4, 5} occurs in both T1 andT2 The

standard-ized Robinson and Foulds distance between T1 and T2 is2/4 213.1 The exchangeability coefficients in ROTA, FLU and JTT models

The black (gray or white) bubble at the intersection of row X and

column Y presents the exchange rate between amino acid X and

amino acid Y in ROTA (FLU or JTT) 323.2 The relative differences between exchangeability coefficients in

ROTA and other two models The size of the bubble corresponds

to the value(ROT AXY − MXY)/(ROT AXY + MXY)where X,

Y is one of 20 amino acids and M is FLU for the subfigure a) or

JTT for the subfigure b) The black bubble with value 2/3 (1/3)

indicates the coefficient in ROTA is 5 (2) times larger than the

corresponding one in M whereas the white bubble with value 2/3

(1/3) indicates the coefficient in ROTA is 5 (2) times smaller than

the corresponding one in M 333.3 Amino acid frequencies of ROTA, FLU and JTT models 34

Trang 10

List of Tables

1.1 Twenty differenct amino acids 73.1 Number of sequences grouped by protein types in training and

testing dataset 293.2 The Pearson’s correlations between ROTA and 10 widely used

models 303.3 Comparisons of ROTA and 10 other models in constructing max-

imum likelihood trees 343.4 Comparisons of ROTA and 10 other models in constructing max-

imum likelihood trees 353.5 The normalized Robinson-Foulds distance between trees inferred

using ROTA versus FLU and JTT models for 12 testing multiple

alignments 363.6 The number of test alignments that trees inferred from existing

models are significantly worse than those from ROTA for the 11

testing alignments that ROTA is the best-fit model 373.7 Comparison of log-likelihood per site between ROTA and protein-

specific model For each alignment of protein P, ROTAP is the

model estimated using only sequences of protein P from training

dataset 38A.1 Amino acid exchangeability matrix (the first 20 rows) and fre-

quency vector (the last row) of ROTA 46

Trang 11

Motivation

A major field of research in bioinformatics is to understand the evolutionary lationship among species In the recent decades, protein data have been accumu-lated in a large scale and are used in many methods to model the evolutionaryprocess One popular method is to use protein data to estimate amino acid sub-stitution models which can reveal the instantaneous substitution rates that aminoacids change into the other amino acids This kind of model has become crucialfor a wide range of bioinformatics sytems thanks to its various applications.The first application of amino acid substitution model is to help create proteinsequence alignments [1] Methods used to align sequences often involve the usage

re-of a score matrix which represents the penalty when an mutation occurs Typicallyfor nucleotide models, a score matrix can be simply +1 for a matched site, -1 for

a mismatched site and -2 for an indel of two nucleotide sequences Such a ple scoring scheme, however, is not suitable for comparing protein sequences.Amino acids, the building blocks that make up protein sequences, have biochemi-cal properties that influence their exchangeability in the process of evolution Forinstance, amino acids of similar molecular sizes get substituted more often thanthose of widely different molecular sizes Other properties such as the tendency

sim-to bind with water molecules also influence the probability of substitution fore, it is important to use a scoring system that reflects these properties whichcan be derived from amino acid substitution models

There-Amino acid substitution models are also frequently used to infer protein logenetic trees using maximum likelihood approach [2] In this method, given aphylogeny of protein sequences with branch lengths, an evolutionary model that

Trang 12

phy-allows to compute the probability of this tree is needed to select the maximumlikelihood tree A substitution model allows us to compute the transition proba-bilitiesPij(t), the probability that statej will exist at the end of a branch of length

t, if the state at the start of the branch is i, thus makes it possible to calculate thelikelihood of data given a phylogenetic tree

Moreover, these models are used to estimate pairwise genetic distances betweenprotein sequences that subsequently serve as inputs for distance-based phyloge-netic analysis [3] Distance methods’ objective is to fit a tree to a matrix of pair-wise genetic distances For every two sequences, the observed distance is a singlevalue based on the number of different positions between the two sequences Theobserved distance is an underestimation of the true genetic distance because some

of the positions may have experienced multiple substitution events In based methods, a better estimate of the number of substitutions that have actuallyoccurred is often done by applying a substitution model

distance-Many methods have been proposed to estimate general amino acid tion models from large and diverse databases [2] [4] The common methods arecounting or maximum likelihood approaches Estimating amino acid substitutionmodels is much more challenging than estimating nucleotide substitution modelsdue to a large number of parameters to be optimized For example, the generaltime reversible model for nucleotides contains 8 parameters in comparison to 208parameters for models of amino acid substitutions Thus, amino acid substitutionmodels are typically estimated from large datasets The first model is Dayhoffmodel [5] using highly similar protein sequences As more protein sequencesbecame available, JTT model was estimated [6] which used the same countingmethod However, the counting methods are limited to only closely related pro-tein sequences

substitu-The maximum likelihood method was proposed later [7] to estimate the mtREVmodel from 20 complete vertebrate mtDNA-encoded protein sequences WAGmodel [8] was estimated using a maximum likelihood method from 182 globu-lar protein families The WAG model produced better likelihood trees than theDayhoff and JTT models for a large number of globular protein families Themaximum likelihood method is improved by incorporating the variability of evo-lutionary rates across sites into the estimation process to create LG model [4] from

Trang 13

the Pfam database Experiments showed that the LG model gave better results thanother models both in terms of constructing maximum likelihood trees.

Although a number of general models have been estimated from protein quences of various species, they might be inappropriate for a particular set ofspecies due to the differences in the evolutionary processes of these species Anumber of specific amino acid substitution models for important species have beenintroduced such as rtREV model [9] for retrovirus and HIVb, HIVw models [10]for HIV related analysis Recently, FLU model is estimated for Influenza viruswhich was shown to be highly different from widely used existing models andmore suitable for Influenza protein analysis [11]

se-Meanwhile, the world has witnessed a series of emerging epidemics caused byviruses such as Ebolavirus, MERS-CoV - Middle East respiratory syndrome coro-navirus, or Rotavirus - a contagious virus that can cause gastroenteritis These epi-demics raise a need to modeling the evolution of these viruses The Viral GenomeResource [12] at National Center for Biotechnology Information (NCBI) acting

as a storage for protein sequences of various emerging viruses provides necessarydata to model viral protein evolution

In this thesis, the data from the Viral Genome Resource at NCBI is used toestimate such an amino acid substitution model for rotavirus, an emerging virusthat is the leading cause of the Diarrhea in young children A maximum likelihoodapproach which allows the variability of evolutionary rates across sites are used

as the method of model estimation

The main contribution of this work is to provide a rotavirus-specific tion model which would benefit subsequent research on understanding the evo-

Trang 14

substitu-lutionary patterns of this type of virus A detailed investigation of the model isconducted to understand the difference of this model from existing ones and toevaluate its performance on other dataset.

Outline

This thesis is divided into 3 main chapters The first chapter provides backgroundknowledge needed to understand the estimation process in the next chapter Chap-ter 2 describes how the evolution of rotavirus can be modeled as well as whichsteps the model estimation process follows Chapter 3 describes the data prepa-ration process, then investigates the rotavirus-specific model obtained from theprocess described in Chapter 2 (called ROTA for short) to highlight its inner char-acteristics and compare its performance on a testing dataset with other existingmodels Some discussions on the results are also given in Chapter 3

Trang 15

Chapter 1

Background

This chapter explains necessary terminologies and concepts to understand the timation process of an amino acid substitution model Firstly, section 1.1 de-scribes the evolution of sequences in which we explains heredity materials thatpass down from generations to generations and homologous sequences that are themain subjects of investigation to understand evolutionary process Subsequently,section 1.2 explains popular methods to model sequence evolution as well as re-vises existing amino acid substitution models which are commonly used in prac-tice The final section discusses phylogenetic trees in detail which are used as thevisual representation of evolutionary process and then addresses the methods that

es-we use later to compare these phylogenetic trees

1.1 Sequence evolution

1.1.1 Heredity materials

In molecular biology, nucleic acids are known to be the heredity materials whichare replicated and pass down from parents to children There are two types of nu-cleic acids: deoxyribonucleic acid (DNA) and ribonucleic acid (RNA) For livingorganisms, DNA is the molecule that carries genetic information Many virusesuse RNA as their genetic material instead

DNA and RNA are composed of smaller subunits called nucleotides [13] cleotides are linked together through chemical bonds in a linear fashion and formsequences of DNAs and RNAs There are 4 types of nucleotides in DNA which

Trang 16

Nu-are named after 4 different nitrogenous bases making up these nucleotides: A, C,

G and T (short for adenine, cytosine, guanine and thymine, respectively) In terms

of RNA, base T is substituted by base U (uracil)

DNA (or RNA in some viruses) instructs cells to produce one of the most portant types of macro-molecules: protein Proteins play a critical role in thecontinuity of life since they form the structure of organisms, provide a lot of fun-tionalities as well as control the regulation of the body Proteins are sequences ofamino acids There are 20 different amino acids found in nature (see Table 1.1),each of which has different chemical properties One change of amino acids in aprotein sequence could either have a subtle effect, change the protein function orentirely cause loss of function

im-Only some regions in DNA and RNA contain the instructions to encode teins Those coding regions are called genes Within a gene, three consecutivenucleotides, often called a codon, encode one amino acid The process of creating

pro-a protein bpro-ased on pro-a gene is cpro-alled trpro-anslpro-ation Different codons cpro-an encode thesame amino acid, which explains why although there are 64 different combina-tions of 3 nucleotides, only 20 types of amino acids are present To be accurate,

61 codons encode for amino acids while the other three are stop-codons As theirname suggests, these stop-codons instruct cells to stop the translation process

1.1.2 Evolution and homologous sequences

The well-known theory about evolution is proposed by Darwin [14] which gests that species have evolved from a common ancestor At molecular level,living organisms share:

sug-• The same genetic material (nucleic acids)

• The same table of encodings (universal genetic codes)

• The same molecular building blocks (such as nucleotides and amino acids)

These shared features highly suggest that living things today are descendants

of a common ancestor, or at least highly similar ancestors Heredity materials,genetic codes and the process of translation are passed down from generations to

Trang 17

Table 1.1: Twenty differenct amino acids.

Name Three-letter code One-letter codeAlanine Ala A

Cysteine Cys CAspartic Acid Asp DGlutamic Acid Glu EPhenylalanine Phe FGlycine Gly GHistidine His HIsoleucine Ile ILysine Lys KLeucine Leu LMethionine Met MAsparagine Asn NProline Pro PGlutamine Gln QArginine Arg RSerine Ser SThreonine Thr TValine Val VTryptophan Trp WTyrosine Tyr Y

generations, which results in present-day organisms having many similar features

at molecular level

Children, however, often inherit genes from their parents with modifications(also called mutations) due to the error-prone replication process The mutationscan be harmful or beneficial When mutations occur, the mutated organisms mayeither die, develop new traits or even evolve into new species Millions of years

of replication and mutation result in a diversity of species as we see at the presentdays

Trang 18

Mutations can be in small-scale or large-scale In small-scale mutations, genesare affected by changes in one or a small number of nucleotides If only one singlenucleotide is mutated, it is called a point mutation There are three type of pointmutations:

• Substitutions are changes from one nucleotide to the others A substitutionmay have no effect if it does not change the corresponding gene product (syn-onymous substitution) or may alter the amino acid in the sequence of protein(nonsynonymous substitution) In theory, every type of nucleotides can besubstituted by any other type at any position in the sequence Lethal substitu-tions, however, do not replicate and are hardly found in nature

• Insertions add extra nucleotides into the sequence This may shift the readingframe of codons or change the splice sites of a gene (the positions genes arechopped up and then reassemble before translation occurs), both of whichgreatly affect encoded proteins

• Deletions remove nucleotides from the original sequence As insertions, tions may change reading frame or splice sites and alter the gene products.Large-scale mutations are more complicated and affect one or many genes.Some types of large-scale mutations include: duplications which create manycopies of the same sequence, deletions of large region on a gene or deletions ofmany genes and translocations which exchange regions between different DNAmolecules

dele-Sequences of nucleotides or amino acids are said to be homologous if theyderive from an ancestral sequence Homologous sequences are not necessarily thesame as their ancestor due to mutations Investigating homologous sequences is

an important task as understanding the differences among homologous sequencescan reveal the evolutionary relationships among species or individuals of a species.Identifying homology, however, is a hard task because of various mutations in thereplication process

In the following sections, we use the term character to denote a nucleotide

or an amino acid Homologous characters are nucleotides or amino acids whichderive from a common nucleotide or amino acid

Trang 19

1.2 Modeling sequence evolution

1.2.1 Sequence alignment

Homologous sequences may not have the same length because of different number

of insertions and deletions in these sequences Before the evolutionary ships can be analysed, homologous sequences need to be aligned first Aligning isthe process of inserting character ’-’ to the input character sequences so the thesesequences have an equal length

relation-There may be many different ways to align two sequences For example, withtwo sequences TGAATAGCC and TGAAAGTCC, these are acceptable align-ments:

1 2 3 4 5 6 7 8 9Sequence 1 T G A A T A G C C

Sequence 2 T G A A A G T C C

1 2 3 4 5 6 7 8 9 10Sequence 1 T G A A T A G - C C

Sequence 2 T G A A - A G T C C

With the absence of ancestral sequence, the length of the real alignment andthe types of occurred mutations are unknown In the above example, the firstalignment has a length of 9 nucleotides and there are 3 substitutions at position 5,

6 and 7 In contrast, the second alignment has a length of 10 nucletides and there

is one deletion (or one insertion of T) at position 5 and one insertion of T (or onedeletion) at position 8 Since it is impossible to distinguish whether insertions ordeletions have actually occured, they are often referred to as indels

The construction of pairwise alignment can be done inO(pq)time using Wunsch algorithm [15] wherep and q are the lengths of the two sequences TheNeedleman-Wunsch algorithm is a dynamic programmign algorithm which tries

Needleman-to assign a score Needleman-to each alignment and find the alignments with the highest score

A so-called score matrix is used which defines scores for every event Differentscoring systems exist One simple system is: +1 for a match,−1 for a mismatchand−2 for an indel

Given two DNA sequences X = (x1, x2, , xp) and Y = (y1, y2, , yq)

together with a scoring matrixC and the character set A = {A, C, G, T, −}, the

Trang 20

highest score alignments can be found using a recursive formula:

• C(xi, −)or C(−, yi)is score when there is an indel

• F (i, j)is the best score when aligning two sequencesXi = (x1, x2, , xi)

or species Score matrices can be derived from substitution models

The pairwise sequence alignment only reveals the relationship between twosequences whereas many bioinformatic analysis objective is to investigate the re-lationship among a set of sequences A multiple sequence alignment (MSA) isrequired for such analysis which is a matrix of characters such that row i repre-sents characters of sequence iand columnj represents homologous characters atpositionj

The construction of multiple sequence alignments can be performed using namic programming [16] The time complexity for this approach is O(mn2n)

dy-where n is the number of sequences and m is the length of the alignment Thiscomputational burden limits this approach to only a few sequences Many ap-proximate methods have been proposed for larger number of sequences such asCLUSTALW [17] or MUSCLE [18]

Trang 21

1.2.2 General time-reversible substitution model

A popular method to model the sequence evolution is to use a substitution modelwhich describes the process that characters in a character set change into oth-ers The substitution process can be modeled by a time-continuous discrete-stateMarkov model Q = (qxy) [19] For sequences with a character set A, Q is a

|A| × |A|matrix in which the value qxy at rowx columny represents the taneous rate character x is changed into character y for x 6= y The value qxx iscalculated such that the sum of every row equals0 For the nucleotide substitutionmodel, theQ matrix is as follow:

Substitution models are usually assumed to be stationary, which means the quency Π = (πx) of every available state does not change with time The fre-quency vectorΠand the instantaneous substitution rate matrix Qare dependent:

In practice, theQmatrix is typically normalized such that the expected number

of substitutions per site is one

Trang 22

• Time-homogeneity: The substitution rates do not change with time.

• Stationarity: The frequency of all states are stable

• Time-reversibility: If two states have the same frequency, the rate xchanges

toy and the ratey changes toxare the same

More specifically, a GTR subtitution model for nucleotides can be described bythe instantaneous substitution rate matrixQ

Trang 23

The GTR substitution model for nucleotides has 8 free parameters (3ters for the frequency vectorΠdue to its elements summing up to 1 and5param-eters for the exchangeability coefficient matrixR due to equation 1.4) The GTRsubstitution model for amino acids can be described in similar manner, however,with much more free parameters (208 free parameters to be precise).

parame-Calculate transition probability

From the instantaneous substitution rate matrix Q, one can derive a transitionprobability matrixP(t) = {Pxy(t)}wherePxy(t)is the probability that character

xchanges to character y during the evolutionary time t In case Q is normalized

as equation 1.4,tis the duration an expected number oftsubstitutions occurs persite The probability matrix P(t) is useful to construct phylogenetic trees usingmaximum likelihood method or to generate sequences for evolution simulation

TheQmatrix for GTR model is diagonalizable [20], hence, P(t)can be lated efficiently using the decomposition ofQ Specifically,

• |A|is the length of the character set (the number of possible states)

• Λ = diag{λ1, λ2, , λ|A|}is the |A| × |A|diagonal matrix corresponding

to the eigenvaluesλ1, λ2, , λ|A| ofQ

• U = {u1, u2, , u|A|}is the matrix of corresponding eigenvectors ofQand

U−1 is its inverse

Estimate genetic distances

The subsitution model is also useful to estimate genetic distance between ogous sequences The genetic distance d(X, Y) between two sequences X =

Trang 24

homol-{x1, x2, , xm}and Y = {y1, y2, , ym}is the number of substitutions whichhave occured betweenX andY The naive approach to estimate genetic distance

is to observe the difference between two sequences

d(X, Y) =

Pm i=1δ(xi, yi)m

• Multiple subsitutions: Two or more substitutions have occured at one positionbut at most one substitution is observed For example: A changes toC, then

to T but only one substitution from A to T is found by observation If T isthen changed to A, no substitution is observed (back substitution)

• Parallel subsitutions: The substitution occurs in both sequences and no stitution is observed For example, the ancestral nucleotide is A and A ischanged toC in both descendent sequences The observed distance betweendescendants is0while the actual distance is 2

sub-Having been modeled by the substitution rate matrix Q, the genetic distance d

of two sequences can be estimated using the maximum likelihood (ML) approachwith the following likelihood function [19]

d∗ = argmaxd≥0{L(d)} (1.14)

Trang 25

1.2.3 Model of rate heterogeneity

The GTR substitution model as described above represents the evolution of onesite in the analysed sequences It is unrealistic that every site evolves according

to the same model For example, the third position in a codon usually has muchfaster substitution rates than the other two positions [21] A model that has varyingsubstitution rates among sites is called model of rate heterogeneity

Using different substitution models for different sites may cause overfit of themodel to the data Thus, many models of heterogeneous substitution rates havebeen proposed [22] [23] A common type of rate heterogeneity model [22] is atwo-state model that categorizes sequence sites as either invariable or variable.For every sitei, there is a substitution rate scaling factor

Differentiating whether sites are variable or invarible is impossible because it

is possible that variable sites in the dataset are happen to be unvaried or some sitesundergo back substitutions One more parameter θ is introduced to represent thepercentage that sites are invariable

Another widely-used model of rate heterogeneity is Gamma-distributed rateheterogeneity model [23] In this type of model, aΓ-distribution with expectation1.0 and variance 1/α, α > 0is used from which substitution rate scaling factorsacross sites are drawn

e−ttα−1dt

(1.16)

One can adjust the rate heterogeneity by adjusting theα parameter α is calledthe shape parameter of the Γ-distribution Generally speaking, the smaller the

Trang 26

shape parameter is, the stronger the rate heterogeneity among sites becomes Forexample, settingα = 0.5 means that most sites hardly undergo substitutions butsome other sites have high substitution rates In contrast, if α = 5, sites havehighly similar substitution rates (the scaling factor of most sites are approximately

1.0)

A discrete Γ-function can be used instead of its continuous version In crete Γ-distributed rate heterogeneity model, a parameter c which is the number

dis-of substitution rate scaling factor categoriesr1, r2, , rc is introduced [24]

A hybrid model which integrates the two-state model and the Γ-distributionmodel is also a common model used in practice In this type of model, some sitesare invariable according to the parameterθ while the other sites are variable withthe substitution rate scaling factors drawn from the discrete Γ-function with theshape parameterα and the number of rate categoriesc

The parameter c is typically set in advance while the invariant percentage θ

and the shape parameter α are estimated from data The estimation of theseextra parameters has been implemented in some phylogenetic packages such asPHYML [25] or IQ-TREE [26]

1.2.4 Available amino acid substitution models

Amino acid substitution models have much more free parameters in comparisonwith nucleotide counterparts Parameters are usually estimated from data, there-fore, substitution models for amino acids are called empirical substitution models.Different methods are proposed to estimate amino acid substitution models.Dayhoff model [5] was the first model describing the process of amino acidsubstitutions The model was estimated using a counting method with 71 sets ofclosely related proteins and an observation of 1572 substitutions The countingmethod was pioneering at the early stage of substitution model research JTTmodel was estimated using the same counting method [6] JTT, however, used

a larger protein datasets in the estimation process BLOSUM62 model [27] alsoemployed counting approach but only alignments that were similar at least 62%were used to count amino acid substitutions

The drawback of counting method is that it is only applicable for highly related

Trang 27

sequences A later research [28] introduced the resolvent method and estimated a

VT model from various types of protein sequences

The maximum likelihood method was used to estimate mtREV model [7] based

on 20 complete vertebrate mtDNA-encoded protein sequences WAG model [8]was estimated using an approximate maximum likelihood approach which wasbased on 3,905 globular protein sequences from 182 protein families LG model[4] incorpated the rate heterogeneity across sites into the estimation process usingprotein sequences from the Pfam database The LG model was shown to be betterthan the previous models in terms of contructing maximum likelihood phyloge-netic trees

Although many models have been estimated from large and diverse databases ofprotein sequences from a wide range of species (these models are often referred to

as general models), recent studies have showed that they might be inappropriate torepresent the evolutionary process of some specific species A number of species-specific amino acid substitution models have been introduced such as the rtREVmodel [9] which are used for inference of retrovirus and HIV-specific models[10] (HIVb, HIVw) estimated from HIV data Recently, FLU model [11] wasestimated for Influenza virus which is highly different from widely used existingmodels Experiments showed that species-specific models better characterizes theevolutionary process of the corresponding species sequences

1.3 Phylogenetic trees

1.3.1 Overview

Phylogenetic trees are trees that represent the evolutionary relationships amongsequences Phylogenetic analysis typically use binary trees, either rooted or un-rooted, to represent phylogeny The sequences of interest are the leaf nodes whilethe internal nodes represent divergence events, which means an ancestral sequence

is splitted apart resulting in 2 descendent sequences Relationships between quences are described as branches of the tree Two sequences are called morerelated if they have more recent common ancestor Figure 1.1 is an example of

se-a phylogenetic tree for 5 sequences A, B, C, D, E In this exse-ample, A se-and B se-are

Trang 28

more related than A and C because the common ancestor of A and B is more cent than the common ancestor of A and C Each branch of a phylogenetic treecan be assigned a value, called branch length, to represent the genetic distancebetween connected nodes.

re-Figure 1.1: A sample phylogenetic tree of 5 sequences

There are many different binary trees to represent the evolution of the same set

of sequences Without regard for branch lengths, the numbers of possible rootedand unrooted binary trees for a set ofnsequences increase exponentially with n.The number of binary unrooted treesB(n)for nsequences is

1.3.2 Phylogenetic tree reconstruction

Three optimization criteria are often used to reconstruct phylogenetic trees fromhomologous sequences: minimum evolution (ME), maximum parsimony (MP)and maximum likelihood (ML)

Ngày đăng: 20/03/2021, 19:33

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN