One popular method is to use protein data to estimate amino acid sub- stitution modcts which can reveal the instantancous substitution ratcs that amino avids change into the olher amin
Trang 1VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY
Trang 2VIETNAM NATIONAL UNIVERSITY, HANOI
UNIVERSITY OF ENGINEERING AND TECHNOLOGY
Nguyen Duc Canh
AMINO ACTD SUBSTITUTION MODEL
FOR ROTAVIRUS
MASTER THESIS
Major: Computer Science
Supervisor: Assoc Prof Le Sy Vinh
HANOI- 2019
Trang 3Abstract
Modeling protein evolution has been a major field of research in bioinformatics for decades One popular method to approximate the evolution of proteins is to
‘use an amino-acid substitution model which can reveal the instantaneous rate that
an amino acid is changed into another amino acid Such kind of model is useful
in many different ways and has become a main component in a variety of bivin- formatic systems Many models such as JTT, WAG and LG have been estimated
using data from various species Recent research showed that these models might
be inappropriate for analysis of some specific specics Mcanwhile, the world has
witnessed a series of emerging epidemics caused by viruses, notably rotavirus - a
contagious virus that can cause gastroenteritis These epidemics raise a need for
modeling the evolution of these emerging viruses In this thesis, using the data
from the Viral Genome Resource at National Center for Bivte
chnology Informa- tion (NCBD), we propose (he ROTA model that has been specifically estimated for
modeling the evolution of rotavirus Analysis revealed significant differences be- tween ROTA and existing models in amino acid frequencies, exchangeahility co-
efficionls as well as inferred phylogenics Experiments showed that ROTA better
characterizes the evolutionary patterns of rotavirus than other models and should
be useful in most systems that requires an accurate description of rotavirus evolu- tion
Trang 4Acknowledgements
T would like to express my sincere gratitude to my advisor Assoc Prof Te Sy Vinh for the continuous support of my study and research, for his patience, motivation,
enthusiasm, and immense knowledge His guidance helped me in all the time
of research and writing of this thesis, [ could not have imagined having a better advisor and mentor for my Masler study
Besides my advisor, I would like to thank Dr Dang Cao Cuong and MSc Le Kim Thu for giving devoted explanations to my questions and guiding me to solve
various problems that I had to face
T also thank my friends: Can Duy Cat, Nguyen Minh Trang, Le Iai Nam for
their continuous motivations without which I would never be able to complete this
thesis
My sincere thanks also goes to Information Faculty of University of Engincer-
ing and Technulogy, Vietnam Nalional University, [noi for providing me the
necessary facilities to conduct experiments
Last but not the least, | wonld like to thank my parents for giving birth to me at
the first place and supporting me spiritually throughout my life.
Trang 5Declaration
Thereby declare that this thesis was entirely my own work and that any additional
sources of information have been properly cited
I certify that, to the best of my knowledge, my thesis does not infringe upon
anyone’s copyright nor violate any proprietary rights and that any ideas, tech-
niques or any other material from the work of other people included in my thesis, published or otherwise, are fully acknowledged in accordance with the standard
referencing practices
T declare that this thesis has not beon submitted for a higher degree to any other
University or Institution.
Trang 6
vi
Trang 71.3.2 Phylogenetic tree reconstruction 1.3.3 Robinson-Foulds distance
1.3.4 Phylogenctic hypothesis tesling
2 Mcthod
2.2 Model estimation process
3 Results and discussion
3.2 Modelanalysis 2 2 2 ee ee eee
3.3 Performance on testing alignments
3.4 Troc topology analysis
Trang 8Multiple sequence alignment
National Center for Biotechnology
Information
Robinson-Foulds
viii
Trang 9A sample phylogenetic tree of 5 sequences 2.2.20 18
Two irees T) and T, describe the same set of 5 sequences {1, 2 3, 4,5}
but have different topologies Tree Tj has two bipartitions { 1, 2}|{3 4,5} and {1,2,3}|{4, 5} while tree 73 has two bipartitions {1, 3}|{2.4,5}
and {1, 2, 3}|{4, 5} Bipartition {1, 2}[{3 4, 5} is present only in
T, whereas bipartition {1,3}|{2, 4,5} is present only in 7} Bi-
partition {1, 2,3}|{4,5} occurs in both Tj and 7 The standard-
s2/4 21
ized Robinson and Foulds distance between 71 and 23
The exchangeability coefficients in ROTA, FLU and JTT models
The black (gray or white) bubble at the intersection of row X and column Y presents the exchange rate between amino acid X and amino acid Y in ROTA (EUU or J]T) 32 The rclative diffcrenccs bctweccn cxchangcability cocfficicnts
ROTA and other two models The size of the bubble corresponds
to the value (ROT Axy — Mxy)/(ROT Axy — Mxy) where X,
Y is one of 20 amino acids and M is FLU for the subfigure a) or
TTT for the subfigure b) The black bubble with valuc 2/3 (1/3) indicates the coefficient in ROTA is $ (2) times larger than the
corresponding one in M whereas the white bubble with value 2/3
(1/3) indicates the coefficient in ROTA is 5 (2) times smaller than
Amino acid frequencies of ROTA, PLU and JTT models 34
Trang 10Twenty differenct amino acids 2 2.2 ee 7
Number of sequences grouped by protein types in training and
The Pearson's correlations between ROTA and 10 widely used
Comparisons of ROTA and 10 other models in constructing max-
Comparisons of ROTA and 10 other models in constructing max-
The normalized Rebinson-Foulds distance between trees inferred using ROTA versus FLU and JTT models for 12 testing multiple alignments
The number of test alignments that trees inferred from existing
models are significantly worse than those fram ROTA for the 11
testing alignments that ROTA is the best-fit model 2 2 37
Comparison of log-likelihood per site between ROTA and protein-
specific model For cach alignment of protein P, ROTAp is the model estimated using only sequences of protein P from taining
dataset
Amino acid exchangeability matrix (the first 20 rows) and fre-
quency vector (the lastrow) of ROTIA 46
Trang 11Introduction
Motivation
A major field of research in hioinformatics is to understand the evolutionary re- Jationship among species In the recent decades, protein data have been accumu-
Jated in a large scale and are used in many methods to model the evolutionary
process One popular method is to use protein data to estimate amino acid sub-
stitution modcts which can reveal the instantancous substitution ratcs that amino
avids change into the olher amino acids This kind of model hus become crucial for a wide range of bioinformatics sytems thanks to its various applications
The first application of amino acid substitution model is ta help create protein
sequence alignments [1] Methods uscd to align sequences often involve the usage
of a score matrix which represents the penalty when an mutation occurs Typically
for nucleotide models, a score matrix can be simply +1 for a matched site, -{ for
a mismatched site and -2 for an inde] of two nucleotide sequences Such a sim-
ple scoring scheme, however, is not suilable for comparing protein sequences Amino acids, the building blocks that make up protein sequences, have biochemi-
cal properties that influence their exchangeability in the process of evolution For instance, amino acids of similar molecular sizes get substituted more oficn than
those of widely different molecular sizes Olher properties such as the (endency
to bind with water molecules also influence the probability of substicution There-
fore, it is
important to use a scoring system that retlects these properties which
ean be derived from amino acid substitution models
Amino acid substitution models are also frequently used (o infer protein phy- logenetic trees using maximum likelihood approach [2] In this method, given a phylogeny of protein sequences with branch lengths, an evolutionary model that
Trang 12allows to compute the probability of this tree is needed to select the maximum
likelihood tree A substitution modcl allows us to compute the transition proba- bilities #,)(t), the probability that stale j will caist al the end of a branch of length
t, if the state at the start of the branch is 7, thus makes it possible to calculate the
likelihood of data given a phylogenetic tree
Marcever, these models arc uscd to estimate pairwise genctic distances between protein sequences that subsequently serve as inpuls for distance-based phyloge-
netic analysis [3] Distance methods’ objective is to fit a tree to a matrix of pair-
wise penetic distances Kor every two sequences, the observed distance is a single valuc based on the number of different positions between the two sequcnecs The observed distance is an underestimation of the true genetic distance because some
of the positions may have experienced multiple substitution events In distance- based methods, a better estimate of the number of substitutions that have actually occurred is often done by applying a substitution model
Many methods have been proposed to estimate general amino acid substitu-
tion models from large and diverse databases [2] [4] ‘The common methods are counting or maximum likelihood approaches Estimating amino acid substitution
models is much more challenging than estimating nucleotide substitution models due to a large number of parameters to be optimized For example, the general
time reversible model for nucleotides contains 8 parameters in comparison to 208 parameiers for models of amino acid substitulions Thus, amino acid substitution
models are typically estimated from large datasets The first model is Dayhoff
model |5] using highly similar protein sequences As more protein sequences
became available, ITT model was c:
atcd [6] which used the same counting
method However, the counting methods are limited to only closely related pro-
tein sequences
The maximum likelihood method was propased later [7 | to estimate the mtREV
madel from 20 complete verichrate mtPNA-cncoded protein sequences WAG model [8] was estimated using a maximum likelihood method from 182 globu-
Jar protein families The WAG model produced better likelihood trees than the Dayhotf and JTT nodels for a large number of globular protein families ‘The maximum likelihood method is improved by incorporating the variability of evo-
Intionary rates across sites into the estimation process to create LG model [4] from
2
Trang 13the Pfam database Experiments showed that the LG model gave better results than
other models both in terms of constructing maximum likelihood trees
Although a number of general models have been estimated from protein sc-
quences of various species, they might be inappropriate for a particular set of
species due to the differences in the evolutionary processes of these species A
number of specific amino acid substitution madels for important species have been introduced such as REV model [9] for retrovirus and HIVb, HIVw models [10]
for HIV related analysis Recently, FLU model is estimated for Influenza virus which was shown to be highly different from widely used ex ing models and more suitable for Influenza protein analysis [11]
Meanwhile, the world has witnessed a series of emerging epidemics caused by viruses such as Ebolavirus, MERS-Co¥ - Middle East respiratory syndrome coro- navirus, or Rotavirus - a contagious vinus that can cause gastroenteritis ‘These epi-
demics raise a need to modeling the evolution of these viruses The Viral Genome
Resource [12] at National Center for Biotechnology Information (NCBI) acting
as a storage for protein sequences of various emerging viruses provides necessary
data to model viral protein cvolution
Problem statement
Existing amino acid substitution models do not have the ability to accurately char-
acterize the evolution of rotavirus An amino acid substitution model for rotavirus
protein analysis systems can utilize an rotavirus-specifc amino acid substitution
model to analyse rotavirus protein sequences and find out the patterns in rotavirus viral evolution and infection process
In this thesis, the data from the Viral Genome Resource al NCBI is used lo estimate such an amino acid substitution model for rotavirus, an emerging virus
that is the leading cause of the Diarrhea in young children A maximum likelihood
approach which allows the variability of evolutionary rates across sites are used
as the method ef model cslimalion
The main contribution of this work is to provide a rotavirus-specific substitu-
tion model which would benefit subsequent research on understanding the evo-
3
Trang 14Intionary patterns of this type of virus A detailed investigation of the model is conducted to understand the difference of this modct from existing oncs and to
evaluate its performance on other dulasel
Outline
This thesis is divided into 3 main chapters The first chapter provides background knowledge necded to understand the estimation pracess in the next chapter Chap-
ter 2 describes how the evolution of rotavirus can be moudeled as well as which
steps the model estimation process follows Chapter 3 describes the data prepa- ration process, then investigates the rotavirus-specific model] obtained trom the
process described in Chapter 2 (called ROTA for short) ta highlight its inner char-
acteristics and compare ils performance on a testing dalasel with other existing
models Some discussions on the results are also given in Chapter 3
Trang 15Chapter 1
Background
This chapter explains necessary terminologies and concepts to understand the es- timation process of an amino acid substitution model Firstly, section 1.1 de- scribes the evolution of scquences in which we explains Aeredity materials that
pass down from generations to generations and homologous sequences that are the main subjects of investigation to understand evolutionary process Subsequently,
section 1.2 explains popular methods to model sequence cvolution as well as ro- vises existing amino acid substitution models which are commonly used in prac-
tice The final section discusses phylogenetic trees in detail which are used as the visual representation of evolutionary process and then addresses the methods that
we use later lo compare these phylogenelic (rees
1.1 Sequence evolution
1.1.1 Heredity materials
in molecular biology, nucleic acids are known to be the heredity materials which
arc replicated and puss down from parents to children There are two lypes of nu-
cleic acids: deoxyribonucleic acid (DNA) and ribonucleic acid (RNA) Por living
organisms, DNA is the molecule that carries genetic information Many viruses usc RNA as their genetic matcrial instead
DNA and RNA arc composed of smaller subunils called nucleatides [13] Nu-
cleotides are linked together through chemical bonds in a linear fashion and form
sequences of IDNAs and RNAs There are 4 types of nucleotides in DNA which
5
Trang 16are named after 4 different nitrogenous bases making up these nucleotides: A, C, Gand T (short for adenine, cytosine, guanine and thymine, respectively) Tn terms
of RNA, base T is substituted by base U (uracil)
DNA (or RNA in some viruses) instructs cells to produce one of the most im-
portant types of macro-molecules: protein Proteins play a critical role in the continuity of life since they form the structure of organisms, provide a lot of fun- tionalilies as well ax control the regulation of the body Proteins arc sequences of
amino acids There are 20 different amino acids found in nature (see Table 1.1},
each of which has different chemical properties One change of amino acids in a
protein sequence could cither have a subtle cffect, change the protein function or
entirely cause loss of function
Only some regions in DNA and RNA contain the instructions to encode pro-
teins Those coding regions are called genes Within a gene, three consecutive nucleotides, often valled a codon, encode one aminu acid The process of creating
a protein based on a gene is called ¢ransiation Different codons can encode the same amino acid, which explains why although there are 64 different combina-
tions of 3 nucleotides, only 20 typos of amino acids are present Ta be accurate,
61 cadons encode for amino acids while the other three are stop-codons As their
name suggests, these stop-codons instruct cells to stop the translation process
The well-known theory about evolution is proposed by Darwin [14] which sug-
gests that species have evolved from a common ancestor At molecular level, living organisms share:
® The same genetic material (nucleic acids)
© The same table of encodings (universal genctic codes)
* The same molecular building blocks (such as nucleotides and amino acids)
These shared features highly suggest that living things today arc desccndants
of 4 common ancestor, or al Ioast highly similar ancestors Heredity materials,
genetic codes and the process of translation are passed down from generations to
Trang 17Table 1.1: Twenty differenct amino acids
Name Three-letter code One-letter code
Children, however, often inherit genes from their parents with modifications
(also called mutations) due to the error-prone replication process ‘The mutations can be harmful or beneficial When mutations occur, the mutated organisms may
either die, develop new traits or even evolve into new species Millions of years
of replication and mutation result in a diversity of species as we see at the present
days.
Trang 18Mutations can be in small-scale or large-scale In small-scale mutations, genes are affected by changes in anc ar a small number of nucleotides Tf only onc single
nucleotide is mutated, il is called a point mutation There are three (ype of point mutations:
* Substitutions arc changes from onc nucleotide to the others A substitution may have ne ellect if il dues nol change the corresponding gene product (syn-
onymous substitution) or may alter the amino acid in the sequence of protein
(onsynonymous substitution) In theory, every type of nucleotides can be substituted by any other Lype al any position in the sequence Lethal substitu-
tions, however, do not replicate and are hardly found in nature
Insertions add extra nucleatides into the sequence This may shift the reading
feame of codons or change the splice sites of a genc (Ihe positions genes arc
chopped up and then reassemble before translation occurs), both of which greatly affect encoded proteins
® Deletions remove nucleotides from the original sequence As insertions, dele-
tions may change reading frame or splice sites and alter the gene products
Large-scale mutations are more complicated and affect one or many genes Some types of large-scale mutations include: duplications which create many copies of the same sequence, deletions of large region on a gene or deletions of many genes and translocations which exchange regions botweon different DNA
molecules
Sequences of nucleotides or amino acids are said to be homologous if they
derive from an ancestral sequence Homologous sequences are not necessarily the
same as their ancestor due to mulations Investigating homologous scqucnecs is
an important task as understanding the differences among homologous sequences
can reveal the evolutionary relationships among species or individuals of a species
Identifying homology, however, is a hard task becausc of various mutations in the
replication process
In the following sections, we use the term character to denote a nucleotide
or an amino acid Homologous characters are nucleotides or amino acids which
derive from a common nucleotide or amino acid.
Trang 191.2 Modeling sequence evolution
Tlomologous sequences may not have the same length because of different number
of insertions and deletions in these sequences Before the evolutionary relation-
ships can be analysed, homologous sequences necd to be aligned first Aligning is
the process of inscrling character ’-’ to the inpul character sequences sv the these
sequences have an equal length
There may be many different ways to align two sequences For example, with
two sequences TGAATAGCC and TGAAAGTCC, these are acceptable align-
With the absence of ancestral sequence, the length of the real alignment and
the types of occurred mutations are unknown In the above example, the first
alignment has a length of 9 nucleotides and there are 3 substitutions al position 5,
6 and 7 In contrast, the second alignment has a length of 10 nucletides and there
is one deletion (or one insertion o† T) at position 5 and one insertion of T (or one
deletion) at position & Since it is impossible to distinguish whether insertions or
deletions have actually occured, they are often referred to as indels
The construction of pairwise alignment can be done in O(yq) time using Needleman- Wunsch algorithm |15] where p and q are the lengths of the two sequences ‘The
Needleman-Wunsch algorithm is a dynamic programmign algorithm which tries
1o assign a score (o cach alignment and find the alignments with the highest score
A so-called score matrix is used which defines scores for every event Different
scoring systems exist One simple system is: |1 fora match, 1 for a mismatch
and 3for an indel
Given two DNA sequences X = (2 ,%2, ,2p) and = (WI;2 , 2)
together with a scoring matrix C’ and the character set A — {.4,C,G,T, —}, the
9
Trang 20highest score alignments can be found using a recursive formula:
= C(x;,—) or C(—, y,) is score when there is an indel
® Ƒ(¿ 7) is the best score when aligning two sequences KX? — (21, x2, ,a)
and Y7 — (Mi,ta, , My}
Amino acid sequences can be aligned in a similar fashion using a larger charac-
ter set A = {A, RB, N,D,C,Q.E,G,H.1,L,K.M, FP, S.T,W,¥.V,—} and
a different score matrix A notable observation is that the score matrix of amino acids is much more complicated due to the difference in properties of amino acids
The more similar two amino acids arc, the higher score an alignment for them
should be Also, the score matrix may be different for different types of proteins
or species Score matrices can be derived from substitution models
The pairwise sequence alignment only reveals the relationship hetween two scquences whereas many bioinformatic analysis ubjective is to investigate the re- lationship among a set of sequences A multiple sequence alignment (MSA) is required tor such analysis which is a matrix of characters such that row i cepre-
sents characters of sequence 7 and column j represents homologous characters at
position 7
The construction of multiple sequence alignments can be performed using dy-
nanuc programming |16| The time complexity for this approach is O(sm"2")
where 7 is the number of sequences and 11 is the length of the alignment This
computational burden limits this approach to only a few sequences Many ap-
proximate methods have been proposed for larger number of sequences such as
CLUSTALW [17] or MUSCLE [18]
10,
Trang 211.2.2 General time-reversible substitution model
A popular method to model the sequence cyolution is to use a substitution model which describes the process that characters in a character set change into oth- ers The substitution process can be modeled by a time-continuous discrete-state
Markov model Q — (0.y) [19] For sequences with a character scl A, Q is a
|À| x [A] mutrix in which the value gey al row :¢ column y represents [he instan-
taneous rate character z is changed into character y for z z2 y The value q, is
calculated such that the sum of every row equals 0 For the nucleotide substitution
model, the Q matrix is as follow:
Substitution models are usually assumed to be stationary, which means the Ïre-
quency TI — (7,,) of every available state does not change with time The fre-
quency vector TT and the instantancous substitution ratc matrix Q are dependent:
In practice, the Q matrix is typically normalized such that the expected number
of substitutions per site is onc
XeA
Substitution models often employ the time-reversibility assumption, that is the
relative substitution rates that 7 changes to y and that y changes to + are the same:
Under the time-reversibility assumption, the Q matrix can be decomposed into
two components: a relative substitution rate matrix R = (7) and a frequency
vector IT — (a„}
Trang 22Ty? fe xy
— Dust Ow fey The I matrix is also called exchangeability coefficient matrix where the rela-
tive substitution rate hetween x and y, or the coefficient for the substitution be-
tween x and y is
Tay — „ Tz
A so-called general time-reversible (GTR) substitution model is the model that
employs all these following assumptions:
* Markovian assumption: The evolution of character « is independent of its
previous states
# Time-continuity: The substitution between states can occur at any time during the evolutionary process
® Time-homogeneity: The substitution rates do not change with time
® Stationarity: The frequency of all states arc stable
® Time-reversibility: If two states have the same frequency, the rate x changes
to y and the rate y changes to x are the same
More spccifically, a GTR subtitution mode] for nucleotides can be descrihed by
the instantancous substitution rate matrix Q
—32AzađAx ane bra emp
Trang 23The GTR substitution model for nucleotides has 8 free parameters (3 parame-
ters for the frequency vector FI duc to its clements summing up to 1 and 5 param-
clers for the exchangeabilily couflicient matrix BR duc to equation 1.4) The GIR
substitution model for amino acids can be described in similar manner, however,
with much more tree parameters (2008 free parameters to be precise)
Calculate transition prabability
From the instantaneous substitution rate matrix (, one can derive a transition
probability matrix P(t) — {P,y(t)} where P,,(£) is the probability that character
a changes to character y during the evolutionary time ¿ In case Q is normalized
as equation 1.4, éis the duration an expected number of é substitutions occurs per site The probability matrix P(é) is useful to construct phylogenetic trees using
maximum likelihood methed or to generate sequences for evalution simulation
The Q matrix for GTR model is diagonalizable [20], hence, P(¢) can be calcu-
lated efficiently using the decomposition of Q Specifically,
la]
vl
where
® |À| is the length of the character set (the number of possible states)
© A — diag{Ar, ds, ., 4a } is the [A] x |A| diagonal matrix corresponding 1o the eigenvalues Ay, dg, Aaj of Q
oe U — {u,t
U"1 is its inverse
wait is the matrix of corresponding eigenvectors of Q and
Estimate genetic distances
The subsitution model is also useful to estimate genetic distance between homol-
ogous sequences The genclic distance d(X,Y) between lwo sequences KX —
13
Trang 24#1, #ay ; t8} and Y — {91, ya + Ym} is the number of substitutions which
have occured between X and Y The naive approach to eslimale genetic distance
® Multiple subsitutions: Two or more substitutions have occured at one position
but at most one substitution is observed For example: 4 changes to (, then
ta F but only one substitution from A to ‘f’ is found by observation Tf /" is then changed (o A, no substitution is obscryed (back substilulion)
® Parallel subsitutions: The substitution occurs in beth sequences and no sub-
slitulion is observed For cxample, the ancestral nucleotide is A and A is
changed to 7 in both descendent sequences The observed distance between descendants is 0) while the actual distance is 2
Having been modeled by the substitution rate matrix Q, the genetic distance d
of two sequences can be estimated using the maximum likelihood (ML) approach with the following likelihood function |19]
Trang 251.2.3 Model of rate heterogeneity
The GTR substitution model as described above represents (he evolution of onc
site in the analysed sequences It is unrealistic that every site evolves according
to the same model For example, the third position in a codon usually has much faster substitution raics than the other two positions [21] A model that has varying: substitution rales among giles is called model of rate helerogeneiLy
Using different substitution models for different sites may cause overfit of the model to the data Thus, many models of heterogeneous substitution rates have
been proposed [22] [23] A common type of ratc heterogencity model [22] is a
two-state model thal categorizes sequence sites as either invariable or variable
Hor every site 7, there is a substitution rate scaling factor
0 if site ¢ is invariable
1 otherwise
‘That is, same sites in the sequences never change while the other sites evolves
avcurding to one substitution model
Differentiating whether sites are variable or invarible is impossible because it
is possible that variable sites in the dataset are happen to be unvaried or some sites
undergo back substitutions Onc more parameter @ is intreduced to represent the
percentage that siley are invariable
Another widely-used model of rate heterogeneity is Gamma-distributed rate
heterogeneity model [23] In this type of model, a T-distribution with expectation 1.0 and variance 1/a, « °> 0 is used from which substilution rate scaling factors
across sites are drawn
One can adjust the rate heterogeneity by adjusting the a parameter o: is called
the shape parameter uf the [-distribution Gencrally speaking, the smaller the
15
Trang 26shape parameter is, the stronger the rate heterogeneity among sites becomes For
example, sctting «@ — 0.5 means that most sitcs hardly undergo substitutions but
some other siles have high substitulion rates In contrast, if @ = 5, sites have highly similar substitution rates (the scaling factor of most sites are approximately
1.0)
A discrete T-funetion can be used instead of its continuous version Tn dis-
crete T-distributed rate helterogencily model, a parameter ¢ which is the number
of substitution rate scaling factor categories r1, r2, , 7e is introduced [24]
A hybrid model which integrates the two-state model and the T-distribution
madel is alse a cammon model used in practice In this type of model, some sites
are invariable according to the parameter @ while the other sites are variable with the substitution rate scaling factors drawn from the discrete [-function with the shape parameter ce and the number of rate categories ¢
The parameter c is lypically sel in advance while the invariant percentage 8
and the shape parameter a are estimated from data The estimation of these extra parameters has been implemented in some phylogenetic packages such as PHYML [25] or IQ-TREE [26]
Amino acid substitution models have much more free parameters in comparison
with nucleutide counterparts Parameters are usually estimated from data, there-
fore, substitution models for amino acids are called empirical substitution models
Different methods are proposed to estimate amino acid substitution models
Dayhoff model [5] was the first model describing the proccss of amino acid substituGions The model was cslimated using a counting method with 71 scts of
closely related proteins and an observation of 1572 substitutions The counting method was pioneering at the early stage of substitution model research JTT
madel was estimated using the same counting method [6] JTT, however, used
a larger protein đatasets in the estimation process BLOSUM62 model [27] also employed counting approach but only alignments that were similar at least 62%
were used to count amino acid substitutions
The drawback of counting method is thal iL is only applicable for highly related
Trang 27sequences A later research [28] introduced the resolvent method and estimated a
VT model fram various types of protein scquences
The maximum likelihood method was used lo estimate mtREV madel [7] based
on 20 complete vertebrate mtDNA-encoded protein sequences WAG model [8] was estimated using an approximate maximum likelihood approach which was
based on 3,905 globular protein sequences from 182 protein familics LLG model
[4] incorpated the rale heteregencily across sites into the estimation process using protein sequences from the Pfam database The LG model was shown to be better than the previous models in terms of contructing maximum likelihood phyloge- netic trees
Although many models have been estimated from large and diverse databases of
protein sequences from a wide range of species (these models are often referred to
as general models), recent studies have showed that they might be inappropriate ta represent the evolulionary process of some specific species A number of species-
specific amino acid substitution models have been introduced such as the rtREV model [9] which are used for inference of retrovirus and HIV-specific models [10] CHIVb, HIVw) estimaicd fom HIV daia Recently, FLU model [11] was
estimated for Influenza virus which is highly different from widely used existing models Experiments showed that species-specific models better characterizes the
evolutionary process of the corresponding species sequences
Trang 28more related than A and C because the common ancestor of A and B is more re-
cent than the common ancestor of A and C Each branch of a phylogenctic tree can be assigned a valuc, called branch length, to represent the genelic distance
between connected nodes
There are many different binary trees to represent the evolution of the same set
of sequences Without regard for branch lengths, the numbers of possible rooted and unrooted binary trees for a set of 7 sequences increase exponentially with n The number of binary unrooted trees B(m) for n sequences is
Three optimization criteria are often used to reconstruct phylogenclic trees from
homologous sequences: minimum evolution (ME), maximum parsimony (MP)
and maximum likelihood (ML).