Luận văn amino acid substitution model for rotavirus

One popular method is to use protein data to estimate amino acid substitution modcts which can reveal the instantancous substitution ratcs that amino avids change into the olher amin

Trang 1

VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY

Trang 2

VIETNAM NATIONAL UNIVERSITY, HANOI

UNIVERSITY OF ENGINEERING AND TECHNOLOGY

Nguyen Duc Canh

AMINO ACTD SUBSTITUTION MODEL

FOR ROTAVIRUS

MASTER THESIS

Major: Computer Science

Supervisor: Assoc Prof Le Sy Vinh

HANOI- 2019

Trang 3

Abstract

Modeling protein evolution has been a major field of research in bioinformatics for decades One popular method to approximate the evolution of proteins is to

‘use an amino-acid substitution model which can reveal the instantaneous rate that

an amino acid is changed into another amino acid Such kind of model is useful

in many different ways and has become a main component in a variety of bivin- formatic systems Many models such as JTT, WAG and LG have been estimated

using data from various species Recent research showed that these models might

be inappropriate for analysis of some specific specics Mcanwhile, the world has

witnessed a series of emerging epidemics caused by viruses, notably rotavirus - a

contagious virus that can cause gastroenteritis These epidemics raise a need for

modeling the evolution of these emerging viruses In this thesis, using the data

from the Viral Genome Resource at National Center for Bivte

chnology Informa- tion (NCBD), we propose (he ROTA model that has been specifically estimated for

modeling the evolution of rotavirus Analysis revealed significant differences between ROTA and existing models in amino acid frequencies, exchangeahility co-

efficionls as well as inferred phylogenics Experiments showed that ROTA better

characterizes the evolutionary patterns of rotavirus than other models and should

be useful in most systems that requires an accurate description of rotavirus evolution

Trang 4

Acknowledgements

T would like to express my sincere gratitude to my advisor Assoc Prof Te Sy Vinh for the continuous support of my study and research, for his patience, motivation,

enthusiasm, and immense knowledge His guidance helped me in all the time

of research and writing of this thesis, [ could not have imagined having a better advisor and mentor for my Masler study

Besides my advisor, I would like to thank Dr Dang Cao Cuong and MSc Le Kim Thu for giving devoted explanations to my questions and guiding me to solve

various problems that I had to face

T also thank my friends: Can Duy Cat, Nguyen Minh Trang, Le Iai Nam for

their continuous motivations without which I would never be able to complete this

thesis

My sincere thanks also goes to Information Faculty of University of Engincer-

ing and Technulogy, Vietnam Nalional University, [noi for providing me the

necessary facilities to conduct experiments

Last but not the least, | wonld like to thank my parents for giving birth to me at

the first place and supporting me spiritually throughout my life.

Trang 5

Declaration

Thereby declare that this thesis was entirely my own work and that any additional

sources of information have been properly cited

I certify that, to the best of my knowledge, my thesis does not infringe upon

anyone’s copyright nor violate any proprietary rights and that any ideas, tech-

niques or any other material from the work of other people included in my thesis, published or otherwise, are fully acknowledged in accordance with the standard

referencing practices

T declare that this thesis has not beon submitted for a higher degree to any other

University or Institution.

Trang 6

vi

Trang 7

1.3.2 Phylogenetic tree reconstruction 1.3.3 Robinson-Foulds distance

1.3.4 Phylogenctic hypothesis tesling

2 Mcthod

2.2 Model estimation process

3 Results and discussion

3.2 Modelanalysis 2 2 2 ee ee eee

3.3 Performance on testing alignments

3.4 Troc topology analysis

Trang 8

Multiple sequence alignment

National Center for Biotechnology

Information

Robinson-Foulds

viii

Trang 9

A sample phylogenetic tree of 5 sequences 2.2.20 18

Two irees T) and T, describe the same set of 5 sequences {1, 2 3, 4,5}

but have different topologies Tree Tj has two bipartitions { 1, 2}|{3 4,5} and {1,2,3}|{4, 5} while tree 73 has two bipartitions {1, 3}|{2.4,5}

and {1, 2, 3}|{4, 5} Bipartition {1, 2}[{3 4, 5} is present only in

T, whereas bipartition {1,3}|{2, 4,5} is present only in 7} Bi-

partition {1, 2,3}|{4,5} occurs in both Tj and 7 The standard-

s2/4 21

ized Robinson and Foulds distance between 71 and 23

The exchangeability coefficients in ROTA, FLU and JTT models

The black (gray or white) bubble at the intersection of row X and column Y presents the exchange rate between amino acid X and amino acid Y in ROTA (EUU or J]T) 32 The rclative diffcrenccs bctweccn cxchangcability cocfficicnts

ROTA and other two models The size of the bubble corresponds

to the value (ROT Axy — Mxy)/(ROT Axy — Mxy) where X,

Y is one of 20 amino acids and M is FLU for the subfigure a) or

TTT for the subfigure b) The black bubble with valuc 2/3 (1/3) indicates the coefficient in ROTA is $ (2) times larger than the

corresponding one in M whereas the white bubble with value 2/3

(1/3) indicates the coefficient in ROTA is 5 (2) times smaller than

Amino acid frequencies of ROTA, PLU and JTT models 34

Trang 10

Twenty differenct amino acids 2 2.2 ee 7

Number of sequences grouped by protein types in training and

The Pearson's correlations between ROTA and 10 widely used

Comparisons of ROTA and 10 other models in constructing max-

The normalized Rebinson-Foulds distance between trees inferred using ROTA versus FLU and JTT models for 12 testing multiple alignments

The number of test alignments that trees inferred from existing

models are significantly worse than those fram ROTA for the 11

testing alignments that ROTA is the best-fit model 2 2 37

Comparison of log-likelihood per site between ROTA and protein-

specific model For cach alignment of protein P, ROTAp is the model estimated using only sequences of protein P from taining

dataset

Amino acid exchangeability matrix (the first 20 rows) and fre-

quency vector (the lastrow) of ROTIA 46

Trang 11

Introduction

Motivation

A major field of research in hioinformatics is to understand the evolutionary re- Jationship among species In the recent decades, protein data have been accumu-

Jated in a large scale and are used in many methods to model the evolutionary

process One popular method is to use protein data to estimate amino acid sub-

stitution modcts which can reveal the instantancous substitution ratcs that amino

avids change into the olher amino acids This kind of model hus become crucial for a wide range of bioinformatics sytems thanks to its various applications

The first application of amino acid substitution model is ta help create protein

sequence alignments [1] Methods uscd to align sequences often involve the usage

of a score matrix which represents the penalty when an mutation occurs Typically

for nucleotide models, a score matrix can be simply +1 for a matched site, -{ for

a mismatched site and -2 for an inde] of two nucleotide sequences Such a sim-

ple scoring scheme, however, is not suilable for comparing protein sequences Amino acids, the building blocks that make up protein sequences, have biochemi-

cal properties that influence their exchangeability in the process of evolution For instance, amino acids of similar molecular sizes get substituted more oficn than

those of widely different molecular sizes Olher properties such as the (endency

to bind with water molecules also influence the probability of substicution There-

fore, it is

important to use a scoring system that retlects these properties which

ean be derived from amino acid substitution models

Amino acid substitution models are also frequently used (o infer protein phylogenetic trees using maximum likelihood approach [2] In this method, given a phylogeny of protein sequences with branch lengths, an evolutionary model that

Trang 12

allows to compute the probability of this tree is needed to select the maximum

likelihood tree A substitution modcl allows us to compute the transition proba- bilities #,)(t), the probability that stale j will caist al the end of a branch of length

t, if the state at the start of the branch is 7, thus makes it possible to calculate the

likelihood of data given a phylogenetic tree

Marcever, these models arc uscd to estimate pairwise genctic distances between protein sequences that subsequently serve as inpuls for distance-based phyloge-

netic analysis [3] Distance methods’ objective is to fit a tree to a matrix of pair-

wise penetic distances Kor every two sequences, the observed distance is a single valuc based on the number of different positions between the two sequcnecs The observed distance is an underestimation of the true genetic distance because some

of the positions may have experienced multiple substitution events In distance- based methods, a better estimate of the number of substitutions that have actually occurred is often done by applying a substitution model

Many methods have been proposed to estimate general amino acid substitu-

tion models from large and diverse databases [2] [4] ‘The common methods are counting or maximum likelihood approaches Estimating amino acid substitution

models is much more challenging than estimating nucleotide substitution models due to a large number of parameters to be optimized For example, the general

time reversible model for nucleotides contains 8 parameters in comparison to 208 parameiers for models of amino acid substitulions Thus, amino acid substitution

models are typically estimated from large datasets The first model is Dayhoff

model |5] using highly similar protein sequences As more protein sequences

became available, ITT model was c:

atcd [6] which used the same counting

method However, the counting methods are limited to only closely related pro-

tein sequences

The maximum likelihood method was propased later [7 | to estimate the mtREV

madel from 20 complete verichrate mtPNA-cncoded protein sequences WAG model [8] was estimated using a maximum likelihood method from 182 globu-

Jar protein families The WAG model produced better likelihood trees than the Dayhotf and JTT nodels for a large number of globular protein families ‘The maximum likelihood method is improved by incorporating the variability of evo-

Intionary rates across sites into the estimation process to create LG model [4] from

2

Trang 13

the Pfam database Experiments showed that the LG model gave better results than

other models both in terms of constructing maximum likelihood trees

Although a number of general models have been estimated from protein sc-

quences of various species, they might be inappropriate for a particular set of

species due to the differences in the evolutionary processes of these species A

number of specific amino acid substitution madels for important species have been introduced such as REV model [9] for retrovirus and HIVb, HIVw models [10]

for HIV related analysis Recently, FLU model is estimated for Influenza virus which was shown to be highly different from widely used ex ing models and more suitable for Influenza protein analysis [11]

Meanwhile, the world has witnessed a series of emerging epidemics caused by viruses such as Ebolavirus, MERS-Co¥ - Middle East respiratory syndrome coro- navirus, or Rotavirus - a contagious vinus that can cause gastroenteritis ‘These epi-

demics raise a need to modeling the evolution of these viruses The Viral Genome

Resource [12] at National Center for Biotechnology Information (NCBI) acting

as a storage for protein sequences of various emerging viruses provides necessary

data to model viral protein cvolution

Problem statement

Existing amino acid substitution models do not have the ability to accurately char-

acterize the evolution of rotavirus An amino acid substitution model for rotavirus

protein analysis systems can utilize an rotavirus-specifc amino acid substitution

model to analyse rotavirus protein sequences and find out the patterns in rotavirus viral evolution and infection process

In this thesis, the data from the Viral Genome Resource al NCBI is used lo estimate such an amino acid substitution model for rotavirus, an emerging virus

that is the leading cause of the Diarrhea in young children A maximum likelihood

approach which allows the variability of evolutionary rates across sites are used

as the method ef model cslimalion

The main contribution of this work is to provide a rotavirus-specific substitu-

tion model which would benefit subsequent research on understanding the evo-

3

Trang 14

Intionary patterns of this type of virus A detailed investigation of the model is conducted to understand the difference of this modct from existing oncs and to

evaluate its performance on other dulasel

Outline

This thesis is divided into 3 main chapters The first chapter provides background knowledge necded to understand the estimation pracess in the next chapter Chap-

ter 2 describes how the evolution of rotavirus can be moudeled as well as which

steps the model estimation process follows Chapter 3 describes the data prepa- ration process, then investigates the rotavirus-specific model] obtained trom the

process described in Chapter 2 (called ROTA for short) ta highlight its inner char-

acteristics and compare ils performance on a testing dalasel with other existing

models Some discussions on the results are also given in Chapter 3

Trang 15

Chapter 1

Background

This chapter explains necessary terminologies and concepts to understand the estimation process of an amino acid substitution model Firstly, section 1.1 describes the evolution of scquences in which we explains Aeredity materials that

pass down from generations to generations and homologous sequences that are the main subjects of investigation to understand evolutionary process Subsequently,

section 1.2 explains popular methods to model sequence cvolution as well as ro- vises existing amino acid substitution models which are commonly used in prac-

tice The final section discusses phylogenetic trees in detail which are used as the visual representation of evolutionary process and then addresses the methods that

we use later lo compare these phylogenelic (rees

1.1 Sequence evolution

1.1.1 Heredity materials

in molecular biology, nucleic acids are known to be the heredity materials which

arc replicated and puss down from parents to children There are two lypes of nu-

cleic acids: deoxyribonucleic acid (DNA) and ribonucleic acid (RNA) Por living

organisms, DNA is the molecule that carries genetic information Many viruses usc RNA as their genetic matcrial instead

DNA and RNA arc composed of smaller subunils called nucleatides [13] Nu-

cleotides are linked together through chemical bonds in a linear fashion and form

sequences of IDNAs and RNAs There are 4 types of nucleotides in DNA which

5

Trang 16

are named after 4 different nitrogenous bases making up these nucleotides: A, C, Gand T (short for adenine, cytosine, guanine and thymine, respectively) Tn terms

of RNA, base T is substituted by base U (uracil)

DNA (or RNA in some viruses) instructs cells to produce one of the most im-

portant types of macro-molecules: protein Proteins play a critical role in the continuity of life since they form the structure of organisms, provide a lot of fun- tionalilies as well ax control the regulation of the body Proteins arc sequences of

amino acids There are 20 different amino acids found in nature (see Table 1.1},

each of which has different chemical properties One change of amino acids in a

protein sequence could cither have a subtle cffect, change the protein function or

entirely cause loss of function

Only some regions in DNA and RNA contain the instructions to encode pro-

teins Those coding regions are called genes Within a gene, three consecutive nucleotides, often valled a codon, encode one aminu acid The process of creating

a protein based on a gene is called ¢ransiation Different codons can encode the same amino acid, which explains why although there are 64 different combina-

tions of 3 nucleotides, only 20 typos of amino acids are present Ta be accurate,

61 cadons encode for amino acids while the other three are stop-codons As their

name suggests, these stop-codons instruct cells to stop the translation process

The well-known theory about evolution is proposed by Darwin [14] which sug-

gests that species have evolved from a common ancestor At molecular level, living organisms share:

® The same genetic material (nucleic acids)

* The same molecular building blocks (such as nucleotides and amino acids)

These shared features highly suggest that living things today arc desccndants

of 4 common ancestor, or al Ioast highly similar ancestors Heredity materials,

genetic codes and the process of translation are passed down from generations to

Trang 17

Table 1.1: Twenty differenct amino acids

Name Three-letter code One-letter code

Children, however, often inherit genes from their parents with modifications

(also called mutations) due to the error-prone replication process ‘The mutations can be harmful or beneficial When mutations occur, the mutated organisms may

either die, develop new traits or even evolve into new species Millions of years

of replication and mutation result in a diversity of species as we see at the present

days.

Trang 18

Mutations can be in small-scale or large-scale In small-scale mutations, genes are affected by changes in anc ar a small number of nucleotides Tf only onc single

nucleotide is mutated, il is called a point mutation There are three (ype of point mutations:

* Substitutions arc changes from onc nucleotide to the others A substitution may have ne ellect if il dues nol change the corresponding gene product (syn-

onymous substitution) or may alter the amino acid in the sequence of protein

(onsynonymous substitution) In theory, every type of nucleotides can be substituted by any other Lype al any position in the sequence Lethal substitu-

tions, however, do not replicate and are hardly found in nature

Insertions add extra nucleatides into the sequence This may shift the reading

feame of codons or change the splice sites of a genc (Ihe positions genes arc

chopped up and then reassemble before translation occurs), both of which greatly affect encoded proteins

® Deletions remove nucleotides from the original sequence As insertions, dele-

tions may change reading frame or splice sites and alter the gene products

Large-scale mutations are more complicated and affect one or many genes Some types of large-scale mutations include: duplications which create many copies of the same sequence, deletions of large region on a gene or deletions of many genes and translocations which exchange regions botweon different DNA

molecules

Sequences of nucleotides or amino acids are said to be homologous if they

derive from an ancestral sequence Homologous sequences are not necessarily the

same as their ancestor due to mulations Investigating homologous scqucnecs is

an important task as understanding the differences among homologous sequences

can reveal the evolutionary relationships among species or individuals of a species

Identifying homology, however, is a hard task becausc of various mutations in the

replication process

In the following sections, we use the term character to denote a nucleotide

or an amino acid Homologous characters are nucleotides or amino acids which

derive from a common nucleotide or amino acid.

Trang 19

1.2 Modeling sequence evolution

Tlomologous sequences may not have the same length because of different number

of insertions and deletions in these sequences Before the evolutionary relation-

ships can be analysed, homologous sequences necd to be aligned first Aligning is

the process of inscrling character ’-’ to the inpul character sequences sv the these

sequences have an equal length

There may be many different ways to align two sequences For example, with

two sequences TGAATAGCC and TGAAAGTCC, these are acceptable align-

With the absence of ancestral sequence, the length of the real alignment and

the types of occurred mutations are unknown In the above example, the first

alignment has a length of 9 nucleotides and there are 3 substitutions al position 5,

6 and 7 In contrast, the second alignment has a length of 10 nucletides and there

is one deletion (or one insertion o† T) at position 5 and one insertion of T (or one

deletion) at position & Since it is impossible to distinguish whether insertions or

deletions have actually occured, they are often referred to as indels

The construction of pairwise alignment can be done in O(yq) time using Needleman- Wunsch algorithm |15] where p and q are the lengths of the two sequences ‘The

Needleman-Wunsch algorithm is a dynamic programmign algorithm which tries

1o assign a score (o cach alignment and find the alignments with the highest score

A so-called score matrix is used which defines scores for every event Different

scoring systems exist One simple system is: |1 fora match, 1 for a mismatch

and 3for an indel

Given two DNA sequences X = (2 ,%2, ,2p) and = (WI;2 , 2)

together with a scoring matrix C’ and the character set A — {.4,C,G,T, —}, the

9

Trang 20

highest score alignments can be found using a recursive formula:

= C(x;,—) or C(—, y,) is score when there is an indel

® Ƒ(¿ 7) is the best score when aligning two sequences KX? — (21, x2, ,a)

and Y7 — (Mi,ta, , My}

Amino acid sequences can be aligned in a similar fashion using a larger charac-

ter set A = {A, RB, N,D,C,Q.E,G,H.1,L,K.M, FP, S.T,W,¥.V,—} and

a different score matrix A notable observation is that the score matrix of amino acids is much more complicated due to the difference in properties of amino acids

The more similar two amino acids arc, the higher score an alignment for them

should be Also, the score matrix may be different for different types of proteins

or species Score matrices can be derived from substitution models

The pairwise sequence alignment only reveals the relationship hetween two scquences whereas many bioinformatic analysis ubjective is to investigate the relationship among a set of sequences A multiple sequence alignment (MSA) is required tor such analysis which is a matrix of characters such that row i cepre-

sents characters of sequence 7 and column j represents homologous characters at

position 7

The construction of multiple sequence alignments can be performed using dy-

nanuc programming |16| The time complexity for this approach is O(sm"2")

where 7 is the number of sequences and 11 is the length of the alignment This

computational burden limits this approach to only a few sequences Many ap-

proximate methods have been proposed for larger number of sequences such as

CLUSTALW [17] or MUSCLE [18]

10,

Trang 21

1.2.2 General time-reversible substitution model

A popular method to model the sequence cyolution is to use a substitution model which describes the process that characters in a character set change into others The substitution process can be modeled by a time-continuous discrete-state

Markov model Q — (0.y) [19] For sequences with a character scl A, Q is a

|À| x [A] mutrix in which the value gey al row :¢ column y represents [he instan-

taneous rate character z is changed into character y for z z2 y The value q, is

calculated such that the sum of every row equals 0 For the nucleotide substitution

model, the Q matrix is as follow:

Substitution models are usually assumed to be stationary, which means the Ïre-

quency TI — (7,,) of every available state does not change with time The fre-

quency vector TT and the instantancous substitution ratc matrix Q are dependent:

In practice, the Q matrix is typically normalized such that the expected number

of substitutions per site is onc

XeA

Substitution models often employ the time-reversibility assumption, that is the

relative substitution rates that 7 changes to y and that y changes to + are the same:

Under the time-reversibility assumption, the Q matrix can be decomposed into

two components: a relative substitution rate matrix R = (7) and a frequency

vector IT — (a„}

Trang 22

Ty? fe xy

— Dust Ow fey The I matrix is also called exchangeability coefficient matrix where the rela-

tive substitution rate hetween x and y, or the coefficient for the substitution be-

tween x and y is

Tay — „ Tz

A so-called general time-reversible (GTR) substitution model is the model that

employs all these following assumptions:

* Markovian assumption: The evolution of character « is independent of its

previous states

# Time-continuity: The substitution between states can occur at any time during the evolutionary process

® Time-homogeneity: The substitution rates do not change with time

® Stationarity: The frequency of all states arc stable

® Time-reversibility: If two states have the same frequency, the rate x changes

to y and the rate y changes to x are the same

More spccifically, a GTR subtitution mode] for nucleotides can be descrihed by

the instantancous substitution rate matrix Q

—32AzađAx ane bra emp

Trang 23

The GTR substitution model for nucleotides has 8 free parameters (3 parame-

ters for the frequency vector FI duc to its clements summing up to 1 and 5 param-

clers for the exchangeabilily couflicient matrix BR duc to equation 1.4) The GIR

substitution model for amino acids can be described in similar manner, however,

with much more tree parameters (2008 free parameters to be precise)

Calculate transition prabability

From the instantaneous substitution rate matrix (, one can derive a transition

probability matrix P(t) — {P,y(t)} where P,,(£) is the probability that character

a changes to character y during the evolutionary time ¿ In case Q is normalized

as equation 1.4, éis the duration an expected number of é substitutions occurs per site The probability matrix P(é) is useful to construct phylogenetic trees using

maximum likelihood methed or to generate sequences for evalution simulation

The Q matrix for GTR model is diagonalizable [20], hence, P(¢) can be calcu-

lated efficiently using the decomposition of Q Specifically,

la]

vl

where

® |À| is the length of the character set (the number of possible states)

oe U — {u,t

U"1 is its inverse

wait is the matrix of corresponding eigenvectors of Q and

Estimate genetic distances

The subsitution model is also useful to estimate genetic distance between homol-

ogous sequences The genclic distance d(X,Y) between lwo sequences KX —

13

Trang 24

#1, #ay ; t8} and Y — {91, ya + Ym} is the number of substitutions which

have occured between X and Y The naive approach to eslimale genetic distance

® Multiple subsitutions: Two or more substitutions have occured at one position

but at most one substitution is observed For example: 4 changes to (, then

ta F but only one substitution from A to ‘f’ is found by observation Tf /" is then changed (o A, no substitution is obscryed (back substilulion)

® Parallel subsitutions: The substitution occurs in beth sequences and no sub-

slitulion is observed For cxample, the ancestral nucleotide is A and A is

changed to 7 in both descendent sequences The observed distance between descendants is 0) while the actual distance is 2

Having been modeled by the substitution rate matrix Q, the genetic distance d

of two sequences can be estimated using the maximum likelihood (ML) approach with the following likelihood function |19]

Trang 25

1.2.3 Model of rate heterogeneity

The GTR substitution model as described above represents (he evolution of onc

site in the analysed sequences It is unrealistic that every site evolves according

to the same model For example, the third position in a codon usually has much faster substitution raics than the other two positions [21] A model that has varying: substitution rales among giles is called model of rate helerogeneiLy

Using different substitution models for different sites may cause overfit of the model to the data Thus, many models of heterogeneous substitution rates have

been proposed [22] [23] A common type of ratc heterogencity model [22] is a

two-state model thal categorizes sequence sites as either invariable or variable

Hor every site 7, there is a substitution rate scaling factor

0 if site ¢ is invariable

1 otherwise

‘That is, same sites in the sequences never change while the other sites evolves

avcurding to one substitution model

Differentiating whether sites are variable or invarible is impossible because it

is possible that variable sites in the dataset are happen to be unvaried or some sites

undergo back substitutions Onc more parameter @ is intreduced to represent the

percentage that siley are invariable

Another widely-used model of rate heterogeneity is Gamma-distributed rate

heterogeneity model [23] In this type of model, a T-distribution with expectation 1.0 and variance 1/a, « °> 0 is used from which substilution rate scaling factors

across sites are drawn

One can adjust the rate heterogeneity by adjusting the a parameter o: is called

the shape parameter uf the [-distribution Gencrally speaking, the smaller the

15

Trang 26

shape parameter is, the stronger the rate heterogeneity among sites becomes For

example, sctting «@ — 0.5 means that most sitcs hardly undergo substitutions but

some other siles have high substitulion rates In contrast, if @ = 5, sites have highly similar substitution rates (the scaling factor of most sites are approximately

1.0)

A discrete T-funetion can be used instead of its continuous version Tn dis-

crete T-distributed rate helterogencily model, a parameter ¢ which is the number

of substitution rate scaling factor categories r1, r2, , 7e is introduced [24]

A hybrid model which integrates the two-state model and the T-distribution

madel is alse a cammon model used in practice In this type of model, some sites

are invariable according to the parameter @ while the other sites are variable with the substitution rate scaling factors drawn from the discrete [-function with the shape parameter ce and the number of rate categories ¢

The parameter c is lypically sel in advance while the invariant percentage 8

and the shape parameter a are estimated from data The estimation of these extra parameters has been implemented in some phylogenetic packages such as PHYML [25] or IQ-TREE [26]

Amino acid substitution models have much more free parameters in comparison

with nucleutide counterparts Parameters are usually estimated from data, there-

fore, substitution models for amino acids are called empirical substitution models

Different methods are proposed to estimate amino acid substitution models

Dayhoff model [5] was the first model describing the proccss of amino acid substituGions The model was cslimated using a counting method with 71 scts of

closely related proteins and an observation of 1572 substitutions The counting method was pioneering at the early stage of substitution model research JTT

madel was estimated using the same counting method [6] JTT, however, used

a larger protein đatasets in the estimation process BLOSUM62 model [27] also employed counting approach but only alignments that were similar at least 62%

were used to count amino acid substitutions

The drawback of counting method is thal iL is only applicable for highly related

Trang 27

sequences A later research [28] introduced the resolvent method and estimated a

VT model fram various types of protein scquences

The maximum likelihood method was used lo estimate mtREV madel [7] based

on 20 complete vertebrate mtDNA-encoded protein sequences WAG model [8] was estimated using an approximate maximum likelihood approach which was

based on 3,905 globular protein sequences from 182 protein familics LLG model

[4] incorpated the rale heteregencily across sites into the estimation process using protein sequences from the Pfam database The LG model was shown to be better than the previous models in terms of contructing maximum likelihood phylogenetic trees

Although many models have been estimated from large and diverse databases of

protein sequences from a wide range of species (these models are often referred to

as general models), recent studies have showed that they might be inappropriate ta represent the evolulionary process of some specific species A number of species-

specific amino acid substitution models have been introduced such as the rtREV model [9] which are used for inference of retrovirus and HIV-specific models [10] CHIVb, HIVw) estimaicd fom HIV daia Recently, FLU model [11] was

estimated for Influenza virus which is highly different from widely used existing models Experiments showed that species-specific models better characterizes the

evolutionary process of the corresponding species sequences

Trang 28

more related than A and C because the common ancestor of A and B is more re-

cent than the common ancestor of A and C Each branch of a phylogenctic tree can be assigned a valuc, called branch length, to represent the genelic distance

between connected nodes

There are many different binary trees to represent the evolution of the same set

of sequences Without regard for branch lengths, the numbers of possible rooted and unrooted binary trees for a set of 7 sequences increase exponentially with n The number of binary unrooted trees B(m) for n sequences is

Three optimization criteria are often used to reconstruct phylogenclic trees from

homologous sequences: minimum evolution (ME), maximum parsimony (MP)

and maximum likelihood (ML).

Tiêu đề	Amino Acid Substitution Model for Rotavirus
Tác giả	Nguyen Duc Canh
Người hướng dẫn	Assoc. Prof. Le Sy Vinh
Trường học	Hanoi University of Engineering and Technology
Chuyên ngành	Computer Science
Thể loại	Master thesis
Năm xuất bản	2019
Thành phố	Hanoi

Định dạng
Số trang	56
Dung lượng	498,07 KB