An improving method for estimating amino acid replacement models Lê Văn Đạt Trường Đại học Công nghệ Chuyên ngành: Khoa học máy tính; Mã số: 60 48 01 Người hướng dẫn: TS.. Lê Sỹ Vinh
Trang 1An improving method for estimating amino acid
replacement models
Lê Văn Đạt
Trường Đại học Công nghệ Chuyên ngành: Khoa học máy tính; Mã số: 60 48 01
Người hướng dẫn: TS Lê Sỹ Vinh
Năm bảo vệ: 2012
Abstract: Amino acid replacement models (amino acid substitution models or
ma-trices) play important roles in protein phylogenetics analysis and protein sequence alignment Dayhoff was the fi rst person who proposed a method to build amino acid models in 1972 Currently, maximum likelihood (ML) methods are widely used to estimate popular models such as WAG, LG, FLU, etc However, ML methods are slow and not applicable to large datasets The most time consuming step in estimating matrices is build-ingphylogenetics trees from protein alignments In this thesis, we propose new methods to overcome the obstacle by splitting large alignments into small ones which still contain enough evolutionary information for esti-mating matrices Experiments with both Pfam and FLU data sets show that proposed meth-ods are about three to nine times faster than the best current method while the quality of estimated matrices are nearly the same Thus, our methods will enable researchers to estimate matrices from very large datasets
Trang 2List of Figures v
1.1 Motivation 1
1.2 Outline of thesis 5
2.1 Amino and amino substitutions 6
2.2 Markov model 8
2.3 Amino substitution models 9
2.4 Methods to estimate amino substitution models 10
2.4.1 Counting methods 10
2.4.2 Maximum likelihood methods 12
2.4.2.1 Intro 12
2.4.2.2 Steps to build an amino substitution
model by maximum likelihood method 13
2.4.3 15
3 Alignment splitting methods for estimating amino
Trang 33.1 The multiple alignment 17
3.2 Steps to build an amino substitution model by align-ment splitting method 19
3.3 Random alignment splitting 20
3.4 Tree-based alignment splitting 20
4 Results 25 4.1 Compare methods on Pfam data set 26
4.1.1 Data preparation 26
4.1.2 Time 26
4.1.3 P 26
4.1.4 Robustness of model 29
4.2 Compare methods on FLU data set 30
4.2.1 Data preparation 30
4.2.2 Time 31
4.2.3 P 31
4.2.4 Robustness of model 33
36
Trang 4Derived from Multiple Protein Families Using a Maximum-Likelihood h Mole BiologyandEvolution, 18(5):691699,May2001 1,12
[2℄ Si Q Le and Olivier An Improved General Amino A t Matrix Mole BiologyandEvolution, 25(7):13071320,July2008 1,12,14,26
[3℄ RobertD Finn,John Tate, JainaMistry,Penny C.Coggill, StephenJohn J Sam-mut, Hans-Rudolf R Hotz, Goran Kristoffer Forslund, Sean R Eddy, Erik L.Sonnhammer, and Alex Bateman The Pfamprotein familiesdatabase
Nu-resear 36(Databaseissue):D281D288,January 2008 1,12,25,26
[4℄ AndreasD.Baxevanis TheImp eof Biolo al DatabasesinBiolo al overy John Wiley&Sons, 2011 2
[5℄ The proteindata bank,2012 3
[6℄ Salam Al-Karadaghi The 20 Amino A and Their Role in Protein
2012 6,7
[7℄ T.A Brown Genomes Bios Publishers,2002 6,7
[8℄ Wai-Ki Ching and K Ng Markov Chains: Models, Algorithms and ations (InternationalSeries inOperations Resear & Management e) Springer,1 edition, De-ber2005 8
[9℄ MatthewJ Betts and Robert B Russell 9
[10℄ JosephFelsenstein InferringPhylogenies SinauerAsso 2edition,September 2003 9,
13, 15
[11℄ D Bryant, N Galtier, and M.-A Poursat Likelihood in
ph InO editor, of evolutionandphylogeny,pages3362 OxfordUniversityPress,2005 9
Trang 5[12℄ Ziheng Yang Computational Mole Evolution (Oxford Series in ology and Evolution) OxfordUniversityPress,USA, ber2006 9,13
[13℄ CarlosSetubalandJoa Meidanis.Intro toComputationalMole Biology.PWS Publishing,January1997 9
[14℄ M O Dayhoff, R V and C M Park A model of evolutionary in proteins in Atlas of Protein Se e and e., 8999 National h Foundation.,
1972 10
[15℄ M.O Dayhoff and R M artz Chapter 22: A model of evolutionary hange
in proteins IninAtlasofProtein Se eand e,1978 10,11
[16℄ D.T.Jones, W R.Taylor,and J.M Thornton The rapid generation of mutation data fromprotein Computer ationsinthe es: CABIOS, 8(3):275282,June1992 11
[17℄ G H Gonnet, M A Cohen, and S A Benner Exhaustive hing of the entire protein database e(NewYork, N.Y.),256(5062):14431445,June1992 11
[18℄ D T Jones, W R Taylor, and J M Thornton A mutation data matrix for transmembrane proteins FEBS letters,339(3):269275,February1994 11
[19℄ J Ada andM Hasegawa Model ofamino substitution in proteins ded
bymito hondrial DNA Journalof mole evolution,42(4):459468,April1996 12
[20℄ StéphaneGuindonandOlivier ASimple,Fast,andA Algorithmto EstimateLargePhylogeniesbyMaximumLikelihood Biology,52(5):696704,
er2003 12
[21℄ L S Vinh and A von Haeseler IQPNNI: moving fast through tree and stoppingin time Mole biology andevolution,21(8):15651571,August2004 12
[22℄ StéphaneGuindonandOlivier ASimple,Fast,andA Algorithmto EstimateLargePhylogeniesbyMaximumLikelihood Biology,52(5):696704,
er2003 13,19
[23℄ Peter S Klosterman, Andrew V Uzilov, Yuri R Bendaña, Robert K Bradley, Sharon Chao, Carolin Kosiol, Goldman, and Ian Holmes XRate: a fast prototyping, training and annotation tool for phylo-grammars BMC
7:428+, er2006 13,19
[24℄ M.S Waterman Intro toComputational Biology: Maps, Se esand Genomes
In-Taylor&F 1995 17,19
[25℄ M.SalemiandA.M Vandamme ThePhylo HandbookAPr al Appro toDNA andProtein Phylogeny, hapter Multiplealignment CambridgeUniversityPress,2003 17
Trang 6[26℄ J D Thompson, D G Higgins, and T J Gibson CLUSTAL W: improving the sensitivity of progressive multiple alignment through weight-ing, position-sp gap penalties and weight matrix resear 22(22):46734680,November1994 19
[27℄ C Notredame, D G Higgins, and J Heringa T-Coee: A novel method for fast and multiple alignment Journalof mole biology,302(1):205217, September2000 19
[28℄ Robert C Edgar MUSCLE: multiple alignment with high and high throughput resear 32(5):17921797, h2004 19
[29℄ N.Saitou andM.Nei Theneighbor-joiningmethod: anewmethodfor
ingph trees Mole biology andevolution,4(4):406425,July1987 22
[30℄ Cuong Dang, Quang Le, Olivier and Vinh Le FLU, an amino sub-stitutionmodel for inuenzaproteins BMCEvolutionary Biology, 10(1):99+,April2010
25, 30
[31℄ H.KishinoandM.Hasegawa Evaluationofthe maximumlikelihoodestimateofthe evolutionary tree topologies from DNA data, and the hing order in hominoidea Journalof Mole Evolution,29(2):170179,August1989 25
[32℄ D F Robinson and L R Foulds Comparison of ph trees al
es,53(1-2):131147, February1981 25