The Phylogenetic Handbook will calm the nerves of anyone charged with under-taking an evolutionary analysis of gene sequence data.. Part of the genetic information in DNA is transcribed
Trang 2This page intentionally left blank
Trang 3The Phylogenetic Handbook
Second Edition
The Phylogenetic Handbook provides a comprehensive introduction to theory and practice of
nucleotide and protein phylogenetic analysis This second edition includes seven new chapters, covering topics such as Bayesian inference, tree topology testing, and the impact of recombination
on phylogenies The book has a stronger focus on hypothesis testing than the previous edition, with more extensive discussions on recombination analysis, detecting molecular adaptation and genealogy-based population genetics Many chapters include elaborate practical sections, which have been updated to introduce the reader to the most recent versions of sequence analysis and phylogeny software, including Blast , FastA , Clustal , T-coffee , Muscle , Dambe , Tree-Puzzle ,
Phylip , Mega4 , Paup* , Iqpnni , Consel , ModelTest , ProtTest , Paml , HyPhy , MrBayes , Beast , Lamarc ,
SplitsTree , and Rdp3 Many analysis tools are described by their original authors, resulting in clear explanations that constitute an ideal teaching guide for advanced-level undergraduate and graduate students.
Philippe Lemey is a FWO postdoctoral researcher at the Rega Institute, Katholieke Universiteit Leuven, Belgium, where he completed his Ph.D in Medical Sciences He has been an EMBO Fellow and a Marie-Curie Fellow in the Evolutionary Biology Group at the Department of Zoology, University of Oxford His research focuses on molecular evolution of viruses by integrating molecular biology and computational approaches.
Marco Salemi is Assistant Professor at the Department of Pathology, Immunology and ratory Medicine of the University of Florida School of Medicine, Gainesville, USA His research interests include molecular epidemiology, intra-host virus evolution, and the application of phylogenetic and population genetic methods to the study of human and simian pathogenic viruses.
Labo-Anne-Mieke Vandamme is a Full Professor in the Medical Faculty at the Katholieke versiteit, Belgium, working in the field of clinical and epidemiological virology Her laboratory investigates treatment responses in HIV-infected patients and is respected for its scientific and clinical contributions to virus–drug resistance Her laboratory also studies the evolution and molecular epidemiology of human viruses such as HIV and HTLV.
Trang 5Uni-The Phylogenetic Handbook
A Practical Approach to Phylogenetic
Analysis and Hypothesis Testing
Trang 6CAMBRIDGE UNIVERSITY PRESS
Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore,
São Paulo, Delhi, Dubai, Tokyo
Cambridge University Press
The Edinburgh Building, Cambridge CB2 8RU, UK
First published in print format
Information on this title: www.cambridge.org/9780521877107
This publication is in copyright Subject to statutory exception and to the
provision of relevant collective licensing agreements, no reproduction of any partmay take place without the written permission of Cambridge University Press
Cambridge University Press has no responsibility for the persistence or accuracy
of urls for external or third-party internet websites referred to in this publication, and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate
Published in the United States of America by Cambridge University Press, New Yorkwww.cambridge.org
PaperbackeBook (NetLibrary)Hardback
Trang 7Section II: Data preparation
2.2.3 Specialized sequence databases, reference databases, and
2.3 Composite databases, database mirroring, and search tools 39
v
Trang 8vi Contents
2.3.3 Some general considerations about database searching
Marc Van Ranst and Philippe Lemey
3.9 Nucleotide sequences vs amino acid sequences 95
Des Higgins and Philippe Lemey
3.11.2 Aligning the primate Trim5α amino acid sequences 101
Trang 9vii Contents
3.14 Comparing alignments using theAltAVisTweb tool 103
4.3 Number of mutations in a given time interval *(optional) 113
4.4 Nucleotide substitutions as a homogeneous Markov process 116
Marco Salemi
4.8 Observed vs estimated genetic distances: the JC69 model 128
4.9 Kimura 2-parameters (K80) and F84 genetic distances 131
4.10.1 Modeling rate heterogeneity among sites 133
4.13 Choosing among different evolutionary models 140
Yves Van de Peer
5.2 Tree-inference methods based on genetic distances 144
Trang 105.5 Programs to display and manipulate phylogenetic trees 161
5.6 Distance-based phylogenetic inference inPhylip 162
5.7 Inferring a Neighbor-Joining tree for the primates data set 163
5.8 Inferring a Fitch–Margoliash tree for the mtDNA data set 170
5.10 Impact of genetic distances on tree topology: an example using
6.2.1 The simple case: maximum-likelihood tree for
6.3 Computing the probability of an alignment for a fixed tree 186
Trang 11ix Contents
Heiko A Schmidt and Arndt von Haeseler
6.9 An illustrative example of an ML tree reconstruction 199
6.9.2 Getting a tree with branch support values using
7.11.9 Summarizing samples of substitution model parameters 255
7.11.10 Summarizing samples of trees and branch lengths 257
Trang 12x Contents
8 Phylogeny inference based on parsimony and other methods
David L Swofford and Jack Sullivan
8.3.1 Calculating the length of a given tree under the parsimony
David L Swofford and Jack Sullivan
8.5 Analyzing data withPaup∗through the command–line interface 292
Fred R Opperdoes
9.2.4 Nature of sequence divergence in proteins (the PAM unit) 319
Trang 13xi Contents
Fred R Opperdoes and Philippe Lemey
9.4 A phylogenetic analysis of the Leishmanial
glyceraldehyde-3-phosphate dehydrogenase gene carried out via the
9.5 A phylogenetic analysis of trypanosomatid
glyceraldehyde-3-phosphate dehydrogenase protein sequences using Bayesian
10.3 Hierarchical likelihood ratio tests (hLRTs) 348
Trang 14xii Contents
11.3 Likelihood ratio test of the global molecular clock 365
Philippe Lemey and David Posada
Heiko A Schmidt
12.2 Some definitions for distributions and testing 382
12.4 How to get the distribution of likelihood ratios 385
12.5.2 The original Kishino–Hasegawa (KH) test 388
12.9 Testing a set of trees withTree-PuzzleandConsel 397
12.9.1 Testing and obtaining site-likelihood with
Trang 15xiii Contents
Section V: Molecular adaptation
405
13 Natural selection and adaptation of molecular sequences 407
Oliver G Pybus and Beth Shapiro
14.6 Estimating branch-by-branch variation in rates 438
14.7.5 The importance of synonymous rate variation 449
14.8 Comparing rates at a site in different branches 449
Sergei L Kosakovsky Pond, Art F Y Poon, and Simon D W Frost
Trang 1614.13.1 Fitting a global model in theHyPhyGUI 467
14.13.2 Fitting a global model with aHyPhy
14.14 Estimating branch-by-branch variation in rates 470
14.14.1 Fitting a local codon model inHyPhy 471
14.14.2 Interclade variation in substitution rates 473
14.14.3 Comparing internal and terminal branches 474
14.15 Estimating site-by-site variation in rates 475
14.15.3 Single-likelihood ancestor counting (SLAC) 477
14.16 Estimating gene-by-gene variation in rates 484
14.16.1 Comparing selection in different populations 484
14.16.2 Comparing selection between different
Philippe Lemey and David Posada
Trang 17xv Contents
15.3 Linkage disequilibrium, substitution patterns, and
15.6 Recombination analysis as a multifaceted discipline 506
15.6.2 Recombinant identification and breakpoint detection 507
15.8 Performance of recombination detection tools 517
16 Detecting and characterizing individual recombination events 519
Mika Salminen and Darren Martin
16.3 Theoretical basis for recombination detection methods 523
16.4 Identifying and characterizing actual recombination events 530
Mika Salminen and Darren Martin
16.6 Analyzing example sequences to detect and characterize individual
16.6.2 Exercise 2: Mapping recombination with Simplot 536
16.6.3 Exercise 3: Using the “groups” feature of Simplot 537
16.6.4 Exercise 4: Setting upRdp3to do an exploratory
Trang 18xvi Contents
18.3.1 Substitution models and rate models among sites 570
18.3.2 Rate models among branches, divergence time estimation,
Alexei J Drummond and Andrew Rambaut
18.7.1 Translating the data in amino acid sequences 579
Trang 19xvii Contents
19 Lamarc: Estimating population genetic parameters
Mary K Kuhner
19.2 Basis of the Metropolis–Hastings MCMC sampler 593
19.8.1 Converting data using theLamarcfile converter 604
20.2 Steel’s method: potential problem, limitation, and
Trang 20xviii Contents
20.3 Xia’s method: its problem, limitation, and implementation
Xuhua Xia and Philippe Lemey
21 Split networks A tool for exploring complex evolutionary
Vincent Moulton and Katharina T Huber
21.1 Understanding evolutionary relationships through networks 631
21.2 An introduction to split decomposition theory 633
Vincent Moulton and Katharina T Huber
Trang 21School of Computing Sciences
University of East Anglia
Norwich, UK
John P Huelsenbeck
Department of Integrative Biology
University of California at Berkeley
3060 Valley Life Sciences Bldg
Berkeley, CA 94720-3140, USA
Sergei Kosakovsky Pond
Antiviral Research Center University of California
150 W Washington St, Ste 100 San Diego, CA 92103, USA
Mary Kuhner
Department of Genome Sciences University of Washington Seattle (WA), USA
Philippe Lemey
Rega Institute for Medical Research Katholieke Universiteit Leuven Leuven, Belgium
Vincent Moulton
School of Computing Sciences University of East Anglia Norwich, UK
Fred R Opperdoes
C de Duve Institute of Cellular Pathology Universite Catholique de Louvain Brussels, Belgium
xix
Trang 22Swedish Museum of Natural History
Box 50007, SE-104 05 Stockholm
Beth Shapiro
Department of Biology The Pennsylvania State University
326 Mueller Lab University Park, PA 16802 USA
Korbinian Strimmer
Institute for Medical Informatics Statistics and Epidemiology (IMISE)
University of Leipzig Germany
Florida, USA
Anne-Mieke Vandamme
Rega Institute for Medical Research Katholieke Universiteit Leuven Leuven, Belgium
Trang 23xxi List of contributors
Yves Van de Peer
VIB / Ghent University
Bioinformatics & Evolutionary Genomics
Technologiepark 927
B-9052 Gent, Belgium
Paul van der Mark
School of Computational Science
Florida State University
Tallahassee, FL 32306-4120, USA
Marc Van Ranst
Rega Institute for Medical Research
Katholieke Universiteit Leuven
Leuven, Belgium
Arndt von Haeseler
Center for Integrative Bioinformatics Vienna (CIBIV)
Max F Perutz Laboratories (MFPL)
Dr Bohr Gasse 9 A-1030 Wien, Austria
Xuhua Xia
Biology Department University of Ottawa Ottawa, Ontario Canada
Trang 25“It looked insanely complicated, and this was one of the reasons why the snug plastic cover it fitted into had the words DON’T PANIC printed on it in large friendly letters.”
Douglas Adams
The Hitch Hiker’s Guide to the Galaxy
As of February 2008 there were 85 759 586 764 bases in 82 853 685 sequences stored
in GenBank (Nucleic Acids Research, Database issue, January 2008) Under any
criteria, this is a staggering amount of data Although these sequences come from
a myriad of organisms, from viruses to humans, and include genes with a diversearrange of functions, it can all, at least in principle, be studied from an evolutionary
perspective But how? If ever there was an invitation panic, it is this Enter The Phylogenetic Handbook, an invaluable guide to the phylogenetic universe.
The first edition of The Phylogenetic Handbook was published in 2003 and
represented something of a landmark in evolutionary biology, as it was the firstaccessible, hands-on instruction manual for molecular phylogenetics, yet with
a healthy dose of theory Up until this point, the evolutionary analysis of gene
sequence was often considered something of a black art The Phylogenetic Handbook
made it accessible to anyone with a desktop computer
The new edition The Phylogenetic Handbook moves the field along nicely and
has a number of important intellectual and structural changes from the earlieredition Such a revision is necessary to track the major changes in this rapidlyevolving field, in terms of both the new theory and new methodologies availablefor the computational analysis of gene sequence evolution The result is a finebalance between theory and practice As with the First Edition, the chapters take usfrom the basic, but fundamental, tasks of database searching and sequence align-ment, to the complexity of the coalescent Similarly, all the chapters are written byacknowledged experts in the field, who work at the coal-face of developing newmethods and using them to address fundamental biological questions Most ofthe authors are also remarkably young, highlighting the dynamic nature of thisdiscipline
xxiii
Trang 26xxiv Foreword
The biggest alteration from the First Edition is the restructuring into a series ofsections, complete with both theory and practice chapters, with each designed totake the uninitiated through all the steps of evolutionary bioinformatics There arealso more chapters on a greater range of topics, so the new edition is satisfyinglycomprehensive Indeed, it almost stands alone as a textbook in modern populationgenetics It is also pleasing to see a much stronger focus on hypothesis testing, which
is a key aspect of modern phylogenetic analysis Another welcome change is theinclusion of chapters describing Bayesian methods for both phylogenetic inferenceand revealing population dynamics, which fills a major gap in the literature, andhighlights the current popularity of this form of statistical inference
The Phylogenetic Handbook will calm the nerves of anyone charged with
under-taking an evolutionary analysis of gene sequence data My only suggestion for animprovement to the third edition are the words DON’T PANIC on the cover
Edward C HolmesJune 12, 2008
Trang 27The idea for The Phylogenetic Handbook was conceived during an early edition of
the Workshop on Virus Evolution and Molecular Epidemiology The rationale wassimple: to collect the information being taught in the workshop and turn it into
a comprehensive, yet simply written textbook with a strong practical component.Marco and Annemie took up this challenge, and, with the help of many experts inthe field, successfully produced the First Edition in 2003 The resulting text was anexcellent primer for anyone taking their first computational steps into evolutionarybiology, and, on a personal note, inspired me to try out many of the techniquesintroduced by the book in my own research It was therefore a great pleasure to
join in the collaboration for the Second Edition of The Phylogenetic Handbook.
Computational molecular biology is a fast-evolving field in which new niques are constantly emerging A book with a strong focus on the software side
tech-of phylogenetics will therefore rapidly grow a need for updating In this SecondEdition, we hope to have satisfied this need to a large extent We also took theopportunity to provide a structure that groups different types of sequence analysesaccording to the evolutionary hypothesis they focus on Evolutionary biology hasmatured into a fully quantitative discipline, with phylogenies themselves havingevolved from classification tools to central models in quantifying underlying evo-lutionary and population genetic processes Inspired by this, the Second Editionprovides a broader coverage of techniques for testing models and trees, detectingrecombination, the analysis of selective pressure and genealogy-based population
genetics Changing the subtitle to A Practical Approach to Phylogenetic Inference and Hypothesis Testing emphasizes this shift in focus Thanks to novel contributions,
we also hope to have addressed the need for a Bayesian treatment of phylogeneticinference, which started to gain a great deal of popularity at the time the contentfor the First Edition was already fixed
Following the philosophy of the First Edition, the book includes many step software tutorials using example data sets We have not used the same data setsthroughout the complete Second Edition; not only is it difficult to find data sets thatxxv
Trang 28step-by-xxvi Preface
consistently meet the assumptions or reveal interesting aspects of all the methodsdescribed, but we also feel that being confronted with different data with theirown characteristics adds educational value These data sets can be retrieved fromwww.thephylogenetichandbook.org, where other useful links listed in the book canalso be found Furthermore, a glossary has been compiled with important termsthat are indicated in italics and boldface throughout the book
We are very grateful to the researchers who took the time to contribute to thisedition, either by updating a chapter or writing a novel contribution I hope that
my persistent pestering has not affected any of these friendships We would like
to thank Eddie Holmes in particular for writing the Foreword to the book It hasbeen a pleasure to work with Katrina Halliday and Alison Evans of CambridgeUniversity Press We also wish to thank those who supported our research and thework on this book: the Flemish “Fonds voor Wetenschappelijk Onderzoek”, EMBOand Marie Curie funding Finally, we would like to express our thanks to colleagues,family and friends onto whom we undoubtedly projected some of the pressure incompleting this book
Philippe Lemey
Trang 29Section I
Introduction
Trang 31The genome, carrier of this genetic information, is in most organisms
deoxy-ribonucleic acid (DNA), whereas some viruses have a deoxy-ribonucleic acid (RNA)
genome Part of the genetic information in DNA is transcribed into RNA: either
mRNA, which acts as a template for protein synthesis; rRNA, which together with
ribosomal proteins constitutes the protein translation machinery; tRNA, whichoffers the encoded amino acid; or small RNAs, some of which are involved inregulating expression of genes The genomic DNA also contains elements, such
as promotors and enhancers, which orchestrate the proper transcription into RNA.
A large part of the genomic DNA of eukaryotes consists of genetic elements such
as introns or alu-repeats, the function of which is still not entirely clear Proteins,RNA, and to some extent DNA, constitute the phenotype of an organism thatinteracts with the environment
DNA is a double helix with two antiparallel polynucleotide strands, whereas
RNA is a single-stranded polynucleotide The backbone in each DNA strand
The Phylogenetic Handbook: a Practical Approach to Phylogenetic Analysis and Hypothesis Testing,
Philippe Lemey, Marco Salemi, and Anne-Mieke Vandamme (eds.) Published by Cambridge
University Press C Cambridge University Press 2009.
3
Trang 32N O
N N H H
O
O
O N
O
N
N N O H
N H H
P
N N N N
O
H H
O O
N N
O
N H H O
P
O O
N O
N O H
N N
N N
O
N H H
P
P
N N N N
O
H H
O
O
N N
O
N H H O
P
O O
N
OH 3'
N O H
N N
N N
O
O OPO 3- 5'
N H H
P dR G
P dR T
P dR G
P dR T
OH 3'
dR
P dR
dR P dR P dR P dR
P 5' Fig 1.1 Chemical structure of double-stranded DNA The chemical moieties are indicated as follows:
dR, deoxyribose; P, phosphate; G, guanine; T, thymine; A, adenine; and C, cytosine The strand orientation is represented in a standard way: in the upper strand 5–3, indicating that the chain starts at the 5carbon of the first dR, and ends at the 3 carbon of the last
dR The one letter code of the corresponding genetic information is given on top, and only takes into account the 5–3upper strand (Courtesy of Professor C Pannecouque.)consists of deoxyriboses with a phosphodiester linking each 5 carbon with the
3carbon of the next sugar In RNA the sugar moiety is ribose On each sugar, one
of the four following bases is linked to the 1carbon in DNA: the purines, adenine (A), or guanine (G), or the pyrimidines, thymine (T), or cytosine (C); in RNA,
thymine is replaced by uracil (U) Hydrogen bonds and base stacking result in the
two DNA strands binding together, with strong (triple) bonds between G and C, andweak (double) bonds between T/U and A (Fig 1.1) These hydrogen-bonded pairs
are called complementary During DNA duplication or RNA transcription, DNA
or RNA polymerases synthesize a complementary 5–3 strand starting with thelower 3–5DNA strand as template, in order to preserve the genetic information.This genetic information is represented by a one letter code, indicating the 5–3sequential order of the bases in the DNA or RNA (Fig 1.1) A nucleotide sequence
is thus represented by a contiguous stretch of the four letters A, G, C, and T/U
Trang 335 Basic concepts of molecular evolution
encoded amino acids
by a three- or one-letter abbreviation (Table 1.1) An amino acid sequence is erally represented by a contiguous stretch of one-letter amino acid abbreviations(with 20 possible letters)
gen-The genetic code is universal for all organisms, with only a few exceptions
such as the mitochondrial code, and it is usually represented as an RNA codebecause the RNA is the direct template for protein synthesis (Table 1.2) Thecorresponding DNA code can be easily reconstructed by replacing the U by a T.Each position of the triplet code can be one of four bases; hence, 43or 64 possible
triplets encode 20 amino acids (61 sense codes) and 3 stop codons (3 non-sense
codes) The genetic code is said to be degenerated, or redundant, since all aminoacids except methionine have more than one possible triplet code The first codon
for methionine downstream (or 3) of the ribosome entry site also acts as thestart codon for the translation of a protein As a result of the triplet code, each
Trang 34Phe Phe Leu Leu
UCU UCC UCA UCG
Ser Ser Ser Ser
UAU UAC UAA UAG
Tyr Tyr STOP STOP
UGU UGC UGA UGG
Cys Cys STOP Trp
U C A G
C
CUU CUC CUA CUG
Leu Leu Leu Leu
CCU CCC CCA CCG
Pro Pro Pro Pro
CAU CAC CAA CAG
His His Gln Gln
CGU CGC CGA CGG
Arg Arg Arg Arg
U C A G
A
AUU AUC AUA AUG
Ile Ile Ile Met
ACU ACC ACA ACG
Thr Thr Thr Thr
AAU AAC AAA AAG
Asn Asn Lys Lys
AGU AGC AGA AGG
Ser Ser Arg Arg
U C A G
G
GUU GUC GUA GUG
Val Val Val Val
GCU GCC GCA GCG
Ala Ala Ala Ala
GAU GAC GAA GAG
Asp Asp Glu Glu
GGU GGC GGA GGG
Gly Gly Gly Gly
U C A G
The first nucleotide letter is indicated on the left, the second on the top, and the third on the right side The amino acids are given by their three-letter code (see Table 1.1) Three stop codons are indicated.
contiguous nucleotide stretch has three reading frames in the 5–3direction Thecomplementary strand encodes another three reading frames A reading frame that
is able to encode a protein starts with a codon for methionine, and ends with a stop
codon These reading frames are called open reading frames or ORFs.
During duplication of the genetic information, the DNA or RNA polymerase canoccasionally incorporate a non-complementary nucleotide In addition, bases in aDNA strand can be chemically modified due to environmental factors such as UVlight or chemical substances These modified bases can potentially interfere withthe synthesis of the complementary strand and thereby also result in a nucleotideincorporation that is not complementary to the original nucleotide When thesechanges escape the cellular repair mechanisms, the genetic information is altered,
resulting in what is called a point mutation The genetic code has evolved in such
a way that a point mutation at the third codon position rarely results in an aminoacid change (only in 30% of possible changes) A change at the second codonposition always, and at the first codon position mostly (96%), results in an aminoacid change Mutations that do not result in amino acid changes are called silent
Trang 357 Basic concepts of molecular evolution
or synonymous mutations When a mutation results in the incorporation of a different amino acid, it is called non-silent or non-synonymous A site within a
coding triplet is said to be fourfold degenerate when all possible changes at that site are synonymous (for example “CUN”); twofold degenerate when only two different
amino acids are encoded by the four possible nucleotides at that position (for
example, “UUN”); and non-degenerate when all possible changes alter the encoded
amino acid (for example, “NUU”)
Incorporation errors replacing a purine (A, G) with a purine and a pyrimidine(C, T) with a pyrimidine occur more easily because of chemical and steric reasons
The resulting mutations are called transitions Transversions, purine to pyrimidine
changes and the reverse, are less likely When resulting in an amino acid change,transversions usually have a larger impact on the protein than transitions, because
of the more drastic changes in biochemical properties of the encoded amino acid.There are four possible transition errors (A ↔ G, C ↔ T), and eight possibletransversion errors (A ↔ C, A ↔ T, G ↔ C, G ↔ T); therefore, if a mutationoccurred randomly, a transversion would be two times more likely than a transition.However, the genetic code has evolved in such a way that, in many genes, the lessdisruptive transitions are more likely to occur than transversions
Single nucleotide changes in a particular codon often change the amino acid
to one with similar properties (e.g hydrophobic), such that the tertiary structure
of the encoded protein is not altered dramatically Living organisms can thereforetolerate a limited number of nucleotide point mutations in their coding regions.Point mutations in non-coding regions are subject to other constraints, such asconservation of binding places for proteins, conservation of base pairing in RNAtertiary structures or avoidance of too many homopolymer stretches in whichpolymerases tend to stutter
Errors in duplication of genetic information can also result in the deletion
or insertion of one or more nucleotides, collectively referred to as indels When
multiples of three nucleotides are inserted or deleted in coding regions, the readingframe remains intact and one or more amino acids are inserted or deleted Whenone or two nucleotides are inserted or deleted, the reading frame is disturbed andthe resulting gene generally codes for an entirely different protein, with differentamino acids and a different length from the original protein The consequence
of this change depends on the position in the gene where the change took place.Insertions or deletions are therefore rare in coding regions, but rather frequent
in non-coding regions When occurring in coding regions, indels can occasionallychange the reading frame of a gene and make another ORF of the same geneaccessible Such mutations can lead to acquisition of new gene functions Havingsmall genomes, viruses make extensive use of this possibility They often encodeseveral proteins from a single gene by using overlapping ORFs Another type of
Trang 368 Anne-Mieke Vandamme
mutation that can change reading frames or make accessible new reading frames
is mutations in splicing patterns Eukaryotic proteins are encoded by coding gene
fragments called exons, which are separated from each other by introns Joining the introns is called splicing and occurs in the nucleus at the pre-mRNA level
through dedicated spliceosomes Mutations in splicing patterns usually destroythe gene function, but can occasionally result in the acquisition of a new genefunction Viruses have used these mechanisms extensively By alternative splicing,sometimes in combination with the use of different reading frames, viruses are able
to encode multiple proteins by a single gene For example, HIV is able to encode
two additional regulatory proteins using part of the coding region of the env gene
by alternative splicing and overlapping reading frames
When parts of two different DNA strands are combined into a single strand, the
genetic exchange is called recombination Recombination has a major effect on
the genetic make-up of organisms (seeChapter 15) The most common form of
recombination happens in eukaryotes during meiosis, when recombination occurs
between homologous chromosomes, shuffling the alleles for the next generation
Con-sequently, recombination contributes significantly to evolution of diploid isms More details on the process and consequences of recombination are provided
organ-inChapter 15
Another form of genetic exchange is lateral gene transfer, which is a relativelyfrequent event in bacteria A dramatic example of this is the origin of eukaryotesarising from bacteria acquiring other bacterial genomes that evolved into organellessuch as mitochondria or chloroplasts The bacterial predecessor of mitochondriasubsequently exchanged many genes with the “cellular” genome Substantial parts
of mammal genomes are “littered” with endogenous retroviral sequences, with the
“fusion” capacity of some retroviral envelope genes at the origin of the placenta.Every retroviral infection results in lateral gene transfer, usually only in somaticcells
Genetic variation can also be caused by gene duplication Gene duplication
results in genome enlargement and can involve a single gene, or large genomesections They can be partial, involving only gene fragments, or complete, whereby
entire genes, chromosomes (aneuploidy) or entire genomes (polyploidy) are
dupli-cated Genes experiencing partial duplication, such as domain duplication, canpotentially have a greatly altered function An entirely duplicated gene can evolveindependently After a long history of independent evolution, duplicated genes caneventually acquire a new function Duplication events have played a major role
in the evolution of species For example, complex body plans were possible due
to separate evolution of duplications of the homeobox genes (Carroll,1995), andespecially in plants, new species are frequently the result of polyploidy
Trang 379 Basic concepts of molecular evolution
Fig 1.2 Loss or fixation of an allele in a population.
1.2 Population dynamics
Mutations in a gene that are passed on to the offspring and that coexist with
the original gene result in polymorphisms At a polymorphic site, two or more
variants of a gene circulate in the population simultaneously Population geneticiststypically study the dynamics of the frequency of these polymorphic sites over time
The location in the genome where two or more variants coexist is called the locus The different variants for a particular locus are called alleles Virus genomes, in
particular, are very flexible to genetic changes; RNA viruses can contain manypolymorphic sites in a single population HIV, for example, does not exist in asingle host as a single genomic sequence, but consists of a continuously changing
swarm of variants sometimes referred to as a quasispecies (Eigen & Biebricher,1988;
Domingo et al.,2006) Although this has become a standard term for virologists,the quasispecies theory has specific mathematical formulations and to what extentvirus populations comply with these is the subject of great debate The high geneticdiversity mainly results from the rapid and error prone replication of RNA viruses
Diploid organisms always carry two alleles When both alleles are identical, the
organism is homozygous at that locus; when the organism carries two different alleles, it is heterozygous at that locus Heterozygous positions are polymorphic Evolution is always a result of changes in allele frequencies, also called gene fre- quencies whereby some alleles are lost over time, while other alleles sometimes increase their frequency to 100%, they become fixed in the population (Fig 1.2)
The rate at which this occurs is called the fixation rate The long-term evolution
of a species results from the successive fixation of particular alleles, which reflects
fixation of mutations Terms like fixation rate, mutation rate, substitution rate and evolutionary rate have been used interchangeably by several authors, but they
Trang 3810 Anne-Mieke Vandamme
can refer to markedly different processes This is particularly so for mutation rate,which should preferably be reserved for the rate at which mutations arise at theDNA level, usually expressed as the number of nucleotide (or amino acid) changesper site per replication cycle Fixation rate, substitution rate, and the rate of molecu-lar evolution are all equivalent when applied to sequences representing differentspecies or populations, in which case they represent the number of new muta-tions per unit time that become fixed in a species or population However, whenapplied to sequences representing different individuals within a population, theinterpretation of these terms is subtly altered, because not all observed mutationaldifferences among individuals (polymorphisms) will eventually become fixed inthe population In these cases, fixation rates are not appropriate, but substitutionrate or the rate of molecular evolution can still be used to represent the rate atwhich individuals accrue genetic differences to each other over time (under theselective regime acting on this population) To summarize this from a phylogeneticperspective, the differences in nucleotide or amino acid sequences between taxaare generally called substitutions (although recently generated mutations can bepresent on terminal branches of trees) If these taxa represent different species
or populations, the substitutions will be equivalent to fixation events If the taxarepresent different individuals within a population, branch lengths measure thegenetic differences that accrue within individuals, which are not, but ultimatelymay lead to, fixation events
The rate at which populations genetically diverge over time is dependent on the
underlying mutation rate, the generation time, the time separating two generations,
and on evolutionary forces, such as the fitness of the organism carrying the allele
or variant, positive and negative selective pressure, population size, genetic drift,reproductive potential, and competition of alleles If a particular allele is more
fit than others in a particular environment, it will be subject to positive selective
pressure; if it is less fit, it will be subject to negative selective pressure An allele can
confer a lower fitness to the homozygous organism, while heterozygosity of bothalleles at this locus can be an advantage In this case, polymorphism is advantageous
and will be maintained; this is called balancing selection (heterozygote is more
fit than either homozygote) For example, humans who carry the hemoglobin Sallele on both chromosomes suffer from sickle-cell anaemia However, this allele ismaintained in the human population because heterozygotes are, to some extent,protected against malaria (Allison,1956) Fitness of a variant is always the result
of a particular phenotype of the organism; therefore, in coding regions, selectivepressure always acts on mutations that alter function or stability of a gene or theamino acid sequence encoded by the gene Synonymous mutations could at firstsight be expected to be neutral since they do not result in amino acid changes.However, this is not always true For example, synonymous changes can alter
Trang 3911 Basic concepts of molecular evolution
event
Fig 1.3 Population dynamics of alleles Each different symbol represents a different allele A
muta-tion event in the sixth generamuta-tion gives rise to a new allele The figure illustrates fixamuta-tion and loss of alleles during a bottleneck event, and the concept of coalescence time (tracking
back the time to the most recent common ancestor of the grey individuals) N: population
size.
RNA secondary structure and influence RNA stability; also, they result in the usage
of a different tRNA that may be less abundant Still, most synonymous mutationscan be considered selectively neutral
The rate at which a mutation becomes fixed through deterministic or stochastic
forces depends on the effective population size (N e) of the organism This can
be defined as the size of an idealized population that is randomly mating andthat has the same gene frequency changes as the population being studied (the
“census” population) The effective population size is smaller than the overall
population size (N), when a substantial proportion of a population is producing
no offspring, when there is inbreeding, in cases of population subdivision, and whenselection operates on linked viral mutations The effective population size is a majordeterminant of the dynamics of the allele frequencies over time When the (effective)population size varies over multiple generations, the rates of evolution are notablyinfluenced by generations with the smallest effective population sizes This may
be particularly true if population sizes are greatly reduced due to catastrophes,
or during migrations, etc (Fig 1.3) Such events can significantly affect genetic
diversity and are called genetic bottlenecks Two individual lineages merging into
Trang 4012 Anne-Mieke Vandamme
a single ancestor as we go back in time is referred to as a coalescent event In
general, the most recent common ancestor of the extant generation never tracesback to the first generation of a population InFig 1.3, all individuals of the seventhgeneration have one common ancestor in the fourth generation (tracing back thegray individuals), which is called the coalescence time of the extant individuals
An entirely deterministic evolutionary pattern would require that changes in
allele or gene frequencies depend solely on the reproductive fitness of the ants in a particular environment and on the environmental conditions In such asituation the gene frequencies can be predicted if the fitness and environmentalconditions are known In deterministic evolution, changes other than environ-mental conditions, such as chance events, do not influence allele/gene frequencies
vari-This can only hold true if the effective population is infinitely large Natural
selec-tion, the effect of positive and negative selective pressure, accounts entirely for the
changes in frequencies When random fluctuations determine in part the allele
frequencies, allele/gene frequencies cannot be predicted exactly In such a stochastic
model, one can only determine the probability of frequencies in the next tion These probabilities still depend on the reproductive fitness of the variants in
genera-a pgenera-articulgenera-ar environment genera-and on the environmentgenera-al conditions However, chgenera-anceevents also play a role in populations of limited size Consequently, only statistical
statements about allele/gene frequencies can be made Random genetic drift,
there-fore, contributes significantly to changes in frequencies under a stochastic model.The smaller the effective population size, the larger the effect of chance events andthe more the mutation rate is determined by genetic drift rather than by selectivepressure
Evolution is never entirely deterministic or entirely stochastic Depending onthe interplay of effective population size and the distribution of selective coeffi-cients, the evolution of allele/gene frequencies is more affected by either naturalselection or genetic drift Genetic mutations are always random, but they can some-times result in an adaptive advantage In this case, positive selective pressure will
increase the frequency of the advantageous mutation, eventually leading to fixation
after fewer generations than expected for a neutral change, provided the tive population size is large enough A mutation under negative selective pressurecan become fixed due to random genetic drift when it is not entirely deleterious,but this generally requires more generations than expected for a neutral change.Non-synonymous mutations result in a phenotypic change of an organism, andare subject to selective pressure if they change the interaction of that organismwith its environment As explained above, synonymous mutations are usually neu-tral and therefore become fixed due to genetic drift The effect of positive andnegative selective pressure can be investigated by comparing the synonymous andnon-synonymous substitution rate (see alsoChapters 13and14)