1. Trang chủ
  2. » Y Tế - Sức Khỏe

The Phylogenetic Handbook pptx

751 5,9K 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề The Phylogenetic Handbook Second Edition
Tác giả Philippe Lemey, Marco Salemi, Anne-Mieke Vandamme
Trường học Katholieke Universiteit Leuven
Chuyên ngành Phylogenetics and Evolutionary Biology
Thể loại sách hướng dẫn thực hành phân tích phylogenetics
Năm xuất bản 2023
Thành phố Leuven
Định dạng
Số trang 751
Dung lượng 6,55 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

The Phylogenetic Handbook will calm the nerves of anyone charged with under-taking an evolutionary analysis of gene sequence data.. Part of the genetic information in DNA is transcribed

Trang 2

This page intentionally left blank

Trang 3

The Phylogenetic Handbook

Second Edition

The Phylogenetic Handbook provides a comprehensive introduction to theory and practice of

nucleotide and protein phylogenetic analysis This second edition includes seven new chapters, covering topics such as Bayesian inference, tree topology testing, and the impact of recombination

on phylogenies The book has a stronger focus on hypothesis testing than the previous edition, with more extensive discussions on recombination analysis, detecting molecular adaptation and genealogy-based population genetics Many chapters include elaborate practical sections, which have been updated to introduce the reader to the most recent versions of sequence analysis and phylogeny software, including Blast , FastA , Clustal , T-coffee , Muscle , Dambe , Tree-Puzzle ,

Phylip , Mega4 , Paup* , Iqpnni , Consel , ModelTest , ProtTest , Paml , HyPhy , MrBayes , Beast , Lamarc ,

SplitsTree , and Rdp3 Many analysis tools are described by their original authors, resulting in clear explanations that constitute an ideal teaching guide for advanced-level undergraduate and graduate students.

Philippe Lemey is a FWO postdoctoral researcher at the Rega Institute, Katholieke Universiteit Leuven, Belgium, where he completed his Ph.D in Medical Sciences He has been an EMBO Fellow and a Marie-Curie Fellow in the Evolutionary Biology Group at the Department of Zoology, University of Oxford His research focuses on molecular evolution of viruses by integrating molecular biology and computational approaches.

Marco Salemi is Assistant Professor at the Department of Pathology, Immunology and ratory Medicine of the University of Florida School of Medicine, Gainesville, USA His research interests include molecular epidemiology, intra-host virus evolution, and the application of phylogenetic and population genetic methods to the study of human and simian pathogenic viruses.

Labo-Anne-Mieke Vandamme is a Full Professor in the Medical Faculty at the Katholieke versiteit, Belgium, working in the field of clinical and epidemiological virology Her laboratory investigates treatment responses in HIV-infected patients and is respected for its scientific and clinical contributions to virus–drug resistance Her laboratory also studies the evolution and molecular epidemiology of human viruses such as HIV and HTLV.

Trang 5

Uni-The Phylogenetic Handbook

A Practical Approach to Phylogenetic

Analysis and Hypothesis Testing

Trang 6

CAMBRIDGE UNIVERSITY PRESS

Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore,

São Paulo, Delhi, Dubai, Tokyo

Cambridge University Press

The Edinburgh Building, Cambridge CB2 8RU, UK

First published in print format

Information on this title: www.cambridge.org/9780521877107

This publication is in copyright Subject to statutory exception and to the

provision of relevant collective licensing agreements, no reproduction of any partmay take place without the written permission of Cambridge University Press

Cambridge University Press has no responsibility for the persistence or accuracy

of urls for external or third-party internet websites referred to in this publication, and does not guarantee that any content on such websites is, or will remain,

accurate or appropriate

Published in the United States of America by Cambridge University Press, New Yorkwww.cambridge.org

PaperbackeBook (NetLibrary)Hardback

Trang 7

Section II: Data preparation

2.2.3 Specialized sequence databases, reference databases, and

2.3 Composite databases, database mirroring, and search tools 39

v

Trang 8

vi Contents

2.3.3 Some general considerations about database searching

Marc Van Ranst and Philippe Lemey

3.9 Nucleotide sequences vs amino acid sequences 95

Des Higgins and Philippe Lemey

3.11.2 Aligning the primate Trim5α amino acid sequences 101

Trang 9

vii Contents

3.14 Comparing alignments using theAltAVisTweb tool 103

4.3 Number of mutations in a given time interval *(optional) 113

4.4 Nucleotide substitutions as a homogeneous Markov process 116

Marco Salemi

4.8 Observed vs estimated genetic distances: the JC69 model 128

4.9 Kimura 2-parameters (K80) and F84 genetic distances 131

4.10.1 Modeling rate heterogeneity among sites 133

4.13 Choosing among different evolutionary models 140

Yves Van de Peer

5.2 Tree-inference methods based on genetic distances 144

Trang 10

5.5 Programs to display and manipulate phylogenetic trees 161

5.6 Distance-based phylogenetic inference inPhylip 162

5.7 Inferring a Neighbor-Joining tree for the primates data set 163

5.8 Inferring a Fitch–Margoliash tree for the mtDNA data set 170

5.10 Impact of genetic distances on tree topology: an example using

6.2.1 The simple case: maximum-likelihood tree for

6.3 Computing the probability of an alignment for a fixed tree 186

Trang 11

ix Contents

Heiko A Schmidt and Arndt von Haeseler

6.9 An illustrative example of an ML tree reconstruction 199

6.9.2 Getting a tree with branch support values using

7.11.9 Summarizing samples of substitution model parameters 255

7.11.10 Summarizing samples of trees and branch lengths 257

Trang 12

x Contents

8 Phylogeny inference based on parsimony and other methods

David L Swofford and Jack Sullivan

8.3.1 Calculating the length of a given tree under the parsimony

David L Swofford and Jack Sullivan

8.5 Analyzing data withPaup∗through the command–line interface 292

Fred R Opperdoes

9.2.4 Nature of sequence divergence in proteins (the PAM unit) 319

Trang 13

xi Contents

Fred R Opperdoes and Philippe Lemey

9.4 A phylogenetic analysis of the Leishmanial

glyceraldehyde-3-phosphate dehydrogenase gene carried out via the

9.5 A phylogenetic analysis of trypanosomatid

glyceraldehyde-3-phosphate dehydrogenase protein sequences using Bayesian

10.3 Hierarchical likelihood ratio tests (hLRTs) 348

Trang 14

xii Contents

11.3 Likelihood ratio test of the global molecular clock 365

Philippe Lemey and David Posada

Heiko A Schmidt

12.2 Some definitions for distributions and testing 382

12.4 How to get the distribution of likelihood ratios 385

12.5.2 The original Kishino–Hasegawa (KH) test 388

12.9 Testing a set of trees withTree-PuzzleandConsel 397

12.9.1 Testing and obtaining site-likelihood with

Trang 15

xiii Contents

Section V: Molecular adaptation

405

13 Natural selection and adaptation of molecular sequences 407

Oliver G Pybus and Beth Shapiro

14.6 Estimating branch-by-branch variation in rates 438

14.7.5 The importance of synonymous rate variation 449

14.8 Comparing rates at a site in different branches 449

Sergei L Kosakovsky Pond, Art F Y Poon, and Simon D W Frost

Trang 16

14.13.1 Fitting a global model in theHyPhyGUI 467

14.13.2 Fitting a global model with aHyPhy

14.14 Estimating branch-by-branch variation in rates 470

14.14.1 Fitting a local codon model inHyPhy 471

14.14.2 Interclade variation in substitution rates 473

14.14.3 Comparing internal and terminal branches 474

14.15 Estimating site-by-site variation in rates 475

14.15.3 Single-likelihood ancestor counting (SLAC) 477

14.16 Estimating gene-by-gene variation in rates 484

14.16.1 Comparing selection in different populations 484

14.16.2 Comparing selection between different

Philippe Lemey and David Posada

Trang 17

xv Contents

15.3 Linkage disequilibrium, substitution patterns, and

15.6 Recombination analysis as a multifaceted discipline 506

15.6.2 Recombinant identification and breakpoint detection 507

15.8 Performance of recombination detection tools 517

16 Detecting and characterizing individual recombination events 519

Mika Salminen and Darren Martin

16.3 Theoretical basis for recombination detection methods 523

16.4 Identifying and characterizing actual recombination events 530

Mika Salminen and Darren Martin

16.6 Analyzing example sequences to detect and characterize individual

16.6.2 Exercise 2: Mapping recombination with Simplot 536

16.6.3 Exercise 3: Using the “groups” feature of Simplot 537

16.6.4 Exercise 4: Setting upRdp3to do an exploratory

Trang 18

xvi Contents

18.3.1 Substitution models and rate models among sites 570

18.3.2 Rate models among branches, divergence time estimation,

Alexei J Drummond and Andrew Rambaut

18.7.1 Translating the data in amino acid sequences 579

Trang 19

xvii Contents

19 Lamarc: Estimating population genetic parameters

Mary K Kuhner

19.2 Basis of the Metropolis–Hastings MCMC sampler 593

19.8.1 Converting data using theLamarcfile converter 604

20.2 Steel’s method: potential problem, limitation, and

Trang 20

xviii Contents

20.3 Xia’s method: its problem, limitation, and implementation

Xuhua Xia and Philippe Lemey

21 Split networks A tool for exploring complex evolutionary

Vincent Moulton and Katharina T Huber

21.1 Understanding evolutionary relationships through networks 631

21.2 An introduction to split decomposition theory 633

Vincent Moulton and Katharina T Huber

Trang 21

School of Computing Sciences

University of East Anglia

Norwich, UK

John P Huelsenbeck

Department of Integrative Biology

University of California at Berkeley

3060 Valley Life Sciences Bldg

Berkeley, CA 94720-3140, USA

Sergei Kosakovsky Pond

Antiviral Research Center University of California

150 W Washington St, Ste 100 San Diego, CA 92103, USA

Mary Kuhner

Department of Genome Sciences University of Washington Seattle (WA), USA

Philippe Lemey

Rega Institute for Medical Research Katholieke Universiteit Leuven Leuven, Belgium

Vincent Moulton

School of Computing Sciences University of East Anglia Norwich, UK

Fred R Opperdoes

C de Duve Institute of Cellular Pathology Universite Catholique de Louvain Brussels, Belgium

xix

Trang 22

Swedish Museum of Natural History

Box 50007, SE-104 05 Stockholm

Beth Shapiro

Department of Biology The Pennsylvania State University

326 Mueller Lab University Park, PA 16802 USA

Korbinian Strimmer

Institute for Medical Informatics Statistics and Epidemiology (IMISE)

University of Leipzig Germany

Florida, USA

Anne-Mieke Vandamme

Rega Institute for Medical Research Katholieke Universiteit Leuven Leuven, Belgium

Trang 23

xxi List of contributors

Yves Van de Peer

VIB / Ghent University

Bioinformatics & Evolutionary Genomics

Technologiepark 927

B-9052 Gent, Belgium

Paul van der Mark

School of Computational Science

Florida State University

Tallahassee, FL 32306-4120, USA

Marc Van Ranst

Rega Institute for Medical Research

Katholieke Universiteit Leuven

Leuven, Belgium

Arndt von Haeseler

Center for Integrative Bioinformatics Vienna (CIBIV)

Max F Perutz Laboratories (MFPL)

Dr Bohr Gasse 9 A-1030 Wien, Austria

Xuhua Xia

Biology Department University of Ottawa Ottawa, Ontario Canada

Trang 25

“It looked insanely complicated, and this was one of the reasons why the snug plastic cover it fitted into had the words DON’T PANIC printed on it in large friendly letters.”

Douglas Adams

The Hitch Hiker’s Guide to the Galaxy

As of February 2008 there were 85 759 586 764 bases in 82 853 685 sequences stored

in GenBank (Nucleic Acids Research, Database issue, January 2008) Under any

criteria, this is a staggering amount of data Although these sequences come from

a myriad of organisms, from viruses to humans, and include genes with a diversearrange of functions, it can all, at least in principle, be studied from an evolutionary

perspective But how? If ever there was an invitation panic, it is this Enter The Phylogenetic Handbook, an invaluable guide to the phylogenetic universe.

The first edition of The Phylogenetic Handbook was published in 2003 and

represented something of a landmark in evolutionary biology, as it was the firstaccessible, hands-on instruction manual for molecular phylogenetics, yet with

a healthy dose of theory Up until this point, the evolutionary analysis of gene

sequence was often considered something of a black art The Phylogenetic Handbook

made it accessible to anyone with a desktop computer

The new edition The Phylogenetic Handbook moves the field along nicely and

has a number of important intellectual and structural changes from the earlieredition Such a revision is necessary to track the major changes in this rapidlyevolving field, in terms of both the new theory and new methodologies availablefor the computational analysis of gene sequence evolution The result is a finebalance between theory and practice As with the First Edition, the chapters take usfrom the basic, but fundamental, tasks of database searching and sequence align-ment, to the complexity of the coalescent Similarly, all the chapters are written byacknowledged experts in the field, who work at the coal-face of developing newmethods and using them to address fundamental biological questions Most ofthe authors are also remarkably young, highlighting the dynamic nature of thisdiscipline

xxiii

Trang 26

xxiv Foreword

The biggest alteration from the First Edition is the restructuring into a series ofsections, complete with both theory and practice chapters, with each designed totake the uninitiated through all the steps of evolutionary bioinformatics There arealso more chapters on a greater range of topics, so the new edition is satisfyinglycomprehensive Indeed, it almost stands alone as a textbook in modern populationgenetics It is also pleasing to see a much stronger focus on hypothesis testing, which

is a key aspect of modern phylogenetic analysis Another welcome change is theinclusion of chapters describing Bayesian methods for both phylogenetic inferenceand revealing population dynamics, which fills a major gap in the literature, andhighlights the current popularity of this form of statistical inference

The Phylogenetic Handbook will calm the nerves of anyone charged with

under-taking an evolutionary analysis of gene sequence data My only suggestion for animprovement to the third edition are the words DON’T PANIC on the cover

Edward C HolmesJune 12, 2008

Trang 27

The idea for The Phylogenetic Handbook was conceived during an early edition of

the Workshop on Virus Evolution and Molecular Epidemiology The rationale wassimple: to collect the information being taught in the workshop and turn it into

a comprehensive, yet simply written textbook with a strong practical component.Marco and Annemie took up this challenge, and, with the help of many experts inthe field, successfully produced the First Edition in 2003 The resulting text was anexcellent primer for anyone taking their first computational steps into evolutionarybiology, and, on a personal note, inspired me to try out many of the techniquesintroduced by the book in my own research It was therefore a great pleasure to

join in the collaboration for the Second Edition of The Phylogenetic Handbook.

Computational molecular biology is a fast-evolving field in which new niques are constantly emerging A book with a strong focus on the software side

tech-of phylogenetics will therefore rapidly grow a need for updating In this SecondEdition, we hope to have satisfied this need to a large extent We also took theopportunity to provide a structure that groups different types of sequence analysesaccording to the evolutionary hypothesis they focus on Evolutionary biology hasmatured into a fully quantitative discipline, with phylogenies themselves havingevolved from classification tools to central models in quantifying underlying evo-lutionary and population genetic processes Inspired by this, the Second Editionprovides a broader coverage of techniques for testing models and trees, detectingrecombination, the analysis of selective pressure and genealogy-based population

genetics Changing the subtitle to A Practical Approach to Phylogenetic Inference and Hypothesis Testing emphasizes this shift in focus Thanks to novel contributions,

we also hope to have addressed the need for a Bayesian treatment of phylogeneticinference, which started to gain a great deal of popularity at the time the contentfor the First Edition was already fixed

Following the philosophy of the First Edition, the book includes many step software tutorials using example data sets We have not used the same data setsthroughout the complete Second Edition; not only is it difficult to find data sets thatxxv

Trang 28

step-by-xxvi Preface

consistently meet the assumptions or reveal interesting aspects of all the methodsdescribed, but we also feel that being confronted with different data with theirown characteristics adds educational value These data sets can be retrieved fromwww.thephylogenetichandbook.org, where other useful links listed in the book canalso be found Furthermore, a glossary has been compiled with important termsthat are indicated in italics and boldface throughout the book

We are very grateful to the researchers who took the time to contribute to thisedition, either by updating a chapter or writing a novel contribution I hope that

my persistent pestering has not affected any of these friendships We would like

to thank Eddie Holmes in particular for writing the Foreword to the book It hasbeen a pleasure to work with Katrina Halliday and Alison Evans of CambridgeUniversity Press We also wish to thank those who supported our research and thework on this book: the Flemish “Fonds voor Wetenschappelijk Onderzoek”, EMBOand Marie Curie funding Finally, we would like to express our thanks to colleagues,family and friends onto whom we undoubtedly projected some of the pressure incompleting this book

Philippe Lemey

Trang 29

Section I

Introduction

Trang 31

The genome, carrier of this genetic information, is in most organisms

deoxy-ribonucleic acid (DNA), whereas some viruses have a deoxy-ribonucleic acid (RNA)

genome Part of the genetic information in DNA is transcribed into RNA: either

mRNA, which acts as a template for protein synthesis; rRNA, which together with

ribosomal proteins constitutes the protein translation machinery; tRNA, whichoffers the encoded amino acid; or small RNAs, some of which are involved inregulating expression of genes The genomic DNA also contains elements, such

as promotors and enhancers, which orchestrate the proper transcription into RNA.

A large part of the genomic DNA of eukaryotes consists of genetic elements such

as introns or alu-repeats, the function of which is still not entirely clear Proteins,RNA, and to some extent DNA, constitute the phenotype of an organism thatinteracts with the environment

DNA is a double helix with two antiparallel polynucleotide strands, whereas

RNA is a single-stranded polynucleotide The backbone in each DNA strand

The Phylogenetic Handbook: a Practical Approach to Phylogenetic Analysis and Hypothesis Testing,

Philippe Lemey, Marco Salemi, and Anne-Mieke Vandamme (eds.) Published by Cambridge

University Press  C Cambridge University Press 2009.

3

Trang 32

N O

N N H H

O

O

O N

O

N

N N O H

N H H

P

N N N N

O

H H

O O

N N

O

N H H O

P

O O

N O

N O H

N N

N N

O

N H H

P

P

N N N N

O

H H

O

O

N N

O

N H H O

P

O O

N

OH 3'

N O H

N N

N N

O

O OPO 3- 5'

N H H

P dR G

P dR T

P dR G

P dR T

OH 3'

dR

P dR

dR P dR P dR P dR

P 5' Fig 1.1 Chemical structure of double-stranded DNA The chemical moieties are indicated as follows:

dR, deoxyribose; P, phosphate; G, guanine; T, thymine; A, adenine; and C, cytosine The strand orientation is represented in a standard way: in the upper strand 5–3, indicating that the chain starts at the 5carbon of the first dR, and ends at the 3 carbon of the last

dR The one letter code of the corresponding genetic information is given on top, and only takes into account the 5–3upper strand (Courtesy of Professor C Pannecouque.)consists of deoxyriboses with a phosphodiester linking each 5 carbon with the

3carbon of the next sugar In RNA the sugar moiety is ribose On each sugar, one

of the four following bases is linked to the 1carbon in DNA: the purines, adenine (A), or guanine (G), or the pyrimidines, thymine (T), or cytosine (C); in RNA,

thymine is replaced by uracil (U) Hydrogen bonds and base stacking result in the

two DNA strands binding together, with strong (triple) bonds between G and C, andweak (double) bonds between T/U and A (Fig 1.1) These hydrogen-bonded pairs

are called complementary During DNA duplication or RNA transcription, DNA

or RNA polymerases synthesize a complementary 5–3 strand starting with thelower 3–5DNA strand as template, in order to preserve the genetic information.This genetic information is represented by a one letter code, indicating the 5–3sequential order of the bases in the DNA or RNA (Fig 1.1) A nucleotide sequence

is thus represented by a contiguous stretch of the four letters A, G, C, and T/U

Trang 33

5 Basic concepts of molecular evolution

encoded amino acids

by a three- or one-letter abbreviation (Table 1.1) An amino acid sequence is erally represented by a contiguous stretch of one-letter amino acid abbreviations(with 20 possible letters)

gen-The genetic code is universal for all organisms, with only a few exceptions

such as the mitochondrial code, and it is usually represented as an RNA codebecause the RNA is the direct template for protein synthesis (Table 1.2) Thecorresponding DNA code can be easily reconstructed by replacing the U by a T.Each position of the triplet code can be one of four bases; hence, 43or 64 possible

triplets encode 20 amino acids (61 sense codes) and 3 stop codons (3 non-sense

codes) The genetic code is said to be degenerated, or redundant, since all aminoacids except methionine have more than one possible triplet code The first codon

for methionine downstream (or 3) of the ribosome entry site also acts as thestart codon for the translation of a protein As a result of the triplet code, each

Trang 34

Phe Phe Leu Leu

UCU UCC UCA UCG

Ser Ser Ser Ser

UAU UAC UAA UAG

Tyr Tyr STOP STOP

UGU UGC UGA UGG

Cys Cys STOP Trp

U C A G

C

CUU CUC CUA CUG

Leu Leu Leu Leu

CCU CCC CCA CCG

Pro Pro Pro Pro

CAU CAC CAA CAG

His His Gln Gln

CGU CGC CGA CGG

Arg Arg Arg Arg

U C A G

A

AUU AUC AUA AUG

Ile Ile Ile Met

ACU ACC ACA ACG

Thr Thr Thr Thr

AAU AAC AAA AAG

Asn Asn Lys Lys

AGU AGC AGA AGG

Ser Ser Arg Arg

U C A G

G

GUU GUC GUA GUG

Val Val Val Val

GCU GCC GCA GCG

Ala Ala Ala Ala

GAU GAC GAA GAG

Asp Asp Glu Glu

GGU GGC GGA GGG

Gly Gly Gly Gly

U C A G

The first nucleotide letter is indicated on the left, the second on the top, and the third on the right side The amino acids are given by their three-letter code (see Table 1.1) Three stop codons are indicated.

contiguous nucleotide stretch has three reading frames in the 5–3direction Thecomplementary strand encodes another three reading frames A reading frame that

is able to encode a protein starts with a codon for methionine, and ends with a stop

codon These reading frames are called open reading frames or ORFs.

During duplication of the genetic information, the DNA or RNA polymerase canoccasionally incorporate a non-complementary nucleotide In addition, bases in aDNA strand can be chemically modified due to environmental factors such as UVlight or chemical substances These modified bases can potentially interfere withthe synthesis of the complementary strand and thereby also result in a nucleotideincorporation that is not complementary to the original nucleotide When thesechanges escape the cellular repair mechanisms, the genetic information is altered,

resulting in what is called a point mutation The genetic code has evolved in such

a way that a point mutation at the third codon position rarely results in an aminoacid change (only in 30% of possible changes) A change at the second codonposition always, and at the first codon position mostly (96%), results in an aminoacid change Mutations that do not result in amino acid changes are called silent

Trang 35

7 Basic concepts of molecular evolution

or synonymous mutations When a mutation results in the incorporation of a different amino acid, it is called non-silent or non-synonymous A site within a

coding triplet is said to be fourfold degenerate when all possible changes at that site are synonymous (for example “CUN”); twofold degenerate when only two different

amino acids are encoded by the four possible nucleotides at that position (for

example, “UUN”); and non-degenerate when all possible changes alter the encoded

amino acid (for example, “NUU”)

Incorporation errors replacing a purine (A, G) with a purine and a pyrimidine(C, T) with a pyrimidine occur more easily because of chemical and steric reasons

The resulting mutations are called transitions Transversions, purine to pyrimidine

changes and the reverse, are less likely When resulting in an amino acid change,transversions usually have a larger impact on the protein than transitions, because

of the more drastic changes in biochemical properties of the encoded amino acid.There are four possible transition errors (A ↔ G, C ↔ T), and eight possibletransversion errors (A ↔ C, A ↔ T, G ↔ C, G ↔ T); therefore, if a mutationoccurred randomly, a transversion would be two times more likely than a transition.However, the genetic code has evolved in such a way that, in many genes, the lessdisruptive transitions are more likely to occur than transversions

Single nucleotide changes in a particular codon often change the amino acid

to one with similar properties (e.g hydrophobic), such that the tertiary structure

of the encoded protein is not altered dramatically Living organisms can thereforetolerate a limited number of nucleotide point mutations in their coding regions.Point mutations in non-coding regions are subject to other constraints, such asconservation of binding places for proteins, conservation of base pairing in RNAtertiary structures or avoidance of too many homopolymer stretches in whichpolymerases tend to stutter

Errors in duplication of genetic information can also result in the deletion

or insertion of one or more nucleotides, collectively referred to as indels When

multiples of three nucleotides are inserted or deleted in coding regions, the readingframe remains intact and one or more amino acids are inserted or deleted Whenone or two nucleotides are inserted or deleted, the reading frame is disturbed andthe resulting gene generally codes for an entirely different protein, with differentamino acids and a different length from the original protein The consequence

of this change depends on the position in the gene where the change took place.Insertions or deletions are therefore rare in coding regions, but rather frequent

in non-coding regions When occurring in coding regions, indels can occasionallychange the reading frame of a gene and make another ORF of the same geneaccessible Such mutations can lead to acquisition of new gene functions Havingsmall genomes, viruses make extensive use of this possibility They often encodeseveral proteins from a single gene by using overlapping ORFs Another type of

Trang 36

8 Anne-Mieke Vandamme

mutation that can change reading frames or make accessible new reading frames

is mutations in splicing patterns Eukaryotic proteins are encoded by coding gene

fragments called exons, which are separated from each other by introns Joining the introns is called splicing and occurs in the nucleus at the pre-mRNA level

through dedicated spliceosomes Mutations in splicing patterns usually destroythe gene function, but can occasionally result in the acquisition of a new genefunction Viruses have used these mechanisms extensively By alternative splicing,sometimes in combination with the use of different reading frames, viruses are able

to encode multiple proteins by a single gene For example, HIV is able to encode

two additional regulatory proteins using part of the coding region of the env gene

by alternative splicing and overlapping reading frames

When parts of two different DNA strands are combined into a single strand, the

genetic exchange is called recombination Recombination has a major effect on

the genetic make-up of organisms (seeChapter 15) The most common form of

recombination happens in eukaryotes during meiosis, when recombination occurs

between homologous chromosomes, shuffling the alleles for the next generation

Con-sequently, recombination contributes significantly to evolution of diploid isms More details on the process and consequences of recombination are provided

organ-inChapter 15

Another form of genetic exchange is lateral gene transfer, which is a relativelyfrequent event in bacteria A dramatic example of this is the origin of eukaryotesarising from bacteria acquiring other bacterial genomes that evolved into organellessuch as mitochondria or chloroplasts The bacterial predecessor of mitochondriasubsequently exchanged many genes with the “cellular” genome Substantial parts

of mammal genomes are “littered” with endogenous retroviral sequences, with the

“fusion” capacity of some retroviral envelope genes at the origin of the placenta.Every retroviral infection results in lateral gene transfer, usually only in somaticcells

Genetic variation can also be caused by gene duplication Gene duplication

results in genome enlargement and can involve a single gene, or large genomesections They can be partial, involving only gene fragments, or complete, whereby

entire genes, chromosomes (aneuploidy) or entire genomes (polyploidy) are

dupli-cated Genes experiencing partial duplication, such as domain duplication, canpotentially have a greatly altered function An entirely duplicated gene can evolveindependently After a long history of independent evolution, duplicated genes caneventually acquire a new function Duplication events have played a major role

in the evolution of species For example, complex body plans were possible due

to separate evolution of duplications of the homeobox genes (Carroll,1995), andespecially in plants, new species are frequently the result of polyploidy

Trang 37

9 Basic concepts of molecular evolution

Fig 1.2 Loss or fixation of an allele in a population.

1.2 Population dynamics

Mutations in a gene that are passed on to the offspring and that coexist with

the original gene result in polymorphisms At a polymorphic site, two or more

variants of a gene circulate in the population simultaneously Population geneticiststypically study the dynamics of the frequency of these polymorphic sites over time

The location in the genome where two or more variants coexist is called the locus The different variants for a particular locus are called alleles Virus genomes, in

particular, are very flexible to genetic changes; RNA viruses can contain manypolymorphic sites in a single population HIV, for example, does not exist in asingle host as a single genomic sequence, but consists of a continuously changing

swarm of variants sometimes referred to as a quasispecies (Eigen & Biebricher,1988;

Domingo et al.,2006) Although this has become a standard term for virologists,the quasispecies theory has specific mathematical formulations and to what extentvirus populations comply with these is the subject of great debate The high geneticdiversity mainly results from the rapid and error prone replication of RNA viruses

Diploid organisms always carry two alleles When both alleles are identical, the

organism is homozygous at that locus; when the organism carries two different alleles, it is heterozygous at that locus Heterozygous positions are polymorphic Evolution is always a result of changes in allele frequencies, also called gene fre- quencies whereby some alleles are lost over time, while other alleles sometimes increase their frequency to 100%, they become fixed in the population (Fig 1.2)

The rate at which this occurs is called the fixation rate The long-term evolution

of a species results from the successive fixation of particular alleles, which reflects

fixation of mutations Terms like fixation rate, mutation rate, substitution rate and evolutionary rate have been used interchangeably by several authors, but they

Trang 38

10 Anne-Mieke Vandamme

can refer to markedly different processes This is particularly so for mutation rate,which should preferably be reserved for the rate at which mutations arise at theDNA level, usually expressed as the number of nucleotide (or amino acid) changesper site per replication cycle Fixation rate, substitution rate, and the rate of molecu-lar evolution are all equivalent when applied to sequences representing differentspecies or populations, in which case they represent the number of new muta-tions per unit time that become fixed in a species or population However, whenapplied to sequences representing different individuals within a population, theinterpretation of these terms is subtly altered, because not all observed mutationaldifferences among individuals (polymorphisms) will eventually become fixed inthe population In these cases, fixation rates are not appropriate, but substitutionrate or the rate of molecular evolution can still be used to represent the rate atwhich individuals accrue genetic differences to each other over time (under theselective regime acting on this population) To summarize this from a phylogeneticperspective, the differences in nucleotide or amino acid sequences between taxaare generally called substitutions (although recently generated mutations can bepresent on terminal branches of trees) If these taxa represent different species

or populations, the substitutions will be equivalent to fixation events If the taxarepresent different individuals within a population, branch lengths measure thegenetic differences that accrue within individuals, which are not, but ultimatelymay lead to, fixation events

The rate at which populations genetically diverge over time is dependent on the

underlying mutation rate, the generation time, the time separating two generations,

and on evolutionary forces, such as the fitness of the organism carrying the allele

or variant, positive and negative selective pressure, population size, genetic drift,reproductive potential, and competition of alleles If a particular allele is more

fit than others in a particular environment, it will be subject to positive selective

pressure; if it is less fit, it will be subject to negative selective pressure An allele can

confer a lower fitness to the homozygous organism, while heterozygosity of bothalleles at this locus can be an advantage In this case, polymorphism is advantageous

and will be maintained; this is called balancing selection (heterozygote is more

fit than either homozygote) For example, humans who carry the hemoglobin Sallele on both chromosomes suffer from sickle-cell anaemia However, this allele ismaintained in the human population because heterozygotes are, to some extent,protected against malaria (Allison,1956) Fitness of a variant is always the result

of a particular phenotype of the organism; therefore, in coding regions, selectivepressure always acts on mutations that alter function or stability of a gene or theamino acid sequence encoded by the gene Synonymous mutations could at firstsight be expected to be neutral since they do not result in amino acid changes.However, this is not always true For example, synonymous changes can alter

Trang 39

11 Basic concepts of molecular evolution

event

Fig 1.3 Population dynamics of alleles Each different symbol represents a different allele A

muta-tion event in the sixth generamuta-tion gives rise to a new allele The figure illustrates fixamuta-tion and loss of alleles during a bottleneck event, and the concept of coalescence time (tracking

back the time to the most recent common ancestor of the grey individuals) N: population

size.

RNA secondary structure and influence RNA stability; also, they result in the usage

of a different tRNA that may be less abundant Still, most synonymous mutationscan be considered selectively neutral

The rate at which a mutation becomes fixed through deterministic or stochastic

forces depends on the effective population size (N e) of the organism This can

be defined as the size of an idealized population that is randomly mating andthat has the same gene frequency changes as the population being studied (the

“census” population) The effective population size is smaller than the overall

population size (N), when a substantial proportion of a population is producing

no offspring, when there is inbreeding, in cases of population subdivision, and whenselection operates on linked viral mutations The effective population size is a majordeterminant of the dynamics of the allele frequencies over time When the (effective)population size varies over multiple generations, the rates of evolution are notablyinfluenced by generations with the smallest effective population sizes This may

be particularly true if population sizes are greatly reduced due to catastrophes,

or during migrations, etc (Fig 1.3) Such events can significantly affect genetic

diversity and are called genetic bottlenecks Two individual lineages merging into

Trang 40

12 Anne-Mieke Vandamme

a single ancestor as we go back in time is referred to as a coalescent event In

general, the most recent common ancestor of the extant generation never tracesback to the first generation of a population InFig 1.3, all individuals of the seventhgeneration have one common ancestor in the fourth generation (tracing back thegray individuals), which is called the coalescence time of the extant individuals

An entirely deterministic evolutionary pattern would require that changes in

allele or gene frequencies depend solely on the reproductive fitness of the ants in a particular environment and on the environmental conditions In such asituation the gene frequencies can be predicted if the fitness and environmentalconditions are known In deterministic evolution, changes other than environ-mental conditions, such as chance events, do not influence allele/gene frequencies

vari-This can only hold true if the effective population is infinitely large Natural

selec-tion, the effect of positive and negative selective pressure, accounts entirely for the

changes in frequencies When random fluctuations determine in part the allele

frequencies, allele/gene frequencies cannot be predicted exactly In such a stochastic

model, one can only determine the probability of frequencies in the next tion These probabilities still depend on the reproductive fitness of the variants in

genera-a pgenera-articulgenera-ar environment genera-and on the environmentgenera-al conditions However, chgenera-anceevents also play a role in populations of limited size Consequently, only statistical

statements about allele/gene frequencies can be made Random genetic drift,

there-fore, contributes significantly to changes in frequencies under a stochastic model.The smaller the effective population size, the larger the effect of chance events andthe more the mutation rate is determined by genetic drift rather than by selectivepressure

Evolution is never entirely deterministic or entirely stochastic Depending onthe interplay of effective population size and the distribution of selective coeffi-cients, the evolution of allele/gene frequencies is more affected by either naturalselection or genetic drift Genetic mutations are always random, but they can some-times result in an adaptive advantage In this case, positive selective pressure will

increase the frequency of the advantageous mutation, eventually leading to fixation

after fewer generations than expected for a neutral change, provided the tive population size is large enough A mutation under negative selective pressurecan become fixed due to random genetic drift when it is not entirely deleterious,but this generally requires more generations than expected for a neutral change.Non-synonymous mutations result in a phenotypic change of an organism, andare subject to selective pressure if they change the interaction of that organismwith its environment As explained above, synonymous mutations are usually neu-tral and therefore become fixed due to genetic drift The effect of positive andnegative selective pressure can be investigated by comparing the synonymous andnon-synonymous substitution rate (see alsoChapters 13and14)

Ngày đăng: 28/03/2014, 10:20

TỪ KHÓA LIÊN QUAN

w