Preface to the First Edition, xxiii PART I ANALYZING DNA, RNA, AND PROTEIN SEQUENCES IN DATABASES Organization of The Book, 4 Bioinformatics: The Big Picture, 4 A Consistent Example: Hem
Trang 2Bioinformatics and Functional Genomics
Trang 4Copyright # 2009 by John Wiley & Sons, Inc All rights reserved.
Wiley-Blackwell is an imprint of John Wiley & Sons, formed by the merger of Wiley’s global Scientific, Technical, and Medical business with Blackwell Publishing.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey
Published simultaneously in Canada
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750 – 8400, fax (978)
750 – 4470, or on the web at www.copyright.com Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748 – 6011, fax (201) 748 – 6008, or online at http://www.wiley.com/go/permission.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or comple- teness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose No warranty may be created or extended by sales representatives or written sales materials The advice and strategies contained herein may not be suitable for your situation You should consult with a professional where appropriate Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, con- sequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762 – 2974, outside the United States at (317) 572 – 3993 or fax (317) 572 – 4002.
Wiley also publishes its books in variety of electronic formats Some content that appears in print may not
be available in electronic format For more information about Wiley products, visit our web site at www wiley.com.
Cover illustration includes detail from Leonardo da Vinci (1452 – 1519), dated c.1506 – 1507, courtesy
of the Schlossmuseum (Weimar).
ISBN: 978-0-470-08585-1
Library of Congress Cataloging-in-Publication Data is available.
Printed in the United States of America
Trang 5For Barbara, Ava and Lillian with all my love.
Trang 6Contents in Brief
PART I ANALYZING DNA, RNA, AND PROTEIN SEQUENCES IN DATABASES
PART II GENOMEWIDE ANALYSIS OF RNA AND PROTEIN
PART III GENOME ANALYSIS
Trang 7Preface to the First Edition, xxiii
PART I ANALYZING DNA, RNA, AND PROTEIN SEQUENCES
IN DATABASES
Organization of The Book, 4
Bioinformatics: The Big Picture, 4
A Consistent Example:
Hemoglobin, 8Organization of The Chapters, 9
A Textbook for Courses on
Bioinformatics andGenomics, 9Key Bioinformatics Websites, 10
Nucleotide and ProteinSequences, 14Amount of Sequence Data, 15Organisms in GenBank, 16Types of Data in GenBank, 18Genomic DNA Databases, 19cDNA Databases Corresponding
to Expressed Genes, 19Expressed Sequence Tags(ESTs), 19
ESTs and UniGene, 20Sequence-Tagged Sites(STSs), 22
Genome Survey Sequences(GSSs), 22
High Throughput GenomicSequence (HTGS), 23Protein Databases, 23National Center for BiotechnologyInformation, 23
Introduction to NCBI: HomePage, 23
PubMed, 23Entrez, 24BLAST, 25OMIM, 25Books, 25Taxonomy, 25Structure, 25The European BioinformaticsInstitute (EBI), 25Access to Information: AccessionNumbers to Label and IdentifySequences, 26
The Reference Sequence (RefSeq)Project, 27
The Consensus Coding Sequence(CCDS) Project, 29
Access to Information via Entrez Gene
at NCBI, 29Relationship of Entrez Gene,Entrez Nucleotide, and EntrezProtein, 32
Comparison of Entrez Gene andUniGene, 32
Entrez Gene and HomoloGene, 33Access to Information: Protein
Databases, 33UniProt, 33The Sequence Retrieval System atExPASy, 34
Access to Information: The ThreeMain Genome Browsers, 35The Map Viewer at NCBI, 35
ix
Trang 8The University of California, SantaCruz (UCSC) Genome
Browser, 35The Ensembl Genome Browser, 35Examples of How to Access Sequence
Data, 36HIVpol, 36Histones, 38Access to Biomedical Literature, 38
PubMed Central and Movementtoward Free Journal Access, 39Example of PubMed Search:
RBP, 40Perspective, 42
Gaps, 55Pairwise Alignment, Homology, andEvolution of Life, 55
Scoring Matrices, 57
Dayhoff Model: Accepted PointMutations, 58
PAM1 Matrix, 63PAM250 and Other PAMMatrices, 65
From a Mutation Probability Matrix
to a Log-Odds ScoringMatrix, 69
Practical Usefulness of PAMMatrices in PairwiseAlignment, 70Important Alternative to PAM:
BLOSUM Scoring Matrices, 70Pairwise Alignment and Limits ofDetection: The “TwilightZone”, 74
Alignment Algorithms: Global and
Local, 75Global Sequence Alignment:
Algorithm of Needleman andWunsch, 76
Step 1: Setting Up a Matrix, 76Step 2: Scoring the Matrix, 77Step 3: Identifying the OptimalAlignment, 79
Local Sequence Alignment: Smithand Waterman Algorithm, 82Rapid, Heuristic Versions ofSmith – Waterman: FASTA andBLAST, 84
Pairwise Alignment with DotPlots, 85
The Statistical Significance of PairwiseAlignments, 86
Statistical Significance of GlobalAlignments, 87
Statistical Significance of LocalAlignments, 89
Percent Identity and RelativeEntropy, 90
Perspective, 91Pitfalls, 94Web Resources, 94Discussion Questions, 94Problems/Computer Lab, 95Self-Test Quiz, 95
Suggested Reading, 96References, 97
Introduction, 101BLAST Search Steps, 103Step 1: Specifying Sequence ofInterest, 103
Step 2: Selecting BLASTProgram, 104Step 3: Selecting aDatabase, 106Step 4a: Selecting Optional SearchParameters, 106
9 Filtering and Masking, 111Step 4b: Selecting FormattingParameters, 112
BLAST Algorithm Uses LocalAlignment Search Strategy, 115
x CONTENTS
Trang 9BLAST Algorithm Parts: List, Scan,
Extend, 115
BLAST Algorithm: Local Alignment
Search Statistics and
BLAST Searching With
Multidomain Protein: HIV-1
Finding Distantly Related Proteins:
Position-Specific Iterated BLAST
Pattern-Hit Initiated BLAST(PHI-BLAST), 153Profile Searches: Hidden MarkovModels, 155
BLAST-Like Alignment Tools toSearch Genomic DNARapidly, 161Benchmarking to Assess GenomicAlignment Performance, 162PatternHunter, 162
BLASTZ, 163MegaBLAST and DiscontiguousMegaBLAST, 164
BLAT, 166LAGAN, 168SSAHA, 168SIM4, 169Using BLAST for GeneDiscovery, 169Perspective, 173Pitfalls, 173Web Resources, 174Discussion Questions, 174Problems/Computer Lab, 174Self-Test Quiz, 175
Suggested Reading, 176References, 176
Introduction, 179Definition of Multiple SequenceAlignment, 180
Typical Uses and Practical Strategies
of Multiple SequenceAlignment, 181Benchmarking: Assessment ofMultiple Sequence AlignmentAlgorithms, 182
Five Main Approaches to MultipleSequence Alignment, 184Exact Approaches to MultipleSequence Alignment, 184Progressive Sequence
Alignment, 185Iterative Approaches, 190Consistency-Based
Approaches, 192Structure-Based Methods, 194Conclusions from BenchmarkingStudies, 196
CONTENTS xi
Trang 10Databases of Multiple Sequence
Alignments, 197Pfam: Protein Family Database ofProfile HMMs, 197
Smart, 199Conserved Domain Database, 199Prints, 201
Integrated Multiple SequenceAlignment Resources: InterProand iProClass, 201
PopSet, 202Multiple Sequence AlignmentDatabase Curation: Manualversus Automated, 202Multiple Sequence Alignments of
Genomic Regions, 203Perspective, 206
Hypothesis, 221Positive and NegativeSelection, 227Neutral Theory of MolecularEvolution, 230
Molecular Phylogeny: Properties of
Trees, 231Tree Roots, 233Enumerating Trees andSelecting SearchStrategies, 234Type of Trees, 238
Species Trees versus Gene/ProteinTrees, 238
DNA, RNA, or Protein-BasedTrees, 240
Five Stages of Phylogenetic
Analysis, 243Stage 1: SequenceAcquisition, 243Stage 2: Multiple SequenceAlignment, 244
Stage 3: Models of DNAand Amino AcidSubstitution, 246Stage 4: Tree-BuildingMethods, 254Phylogenetic Methods, 255Distance, 255
The UPGMA Distance-BasedMethod, 256
Making Trees by Based Methods: NeighborJoining, 259
Distance-Phylogenetic Inference: MaximumParsimony, 260
Model-Based PhylogeneticInference: MaximumLikelihood, 262Tree Inference: BayesianMethods, 264Stage 5: Evaluating Trees, 266Perspective, 268
Pitfalls, 268Web Resources, 269Discussion Questions, 269Problems/Computer Lab, 269Self-Test Quiz, 271
Suggested Reading, 272References, 272
PART II GENOMEWIDE ANALYSIS OF RNA AND PROTEIN
Introduction to RNA, 279Noncoding RNA, 282Noncoding RNAs in the RfamDatabase, 283
Transfer RNA, 283Ribosomal RNA, 288Small Nuclear RNA, 291Small Nucleolar RNA, 292MicroRNA, 293
Short Interfering RNA, 294Noncoding RNAs in the UCSCGenome and Table
Browser, 294Introduction to Messenger RNA, 296mRNA: Subject of Gene
Expression Studies, 300Analysis of Gene Expression incDNA Libraries, 302Pitfalls in Interpreting ExpressionData from cDNA Libraries, 308xii CONTENTS
Trang 11Full-Length cDNA Projects, 308
Serial Analysis of Gene Expression
Stage 4: Image Analysis, 317
Stage 5: Data Analysis, 318
The Relationship of DNA,
mRNA, and Protein
Microarray Data Analysis Software
and Data Sets, 334
Reproducibility of Microarray
Experiments, 335
Microarray Data Analysis:
Preprocessing, 337
Scatter Plots and MA Plots, 338
Global and Local
Normalization, 343
Accuracy and Precision, 344
Robust Multiarray Analysis
Hierarchical Cluster Analysis ofMicroarray Data, 355Partitioning Methods for Clustering:k-Means Clustering, 363
Clustering Strategies: Organizing Maps, 363Principal Components Analysis:Visualizing MicroarrayData, 364
Self-Supervised Data Analysis forClassification of Genes orSamples, 367
Functional Annotation of MicroarrayData, 368
Perspective, 369Pitfalls, 370Discussion Questions, 370Problems/Computer Lab, 371Self-Test Quiz, 372
Suggested Reading, 373References, 373
Introduction, 379Protein Databases, 380Community Standards forProteomics Research, 381Techniques to Identify Proteins, 381Direct Protein Sequencing, 381Gel Electrophoresis, 382Mass Spectrometry, 385Four Perspectives on Proteins, 388Perspective 1 Protein Domains andMotifs: Modular Nature of
Proteins, 389Added Complexity of MultidomainProteins, 394
Protein Patterns: Motifs orFingerprints Characteristic ofProteins, 394
Perspective 2 Physical Properties ofProteins, 397
Accuracy of PredictionPrograms, 399Proteomic Approaches toPhosphorylation, 401
CONTENTS xiii
Trang 12Proteomic Approaches toTransmembraneDomains, 401Introduction to Perspectives 3 and 4:
Gene OntologyConsortium, 402Perspective 3: ProteinLocalization, 406Perspective 4: ProteinFunction, 407Perspective, 411Pitfalls, 411Web Resources, 412Discussion Questions, 414Problems/Computer Lab, 415Self-Test Quiz, 415
Suggested Reading, 416References, 416
Overview of ProteinStructure, 421Protein Sequence andStructure, 422Biological Questions Addressed byStructural Biology:
Globins, 423Principles of ProteinStructure, 423Primary Structure, 424Secondary Structure, 425Tertiary Protein Structure:
Protein-Folding Problem, 430Target Selection and Acquisition
of Three-Dimensional ProteinStructures, 432
Structural Genomics and theProtein Structure
Initiative, 432The Protein Data Bank, 434Accessing PDB Entries at the NCBIWebsite, 437
Integrated Views of theUniverse of ProteinFolds, 441Taxonomic System for ProteinStructures: The SCOPDatabase, 441The CATH Database, 443The Dali Domain
Dictionary, 445Comparison of Resources, 446Protein Structure Prediction, 447Homology Modeling (ComparativeModeling), 448
Fold Recognition (Threading), 450
Ab Initio Prediction (Template-FreeModeling), 450
A Competition to AssessProgress in StructurePrediction, 451Intrinsically DisorderedProteins, 453Protein Structure and Disease, 453Perspective, 454
Pitfalls, 455Discussion Questions, 455Problems/Computer Lab, 455Self-Test Quiz, 456
Suggested Reading, 457References, 457
Introduction to FunctionalGenomics, 461The Relationship of Genotype andPhenotype, 463
Eight Model Organisms forFunctional Genomics, 465The BacteriumEscherichiacoli, 466
The YeastSaccharomycescerevisiae, 466
The PlantArabidopsisthaliana, 470The NematodeCaenorhabditiselegans, 470
The FruitflyDrosophilamelanogaster, 471The ZebrafishDanio rerio, 471The MouseMus musculus, 472Homo sapiens: Variation inHumans, 473
Functional Genomics Using ReverseGenetics and Forward
Genetics, 473Reverse Genetics: MouseKnockouts and theb-GlobinGene, 475
Reverse Genetics: Knocking OutGenes in Yeast Using MolecularBarcodes, 480
Reverse Genetics: RandomInsertional Mutagenesis(Gene Trapping), 483Reverse Genetics: InsertionalMutagenesis in Yeast, 486Reverse Genetics: Gene
Silencing by DisruptingRNA, 489
xiv CONTENTS
Trang 13Forward Genetics: ChemicalMutagenesis, 491Functional Genomics and the CentralDogma, 492
Functional Genomics and DNA:
The ENCODE Project, 492Functional Genomics andRNA, 492
Functional Genomics andProtein, 493
Proteomics Approaches toFunctional Genomics, 493Protein–Protein Interactions, 495The Yeast Two-Hybrid
System, 496Protein Complexes: AffinityChromatography and MassSpectrometry, 498The Rosetta StoneApproach, 500Protein – Protein InteractionDatabases, 501Protein Networks, 502Perspective, 507
Pitfalls, 508Discussion Questions, 508Problems/Computer Lab, 509Self-Test Quiz, 509
Suggested Reading, 510References, 510
PART III GENOME ANALYSIS
Introduction, 517Five Perspectives onGenomics, 519Brief History ofSystematics, 520History of Life on Earth, 521Molecular Sequences as the Basis
of the Tree of Life, 523Role of Bioinformatics inTaxonomy, 524Genome-Sequencing Projects:
Overview, 525Four Prominent WebResources, 525Brief Chronology, 526First Bacteriophage andViral Genomes(1976 – 1978), 527First Eukaryotic OrganellarGenome (1981), 527
First Chloroplast Genomes(1986), 528
First Eukaryotic Chromosome(1992), 529
Complete Genome ofFree-Living Organism(1995), 530First Eukaryotic Genome(1996), 532
Escherichia coli (1997), 532First Genome of MulticellularOrganism (1998), 532Human Chromosome(1999), 533Fly, Plant, and HumanChromosome 21(2000), 534Draft Sequences of HumanGenome (2001), 535Continuing Rise in CompletedGenomes (2002), 535Expansion of Genome Projects(2003 – 2009), 536Genome Analysis Projects, 537Criteria for Selection of Genomesfor Sequencing, 538
Genome Size, 539Cost, 540
Relevance to HumanDisease, 541Relevance to Basic BiologicalQuestions, 541
Relevance toAgriculture, 541Should an Individual from aSpecies, Several Individuals,
or Many Individuals BeSequenced, 541Resequencing Projects, 542Ancient DNA Projects, 542Metagenomics Projects, 543DNA Sequencing
Technologies, 544Sanger Sequencing, 544Pyrosequencing, 545Cyclic Reversible Termination:Solexa, 547
The Process of GenomeSequencing, 547Genome-SequencingCenters, 547Sequencing and AssemblingGenomes: Strategies, 548Genomic Sequence Data: FromUnfinished to Finished, 549
CONTENTS xv
Trang 14Finishing: When Has a GenomeBeen Fully Sequenced, 551Repository for Genome
Sequence Data, 552Role of ComparativeGenomics, 552Genome Annotation: Features ofGenomic DNA, 555
Annotation of Genes inProkaryotes, 556Annotation of Genes inEukaryotes, 558Summary: Questions fromGenome-SequencingProjects, 558Perspective, 559Pitfalls, 559Discussion Questions, 560Problems/Computer Lab, 560Self-Test Quiz, 560
Suggested Reading, 561References, 561
Introduction, 567Classification of Viruses, 568Diversity and Evolution ofViruses, 571
Metagenomics and VirusDiversity, 573Bioinformatics Approaches toProblems in Virology, 574Influenza Virus, 574
Herpesvirus: From Phylogeny toGene Expression, 578Human ImmunodeficiencyVirus, 583
Bioinformatic Approaches toHIV-1, 585
Measles Virus, 588Perspectives, 591Pitfalls, 591Web Resources, 591Discussion Questions, 592Problems/Computer Lab, 592Self-Test Quiz, 593
Suggested Reading, 593References, 593
Introduction, 598Classification of Bacteria andArchaea, 598
Classification of Bacteria byMorphological Criteria, 599Classification of Bacteria andArchaea Based on
Genome Size andGeometry, 602Classification of Bacteria andArchaea Based on
Lifestyle, 607Classification of Bacteria Based
on Human DiseaseRelevance, 610Classification of Bacteria andArchaea Based on RibosomalRNA Sequences, 611Classification of Bacteria andArchaea Based on OtherMolecular Sequences, 612Analysis of Prokaryotic
Genomes, 615Nucleotide Composition, 615Finding Genes, 617
Lateral Gene Transfer, 620Functional Annotation:
COGs, 622Comparison of ProkaryoticGenomes, 625TaxPlot, 626
Perspective, 629Pitfalls, 630Web Resources, 630Discussion Questions, 630Problems/Computer Lab, 631Self-Test Quiz, 631
Suggested Reading, 632References, 632
Introduction, 640Major Differences betweenEukaryotes and
Prokaryotes, 641General Features of EukaryoticGenomes and
Chromosomes, 643
C Value Paradox: Why EukaryoticGenome Sizes Vary So
Greatly, 643Organization of EukaryoticGenomes into
Chromosomes, 644Analysis of Chromosomes UsingGenome Browsers, 645xvi CONTENTS
Trang 15Analysis of Chromosomes by the
ENCODE Project, 647
Repetitive DNA Content of
Eukaryotic Chromosomes, 650
Eukaryotic Genomes Include
Noncoding and Repetitive DNA
Repeated Sequences Such as
Are Found at Telomeres,
Centromeres, and Ribosomal
Transcription Factor Databases
and Other Genomic DNA
Pitfalls, 687Web Resources, 688Discussion Questions, 688Problems/Computer Lab, 688Self-Test Quiz, 689
Suggested Reading, 690References, 690
Introduction, 697Description and Classification ofFungi, 698
Introduction to Budding YeastSaccharomyces cerevisiae, 700Sequencing the Yeast
Genome, 701Features of the Budding YeastGenome, 701
Exploring a Typical YeastChromosome, 704Gene Duplication and GenomeDuplication ofS cerevisiae, 708Comparative Analyses of
Hemiascomycetes, 712Analysis of Whole GenomeDuplication, 712Identification of FunctionalElements, 714
Analysis of Fungal Genomes, 715Aspergillus, 715
Candida albicans, 718Cryptococcus neoformans: ModelFungal Pathogen, 719Atypical Fungus: MicrosporidialParasiteEncephalitozooncuniculi, 719
Neurospora crassa, 719First Basidiomycete:
Phanerochaetechrysosporium, 720Fission YeastSchizosaccharomycespombe, 721
Perspective, 721Pitfalls, 722Web Resources, 722Discussion Questions, 722Problems/Computer Lab, 723
CONTENTS xvii
Trang 16Self-Test Quiz, 723Suggested Reading, 724References, 724
Introduction, 729Protozoans at the Base ofthe Tree LackingMitochondria, 732Trichomonas, 732Giardia lamblia: A HumanIntestinal Parasite, 733Genomes of Unicellular Pathogens:
Trypanosomes andLeishmania, 735Trypanosomes, 735Leishmania, 736The Chromalveolates, 738Malaria ParasitePlasmodiumfalciparum and OtherApicomplexans, 738Astonishing Ciliophora:
Paramecium andTetrahymena, 742Nucleomorphs, 745Kingdom Stramenopila, 746Plant Genomes, 748
Overview, 748Green Algae (Chlorophyta), 748Arabidopsis thaliana
Genome, 751The Second Plant Genome:
Rice, 753The Third Plant Genome:
Poplar, 755The Fourth Plant Genome:
Grapevine, 755Moss, 756
Slime and Fruiting Bodies at theFeet of Metazoans, 756Social Slime MoldDictyosteliumdiscoideum, 756
Metazoans, 758Introduction toMetazoans, 758Analysis of a Simple Animal: TheNematodeCaenorhabditiselegans, 759
The First Insect Genome:
Drosophila melanogaster, 761The Second Insect Genome:
Anopheles gambiae, 764Silkworm, 765
450 Million Years Ago:
Vertebrate Genomes ofFish, 768
310 Million Years Ago: Dinosaursand the Chicken
Genome, 771
180 Million Years Ago: TheOpposum Genome, 772
100 Million Years Ago:
Mammalian Radiation fromDog to Cow, 773
80 Million Years Ago: The Mouseand Rat, 774
5 to 50 Million Years Ago:Primate Genomes, 778Perspective, 781
Pitfalls, 781Web Resources, 782Discussion Questions, 782Problems/Computer Lab, 782Self-Test Quiz, 783
Suggested Reading, 783References, 784
Introduction, 791Main Conclusions of HumanGenome Project, 792The ENCODE Project, 793Gateways to Access the HumanGenome, 794
NCBI, 794Ensembl, 794University of California at SantaCruz Human Genome
Browser, 798NHGRI, 800The Wellcome Trust SangerInstitute, 800
The Human Genome Project, 800Background of the HumanGenome Project, 800Strategic Issues: HierarchicalShotgun Sequencing toGenerate Draft
Sequence, 802Features of the GenomeSequence, 805The Broad GenomicLandscape, 806
Trang 17Four Categories of Disease, 846Monogenic Disorders, 847Complex Disorders, 851Genomic Disorders, 852Environmentally CausedDisease, 855Other Categories ofDisease, 857Disease Databases, 859OMIM: Central BioinformaticsResource for HumanDisease, 859Locus-Specific MutationDatabases, 862The PhenCode Project, 865Four Approaches to IdentifyingDisease-Associated Genes, 866Linkage Analysis, 866
Genome-Wide AssociationStudies, 867
Identification of ChromosomalAbnormalities, 868Genomic DNA Sequencing, 869Human Disease Genes in ModelOrganisms, 870
Human Disease Orthologs inNonvertebrate Species, 870Human Disease Orthologs inRodents, 876
Human Disease Orthologs inPrimates, 878
Human Disease Genes andSubstitution Rates, 878Functional Classification of DiseaseGenes, 880
Perspective, 882Pitfalls, 882Web Resources, 882Discussion Questions, 884Problems, 884
Self-Test Quiz, 885Suggested Reading, 885References, 886
Trang 18Preface to the Second Edition
The Neurobehavioral Unit of the Kennedy Krieger Institute has 16 hospital beds
Most of the patients are children who have been diagnosed with autism, and most
engage in self-injurious behavior They engage in self-biting, self-hitting,
head-banging, and other destructive behaviors In most cases, we do not understand the
genetic contributions to such behaviors, limiting the available strategies for
treat-ment In my research, I am motivated to understand molecular changes that underlie
childhood brain diseases The field of bioinformatics provides tools we can use to
understand disease processes through the analysis of molecular sequence data
More broadly, bioinformatics facilitates our understanding of the basic aspects of
biology including development, metabolism, adaptation to the environment,
gen-etics (e.g., the basis of individual differences), and evolution
Since the publication of the first edition of this textbook in 2003, the fields of
bioinformatics and genomics have grown explosively In the preface to the first edition
(2003) I noted that tens of billions of base pairs (gigabases) of DNA had been
depos-ited in GenBank Now in 2009 we are reaching tens of trillions (terabases) of DNA,
presenting us with unprecedented challenges in how to store, analyze, and interpret
sequence data In this second edition I have made numerous changes to the content
and organization of the book All of the chapters are rewritten, and about 90% of the
figures and tables are updated There are two new chapters, one on functional
genomics and one on the eukaryotic chromosome I now focus on the globins as
examples throughout the book Globins have a special place in the history of biology,
as they were among the first proteins to be identified (in the 1830s) and sequenced (in
the 1950s and 1960s) The first protein to have its structure solved by X-ray
crystal-lography was myoglobin (Chapter 11); molecular phylogeny was applied to the
glo-bins in the 1960s (Chapter 7); and the globin gene loci were among the first to be
sequenced (in the 1980s; see Chapter 16)
The fields of bioinformatics and genomics are far too broad to be understood by
one person Thus many textbooks are written by multiple authors, each of whom
brings a deeper knowledge of the subject matter I hope that this book at least
offers the benefit of a single author’s vision of how to present the material This is
essentially two textbooks: one on bioinformatics (parts I and II) and one on genomics
(part III) I feel that presenting bioinformatics on its own would be incomplete
with-out further applying those approaches to sequence analysis of genomes across the tree
of life Similarly I feel that it is not possible to approach genomics without first
treat-ing the bioinformatics tools that are essential engines of that field
As with the previous edition a companion website is available which provides
up-to-date web links referred to in the book and PowerPoint slides arranged by
xxi
Trang 19chapter (www.bioinfbook.org) A resource site for instructors is also available givingdetailed solutions to problems (www.wiley.com/go/pevsnerbioinformatics).
In preparing each edition of this book I read many papers and reviewed severalthousand websites I sincerely apologize to those authors, researchers and otherswhose work I did not cite It is a great pleasure to acknowledge my colleagues whohave helped in the preparation of this book Some read chapters including JefBoeke (Chapter 12), Rafael Irizarry (Chapter 9), Stuart Ray (Chapter 7), IngoRuczinski (Chapter 11), and Sarah Wheelan (Chapters 3 and 5 – 7) I thank manystudents and faculty at Johns Hopkins and elsewhere who have provided critical feed-back, including those who have lectured in bioinformatics and genomics courses(Judith Bender, Jef Boeke, Egbert Hoiczyk, Ingo Ruczinski, Alan Scott, DavidSullivan, David Valle, and Sarah Wheelan) Many others engaged in helpful discus-sions including Charles D Cohen, Bob Cole, Donald Coppock, Laurence Frelin,Hugh Gelch, Gary W Goldstein, Marjan Gucek, Ada Hamosh, Nathaniel Miller,Akhilesh Pandey, Elisha Roberson, Kirby D Smith, Jason Ting, and N Varg
I thank my wife Barbara for her support and love as I prepared this book
xxii PREFACE TO THE SECOND EDITION
Trang 20Preface to the First Edition
This book emerged from lecture notes I prepared several years ago for an
introductory bioinformatics and genomics course at the Johns Hopkins School of
Medicine The first class consisted of about 70 graduate students and several
hun-dred auditors, including postdoctoral fellows, technicians, undergraduates, and
fac-ulty Those who attended the course came from a broad variety of fields—students of
genetics, neuroscience, immunology or cell biology, clinicians interested in particular
diseases, statisticians and computer scientists, virologists and microbiologists They
had a common interest in wanting to understand how they could apply the tools of
computer science to solve biological problems This is the domain of bioinformatics,
which I define most simply as the interface of computer science and molecular
biology This emerging field relies on the use of computer algorithms and computer
databases to study proteins, genes, and genomes Functional genomics is the study of
gene function using genome-wide experimental and computational approaches
At its essence, the field of bioinformatics is about comparisons In the first third of the
book we learn how to extract DNA or protein sequences from the databases, and then
to compare them to each other in a pairwise fashion or by searching an entire
data-base For the student who has a gene of particular interest, a natural question is to
ask “what other genes (or proteins) are related to mine?”
In the middle third of the book, we move from DNA to RNA (gene expression)
and to proteins We again are engaged in a series of comparisons We compare gene
expression in two cell lines with or without drug treatment, or a wildtype mouse heart
versus a knockout mouse heart, or a frog at different stages of development These
comparisons extend to the world of proteins, where we apply the tools of proteomics
to complex biological samples under assorted physiological conditions The
align-ment of multiple, related DNA or protein sequences is another form of comparison
These relationships can be visualized in a phylogenetic tree
The last third of the book spans the tree of life, and this provides another level of
comparison Which forms of human immunodeficiency virus threaten us, and how
can we compare the various HIV subtypes to learn how we might develop a vaccine?
How are a mosquito and a fruitfly related? What genes do vertebrates such as fish
and humans share in common, and which genes are unique to various phylogenetic
lineages?
xxiii
Trang 21I believe that these various kinds of comparisons are what distinguish the newlyemerging fields of bioinformatics and genomics from traditional biology Biology hasalways concerned comparisons; in this book I quote 19th century biologists such asRichard Owen, Ernst Haeckel, and Charles Darwin who engaged in comparativestudies at the organismal level The problems we are trying to solve have not changedsubstantially We still seek a more complete understanding of the unifying concepts ofbiology, such as the organization of life from its constituent parts (e.g., genes and pro-teins), the behavior of complex biological systems, and the continuity of life throughevolution What has changed is how we pursue this more complete understanding.This book describes databases filled with raw information on genes and gene pro-ducts and the tools that are useful to analyze these data.
My training is as a molecular biologist and neuroscientist My laboratory studies themolecular basis of childhood brain disorders such as Down syndrome, autism, andlead poisoning We are located at the Kennedy Krieger Institute, a hospital for chil-dren for developmental disorders (You can learn more about this Institute at http://www.kennedykrieger.org.) Each year over 10,000 patients visit the Institute Thehospital includes clinics for children with a variety of conditions including languagedisorders, eating disorders, autism, mental retardation, spina bifida, and traumaticbrain injury Some have very common disorders, such as Down syndrome (affectingabout 1:700 live births) and mental retardation Others have rare disorders, such asRett syndrome or adrenoleukodystrophy
We are at a time when the number of base pairs of DNA deposited in the world’spublic repositories has reached tens of billions, as described in Chapter 2 We haveobtained the first sequence of the human genome, and since 1995 hundreds of gen-omes have been sequenced Throughout the book, you can follow the progress ofscience as we learn how to sequence DNA, and study its RNA and protein products
At times the pace of progress seems dazzling
Yet at the same time we understand so little about human disease For thousands
of diseases, a defect in a single gene causes a pathological effect Even as we discoverthe genes that are defective in diseases such as cystic fibrosis, muscular dystrophy,adrenoleukodystrophy, and Rett syndrome, the path to finding an effective treatment
or cure is obscure But single gene disorders are not nearly as common as complexdiseases such as autism, depression, and mental retardation that are likely due tomutations in multiple genes And all genetic disease is not nearly as common as infec-tious disease We know little about why one strain of virus infects only humans, whileanother closely related species infects only chimpanzees We do not understand whyone bacterial strain may be pathogenic, while another is harmless We have notlearned how to develop an effective vaccine against any eukaryotic pathogen, fromprotozoa (such as Plasmodium falciparum that causes malaria) to parasitic nematodes.The prospects for making progress in these areas are very encouraging specifi-cally because of the recent development of new bioinformatics tools We are onlynow beginning to position ourselves to understand the genetic basis of bothdisease-causing agents and the hosts that are susceptible Our hope is that the infor-mation so rapidly accumulating in new bioinformatics databases can be translatedthrough research into insights into human disease and biology in general
xxiv PREFACE TO THE FIRST EDITION
Trang 22NOTE TO READERS
This book describes over 1,000 websites related to bioinformatics and functional
genomics All of these sites evolve over time (and some become extinct) In an
effort to keep the web links up-to-date, a companion website (http://www
bioinfbook.org) maintains essentially all of the website links, organized by chapter
of the book We try our best to maintain this site over time We use a program to
auto-matically scan all the links each month, and then we update them as necessary
An additional site is available to instructors, including detailed solutions to
problems (see http://www.wiley.com)
Writing this book has been a wonderful learning experience It is a pleasure to thank
the many people who have contributed In particular, the intellectual environment at
the Kennedy Krieger Institute and the Johns Hopkins School of Medicine has been
extraordinarily rich These chapters were developed from lectures in an introductory
bioinformatics course The Johns Hopkins faculty who lectured during its first three
years were Jef Boeke (yeast functional genomics), Aravinda Chakravarti (human
dis-ease), Neil Clarke (protein structure), Kyle Cunningham (yeast), Garry Cutting
(human disease), Rachel Green (RNA), Stuart Ray (molecular phylogeny), and
Roger Reeves (the human genome) I have benefited greatly from their insights
into these areas
I gratefully acknowledge the many reviewers of this book, including a group of
anonymous reviewers who offered extremely constructive and detailed suggestions
Those who read the book include Russ Altman, Christopher Aston, David P
Leader, and Harold Lehmann (various chapters), Conover Talbot (Chapters 2 and
18), Edie Sears (Chapter 3), Tom Downey (Chapter 7), Jef Boeke (Chapter 8 and
various other chapters), Michelle Nihei and Daniel Yuan (Chapter 8), Mario
Amzel and Ingo Ruczinski (Chapter 9), Stuart Ray (Chapter 11), Marie Hardwick
(Chapter 13), Yukari Manabe (Chapter 14), Kyle Cunningham and Forrest
Spencer (Chapter 15), and Roger Reeves (Chapter 16) Kirby D Smith read
Chapter 18 and provided insights into most of the other chapters as well Each of
these colleagues offered a great deal of time and effort to help improve the content,
and each served as a mentor Of the many students who read the chapters I mention
Rong Mao, Ok-Hee Jeon, and Vinoy Prasad I particularly thank Mayra Garcia and
Larry Frelin who provided invaluable assistance throughout the writing process I am
grateful to my editor at John Wiley & Sons, Luna Han, for her encouragement
I also acknowledge Gary W Goldstein, President of the Kennedy Krieger
Insti-tute, and Solomon H Snyder, my chairman in the Department of Neuroscience at
Johns Hopkins Both provided encouragement, and allowed me the opportunity to
write this book while maintaining an academic laboratory
On a personal note, I thank my family for all their love and support, as well as N
Varg, Kimberly Reed, and Charles Cohen Most of all, I thank my fiance´e Barbara
Reed for her patience, faith, and love
ACKNOWLEDGMENTS xxv
Trang 23Ask 10 investigators in human genetics what resources they need most and it is highly
likely that computational skills and tools will be at the top of the list Genomics, with
its reliance on microarrays, genotyping, high throughput sequencing and the like, is
intensely data-rich and for this reason is impossible to disentangle from
bioinfor-matics This text, with its clear descriptions, practical examples and focus on the
overlaps and interdependence of these two fields, is thus an essential resource for
students and practitioners alike
Interestingly, bioinformatics and genomics are both relatively recent disciplines
Each emerged in the course of the Human Genome Project (HGP) that was
con-ceived in the mid-1980s and began officially on October 1, 1990 As the HGP
matured from its initial focus on gene maps in model organisms to the massive efforts
to produce a reference human whole genome sequence, there was an increasing need
for computational biology tools to store, analyze and disseminate large amounts of
sequence data For this reason, genomics increasingly relied on bioinformatics
and, in turn, the field of bioinformatics flourished Today, no serious student of
geno-mics can imagine life without bioinformatics This interdependence continues to
grow by leaps and bounds as the questions and activities of investigators in genomics
become bolder and more expansive; consider, for example, whole genome
associ-ation studies (GWAS), the ENCODE project, the challenge of copy number variants,
the 1000 Genomes project, epigenomics, and the looming growth of personal
genome sequences and their analysis
This textbook provides a clear and timely introduction to both bioinformatics
and genomics It is organized so that each chapter can correspond to a lecture for a
course on bioinformatics or genomics and, indeed, we have used it this way for our
students Also, for readers not taking courses, the book provides essential
background material For computer scientists and biologists alike the book offers
explanations of available methods and the kinds of problems for which they can be
used The sections on bioinformatics in the first part of the book describe many of
the basic tools that are used to analyze and compare DNA and protein sequences
The tone is inviting as the reader is guided to learn to use different software by
example Multiple approaches for solving particular problems, such as sequence
alignment and molecular phylogeny, are presented The middle part of the book
introduces functional genomics Here again the focus is on helping the reader to
learn how to do analyses (such as microarray data analysis or protein structure
prediction) in a practical way A companion website provides many data sets, so
the student can get experience in performing analyses Chapter 12 provides a
roadmap to the very complicated topic of functional genomics, spanning a range
of techniques and model organisms used to study gene function The last third of
xxvii
Trang 24the book provides a survey of the tree of life from a genomics perspective There is anattempt to be comprehensive, and at the same time, to present the material in aninteresting way, highlighting the fascinating features that make each genome unique.Far from being a dry account of the facts of genomics and bioinformatics, thebook offers many features that highlight the vitality of this field There are discussionsthroughout about how to critically evaluate the performance of different software.For example, there are ‘competitions’ in which different research groups performcomputational analyses on data sets that have been validated with some ‘gold stan-dard’, allowing false positive and false negative error rates to be determined Thesecompetitions are described in areas such as microarray data analysis (Chapter 9),mass spectrometry (Chapter 10), protein structure prediction (Chapter 11), orgene prediction (Chapter 16) The book also includes descriptions of importantmovements in the fields of bioinformatics and genomics, ranging from the RefSeqproject for organizing sequences to the ENCODE and HapMap projects.Similarly, there is a rich description of the historical context for different aspects ofbioinformatics and genomics, such as Garrod’s views on disease (Chapter 20);Ohno’s classic 1970 book on genome duplication (Chapter 17); and, the earliestattempts to create alignments and phylogenetic trees of the globins.
Where will the fields of bioinformatics and genomics go in the next five to 10years? The opportunities are vast and any prediction will certainly be incomplete,but it is certain that the rapid technological advances in sequencing will provide anunprecedented view of human genetic variation and how this relates to phenotype
In the area of human disease studies, genome-wide association studies can beexpected to lead to the identification of hundreds of genes underlying complex dis-orders Finally, our understanding of evolution and its relevance to medicine willexpand dramatically Dr Pevsner’s valuable book will help the student or researcheraccess the tools and learn the principles that will enable this exciting research
David Valle, M.D.Henry J Knott Professor and Director McKusick-Nathans Institute of Genetic Medicine,
Johns Hopkins University School of Medicine
Trang 25FIGURE 3.1 Three-dimensional structures of (a) myoglobin (accession 2MM1), (b) the tetrameric hemoglobin protein (2H35),(c) the beta globin subunit of hemoglobin, and (d) myoglobin and beta globin superimposed The images were generated with theprogram Cn3D (see Chapter 11) These proteins are homologous (descended from a common ancestor), and they share very similarthree-dimensional structures However, pairwise alignment of these proteins’ amino acid sequences reveals that the proteins sharevery limited amino acid identity.
Trang 26FIGURE 4.7 Middle portion of a typical blastp output provides a graphical display of the results Database matches are color coded toindicate relatedness (based on alignment score), and the length of each line corresponds to the region in which that sequence aligns withthe query sequence This graphic can be useful to summarize the regions in which database matches align to the query.
FIGURE 6.10 Multiple sequence alignment of the human beta globin locus compared to other vertebrate genomic sequences (a) Aview in the UCSC Genome Browser of the beta globin gene is indicated Exons are represented by blocks (arrow 1) and tend to behighly conserved among a group of vertebrate genomes Additionally, several regions of high conservation occur in noncoding areas(e.g., arrow 2) (b) A view of 55 base pairs at the beta globin locus At this magnification (fewer than 30,000 base pairs), theUCSC genome browser displays the nucleotides of genomic DNA in the multiple sequence alignment of a group of vertebrates TheATG codon (oriented from right to left) is indicated (three asterisks), and the human protein product is shown (amino acids fromright to left matching the start of protein NP_000509, MVHLTPEEKS)
Trang 27Experimental design
Compare normal vs diseased tissue, cells +/- drug, early vs late development Stage 1
Compare two biological samples
Hybridize samples to microarrays
Image analysis
Detect signals that represent expressed genes; quantitate
Identify co-regulated genes (e.g cluster analysis); classify samples
Biological confirmation
Independently confirm that genes are regulated e.g by Northern analysis
Deposit data in a database
(e.g GEO, ArrayExpress)
Analyze data in the context of other, related experiments Investigate behavior of expressed genes in other experimental paradigms
P3 P2 P4 P1 N4 N3 C4 C3 N2 C2 C1
RNA preparation and probe preparation
Isolate total RNA or mRNA, label with fluorescence (or radioactivity)
FIGURE 8.17 Overview of the process of generating high throughput gene expression data using microarrays In stage 1, biologicalsamples are selected for a comparison of gene expression In stage 2, RNA is isolated and labeled, often with fluorescent dyes Thesesamples are hybridized to microarrays, which are solid supports containing complementary DNA or oligonucleotides corresponding
to known genes or ESTs In stage 4, image analysis is performed to evaluate signal intensities In stage 5, the expression data are lyzed to identify differentially regulated genes (e.g., using ANOVA [Chapter 9] and scatter plots; stage 5, at left) or clustering of genesand/or samples (right) Based on these findings, independent confirmation of microarray-based findings is performed (stage 6) Themicroarray data are deposited in a database so that large-scale analyses can be performed
ana-FIGURE 8.21 Microarray images (a) A nitrocellulose filter is probed with [32P]cDNA derived from the hippocampus of a postmortembrain of an individual with Down syndrome There are 5000 cDNAs spotted on the array The pattern in which genes are represented onany array is randomized (b) Six of the signals are visualized using NIH Image software Image analysis software must define the prop-erties of each signal, including the likelihood that an intense signal (lower left) will “bleed” onto a weak signal (lower right) (c) A micro-array from NEN Perkin-Elmer (representing 2400 genes) was probed with the same Rett syndrome and control brain samples used inFig 8.20 This technology employs cDNA samples that are fluorescently labeled in a competitive hybridization
Trang 28(a) Primary structure
(b) Secondary structure
(c) Tertiary structure (d) Quaternary structure
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSD GLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
N
C
FIGURE 11.1 A hierarchy of protein structure (a) The primary structure of a protein refers to the linear polypeptide chain of aminoacids Here, human beta globin is shown (NP_000539) (b) The secondary structure includes elements such as alpha helices and betasheets Here, beta globin protein sequence was input to the POLE server for secondary structure (Qhttp://pbil.univ-lyon1.fr/) wherethree prediction algorithms were run and a consensus was produced Abbreviations: h, alpha helix; c, random coil; e, extended strand.(c) The tertiary structure is the three-dimensional structure of the protein chain Alpha helices are represented as thickened cylinders.Arrows labeled N and C point to the amino- and carboxy-terminals, respectively (d) The quarternary structure includes the inter-actions of the protein with other subunits and heteroatoms Here, the four subunits of hemoglobin are shown (with an a2b2 compositionand one beta globin chain highlighted) as well as four noncovalently attached heme groups Panels (c) and (d) were produced usingCn3D software from NCBI
Trang 29FIGURE 11.3 Examples of secondary structure (a) Myoglobin (Protein Data Bank ID 2MM1) is composed of large regions of ahelices, shown as strands wrapped around barrel-shaped objects By entering the accession 2MM1 into NCBI’s structure site, onecan view this three-dimensional structure using Cn3D software The accompanying sequence viewer shows the primary amino acidsequence By clicking on a colored region (bracket) corresponding to an alpha helix, that structure is highlighted in the structureviewer (arrow) (b) Human pepsin (PDB 1PSN) is an example of a protein primarily composed as b strands, drawn as largearrows Selecting a region of the primary amino acid sequence (bracket) results in a highlighting of the corresponding b strand.
Trang 31FIGURE 18.8 Whole genome duplication in the ciliateParamecium tetraurelia is inferred by analysis of protein paralogs The outercircle displays all chromosome-sized scaffolds from the genome sequencing project Lines link pairs of genes with a “best reciprocal hit”match The three interior circles show the reconstructed ancestral sequences obtained by combining the paired sequences from eachprevious step The inner circles are progressively smaller and reflect fewer conserved genes with a smaller average similarity FromAury et al (2006) Used with permission.
Trang 32FIGURE 18.18 Alignment of C elegans and C briggsae conserved syntenic regions using the synteny viewer at WormBase(Qhttp://www.wormbase.org) Regions of chromosome I are aligned fromC elegans (above) and C briggsae (below).
Trang 33Analyzing DNA, RNA, and Protein Sequences in Databases
Trang 34The study of bioinformatics includes the analysis of proteins In the first half of the nineteenth century the Dutch researcher GerardusJohannes Mulder (1802 – 1880), advised by the Swedish chemist Jo¨ns Jacob Berzelius (1779 – 1848), studied the “albuminous” sub-stances or proteins fibrin, albumin from blood, albumin from egg (ovalbumin), and the coloring matter of blood (hemoglobin).Mulder and others extracted and purified these proteins and believed that they all shared the same elemental composition(C400H260N100O120), with varying amounts of phosphorus and sulfur Justus Liebig (1803 – 1873) believed that the composition ofprotein was C48H36N6O14 This page, from Liebig’s Animal Chemistry, or Organic Chemistry in its Applications to Physiologyand Pathology (1847, p 36), discusses albumin, fibrin, and casein (see arrowhead).
Trang 35Introduction
Bioinformatics represents a new field at the interface of the twentieth-century
revolu-tions in molecular biology and computers A focus of this new discipline is the use of
computer databases and computer algorithms to analyze proteins, genes, and the
complete collections of deoxyribonucleic acid (DNA) that comprises an organism
(the genome) A major challenge in biology is to make sense of the enormous
quan-tities of sequence data and structural data that are generated by genome-sequencing
projects, proteomics, and other large-scale molecular biology efforts The tools of
bioinformatics include computer programs that help to reveal fundamental
mechan-isms underlying biological problems related to the structure and function of
macro-molecules, biochemical pathways, disease processes, and evolution
According to a National Institutes of Health (NIH) definition, bioinformatics is
“research, development, or application of computational tools and approaches for
expanding the use of biological, medical, behavioral or health data, including those
to acquire, store, organize, analyze, or visualize such data.” The related discipline
of computational biology is “the development and application of data-analytical
and theoretical methods, mathematical modeling and computational simulation
techniques to the study of biological, behavioral, and social systems.”
While the discipline of bioinformatics focuses on the analysis of molecular
sequences, genomics and functional genomics are two closely related disciplines
The goal of genomics is to determine and analyze the complete DNA sequence of
an organism, that is, its genome The DNA encodes genes, which can be expressed
as ribonucleic acid (RNA) transcripts and then in many cases further translated into
Bioinformatics and Functional Genomics, Second Edition By Jonathan Pevsner
Copyright # 2009 John Wiley & Sons, Inc.
The NIH Bioinformatics Definition Committee findings are reported atQhttp://www.bisti nih.gov/CompuBioDef.pdf For additional definitions of bioinfor- matics and functional genomics, see Boguski (1994), Luscombe
et al (2001), Ideker et al (2001), and Goodman (2002).
3
Trang 36protein Functional genomics describes the use of genomewide assays in the study ofgene and protein function.
The aim of this book is to explain both the theory and practice of bioinformaticsand genomics The book is especially designed to help the biology student use com-puter programs and databases to solve biological problems related to proteins, genes,and genomes Bioinformatics is an integrative discipline, and our focus on individualproteins and genes is part of a larger effort to understand broad issues in biology, such
as the relationship of structure to function, development, and disease For the puter scientist, this book explains the motivations for creating and using algorithmsand databases
There are three main sections of the book The first part (Chapters 2 to 7) explainshow to access biological sequence data, particularly DNA and protein sequences(Chapter 2) Once sequences are obtained, we show how to compare two sequences(pairwise alignment; Chapter 3) and how to compare multiple sequences (primarily
by the Basic Local Alignment Search Tool [BLAST]; Chapters 4 and 5) We duce multiple sequence alignment (Chapter 6) and show how multiply alignedsequences can be visualized in phylogenetic trees (Chapter 7) Chapter 7 thusintroduces the subject of molecular evolution
intro-The second part of the book describes functional genomics approaches to RNAand protein and the determination of gene function (Chapters 8 to 12) The centraldogma of biology states that DNA is transcribed into RNA then translated into protein
We will examine bioinformatic approaches to RNA, including both noncoding andcoding RNAs We then describe the technology of DNA microarrays and examinemicroarray data analysis (Chapter 9) From RNA we turn to consider proteins fromthe perspective of protein families, and the analysis of individual proteins (Chapter10) and protein structure (Chapter 11) We conclude the middle part of the bookwith an overview of the rapidly developing field of functional genomics (Chapter 12).Since 1995, the genomes have been sequenced for several thousand viruses, pro-karyotes (bacteria and archaea), and eukaryotes, such as fungi, animals, and plants.The third section of the book covers genome analysis (Chapters 13 to 20) Chapter
13 provides an overview of the study of completed genomes and then descriptions ofhow the tools of bioinformatics can elucidate the tree of life We describe bioinfor-matics resources for the study of viruses (Chapter 14) and bacteria and archaea(Chapter 15; these are two of the three main branches of life) Next we examinethe eukaryotic chromosome (Chapter 16) and explore the genomes of a variety ofeukaryotes, including fungi (Chapter 17), organisms from parasites to primates(Chapter 18), and then the human genome (Chapter 19) Finally, we explore bioin-formatic approaches to human disease (Chapter 20)
We can summarize the fields of bioinformatics and genomics with three perspectives.The first perspective on bioinformatics is the cell (Fig 1.1) The central dogma ofmolecular biology is that DNA is transcribed into RNA and translated into protein.The focus of molecular biology has been on individual genes, messenger RNA
4 INTRODUCTION
Trang 37(mRNA) transcripts as well as noncoding RNAs, and proteins A focus of the field of
bioinformatics is the complete collection of DNA (the genome), RNA (the
transcrip-tome), and protein sequences (the proteome) that have been amassed (Henikoff,
2002) These millions of molecular sequences present both great opportunities
and great challenges A bioinformatics approach to molecular sequence data involves
the application of computer algorithms and computer databases to molecular and
Central dogma of molecular biology
Central dogma of genomics
cellular phenotype genome transcriptome proteome
cellular phenotype
RNA protein DNA
FIGURE 1.1 The first perspective
of the field of bioinformatics is thecell Bioinformatics has emerged as
a discipline as biology has becometransformed by the emergence ofmolecular sequence data Databasessuch as the European MolecularBiology Laboratory (EMBL),GenBank, and the DNA Database
of Japan (DDBJ) serve as tories for hundreds of billions ofnucleotides of DNA sequence data(see Chapter 2) Corresponding data-bases of expressed genes (RNA) andprotein have been established Amain focus of the field of bioinfor-matics is to study molecular sequencedata to gain insight into a broadrange of biological problems
reposi-time of
development
region of body
physiological or pathological state
FIGURE 1.2 The second tive of bioinformatics is the organ-ism Broadening our view fromthe level of the cell to the organism,
perspec-we can consider the individual’sgenome (collection of genes),including the genes that areexpressed as RNA transcripts andthe protein products Thus, for anindividual organism bioinfor-matics tools can be applied todescribe changes through develop-mental time, changes across bodyregions, and changes in a variety
of physiological or pathologicalstates
BIOINFORMATICS : THE BIG PICTURE 5
Trang 38cellular biology Such an approach is sometimes referred to as functional genomics.This typifies the essential nature of bioinformatics: biological questions can beapproached from levels ranging from single genes and proteins to cellular pathwaysand networks or even whole genomic responses (Ideker et al., 2001) Our goals are
to understand how to study both individual genes and proteins and collections ofthousands of genes or proteins
From the cell we can focus on individual organisms, which represents a secondperspective of the field of bioinformatics (Fig 1.2) Each organism changes acrossdifferent stages of development and (for multicellular organisms) across differentregions of the body For example, while we may sometimes think of genes as staticentities that specify features such as eye color or height, they are in fact dynamicallyregulated across time and region and in response to physiological state Geneexpression varies in disease states or in response to a variety of signals, both intrinsicand environmental Many bioinformatics tools are available to study the broad bio-logical questions relevant to the individual: there are many databases of expressed
FIGURE 1.3 The third
perspec-tive of the field of bioinformatics is
represented by the tree of life The
scope of bioinformatics includes
all of life on Earth, including the
three major branches of bacteria,
archaea, and eukaryotes Viruses,
which exist on the borderline of
the definition of life, are not
depicted here For all species, the
collection and analysis of
molecu-lar sequence data allow us to
describe the complete collection of
DNA that comprises each organism
(the genome) We can further
learn the variations that occur
between species and among
mem-bers of a species, and we can
deduce the evolutionary history of
life on Earth (After Barns et al.,
1996 and Pace, 1997.) Used with
permission
6 INTRODUCTION
Trang 39genes and proteins derived from different tissues and conditions One of the most
powerful applications of functional genomics is the use of DNA microarrays to
measure the expression of thousands of genes in biological samples
At the largest scale is the tree of life (Fig 1.3) (Chapter 13) There are many
millions of species alive today, and they can be grouped into the three major branches
of bacteria, archaea (single-celled microbes that tend to live in extreme
environ-ments), and eukaryotes Molecular sequence databases currently hold DNA
sequences from over 150,000 different organisms The complete genome sequences
of thousands of organisms are now available, including organellar and viral genomes
One of the main lessons we are learning is the fundamental unity of life at the
molecular level We are also coming to appreciate the power of comparative
geno-mics, in which genomes are compared Through DNA sequence analysis we are
learning how chromosomes evolve and are sculpted through processes such as
chromosomal duplications, deletions, and rearrangements, as well as through
whole genome duplications (Chapters 16 to 18)
Figure 1.4 presents the contents of this book in the context of these three
per-spectives of bioinformatics
RNA protein DNA
Part 1: Analyzing DNA, RNA, and protein sequences Chapter 1: Introduction
Chapter 2: How to obtain sequences Chapter 3: How to compare two sequences Chapters 4 and 5: How to compare a sequence
to all other sequences in databases Chapter 6: How to multiply align sequences Chapter 7: How to view multiply aligned sequences
as phylogenetic trees
Part 3: Genome analysis Chapter 13: The tree of life Chapter 14: Viruses Chapter 15: Prokaryotes Chapter 16: The eukaryotic chromosome Chapter 17: The fungi
Chapter 18: Eukaryotes from parasites to plants to primates Chapter 19: The human genome
Chapter 20: Human disease
Part 2: Genome-wide analysis of RNA and protein Chapter 8: Bioinformatics approaches to RNA Chapter 9: Microarray data analysis
Chapter 10: Protein analysis and protein families Chapter 11: Protein structure
Chapter 12: Functional genomics
Molecular sequence database
FIGURE 1.4 Overview of thechapters in this book
BIOINFORMATICS : THE BIG PICTURE 7
Trang 40A CONSISTENT EXAMPLE: HEMOGLOBIN
Throughout this book, we will focus on the globin gene family to provide a consistentexample of bioinformatics and genomics concepts The globin family is one of thebest characterized in biology
† Historically, hemoglobin was one of the first proteins to be studied, havingbeen described in the 1830s and 1840s by Mulder, Liebig, and others
† Myoglobin, a globin that binds oxygen in the muscle tissue, was the firstprotein to have its structure solved by x-ray crystallography (Chapter 11)
† Hemoglobin, a tetramer of four globin subunits (principallya2b2in adults), isthe main oxygen carrier in blood of vertebrates Its structure was also one of theearliest to be described The comparison of myoglobin, alpha globin, and betaglobin protein sequences represents one of the earliest applications of multiplesequence alignment (Chapter 6), and led to the development of amino acidsubstitution matrices used to score protein relatedness (Chapter 3)
† In the 1980s as DNA sequencing technology emerged, the globin loci onhuman chromosomes 16 (fora globin) and 11 (for b globin) were amongthe first to be sequenced and analyzed The globin genes are exquisitely regu-lated across time (switching from embryonic to fetal to adult forms) and withtissue-specific gene expression We will discuss these loci in the description ofthe control of gene expression (Chapter 16)
† While hemoglobin and myoglobin remain the best-characterized globins, thefamily of homologous proteins extends to two separate classes of plant globins,invertebrate hemoglobins (some of which contain multiple globin domainswithin one protein molecule), bacterial homodimeric hemoglobins (consist-ing of two globin subunits), and flavohemoglobins that occur in bacteria,archaea, and fungi Thus the globin family is useful as we survey the tree oflife (Chapters 13 to 18)
Another protein we will use as an example is retinol-binding protein (RBP4),
a small, abundant secreted protein that binds retinol (vitamin A) in blood(Newcomer and Ong, 2000) Retinol, obtained from carrots in the form of vitamin
A, is very hydrophobic RBP4 helps transport this ligand to the eye where it is usedfor vision We will study RBP4 in detail because it has a number of interestingfeatures:
† There are many proteins that are homologous to RBP4 in a variety of species,including human, mouse, and fish (“orthologs”) We will use these asexamples of how to align proteins, perform database searches, and studyphylogeny
† There are other human proteins that are closely related to RBP4 (“paralogs”).Altogether the family that includes RBP4 is called the lipocalins, a diversegroup of small ligand-binding proteins that tend to be secreted into extracellu-lar spaces (Akerstrom et al., 2000; Flower et al., 2000) Other lipocalins havefascinating functions such as apoliprotein D (which binds cholesterol), a preg-nancy-associated lipocalin, aphrodisin (an “aphrodisiac” in hamsters), and anodorant-binding protein in mucus
8 INTRODUCTION