1. Trang chủ
  2. » Khoa Học Tự Nhiên

Tin sinh học và Genomics chức năng

971 99 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 971
Dung lượng 37,61 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Preface to the First Edition, xxiii PART I ANALYZING DNA, RNA, AND PROTEIN SEQUENCES IN DATABASES Organization of The Book, 4 Bioinformatics: The Big Picture, 4 A Consistent Example: Hem

Trang 2

Bioinformatics and Functional Genomics

Trang 4

Copyright # 2009 by John Wiley & Sons, Inc All rights reserved.

Wiley-Blackwell is an imprint of John Wiley & Sons, formed by the merger of Wiley’s global Scientific, Technical, and Medical business with Blackwell Publishing.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey

Published simultaneously in Canada

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750 – 8400, fax (978)

750 – 4470, or on the web at www.copyright.com Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748 – 6011, fax (201) 748 – 6008, or online at http://www.wiley.com/go/permission.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or comple- teness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose No warranty may be created or extended by sales representatives or written sales materials The advice and strategies contained herein may not be suitable for your situation You should consult with a professional where appropriate Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, con- sequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762 – 2974, outside the United States at (317) 572 – 3993 or fax (317) 572 – 4002.

Wiley also publishes its books in variety of electronic formats Some content that appears in print may not

be available in electronic format For more information about Wiley products, visit our web site at www wiley.com.

Cover illustration includes detail from Leonardo da Vinci (1452 – 1519), dated c.1506 – 1507, courtesy

of the Schlossmuseum (Weimar).

ISBN: 978-0-470-08585-1

Library of Congress Cataloging-in-Publication Data is available.

Printed in the United States of America

Trang 5

For Barbara, Ava and Lillian with all my love.

Trang 6

Contents in Brief

PART I ANALYZING DNA, RNA, AND PROTEIN SEQUENCES IN DATABASES

PART II GENOMEWIDE ANALYSIS OF RNA AND PROTEIN

PART III GENOME ANALYSIS

Trang 7

Preface to the First Edition, xxiii

PART I ANALYZING DNA, RNA, AND PROTEIN SEQUENCES

IN DATABASES

Organization of The Book, 4

Bioinformatics: The Big Picture, 4

A Consistent Example:

Hemoglobin, 8Organization of The Chapters, 9

A Textbook for Courses on

Bioinformatics andGenomics, 9Key Bioinformatics Websites, 10

Nucleotide and ProteinSequences, 14Amount of Sequence Data, 15Organisms in GenBank, 16Types of Data in GenBank, 18Genomic DNA Databases, 19cDNA Databases Corresponding

to Expressed Genes, 19Expressed Sequence Tags(ESTs), 19

ESTs and UniGene, 20Sequence-Tagged Sites(STSs), 22

Genome Survey Sequences(GSSs), 22

High Throughput GenomicSequence (HTGS), 23Protein Databases, 23National Center for BiotechnologyInformation, 23

Introduction to NCBI: HomePage, 23

PubMed, 23Entrez, 24BLAST, 25OMIM, 25Books, 25Taxonomy, 25Structure, 25The European BioinformaticsInstitute (EBI), 25Access to Information: AccessionNumbers to Label and IdentifySequences, 26

The Reference Sequence (RefSeq)Project, 27

The Consensus Coding Sequence(CCDS) Project, 29

Access to Information via Entrez Gene

at NCBI, 29Relationship of Entrez Gene,Entrez Nucleotide, and EntrezProtein, 32

Comparison of Entrez Gene andUniGene, 32

Entrez Gene and HomoloGene, 33Access to Information: Protein

Databases, 33UniProt, 33The Sequence Retrieval System atExPASy, 34

Access to Information: The ThreeMain Genome Browsers, 35The Map Viewer at NCBI, 35

ix

Trang 8

The University of California, SantaCruz (UCSC) Genome

Browser, 35The Ensembl Genome Browser, 35Examples of How to Access Sequence

Data, 36HIVpol, 36Histones, 38Access to Biomedical Literature, 38

PubMed Central and Movementtoward Free Journal Access, 39Example of PubMed Search:

RBP, 40Perspective, 42

Gaps, 55Pairwise Alignment, Homology, andEvolution of Life, 55

Scoring Matrices, 57

Dayhoff Model: Accepted PointMutations, 58

PAM1 Matrix, 63PAM250 and Other PAMMatrices, 65

From a Mutation Probability Matrix

to a Log-Odds ScoringMatrix, 69

Practical Usefulness of PAMMatrices in PairwiseAlignment, 70Important Alternative to PAM:

BLOSUM Scoring Matrices, 70Pairwise Alignment and Limits ofDetection: The “TwilightZone”, 74

Alignment Algorithms: Global and

Local, 75Global Sequence Alignment:

Algorithm of Needleman andWunsch, 76

Step 1: Setting Up a Matrix, 76Step 2: Scoring the Matrix, 77Step 3: Identifying the OptimalAlignment, 79

Local Sequence Alignment: Smithand Waterman Algorithm, 82Rapid, Heuristic Versions ofSmith – Waterman: FASTA andBLAST, 84

Pairwise Alignment with DotPlots, 85

The Statistical Significance of PairwiseAlignments, 86

Statistical Significance of GlobalAlignments, 87

Statistical Significance of LocalAlignments, 89

Percent Identity and RelativeEntropy, 90

Perspective, 91Pitfalls, 94Web Resources, 94Discussion Questions, 94Problems/Computer Lab, 95Self-Test Quiz, 95

Suggested Reading, 96References, 97

Introduction, 101BLAST Search Steps, 103Step 1: Specifying Sequence ofInterest, 103

Step 2: Selecting BLASTProgram, 104Step 3: Selecting aDatabase, 106Step 4a: Selecting Optional SearchParameters, 106

9 Filtering and Masking, 111Step 4b: Selecting FormattingParameters, 112

BLAST Algorithm Uses LocalAlignment Search Strategy, 115

x CONTENTS

Trang 9

BLAST Algorithm Parts: List, Scan,

Extend, 115

BLAST Algorithm: Local Alignment

Search Statistics and

BLAST Searching With

Multidomain Protein: HIV-1

Finding Distantly Related Proteins:

Position-Specific Iterated BLAST

Pattern-Hit Initiated BLAST(PHI-BLAST), 153Profile Searches: Hidden MarkovModels, 155

BLAST-Like Alignment Tools toSearch Genomic DNARapidly, 161Benchmarking to Assess GenomicAlignment Performance, 162PatternHunter, 162

BLASTZ, 163MegaBLAST and DiscontiguousMegaBLAST, 164

BLAT, 166LAGAN, 168SSAHA, 168SIM4, 169Using BLAST for GeneDiscovery, 169Perspective, 173Pitfalls, 173Web Resources, 174Discussion Questions, 174Problems/Computer Lab, 174Self-Test Quiz, 175

Suggested Reading, 176References, 176

Introduction, 179Definition of Multiple SequenceAlignment, 180

Typical Uses and Practical Strategies

of Multiple SequenceAlignment, 181Benchmarking: Assessment ofMultiple Sequence AlignmentAlgorithms, 182

Five Main Approaches to MultipleSequence Alignment, 184Exact Approaches to MultipleSequence Alignment, 184Progressive Sequence

Alignment, 185Iterative Approaches, 190Consistency-Based

Approaches, 192Structure-Based Methods, 194Conclusions from BenchmarkingStudies, 196

CONTENTS xi

Trang 10

Databases of Multiple Sequence

Alignments, 197Pfam: Protein Family Database ofProfile HMMs, 197

Smart, 199Conserved Domain Database, 199Prints, 201

Integrated Multiple SequenceAlignment Resources: InterProand iProClass, 201

PopSet, 202Multiple Sequence AlignmentDatabase Curation: Manualversus Automated, 202Multiple Sequence Alignments of

Genomic Regions, 203Perspective, 206

Hypothesis, 221Positive and NegativeSelection, 227Neutral Theory of MolecularEvolution, 230

Molecular Phylogeny: Properties of

Trees, 231Tree Roots, 233Enumerating Trees andSelecting SearchStrategies, 234Type of Trees, 238

Species Trees versus Gene/ProteinTrees, 238

DNA, RNA, or Protein-BasedTrees, 240

Five Stages of Phylogenetic

Analysis, 243Stage 1: SequenceAcquisition, 243Stage 2: Multiple SequenceAlignment, 244

Stage 3: Models of DNAand Amino AcidSubstitution, 246Stage 4: Tree-BuildingMethods, 254Phylogenetic Methods, 255Distance, 255

The UPGMA Distance-BasedMethod, 256

Making Trees by Based Methods: NeighborJoining, 259

Distance-Phylogenetic Inference: MaximumParsimony, 260

Model-Based PhylogeneticInference: MaximumLikelihood, 262Tree Inference: BayesianMethods, 264Stage 5: Evaluating Trees, 266Perspective, 268

Pitfalls, 268Web Resources, 269Discussion Questions, 269Problems/Computer Lab, 269Self-Test Quiz, 271

Suggested Reading, 272References, 272

PART II GENOMEWIDE ANALYSIS OF RNA AND PROTEIN

Introduction to RNA, 279Noncoding RNA, 282Noncoding RNAs in the RfamDatabase, 283

Transfer RNA, 283Ribosomal RNA, 288Small Nuclear RNA, 291Small Nucleolar RNA, 292MicroRNA, 293

Short Interfering RNA, 294Noncoding RNAs in the UCSCGenome and Table

Browser, 294Introduction to Messenger RNA, 296mRNA: Subject of Gene

Expression Studies, 300Analysis of Gene Expression incDNA Libraries, 302Pitfalls in Interpreting ExpressionData from cDNA Libraries, 308xii CONTENTS

Trang 11

Full-Length cDNA Projects, 308

Serial Analysis of Gene Expression

Stage 4: Image Analysis, 317

Stage 5: Data Analysis, 318

The Relationship of DNA,

mRNA, and Protein

Microarray Data Analysis Software

and Data Sets, 334

Reproducibility of Microarray

Experiments, 335

Microarray Data Analysis:

Preprocessing, 337

Scatter Plots and MA Plots, 338

Global and Local

Normalization, 343

Accuracy and Precision, 344

Robust Multiarray Analysis

Hierarchical Cluster Analysis ofMicroarray Data, 355Partitioning Methods for Clustering:k-Means Clustering, 363

Clustering Strategies: Organizing Maps, 363Principal Components Analysis:Visualizing MicroarrayData, 364

Self-Supervised Data Analysis forClassification of Genes orSamples, 367

Functional Annotation of MicroarrayData, 368

Perspective, 369Pitfalls, 370Discussion Questions, 370Problems/Computer Lab, 371Self-Test Quiz, 372

Suggested Reading, 373References, 373

Introduction, 379Protein Databases, 380Community Standards forProteomics Research, 381Techniques to Identify Proteins, 381Direct Protein Sequencing, 381Gel Electrophoresis, 382Mass Spectrometry, 385Four Perspectives on Proteins, 388Perspective 1 Protein Domains andMotifs: Modular Nature of

Proteins, 389Added Complexity of MultidomainProteins, 394

Protein Patterns: Motifs orFingerprints Characteristic ofProteins, 394

Perspective 2 Physical Properties ofProteins, 397

Accuracy of PredictionPrograms, 399Proteomic Approaches toPhosphorylation, 401

CONTENTS xiii

Trang 12

Proteomic Approaches toTransmembraneDomains, 401Introduction to Perspectives 3 and 4:

Gene OntologyConsortium, 402Perspective 3: ProteinLocalization, 406Perspective 4: ProteinFunction, 407Perspective, 411Pitfalls, 411Web Resources, 412Discussion Questions, 414Problems/Computer Lab, 415Self-Test Quiz, 415

Suggested Reading, 416References, 416

Overview of ProteinStructure, 421Protein Sequence andStructure, 422Biological Questions Addressed byStructural Biology:

Globins, 423Principles of ProteinStructure, 423Primary Structure, 424Secondary Structure, 425Tertiary Protein Structure:

Protein-Folding Problem, 430Target Selection and Acquisition

of Three-Dimensional ProteinStructures, 432

Structural Genomics and theProtein Structure

Initiative, 432The Protein Data Bank, 434Accessing PDB Entries at the NCBIWebsite, 437

Integrated Views of theUniverse of ProteinFolds, 441Taxonomic System for ProteinStructures: The SCOPDatabase, 441The CATH Database, 443The Dali Domain

Dictionary, 445Comparison of Resources, 446Protein Structure Prediction, 447Homology Modeling (ComparativeModeling), 448

Fold Recognition (Threading), 450

Ab Initio Prediction (Template-FreeModeling), 450

A Competition to AssessProgress in StructurePrediction, 451Intrinsically DisorderedProteins, 453Protein Structure and Disease, 453Perspective, 454

Pitfalls, 455Discussion Questions, 455Problems/Computer Lab, 455Self-Test Quiz, 456

Suggested Reading, 457References, 457

Introduction to FunctionalGenomics, 461The Relationship of Genotype andPhenotype, 463

Eight Model Organisms forFunctional Genomics, 465The BacteriumEscherichiacoli, 466

The YeastSaccharomycescerevisiae, 466

The PlantArabidopsisthaliana, 470The NematodeCaenorhabditiselegans, 470

The FruitflyDrosophilamelanogaster, 471The ZebrafishDanio rerio, 471The MouseMus musculus, 472Homo sapiens: Variation inHumans, 473

Functional Genomics Using ReverseGenetics and Forward

Genetics, 473Reverse Genetics: MouseKnockouts and theb-GlobinGene, 475

Reverse Genetics: Knocking OutGenes in Yeast Using MolecularBarcodes, 480

Reverse Genetics: RandomInsertional Mutagenesis(Gene Trapping), 483Reverse Genetics: InsertionalMutagenesis in Yeast, 486Reverse Genetics: Gene

Silencing by DisruptingRNA, 489

xiv CONTENTS

Trang 13

Forward Genetics: ChemicalMutagenesis, 491Functional Genomics and the CentralDogma, 492

Functional Genomics and DNA:

The ENCODE Project, 492Functional Genomics andRNA, 492

Functional Genomics andProtein, 493

Proteomics Approaches toFunctional Genomics, 493Protein–Protein Interactions, 495The Yeast Two-Hybrid

System, 496Protein Complexes: AffinityChromatography and MassSpectrometry, 498The Rosetta StoneApproach, 500Protein – Protein InteractionDatabases, 501Protein Networks, 502Perspective, 507

Pitfalls, 508Discussion Questions, 508Problems/Computer Lab, 509Self-Test Quiz, 509

Suggested Reading, 510References, 510

PART III GENOME ANALYSIS

Introduction, 517Five Perspectives onGenomics, 519Brief History ofSystematics, 520History of Life on Earth, 521Molecular Sequences as the Basis

of the Tree of Life, 523Role of Bioinformatics inTaxonomy, 524Genome-Sequencing Projects:

Overview, 525Four Prominent WebResources, 525Brief Chronology, 526First Bacteriophage andViral Genomes(1976 – 1978), 527First Eukaryotic OrganellarGenome (1981), 527

First Chloroplast Genomes(1986), 528

First Eukaryotic Chromosome(1992), 529

Complete Genome ofFree-Living Organism(1995), 530First Eukaryotic Genome(1996), 532

Escherichia coli (1997), 532First Genome of MulticellularOrganism (1998), 532Human Chromosome(1999), 533Fly, Plant, and HumanChromosome 21(2000), 534Draft Sequences of HumanGenome (2001), 535Continuing Rise in CompletedGenomes (2002), 535Expansion of Genome Projects(2003 – 2009), 536Genome Analysis Projects, 537Criteria for Selection of Genomesfor Sequencing, 538

Genome Size, 539Cost, 540

Relevance to HumanDisease, 541Relevance to Basic BiologicalQuestions, 541

Relevance toAgriculture, 541Should an Individual from aSpecies, Several Individuals,

or Many Individuals BeSequenced, 541Resequencing Projects, 542Ancient DNA Projects, 542Metagenomics Projects, 543DNA Sequencing

Technologies, 544Sanger Sequencing, 544Pyrosequencing, 545Cyclic Reversible Termination:Solexa, 547

The Process of GenomeSequencing, 547Genome-SequencingCenters, 547Sequencing and AssemblingGenomes: Strategies, 548Genomic Sequence Data: FromUnfinished to Finished, 549

CONTENTS xv

Trang 14

Finishing: When Has a GenomeBeen Fully Sequenced, 551Repository for Genome

Sequence Data, 552Role of ComparativeGenomics, 552Genome Annotation: Features ofGenomic DNA, 555

Annotation of Genes inProkaryotes, 556Annotation of Genes inEukaryotes, 558Summary: Questions fromGenome-SequencingProjects, 558Perspective, 559Pitfalls, 559Discussion Questions, 560Problems/Computer Lab, 560Self-Test Quiz, 560

Suggested Reading, 561References, 561

Introduction, 567Classification of Viruses, 568Diversity and Evolution ofViruses, 571

Metagenomics and VirusDiversity, 573Bioinformatics Approaches toProblems in Virology, 574Influenza Virus, 574

Herpesvirus: From Phylogeny toGene Expression, 578Human ImmunodeficiencyVirus, 583

Bioinformatic Approaches toHIV-1, 585

Measles Virus, 588Perspectives, 591Pitfalls, 591Web Resources, 591Discussion Questions, 592Problems/Computer Lab, 592Self-Test Quiz, 593

Suggested Reading, 593References, 593

Introduction, 598Classification of Bacteria andArchaea, 598

Classification of Bacteria byMorphological Criteria, 599Classification of Bacteria andArchaea Based on

Genome Size andGeometry, 602Classification of Bacteria andArchaea Based on

Lifestyle, 607Classification of Bacteria Based

on Human DiseaseRelevance, 610Classification of Bacteria andArchaea Based on RibosomalRNA Sequences, 611Classification of Bacteria andArchaea Based on OtherMolecular Sequences, 612Analysis of Prokaryotic

Genomes, 615Nucleotide Composition, 615Finding Genes, 617

Lateral Gene Transfer, 620Functional Annotation:

COGs, 622Comparison of ProkaryoticGenomes, 625TaxPlot, 626

Perspective, 629Pitfalls, 630Web Resources, 630Discussion Questions, 630Problems/Computer Lab, 631Self-Test Quiz, 631

Suggested Reading, 632References, 632

Introduction, 640Major Differences betweenEukaryotes and

Prokaryotes, 641General Features of EukaryoticGenomes and

Chromosomes, 643

C Value Paradox: Why EukaryoticGenome Sizes Vary So

Greatly, 643Organization of EukaryoticGenomes into

Chromosomes, 644Analysis of Chromosomes UsingGenome Browsers, 645xvi CONTENTS

Trang 15

Analysis of Chromosomes by the

ENCODE Project, 647

Repetitive DNA Content of

Eukaryotic Chromosomes, 650

Eukaryotic Genomes Include

Noncoding and Repetitive DNA

Repeated Sequences Such as

Are Found at Telomeres,

Centromeres, and Ribosomal

Transcription Factor Databases

and Other Genomic DNA

Pitfalls, 687Web Resources, 688Discussion Questions, 688Problems/Computer Lab, 688Self-Test Quiz, 689

Suggested Reading, 690References, 690

Introduction, 697Description and Classification ofFungi, 698

Introduction to Budding YeastSaccharomyces cerevisiae, 700Sequencing the Yeast

Genome, 701Features of the Budding YeastGenome, 701

Exploring a Typical YeastChromosome, 704Gene Duplication and GenomeDuplication ofS cerevisiae, 708Comparative Analyses of

Hemiascomycetes, 712Analysis of Whole GenomeDuplication, 712Identification of FunctionalElements, 714

Analysis of Fungal Genomes, 715Aspergillus, 715

Candida albicans, 718Cryptococcus neoformans: ModelFungal Pathogen, 719Atypical Fungus: MicrosporidialParasiteEncephalitozooncuniculi, 719

Neurospora crassa, 719First Basidiomycete:

Phanerochaetechrysosporium, 720Fission YeastSchizosaccharomycespombe, 721

Perspective, 721Pitfalls, 722Web Resources, 722Discussion Questions, 722Problems/Computer Lab, 723

CONTENTS xvii

Trang 16

Self-Test Quiz, 723Suggested Reading, 724References, 724

Introduction, 729Protozoans at the Base ofthe Tree LackingMitochondria, 732Trichomonas, 732Giardia lamblia: A HumanIntestinal Parasite, 733Genomes of Unicellular Pathogens:

Trypanosomes andLeishmania, 735Trypanosomes, 735Leishmania, 736The Chromalveolates, 738Malaria ParasitePlasmodiumfalciparum and OtherApicomplexans, 738Astonishing Ciliophora:

Paramecium andTetrahymena, 742Nucleomorphs, 745Kingdom Stramenopila, 746Plant Genomes, 748

Overview, 748Green Algae (Chlorophyta), 748Arabidopsis thaliana

Genome, 751The Second Plant Genome:

Rice, 753The Third Plant Genome:

Poplar, 755The Fourth Plant Genome:

Grapevine, 755Moss, 756

Slime and Fruiting Bodies at theFeet of Metazoans, 756Social Slime MoldDictyosteliumdiscoideum, 756

Metazoans, 758Introduction toMetazoans, 758Analysis of a Simple Animal: TheNematodeCaenorhabditiselegans, 759

The First Insect Genome:

Drosophila melanogaster, 761The Second Insect Genome:

Anopheles gambiae, 764Silkworm, 765

450 Million Years Ago:

Vertebrate Genomes ofFish, 768

310 Million Years Ago: Dinosaursand the Chicken

Genome, 771

180 Million Years Ago: TheOpposum Genome, 772

100 Million Years Ago:

Mammalian Radiation fromDog to Cow, 773

80 Million Years Ago: The Mouseand Rat, 774

5 to 50 Million Years Ago:Primate Genomes, 778Perspective, 781

Pitfalls, 781Web Resources, 782Discussion Questions, 782Problems/Computer Lab, 782Self-Test Quiz, 783

Suggested Reading, 783References, 784

Introduction, 791Main Conclusions of HumanGenome Project, 792The ENCODE Project, 793Gateways to Access the HumanGenome, 794

NCBI, 794Ensembl, 794University of California at SantaCruz Human Genome

Browser, 798NHGRI, 800The Wellcome Trust SangerInstitute, 800

The Human Genome Project, 800Background of the HumanGenome Project, 800Strategic Issues: HierarchicalShotgun Sequencing toGenerate Draft

Sequence, 802Features of the GenomeSequence, 805The Broad GenomicLandscape, 806

Trang 17

Four Categories of Disease, 846Monogenic Disorders, 847Complex Disorders, 851Genomic Disorders, 852Environmentally CausedDisease, 855Other Categories ofDisease, 857Disease Databases, 859OMIM: Central BioinformaticsResource for HumanDisease, 859Locus-Specific MutationDatabases, 862The PhenCode Project, 865Four Approaches to IdentifyingDisease-Associated Genes, 866Linkage Analysis, 866

Genome-Wide AssociationStudies, 867

Identification of ChromosomalAbnormalities, 868Genomic DNA Sequencing, 869Human Disease Genes in ModelOrganisms, 870

Human Disease Orthologs inNonvertebrate Species, 870Human Disease Orthologs inRodents, 876

Human Disease Orthologs inPrimates, 878

Human Disease Genes andSubstitution Rates, 878Functional Classification of DiseaseGenes, 880

Perspective, 882Pitfalls, 882Web Resources, 882Discussion Questions, 884Problems, 884

Self-Test Quiz, 885Suggested Reading, 885References, 886

Trang 18

Preface to the Second Edition

The Neurobehavioral Unit of the Kennedy Krieger Institute has 16 hospital beds

Most of the patients are children who have been diagnosed with autism, and most

engage in self-injurious behavior They engage in self-biting, self-hitting,

head-banging, and other destructive behaviors In most cases, we do not understand the

genetic contributions to such behaviors, limiting the available strategies for

treat-ment In my research, I am motivated to understand molecular changes that underlie

childhood brain diseases The field of bioinformatics provides tools we can use to

understand disease processes through the analysis of molecular sequence data

More broadly, bioinformatics facilitates our understanding of the basic aspects of

biology including development, metabolism, adaptation to the environment,

gen-etics (e.g., the basis of individual differences), and evolution

Since the publication of the first edition of this textbook in 2003, the fields of

bioinformatics and genomics have grown explosively In the preface to the first edition

(2003) I noted that tens of billions of base pairs (gigabases) of DNA had been

depos-ited in GenBank Now in 2009 we are reaching tens of trillions (terabases) of DNA,

presenting us with unprecedented challenges in how to store, analyze, and interpret

sequence data In this second edition I have made numerous changes to the content

and organization of the book All of the chapters are rewritten, and about 90% of the

figures and tables are updated There are two new chapters, one on functional

genomics and one on the eukaryotic chromosome I now focus on the globins as

examples throughout the book Globins have a special place in the history of biology,

as they were among the first proteins to be identified (in the 1830s) and sequenced (in

the 1950s and 1960s) The first protein to have its structure solved by X-ray

crystal-lography was myoglobin (Chapter 11); molecular phylogeny was applied to the

glo-bins in the 1960s (Chapter 7); and the globin gene loci were among the first to be

sequenced (in the 1980s; see Chapter 16)

The fields of bioinformatics and genomics are far too broad to be understood by

one person Thus many textbooks are written by multiple authors, each of whom

brings a deeper knowledge of the subject matter I hope that this book at least

offers the benefit of a single author’s vision of how to present the material This is

essentially two textbooks: one on bioinformatics (parts I and II) and one on genomics

(part III) I feel that presenting bioinformatics on its own would be incomplete

with-out further applying those approaches to sequence analysis of genomes across the tree

of life Similarly I feel that it is not possible to approach genomics without first

treat-ing the bioinformatics tools that are essential engines of that field

As with the previous edition a companion website is available which provides

up-to-date web links referred to in the book and PowerPoint slides arranged by

xxi

Trang 19

chapter (www.bioinfbook.org) A resource site for instructors is also available givingdetailed solutions to problems (www.wiley.com/go/pevsnerbioinformatics).

In preparing each edition of this book I read many papers and reviewed severalthousand websites I sincerely apologize to those authors, researchers and otherswhose work I did not cite It is a great pleasure to acknowledge my colleagues whohave helped in the preparation of this book Some read chapters including JefBoeke (Chapter 12), Rafael Irizarry (Chapter 9), Stuart Ray (Chapter 7), IngoRuczinski (Chapter 11), and Sarah Wheelan (Chapters 3 and 5 – 7) I thank manystudents and faculty at Johns Hopkins and elsewhere who have provided critical feed-back, including those who have lectured in bioinformatics and genomics courses(Judith Bender, Jef Boeke, Egbert Hoiczyk, Ingo Ruczinski, Alan Scott, DavidSullivan, David Valle, and Sarah Wheelan) Many others engaged in helpful discus-sions including Charles D Cohen, Bob Cole, Donald Coppock, Laurence Frelin,Hugh Gelch, Gary W Goldstein, Marjan Gucek, Ada Hamosh, Nathaniel Miller,Akhilesh Pandey, Elisha Roberson, Kirby D Smith, Jason Ting, and N Varg

I thank my wife Barbara for her support and love as I prepared this book

xxii PREFACE TO THE SECOND EDITION

Trang 20

Preface to the First Edition

This book emerged from lecture notes I prepared several years ago for an

introductory bioinformatics and genomics course at the Johns Hopkins School of

Medicine The first class consisted of about 70 graduate students and several

hun-dred auditors, including postdoctoral fellows, technicians, undergraduates, and

fac-ulty Those who attended the course came from a broad variety of fields—students of

genetics, neuroscience, immunology or cell biology, clinicians interested in particular

diseases, statisticians and computer scientists, virologists and microbiologists They

had a common interest in wanting to understand how they could apply the tools of

computer science to solve biological problems This is the domain of bioinformatics,

which I define most simply as the interface of computer science and molecular

biology This emerging field relies on the use of computer algorithms and computer

databases to study proteins, genes, and genomes Functional genomics is the study of

gene function using genome-wide experimental and computational approaches

At its essence, the field of bioinformatics is about comparisons In the first third of the

book we learn how to extract DNA or protein sequences from the databases, and then

to compare them to each other in a pairwise fashion or by searching an entire

data-base For the student who has a gene of particular interest, a natural question is to

ask “what other genes (or proteins) are related to mine?”

In the middle third of the book, we move from DNA to RNA (gene expression)

and to proteins We again are engaged in a series of comparisons We compare gene

expression in two cell lines with or without drug treatment, or a wildtype mouse heart

versus a knockout mouse heart, or a frog at different stages of development These

comparisons extend to the world of proteins, where we apply the tools of proteomics

to complex biological samples under assorted physiological conditions The

align-ment of multiple, related DNA or protein sequences is another form of comparison

These relationships can be visualized in a phylogenetic tree

The last third of the book spans the tree of life, and this provides another level of

comparison Which forms of human immunodeficiency virus threaten us, and how

can we compare the various HIV subtypes to learn how we might develop a vaccine?

How are a mosquito and a fruitfly related? What genes do vertebrates such as fish

and humans share in common, and which genes are unique to various phylogenetic

lineages?

xxiii

Trang 21

I believe that these various kinds of comparisons are what distinguish the newlyemerging fields of bioinformatics and genomics from traditional biology Biology hasalways concerned comparisons; in this book I quote 19th century biologists such asRichard Owen, Ernst Haeckel, and Charles Darwin who engaged in comparativestudies at the organismal level The problems we are trying to solve have not changedsubstantially We still seek a more complete understanding of the unifying concepts ofbiology, such as the organization of life from its constituent parts (e.g., genes and pro-teins), the behavior of complex biological systems, and the continuity of life throughevolution What has changed is how we pursue this more complete understanding.This book describes databases filled with raw information on genes and gene pro-ducts and the tools that are useful to analyze these data.

My training is as a molecular biologist and neuroscientist My laboratory studies themolecular basis of childhood brain disorders such as Down syndrome, autism, andlead poisoning We are located at the Kennedy Krieger Institute, a hospital for chil-dren for developmental disorders (You can learn more about this Institute at http://www.kennedykrieger.org.) Each year over 10,000 patients visit the Institute Thehospital includes clinics for children with a variety of conditions including languagedisorders, eating disorders, autism, mental retardation, spina bifida, and traumaticbrain injury Some have very common disorders, such as Down syndrome (affectingabout 1:700 live births) and mental retardation Others have rare disorders, such asRett syndrome or adrenoleukodystrophy

We are at a time when the number of base pairs of DNA deposited in the world’spublic repositories has reached tens of billions, as described in Chapter 2 We haveobtained the first sequence of the human genome, and since 1995 hundreds of gen-omes have been sequenced Throughout the book, you can follow the progress ofscience as we learn how to sequence DNA, and study its RNA and protein products

At times the pace of progress seems dazzling

Yet at the same time we understand so little about human disease For thousands

of diseases, a defect in a single gene causes a pathological effect Even as we discoverthe genes that are defective in diseases such as cystic fibrosis, muscular dystrophy,adrenoleukodystrophy, and Rett syndrome, the path to finding an effective treatment

or cure is obscure But single gene disorders are not nearly as common as complexdiseases such as autism, depression, and mental retardation that are likely due tomutations in multiple genes And all genetic disease is not nearly as common as infec-tious disease We know little about why one strain of virus infects only humans, whileanother closely related species infects only chimpanzees We do not understand whyone bacterial strain may be pathogenic, while another is harmless We have notlearned how to develop an effective vaccine against any eukaryotic pathogen, fromprotozoa (such as Plasmodium falciparum that causes malaria) to parasitic nematodes.The prospects for making progress in these areas are very encouraging specifi-cally because of the recent development of new bioinformatics tools We are onlynow beginning to position ourselves to understand the genetic basis of bothdisease-causing agents and the hosts that are susceptible Our hope is that the infor-mation so rapidly accumulating in new bioinformatics databases can be translatedthrough research into insights into human disease and biology in general

xxiv PREFACE TO THE FIRST EDITION

Trang 22

NOTE TO READERS

This book describes over 1,000 websites related to bioinformatics and functional

genomics All of these sites evolve over time (and some become extinct) In an

effort to keep the web links up-to-date, a companion website (http://www

bioinfbook.org) maintains essentially all of the website links, organized by chapter

of the book We try our best to maintain this site over time We use a program to

auto-matically scan all the links each month, and then we update them as necessary

An additional site is available to instructors, including detailed solutions to

problems (see http://www.wiley.com)

Writing this book has been a wonderful learning experience It is a pleasure to thank

the many people who have contributed In particular, the intellectual environment at

the Kennedy Krieger Institute and the Johns Hopkins School of Medicine has been

extraordinarily rich These chapters were developed from lectures in an introductory

bioinformatics course The Johns Hopkins faculty who lectured during its first three

years were Jef Boeke (yeast functional genomics), Aravinda Chakravarti (human

dis-ease), Neil Clarke (protein structure), Kyle Cunningham (yeast), Garry Cutting

(human disease), Rachel Green (RNA), Stuart Ray (molecular phylogeny), and

Roger Reeves (the human genome) I have benefited greatly from their insights

into these areas

I gratefully acknowledge the many reviewers of this book, including a group of

anonymous reviewers who offered extremely constructive and detailed suggestions

Those who read the book include Russ Altman, Christopher Aston, David P

Leader, and Harold Lehmann (various chapters), Conover Talbot (Chapters 2 and

18), Edie Sears (Chapter 3), Tom Downey (Chapter 7), Jef Boeke (Chapter 8 and

various other chapters), Michelle Nihei and Daniel Yuan (Chapter 8), Mario

Amzel and Ingo Ruczinski (Chapter 9), Stuart Ray (Chapter 11), Marie Hardwick

(Chapter 13), Yukari Manabe (Chapter 14), Kyle Cunningham and Forrest

Spencer (Chapter 15), and Roger Reeves (Chapter 16) Kirby D Smith read

Chapter 18 and provided insights into most of the other chapters as well Each of

these colleagues offered a great deal of time and effort to help improve the content,

and each served as a mentor Of the many students who read the chapters I mention

Rong Mao, Ok-Hee Jeon, and Vinoy Prasad I particularly thank Mayra Garcia and

Larry Frelin who provided invaluable assistance throughout the writing process I am

grateful to my editor at John Wiley & Sons, Luna Han, for her encouragement

I also acknowledge Gary W Goldstein, President of the Kennedy Krieger

Insti-tute, and Solomon H Snyder, my chairman in the Department of Neuroscience at

Johns Hopkins Both provided encouragement, and allowed me the opportunity to

write this book while maintaining an academic laboratory

On a personal note, I thank my family for all their love and support, as well as N

Varg, Kimberly Reed, and Charles Cohen Most of all, I thank my fiance´e Barbara

Reed for her patience, faith, and love

ACKNOWLEDGMENTS xxv

Trang 23

Ask 10 investigators in human genetics what resources they need most and it is highly

likely that computational skills and tools will be at the top of the list Genomics, with

its reliance on microarrays, genotyping, high throughput sequencing and the like, is

intensely data-rich and for this reason is impossible to disentangle from

bioinfor-matics This text, with its clear descriptions, practical examples and focus on the

overlaps and interdependence of these two fields, is thus an essential resource for

students and practitioners alike

Interestingly, bioinformatics and genomics are both relatively recent disciplines

Each emerged in the course of the Human Genome Project (HGP) that was

con-ceived in the mid-1980s and began officially on October 1, 1990 As the HGP

matured from its initial focus on gene maps in model organisms to the massive efforts

to produce a reference human whole genome sequence, there was an increasing need

for computational biology tools to store, analyze and disseminate large amounts of

sequence data For this reason, genomics increasingly relied on bioinformatics

and, in turn, the field of bioinformatics flourished Today, no serious student of

geno-mics can imagine life without bioinformatics This interdependence continues to

grow by leaps and bounds as the questions and activities of investigators in genomics

become bolder and more expansive; consider, for example, whole genome

associ-ation studies (GWAS), the ENCODE project, the challenge of copy number variants,

the 1000 Genomes project, epigenomics, and the looming growth of personal

genome sequences and their analysis

This textbook provides a clear and timely introduction to both bioinformatics

and genomics It is organized so that each chapter can correspond to a lecture for a

course on bioinformatics or genomics and, indeed, we have used it this way for our

students Also, for readers not taking courses, the book provides essential

background material For computer scientists and biologists alike the book offers

explanations of available methods and the kinds of problems for which they can be

used The sections on bioinformatics in the first part of the book describe many of

the basic tools that are used to analyze and compare DNA and protein sequences

The tone is inviting as the reader is guided to learn to use different software by

example Multiple approaches for solving particular problems, such as sequence

alignment and molecular phylogeny, are presented The middle part of the book

introduces functional genomics Here again the focus is on helping the reader to

learn how to do analyses (such as microarray data analysis or protein structure

prediction) in a practical way A companion website provides many data sets, so

the student can get experience in performing analyses Chapter 12 provides a

roadmap to the very complicated topic of functional genomics, spanning a range

of techniques and model organisms used to study gene function The last third of

xxvii

Trang 24

the book provides a survey of the tree of life from a genomics perspective There is anattempt to be comprehensive, and at the same time, to present the material in aninteresting way, highlighting the fascinating features that make each genome unique.Far from being a dry account of the facts of genomics and bioinformatics, thebook offers many features that highlight the vitality of this field There are discussionsthroughout about how to critically evaluate the performance of different software.For example, there are ‘competitions’ in which different research groups performcomputational analyses on data sets that have been validated with some ‘gold stan-dard’, allowing false positive and false negative error rates to be determined Thesecompetitions are described in areas such as microarray data analysis (Chapter 9),mass spectrometry (Chapter 10), protein structure prediction (Chapter 11), orgene prediction (Chapter 16) The book also includes descriptions of importantmovements in the fields of bioinformatics and genomics, ranging from the RefSeqproject for organizing sequences to the ENCODE and HapMap projects.Similarly, there is a rich description of the historical context for different aspects ofbioinformatics and genomics, such as Garrod’s views on disease (Chapter 20);Ohno’s classic 1970 book on genome duplication (Chapter 17); and, the earliestattempts to create alignments and phylogenetic trees of the globins.

Where will the fields of bioinformatics and genomics go in the next five to 10years? The opportunities are vast and any prediction will certainly be incomplete,but it is certain that the rapid technological advances in sequencing will provide anunprecedented view of human genetic variation and how this relates to phenotype

In the area of human disease studies, genome-wide association studies can beexpected to lead to the identification of hundreds of genes underlying complex dis-orders Finally, our understanding of evolution and its relevance to medicine willexpand dramatically Dr Pevsner’s valuable book will help the student or researcheraccess the tools and learn the principles that will enable this exciting research

David Valle, M.D.Henry J Knott Professor and Director McKusick-Nathans Institute of Genetic Medicine,

Johns Hopkins University School of Medicine

Trang 25

FIGURE 3.1 Three-dimensional structures of (a) myoglobin (accession 2MM1), (b) the tetrameric hemoglobin protein (2H35),(c) the beta globin subunit of hemoglobin, and (d) myoglobin and beta globin superimposed The images were generated with theprogram Cn3D (see Chapter 11) These proteins are homologous (descended from a common ancestor), and they share very similarthree-dimensional structures However, pairwise alignment of these proteins’ amino acid sequences reveals that the proteins sharevery limited amino acid identity.

Trang 26

FIGURE 4.7 Middle portion of a typical blastp output provides a graphical display of the results Database matches are color coded toindicate relatedness (based on alignment score), and the length of each line corresponds to the region in which that sequence aligns withthe query sequence This graphic can be useful to summarize the regions in which database matches align to the query.

FIGURE 6.10 Multiple sequence alignment of the human beta globin locus compared to other vertebrate genomic sequences (a) Aview in the UCSC Genome Browser of the beta globin gene is indicated Exons are represented by blocks (arrow 1) and tend to behighly conserved among a group of vertebrate genomes Additionally, several regions of high conservation occur in noncoding areas(e.g., arrow 2) (b) A view of 55 base pairs at the beta globin locus At this magnification (fewer than 30,000 base pairs), theUCSC genome browser displays the nucleotides of genomic DNA in the multiple sequence alignment of a group of vertebrates TheATG codon (oriented from right to left) is indicated (three asterisks), and the human protein product is shown (amino acids fromright to left matching the start of protein NP_000509, MVHLTPEEKS)

Trang 27

Experimental design

Compare normal vs diseased tissue, cells +/- drug, early vs late development Stage 1

Compare two biological samples

Hybridize samples to microarrays

Image analysis

Detect signals that represent expressed genes; quantitate

Identify co-regulated genes (e.g cluster analysis); classify samples

Biological confirmation

Independently confirm that genes are regulated e.g by Northern analysis

Deposit data in a database

(e.g GEO, ArrayExpress)

Analyze data in the context of other, related experiments Investigate behavior of expressed genes in other experimental paradigms

P3 P2 P4 P1 N4 N3 C4 C3 N2 C2 C1

RNA preparation and probe preparation

Isolate total RNA or mRNA, label with fluorescence (or radioactivity)

FIGURE 8.17 Overview of the process of generating high throughput gene expression data using microarrays In stage 1, biologicalsamples are selected for a comparison of gene expression In stage 2, RNA is isolated and labeled, often with fluorescent dyes Thesesamples are hybridized to microarrays, which are solid supports containing complementary DNA or oligonucleotides corresponding

to known genes or ESTs In stage 4, image analysis is performed to evaluate signal intensities In stage 5, the expression data are lyzed to identify differentially regulated genes (e.g., using ANOVA [Chapter 9] and scatter plots; stage 5, at left) or clustering of genesand/or samples (right) Based on these findings, independent confirmation of microarray-based findings is performed (stage 6) Themicroarray data are deposited in a database so that large-scale analyses can be performed

ana-FIGURE 8.21 Microarray images (a) A nitrocellulose filter is probed with [32P]cDNA derived from the hippocampus of a postmortembrain of an individual with Down syndrome There are 5000 cDNAs spotted on the array The pattern in which genes are represented onany array is randomized (b) Six of the signals are visualized using NIH Image software Image analysis software must define the prop-erties of each signal, including the likelihood that an intense signal (lower left) will “bleed” onto a weak signal (lower right) (c) A micro-array from NEN Perkin-Elmer (representing 2400 genes) was probed with the same Rett syndrome and control brain samples used inFig 8.20 This technology employs cDNA samples that are fluorescently labeled in a competitive hybridization

Trang 28

(a) Primary structure

(b) Secondary structure

(c) Tertiary structure (d) Quaternary structure

MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSD GLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

N

C

FIGURE 11.1 A hierarchy of protein structure (a) The primary structure of a protein refers to the linear polypeptide chain of aminoacids Here, human beta globin is shown (NP_000539) (b) The secondary structure includes elements such as alpha helices and betasheets Here, beta globin protein sequence was input to the POLE server for secondary structure (Qhttp://pbil.univ-lyon1.fr/) wherethree prediction algorithms were run and a consensus was produced Abbreviations: h, alpha helix; c, random coil; e, extended strand.(c) The tertiary structure is the three-dimensional structure of the protein chain Alpha helices are represented as thickened cylinders.Arrows labeled N and C point to the amino- and carboxy-terminals, respectively (d) The quarternary structure includes the inter-actions of the protein with other subunits and heteroatoms Here, the four subunits of hemoglobin are shown (with an a2b2 compositionand one beta globin chain highlighted) as well as four noncovalently attached heme groups Panels (c) and (d) were produced usingCn3D software from NCBI

Trang 29

FIGURE 11.3 Examples of secondary structure (a) Myoglobin (Protein Data Bank ID 2MM1) is composed of large regions of ahelices, shown as strands wrapped around barrel-shaped objects By entering the accession 2MM1 into NCBI’s structure site, onecan view this three-dimensional structure using Cn3D software The accompanying sequence viewer shows the primary amino acidsequence By clicking on a colored region (bracket) corresponding to an alpha helix, that structure is highlighted in the structureviewer (arrow) (b) Human pepsin (PDB 1PSN) is an example of a protein primarily composed as b strands, drawn as largearrows Selecting a region of the primary amino acid sequence (bracket) results in a highlighting of the corresponding b strand.

Trang 31

FIGURE 18.8 Whole genome duplication in the ciliateParamecium tetraurelia is inferred by analysis of protein paralogs The outercircle displays all chromosome-sized scaffolds from the genome sequencing project Lines link pairs of genes with a “best reciprocal hit”match The three interior circles show the reconstructed ancestral sequences obtained by combining the paired sequences from eachprevious step The inner circles are progressively smaller and reflect fewer conserved genes with a smaller average similarity FromAury et al (2006) Used with permission.

Trang 32

FIGURE 18.18 Alignment of C elegans and C briggsae conserved syntenic regions using the synteny viewer at WormBase(Qhttp://www.wormbase.org) Regions of chromosome I are aligned fromC elegans (above) and C briggsae (below).

Trang 33

Analyzing DNA, RNA, and Protein Sequences in Databases

Trang 34

The study of bioinformatics includes the analysis of proteins In the first half of the nineteenth century the Dutch researcher GerardusJohannes Mulder (1802 – 1880), advised by the Swedish chemist Jo¨ns Jacob Berzelius (1779 – 1848), studied the “albuminous” sub-stances or proteins fibrin, albumin from blood, albumin from egg (ovalbumin), and the coloring matter of blood (hemoglobin).Mulder and others extracted and purified these proteins and believed that they all shared the same elemental composition(C400H260N100O120), with varying amounts of phosphorus and sulfur Justus Liebig (1803 – 1873) believed that the composition ofprotein was C48H36N6O14 This page, from Liebig’s Animal Chemistry, or Organic Chemistry in its Applications to Physiologyand Pathology (1847, p 36), discusses albumin, fibrin, and casein (see arrowhead).

Trang 35

Introduction

Bioinformatics represents a new field at the interface of the twentieth-century

revolu-tions in molecular biology and computers A focus of this new discipline is the use of

computer databases and computer algorithms to analyze proteins, genes, and the

complete collections of deoxyribonucleic acid (DNA) that comprises an organism

(the genome) A major challenge in biology is to make sense of the enormous

quan-tities of sequence data and structural data that are generated by genome-sequencing

projects, proteomics, and other large-scale molecular biology efforts The tools of

bioinformatics include computer programs that help to reveal fundamental

mechan-isms underlying biological problems related to the structure and function of

macro-molecules, biochemical pathways, disease processes, and evolution

According to a National Institutes of Health (NIH) definition, bioinformatics is

“research, development, or application of computational tools and approaches for

expanding the use of biological, medical, behavioral or health data, including those

to acquire, store, organize, analyze, or visualize such data.” The related discipline

of computational biology is “the development and application of data-analytical

and theoretical methods, mathematical modeling and computational simulation

techniques to the study of biological, behavioral, and social systems.”

While the discipline of bioinformatics focuses on the analysis of molecular

sequences, genomics and functional genomics are two closely related disciplines

The goal of genomics is to determine and analyze the complete DNA sequence of

an organism, that is, its genome The DNA encodes genes, which can be expressed

as ribonucleic acid (RNA) transcripts and then in many cases further translated into

Bioinformatics and Functional Genomics, Second Edition By Jonathan Pevsner

Copyright # 2009 John Wiley & Sons, Inc.

The NIH Bioinformatics Definition Committee findings are reported atQhttp://www.bisti nih.gov/CompuBioDef.pdf For additional definitions of bioinfor- matics and functional genomics, see Boguski (1994), Luscombe

et al (2001), Ideker et al (2001), and Goodman (2002).

3

Trang 36

protein Functional genomics describes the use of genomewide assays in the study ofgene and protein function.

The aim of this book is to explain both the theory and practice of bioinformaticsand genomics The book is especially designed to help the biology student use com-puter programs and databases to solve biological problems related to proteins, genes,and genomes Bioinformatics is an integrative discipline, and our focus on individualproteins and genes is part of a larger effort to understand broad issues in biology, such

as the relationship of structure to function, development, and disease For the puter scientist, this book explains the motivations for creating and using algorithmsand databases

There are three main sections of the book The first part (Chapters 2 to 7) explainshow to access biological sequence data, particularly DNA and protein sequences(Chapter 2) Once sequences are obtained, we show how to compare two sequences(pairwise alignment; Chapter 3) and how to compare multiple sequences (primarily

by the Basic Local Alignment Search Tool [BLAST]; Chapters 4 and 5) We duce multiple sequence alignment (Chapter 6) and show how multiply alignedsequences can be visualized in phylogenetic trees (Chapter 7) Chapter 7 thusintroduces the subject of molecular evolution

intro-The second part of the book describes functional genomics approaches to RNAand protein and the determination of gene function (Chapters 8 to 12) The centraldogma of biology states that DNA is transcribed into RNA then translated into protein

We will examine bioinformatic approaches to RNA, including both noncoding andcoding RNAs We then describe the technology of DNA microarrays and examinemicroarray data analysis (Chapter 9) From RNA we turn to consider proteins fromthe perspective of protein families, and the analysis of individual proteins (Chapter10) and protein structure (Chapter 11) We conclude the middle part of the bookwith an overview of the rapidly developing field of functional genomics (Chapter 12).Since 1995, the genomes have been sequenced for several thousand viruses, pro-karyotes (bacteria and archaea), and eukaryotes, such as fungi, animals, and plants.The third section of the book covers genome analysis (Chapters 13 to 20) Chapter

13 provides an overview of the study of completed genomes and then descriptions ofhow the tools of bioinformatics can elucidate the tree of life We describe bioinfor-matics resources for the study of viruses (Chapter 14) and bacteria and archaea(Chapter 15; these are two of the three main branches of life) Next we examinethe eukaryotic chromosome (Chapter 16) and explore the genomes of a variety ofeukaryotes, including fungi (Chapter 17), organisms from parasites to primates(Chapter 18), and then the human genome (Chapter 19) Finally, we explore bioin-formatic approaches to human disease (Chapter 20)

We can summarize the fields of bioinformatics and genomics with three perspectives.The first perspective on bioinformatics is the cell (Fig 1.1) The central dogma ofmolecular biology is that DNA is transcribed into RNA and translated into protein.The focus of molecular biology has been on individual genes, messenger RNA

4 INTRODUCTION

Trang 37

(mRNA) transcripts as well as noncoding RNAs, and proteins A focus of the field of

bioinformatics is the complete collection of DNA (the genome), RNA (the

transcrip-tome), and protein sequences (the proteome) that have been amassed (Henikoff,

2002) These millions of molecular sequences present both great opportunities

and great challenges A bioinformatics approach to molecular sequence data involves

the application of computer algorithms and computer databases to molecular and

Central dogma of molecular biology

Central dogma of genomics

cellular phenotype genome transcriptome proteome

cellular phenotype

RNA protein DNA

FIGURE 1.1 The first perspective

of the field of bioinformatics is thecell Bioinformatics has emerged as

a discipline as biology has becometransformed by the emergence ofmolecular sequence data Databasessuch as the European MolecularBiology Laboratory (EMBL),GenBank, and the DNA Database

of Japan (DDBJ) serve as tories for hundreds of billions ofnucleotides of DNA sequence data(see Chapter 2) Corresponding data-bases of expressed genes (RNA) andprotein have been established Amain focus of the field of bioinfor-matics is to study molecular sequencedata to gain insight into a broadrange of biological problems

reposi-time of

development

region of body

physiological or pathological state

FIGURE 1.2 The second tive of bioinformatics is the organ-ism Broadening our view fromthe level of the cell to the organism,

perspec-we can consider the individual’sgenome (collection of genes),including the genes that areexpressed as RNA transcripts andthe protein products Thus, for anindividual organism bioinfor-matics tools can be applied todescribe changes through develop-mental time, changes across bodyregions, and changes in a variety

of physiological or pathologicalstates

BIOINFORMATICS : THE BIG PICTURE 5

Trang 38

cellular biology Such an approach is sometimes referred to as functional genomics.This typifies the essential nature of bioinformatics: biological questions can beapproached from levels ranging from single genes and proteins to cellular pathwaysand networks or even whole genomic responses (Ideker et al., 2001) Our goals are

to understand how to study both individual genes and proteins and collections ofthousands of genes or proteins

From the cell we can focus on individual organisms, which represents a secondperspective of the field of bioinformatics (Fig 1.2) Each organism changes acrossdifferent stages of development and (for multicellular organisms) across differentregions of the body For example, while we may sometimes think of genes as staticentities that specify features such as eye color or height, they are in fact dynamicallyregulated across time and region and in response to physiological state Geneexpression varies in disease states or in response to a variety of signals, both intrinsicand environmental Many bioinformatics tools are available to study the broad bio-logical questions relevant to the individual: there are many databases of expressed

FIGURE 1.3 The third

perspec-tive of the field of bioinformatics is

represented by the tree of life The

scope of bioinformatics includes

all of life on Earth, including the

three major branches of bacteria,

archaea, and eukaryotes Viruses,

which exist on the borderline of

the definition of life, are not

depicted here For all species, the

collection and analysis of

molecu-lar sequence data allow us to

describe the complete collection of

DNA that comprises each organism

(the genome) We can further

learn the variations that occur

between species and among

mem-bers of a species, and we can

deduce the evolutionary history of

life on Earth (After Barns et al.,

1996 and Pace, 1997.) Used with

permission

6 INTRODUCTION

Trang 39

genes and proteins derived from different tissues and conditions One of the most

powerful applications of functional genomics is the use of DNA microarrays to

measure the expression of thousands of genes in biological samples

At the largest scale is the tree of life (Fig 1.3) (Chapter 13) There are many

millions of species alive today, and they can be grouped into the three major branches

of bacteria, archaea (single-celled microbes that tend to live in extreme

environ-ments), and eukaryotes Molecular sequence databases currently hold DNA

sequences from over 150,000 different organisms The complete genome sequences

of thousands of organisms are now available, including organellar and viral genomes

One of the main lessons we are learning is the fundamental unity of life at the

molecular level We are also coming to appreciate the power of comparative

geno-mics, in which genomes are compared Through DNA sequence analysis we are

learning how chromosomes evolve and are sculpted through processes such as

chromosomal duplications, deletions, and rearrangements, as well as through

whole genome duplications (Chapters 16 to 18)

Figure 1.4 presents the contents of this book in the context of these three

per-spectives of bioinformatics

RNA protein DNA

Part 1: Analyzing DNA, RNA, and protein sequences Chapter 1: Introduction

Chapter 2: How to obtain sequences Chapter 3: How to compare two sequences Chapters 4 and 5: How to compare a sequence

to all other sequences in databases Chapter 6: How to multiply align sequences Chapter 7: How to view multiply aligned sequences

as phylogenetic trees

Part 3: Genome analysis Chapter 13: The tree of life Chapter 14: Viruses Chapter 15: Prokaryotes Chapter 16: The eukaryotic chromosome Chapter 17: The fungi

Chapter 18: Eukaryotes from parasites to plants to primates Chapter 19: The human genome

Chapter 20: Human disease

Part 2: Genome-wide analysis of RNA and protein Chapter 8: Bioinformatics approaches to RNA Chapter 9: Microarray data analysis

Chapter 10: Protein analysis and protein families Chapter 11: Protein structure

Chapter 12: Functional genomics

Molecular sequence database

FIGURE 1.4 Overview of thechapters in this book

BIOINFORMATICS : THE BIG PICTURE 7

Trang 40

A CONSISTENT EXAMPLE: HEMOGLOBIN

Throughout this book, we will focus on the globin gene family to provide a consistentexample of bioinformatics and genomics concepts The globin family is one of thebest characterized in biology

† Historically, hemoglobin was one of the first proteins to be studied, havingbeen described in the 1830s and 1840s by Mulder, Liebig, and others

† Myoglobin, a globin that binds oxygen in the muscle tissue, was the firstprotein to have its structure solved by x-ray crystallography (Chapter 11)

† Hemoglobin, a tetramer of four globin subunits (principallya2b2in adults), isthe main oxygen carrier in blood of vertebrates Its structure was also one of theearliest to be described The comparison of myoglobin, alpha globin, and betaglobin protein sequences represents one of the earliest applications of multiplesequence alignment (Chapter 6), and led to the development of amino acidsubstitution matrices used to score protein relatedness (Chapter 3)

† In the 1980s as DNA sequencing technology emerged, the globin loci onhuman chromosomes 16 (fora globin) and 11 (for b globin) were amongthe first to be sequenced and analyzed The globin genes are exquisitely regu-lated across time (switching from embryonic to fetal to adult forms) and withtissue-specific gene expression We will discuss these loci in the description ofthe control of gene expression (Chapter 16)

† While hemoglobin and myoglobin remain the best-characterized globins, thefamily of homologous proteins extends to two separate classes of plant globins,invertebrate hemoglobins (some of which contain multiple globin domainswithin one protein molecule), bacterial homodimeric hemoglobins (consist-ing of two globin subunits), and flavohemoglobins that occur in bacteria,archaea, and fungi Thus the globin family is useful as we survey the tree oflife (Chapters 13 to 18)

Another protein we will use as an example is retinol-binding protein (RBP4),

a small, abundant secreted protein that binds retinol (vitamin A) in blood(Newcomer and Ong, 2000) Retinol, obtained from carrots in the form of vitamin

A, is very hydrophobic RBP4 helps transport this ligand to the eye where it is usedfor vision We will study RBP4 in detail because it has a number of interestingfeatures:

† There are many proteins that are homologous to RBP4 in a variety of species,including human, mouse, and fish (“orthologs”) We will use these asexamples of how to align proteins, perform database searches, and studyphylogeny

† There are other human proteins that are closely related to RBP4 (“paralogs”).Altogether the family that includes RBP4 is called the lipocalins, a diversegroup of small ligand-binding proteins that tend to be secreted into extracellu-lar spaces (Akerstrom et al., 2000; Flower et al., 2000) Other lipocalins havefascinating functions such as apoliprotein D (which binds cholesterol), a preg-nancy-associated lipocalin, aphrodisin (an “aphrodisiac” in hamsters), and anodorant-binding protein in mucus

8 INTRODUCTION

Ngày đăng: 27/07/2019, 23:26

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

w