Methods in molecular biology 338, gene mapping, discovery, and expression m bina (humana, 2006)

Therefore, Gene Mapping, Discovery, and Expression: Methods and Protocols provides a com-putational protocol for identifying and mapping recent segmental and geneduplications.. This vol

Trang 1

Edited by Minou Bina

Gene Mapping,

Discovery, and Expression

Methods and Protocols

Gene Mapping,

Discovery, and Expression

Edited by

Minou Bina

Trang 3

John M Walker, SERIES EDITOR

355 Plant Proteomics: Methods and Protocols, edited

by Hervé Thiellement, Michel Zivy, Catherine

Damerval, and Valerie Mechin, 2006

354 Plant–Pathogen Interactions: Methods and

Protocols, edited by Pamela C Ronald, 2006

353 DNA Analysis by Nonradioactive Probes:

Methods and Protocols, edited by Elena Hilario and

John F MacKay, 2006

352 Protein Engineering Protocols, edited by Kristian

Müller and Katja Arndt, 2006

351 C elegans: Methods and Applications, edited by

Kevin Strange, 2006

350 Protein Folding Protocols, edited by Yawen Bai

and Ruth Nussinov 2006

349 YAC Protocols, Second Edition, edited by Alasdair

MacKenzie, 2006

348 Nuclear Transfer Protocols: Cell Reprogramming

and Transgenesis, edited by Paul J Verma and Alan

Trounson, 2006

347 Glycobiology Protocols, edited by Inka

Brockhausen-Schutzbach, 2006

346 Dictyostelium discoideum Protocols, edited by

Ludwig Eichinger and Francisco Rivero-Crespo,

2006

345 Diagnostic Bacteriology Protocols, Second Edition,

edited by Louise O'Connor, 2006

344 Agrobacterium Protocols, Second Edition:

Volume 2, edited by Kan Wang, 2006

343 Agrobacterium Protocols, Second Edition:

Volume 1, edited by Kan Wang, 2006

342 MicroRNA Protocols, edited by Shao-Yao Ying,

2006

341 Cell–Cell Interactions: Methods and Protocols,

edited by Sean P Colgan, 2006

340 Protein Design: Methods and Applications,

edited by Raphael Guerois and Manuela López de la

Paz, 2006

339 Microchip Capillary Electrophoresis: Methods

and Protocols, edited by Charles S Henry, 2006

338 Gene Mapping, Discovery, and Expression:

Methods and Protocols, edited by M Bina, 2006

337 Ion Channels: Methods and Protocols, edited by

James D Stockand and Mark S Shapiro, 2006

336 Clinical Applications of PCR, Second Edition,

edited by Y M Dennis Lo, Rossa W K Chiu,

and K C Allen Chan, 2006

335 Fluorescent Energy Transfer Nucleic Acid

Probes: Designs and Protocols, edited by Vladimir

V Didenko, 2006

334 PRINS and In Situ PCR Protocols, Second

Edition, edited by Franck Pellestor, 2006

333 Transplantation Immunology: Methods and

Protocols, edited by Philip Hornick and Marlene

Rose, 2006

332 Transmembrane Signaling Protocols, Second

Edition, edited by Hydar Ali and Bodduluri

331 Human Embryonic Stem Cell Protocols, edited

by Kursad Turksen, 2006

330 Embryonic Stem Cell Protocols, Second Edition,

Vol II: Differentiation Models, edited by Kursad Turksen, 2006

329 Embryonic Stem Cell Protocols, Second Edition,

Vol I: Isolation and Characterization, edited by Kursad Turksen, 2006

328 New and Emerging Proteomic Techniques,

edited by Dobrin Nedelkov and Randall W Nelson,

2006

327 Epidermal Growth Factor: Methods and Protocols,

edited by Tarun B Patel and Paul J Bertics, 2006

326 In Situ Hybridization Protocols, Third Edition,

edited by Ian A Darby and Tim D Hewitson, 2006

325 Nuclear Reprogramming: Methods and Protocols,

edited by Steve Pells, 2006

324 Hormone Assays in Biological Fluids, edited by

Michael J Wheeler and J S Morley Hutchinson, 2006

323 Arabidopsis Protocols, Second Edition, edited by

Julio Salinas and Jose J Sanchez-Serrano, 2006

322 Xenopus Protocols: Cell Biology and Signal

Trans-duction, edited by X Johné Liu, 2006

321 Microfluidic Techniques: Reviews and Protocols,

edited by Shelley D Minteer, 2006

320 Cytochrome P450 Protocols, Second Edition,

edited by Ian R Phillips and Elizabeth A Shephard,

2006

319 Cell Imaging Techniques: Methods and Protocols,

edited by Douglas J Taatjes and Brooke T.

Mossman, 2006

318 Plant Cell Culture Protocols, Second Edition,

edited by Victor M Loyola-Vargas and Felipe

Vázquez-Flota, 2005

317 Differential Display Methods and Protocols,

Sec-ond Edition, edited by Peng Liang, Jonathan Meade, and Arthur B Pardee, 2005

316 Bioinformatics and Drug Discovery, edited by

Richard S Larson, 2005

315 Mast Cells: Methods and Protocols, edited by Guha

Krishnaswamy and David S Chi, 2005

314 DNA Repair Protocols: Mammalian Systems,

Sec-ond Edition, edited by Daryl S Henderson, 2006

313 Yeast Protocols, Second Edition, edited by Wei

Xiao, 2005

312 Calcium Signaling Protocols, Second Edition,

edited by David G Lambert, 2005

311 Pharmacogenomics: Methods and Protocols,

edited by Federico Innocenti, 2005

310 Chemical Genomics: Reviews and Protocols,

edited by Edward D Zanders, 2005

309 RNA Silencing: Methods and Protocols, edited by

Gordon Carmichael, 2005

308 Therapeutic Proteins: Methods and Protocols,

edited by C Mark Smales and David C James,

335

328

323 322 321 320 343

Trang 5

Totowa, New Jersey 07512

www.humanapress.com

All rights reserved No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording, or otherwise without written permission from the Publisher Methods in Molecular Biology TM is a trademark of The Humana Press Inc.

All papers, comments, opinions, conclusions, or recommendations are those of the author(s), and do not necessarily reflect the views of the publisher.

This publication is printed on acid-free paper ∞

ANSI Z39.48-1984 (American Standards Institute)

Permanence of Paper for Printed Library Materials.

Cover illustration: Figure 2, from Chapter 4, “Quantitative DNA Fiber Mapping in Genome Research and Construction of Physical Maps,” by H.-U G Weier and L W Chu

Cover design by Patricia F Cleary.

For additional copies, pricing for bulk purchases, and/or information about other Humana titles, contact Humana at the above address or at any of the following numbers: Tel.: 973-256-1699; Fax: 973-256-8341; E-mail: orders@humanapr.com; or visit our Website: www.humanapress.com

Photocopy Authorization Policy:

Authorization to photocopy items for internal or personal use, or the internal or personal use of specific clients, is granted by Humana Press Inc., provided that the base fee of US $30.00 per copy is paid directly

to the Copyright Clearance Center at 222 Rosewood Drive, Danvers, MA 01923 For those organizations that have been granted a photocopy license from the CCC, a separate system of payment has been arranged and is acceptable to Humana Press Inc The fee code for users of the Transactional Reporting Service is: [1-58829-575-3/06 $30.00 ].

Printed in the United States of America 10 9 8 7 6 5 4 3 2 1

eISBN 1-59745-097-9

Library of Congress Cataloging in Publication Data

Gene mapping, discovery, and expression : methods and protocols / edited by Minou Bina.

p ; cm — (Methods in molecular biology ; v 338) Includes bibliographical references and index.

ISBN 1-58829-575-3 (alk paper)

1 Gene mapping—Methodology 2 Gene mapping—Data processing 3 Genetics—Technique.

Trang 6

Preface

Completion of the sequence of the human genome represents an leled achievement in the history of biology The project has produced nearlycomplete, highly accurate, and comprehensive sequences of genomes of sev-eral organisms including human, mouse, drosophila, and yeast Furthermore,the development of high-throughput technologies has led to an explosion ofprojects to sequence the genomes of additional organisms including rat, chimp,dog, bee, chicken, and the list is expanding

unparal-The nearly completed draft of genomic sequences from numerous species hasopened a new era of research in biology and in biomedical sciences In keeping

with the interdisciplinary nature of the new scientific era, the chapters in Gene

Mapping, Discovery, and Expression: Methods and Protocols recapitulate the

necessity of integration of experimental and computational tools for solving portant research problems The general underlying theme of this volume is DNAsequence-based technologies At one level, the book highlights the importance

im-of databases, genome-browsers, and web-based tools for data access and sis More specifically, sequencing projects routinely deposit their data in pub-licly available databases including GenBank, at the National Center ofBiotechnology (NCBI) in the United States; EMBL, maintained by the EuropeanBioinformatics Institute; and DDBJ, the DNA Data Bank of Japan Currently,several browsers offer facile access to numerous genomic DNA sequences forgene mapping and data retrieval These include the map-view at NCBI; the ge-nome browser at the University of California at Santa Cruz, UCSC; and thebrowser maintained by Ensembl All three browsers offer sophisticated tools forgene mapping and localization on genomic DNA

analy-For beginners in the field, through a specific example, one chapter provides

a step-by-step procedure for localization, creating a map, and a graphical resentation of genes of interest using the genome browser at UCSC Since thedrafts of the genomic sequences provide primarily a reference for studies ofgene organization, additional methods are needed for understanding the com-plexity and dynamic nature of chromosomes Significantly, segmental dupli-

rep-cations are a common feature of many mammalian genomes Therefore, Gene

Mapping, Discovery, and Expression: Methods and Protocols provides a

com-putational protocol for identifying and mapping recent segmental and geneduplications Another chapter offers a step-by-step procedure for identifyingparalogous genes, using the genome browser at UCSC

Trang 7

To examine local variations in specific regions of chromosomes tally, a chapter provides a novel method, Quantitative DNA Fiber Mapping,

experimen-that relies on fluorescent in situ hybridization (FISH) to identify, delineate,

and characterize selected, often small, DNA sequences along a larger piece ofthe human genome In another experimental contribution, a chapter describes a

sensitive and specific method, Primed in situ labeling, that can be used for

localization of single copy genes and sequences too small for detection by ventional FISH

con-Novel DNA sequence-based strategies include methods for the discoveryand mapping of the functional elements and the “codes” in DNA that regulatethe expression of genes The completed sequence of the human genome andthe genomic sequences of model organisms offer a rich source of data for ad-dressing this problem A fundamental and powerful method is based on com-paring the sequences from different species to identify the conserved functionalelements A chapter in this volume describes the VISTA family of computa-tional tools, created to assist researchers in aligning DNA sequences for locat-ing the genomic DNA regions that are highly conserved Another chapter aims

at using sequence conservation as a guide for identifying the elements that mayregulate the expression of genes This chapter describes how to use publiclyavailable servers (Galaxy, the UCSC Table Browser, and GALA) to find ge-

nomic sequences whose alignments show properties associated with

cis-regu-latory modules and conserved transcription factor binding sites Furthermore,this volume describes additional versatile and web-based tools for promoter,regulatory region, and expression analyses These tools include CORG “COm-parative Regulatory Genomics” and BEARR “Batch Extraction and Analysis

of cis-Regulatory Regions.”

DNA sequence-based technologies include other strategies that could helpwith the identification of regulatory signals and potential protein binding ele-ments in the regulatory regions of genes For example, a chapter describeshow a database of 9-mers from promoter regions of human protein-codinggenes could be accessed via the web for the discovery of the lexical character-istics of potential regulatory motifs in human genomic DNA These character-istics could help with predicting and classifying regulatory cis-elementsaccording to the genes that they control

Cis-elements can control the expression of genes in an allele-specific

fash-ion The analysis of allele-specific gene expression is of interest in the study

of genomic imprinting Significantly, there is growing awareness that ences in allelic expression could be widespread among autosomal non-im-

differ-printed genes A chapter in Gene Mapping, Discovery, and Expression: Methods

and Protocols provides protocols for in vivo analysis of allelic-specific gene

Trang 8

expression These include analysis of the relative allelic abundance of scribed RNA, and of transcription factor recruitment and Pol II loading bychromatin immunoprecipitation Another chapter describes miRNAs expres-sion vectors containing human RNA polymerase II or III promoters for studies

tran-of the control tran-of gene expression

In this new scientific era, gene expression is extensively studied using croarray technologies Two chapters describe how to use web-based tools foraccessing and analyzing the microarray data One chapter describes Gene Ex-pression Omnibus (GEO) developed at NCBI GEO has emerged as a leadingfully publicrepository for gene expression data The chapter describes how touse Web-based interfaces, applications, and graphics to effectively explore,visualize and interpret the hundreds ofmicroarray studies and millions of geneexpression patterns stored in GEO Another chapter describes the resources atthe Stanford Microarray Database (SMD) This database offers a large amount

mi-of data for public use The chapter describes how to use the primary tools forsearching, browsing, retrieving, and analyzing data available at SMD Fur-thermore, researchers, educators, and students may find SMD a very usefulrepository of a large quantity of publicly available data that together with analy-sis tools, could be used for exploratory, unsupervised analysis and discovery.Another level of sequence-based technologies depends on how best to ana-lyze the structural organization of chromosomes, evaluate the sequence speci-ficity of transcription factors, and isolate and identify the components of theprotein complexes formed with DNA More specifically, in cells, the chromo-somal DNA is associated with proteins to form complexes referred to as chro-matin A major group of chromosomal proteins, the histones, functions in thecompaction of DNA by forming nucleosomes Another major group corre-sponds to transcription factors, which control the expression of genes throughprotein–DNA and protein–protein interactions Evidence supports major rolesfor the underlying DNA sequence on the relative arrangement of proteins alongthe chromosomes Two chapters in this volume provide DNA sequence-basedmethods for probing chromatin structure One chapter describes a step-by-stepprocedure for detecting and analyzing nucleosome ladders on unique DNAsequences Another offers a non-invasive method of assaying relative DNAaccessibility in yeast chromatin without disrupting DNA–protein interactions.The DNA sequence specificities of transcription factors are key components

of the cis regulatory networks However, despite their importance, the DNA

binding specificities of many transcription factors remain unknown more, methods routinely used for characterizing protein binding sites are notscalable and are time-consuming These issues are problematic because com-plete, accurate, and reliable datasets of transcription factor binding elements

Trang 9

Further-are needed for localizing the regulatory regions of genes This volume offerstwo chapters on novel DNA microarray-based technologies for rapid, high-throughput in vitro characterization of the DNA sequence specificities of tran-scription factors.

Lastly, several chapters in Gene Mapping, Discovery, and Expression:

Meth-ods and Protocols offer non-invasive technologies for the isolation of

transcrip-tion factor complexes formed with specific DNA sequences used as bait.Identification of the components of large protein–DNA complexes is an impor-tant step in elucidating the mechanisms by which gene expression is controlled.Two chapters describe the use of powerful methods based on mass spectrometryfor identification of proteins in the complexes formed with DNA These methodscan lead to the discovery of novel transcription factors with important roles inthe control of gene expression

Minou Bina

Trang 10

Preface vContributors xiii

1 Use of Genome Browsers to Locate Your Favorite Genes

Minou Bina 1

2 Methods for Identifying and Mapping Recent Segmental

and Gene Duplications in Eukaryotic Genomes

Razi Khaja, Jeffrey R MacDonald, Junjun Zhang,

and Stephen W Scherer 9

3 Identification and Mapping of Paralogous Genes

on a Known Genomic DNA Sequence

Minou Bina 21

4 Quantitative DNA Fiber Mapping in Genome Research

and Construction of Physical Maps

Heinz-Ulrich G Weier and Lisa W Chu 31

5 PRINS for Mapping Single-Copy Genes

Avirachan T Tharapel and Stephen S Wachtel 59

6 VISTA Family of Computational Tools for Comparative Analysis

of DNA Sequences and Whole Genomes

Inna Dubchak and Dmitriy V Ryaboy 69

7 Computational Prediction of cis-Regulatory Modules

from Multispecies Alignments Using Galaxy,

Table Browser, and GALA

Laura Elnitski, David King, and Ross C Hardison 91

8 Comparative Promoter Analysis in Vertebrate Genomes

with the CORG Workbench

Christoph Dieterich and Martin Vingron 105

9 cis-Regulatory Region Analysis Using BEARR

Vinsensius Berlian Vega 119

10 A Database of 9-Mers from Promoter Regions

of Human Protein-Coding Genes

Minou Bina, Phillip Wyss, and Syed Rehan Shah 129

11 A Program Toolkit for the Analysis of Regulatory

Regions of Genes

Phillip Wyss, Sheryl A Lazarus, and Minou Bina 135

ix

Trang 11

12 Analysis of Allele-Specific Gene Expression

Julian C Knight 153

13 Construction of microRNA-Containing Vectors for Expression

in Mammalian Cells

Yoko Fukuda, Hiroaki Kawasaki, and Kazunari Taira 167

14 Mining Microarray Data at NCBI’s Gene

Expression Omnibus (GEO)

Tanya Barrett and Ron Edgar 175

15 The Stanford Microarray Database: A User’s Guide

Jeremy Gollub, Catherine A Ball, and Gavin Sherlock 191

16 Detecting Nucleosome Ladders on Unique DNA Sequences

in Mouse Liver Nuclei

Tomara J Fleury, Alfred Cioffi, and Arnold Stein 209

17 DNA Methyltransferase Probing of DNA–Protein Interactions

Scott A Hoose and Michael P Kladde 225

18 Protein Binding Microarrays (PBMs) for Rapid,

High-Throughput Characterization

of the Sequence Specificities of DNA Binding Proteins

Michael F Berger and Martha L Bulyk 245

19 Quantitative Profiling of Protein-DNA Binding on Microarrays

Jiannis Ragoussis, Simon Field, and Irina A Udalova 261

20 Analysis of Protein-DNA Binding

by Streptavidin–Agarose Pulldown

Kenneth K Wu 281

21 Isolation and Mass Spectrometry of Specific

DNA Binding Proteins

Mariana Yaneva and Paul Tempst 291

22 Isolation of Transcription Factor Complexes

by In Vivo Biotinylation Tagging and Direct Binding

to Streptavidin Beads

Patrick Rodriguez, Harald Braun, Katarzyna E Kolodziej,

Ernie de Boer, Jennifer Campbell, Edgar Bonte,

Frank Grosveld, Sjaak Philipsen, and John Strouboulis 305

Index 325

Trang 12

MICHAEL F BERGER • Biophysics Program, Harvard University, Boston, MA

MINOU BINA • Department of Chemistry, Purdue University, West Lafayette, IN

EDGAR BONTE • Department of Cell Biology, Erasmus Medical Center, Rotterdam, The Netherlands

HARALD BRAUN • Department of Cell Biology, Erasmus Medical Center, Rotterdam, The Netherlands

MARTHA L BULYK • Department of Medicine, Division of Genetics; Department

of Pathology; and Harvard–MIT Division of Health Sciences & Technology, Brighman and Women’s Hospital and Harvard Medical School, Boston, MA

JENNIFER CAMPBELL • Department of Cell Biology, Erasmus Medical Center, Rotterdam, The Netherlands

LISA W CHU • Department of Genome Biology, Life Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA

ALFRED CIOFFI • Department of Biological Sciences, Purdue University, West Lafayette, IN

ERNIE DE BOER • Department of Cell Biology, Erasmus Medical Center, Rotterdam, The Netherlands

CHRISTOPH DIETERICH • Computational Molecular Biology Department, Max Planck Institute for Molecular Genetics, Berlin, Germany

INNA DUBCHAK • Genomics Division, Lawrence Berkeley National Laboratory, Berkeley, CA

RON EDGAR • National Center for Biotechnology Information, National Institutes of Health, Bethesda, MD

LAURA ELNITSKI • Genome Technology Branch, National Institutes of Health, Rockville, MD

SIMON FIELD • University of Oxford, Oxford, UK

TOMARA J FLEURY • Department of Biological Sciences, Purdue University, West Lafayette, IN

YOKO FUKUDA • Department of Chemistry and Biotechnology, The University

of Tokyo, Japan

JEREMY GOLLUB • Department of Biochemistry, Stanford University Medical School, Stanford, CA

Trang 13

FRANK GROSVELD • Department of Cell Biology, Erasmus Medical Center, Rotterdam, The Netherlands

ROSS C HARDISON • Department of Biochemistry and Molecular Biology, Center for Comparative Genomics and Bioinformatics, The Pennsylvania State University, University Park, PA

SCOTT A HOOSE • Department of Biochemistry and Biophysics,

Texas A&M University, College Station, TX

HIROAKI KAWASAKI • Department of Chemistry and Biotechnology,

The University of Tokyo, Japan

RAZI KHAJA • Program in Genetics and Genomic Biology, Research

Institute, The Hospital for Sick Children, Toronto, ON, Canada

DAVID KING • Department of Biochemistry and Molecular Biology,

Center for Comparative Genoics and Bioinformatics,

The Pennsylvania State University, University Park, PA

MICHAEL P KLADDE • Department of Biochemistry and Biophysics, Texas A&M University, College Station, TX

JULIAN C KNIGHT • Wellcome Trust Centre for Human Genetics, University

Trang 14

JOHN STROUBOULIS • Department of Cell Biology, Erasmus Medical Center, Rotterdam, The Netherlands

KAZUNARI TAIRA • Department of Chemistry and Biotechnology,

The University of Tokyo, Japan

PAUL TEMPST • Molecular Biology Program, Memorial Sloan-Kettering Cancer Center, New York, NY

AVIRACHAN T THARAPEL • Department of Pediatrics, University of Tennessee, Memphis, TN

IRINA A UDALOVA • Kennedy Institute of Rheumatology, Imperial College, London, UK

VINSENSIUS BERLIAN VEGA • Genome Institute of Singapore, Singapore

MARTIN VINGRON • Max Planck Institute for Molecular Genetics, Germany

STEPHEN S WACHTEL • Department of Obstetrics and Gynecology, University

MARIANA YANEVA • Memorial Sloan-Kettering Cancer Center, New York, NY

JUNJUN ZHANG • The Hospital for Sick Children, Toronto, ON, Canada

Trang 16

From: Methods in Molecular Biology, vol 338: Gene Mapping, Discovery, and Expression:

Methods and Protocols Edited by: M Bina © Humana Press Inc., Totowa, NJ

1

Use of Genome Browsers

to Locate Your Favorite Genes

Minou Bina

Summary

The completion of whole-genome sequencing projects offers the opportunity of ing high-resolution maps of specific segments in a known genomic DNA sequence For this purpose, several genome browsers have been created They include the map-view (http:// www.ncbi.nlm.nih.gov/mapview/), the Ensembl genome browser (http://www.ensembl org/), and the genome browser at UCSC (http://genome.ucsc.edu/) For the beginners in the field, through a specific example, this chapter provides a step-by-step procedure for creating a map using the genome browser at UCSC The example describes mapping, in the human genome, the promoter region of the NF-IL6 gene The procedure is applicable

creat-to creating maps of the desired regions in genomes of other species available at the genome browser at UCSC.

Key Words: The Human Genome Project; gene mapping; gene localization.

1 Introduction

The rapid advances of genome sequencing projects have offered the tunity to map and locate genes of interest, without resorting to time-consumingand costly experimental procedures Large sequencing projects routinely deposittheir data in publicly available databases including GenBank, at the National

oppor-Center for Biotechnology Information (NCBI) in the United States ( 1,2 ); EMBL,

maintained by the European Bioinformatics Institute ( 3 ); and DDBJ, the DNA

Data Bank of Japan ( 4 ).

Currently, several browsers offer facile access to numerous genomic DNA

sequences for gene mapping and data retrieval These include NCBI ( 1,2 ); the

genome browser at the University of California at Santa Cruz (UCSC) ( 5,6 ); and

the browser maintained by Ensembl ( 7 ) All three browsers offer sophisticated

Trang 17

tools for gene mapping and localization on genomic DNA This chapter provides

an example of how to use the genome browser at UCSC ( 5,6 ) to obtain a map

and a graphical view of a known DNA sequence

2 Materials

The gene localization procedure was done on a PC equipped with the Windows

XP operating system The general procedure should be applicable to other

com-puters (see Note 1).

3 Methods

The genome browser at UCSC provides numerous sophisticated tools for

data access, analyses, and visualization ( 5,6 ) The following sections will guide

a beginner in the field through simple and general procedures for locating andmapping the positions of a known sequence on genomic DNA

1 Use the BLAT sequence alignment program at the genome browser at UCSC ( 8 ).

2 To access BLAT, go to the browser’s home page (http://genome.ucsc.edu/) Click

on BLAT, one of the options listed on the left side of the page You will obtain aquery box for pasting a DNA sequence for analysis by BLAT

3 To conduct a BLAT search, you should provide the query sequence in the FASTAformat In this format, the sequence is presented as a continuous chain of nucle-otides, without any numbering and blank spaces (Fig 1)

4 If you know the GenBank accession number of the DNA sequence of interest,perform the following steps to obtain a FASTA formatted file:

a Go to NCBI (http://www.ncbi.nlm.nih.gov/)

b Use the pull-down menu next to the query box that contains the word AllDatabases

c On the menu, select nucleotides

d In the query box next to “for,” type the known accession number As an ple, type AF350408 This accession number contains the nucleotide sequence

exam-of a cloned human DNA fragment that includes the promoter region exam-of the

NF-IL6 gene ( 9 ).

e After typing the accession number in the NCBI query box, click on go Youwill obtain a page that includes the accession number and a description of thesequence file

f Above the accession number, you will find the word report, in red letters Click

on report You will obtain a pull-down menu On the menu, select FASTA Youwill obtain the FASTA formatted version of the sequence

5 Copy the entire sequence

6 Paste it in the BLAT query box at the UCSC browser, described above in step 2.

7 Alternatively, you can scroll down the BLAT page to use the box that would allowyou to upload a FASTA formatted file from your computer

Trang 18

8 On the top of the BLAT query box, for genome, select human Click on the down menu to view the extensive list of genomic sequences offered by the browser.(You can also use the procedures described here for mapping and graphical repre-sentation of sequences from other species.)

pull-9 Above the BLAT query box, in the box under assembly, choose the latest version(in our example, 2004) Alternatively, from the pull-down menu, select an earlierversion of a genomic DNA sequence

10 Use the pull-down menu under the Query type and select DNA

11 For the other variables (score and output type), use the default values

12 Finally, click on submit

13 You will obtain a page listing the results of the BLAT search (Fig 2)

14 Examine the column tagged score (Fig 2) You will find the highest score (6455)for an extended region (positions 7–6477), with 100% sequence identity to the querysubmitted for analysis by BLAT (Fig 2) In some cases, for additional extendedregions, you might obtain high scores and high sequence identity to the query.These scores may represent pseudogenes or recent duplications that could be exam-ined for further evaluation

15 Next to each query result (Your Seq., Fig 2), right-click on details to open the link

in a new window This link provides useful information (see Note 2) For

exam-ple, on the top of the new window, you will find the chromosomal positions of thequery sequence (in that example, chr20:48234366-48240842) Below the posi-tions, you will find the submitted sequence with regions highlighted in differentcolors Scroll down to view the results of side-by-side alignment The quality ofthe alignment can guide your decision as to whether the reported matches with the

query sequence are significant (see Note 2).

16 Go to the browser to obtain a map (a graphical view) of the query sequence To do

so, on the page summarizing the result of the BLAT search (Fig 2), choose thetop line, the line with the highest score Right-click on the browser link on the leftside, to open and view the map in a new window (Fig 3)

Fig 1 Example of a FASTA formatted DNA sequence

Trang 19

17 Examine the page closely The browser provides an extensive list of options from

which you can choose for viewing the map ( 5 ) For example, on the top of the page,

you can use specific control keys (i.e., the left and right arrows) to move to andview the flanking regions in the map You can click on zoom buttons to zoom in orout In the example, click on the left arrow (>) twice, to move the map to include thecoding region of the sequence In that example, you will find the coding region ofthe human NF-IL6 gene, which is also known as C/EBPbeta (Fig 4)

18 Select from the options listed below the graph (mapping and sequencing tracks),

to choose what you want to include in the graph The options are extensive Youcan choose options that would allow the inclusion of additional details in the map.Each time you choose an option, or a set of options, click on the refresh button

The browser will display the selected annotations as a series of horizontal tracks ( 5 ).

19 On the graph, the arrows on the tracks representing the gene provide the direction

of transcription (Fig 4) Click on a given track to obtain useful links and tion about that track

informa-20 To obtain the sequence of the region shown in the graph, on the top bar (Fig 3),right-click on DNA to open a new window for viewing the sequence Follow theinstructions for obtaining the desired format (for example, you can choose mask-ing the repetitive DNA sequences to lower case letters)

21 To obtain an output of the graph, for your record or for publication, on the topbar (Fig 3), right click on PDF/PS to open a new window that would provide theoptions to save the plot in a PDF or a postscript file (Fig 4)

Fig 2 A partial listing of the result of the BLAT search

Trang 20

Fig 3 Graphical representation of the promoter region of the human C/EBP IL6) gene in the genome browser at UCSC The top of this view shows the control keysfor zooming in or out, as well as keys for moving the displayed region to the left or

(NF-to the right The bot(NF-tom view includes a partial listing of the control keys for addingdetails to and removing tracks from the map

Trang 21

Fig 4 Graphical representation of a region that includes both the promoter and the coding region of the human C/EBP IL6) gene This representation was obtained by using the control key move, for including the gene in the displayed region Sub-sequently, the result was saved in a PDF file This was done by selecting the key marked PDF/PS, shown on the top of Fig 3

Trang 22

22 To obtain a sequence alignment of the conserved regions (Fig 3), click on the

area next to the track named conservation (see Note 3).

23 At the UCSC genome browser, the page that shows the map (Fig 3) also providesthe option of viewing that map in the Ensembl and NCBI browsers On that page,the links are shown on the top bar (Fig 3) Click on these options to view the map

of the sequence of interest in these alternative browsers

4 Notes

1 Opening a new window for each of the desired links is recommended This wouldcircumvent problems with losing the connection to the preceding page The right-click option, for opening a new window to a link, is available on PCs that useMicrosoft operating systems This option might not be available on other operat-ing systems

2 Viewing the in-depth information can help you to evaluate whether the matcheswith the genomic DNA are significant

3 Currently multispecies alignment is provided for 30,000 bases or less Therefore,

to obtain an alignment, zoom in the desired region This works relatively well forviewing the conserved regions in the promoter regions of genes To do so, scroll

to left or right, depending on the direction of the transcript Identify the longestcDNA by including the track for known genes Subsequently, zoom in the 5' end

of the gene, to bring the viewed region to 30,000 bases or less Click on refresh.Then click on the track named conservation (Fig 3) You will obtain alignments

of the nucleotide sequences of the selected species

References

1 Benson, D A., Karsch-Mizrachi, I., Lipman, D J., Ostell, J., and Wheeler, D L

(2005) GenBank Nucleic Acids Res 33, (Database issue) D34–38.

2 Wheeler, D L., Barrett, T., Benson, D A., et al (2005) Database resources of the

National Center for Biotechnology Information Nucleic Acids Res 33, (Database

issue) D39–45

3 Kanz, C., Aldebert, P., Althorpe, N., et al (2005) The EMBL Nucleotide Sequence

Database Nucleic Acids Res 33, (Database issue) D29–33.

4 Miyazaki, S., Sugawara, H., Ikeo, K., Gojobori, T., and Tateno, Y (2004) DDBJ in

the stream of various biological data Nucleic Acids Res 32, (Database issue) D31–34.

5 Kent, W J., Sugnet, C W., Furey, T S., et al (2002) The human genome browser

at UCSC Genome Res 12, 996–1006.

6 Karolchik, D., Hinrichs, A S., Furey, T S., et al (2004) The UCSC Table Browser

data retrieval tool Nucleic Acids Res 32, (Database issue) D493–496.

7 Hubbard, T., Andrews, D., Caccamo, M., et al (2005) Ensembl 2005 Nucleic Acids

Res 33, (Database issue) D447–453.

8 Kent, W J (2002) BLAT—the BLAST-Like Alignment Tool Genome Res 4, 656–

664

9 Yang, Y., Pares-Matos, E I., Tesmer, V M., et al (2002) Organization of the

pro-moter region of the human NF-IL6 gene Biochim Biophys Acta 1577, 102–108.

Trang 24

2

Methods for Identifying and Mapping Recent Segmental and Gene Duplications in Eukaryotic Genomes

Razi Khaja, Jeffrey R MacDonald, Junjun Zhang,

and Stephen W Scherer

Summary

The aim of this chapter is to provide instruction for analyzing and mapping recent segmental and gene duplications in eukaryotic genomes We describe a bioinformatics- based approach utilizing computational tools to manage eukaryotic genome sequences

to characterize and understand the evolutionary fates and trajectories of duplicated genes.

An introduction to bioinformatics tools and programs such as BLAST, Perl, BioPerl, and the GFF specification provides the necessary background to complete this analysis for any eukaryotic genome of interest.

Key Words: Bioinformatics; BLAST/MegaBLAST; gene duplication; gene ontology;

genome assembly; genomic disorder; GFF (Generic Feature Format); homology; tionalization; paralogous; Perl/BioPerl; pseudogene; RefSeq; RepeatMasker; segmental duplication; sequence alignments; subfunctionalization.

neofunc-1 Introduction

With the completion of the human genome sequence and the increasing ability of whole genome shotgun sequences (WGS) for numerous other eukary-otic species, we are poised to begin to understand the complexity and dynamicnature of chromosomes Segmental duplications are nearly identical segments

avail-of DNA at two or more sites in a genome; for human they comprise about 3.5

to 5% of the total DNA content ( 1,2 ) Segmental duplications also account for

1.2 to 2% of the mouse genome ( 3,4 ) and approx 3% of the rat genome ( 5 )

Seg-mental duplications (also called low copy repeats [LCRs]) can be tion sites for increased opportunity of nonallelic homologous recombination

predisposi-leading to deletion, inversion, or duplication of large segments of DNA ( 6 ).

Trang 25

These structural alterations may lead to the gain or loss of dosage-sensitivegenetic material and may result in a spectrum of diseases defined as genomic

disorders ( 7–9 ).

The presence of segmental duplications is a common feature of many malian genomes, and their involvement in chromosome evolution and natural

mam-variation is an area of active investigation ( 10–12 ) Duplication of large

seg-ments of DNA can generate duplicate genes in whole ( 13 ), or in part ( 14 ), and

may lead to an expanding repertoire of similar gene products The tion of recent segmental duplication therefore gives us the ability to map theorigin and fate of duplicate genes, which are a driving force in species evolution

identifica-(see Note 1).

Here we define recent segmental duplications as paralogous regions of agenome having a length greater than 5000 nucleotides (nt) and having greaterthan 90% DNA sequence identity We present a computational protocol foridentifying and mapping recent segmental and gene duplications in eukaryoticgenomes The major procedures involved in identifying recent segmental and

gene duplications include comparing genomic sequences using BLAST ( 15 ),

parsing and filtering BLAST alignments, and mapping genes to segmental cations to identify gene duplicates We note that much of our methodologieshave arisen in an ongoing initiative to map segmental duplications accurately in

dupli-the human ( 2 ), chimpanzee, mouse ( 3 ), and other mammalian genomes as

dis-played at publicly available websites (http://projects.tcag.ca/humandup and http://projects.tcag.ca/xenodup)

2 Materials

1 A modest-sized cluster-computer or super-computer with 4 GB of RAM per CPUrunning any variant of a UNIX or Linux operating system

2 Internet connection, ftp utilities (e.g., ftp, ncftp, wget)

3 Archiving utilities (e.g., unzip)

4 An assembled genome sequence of a eukaryotic organism that is lower case maskedfor repetitive elements

5 The BLAST suite of programs (particularly formatdb and MegaBLAST)

Trang 26

duplications in eukaryotic genomes, the methods summarize: (5) the procedurefor performing sequence alignments of all possible pairs of chromosomes usingMegaBLAST, (6) how to convert MegaBLAST alignments into Generic FeatureFormat (GFF) format, and (7) the criteria for filtering GFF records and (8) chainalignments together Furthermore, we describe how to identify gene duplicates

by (9) mapping RefSeq genes to segmental duplications and (10) using the GeneOntology to characterize gene duplicates by function

3.1 Prerequisites/Assumptions

To perform segmental duplication analysis of eukaryotic genomes, the readerneeds access to a modest-sized cluster-computer or super-computer with a mini-

mum of 4 GB of RAM available to each CPU (see Note 2) running any variant

of a UNIX or Linux operating system (see Note 3) Competency in using UNIX command line utilities and programming in Perl is also a necessity (see Note 4).

It is also a prerequisite that the BioPerl package (see Note 5) be available in the

computing environment Furthermore, the reader should be capable of usingBioPerl to convert MegaBLAST alignment files into GFF records and should

be familiar with the GFF version 3 specification (see Note 6).

3.2 Download Genome of Interest

This protocol requires that the genome sequence being targeted for the tification of segmental and gene duplications be assembled and masked forrepetitive elements

iden-Although this protocol is applicable to all eukaryotic genomes (see Note 7),

the mouse genome will be used as our example The May 2004 mouse genomeassembly (referred to as mm5 by UCSC or Build 33 by NCBI) can be downloadedfrom UCSC (http://genome.ucsc.edu) as a zip file by executing the followingcommand:

% wget http://hgdownload.cse.ucsc.edu/goldenPath/mm5/bigZips/chromFa.zip

This zip file contains the mouse genome assembly with one FASTA file foreach chromosome Repetitive elements within each chromosome sequence havebeen identified with RepeatMasker (http://www.repeatmasker.org) and are repre-sented in lower case letters; nonrepeating DNA sequences are shown in uppercase letters Once the genome has been downloaded, the zip file is uncompressed

by executing the following command:

% unzip chromFa.zip

Uncompressing this file will extract one FASTA file for each chromosomesequence For the mouse genome, this should extract files: chr1.fa to chr19.fa,chrX.fa, chrY.fa, and chrM.fa (mitochondrial dna), as well as chr1_random.fa

Trang 27

to chr19_random.fa, chrX_random.fa, chrY_random.fa, and chrUn_random.fa

(see Note 8).

3.3 Download and Install the BLAST Suite of Programs

To perform sequence alignments for identification of segmental duplications

in the genome, download and install the BLAST suite of programs on your puting environment The BLAST suite of programs is available from the NCBI

com-as precompiled binary distributions or com-as source code The precompiled ies are available from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/

binar-These are compiled for many operating systems and hardware architectures (see

Note 9) Installation is a simple matter of downloading and then uncompressing

the distribution for your computing environment Documentation supplied withthe BLAST suite of programs describes command line options for each of the

utilities In this protocol, the formatdb and MegaBLAST ( 16 ) command line

tools are used to identify segmental duplications in the genome Formatdb isused to create BLAST databases, and MegaBLAST is used to perform sequencealignments

3.4 Create BLAST Databases, One for Each Chromosome

Once the genome has been downloaded and the BLAST suite has been stalled, create BLAST databases for each of the chromosome FASTA files usingthe formatdb command line utility The formatdb command line utility must beused to format a FASTA file such as chr7.fa into a BLAST database before it can

in-be searched by MegaBLAST The following command is an example of usingformatdb to create a BLAST database:

% formatdb -i chr7.fa -p F

Executing this command will create the files: chr7.fa.nhr, chr7.fa.nin, andchr7.fa.nsq, which collectively represent the BLAST database for mouse chro-mosome 7 This database will be searched by MegaBLAST in order to producesequence alignments for the purpose of identifying segmental duplications inthe genome BLAST databases must be created iteratively for every FASTAfile for each chromosome sequence in the genome, including the pseudo chro-

mosomes (see Note 8).

In the example above, “–i chr7.fa” specifies the name of the input file, and

“–p F” specifies that the sequence contained within the file is nucleotide Below

is a detailed description of the command line options used:

formatdb 2.2.10 arguments:

-i Input file(s) for formatting (this parameter must be set)

[File In]

Trang 28

documen-3.5 Perform Sequence Alignments of All Possible Pairs

of Chromosomes Using MegaBLAST

The MegaBLAST program is used to perform sequence alignments because itwas designed to identify long alignments efficiently between similar sequences.Since we have defined recent segmental duplications as long stretches of DNA(>5000 nt) having greater than 90% sequence identity, MegaBLAST is ideal atidentifying these paralogous regions of the genome After creating the BLASTdatabases for each chromosome, MegaBLAST is used to perform sequence align-ments between all possible pairs of chromosomes In other words, each FASTA

file is compared with each of the BLAST databases (see Note 10).

The following command is an example of using MegaBLAST to find sequencealignments between mouse chromosome 7 and mouse chromosome 3

% megablast –d chr7.fa –i chr3.fa –D 2 –F ‘m’ –U T –o chr7.3.blast

In the example above, the “-d chr7.fa” option specifies that MegaBLAST usethe mouse chromosome 7 BLAST database as the subject of this comparisonand the “-i chr3.fa” option specifies mouse chromosome 3 as the query sequence.Sequence alignments are stored in the chr7.3.blast output file as specified bythe option “-o chr7.3.blast” and the format of output generated is “traditionalBLAST output” as specified by the “-D 2” option Furthermore, “-U T” specifiesthat lower case letters in the query sequence should be recognized as a repetitiveelement The “-F ‘m’” option denotes that the MegaBLAST algorithm shouldnot find word matches in the repetitive regions of the query sequence but shouldallow for extension of sequence alignments through these regions

Below is a detailed description of the command line options that are required

to perform sequence alignments using MegaBLAST to identify segmental cations in a genome:

Trang 29

0 - alignment endpoints and score

1 - all ungapped segments endpoints

2 - traditional BLAST output

3 - tab-delimited one-line format [Integer]

documen-a subject ddocumen-atdocumen-abdocumen-ase documen-and documen-a query sequence of different chromosomes documen-are used toidentify interchromosomal segmental duplications (i.e., duplications that occurbetween different chromosomes) Executing MegaBLAST on a subject databaseand query sequence generates many sequence alignments Not all of these repre-sent sequences involved in segmental duplications, so further steps are required

to convert, filter, and process these alignments based on a variety of criteria.These criteria are described in the sections below

3.6 Convert MegaBLAST Alignments Into GFF Format

In the previous step, MegaBLAST was used to generate traditional BLASToutput for all pairs of chromosomes Sequence alignments in this format areextremely informative since they visualize detailed information about homolo-gous DNA, showing locations of nucleotide mismatches and small insertionsand deletions (Fig 1)

However, programmatically it is difficult to identify duplications from blastresults in this format as this output is generated for visual inspection In order

to identify segmental duplications from blast results without loss of tion it is necessary to transform traditional BLAST output into a tabular format.The current Generic Feature Format version 3 (GFF3) specification (http://song.sourceforge.net/gff3.shtml) is a widely accepted tabular format for describinggenes and other features associated with DNA, RNA, and protein sequences.The BioPerl project (http://www.bioperl.org) supports the parsing of differentoutput formats, including traditional BLAST output into GFF3

Trang 30

informa-Using the Bio::SearchIO module that is part of the BioPerl package, it is quired that BLAST alignment files for each pair of chromosomes be convertedinto GFF3 records Below is an example of the result of converting the alignmentshown in Fig 1 as a GFF3 record:

re-chr7 UCSC_hg17 match 61612 61790 0.0 - Target=chr3 36022 36201;Gap=M6 I1M25 I2 M90 D3 M53;percentId=94.96;alnLength=2123;matches=2016;gaps=24;bitScore=3336;rawScore=1683

To understand how to generate records in GFF3 format, the reader should stand the GFF3 specification This will enable the user to apply the Bio::SearchIOmodule to convert BLAST alignment files to generate this output This formatallows storage of all information from the traditional BLAST output including:subject sequence start and stop coordinates, query sequence start and stop coordi-nates, e-value, strand, percent identity, alignment length, matched nucleotides,gaps, bit score, raw score and detailed alignment information

under-3.7 Filter GFF Records Based on Many Criteria

After converting the traditional BLAST alignments into GFF format, somealignments are excluded since not all are components of recent segmental dup-lications To identify sequences meeting a stringent categorization of being a

“recent segmental duplication,” GFF records are filtered based on the criteriadescribed below

3.7.1 Filter Sequence Alignments With Less Than 90 Percent IdentityRecent segmental duplications are defined as paralogous sequences that sharegreater than 90% sequence similarity Remove GFF records in which the per-cent identity attribute does not meet this minimum percent identity cutoff This

Fig 1 Traditional BLAST output as generated by MegaBLAST

Trang 31

filtering criterion is applicable to both inter- and intrachromosomal sequencealignments.

3.7.2 Filter Suboptimal Sequence Alignments

Suboptimal sequence alignments occur when one sequence alignment is dundant in the sense that the subject and query elements are completely covered

re-or spanned by another alignment Remove the GFF recre-ord with the smaller span,which is considered a suboptimal alignment This filtering step is applicable toboth inter- and intrachromosomal sequence alignments

3.7.3 Filter Identical Sequence Alignments

This filtering step is only applicable to intrachromosomal sequence ments Exclude self-self matches, whose GFF records have subject sequence coor-dinates that are identical to the query sequence coordinates

align-3.8 Identify Segmental Duplications

by Chaining Alignments Together

To define the boundaries of segmental duplications, alignments whose dinates are monotonically increasing are chained together to form larger contig-uous alignments This compensates for short and fragmented alignments, whichhave arisen because of insertion or deletion events that have modified paralog-ous copies of DNA Since we defined segmental duplications as regions of thegenome having length greater than 5000 nt, we need to filter chained alignmentsthat do not meet this minimum length requirement

coor-1 Sort GFF records by subject and query coordinates

2 For records of the same subject and query chromosome pair, if adjacent sequencealignments are separated by less than 3000 nt, chain the alignments together

3 Remove chained alignments that are smaller than 5000 bp

This step concludes the identification of large regions of the genome involved inrecent segmental duplications Large segmental duplications can often containduplicate genes and/or be implicated in genomic disease and structural rear-

rangements; hence they have an inherent biological interest Subheadings 3.9 and 3.10 discuss mapping genes to segmental duplications, identifying dupli-

cate gene pairs, and characterizing gene duplications using the Gene Ontology

3.9 Map RefSeq Genes to the Mouse Genome

and to Segmental Duplications

To identify and characterize recent gene duplicates in the mouse genome, youwill first need to obtain the most current curated gene data set, map the location

of the gene to the genome of interest, and perform a positional colocalization

of genes and duplications to detect gene paralogs

Trang 32

3.9.1 Obtain RefSeq Gene Set and Mapping Location in the Mouse Genome

1 Obtain the mouse gene data set (refGene.txt.gz) from the University of California

at Santa Cruz (http://hgdownload.cse.ucsc.edu/goldenPath/mm5/database/)

2 Extract the gene mapping information from the above file, and store in GFF3 mat A description of the refGene.txt table format from UCSC can be found at http://genome.ucsc.edu/goldenPath/gbdDescriptions.html#GenePredictions

for-3.9.2 Identify Recent Gene Duplicates

Identifying recent gene duplications generated via a segmental duplicationevent can be accomplished by localizing the genes that lie within the bounda-ries of the duplications detected and determining the paralogous gene pair inthe corresponding duplicon The genes may be duplicated in whole or part alongwith the surrounding genomic DNA

1 Identify the genes that reside completely within the defined boundary of the cations (whole gene duplication) Compare the transcriptional start and end coor-dinates stored in the GFF3 file and identify those genes that fall completely withinthe coordinates of the duplication

dupli-2 Identify the genes that lie partially within the defined boundary of the duplication(partial gene duplication) Compare the transcriptional start and end coordinatesstored in the GFF3 file and identify those genes that overlap one or both bounda-ries of the duplication (as defined by either the feature start, the feature end, or thosetranscripts that span the entire duplication)

3 Now that you have found all RefSeq genes, which reside within or span the daries of segmental duplications, you will need to search for the paralogous genepair within the related segmental duplication loci The duplicated gene may besupported by a curated RefSeq mRNA, an unannotated full-length mRNA, or anexpressed sequence tag (EST)

boun-a Download EST (all_est.txt.gz) and mRNA (all_mrnboun-a.txt.gz) data sets from UCSC(http://hgdownload.cse.ucsc.edu/goldenPath/mm5/database/)

b Extract the EST and mRNA mapping information from the above file, and store

in GFF3 format A description of the all_est.txt and all_mrna.txt table formatfrom UCSC can be found at http://genome.ucsc.edu/goldenPath/gbdDescriptions.html#GenePredictions

c Identify the transcripts (EST and mRNA) that map completely within the definedboundary of the duplications (whole gene duplication) Compare the transcrip-tional start and end coordinates stored in the GFF3 file, and identify those EST ormRNA sequences that fall completely within the coordinates of the duplication

d Identify the transcripts (EST and mRNA) that are located partially within thedefined boundary of the duplication (partial gene duplication) Compare the tran-scriptional start and end coordinates stored in the GFF3 file and identify thoseEST or mRNA sequences that overlap one or both boundaries of the duplication(as defined by either the feature start, the feature end, or those transcripts thatspan the entire duplication)

Trang 33

4 You now have a list of all RefSeq genes and EST and mRNA sequences thatreside within duplications This data set will represent all transcribed sequencesthat are candidates of recent gene duplication events To determine the relationshipbetween duplicate genes, a pairwise comparison of all transcripts within relatedduplications is required.

a To determine whether two transcripts are related (i.e., a duplicated gene pair),you will need to BLAST pairs of transcript sequences

b Based on our criteria, genes that share greater than 90% DNA sequence larity for greater than 50% of the length of the transcript can be categorized as

simi-a duplicsimi-ated gene psimi-air

3.10 Functional Characterization of Genes by Gene Ontology

Duplicate genes may undergo pseudogenization, subfunctionalization, or

neofunctionalization ( 17 ) To identify the putative function and fates of

dupli-cate genes, an in silico analysis of gene function should be undertaken using

the Gene Ontology (GO) resource ( 18 ).

1 Obtain the geneID (extract the ID from the gene2refseq.gz file) for each cated gene from the NCBI website (ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/) ThegeneID is a unique NCBI identifier (previously Locus Link ID) for each curatedRefSeq entry The GO database can be searched by this unique ID to extract pre-computed gene ontology information Additional information on the GO project

dupli-is available at thdupli-is website http://www.geneontology.org/

2 Using the unique geneID, assign each gene to its GO annotations from each of thethree GO taxonomies (biological processes, cellular component, and molecularfunction) by utilizing the GO Tree Machine (http://genereg.ornl.gov/gotm/) Youwill need to create an account (Registration is free and will allow the user to saveand retrieve analyses.)

3 Create a text file with the list of the geneIDs and save to a file

a Log onto the GO Tree Machine site, and give the analysis a relevant name forfuture access

b From the drop-down menu for “Select the ID type in your file,” select LocusLink ID (same as geneID)

c For “What kind of analysis do you want to do?” select “single gene list” toperform a functional characterization of the duplicated genes

d You will need to upload the text file with the list of geneIDs previously createdand select “MAKE TREE.”

e Alternatively, if, for step 3c, you select “interesting gene list vs reference gene

list” you can perform a statistical analysis of duplicated genes to detect GO termsthat are relatively enriched compared with the full RefSeq data set You willneed to choose the “MOUSE” reference list

4 Notes

1 Gene duplication allows for relaxed selection owing to redundancy, and this mayallow for processes such as subfunctionalization, neofunctionalization, and pseu-

Trang 34

dogenization Subfunctionalization occurs when two gene copies specialize toperform complementary functions Neofunctionalization involves gene duplicationwhereby one of the genes acquires a new biochemical function Furthermore, pseu-dogenization occurs when one of the duplicated genes acquires mutations render-ing it nonfunctional.

2 Since chromosome sequence FASTA files are quite large and range in size from 50

to 250 Mb, a significant amount of computational power and memory is required

to perform the sequence alignments using MegaBLAST

3 We will explain how to perform this analysis in a serial manner It is up to the reader

to understand the nuances of their particular cluster or supercomputing tion in order to parallelize the algorithm and achieve the desired results in less time.This means understanding whether using MPI or forking and executing processes

installa-is suitable

4 This protocol can be written in any programming language such as Perl, Java,Python, Ruby, C, or C++ However, typically in bioinformatic applications, algo-rithms are written in Perl

5 The BioPerl package is available from http://www.bioperl.org/

6 The current Generic Feature Format version 3 (GFF3) specification is available athttp://song.sourceforge.net/gff3.shtml

7 Assembled genomes of several species such as: human, rat, chimpanzee, dog,chicken, and others are available from the download page of the University of Cali-fornia at Santa Cruz (UCSC), http://hgdownload.cse.ucsc.edu/downloads.html

8 The main chromosome sequence assemblies are found in the chrN.fa files, where

N is the name of the chromosome The chrN_random.fa files are pseudo somes containing sequences that are not yet finished or cannot be localized withcertainty at any particular place in the chromosome assembly The chrUn_random

chromo-fa file is another pseudo chromosome containing clones that have not been ized to a particular chromosome in the genome These pseudo chromosomes shouldnot be overlooked since they can often contain sequences that are involved in seg-mental duplications and have not been included in the main genome assembly per-haps because of their duplicated nature

local-9 If the precompiled binaries do not match your computing environment, sourcecode is available from NCBI at ftp://ftp.ncbi.nlm.nih.gov/toolbox/ncbi_tools/ncbi.tar.gz The instructions detail how to compile and install this suite of tools for yourparticular computing environment

10 A total of N2 sequence alignments are performed for all sequence files where N isthe number of files in the genome (i.e., chr1.fa vs chr2 BLAST database and chr2

fa vs chr1 BLAST database) Sequence comparisons are required for all somes in the genome including the pseudo chromosomes

chromo-References

1 Bailey, J A., Gu, Z., Clark, R A., et al (2002) Recent segmental duplications in

the human genome Science 297, 1003–1007.

Trang 35

2 Cheung, J., Estivill, X., Khaja, R., et al (2003) Genome-wide detection of mental duplications and potential assembly errors in the human genome sequence.

seg-Genome Biol 4, R25.

3 Cheung, J., Wilson, M D., Zhang, J., et al (2003) Recent segmental and gene

dupli-cations in the mouse genome Genome Biol 4, R47.

4 Bailey, J A., Church, D M., Ventura, M., Rocchi, M., and Eichler, E E (2004)

Analysis of segmental duplications and genome assembly in the mouse Genome

Res 14, 789–801.

5 Tuzun, E., Bailey, J A., and Eichler, E E (2004) Recent segmental duplications in

the working draft assembly of the brown Norway rat Genome Res 14, 493–506.

6 Lupski, J R (1998) Genomic disorders: structural features of the genome can lead

to DNA rearrangements and human disease traits Trends Genet 14, 417–422.

7 Stankiewicz, P and Lupski, J R (2002) Genome architecture, rearrangements and

genomic disorders Trends Genet 18, 74–82.

8 Eichler, E E (2001) Recent duplication, domain accretion and the dynamic

muta-tion of the human genome Trends Genet 17, 661–669.

9 Ji, Y., Eichler, E E., Schwartz, S., and Nicholls, R D (2000) Structure of

chromo-somal duplicons and their role in mediating human genomic disorders Genome

Res 10, 597–610.

10 Iafrate, A J., Feuk, L., Rivera, M N., et al (2004) Detection of large-scale

varia-tion in the human genome Nat Genet 36, 949–951.

11 Armengol, L., Pujana, M A., Cheung, J., Scherer, S W., and Estivill, X (2003)Enrichment of segmental duplications in regions of breaks of synteny between thehuman and mouse genomes suggest their involvement in evolutionary rearrange-

ments Hum Mol Genet 12, 2201–2208.

12 Bailey, J A., Baertsch, R., Kent, W J., Haussler, D., and Eichler, E E (2004)

Hot-spots of mammalian chromosomal evolution Genome Biol 5, R23.

13 Ohno, S (1970) Evolution by Gene Duplication Springer, New York, NY.

14 Buiting, K., Korner, C., Ulrich, B., Wahle, E., and Horsthemke, B (1999) Thehuman gene for the poly(A)-specific ribonuclease (PARN) maps to 16p13 and has

a truncated copy in the Prader-Willi/Angelman syndrome region on 15q11♦q13

Cytogenet Cell Genet 87, 125–131.

15 Altschul, S F., Gish, W., Miller, W., Myers, E W., and Lipman, D J (1990)

Basic local alignment search tool J Mol Biol 215, 403–410.

16 Zhang, Z., Schwartz, S., Wagner, L., and Miller, W (2000) A greedy algorithm

for aligning DNA sequences J Comput Biol 7, 203–214.

17 Prince, V E and Pickett, F B (2002) Splitting pairs: the diverging fates of

dupli-cated genes Nat Rev Genet 3, 827–837.

18 Ashburner, M., Ball, C A., Blake, J A., et al (2000) Gene ontology: tool for the

unification of biology Nat Genet 25, 25–29.

Trang 36

3

Identification and Mapping of Paralogous Genes

on a Known Genomic DNA Sequence

Minou Bina

Summary

The completion of whole genome sequencing projects offers the opportunity to ine the organization of genes and the discovery of evolutionarily related genes in a given species For the beginners in the field, through a specific example, this chapter provides

exam-a step-by-step procedure for identifying pexam-arexam-alogous genes, using the genome browser exam-at UCSC (http://genome.ucsc.edu/) The example describes identification and mapping in the human genome, the paralogs of TCF12/HTF4 The example identifies TCF3 and TCF4

as paralogs of the TCF12/HTF4 gene The example also identifies a related sequence, responding to a pseudogene, in one of the introns of the JAK2 gene The procedure described should be applicable to the discovery and creation of maps of paralogous genes in the genomic DNA sequences that are available at the genome browser at UCSC.

cor-Key Words: The Human Genome Project; mapping of gene families; gene discovery.

1 Introduction

Paralogs refer to genesthat appear in more than one copy in the genome of

a given organism ( 1 ) Paralogs arise from gene duplication events If it is

advan-tageous, duplicated genes evolve independently to produce distinct but relatedproteins This process often involves specialization of paralogous genes into spe-

cific functions ( 1 ) The evolution of paralogous genes can generate

developmen-tal and physiological novelties by changing the patterns of regulation of these

genes, by changing the functions of the proteins they encode, or by both ( 1 ).

From the complete genomic sequence of a given species, it is possible to

iden-tify the paralogous genes in that species This chapter describes an example of

how to map and obtain a graphical representation of paralogous genes in a mic DNA The example uses the genome browser at the University of California

Trang 37

geno-at Santa Cruz (UCSC) ( 2,3 ) In the analysis paralogy is defined on the basis of

significant scores obtained for global alignments of amino acid sequences Thiscan be contrasted with local alignments, which are often utilized for the discov-

ery of conserved motifs in the amino acid sequences of proteins (see, for

exam-ple, ref 4) The example given provides a relatively simple case, a good startingpoint for a beginner in the field More complex cases would require additionaltool sets As an example, see the publication that describes how to explore rela-

tionships and mine data with the browser at UCSC ( 5 ).

ping of paralogous genes, the browser at the UCSC is used here since it

pro-vides numerous tools for data access and visualization ( 2,3 ).

3.1 Using the Amino Acid Sequence of a Protein

as Query to Identify Potential Paralogous Genes

1 In the browser at UCSC (http://genome.ucsc.edu/), use the BLAT sequence

align-ment tool ( 7 ) for locating genes that might be paralogous to the gene of interest.

To access BLAT, in the genome browser click on BLAT, one of the options listed

on the left side of the page You will obtain a query box for pasting the amino acidsequence of a protein

2 If you want to analyze a predicted protein sequence that was compiled in your lab,you should convert it to a FASTA format In this format, the sequence is presented

as a continuous chain of amino acids, without any numbering, blank spaces, orannotation (Fig 1)

3 If you know the accession number for a DNA or a protein sequence of interest,

perform the following steps to obtain a FASTA formatted file from GenBank ( 8,9 ).

In the example shown below, the accession number of a DNA file is used to obtain

a FASTA formatted file for the corresponding protein

Trang 38

d In the query box next to for, type the known accession number for the DNAsequence of interest As an example, type BK001049 This accession numbercontains the nucleotide sequence of HTF4c, one of the spliced transcripts of

the human TCF12/HTF4 gene ( 10 ) After typing the accession number in the

query box, click on go You will obtain a page that includes the accession ber and a description of the DNA sequence file

num-e On the right side of the accession number click on link On the pull-downmenu select protein

f Above the accession number of the retrieved protein sequence file, you willfind the word “report,” in red letters On the pull-down menu, click on FASTA.You will obtain the FASTA format of the protein sequence

g Copy the entire sequence

h Paste the sequence in the BLAT query box at the UCSC browser, described

6 Go to the pull-down menu under the Query type and select protein

7 For the other variables (score and output type), use the default values

8 Finally, click on submit

9 Upon completion of the BLAT search, you will receive a table listing the results(Fig 2)

10 Examine the column tagged score You will find the highest score (2100) for an tended region (positions 1–706), with 100% sequence identity to the query sequenceanalyzed by BLAT The second and third highest scores (512 and 316) also reflectglobal alignments corresponding to positions 5 and 679 and to positions 4 to 681,

ex-Fig 1 Example of a FASTA formatted protein sequence

Trang 39

respectively (Fig 2) The fourth score might or might not be significant; therefore

it should also be analyzed The other scores correspond to relatively short localalignments and therefore do not appear to be significant This conclusion can bededuced by examining the information provided in detail (Fig 2) Therefore, next

to each sequence, right-click on details to open a new window for the link to obtain

useful information (see Note 2).

11 First examine the details for the sequence with the highest score, the first line in

Fig 2 (see Note 1) In the details for that line, you will obtain the position of the

submitted sequence on human chromosome 15 Also, you will obtain the submittedamino acid sequence with regions highlighted in different colors These regionsmight correspond to spliced sites in the DNA On that page, scroll down to viewthe nucleotide positions of the exons and the splice sites in the genomic DNA.Scroll further down to examine the predicted amino acid sequence encoded by theexons of the gene

12 Next, examine the details for the sequence with the second highest score, the secondline in Fig 2. You will obtain the position of a genomic DNA region, with a pre-dicted amino acid sequence that shows similarity to the sequence analyzed by BLAT.You will find that the genomic DNA is on human chromosome 18 Also, you willobtain the similarity of the predicted sequence to the submitted amino acid sequence.The regions exhibiting sequence similarity are highlighted in different colors Theresult indicates global similarities over an extended region

13 Scroll down the page to view the genomic positions of the exons in the gene onchromosome 14 Scroll down to view the results of side-by-side alignments

14 Next examine the details for the sequence with the third highest score (the thirdline in Fig 2 ) As detailed above in step 12, you will obtain the position of a geno-

mic DNA region with a predicted amino acid sequence that shows similarity tothe sequence analyzed by BLAT The genomic DNA is on human chromosome

19 As described above in step 12, you can view the blocks that show similarity to

the query sequence These blocks are highlighted in different colors Again, theresult indicates global similarities over an extended region As detailed above in

Fig 2 A partial listing of the result of the BLAT search

Trang 40

step 13, by scrolling down, you can obtain genomic positions of the exons of the

gene in the genomic DNA (in this case chromosome 19) and view the results ofside-by-side alignments

15 Finally, examine the details for the other sequences listed in Fig 2 The details forthe sequence on the fourth line indicate a global alignment that might be signifi-

cant However, as shown in Subheading 3.2., steps 10 and 11, you will find that

the sequence corresponds to a pseudogene The details for the sequence on thefifth line identify the same genomic region obtained from the details for chromo-somes 19, the third line The details describing the sequence on the sixth line revealrelatively short local alignments with a protein sequence predicted for a gene onchromosome 20 The sequence matches with low scores are unlikely to correspond

to paralogous genes

3.2 Mapping and Viewing the Chromosomal Positions

of the Candidate Paralogous Genes

1 The BLAT report (Fig 2) includes links for viewing the genomic locations of didate paralogous genes in the browser at UCSC

can-2 First, obtain a map of the query sequence on the genomic DNA To do so, in theBLAT results (Fig 2) right-click on the browser link, on the left side of the firstline, to open this link in a new window The first line contains the highest scoreand indicates 100% sequence identity to the query

3 Examine the browser page closely (see Note 3) You will find that the map

pro-vides the genomic position of the gene encoding TCF12/HTF4, on human

chro-mosome 15 ( 11 ) Note that the browser offers an extensive list of options from

which you can chose for viewing and analyzing the map For example, on the top,you can use the left and right arrows to move to the flanking regions in the map.You can use the zoom buttons (on the top right of the page) to zoom in, to obtain anexpanded view, or zoom out, to include additional sequences in the map (Fig 3A)

4 Explore the options that are listed below the graph (below the bar indicating ping and sequencing tracks), to choose what you want to include in the graph Theoptions are extensive You can chose options to create tracks for viewing additional

map-details ( 2 ) The options include creating a track for reference sequences (RefSeq).

Each time you chose an option, or a set of options, click on the refresh button

5 On the map displayed, the arrows on the tracks corresponding to known genes vide the direction of transcription (Fig 3) Click on one of these tracks to obtaininformation about that track and useful links that could help with data analysis

pro-and evaluation (see Note 3) For example, click one of the tracks labeled TCF12.

You will obtain a page that includes the accession number for the transcript thatcorresponds to that track Scroll down the page to obtain additional informationabout the gene

6 To obtain an output of the graph showing the map, for your record or for tion, click on PDF/PS file In that link, you will be able to save the plot in a PDFfile or a postscript file Figure 3 displays examples of outputs obtained from PDFfiles

Định dạng
Số trang	349
Dung lượng	5,71 MB