Therefore, Gene Mapping, Discovery, and Expression: Methods and Protocols provides a com-putational protocol for identifying and mapping recent segmental and geneduplications.. This vol
Trang 1Edited by Minou Bina
Gene Mapping,
Discovery, and Expression
Methods and Protocols
Gene Mapping,
Discovery, and Expression
Edited by
Minou Bina
Trang 3John M Walker, SERIES EDITOR
355 Plant Proteomics: Methods and Protocols, edited
by Hervé Thiellement, Michel Zivy, Catherine
Damerval, and Valerie Mechin, 2006
354 Plant–Pathogen Interactions: Methods and
Protocols, edited by Pamela C Ronald, 2006
353 DNA Analysis by Nonradioactive Probes:
Methods and Protocols, edited by Elena Hilario and
John F MacKay, 2006
352 Protein Engineering Protocols, edited by Kristian
Müller and Katja Arndt, 2006
351 C elegans: Methods and Applications, edited by
Kevin Strange, 2006
350 Protein Folding Protocols, edited by Yawen Bai
and Ruth Nussinov 2006
349 YAC Protocols, Second Edition, edited by Alasdair
MacKenzie, 2006
348 Nuclear Transfer Protocols: Cell Reprogramming
and Transgenesis, edited by Paul J Verma and Alan
Trounson, 2006
347 Glycobiology Protocols, edited by Inka
Brockhausen-Schutzbach, 2006
346 Dictyostelium discoideum Protocols, edited by
Ludwig Eichinger and Francisco Rivero-Crespo,
2006
345 Diagnostic Bacteriology Protocols, Second Edition,
edited by Louise O'Connor, 2006
344 Agrobacterium Protocols, Second Edition:
Volume 2, edited by Kan Wang, 2006
343 Agrobacterium Protocols, Second Edition:
Volume 1, edited by Kan Wang, 2006
342 MicroRNA Protocols, edited by Shao-Yao Ying,
2006
341 Cell–Cell Interactions: Methods and Protocols,
edited by Sean P Colgan, 2006
340 Protein Design: Methods and Applications,
edited by Raphael Guerois and Manuela López de la
Paz, 2006
339 Microchip Capillary Electrophoresis: Methods
and Protocols, edited by Charles S Henry, 2006
338 Gene Mapping, Discovery, and Expression:
Methods and Protocols, edited by M Bina, 2006
337 Ion Channels: Methods and Protocols, edited by
James D Stockand and Mark S Shapiro, 2006
336 Clinical Applications of PCR, Second Edition,
edited by Y M Dennis Lo, Rossa W K Chiu,
and K C Allen Chan, 2006
335 Fluorescent Energy Transfer Nucleic Acid
Probes: Designs and Protocols, edited by Vladimir
V Didenko, 2006
334 PRINS and In Situ PCR Protocols, Second
Edition, edited by Franck Pellestor, 2006
333 Transplantation Immunology: Methods and
Protocols, edited by Philip Hornick and Marlene
Rose, 2006
332 Transmembrane Signaling Protocols, Second
Edition, edited by Hydar Ali and Bodduluri
331 Human Embryonic Stem Cell Protocols, edited
by Kursad Turksen, 2006
330 Embryonic Stem Cell Protocols, Second Edition,
Vol II: Differentiation Models, edited by Kursad Turksen, 2006
329 Embryonic Stem Cell Protocols, Second Edition,
Vol I: Isolation and Characterization, edited by Kursad Turksen, 2006
328 New and Emerging Proteomic Techniques,
edited by Dobrin Nedelkov and Randall W Nelson,
2006
327 Epidermal Growth Factor: Methods and Protocols,
edited by Tarun B Patel and Paul J Bertics, 2006
326 In Situ Hybridization Protocols, Third Edition,
edited by Ian A Darby and Tim D Hewitson, 2006
325 Nuclear Reprogramming: Methods and Protocols,
edited by Steve Pells, 2006
324 Hormone Assays in Biological Fluids, edited by
Michael J Wheeler and J S Morley Hutchinson, 2006
323 Arabidopsis Protocols, Second Edition, edited by
Julio Salinas and Jose J Sanchez-Serrano, 2006
322 Xenopus Protocols: Cell Biology and Signal
Trans-duction, edited by X Johné Liu, 2006
321 Microfluidic Techniques: Reviews and Protocols,
edited by Shelley D Minteer, 2006
320 Cytochrome P450 Protocols, Second Edition,
edited by Ian R Phillips and Elizabeth A Shephard,
2006
319 Cell Imaging Techniques: Methods and Protocols,
edited by Douglas J Taatjes and Brooke T.
Mossman, 2006
318 Plant Cell Culture Protocols, Second Edition,
edited by Victor M Loyola-Vargas and Felipe
Vázquez-Flota, 2005
317 Differential Display Methods and Protocols,
Sec-ond Edition, edited by Peng Liang, Jonathan Meade, and Arthur B Pardee, 2005
316 Bioinformatics and Drug Discovery, edited by
Richard S Larson, 2005
315 Mast Cells: Methods and Protocols, edited by Guha
Krishnaswamy and David S Chi, 2005
314 DNA Repair Protocols: Mammalian Systems,
Sec-ond Edition, edited by Daryl S Henderson, 2006
313 Yeast Protocols, Second Edition, edited by Wei
Xiao, 2005
312 Calcium Signaling Protocols, Second Edition,
edited by David G Lambert, 2005
311 Pharmacogenomics: Methods and Protocols,
edited by Federico Innocenti, 2005
310 Chemical Genomics: Reviews and Protocols,
edited by Edward D Zanders, 2005
309 RNA Silencing: Methods and Protocols, edited by
Gordon Carmichael, 2005
308 Therapeutic Proteins: Methods and Protocols,
edited by C Mark Smales and David C James,
335
328
323 322 321 320 343
Trang 5Totowa, New Jersey 07512
www.humanapress.com
All rights reserved No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording, or otherwise without written permission from the Publisher Methods in Molecular Biology TM is a trademark of The Humana Press Inc.
All papers, comments, opinions, conclusions, or recommendations are those of the author(s), and do not necessarily reflect the views of the publisher.
This publication is printed on acid-free paper ∞
ANSI Z39.48-1984 (American Standards Institute)
Permanence of Paper for Printed Library Materials.
Cover illustration: Figure 2, from Chapter 4, “Quantitative DNA Fiber Mapping in Genome Research and Construction of Physical Maps,” by H.-U G Weier and L W Chu
Cover design by Patricia F Cleary.
For additional copies, pricing for bulk purchases, and/or information about other Humana titles, contact Humana at the above address or at any of the following numbers: Tel.: 973-256-1699; Fax: 973-256-8341; E-mail: orders@humanapr.com; or visit our Website: www.humanapress.com
Photocopy Authorization Policy:
Authorization to photocopy items for internal or personal use, or the internal or personal use of specific clients, is granted by Humana Press Inc., provided that the base fee of US $30.00 per copy is paid directly
to the Copyright Clearance Center at 222 Rosewood Drive, Danvers, MA 01923 For those organizations that have been granted a photocopy license from the CCC, a separate system of payment has been arranged and is acceptable to Humana Press Inc The fee code for users of the Transactional Reporting Service is: [1-58829-575-3/06 $30.00 ].
Printed in the United States of America 10 9 8 7 6 5 4 3 2 1
eISBN 1-59745-097-9
Library of Congress Cataloging in Publication Data
Gene mapping, discovery, and expression : methods and protocols / edited by Minou Bina.
p ; cm — (Methods in molecular biology ; v 338) Includes bibliographical references and index.
ISBN 1-58829-575-3 (alk paper)
1 Gene mapping—Methodology 2 Gene mapping—Data processing 3 Genetics—Technique.
Trang 6Preface
Completion of the sequence of the human genome represents an leled achievement in the history of biology The project has produced nearlycomplete, highly accurate, and comprehensive sequences of genomes of sev-eral organisms including human, mouse, drosophila, and yeast Furthermore,the development of high-throughput technologies has led to an explosion ofprojects to sequence the genomes of additional organisms including rat, chimp,dog, bee, chicken, and the list is expanding
unparal-The nearly completed draft of genomic sequences from numerous species hasopened a new era of research in biology and in biomedical sciences In keeping
with the interdisciplinary nature of the new scientific era, the chapters in Gene
Mapping, Discovery, and Expression: Methods and Protocols recapitulate the
necessity of integration of experimental and computational tools for solving portant research problems The general underlying theme of this volume is DNAsequence-based technologies At one level, the book highlights the importance
im-of databases, genome-browsers, and web-based tools for data access and sis More specifically, sequencing projects routinely deposit their data in pub-licly available databases including GenBank, at the National Center ofBiotechnology (NCBI) in the United States; EMBL, maintained by the EuropeanBioinformatics Institute; and DDBJ, the DNA Data Bank of Japan Currently,several browsers offer facile access to numerous genomic DNA sequences forgene mapping and data retrieval These include the map-view at NCBI; the ge-nome browser at the University of California at Santa Cruz, UCSC; and thebrowser maintained by Ensembl All three browsers offer sophisticated tools forgene mapping and localization on genomic DNA
analy-For beginners in the field, through a specific example, one chapter provides
a step-by-step procedure for localization, creating a map, and a graphical resentation of genes of interest using the genome browser at UCSC Since thedrafts of the genomic sequences provide primarily a reference for studies ofgene organization, additional methods are needed for understanding the com-plexity and dynamic nature of chromosomes Significantly, segmental dupli-
rep-cations are a common feature of many mammalian genomes Therefore, Gene
Mapping, Discovery, and Expression: Methods and Protocols provides a
com-putational protocol for identifying and mapping recent segmental and geneduplications Another chapter offers a step-by-step procedure for identifyingparalogous genes, using the genome browser at UCSC
Trang 7To examine local variations in specific regions of chromosomes tally, a chapter provides a novel method, Quantitative DNA Fiber Mapping,
experimen-that relies on fluorescent in situ hybridization (FISH) to identify, delineate,
and characterize selected, often small, DNA sequences along a larger piece ofthe human genome In another experimental contribution, a chapter describes a
sensitive and specific method, Primed in situ labeling, that can be used for
localization of single copy genes and sequences too small for detection by ventional FISH
con-Novel DNA sequence-based strategies include methods for the discoveryand mapping of the functional elements and the “codes” in DNA that regulatethe expression of genes The completed sequence of the human genome andthe genomic sequences of model organisms offer a rich source of data for ad-dressing this problem A fundamental and powerful method is based on com-paring the sequences from different species to identify the conserved functionalelements A chapter in this volume describes the VISTA family of computa-tional tools, created to assist researchers in aligning DNA sequences for locat-ing the genomic DNA regions that are highly conserved Another chapter aims
at using sequence conservation as a guide for identifying the elements that mayregulate the expression of genes This chapter describes how to use publiclyavailable servers (Galaxy, the UCSC Table Browser, and GALA) to find ge-
nomic sequences whose alignments show properties associated with
cis-regu-latory modules and conserved transcription factor binding sites Furthermore,this volume describes additional versatile and web-based tools for promoter,regulatory region, and expression analyses These tools include CORG “COm-parative Regulatory Genomics” and BEARR “Batch Extraction and Analysis
of cis-Regulatory Regions.”
DNA sequence-based technologies include other strategies that could helpwith the identification of regulatory signals and potential protein binding ele-ments in the regulatory regions of genes For example, a chapter describeshow a database of 9-mers from promoter regions of human protein-codinggenes could be accessed via the web for the discovery of the lexical character-istics of potential regulatory motifs in human genomic DNA These character-istics could help with predicting and classifying regulatory cis-elementsaccording to the genes that they control
Cis-elements can control the expression of genes in an allele-specific
fash-ion The analysis of allele-specific gene expression is of interest in the study
of genomic imprinting Significantly, there is growing awareness that ences in allelic expression could be widespread among autosomal non-im-
differ-printed genes A chapter in Gene Mapping, Discovery, and Expression: Methods
and Protocols provides protocols for in vivo analysis of allelic-specific gene
Trang 8expression These include analysis of the relative allelic abundance of scribed RNA, and of transcription factor recruitment and Pol II loading bychromatin immunoprecipitation Another chapter describes miRNAs expres-sion vectors containing human RNA polymerase II or III promoters for studies
tran-of the control tran-of gene expression
In this new scientific era, gene expression is extensively studied using croarray technologies Two chapters describe how to use web-based tools foraccessing and analyzing the microarray data One chapter describes Gene Ex-pression Omnibus (GEO) developed at NCBI GEO has emerged as a leadingfully publicrepository for gene expression data The chapter describes how touse Web-based interfaces, applications, and graphics to effectively explore,visualize and interpret the hundreds ofmicroarray studies and millions of geneexpression patterns stored in GEO Another chapter describes the resources atthe Stanford Microarray Database (SMD) This database offers a large amount
mi-of data for public use The chapter describes how to use the primary tools forsearching, browsing, retrieving, and analyzing data available at SMD Fur-thermore, researchers, educators, and students may find SMD a very usefulrepository of a large quantity of publicly available data that together with analy-sis tools, could be used for exploratory, unsupervised analysis and discovery.Another level of sequence-based technologies depends on how best to ana-lyze the structural organization of chromosomes, evaluate the sequence speci-ficity of transcription factors, and isolate and identify the components of theprotein complexes formed with DNA More specifically, in cells, the chromo-somal DNA is associated with proteins to form complexes referred to as chro-matin A major group of chromosomal proteins, the histones, functions in thecompaction of DNA by forming nucleosomes Another major group corre-sponds to transcription factors, which control the expression of genes throughprotein–DNA and protein–protein interactions Evidence supports major rolesfor the underlying DNA sequence on the relative arrangement of proteins alongthe chromosomes Two chapters in this volume provide DNA sequence-basedmethods for probing chromatin structure One chapter describes a step-by-stepprocedure for detecting and analyzing nucleosome ladders on unique DNAsequences Another offers a non-invasive method of assaying relative DNAaccessibility in yeast chromatin without disrupting DNA–protein interactions.The DNA sequence specificities of transcription factors are key components
of the cis regulatory networks However, despite their importance, the DNA
binding specificities of many transcription factors remain unknown more, methods routinely used for characterizing protein binding sites are notscalable and are time-consuming These issues are problematic because com-plete, accurate, and reliable datasets of transcription factor binding elements
Trang 9Further-are needed for localizing the regulatory regions of genes This volume offerstwo chapters on novel DNA microarray-based technologies for rapid, high-throughput in vitro characterization of the DNA sequence specificities of tran-scription factors.
Lastly, several chapters in Gene Mapping, Discovery, and Expression:
Meth-ods and Protocols offer non-invasive technologies for the isolation of
transcrip-tion factor complexes formed with specific DNA sequences used as bait.Identification of the components of large protein–DNA complexes is an impor-tant step in elucidating the mechanisms by which gene expression is controlled.Two chapters describe the use of powerful methods based on mass spectrometryfor identification of proteins in the complexes formed with DNA These methodscan lead to the discovery of novel transcription factors with important roles inthe control of gene expression
Minou Bina
Trang 10Preface vContributors xiii
1 Use of Genome Browsers to Locate Your Favorite Genes
Minou Bina 1
2 Methods for Identifying and Mapping Recent Segmental
and Gene Duplications in Eukaryotic Genomes
Razi Khaja, Jeffrey R MacDonald, Junjun Zhang,
and Stephen W Scherer 9
3 Identification and Mapping of Paralogous Genes
on a Known Genomic DNA Sequence
Minou Bina 21
4 Quantitative DNA Fiber Mapping in Genome Research
and Construction of Physical Maps
Heinz-Ulrich G Weier and Lisa W Chu 31
5 PRINS for Mapping Single-Copy Genes
Avirachan T Tharapel and Stephen S Wachtel 59
6 VISTA Family of Computational Tools for Comparative Analysis
of DNA Sequences and Whole Genomes
Inna Dubchak and Dmitriy V Ryaboy 69
7 Computational Prediction of cis-Regulatory Modules
from Multispecies Alignments Using Galaxy,
Table Browser, and GALA
Laura Elnitski, David King, and Ross C Hardison 91
8 Comparative Promoter Analysis in Vertebrate Genomes
with the CORG Workbench
Christoph Dieterich and Martin Vingron 105
9 cis-Regulatory Region Analysis Using BEARR
Vinsensius Berlian Vega 119
10 A Database of 9-Mers from Promoter Regions
of Human Protein-Coding Genes
Minou Bina, Phillip Wyss, and Syed Rehan Shah 129
11 A Program Toolkit for the Analysis of Regulatory
Regions of Genes
Phillip Wyss, Sheryl A Lazarus, and Minou Bina 135
ix
Trang 1112 Analysis of Allele-Specific Gene Expression
Julian C Knight 153
13 Construction of microRNA-Containing Vectors for Expression
in Mammalian Cells
Yoko Fukuda, Hiroaki Kawasaki, and Kazunari Taira 167
14 Mining Microarray Data at NCBI’s Gene
Expression Omnibus (GEO)
Tanya Barrett and Ron Edgar 175
15 The Stanford Microarray Database: A User’s Guide
Jeremy Gollub, Catherine A Ball, and Gavin Sherlock 191
16 Detecting Nucleosome Ladders on Unique DNA Sequences
in Mouse Liver Nuclei
Tomara J Fleury, Alfred Cioffi, and Arnold Stein 209
17 DNA Methyltransferase Probing of DNA–Protein Interactions
Scott A Hoose and Michael P Kladde 225
18 Protein Binding Microarrays (PBMs) for Rapid,
High-Throughput Characterization
of the Sequence Specificities of DNA Binding Proteins
Michael F Berger and Martha L Bulyk 245
19 Quantitative Profiling of Protein-DNA Binding on Microarrays
Jiannis Ragoussis, Simon Field, and Irina A Udalova 261
20 Analysis of Protein-DNA Binding
by Streptavidin–Agarose Pulldown
Kenneth K Wu 281
21 Isolation and Mass Spectrometry of Specific
DNA Binding Proteins
Mariana Yaneva and Paul Tempst 291
22 Isolation of Transcription Factor Complexes
by In Vivo Biotinylation Tagging and Direct Binding
to Streptavidin Beads
Patrick Rodriguez, Harald Braun, Katarzyna E Kolodziej,
Ernie de Boer, Jennifer Campbell, Edgar Bonte,
Frank Grosveld, Sjaak Philipsen, and John Strouboulis 305
Index 325
Trang 12MICHAEL F BERGER • Biophysics Program, Harvard University, Boston, MA
MINOU BINA • Department of Chemistry, Purdue University, West Lafayette, IN
EDGAR BONTE • Department of Cell Biology, Erasmus Medical Center, Rotterdam, The Netherlands
HARALD BRAUN • Department of Cell Biology, Erasmus Medical Center, Rotterdam, The Netherlands
MARTHA L BULYK • Department of Medicine, Division of Genetics; Department
of Pathology; and Harvard–MIT Division of Health Sciences & Technology, Brighman and Women’s Hospital and Harvard Medical School, Boston, MA
JENNIFER CAMPBELL • Department of Cell Biology, Erasmus Medical Center, Rotterdam, The Netherlands
LISA W CHU • Department of Genome Biology, Life Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA
ALFRED CIOFFI • Department of Biological Sciences, Purdue University, West Lafayette, IN
ERNIE DE BOER • Department of Cell Biology, Erasmus Medical Center, Rotterdam, The Netherlands
CHRISTOPH DIETERICH • Computational Molecular Biology Department, Max Planck Institute for Molecular Genetics, Berlin, Germany
INNA DUBCHAK • Genomics Division, Lawrence Berkeley National Laboratory, Berkeley, CA
RON EDGAR • National Center for Biotechnology Information, National Institutes of Health, Bethesda, MD
LAURA ELNITSKI • Genome Technology Branch, National Institutes of Health, Rockville, MD
SIMON FIELD • University of Oxford, Oxford, UK
TOMARA J FLEURY • Department of Biological Sciences, Purdue University, West Lafayette, IN
YOKO FUKUDA • Department of Chemistry and Biotechnology, The University
of Tokyo, Japan
JEREMY GOLLUB • Department of Biochemistry, Stanford University Medical School, Stanford, CA
Trang 13FRANK GROSVELD • Department of Cell Biology, Erasmus Medical Center, Rotterdam, The Netherlands
ROSS C HARDISON • Department of Biochemistry and Molecular Biology, Center for Comparative Genomics and Bioinformatics, The Pennsylvania State University, University Park, PA
SCOTT A HOOSE • Department of Biochemistry and Biophysics,
Texas A&M University, College Station, TX
HIROAKI KAWASAKI • Department of Chemistry and Biotechnology,
The University of Tokyo, Japan
RAZI KHAJA • Program in Genetics and Genomic Biology, Research
Institute, The Hospital for Sick Children, Toronto, ON, Canada
DAVID KING • Department of Biochemistry and Molecular Biology,
Center for Comparative Genoics and Bioinformatics,
The Pennsylvania State University, University Park, PA
MICHAEL P KLADDE • Department of Biochemistry and Biophysics, Texas A&M University, College Station, TX
JULIAN C KNIGHT • Wellcome Trust Centre for Human Genetics, University
Trang 14JOHN STROUBOULIS • Department of Cell Biology, Erasmus Medical Center, Rotterdam, The Netherlands
KAZUNARI TAIRA • Department of Chemistry and Biotechnology,
The University of Tokyo, Japan
PAUL TEMPST • Molecular Biology Program, Memorial Sloan-Kettering Cancer Center, New York, NY
AVIRACHAN T THARAPEL • Department of Pediatrics, University of Tennessee, Memphis, TN
IRINA A UDALOVA • Kennedy Institute of Rheumatology, Imperial College, London, UK
VINSENSIUS BERLIAN VEGA • Genome Institute of Singapore, Singapore
MARTIN VINGRON • Max Planck Institute for Molecular Genetics, Germany
STEPHEN S WACHTEL • Department of Obstetrics and Gynecology, University
MARIANA YANEVA • Memorial Sloan-Kettering Cancer Center, New York, NY
JUNJUN ZHANG • The Hospital for Sick Children, Toronto, ON, Canada
Trang 16From: Methods in Molecular Biology, vol 338: Gene Mapping, Discovery, and Expression:
Methods and Protocols Edited by: M Bina © Humana Press Inc., Totowa, NJ
1
Use of Genome Browsers
to Locate Your Favorite Genes
Minou Bina
Summary
The completion of whole-genome sequencing projects offers the opportunity of ing high-resolution maps of specific segments in a known genomic DNA sequence For this purpose, several genome browsers have been created They include the map-view (http:// www.ncbi.nlm.nih.gov/mapview/), the Ensembl genome browser (http://www.ensembl org/), and the genome browser at UCSC (http://genome.ucsc.edu/) For the beginners in the field, through a specific example, this chapter provides a step-by-step procedure for creating a map using the genome browser at UCSC The example describes mapping, in the human genome, the promoter region of the NF-IL6 gene The procedure is applicable
creat-to creating maps of the desired regions in genomes of other species available at the genome browser at UCSC.
Key Words: The Human Genome Project; gene mapping; gene localization.
1 Introduction
The rapid advances of genome sequencing projects have offered the tunity to map and locate genes of interest, without resorting to time-consumingand costly experimental procedures Large sequencing projects routinely deposittheir data in publicly available databases including GenBank, at the National
oppor-Center for Biotechnology Information (NCBI) in the United States ( 1,2 ); EMBL,
maintained by the European Bioinformatics Institute ( 3 ); and DDBJ, the DNA
Data Bank of Japan ( 4 ).
Currently, several browsers offer facile access to numerous genomic DNA
sequences for gene mapping and data retrieval These include NCBI ( 1,2 ); the
genome browser at the University of California at Santa Cruz (UCSC) ( 5,6 ); and
the browser maintained by Ensembl ( 7 ) All three browsers offer sophisticated
Trang 17tools for gene mapping and localization on genomic DNA This chapter provides
an example of how to use the genome browser at UCSC ( 5,6 ) to obtain a map
and a graphical view of a known DNA sequence
2 Materials
The gene localization procedure was done on a PC equipped with the Windows
XP operating system The general procedure should be applicable to other
com-puters (see Note 1).
3 Methods
The genome browser at UCSC provides numerous sophisticated tools for
data access, analyses, and visualization ( 5,6 ) The following sections will guide
a beginner in the field through simple and general procedures for locating andmapping the positions of a known sequence on genomic DNA
1 Use the BLAT sequence alignment program at the genome browser at UCSC ( 8 ).
2 To access BLAT, go to the browser’s home page (http://genome.ucsc.edu/) Click
on BLAT, one of the options listed on the left side of the page You will obtain aquery box for pasting a DNA sequence for analysis by BLAT
3 To conduct a BLAT search, you should provide the query sequence in the FASTAformat In this format, the sequence is presented as a continuous chain of nucle-otides, without any numbering and blank spaces (Fig 1)
4 If you know the GenBank accession number of the DNA sequence of interest,perform the following steps to obtain a FASTA formatted file:
a Go to NCBI (http://www.ncbi.nlm.nih.gov/)
b Use the pull-down menu next to the query box that contains the word AllDatabases
c On the menu, select nucleotides
d In the query box next to “for,” type the known accession number As an ple, type AF350408 This accession number contains the nucleotide sequence
exam-of a cloned human DNA fragment that includes the promoter region exam-of the
NF-IL6 gene ( 9 ).
e After typing the accession number in the NCBI query box, click on go Youwill obtain a page that includes the accession number and a description of thesequence file
f Above the accession number, you will find the word report, in red letters Click
on report You will obtain a pull-down menu On the menu, select FASTA Youwill obtain the FASTA formatted version of the sequence
5 Copy the entire sequence
6 Paste it in the BLAT query box at the UCSC browser, described above in step 2.
7 Alternatively, you can scroll down the BLAT page to use the box that would allowyou to upload a FASTA formatted file from your computer
Trang 188 On the top of the BLAT query box, for genome, select human Click on the down menu to view the extensive list of genomic sequences offered by the browser.(You can also use the procedures described here for mapping and graphical repre-sentation of sequences from other species.)
pull-9 Above the BLAT query box, in the box under assembly, choose the latest version(in our example, 2004) Alternatively, from the pull-down menu, select an earlierversion of a genomic DNA sequence
10 Use the pull-down menu under the Query type and select DNA
11 For the other variables (score and output type), use the default values
12 Finally, click on submit
13 You will obtain a page listing the results of the BLAT search (Fig 2)
14 Examine the column tagged score (Fig 2) You will find the highest score (6455)for an extended region (positions 7–6477), with 100% sequence identity to the querysubmitted for analysis by BLAT (Fig 2) In some cases, for additional extendedregions, you might obtain high scores and high sequence identity to the query.These scores may represent pseudogenes or recent duplications that could be exam-ined for further evaluation
15 Next to each query result (Your Seq., Fig 2), right-click on details to open the link
in a new window This link provides useful information (see Note 2) For
exam-ple, on the top of the new window, you will find the chromosomal positions of thequery sequence (in that example, chr20:48234366-48240842) Below the posi-tions, you will find the submitted sequence with regions highlighted in differentcolors Scroll down to view the results of side-by-side alignment The quality ofthe alignment can guide your decision as to whether the reported matches with the
query sequence are significant (see Note 2).
16 Go to the browser to obtain a map (a graphical view) of the query sequence To do
so, on the page summarizing the result of the BLAT search (Fig 2), choose thetop line, the line with the highest score Right-click on the browser link on the leftside, to open and view the map in a new window (Fig 3)
Fig 1 Example of a FASTA formatted DNA sequence
Trang 1917 Examine the page closely The browser provides an extensive list of options from
which you can choose for viewing the map ( 5 ) For example, on the top of the page,
you can use specific control keys (i.e., the left and right arrows) to move to andview the flanking regions in the map You can click on zoom buttons to zoom in orout In the example, click on the left arrow (>) twice, to move the map to include thecoding region of the sequence In that example, you will find the coding region ofthe human NF-IL6 gene, which is also known as C/EBPbeta (Fig 4)
18 Select from the options listed below the graph (mapping and sequencing tracks),
to choose what you want to include in the graph The options are extensive Youcan choose options that would allow the inclusion of additional details in the map.Each time you choose an option, or a set of options, click on the refresh button
The browser will display the selected annotations as a series of horizontal tracks ( 5 ).
19 On the graph, the arrows on the tracks representing the gene provide the direction
of transcription (Fig 4) Click on a given track to obtain useful links and tion about that track
informa-20 To obtain the sequence of the region shown in the graph, on the top bar (Fig 3),right-click on DNA to open a new window for viewing the sequence Follow theinstructions for obtaining the desired format (for example, you can choose mask-ing the repetitive DNA sequences to lower case letters)
21 To obtain an output of the graph, for your record or for publication, on the topbar (Fig 3), right click on PDF/PS to open a new window that would provide theoptions to save the plot in a PDF or a postscript file (Fig 4)
Fig 2 A partial listing of the result of the BLAT search
Trang 20Fig 3 Graphical representation of the promoter region of the human C/EBP IL6) gene in the genome browser at UCSC The top of this view shows the control keysfor zooming in or out, as well as keys for moving the displayed region to the left or
(NF-to the right The bot(NF-tom view includes a partial listing of the control keys for addingdetails to and removing tracks from the map
Trang 21Fig 4 Graphical representation of a region that includes both the promoter and the coding region of the human C/EBP IL6) gene This representation was obtained by using the control key move, for including the gene in the displayed region Sub-sequently, the result was saved in a PDF file This was done by selecting the key marked PDF/PS, shown on the top of Fig 3
Trang 2222 To obtain a sequence alignment of the conserved regions (Fig 3), click on the
area next to the track named conservation (see Note 3).
23 At the UCSC genome browser, the page that shows the map (Fig 3) also providesthe option of viewing that map in the Ensembl and NCBI browsers On that page,the links are shown on the top bar (Fig 3) Click on these options to view the map
of the sequence of interest in these alternative browsers
4 Notes
1 Opening a new window for each of the desired links is recommended This wouldcircumvent problems with losing the connection to the preceding page The right-click option, for opening a new window to a link, is available on PCs that useMicrosoft operating systems This option might not be available on other operat-ing systems
2 Viewing the in-depth information can help you to evaluate whether the matcheswith the genomic DNA are significant
3 Currently multispecies alignment is provided for 30,000 bases or less Therefore,
to obtain an alignment, zoom in the desired region This works relatively well forviewing the conserved regions in the promoter regions of genes To do so, scroll
to left or right, depending on the direction of the transcript Identify the longestcDNA by including the track for known genes Subsequently, zoom in the 5' end
of the gene, to bring the viewed region to 30,000 bases or less Click on refresh.Then click on the track named conservation (Fig 3) You will obtain alignments
of the nucleotide sequences of the selected species
References
1 Benson, D A., Karsch-Mizrachi, I., Lipman, D J., Ostell, J., and Wheeler, D L
(2005) GenBank Nucleic Acids Res 33, (Database issue) D34–38.
2 Wheeler, D L., Barrett, T., Benson, D A., et al (2005) Database resources of the
National Center for Biotechnology Information Nucleic Acids Res 33, (Database
issue) D39–45
3 Kanz, C., Aldebert, P., Althorpe, N., et al (2005) The EMBL Nucleotide Sequence
Database Nucleic Acids Res 33, (Database issue) D29–33.
4 Miyazaki, S., Sugawara, H., Ikeo, K., Gojobori, T., and Tateno, Y (2004) DDBJ in
the stream of various biological data Nucleic Acids Res 32, (Database issue) D31–34.
5 Kent, W J., Sugnet, C W., Furey, T S., et al (2002) The human genome browser
at UCSC Genome Res 12, 996–1006.
6 Karolchik, D., Hinrichs, A S., Furey, T S., et al (2004) The UCSC Table Browser
data retrieval tool Nucleic Acids Res 32, (Database issue) D493–496.
7 Hubbard, T., Andrews, D., Caccamo, M., et al (2005) Ensembl 2005 Nucleic Acids
Res 33, (Database issue) D447–453.
8 Kent, W J (2002) BLAT—the BLAST-Like Alignment Tool Genome Res 4, 656–
664
9 Yang, Y., Pares-Matos, E I., Tesmer, V M., et al (2002) Organization of the
pro-moter region of the human NF-IL6 gene Biochim Biophys Acta 1577, 102–108.
Trang 24From: Methods in Molecular Biology, vol 338: Gene Mapping, Discovery, and Expression:
Methods and Protocols Edited by: M Bina © Humana Press Inc., Totowa, NJ
2
Methods for Identifying and Mapping Recent Segmental and Gene Duplications in Eukaryotic Genomes
Razi Khaja, Jeffrey R MacDonald, Junjun Zhang,
and Stephen W Scherer
Summary
The aim of this chapter is to provide instruction for analyzing and mapping recent segmental and gene duplications in eukaryotic genomes We describe a bioinformatics- based approach utilizing computational tools to manage eukaryotic genome sequences
to characterize and understand the evolutionary fates and trajectories of duplicated genes.
An introduction to bioinformatics tools and programs such as BLAST, Perl, BioPerl, and the GFF specification provides the necessary background to complete this analysis for any eukaryotic genome of interest.
Key Words: Bioinformatics; BLAST/MegaBLAST; gene duplication; gene ontology;
genome assembly; genomic disorder; GFF (Generic Feature Format); homology; tionalization; paralogous; Perl/BioPerl; pseudogene; RefSeq; RepeatMasker; segmental duplication; sequence alignments; subfunctionalization.
neofunc-1 Introduction
With the completion of the human genome sequence and the increasing ability of whole genome shotgun sequences (WGS) for numerous other eukary-otic species, we are poised to begin to understand the complexity and dynamicnature of chromosomes Segmental duplications are nearly identical segments
avail-of DNA at two or more sites in a genome; for human they comprise about 3.5
to 5% of the total DNA content ( 1,2 ) Segmental duplications also account for
1.2 to 2% of the mouse genome ( 3,4 ) and approx 3% of the rat genome ( 5 )
Seg-mental duplications (also called low copy repeats [LCRs]) can be tion sites for increased opportunity of nonallelic homologous recombination
predisposi-leading to deletion, inversion, or duplication of large segments of DNA ( 6 ).
Trang 25These structural alterations may lead to the gain or loss of dosage-sensitivegenetic material and may result in a spectrum of diseases defined as genomic
disorders ( 7–9 ).
The presence of segmental duplications is a common feature of many malian genomes, and their involvement in chromosome evolution and natural
mam-variation is an area of active investigation ( 10–12 ) Duplication of large
seg-ments of DNA can generate duplicate genes in whole ( 13 ), or in part ( 14 ), and
may lead to an expanding repertoire of similar gene products The tion of recent segmental duplication therefore gives us the ability to map theorigin and fate of duplicate genes, which are a driving force in species evolution
identifica-(see Note 1).
Here we define recent segmental duplications as paralogous regions of agenome having a length greater than 5000 nucleotides (nt) and having greaterthan 90% DNA sequence identity We present a computational protocol foridentifying and mapping recent segmental and gene duplications in eukaryoticgenomes The major procedures involved in identifying recent segmental and
gene duplications include comparing genomic sequences using BLAST ( 15 ),
parsing and filtering BLAST alignments, and mapping genes to segmental cations to identify gene duplicates We note that much of our methodologieshave arisen in an ongoing initiative to map segmental duplications accurately in
dupli-the human ( 2 ), chimpanzee, mouse ( 3 ), and other mammalian genomes as
dis-played at publicly available websites (http://projects.tcag.ca/humandup and http://projects.tcag.ca/xenodup)
2 Materials
1 A modest-sized cluster-computer or super-computer with 4 GB of RAM per CPUrunning any variant of a UNIX or Linux operating system
2 Internet connection, ftp utilities (e.g., ftp, ncftp, wget)
3 Archiving utilities (e.g., unzip)
4 An assembled genome sequence of a eukaryotic organism that is lower case maskedfor repetitive elements
5 The BLAST suite of programs (particularly formatdb and MegaBLAST)
Trang 26duplications in eukaryotic genomes, the methods summarize: (5) the procedurefor performing sequence alignments of all possible pairs of chromosomes usingMegaBLAST, (6) how to convert MegaBLAST alignments into Generic FeatureFormat (GFF) format, and (7) the criteria for filtering GFF records and (8) chainalignments together Furthermore, we describe how to identify gene duplicates
by (9) mapping RefSeq genes to segmental duplications and (10) using the GeneOntology to characterize gene duplicates by function
3.1 Prerequisites/Assumptions
To perform segmental duplication analysis of eukaryotic genomes, the readerneeds access to a modest-sized cluster-computer or super-computer with a mini-
mum of 4 GB of RAM available to each CPU (see Note 2) running any variant
of a UNIX or Linux operating system (see Note 3) Competency in using UNIX command line utilities and programming in Perl is also a necessity (see Note 4).
It is also a prerequisite that the BioPerl package (see Note 5) be available in the
computing environment Furthermore, the reader should be capable of usingBioPerl to convert MegaBLAST alignment files into GFF records and should
be familiar with the GFF version 3 specification (see Note 6).
3.2 Download Genome of Interest
This protocol requires that the genome sequence being targeted for the tification of segmental and gene duplications be assembled and masked forrepetitive elements
iden-Although this protocol is applicable to all eukaryotic genomes (see Note 7),
the mouse genome will be used as our example The May 2004 mouse genomeassembly (referred to as mm5 by UCSC or Build 33 by NCBI) can be downloadedfrom UCSC (http://genome.ucsc.edu) as a zip file by executing the followingcommand:
% wget http://hgdownload.cse.ucsc.edu/goldenPath/mm5/bigZips/chromFa.zip
This zip file contains the mouse genome assembly with one FASTA file foreach chromosome Repetitive elements within each chromosome sequence havebeen identified with RepeatMasker (http://www.repeatmasker.org) and are repre-sented in lower case letters; nonrepeating DNA sequences are shown in uppercase letters Once the genome has been downloaded, the zip file is uncompressed
by executing the following command:
% unzip chromFa.zip
Uncompressing this file will extract one FASTA file for each chromosomesequence For the mouse genome, this should extract files: chr1.fa to chr19.fa,chrX.fa, chrY.fa, and chrM.fa (mitochondrial dna), as well as chr1_random.fa
Trang 27to chr19_random.fa, chrX_random.fa, chrY_random.fa, and chrUn_random.fa
(see Note 8).
3.3 Download and Install the BLAST Suite of Programs
To perform sequence alignments for identification of segmental duplications
in the genome, download and install the BLAST suite of programs on your puting environment The BLAST suite of programs is available from the NCBI
com-as precompiled binary distributions or com-as source code The precompiled ies are available from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/
binar-These are compiled for many operating systems and hardware architectures (see
Note 9) Installation is a simple matter of downloading and then uncompressing
the distribution for your computing environment Documentation supplied withthe BLAST suite of programs describes command line options for each of the
utilities In this protocol, the formatdb and MegaBLAST ( 16 ) command line
tools are used to identify segmental duplications in the genome Formatdb isused to create BLAST databases, and MegaBLAST is used to perform sequencealignments
3.4 Create BLAST Databases, One for Each Chromosome
Once the genome has been downloaded and the BLAST suite has been stalled, create BLAST databases for each of the chromosome FASTA files usingthe formatdb command line utility The formatdb command line utility must beused to format a FASTA file such as chr7.fa into a BLAST database before it can
in-be searched by MegaBLAST The following command is an example of usingformatdb to create a BLAST database:
% formatdb -i chr7.fa -p F
Executing this command will create the files: chr7.fa.nhr, chr7.fa.nin, andchr7.fa.nsq, which collectively represent the BLAST database for mouse chro-mosome 7 This database will be searched by MegaBLAST in order to producesequence alignments for the purpose of identifying segmental duplications inthe genome BLAST databases must be created iteratively for every FASTAfile for each chromosome sequence in the genome, including the pseudo chro-
mosomes (see Note 8).
In the example above, “–i chr7.fa” specifies the name of the input file, and
“–p F” specifies that the sequence contained within the file is nucleotide Below
is a detailed description of the command line options used:
formatdb 2.2.10 arguments:
-i Input file(s) for formatting (this parameter must be set)
[File In]
Trang 28documen-3.5 Perform Sequence Alignments of All Possible Pairs
of Chromosomes Using MegaBLAST
The MegaBLAST program is used to perform sequence alignments because itwas designed to identify long alignments efficiently between similar sequences.Since we have defined recent segmental duplications as long stretches of DNA(>5000 nt) having greater than 90% sequence identity, MegaBLAST is ideal atidentifying these paralogous regions of the genome After creating the BLASTdatabases for each chromosome, MegaBLAST is used to perform sequence align-ments between all possible pairs of chromosomes In other words, each FASTA
file is compared with each of the BLAST databases (see Note 10).
The following command is an example of using MegaBLAST to find sequencealignments between mouse chromosome 7 and mouse chromosome 3
% megablast –d chr7.fa –i chr3.fa –D 2 –F ‘m’ –U T –o chr7.3.blast
In the example above, the “-d chr7.fa” option specifies that MegaBLAST usethe mouse chromosome 7 BLAST database as the subject of this comparisonand the “-i chr3.fa” option specifies mouse chromosome 3 as the query sequence.Sequence alignments are stored in the chr7.3.blast output file as specified bythe option “-o chr7.3.blast” and the format of output generated is “traditionalBLAST output” as specified by the “-D 2” option Furthermore, “-U T” specifiesthat lower case letters in the query sequence should be recognized as a repetitiveelement The “-F ‘m’” option denotes that the MegaBLAST algorithm shouldnot find word matches in the repetitive regions of the query sequence but shouldallow for extension of sequence alignments through these regions
Below is a detailed description of the command line options that are required
to perform sequence alignments using MegaBLAST to identify segmental cations in a genome:
Trang 290 - alignment endpoints and score
1 - all ungapped segments endpoints
2 - traditional BLAST output
3 - tab-delimited one-line format [Integer]
documen-a subject ddocumen-atdocumen-abdocumen-ase documen-and documen-a query sequence of different chromosomes documen-are used toidentify interchromosomal segmental duplications (i.e., duplications that occurbetween different chromosomes) Executing MegaBLAST on a subject databaseand query sequence generates many sequence alignments Not all of these repre-sent sequences involved in segmental duplications, so further steps are required
to convert, filter, and process these alignments based on a variety of criteria.These criteria are described in the sections below
3.6 Convert MegaBLAST Alignments Into GFF Format
In the previous step, MegaBLAST was used to generate traditional BLASToutput for all pairs of chromosomes Sequence alignments in this format areextremely informative since they visualize detailed information about homolo-gous DNA, showing locations of nucleotide mismatches and small insertionsand deletions (Fig 1)
However, programmatically it is difficult to identify duplications from blastresults in this format as this output is generated for visual inspection In order
to identify segmental duplications from blast results without loss of tion it is necessary to transform traditional BLAST output into a tabular format.The current Generic Feature Format version 3 (GFF3) specification (http://song.sourceforge.net/gff3.shtml) is a widely accepted tabular format for describinggenes and other features associated with DNA, RNA, and protein sequences.The BioPerl project (http://www.bioperl.org) supports the parsing of differentoutput formats, including traditional BLAST output into GFF3
Trang 30informa-Using the Bio::SearchIO module that is part of the BioPerl package, it is quired that BLAST alignment files for each pair of chromosomes be convertedinto GFF3 records Below is an example of the result of converting the alignmentshown in Fig 1 as a GFF3 record:
re-chr7 UCSC_hg17 match 61612 61790 0.0 - Target=chr3 36022 36201;Gap=M6 I1M25 I2 M90 D3 M53;percentId=94.96;alnLength=2123;matches=2016;gaps=24;bitScore=3336;rawScore=1683
To understand how to generate records in GFF3 format, the reader should stand the GFF3 specification This will enable the user to apply the Bio::SearchIOmodule to convert BLAST alignment files to generate this output This formatallows storage of all information from the traditional BLAST output including:subject sequence start and stop coordinates, query sequence start and stop coordi-nates, e-value, strand, percent identity, alignment length, matched nucleotides,gaps, bit score, raw score and detailed alignment information
under-3.7 Filter GFF Records Based on Many Criteria
After converting the traditional BLAST alignments into GFF format, somealignments are excluded since not all are components of recent segmental dup-lications To identify sequences meeting a stringent categorization of being a
“recent segmental duplication,” GFF records are filtered based on the criteriadescribed below
3.7.1 Filter Sequence Alignments With Less Than 90 Percent IdentityRecent segmental duplications are defined as paralogous sequences that sharegreater than 90% sequence similarity Remove GFF records in which the per-cent identity attribute does not meet this minimum percent identity cutoff This
Fig 1 Traditional BLAST output as generated by MegaBLAST
Trang 31filtering criterion is applicable to both inter- and intrachromosomal sequencealignments.
3.7.2 Filter Suboptimal Sequence Alignments
Suboptimal sequence alignments occur when one sequence alignment is dundant in the sense that the subject and query elements are completely covered
re-or spanned by another alignment Remove the GFF recre-ord with the smaller span,which is considered a suboptimal alignment This filtering step is applicable toboth inter- and intrachromosomal sequence alignments
3.7.3 Filter Identical Sequence Alignments
This filtering step is only applicable to intrachromosomal sequence ments Exclude self-self matches, whose GFF records have subject sequence coor-dinates that are identical to the query sequence coordinates
align-3.8 Identify Segmental Duplications
by Chaining Alignments Together
To define the boundaries of segmental duplications, alignments whose dinates are monotonically increasing are chained together to form larger contig-uous alignments This compensates for short and fragmented alignments, whichhave arisen because of insertion or deletion events that have modified paralog-ous copies of DNA Since we defined segmental duplications as regions of thegenome having length greater than 5000 nt, we need to filter chained alignmentsthat do not meet this minimum length requirement
coor-1 Sort GFF records by subject and query coordinates
2 For records of the same subject and query chromosome pair, if adjacent sequencealignments are separated by less than 3000 nt, chain the alignments together
3 Remove chained alignments that are smaller than 5000 bp
This step concludes the identification of large regions of the genome involved inrecent segmental duplications Large segmental duplications can often containduplicate genes and/or be implicated in genomic disease and structural rear-
rangements; hence they have an inherent biological interest Subheadings 3.9 and 3.10 discuss mapping genes to segmental duplications, identifying dupli-
cate gene pairs, and characterizing gene duplications using the Gene Ontology
3.9 Map RefSeq Genes to the Mouse Genome
and to Segmental Duplications
To identify and characterize recent gene duplicates in the mouse genome, youwill first need to obtain the most current curated gene data set, map the location
of the gene to the genome of interest, and perform a positional colocalization
of genes and duplications to detect gene paralogs
Trang 323.9.1 Obtain RefSeq Gene Set and Mapping Location in the Mouse Genome
1 Obtain the mouse gene data set (refGene.txt.gz) from the University of California
at Santa Cruz (http://hgdownload.cse.ucsc.edu/goldenPath/mm5/database/)
2 Extract the gene mapping information from the above file, and store in GFF3 mat A description of the refGene.txt table format from UCSC can be found at http://genome.ucsc.edu/goldenPath/gbdDescriptions.html#GenePredictions
for-3.9.2 Identify Recent Gene Duplicates
Identifying recent gene duplications generated via a segmental duplicationevent can be accomplished by localizing the genes that lie within the bounda-ries of the duplications detected and determining the paralogous gene pair inthe corresponding duplicon The genes may be duplicated in whole or part alongwith the surrounding genomic DNA
1 Identify the genes that reside completely within the defined boundary of the cations (whole gene duplication) Compare the transcriptional start and end coor-dinates stored in the GFF3 file and identify those genes that fall completely withinthe coordinates of the duplication
dupli-2 Identify the genes that lie partially within the defined boundary of the duplication(partial gene duplication) Compare the transcriptional start and end coordinatesstored in the GFF3 file and identify those genes that overlap one or both bounda-ries of the duplication (as defined by either the feature start, the feature end, or thosetranscripts that span the entire duplication)
3 Now that you have found all RefSeq genes, which reside within or span the daries of segmental duplications, you will need to search for the paralogous genepair within the related segmental duplication loci The duplicated gene may besupported by a curated RefSeq mRNA, an unannotated full-length mRNA, or anexpressed sequence tag (EST)
boun-a Download EST (all_est.txt.gz) and mRNA (all_mrnboun-a.txt.gz) data sets from UCSC(http://hgdownload.cse.ucsc.edu/goldenPath/mm5/database/)
b Extract the EST and mRNA mapping information from the above file, and store
in GFF3 format A description of the all_est.txt and all_mrna.txt table formatfrom UCSC can be found at http://genome.ucsc.edu/goldenPath/gbdDescriptions.html#GenePredictions
c Identify the transcripts (EST and mRNA) that map completely within the definedboundary of the duplications (whole gene duplication) Compare the transcrip-tional start and end coordinates stored in the GFF3 file, and identify those EST ormRNA sequences that fall completely within the coordinates of the duplication
d Identify the transcripts (EST and mRNA) that are located partially within thedefined boundary of the duplication (partial gene duplication) Compare the tran-scriptional start and end coordinates stored in the GFF3 file and identify thoseEST or mRNA sequences that overlap one or both boundaries of the duplication(as defined by either the feature start, the feature end, or those transcripts thatspan the entire duplication)
Trang 334 You now have a list of all RefSeq genes and EST and mRNA sequences thatreside within duplications This data set will represent all transcribed sequencesthat are candidates of recent gene duplication events To determine the relationshipbetween duplicate genes, a pairwise comparison of all transcripts within relatedduplications is required.
a To determine whether two transcripts are related (i.e., a duplicated gene pair),you will need to BLAST pairs of transcript sequences
b Based on our criteria, genes that share greater than 90% DNA sequence larity for greater than 50% of the length of the transcript can be categorized as
simi-a duplicsimi-ated gene psimi-air
3.10 Functional Characterization of Genes by Gene Ontology
Duplicate genes may undergo pseudogenization, subfunctionalization, or
neofunctionalization ( 17 ) To identify the putative function and fates of
dupli-cate genes, an in silico analysis of gene function should be undertaken using
the Gene Ontology (GO) resource ( 18 ).
1 Obtain the geneID (extract the ID from the gene2refseq.gz file) for each cated gene from the NCBI website (ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/) ThegeneID is a unique NCBI identifier (previously Locus Link ID) for each curatedRefSeq entry The GO database can be searched by this unique ID to extract pre-computed gene ontology information Additional information on the GO project
dupli-is available at thdupli-is website http://www.geneontology.org/
2 Using the unique geneID, assign each gene to its GO annotations from each of thethree GO taxonomies (biological processes, cellular component, and molecularfunction) by utilizing the GO Tree Machine (http://genereg.ornl.gov/gotm/) Youwill need to create an account (Registration is free and will allow the user to saveand retrieve analyses.)
3 Create a text file with the list of the geneIDs and save to a file
a Log onto the GO Tree Machine site, and give the analysis a relevant name forfuture access
b From the drop-down menu for “Select the ID type in your file,” select LocusLink ID (same as geneID)
c For “What kind of analysis do you want to do?” select “single gene list” toperform a functional characterization of the duplicated genes
d You will need to upload the text file with the list of geneIDs previously createdand select “MAKE TREE.”
e Alternatively, if, for step 3c, you select “interesting gene list vs reference gene
list” you can perform a statistical analysis of duplicated genes to detect GO termsthat are relatively enriched compared with the full RefSeq data set You willneed to choose the “MOUSE” reference list
4 Notes
1 Gene duplication allows for relaxed selection owing to redundancy, and this mayallow for processes such as subfunctionalization, neofunctionalization, and pseu-
Trang 34dogenization Subfunctionalization occurs when two gene copies specialize toperform complementary functions Neofunctionalization involves gene duplicationwhereby one of the genes acquires a new biochemical function Furthermore, pseu-dogenization occurs when one of the duplicated genes acquires mutations render-ing it nonfunctional.
2 Since chromosome sequence FASTA files are quite large and range in size from 50
to 250 Mb, a significant amount of computational power and memory is required
to perform the sequence alignments using MegaBLAST
3 We will explain how to perform this analysis in a serial manner It is up to the reader
to understand the nuances of their particular cluster or supercomputing tion in order to parallelize the algorithm and achieve the desired results in less time.This means understanding whether using MPI or forking and executing processes
installa-is suitable
4 This protocol can be written in any programming language such as Perl, Java,Python, Ruby, C, or C++ However, typically in bioinformatic applications, algo-rithms are written in Perl
5 The BioPerl package is available from http://www.bioperl.org/
6 The current Generic Feature Format version 3 (GFF3) specification is available athttp://song.sourceforge.net/gff3.shtml
7 Assembled genomes of several species such as: human, rat, chimpanzee, dog,chicken, and others are available from the download page of the University of Cali-fornia at Santa Cruz (UCSC), http://hgdownload.cse.ucsc.edu/downloads.html
8 The main chromosome sequence assemblies are found in the chrN.fa files, where
N is the name of the chromosome The chrN_random.fa files are pseudo somes containing sequences that are not yet finished or cannot be localized withcertainty at any particular place in the chromosome assembly The chrUn_random
chromo-fa file is another pseudo chromosome containing clones that have not been ized to a particular chromosome in the genome These pseudo chromosomes shouldnot be overlooked since they can often contain sequences that are involved in seg-mental duplications and have not been included in the main genome assembly per-haps because of their duplicated nature
local-9 If the precompiled binaries do not match your computing environment, sourcecode is available from NCBI at ftp://ftp.ncbi.nlm.nih.gov/toolbox/ncbi_tools/ncbi.tar.gz The instructions detail how to compile and install this suite of tools for yourparticular computing environment
10 A total of N2 sequence alignments are performed for all sequence files where N isthe number of files in the genome (i.e., chr1.fa vs chr2 BLAST database and chr2
fa vs chr1 BLAST database) Sequence comparisons are required for all somes in the genome including the pseudo chromosomes
chromo-References
1 Bailey, J A., Gu, Z., Clark, R A., et al (2002) Recent segmental duplications in
the human genome Science 297, 1003–1007.
Trang 352 Cheung, J., Estivill, X., Khaja, R., et al (2003) Genome-wide detection of mental duplications and potential assembly errors in the human genome sequence.
seg-Genome Biol 4, R25.
3 Cheung, J., Wilson, M D., Zhang, J., et al (2003) Recent segmental and gene
dupli-cations in the mouse genome Genome Biol 4, R47.
4 Bailey, J A., Church, D M., Ventura, M., Rocchi, M., and Eichler, E E (2004)
Analysis of segmental duplications and genome assembly in the mouse Genome
Res 14, 789–801.
5 Tuzun, E., Bailey, J A., and Eichler, E E (2004) Recent segmental duplications in
the working draft assembly of the brown Norway rat Genome Res 14, 493–506.
6 Lupski, J R (1998) Genomic disorders: structural features of the genome can lead
to DNA rearrangements and human disease traits Trends Genet 14, 417–422.
7 Stankiewicz, P and Lupski, J R (2002) Genome architecture, rearrangements and
genomic disorders Trends Genet 18, 74–82.
8 Eichler, E E (2001) Recent duplication, domain accretion and the dynamic
muta-tion of the human genome Trends Genet 17, 661–669.
9 Ji, Y., Eichler, E E., Schwartz, S., and Nicholls, R D (2000) Structure of
chromo-somal duplicons and their role in mediating human genomic disorders Genome
Res 10, 597–610.
10 Iafrate, A J., Feuk, L., Rivera, M N., et al (2004) Detection of large-scale
varia-tion in the human genome Nat Genet 36, 949–951.
11 Armengol, L., Pujana, M A., Cheung, J., Scherer, S W., and Estivill, X (2003)Enrichment of segmental duplications in regions of breaks of synteny between thehuman and mouse genomes suggest their involvement in evolutionary rearrange-
ments Hum Mol Genet 12, 2201–2208.
12 Bailey, J A., Baertsch, R., Kent, W J., Haussler, D., and Eichler, E E (2004)
Hot-spots of mammalian chromosomal evolution Genome Biol 5, R23.
13 Ohno, S (1970) Evolution by Gene Duplication Springer, New York, NY.
14 Buiting, K., Korner, C., Ulrich, B., Wahle, E., and Horsthemke, B (1999) Thehuman gene for the poly(A)-specific ribonuclease (PARN) maps to 16p13 and has
a truncated copy in the Prader-Willi/Angelman syndrome region on 15q11♦q13
Cytogenet Cell Genet 87, 125–131.
15 Altschul, S F., Gish, W., Miller, W., Myers, E W., and Lipman, D J (1990)
Basic local alignment search tool J Mol Biol 215, 403–410.
16 Zhang, Z., Schwartz, S., Wagner, L., and Miller, W (2000) A greedy algorithm
for aligning DNA sequences J Comput Biol 7, 203–214.
17 Prince, V E and Pickett, F B (2002) Splitting pairs: the diverging fates of
dupli-cated genes Nat Rev Genet 3, 827–837.
18 Ashburner, M., Ball, C A., Blake, J A., et al (2000) Gene ontology: tool for the
unification of biology Nat Genet 25, 25–29.
Trang 36From: Methods in Molecular Biology, vol 338: Gene Mapping, Discovery, and Expression:
Methods and Protocols Edited by: M Bina © Humana Press Inc., Totowa, NJ
3
Identification and Mapping of Paralogous Genes
on a Known Genomic DNA Sequence
Minou Bina
Summary
The completion of whole genome sequencing projects offers the opportunity to ine the organization of genes and the discovery of evolutionarily related genes in a given species For the beginners in the field, through a specific example, this chapter provides
exam-a step-by-step procedure for identifying pexam-arexam-alogous genes, using the genome browser exam-at UCSC (http://genome.ucsc.edu/) The example describes identification and mapping in the human genome, the paralogs of TCF12/HTF4 The example identifies TCF3 and TCF4
as paralogs of the TCF12/HTF4 gene The example also identifies a related sequence, responding to a pseudogene, in one of the introns of the JAK2 gene The procedure described should be applicable to the discovery and creation of maps of paralogous genes in the gen- omic DNA sequences that are available at the genome browser at UCSC.
cor-Key Words: The Human Genome Project; mapping of gene families; gene discovery.
1 Introduction
Paralogs refer to genesthat appear in more than one copy in the genome of
a given organism ( 1 ) Paralogs arise from gene duplication events If it is
advan-tageous, duplicated genes evolve independently to produce distinct but relatedproteins This process often involves specialization of paralogous genes into spe-
cific functions ( 1 ) The evolution of paralogous genes can generate
developmen-tal and physiological novelties by changing the patterns of regulation of these
genes, by changing the functions of the proteins they encode, or by both ( 1 ).
From the complete genomic sequence of a given species, it is possible to
iden-tify the paralogous genes in that species This chapter describes an example of
how to map and obtain a graphical representation of paralogous genes in a mic DNA The example uses the genome browser at the University of California
Trang 37geno-at Santa Cruz (UCSC) ( 2,3 ) In the analysis paralogy is defined on the basis of
significant scores obtained for global alignments of amino acid sequences Thiscan be contrasted with local alignments, which are often utilized for the discov-
ery of conserved motifs in the amino acid sequences of proteins (see, for
exam-ple, ref 4) The example given provides a relatively simple case, a good startingpoint for a beginner in the field More complex cases would require additionaltool sets As an example, see the publication that describes how to explore rela-
tionships and mine data with the browser at UCSC ( 5 ).
ping of paralogous genes, the browser at the UCSC is used here since it
pro-vides numerous tools for data access and visualization ( 2,3 ).
3.1 Using the Amino Acid Sequence of a Protein
as Query to Identify Potential Paralogous Genes
1 In the browser at UCSC (http://genome.ucsc.edu/), use the BLAT sequence
align-ment tool ( 7 ) for locating genes that might be paralogous to the gene of interest.
To access BLAT, in the genome browser click on BLAT, one of the options listed
on the left side of the page You will obtain a query box for pasting the amino acidsequence of a protein
2 If you want to analyze a predicted protein sequence that was compiled in your lab,you should convert it to a FASTA format In this format, the sequence is presented
as a continuous chain of amino acids, without any numbering, blank spaces, orannotation (Fig 1)
3 If you know the accession number for a DNA or a protein sequence of interest,
perform the following steps to obtain a FASTA formatted file from GenBank ( 8,9 ).
In the example shown below, the accession number of a DNA file is used to obtain
a FASTA formatted file for the corresponding protein
Trang 38d In the query box next to for, type the known accession number for the DNAsequence of interest As an example, type BK001049 This accession numbercontains the nucleotide sequence of HTF4c, one of the spliced transcripts of
the human TCF12/HTF4 gene ( 10 ) After typing the accession number in the
query box, click on go You will obtain a page that includes the accession ber and a description of the DNA sequence file
num-e On the right side of the accession number click on link On the pull-downmenu select protein
f Above the accession number of the retrieved protein sequence file, you willfind the word “report,” in red letters On the pull-down menu, click on FASTA.You will obtain the FASTA format of the protein sequence
g Copy the entire sequence
h Paste the sequence in the BLAT query box at the UCSC browser, described
6 Go to the pull-down menu under the Query type and select protein
7 For the other variables (score and output type), use the default values
8 Finally, click on submit
9 Upon completion of the BLAT search, you will receive a table listing the results(Fig 2)
10 Examine the column tagged score You will find the highest score (2100) for an tended region (positions 1–706), with 100% sequence identity to the query sequenceanalyzed by BLAT The second and third highest scores (512 and 316) also reflectglobal alignments corresponding to positions 5 and 679 and to positions 4 to 681,
ex-Fig 1 Example of a FASTA formatted protein sequence
Trang 39respectively (Fig 2) The fourth score might or might not be significant; therefore
it should also be analyzed The other scores correspond to relatively short localalignments and therefore do not appear to be significant This conclusion can bededuced by examining the information provided in detail (Fig 2) Therefore, next
to each sequence, right-click on details to open a new window for the link to obtain
useful information (see Note 2).
11 First examine the details for the sequence with the highest score, the first line in
Fig 2 (see Note 1) In the details for that line, you will obtain the position of the
submitted sequence on human chromosome 15 Also, you will obtain the submittedamino acid sequence with regions highlighted in different colors These regionsmight correspond to spliced sites in the DNA On that page, scroll down to viewthe nucleotide positions of the exons and the splice sites in the genomic DNA.Scroll further down to examine the predicted amino acid sequence encoded by theexons of the gene
12 Next, examine the details for the sequence with the second highest score, the secondline in Fig 2. You will obtain the position of a genomic DNA region, with a pre-dicted amino acid sequence that shows similarity to the sequence analyzed by BLAT.You will find that the genomic DNA is on human chromosome 18 Also, you willobtain the similarity of the predicted sequence to the submitted amino acid sequence.The regions exhibiting sequence similarity are highlighted in different colors Theresult indicates global similarities over an extended region
13 Scroll down the page to view the genomic positions of the exons in the gene onchromosome 14 Scroll down to view the results of side-by-side alignments
14 Next examine the details for the sequence with the third highest score (the thirdline in Fig 2 ) As detailed above in step 12, you will obtain the position of a geno-
mic DNA region with a predicted amino acid sequence that shows similarity tothe sequence analyzed by BLAT The genomic DNA is on human chromosome
19 As described above in step 12, you can view the blocks that show similarity to
the query sequence These blocks are highlighted in different colors Again, theresult indicates global similarities over an extended region As detailed above in
Fig 2 A partial listing of the result of the BLAT search
Trang 40step 13, by scrolling down, you can obtain genomic positions of the exons of the
gene in the genomic DNA (in this case chromosome 19) and view the results ofside-by-side alignments
15 Finally, examine the details for the other sequences listed in Fig 2 The details forthe sequence on the fourth line indicate a global alignment that might be signifi-
cant However, as shown in Subheading 3.2., steps 10 and 11, you will find that
the sequence corresponds to a pseudogene The details for the sequence on thefifth line identify the same genomic region obtained from the details for chromo-somes 19, the third line The details describing the sequence on the sixth line revealrelatively short local alignments with a protein sequence predicted for a gene onchromosome 20 The sequence matches with low scores are unlikely to correspond
to paralogous genes
3.2 Mapping and Viewing the Chromosomal Positions
of the Candidate Paralogous Genes
1 The BLAT report (Fig 2) includes links for viewing the genomic locations of didate paralogous genes in the browser at UCSC
can-2 First, obtain a map of the query sequence on the genomic DNA To do so, in theBLAT results (Fig 2) right-click on the browser link, on the left side of the firstline, to open this link in a new window The first line contains the highest scoreand indicates 100% sequence identity to the query
3 Examine the browser page closely (see Note 3) You will find that the map
pro-vides the genomic position of the gene encoding TCF12/HTF4, on human
chro-mosome 15 ( 11 ) Note that the browser offers an extensive list of options from
which you can chose for viewing and analyzing the map For example, on the top,you can use the left and right arrows to move to the flanking regions in the map.You can use the zoom buttons (on the top right of the page) to zoom in, to obtain anexpanded view, or zoom out, to include additional sequences in the map (Fig 3A)
4 Explore the options that are listed below the graph (below the bar indicating ping and sequencing tracks), to choose what you want to include in the graph Theoptions are extensive You can chose options to create tracks for viewing additional
map-details ( 2 ) The options include creating a track for reference sequences (RefSeq).
Each time you chose an option, or a set of options, click on the refresh button
5 On the map displayed, the arrows on the tracks corresponding to known genes vide the direction of transcription (Fig 3) Click on one of these tracks to obtaininformation about that track and useful links that could help with data analysis
pro-and evaluation (see Note 3) For example, click one of the tracks labeled TCF12.
You will obtain a page that includes the accession number for the transcript thatcorresponds to that track Scroll down the page to obtain additional informationabout the gene
6 To obtain an output of the graph showing the map, for your record or for tion, click on PDF/PS file In that link, you will be able to save the plot in a PDFfile or a postscript file Figure 3 displays examples of outputs obtained from PDFfiles