4 supplement to nature genetics • september 2002A user’s guide to the human genome doi:10.1038/ng964 The primary aim of A User’s Guide to the Human Genome is to provide the reader with a
Trang 1Cover art by Darryl Leja
Power to the people
Andreas D Baxevanis & Francis S Collins
A user’s guide to the human genome
Tyra G Wolfsberg, Kris A Wetterstrand, Mark S Guyer, Francis S Collins
& Andreas D Baxevanis
in this interval be identified? What BAC clones cover that particular region?
29
Question 4
A user wishes to find all the single nucleotide polymorphisms that lie between twosequence-tagged sites Do any of these single nucleotide polymorphisms fall withinthe coding region of a gene? Where can any additional information about thefunction of these genes be found?
33
Question 5
Given a fragment of mRNA sequence, how would one find where that piece of DNAmapped in the human genome? Once its position has been determined, how wouldone find alternatively spliced transcripts?
Trang 2supplement to nature genetics • september 2002
How would an investigator easily find compiled information describing the structure
of a gene of interest? Is it possible to obtain the sequence of any putative promoterregions?
63
Question 11
An investigator has identified and cloned a human gene, but no correspondingmouse ortholog has yet been identified How can a mouse genomic sequence withsimilarity to the human gene sequence be retrieved?
A user has identified an interesting phenotype in a mouse model and has been able
to narrow down the critical region for the responsible gene to approximately 0.5 cM
How does one find the mouse genes in this region?
Trang 3There was a time, not too long ago, when the wisdom of
genome-sequencing projects was up for discussion.
Would they be too expensive, draining funds from other
areas of the life sciences? Would they be worth the
trou-ble? Not much more than 15 years have passed since
those early debates, and the importance of sequenced
genomes to biology and medicine has now gained wide
acceptance This is in part owing to the relatively rapid
fall in the cost of sequencing, followed by the undeniably
important insights gained from the annotation of
sev-eral bacterial genomes, and those of a few of our favorite
eukaryotes The news has been so relentlessly upbeat
that one might even have expected some ‘genome
fatigue’ to set in, especially given the saturation coverage
of the publication of the drafts of the human genome
sequence 18 months ago Not so, however; witness the
recent jockeying by different groups for inclusion of
‘their’ model organism in the next round of sequencing
projects The honeymoon goes on.
And yet there are important issues to be addressed.
One is the concern surrounding any bestseller—that it
will have far fewer actual readers than one might expect.
At first glance, this would seem not to apply to the
human genome After all, one is hard pressed these days
to pick up a copy of Nature Genetics, or any genetics
journal, and not find evidence that sequenced genomes
inform many of the most important advances A survey
published last year by the Wellcome Trust, however,
found that only half of the researchers who were using
sequence data were fully conversant with the services
provided by the freely accessible databases.
There is also the concern that genome sequencers
might be victims of their own success As
computa-tional biologist David Roos recently put it, “We are
swimming in a rapidly rising sea of data…how do we keep from drowning?” And if geneticists and bioinfor- maticians are struggling to stay afloat, what of the non- geneticists who are eager to exploit the sequences but are relative newcomers to the tools needed to navigate all of this information?
It is with these questions in mind that we present A
User’s Guide to the Human Genome Written by Tyra
Wolfsberg, Kris Wetterstrand, Mark Guyer, Francis Collins and Andreas Baxevanis of the National Human Genome Research Institute (NHGRI), this peer- reviewed how-to manual guides the reader through some of the basic tasks facing anyone whose work might
be facilitated by an improved understanding of the online resources that make sense of annotated genomes The directors of these online resources—Ewan Birney of Ensembl, David Haussler of the University of California, Santa Cruz and David Lipman of the National Center for Biotechnology Information—have served as advisors during the development of this guide, ensuring a bal- anced and accurate treatment of their respective web portals The online version of the guide will also evolve, with an initial update scheduled for April, 2003.
As noted by Harold Varmus in his eloquent
perspec-tive on A User’s Guide and the public databases it
exam-ines, one of the important legacies of the Human Genome Project is its ethos of open access to the data In this spirit, and with the generous sponsorship of the NHGRI and the Wellcome Trust, the online version of this supplement will be freely available on the
Nature Genetics website.
Alan Packer Nature Genetics
Spreading the word
Trang 42 supplement to nature genetics • september 2002
Power to the people
doi:10.1038/ng962
The National Human Genome Research Institute of the
National Institutes of Health is delighted to sponsor this
special supplement of Nature Genetics The primary aim
of this supplement is to provide the reader with an
ele-mentary, hands-on guide for browsing and analyzing
data produced by the International Human Genome
Sequencing Consortium, as well as data found in other
publicly available genome databases The majority of this
supplement is devoted to a series of worked examples,
providing an overview of the types of data available and
highlighting the most common types of questions that
can be asked by searching and analyzing genomic
data-bases These examples, which have been set in a variety of
biological contexts, provide step-by-step instructions
and strategies for using many of the most
commonly-used tools for sequence-based discovery It is hoped that
readers will grow in confidence and capability by
work-ing through the examples, understandwork-ing the underlywork-ing
concepts, and applying the strategies used in the
exam-ples to advance their own research interests.
One of the motivating factors behind the development
of this User’s Guide comes from the general sense that the
most commonly-used tools for genomic analysis still are
terra incognita for the majority of biologists Despite the
large amount of publicity surrounding the Human
Genome Project, a recent survey conducted on behalf of
the Wellcome Trust indicated that only half of ical researchers using genome databases are familiar with the tools that can be used to actually access the data The inherent potential underlying all of this sequence- based data is tremendous, so the importance of all biolo- gists having the ability to navigate through and cull important information from these databases cannot be understated.
biomed-The study of biology and medicine has truly undergone
a major transition over the last year, with the public ability of advanced draft sequences of the genomes of
avail-Homo sapiens and Mus musculus, rapidly growing
sequence data on other organisms, and ready access to a host of other databases on nucleic acids, proteins and their properties Yet for the full benefits of this dramatic revolution to be felt, all scientists on the planet must be empowered to use these powerful databases to unravel longstanding scientific mysteries As pointed out by Harold Varmus in the Perspective, free accessibility of all
of this basic information, without restrictions, tion fees or other obstacles, is the most critical component
subscrip-of realizing this potential It is our modest hope that this
User’s Guide will provide another useful contribution.
Andreas D Baxevanis and Francis S Collins National Human Genome Research Institute
Trang 5Genomic empowerment: the importance of
public databases
doi:10.1038/ng963
Over the past twenty five years, a mere sliver of recorded time, the
world of biology — and indeed the world in general — has been
transformed by the technical tools of a field now known as
genomics These new methods have had at least two kinds of
effects First, they have allowed scientists to generate
extraordi-narily useful information, including the
nucleotide-by-nucleotide description of the genetic blueprint of many of the
organisms we care about most—many infectious pathogens;
use-ful experimental organisms such as mice, the round worm, the
fruitfly, and two kinds of yeast; and human beings Second, they
have changed the way science is done: the amount of factual
knowledge has expanded so precipitously that all modern
biolo-gists using genomic methods have become dependent on
com-puter science to store, organize, search, manipulate and retrieve
the new information
Thus biology has been revolutionized by genomic information
and by the methods that permit useful access to it Equally
importantly, these revolutionary changes have been
dissemi-nated throughout the scientific community, and spread to other
interested parties, because many of those who practice genomics
have made a concerted effort to ensure that access is simplified
for all, including those who have not been deeply schooled in the
information sciences The goal of providing genomic
informa-tion widely has also inevitably attracted the interests of those in
the commercial sector, and privately developed versions of
vari-ous genomes are also now available, albeit for a licensing fee
The operative principle most prominently involved in
trans-mitting the fruits of genomics—the one that has captured the
imagination of the public and served as a standard for the
shar-ing of results and methods more generally in modern biology—
has been open access Funding by public and philanthropic
organizations, such as the U.S National Institutes of Health, the
U.S Department of Energy, the Wellcome Trust in Britain, and
many other organizations, has made this altruistic behavior
pos-sible and has fostered the idea that genomic information about
biological species should be available to all (Such information
about individual human beings is, of course, an entirely different
matter and should be protected by privacy rules.) The attitude of
open access to new biological knowledge has also been embodied
in the databases of the International Nucleotide Sequence
Data-base Collaboration, comprising the DNA DataBank of Japan, the
European Molecular Biology Laboratory, and GenBank at the US
National Library of Medicine The same focus on open access is
exemplified by PubMed (operated by the NLM), other gateways
to the scientific literature, and the assemblies of genomic
sequence now found at the several Web portals described in this
guide
The Human Genome Project (HGP), which has supported the
public genome sequencing effort, has been the mainstay of the
effort to make genomes accessible to the entire community of
scientists and all citizens This effort has, in fact, been quite
natu-rally extended to instruct the public about many themes in
mod-ern biological science This has occurred in part because the
human genome itself has been such an exciting concept for the
public; in part because genomes are natural entry points for
teaching many of the principles of biological design, includingevolution, gene organization and expression, organismal devel-opment, and disease; and in part because those who work ongenomes have been tireless in attempts to explain the meaning ofgenes to an eager public Endless metaphors, artistic creations,lively journalism, monographs about social and ethical implica-tions, televised lectures from the White House, and many othercultural happenings have been among the manifestations of thisfascination In this way, the HGP has had a strong hand in raisingthe public’s awareness of new ideas in biology and of the power-ful implications of genomics in medicine, law and other societalinstitutions
Some of these cultural effects come as much from the ioral aspects of the HGP as from the genomic sequences them-selves The sharing of new information, even before its assemblyinto publishable form, has spurred efforts to share other kinds ofresearch tools and has encouraged the notion of making the sci-entific literature freely accessible through the Internet The con-tribution of scientists in many countries to the sequencing ofmany genomes, including the human genome, has inspiredefforts to develop gene-based sciences—from basic genomics tobiotechnology—throughout the world, including the poorestdeveloping nations Indeed, the World Health Organization, theUnited Nations, and the World Bank have all contributedrecently to the growth of the ideas that science is both possibleand valuable in all economies and that science can be a means tohelp unify the world’s population under a banner of enlighten-ment, demonstrating a virtue of globalization
behav-From this perspective, the availability of the sequences of manygenomes through the Internet is a liberating notion, makingextraordinary amounts of essential information freely accessible
to anyone with a desktop computer and a link to the World WideWeb But the information itself is not enough to allow efficientuse Interested people who reside outside the centers for studyinggenomes need to be told where best to view the information in aform suitable for their purposes and how to take advantage of thesoftware that has been provided for retrieval and analysis.The manual before us now offers such help to those who mightotherwise have had trouble in attempting to use the products ofgenomics Furthermore, the advice is offered in that spirit ofaltruism that has come to characterize the public world ofgenomics The information is provided in a highly inviting andunderstandable format by casting it in the form of answers to thequestions most commonly posed when approaching biggenomes The information, made freely available on the WorldWide Web, has been assembled by some of the best minds in theHGP, who have generously given their time and intellect toencourage widespread use of the great bounty that has been cre-ated over the past two decades
In other words, the guide to use of genomes provided here issimply another indication that the HGP should take great pride
in much more than the sequencing of genomes
Harold VarmusMemorial Sloan-Kettering Cancer Center
Trang 64 supplement to nature genetics • september 2002
A user’s guide to the human genome
doi:10.1038/ng964
The primary aim of A User’s Guide to the Human Genome is to provide the reader with an elementary hands-on
guide for browsing and analyzing data produced by the International Human Genome Sequencing Consortium and other systematic sequencing efforts The majority of this supplement is devoted to a series of worked exam- ples, providing an overview of the types of data available, details on how these data can be browsed, and step- by-step instructions for using many of the most commonly-used tools for sequence-based discovery The major web portals featured throughout include the National Center for Biotechnology Information Map Viewer, the University of California, Santa Cruz Genome Browser, and the European Bioinformatics Institute’s Ensembl system, along with many others that are discussed in the individual examples It is hoped that readers will become more familiar with these resources, allowing them to apply the strategies used in the examples to advance their own research programs.
Trang 7Introduction: putting it together
doi:10.1038/ng965
In its short history, the Human Genome Project (HGP) has
pro-vided significant advances in the understanding of gene structure
and organization, genetic variation, comparative genomics and
appreciation of the ethical, legal and social issues surrounding
the availability of human sequence data One of the most
signifi-cant milestones in the history of this project was met in February
2001 with the announcement and publication of the draft
ver-sion of the human genome sequence1 The significance of this
milestone cannot be understated, as it firmly marks the entrance
of modern biology into the genome era (and not the
post-genome era, as many have stated) The potential usefulness of
this rich databank of information should not be lost on any
biol-ogist: it provides the basis for ‘sequence-based biology’, whereby
sequence data can be used more effectively to design and
inter-pret experiments at the bench The intelligent use of sequence
data from humans and model organisms, along with recent
tech-nological innovation fostered by the HGP, will lead to important
advances in the understanding of diseases and disorders having a
genetic basis and, more importantly, in how health care is
deliv-ered from this point forward2
Although this flood of data has enormous potential, many
investigators whose research programs stand to benefit in a
tan-gible way from the availability of this information have not
been able to capitalize on its potential Some have found the
data difficult to use, particularly with respect to incomplete
human genome draft sequence information Others are simply
not sufficiently conversant with the seeming myriad of
data-bases and analytical tools that have arisen over the last several
years To assist investigators and students in navigating this
rapidly expanding information space, numerous World Wide
Web sites, courses and textbooks have become available; many
individuals, of course, also turn to their friends and colleagues
for guidance We have prepared this Guide in that same spirit,
as an additional resource for our fellow scientists who wish to
make use (or better use) of both sequence data and the major
tools that can be used to view these data The Guide has been
written in a practical, question-and-answer format, with
step-by-step instructions on how to approach a representative set of
problems using publicly available resources The reader is
encouraged to work through the examples, as this is the best
way to truly learn how to navigate the resources covered and
become comfortable using them on a regular basis We suggest
that readers keep copies of the Guide next to their computers as
an easy-to-use reference
Before embarking on this new adventure, it is important to
review a number of basic concepts regarding the generation of
human genome sequence data This review does not discuss the
chronological development of the HGP or provide an in-depth
treatment of its implications; the reader is referred to Nature’s
Genome Gateway (http://www.nature.com/genomics/human/)
for more information on these topics
Current status of human genome sequencing
Sequencing of the human genome is nearing completion The
target date for making the complete, high-accuracy sequence
available is April 2003, the 50th anniversary of the discovery
of the double helix3 As we go to press, however, the work is still
a mosaic of finished and draft sequence A sequence becomes
finished when it has been determined at an accuracy of at least99.99% and has no gaps Sequence data that fall short of thatbenchmark but can be positioned along the physical map of thechromosomes are termed ‘draft’ Currently, 87% of the euchro-matic fraction of the genome is finished and less than 13% is atthe draft stage
Even in this incomplete state, the available data are extremelyuseful This usefulness was apparent early on, leading the Inter-national Human Genome Sequencing Consortium (IHGSC) topursue a staged approach in sequencing the human genome Thefirst stage generated draft sequence across the entire genome1.The project is now well advanced into its second stage, with draftsequence being improved to ‘finished quality’ across the entiregenome, a necessarily localized process As a result, and as it hasbeen presented to date, the human genome sequence is an evolv-ing mix of both finished and unfinished regions, with the unfin-ished regions varying in data quality As the data are initiallymade available in raw form, with subsequent refinement andimprovement, and because data of different quality are found indifferent places in the genome, users must understand the kinds
of data presented by the various tools available
Determining the human sequence: a brief overview
As with all systematic sequencing projects, the basic tal problem in sequencing lies in the fact that the output of a sin-gle reaction (a ‘read’) yields about 500–800 bp1,4 To determinethe sequence of a DNA molecule that is millions of bases long, itmust first be fragmented into pieces that are within an order ofmagnitude of the read size The sequence at one or both ends ofmany such fragments is determined, and the pieces are then
experimen-‘assembled’ back into the long linear string from which they wereoriginally derived A number of approaches for doing this havebeen suggested and tested; the most commonly used is shotgunsequencing4 The application of shotgun sequencing to the mul-timegabase- or gigabase-sized genomes of metazoans is stillevolving A small number of strategies are currently being evalu-ated, for example, hierarchical or map-based shotgun sequenc-ing, whole-genome shotgun sequencing and hybrid approaches.These approaches are described in detail elsewhere4
The IHGSC’s human sequencing effort began as a purely based strategy and evolved into a hybrid strategy1 The ‘pipeline’that the IHGSC used to generate the human sequence datainvolved the following steps
map-1 Bacterial artificial chromosome (BAC) clones were selected,and a random subclone library was constructed for each one ineither an M13- or a plasmid-based vector
2 A small number of members of the subclone library (usually
96 or 192) were sequenced to produce very-low-coverage, pass or ‘phase 0’ data These data were used for quality controland can be found in the Genome Survey Sequence division ofThe DNA Database of Japan (DDBJ), the European MolecularBiology Laboratory (EMBL) and GenBank (of the National Cen-ter for Biotechnology and Information; NCBI)
single-3 If a BAC clone met the requisite standard, subclones werederived and sufficient sequence data generated from these to pro-vide four- to fivefold coverage (that is, enough data to represent
an average base in the BAC clone between four and five times).This is known as ‘draft-level’ coverage, and permits the assembly
Trang 86 supplement to nature genetics • september 2002
of sequence using computer programs that can detect overlaps
between the random reads from the subclones, yielding longer
‘sequence contigs’ At this stage, the sequence of a BAC clone
could typically exist on between four and ten different contigs,
only some of which were ordered and oriented with respect to
one another The BAC ‘projects’ were submitted, within 24 hours
of having been assembled, to the High-Throughput Genomic
Sequences (HTGS) division of DDBJ/EMBL/GenBank5, where
each was given a unique accession number and identified with
the keyword ‘htgs_draft’ (The DDBJ, EMBL and GenBank are
members of the International Nucleotide Sequence Database
Collaboration, whose members exchange data nightly and assure
that the sequence data generated by all public sequencing efforts
are made available to all interested parties freely and in a timely
fashion.) Less-complete high-throughput genomic (HTG)
records are also known as ‘phase 1’ records As the sequence is
refined, it is designated ‘phase 2’ In the context of a BLAST
search at the NCBI, these sequences would be available in the
HTGS database
4 In late 2000, the draft sequence of the entire human genome
was assembled from the sequence of 30,445 clones (BAC clones
and a relatively small number of other large-insert clones) This
assembled draft human genome sequence was published in
Feb-ruary 2001 and made publicly available through three primary
portals: the University of California, Santa Cruz (UCSC),
Ensembl (of the European Bioinformatics Institute; EBI) and the
NCBI The use of all three of these sites to obtain annotated
information on the human genome sequence is the primary
sub-ject of this guide
5 Subsequent to the tion and publication of thedraft human genome sequence,work has continued towardsfinishing the sequencing Thefinal stage initially targeteddraft-quality BAC clones Foreach of these clones, enoughadditional shotgun sequencedata are obtained to bring thecoverage to eight- to tenfold, astage referred to as ‘fullytopped-up’ The data from eachfully topped-up BAC arereassembled, typically resulting
genera-in a smaller number of contigs(often in just a single contig)than at the draft level The newassembly is again submitted tothe HTGS division as anupdate of the existing BACclone, now identified with thekeyword ‘htgs_fulltop’ Theaccession number of the clonestays the same, and the versionnumber increases by one(AC108475.2, for example,becoming AC108475.3)
6 At this stage, there are,even for clones comprising asingle contig, typically someregions that are of insufficientquality for the clone to be con-sidered finished If this is thecase, the fully topped-upsequence is analyzed by a sequence finisher (an actual person)who collects, in a directed manner, the additional data that areneeded to close the few remaining gaps and to bring any regions
of low quality up to the finished sequence standard While theclone is worked on by the finisher, the HTGS entry in GenBank isidentified by the keyword ‘htgs_activefin’ Once work on theclone has been completed, the keyword of the HTG record ischanged to ‘htgs_phase3’, the version number is once againincreased, and the record is moved from the HTGS division tothe primate division of DDBJ/EMBL/GenBank In the context of
a BLAST search at NCBI, these finished BAC sequences wouldnow be available in the nr (“non-redundant”) database
7 The finished clone sequences are then put together into afinished chromosome sequence As with the initial draft assem-blies, there are a number of steps involved in this process that usemap-based and sequence-based information in calculating themaps The final assembly process involves identifying overlapsbetween the clones and then anchoring the finished sequencecontigs to the map of the genome; details of the process can befound on the NCBI web site (http://www.ncbi.nlm.nih.gov/genome/guide/build.html)
Initially, both the UCSC and NCBI groups generated completeassemblies of the human genome, albeit using differentapproaches As noted on the UCSC web site, the NCBI assemblytended to have slightly better local order and orientation, whereasthe UCSC assembly tended to track the chromosome-level mapssomewhat better Rather than having different assemblies based
on the same data, IHGSC, UCSC, Ensembl and NCBI decidedthat it would be more productive (and obviously less confusing)
NCBI reference sequences
The data release and distribution practices adopted by the HGP participants have led not
only to very early, pre-publication access to this treasure trove of information, but also to a
potentially confusing variety of formats and sources for the sequence data To address this and
other issues, the NCBI initiated the RefSeq project (http://www.ncbi.nlm.nih.gov/
locuslink/refseq.html)
The goal of the RefSeq effort is to provide a single reference sequence for each molecule of the
central dogma: DNA, the mRNA transcript, and the protein The RefSeq project helps to
sim-plify the redundant information in GenBank by providing, for example, a single reference for
human glyceraldehyde-3-phosphate dehydrogenase mRNA and protein, out of the 14 or so
full-length sequences in GenBank Each alternatively spliced transcript is represented by its own
ref-erence mRNA and protein The RefSeq project also includes sequences of complete genomes
and whole chromosomes, and genomic sequence contigs The human genomic contigs that
NCBI assembles, which form the basis of the presentations in the different genome browsers,
are part of the RefSeq project Most RefSeq entries are considered provisional and are derived by
an automated process from existing GenBank records Reviewed RefSeq entries are manually
curated and list additional publications, gene function summaries and sometimes sequence
corrections or extensions
Reference sequences are available through NCBI resources, including Entrez, BLAST and
LocusLink They can be easily recognized by the distinctive style of their accession numbers
NM_###### is used to designate mRNAs, NP_###### to designate proteins and NT_###### to
designate genomic contigs The NCBI and UCSC use alignments of the mRNA RefSeqs with the
genome to annotate the positions of known genes Ensembl aligns mRNA RefSeqs to the
genome The NCBI also provides model mRNA RefSeqs produced from genome annotation
These are derived by aligning the NM_ mRNAs and other GenBank mRNAs to the assembled
genome and then extracting the genomic sequence corresponding to the transcripts The
result-ing model mRNA and model protein sequences have accession numbers of the form
XM_###### and XP_###### As the XM_ and XP_ records are derived from genomic sequence,
they may differ from the original NM_ or GenBank mRNAs because of real-sequence
polymor-phisms, errors in the genomic or mRNA sequences or problems in the mRNA/genomic
sequence alignment A complete list of types of RefSeqs, along with details on how they are
pro-duced, is available from http://www.ncbi.nlm.nih.gov/LocusLink/refseq.html
Trang 9to focus their efforts on a single, definitive assembly To this end,
and by agreement, the NCBI assembly will be taken as the
refer-ence human genome sequrefer-ence It is this NCBI assembly that is
displayed at the three major portals covered in this guide
Annotating the assemblies
Once the assemblies have been constructed, the DNA sequence
undergoes a process known as annotation, in which useful
sequence features and other relevant experimental data are
cou-pled to the assembly The most obvious annotation is that of
known genes In the case of NCBI, known genes are identified by
simply aligning Reference Sequence (RefSeq) mRNAs (see box),
GenBank mRNAs, or both to the assembly If the RefSeq or
Gen-Bank mRNA aligns to more than one location, the best
align-ment is selected If, however, the alignalign-ments are of the same
quality, both are marked on to the contig, subject to certain rules
(specifically, the transcript alignment must be at least 95%
iden-tical, with the aligned region covering 50% or more of the length,
or at least 1,000 bases) Transcript models are used to refine the
alignments Ensembl identifies ‘best in genome’ positions for
known genes by performing alignments between all known
human proteins in the SPTREMBL database6and the assembly
using a fast protein-to-DNA sequence matcher7 UCSC predicts
the location of known genes and human mRNAs by aligning
Ref-Seq and other GenBank mRNAs to the genome using the
BLAST-like alignment tool (BLAT) program8 In addition to identifying
and placing known genes onto the assemblies, all of the major
genome browser sites provide ab initio gene predictions, using a
variety of prediction programs and approaches
Genome annotation goes well beyond noting where known
and predicted genes are Features found in the Ensembl, NCBI
and UCSC assemblies include, for example, the location and
placement of single-nucleotide polymorphisms,
sequence-tagged sites, expressed sequence tags, repetitive elements and
clones Full details on the types of annotation available and the
methods underlying sequence annotation for each of these
dif-ferent types of sequence feature can be found by accessing the
URLs listed under Genome Annotation in the Web Resources
section of this guide At UCSC, many of the annotations are
pro-vided by outside groups, and there may be a significant delay
between the release of the genome assembly and the annotation
of certain features Furthermore, some tracks are generated for
only a limited number of assemblies For an in-depth discussion
of genome annotation, the reader is referred to an excellent
review by Stein9 and the references cited therein This review,
along with the Commentary in this guide, also provides cautions
on the possible overinterpretation of genome annotation data
The data—and sometimes the tools—change every day
The steps outlined in the previous section should emphasize
that the state of the human genome sequence will continue to be
in flux, as it will be updated daily until it has actually been
declared ‘finished’ (Finished sequence is properly defined as the
“complete sequence of a clone or genome, with an accuracy of at
least 99.99% and no gaps”2 A more practical definition is that of
“essentially finished sequence,” meaning the complete sequence
of a clone or genome, with an accuracy of at least 99.99% and no
gaps, except those that cannot be closed by any current
method.) The reader should be mindful of this, not just when
reading this guide, but also, when referring back to it over time
Similarly, the tools used to search, visualize and analyze these
sequence data also undergo constant evolution, capitalizing on
new knowledge and new technology in increasing the usefulness
of these data to the user
Over the next year, sequence producers will continue to addfinished sequence to the nucleotide sequence databases, and theNCBI will continue to update the human sequence assemblyuntil its ultimate completion The human genome sequence will,however, continue to improve even after April 2003, as newcloning, mapping and sequencing technologies lead to the clo-sure of the few gaps that will remain in the euchromatic regions
It is hoped that such technological advances will also allow forthe sequencing of heterochromatic regions, regions that cannot
be cloned or sequenced using currently available methods.The sequence-based and functional annotations presented atthe three major genome portals will certainly continue to evolvelong after April 2003 Computational annotation is a highlyactive area of research, yielding better methods for identifyingcoding regions, noncoding transcribed regions and noncoding,non-transcribed functional elements contained within thehuman sequence
Accessing human genome sequence data
Although each of the three portals through which users accessgenome data has its own distinctive features, coordinationamong the three ensures that the most recent version and anno-tations of the human genome sequence are available
Ensembl (http://www.ensembl.org) is the product of a orative effort between the Wellcome Trust Sanger Institute andEMBL’s European Bioinformatics Institute and provides a bioin-formatics framework to organize biology around the sequences
collab-of large genomes7 It contains comprehensive human genome
annotation through ab initio gene prediction, as well as
infor-mation on putative gene function and expression The web siteprovides numerous different views of the data, which can beeither map-, gene- or protein-centric Ensembl is actively build-ing comparative genome sequence views, and presents datafrom human, mouse, mosquito and zebrafish In addition,numerous sequence-based search tools are available, and theEnsembl system itself can be downloaded for use with individ-ual sequencing projects
The UCSC Genome Browser (http://genome.ucsc.edu) wasoriginally developed by a relatively small academic researchgroup that was responsible for the first human genome assem-blies The genome can be viewed at any scale and is based onthe intuitive idea of overlaying ‘tracks’ onto the humangenome sequence; these annotation tracks include, for exam-ple, known genes, predicted genes and possible patterns ofalternative splicing There is also an emphasis on comparativegenomics, with mouse genomic alignments being available.The browser also provides access to an interactive version ofthe BLAT algorithm8, which UCSC uses for RNA and compar-ative genomic alignments
Given its Congressional mandate to store and analyze cal data and to facilitate the use of databases by the research com-munity, the NCBI (http://www.ncbi.nlm.nih.gov) serves as acentral hub for genome-related resources NCBI maintains Gen-Bank, which stores sequence data, including that generated bythe HGP and other systematic sequencing projects NCBI’s MapViewer provides a tool through which information such as exper-imentally verified genes, predicted genes, genomic markers,physical maps, genetic maps and sequence variation data can bevisualized The Map Viewer is linked to other NCBI tools—forexample, Entrez, the integrated information retrieval system thatprovides access to numerous component databases
biologi-Although we have chosen to illustrate each example usingresources available at a single site, almost all the questions in thisguide can be answered using any of the three browsers The
Trang 108 supplement to nature genetics • september 2002
informational sidebars that follow some of the questions provide
pointers on how to format the search at other sites Furthermore,
the three sites link to each other wherever possible Examples
presented in this Guide rely on the data and genome browser
interfaces that were available in June 2002 As new versions of the
genome assembly and viewing tools will come online every few
months, the specifics of some of the examples may change over
time Regardless, the basic strategies behind answering the
ques-tions in the examples will remain the same This underscores the
importance of readers working through the examples at their
own computers so that they may understand and be able to
navi-gate these public databases The readers are encouraged to
explore the alternative methods for answering the questions
Trang 11Question 1
How does one find a gene of interest and determine that gene’s ture? Once the gene has been located on the map, how does one easily examine other genes in that same region?
struc-doi:10.1038/ng966
This question serves as a basic introduction to the three major
genome viewers One gene, ADAM2, will be examined using
all three sites so that the reader can gain an appreciation of
the subtle differences in information presented at each of
these sites
National Center for Biotechnology Information Map
Viewer
The NCBI Human Map Viewer can be accessed from the NCBI’s
home page, at http://www.ncbi.nlm.nih.gov Follow the
hyper-link in the right-hand column labeled Human map viewer to go
to the Map Viewer home page The notation at the top of the
page indicates that this is Build 29, or the NCBI’s 29th assembly
of the human genome Build 29 is based on sequence data from 5
April 2002 The previous genome assembly, Build 28, was based
on sequence data from 24 December 2001 To search for any
mapped element, such as a gene symbol, GenBank accession
number, marker name or disease name, enter that term in the
Search for box and then press Find For this example, enter
‘ADAM2’ and then press Find The on chromosome(s) box may be
left blank for text-based searches such as this one
The resulting overview page shows a schematic of all of the
human chromosomes, pinpointing the position of ADAM2 to
the p arm of chromosome 8 (Fig 1.1) The search results section
shows that the gene exists on two NCBI maps, Genes_cyto and
Genes_seq Genes_cyto refers to the cytogenetic map, whereas
Genes_seq refers to the sequence map Clicking on either of those
two links opens a view of just that map
Detailed descriptions of these and other NCBI maps are
available at http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/
humansearch.html To get the most general overview of the
genomic context of ADAM2, including all available maps, click
on the item in the Map element column (in this case, ADAM2).
This view shows ADAM2 and a bit of flanking sequence on
chro-mosome 8p11.2 (Fig 1.2) Three maps are displayed in this view,
each of which will be discussed below Additional maps,
dis-cussed in other examples in this guide, can be added to this view
using the Maps & Options link.
The rightmost map is the master map, the map providing the
most detail The master map in this case is the Genes_seq map,
which depicts the intron/exon organization of ADAM2 and is
created by aligning the ADAM2 mRNA to the genome The gene
appears to have 14 exons The vertical arrow next to the ADAM2
gene symbol (within the pink box) shows the direction in which
the gene is transcribed The gene symbol itself is linked to
LocusLink, an NCBI resource that provides comprehensive
information about the gene, including aliases, nucleotide and
protein sequences, and links to other resources10(see Question
10) The links to the right of the gene symbol point to additional
information about the gene
• sv, or sequence view, shows the position of the gene in the
context of the genomic contig, including the nucleotide and
encoded protein sequences
• ev brings the user to the evidence viewer, a view that displays
the biological evidence supporting a particular gene model.This view shows all RefSeq models, GenBank mRNAs, tran-scripts (whether annotated, known or potential) andexpressed sequence tags (ESTs) aligning to this genomic con-tig More information on the evidence viewer can be found
on the NCBI web site by clicking Evidence Viewer Help on any
ev report page
• hm is a link to the NCBI’s Human–Mouse Homology Map,
showing genome sequences with predicted orthologybetween mouse and human (Fig 12.2)
• seq allows the user to retrieve the genomic sequence of the
region in text format The region of sequence displayed caneasily be changed
• mm is a link to the Model Maker, which shows the exons that
result when GenBank mRNAs, ESTs and gene predictions arealigned to the genomic sequence The user can then selectindividual exons to create a customized model of the gene.More information on the Model Maker can be found on the
NCBI web site by clicking help on any mm report page.
The UniG_Hs map shows human UniGene clusters that havebeen aligned to the genome The gray histogram depicts thenumber of aligning ESTs and the blue lines show the mapping ofUniGene clusters to the genome The thick blue bars are regions
of alignment (that is, exons) and the thin blue lines indicatepotential introns In this example, the mapping of UniGene clus-
ter Hs.177959 to the genome follows that of ADAM2, and all the
exons align
The Genes_cyto map shows genes that have been mappedcytogenetically; the orange bar shows the position of the gene
Although ADAM2 has been finely mapped and is represented by
a short line, other genes, such as the group below it on a longerline, have been cytogenetically mapped to broader regions ofchromosome 8
Clicking on the zoom control in the blue sidebar allows theuser to zoom out to view a larger region of chromosome 8.Zooming out one level shows 1/100th of the chromosome Thereare 20 genes in the region, and all 20 are labeled (displayed) in
this view (Fig 1.3) The region of ADAM2 is highlighted in red
on all maps On the basis of the Genes_seq map, ADAM2 is located between ADAM18 and LOC206849.
University of California, Santa Cruz Genome Browser
The home page for the UCSC Genome Browser is http://genome.ucsc.edu/ At present, UCSC provides browsers not only for themost recent version of the mouse and human genome data, butalso for several earlier assemblies To use the Genome Browser,select the appropriate organism from the pull-down menu at the
top of the blue sidebar (Human, in this case) and then click the link labeled Browser On the resulting page, select the version of
the human assembly to view The genome browser from August
2001 is based on an assembly of the human genome done by
UCSC using sequence data available on that date The Dec 2001
Trang 1210 supplement to nature genetics • september 2002
browser displays annotations based on NCBI’s build 28 of the
human genome, and the Apr 2002 browser displays annotations
on NCBI’s build 29 As the annotations presented in this most
recent human assembly are not yet as comprehensive as those
from the December 2001 assembly, the examples in this text are
based on the earlier assembly Select Dec 2001 from the
pull-down menu to access the assembly from that date (Fig 1.4)
Supported types of queries are listed below the text input
boxes Enter ‘ADAM2’ in the box labeled position and then
click Submit The results of this search are presented in two
categories, Known Genes and mRNA Associated Search Results
(Fig 1.5) The section marked Known Genes shows the
map-ping of the NCBI Reference mRNA sequences to the genome
The mRNA Associated Search Results represent the mapping of
other GenBank mRNA sequences to the genome Click on the
Known Genes link for ADAM2 (arrow, Fig 1.5) to see the
genomic context of the ADAM2 mRNA Reference Sequence
(NM_001464)
The resulting zoomed-in view shows a region of chromosome
8 from base pair 36234934 to 36280132, located within 8p12
(Fig 1.6) The blue track entitled Known Genes (from RefSeq)
shows the intron–exon structure of known genes The vertical
boxes indicate exons and the horizontal lines introns The
ADAM2 gene seems to have 14 exons The direction of
transcrip-tion is indicated by the arrowheads on the introns The tracks
labeled Acembly Gene Predictions, Ensembl Gene Predictions
and Fgenesh++ Gene Predictions are the results of gene
predic-tions (see Question 7) Alignments of other database nucleotide
sequences are shown in the Human mRNAs from GenBank,
spliced EST, UniGene and Nonhuman mRNAs from GenBank
tracks Translated alignments of mouse and Tetraodon genomic
sequence are in the mouse and fish BLAT tracks Tracks
display-ing sdisplay-ingle-nucleotide polymorphisms (SNPs), repetitive
ele-ments and microarray data are shown at the bottom Additional
details about each track are available by selecting the track name
in the Track Controls at the bottom
To view the genomic context of ADAM2, zoom out 10×by
clicking on the zoom out 10×box in the upper right corner
ADAM2 is located between TEM5 and ADAM18 (Fig 1.7).
Ensembl
The Ensembl7 project, http://www.ensembl.org/, provides
genome browsers for four species: human, mouse, zebrafish and
mosquito Click on Human to view the main entry point for the
human genome The current version of human Ensembl is
ver-sion 6.28.1, based on the NCBI’s 28th build of the genome To
perform a text search, enter ‘ADAM2’ in the text box, and limit
the search by selecting Gene from the pull-down search Click on
the upper button labeled Lookup A single result is returned with
a link to the ADAM2 gene (Fig 1.8).
Click on either of the ADAM2 links to retrieve the GeneView
window The returned page contains four sections of data The
first section (Fig 1.9) is an overview of ADAM2, including links
to accession numbers and protein domains and families Links to
the Ensembl view of highly similar mouse sequences are
pre-sented in the Homology Matches section Some of these fields will
be described in more detail in later examples The second section
of the GeneView window provides information on the gene
tran-script (Fig 1.10) The sequence of the cDNA is shown, as is a
graphic of its intron–exon structure A limited amount of the
genomic context around the gene is shown schematically as well
Exon sequences are shown in the third section of the GeneView(Fig 1.11) and splice sites in the fourth (Fig 1.12) If more thanone transcript is predicted for the gene, each is allocated its owntranscript, exon and splice-site sections
The complete genomic context of ADAM2 is viewed by
return-ing to the first section of the GeneView (Fig 1.9) and clickreturn-ing on
one of the two links within the Genomic Location box The top
portion of the resulting ContigView (Fig 1.13) depicts the mosome, with the region of interest outlined in red TheOverview shows the genomic context of the gene, including thechromosome bands, contigs, markers and genes that map to near8p12 Clicking on any of these items recenters the display aroundthat item The section of interest is boxed in red on theDNA(contigs) map The genes annotated by Ensembl as being
chro-around ADAM2 are Q96KB2 and ADAM18.
The bottom panel of the ContigView, the Detailed View (Fig 1.14), shows a zoomed-in view of the boxed region, high-lighting all features that have been mapped to this region of thehuman genome The navigator buttons between the Overviewand the Detailed View move the display to the left and right andzoom in and out The features to be displayed can be changed
by selecting the Features pull-down menu and then checking
which features to view
The Features shown in Fig 1.14 are the defaults The DNA(contigs) map separates items on the forward strand (above)from those on the reverse (below) The only feature on thereverse strand in this view is a single Genscan transcript, pre-dicted by the GENSCAN gene prediction program11(see Ques-tion 7) The forward strand shows five types of features Starting
at the bottom, the ADAM2 transcript is shown in red, indicating
that it is a known transcript corresponding to a near-full-lengthcDNA sequence, protein sequence or both already available inthe public sequence database Black transcripts are predicted
based on EST or protein sequence similarity EST Transcr links to
individual aligning ESTs, whereas the UniGene track near the topdisplays UniGene clusters The Genscan model on the forwardstrand contains many exons found in the known transcript The
Proteins and Human proteins boxes indicate protein sequences that align to this version of the genome, whereas NCBI Transcr.
links to the NCBI Map Viewer Positioning the computer mouseover any feature brings up the feature’s name and links to moredetailed information
The NCBI, UCSC and Ensembl sometimes use different bols for the same genes, so it can be difficult to compare theviews obtained by the different browsers Furthermore, thethree sites maintain independent annotation pipelines and donot all attempt to align the same mRNA sequences to thegenome The NCBI is currently displaying build 29, Ensemblshows build 28, and UCSC offers both builds 28 (December2001) and 29 (April 2002), although all examples from UCSC inthis guide will be illustrated using the better-annotated build
sym-28 Because of the differences between the two assemblies, thereare subtle discrepancies between what is shown at the NCBI andwhat is available at UCSC and Ensembl However, it is fairlyeasy to navigate among the three sites The NCBI, for example,links to Ensembl and UCSC through the black boxes at the top
of LocusLink entries for human genes, and Ensembl directsusers to NCBI and UCSC through the “Jump to” link in its Con-tigView Some versions of UCSC’s Genome Browser have links
to Ensembl and NCBI’s Map Viewer in the blue bar at the top ofeach browser page
Trang 1412 supplement to nature genetics • september 2002
Trang 1614 supplement to nature genetics • september 2002
Trang 1816 supplement to nature genetics • september 2002
Trang 2018 supplement to nature genetics • september 2002
Question 2
How can sequence-tagged sites within a DNA sequence be identified?
doi:10.1038/ng967
The NCBI’s electronic PCR (e-PCR) tool12, which is part of the
UniSTS resource, can be used to find STS markers within a DNA
fragment of interest UniSTS (http://www.ncbi.nih.gov/
genome/sts/) contains all the available data on STS markers,
including primer sequences, product size, mapping information
and alternative names Links to other NCBI resources such as
Entrez, LocusLink and the MapViewer are also provided e-PCR
looks for potential STSs in a DNA sequence by searching for
sub-sequences with the correct orientation and distance that could
represent the PCR primers used to generate known STSs
The e-PCR home page can be found by going to the NCBI
home page, at http://www.ncbi.nlm.nih.gov, and then following
the Electronic PCR link in the right-hand column On the e-PCR
home page, paste the sequence of interest or enter an accession
number into the large text box at the top of the page The
acces-sion number of the sequence for this example is AF288398 This
sequence contains only one STS, stSG47693, which is located
between nucleotides (nt) 2102 and 2232 of the sequence under
study (Fig 2.1)
Click on the marker name to bring up details of the STS from
UniSTS (Fig 2.2) The primer information and PCR product size
are listed at the top of the page, along with alternative names for
the marker Often STSs are known by different names on
differ-ent maps Cross-references to LocusLink, UniGene and theGenebridge 4 map to which this STS was mapped are shownnext The mapping information section contains links to theNCBI’s MapViewer At the bottom of the page, the ElectronicPCR results show other sequences, including contigs, mRNAsand ESTs that may contain this STS marker
To see the genomic context of the STS marker in all maps to
which it has been mapped, click on the link labeled MapViewer
at the top of the Mapping Information section This map view
(Fig 2.3) shows two maps Note that, in this view, the STSstSG47693 is called RH92759 (highlighted in pink) Gene Map ’99–Genebridge 4 (GM99_GB4, left) has 46,000 STS mark-ers mapped onto the GB4 RH panel by the International Radiation Hybrid Consortium The STS map (right) shows theNCBI’s placement of STSs onto the genome sequence assemblyusing e-PCR Gray lines connect markers that appear in bothmaps, whereas the red line denotes where the STS RH92759appears on both maps In the region shown, there are a total of
211 STSs on the STS map, but only 20 are labeled in this view Tothe right of the STS map, the green and yellow circles show themaps on which the STS markers have been placed One canzoom in or out of this view by clicking on the lines of the zoomtool in the left sidebar
Trang 2220 supplement to nature genetics • september 2002
Trang 23Question 3
During a positional cloning project aimed at finding a human disease
gene, linkage data have been obtained suggesting that the gene of
interest lies between two sequence-tagged site markers How can all the known and predicted candidate genes in this interval be identified? What BAC clones cover that particular region?
doi:10.1038/ng968
UCSC
One possible starting point for this search is the UCSC Genome
Browser home page, at http://genome.ucsc.edu From this page,
select Human from the Organism pull-down menu in the blue
bar at the side of the page, and then click Browser On the Human
Genome Browser Gateway page, change the assembly pull-down
to Dec 2001 To view a region of the genome between two query
terms, enter the terms in the search box, separated by a
semi-colon For example, to view the region between STS markers
D10S1676 and D10S1675, enter ‘D10S1676;D10S1675’ in the box
marked position and press Submit Because both of these markers
map to a single position in the genome, the genome browser for
the region between those markers is returned (Fig 3.1)
The STS Markers track displays genetically mapped markers in
blue and radiation hybrid–mapped markers in black Click on
the STS Markers label to expand that track and see each marker
listed individually (Fig 3.2) The markers of interest are called by
their alternate names (AFMA232YH9 and AFMA230VA9 in this
view) and are at the top and bottom of the interval, respectively
(Fig 3.2, arrows)
The full list of known genes in this display is shown in the
Known Genes track (Fig 3.1) These protein-coding genes are
taken from the RefSeq mRNA sequences compiled at the NCBI10
and aligned to the genome assembly using the BLAT program8 To
export a list of the genes, or other features, in this region, click the
Tables link in the top blue bar For more information about a
par-ticular gene (such as MGMT), click on the gene symbol to get a list
of additional links to resources such as Online Mendelian
Inheri-tance in Man (OMIM), PubMed, GeneCards and Mouse Genome
Informatics (MGI; Fig 3.3) Many tracks, including Acembly
Genes, Ensembl Genes and Fgenesh++ Genes, indicate predicted
genes (see Question 7).To view the full set of features in any of
these categories, click on the title of that track on the left side of the
screen in Fig 3.1 To view brief descriptions of these tracks, as well
as others not mentioned, click on the gray box to the left of the
track or scroll down to Track Controls and click on the title of a
fea-ture of interest Explanations of the gene-prediction programs can
be found in Question 7 Reset the browser to its default settings by
clicking on the reset all button below the tracks.
To see the BAC clones used for sequencing, return to the page
illustrated in Fig 3.1 and click on Coverage at the left side of the
screen to expand that track Here BAC clones are listed
individu-ally, with finished regions shown in black and draft regions
shown in various shades of gray (Fig 3.4) For details such as size
and sequence coverage of a specific clone, click on the clone
accession number (such as AL355529.21, arrow) From this
screen, click on the accession number (as shown in Fig 3.5) to
link to the NCBI Entrez document summary for the clone The
full GenBank entry can be viewed by clicking on AL355529 on
the Entrez document summary page
According to NCBI naming conventions, this clone is from theRP11 library and has been named 85C15 RP11 is the NCBI desig-nation for RPCI-11, a commonly used human BAC library pro-duced at the Roswell Park Cancer Institute More information
on the naming conventions of genomic sequencing libraries can be found at the NCBI’s Clone Registry (Fig 3.6;http://www.ncbi.nlm.nih.gov/genome/clone/nomenclature.shtml).Clone ordering information is also available, at http://www.ncbi.nlm.nih.gov/genome/clone/ordering.html
NCBI
The NCBI MapViewer allows for direct viewing of the regionbetween two markers, as long as both markers are on the mastermap If, for example, the master map is a cytogenetic one, onecan search chromosome 22 for the region between band num-bers 22q12.1 and 22q13.2 If the master map is Gene_Seq, onecan view the region between two mapped genes
Access the Map Viewer home page by starting at the NCBIhome page (http://www.ncbi.nlm.nih.gov) and clicking
Human map viewer in the list on the right-hand side of the
page To view multiple hits on the same chromosome, type inthe search terms separated by the word ‘OR’ To see the sameregion between the STS markers D10S1676 and D10S1675, forexample, type ‘D10S1676 OR D10S1675’ in the search box, and
hit Find At the top of the resulting page (Fig 3.7), two red tick
marks on the chromosome cartoon indicate that the markersmap close to each other on chromosome 10 The search results
at the bottom of the page show the alternative names for thetwo markers (AFMA232YH9 and AFMA230VA9) as well as themaps on which they have been placed To view both markers at
the same time, click on the link for chromosome 10 in the
chromosome diagram Fig 3.8 shows the region aroundD10S1676 and D10S1675, with the original queries high-lighted in pink Red lines connect the positions of the marker
on the different maps
The Maps & Options link, in the horizontal blue bar near the
top of the page, allows the user to customize the maps and regiondisplayed To view, for example, the known and predicted genes
One can also search for a region between two STS markersusing the MapView at Ensembl Start at the Ensembl HumanGenome Browser at http://www.ensembl.org/Homo_sapi-ens/, click on the idiogram of any chromosome to access the
MapView, and enter the marker names in the Jump to tigview section To use Ensembl to obtain a list of genes (or
Con-other annotations) in a defined chromosomal region, click on
Export→Gene List from any ContigView window (Fig 1.14,
center yellow bar)
Trang 2422 supplement to nature genetics • september 2002
in this region, as well as the BAC clones from which the sequence
was derived, click on the link to open the Maps & Options
win-dow (Fig 3.9) First remove all the maps except Gene and STS
from the Maps Displayed box by highlighting them, and selecting
<<REMOVE Next, add the Transcript (RNA), GenomeScan,
Component and Contig maps by selecting them from the
Avail-able Maps box and selecting ADD>> Make the STS map the
master by highlighting it, then selecting Make Master/Move to
Bottom To limit the view such that only the STSs between
D10S1676 and D10S1675 are shown, type the marker names in
the Region Shown boxes Hit Apply to see the aligned maps In
some cases, it may be useful to select a page size larger than the
default of 20 to view more data in the browser window
Fig 3.10 shows the maps, as specified in the Maps & Options
window The green dots to the right of the STS map show all the
maps on which the markers appear This is a fairly long region of
chromosome 10, and not every STS marker is shown In
particu-lar, although there are 611 STSs in this region, only 20 are shown
by name in this view For each known gene, the Genes_Seq map
shows all the exons that have been mapped to the genome Exons
for individual known mRNAs are shown on the RNA
(Tran-script) map Unless a gene is alternatively spliced, the Genes_Seq
and RNA maps will be the same The GScan (GenomeScan) map
shows the NCBI’s gene predictions Any of these genes, known orpredicted, are candidates for the disease gene
The NCBI’s assembled contigs, also known as the NT contigs,are found in the Contig map Blue segments come from finishedsequence, orange from draft These contigs are constructed fromthe individual GenBank sequence entries shown in the Comp(Component) map Draft HTG records (phase 1 and 2; seehttp://www.ncbi.nlm.nih.gov/HTGS/) are displayed in orangeand finished HTGs in blue Most of these GenBank entries arederived from BAC clones The tiling paths of the BAC clones thatwere assembled into contigs are clearly visible One can obtainmore details about an entry, including the clone name, by click-ing on the accession number to link to Entrez The clone name isvisible directly in the MapViewer if the Comp map is the master
A map can be quickly made the master map by clicking on theblue arrow next to its name
Because this is a zoomed-out view of the chromosome, vidual genes and GenBank entries are difficult to visualize.Zooming in, using the controls in the blue sidebar, will provide
indi-a region in more detindi-ail Alternindi-atively, click on the Dindi-atindi-a As Table View in the left sidebar to retrieve all data, including
those hidden in this view, as a text-based table (partially shown
Trang 2624 supplement to nature genetics • september 2002
Trang 2826 supplement to nature genetics • september 2002
Trang 3028 supplement to nature genetics • september 2002
Trang 31Question 4
A user wishes to find all the single nucleotide polymorphisms that lie between two sequence-tagged sites Do any of these single nucleotide polymorphisms fall within the coding region of a gene? Where can any additional information about the function of these genes be found?
doi:10.1038/ng969
The starting point for this search would be the web site for the
Database of Single Nucleotide Polymorphisms (dbSNP) at the
NCBI13, which is located at http://www.ncbi.nlm.nih.gov/SNP
There is a series of links on the page that allow the user to search
using either information about the database submission itself or
information regarding genes and gene loci
For this particular search, assume that the region of interest is
known and defined by two STS markers, RH70674 and G32133
Begin by scrolling to the section labeled Between Markers at the
bottom of the page Enter the STS marker names ‘RH70674’ and
‘G32133’ into the two text boxes, and click on Submit STS
Mark-ers This will produce a display showing SNPs 1–25 out of the
total of 81 within the region of interest Go to page 3 of the
dis-play by entering ‘3’ in the Page box and clicking Disdis-play.
The resulting page (Fig 4.1) illustrates most of the possible
types of result one would find on a typical dbSNP results page In
the table, starting from the left, the first column gives the
individ-ual dbSNP cluster IDs (all starting with ‘rs’) The second column,
labeled Map, shows whether a particular SNP has been mapped
to a unique position in the genome (illustrated by a single green
arrow, as in the first row of the example) or to multiple positions
(not shown here)
The next set of columns, labeled Gene, indicates whether these
SNPs are associated with particular features, such as genes,
mRNAs or coding regions The three columns (L, T and C) are
either lit up or appear gray in every row Taking each in order:
If the L (for locus) appears in blue, part or all of the marker
position lies either within 2 kilobases (kb) of the 5′end of a gene
feature or within 500 bases of the 3′end of a gene feature
If the T (for transcript) appears in green, part or all of the
marker position overlaps with a known mRNA This does not
mean, however, that the SNP marker necessarily falls within a
coding region
If the C (for coding) appears in orange, part or all of the
marker position overlaps with a coding region
The next column, labeled Het, indicates the average
heterozy-gosity observed for this marker, on a scale of 0–100% A reading
of zero means that no information is available for that particular
marker, whereas the pink bars show a 95% confidence interval
for the marker The Validation column indicates whether the
marker has been validated (shown by a star) or is unvalidated
(shown by light blue boxes) Validated markers have been
veri-fied by independent re-analysis of the sequence All of the
unval-idated markers shown in Fig 4.1 are denoted by three blue boxes,
which, according to the scale at the top of the column, means that
there is a >95% success rate in validation This figure indicates
the probability that this marker is real (The success rate is
defined as 1 – false-positive rate.)
In the penultimate column, the symbol TT (not shown here)
indicates that individual genotypes are available for this marker
Finally, the Linkout Avail column indicates which markers are
linked to other databases; a P in this column indicates that the
variation has been mapped to a known protein structure For acomplete description of all the features within this display, click
on any part of the header above the columns
Returning to the original question, one of the SNPs displayed
on this page does indeed fall within a coding region, as
indi-cated by an orange C To obtain more information on any
par-ticular SNP, simply click on the hyperlinked SNP Cluster ID
Clicking on rs1059133, for example, produces a new page, with
all available information on that SNP (Fig 4.2) Under the
header marked Submitter records for this RefSNP Cluster is a list
of the individual SNPs (in this case, only one SNP) that havebeen clustered together to form this single reference SNP Thesequence of the SNP is shown in the next header Under the
header marked NCBI Resource Links are GenBank and NCBI
RefSeq entries that are associated with this SNP Scrolling ther down on the SNP page (Fig 4.3), the gene whose coding
fur-region this SNP falls within is indicated on the LocusLink sis section (ADAM2, a disintegrin and metalloproteinase
Analy-domain 2) The SNP allele is G/C, a non-synonymous changeleading to replacement of the Asp residue in the referencesequence by a His residue Links are also provided to the NCBIMap Viewer, Ensembl map and UCSC genome assembly in the
section labeled Integrated Maps The sections labeled Variation Summary and Validation Summary (not shown) give the raw
data on this particular SNP
To answer the final part of this question requires jumping fromdbSNP to LocusLink10 To do so, click on the ADAM2 link in the line marked LocusLink at the top of the page (Fig 4.3) This brings the user to the LocusLink page for ADAM2 and provides
numerous jumping-off points to the NCBI and affiliatedresources through the boxed links at the top of the page Moreinformation on these resources can be found by following theLocusLink FAQ link in the left-hand column of the page By sim-ply examining the LocusLink page itself, one sees that theADAM2 protein belongs to a family of membrane-anchored pro-teins that have been implicated in processes as diverse as fertiliza-tion, muscle development and neurogenesis
One often-overlooked source of information on genes andgene products is OMIM14 This is an electronic version of the
Using the UCSC browser, users can retrieve the positions ofgenome annotations such as SNPs as a text file suitable forloading into a spreadsheet program While looking at thebrowser for a defined chromosomal region, click on the
Tables link (Fig 1.6, upper blue bar) Similarly, to export a
list of genome annotations in a defined chromosomal region
at Ensembl, click on Export from any ContigView window
(Fig 1.14, center yellow bar)
Trang 3230 supplement to nature genetics • september 2002
catalog of human genes and genetic disorders developed by
Vic-tor McKusick at The Johns Hopkins University OMIM provides
the user with concise textual information from the published
literature on most human disorders with a genetic basis, and
links back to the primary literature as appropriate Information
comprising an OMIM entry includes the gene symbol, alternate
names for the disease, a description of the disease (including
clinical, biochemical and cytogenetic features), details of the
mode of inheritance (including mapping information) and aclinical synopsis These entries are manually curated, ensuringthat the ‘executive summary’ is up to date and accurate.Although OMIM can be searched directly, many LocusLinkentries also link to the OMIM record for the gene The OMIMentry page for the ADAM2 protein is shown in Fig 4.4 Thepage is fully hyperlinked to PubMed, GenBank and otherrelated databases
Trang 3432 supplement to nature genetics • september 2002
Trang 35Question 5
Given a fragment of mRNA sequence, how would one find where that piece of DNA mapped in the human genome? Once its position has been determined, how would one find alternatively spliced transcripts?
doi:10.1038/ng970
For the purpose of this example, the fragment of mRNA of
inter-est is contained within GenBank accession number BG334944
First, retrieve the nucleotide sequence of this EST using the
NCBI’s Entrez interface, at http://www.ncbi.nlm.nih.gov/
Entrez/ Type ‘BG334944’ into the text box at the top of the page,
change the pull-down menu to Nucleotide and press Go The
resulting page shows one entry, corresponding to accession
num-ber BG334944 To retrieve this sequence in FASTA format (a
common format for bioinformatics programs), change the
pull-down menu on this page to FASTA and then press Text (Fig 5.1).
A new web page containing only the sequence, in FASTA format,
is produced (Fig 5.2); copy the resulting sequence
To determine where this sequence maps within the genome,
use UCSC’s BLAT tool8 Begin this search by pointing your web
browser to the UCSC Genome Browser home page, at
http://genome.ucsc.edu From this page, select Human from the
Organism pull-down menu in the blue bar on the side of the
page, and then click Blat Paste the FASTA-formatted sequence
obtained from Entrez (above) into the large text box on the BLAT
search page (Fig 5.3), change the Freeze pull-down menu to Dec.
2001, change the Query pull-down menu to DNA and then press
Submit The server will (very quickly) return the search results; in
this case, a single match of length 636 is found on the forward
strand of chromosome 9 (Fig 5.4)
To obtain more details on this hit, click the details link, to the
left of the entry A long web page is returned, with three major
sections: the mRNA sequence (Fig 5.5, top), the genomic
sequence (Fig 5.5, middle) and an alignment of the mRNA
sequence against the genomic sequence (see Fig 5.9 for an
exam-ple) In the alignment in Fig 5.5, matching bases in the cDNA
and genomic sequences are colored in darker blue and
capital-ized Gaps are indicated in lower-case black type Light blue
upper-case bases mark the boundaries of aligned regions on
either side of a gap and are often splice sites
Returning to the BLAT summary page for this search (Fig 5.4),
click on browser This will produce a graphic representation of
where this particular mRNA sequence aligns to the genome
(Fig 5.6) The track labeled Chromosome Band indicates that the
mRNA maps to 9q34.11 The query sequence itself is represented
on the line labeled Your Sequence from BLAT Search (arrow,
Fig 5.6) The sequence is shown as being discontinuous: regions
of similarity are shown as vertical lines, gaps are shown as thin
horizontal lines, and the direction of the alignment is indicated
by the arrowheads The aligned regions of the EST query
corre-spond to the exons of a known gene, shown on the line
immedi-ately below (Known Genes, here RAB9P40) Typing the EST
name, BG334944, directly into a UCSC search box would have
generated a similar result to that shown in Fig 5.6, but part of the
purpose of this example is to illustrate the use of BLAT
Approximately halfway down the graphic is a track labeled
Human ESTs That Have Been Spliced This track is at first shown
in dense mode, with all the ESTs condensed onto a single line To
see all of the ESTs that align with the genome in this region,
potentially representing differentially spliced transcripts, click
on the track’s label This will expand this area of the figure sothat each EST occupies a single line (Fig 5.7) The ESTs are ofvarying length, but most contain the same exons as the knowngene and are (presumably) spliced in the same way Closeinspection indicates that some of the ESTs are missing one ormore exons compared with the known gene Consider the lines
marked BE798864 and W52533: the former appears to be
miss-ing the fifth exon, whereas the latter is missmiss-ing the fourth, fifthand sixth exons
Any of the ESTs can be examined in more detail by clicking on
that particular line Here, click on the line for BE798864 (arrow,
Fig 5.7) to reach the information page for this EST (Fig 5.8).The EST is 99.8% identical to the genomic sequence; clickinganywhere on the hyperlinked line in the section marked
EST/Genomic Alignments returns the actual side-by-side
align-ment (Fig 5.9) Differences exist at the ends of the EST, but thesequences are identical in the region surrounding the putativemissing exon
An alternatively spliced mRNA is more likely to be of cal significance when it changes the sequence of the encoded,wildtype protein To determine whether EST BE798864 couldencode a protein different from that of the known gene
biologi-(RAB9P40), one can simply compare the two sequences directly
against each other using the NCBI’s BLAST 2 Sequences tool.First, open a new web browser window, because informationfrom the above search will be needed here; this will prevent hav-
ing to use the browser’s Back and Forward keys excessively and is
a good general rule when using multiple web tools Then accessthe BLAST home page, at http://www.ncbi.nlm.nih.gov/BLAST
Select BLAST 2 Sequences, under the header labeled Pairwise BLAST On this page, the user can simply enter accession num-
bers rather than cutting and pasting sequences into the textboxes For the EST, simply enter its accession number
Ensembl also displays database hits that overlap with eachexon in a transcript These hits may include proteins as well asESTs and mRNAs, and may illustrate alternatively splicedproducts The hits are shown as green boxes in the TransView(Fig.13.5), which can be accessed in a number of ways; for
example, by clicking on the View Evidence box for a transcript
on the GeneView (Fig 1.10) Another good starting point forvisualizing alternatively spliced transcripts is the NCBI’s
Model Maker (follow the mm link in Fig 1.2) The Model
Maker displays putative exons from mRNAs, ESTs and genepredictions that align with the genome Users can select indi-vidual exons from these alignments and build a customizedgene model As the Model Maker displays the nucleotidesequence of the model along with its three-frame translation,the effects of adding, modifying or deleting exons can bequickly evaluated
Trang 3634 supplement to nature genetics • september 2002
(BE798864) into the box marked Enter accession or GI for
Sequence 1 Obtaining the accession number of RAB9P40
requires going back to the graphic shown in Fig 5.6 and clicking
on the gene’s track Once this has been done, input the gene’s
accession number (NM_005833) into the box marked Enter
accession or GI for Sequence 2 Make sure that the Program
pull-down is set to blastn (to compare a nucleotide sequence against
another nucleotide sequence, hence the n in blastn) and click the
Align button at the bottom of the page to generate the alignment
(Fig 5.10) The sequence corresponding to sequence 1 (the EST)
is denoted as the query, whereas the sequence corresponding to
sequence 2 (the known gene) is denoted as the subject The
known gene’s protein translation is also shown, starting at the
end of the third row of the alignment Examination of the
align-ment shows that the EST is missing 153 nt (nt 360–512 of the
mRNA), which corresponds to the fifth exon that is missing inBE798864 This gap is in frame, so the EST could encode ahomologous yet shorter protein
Because of the nature of EST sequencing, ESTs often containsequencing errors at a rate much higher than those of the fin-ished or even draft genomic sequence It is certainly encouragingthat EST BE798864 aligns well with the genomic sequence andthat its encoded protein could be in the same frame as that pro-duced from the known gene In addition, it appears from theUCSC graphic (Fig 5.7) that other ESTs in this region, such asBE779110, are also missing the fifth exon of RAB9P40 All thesepredictions must, however, be tested computationally by looking
at the quality of the EST–genomic alignment as shown above.Final proof of alternative splicing can, of course, only be gener-ated at the laboratory bench
Trang 3836 supplement to nature genetics • september 2002
Trang 4038 supplement to nature genetics • september 2002