1. Trang chủ
  2. » Khoa Học Tự Nhiên

a user's guide to the human genome

82 480 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề A user’s guide to the human genome
Tác giả Tyra G Wolfsberg, Kris A Wetterstrand, Mark S Guyer, Francis S Collins, Andreas D Baxevanis
Trường học University of [Not specified]
Chuyên ngành Genomics
Thể loại guide
Năm xuất bản 2002
Thành phố [Not specified]
Định dạng
Số trang 82
Dung lượng 16,61 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

4 supplement to nature genetics • september 2002A user’s guide to the human genome doi:10.1038/ng964 The primary aim of A User’s Guide to the Human Genome is to provide the reader with a

Trang 1

Cover art by Darryl Leja

Power to the people

Andreas D Baxevanis & Francis S Collins

A user’s guide to the human genome

Tyra G Wolfsberg, Kris A Wetterstrand, Mark S Guyer, Francis S Collins

& Andreas D Baxevanis

in this interval be identified? What BAC clones cover that particular region?

29

Question 4

A user wishes to find all the single nucleotide polymorphisms that lie between twosequence-tagged sites Do any of these single nucleotide polymorphisms fall withinthe coding region of a gene? Where can any additional information about thefunction of these genes be found?

33

Question 5

Given a fragment of mRNA sequence, how would one find where that piece of DNAmapped in the human genome? Once its position has been determined, how wouldone find alternatively spliced transcripts?

Trang 2

supplement to nature genetics • september 2002

How would an investigator easily find compiled information describing the structure

of a gene of interest? Is it possible to obtain the sequence of any putative promoterregions?

63

Question 11

An investigator has identified and cloned a human gene, but no correspondingmouse ortholog has yet been identified How can a mouse genomic sequence withsimilarity to the human gene sequence be retrieved?

A user has identified an interesting phenotype in a mouse model and has been able

to narrow down the critical region for the responsible gene to approximately 0.5 cM

How does one find the mouse genes in this region?

Trang 3

There was a time, not too long ago, when the wisdom of

genome-sequencing projects was up for discussion.

Would they be too expensive, draining funds from other

areas of the life sciences? Would they be worth the

trou-ble? Not much more than 15 years have passed since

those early debates, and the importance of sequenced

genomes to biology and medicine has now gained wide

acceptance This is in part owing to the relatively rapid

fall in the cost of sequencing, followed by the undeniably

important insights gained from the annotation of

sev-eral bacterial genomes, and those of a few of our favorite

eukaryotes The news has been so relentlessly upbeat

that one might even have expected some ‘genome

fatigue’ to set in, especially given the saturation coverage

of the publication of the drafts of the human genome

sequence 18 months ago Not so, however; witness the

recent jockeying by different groups for inclusion of

‘their’ model organism in the next round of sequencing

projects The honeymoon goes on.

And yet there are important issues to be addressed.

One is the concern surrounding any bestseller—that it

will have far fewer actual readers than one might expect.

At first glance, this would seem not to apply to the

human genome After all, one is hard pressed these days

to pick up a copy of Nature Genetics, or any genetics

journal, and not find evidence that sequenced genomes

inform many of the most important advances A survey

published last year by the Wellcome Trust, however,

found that only half of the researchers who were using

sequence data were fully conversant with the services

provided by the freely accessible databases.

There is also the concern that genome sequencers

might be victims of their own success As

computa-tional biologist David Roos recently put it, “We are

swimming in a rapidly rising sea of data…how do we keep from drowning?” And if geneticists and bioinfor- maticians are struggling to stay afloat, what of the non- geneticists who are eager to exploit the sequences but are relative newcomers to the tools needed to navigate all of this information?

It is with these questions in mind that we present A

User’s Guide to the Human Genome Written by Tyra

Wolfsberg, Kris Wetterstrand, Mark Guyer, Francis Collins and Andreas Baxevanis of the National Human Genome Research Institute (NHGRI), this peer- reviewed how-to manual guides the reader through some of the basic tasks facing anyone whose work might

be facilitated by an improved understanding of the online resources that make sense of annotated genomes The directors of these online resources—Ewan Birney of Ensembl, David Haussler of the University of California, Santa Cruz and David Lipman of the National Center for Biotechnology Information—have served as advisors during the development of this guide, ensuring a bal- anced and accurate treatment of their respective web portals The online version of the guide will also evolve, with an initial update scheduled for April, 2003.

As noted by Harold Varmus in his eloquent

perspec-tive on A User’s Guide and the public databases it

exam-ines, one of the important legacies of the Human Genome Project is its ethos of open access to the data In this spirit, and with the generous sponsorship of the NHGRI and the Wellcome Trust, the online version of this supplement will be freely available on the

Nature Genetics website.

Alan Packer Nature Genetics

Spreading the word

Trang 4

2 supplement to nature genetics • september 2002

Power to the people

doi:10.1038/ng962

The National Human Genome Research Institute of the

National Institutes of Health is delighted to sponsor this

special supplement of Nature Genetics The primary aim

of this supplement is to provide the reader with an

ele-mentary, hands-on guide for browsing and analyzing

data produced by the International Human Genome

Sequencing Consortium, as well as data found in other

publicly available genome databases The majority of this

supplement is devoted to a series of worked examples,

providing an overview of the types of data available and

highlighting the most common types of questions that

can be asked by searching and analyzing genomic

data-bases These examples, which have been set in a variety of

biological contexts, provide step-by-step instructions

and strategies for using many of the most

commonly-used tools for sequence-based discovery It is hoped that

readers will grow in confidence and capability by

work-ing through the examples, understandwork-ing the underlywork-ing

concepts, and applying the strategies used in the

exam-ples to advance their own research interests.

One of the motivating factors behind the development

of this User’s Guide comes from the general sense that the

most commonly-used tools for genomic analysis still are

terra incognita for the majority of biologists Despite the

large amount of publicity surrounding the Human

Genome Project, a recent survey conducted on behalf of

the Wellcome Trust indicated that only half of ical researchers using genome databases are familiar with the tools that can be used to actually access the data The inherent potential underlying all of this sequence- based data is tremendous, so the importance of all biolo- gists having the ability to navigate through and cull important information from these databases cannot be understated.

biomed-The study of biology and medicine has truly undergone

a major transition over the last year, with the public ability of advanced draft sequences of the genomes of

avail-Homo sapiens and Mus musculus, rapidly growing

sequence data on other organisms, and ready access to a host of other databases on nucleic acids, proteins and their properties Yet for the full benefits of this dramatic revolution to be felt, all scientists on the planet must be empowered to use these powerful databases to unravel longstanding scientific mysteries As pointed out by Harold Varmus in the Perspective, free accessibility of all

of this basic information, without restrictions, tion fees or other obstacles, is the most critical component

subscrip-of realizing this potential It is our modest hope that this

User’s Guide will provide another useful contribution.

Andreas D Baxevanis and Francis S Collins National Human Genome Research Institute

Trang 5

Genomic empowerment: the importance of

public databases

doi:10.1038/ng963

Over the past twenty five years, a mere sliver of recorded time, the

world of biology — and indeed the world in general — has been

transformed by the technical tools of a field now known as

genomics These new methods have had at least two kinds of

effects First, they have allowed scientists to generate

extraordi-narily useful information, including the

nucleotide-by-nucleotide description of the genetic blueprint of many of the

organisms we care about most—many infectious pathogens;

use-ful experimental organisms such as mice, the round worm, the

fruitfly, and two kinds of yeast; and human beings Second, they

have changed the way science is done: the amount of factual

knowledge has expanded so precipitously that all modern

biolo-gists using genomic methods have become dependent on

com-puter science to store, organize, search, manipulate and retrieve

the new information

Thus biology has been revolutionized by genomic information

and by the methods that permit useful access to it Equally

importantly, these revolutionary changes have been

dissemi-nated throughout the scientific community, and spread to other

interested parties, because many of those who practice genomics

have made a concerted effort to ensure that access is simplified

for all, including those who have not been deeply schooled in the

information sciences The goal of providing genomic

informa-tion widely has also inevitably attracted the interests of those in

the commercial sector, and privately developed versions of

vari-ous genomes are also now available, albeit for a licensing fee

The operative principle most prominently involved in

trans-mitting the fruits of genomics—the one that has captured the

imagination of the public and served as a standard for the

shar-ing of results and methods more generally in modern biology—

has been open access Funding by public and philanthropic

organizations, such as the U.S National Institutes of Health, the

U.S Department of Energy, the Wellcome Trust in Britain, and

many other organizations, has made this altruistic behavior

pos-sible and has fostered the idea that genomic information about

biological species should be available to all (Such information

about individual human beings is, of course, an entirely different

matter and should be protected by privacy rules.) The attitude of

open access to new biological knowledge has also been embodied

in the databases of the International Nucleotide Sequence

Data-base Collaboration, comprising the DNA DataBank of Japan, the

European Molecular Biology Laboratory, and GenBank at the US

National Library of Medicine The same focus on open access is

exemplified by PubMed (operated by the NLM), other gateways

to the scientific literature, and the assemblies of genomic

sequence now found at the several Web portals described in this

guide

The Human Genome Project (HGP), which has supported the

public genome sequencing effort, has been the mainstay of the

effort to make genomes accessible to the entire community of

scientists and all citizens This effort has, in fact, been quite

natu-rally extended to instruct the public about many themes in

mod-ern biological science This has occurred in part because the

human genome itself has been such an exciting concept for the

public; in part because genomes are natural entry points for

teaching many of the principles of biological design, includingevolution, gene organization and expression, organismal devel-opment, and disease; and in part because those who work ongenomes have been tireless in attempts to explain the meaning ofgenes to an eager public Endless metaphors, artistic creations,lively journalism, monographs about social and ethical implica-tions, televised lectures from the White House, and many othercultural happenings have been among the manifestations of thisfascination In this way, the HGP has had a strong hand in raisingthe public’s awareness of new ideas in biology and of the power-ful implications of genomics in medicine, law and other societalinstitutions

Some of these cultural effects come as much from the ioral aspects of the HGP as from the genomic sequences them-selves The sharing of new information, even before its assemblyinto publishable form, has spurred efforts to share other kinds ofresearch tools and has encouraged the notion of making the sci-entific literature freely accessible through the Internet The con-tribution of scientists in many countries to the sequencing ofmany genomes, including the human genome, has inspiredefforts to develop gene-based sciences—from basic genomics tobiotechnology—throughout the world, including the poorestdeveloping nations Indeed, the World Health Organization, theUnited Nations, and the World Bank have all contributedrecently to the growth of the ideas that science is both possibleand valuable in all economies and that science can be a means tohelp unify the world’s population under a banner of enlighten-ment, demonstrating a virtue of globalization

behav-From this perspective, the availability of the sequences of manygenomes through the Internet is a liberating notion, makingextraordinary amounts of essential information freely accessible

to anyone with a desktop computer and a link to the World WideWeb But the information itself is not enough to allow efficientuse Interested people who reside outside the centers for studyinggenomes need to be told where best to view the information in aform suitable for their purposes and how to take advantage of thesoftware that has been provided for retrieval and analysis.The manual before us now offers such help to those who mightotherwise have had trouble in attempting to use the products ofgenomics Furthermore, the advice is offered in that spirit ofaltruism that has come to characterize the public world ofgenomics The information is provided in a highly inviting andunderstandable format by casting it in the form of answers to thequestions most commonly posed when approaching biggenomes The information, made freely available on the WorldWide Web, has been assembled by some of the best minds in theHGP, who have generously given their time and intellect toencourage widespread use of the great bounty that has been cre-ated over the past two decades

In other words, the guide to use of genomes provided here issimply another indication that the HGP should take great pride

in much more than the sequencing of genomes

Harold VarmusMemorial Sloan-Kettering Cancer Center

Trang 6

4 supplement to nature genetics • september 2002

A user’s guide to the human genome

doi:10.1038/ng964

The primary aim of A User’s Guide to the Human Genome is to provide the reader with an elementary hands-on

guide for browsing and analyzing data produced by the International Human Genome Sequencing Consortium and other systematic sequencing efforts The majority of this supplement is devoted to a series of worked exam- ples, providing an overview of the types of data available, details on how these data can be browsed, and step- by-step instructions for using many of the most commonly-used tools for sequence-based discovery The major web portals featured throughout include the National Center for Biotechnology Information Map Viewer, the University of California, Santa Cruz Genome Browser, and the European Bioinformatics Institute’s Ensembl system, along with many others that are discussed in the individual examples It is hoped that readers will become more familiar with these resources, allowing them to apply the strategies used in the examples to advance their own research programs.

Trang 7

Introduction: putting it together

doi:10.1038/ng965

In its short history, the Human Genome Project (HGP) has

pro-vided significant advances in the understanding of gene structure

and organization, genetic variation, comparative genomics and

appreciation of the ethical, legal and social issues surrounding

the availability of human sequence data One of the most

signifi-cant milestones in the history of this project was met in February

2001 with the announcement and publication of the draft

ver-sion of the human genome sequence1 The significance of this

milestone cannot be understated, as it firmly marks the entrance

of modern biology into the genome era (and not the

post-genome era, as many have stated) The potential usefulness of

this rich databank of information should not be lost on any

biol-ogist: it provides the basis for ‘sequence-based biology’, whereby

sequence data can be used more effectively to design and

inter-pret experiments at the bench The intelligent use of sequence

data from humans and model organisms, along with recent

tech-nological innovation fostered by the HGP, will lead to important

advances in the understanding of diseases and disorders having a

genetic basis and, more importantly, in how health care is

deliv-ered from this point forward2

Although this flood of data has enormous potential, many

investigators whose research programs stand to benefit in a

tan-gible way from the availability of this information have not

been able to capitalize on its potential Some have found the

data difficult to use, particularly with respect to incomplete

human genome draft sequence information Others are simply

not sufficiently conversant with the seeming myriad of

data-bases and analytical tools that have arisen over the last several

years To assist investigators and students in navigating this

rapidly expanding information space, numerous World Wide

Web sites, courses and textbooks have become available; many

individuals, of course, also turn to their friends and colleagues

for guidance We have prepared this Guide in that same spirit,

as an additional resource for our fellow scientists who wish to

make use (or better use) of both sequence data and the major

tools that can be used to view these data The Guide has been

written in a practical, question-and-answer format, with

step-by-step instructions on how to approach a representative set of

problems using publicly available resources The reader is

encouraged to work through the examples, as this is the best

way to truly learn how to navigate the resources covered and

become comfortable using them on a regular basis We suggest

that readers keep copies of the Guide next to their computers as

an easy-to-use reference

Before embarking on this new adventure, it is important to

review a number of basic concepts regarding the generation of

human genome sequence data This review does not discuss the

chronological development of the HGP or provide an in-depth

treatment of its implications; the reader is referred to Nature’s

Genome Gateway (http://www.nature.com/genomics/human/)

for more information on these topics

Current status of human genome sequencing

Sequencing of the human genome is nearing completion The

target date for making the complete, high-accuracy sequence

available is April 2003, the 50th anniversary of the discovery

of the double helix3 As we go to press, however, the work is still

a mosaic of finished and draft sequence A sequence becomes

finished when it has been determined at an accuracy of at least99.99% and has no gaps Sequence data that fall short of thatbenchmark but can be positioned along the physical map of thechromosomes are termed ‘draft’ Currently, 87% of the euchro-matic fraction of the genome is finished and less than 13% is atthe draft stage

Even in this incomplete state, the available data are extremelyuseful This usefulness was apparent early on, leading the Inter-national Human Genome Sequencing Consortium (IHGSC) topursue a staged approach in sequencing the human genome Thefirst stage generated draft sequence across the entire genome1.The project is now well advanced into its second stage, with draftsequence being improved to ‘finished quality’ across the entiregenome, a necessarily localized process As a result, and as it hasbeen presented to date, the human genome sequence is an evolv-ing mix of both finished and unfinished regions, with the unfin-ished regions varying in data quality As the data are initiallymade available in raw form, with subsequent refinement andimprovement, and because data of different quality are found indifferent places in the genome, users must understand the kinds

of data presented by the various tools available

Determining the human sequence: a brief overview

As with all systematic sequencing projects, the basic tal problem in sequencing lies in the fact that the output of a sin-gle reaction (a ‘read’) yields about 500–800 bp1,4 To determinethe sequence of a DNA molecule that is millions of bases long, itmust first be fragmented into pieces that are within an order ofmagnitude of the read size The sequence at one or both ends ofmany such fragments is determined, and the pieces are then

experimen-‘assembled’ back into the long linear string from which they wereoriginally derived A number of approaches for doing this havebeen suggested and tested; the most commonly used is shotgunsequencing4 The application of shotgun sequencing to the mul-timegabase- or gigabase-sized genomes of metazoans is stillevolving A small number of strategies are currently being evalu-ated, for example, hierarchical or map-based shotgun sequenc-ing, whole-genome shotgun sequencing and hybrid approaches.These approaches are described in detail elsewhere4

The IHGSC’s human sequencing effort began as a purely based strategy and evolved into a hybrid strategy1 The ‘pipeline’that the IHGSC used to generate the human sequence datainvolved the following steps

map-1 Bacterial artificial chromosome (BAC) clones were selected,and a random subclone library was constructed for each one ineither an M13- or a plasmid-based vector

2 A small number of members of the subclone library (usually

96 or 192) were sequenced to produce very-low-coverage, pass or ‘phase 0’ data These data were used for quality controland can be found in the Genome Survey Sequence division ofThe DNA Database of Japan (DDBJ), the European MolecularBiology Laboratory (EMBL) and GenBank (of the National Cen-ter for Biotechnology and Information; NCBI)

single-3 If a BAC clone met the requisite standard, subclones werederived and sufficient sequence data generated from these to pro-vide four- to fivefold coverage (that is, enough data to represent

an average base in the BAC clone between four and five times).This is known as ‘draft-level’ coverage, and permits the assembly

Trang 8

6 supplement to nature genetics • september 2002

of sequence using computer programs that can detect overlaps

between the random reads from the subclones, yielding longer

‘sequence contigs’ At this stage, the sequence of a BAC clone

could typically exist on between four and ten different contigs,

only some of which were ordered and oriented with respect to

one another The BAC ‘projects’ were submitted, within 24 hours

of having been assembled, to the High-Throughput Genomic

Sequences (HTGS) division of DDBJ/EMBL/GenBank5, where

each was given a unique accession number and identified with

the keyword ‘htgs_draft’ (The DDBJ, EMBL and GenBank are

members of the International Nucleotide Sequence Database

Collaboration, whose members exchange data nightly and assure

that the sequence data generated by all public sequencing efforts

are made available to all interested parties freely and in a timely

fashion.) Less-complete high-throughput genomic (HTG)

records are also known as ‘phase 1’ records As the sequence is

refined, it is designated ‘phase 2’ In the context of a BLAST

search at the NCBI, these sequences would be available in the

HTGS database

4 In late 2000, the draft sequence of the entire human genome

was assembled from the sequence of 30,445 clones (BAC clones

and a relatively small number of other large-insert clones) This

assembled draft human genome sequence was published in

Feb-ruary 2001 and made publicly available through three primary

portals: the University of California, Santa Cruz (UCSC),

Ensembl (of the European Bioinformatics Institute; EBI) and the

NCBI The use of all three of these sites to obtain annotated

information on the human genome sequence is the primary

sub-ject of this guide

5 Subsequent to the tion and publication of thedraft human genome sequence,work has continued towardsfinishing the sequencing Thefinal stage initially targeteddraft-quality BAC clones Foreach of these clones, enoughadditional shotgun sequencedata are obtained to bring thecoverage to eight- to tenfold, astage referred to as ‘fullytopped-up’ The data from eachfully topped-up BAC arereassembled, typically resulting

genera-in a smaller number of contigs(often in just a single contig)than at the draft level The newassembly is again submitted tothe HTGS division as anupdate of the existing BACclone, now identified with thekeyword ‘htgs_fulltop’ Theaccession number of the clonestays the same, and the versionnumber increases by one(AC108475.2, for example,becoming AC108475.3)

6 At this stage, there are,even for clones comprising asingle contig, typically someregions that are of insufficientquality for the clone to be con-sidered finished If this is thecase, the fully topped-upsequence is analyzed by a sequence finisher (an actual person)who collects, in a directed manner, the additional data that areneeded to close the few remaining gaps and to bring any regions

of low quality up to the finished sequence standard While theclone is worked on by the finisher, the HTGS entry in GenBank isidentified by the keyword ‘htgs_activefin’ Once work on theclone has been completed, the keyword of the HTG record ischanged to ‘htgs_phase3’, the version number is once againincreased, and the record is moved from the HTGS division tothe primate division of DDBJ/EMBL/GenBank In the context of

a BLAST search at NCBI, these finished BAC sequences wouldnow be available in the nr (“non-redundant”) database

7 The finished clone sequences are then put together into afinished chromosome sequence As with the initial draft assem-blies, there are a number of steps involved in this process that usemap-based and sequence-based information in calculating themaps The final assembly process involves identifying overlapsbetween the clones and then anchoring the finished sequencecontigs to the map of the genome; details of the process can befound on the NCBI web site (http://www.ncbi.nlm.nih.gov/genome/guide/build.html)

Initially, both the UCSC and NCBI groups generated completeassemblies of the human genome, albeit using differentapproaches As noted on the UCSC web site, the NCBI assemblytended to have slightly better local order and orientation, whereasthe UCSC assembly tended to track the chromosome-level mapssomewhat better Rather than having different assemblies based

on the same data, IHGSC, UCSC, Ensembl and NCBI decidedthat it would be more productive (and obviously less confusing)

NCBI reference sequences

The data release and distribution practices adopted by the HGP participants have led not

only to very early, pre-publication access to this treasure trove of information, but also to a

potentially confusing variety of formats and sources for the sequence data To address this and

other issues, the NCBI initiated the RefSeq project (http://www.ncbi.nlm.nih.gov/

locuslink/refseq.html)

The goal of the RefSeq effort is to provide a single reference sequence for each molecule of the

central dogma: DNA, the mRNA transcript, and the protein The RefSeq project helps to

sim-plify the redundant information in GenBank by providing, for example, a single reference for

human glyceraldehyde-3-phosphate dehydrogenase mRNA and protein, out of the 14 or so

full-length sequences in GenBank Each alternatively spliced transcript is represented by its own

ref-erence mRNA and protein The RefSeq project also includes sequences of complete genomes

and whole chromosomes, and genomic sequence contigs The human genomic contigs that

NCBI assembles, which form the basis of the presentations in the different genome browsers,

are part of the RefSeq project Most RefSeq entries are considered provisional and are derived by

an automated process from existing GenBank records Reviewed RefSeq entries are manually

curated and list additional publications, gene function summaries and sometimes sequence

corrections or extensions

Reference sequences are available through NCBI resources, including Entrez, BLAST and

LocusLink They can be easily recognized by the distinctive style of their accession numbers

NM_###### is used to designate mRNAs, NP_###### to designate proteins and NT_###### to

designate genomic contigs The NCBI and UCSC use alignments of the mRNA RefSeqs with the

genome to annotate the positions of known genes Ensembl aligns mRNA RefSeqs to the

genome The NCBI also provides model mRNA RefSeqs produced from genome annotation

These are derived by aligning the NM_ mRNAs and other GenBank mRNAs to the assembled

genome and then extracting the genomic sequence corresponding to the transcripts The

result-ing model mRNA and model protein sequences have accession numbers of the form

XM_###### and XP_###### As the XM_ and XP_ records are derived from genomic sequence,

they may differ from the original NM_ or GenBank mRNAs because of real-sequence

polymor-phisms, errors in the genomic or mRNA sequences or problems in the mRNA/genomic

sequence alignment A complete list of types of RefSeqs, along with details on how they are

pro-duced, is available from http://www.ncbi.nlm.nih.gov/LocusLink/refseq.html

Trang 9

to focus their efforts on a single, definitive assembly To this end,

and by agreement, the NCBI assembly will be taken as the

refer-ence human genome sequrefer-ence It is this NCBI assembly that is

displayed at the three major portals covered in this guide

Annotating the assemblies

Once the assemblies have been constructed, the DNA sequence

undergoes a process known as annotation, in which useful

sequence features and other relevant experimental data are

cou-pled to the assembly The most obvious annotation is that of

known genes In the case of NCBI, known genes are identified by

simply aligning Reference Sequence (RefSeq) mRNAs (see box),

GenBank mRNAs, or both to the assembly If the RefSeq or

Gen-Bank mRNA aligns to more than one location, the best

align-ment is selected If, however, the alignalign-ments are of the same

quality, both are marked on to the contig, subject to certain rules

(specifically, the transcript alignment must be at least 95%

iden-tical, with the aligned region covering 50% or more of the length,

or at least 1,000 bases) Transcript models are used to refine the

alignments Ensembl identifies ‘best in genome’ positions for

known genes by performing alignments between all known

human proteins in the SPTREMBL database6and the assembly

using a fast protein-to-DNA sequence matcher7 UCSC predicts

the location of known genes and human mRNAs by aligning

Ref-Seq and other GenBank mRNAs to the genome using the

BLAST-like alignment tool (BLAT) program8 In addition to identifying

and placing known genes onto the assemblies, all of the major

genome browser sites provide ab initio gene predictions, using a

variety of prediction programs and approaches

Genome annotation goes well beyond noting where known

and predicted genes are Features found in the Ensembl, NCBI

and UCSC assemblies include, for example, the location and

placement of single-nucleotide polymorphisms,

sequence-tagged sites, expressed sequence tags, repetitive elements and

clones Full details on the types of annotation available and the

methods underlying sequence annotation for each of these

dif-ferent types of sequence feature can be found by accessing the

URLs listed under Genome Annotation in the Web Resources

section of this guide At UCSC, many of the annotations are

pro-vided by outside groups, and there may be a significant delay

between the release of the genome assembly and the annotation

of certain features Furthermore, some tracks are generated for

only a limited number of assemblies For an in-depth discussion

of genome annotation, the reader is referred to an excellent

review by Stein9 and the references cited therein This review,

along with the Commentary in this guide, also provides cautions

on the possible overinterpretation of genome annotation data

The data—and sometimes the tools—change every day

The steps outlined in the previous section should emphasize

that the state of the human genome sequence will continue to be

in flux, as it will be updated daily until it has actually been

declared ‘finished’ (Finished sequence is properly defined as the

“complete sequence of a clone or genome, with an accuracy of at

least 99.99% and no gaps”2 A more practical definition is that of

“essentially finished sequence,” meaning the complete sequence

of a clone or genome, with an accuracy of at least 99.99% and no

gaps, except those that cannot be closed by any current

method.) The reader should be mindful of this, not just when

reading this guide, but also, when referring back to it over time

Similarly, the tools used to search, visualize and analyze these

sequence data also undergo constant evolution, capitalizing on

new knowledge and new technology in increasing the usefulness

of these data to the user

Over the next year, sequence producers will continue to addfinished sequence to the nucleotide sequence databases, and theNCBI will continue to update the human sequence assemblyuntil its ultimate completion The human genome sequence will,however, continue to improve even after April 2003, as newcloning, mapping and sequencing technologies lead to the clo-sure of the few gaps that will remain in the euchromatic regions

It is hoped that such technological advances will also allow forthe sequencing of heterochromatic regions, regions that cannot

be cloned or sequenced using currently available methods.The sequence-based and functional annotations presented atthe three major genome portals will certainly continue to evolvelong after April 2003 Computational annotation is a highlyactive area of research, yielding better methods for identifyingcoding regions, noncoding transcribed regions and noncoding,non-transcribed functional elements contained within thehuman sequence

Accessing human genome sequence data

Although each of the three portals through which users accessgenome data has its own distinctive features, coordinationamong the three ensures that the most recent version and anno-tations of the human genome sequence are available

Ensembl (http://www.ensembl.org) is the product of a orative effort between the Wellcome Trust Sanger Institute andEMBL’s European Bioinformatics Institute and provides a bioin-formatics framework to organize biology around the sequences

collab-of large genomes7 It contains comprehensive human genome

annotation through ab initio gene prediction, as well as

infor-mation on putative gene function and expression The web siteprovides numerous different views of the data, which can beeither map-, gene- or protein-centric Ensembl is actively build-ing comparative genome sequence views, and presents datafrom human, mouse, mosquito and zebrafish In addition,numerous sequence-based search tools are available, and theEnsembl system itself can be downloaded for use with individ-ual sequencing projects

The UCSC Genome Browser (http://genome.ucsc.edu) wasoriginally developed by a relatively small academic researchgroup that was responsible for the first human genome assem-blies The genome can be viewed at any scale and is based onthe intuitive idea of overlaying ‘tracks’ onto the humangenome sequence; these annotation tracks include, for exam-ple, known genes, predicted genes and possible patterns ofalternative splicing There is also an emphasis on comparativegenomics, with mouse genomic alignments being available.The browser also provides access to an interactive version ofthe BLAT algorithm8, which UCSC uses for RNA and compar-ative genomic alignments

Given its Congressional mandate to store and analyze cal data and to facilitate the use of databases by the research com-munity, the NCBI (http://www.ncbi.nlm.nih.gov) serves as acentral hub for genome-related resources NCBI maintains Gen-Bank, which stores sequence data, including that generated bythe HGP and other systematic sequencing projects NCBI’s MapViewer provides a tool through which information such as exper-imentally verified genes, predicted genes, genomic markers,physical maps, genetic maps and sequence variation data can bevisualized The Map Viewer is linked to other NCBI tools—forexample, Entrez, the integrated information retrieval system thatprovides access to numerous component databases

biologi-Although we have chosen to illustrate each example usingresources available at a single site, almost all the questions in thisguide can be answered using any of the three browsers The

Trang 10

8 supplement to nature genetics • september 2002

informational sidebars that follow some of the questions provide

pointers on how to format the search at other sites Furthermore,

the three sites link to each other wherever possible Examples

presented in this Guide rely on the data and genome browser

interfaces that were available in June 2002 As new versions of the

genome assembly and viewing tools will come online every few

months, the specifics of some of the examples may change over

time Regardless, the basic strategies behind answering the

ques-tions in the examples will remain the same This underscores the

importance of readers working through the examples at their

own computers so that they may understand and be able to

navi-gate these public databases The readers are encouraged to

explore the alternative methods for answering the questions

Trang 11

Question 1

How does one find a gene of interest and determine that gene’s ture? Once the gene has been located on the map, how does one easily examine other genes in that same region?

struc-doi:10.1038/ng966

This question serves as a basic introduction to the three major

genome viewers One gene, ADAM2, will be examined using

all three sites so that the reader can gain an appreciation of

the subtle differences in information presented at each of

these sites

National Center for Biotechnology Information Map

Viewer

The NCBI Human Map Viewer can be accessed from the NCBI’s

home page, at http://www.ncbi.nlm.nih.gov Follow the

hyper-link in the right-hand column labeled Human map viewer to go

to the Map Viewer home page The notation at the top of the

page indicates that this is Build 29, or the NCBI’s 29th assembly

of the human genome Build 29 is based on sequence data from 5

April 2002 The previous genome assembly, Build 28, was based

on sequence data from 24 December 2001 To search for any

mapped element, such as a gene symbol, GenBank accession

number, marker name or disease name, enter that term in the

Search for box and then press Find For this example, enter

‘ADAM2’ and then press Find The on chromosome(s) box may be

left blank for text-based searches such as this one

The resulting overview page shows a schematic of all of the

human chromosomes, pinpointing the position of ADAM2 to

the p arm of chromosome 8 (Fig 1.1) The search results section

shows that the gene exists on two NCBI maps, Genes_cyto and

Genes_seq Genes_cyto refers to the cytogenetic map, whereas

Genes_seq refers to the sequence map Clicking on either of those

two links opens a view of just that map

Detailed descriptions of these and other NCBI maps are

available at http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/

humansearch.html To get the most general overview of the

genomic context of ADAM2, including all available maps, click

on the item in the Map element column (in this case, ADAM2).

This view shows ADAM2 and a bit of flanking sequence on

chro-mosome 8p11.2 (Fig 1.2) Three maps are displayed in this view,

each of which will be discussed below Additional maps,

dis-cussed in other examples in this guide, can be added to this view

using the Maps & Options link.

The rightmost map is the master map, the map providing the

most detail The master map in this case is the Genes_seq map,

which depicts the intron/exon organization of ADAM2 and is

created by aligning the ADAM2 mRNA to the genome The gene

appears to have 14 exons The vertical arrow next to the ADAM2

gene symbol (within the pink box) shows the direction in which

the gene is transcribed The gene symbol itself is linked to

LocusLink, an NCBI resource that provides comprehensive

information about the gene, including aliases, nucleotide and

protein sequences, and links to other resources10(see Question

10) The links to the right of the gene symbol point to additional

information about the gene

• sv, or sequence view, shows the position of the gene in the

context of the genomic contig, including the nucleotide and

encoded protein sequences

• ev brings the user to the evidence viewer, a view that displays

the biological evidence supporting a particular gene model.This view shows all RefSeq models, GenBank mRNAs, tran-scripts (whether annotated, known or potential) andexpressed sequence tags (ESTs) aligning to this genomic con-tig More information on the evidence viewer can be found

on the NCBI web site by clicking Evidence Viewer Help on any

ev report page

• hm is a link to the NCBI’s Human–Mouse Homology Map,

showing genome sequences with predicted orthologybetween mouse and human (Fig 12.2)

• seq allows the user to retrieve the genomic sequence of the

region in text format The region of sequence displayed caneasily be changed

• mm is a link to the Model Maker, which shows the exons that

result when GenBank mRNAs, ESTs and gene predictions arealigned to the genomic sequence The user can then selectindividual exons to create a customized model of the gene.More information on the Model Maker can be found on the

NCBI web site by clicking help on any mm report page.

The UniG_Hs map shows human UniGene clusters that havebeen aligned to the genome The gray histogram depicts thenumber of aligning ESTs and the blue lines show the mapping ofUniGene clusters to the genome The thick blue bars are regions

of alignment (that is, exons) and the thin blue lines indicatepotential introns In this example, the mapping of UniGene clus-

ter Hs.177959 to the genome follows that of ADAM2, and all the

exons align

The Genes_cyto map shows genes that have been mappedcytogenetically; the orange bar shows the position of the gene

Although ADAM2 has been finely mapped and is represented by

a short line, other genes, such as the group below it on a longerline, have been cytogenetically mapped to broader regions ofchromosome 8

Clicking on the zoom control in the blue sidebar allows theuser to zoom out to view a larger region of chromosome 8.Zooming out one level shows 1/100th of the chromosome Thereare 20 genes in the region, and all 20 are labeled (displayed) in

this view (Fig 1.3) The region of ADAM2 is highlighted in red

on all maps On the basis of the Genes_seq map, ADAM2 is located between ADAM18 and LOC206849.

University of California, Santa Cruz Genome Browser

The home page for the UCSC Genome Browser is http://genome.ucsc.edu/ At present, UCSC provides browsers not only for themost recent version of the mouse and human genome data, butalso for several earlier assemblies To use the Genome Browser,select the appropriate organism from the pull-down menu at the

top of the blue sidebar (Human, in this case) and then click the link labeled Browser On the resulting page, select the version of

the human assembly to view The genome browser from August

2001 is based on an assembly of the human genome done by

UCSC using sequence data available on that date The Dec 2001

Trang 12

10 supplement to nature genetics • september 2002

browser displays annotations based on NCBI’s build 28 of the

human genome, and the Apr 2002 browser displays annotations

on NCBI’s build 29 As the annotations presented in this most

recent human assembly are not yet as comprehensive as those

from the December 2001 assembly, the examples in this text are

based on the earlier assembly Select Dec 2001 from the

pull-down menu to access the assembly from that date (Fig 1.4)

Supported types of queries are listed below the text input

boxes Enter ‘ADAM2’ in the box labeled position and then

click Submit The results of this search are presented in two

categories, Known Genes and mRNA Associated Search Results

(Fig 1.5) The section marked Known Genes shows the

map-ping of the NCBI Reference mRNA sequences to the genome

The mRNA Associated Search Results represent the mapping of

other GenBank mRNA sequences to the genome Click on the

Known Genes link for ADAM2 (arrow, Fig 1.5) to see the

genomic context of the ADAM2 mRNA Reference Sequence

(NM_001464)

The resulting zoomed-in view shows a region of chromosome

8 from base pair 36234934 to 36280132, located within 8p12

(Fig 1.6) The blue track entitled Known Genes (from RefSeq)

shows the intron–exon structure of known genes The vertical

boxes indicate exons and the horizontal lines introns The

ADAM2 gene seems to have 14 exons The direction of

transcrip-tion is indicated by the arrowheads on the introns The tracks

labeled Acembly Gene Predictions, Ensembl Gene Predictions

and Fgenesh++ Gene Predictions are the results of gene

predic-tions (see Question 7) Alignments of other database nucleotide

sequences are shown in the Human mRNAs from GenBank,

spliced EST, UniGene and Nonhuman mRNAs from GenBank

tracks Translated alignments of mouse and Tetraodon genomic

sequence are in the mouse and fish BLAT tracks Tracks

display-ing sdisplay-ingle-nucleotide polymorphisms (SNPs), repetitive

ele-ments and microarray data are shown at the bottom Additional

details about each track are available by selecting the track name

in the Track Controls at the bottom

To view the genomic context of ADAM2, zoom out 10×by

clicking on the zoom out 10×box in the upper right corner

ADAM2 is located between TEM5 and ADAM18 (Fig 1.7).

Ensembl

The Ensembl7 project, http://www.ensembl.org/, provides

genome browsers for four species: human, mouse, zebrafish and

mosquito Click on Human to view the main entry point for the

human genome The current version of human Ensembl is

ver-sion 6.28.1, based on the NCBI’s 28th build of the genome To

perform a text search, enter ‘ADAM2’ in the text box, and limit

the search by selecting Gene from the pull-down search Click on

the upper button labeled Lookup A single result is returned with

a link to the ADAM2 gene (Fig 1.8).

Click on either of the ADAM2 links to retrieve the GeneView

window The returned page contains four sections of data The

first section (Fig 1.9) is an overview of ADAM2, including links

to accession numbers and protein domains and families Links to

the Ensembl view of highly similar mouse sequences are

pre-sented in the Homology Matches section Some of these fields will

be described in more detail in later examples The second section

of the GeneView window provides information on the gene

tran-script (Fig 1.10) The sequence of the cDNA is shown, as is a

graphic of its intron–exon structure A limited amount of the

genomic context around the gene is shown schematically as well

Exon sequences are shown in the third section of the GeneView(Fig 1.11) and splice sites in the fourth (Fig 1.12) If more thanone transcript is predicted for the gene, each is allocated its owntranscript, exon and splice-site sections

The complete genomic context of ADAM2 is viewed by

return-ing to the first section of the GeneView (Fig 1.9) and clickreturn-ing on

one of the two links within the Genomic Location box The top

portion of the resulting ContigView (Fig 1.13) depicts the mosome, with the region of interest outlined in red TheOverview shows the genomic context of the gene, including thechromosome bands, contigs, markers and genes that map to near8p12 Clicking on any of these items recenters the display aroundthat item The section of interest is boxed in red on theDNA(contigs) map The genes annotated by Ensembl as being

chro-around ADAM2 are Q96KB2 and ADAM18.

The bottom panel of the ContigView, the Detailed View (Fig 1.14), shows a zoomed-in view of the boxed region, high-lighting all features that have been mapped to this region of thehuman genome The navigator buttons between the Overviewand the Detailed View move the display to the left and right andzoom in and out The features to be displayed can be changed

by selecting the Features pull-down menu and then checking

which features to view

The Features shown in Fig 1.14 are the defaults The DNA(contigs) map separates items on the forward strand (above)from those on the reverse (below) The only feature on thereverse strand in this view is a single Genscan transcript, pre-dicted by the GENSCAN gene prediction program11(see Ques-tion 7) The forward strand shows five types of features Starting

at the bottom, the ADAM2 transcript is shown in red, indicating

that it is a known transcript corresponding to a near-full-lengthcDNA sequence, protein sequence or both already available inthe public sequence database Black transcripts are predicted

based on EST or protein sequence similarity EST Transcr links to

individual aligning ESTs, whereas the UniGene track near the topdisplays UniGene clusters The Genscan model on the forwardstrand contains many exons found in the known transcript The

Proteins and Human proteins boxes indicate protein sequences that align to this version of the genome, whereas NCBI Transcr.

links to the NCBI Map Viewer Positioning the computer mouseover any feature brings up the feature’s name and links to moredetailed information

The NCBI, UCSC and Ensembl sometimes use different bols for the same genes, so it can be difficult to compare theviews obtained by the different browsers Furthermore, thethree sites maintain independent annotation pipelines and donot all attempt to align the same mRNA sequences to thegenome The NCBI is currently displaying build 29, Ensemblshows build 28, and UCSC offers both builds 28 (December2001) and 29 (April 2002), although all examples from UCSC inthis guide will be illustrated using the better-annotated build

sym-28 Because of the differences between the two assemblies, thereare subtle discrepancies between what is shown at the NCBI andwhat is available at UCSC and Ensembl However, it is fairlyeasy to navigate among the three sites The NCBI, for example,links to Ensembl and UCSC through the black boxes at the top

of LocusLink entries for human genes, and Ensembl directsusers to NCBI and UCSC through the “Jump to” link in its Con-tigView Some versions of UCSC’s Genome Browser have links

to Ensembl and NCBI’s Map Viewer in the blue bar at the top ofeach browser page

Trang 14

12 supplement to nature genetics • september 2002

Trang 16

14 supplement to nature genetics • september 2002

Trang 18

16 supplement to nature genetics • september 2002

Trang 20

18 supplement to nature genetics • september 2002

Question 2

How can sequence-tagged sites within a DNA sequence be identified?

doi:10.1038/ng967

The NCBI’s electronic PCR (e-PCR) tool12, which is part of the

UniSTS resource, can be used to find STS markers within a DNA

fragment of interest UniSTS (http://www.ncbi.nih.gov/

genome/sts/) contains all the available data on STS markers,

including primer sequences, product size, mapping information

and alternative names Links to other NCBI resources such as

Entrez, LocusLink and the MapViewer are also provided e-PCR

looks for potential STSs in a DNA sequence by searching for

sub-sequences with the correct orientation and distance that could

represent the PCR primers used to generate known STSs

The e-PCR home page can be found by going to the NCBI

home page, at http://www.ncbi.nlm.nih.gov, and then following

the Electronic PCR link in the right-hand column On the e-PCR

home page, paste the sequence of interest or enter an accession

number into the large text box at the top of the page The

acces-sion number of the sequence for this example is AF288398 This

sequence contains only one STS, stSG47693, which is located

between nucleotides (nt) 2102 and 2232 of the sequence under

study (Fig 2.1)

Click on the marker name to bring up details of the STS from

UniSTS (Fig 2.2) The primer information and PCR product size

are listed at the top of the page, along with alternative names for

the marker Often STSs are known by different names on

differ-ent maps Cross-references to LocusLink, UniGene and theGenebridge 4 map to which this STS was mapped are shownnext The mapping information section contains links to theNCBI’s MapViewer At the bottom of the page, the ElectronicPCR results show other sequences, including contigs, mRNAsand ESTs that may contain this STS marker

To see the genomic context of the STS marker in all maps to

which it has been mapped, click on the link labeled MapViewer

at the top of the Mapping Information section This map view

(Fig 2.3) shows two maps Note that, in this view, the STSstSG47693 is called RH92759 (highlighted in pink) Gene Map ’99–Genebridge 4 (GM99_GB4, left) has 46,000 STS mark-ers mapped onto the GB4 RH panel by the International Radiation Hybrid Consortium The STS map (right) shows theNCBI’s placement of STSs onto the genome sequence assemblyusing e-PCR Gray lines connect markers that appear in bothmaps, whereas the red line denotes where the STS RH92759appears on both maps In the region shown, there are a total of

211 STSs on the STS map, but only 20 are labeled in this view Tothe right of the STS map, the green and yellow circles show themaps on which the STS markers have been placed One canzoom in or out of this view by clicking on the lines of the zoomtool in the left sidebar

Trang 22

20 supplement to nature genetics • september 2002

Trang 23

Question 3

During a positional cloning project aimed at finding a human disease

gene, linkage data have been obtained suggesting that the gene of

interest lies between two sequence-tagged site markers How can all the known and predicted candidate genes in this interval be identified? What BAC clones cover that particular region?

doi:10.1038/ng968

UCSC

One possible starting point for this search is the UCSC Genome

Browser home page, at http://genome.ucsc.edu From this page,

select Human from the Organism pull-down menu in the blue

bar at the side of the page, and then click Browser On the Human

Genome Browser Gateway page, change the assembly pull-down

to Dec 2001 To view a region of the genome between two query

terms, enter the terms in the search box, separated by a

semi-colon For example, to view the region between STS markers

D10S1676 and D10S1675, enter ‘D10S1676;D10S1675’ in the box

marked position and press Submit Because both of these markers

map to a single position in the genome, the genome browser for

the region between those markers is returned (Fig 3.1)

The STS Markers track displays genetically mapped markers in

blue and radiation hybrid–mapped markers in black Click on

the STS Markers label to expand that track and see each marker

listed individually (Fig 3.2) The markers of interest are called by

their alternate names (AFMA232YH9 and AFMA230VA9 in this

view) and are at the top and bottom of the interval, respectively

(Fig 3.2, arrows)

The full list of known genes in this display is shown in the

Known Genes track (Fig 3.1) These protein-coding genes are

taken from the RefSeq mRNA sequences compiled at the NCBI10

and aligned to the genome assembly using the BLAT program8 To

export a list of the genes, or other features, in this region, click the

Tables link in the top blue bar For more information about a

par-ticular gene (such as MGMT), click on the gene symbol to get a list

of additional links to resources such as Online Mendelian

Inheri-tance in Man (OMIM), PubMed, GeneCards and Mouse Genome

Informatics (MGI; Fig 3.3) Many tracks, including Acembly

Genes, Ensembl Genes and Fgenesh++ Genes, indicate predicted

genes (see Question 7).To view the full set of features in any of

these categories, click on the title of that track on the left side of the

screen in Fig 3.1 To view brief descriptions of these tracks, as well

as others not mentioned, click on the gray box to the left of the

track or scroll down to Track Controls and click on the title of a

fea-ture of interest Explanations of the gene-prediction programs can

be found in Question 7 Reset the browser to its default settings by

clicking on the reset all button below the tracks.

To see the BAC clones used for sequencing, return to the page

illustrated in Fig 3.1 and click on Coverage at the left side of the

screen to expand that track Here BAC clones are listed

individu-ally, with finished regions shown in black and draft regions

shown in various shades of gray (Fig 3.4) For details such as size

and sequence coverage of a specific clone, click on the clone

accession number (such as AL355529.21, arrow) From this

screen, click on the accession number (as shown in Fig 3.5) to

link to the NCBI Entrez document summary for the clone The

full GenBank entry can be viewed by clicking on AL355529 on

the Entrez document summary page

According to NCBI naming conventions, this clone is from theRP11 library and has been named 85C15 RP11 is the NCBI desig-nation for RPCI-11, a commonly used human BAC library pro-duced at the Roswell Park Cancer Institute More information

on the naming conventions of genomic sequencing libraries can be found at the NCBI’s Clone Registry (Fig 3.6;http://www.ncbi.nlm.nih.gov/genome/clone/nomenclature.shtml).Clone ordering information is also available, at http://www.ncbi.nlm.nih.gov/genome/clone/ordering.html

NCBI

The NCBI MapViewer allows for direct viewing of the regionbetween two markers, as long as both markers are on the mastermap If, for example, the master map is a cytogenetic one, onecan search chromosome 22 for the region between band num-bers 22q12.1 and 22q13.2 If the master map is Gene_Seq, onecan view the region between two mapped genes

Access the Map Viewer home page by starting at the NCBIhome page (http://www.ncbi.nlm.nih.gov) and clicking

Human map viewer in the list on the right-hand side of the

page To view multiple hits on the same chromosome, type inthe search terms separated by the word ‘OR’ To see the sameregion between the STS markers D10S1676 and D10S1675, forexample, type ‘D10S1676 OR D10S1675’ in the search box, and

hit Find At the top of the resulting page (Fig 3.7), two red tick

marks on the chromosome cartoon indicate that the markersmap close to each other on chromosome 10 The search results

at the bottom of the page show the alternative names for thetwo markers (AFMA232YH9 and AFMA230VA9) as well as themaps on which they have been placed To view both markers at

the same time, click on the link for chromosome 10 in the

chromosome diagram Fig 3.8 shows the region aroundD10S1676 and D10S1675, with the original queries high-lighted in pink Red lines connect the positions of the marker

on the different maps

The Maps & Options link, in the horizontal blue bar near the

top of the page, allows the user to customize the maps and regiondisplayed To view, for example, the known and predicted genes

One can also search for a region between two STS markersusing the MapView at Ensembl Start at the Ensembl HumanGenome Browser at http://www.ensembl.org/Homo_sapi-ens/, click on the idiogram of any chromosome to access the

MapView, and enter the marker names in the Jump to tigview section To use Ensembl to obtain a list of genes (or

Con-other annotations) in a defined chromosomal region, click on

ExportGene List from any ContigView window (Fig 1.14,

center yellow bar)

Trang 24

22 supplement to nature genetics • september 2002

in this region, as well as the BAC clones from which the sequence

was derived, click on the link to open the Maps & Options

win-dow (Fig 3.9) First remove all the maps except Gene and STS

from the Maps Displayed box by highlighting them, and selecting

<<REMOVE Next, add the Transcript (RNA), GenomeScan,

Component and Contig maps by selecting them from the

Avail-able Maps box and selecting ADD>> Make the STS map the

master by highlighting it, then selecting Make Master/Move to

Bottom To limit the view such that only the STSs between

D10S1676 and D10S1675 are shown, type the marker names in

the Region Shown boxes Hit Apply to see the aligned maps In

some cases, it may be useful to select a page size larger than the

default of 20 to view more data in the browser window

Fig 3.10 shows the maps, as specified in the Maps & Options

window The green dots to the right of the STS map show all the

maps on which the markers appear This is a fairly long region of

chromosome 10, and not every STS marker is shown In

particu-lar, although there are 611 STSs in this region, only 20 are shown

by name in this view For each known gene, the Genes_Seq map

shows all the exons that have been mapped to the genome Exons

for individual known mRNAs are shown on the RNA

(Tran-script) map Unless a gene is alternatively spliced, the Genes_Seq

and RNA maps will be the same The GScan (GenomeScan) map

shows the NCBI’s gene predictions Any of these genes, known orpredicted, are candidates for the disease gene

The NCBI’s assembled contigs, also known as the NT contigs,are found in the Contig map Blue segments come from finishedsequence, orange from draft These contigs are constructed fromthe individual GenBank sequence entries shown in the Comp(Component) map Draft HTG records (phase 1 and 2; seehttp://www.ncbi.nlm.nih.gov/HTGS/) are displayed in orangeand finished HTGs in blue Most of these GenBank entries arederived from BAC clones The tiling paths of the BAC clones thatwere assembled into contigs are clearly visible One can obtainmore details about an entry, including the clone name, by click-ing on the accession number to link to Entrez The clone name isvisible directly in the MapViewer if the Comp map is the master

A map can be quickly made the master map by clicking on theblue arrow next to its name

Because this is a zoomed-out view of the chromosome, vidual genes and GenBank entries are difficult to visualize.Zooming in, using the controls in the blue sidebar, will provide

indi-a region in more detindi-ail Alternindi-atively, click on the Dindi-atindi-a As Table View in the left sidebar to retrieve all data, including

those hidden in this view, as a text-based table (partially shown

Trang 26

24 supplement to nature genetics • september 2002

Trang 28

26 supplement to nature genetics • september 2002

Trang 30

28 supplement to nature genetics • september 2002

Trang 31

Question 4

A user wishes to find all the single nucleotide polymorphisms that lie between two sequence-tagged sites Do any of these single nucleotide polymorphisms fall within the coding region of a gene? Where can any additional information about the function of these genes be found?

doi:10.1038/ng969

The starting point for this search would be the web site for the

Database of Single Nucleotide Polymorphisms (dbSNP) at the

NCBI13, which is located at http://www.ncbi.nlm.nih.gov/SNP

There is a series of links on the page that allow the user to search

using either information about the database submission itself or

information regarding genes and gene loci

For this particular search, assume that the region of interest is

known and defined by two STS markers, RH70674 and G32133

Begin by scrolling to the section labeled Between Markers at the

bottom of the page Enter the STS marker names ‘RH70674’ and

‘G32133’ into the two text boxes, and click on Submit STS

Mark-ers This will produce a display showing SNPs 1–25 out of the

total of 81 within the region of interest Go to page 3 of the

dis-play by entering ‘3’ in the Page box and clicking Disdis-play.

The resulting page (Fig 4.1) illustrates most of the possible

types of result one would find on a typical dbSNP results page In

the table, starting from the left, the first column gives the

individ-ual dbSNP cluster IDs (all starting with ‘rs’) The second column,

labeled Map, shows whether a particular SNP has been mapped

to a unique position in the genome (illustrated by a single green

arrow, as in the first row of the example) or to multiple positions

(not shown here)

The next set of columns, labeled Gene, indicates whether these

SNPs are associated with particular features, such as genes,

mRNAs or coding regions The three columns (L, T and C) are

either lit up or appear gray in every row Taking each in order:

If the L (for locus) appears in blue, part or all of the marker

position lies either within 2 kilobases (kb) of the 5′end of a gene

feature or within 500 bases of the 3′end of a gene feature

If the T (for transcript) appears in green, part or all of the

marker position overlaps with a known mRNA This does not

mean, however, that the SNP marker necessarily falls within a

coding region

If the C (for coding) appears in orange, part or all of the

marker position overlaps with a coding region

The next column, labeled Het, indicates the average

heterozy-gosity observed for this marker, on a scale of 0–100% A reading

of zero means that no information is available for that particular

marker, whereas the pink bars show a 95% confidence interval

for the marker The Validation column indicates whether the

marker has been validated (shown by a star) or is unvalidated

(shown by light blue boxes) Validated markers have been

veri-fied by independent re-analysis of the sequence All of the

unval-idated markers shown in Fig 4.1 are denoted by three blue boxes,

which, according to the scale at the top of the column, means that

there is a >95% success rate in validation This figure indicates

the probability that this marker is real (The success rate is

defined as 1 – false-positive rate.)

In the penultimate column, the symbol TT (not shown here)

indicates that individual genotypes are available for this marker

Finally, the Linkout Avail column indicates which markers are

linked to other databases; a P in this column indicates that the

variation has been mapped to a known protein structure For acomplete description of all the features within this display, click

on any part of the header above the columns

Returning to the original question, one of the SNPs displayed

on this page does indeed fall within a coding region, as

indi-cated by an orange C To obtain more information on any

par-ticular SNP, simply click on the hyperlinked SNP Cluster ID

Clicking on rs1059133, for example, produces a new page, with

all available information on that SNP (Fig 4.2) Under the

header marked Submitter records for this RefSNP Cluster is a list

of the individual SNPs (in this case, only one SNP) that havebeen clustered together to form this single reference SNP Thesequence of the SNP is shown in the next header Under the

header marked NCBI Resource Links are GenBank and NCBI

RefSeq entries that are associated with this SNP Scrolling ther down on the SNP page (Fig 4.3), the gene whose coding

fur-region this SNP falls within is indicated on the LocusLink sis section (ADAM2, a disintegrin and metalloproteinase

Analy-domain 2) The SNP allele is G/C, a non-synonymous changeleading to replacement of the Asp residue in the referencesequence by a His residue Links are also provided to the NCBIMap Viewer, Ensembl map and UCSC genome assembly in the

section labeled Integrated Maps The sections labeled Variation Summary and Validation Summary (not shown) give the raw

data on this particular SNP

To answer the final part of this question requires jumping fromdbSNP to LocusLink10 To do so, click on the ADAM2 link in the line marked LocusLink at the top of the page (Fig 4.3) This brings the user to the LocusLink page for ADAM2 and provides

numerous jumping-off points to the NCBI and affiliatedresources through the boxed links at the top of the page Moreinformation on these resources can be found by following theLocusLink FAQ link in the left-hand column of the page By sim-ply examining the LocusLink page itself, one sees that theADAM2 protein belongs to a family of membrane-anchored pro-teins that have been implicated in processes as diverse as fertiliza-tion, muscle development and neurogenesis

One often-overlooked source of information on genes andgene products is OMIM14 This is an electronic version of the

Using the UCSC browser, users can retrieve the positions ofgenome annotations such as SNPs as a text file suitable forloading into a spreadsheet program While looking at thebrowser for a defined chromosomal region, click on the

Tables link (Fig 1.6, upper blue bar) Similarly, to export a

list of genome annotations in a defined chromosomal region

at Ensembl, click on Export from any ContigView window

(Fig 1.14, center yellow bar)

Trang 32

30 supplement to nature genetics • september 2002

catalog of human genes and genetic disorders developed by

Vic-tor McKusick at The Johns Hopkins University OMIM provides

the user with concise textual information from the published

literature on most human disorders with a genetic basis, and

links back to the primary literature as appropriate Information

comprising an OMIM entry includes the gene symbol, alternate

names for the disease, a description of the disease (including

clinical, biochemical and cytogenetic features), details of the

mode of inheritance (including mapping information) and aclinical synopsis These entries are manually curated, ensuringthat the ‘executive summary’ is up to date and accurate.Although OMIM can be searched directly, many LocusLinkentries also link to the OMIM record for the gene The OMIMentry page for the ADAM2 protein is shown in Fig 4.4 Thepage is fully hyperlinked to PubMed, GenBank and otherrelated databases

Trang 34

32 supplement to nature genetics • september 2002

Trang 35

Question 5

Given a fragment of mRNA sequence, how would one find where that piece of DNA mapped in the human genome? Once its position has been determined, how would one find alternatively spliced transcripts?

doi:10.1038/ng970

For the purpose of this example, the fragment of mRNA of

inter-est is contained within GenBank accession number BG334944

First, retrieve the nucleotide sequence of this EST using the

NCBI’s Entrez interface, at http://www.ncbi.nlm.nih.gov/

Entrez/ Type ‘BG334944’ into the text box at the top of the page,

change the pull-down menu to Nucleotide and press Go The

resulting page shows one entry, corresponding to accession

num-ber BG334944 To retrieve this sequence in FASTA format (a

common format for bioinformatics programs), change the

pull-down menu on this page to FASTA and then press Text (Fig 5.1).

A new web page containing only the sequence, in FASTA format,

is produced (Fig 5.2); copy the resulting sequence

To determine where this sequence maps within the genome,

use UCSC’s BLAT tool8 Begin this search by pointing your web

browser to the UCSC Genome Browser home page, at

http://genome.ucsc.edu From this page, select Human from the

Organism pull-down menu in the blue bar on the side of the

page, and then click Blat Paste the FASTA-formatted sequence

obtained from Entrez (above) into the large text box on the BLAT

search page (Fig 5.3), change the Freeze pull-down menu to Dec.

2001, change the Query pull-down menu to DNA and then press

Submit The server will (very quickly) return the search results; in

this case, a single match of length 636 is found on the forward

strand of chromosome 9 (Fig 5.4)

To obtain more details on this hit, click the details link, to the

left of the entry A long web page is returned, with three major

sections: the mRNA sequence (Fig 5.5, top), the genomic

sequence (Fig 5.5, middle) and an alignment of the mRNA

sequence against the genomic sequence (see Fig 5.9 for an

exam-ple) In the alignment in Fig 5.5, matching bases in the cDNA

and genomic sequences are colored in darker blue and

capital-ized Gaps are indicated in lower-case black type Light blue

upper-case bases mark the boundaries of aligned regions on

either side of a gap and are often splice sites

Returning to the BLAT summary page for this search (Fig 5.4),

click on browser This will produce a graphic representation of

where this particular mRNA sequence aligns to the genome

(Fig 5.6) The track labeled Chromosome Band indicates that the

mRNA maps to 9q34.11 The query sequence itself is represented

on the line labeled Your Sequence from BLAT Search (arrow,

Fig 5.6) The sequence is shown as being discontinuous: regions

of similarity are shown as vertical lines, gaps are shown as thin

horizontal lines, and the direction of the alignment is indicated

by the arrowheads The aligned regions of the EST query

corre-spond to the exons of a known gene, shown on the line

immedi-ately below (Known Genes, here RAB9P40) Typing the EST

name, BG334944, directly into a UCSC search box would have

generated a similar result to that shown in Fig 5.6, but part of the

purpose of this example is to illustrate the use of BLAT

Approximately halfway down the graphic is a track labeled

Human ESTs That Have Been Spliced This track is at first shown

in dense mode, with all the ESTs condensed onto a single line To

see all of the ESTs that align with the genome in this region,

potentially representing differentially spliced transcripts, click

on the track’s label This will expand this area of the figure sothat each EST occupies a single line (Fig 5.7) The ESTs are ofvarying length, but most contain the same exons as the knowngene and are (presumably) spliced in the same way Closeinspection indicates that some of the ESTs are missing one ormore exons compared with the known gene Consider the lines

marked BE798864 and W52533: the former appears to be

miss-ing the fifth exon, whereas the latter is missmiss-ing the fourth, fifthand sixth exons

Any of the ESTs can be examined in more detail by clicking on

that particular line Here, click on the line for BE798864 (arrow,

Fig 5.7) to reach the information page for this EST (Fig 5.8).The EST is 99.8% identical to the genomic sequence; clickinganywhere on the hyperlinked line in the section marked

EST/Genomic Alignments returns the actual side-by-side

align-ment (Fig 5.9) Differences exist at the ends of the EST, but thesequences are identical in the region surrounding the putativemissing exon

An alternatively spliced mRNA is more likely to be of cal significance when it changes the sequence of the encoded,wildtype protein To determine whether EST BE798864 couldencode a protein different from that of the known gene

biologi-(RAB9P40), one can simply compare the two sequences directly

against each other using the NCBI’s BLAST 2 Sequences tool.First, open a new web browser window, because informationfrom the above search will be needed here; this will prevent hav-

ing to use the browser’s Back and Forward keys excessively and is

a good general rule when using multiple web tools Then accessthe BLAST home page, at http://www.ncbi.nlm.nih.gov/BLAST

Select BLAST 2 Sequences, under the header labeled Pairwise BLAST On this page, the user can simply enter accession num-

bers rather than cutting and pasting sequences into the textboxes For the EST, simply enter its accession number

Ensembl also displays database hits that overlap with eachexon in a transcript These hits may include proteins as well asESTs and mRNAs, and may illustrate alternatively splicedproducts The hits are shown as green boxes in the TransView(Fig.13.5), which can be accessed in a number of ways; for

example, by clicking on the View Evidence box for a transcript

on the GeneView (Fig 1.10) Another good starting point forvisualizing alternatively spliced transcripts is the NCBI’s

Model Maker (follow the mm link in Fig 1.2) The Model

Maker displays putative exons from mRNAs, ESTs and genepredictions that align with the genome Users can select indi-vidual exons from these alignments and build a customizedgene model As the Model Maker displays the nucleotidesequence of the model along with its three-frame translation,the effects of adding, modifying or deleting exons can bequickly evaluated

Trang 36

34 supplement to nature genetics • september 2002

(BE798864) into the box marked Enter accession or GI for

Sequence 1 Obtaining the accession number of RAB9P40

requires going back to the graphic shown in Fig 5.6 and clicking

on the gene’s track Once this has been done, input the gene’s

accession number (NM_005833) into the box marked Enter

accession or GI for Sequence 2 Make sure that the Program

pull-down is set to blastn (to compare a nucleotide sequence against

another nucleotide sequence, hence the n in blastn) and click the

Align button at the bottom of the page to generate the alignment

(Fig 5.10) The sequence corresponding to sequence 1 (the EST)

is denoted as the query, whereas the sequence corresponding to

sequence 2 (the known gene) is denoted as the subject The

known gene’s protein translation is also shown, starting at the

end of the third row of the alignment Examination of the

align-ment shows that the EST is missing 153 nt (nt 360–512 of the

mRNA), which corresponds to the fifth exon that is missing inBE798864 This gap is in frame, so the EST could encode ahomologous yet shorter protein

Because of the nature of EST sequencing, ESTs often containsequencing errors at a rate much higher than those of the fin-ished or even draft genomic sequence It is certainly encouragingthat EST BE798864 aligns well with the genomic sequence andthat its encoded protein could be in the same frame as that pro-duced from the known gene In addition, it appears from theUCSC graphic (Fig 5.7) that other ESTs in this region, such asBE779110, are also missing the fifth exon of RAB9P40 All thesepredictions must, however, be tested computationally by looking

at the quality of the EST–genomic alignment as shown above.Final proof of alternative splicing can, of course, only be gener-ated at the laboratory bench

Trang 38

36 supplement to nature genetics • september 2002

Trang 40

38 supplement to nature genetics • september 2002

Ngày đăng: 10/04/2014, 10:58

TỪ KHÓA LIÊN QUAN