1. Trang chủ
  2. » Luận Văn - Báo Cáo

Báo cáo y học: "Sequencing and analysis of an Irish human genome" ppt

14 483 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 14
Dung lượng 854,77 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

R E S E A R C H Open AccessSequencing and analysis of an Irish human genome Pin Tong1†, James GD Prendergast2†, Amanda J Lohan1, Susan M Farrington2,3, Simon Cronin4, Nial Friel5, Dan G

Trang 1

R E S E A R C H Open Access

Sequencing and analysis of an Irish human

genome

Pin Tong1†, James GD Prendergast2†, Amanda J Lohan1, Susan M Farrington2,3, Simon Cronin4, Nial Friel5,

Dan G Bradley6, Orla Hardiman7, Alex Evans8, James F Wilson9, Brendan Loftus1*

Abstract

Background: Recent studies generating complete human sequences from Asian, African and European subgroups have revealed population-specific variation and disease susceptibility loci Here, choosing a DNA sample from a population of interest due to its relative geographical isolation and genetic impact on further populations, we extend the above studies through the generation of 11-fold coverage of the first Irish human genome sequence Results: Using sequence data from a branch of the European ancestral tree as yet unsequenced, we identify variants that may be specific to this population Through comparisons with HapMap and previous genetic

association studies, we identified novel disease-associated variants, including a novel nonsense variant putatively associated with inflammatory bowel disease We describe a novel method for improving SNP calling accuracy at low genome coverage using haplotype information This analysis has implications for future re-sequencing studies and validates the imputation of Irish haplotypes using data from the current Human Genome Diversity Cell Line Panel (HGDP-CEPH) Finally, we identify gene duplication events as constituting significant targets of recent positive selection in the human lineage

Conclusions: Our findings show that there remains utility in generating whole genome sequences to illustrate both general principles and reveal specific instances of human biology With increasing access to low cost

sequencing we would predict that even armed with the resources of a small research group a number of similar initiatives geared towards answering specific biological questions will emerge

Background

Publication of the first human genome sequence

her-alded a landmark in human biology [1] By mapping out

the entire genetic blueprint of a human, and as the

cul-mination of a decade long effort by a variety of centers

and laboratories from around the world, it represented a

significant technical as well as scientific achievement

However, prior the publication, much researcher interest

had shifted towards a ‘post-genome’ era in which the

focus would move from the sequencing of genomes to

interpreting the primary findings The genome sequence

has indeed prompted a variety of large scale

post-gen-ome efforts, including the encyclopedia of DNA

ele-ments (ENCODE) project [2], which has pointed

towards increased complexity at the levels of the

genome and transcriptome Analysis of this complexity

is increasingly being facilitated by a proliferation of sequence-based methods that will allow high resolution measurements of both and the activities of proteins that either transiently or permanently associate with them [3,4]

However, the advent of second and third generation sequencing technologies means that the landmark of sequencing an entire human genome for $1,000 is within reach, and indeed may soon be surpassed [5] The two versions of the human genome published in

2001, while both seminal achievements, were mosaic renderings of a number of individual genomes Never-theless, it has been clear for some time that sequen-cing additional representative genomes would be needed for a more complete understanding of genomic variation and its relationship to human biology The structure and sequence of the genome across human populations is highly variable, and generation of entire

* Correspondence: brendan.loftus@ucd.ie

† Contributed equally

1 Conway Institute, University College Dublin, Belfield, Dublin 4, Ireland

Full list of author information is available at the end of the article

© 2010 Tong et al.; licensee BioMed Central Ltd This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in

Trang 2

genome sequences from a number of individuals from

a variety of geographical backgrounds will be required

for a comprehensive assessment of genetic variation

SNPs as well as insertions/deletions (indels) and copy

number variants all contribute to the extensive

pheno-typic diversity among humans and have been shown to

associate with disease susceptibility [6] Consequently,

several recent studies have undertaken to generate

whole genome sequences from a variety of normal and

patient populations [7] Similarly, whole genome

sequences have recently been generated from diverse

human populations, and studies of genetic diversity at

the population level have unveiled some interesting

findings [8] These data look to be dramatically

extended with releases of data from the 1000 Genomes

project [9] The 1000 Genomes project aims to achieve

a nearly complete catalog of common human genetic

variants (minor allele frequencies > 1%) by generating

high-quality sequence data for > 85% of the genome

for 10 sets of 100 individuals, chosen to represent

broad geographic regions from across the globe

Repre-sentation of Europe will come from European

Ameri-can samples from Utah and Italian, Spanish, British

and Finnish samples

In a recent paper entitled ‘Genes mirror geography

within Europe’ [10], the authors suggest that a

geogra-phical map of Europe naturally arises as a

two-dimen-sional summary of genetic variation within Europe and

state that when mapping disease phenotypes spurious

associations can arise if genetic structure is not

prop-erly accounted for In this regard Ireland represents an

interesting case due to its position, both geographically

and genetically, at the western periphery of Europe Its

population has also made disproportionate ancestral

contributions to other regions, particularly North

America and Australia Ireland also displays a maximal

or near maximal frequency of alleles that cause or

pre-dispose to a number of important diseases, including

cystic fibrosis, hemochromatosis and phenylketonuria

[11] This unique genetic heritage has long been of

interest to biomedical researchers and this, in

conjunc-tion with the absence of an Irish representative in the

1000 Genomes project, prompted the current study to

generate a whole genome sequence from an Irish

indi-vidual The resulting sequence should contain rare

structural and sequence variants potentially specific to

the Irish population or underlying the missing

herit-ability of chronic diseases not accounted for by the

common susceptibility markers discovered to date [12]

In conjunction with the small but increasing number

of other complete human genome sequences, we

hoped to address a number of other broader questions,

such as identifying key targets of recent positive

selec-tion in the human lineage

Results and discussion Data generated

The genomic DNA used in this study was obtained from

a healthy, anonymous male of self-reported Irish Cauca-sian ethnicity of at least three generations, who has been genotyped and included in previous association and population structure studies [13-15] These studies have shown this individual to be a suitable genetic represen-tative of the Irish population (Additional file 1)

Four single-end and five paired-end DNA libraries were generated and sequenced using a GAII Illumina Genome Analyzer The read lengths of the single-end libraries were 36, 42, 45 and 100 bp and those of the paired end were 36, 40, 76, and 80 bp, with the span sizes of the paired-end libraries ranging from 300 to 550

bp (± 35 bp) In total, 32.9 gigabases of sequence were generated (Table 1) Ninety-one percent of the reads mapped to a unique position in the reference genome (build 36.1) and in total 99.3% of the bases in the refer-ence genome were covered by at least one read, result-ing in an average 10.6-fold coverage of the genome

SNP discovery and novel disease-associated variants SNP discovery

Comparison with the reference genome identified 3,125,825 SNPs in the Irish individual, of which 87% were found to match variants in dbSNP130 (2,486,906

as validated and 240,791 as non-validated; Figure 1) The proportion of observed homozygotes and heterozy-gotes was 42.1% and 57.9%, respectively, matching that observed in previous studies [16] Of those SNPs identi-fied in coding regions of genes, 9,781 were synonymous, 10,201 were non-synonymous and 107 were nonsense

Of the remainder, 24,238 were located in untranslated regions, 1,083,616 were intronic and the remaining 1,979,180 were intergenic (Table 2) In order to validate our SNP calling approach (see Materials and methods)

we compared genotype calls from the sequencing data

to those obtained using a 550 k Illumina bead array Of those SNPs successfully genotyped on the array, 98% were in agreement with those derived from the sequen-cing data with a false positive rate estimated at 0.9%, validating the quality and reproducibility of the SNPs called

Disease-associated variants

Various disease-associated SNPs were detected in the sequence, but they are likely to be of restricted wide-spread value in themselves However, a large proportion

of SNPs in the Human Gene Mutation Database (HGMD) [17], genome-wide association studies (GWAS) [18] and the Online Mendelian Inheritance in Man (OMIM) database [19] are risk markers, not directly causative of the associated disease but rather in linkage disequilibrium (LD) with generally unknown

Trang 3

SNPs that are Therefore, in order to interrogate our

newly identified SNPs for potential causative risk factors,

we looked for those that appeared to be in LD with

already known associated (rather than

disease-causing) variants We identified 23,176 novel SNPs in

close proximity (< 250 kb) to a known HGMD or gen-ome-wide association study disease-associated SNP and where both were flanked by at least one pair of HapMap [20] CEU markers known to be in high LD As the annotation of the precise risk allele and strand of SNPs

in these databases is often incomplete, we focused on those positions, heterozygous in our individual, that are associated with a disease or syndrome Of the 7,682 of these novel SNPs that were in putative LD of a HGMD

or genome-wide association study disease-associated SNP heterozygous in our individual, 31 were non-synon-ymous, 14 were at splice sites (1 annotated as essential) and 1 led to the creation of a stop codon (Table S1 in Additional file 2)

This nonsense SNP is located in the macrophage-sti-mulating immune gene MST1, 280 bp 5′ of a non-synonymous coding variant marker (rs3197999) that has been shown in several cohorts to be strongly associated with inflammatory bowel disease and primary sclerosing cholangitis [21-23] Our individual was heterozygous at both positions (confirmed via resequencing; Additional files 3 and 4) and over 30 pairs of HapMap markers in high LD flank the two SNPs The role of MST1 in the immune system makes it a strong candidate for being the gene in this region conferring inflammatory bowel disease risk, and it had previously been proposed that rs3197999 could itself be causative due to its potential impact on the interaction between the MST1 protein product and its receptor [22]

Importantly, the newly identified SNP 5′ of rs3197999′

s position in the gene implies that the entire region 3′

Table 1 Read information

Data type Library number Number of reads Number of mapped reads Total bases (Gb) Mapped base (Gb) Effective depth

Figure 1 Comparison of detected SNPs and indels to

dbSNP130 The dbSNP alleles were separated into validated and

non-validated, and the detected variations that were not present in

dbSNP were classified as novel.

Table 2 Types of SNPs found

Consequence Number of SNPs % of SNPs Essential_splice_site 135 0.0043

Non_synonymous_coding 10,201 0.3263

Synonymous_coding 9,781 0.3129 Within_mature_mirna 30 0.0010 Within_non_coding_gene 16,512 0.5282

Trang 4

of this novel SNP would be lost from the protein,

including the amino acid affected by rs3197999 (Figure

2) Therefore, although further investigation is required,

there remains a possibility that this previously

unidenti-fied nonsense SNP is either conferring disease risk to

inflammatory bowel disease marked by rs3197999, or if

rs3197999 itself confers disease as previously

hypothe-sized [22], this novel SNP is conferring novel risk via

the truncation of the key region of the MST1 protein

Using the SIFT program [24], we investigated whether

those novel non-synonymous SNPs in putative LD with

risk markers were enriched with SNPs predicted to be

deleterious (that is, that affect fitness), and we indeed

found an enrichment of deleterious SNPs as one would

expect if an elevated number were conferring risk to the relevant disease Of all 7,993 non-synonymous allele changes identified in our individual for which SIFT pre-dictions could be successfully made, 26% were predicted

to be deleterious However, of those novel variants in putative LD with a disease SNP heterozygous in our individual, 56% (14 out of 25) were predicted to be harmful by SIFT (chi-square P = 6.8 × 10-4

, novel non-synonymous SNPs in putative LD with risk allele versus all non-synonymous SNPs identified) This suggests that this subset of previously unidentified non-synonymous SNPs in putative LD with disease markers is indeed sub-stantially enriched for alleles with deleterious consequences

Figure 2 The linkage disequilibrium structure in the immediate region of the MST1 gene Red boxes indicate SNPs in high LD rs3197999, which has previously been associated with inflammatory bowel disease, and our novel nonsense SNP are highlighted in blue.

Trang 5

Indels are useful in mapping population structure, and

measurement of their frequency will help determine

which indels will ultimately represent markers of

predo-minately Irish ancestry We identified 195,798 short

indels ranging in size from 29-bp deletions to 20-bp

insertions (see Materials and methods) Of these, 49.3%

were already present in dbSNP130 Indels in coding

regions will often have more dramatic impacts on

pro-tein translation than SNPs, and accordingly be selected

against, and unsurprisingly only a small proportion of

the total number of short indels identified were found

to map to coding sequence regions Of the 190 novel

coding sequence indels identified (Table S2 Additional

file 2), only 2 were at positions in putative LD with a

heterozygous disease-associated SNP, of which neither

led to a frameshift (one caused an amino acid deletion

and one an amino acid insertion; Table S1 in Additional

file 2)

Population genetics

The DNA sample from which the genome sequence was

derived has previously been used in an analysis of the

genetic structure of 2,099 individuals from various

Northern European countries and was shown to be

representative of the Irish samples The sample was also

demonstrated to be genetically distinct from the core

group of individuals genotyped from neighboring

Brit-ain, and the data are likely, therefore, to complement

the upcoming 1000 Genomes data derived from British

heritage samples (including CEU; Additional file 1)

Non-parametric population structure analysis [25] was

carried out to determine the positioning of our Irish

individual relative to other sequenced genomes and the

CEU HapMap dataset As can be seen in Figure 3, as

expected, the African and Asian individuals form clear

subpopulations in this analysis The European samples

form three further subpopulations in this analysis, with

the Irish individual falling between Watson and Venter

and the CEU subgroup (of which individual NA07022

has been sequenced [26]) Therefore, the Irish genome

inhabits a hitherto unsampled region in European

whole-genome variation, providing a valuable resource

for future phylogenetic and population genetic studies

Y chromosome haplotype analysis highlighted that our

individual belonged to the common Irish and British

S145+ subgroup (JFW, unpublished data) of the most

common European group R1b [27] Indeed, S145

reaches its maximum global frequency in Ireland, where

it accounts for > 60% of all chromosomes (JFW,

unpub-lished data) None of the five markers defining known

subgroups of R1b-S145 could be found in our

indivi-dual, indicating he potentially belongs to an as yet

undefined branch of the S145 group A subset of the

(> 2,141) newly discovered Y chromosome markers found in this individual is therefore likely to be useful in further defining European and Irish Y chromosome lineages

Mapping of reads to the mitochondrial DNA (mtDNA) associated with UCSC reference build 36 revealed 48 differences, which by comparison to the revised Cambridge Reference Sequence [28] and the PhyloTree website [29] revealed the subject to belong to mtDNA haplogroup J2a1a (coding region transitions including nucleotide positions 7789, 13722, 14133) The rather high number of differences is explained by the fact that the reference sequence belongs to the African haplogroup L3e2b1a (for example, differences at nucleo-tide positions 2483, 9377, 14905) Haplogroup J2a (for-merly known as J1a) is only found at a frequency of approximately 0.3% in Ireland [30] but is ten times more common in Central Europe [31]

The distribution of this group has in the past been correlated with the spread of the Linearbandkeramik farming culture in the Neolithic [31], and maximum likelihood estimates of the age of J2a1 using complete mtDNA sequences give a point estimate of 7,700 years ago [32]; in good agreement with this thesis, sampled ancient mtDNA sequences from Neolithic sites in Cen-tral Europe predominantly belong to the N1a group [33]

SNP imputation

The Irish population is of interest to biomedical researchers because of its isolated geography, ancestral impact on further populations and the high prevalence

of a number of diseases, including cystic fibrosis, hemo-chromatosis and phenyketonuria [11] Consequently,

Figure 3 Multidimensional scaling plot illustrating the Irish individual ’s relationship to the CEU HapMap individuals and other previously sequenced genomes.

Trang 6

several disease genetic association studies have been

car-ried out on Irish populations As SNPs are often

co-inherited in the form of haplotypes, such studies

gener-ally only involve genotyping subsets of known SNPs

Patterns of known co-inheritance, derived most

com-monly from the HapMap datasets, are then often used

to infer the alleles at positions not directly typed using

programs such as IMPUTE [34] or Beagle [35] In the

absence of any current or planned Irish-specific

Hap-Map population, disease association studies have relied

on the overall genetic proximity of the CEU dataset

derived from European Americans living in Utah for use

in such analyses However, both this study (Figure 3)

and previous work (Additional file 1) indicate that the

Irish population is, at least to a certain extent,

geneti-cally distinct from the individuals that comprise the

CEU dataset

We were consequently interested in assessing the

accuracy of genome-wide imputation of SNP genotypes

using the previously unavailable resource of

genome-wide SNP calls from our representative Irish individual

Using a combination of IMPUTE and the individual’s

genotype data derived from the SNP array we were able

to estimate genotypes at 430,535 SNPs with an IMPUTE

threshold greater than 0.9 (not themselves typed on the

array) Within the imputed SNPs a subset of 429,617

genotypes were covered by at least one read in our

ana-lysis, and of those, 97.6% were found to match those

called from the sequencing data alone

This successful application of imputation of unknown

genotypes in our Irish individual prompted us to test

whether haplotype information could also be used to

improve SNP calling in whole genome data with low

sequence coverage Coverage in sequencing studies is

not consistent, and regions of low coverage can be

adja-cent to those regions of relatively high read depth As

SNPs are often co-inherited, it is possible that high

con-fidence SNP calls from well sequenced regions could be

combined with previously known haplotype information

to improve the calling of less well sequenced variants

nearby Consequently, we tested whether the use of

pre-viously known haplotype information could be used to

improve SNP calling At a given position where more

than one genotype is possible given the sequencing data,

we reasoned more weight should be given to those

gen-otypes matching those we would expect given the

sur-rounding SNPs and the previously known haplotype

structure of the region To test this, we assessed the

improvements in SNP calling using a Bayesian approach

to combining haplotype and sequence read information

(see Materials and methods) Other studies have also

used Bayesian methods to include external information

to improve calls in low-coverage sequencing studies

with perhaps the most widely used being SOAPsnp [36]

SOAPsnp uses allele frequencies obtained from dbSNP

as prior probabilites for genotype calling Our methods goes further, and by using known haplotype structures

we can use information from SNPs called with relatively high confidence to improve the SNP calling of nearby positions By comparing genotype calls to those observed on our SNP array we found substantial improvements can be observed at lower read depths when haplotype information is accounted for (Figure 4)

At a depth of 2.4X, approximately 95% of genotypes matched those from the bead array when haplotype information was included, corresponding to the accuracy observed at a read depth of 8X when sequence data alone are used Likewise, our method showed substantial improvements in genotype calling compared to only using previously known genotype frequency information

as priors

Given the comprehensive haplotype information likely

to emerge from other re-sequencing projects and the

1000 Genomes project, our data suggest that sequencing

at relatively low levels should provide relatively accurate genotyping data [37] Decreased costs associated with lower coverage will allow greater numbers of genomes

to be sequenced, which should be especially relevant to whole genome case-control studies searching for new disease markers

Causes of selection in the human lineage

There have been numerous recent studies, using a vari-ety of techniques and datasets, examining the causes and effects of positive selection in the human genome [38-42] Most of these have focused on gene function as

a major contributing factor, but it is likely that other factors influence rates of selection in the recent human lineage The availability of a number of completely sequenced human genomes now offers an opportunity

to investigate factors contributing to positive selection

in unprecedented detail

Using this and other available completely sequenced human genomes, we first looked for regions of the human genome that have undergone recent selective sweeps by calculating Tajima’s D in 10-kb sliding win-dows across the genome Positive values of D indicate balancing selection while negative values indicate posi-tive selection (see Materials and methods for more details) Due to the relatively small numbers of indivi-duals from each geographical area (three Africans, three Asians and five of European descent - including refer-ence) [16,26,43-48], we restricted the analysis to regions observed to be outliers in the general global human population

A previous, lower resolution analysis using 1.2 million SNPs from 24 individuals and an average window size

of 500-kb had previously identified 21 regions showing

Trang 7

evidence of having undergone recent selective sweeps in

the human lineage [41] Our data also showed evidence

of selection in close proximity to the majority of these

regions (Table 3)

Gene pathways associated with selection in the human

lineage

Examination of genes under strong positive selection

using the GOrilla program [49] identified nucleic acid

binding and chromosome organization as the Gene

Ontology (GO) terms with the strongest enrichment

among this gene set (uncorrected P = 2.31 × 10-9

and 4.45 × 10-8, respectively)

Genes with the highest Tajima’s D values, and

pre-dicted to be under balancing selection, were most

enriched with the GO term associated with the sensory

perception of chemical stimuli (uncorrectedP = 2.39 ×

10-21) These data confirm a previous association of

olfactory receptors with balancing selection in humans

using HapMap data [50] However, our analysis also

identified that a range of taste receptors were among

the top genes ranked byD value, suggesting that

balan-cing selection may be associated with a wider spectrum

of human sensory receptors than previously appreciated

The next most significantly enriched GO term, not

attributable to the enrichment in taste and olfactory

receptors, was keratinization (uncorrected P = 3.23 ×

10-5) and genes affecting hair growth have previously

been hypothesized to be under balancing selection in the recent human lineage [51]

Gene duplication and positive selection in the human genome

Although most studies examine gene pathways when investigating what underlies positive selection in the human genome, it is likely other factors, including gene duplication, also play a role It is now accepted that fol-lowing gene duplication the newly arisen paralogs are subjected to an altered selective regime where one or both of the resulting paralogs is free to evolve [52] Lar-gely due to the lack of available data, there has been lit-tle investigation of the evolution of paralogs specifically within the human lineage A recent paper has suggested that positive selection has been pervasive during verte-brate evolution and that the rates of positive selection after gene duplication in vertebrates may not in fact be different to those observed in single copy genes [53] The emergence of a number of fully sequenced gen-omes, such as the one presented in this report, allowed

us to investigate the rates of evolution of duplicated genes arising at various time points through the human ancestral timeline

As shown in Figure 5, there is clear evidence in our analysis for high levels of positive selection in recent paralogs, with paralogs arising from more recent dupli-cation events displaying substantially lower values of Figure 4 Improved SNP calling using haplotype data SNP calling performance on chromosome 20 at various read depths with and without the inclusion of haplotype or genotype frequency data.

Trang 8

Tajima’s D than the background set of all genes Indeed,

elevated levels of positive selection over background

rates are observed in paralogs that arose as long ago as

the eutherian ancestors of humans (Figure 5)

Conse-quently, while in agreement with the previous

observa-tion of no general elevaobserva-tion in the rates of evoluobserva-tion in

paralogs arising from the most ancient, vertebrate

dupli-cation events, these data clearly illustrate that more

recently duplicated genes are under high levels of

posi-tive selection

As discussed, it has been proposed that, upon gene

duplication, one of the gene copies retains the original

function and is consequently under stronger purifying

selection than the other However, it has also been

pro-posed that both genes may be under less sequence

restraint, at least in lower eukaryotes such as yeast [52]

We consequently examined the rates of positive

selec-tion in both copies of genes in each paralog pair to see

whether both, or just one, in general show elevated rates

of positive selection in the human lineage More closely

examining paralog pairs that arose from a duplication

event in Homo sapiens highlighted that even when only

those genes in each paralog pair whose value of D was

greater were examined, theirD values were still

signifi-cantly lower than the genome average (t-test P < 2.2 ×

10-16), illustrating that even those genes in each paralog

pair showing the least evidence of positive selection still show substantially higher levels of positive selection than the majority of genes These results therefore sup-port the hypothesis that both paralogs, rather than just one, undergo less selective restraint following gene duplication Consequently, a significant driver for many

of the genes undergoing positive selection in the human lineage (Table S3 in Additional file 2) appears to be this high rate of evolution following a duplication event For example, 25% of those genes with a Tajima’s D value of less than -2 have been involved in a duplication event in Homo sapiens, compared to only 1.63% of genes with D values greater than this threshold (chi-squaredP < 2.2 ×

10-16), illustrating that there is a substantial enrichment

of genes having undergone a recent duplication event among the genes showing the strongest levels of positive selection In conclusion, it appears that whether a gene has undergone a recent duplication event is likely to be

at least as important a predictor of its likelihood of being under positive selection as its function

Conclusions

The first Irish human genome sequence provides insight into the population structure of this branch of the Eur-opean lineage, which has a distinct ancestry from other published genomes At 11-fold genome coverage,

Table 3 Regions of high positive selection, in close proximity to genes, identified in the analysis of Williamson

et al [41]

Williamson et al [41]regions of high positive selection Corresponding regions of low Tajima ’s D in this analysis Chr Position (hg18) Nearest gene Position (hg18) Nearest gene Tajima ’s D

4 169386385 FLJ20035 (0) 169395001-169405000 FLJ20035/DDX60 (0 kb) -2.10

5 15527762 FBXL7 (26 kb) 15535001-15545000 FBXL7 (8.3 kb) -2.23

8 57165523 RPS20 (16 kb) 57200001-57210000 PLAG1 (26 kb) -2.06

10 45498260 ANUBL1 (10 kb) 45495001-45505000 FAM21C (0 kb) -2.27

12 81525433 DKFZp762A217 (79 kb) 81520001-81530000 DKFZp762A217 (75 kb) -2.21

18 44274281 KIAA0427 (45 kb) 44365001-44375000 KIAA0427 (0 kb) -2.28 Regions in this analysis with a Tajima’s D value of less than -2 within 100 kb of the corresponding region from Williamson et al [41] are highlighted in bold (Selection of 21 random positions in the genome 1,000 times never produced as many within close proximity to a window whose Tajima’s D was less than -2.)

Trang 9

approximately 99.3% of the reference genome was

cov-ered and more than 3 million SNPs were detected, of

which 13% were novel and may include specific markers

of Irish ancestry We provide a novel technique for SNP

calling in human genome sequence using haplotype data

and validate the imputation of Irish haplotypes using

data from the current Human Genome Diversity Panel

(HGDP-CEPH) Our analysis has implications for future

re-sequencing studies and suggests that relatively low

levels of genome coverage, such as that being used by

the 1000 Genomes project, should provide relatively

accurate genotyping data Using novel variants identified

within the study, which are in LD with already known

disease-associated SNPs, we illustrate how these novel

variants may point towards potential causative risk

fac-tors for important diseases Comparisons with other

sequenced human genomes allowed us to address

posi-tive selection in the human lineage and to examine the

relative contributions of gene function and gene

duplica-tion events Our findings point towards the possible

pri-macy of recent duplication events over gene function as

indicative of a gene’s likelihood of being under positive selection Overall, we demonstrate the utility of generat-ing targeted whole-genome sequence data in helpgenerat-ing to address general questions of human biology as well as providing data to answer more lineage-restricted questions

Materials and methods Individual sequenced

It has been recently shown that population genetic ana-lyses using dense genomic SNP coverage can be used to infer an individual’s ancestral country of origin with rea-sonable accuracy [15] The sample sequenced here was chosen from among a cohort of 211 healthy Irish con-trol subjects included in recent genome-wide association studies [13,14] with all participants being of self-reported Irish Caucasian ethnicity for at least three gen-erations Using Illumina Infinium II 550 K SNP chips, the Irish samples were assayed for 561,466 SNPs selected from the HapMap project Quality control and genotyping procedures have been detailed previously

Figure 5 Tajima ’s D values for paralogs arisen from gene duplications of different ages Mean Tajima’s D values for genes involved in duplication events of differing ages Horizontal dotted line indicates median Tajima ’s D value of all genes in human genome As can be seen, genes involved in a recent duplication event in general show lower values of D than the genome-wide average, with genes involved in a duplication event specific to Humans, as a group, showing the lowest values of D (Kruskal-Wallis P < 2.2 × 10 -16 ).

Trang 10

[15] We have previously published 300 K density

STRUCTURE [54,55] and principle components analyses

of the Irish cohort both in comparison to similar

cohorts from the UK, Netherlands, Denmark, Sweden

and Finland [15], and in separate analyses in comparison

to additional cohorts from the UK, Netherlands,

Swe-den, Belgium, France, Poland and Germany [14] The

data demonstrate a broad east-west cline of genetic

structure across Northern Europe, with a lesser

north-south component [15] Individuals from the same

popu-lations cluster together in these joint analyses Using

these data, we here selected a ‘typical’ Irish sample,

which clustered among the Irish individuals and was

independent of the British samples, for further

characterization

Genomic library preparation and sequencing

All genomic DNA libraries were generated according to

the protocol Genomic DNA Sample Prep Guide - Oligo

Only Kit (1003492 A) with the exception of the chosen

fragmentation method Genomic DNA was fragmented

in a Biorupter™ (Diagenode, Liége, Belgium) Paired-end

adapters and amplification primers were purchased from

Illumina (Illumina, San Diego, CA, USA catalogue

num-ber PE-102-1003) New England Biolabs (New England

Biolabs, Ipswich, MA, USA) was the preferred supplier

for all enzymes and buffers and Invitrogen (Invitrogen,

Carlsbad, CA, USA) for the dATP Briefly, the workflow

for library generation was as follows: fragmentation of

genomic DNA; end repair to create blunt ended

frag-ments; addition of 3′-A overhang for efficient adapter

ligation; ligation of the paired-end adapters; size

selec-tion of adapter ligated material on a 2.5% high

resolu-tion agarose (Bioline HighRes Grade Agarose - Bioline,

London, UK), catalogue number BIO-41029); a limited

12 cycle amplification of size-selected libraries; and

library quality control and quantification For each

library 5μg of DNA was diluted to 300 μl and

fragmen-ted via sonication - 30 cycles on Biorupter High setting

with a cycle of 30 s ON and 30 s OFF All other

manip-ulations were as detailed in the Illumina protocol

Quantification prior to clustering was carried out with

a Qubit™ Fluorometer (Invitrogen Q32857) and

Quant-iT™ dsDNA HS Assay Kit (Invitrogen Q32851) Libraries

were sequenced on Illumina GAII and latterly GAIIx

Analyzer following the manufacturer’s standard

cluster-ing and sequenccluster-ing protocols - for extended runs

multi-ple sequencing kits were pooled

Read mapping

NCBI build 36.1 of the human genome was downloaded

from the UCSC genome website and the bwa alignment

software [56] was used to align both the single- and

paired-end reads to this reference sequence Two

mismatches to the reference genome were allowed for each read Unmapped reads from one single-end library were trimmed and remapped due to relative poor qual-ity at the end of some reads, but none were trimmed shorter than 30 bp

SNP and indel identification

SNPs were called using samtools [57] and glfProgs [58] programs The criteria used for autosomal SNP calling were: 1, a prior heterozygosity (theta) of 0.001; 2, posi-tions of read depths lower than 4 or higher than 100 were excluded; 3, a Phred-like consensus quality cutoff

of no higher than 100

Only uniquely mapped reads were used when calling SNPs SNPs in the pseudoautosomal regions of the X and Y chromosomes were not called in this study and consequently only homozygous SNPs were called on these chromosomes The criteria used for sex chromo-some SNP calling were: 1, positions of read depths lower than 2 or higher than 100 were excluded; 2, the likelihoods of each of the four possible genotypes at each position were calculated and where any genotype likelihood exceeded 0.5 that did not match the reference

a SNP was called

The positive predictive value in our study, assessed using the 550 k array data as in other studies [48], was 99% As a result of maintaining a low false positive rate, the heterozygote undercall rate observed in this analysis was slightly higher than in other studies of similar depth

- 26% as opposed to 24% and 22% in the Watson and Venter genomes, respectively

SNP consequences were determined using the Ensembl Perl APIs and novel SNPs identified through comparisons with dbSNP130 obtained from the NCBI ftp site Further human genome SNP sets were also downloaded from their respective sources [7,16,26,43-48] The CEU dataset for the SNP imputa-tion and populaimputa-tion structure analysis were downloaded from the Impute and HapMap websites, respectively Previously identified disease variants were downloaded from OMIM (15 April 2009) and HGMD (HGMD Pro-fessional version 2009.4 (12 November 2009)) Pairs of HapMap SNPs in high LD flanking novel markers and known disease variants were identified using the Ensembl Perl APIs

Indels were called using samtools [57] Short indels had to be separated by at least 20 bp (if within 20 bp, the indel with the higher quality was kept) and for the autosomes had to have a mapping quality of greater than 20 and be covered by a read depth of greater than

4 and less than 100 For the sex chromosomes the lower threshold was set at 2 As with SNP calling, only uniquely mapped reads were used Twenty-six randomly selected coding indels were confirmed via resequencing

Ngày đăng: 09/08/2014, 22:23

TỪ KHÓA LIÊN QUAN

🧩 Sản phẩm bạn có thể quan tâm