3.1.2 Chromosome 6p SNP LD Map Figure 3.1 Comparing Allele Frequencies Between Singapore Chinese and dbSNP Allele frequencies in the local Chinese population have a high correlation to
Trang 1CHAPTER 3:
RESULTS
Trang 23.1 First Generation Linkage Disequilibrium and Haplotype Map of the Chromosome 6p and the Major Histocompatibility Complex
To characterize the genetic variation and patterns of linkage disequilibrium (LD) of the human chromosome 6p and the MHC, 2 separate genotyping projects were initiated and from these, 3 distinct SNP maps with increasing marker densities were constructed SNP genotyping for all datasets were performed using the Illumina Golden Gate platform The descriptions of each SNP map are summarized in Table 3.1
The first map surveys the entire chromosome 6p arm at a density of approximately 1 SNP every 100kb, the second focuses a higher density of SNPs (approximately 1 SNP every 20kb) across a contiguous 7Mb stretch that contains the MHC These 2 maps provide an overview of LD distribution across the MHC and allow comparisons to be made with the rest of the chromosome arm The third SNP map is focused solely within the MHC with a much greater density (1 SNP every 2.6kb) HLA haplotype and homozygosity data obtained from the first 2 maps guided sample selection for the third map, allowing the detailed analysis of the common and conserved MHC haplotypes present in the Singapore Chinese population
Trang 3
3.1.1 SNP Set of the First Generation Map
A set of 1152 SNPs was selected and genotyped in 198 Singaporean Chinese individuals These individuals comprised of randomly selected, unrelated, healthy blood donors from whom appropriate informed consent were obtained The SNPs were chosen based on available genotype data deposited in dbSNP (build 121, Smigielski et al 2002) SNPs were initially selected to achieve a targeted density of 1 SNP per 100kb across the entire Chromosome 6p arm, and a higher density of 1 SNP
2 MHC SNP Map (Section 3.1.3)
3 High -Resolution MHC SNP Map (Section 3.2) Informative
Chr 6
Coordinates
175,572 - 58,750,370
30,133,482 - 37,000,199
28,970,148 - 33,882,048
SNP Density 1 per 95.2kb 1 per 19.9 kb 1 per 2.6 kb
A first generation, low-density, SNP map was used to analyse the LD structure of the chromosome 6p and the MHC A high resolution SNP map was then constructed to analyse the
LD of the MHC in greater detail The corresponding sections in the results chapter of this thesis is indicated for each SNP map
Table 3.1: Summary of SNP Sets Used in this Study
Trang 4To reduce the chance of genotyping uninformative markers, priority was given to SNPs that has been shown to exist in high frequency in an East Asian population
Of the 1152 SNP-genotyping attempted, 1099 were successful – this translates to a 95.4% marker success rate Of the 198 samples genotyped, 6 samples failed to produce results that passed Illumina’s stringent quality checks, possibly due to inadequate DNA quality or quantity, thus attaining a sample success rate of 97.0% The overall genotype call rate and reproducibility was greater than 99.9%
It has been noted that mis-genotyped and erroneously located SNPs may result in spurious linkage disequilibrium associations and false introduction of haplotype variants that do not exist in nature (Gabriel et al 2002, Hosking et al 2004) The
1099 successfully genotyped SNPs were further passed through a series of quality control filters to remove possible genotyped errors First, the flanking sequences of the probes used in the assays were re-aligned to the human genome (NCBI build 36), ensuring that SNPs locations were as planned Eight loci could not be mapped to the chromosome 6p and were eliminated at this stage Next, to remove non-informative markers and identify possible genotyping errors, SNPs that had a minor allele frequency of less than 5% in the population, or did not satisfy the Hardy-Weinberg Equilibrium at a 0.1% significance level were removed 909 SNPs passed these sets of filters and were used to construct the SNP maps
Allele frequencies for the SNPs assayed in these samples were compared to those reported for various populations in dbSNP, on which data it was relied on for the initial SNP selection As one would expect, the frequencies from this study had a
Trang 5much higher correlation of determination (R2 = 0.86) with the aggregated East Asian population data as compared to that from other non-Asian populations (R2 = 0.35), thus providing a gauge to the reliability of the genotyping data (Figure 3.1) This also establishes that in the absence of any other information, the allele frequencies of SNPs reported in dbSNP would be sufficient in guiding informative marker selection
in genotyping studies
3.1.2 Chromosome 6p SNP LD Map
Figure 3.1 Comparing Allele Frequencies Between Singapore Chinese and dbSNP
Allele frequencies in the local Chinese population have a high correlation to frequencies reported
in aggregated East Asian populations (panel A), in contrast with non-Asian populations (panel B) found in dbSNP
R 2 =0.35
R 2 =0.86
Trang 6arm, a “picket-fence” approach was used to select genotyped SNPs from the denser 30.0Mb - 37.0Mb segment to achieve an approximate density of 1 SNP per 100kb distribution consistent with the overall SNP density In all, 615 SNPs were used to construct a linkage disequilibrium map across the entire chromosome 6p, with the first marker starting at position 175,572bp and the last marker at position 58,750,370bp of the physical map The average SNP density is 1 SNP per 95.2kb, with a median distance of 80.0kb between consecutive SNP pairs, ranging from 27kb
to 485kb, with 414 pairs less than 100kb apart The average minor allele frequency of the data set is 28% with an average heterozygosity of 0.37
To evaluate the LD structure across the chromosome arm, the 2 most commonly used measures of LD, r2 and D′, were calculated between all possible SNP pairs separated
by less than 5Mb D′ and r2 are both based on the disequilibrium parameter D (Ott 1999), a difference between observed and expected frequencies of 2-locus haplotypes, although they differ in their interpretation; D′ is strictly an indicator of the absence of recombination in the history of the studied population samples, whereas a high r2value has an additional requirement of correlation between allele frequencies ( Devlin and Risch 1995, Ardlie et al 2002) The distribution of LD between SNP pairs on this map is shown as a heatmap in Figure 3.2 and can be seen to vary greatly along the chromosome arm Stronger LD between consecutive marker pairs is seen towards the centre of the chromosome 6p especially in the region telomeric to the MHC The punctuate nature of LD can be seen with several islands of SNPs in high pairwise LD standing out in contrast with the relatively uniform level of equilibrium in the rest of the chromosome arm At an average marker density of 1 SNP every 95kb, strong LD between consecutive SNPs is not expected (Gabriel et al 2002, International HapMap
Trang 7Consortium 2005), and these islands of high LD are the exceptions rather than the
rule
Figure 3.2 Chromosome 6p SNP LD Map
Gene density across the chromosome 6p (represented in blue) is plotted as sliding window gene counts per 100kb Locations of highlighted genes are represented green glyphs All gene annotations were taken from the Vertebrate Genome Annotation Database (Vega) (Wilming et al 2008) The set of SNPs used in the map are drawn as vertical grey lines Linkage-disequilibrium between pairs of SNPs is depicted using a heatmap produced by Haploview (Barret et al 2005), with darker shades of red representing pairs of SNPs with high D´
Trang 8Smoothing out the pairwise LD values by averaging them across 2Mb sliding windows, the distribution of LD across the chromosome arm was plotted as a function
of physical distance in Figure 3.3
Given that markers in LD do not necessarily show strong allelic correlation (Ott 1999), averaged r2 values are much lower than D′ but the trend of these 2 parameters track evenly across the chromosome arm With the high marker spacing and relative sparseness of this SNP map, high r2 values are not expected and the mean r2 value between pairs of markers less than 5Mb apart is only 0.03 (yellow dotted line) The average for D′ is 0.16 (blue dotted line) There is a noticeable elevation of linkage disequilibrium above the chromosomal-average at several locations, the most prominent of these being an 8Mb-long segment at the centre of the chromosome arm, with elevated LD seen in both D′ and r2 values This strong LD segment lies between
Figure 3.3 Distribution of LD Across the Chromosome 6p
Averaged pairwise LD between SNPs within 2Mb windows was calculated and plotted against physical distance LD was calculated using both the D′ coefficient (blue shaded area) as well as r2 (red shaded area) The averaged pairwise LD value across the whole chromosome arm is indicated by the blue (D′) and yellow (r2) horizontal dotted lines The HapMap genetic map (release 22) in centiMorgans is also plotted in the green line
Trang 9positions 25Mb and 33Mb with the peak being a 2Mb window centred at position 28.9Mb This peak between positions 27.9Mb and 29.9Mb of chromosome 6p contains 21 informative SNPs with pairwise D′ averaging 0.54 and pairwise r2averaging 0.15 This segment of elevated LD is also underscored by fewer recombination hotspots and a lower recombination rate (0.46 cM/Mb as opposed to chromosome average of 1.27cM/Mb) in the genetic map reported recently by the International HapMap project (International HapMap Consortium 2005)
The centromeric half of this high-LD segment contains the classical MHC loci (positions 30.0Mb – 33.4Mb), while the telomeric half is marked by the presence of the largest histone cluster in the human genome (there are over 40 loci coding for histone genes between 26 Mb to 28 Mb) as well as an 8-zinc finger cluster (between 27.5Mb to 28.7Mb) At the centre and peak of this high-LD segment is a large olfactory receptor cluster, with 13 olfactory receptor genes between 29.1Mb to 29.6
Mb The gene map showing the clusters in this region can be seen in Figure 3.4
Trang 10Figure 3.4: Gene Clusters Telomeric to the MHC
The list of genes in this region is obtained from the VEGA project (Wilming et al 2008)
The large gene clusters in the region between 25Mb and 30Mb can be clearly seen in this figure The centre of the high-LD segment (28.9Mb) lies close to a large olfactory receptor cluster marked out in a red border
Trang 113.1.2.1 Haplotype Blocks Across the Chromosome 6p
Haplotype blocks are defined as segments of DNA along chromosomes that exhibit low diversity and low recombination rates (Daly et al 2001, Patil et al 2001) The chromosome 6p SNP linkage disequilibrium map in this study was constructed using
a rather modest resolution of one SNP per 95.2kb, and consequently this sparse SNP map will not allow for an exhaustive description of the haplotype blocks that exist on the chromosome arm However, haplotype blocks identified at this resolution will highlight segments on the chromosome arm that are of significantly low diversity and high linkage disequilibrium
Using a conservative definition for haplotype blocks recently outlined (Gabriel et al 2002), three such blocks can be identified in this SNP map (Figure 3.5) Unsurprisingly two of these blocks, which are over 150kb in length, fall within the high LD MHC-telomeric region described earlier The third is a clearly discernable 535kb haplotype block coinciding with a large gene locus SUPT3H, and overlapping with the RUNX2 loci This long haplotype block has remarkably low diversity, with 3 out of a possible 256 haplotypes representing more than 97% of the variation in the local Chinese population SUPT3H is a transcription initiation factor associated with the RNA polymerase II complex and is highly conserved across many organisms Functional constrains may have resulted in a lack of diversity and recombination suppression across this loci, maintaining a strong haplotype-block structure across a long stretch of DNA
Trang 123.1.3 An Integrated SNP-HLA Map of the MHC
To generate a MHC SNP map of higher density, the same 192 Singaporean Chinese samples were successfully genotyped at a denser resolution of 1 SNP per 20kb across
a 7Mb segment from 30.0Mb to 37.0Mb that includes the MHC The HLA genotypes for these 192 individuals were also determined using sequence-base typing and an integrated SNP-HLA haplotype map was constructed Using this integrated map, the extended and conserved MHC haplotypes present in the local Chinese population are described
Figure 3.5 Large Haplotype Blocks in the Chromosome 6p
Using the definition of haplotype blocks established by Gabriel et al 2002, 3 large haplotype blocks can be identified across the chromosome arm
Panel A:
The left side of the panel shows a LD heat map of a segment of the chromosome between 27.4Mb and 28.6Mb, drawn using Haploview (Barrett et al 2005) Two haplotype blocks of over 150kb in length fall within this region and are outlined in a black border on the heatmap The positions of the SNPs in the blocks are highlighted in blue These 2 blocks lie within the high LD peak described earlier and overlap with zinc finger and histone clusters
Panel B:
The 3rd haplotype block of SNPs is over 500kb long and lies across a large gene locus SUPT3H
Trang 133.1.3.1 LD Structure of the MHC and peri-MHC
The same quality control criteria described in the previous section was employed, removing uninformative markers with less than 5% MAF as well as those that fail the Hardy Weinberg equilibrium test In all, 81 SNPs were filtered out, and the remaining
345 SNPs were used to describe the LD across the MHC This SNP map has an average interval of 19.9kb between consecutive markers (ranging from 0.37 to 140kb, and a median of 12.6kb), with an average minor allele frequency of 30% and heterozygosity of 0.39
As before, to calculate LD, pairwise r2 and D′ values were calculated between all SNP pairs less than 500kb apart The distribution of pairwise LD is shown as a heatmap in Figure 3.6 Regions of stronger LD is seen outside the core MHC (30.0 to 33.4Mb) and at this resolution, no strong LD block is seen across the class I loci HLA-A, -B, and –C, nor the highly polymorphic class II locus HLA-DRB1 The more invariant HLA-DRA, -DMA and -DMB loci are however seen amidst SNP markers with higher
LD
The block-like structure of LD is more evident at this resolution, and the criterion laid out in Gabriel et al 2002 was again used to define haplotype block boundaries In total, 61 blocks of varying physical lengths were identified, ranging from 780bp to 280kb, with the average size of a haplotype block 35.5kb in length (Table 3.2) At this SNP density 31.5% of the region covered in this map falls within a haplotype block The majority of the longer haplotype blocks lie outside the traditionally defined MHC
Trang 14frames, including loci coding for a linked pair of genes – TCP11 (T-complex homologue) and ZN76 (a zinc finger) – that are expressed in tandem (Ragoussis et al 1992) Within the MHC proper, 2 large haplotype blocks lie within the class III region; a 132kb block containing CLIC1, VARS2, DDAH1 and several heat shock
proteins, as well as a 65kb block containing C6orf10
Figure 3.6 LD Map of the MHC and the peri-MHC
The pairwise LD (calculated as D′) of the 426 SNPs successfully genotyped between 30Mb to 37Mb is shown in this figure and drawn as a heatmap using Haploview (Barrett et al 2005) Shades of red indicate strength of LD between SNP pairs Known genes are marked in green and located in this figure Stronger pairwise LD can be seen in the centromeric region 34.0Mb to 37.0Mb that is outside the boundary of the traditional MHC
Trang 15Table 3.2 List of Haplotype Blocks between 30.0Mb and 37.0Mb of the
End Location (Mb)
Block Length (kb)
List of Genes in Block
LY6G6C C6orf25 CLIC1 MSH5 G7C_HUMAN VARS Y LSM2 HSPA1L HSPA1A HSPA1B C6orf48 SNORD48 SNORD52