For constructing the LD and haplotype maps of the Singaporean Chinese population, only genotype data from the 208 unrelated individuals 6 samples failed the Illumina genotype assays were
Trang 13.2 A High-Resolution Linkage Disequilibrium Map of the MHC
In the preceding section, the low-resolution, first-generation SNP map provided an
overview of the linkage disequilibrium patterns of the MHC in the Singaporean
Chinese population From that map, the block-like structure of LD is seen and the
conserved extended haplotypes stretching across megabases were described However
the density of the first generation SNP map limits the ability to resolve fine-scale
recombination patterns in the MHC While only 31.5% of that map falls within a
haplotype block, the recent HapMap publication concluded that the fraction of the
genome covered by haplotype blocks is greater than 65% (International HapMap
Consortium, 2005)
In a bid to delineate the fine-scale recombination patterns, a higher resolution
SNP-variation map of the MHC in the local Chinese population was created Rapid
improvements in SNP genotyping technology coupled with increased polymorphism
data from the International HapMap Project and MHC studies in other populations
(e.g Miretti et al 2005, de Bakker et al 2006) facilitated a construction of such a
map Having established the conserved haplotypes in the previous study, these were
taken into consideration and HLA-homozygous samples were sourced for and
included in the sample set Genotyping these homozygous samples at a
high-resolution provides a high quality dataset to study in detail the conserved haplotypes
in the local population, and to compare these CEHs with those reported in other
populations The data reported here will also provide a resource for studying and
understanding HLA-disease associations
Trang 23.2.1 High-Resolution SNP Variation Map of the MHC
For constructing a fine-scale variation map of the MHC, 2360 SNPs were genotyped
in 284 Singaporean Chinese individuals The bulk of these samples consisted of 214
randomly selected and unrelated individuals Of these 77 overlapped with the samples
genotyped in the previous map Another 27 samples were taken from archived
B-lymphoblastoid cell-lines that were tested and selected for being homozygous at 2 or
3 HLA loci These samples were representative of the CEHs identified in the previous
section The final 41 samples were taken from 12 parental-offspring families (with at
least both parents and a child) These 12 families provided 48 phase-unambiguous
haplotypes that would be useful for improving the haplotype-reconstruction in the
unrelated individuals A breakdown of the 284 samples is shown in Table 3.5 below
Table 3.5: Composition of Samples Used in Constructing High
Resolution SNP Variation Map
Trang 3SNP genotyping was once again performed using the Illumina GoldenGate assay on a
BeadArray platform Of the 2360 SNP positions attempted, 2290 were successfully
genotyped The overall genotyping quality was very high; the locus success rate was
over 97%, the call rate was over 99% and the reproducibility was higher than 99.99%
Of the 284 Chinese samples genotyped, results were not obtained for 6 (all belonging
to the unrelated individuals group), giving a sample success rate of 97.9%
For filtering out uninformative and possibly erroneously called genotypes, a series of
filters was employed Only SNPs with at least a 5% minor allele frequency and in
Hardy-Weinberg equilibrium (using a p-value threshold of 0.001) were retained The
minor allele frequency and heterozygosity distributions of the 2290 markers can be
seen in Figure 3.11 These charts show that the 2290 SNPs had a uniform MAF
distribution with more SNPS skewing towards the higher end of the heterozygosity
Trang 4To weed out other potential genotyping errors, SNPs that had genotypes
disconcordant with pedigree structure in more than one family were also removed
The locations of the SNPs were re-confirmed by mapping the flanking sequences used
in the design of the SNP assays back to the human genome assembly This resulted in
the remapping of 2 SNPs within the MHC The SNP “rs2308655” was remapped from
31,345,141 to 31,430,282 while “rs1611627” was remapped from 29,965,650 to
29,905,761 In both of these cases, the error was in the Illumina annotation, and the
error was communicated back to them
In total 1877 markers were retained, establishing a SNP map that covers a 4.91Mb
segment of the chromosome 6p, from positions 28.97 to 33.88Mb With an average
gap of 2.6kb (and a median of 1.6kb) between consecutive SNPs, this map is about 8
times denser than the previous one Gap intervals range from 18bp to 71kb with over
88% of the gaps less than 5kb There were 6 distinct gaps that span over 25kb and
these are listed in Table 3.6 Two of the largest gaps were within the hyper-variable
HLA-DRB (71kb) and RCCX loci (59kb), which exhibit MHC haplotype-specific
lengths and gene content (Dawkins et al 1999) Individuals carrying different MHC
haplotypes may differ in the number of HLA-DRB paralogues as well as different
number of copies of the C4A/C4B genes within the RCCX locus The other large gaps
cover segments that are densely packed with large tracks of repetitive and
transposable elements These gaps most probably reflect difficulties in designing SNP
assays in regions with repetitive sequences and variable-length polymorphisms,
resulting in the lack of genotype information here
Trang 5Table 3.6: Gaps Larger than 25kb in the High-Resolution SNP Map
Gap Length
(kb)
Position Along Chromosome 6p (Mb) Description of Loci
The large gaps in this map coincide with regions of complex polymorphism and repeat
elements, reflecting the difficulty in designing SNP assays here
For constructing the LD and haplotype maps of the Singaporean Chinese population,
only genotype data from the 208 unrelated individuals (6 samples failed the Illumina
genotype assays) were used As the 29 specifically chosen homozygous cell-lines and
the 41 family-chromosomes were not a random sampling of the local Chinese
population, these were not included in constructing population LD maps However,
genotype information from the HLA homozygous cell-lines are a valuable source of
extended haplotypes across the MHC and these were used in subsequent analysis of
HLA haplotypes and recombination breakpoints The family-based genotypes were
used to reconstruct phase-unambiguous haplotypes that were subsequently used to
improve the haplotype phasing of the unrelated individuals (See Methods)
The allele frequencies for the SNPs in this data set were compared to those reported
for the 4 populations genotyped as part of HapMap project (International HapMap
Consortium, 2005) As expected, of the 4 populations the allele frequencies in the
Trang 6local Chinese show the tightest correlation with those reported in the Beijing Chinese
0.84), reflecting the relatively recent shared ancestry of the 2 ethnic groups The CHB
and JPT datasets are frequently combined in HapMap data releases, but the results
here indicate that when using HapMap data for designing informative genotyping
panels in the local Chinese population, it is better to consider the CHB data only
Figure 3.12 Comparing Allele Frequencies with HapMap Panels
Allele frequencies for the 1877 informative SNPs genotyped in the local Chinese population were plotted against the corresponding allele frequencies from each HapMap population and the Pearson correlation coefficient was calculated
Clockwise from top left: CHB – Han Chinese (Beijing), JPT – Japanese (Tokyo), CEU – Caucasian (CEPH), YRI – African (Yoruban, Nigeria) Data was obtained from HapMap release 22
Trang 73.2.2 Estimating Coverage of Known Variation in the MHC using the
high-resolution SNP Map
The MHC is known to be the most polymorphic region in the genome and the 1877
SNPs genotyped in this study is a subset of the known variation here (Horton et al
2008) The publicly available HapMap data offers the opportunity to address how
effective a proxy this 2.6kb-resolution SNP map is to the other known SNPs in the
Chinese population Having established above that HapMap Han Chinese data is a
good representative for allele frequencies in the local Chinese population, this Han
Chinese data was used as a surrogate test set Deposited Han Chinese genotypes in
release 22 of the HapMap consist of 9479 SNPs across the MHC, including the 1877
informative SNPs genotyped in this study To test the efficacy of these 1877 in
representing the variation in the remaining HapMap Han Chinese SNPs not genotyped
remaining HapMap SNPs were calculated from the HapMap Han Chinese genotypes
The results are plotted in 2 bar charts in Figure 3.13 The panel of SNPs used in this
study represents most of the variation in the HapMap Han Chinese population well
Of the 7602 HapMap SNP loci not genotyped in this study, more than half (51.1%)
0.84 Uninformative SNPs make up bulk of the 341 HapMap SNPs that were poorly
poorly represented SNPs have a MAF of less than 5%
Trang 8Interestingly, the distribution of the 341 poorly represented SNPs was not uniform
across the MHC, but rather there were 125 poorly represented SNPs concentrated
within a 900kb segment (32.3Mb to 33.2Mb) defined as the class II region (Horton et
al 2004), while the remaining 216 were scattered across the other 4Mb of this SNP
map Furthermore, the 125 poorly represented SNPs in the class II region had an
average minor allele frequency of 8%, compared to an average of 4% for the other
216 This result suggests that although the overall performance of the 1877-SNP map
in capturing HapMap variation is very high, some common variation in the class II
that LD in the class II region is lower than the rest of the MHC
Figure 3.13 Estimating Coverage of Known Variation in the MHC using the Resolution SNP Map
between the 1877 SNPs used and the remaining HapMap SNPs within the MHC locus, were calculated using the genotype data of the HapMap Han Chinese population
(red portion of bar chart) Only 341 (4.5%) of the 7602 SNPs were poorly represented (defined as
Panel B: Of these poorly represented SNPs, the majority of them are present at a frequency of less than 5% in the population, and thus not informative in the Han Chinese population
r 2 = 1.0
Trang 93.2.3 Fine-scale Linkage Disequilibrium Patterns of the MHC
Linkage-disequilibrium structure of the MHC was analysed in 2 ways First, the
pairs of SNPs up to 500kb apart This gives an overview of LD decay across the
4.9Mb SNP map Second, as recombination is known to occur at preferred ‘hotspots’
and not uniformly across chromosomes, the detailed localised variation of LD over
kilobases was resolved by describing the location of haplotype blocks across the
MHC
The MHC can be divided into 5 sub-regions that reflect the clustering of the different
classes of HLA genes within (Horton et al 2004) To see if LD patterns differ across
these sub-regions, the SNP map was divided accordingly, with LD analysed in each
sub-region separately and also across the MHC as a whole The 5 sub-regions are:
Extended class I (29.0Mb to 29.8Mb), class I (29.8Mb to 31.6Mb), class III (31.6Mb
to 32.3Mb), class II (32.3Mb to 33.2Mb) and extended class II (33.2 to 33.9Mb)
In this high-resolution variation map, all SNPs are in high LD with at least one other
SNP, as determined by D′ Of the 1877 SNPs, 1872 are in perfect LD with at least one
neighbouring SNP (D′=1) while the 5 remaining SNPs have at least a D′=0.9 with a
pairs, LD is seen to decay with increasing physical distance across the MHC (Figure
Trang 103.14) SNP pairs less than 20kb apart have an average D′ of 0.81 and pairs separated
by 500kb have an average D′ of 0.32 However, there is a noticeable difference in the
rate of decay of LD across the different sub-regions of the MHC The class I, class III
and extended class II segments show a level of LD similar to the MHC average, but
SNP pairs in the extended class I region show a lower rate of LD decay across
distances while the opposite is seen for the class II region Across the extended class I
region, SNP pairs less than 20kb apart have an average D′ of 0.91, and pairs separated
by 500kb have an average value of 0.36 By contrast, the corresponding values within
the Class II region are 0.78 and 0.28 This pattern of LD confirms the observation
reported in Caucasian MHC haplotypes (Miretti et al 2005), and is in concordance
with the higher LD in the telomeric segment described in the previous section
Figure 3.14 Pairwise Linkage Disequilibrium as a Function of Marker Distance
Average linkage disequilibrium values (r2 – left, D’ – right) between all SNP pairs up to a distance of 500kb apart are shown as a function of physical distance Greater physical distance affords more opportunity for recombination, hence the general trend of LD decreasing with increasing marker distance There is a noticeable spread between LD values
in the Extended Class I (blue curves) versus Class II (green curves) segments
Trang 11However these average pairwise LD values mask the local variations seen on a
finer-scale Widely spaced SNPs up to 500kb apart can be found in perfect LD (D′=1),
while some closely spaced markers less than 1kb apart exist in complete equilibrium
(D′=0) Linkage-disequilibrium distribution across this high-resolution SNP map can
be construed as consecutive runs of SNPs in strong LD interrupted by a sudden
breakdown of LD between closely spaced markers, similar to the observations of a
“block-like” structure of LD described in other parts of the genome (Daly et al, 2001,
Dawson et al 2002, International HapMap Consortium 2005)
To map the structure of the haplotype blocks seen in the local Chinese population,
block boundaries were determined using a well-established criteria (Gabriel et al
2002) that defines a consecutive run of SNPs with significantly high pairwise D′ as a
block In contrast with the previous first generation map, this denser SNP map enables
more haplotype blocks to be uncovered Most of the SNPs on this map (1712 out of
1877, or 91%) lie within defined haplotype blocks and a total of 203 haplotype blocks
can be identified across the MHC, covering 3.7Mb of this 4.9Mb map (75.25%) This
is similar to the 202 blocks covering 82% of the MHC region reported in a LD map of
a Caucasian population (Miretti et al 2005) The haplotype block coverage also falls
into the range of the genome-wide average (67-87%) reported in the HapMap project
(International HapMap Consortium, 2005)
The haplotype blocks have an average size of 18.2kb and range from 70bp to 180kb
As seen in the previous lower-resolution SNP map, 2 of the biggest blocks with sizes
of 180kb and 100kb lay within the extended class I region This indicates that the
blocks identified in the lower-resolution SNP map are robust There is an average of
Trang 127.1 haplotypes per block and this is very similar to that reported in the Caucasian
population (18kb average, 6.4 haplotypes per block) Haplotype blocks are also
segments of low diversity – within a haplotype block, 95% of the total variation in the
local Chinese population is represented by an average of 4.4 haplotypes Furthermore,
each haplotype block carries an average of 3.9 common haplotypes (present in greater
than 5% of the population)
The characteristics of the haplotype blocks seen at a MHC-wide average, as well as
when broken down into the 5 sub-regions of the MHC, are detailed in Figure 3.15
The number of haplotype blocks (expressed as a ratio to physical length to account for
the different sizes of the MHC sub-regions) was greatest in the class II region with
over 60 blocks per Mb By contrast there are 24 blocks per Mb in the extended class I
segment, while the MHC average is 41 blocks per Mb Haplotype blocks in the
extended class I region are larger and have higher coverage, averaging 37.1kb in
length and extending across 88% of the region Class II region haplotype blocks are
almost a third smaller (12.5kb) and cover only 76% of underlying DNA sequence
This shorter, more fragmented haplotype structure of the class II region appear
consistent with the greater number of discovered recombination hotspots there
(Cullen et al 1997, Jeffreys et al 2001, Cullen et al 2002) The haplotype block
characteristics mirror the pattern of stronger and longer LD in the extended class I
region, and weaker LD in the class II segment The stark contrast of the haplotype
blocks within these 2 sub-regions is clearly illustrated in the LD heatmap in Figure
3.16