It houses over 200 gene loci in a 3.6Mb region, and more than 40% of the expressed genes here are known to have an immunity related function The MHC Sequencing Consortium, 1999.. Besides
Trang 1CHAPTER 1:
INTRODUCTION
Trang 21.1 The Human Major Histocompatibility Complex (MHC)
1.1.1 Features of the Human MHC
The human Major Histocompatibility Complex (MHC) is a gene dense segment on the short arm of the human chromosome 6 It houses over 200 gene loci in a 3.6Mb region, and more than 40% of the expressed genes here are known to have an immunity related function (The MHC Sequencing Consortium, 1999) The highly polymorphic human leukocyte antigen (HLA) genes, key loci for histocompatibility matching in organ transplants, are also located within this complex Traditionally, the MHC is defined as the region bordered by the HLA-F and the RPS18 loci (Campbell and Trowsdale, 1993), and divided into 3 sub-regions to reflect the clustering of different classes of genes The class I region lies at the telomeric end and carries a cluster of HLA class I genes such as the classical HLA-A, -B, -C, the non-classical HLA-E, -F, -G and the class I-like MICA and MICB The class II region is at the centromeric end, comprising of HLA class II antigens such as HLA-DR, -DP and -
DQ The class III region is sandwiched in-between and comprises of varied gene families such as the tumour necrosis factors (TNF, LTA, LTB) and the complement cascade (CFB, C2, C4A, C4B) However the discovery of HLA-like genes in the peri-regions of the classical MHC, as well as extended regions of conserved synteny, led
to the hypothesis of an extended MHC in humans (Malfroy et al 1997, Yoshino et al
1997, Stephens et al 1999) and these additional regions have been termed the extended class I and extended class II regions A gene map of the MHC can be seen in Figure 1.1
Trang 3Figure 1.1 Gene Map of the Major Histocompatibility Complex
The gene map between 29.0Mb and 33.8Mb of the chromosome 6p is shown above Only gene loci that are known to be expressed are included The coloured backgrounds mark out sub-regions of the MHC as described in a recent review of the extended MHC (Horton et al 2004) Yellow – Extended class I region; Blue – Class I region; Green – Class III region; Orange – Class II region; Grey – Extended Class II region The locations of the classical HLA class I genes (HLA-A, HLA-B, HLA-C) and HLA class II genes (HLA- DRB1, HLA-DQB1, HLA-DQP1) are indicated in blue and red blocks respectively All gene annotations were taken from the Vertebrate Genome Annotation Database (Vega) (Wilming et al 2008)
The distances in this figure are not to scale but the physical map locations are approximately indicated for
Trang 4A hallmark of the MHC is the duplication that results in the formation of large gene clusters across the region Besides the HLA genes, very large clusters of RNA (157 in total have been found) and histone genes (55 expressed loci) are located here, mostly within the extended class I sub-region This same sub-region also contains 34 olfactory receptor-coding loci, of which about half are potentially functional (Younger et al 2001, Horton et al 2004) Within the classical MHC, other gene clusters include tripartite motifs-containing zinc-fingers (TRIM), heat shock proteins
as well as lymphocyte antigen 6 genes (LY6) The gene clusters across the MHC are believed to be the results of both gene-level and larger segmental duplication events and are maintained because of functional requirements For example several immune system genes function in tandem, such as the dimerization of HLA-DQA and DQB, and co-localization can ensure that these genes are expressed in similar quantities for heterodimer formations Similarly genes involved in antigen processing such as the peptide transporters (TAP1/2), immuno-proteosome components (PSMB8/9) and peptide chaperone TAPBP could benefit from clustering together for the coordination
of expression profiles (Horton et al 2004) Other gene families such as the RNA and histone genes are thought to exist in clusters in order to maximise transcription levels (Mungall at al 2003)
Extreme polymorphism is another characteristic of the MHC, with diversity at an order of magnitude higher than the genome average (Stewart et al 2004) Most of these variations can be attributed to the HLA class I and class II genes that carry between 200 to 1000 different alleles at each locus (Robinson et al 2003) These HLA genes encode for cell surface glycoproteins that present endogenous and exogenous peptides to T-cells, subsequently initiating the adaptive immunity
Trang 5response The extreme diversity of HLA genes is believed to be a result of driven balancing selection favouring increased variety against evolving pathogens (Meyer and Thompson 2001) As for variation across the entire MHC, there is an estimated 15,000 single-nucleotide polymorphisms (SNP) between the MHC sequences of any 2 individuals, or about 3.2 SNPs for every kb (Horton et al 2008) This is somewhat similar to the genome average (ENCODE Project Consortium, 2004)
pathogen-Polymorphism in the MHC is however not restricted to SNPs Large tracks of deletion/insertion polymorphisms (DIPs) have been catalogued between several full-length MHC sequences (Horton et al 2008) The majority of DIPs are a result of copy-number variations of repetitive transposable elements such as short interspersed elements (SINE), long interspersed elements (LINE), Alu elements and human endogenous retroviruses (HERV) These DIPs contribute to sequence length variations between different MHC haplotypes
Additionally there also exist gene-specific copy number variations, such as the 2 large regions of complex polymorphism in the MHC, the RCCX module and the HLA-DRB locus The RCCX module is a segment of DNA that includes part of the STK19 (RP) gene, C4A/B, CYP21 and part of the TNXB gene (Chung et al 2002) The C4 gene is a component of the complement cascade and exists in either a long or a short variant Different MHC haplotypes carry between 1 to 4 modular units of a combination of C4A and/or C4B as a result of segmental duplication, and consequently different length variants of the RCCX module correlate to dosage variation of the C4A and C4B proteins Dosage variation of C4A and C4B proteins is
Trang 6associated with systemic lupus erythematosus (SLE), possibly by increased production of the C4 protein at local tissues during the inflammatory process, exacerbating complement-mediated tissue injuries (Rupert et al 2002)
The HLA-DRB locus also displays length variation between MHC haplotypes with 5 main arrangements of this locus; HLA-DR1, -DR8, -DR51, -DR52 and -DR53 (Figure 1.2)
All 5 arrangements carry a DRB9 pseudogene at the telomeric end and a DRB1 gene
at the centromeric end In between, the DR subtypes carry different arrangements of DRB3, DRB4 and DRB5 genes, as well as DRB6, DRB7 and DRB8 pseudogenes (Bergstrom et al 1999) Based on the analysis of lineage-specific insertion/deletion elements between syntenic regions in humans, chimpanzees and gorillas, the DR51,
Figure 1.2 Organisation of HLA-DR Haplotypes
This is a schematic representation of the 5 major DR haplotypes found in the MHC, shown telomeric to centromeric All haplotypes carry the DRB9 pseudogene at the telomeric end and the DRB1 gene at the centromeric end Each however varies in the complement of DRB paralogs in-between
Coloured blocks are expressed genes while grey patterned blocks are pseudogenes Distances are not drawn to scale
This figure is adapted from Svensson and Andersson, 1997
Trang 7DR52 and DR53 arrangements were found to be ancient in the hominid clade, while the DR1 and DR8 arrangements evolved from DR51 and DR52 more recently (Svensson and Andersson 1997)
1.1.2 MHC and Disease
Given that the MHC is at the heart of the human immune system, it is not surprising that the MHC is associated with the majority of autoimmune and infectious diseases, and thus is the focus of many disease gene-mapping studies (Lechler and Warrens, 2000) Most disease associations to the MHC are first identified as a significant difference in the frequency of a particular HLA allele in a patient group compared to
an ethnically matched group of healthy individuals An example of such an association is that of HLA-B27 alleles and ankylosing spondylitis (AS), a chronic inflammatory rheumatic disease This association is the strongest described of any other auto-immune associations with MHC molecules, with over 94% of AS patients HLA-B27-positive compared to only less than 10% of healthy individuals, translating
to an odds ratio of over 170 (Lechler and Warrens, 2000) This association is also remarkable for the fact that it is robust in every population examined The main hypothesis for the B27 and AS association is that HLA-B27 molecules have the unique ability to bind to a set of “arthritogenic self-peptides” This hypothesis has been supported by epidemiological and functional studies of HLA-B27 subtypes in which associated subtypes like B*2705 are able to present a self-peptide in 2 different conformations while non-associated subtypes like B*2706 and B*2709 do not share this ability (Ren et al 1997, Hülsmeyer et al 2004) However even in the face of such strong epidemiological support, the precise molecular basis of HLA-B27 association with AS has not yet been determined conclusively
Trang 8One of the very few disease genes mapped conclusively to the MHC region illustrates the complexity of MHC disease associations Hereditary haemochromatosis is an autosomal recessive disorder of iron metabolism leading to an accumulation of excessive iron in the body, ultimately leading to multi-organ dysfunction (Feder et al 1996) The association of haemochromatosis to the MHC was first identified in 1976 with the segregation of HLA-A3 alleles with haemochromatosis patients of Caucasian descent throughout Europe (Simon et al 1976) Numerous fine-mapping studies followed in the 2 decades after that, with strong association signals coming from the 1-2Mb region around the HLA-A locus Eventually a mutation in a HLA class I-like gene, HFE, was conclusively linked to hereditary haemochromatosis, and shown to disrupt the process in which HFE plays in regulating iron absorption and distribution (Feder et al 1996, Townsend and Drakesmith 2002) Physical mapping of HFE places
it at location 26.2Mb of the chromosome 6p, a distance of 3.8Mb away from HLA-A MHC haplotypes carrying the HLA-A3 allele in Caucasians are highly conserved in the stretch between the HFE and HLA-A loci, providing an explanation for the strong but ultimately spurious association signal for HLA-A3 in hereditary haemochromatosis
For the majority of the MHC associated diseases, establishing a causative relationship between a disease and a MHC gene has been difficult In most studies where associations with HLA alleles are found, such as the link between insulin-dependent diabetes mellitus (IDDM) and HLA-DRB1*03/*04, these are oftentimes population dependent with little reproducibility in other ethnic groups (Lechler and Warrens, 2000) There are a few reasons for this lack of success First is the incomplete knowledge of the variation in the MHC outside of the classical HLA loci, especially
Trang 9in non-Caucasian populations Second, strong stretches of allele conservation due to linkage disequilibrium – exemplified by the tripartite association of HLA-A3, haemochromatosis and the HFE gene – complicate the discovery of disease loci Classical HLA alleles associated with diseases are only thought to be markers in linkage disequilibrium with the actual disease loci within the MHC (Dawkins et al 1999) Third, most MHC associated diseases are believed to be complex and polygenic in nature, belonging to the common disease/common variant (CDCV) class
of diseases, in which the genetic risk for common diseases will often be due to disease-predisposing alleles with relatively high frequencies (Reich and Lander, 2001) In order to map CDCV diseases, genotyping informative markers selected with knowledge of the linkage disequilibrium of the targeted region is crucial (Zondervan and Cardon, 2004)
Within the Singaporean Chinese population, a range of diseases has been found to be associated with the MHC (Table 1.1) These include adverse drug reactions, cancers, renal diseases and autoimmune conditions Many of these associations, such as that of HLA-B*4601 with nasopharyngeal carcinoma, have been replicated in other Chinese populations (Hildesheim et al 2002) However, without complete knowledge of the variation and linkage disequilibrium of the MHC in Chinese, the identification of a causative or disease-associated locus in each of these diseases has been elusive thus far
Trang 101.2 Linkage Disequilibrium
Linkage disequilibrium (LD) refers to the non-independence of alleles at different loci As an example consider 2 adjacent loci, one with alleles A/a and the other with alleles B/b, giving rise to 4 possible haplotypes AB, ab, Ab and aB If the 2 loci are independent, the frequency of seeing a particular haplotype should not deviate from the expected frequency – which is the product of the individual allele frequencies If
Trang 11there is significant difference between the observed and expected haplotype frequencies, the 2 loci are said to be in linkage disequilibrium LD arises when alleles share a common population ancestry and the strength of this allelic association is gradually eroded by recombination events Hence the strength of LD is also a reflection of the number of recombination events or the recombination rate The opportunity of recombination increases over time as well as physical distance and therefore LD is theoretically inversely related to physical distance and time However this simplistic model of LD poorly represents local variation in LD in which many other factors are important Although there is a general trend of LD decreasing over physical distance, closely spaced markers are not necessarily in high LD while distal markers can be found in high disequilibrium (Ardlie et al 2002)
Local variation in LD is influenced by a number of factors, and some of these are discussed briefly here Admixture (or migration) into populations may lead to spurious inflated LD even if no linkage actually exists between markers, especially if the allele frequencies between the 2 populations differ greatly (Pfaff et al 2001) In subsequent generations, recombination will break down the disequilibrium, so the amount of time since the admixture event will affect the amount of LD seen Population subdivision resulting in inbreeding will also increase LD but this should not be an issue in large, diversified populations where random mating may be assumed (Laporte and Charlesworth 2002) Natural selection, either by purifying selection (deleterious alleles are rapidly swept away from a population) or positive selection (advantageous alleles are rapidly swept to high frequency), also may inflate local variation of LD Segments neighbouring the allele under selection will also be swept along in a “hitchhiking” effect, potentially creating long segments of
Trang 12disequilibrium (Sabeti et al 2002) Finally, while the above-mentioned factors inflate
LD over long distances, meiotic recombination hotspots does the opposite, breaking down LD between closely spaced markers (Jeffreys et al 2001) The relationship between recombination and LD is discussed in greater detail in Section 1.3 of this chapter
1.2.1 LD and Disease Association Studies
Disease gene mapping has traditionally been done by using linkage studies – tracing patterns of allele sharing between affected relatives – and proven successful for identifying single-gene Mendelian disorders with high penetrance However, as briefly mentioned earlier, the majority of human diseases may be polygenic and complex, fitting into the common disease/common variant (CDCV) paradigm, and traditional linkage studies are under-powered to detect such disease variants unless armed with an impractically large number of affected families Rather case-control association studies – in which the genotypes of a large number of markers across a region are compared between a panel of patients and a panel of population-matched healthy individuals – were thought to be more powerful for detecting high frequency, small-effect polymorphisms (Risch and Merikangas 1996) In case-control association study designs, there is no reason to believe that the markers selected for genotyping coincides with the actual susceptibility or disease loci, but rather uniformly spaced markers are chosen in hope that one of these will be in linkage disequilibrium with the disease loci and hence detected as an associated marker Regions around associated markers can then be further screened in a directed approach SNPs are ideal markers for these association studies given their frequency across the genome (one every 300bp) and stability between different populations (ENCODE Project Consortium
Trang 132004) The selection of SNPs for association studies should also be optimized with the local variation of LD in mind, with fewer markers in regions of high LD and denser genotyping where LD is negligible
The two common measures of LD used are r2 and D´, and both are based on the basic disequilibria measure D, which is the difference between observed and expected haplotype frequencies D´ has the useful property of indicating loci that have not been separated by recombination if D´ between the loci is equal to unity The measure r2 on the other hand, is a stricter indicator of correlation; while an r2 value equal to 1 also signifies no recombination, it further implies that alleles at linked loci are strictly correlated, such that knowing the allele at one will succinctly describe the allele at others – or put another way, one loci is a perfect proxy for the other (Devlin and Risch 1995)
1.2.2 Block-Like Structure of Linkage Disequilibrium
With the realization that characterizing LD is important in gene-mapping association studies, several groups have recently published reports on the LD structure in the human genome A seminal paper in 2001 analysed variation patterns across a 500kb segment of chromosome 5p, and described in detail the underlying LD structure (Daly
et al 2001) Daly and colleagues reported that the disequilibrium pattern of the region falls into a series of discrete high-LD blocks (haplotype blocks) ranging from 3 - 92kb
in length Each block displays low diversity with only 2 - 4 common haplotypes per block and there is no evidence of historical recombination within Blocks are separated by short distinct intervals of low LD that were believed to be historical recombination hotspots, and correlations between blocks give rise to long-range LD
Trang 14The authors also reported that a subset of SNPs within each haplotype block could be selected to distinguish between common haplotypes in a block, firmly establishing the concept of “tag SNPs”
1.2.3 The LD Structure of the Human Genome
The definition of haplotype blocks was formalised by Gabriel and colleagues by framing blocks as segments of SNPs with statistically significant LD, and used this to describe the haplotype block structure of the entire human genome (Gabriel et al 2002) This study provided a foundation for the construction of a haplotype map of the genome and this materialized as the HapMap Project (International HapMap Consortium 2005) This project genotyped one million SNPs in 270 samples from 4 populations: 90 individuals of African descent from Nigeria, 90 Caucasian individuals from the United States of America, 45 Han Chinese individuals from China and 45 individuals from Japan From these data the common DNA variation within the 4 populations were analysed and compared They reported that the 3 non-African populations had fewer low-frequency alleles than the Nigerian samples; a pattern attributed to the population bottlenecks in the history of non-African samples Chinese and Japanese allele frequencies were seen to be very similar and were merged together as a single population for analysis Polymorphic sites were also highly identical between populations, with fewer than 21 out of a million SNPs unique to individual populations With the higher density SNP map used, the HapMap project reported the similar structure of localised LD variation as before, with long discrete haplotype blocks and short historical recombination intervals Using the haplotype block definitions established by Gabriel et al 2002, they reported that the average blocks lengths in the Oriental and Caucasian populations were similar (13kb