5.2 Results 5.2.1 First SNPFINDER Analysis of ABCA1 UniGene ESTs Early 2000 In silico SNP discovery essentially mines SNPs from pre-existing DNA sequences.. The key steps in the SNPFIN
Trang 15 ABCA1 SNP Survey
5.1 Introduction
Evidence that ABCA1 gene mutations are responsible for the familial high density lipoprotein (HDL) deficiency disorders of Tangier disease and familial hypoalphalipoproteinemia (Bodzoich et al., 1999; Brooks-Wilson et al., 1999; Marcil et al., 1999; Remaley et al., 1999; Rust et al., 1999), together with the long established finding from epidemiological studies that HDL levels are inversely related with coronary artery disease (CAD; Wang and Briggs, 2004), suggests common ABCA1 genetic variations may explain phenotypic variation in HDL levels and CAD susceptibility in the general population Association studies are an approach to investigate this notion, but to facilitate such studies, single-nucleotide polymorphisms (SNPs) in ABCA1 must first be identified
When this research was initiated in early 2000, there was a paucity of ABCA1 SNPs reported in literature Pullinger et al (2000) discovered -278G>C and -14C>T in the proximal promoter as well as 237indelG and 296C>G in the 5’ untranslated region (UTR) while mapping the transcriptional initiation site of the ABCA1 gene In another report, Wang et al (2000) found four missense (R216K, V825I, M883I and R1587K), and five silent coding SNPs (cSNPs; P312, G316, I680, L960 and T1427) during their cDNA-based resequencing efforts
Here, we surveyed the sequence variation in the ABCA1 gene among Singapore
Chinese, Malays and Indians using experimental and in silico strategies A segment of
the ABCA1 proximal promoter was amplified and resequenced Two strategies were used
to examine variation in the exonic regions Individual exons were amplified and subjected
to heteroduplex detection using denaturing high performance liquid chromatography
Trang 2and candidate SNPs identified from regions of sequence overlaps The results of the SNP discovery efforts are presented in a chronological order
5.2 Results
5.2.1 First SNPFINDER Analysis of ABCA1 UniGene ESTs (Early 2000)
In silico SNP discovery essentially mines SNPs from pre-existing DNA sequences Public domain sequences such as ESTs provide a rich resource for SNP discovery (Buetow et al., 1999; Garg et al., 1999; Marth et al., 1999; Picoult-Newberg et al., 1999; Irizarry et al.,
2000; Cox et al., 2001) The SNPFINDER (Buetow et al., 1999) was chosen to perform in
silico SNP mining in the ABCA1 gene because of its integration with the UniGene database which consists of gene-specific ESTs and full-length mRNAs, and public DNA electrophoretogram archives The key steps in the SNPFINDER pipeline include basecalling of DNA traces using PHRED which assigns a quality (Q) value to each base, assembly and alignment of multiple sequences using PHRAP, and finally, identification of candidate variants from regions of sequence overlaps using a statistical analysis that
takes into account sequence quality By examining sequencing traces, bona fide allelic
variants can be discerned from sequencing errors with greater confidence as opposed to mere comparison of text-based sequences, for instance, using BLAST (Cox et al., 2001) Candidate variants flagged by SNPFINDER are assigned a score which directly reflects the probability that a position within a given assembly has heterogeneity in nucleotide composition (Buetow et al., 1999)
When this survey was first conducted in early 2000, the ABCA1 UniGene cluster comprised of two mRNAs and 60 non-redundant ESTs (Table 5.1) Most of the ESTs originated from normal tissues DNA traces were available for approximately half (31/60)
of the ESTs, and these together with the mRNAs, were analyzed by SNPFINDER Six
Trang 3gene, which probably reflects the fact that cDNA synthesis is generally primed by oligo dT primers as well as the fairly long 3’ portion of the ABCA1 mRNA (Santamarina-Fojo, et al., 2000) Since SNPFINDER was primarily developed for large-scale identification of SNPs, the original authors used a high arbitrary cutoff score of 0.99 in order to have a higher confirmation rate at the validation stage Applying such a stringent cutoff in our case would result in no hit Nevertheless, closer inspection of the multiple alignment revealed a highly probable A>G polymorphism (SNPFINDER score 0.96) in the position corresponding to nucleotide 8995 in the 3’UTR of the ABCA1 mRNA (numbering in mRNA with respect to reference sequence NM_005502) As illustrated in Figure 5.1, two out of nine ESTs originating from distinct tissue sources harbour the minor G variant Moreover, the candidate variant is flanked by high quality bases (denoted by uppercase letters in the multiple alignment in Figure 5.1), further indicating the high likelihood of the SNP’s existence
In contrast, the candidate variants at positions 9091, 9410, 10027, 10029 and
10032 were less unlikely to be true SNPs because they occurred as singletons, possessed low scores or were flanked by low quality bases (Figure 5.2) Conversely, some of them might represent true but rare variants which would necessitate more members in the contig for detection The variant at nucleotide 9410 has since been confirmed in a recent Japanese population survey (Iida et al., 2001)
To experimentally verify the existence of the 8995A>G candidate SNP, a short
109 bp segment containing the candidate SNP was amplified and subjected to strand conformation polymorphism (SSCP) analysis and sequencing Figure 5.3 shows the three distinct band migration patterns on SSCP gels which match the three expected
Trang 4single-downstream of the stop codon, and ~1400 bp upstream of the first polyadenylation motif, AAUAAA (Santamarina-Fojo et al., 2000)
To assess the potential biological significance of 8995A>G, we searched for potential conserved sequences in the ABCA1 3’UTR There is an absence of sequence conservation in the 3’UTRs of the ABCA1 mRNAs from human, mouse, rat and chicken This is not an unexpected finding since the 3’ ends of genes are generally more heterogeneous among species compared to protein coding sequences (Makalowski et al., 1996) The less conserved nature of 3’UTRs possibly confers flexibility in spatial and temporal aspects of gene regulation in a manner specific to the organism (Conne et al., 2000) Furthermore, a search against the 3’UTR database (Pesole et al., 2002; http://bighost.area.ba.cnr.it/BIG/Blast/BlastUTR.html) also revealed no other genes carrying sequence motifs similar to the ABCA1 3’UTR
Allelic variants can create different structural folds in mRNA, leading to different
http://www.bioinfo.rpi.edu/applications/mfold/old/rna/) was utilized to investigate whether the A and G allelic variants at 8995 would impact the folding of the ABCA1 3’UTR RNA secondary structures encoded by both variants appear similar to one another (Figure 5.4)
Trang 5Table 5.1 List of mRNAs and ESTs in the human ABCA1 UniGene cluster Hs.211562 from a release in early 2000 “*” indicates ESTs with DNA traces (31 in total) that are available from public FTP archives (e.g Washington University Genome Sequencing Centre) and which could be
automatically retrieved by SNPFINDER for SNP detection
AJ012376
AA902925* Smooth muscle AA328447 Whole embryo
N63586* Central nervous
Trang 6Figure 5.1 A high confidence candidate 8995A>G SNP identified from the ABCA1UniGene cluster Hs.211562 (early 2000 release) Only ESTs with DNA traces were analyzed with the SNPFINDER program (indicated by an asterick in Table 5.1) These sequences were basecalled by PHRED, assembled by PHRAP and candidate
variants identified from regions of overlaps by DEMIGLACE (Buetow et al., 1999) (A)
Position of the candidate SNP in the context of the multiple alignment Two out of nine ESTs harbour the G variant Upper and lower case letters in the alignment
denote bases of high or low sequence quality respectively (B) Representative DNA
traces with the corresponding PHRED Q values at the variant position Q value is a
measure of the quality of each basecall and is related to the error probability p by: Q=
A
B
Trang 7Figure 5.2 SNPFINDER multiple alignment showing the low confidence candidate variants identified from members of the ABCA1 UniGene cluster Hs.211562
Candidate variants are highlighted in blue or green columns (A) 9091G>A (B) from left to right, 10027T>G, 10029G>A and 10032A>T (C) 9410A>G Upper and lower
case letters in alignment denote high and low base quality calls respectively SNPFINDER scores ranged between 0.05 and 0.66 This analysis was conducted in
A
B
C
Trang 8B
A
Figure 5.3 Experimental confirmation of 8995A>G, a novel SNP in the 3’UTR of
the ABCA1 gene (A) SSCP analysis of a short 109 bp amplicon flanking the SNP
reveals three reproducible and distinctive band migration patterns (B) DNA traces
showing the three representative genotypes
Trang 9Figure 5.4ALowest energy RNA structure predicted by MFOLD (Zuker, 2003)
A
Trang 10Figure 5.4BLowest energy RNA structure predicted by MFOLD (Zuker, 2003) for the 8995G variant
B
Trang 11To gauge the efficiency of the initial SNPFINDER analysis as well as to potentially uncover more ABCA1 SNPs, a second analysis was conducted on a later release of the ABCA1 UniGene cluster (build 156, release 28 Sep 2002) By this time, the number of sequences in the UniGene cluster increased from 62 to 87 (Table 5.2) Also noted is an increased number of ESTs derived from tumourigenic sources, attributable to initiatives to catalogue genes expressed in cancers (Strausberg et al., 2003)
Seven high confidence candidate variants with scores of at least 0.96 were identified (Figure 5.5) These include 8995A>G which had been confirmed earlier Singletons were recorded at positions 8375 (Figure 5.5B) and 8517 (Figure 5.5C) whereas multiple occurrences of the alternative variants were recorded at positions 8705 (four sequences with G vs 11 sequences with T, low quality sequences disregarded, Figure 5.5D) and 8720 (three sequences with G vs 12 sequences with T, Figure 5.5D) However, we also noted that these high scoring candidate variants resided exclusively in ESTs obtained from a single cDNA library, NIH MGC69 (sequences 601490437FL, 601491738FL and 601492503FL in Figure 5.5 correspond to ESTs BE880894, BE879545 and BE878485 respectively in Table 5.2), which originated from an undifferentiated large cell carcinoma (library information obtained from the IMAGE consortium, http://image.llnl.gov) None of these putative variants have been confirmed in dbSNP (build 123) It is plausible that they were acquired during propagation of the tissue
in culture or represent mutations with a specific role in tumour development Two candidate variants with high scores of 0.96, 8539C>T (Figure 5.5C) and 9410A>G (Figure 5.5F), were found in two ESTs, one of which was derived from a large cell carcinoma In the first SNPFINDER analysis conducted in early 2000, we also detected
Trang 129410A>G variant then Both 8539C>T (rs 4149339) and 9410A>G (rs4149341) were first verified in the Japanese population (Iida et al., 2001), and multiple dbSNP (build 123) submissions have also been noted to date
The remaining five candidate variants from the second SNPFINDER analysis registered lower scores ranging between 0.02 and 0.79 Among these was 10029G>A (Figure 5.5H) which was previously encountered in the first SNPFINDER analysis (Figure 5.2) whereas the other four candidate variants at positions 8570 (Figure 5.5C), 8673 (Figure 5.5D), 9097 (Figure 5.5E) and 9696 (Figure 5.5G) represented novel predictions from the second analysis None of the low scoring putative variants have been documented dbSNP (build 123).Two variants at positions 10027 and 10032 documented from the first SNPFINDER analysis (Figure 5.2) were not found in the second analysis because the original EST sequences in which these candidate sequence variants were initially identified had been withdrawn from the UniGene set
Table 5.3 summarizes the comparison of the two SNPFINDER analyses conducted separately in 2000 and 2002 We combined information from dbSNP (build
123) as well as published reports (Iida et al., 2001) to verify the number of in silico
variants that are likely to be true positives or negatives By increasing the number of
sequences in the alignment, the specificity of the in silico mining is higher since no false
negatives were encountered in the second SNPFINDER analysis On the other hand, using a cutoff score of 0.96, two false negatives from the first SNPFINDER analysis were
which was missed completely due to a lack of ESTs in this region Although an expanded set of UniGene sequences was able to detect more true variants, more false positives were also generated, probably due largely to the source of the ESTs in which these variants were found
Trang 13Table 5.2 List of mRNAs and ESTs in the human ABCA1 UniGene cluster Hs.211562 from a later release (release 156, date accessed 28 Sep 2002).
AI356194 Glioblastoma (pooled) N46182 Melanocyte AW845151 Colon
R01051 Liver and spleen BF574391 Muscle (skeletal) AW364428 Denis drash
AA527406 Colon AA826281 Germinal center B AW364424 Denis drash
AI359714 Glioblastoma (pooled) AI241822 Tumour, 5 pooled AW364344 Denis drash
AA902925 Leiomyosarcoma AA748860 Germinal center B AW364342 Denis drash
AA493786 Thyroid AA704305 Liver and Spleen AW364331 Denis drash
AI399824 Pooled human melanocyte,
fetal heart,uterus
AA731742 Germinal center b cell AL048638 Brain AW190098 Adenocarcinoma AI819656 Squamous cell carcinoma AA669024 Lung AW130712 Adenocarcinoma
AA814091 Germinal center B cell AA618276 Thyroid AW044702 Pooled
AA625082 Pooled human melanocyte,
fetal heart,uterus
AI344681 2 pooled tumours (clear
cell type)
AA292158 Ovarian tumour AA434152 Ovarian tumour BG149600 Carcinoid
BM830709 Stomach AA302670 Adipose tissue, white BF951740 Nervous normal BM978608 Primary lung epithelial cells AA302777 Adipose tissue, white BF988872 Placenta normal
BM153383 Leukopheresis BE880894 Large cell carcinoma AW601575 Breast
N94914 Multiple sclerosis lesions BE879545 Large cell carcinoma BF855659 Prostate normal BM728651 Ocular tissues BE878485 Large cell carcinoma BF671104 Muscle (skeletal)
BI063291 Uterus_tumour AV656040 Liver tissue BF216316 Glioblastoma N63586 Multiple sclerosis lesions AV647223 Liver tissue AU135588 Placenta
BG678861 Squamous cell carcinoma BF094524 Uterus tumour BF116114 Fibrotheoma
Trang 15D
E
F
Figure 5.5 Continued from previous page (D) from left to right: 8673T>G,
8705T>G and 8720T>G with scores of 0.16, 0.99 and 0.99 respectively
Trang 16G
H
Figure 5.5 Continued from previous page (G) 9696C>T, 0.02 (H)
10029G>A, 0.38
Trang 17Table 5.3. Summary of two SNPFINDER analyses conducted separately in
identified in silico
6 12 Number of predicted variants with
Trang 185.2.3 SNP Survey in the ABCA1 Proximal Promoter
Resequencing of a 600 bp segment of the ABCA1 proximal promoter revealed seven sequence variants: -14C>T, -99G>C, -278C>G, -302C>T, -407C>G, -463C>T and -564T>C Representative DNA traces for these SNPs except -14C>T from individuals are shown in Figure 5.6 Because only a limited number of individuals (n=16 per ethnicity) were sequenced, genotype and allele frequencies were not determined as they are unlikely to be representative of the true population frequencies All variants except -463C>T were detected in multiple individuals in each local population sample -463C>T appeared as a heterozygous variant in one Indian individual and thus it could represent a rare or population-specific variant, or a PCR-induced mutation Subjecting a second freshly amplified fragment to sequencing confirmed that the singleton variant was truly existent -14C>T, -99G>C, -278C>G, -302C>T, -407C>G and -564T>C have been documented in numerous ABCA1 promoter surveys involving Caucasian (Pullinger et al., 2000; Zwarts et al., 2002; Probst et al., 2004; Tregouet et al., 2004) and Japanese individuals (Iida et al., 2002; Shioji et al., 2004; Yamakawa-Kobayashi et al., 2004)
To assess the potential biological significance of the ABCA1 promoter SNPs, we determined if they affect known or putative consensus transcription factor binding sites Previous work has identified that a cholesterol responsive element in the ABCA1 promoter, the DR4 element It consists of two half sites of an imperfect direct repeat of TGACCT separated by 4 bp (TGACCGatagTAACCT) and is located in the region -70 to -
55 bp The DR4 element is critical for oxysterol activation of the ABCA1 gene by the nuclear receptor heterodimers LXR-RXR (Costet et al., 2000; Schwartz et al., 2000) Other experimentally mapped elements include the E-box centered at position -147 (Yang et al., 2002) and the GnT repeats (recognition motif for the zinc finger transcription
Trang 19ABCA1 gene transcription None of the seven promoter SNPs identified in the study disrupt any of these experimentally-verified transcriptional regulatory elements
Putative transcription factor binding sites in the ABCA1 proximal promoter were identified by searching against the TRANSFAC database (Wingender et al., 1996) using MATCH (Kel et al., 2003) -407C>G lies in a segment that matches the consensus binding sites for hepatic nuclear factor 4 (HNF4) and c-REL with core and matrix similarity scores of 0.88 and 0.78, and 1.00 and 0.87 respectively (Figure 5.7) HNF4, a liver-enriched transcription factor, controls a variety of genes involved in lipid and glucose metabolism such as the apolipoproteins AI, AII, B and CIII (Sladek and Seidel, 2001) and mutations of the HNF4α isoform underlie human diseases such as maturity-onset diabetes of the young (Yamagata et al., 1996) and non-insulin dependent diabetes mellitus (Nakajima et al., 1996) c-REL is a member of the NF-κb family of transcription factors and contains a potent transactivation domain (Chen and Green, 2004) The presence of this putative NF-κB recognition element in the ABCA1 proximal promoter is consistent with the role of inflammation in atherosclerosis
To determine whether the proximal promoter SNPs affected evolutionarily conserved bases, we performed a comparative analysis of the promoter segments of ABCA1 orthologues Figure 5.8 shows the multiple alignment of human, chimpanzee, dog, mouse and rat ABCA1 promoters The human variant -99C>G targets an extremely conserved base although no putative (Figure 5.7) or known transcription factor binding site is located here In contrast, the sites corresponding to human -14C>T, -463C>T and -564T>C variants have apparently diverged across evolutionary times; for instance, three different nucleotide variants are seen across species for -14C>T and -564T>C The
Trang 20conserved segment containing the -14C>T and -99C>G variants is attributed to the presence of the DR4, Ebox, GnT and core promoter elements (Costet et al., 2000; Schwartz et al., 2000; Porsch-Ozcurumez et al., 2001; Langmann et al., 2002; Yang et al., 2002) We had earlier noted that -407C>G resides in a putative transcription binding site (Figure 5.7)
Trang 21Figure 5.6 ABCA1 proximal promoter SNPs identified by direct
Trang 23Figure 5.8 Sequence variants in the human ABCA1 gene proximal promoter
shown in relation to the chimpanzee, dog, mouse and rat sequences The
Trang 24Figure 5.8 Continued from previous page Sequence variants in the human
ABCA1 gene proximal promoter shown in relation to the chimpanzee, dog,
mouse and rat sequences
Trang 25Promoter
The large ABCA1 gene encodes 50 exons (Santamarina-Fojo et al., 2000), hence a SNP discovery strategy based entirely on resequencing would be time-consuming and costly Therefore, for efficient SNP discovery in the ABCA1 exons, we used a screening technique, DHPLC, which is based on heteroduplex detection of sequence variants under partially denaturing conditions Primers were designed to amplify most of the 50 exons individually including the entire 5’ UTR and the protein coding portion of exon 50 In addition, a fragment containing the newly identified exon1A and distal promoter (Cavelier
et al., 2001) was also analyzed Collectively, 19,273 bp including 7,162 bp of exonic sequences over 49 fragments were subjected to DHPLC analysis Sixteen DNA samples were screened per fragment in each ethnic group This sample size is estimated to have
>99% power to detect SNPs with minor allele frequency of at least 10% (Kruglyak and Nickerson, 2001) Samples displaying differential DHPLC profiles were re-amplified from genomic DNA and sequenced in order to identify the nature and location of the sequence variant
Six cSNPs comprising of five missense and one silent SNPs, as well as two 5’UTR SNPs were identified Figure 5.9 shows the DHPLC elution profiles of representative samples harbouring putative variants The 5’UTR SNPs, 237indelG and 296C>G, were initially identified from the same PCR fragment during DHPLC analysis None of these exonic SNPs are considered novel as they had been documented prior to
or during the course of the DHPLC analysis (Pullinger et al., 2000; Wang et al., 2000; Clee et al., 2001) All exonic SNPs were detected in at least one representative DNA from each of the local ethnic groups