Since functional sequences evolve slower than the surrounding neutrally evolving regions, cis-regulatory elements can be identified as conserved noncoding elements CNEs in comparisons o
Trang 1IDENTIFICATION AND CHARACTERIZATION OF
CONSERVED CIS-REGULATORY ELEMENTS
IN THE HUMAN GENOME
ALISON P LEE
B Computing (Computer Science) (Honours)
National University of Singapore, 2004
A THESIS SUBMITTED FOR THE DEGREE OF
Trang 2Acknowledgements
There are several people I would like to acknowledge:
Firstly, my advisor Professor Byrappa Venkatesh for being an excellent and patient mentor, and for devoting so much time and energy to helping me improve my
reasoning, writing and presentation skills;
Dr Sydney Brenner and Dr Ng Huck Hui (GIS, Singapore), members of my thesis advisory committee, for their scientific advice;
Dr Alice Tay for her constant encouragement and discussions at work and
conferences, and her work on the fugu Hox clusters project;
Esther Koh for her work on the fugu Hox clusters project and Yang Yuchen for his work on the TFCONES project;
Gene Yeo, Eddie Loh, Nidhi Dandona for showing me the ropes in bioinformatics;
Luo Ming, Wang Jianli, Kevin Lam for useful software/hardware discussions and troubleshooting;
Krish Jon Mathavan, Elizabeth Yeoh, Tay Boon Hui and Sumanty Tohari for
guidance in laboratory techniques;
Patrick Gilligan and Ravi Vydianathan for their valuable comments on manuscripts;
Data Storage and Cluster Computing teams at Bioinformatics Institute, led by Lai Loong Fong and Stephen Wong respectively, for speedy response to my requests for server fixes or software and database updates;
Arun Kumar, Zhao Zhiyang, Peng Huiling, and Chua Yiwen at A*STAR Biological Resource Centre for their skilled work in DNA microinjections and mouse husbandry;
Trang 3The members of IMCB Histopathology Unit, especially Keith Rogers and Susan Rogers for histology tips, protocols and equipment;
The staff of A*STAR Graduate Academy and NUS Graduate School for Integrative Sciences and Engineering for the handling of administrative affairs;
All present and former members of the Comparative Genomics Laboratory and DNA Sequencing Facility in IMCB, Singapore for making the laboratory a conducive and pleasant place to work in;
Finally, my greatest appreciation goes to my parents and my brother for their
strongest support
Trang 4Table of contents
Acknowledgements i
Table of contents iii
Summary vi
List of tables viii
List of figures ix
List of abbreviations xi
Chapter 1: Introduction 1
1.1 The Human Genome Project 1
1.2 Cis-regulatory elements 2
1.3 Methods to identify cis-regulatory elements 6
1.4 Comparative genomics approach for identifying cis-regulatory elements 20
1.5 Transcription factors (TFs) 24
1.6 Objectives of my work 26
Chapter 2: Materials and methods 28
2.1 Identifying human, mouse and fugu TF-encoding genes 28
2.2 Identifying CNEs in TF-encoding gene loci 29
2.3 Computing CNE statistics of human TF-encoding genes 31
2.4 Enrichment analysis of CNEs within experimentally defined TFBS 31
2.5 Gene Ontology enrichment analysis of human TF-encoding genes 32
2.6 Expression analysis of human TF-encoding genes 33
2.7 Motif finding 33
2.8 Predicting TFBS in human-fugu CNEs 34
2.9 Building and implementing the TFCONES database 35
2.10 Sequencing and annotating the fugu Hox gene clusters 35
Trang 52.11 Functional assay of CNEs in transgenic mice 37
Chapter 3: Results - CNEs in transcription factor-encoding genes 45
3.1 TF-encoding genes in human, mouse and fugu 45
3.2 Conserved clusters of TF-encoding genes in human, mouse and fugu 48
3.3 Identification of human-mouse and human-fugu CNEs 51
3.4 Distribution pattern of human-fugu CNEs in the introns and flanking regions of TF-encoding genes 62
3.5 Presence of experimentally verified TFBS within human-fugu CNEs 64
3.6 Functional categories of human TF-encoding genes associated with human-fugu CNEs 66
3.7 Association between human-fugu CNEs and expression profiles of TF-encoding genes 69
3.8 Over-represented motifs in the human-fugu CNEs of central nervous system-expressing TF-encoding genes 72
3.9 Database of TFs and CNEs associated with TF-encoding genes 74
3.10 Discussion 77
Chapter 4: Results – CNEs in the Hox gene clusters 82
4.1 The Hox gene clusters 82
4.2 Fugu Hox loci and conserved syntenic blocks 87
4.3 CNEs in the Hox loci 90
4.4 Distribution of CNEs in fugu Hox loci 94
4.5 CNEs in the human Hox loci 95
4.6 Discussion 99
Chapter 5: Results – Functional assay of CNEs associated with a representative TF-encoding gene 102
5.1 Introduction 102
5.2 Functions and expression pattern of Lhx2 gene 102
5.3 Organization of the Lhx2 locus in vertebrates 106
5.4 CNEs in the Lhx2 gene locus 109
Trang 65.5 Expression patterns of Lhx2, Crb2 and Dennd1a in developing mouse
embryos 113
5.6 Functional assay of Lhx2 CNEs in E11.5 transgenic mouse embryos 115
5.7 Site-directed mutagenesis of a predicted motif in Lhx2_CNE2/3 124
5.8 Discussion 128
Chapter 6: Discussion 132
Bibliography 139
Annexes 159
List of my publications 188
Trang 7Summary
Comparative genomics is a powerful approach for identifying conserved
cis-regulatory elements in the human genome Since functional sequences evolve slower
than the surrounding neutrally evolving regions, cis-regulatory elements can be
identified as conserved noncoding elements (CNEs) in comparisons of human and other vertebrate genomes In particular, comparison of human with distantly related vertebrates such as fishes increases the likelihood that most predicted CNEs are functional sequences The objectives of my project were to identify all the CNEs associated with transcription factor (TF)-encoding genes in the human genome by comparison with pufferfish (fugu) sequences, analyze the characteristics of the CNEs and assay the functions of selected CNEs in transgenic mice I started by building a curated database of all TF-encoding genes in human, mouse and fugu and predicted CNEs (≥ 65% identity over 50 bp) associated with orthologous genes by locus-by-locus global alignments In total, 1,738 human, 1,495 mouse, and 1,762 fugu TF-encoding genes were identified, with 1,145 genes having orthologs in all three
genomes Further analyses focused on the set of 816 DNA-binding TF-encoding orthologous genes A total of 2,843 human-fugu CNEs (total length ~388 kb) were found to be associated with these TF-encoding genes An online database was
constructed to catalog the human, mouse and fugu TF-encoding genes, together with their associated CNEs, and this database is named TFCONES (Transcription Factor Genes & Associated COnserved Noncoding ElementS; http://tfcones.fugu-sg.org/) The TFCONES database would be useful to researchers interested in studying the regulation of TF-encoding genes and understanding gene regulatory networks in vertebrates
Trang 8The human-fugu CNEs identified showed a significant overlap with experimentally verified transcription factor binding sites (TFBS) of known transcriptional activators and repressors, confirming that some CNEs function as transcriptional enhancers or silencers In addition, functional enrichment analyses indicated that the CNEs are significantly associated with TF-encoding genes that are involved in regulating
development, particularly the development of central nervous system Furthermore, expression profiling based on publicly available expression data, showed that genes that express most highly in central nervous system tissues are enriched with human-fugu CNEs Motif discovery within the CNEs of TF-encoding genes that express most highly in the central nervous system, revealed four 8-mer motifs that are likely to be involved in transcriptional enhancer activity in the central nervous system To verify the functions of CNEs and the motifs, I assayed the CNEs of a representative TF-
encoding gene, the LIM homeobox gene Lhx2, in transgenic mice Four out of eight CNEs tested demonstrated enhancer activity by recapitulating Lhx2 expression in the
midbrain and hindbrain at embryonic day 11.5 Mutagenesis of a predicted motif in a selected CNE abolished gene expression in the neural tube and dorsal root ganglia, demonstrating that the motif is indeed critical for enhancer activity
Trang 9List of tables
Table 1 Fugu scaffolds (assembly v3.0) for the seven Hox loci 36
Table 2 Primer sequences of the eight Lhx2 constructs 38
Table 3 TF-encoding genes in human, mouse and fugu genomes 47
Table 4 Conserved clusters of human, mouse and fugu TF-encoding genes 49
Table 5 Top twenty TF-encoding genes associated with the highest density of human-fugu CNEs in human 57
Table 6 Top twenty TF-encoding genes associated with the highest density of human-fugu CNEs in human-fugu 58
Table 7 Top twenty TF-encoding genes associated with the highest number and total length of human-fugu CNEs 61
Table 8 Location of human-fugu CNEs in relation to the protein-coding sequence of nearest human TF-encoding genes 63
Table 9 Significantly over-represented and under-represented Gene Ontology (GO) terms (P < 0.01) of CNE-associated human TF-encoding genes 68
Table 10 Over-represented 8-mer motifs in human-fugu CNEs of cluster #3 human TF-encoding genes 73
Table 11 Conserved syntenic fragments at the fugu and human Hox loci 87
Table 12 Human-fugu CNEs in the fugu Hox loci 95
Table 13 Human-fugu CNEs in the human Hox loci 96
Table 14 Details of Lhx2 CNE constructs tested in transgenic mice 111
Table 15 Summary of the expression patterns directed by the CNEs tested 124
Trang 10List of figures
Figure 1 A flowchart of protocols used for identifying human, mouse and fugu TFs,
and human-mouse, human-fugu CNEs 51
Figure 2 Distribution of lengths of human-mouse and human-fugu CNEs 53
Figure 3 Plot of total length of CNEs associated with DNA-binding TF-encoding genes against the length of non-repetitive noncoding sequence (in kilobases; kb) of human or fugu gene locus 54
Figure 4 Plot of CNE density against the total length of CNEs associated with DNA-binding TF-encoding genes 55
Figure 5 CNEs in the MEIS2 gene locus 62
Figure 6 Distribution of UTR-intronic and internal-intronic human-fugu CNEs 64
Figure 7 TF-encoding genes predominantly expressed in the central nervous system are enriched with CNEs 71
Figure 8 List of human-fugu CNEs associated with a representative TF-encoding gene FOXA2 74
Figure 9 An image of the location of CNEs relative to the associated TF-encoding gene FOXA2 75
Figure 10 Conserved syntenic blocks at the fugu and human Hox loci 88
Figure 11 VISTA plot of the MLAGAN alignment of fugu HoxAa locus with human HoxA and mouse HoxA loci 92
Figure 12 VISTA plot of the MLAGAN alignment of fugu HoxDa locus with human HoxD and mouse HoxD loci 93
Figure 13 Profiles of CNEs at the human HoxA, HoxB, HoxC, and HoxD loci 97
Figure 14 Lhx2 gene loci in human, mouse, chicken and fugu 106
Figure 15 Syntenic genes surrounding Lhx2 in human, mouse and fugu 109
Figure 16 CNEs in the Lhx2 locus 110
Figure 17 Ten CNEs associated with the human LHX2 gene locus are located within the introns of the upstream gene DENND1A 112
Figure 18 Expression of Lhx2, Crb2, Dennd1a in E11.5 mouse embryos 114
Trang 11Figure 19 Lhx2_CNE2/3 directs lacZ expression in the neural tube and dorsal root
ganglia 117
Figure 20 Lhx2_CNE5/6 directs lacZ expression in the hindbrain and neural tube 119 Figure 21 Lhx2_CNE7 directs lacZ expression in the hindbrain and neural tube 120 Figure 22 Lhx2_CNE10 directs lacZ expression in the midbrain, hindbrain and neural
tube 122
Figure 23 Mutation of the overlapping motifs in Lhx2_CNE2 125 Figure 24 Site-directed mutagenesis of a predicted motif in Lhx2_CNE2 results in significantly reduced lacZ expression in the neural tube and dorsal root ganglia 126
Trang 12List of abbreviations
CNE conserved noncoding element
Myr millions of years
PBS phosphate buffered saline
PCR polymerase chain reaction
TFBS transcription factor binding site
TSS transcription start site
UTR untranslated region (of an mRNA)
Trang 13Chapter 1: Introduction
1.1 The Human Genome Project
A major quest in biology is to understand the function and regulation of all the human genes and their role in human biology and diseases The Human Genome Project launched in 1990 was the first step towards accomplishing this ambitious scientific endeavor The project had two major goals: (i) to determine the complete sequence of the human genome and (ii) to identify all the functional elements encoded by it In
2001, the first goal was achieved when two draft sequences of the human genome were released (Lander et al 2001; Venter et al 2001) Since then, coordinated efforts were made to close the gaps in the genome and to achieve high sequencing accuracy
As a result, the human genome sequence is now essentially complete (International Human Genome Sequencing Consortium 2004) The second goal, that is to map all the functional elements in the human genome, is however far from completion
Although about 21,000 protein-coding genes have been identified in the human
genome with high confidence (Clamp et al 2007), the identification of noncoding functional elements poses a major challenge Comparisons of the human and mouse genomes have revealed that approximately 5% of the human genome is under
purifying selection and represent functional sequences (Waterston et al 2002) Of these functional sequences, 1.5% is accounted for by protein-coding genes and the remaining 3.5% is functional noncoding sequence Functional noncoding elements in
the human genome include noncoding RNA genes, cis-regulatory elements involved
in transcriptional regulation, splicing regulatory elements and sequences that dictate chromatin structure Unlike protein-coding genes which have a characteristic structure and genetic code that facilitate their identification, very little is known about the
Trang 14structure and organization of noncoding functional elements Hence, their
identification and verification remains a challenge The main focus of my project is to
identify and verify functional cis-regulatory elements in the human genome
1.2 Cis-regulatory elements
Cis-regulatory elements are sequences that regulate the precise spatial and temporal
expression of the target gene While the genomic content in every cell in the body is largely the same, important biological processes such as development, differentiation, proliferation and apoptosis can take place because gene expression is differentially
regulated in various cell types and at different developmental stages Cis-regulatory
elements comprise of core promoters, proximal promoters, enhancers, silencers, insulators and locus control regions The core promoter is the minimal region
surrounding the transcription start site of a gene (extending ~35bp on either side), that
is sufficient to direct the initiation of transcription by the RNA polymerase II
machinery (Butler and Kadonaga 2002) At the core promoter, RNA polymerase II, general transcription factors and other associated factors congregate and form the pre-
initiation complex The other classes of cis-regulatory elements contain multiple
binding sites for sequence-dependent DNA-binding transcription factors The
proximal promoter extends ~250 bp on either side of the transcription start site (Butler
and Kadonaga 2002) Transcriptional enhancers, also known as cis-regulatory
modules, and silencers contain binding sites for transcriptional activators and
repressors respectively, and can be located far away from their target gene (Maston et
al 2006) Insulators prevent a neighbouring enhancer from acting on a gene promoter when the insulator is situated between them, or shield a gene from the encroachment
of repressive heterochromatin (West et al 2002) Locus control regions are composed
Trang 15of multiple cis-regulatory elements including enhancers, silencers, insulators, matrix
or scaffold attachment regions (Maston et al 2006) They are capable of directing the tissue-specific expression of one or more linked genes to physiological levels in a copy-number dependent manner at ectopic sites, possibly by opening up a
chromosome domain or by preventing the establishment of heterochromatin (Li et al
2002) Among the various classes of cis-regulatory elements, transcriptional
enhancers contribute the most to specificity in gene expression Core and proximal promoters generally direct basal levels of transcription while increases and nuances in gene expression are largely achieved by the highly coordinated interaction between transcriptional enhancers and their cognate upstream transcription factors While a gene can have at most a proximal and a core promoter, it can be potentially associated with several enhancers, each containing a different combination of binding sites for different transcription factors Hence, the total number of gene expression patterns
that can be generated is significantly more than the number of trans-acting factors
there is in the human genome (Maston et al 2006)
Experimental studies on transcriptional enhancers have revealed some characteristic features of enhancers Firstly, transcriptional enhancers are modular in nature This means that distinct enhancers can act independently upon a single promoter at
different times, in different cell types and in response to different external stimuli
This feature is best demonstrated by the regulated expression of pair-rule gene even skipped (eve) in Drosophila The expression of eve in seven thin stripes of the early
blastoderm embryo is achieved by five different enhancers and each enhancer
contributes to expression in exactly one or two stripes (Andrioli et al 2002)
Secondly, transcriptional enhancers can be located in the 5’ or 3’ flanking regions,
Trang 16untranslated exonic regions or introns of a gene They can even be situated at
significant distances from the gene For example, an enhancer that initiates and directs
specific Sonic hedgehog (Shh) expression in the posterior regions of developing
mammalian forelimbs and hindlimbs, was found to be located 1 Mb away in an intron
of a neighbouring gene (Lettice et al 2003) Thirdly, enhancers act on their target genes in an orientation-independent way An SV40 viral enhancer sequence that was cloned 1.4 kb upstream of a rabbit β-globin gene enhanced globin gene expression, as did the sequence when cloned 3.3 kb downstream (Banerji et al 1981) Lastly, the prevailing model for long-range interaction between sequence-specific transcription factors at an enhancer and the basic transcriptional machinery at the promoter is a
“looping-out” mechanism by which the DNA between the enhancer and the promoter
“loops out” so that the enhancer is brought close to the promoter (Vilar and Saiz 2005)
In order to understand the regulation of human genes in different tissues and at
different developmental stages, it is important to identify and characterize all the regulatory elements associated with human genes A comprehensive catalog of cis-
cis-regulatory elements will not only help improve our understanding of how individual human genes are transcriptionally regulated, but also lead to an understanding of how certain mutations in the human gene regulatory regions cause genetic diseases Many disease-associated loci fall within noncoding regions that are potentially involved in gene regulation For example, patients with preaxial polydactyly, a condition marked
by extra digits on hands and feet, have point mutations in the long-range enhancer of
the Shh gene that is described above These point mutations cause Shh expression that
is usually asymmetrically localized to the posterior limb buds, to abnormally expand
Trang 17into the anterior region, thus resulting in extra digits (Lettice et al 2003) In patients with the Van Buchem disease, a homozygous recessive disorder that is characterized
by a gradual increase in bone density and distortions in the face, mandible and head, entrapment of cranial nerves and excessive weight, a 52-kb long deletion region has been uncovered on human chromosome region 17p21 Within this deletion region resides a 250-bp long-range enhancer that is postulated to direct expression of
sclerostin in human adult skeleton Sclerostin is a negative regulator that keeps adult bone formation in check Loss of the enhancer results in significant loss of adult sclerostin expression and uncurtailed bone growth (Loots et al 2005) Finally, a genomic deletion in human chromosome region 4q35 has been associated with facio-scapulo-humeral dystrophy (FSHD), one of the most common forms of muscular dystrophy The deletion locus contains a matrix attachment region which blocks a
nearby enhancer from activating the FRG1 (FSHD region gene 1 protein) gene in normal cells The FRG1 gene encodes a putative splicing factor that is reportedly
over-expressed in the myoblasts of FSHD patients The deletion event truncates a D4Z4 repeat array adjacent to the matrix attachment region and disrupts the latter’s
ability to block the enhancer from activating FRG1, causing an aberrant upregulation
of FRG1 (Petrov et al 2008) The role of FRG1 in FSHD is still unknown
The above examples of cis-regulatory mutations that are associated with genetic diseases, underscore the importance of cis-regulatory elements in maintaining the
correct spatio-temporal expression of crucial genes Besides gleaning new insights
into genetic diseases from regulatory elements, clinicians have begun to use
cis-regulatory elements to direct tissue-specific or sustained expression of transgenes in gene therapy approaches For instance, inducible promoters that are sensitive to
Trang 18external stimuli like ionizing radiation or hypoxia are highly specific and have been proposed as tools to restrict expression of cytotoxic or tumoricidal genes to cancer cells (Goverdhana et al 2005) Furthermore, insulators have been explored as a way
to override chromatin position effects and to ensure sustained expression of
transgenes introduced into the human genome during gene therapy (Recillas-Targa et
al 2004) Hence, cis-regulatory elements have become useful in understanding and correcting genetic disorders Furthermore, the identification of cis-regulatory elements
and mutations that may occur in them has implications for the evolutionary processes that give rise to population variation At least 1% of all human genes are postulated to
possess functional cis-regulatory polymorphism and these variations lead to
considerable perturbation in gene expression levels and may thus impact various aspects of phenotype (Rockman and Wray 2002) For the reasons mentioned above,
identifying and characterizing all the cis-regulatory elements in the human genome is
indeed a high-priority endeavor
1.3 Methods to identify cis-regulatory elements
Cis-regulatory elements such as transcriptional enhancers lack a well-defined
structure similar to that of protein-coding genes They are typically composed of clusters of binding sites for several different transcription factors and there is very
limited information on how these binding sites are arranged within the cis-regulatory
element Transcription factor binding sites are typically only 5 – 12 bp long and are degenerate, whereby variation in some bases is tolerated more so than in other bases
Without a priori information about which transcription factors regulate a gene of interest, the identification of all the cis-regulatory elements associated with the gene is
a complex task Furthermore, cis-regulatory elements are position-independent and
Trang 19orientation-independent, hence searches for cis-regulatory elements are required to
cover extensive stretches of the genome and both strands of the DNA The challenge
is not only to identify a cis-regulatory element, but also to dissect the core sequences
of a cis-regulatory element that are crucial for its function and to determine their
mode of action The methods that have been used to date in identifying and
characterizing cis-regulatory elements can be broadly categorized into traditional
methods and genomic-era strategies that became possible after the availability of whole genome sequences of human and other organisms
1.3.1 Traditional strategies
Traditional approaches to identifying cis-regulatory elements can be described as
either biochemical or genetic methods The first biochemical method to study DNA interaction is the DNA footprinting assay (Galas and Schmitz 1978) The aim of this method is to identify the binding sites of a particular protein within a DNA
protein-sequence The protein of interest is added to PCR-amplified DNA that has been singly end-labeled and the mixture is subjected to cleavage (either by an endonuclease or chemical reagent) The resulting DNA fragments are separated by gel electrophoresis alongside the fragments obtained from a control DNA sample without addition of the protein Two different cleavage patterns are generated from both DNA samples, with the protein-bound DNA sample missing bands in DNA sites (“footprints”) where the protein has bound and protected them from cleavage To identify the binding sites, the positions of the footprints are approximated in relation to the radiolabeled end The advantage of this method is that single-nucleotide resolution binding sites can be obtained when a highly reactive and sequence-independent chemical reagent such as hydroxyl radical is used to cleave the DNA (Jain and Tullius 2008) Another
Trang 20biochemical method that describes protein-DNA interaction is the electrophoretic mobility shift assay (EMSA) introduced in the 1980s EMSA resembles the DNA footprinting assay in that it detects binding of a protein of interest to an end-labeled DNA sequence based on movement of the mixture along a polyacrylamide gel The difference is that there is no cleavage step, and hence a single DNA band is observed after gel electrophoresis as opposed to a series of bands as in DNA footprinting DNA molecules to which proteins have bound move more slowly in the gel than the control DNA with no protein, resulting in a ‘band shift’ The advantage of EMSA is that multiple binding sites can be detected by applying increasing protein concentrations that will then result in more pronounced band shifts A limitation of the above two
methods is that both assays require prior knowledge of a putative cis-regulatory
element and a potential DNA-binding protein In addition, the step of protein-DNA
interaction is carried out in vitro, which may not necessarily reflect in vivo events
One example of a biochemical assay that captures the chromatin state of DNA
sequences in vivo is the DNase I hypersensitivity assay (Keene et al 1981; McGhee et
al 1981) In the eukaryotic nucleus, DNA is coiled around histone octamer complexes
at regular intervals to form nucleosomes These nucleosomes serve to pack the large eukaryotic genome into the nucleus and also influence important events like gene transcription and DNA replication Modifications to histone proteins, like
trimethylation of histone H3’s lysine 4 and acetylation of histone H3 (Xi et al 2007), can lower the DNA’s affinity to nucleosomes and displace nucleosomes, creating a state of open chromatin and making the underlying DNA susceptible to DNase I cleavage The displacement of nucleosomes also implies that the DNA is accessible to binding by proteins such as transcription factors Hence, DNase I hypersensitivity
Trang 21marks various functional noncoding elements, including cis-regulatory elements,
origins of replication, recombination elements and structural sites of telomeres and centromeres (Cereghini et al 1984; Gross and Garrard 1988) The advantage of using
DNase I hypersensitivity to find potential cis-regulatory elements is that DNase I
hypersensitivity is a transient property of DNA sequences, and hence this method is
useful in the study of spatial and temporal patterns of transcriptional activity The
cis-regulatory elements that are active in a particular cellular phenotype can also be identified by carrying out the assay on samples derived from that phenotype Since the assay was introduced, the level of resolution has greatly improved from ~500 bp on either side of the nucleosome-free region using Southern transfer followed by indirect end-labeling (Wu 1980) to nearly nucleotide resolution using PCR assessment (Yoo et
al 1996) and quantitative PCR (McArthur et al 2001) Nevertheless, DNase I
hypersensitivity implies the presence of a cis-regulatory element but does not
demonstrate its function
Chromatin immunoprecipitation (ChIP) is a method that identifies the sites of
transcription factor-DNA interaction in vivo DNA-binding proteins are covalently
cross-linked to genomic DNA by the addition of formaldehyde to living cells The genomic DNA is extracted from the cells and fragmented by sonication to lengths in the range of 100 – 500 bp An antibody highly specific for the transcription factor of interest is used to precipitate all the DNA fragments to which the protein is bound Protein-DNA crosslinking is reversed and the DNA fragments are then purified A ChIP library is constructed whereby the purified DNA fragments are cloned into a vector and sequencing is carried out using vector primers Although the ChIP
technique is effectively unbiased in its search space for cis-regulatory elements and it
Trang 22can be used to detect protein-DNA interactions in a particular type of cells, there still remain several disadvantages to this technique A highly specific antibody must be raised against the transcription factor of interest, which is not a trivial task especially for transcription factors that belong to larger families of structurally similar
transcription factors In addition, more in-depth analyses have to be carried out to map the actual bases of contact between the transcription factor and DNA within the larger DNA fragment Finally, some of the detected interactions may be indirect
interactions, meaning that another DNA-binding factor has interacted with the
transcription factor being studied
A more accurate and reliable way to verify a putative cis-regulatory element is to
examine its ability to direct gene expression This goal is achieved through the
reporter gene assay In reporter gene assays aimed at verifying enhancers, the
sequence of interest is cloned upstream of a reporter gene linked to a core promoter Examples of reporter genes include genes that encode β-galactosidase, green
fluorescent protein and firefly-luciferase The construct is incorporated into cultured cells through transient or stable transfection and this is a fast method to measure or
examine the expression of the reporter gene as directed by the cis-regulatory element
being studied (Carey and Smale 2000; Himes and Shannon 2000) By varying the
length of the putative cis-regulatory element, the minimal cis-regulatory element can
be identified Silencers are tested in a similar way as enhancers except that the core promoter used is one that drives strong reporter gene expression so that any resulting reduction in expression level can be readily detected (Maston et al 2006) On the other hand, more complex arrangements are required in order to verify putative
insulators and locus control regions A putative insulator is cloned between a known
Trang 23enhancer and a core promoter to demonstrate its enhancer-blocking ability or tested in transgenic assays to test its heterochromatin-barrier ability (West et al 2002), while a putative locus control region must be verified in transgenic assays to see if it can overcome chromosomal position effects (Maston et al 2006) Two limitations of conducting reporter gene assays in cell lines are that developmentally active
enhancers cannot be tested in cell lines and cell lines may not be available for all types of cells To overcome these limitations, transgenic systems are used in place of
cultured cells In addition, transgenic systems allow the researchers to show that a
cis-regulatory element indeed functions in living organisms The plasmid construct is first linearized and then injected into mouse or zebrafish embryos, after which the
construct becomes randomly integrated into the genome, for example as demonstrated
by Kothary et al (1989) and Muller et al (1997) The researcher then monitors the reporter gene expression during development The advantage of transgenic reporter
assays is that the effect of putative cis-regulatory elements can be studied in vivo and
to date, this remains the most accurate and effective way to prove that an element is functional However, this method does have its disadvantages Firstly, because
integration of the transgene into the host genome occurs at random locations, in vivo expression of the reporter gene may not reflect the actual in vivo expression driven by the cis-regulatory element, due to position effects (whereby the reporter gene
inadvertently falls under the control of endogenous cis-regulatory elements or
heterochromatin), thus leading to ectopic gene expression Hence, it is crucial to ensure that a similar expression pattern is obtained in several independent transgenic
lines before a conclusion is made about the putative cis-regulatory element being
tested Secondly, microinjections of the transgene into mouse or zebrafish embryos are carried out at the one-cell stage so that ideally, all the cells in the resulting
Trang 24organism contain a copy of the transgene in its genome However, mosaicism arises if cell divisions begin before the transgene can be integrated into the genome especially
in the rapidly dividing zebrafish embryo This means that insertion of the transgene into the host genome occurs in only a subset of all cells and reporter gene expression
can only be observed in subpopulations of cells Nevertheless, the analysis of
cis-regulatory elements by examining mosaic patterns of reporter gene expression in microinjected transient transgenic fish embryos is fairly rapid (Muller et al 2002) Furthermore, these assays are now aided by techniques such as transposon-based gene transfer (Grabher and Wittbrodt 2007; Kawakami 2007) After transgenic assays have
been carried out and a cis-regulatory element is shown to direct gene expression to
specific tissues at particular developmental stages, subsequent experiments include making constructs with serial deletions or site-directed mutagenesis to find the critical
segments of the cis-regulatory element These experiments are very tedious and
time-consuming and are appropriate for only small sets of genes
1.3.2 Genomic era strategies
Ever since the release of the human genome, methods of detecting cis-regulatory
elements have gradually become high-throughput, capable of detecting hundreds to thousands of elements in one experiment Typically, these methods make use of the completed human genome as a reference to which large numbers of experimentally obtained sequences can be mapped or from which numerous microarray probes can be designed for large-scale hybridization The DNase I hypersensitivity assay and ChIP assay take advantage of the availability of genome sequences in similar ways, to identify DNase I hypersensitive sites and regions bound by transcription factors on a whole-genome level Massively parallel signature sequencing was used to generate
Trang 25many short sequence tags from a genomic DNase library derived from quiescent human CD4+ T-cells and ~14,000 DNase I hypersensitive sites were identified
(Crawford et al 2006b) In the same year, NimbleGen tiling microarrays designed from the ENCODE regions that make up 1% of the human genome were used to identify captured DNase I-digested ends in primary and immortalized human cell types (Crawford et al 2006a; Sabo et al 2006) More recently, a high-resolution whole-genome map of chromatin perturbations in primary human CD4+ T-cells was constructed using Solexa and 454 sequencing platforms in combination with
NimbleGen tiling arrays and almost 95,000 sites were identified (~2.1% of the
genome) (Boyle et al 2008)
Recent extensions of the original ChIP assay allow the researcher to map transcription factor binding sites on a genome-wide scale In ChIP-on-chip, a method also known
as genome-wide location analysis, the protein-bound enriched DNA fragments are hybridized to a tiling microarray This method was applied to human chromosomes 21 and 22 to find the binding sites for three transcription factors – Sp1, c-Myc and p53 (Cawley et al 2004) Although ChIP-on-chip has become a fairly high-throughput way to identify sequences across the genome due to availability of commercially-offered microarray chips (e.g., NimbleGen Human ChIP-chip 2.1M array, Affymetrix GeneChip® Human Tiling 2.0R array), there are problems of cross-hybridization especially for large mammalian genomes In addition, the method predicts large numbers of binding sites and only a fraction is expected to be truly functional, hence statistical methods are required to attach a statistical significance to each site as an indicator of its functional relevance (Buck and Lieb 2004) In ChIP-PET, the enriched DNA sequences are sequenced through the paired-end ditag sequencing method (Loh
Trang 26et al 2006) This method has been applied successfully to identify the binding sites for c-Myc in human B cells (Zeller et al 2006) The latest method that makes
effective use of next generation sequencing technologies is ChIP-seq In ChIP-seq, the purified bound DNA is analyzed by massively parallel short-read sequencing The short reads are then mapped back to the reference genome For example, the binding sites of STAT1 transcription factor in human HeLa S3 cells that have been stimulated
by interferon gamma, were mapped using ChIP followed by sequencing on the Solexa platform (Robertson et al 2007)
Studies into the genome-wide distribution of histone aceylation and methylation marks have suggested that certain histone modifications are a good indicator of the
presence of cis-regulatory elements, just as DNase I hypersensitive sites are known to
be correlated with certain histone modifications For example, the relative levels of histone modification across the genome can be quantified using ChIP-SAGE, which is
a combination of chromatin immunoprecipitation to target modified histone proteins and the technique of serial analysis of gene expression to sequence the precipitated DNA sequences This has been done for H3 histones diacetylated on lysines 9 and 14 (denoted H3K9acK14ac) in human peripheral T-cells, located in genomic regions called histone acetylation islands (Roh et al 2005) The H3K9acK14ac mark is
correlated with known and active gene promoters, enhancers and locus control
regions Furthermore, luciferase assays of randomly selected histone acetylation islands showed that 43% (39/90) of them could function as enhancers in human Jurkat T-cells (Roh et al 2007) Recently, when other types of histone modifications were investigated, 17 different histone acetylation and methylated marks were all found to
be present in the promoters of a set of ~3,300 genes that are more highly expressed
Trang 27than those genes that lacked the histone marks (Wang et al 2008) In a separate study, the distribution patterns of several types of acetylation and methylation modifications
on selected lysine residues on histones H3 and H4, were determined using chip on 44 different loci in the human genome The patterns showed that active
ChIP-on-promoters are associated with H3K4 trimethylation while enhancers are associated with H3K4 monomethylation (Heintzman et al 2007) An algorithm was designed to scan the human genome based on this finding and ~220 active promoters and ~420 enhancers were predicted Almost 90% of promoter predictions mapped to known transcription start sites and 63% enhancer predictions corresponded to previously reported DNase I hypersensitive sites or binding sites of co-activator p300 and
Mediator subunit TRAP220, showing that these histone methylation features are
useful in predicting specific cis-regulatory elements (Heintzman et al 2007) As in the
case of genome-wide DNase I hypersensitivity assays, the advantage of genome-wide profiling of histone modifications is that active and repressive chromatin domains which may be dynamically changing can be demarcated in different cell types and under different cellular conditions Nevertheless, additional experiments are required
to identify the transcription factors responsible for creating the regions of open or
closed chromatin and the essential portions of the cis-regulatory elements
A technique known as chromosome conformation capture (abbreviated as “3C”) is used to detect intrachromosomal and interchromosomal interactions in a cell’s natural state (Dekker et al 2002; Tolhuis et al 2002) and has the aim of detecting long-range enhancers that, together with their associated transcription factors, are brought into close proximity of the core promoter of their target gene The 3C protocol is
accomplished in five steps (Simonis et al 2007) Firstly, similar to the ChIP protocol,
Trang 28formaldehyde is used to fix cells and form cross-links between interacting proteins and DNA Secondly, the nuclei are isolated from the cells and DNA is digested with a restriction enzyme A ligation step is then carried out to promote intramolecular ligation between interacting DNA fragments The cross-links are reversed and ligation frequencies are analyzed by quantitative PCR using primers specific to the locus of interest (known as the “bait” sequence) Because 3C is only suitable for studying a small number of chromosomal interactions, 3C-carbon copy (abbreviated as “5C”) was developed whereby a multiplex ligation-mediated amplification step is included after the cross-links are reversed (Dostie et al 2006) This step uses multiplex primers that are composed of universal primer sequences like T7 and T3 plus ligation junction sequences to amplify selected ligation junctions A quantitative replicate of the initial 3C library is hence created and finally analyzed through high-throughput sequencing
or tiling arrays Both 3C and 5C require the researcher to painstakingly design primers specific not only to the bait sequence, but also to every possible restriction fragment (known as the “captured” sequence), hence circular 3C (abbreviated as “4C”)
(Simonis et al 2006; Zhao et al 2006) was developed that does not require primers for captured sequences 4C uses the same 3C protocol, except that it uses a six-base cutter for the first restriction digest, and after ligation between restriction fragments and reversal of cross-links, it uses a four-base cutter in a second restriction digest to produce smaller restriction fragments and also trim ligation junctions This is
followed by a self-circularization step that is more likely to occur since the DNA is no longer bound to proteins Inverse PCR using bait-specific primers is then carried out and the products are analyzed by high-throughput sequencing or custom microarrays with probes that target sequences flanking recognition sites of the first restriction enzyme (Simonis et al 2007) A caveat of the suite of 3C and 3C-derived techniques
Trang 29is the large amount of cells that is required because at most two instances of
chromosomal interactions at a locus of interest can be captured per diploid cell
Another disadvantage of these methods is that they provide information about the frequency but not the functionality of DNA interactions, hence further analysis is required to characterize these DNA interactions (Simonis et al 2007)
In contrast to the above large-scale biochemical methods, in silico prediction of
cis-regulatory elements helps the researcher to narrow the search space of genomic
sequences so that traditional experimental methods can be applied on a manageable
scale Because all cis-regulatory elements contain transcription factor binding sites
(TFBS), predicting individual TFBS or organized clusters of TFBS is a logical step to computationally analyzing a gene locus or the whole genome Databases like
TRANSFAC and JASPAR catalog TFBS obtained from scientific literature and/or position weight matrices (PWMs) that are derived from known TFBS TFBS
prediction can be achieved by either string matching against known TFBS or
statistical matching against PWMs To create a PWM, alignments of known binding sites of a transcription factor are first constructed and the frequency of observing a particular nucleotide at each base of the alignment is represented as a log-likehood ratio relative to the genomic background To score an input sequence against a PWM, the sequence is scanned on both strands for possible sites, the log-likehood ratios of all bases in the site are summed up and sites whose scores exceed a user-defined threshold are then reported Several programs that search TFBS databases have been used widely, for example, MatInspector (Cartharius et al 2005), P-Match
(Chekmenev et al 2005) and TESS (Schug 2003) The power of computational
prediction of TFBS is that the search is not restricted to a single transcription factor,
Trang 30or only selected genomic regions However, TFBS databases are rarely complete Some transcription factor families may be under-represented compared with others or all the TFBS of a particular transcription factor may not have been comprehensively
identified due to the degenerate nature of TFBS For this reason, de novo motif
finding has been proposed as an alternative way to finding novel TFBS without a priori information about the transcription factors that regulate the gene of interest For example, a binding site for SIX3 in a forebrain enhancer of SHH that was verified by
a supershift assay, could not be detected by TRANSFAC analysis (Jeong et al 2008) Secondly, defining the threshold score is not a trivial task An appropriate threshold score has to place a reasonable tradeoff between sensitivity (to minimize false
negatives) and specificity (to minimize false positives) As not all sites that
correspond very well to known TFBS or PWMs are functional, TFBS prediction tends
to result in many false positives and hence, TFBS prediction is usually used in
conjunction with other criteria that provide additional support to a putative
cis-regulatory element Firstly, Harbison et al (2004) enforced a threshold score of 60%
of the maximum possible score for a particular transcription factor, and also added an additional requirement that the sequences must reside near a sequence bound by the same transcription factor as verified by ChIP Secondly, Loots and Ovcharenko
(2004) developed the program rVISTA to identify TFBS that are also evolutionarily conserved, in a bid to reduce the rate of false positives while maintaining reasonable sensitivity rVISTA accepts input alignments from various sources like the BLASTZ program, zPicture, ECR Browser and Penn State University’s GALA database and identifies TFBS that are interconnected in the BLASTZ alignment and highly
conserved (>80% identity) in 20-bp sliding windows In addition, the TFBS are
clustered for individual or combinations of transcription factors by a filtering scheme
Trang 31that identifies evolutionarily conserved regions that contain a minimum number of sites for at least a few transcription factors Lastly, Blanchette et al (2006) used both TFBS prediction, sequence conservation in rodents, and the presence of clustered
binding sites of one to five different transcription factors to identify ~118,000
cis-regulatory modules across the human genome
As an alternative to TFBS prediction programs that are restricted to existing
information in TFBS databases, de novo motif discovery has been proposed as a means to find novel TFBS De novo motif discovery aims to identify patterns in
noncoding DNA that occur more frequently than expected and that signify a possible
underlying function This is based on the observation that cis-regulatory elements tend
to possess multiple binding sites for the same transcription factor Many de novo
motif finding algorithms have been introduced in recent years that differ from each other in various aspects – the definition of a motif, the criteria for statistical
overrepresentation and the method used to find the overrepresented motifs (Tompa et
al 2005), for example, expectation maximization used by MEME (Bailey and Elkan 1995), Gibbs sampling used by AlignACE (Hughes et al 2000) and exhaustive
enumeration of short k-mers by Weeder (Pavesi et al 2004) Motif discovery is also
employed on the conserved noncoding elements (CNEs) of all human genes, or all noncoding regions associated with co-regulated or co-expressed genes For example, motif finding within the mammalian-conserved regions of promoters and 3’
untranslated regions of all human genes yielded 174 motifs in promoters implicated in transcriptional regulation and 106 motifs in 3’ untranslated regions potentially
involved in post-transcriptional regulation (Xie et al 2005) A search for motifs in the CNEs within 1 kb upstream of 18 groups of co-expressed genes (e.g., genes
Trang 32expressing in neuronal tissue, pancreas, heart or skeletal muscle as indicated by their microarray profiles) resulted in 431 previously reported TFBS motifs and 579 novel motifs (Huber and Bulyk 2006) These studies have met with certain amount of
success in identifying bona fide cis-regulatory elements through confirmatory
comparisons with known cis-regulatory elements However, it still remains to be seen
how accurate and precise existing motif finding programs can be and what the extent
of overlap is between the predictions of different programs One study attempted to make a comparison by applying 13 different motif finding programs to the same sequence dataset The main conclusions were that researchers should use a few motif finding programs in combination rather than only one program and also should
consider the best few results instead of just the most significant motif for further analyses so as to increase sensitivity (Tompa et al 2005)
1.4 Comparative genomics approach for identifying cis-regulatory elements
Comparative genomics is a first step to prioritizing candidate regulatory elements for biological assays It is not biased towards specific genomic regions (e.g., promoters)
and does not rely on a priori information on which transcription factors are regulating the target gene The basic premise of this approach to identifying cis-regulatory
elements is that functional noncoding elements tend to evolve more slowly than
non-functional DNA due to selective pressure Hence, cis-regulatory elements can be
identified as conserved noncoding elements (CNEs) in sequence comparisons of related genomes Although this approach has become increasingly popular in the genome era, it had been applied successfully prior to the release of multiple vertebrate genomes For example, the concept of phylogenetic footprinting was introduced as
early as 1988 and established as a technique to identify cis-regulatory elements
Trang 33through the comparison of the promoters and 5’ flanking regions of ε-globin and globin genes of several primates and a few other mammals (Tagle et al 1988)
γ-Briefly, the process of phylogenetic footprinting involves alignment of orthologous nucleotide sequences in two or more species, followed by the identification of
noncoding sequences that are noticeably more conserved than background sequences There exists evidence that experimentally verified TFBS tend to be more conserved than their surrounding sequences, although not perfectly conserved (Moses et al
2003) Conversely, notable proportions of CNEs have been shown to act as bona fide cis-regulatory elements in transgenic mouse or zebrafish assays (Pennacchio et al
2006; Shin et al 2005; Woolfe et al 2005)
Comparisons between human and closely related species such as mouse have been successfully used to identify CNEs in the human genome, some of which were shown
to recapitulate the expression patterns of nearby genes (Lee et al 2004; Loots et al 2000) For example, Loots et al (2000) searched for human-mouse CNEs in a ~1 Mb genomic region at human chromosome 5q31 that were ≥ 70% identical over 100 bp or more Fifteen CNEs were found to be conserved in human, mouse and several other mammals, and one CNE was determined to regulate the expression of cytokine genes
interleukin-4, interleukin-13 and interleukin-5 located in the same genomic interval
While human-mouse sequence comparisons can be particularly useful for finding
mammalian-specific cis-regulatory elements, this approach tends to identify many
false positives due to the relatively short evolutionary distance between the two
species (i.e., ~60 Myr) The short divergence time between human and mouse is insufficient for functional sequences to be distinguished distinctively from neutrally evolving DNA On the other hand, as the evolutionary divergence between two
Trang 34species increases, the average conservation level of DNA that has been under neutral evolution since their last common ancestor decreases Sequence comparisons between human and a more distantly related species like fish that have been separated by a larger evolutionary distance (i.e., ~420 Myr), increases the probability that the CNEs identified are functional sequences Among the fishes, the genome of the Japanese
pufferfish (Takifugu rubripes), also known as “fugu”, is a particularly attractive genome for comparative genomcis The fugu genome was proposed in 1993 as a
model vertebrate genome for identifying genes and other functional elements in the human genome (Brenner et al 1993) and it was the second vertebrate genome to be fully sequenced (Aparicio et al 2002), the first being the human gneome Fugu is a
particularly attractive model for identifying conserved cis-regulatory elements in the human genome due to its compact genome size (reducing noise in in silico
predictions) and maximum evolutionary distance (increasing the specificity of
cis-regulatory element predictions) from the human lineage
The first proof-of-concept study that demonstrated the amenability of the fugu
genome to comparative genomics was carried out by Aparicio et al (1995) Within
two transcriptional enhancers that mediate Hoxb4 expression in the mesoderm,
ectoderm and neural tube, three sequence regions CR1, CR2 and CR3 were found to
be conserved between mouse and fugu CR1 was shown to be essential for expression
in the mesoderm, and the central and peripheral nervous systems while CR3 directed gene expression to the posterior hindbrain in exactly the same manner as the larger mouse enhancer sequence This demonstrated that searching for CNEs between
mammals and fugu can help identify critical cis-regulatory elements (Aparicio et al 1995) Mouse-fugu comparisons have also revealed cis-regulatory elements in the
Trang 35Pax9/Nkx2-9 locus While human-mouse comparisons identified 15 CNEs each about
1 kb long, human-mouse-fugu comparisons narrowed this list down to 2 CNEs,
named as CNS-6, CNS+2, that mediate Nkx2-9 expression in the ventral neural tube and Pax9 expression in the medial nasal process (Santagati et al 2003) Finally,
comparisons between human and fugu identified a set of nine CNEs residing in the
gene deserts that flank the human Dachshund gene locus When assayed in transgenic
mice, seven of the nine elements drove reporter gene expression patterns that partially
or fully recapitulated the expression pattern of Dachshund (Nobrega et al 2003),
demonstrating that comparative genomics using the fugu genome can enable a
researcher to efficiently sift through large tracts of noncoding DNA for functional sequences
Two types of multi-species alignments have been used to identify conserved
enhancers associated with human genes – namely, whole-genome alignments and locus-by-locus alignments Whole-genome comparisons of human and other
vertebrates have indeed proved to be effective in identifying functional cis-regulatory
elements in the human genome (Bejerano et al 2004; Pennacchio et al 2006;
Sandelin et al 2004a; Sanges et al 2006; Shin et al 2005; Woolfe et al 2005)
Whole-genome comparisons typically use local alignment programs like BLASTZ (Schwartz et al 2003) and MegaBLAST (Zhang et al 2000) to rapidly align regions
of high homology When carried out between distantly related genomes such as
human and fish, whole-genome comparisons tend to fail to identify and align all the orthologous sequences due to the stringent criterion of local alignment algorithms On the other hand, locus-by-locus global alignments of orthologous gene loci using programs like LAGAN/MLAGAN (Brudno et al 2003) and AVID (Bray et al 2003)
Trang 36are more effective in identifying all the associated CNEs Furthermore, because global alignment algorithms have an additional assumption that input sequences occur in the same order and orientation, they have more power in detecting weakly conserved regions than local alignments (Bray et al 2003; Frazer et al 2003) Nevertheless, it should be noted that global alignment algorithms tend to miss conserved functional elements that have undergone local inversions and rearrangements For my project, I applied the comparative genomics approach, using fugu as a model fish genome, to
identify cis-regulatory elements in the human genome I chose to focus on the
transcription factor-encoding genes of the human genome, carried out locus-by-locus comparisons between human, mouse and fugu orthologous genes and identified
conserved cis-regulatory elements associated with these genes
1.5 Transcription factors (TFs)
Transcription factors (TFs) are proteins that bind to cis-regulatory elements and
activate or repress transcription of genes Other proteins that are involved in
transcriptional regulation and that could fall under the broad classification of TFs, include co-factors, structural enzymes involved in chromatin or histone modification affecting the transcriptional state of target genes, and general transcription factors that associate directly with RNA polymerase II in a transcription initiation complex The human genome is estimated to contain about 2,000 TF-encoding genes including co-factors, chromatin modifying enzymes and general transcription factors (Messina et
al 2004) TFs play crucial roles in development (e.g., HOX, SOX and PAX proteins) (Chi and Epstein 2002; Krumlauf 1994; Scotting and Rex 1996), cell cycle
progression (e.g., c-MYC, c-JUN) (Adhikary and Eilers 2005; Mechta-Grigoriou et al 2001), tumor suppression (e.g., p53, FOXO proteins) (Arden 2007; Vousden and Lane
Trang 372007) and cell differentiation (e.g., RUNX, DLX proteins) (Durst and Hiebert 2004; Panganiban and Rubenstein 2002) The vast majority of TFs are known to regulate the expression of a number of different genes, and TF-encoding genes themselves are the key targets of TFs, with many TFs regulating the expression levels of their own genes Thus, TFs represent crucial nodes in the gene regulatory networks that determine the correct development of the body plan and regulate various physiological processes (Levine and Davidson 2005; Stathopoulos and Levine 2005) To gain an
understanding of such gene regulatory networks, it is important to identify
cis-regulatory elements associated with TF-encoding genes and the TFs involved in the
differential expression of TF-encoding genes Mutations that disrupt the cis-regulatory
elements of TF-encoding genes have been implicated in various congenital disorders
For example, the TF-encoding gene PAX6 is essential for eye, pancreas and central nervous system development Haploinsufficency of PAX6 causes brain defects and a
congenital eye malformation known as aniridia In some aniridia patients, breakpoints
downstream of PAX6 are observed at chromosome 11p13 These breakpoints disrupt long-range enhancers further downstream, resulting in loss of PAX6 expression in the
retina, iris and ciliary body of developing eyes and parts of the diencephalon
(Kleinjan et al 2006) Campomelic dysplasia is a congenital disorder characterized by bowing and angulation of long bones, together with other skeletal and extraskeletal defects Translocation breakpoints associated with campomelic dysplasia are scattered
up to 1 Mb upstream of SOX9 Sequence comparisons between human and fugu followed by transgenic mouse assays revealed three cis-regulatory elements
distributed up to 290 kb 5’ and 95 kb 3’ of SOX9 recapitulate SOX9 expression in cooperation with the SOX9 core promoter, and these elements are potentially
disrupted by the campomelic dysplasia translocation breakpoints (Bagheri-Fam et al
Trang 382006) To determine how aberrant expression of certain TF-encoding genes leads to
human congenital disorders, it is important to identify the mutations in the
cis-regulatory elements of TF-encoding genes associated with these disorders A
prerequisite for this task is to identify all the cis-regulatory elements associated with
TF-encoding genes Therefore, I chose to use a comparative genomics approach to
identify evolutionarily conserved cis-regulatory elements associated with
TF-encoding genes
1.6 Objectives of my work
The main aim of my project is to identify evolutionarily conserved cis-regulatory
elements associated with TF-encoding genes in the human genome using a
comparative genomics approach and to verify some of the predicted cis-regulatory
elements in a transgenic mouse assay Previous whole-genome comparisons have shown that TF-encoding genes tend to be enriched with CNEs (Bejerano et al 2004; Iwama and Gojobori 2004; Sandelin et al 2004a; Woolfe et al 2005) Notably, a large number of CNEs identified in these genome-wide comparisons were found to be associated with TF-encoding and developmental genes For example, 83% (1,140 out
of 1,373) of CNEs (>70% identity/>100 bp alignment length) identified in wide comparison of human and fugu were located in the vicinity of about 120 human DNA-binding TF-encoding genes (Woolfe et al 2005), while 104 of the 290 human genes associated with human-zebrafish CNEs (>70% identity/>80 bp sequence
genome-length) were found to be TF-encoding genes (Shin et al 2005) In spite of such a known association between TF-encoding genes and CNEs, no systematic gene-by-gene comparison of all the orthologous human and other vertebrate TF-encoding
genes has been carried out to identify potential cis-regulatory elements associated
Trang 39with them Although orthologous TF-encoding genes of human, rodents, fugu and zebrafish have been previously compared (Dieterich et al 2005; Iwama and Gojobori 2004; Robertson et al 2006; Sandelin et al 2004b), such studies were promoter-centric, with comparisons restricted to sequences flanking the transcription start sites (TSS) (e.g., -3 kb to +500 bp downstream from TSS for ConSite (Sandelin et al 2004b); ~10 kb upstream of coding region (Dieterich et al 2005)) The scope of such
studies is limited because cis-regulatory elements can be located at considerable
distance from transcription start sites, even up to several hundred kilobases away (Lettice et al 2003; Nobrega et al 2003)
In my comparative genomics approach, I have chosen to compare the human and mouse genomes with the fugu genome The specific objectives of my project are (1)
to build a curated database of TF-encoding genes in human, mouse and fugu
genomes; (2) to carry out locus-by-locus alignments of their noncoding regions using
a global alignment strategy; (3) to identify CNEs associated with them and construct a database of these CNEs; (4) finally, to assay some of these CNEs in transgenic mouse
assays using lacZ (β-galactosidase) as a reporter gene
Trang 40Chapter 2: Materials and methods
2.1 Identifying human, mouse and fugu TF-encoding genes
Sequences for 1,962 human transcription factors (TFs) were obtained from Messina et
al (2004) and redundancies were removed by a homology search against human RefSeq proteins Several known proteins missing from Messina et al.’s dataset (e.g.,
JMJ2A, JMJ2C, JMJ2D, HES4 and DLX6) were included with the search results and
the resulting proteins were mapped to human genes in Ensembl Release 37
(http://www.ensembl.org/) The TFs were classified by DNA-binding domains, and if lacking a DNA-binding domain, the TFs were classified separately into one of the following categories: co-factors, general transcription factors, components of
chromatin remodeling complexes and transcriptional regulators that are involved solely in protein-protein interactions (i.e., TFs with ZnF-PHD, ZnF-BTB/POZ, ZnF-MYND domains) Mouse orthologs of human TFs were retrieved from Ensembl BioMart Fugu orthologs were identified using a combination of data from Ensembl BioMart (fugu assembly v4.0) and INPARANOID analysis (Remm et al 2001) Teleost fishes contain duplicate copies for many human genes due to a “fish-specific” whole-genome duplication event in the fish lineage (Amores et al 1998; Christoffels
et al 2004; Postlethwait et al 1998) INPARANOID was used to identify duplicate fugu orthologs for human TFs that may have been missed in Ensembl Only proteins longer than 50 residues were used in the analysis INPARANOID identified some many-to-many ortholog groups which were resolved into smaller families based on phylogenetic analysis using PHYLIP (Felsenstein J., University of Washington, Seattle) with sequences from cartilaginous fishes, lamprey, tunicate or amphioxus as the outgroup For each family, multiple human and fugu proteins and a single