Identifying and characterizing cis regulatory elements in the human genome

Since functional sequences evolve slower than the surrounding neutrally evolving regions, cis-regulatory elements can be identified as conserved noncoding elements CNEs in comparisons o

Trang 1

IDENTIFICATION AND CHARACTERIZATION OF

CONSERVED CIS-REGULATORY ELEMENTS

IN THE HUMAN GENOME

ALISON P LEE

B Computing (Computer Science) (Honours)

National University of Singapore, 2004

A THESIS SUBMITTED FOR THE DEGREE OF

Trang 2

Acknowledgements

There are several people I would like to acknowledge:

Firstly, my advisor Professor Byrappa Venkatesh for being an excellent and patient mentor, and for devoting so much time and energy to helping me improve my

reasoning, writing and presentation skills;

Dr Sydney Brenner and Dr Ng Huck Hui (GIS, Singapore), members of my thesis advisory committee, for their scientific advice;

Dr Alice Tay for her constant encouragement and discussions at work and

conferences, and her work on the fugu Hox clusters project;

Esther Koh for her work on the fugu Hox clusters project and Yang Yuchen for his work on the TFCONES project;

Gene Yeo, Eddie Loh, Nidhi Dandona for showing me the ropes in bioinformatics;

Luo Ming, Wang Jianli, Kevin Lam for useful software/hardware discussions and troubleshooting;

Krish Jon Mathavan, Elizabeth Yeoh, Tay Boon Hui and Sumanty Tohari for

guidance in laboratory techniques;

Patrick Gilligan and Ravi Vydianathan for their valuable comments on manuscripts;

Data Storage and Cluster Computing teams at Bioinformatics Institute, led by Lai Loong Fong and Stephen Wong respectively, for speedy response to my requests for server fixes or software and database updates;

Arun Kumar, Zhao Zhiyang, Peng Huiling, and Chua Yiwen at A*STAR Biological Resource Centre for their skilled work in DNA microinjections and mouse husbandry;

Trang 3

The members of IMCB Histopathology Unit, especially Keith Rogers and Susan Rogers for histology tips, protocols and equipment;

The staff of A*STAR Graduate Academy and NUS Graduate School for Integrative Sciences and Engineering for the handling of administrative affairs;

All present and former members of the Comparative Genomics Laboratory and DNA Sequencing Facility in IMCB, Singapore for making the laboratory a conducive and pleasant place to work in;

Finally, my greatest appreciation goes to my parents and my brother for their

strongest support

Trang 4

Table of contents

Acknowledgements i

Table of contents iii

Summary vi

List of tables viii

List of figures ix

List of abbreviations xi

Chapter 1: Introduction 1

1.1 The Human Genome Project 1

1.2 Cis-regulatory elements 2

1.3 Methods to identify cis-regulatory elements 6

1.4 Comparative genomics approach for identifying cis-regulatory elements 20

1.5 Transcription factors (TFs) 24

1.6 Objectives of my work 26

Chapter 2: Materials and methods 28

2.1 Identifying human, mouse and fugu TF-encoding genes 28

2.2 Identifying CNEs in TF-encoding gene loci 29

2.3 Computing CNE statistics of human TF-encoding genes 31

2.4 Enrichment analysis of CNEs within experimentally defined TFBS 31

2.5 Gene Ontology enrichment analysis of human TF-encoding genes 32

2.6 Expression analysis of human TF-encoding genes 33

2.7 Motif finding 33

2.8 Predicting TFBS in human-fugu CNEs 34

2.9 Building and implementing the TFCONES database 35

2.10 Sequencing and annotating the fugu Hox gene clusters 35

Trang 5

2.11 Functional assay of CNEs in transgenic mice 37

Chapter 3: Results - CNEs in transcription factor-encoding genes 45

3.1 TF-encoding genes in human, mouse and fugu 45

3.2 Conserved clusters of TF-encoding genes in human, mouse and fugu 48

3.3 Identification of human-mouse and human-fugu CNEs 51

3.4 Distribution pattern of human-fugu CNEs in the introns and flanking regions of TF-encoding genes 62

3.5 Presence of experimentally verified TFBS within human-fugu CNEs 64

3.6 Functional categories of human TF-encoding genes associated with human-fugu CNEs 66

3.7 Association between human-fugu CNEs and expression profiles of TF-encoding genes 69

3.8 Over-represented motifs in the human-fugu CNEs of central nervous system-expressing TF-encoding genes 72

3.9 Database of TFs and CNEs associated with TF-encoding genes 74

3.10 Discussion 77

Chapter 4: Results – CNEs in the Hox gene clusters 82

4.1 The Hox gene clusters 82

4.2 Fugu Hox loci and conserved syntenic blocks 87

4.3 CNEs in the Hox loci 90

4.4 Distribution of CNEs in fugu Hox loci 94

4.5 CNEs in the human Hox loci 95

4.6 Discussion 99

Chapter 5: Results – Functional assay of CNEs associated with a representative TF-encoding gene 102

5.1 Introduction 102

5.2 Functions and expression pattern of Lhx2 gene 102

5.3 Organization of the Lhx2 locus in vertebrates 106

5.4 CNEs in the Lhx2 gene locus 109

Trang 6

5.5 Expression patterns of Lhx2, Crb2 and Dennd1a in developing mouse

embryos 113

5.6 Functional assay of Lhx2 CNEs in E11.5 transgenic mouse embryos 115

5.7 Site-directed mutagenesis of a predicted motif in Lhx2_CNE2/3 124

5.8 Discussion 128

Chapter 6: Discussion 132

Bibliography 139

Annexes 159

List of my publications 188

Trang 7

Summary

Comparative genomics is a powerful approach for identifying conserved

cis-regulatory elements in the human genome Since functional sequences evolve slower

than the surrounding neutrally evolving regions, cis-regulatory elements can be

identified as conserved noncoding elements (CNEs) in comparisons of human and other vertebrate genomes In particular, comparison of human with distantly related vertebrates such as fishes increases the likelihood that most predicted CNEs are functional sequences The objectives of my project were to identify all the CNEs associated with transcription factor (TF)-encoding genes in the human genome by comparison with pufferfish (fugu) sequences, analyze the characteristics of the CNEs and assay the functions of selected CNEs in transgenic mice I started by building a curated database of all TF-encoding genes in human, mouse and fugu and predicted CNEs (≥ 65% identity over 50 bp) associated with orthologous genes by locus-by-locus global alignments In total, 1,738 human, 1,495 mouse, and 1,762 fugu TF-encoding genes were identified, with 1,145 genes having orthologs in all three

genomes Further analyses focused on the set of 816 DNA-binding TF-encoding orthologous genes A total of 2,843 human-fugu CNEs (total length ~388 kb) were found to be associated with these TF-encoding genes An online database was

constructed to catalog the human, mouse and fugu TF-encoding genes, together with their associated CNEs, and this database is named TFCONES (Transcription Factor Genes & Associated COnserved Noncoding ElementS; http://tfcones.fugu-sg.org/) The TFCONES database would be useful to researchers interested in studying the regulation of TF-encoding genes and understanding gene regulatory networks in vertebrates

Trang 8

The human-fugu CNEs identified showed a significant overlap with experimentally verified transcription factor binding sites (TFBS) of known transcriptional activators and repressors, confirming that some CNEs function as transcriptional enhancers or silencers In addition, functional enrichment analyses indicated that the CNEs are significantly associated with TF-encoding genes that are involved in regulating

development, particularly the development of central nervous system Furthermore, expression profiling based on publicly available expression data, showed that genes that express most highly in central nervous system tissues are enriched with human-fugu CNEs Motif discovery within the CNEs of TF-encoding genes that express most highly in the central nervous system, revealed four 8-mer motifs that are likely to be involved in transcriptional enhancer activity in the central nervous system To verify the functions of CNEs and the motifs, I assayed the CNEs of a representative TF-

encoding gene, the LIM homeobox gene Lhx2, in transgenic mice Four out of eight CNEs tested demonstrated enhancer activity by recapitulating Lhx2 expression in the

midbrain and hindbrain at embryonic day 11.5 Mutagenesis of a predicted motif in a selected CNE abolished gene expression in the neural tube and dorsal root ganglia, demonstrating that the motif is indeed critical for enhancer activity

Trang 9

List of tables

Table 1 Fugu scaffolds (assembly v3.0) for the seven Hox loci 36

Table 2 Primer sequences of the eight Lhx2 constructs 38

Table 3 TF-encoding genes in human, mouse and fugu genomes 47

Table 4 Conserved clusters of human, mouse and fugu TF-encoding genes 49

Table 5 Top twenty TF-encoding genes associated with the highest density of human-fugu CNEs in human 57

Table 6 Top twenty TF-encoding genes associated with the highest density of human-fugu CNEs in human-fugu 58

Table 7 Top twenty TF-encoding genes associated with the highest number and total length of human-fugu CNEs 61

Table 8 Location of human-fugu CNEs in relation to the protein-coding sequence of nearest human TF-encoding genes 63

Table 9 Significantly over-represented and under-represented Gene Ontology (GO) terms (P < 0.01) of CNE-associated human TF-encoding genes 68

Table 10 Over-represented 8-mer motifs in human-fugu CNEs of cluster #3 human TF-encoding genes 73

Table 11 Conserved syntenic fragments at the fugu and human Hox loci 87

Table 12 Human-fugu CNEs in the fugu Hox loci 95

Table 13 Human-fugu CNEs in the human Hox loci 96

Table 14 Details of Lhx2 CNE constructs tested in transgenic mice 111

Table 15 Summary of the expression patterns directed by the CNEs tested 124

Trang 10

List of figures

Figure 1 A flowchart of protocols used for identifying human, mouse and fugu TFs,

and human-mouse, human-fugu CNEs 51

Figure 2 Distribution of lengths of human-mouse and human-fugu CNEs 53

Figure 3 Plot of total length of CNEs associated with DNA-binding TF-encoding genes against the length of non-repetitive noncoding sequence (in kilobases; kb) of human or fugu gene locus 54

Figure 4 Plot of CNE density against the total length of CNEs associated with DNA-binding TF-encoding genes 55

Figure 5 CNEs in the MEIS2 gene locus 62

Figure 6 Distribution of UTR-intronic and internal-intronic human-fugu CNEs 64

Figure 7 TF-encoding genes predominantly expressed in the central nervous system are enriched with CNEs 71

Figure 8 List of human-fugu CNEs associated with a representative TF-encoding gene FOXA2 74

Figure 9 An image of the location of CNEs relative to the associated TF-encoding gene FOXA2 75

Figure 10 Conserved syntenic blocks at the fugu and human Hox loci 88

Figure 11 VISTA plot of the MLAGAN alignment of fugu HoxAa locus with human HoxA and mouse HoxA loci 92

Figure 12 VISTA plot of the MLAGAN alignment of fugu HoxDa locus with human HoxD and mouse HoxD loci 93

Figure 13 Profiles of CNEs at the human HoxA, HoxB, HoxC, and HoxD loci 97

Figure 14 Lhx2 gene loci in human, mouse, chicken and fugu 106

Figure 15 Syntenic genes surrounding Lhx2 in human, mouse and fugu 109

Figure 16 CNEs in the Lhx2 locus 110

Figure 17 Ten CNEs associated with the human LHX2 gene locus are located within the introns of the upstream gene DENND1A 112

Figure 18 Expression of Lhx2, Crb2, Dennd1a in E11.5 mouse embryos 114

Trang 11

Figure 19 Lhx2_CNE2/3 directs lacZ expression in the neural tube and dorsal root

ganglia 117

Figure 20 Lhx2_CNE5/6 directs lacZ expression in the hindbrain and neural tube 119 Figure 21 Lhx2_CNE7 directs lacZ expression in the hindbrain and neural tube 120 Figure 22 Lhx2_CNE10 directs lacZ expression in the midbrain, hindbrain and neural

tube 122

Figure 23 Mutation of the overlapping motifs in Lhx2_CNE2 125 Figure 24 Site-directed mutagenesis of a predicted motif in Lhx2_CNE2 results in significantly reduced lacZ expression in the neural tube and dorsal root ganglia 126

Trang 12

List of abbreviations

CNE conserved noncoding element

Myr millions of years

PBS phosphate buffered saline

PCR polymerase chain reaction

TFBS transcription factor binding site

TSS transcription start site

UTR untranslated region (of an mRNA)

Trang 13

Chapter 1: Introduction

1.1 The Human Genome Project

A major quest in biology is to understand the function and regulation of all the human genes and their role in human biology and diseases The Human Genome Project launched in 1990 was the first step towards accomplishing this ambitious scientific endeavor The project had two major goals: (i) to determine the complete sequence of the human genome and (ii) to identify all the functional elements encoded by it In

2001, the first goal was achieved when two draft sequences of the human genome were released (Lander et al 2001; Venter et al 2001) Since then, coordinated efforts were made to close the gaps in the genome and to achieve high sequencing accuracy

As a result, the human genome sequence is now essentially complete (International Human Genome Sequencing Consortium 2004) The second goal, that is to map all the functional elements in the human genome, is however far from completion

Although about 21,000 protein-coding genes have been identified in the human

genome with high confidence (Clamp et al 2007), the identification of noncoding functional elements poses a major challenge Comparisons of the human and mouse genomes have revealed that approximately 5% of the human genome is under

purifying selection and represent functional sequences (Waterston et al 2002) Of these functional sequences, 1.5% is accounted for by protein-coding genes and the remaining 3.5% is functional noncoding sequence Functional noncoding elements in

the human genome include noncoding RNA genes, cis-regulatory elements involved

in transcriptional regulation, splicing regulatory elements and sequences that dictate chromatin structure Unlike protein-coding genes which have a characteristic structure and genetic code that facilitate their identification, very little is known about the

Trang 14

structure and organization of noncoding functional elements Hence, their

identification and verification remains a challenge The main focus of my project is to

identify and verify functional cis-regulatory elements in the human genome

1.2 Cis-regulatory elements

Cis-regulatory elements are sequences that regulate the precise spatial and temporal

expression of the target gene While the genomic content in every cell in the body is largely the same, important biological processes such as development, differentiation, proliferation and apoptosis can take place because gene expression is differentially

regulated in various cell types and at different developmental stages Cis-regulatory

elements comprise of core promoters, proximal promoters, enhancers, silencers, insulators and locus control regions The core promoter is the minimal region

surrounding the transcription start site of a gene (extending ~35bp on either side), that

is sufficient to direct the initiation of transcription by the RNA polymerase II

machinery (Butler and Kadonaga 2002) At the core promoter, RNA polymerase II, general transcription factors and other associated factors congregate and form the pre-

initiation complex The other classes of cis-regulatory elements contain multiple

binding sites for sequence-dependent DNA-binding transcription factors The

proximal promoter extends ~250 bp on either side of the transcription start site (Butler

and Kadonaga 2002) Transcriptional enhancers, also known as cis-regulatory

modules, and silencers contain binding sites for transcriptional activators and

repressors respectively, and can be located far away from their target gene (Maston et

al 2006) Insulators prevent a neighbouring enhancer from acting on a gene promoter when the insulator is situated between them, or shield a gene from the encroachment

of repressive heterochromatin (West et al 2002) Locus control regions are composed

Trang 15

of multiple cis-regulatory elements including enhancers, silencers, insulators, matrix

or scaffold attachment regions (Maston et al 2006) They are capable of directing the tissue-specific expression of one or more linked genes to physiological levels in a copy-number dependent manner at ectopic sites, possibly by opening up a

chromosome domain or by preventing the establishment of heterochromatin (Li et al

2002) Among the various classes of cis-regulatory elements, transcriptional

enhancers contribute the most to specificity in gene expression Core and proximal promoters generally direct basal levels of transcription while increases and nuances in gene expression are largely achieved by the highly coordinated interaction between transcriptional enhancers and their cognate upstream transcription factors While a gene can have at most a proximal and a core promoter, it can be potentially associated with several enhancers, each containing a different combination of binding sites for different transcription factors Hence, the total number of gene expression patterns

that can be generated is significantly more than the number of trans-acting factors

there is in the human genome (Maston et al 2006)

Experimental studies on transcriptional enhancers have revealed some characteristic features of enhancers Firstly, transcriptional enhancers are modular in nature This means that distinct enhancers can act independently upon a single promoter at

different times, in different cell types and in response to different external stimuli

This feature is best demonstrated by the regulated expression of pair-rule gene even skipped (eve) in Drosophila The expression of eve in seven thin stripes of the early

blastoderm embryo is achieved by five different enhancers and each enhancer

contributes to expression in exactly one or two stripes (Andrioli et al 2002)

Secondly, transcriptional enhancers can be located in the 5’ or 3’ flanking regions,

Trang 16

untranslated exonic regions or introns of a gene They can even be situated at

significant distances from the gene For example, an enhancer that initiates and directs

specific Sonic hedgehog (Shh) expression in the posterior regions of developing

mammalian forelimbs and hindlimbs, was found to be located 1 Mb away in an intron

of a neighbouring gene (Lettice et al 2003) Thirdly, enhancers act on their target genes in an orientation-independent way An SV40 viral enhancer sequence that was cloned 1.4 kb upstream of a rabbit β-globin gene enhanced globin gene expression, as did the sequence when cloned 3.3 kb downstream (Banerji et al 1981) Lastly, the prevailing model for long-range interaction between sequence-specific transcription factors at an enhancer and the basic transcriptional machinery at the promoter is a

“looping-out” mechanism by which the DNA between the enhancer and the promoter

“loops out” so that the enhancer is brought close to the promoter (Vilar and Saiz 2005)

In order to understand the regulation of human genes in different tissues and at

different developmental stages, it is important to identify and characterize all the regulatory elements associated with human genes A comprehensive catalog of cis-

cis-regulatory elements will not only help improve our understanding of how individual human genes are transcriptionally regulated, but also lead to an understanding of how certain mutations in the human gene regulatory regions cause genetic diseases Many disease-associated loci fall within noncoding regions that are potentially involved in gene regulation For example, patients with preaxial polydactyly, a condition marked

by extra digits on hands and feet, have point mutations in the long-range enhancer of

the Shh gene that is described above These point mutations cause Shh expression that

is usually asymmetrically localized to the posterior limb buds, to abnormally expand

Trang 17

into the anterior region, thus resulting in extra digits (Lettice et al 2003) In patients with the Van Buchem disease, a homozygous recessive disorder that is characterized

by a gradual increase in bone density and distortions in the face, mandible and head, entrapment of cranial nerves and excessive weight, a 52-kb long deletion region has been uncovered on human chromosome region 17p21 Within this deletion region resides a 250-bp long-range enhancer that is postulated to direct expression of

sclerostin in human adult skeleton Sclerostin is a negative regulator that keeps adult bone formation in check Loss of the enhancer results in significant loss of adult sclerostin expression and uncurtailed bone growth (Loots et al 2005) Finally, a genomic deletion in human chromosome region 4q35 has been associated with facio-scapulo-humeral dystrophy (FSHD), one of the most common forms of muscular dystrophy The deletion locus contains a matrix attachment region which blocks a

nearby enhancer from activating the FRG1 (FSHD region gene 1 protein) gene in normal cells The FRG1 gene encodes a putative splicing factor that is reportedly

over-expressed in the myoblasts of FSHD patients The deletion event truncates a D4Z4 repeat array adjacent to the matrix attachment region and disrupts the latter’s

ability to block the enhancer from activating FRG1, causing an aberrant upregulation

of FRG1 (Petrov et al 2008) The role of FRG1 in FSHD is still unknown

The above examples of cis-regulatory mutations that are associated with genetic diseases, underscore the importance of cis-regulatory elements in maintaining the

correct spatio-temporal expression of crucial genes Besides gleaning new insights

into genetic diseases from regulatory elements, clinicians have begun to use

cis-regulatory elements to direct tissue-specific or sustained expression of transgenes in gene therapy approaches For instance, inducible promoters that are sensitive to

Trang 18

external stimuli like ionizing radiation or hypoxia are highly specific and have been proposed as tools to restrict expression of cytotoxic or tumoricidal genes to cancer cells (Goverdhana et al 2005) Furthermore, insulators have been explored as a way

to override chromatin position effects and to ensure sustained expression of

transgenes introduced into the human genome during gene therapy (Recillas-Targa et

al 2004) Hence, cis-regulatory elements have become useful in understanding and correcting genetic disorders Furthermore, the identification of cis-regulatory elements

and mutations that may occur in them has implications for the evolutionary processes that give rise to population variation At least 1% of all human genes are postulated to

possess functional cis-regulatory polymorphism and these variations lead to

considerable perturbation in gene expression levels and may thus impact various aspects of phenotype (Rockman and Wray 2002) For the reasons mentioned above,

identifying and characterizing all the cis-regulatory elements in the human genome is

indeed a high-priority endeavor

1.3 Methods to identify cis-regulatory elements

Cis-regulatory elements such as transcriptional enhancers lack a well-defined

structure similar to that of protein-coding genes They are typically composed of clusters of binding sites for several different transcription factors and there is very

limited information on how these binding sites are arranged within the cis-regulatory

element Transcription factor binding sites are typically only 5 – 12 bp long and are degenerate, whereby variation in some bases is tolerated more so than in other bases

Without a priori information about which transcription factors regulate a gene of interest, the identification of all the cis-regulatory elements associated with the gene is

a complex task Furthermore, cis-regulatory elements are position-independent and

Trang 19

orientation-independent, hence searches for cis-regulatory elements are required to

cover extensive stretches of the genome and both strands of the DNA The challenge

is not only to identify a cis-regulatory element, but also to dissect the core sequences

of a cis-regulatory element that are crucial for its function and to determine their

mode of action The methods that have been used to date in identifying and

characterizing cis-regulatory elements can be broadly categorized into traditional

methods and genomic-era strategies that became possible after the availability of whole genome sequences of human and other organisms

1.3.1 Traditional strategies

Traditional approaches to identifying cis-regulatory elements can be described as

either biochemical or genetic methods The first biochemical method to study DNA interaction is the DNA footprinting assay (Galas and Schmitz 1978) The aim of this method is to identify the binding sites of a particular protein within a DNA

protein-sequence The protein of interest is added to PCR-amplified DNA that has been singly end-labeled and the mixture is subjected to cleavage (either by an endonuclease or chemical reagent) The resulting DNA fragments are separated by gel electrophoresis alongside the fragments obtained from a control DNA sample without addition of the protein Two different cleavage patterns are generated from both DNA samples, with the protein-bound DNA sample missing bands in DNA sites (“footprints”) where the protein has bound and protected them from cleavage To identify the binding sites, the positions of the footprints are approximated in relation to the radiolabeled end The advantage of this method is that single-nucleotide resolution binding sites can be obtained when a highly reactive and sequence-independent chemical reagent such as hydroxyl radical is used to cleave the DNA (Jain and Tullius 2008) Another

Trang 20

biochemical method that describes protein-DNA interaction is the electrophoretic mobility shift assay (EMSA) introduced in the 1980s EMSA resembles the DNA footprinting assay in that it detects binding of a protein of interest to an end-labeled DNA sequence based on movement of the mixture along a polyacrylamide gel The difference is that there is no cleavage step, and hence a single DNA band is observed after gel electrophoresis as opposed to a series of bands as in DNA footprinting DNA molecules to which proteins have bound move more slowly in the gel than the control DNA with no protein, resulting in a ‘band shift’ The advantage of EMSA is that multiple binding sites can be detected by applying increasing protein concentrations that will then result in more pronounced band shifts A limitation of the above two

methods is that both assays require prior knowledge of a putative cis-regulatory

element and a potential DNA-binding protein In addition, the step of protein-DNA

interaction is carried out in vitro, which may not necessarily reflect in vivo events

One example of a biochemical assay that captures the chromatin state of DNA

sequences in vivo is the DNase I hypersensitivity assay (Keene et al 1981; McGhee et

al 1981) In the eukaryotic nucleus, DNA is coiled around histone octamer complexes

at regular intervals to form nucleosomes These nucleosomes serve to pack the large eukaryotic genome into the nucleus and also influence important events like gene transcription and DNA replication Modifications to histone proteins, like

trimethylation of histone H3’s lysine 4 and acetylation of histone H3 (Xi et al 2007), can lower the DNA’s affinity to nucleosomes and displace nucleosomes, creating a state of open chromatin and making the underlying DNA susceptible to DNase I cleavage The displacement of nucleosomes also implies that the DNA is accessible to binding by proteins such as transcription factors Hence, DNase I hypersensitivity

Trang 21

marks various functional noncoding elements, including cis-regulatory elements,

origins of replication, recombination elements and structural sites of telomeres and centromeres (Cereghini et al 1984; Gross and Garrard 1988) The advantage of using

DNase I hypersensitivity to find potential cis-regulatory elements is that DNase I

hypersensitivity is a transient property of DNA sequences, and hence this method is

useful in the study of spatial and temporal patterns of transcriptional activity The

cis-regulatory elements that are active in a particular cellular phenotype can also be identified by carrying out the assay on samples derived from that phenotype Since the assay was introduced, the level of resolution has greatly improved from ~500 bp on either side of the nucleosome-free region using Southern transfer followed by indirect end-labeling (Wu 1980) to nearly nucleotide resolution using PCR assessment (Yoo et

al 1996) and quantitative PCR (McArthur et al 2001) Nevertheless, DNase I

hypersensitivity implies the presence of a cis-regulatory element but does not

demonstrate its function

Chromatin immunoprecipitation (ChIP) is a method that identifies the sites of

transcription factor-DNA interaction in vivo DNA-binding proteins are covalently

cross-linked to genomic DNA by the addition of formaldehyde to living cells The genomic DNA is extracted from the cells and fragmented by sonication to lengths in the range of 100 – 500 bp An antibody highly specific for the transcription factor of interest is used to precipitate all the DNA fragments to which the protein is bound Protein-DNA crosslinking is reversed and the DNA fragments are then purified A ChIP library is constructed whereby the purified DNA fragments are cloned into a vector and sequencing is carried out using vector primers Although the ChIP

technique is effectively unbiased in its search space for cis-regulatory elements and it

Trang 22

can be used to detect protein-DNA interactions in a particular type of cells, there still remain several disadvantages to this technique A highly specific antibody must be raised against the transcription factor of interest, which is not a trivial task especially for transcription factors that belong to larger families of structurally similar

transcription factors In addition, more in-depth analyses have to be carried out to map the actual bases of contact between the transcription factor and DNA within the larger DNA fragment Finally, some of the detected interactions may be indirect

interactions, meaning that another DNA-binding factor has interacted with the

transcription factor being studied

A more accurate and reliable way to verify a putative cis-regulatory element is to

examine its ability to direct gene expression This goal is achieved through the

reporter gene assay In reporter gene assays aimed at verifying enhancers, the

sequence of interest is cloned upstream of a reporter gene linked to a core promoter Examples of reporter genes include genes that encode β-galactosidase, green

fluorescent protein and firefly-luciferase The construct is incorporated into cultured cells through transient or stable transfection and this is a fast method to measure or

examine the expression of the reporter gene as directed by the cis-regulatory element

being studied (Carey and Smale 2000; Himes and Shannon 2000) By varying the

length of the putative cis-regulatory element, the minimal cis-regulatory element can

be identified Silencers are tested in a similar way as enhancers except that the core promoter used is one that drives strong reporter gene expression so that any resulting reduction in expression level can be readily detected (Maston et al 2006) On the other hand, more complex arrangements are required in order to verify putative

insulators and locus control regions A putative insulator is cloned between a known

Trang 23

enhancer and a core promoter to demonstrate its enhancer-blocking ability or tested in transgenic assays to test its heterochromatin-barrier ability (West et al 2002), while a putative locus control region must be verified in transgenic assays to see if it can overcome chromosomal position effects (Maston et al 2006) Two limitations of conducting reporter gene assays in cell lines are that developmentally active

enhancers cannot be tested in cell lines and cell lines may not be available for all types of cells To overcome these limitations, transgenic systems are used in place of

cultured cells In addition, transgenic systems allow the researchers to show that a

cis-regulatory element indeed functions in living organisms The plasmid construct is first linearized and then injected into mouse or zebrafish embryos, after which the

construct becomes randomly integrated into the genome, for example as demonstrated

by Kothary et al (1989) and Muller et al (1997) The researcher then monitors the reporter gene expression during development The advantage of transgenic reporter

assays is that the effect of putative cis-regulatory elements can be studied in vivo and

to date, this remains the most accurate and effective way to prove that an element is functional However, this method does have its disadvantages Firstly, because

integration of the transgene into the host genome occurs at random locations, in vivo expression of the reporter gene may not reflect the actual in vivo expression driven by the cis-regulatory element, due to position effects (whereby the reporter gene

inadvertently falls under the control of endogenous cis-regulatory elements or

heterochromatin), thus leading to ectopic gene expression Hence, it is crucial to ensure that a similar expression pattern is obtained in several independent transgenic

lines before a conclusion is made about the putative cis-regulatory element being

tested Secondly, microinjections of the transgene into mouse or zebrafish embryos are carried out at the one-cell stage so that ideally, all the cells in the resulting

Trang 24

organism contain a copy of the transgene in its genome However, mosaicism arises if cell divisions begin before the transgene can be integrated into the genome especially

in the rapidly dividing zebrafish embryo This means that insertion of the transgene into the host genome occurs in only a subset of all cells and reporter gene expression

can only be observed in subpopulations of cells Nevertheless, the analysis of

cis-regulatory elements by examining mosaic patterns of reporter gene expression in microinjected transient transgenic fish embryos is fairly rapid (Muller et al 2002) Furthermore, these assays are now aided by techniques such as transposon-based gene transfer (Grabher and Wittbrodt 2007; Kawakami 2007) After transgenic assays have

been carried out and a cis-regulatory element is shown to direct gene expression to

specific tissues at particular developmental stages, subsequent experiments include making constructs with serial deletions or site-directed mutagenesis to find the critical

segments of the cis-regulatory element These experiments are very tedious and

time-consuming and are appropriate for only small sets of genes

1.3.2 Genomic era strategies

Ever since the release of the human genome, methods of detecting cis-regulatory

elements have gradually become high-throughput, capable of detecting hundreds to thousands of elements in one experiment Typically, these methods make use of the completed human genome as a reference to which large numbers of experimentally obtained sequences can be mapped or from which numerous microarray probes can be designed for large-scale hybridization The DNase I hypersensitivity assay and ChIP assay take advantage of the availability of genome sequences in similar ways, to identify DNase I hypersensitive sites and regions bound by transcription factors on a whole-genome level Massively parallel signature sequencing was used to generate

Trang 25

many short sequence tags from a genomic DNase library derived from quiescent human CD4+ T-cells and ~14,000 DNase I hypersensitive sites were identified

(Crawford et al 2006b) In the same year, NimbleGen tiling microarrays designed from the ENCODE regions that make up 1% of the human genome were used to identify captured DNase I-digested ends in primary and immortalized human cell types (Crawford et al 2006a; Sabo et al 2006) More recently, a high-resolution whole-genome map of chromatin perturbations in primary human CD4+ T-cells was constructed using Solexa and 454 sequencing platforms in combination with

NimbleGen tiling arrays and almost 95,000 sites were identified (~2.1% of the

genome) (Boyle et al 2008)

Recent extensions of the original ChIP assay allow the researcher to map transcription factor binding sites on a genome-wide scale In ChIP-on-chip, a method also known

as genome-wide location analysis, the protein-bound enriched DNA fragments are hybridized to a tiling microarray This method was applied to human chromosomes 21 and 22 to find the binding sites for three transcription factors – Sp1, c-Myc and p53 (Cawley et al 2004) Although ChIP-on-chip has become a fairly high-throughput way to identify sequences across the genome due to availability of commercially-offered microarray chips (e.g., NimbleGen Human ChIP-chip 2.1M array, Affymetrix GeneChip® Human Tiling 2.0R array), there are problems of cross-hybridization especially for large mammalian genomes In addition, the method predicts large numbers of binding sites and only a fraction is expected to be truly functional, hence statistical methods are required to attach a statistical significance to each site as an indicator of its functional relevance (Buck and Lieb 2004) In ChIP-PET, the enriched DNA sequences are sequenced through the paired-end ditag sequencing method (Loh

Trang 26

et al 2006) This method has been applied successfully to identify the binding sites for c-Myc in human B cells (Zeller et al 2006) The latest method that makes

effective use of next generation sequencing technologies is ChIP-seq In ChIP-seq, the purified bound DNA is analyzed by massively parallel short-read sequencing The short reads are then mapped back to the reference genome For example, the binding sites of STAT1 transcription factor in human HeLa S3 cells that have been stimulated

by interferon gamma, were mapped using ChIP followed by sequencing on the Solexa platform (Robertson et al 2007)

Studies into the genome-wide distribution of histone aceylation and methylation marks have suggested that certain histone modifications are a good indicator of the

presence of cis-regulatory elements, just as DNase I hypersensitive sites are known to

be correlated with certain histone modifications For example, the relative levels of histone modification across the genome can be quantified using ChIP-SAGE, which is

a combination of chromatin immunoprecipitation to target modified histone proteins and the technique of serial analysis of gene expression to sequence the precipitated DNA sequences This has been done for H3 histones diacetylated on lysines 9 and 14 (denoted H3K9acK14ac) in human peripheral T-cells, located in genomic regions called histone acetylation islands (Roh et al 2005) The H3K9acK14ac mark is

correlated with known and active gene promoters, enhancers and locus control

regions Furthermore, luciferase assays of randomly selected histone acetylation islands showed that 43% (39/90) of them could function as enhancers in human Jurkat T-cells (Roh et al 2007) Recently, when other types of histone modifications were investigated, 17 different histone acetylation and methylated marks were all found to

be present in the promoters of a set of ~3,300 genes that are more highly expressed

Trang 27

than those genes that lacked the histone marks (Wang et al 2008) In a separate study, the distribution patterns of several types of acetylation and methylation modifications

on selected lysine residues on histones H3 and H4, were determined using chip on 44 different loci in the human genome The patterns showed that active

ChIP-on-promoters are associated with H3K4 trimethylation while enhancers are associated with H3K4 monomethylation (Heintzman et al 2007) An algorithm was designed to scan the human genome based on this finding and ~220 active promoters and ~420 enhancers were predicted Almost 90% of promoter predictions mapped to known transcription start sites and 63% enhancer predictions corresponded to previously reported DNase I hypersensitive sites or binding sites of co-activator p300 and

Mediator subunit TRAP220, showing that these histone methylation features are

useful in predicting specific cis-regulatory elements (Heintzman et al 2007) As in the

case of genome-wide DNase I hypersensitivity assays, the advantage of genome-wide profiling of histone modifications is that active and repressive chromatin domains which may be dynamically changing can be demarcated in different cell types and under different cellular conditions Nevertheless, additional experiments are required

to identify the transcription factors responsible for creating the regions of open or

closed chromatin and the essential portions of the cis-regulatory elements

A technique known as chromosome conformation capture (abbreviated as “3C”) is used to detect intrachromosomal and interchromosomal interactions in a cell’s natural state (Dekker et al 2002; Tolhuis et al 2002) and has the aim of detecting long-range enhancers that, together with their associated transcription factors, are brought into close proximity of the core promoter of their target gene The 3C protocol is

accomplished in five steps (Simonis et al 2007) Firstly, similar to the ChIP protocol,

Trang 28

formaldehyde is used to fix cells and form cross-links between interacting proteins and DNA Secondly, the nuclei are isolated from the cells and DNA is digested with a restriction enzyme A ligation step is then carried out to promote intramolecular ligation between interacting DNA fragments The cross-links are reversed and ligation frequencies are analyzed by quantitative PCR using primers specific to the locus of interest (known as the “bait” sequence) Because 3C is only suitable for studying a small number of chromosomal interactions, 3C-carbon copy (abbreviated as “5C”) was developed whereby a multiplex ligation-mediated amplification step is included after the cross-links are reversed (Dostie et al 2006) This step uses multiplex primers that are composed of universal primer sequences like T7 and T3 plus ligation junction sequences to amplify selected ligation junctions A quantitative replicate of the initial 3C library is hence created and finally analyzed through high-throughput sequencing

or tiling arrays Both 3C and 5C require the researcher to painstakingly design primers specific not only to the bait sequence, but also to every possible restriction fragment (known as the “captured” sequence), hence circular 3C (abbreviated as “4C”)

(Simonis et al 2006; Zhao et al 2006) was developed that does not require primers for captured sequences 4C uses the same 3C protocol, except that it uses a six-base cutter for the first restriction digest, and after ligation between restriction fragments and reversal of cross-links, it uses a four-base cutter in a second restriction digest to produce smaller restriction fragments and also trim ligation junctions This is

followed by a self-circularization step that is more likely to occur since the DNA is no longer bound to proteins Inverse PCR using bait-specific primers is then carried out and the products are analyzed by high-throughput sequencing or custom microarrays with probes that target sequences flanking recognition sites of the first restriction enzyme (Simonis et al 2007) A caveat of the suite of 3C and 3C-derived techniques

Trang 29

is the large amount of cells that is required because at most two instances of

chromosomal interactions at a locus of interest can be captured per diploid cell

Another disadvantage of these methods is that they provide information about the frequency but not the functionality of DNA interactions, hence further analysis is required to characterize these DNA interactions (Simonis et al 2007)

In contrast to the above large-scale biochemical methods, in silico prediction of

cis-regulatory elements helps the researcher to narrow the search space of genomic

sequences so that traditional experimental methods can be applied on a manageable

scale Because all cis-regulatory elements contain transcription factor binding sites

(TFBS), predicting individual TFBS or organized clusters of TFBS is a logical step to computationally analyzing a gene locus or the whole genome Databases like

TRANSFAC and JASPAR catalog TFBS obtained from scientific literature and/or position weight matrices (PWMs) that are derived from known TFBS TFBS

prediction can be achieved by either string matching against known TFBS or

statistical matching against PWMs To create a PWM, alignments of known binding sites of a transcription factor are first constructed and the frequency of observing a particular nucleotide at each base of the alignment is represented as a log-likehood ratio relative to the genomic background To score an input sequence against a PWM, the sequence is scanned on both strands for possible sites, the log-likehood ratios of all bases in the site are summed up and sites whose scores exceed a user-defined threshold are then reported Several programs that search TFBS databases have been used widely, for example, MatInspector (Cartharius et al 2005), P-Match

(Chekmenev et al 2005) and TESS (Schug 2003) The power of computational

prediction of TFBS is that the search is not restricted to a single transcription factor,

Trang 30

or only selected genomic regions However, TFBS databases are rarely complete Some transcription factor families may be under-represented compared with others or all the TFBS of a particular transcription factor may not have been comprehensively

identified due to the degenerate nature of TFBS For this reason, de novo motif

finding has been proposed as an alternative way to finding novel TFBS without a priori information about the transcription factors that regulate the gene of interest For example, a binding site for SIX3 in a forebrain enhancer of SHH that was verified by

a supershift assay, could not be detected by TRANSFAC analysis (Jeong et al 2008) Secondly, defining the threshold score is not a trivial task An appropriate threshold score has to place a reasonable tradeoff between sensitivity (to minimize false

negatives) and specificity (to minimize false positives) As not all sites that

correspond very well to known TFBS or PWMs are functional, TFBS prediction tends

to result in many false positives and hence, TFBS prediction is usually used in

conjunction with other criteria that provide additional support to a putative

cis-regulatory element Firstly, Harbison et al (2004) enforced a threshold score of 60%

of the maximum possible score for a particular transcription factor, and also added an additional requirement that the sequences must reside near a sequence bound by the same transcription factor as verified by ChIP Secondly, Loots and Ovcharenko

(2004) developed the program rVISTA to identify TFBS that are also evolutionarily conserved, in a bid to reduce the rate of false positives while maintaining reasonable sensitivity rVISTA accepts input alignments from various sources like the BLASTZ program, zPicture, ECR Browser and Penn State University’s GALA database and identifies TFBS that are interconnected in the BLASTZ alignment and highly

conserved (>80% identity) in 20-bp sliding windows In addition, the TFBS are

clustered for individual or combinations of transcription factors by a filtering scheme

Trang 31

that identifies evolutionarily conserved regions that contain a minimum number of sites for at least a few transcription factors Lastly, Blanchette et al (2006) used both TFBS prediction, sequence conservation in rodents, and the presence of clustered

binding sites of one to five different transcription factors to identify ~118,000

cis-regulatory modules across the human genome

As an alternative to TFBS prediction programs that are restricted to existing

information in TFBS databases, de novo motif discovery has been proposed as a means to find novel TFBS De novo motif discovery aims to identify patterns in

noncoding DNA that occur more frequently than expected and that signify a possible

underlying function This is based on the observation that cis-regulatory elements tend

to possess multiple binding sites for the same transcription factor Many de novo

motif finding algorithms have been introduced in recent years that differ from each other in various aspects – the definition of a motif, the criteria for statistical

overrepresentation and the method used to find the overrepresented motifs (Tompa et

al 2005), for example, expectation maximization used by MEME (Bailey and Elkan 1995), Gibbs sampling used by AlignACE (Hughes et al 2000) and exhaustive

enumeration of short k-mers by Weeder (Pavesi et al 2004) Motif discovery is also

employed on the conserved noncoding elements (CNEs) of all human genes, or all noncoding regions associated with co-regulated or co-expressed genes For example, motif finding within the mammalian-conserved regions of promoters and 3’

untranslated regions of all human genes yielded 174 motifs in promoters implicated in transcriptional regulation and 106 motifs in 3’ untranslated regions potentially

involved in post-transcriptional regulation (Xie et al 2005) A search for motifs in the CNEs within 1 kb upstream of 18 groups of co-expressed genes (e.g., genes

Trang 32

expressing in neuronal tissue, pancreas, heart or skeletal muscle as indicated by their microarray profiles) resulted in 431 previously reported TFBS motifs and 579 novel motifs (Huber and Bulyk 2006) These studies have met with certain amount of

success in identifying bona fide cis-regulatory elements through confirmatory

comparisons with known cis-regulatory elements However, it still remains to be seen

how accurate and precise existing motif finding programs can be and what the extent

of overlap is between the predictions of different programs One study attempted to make a comparison by applying 13 different motif finding programs to the same sequence dataset The main conclusions were that researchers should use a few motif finding programs in combination rather than only one program and also should

consider the best few results instead of just the most significant motif for further analyses so as to increase sensitivity (Tompa et al 2005)

1.4 Comparative genomics approach for identifying cis-regulatory elements

Comparative genomics is a first step to prioritizing candidate regulatory elements for biological assays It is not biased towards specific genomic regions (e.g., promoters)

and does not rely on a priori information on which transcription factors are regulating the target gene The basic premise of this approach to identifying cis-regulatory

elements is that functional noncoding elements tend to evolve more slowly than

non-functional DNA due to selective pressure Hence, cis-regulatory elements can be

identified as conserved noncoding elements (CNEs) in sequence comparisons of related genomes Although this approach has become increasingly popular in the genome era, it had been applied successfully prior to the release of multiple vertebrate genomes For example, the concept of phylogenetic footprinting was introduced as

early as 1988 and established as a technique to identify cis-regulatory elements

Trang 33

through the comparison of the promoters and 5’ flanking regions of ε-globin and globin genes of several primates and a few other mammals (Tagle et al 1988)

γ-Briefly, the process of phylogenetic footprinting involves alignment of orthologous nucleotide sequences in two or more species, followed by the identification of

noncoding sequences that are noticeably more conserved than background sequences There exists evidence that experimentally verified TFBS tend to be more conserved than their surrounding sequences, although not perfectly conserved (Moses et al

2003) Conversely, notable proportions of CNEs have been shown to act as bona fide cis-regulatory elements in transgenic mouse or zebrafish assays (Pennacchio et al

2006; Shin et al 2005; Woolfe et al 2005)

Comparisons between human and closely related species such as mouse have been successfully used to identify CNEs in the human genome, some of which were shown

to recapitulate the expression patterns of nearby genes (Lee et al 2004; Loots et al 2000) For example, Loots et al (2000) searched for human-mouse CNEs in a ~1 Mb genomic region at human chromosome 5q31 that were ≥ 70% identical over 100 bp or more Fifteen CNEs were found to be conserved in human, mouse and several other mammals, and one CNE was determined to regulate the expression of cytokine genes

interleukin-4, interleukin-13 and interleukin-5 located in the same genomic interval

While human-mouse sequence comparisons can be particularly useful for finding

mammalian-specific cis-regulatory elements, this approach tends to identify many

false positives due to the relatively short evolutionary distance between the two

species (i.e., ~60 Myr) The short divergence time between human and mouse is insufficient for functional sequences to be distinguished distinctively from neutrally evolving DNA On the other hand, as the evolutionary divergence between two

Trang 34

species increases, the average conservation level of DNA that has been under neutral evolution since their last common ancestor decreases Sequence comparisons between human and a more distantly related species like fish that have been separated by a larger evolutionary distance (i.e., ~420 Myr), increases the probability that the CNEs identified are functional sequences Among the fishes, the genome of the Japanese

pufferfish (Takifugu rubripes), also known as “fugu”, is a particularly attractive genome for comparative genomcis The fugu genome was proposed in 1993 as a

model vertebrate genome for identifying genes and other functional elements in the human genome (Brenner et al 1993) and it was the second vertebrate genome to be fully sequenced (Aparicio et al 2002), the first being the human gneome Fugu is a

particularly attractive model for identifying conserved cis-regulatory elements in the human genome due to its compact genome size (reducing noise in in silico

predictions) and maximum evolutionary distance (increasing the specificity of

cis-regulatory element predictions) from the human lineage

The first proof-of-concept study that demonstrated the amenability of the fugu

genome to comparative genomics was carried out by Aparicio et al (1995) Within

two transcriptional enhancers that mediate Hoxb4 expression in the mesoderm,

ectoderm and neural tube, three sequence regions CR1, CR2 and CR3 were found to

be conserved between mouse and fugu CR1 was shown to be essential for expression

in the mesoderm, and the central and peripheral nervous systems while CR3 directed gene expression to the posterior hindbrain in exactly the same manner as the larger mouse enhancer sequence This demonstrated that searching for CNEs between

mammals and fugu can help identify critical cis-regulatory elements (Aparicio et al 1995) Mouse-fugu comparisons have also revealed cis-regulatory elements in the

Trang 35

Pax9/Nkx2-9 locus While human-mouse comparisons identified 15 CNEs each about

1 kb long, human-mouse-fugu comparisons narrowed this list down to 2 CNEs,

named as CNS-6, CNS+2, that mediate Nkx2-9 expression in the ventral neural tube and Pax9 expression in the medial nasal process (Santagati et al 2003) Finally,

comparisons between human and fugu identified a set of nine CNEs residing in the

gene deserts that flank the human Dachshund gene locus When assayed in transgenic

mice, seven of the nine elements drove reporter gene expression patterns that partially

or fully recapitulated the expression pattern of Dachshund (Nobrega et al 2003),

demonstrating that comparative genomics using the fugu genome can enable a

researcher to efficiently sift through large tracts of noncoding DNA for functional sequences

Two types of multi-species alignments have been used to identify conserved

enhancers associated with human genes – namely, whole-genome alignments and locus-by-locus alignments Whole-genome comparisons of human and other

vertebrates have indeed proved to be effective in identifying functional cis-regulatory

elements in the human genome (Bejerano et al 2004; Pennacchio et al 2006;

Sandelin et al 2004a; Sanges et al 2006; Shin et al 2005; Woolfe et al 2005)

Whole-genome comparisons typically use local alignment programs like BLASTZ (Schwartz et al 2003) and MegaBLAST (Zhang et al 2000) to rapidly align regions

of high homology When carried out between distantly related genomes such as

human and fish, whole-genome comparisons tend to fail to identify and align all the orthologous sequences due to the stringent criterion of local alignment algorithms On the other hand, locus-by-locus global alignments of orthologous gene loci using programs like LAGAN/MLAGAN (Brudno et al 2003) and AVID (Bray et al 2003)

Trang 36

are more effective in identifying all the associated CNEs Furthermore, because global alignment algorithms have an additional assumption that input sequences occur in the same order and orientation, they have more power in detecting weakly conserved regions than local alignments (Bray et al 2003; Frazer et al 2003) Nevertheless, it should be noted that global alignment algorithms tend to miss conserved functional elements that have undergone local inversions and rearrangements For my project, I applied the comparative genomics approach, using fugu as a model fish genome, to

identify cis-regulatory elements in the human genome I chose to focus on the

transcription factor-encoding genes of the human genome, carried out locus-by-locus comparisons between human, mouse and fugu orthologous genes and identified

conserved cis-regulatory elements associated with these genes

1.5 Transcription factors (TFs)

Transcription factors (TFs) are proteins that bind to cis-regulatory elements and

activate or repress transcription of genes Other proteins that are involved in

transcriptional regulation and that could fall under the broad classification of TFs, include co-factors, structural enzymes involved in chromatin or histone modification affecting the transcriptional state of target genes, and general transcription factors that associate directly with RNA polymerase II in a transcription initiation complex The human genome is estimated to contain about 2,000 TF-encoding genes including co-factors, chromatin modifying enzymes and general transcription factors (Messina et

al 2004) TFs play crucial roles in development (e.g., HOX, SOX and PAX proteins) (Chi and Epstein 2002; Krumlauf 1994; Scotting and Rex 1996), cell cycle

progression (e.g., c-MYC, c-JUN) (Adhikary and Eilers 2005; Mechta-Grigoriou et al 2001), tumor suppression (e.g., p53, FOXO proteins) (Arden 2007; Vousden and Lane

Trang 37

2007) and cell differentiation (e.g., RUNX, DLX proteins) (Durst and Hiebert 2004; Panganiban and Rubenstein 2002) The vast majority of TFs are known to regulate the expression of a number of different genes, and TF-encoding genes themselves are the key targets of TFs, with many TFs regulating the expression levels of their own genes Thus, TFs represent crucial nodes in the gene regulatory networks that determine the correct development of the body plan and regulate various physiological processes (Levine and Davidson 2005; Stathopoulos and Levine 2005) To gain an

understanding of such gene regulatory networks, it is important to identify

cis-regulatory elements associated with TF-encoding genes and the TFs involved in the

differential expression of TF-encoding genes Mutations that disrupt the cis-regulatory

elements of TF-encoding genes have been implicated in various congenital disorders

For example, the TF-encoding gene PAX6 is essential for eye, pancreas and central nervous system development Haploinsufficency of PAX6 causes brain defects and a

congenital eye malformation known as aniridia In some aniridia patients, breakpoints

downstream of PAX6 are observed at chromosome 11p13 These breakpoints disrupt long-range enhancers further downstream, resulting in loss of PAX6 expression in the

retina, iris and ciliary body of developing eyes and parts of the diencephalon

(Kleinjan et al 2006) Campomelic dysplasia is a congenital disorder characterized by bowing and angulation of long bones, together with other skeletal and extraskeletal defects Translocation breakpoints associated with campomelic dysplasia are scattered

up to 1 Mb upstream of SOX9 Sequence comparisons between human and fugu followed by transgenic mouse assays revealed three cis-regulatory elements

distributed up to 290 kb 5’ and 95 kb 3’ of SOX9 recapitulate SOX9 expression in cooperation with the SOX9 core promoter, and these elements are potentially

disrupted by the campomelic dysplasia translocation breakpoints (Bagheri-Fam et al

Trang 38

2006) To determine how aberrant expression of certain TF-encoding genes leads to

human congenital disorders, it is important to identify the mutations in the

cis-regulatory elements of TF-encoding genes associated with these disorders A

prerequisite for this task is to identify all the cis-regulatory elements associated with

TF-encoding genes Therefore, I chose to use a comparative genomics approach to

identify evolutionarily conserved cis-regulatory elements associated with

TF-encoding genes

1.6 Objectives of my work

The main aim of my project is to identify evolutionarily conserved cis-regulatory

elements associated with TF-encoding genes in the human genome using a

comparative genomics approach and to verify some of the predicted cis-regulatory

elements in a transgenic mouse assay Previous whole-genome comparisons have shown that TF-encoding genes tend to be enriched with CNEs (Bejerano et al 2004; Iwama and Gojobori 2004; Sandelin et al 2004a; Woolfe et al 2005) Notably, a large number of CNEs identified in these genome-wide comparisons were found to be associated with TF-encoding and developmental genes For example, 83% (1,140 out

of 1,373) of CNEs (>70% identity/>100 bp alignment length) identified in wide comparison of human and fugu were located in the vicinity of about 120 human DNA-binding TF-encoding genes (Woolfe et al 2005), while 104 of the 290 human genes associated with human-zebrafish CNEs (>70% identity/>80 bp sequence

genome-length) were found to be TF-encoding genes (Shin et al 2005) In spite of such a known association between TF-encoding genes and CNEs, no systematic gene-by-gene comparison of all the orthologous human and other vertebrate TF-encoding

genes has been carried out to identify potential cis-regulatory elements associated

Trang 39

with them Although orthologous TF-encoding genes of human, rodents, fugu and zebrafish have been previously compared (Dieterich et al 2005; Iwama and Gojobori 2004; Robertson et al 2006; Sandelin et al 2004b), such studies were promoter-centric, with comparisons restricted to sequences flanking the transcription start sites (TSS) (e.g., -3 kb to +500 bp downstream from TSS for ConSite (Sandelin et al 2004b); ~10 kb upstream of coding region (Dieterich et al 2005)) The scope of such

studies is limited because cis-regulatory elements can be located at considerable

distance from transcription start sites, even up to several hundred kilobases away (Lettice et al 2003; Nobrega et al 2003)

In my comparative genomics approach, I have chosen to compare the human and mouse genomes with the fugu genome The specific objectives of my project are (1)

to build a curated database of TF-encoding genes in human, mouse and fugu

genomes; (2) to carry out locus-by-locus alignments of their noncoding regions using

a global alignment strategy; (3) to identify CNEs associated with them and construct a database of these CNEs; (4) finally, to assay some of these CNEs in transgenic mouse

assays using lacZ (β-galactosidase) as a reporter gene

Trang 40

Chapter 2: Materials and methods

2.1 Identifying human, mouse and fugu TF-encoding genes

Sequences for 1,962 human transcription factors (TFs) were obtained from Messina et

al (2004) and redundancies were removed by a homology search against human RefSeq proteins Several known proteins missing from Messina et al.’s dataset (e.g.,

JMJ2A, JMJ2C, JMJ2D, HES4 and DLX6) were included with the search results and

the resulting proteins were mapped to human genes in Ensembl Release 37

(http://www.ensembl.org/) The TFs were classified by DNA-binding domains, and if lacking a DNA-binding domain, the TFs were classified separately into one of the following categories: co-factors, general transcription factors, components of

chromatin remodeling complexes and transcriptional regulators that are involved solely in protein-protein interactions (i.e., TFs with ZnF-PHD, ZnF-BTB/POZ, ZnF-MYND domains) Mouse orthologs of human TFs were retrieved from Ensembl BioMart Fugu orthologs were identified using a combination of data from Ensembl BioMart (fugu assembly v4.0) and INPARANOID analysis (Remm et al 2001) Teleost fishes contain duplicate copies for many human genes due to a “fish-specific” whole-genome duplication event in the fish lineage (Amores et al 1998; Christoffels

et al 2004; Postlethwait et al 1998) INPARANOID was used to identify duplicate fugu orthologs for human TFs that may have been missed in Ensembl Only proteins longer than 50 residues were used in the analysis INPARANOID identified some many-to-many ortholog groups which were resolved into smaller families based on phylogenetic analysis using PHYLIP (Felsenstein J., University of Washington, Seattle) with sequences from cartilaginous fishes, lamprey, tunicate or amphioxus as the outgroup For each family, multiple human and fugu proteins and a single

Định dạng
Số trang	200
Dung lượng	11,49 MB