1. Trang chủ
  2. » Luận Văn - Báo Cáo

Báo cáo khoa học: A study on genomic distribution and sequence features of human long inverted repeats reveals species-specific intronic inverted repeats pptx

13 544 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 13
Dung lượng 603,18 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

In the present study, we report the distribution and sequence features of recombi-nogenic long inverted repeats LIRs that are capable of forming stable stem-loops or palindromes within t

Trang 1

of human long inverted repeats reveals species-specific intronic inverted repeats

Yong Wang and Frederick C C Leung

School of Biological Sciences and Genome Research Centre, The University of Hong Kong, China

An inverted repeat consists of two repeat copies

(here-after termed arms) that are approximately

complemen-tary to each other Generally, there is a spacer between

the arms, and the full structure of an inverted repeat

can form a stem-loop or palindrome The potential to

form a stable stem-loop is determined by the arm size,

spacer size and the matching degree of the arms [1,2]

For example, a relatively huge spacer makes it difficult

for the two arms to form a stem

Studies of inverted repeats show that they may raise

instability in a genome and, on the other hand,

regu-late gene expression in both prokaryotes and

eukary-otes Being capable of forming secondary structures

[3], inverted repeats can induce genomic instability via

gene amplification, recombination, DNA double-strand breaks and rearrangement [1,2,4–8] Moreover, inverted repeats provide sites for the integration of viruses into eukaryotic genomes [9,10] and also com-prise replication stall sites, as shown in a recent study

in which evidence obtained in vivo demonstrated repli-cation stalling by hairpins formed by inverted repeats

in bacteria, yeast and mammalian cells [11] As a result, they are restricted in a genome to some extent For example, neighboring repetitive elements, such as Alu repeats, are generally found to occur in the same direction, and those in the styles of head-to-head and tail-to-tail are rarely observed, particularly when the spacer between them is tiny [1,12] In a mouse

trans-Keywords

human; intron; long inverted repeat;

primates; stem-loop

Correspondence

Y Wang, School of Biological Sciences, The

University of Hong Kong, Hong Kong, China

Fax: +852 2857 4672

Tel: +852 2299 0825

E-mail: wangyong@graduate.hku.hk

(Received 11 December 2008, revised 19

January 2009, accepted 23 January 2009)

doi:10.1111/j.1742-4658.2009.06930.x

The inverted repeats present in a genome play dual roles They can induce genomic instability and, on the other hand, regulate gene expression In the present study, we report the distribution and sequence features of recombi-nogenic long inverted repeats (LIRs) that are capable of forming stable stem-loops or palindromes within the human genome A total of 2551 LIRs were identified, and 37% of them were located in long introns (largely

> 10 kb) of genes Their distribution appears to be random in introns and

is not restrictive, even for regions near intron–exon boundaries Almost half of them comprise TG⁄ CA-rich repeats, inversely arranged Alu repeats and MADE1 mariners The remaining LIRs are mostly unique in their sequence features Comparative studies of human, chimpanzee, rhesus monkey and mouse orthologous genes reveal that human genes have more recombinogenic LIRs than other orthologs, and over 80% are human-specific The human genes associated with the human-specific LIRs are involved in the pathways of cell communication, development and the nervous system, as based on significantly over-represented Gene Ontology terms The functional pathways related to the development and functions

of the nervous system are not enriched in chimpanzee and mouse ortho-logs The findings of the present study provide insight into the role of intronic LIRs in gene regulation and primate speciation

Abbreviations

FDR, false discovery rate; GO, Gene Ontology; LIR, long inverted repeats; siRNA, small interference RNA; TIR, terminal inverted repeat.

Trang 2

gene experiment, the introduction of a large

palin-drome was followed by numerous rearrangements,

which were assumed to comprise a solution for

attenu-ating the impact of the palindrome in the progeny [13]

Inverted repeats also regulate gene expression The

stem-loops and palindromes constructed by inverted

repeats are involved in RNA interference, transcription

initiation of genes, initiation of DNA replication and

alternative splicing of exons The small interference

(si)RNA genes active in RNA interference comprise

inverted repeats capable of forming a stem-loop motif

longer than 22 bp Some are derived from miniature

inverted-repeat transposable elements [14] At present,

studies have identified siRNA genes from

Caenorhabd-itis elegans to humans RNA interference was initially

discovered as an efficient mechanism for inhibiting the

expression of specific genes [15,16], and later was

found to be responsible for developmental regulation

[17,18] and heterochromatin maintenance [19,20] In

promoters, inverted repeats can facilitate the

recogni-tion process and the subsequent binding of RNA

poly-merase during gene transcription [21,22] Moreover,

the inverted repeats in a cruciform structure will

attract mediators of second messenger-directed

tran-scription, hence altering the transcriptional response

[23] Many studies also show that inverted repeats are

essential for the initiation of DNA replication in

plas-mids, bacteria, eukaryotic viruses and mammalian cells

[24] The inverted repeats in introns are able to affect

the alternative splicing of exons [25,26] and the

removal efficiency of introns [27,28] For example,

alternative splicing of exon 2 in the COL2A1 gene was

mediated by a stem-loop adjacent to the exon–intron

boundary [25]

Because inverted repeats are both unstable and

func-tional elements in a genome, they are expected to be

distributed in intergenic regions or large introns of

genes In the yeast genome, almost 100% of large

palindromes (> 25 bp) are far away from coding

regions [29] Any insertion approaching conserved

transcribed sites will ultimately be erased unless their

presence provides an evolutionary advantage and,

thus, is under positive selection One line of evidence

for this is that, compared to introns, upstream regions

of genes have more palindromes, which probably

developed for the initiation of transcription [30] A

recent study shows that Caenorhabditis lineages have

conserved inverted repeats in intergenic regions [31],

which were suggested to be functional and therefore

actively maintained in the lineages In the human

genome, there are many such motifs, although we have

little knowledge of their fine-scale distribution,

sequence features and potential functions at present

[32,33] Human inverted repeats were investigated in a previous study, in which a majority of them were found to be weak with respect to their capacity to form a simple stem-loop or hairpin in terms of their structural features [32] Genome-wide distribution of human palindromes has also been surveyed, and a database has been created for public use [30] How-ever, the palindromes with mismatches and indels were not collected in the database

In the present study, we first located all the long inverted repeats (LIRs) characterized with long arms, high arm similarity and a short internal spacer in the human genome They were termed as recombinogenic LIRs in our previous study on human chromosomes

21 and 22 [33], although their distribution and fre-quency had not been fully surveyed in the whole human genome The present study aims to provide a panoramic view of recombinogenic LIRs On the basis

of evidence obtained in vivo [1,2,11], the LIRs identi-fied in the present study can easily form stem-loops or palindromes Their presence in the human genome by itself implies that they are functional in some manner The results obtained showed that 37% of the LIRs were located in intronic regions and some were primate-specific TG⁄ CA-rich repeats are the most frequently observed feature in LIR arms Considering that the LIRs probably have essential functions and drive the speciation of primates, we studied the degree

of conservation and species specification of the LIRs among orthologous genes from the mouse (Mus mus-culus), rhesus monkey (Macaca mulatta), chimpanzee (Pan troglodytes) and human The results obtained demonstrate that human orthologs have relatively more LIRs, most of which are human-specific These human-specific LIRs are probably essential for the development of the advanced functions of human nervous system in light of the Gene Ontology (GO) profile of human orthologs

Results

Characters and distribution of LIRs in human autosomes

We identified 2551 LIRs in human autosomes and approximately 87% of them have a short spacer (0–9 bp) and arm (31–59 bp) (Fig 1) By contrast, the mismatch rate between the arms of an LIR varies from 0–0.15, showing a relatively lower standard deviation with respect to the amounts of the LIRs in different ranges (Fig 1) These results indicate that a majority of the LIRs are able to form a stem-loop with a stem of 31–59 bp and a tiny loop (or none for a palindrome)

Trang 3

The genomic distribution of the LIRs shows that the

density of LIRs selected by our criteria is quite low

(Fig 2) The highest and lowest LIR densities were

observed in chromosomes 4 (1.2⁄ Mb) and 22

(0.44⁄ Mb) respectively Interestingly, the LIR density

negatively co-varies with gene density among the

chro-mosomes (t = 19.8; P < 10)4) (Fig 3) The point

denoting chromosome 19 is notably far from the

regression line, accounting for gene clusters that con-tribute to the two-fold higher gene density of the chro-mosome 19 compared to the genomic average [34]

We found that the negative correlation is due to the high frequency of LIRs in long genes A total of 956 LIRs (in 702 genes) were located in genic regions, and

1595 in intergenic regions In other words, 37% of the LIRs were found within genes However, our

Fig 1 Characteristics of human LIRs The 2551 LIRs were classified according to spacer size, mismatch rate and arm length.

Fig 2 Distribution of LIRs in the human genome The density is represented by the amount of LIRs per 1 Mb sequence The shortest bars denote one LIR per 1 Mb.

Trang 4

calculation of the coverage of genes in the human

gen-ome was 26.9%, which is consistent with the value

reported previously [35] When introns were taken into

account, the percentage was 25.1% This implies that

the distribution of the LIRs is not random Statistical

analysis performed on the results shows that the

pres-ence of LIRs is significantly biased to be within genes

(chi-square test; P < 0.0001) The LIRs that have long

arms (> 400 bp) and the associated genes are listed in

Table S1 Surprisingly, over half of them were found

within genes There are two cases of partial overlap

between LIRs and exons In one case, the left arm of

an intronic LIR extends into an exon of c14orf165

and, in the other case, an LIR on the chromosome 17

partially overlaps an exon of a putative gene The

fre-quency is much lower than expected We did not find

any LIRs overlapping either the start or end site of a

gene

Further results confirmed that LIRs tend to reside

in large intronic and intergenic regions Only five LIRs

were found in introns < 1 kb (the smallest intron was

757 bp), and none in intergenic regions < 2 kb The

median sizes for the introns and the intergenic regions

are 46 and 386 kb, respectively Moreover, most of the

LIR-containing intergenic regions are > 10 kb

Corre-spondingly, a chromosome that has more long genes

will show a lower gene density, in agreement with the

above negative correlation between LIR density and

gene density (Fig 2)

We then studied the positions of the LIRs in introns

and intergenic regions A short distance to the exon–

intron boundary or transcription starting point is

an indication that an LIR is functioning in the gene

A ratio of 0–0.5 was applied to denote the relative

distance to the boundaries, or to the center, and was divided into five ranges We calculated the percentage

of LIRs falling within the ranges and observed a small percentage difference between the ranges, suggesting a random distribution of LIRs in both intronic and intergenic regions (see Fig S1) We also considered the effect of the length of these regions on the distribution The intronic and intergenic regions were then classified

on the basis of their lengths Within each of the length groups, the numbers of LIRs in the ratio ranges do not show any significant difference (chi-square tests; d.f = 4; P > 0.1) (see Fig S1) Therefore, the LIRs

do not avoid approaching the boundaries for exonic or genic regions The median distance to the exon bound-aries is 7.8 kb, and that to the gene boundbound-aries is

69 kb

Strikingly, pseudogenes were frequently found around the intergenic LIRs A total of 803 intergenic LIRs (50%) have one or two neighboring pseudogenes,

of which 422 are RNA pseudogenes According to the annotation in the Ensembl database (http://www ensembl.org), approximately 27% of the human genes are pseudogenes The occurrence of pseudogenes adjacent to LIRs is statistically significant (chi-square test; P < 0.0001)

Sequence features of the human LIRs

We found that over half (51%) of the identified LIRs could be packed into groups consisting of at least three members on the basis of sequence similarity The group members are comprised of simple repeats, known repetitive elements, amplified genes or dupli-cated genomic fragments The largest group consists of LIRs formed by stretches of TG⁄ CA dinucleotides and interspersed TA dinucleotides We defined them as

TG⁄ CA-rich LIRs, accounting for 33% and 39% of all the LIRs in the intronic and intergenic regions, respectively (Fig 4) By contrast, we also identified

TC⁄ GA-rich LIRs that occupy only 3% in both of the regions Thus, the frequency of TG⁄ CA-rich LIRs is at least 11-fold higher than that of TC⁄ GA-rich ones (for intronic LIRs: 11-fold; for intergenic LIRs: 13-fold) The difference is statistically significant (chi-square test; P < 0.0001) On average, the combination of

TG⁄ CA-rich and TC ⁄ GA-rich LIRs occupies 38% of the identified LIRs Additionally, we could not identify any LIRs constructed by simple repeats in longer repeat units (> 2 bp)

The second largest group comprises the LIRs involved in known human repetitive elements We found 145 MADE1 mariners and 108 inverted Alu repeats in our LIR collection The mariner has a short

y = –23.8x + 33.6

R2 = 0.53

0

5

10

15

20

25

30

35

0.4 0.6 0.8 1 1.2

S-LIR density (/Mb) chr.19

Fig 3 Negative correlation between gene density and LIR

fre-quency The black dots show the correlation between gene density

and LIR density in the 22 chromosomes.

Trang 5

spacer and long terminal inverted repeats (TIRs) In

the present study, they were considered as LIRs in

cases of high identity between TIRs Within both

intronic and intergenic LIRs, they occupy 6% in total

Alu repeats in the LIRs are mostly in a partial

struc-ture and found to be in the styles of head-to-head or

tail-to-tail In some large LIRs, more than one Alu

was included in one arm, and the complete structure

of Alu could be retained therein The proportion of

inverted Alus within the LIRs is 6% for intronic

regions and 3% for intergenic regions

The grouping of the LIRs is also a result of gene

amplification or fragmental duplication, although the

numbers of such groups and the members inside the

groups are not large We identified 20 LIRs in genes

encoding a novel protein similar to septin

(NPS-Sep-tin) and eight in genes encoding POTE The genes

belong to gene families and their duplication is coupled

with the spread of the LIRs inside the gene The

remaining LIRs aside from the above groups show

similarity either to one or none of the others They are

labeled as rare LIRs, accounting for 49% of all the

LIRs in the human genome

We explored the LIRs in the NPS-Septin gene

family in more detail A blat search in the University

of California Santa Cruz (UCSC) browser (http://

genome.ucsc.edu) was used to confirm the association

of the LIRs with the gene family The longest gene

LOC400807 is approximately 107 kb, and an

NPS-Septin LIR is positioned at approximately 10.6 kb In

addition, we also found more NPS-Septin LIRs on the

Y chromosome, although they were not present in

NPS-Septin genes Sequence alignment displays highly

identical arms but diverse spacers for the 20

NPS-Septin LIRs (Fig 5) They are able to form variant

stem-loop structures where both the stem and loop are

of different sizes Except for those on chromosomes 3,

10 and Y, all the LIRs were located at subtelomeric

regions (see Table S2) The proximal LIRs show

simi-lar spacer motifs; for example, the three LIRs on

chro-mosome 1p (no 1–3) and the two LIRs on the

Y chromosome (Fig 5) This is evidence for inverted duplication of the fragments at these regions We also noted that sequence similarity at the flanking regions

of the LIRs declines gradually at all sites

Species-specific LIRs inside orthologous genes

To obtain species specification of the LIRs, we detected LIRs in mouse, rhesus monkey, chimpanzee and human orthologous genes Among 12 723 groups

of orthologous genes, we identified 546 LIRs for human orthologs, 481 for chimpanzee orthologs, 201 for mouse orthologs and 130 for rhesus monkey ortho-logs For species specification of the LIRs, 421 (77%) are human-specific, 355 (74%) are chimpanzee-specific,

180 (90%) are mouse-specific and 107 (82%) are rhe-sus monkey-specific For the nonspecific LIRs, 13 groups of orthologs from the three primate species all have at least one LIR, and 104 ortholog pairs from humans and chimpanzees possess LIR(s) This suggests that most of the nonspecies-specific LIRs are shared

by the primates, and some LIRs were specifically developed in the primate lineage

We next obtained the biological profile of the human orthologs that have human-specific LIR(s) Compared

to randomly-selected human genes, the orthologs are significantly enriched with GO terms within the catego-ries of development, binding, membrane, cell communi-cation and signal transduction (Table 1) An important finding is that a number of the terms are related to the nervous system, including neurotransmitter receptor activity (GO:0030594), central nervous system develop-ment (GO:0007417), GABA receptor activity (GO: 0016917), axonogenesis (GO:0007409), projection, generation, differentiation and development of neurons (GO:0043005, GO:0048699, GO:0030182, GO:0048666), synapse (GO:0045202), and so on The GO term that is under-represented in these genes is GO:0006955 for immune response [false discovery rate (FDR) = 0.048]

We also performed the same test on 104 orthologs with human- and chimpanzee-specific LIRs The GO

Fig 4 Composition of LIRs in the human genome The LIRs in POTE and NPS-Septin families occupy 3% of all the intronic LIRs The ‘other’ LIRs, occupying 49% of all LIRs, refer to those with unique sequence features.

Trang 6

terms from the human orthologs were used for

com-parison with those from randomly-selected human

genes, showing that the above over-represented GO

terms related to the nervous system were largely not

assigned to these orthologs (see Table S3) Only the

term GO:0045202 is related to synapse Basically, the

terms for binding, membrane, signal transduction and

cell communication are retained in the list

To make a control, we obtained over-represented

GO terms from the mouse genes associated with

mouse-specific LIRs by comparison with

randomly-selected mouse orthologs A part of the result shown

in Table S4 is similar to that also shown in Table S3 (e.g binding and signal transduction) The difference is that the result for the mouse orthologs includes GO terms for the regulation of transcription, the RNA bio-synthetic process and the phosphate metabolic process

We found one over-represented term (GO:0007399) in

a pathway for nervous system development (FDR = 0.0368)

The 104 LIRs common in human and chimpanzee orthologs were studied, aiming to uncover the mecha-nism of their formation We searched the arm sequences of the LIRs in the UCSC genome browser

Fig 5 Alignment of the LIRs mostly

located in genes encoding a novel protein

similar to septin The locations of the LIRs

are listed in Table S2 Essentially, the arms

of the LIRs are approximately 1–48 and

97–143 bp, and can be extended into the

spacers in some LIRs.

Trang 7

for homologous fragments in other mammalian

genomes The species specification of the LIRs was

demonstrated in several cases, where we found

half-sized LIRs in the rhesus monkey genome One case is the LIR in the human gene c9orf52 that has 19 ORFs and four transcription variants Positioned

Table 1 GO terms over-represented in human genes having human-specific LIRs The genes are human orthologs that have at least one human-specific LIR Reference genes are randomly selected from the list of orthologs, and are used for comparison with the test human genes with specific LIRs The GO terms in 352 test genes were compared with those in 296 reference genes, using Fisher’s exact test in BLAST2GO FDR was applied to obtain significantly over-represented (FDR < 0.05) GO terms in the test genes Several GO terms belonging to levels 1 or 2 are not included.

GO:0007154 Cell communication 2.65E-04 GO:0005230 Extracellular ion channel activity 0.014854 GO:0032501 Multicellular organismal process 2.65E-04 GO:0031175 Neurite development 0.014854

GO:0004888 Transmembrane receptor activity 7.24E-04 GO:0030030 Cell projection organization 0.016226 GO:0031224 Intrinsic to membrane 7.24E-04 GO:0022804 Active transmembrane transporter 0.016354 GO:0016021 Integral to membrane 8.83E-04 GO:0004672 Protein kinase activity 0.017032 GO:0032502 Developmental process 0.001316 GO:0048856 Anatomical structure development 0.017199 GO:0030695 GTPase regulator activity 0.001397 GO:0043687 Post-translational protein modification 0.017406 GO:0007275 Multicellular organismal development 0.001397 GO:0015075 Ion transmembrane transporter 0.017424 GO:0044459 Plasma membrane part 0.001397 GO:0006464 Protein modification process 0.017424

GO:0030182 Neuron differentiation 0.001526 GO:0051234 Establishment of localization 0.018271

GO:0007186 G-protein coupled receptor protein 0.002093 GO:0022803 Passive transmembrane transporter 0.019349 GO:0004872 Receptor activity 0.002643 GO:0022838 Substrate-specific channel activity 0.019349 GO:0007166

GO:0000166

Cell surface receptor

Nucleotide binding

0.003784 0.003784

GO:0004713 GO:0004930

Protein-tyrosine kinase activity G-protein coupled receptor activity

0.020021 0.020021 GO:0031420 Alkali metal ion binding 0.004149 GO:0022857 Transmembrane transporter activity 0.020021 GO:0045211 Postsynaptic membrane 0.004149 GO:0050793 Regulation of developmental process 0.021558 GO:0006811 Ion transport 0.004408 GO:0008509 Anion transmembrane transporter 0.021558

GO:0046872 Metal ion binding 0.005536 GO:0019199 Transmembrane protein kinasc activity 0.023565 GO:0030554 Adenyl nucleotide binding 0.005536 GO:0048667 Neuron morphogenesis 0.023565 GO:0007155 Cell adhesion 0.005536 GO:0046578 Ras protein signal transduction 0.023565

GO:0005083 Small GTPase regulator activity 0.005868 GO:0004714 Kinase activity 0.023565

GO:0005509 Calcium ion binding 0.006231 GO:0022891 Transmembrane transporter activity 0.026669 GO:0016773 Phosphotransferase activity 0.006277 GO:0009790 Embryonic development 0.028903 GO:0017076 Purine nucleotide binding 0.00637 GO:0006793 Phosphorus metabolic process 0.030622

GO:0000902 Cell morphogenesis 0.007098 GO:0005096 GTPase activator activity 0.033582 GO:0032989 Cellular structure morphogenesis 0.007098 GO:0015698 Inorganic anion transport 0.033582 GO:0030234 Enzyme regulator activity 0.007765 GO:0065007 Biological regulation 0.034756 GO:0051056 Regulation of small GTPase 0.009344 GO:0005089 Rho guanyl-nucleotide exchange factor 0.041165 GO:0048869 Cellular developmental process 0.009344 GO:0005088 Ras guanyl-nucleotide exchange factor 0.041165 GO:0030154 Cell differentiation 0.009344 GO:0007010 Cytoskeleton organization 0.041874 GO:0005215 Transporter activity 0.009344 GO:0007417 Central nervous system development 0.043131 GO:0032559 Adenyl ribonucleotide binding 0.009362 GO:0030594 Neurotransmitter receptor activity 0.043131 GO:0032555 Purine ribonucleotide binding 0.01047 GO:0004674 Protein serine ⁄ threonine kinase 0.045995 GO:0032553 Ribonucleotide binding 0.01047 GO:0008092 Cytoskeletal protein binding 0.046358 GO:0031226 Intrinsic to plasma membrane 0.011022 GO:0005515 Protein binding 0.047541

Trang 8

between exons 17 and 18 (5.3 kb to exon 18; 36.59 kb

to exon 17) (Fig 6), it has a homolog in the

chimpan-zee genome However, all homologous fragments from

the rhesus monkey correspond to one arm of the LIR

Moreover, motif conservation was exhibited at the

flanking sequences of the LIR in primates (Fig 6) In

other words, the half-sized LIR represents the

ancestral status, and the full-sized LIR was developed

in the chimpanzee and human lineages We did not

find fragments homologous to the LIR in the mouse

genome Instead, a half-sized LIR was observed in the

dog genome, suggesting that nonprimate genomes also

lack the full-sized LIR This also serves as additional

solid evidence for the presence of the half-sized LIR in

the rhesus monkey These results imply that some

LIRs were derived by inverted duplication of one arm

Discussion

A survey of recombinogenic LIRs across the

human genome

In the present study, we identified LIRs in the human

genome, and provide a fine map of the distribution of

human LIRs Due to a strong capability for forming a

stem-loop, the LIRs are recombinogenic and account

for only approximately 0.4% of all human LIRs, as

suggested previously [33] Our algorithm allows the

presence of mismatches and insertions in the stem part

of the secondary structure, and also provides settings

for spacer size, arm size and arm similarity Due to

variant internal structures, inverted repeats are

differ-ent in their efficiency with respect to the induction of

instability Evidence is available suggesting that arm

size, arm similarity and internal spacer size are all

important factors [1,2] Therefore, the inverted repeats

identified in the present study are generally associated

with a high potential for stem-loop or palindrome

formation This is partially supported by the fact that

approximately 87% of our LIRs have a short spacer

of < 10 bp Nonetheless, we cannot preclude the pos-sibility that some of the LIRs experience difficulty regarding the formation of a stem-loop, such as the reversely duplicated genes and those extremely large LIRs with a huge spacer (see Table S1) In previous studies, the methods employed for inverted repeat identification could not search the inverted repeats by freely defining arm similarity, spacer size and indels [30,32,36] Thus, the map of the LIRs obtained in the present study provides a more detailed distribution of stem-loops in the human genome, and confirms that the LIRs are mostly located in long introns and inter-genic regions Furthermore, the inverted repeats in the present study are more likely to be functional than those of previous studies because functional inverted repeats such as siRNA genes are rarely palindromes showing 100% arm similarity [30,32,36]

Because of the difficulties encountered in the design

of the algorithm for LIR searching at the genome level and the complex folding structures of inverted repeats, we could not target all the inverted repeats with a strong potential to form a stem-loop or a pal-indrome Particularly, there are a large number of AT-rich regions in the human genome, and the fre-quency of (TA)n repeats is 19.4 per Mb [37] The self-complementary (TA)n repeats can by themselves form variant secondary structures To remove these simple repeats, we set the GC content of the arm sequences at > 20% This step, however, unavoidably deleted AT-rich LIRs, and some of them have been implicated as the mediator of constitutional t(11;22) translocation in humans [38] Although there are also

a large number of (CA)n and (GA)n repeats in the human genome, the frequency of their complementary repeats (TG)n and (CT)n is much lower [37] Therefore, the presence of the TG⁄ CA-rich and

TC⁄ GA-rich LIRs is not a result of the enrichment

of (CA)n and (GA)n repeats

Fig 6 An intronic LIR in c9orf52 and the flanking conserved sequence The arrow denotes the intronic LIR, positioned between exons 17 and 18 The large arrows with an opposing orientation indicate the two arms of the LIR Rhesus monkey (Rhesus macaque) and dog (Canis familiaris) genes possess half-sized LIRs.

Trang 9

Probable functions of the LIRs

The results obtained in the present study show that a

considerable proportion of the LIRs are within genes

and tend to be located in large introns of long genes

The LIRs in the large introns, although still unstable,

will not greatly disturb the coding parts of the genes

Knowing the genomic distribution and sequence

fea-tures of the LIRs enables us to speculate about the

biological functions of the LIRs

First, there are a large number of TG⁄ CA-rich LIRs

in our collection, and these intronic TG and CA tracts

are probably involved in the alternative splicing of

genes One study revealed that intronic TG tracts,

particularly in hairpin structure, are important in the

intron knockout process and help to create

complicated splicing patterns [39] On the other hand,

CA-tracts and CA-rich sequences are confirmed to be

regulators for alternative splicing One study showed

that the insertion of a CA repeat into different intronic

places will result in variant splicing patterns in a

human gene [40] Perhaps splicing sites at intron–exon

boundaries can be recognized easily by a signal of

sec-ondary structure Taken together, this allows us to

propose that the TG⁄ CA-rich LIRs are regulators in

human genes

Second, approximately half of the LIRs are unique

in sequence features, and some of them are probably

unidentified siRNA genes In the present study, the

LIRs are longer than the minimal length required for

an siRNA Although arm similarity is higher than that

observed in most siRNAs, some of them are still

can-didates for siRNA genes We used emboss sirna

(http://emboss.sourceforge.net/apps/cvs/sirna.html) to

identify the candidates with a threshold score of 8, and

found that 267 of the LIRs are potential siRNA genes

The validity of these motifs in gene silencing requires

further empirical examination

LIRs and recombination hotspots are not related

The question remains as to whether the

recombino-genic LIRs identified in the present study are

frequently associated with recombination hotspots in

the human genome Almost 47% of the human

gen-ome is composed of repeats [37], and direct repeats are

predominant over inverted repeats in the human

gen-ome, partially because inverted repeats are able to

induce instability five-fold more efficiently than direct

repeats [41] The UCSC browser provides the

recombi-nation rate data for the human genome Essentially,

recombination hotspots concentrate on subtelomeric

regions [42] The regions, however, do not have more

LIRs than other regions (Fig 2) and, instead, some chromosomal LIR-rich regions are located at the inner part of the chromosomes The lack of an association between LIRs and recombination hotspots is also suggested by a recent study on human recombination hotspots on the basis of a computational simulation using single nucleotide polymorphism data, which showed that inverted repeats were not found over-abundant in the hotspots [43] In the present study, we did not detect over-represented LIRs in the hotspots (results not shown) Therefore, the contribution of LIRs

to recombination hotspots is not supported, and the recombination-inducing effect of the recombinogenic LIRs probably acts only on specific genomic regions

LIRs spreading via fragmental duplications NPS-Septin genes are spread in the human genome possibly due to interchromosomal recombination and fragmental duplication One study showed that inter-chromosomal recombination frequently occurs at the subtelomeric regions in humans [42] NPS-Septin was assumed to be one of the gene families that amplified themselves by this mechanism The result of gene amplification is concurrent duplication of the intronic LIRs, as observed in chromosomes 1 and Y in the present study By contrast, only two chimpanzee NPS-Septin LIRs were identified, which is in accordance with the low frequency of subtelomeric duplications in the chimpanzee genome [42] Regarding the spread of LIRs in the POTE family, at least those on chromo-some 2 subtelomeres are most likely the result of intra-chromosomal recombination, as inferred from genomic locations Similarly, the chimpanzee genome has two POTE homologs: one on chromosome 12 and another one on chromosome 22 In addition, the LIRs in POTE and NPS-Septin families were entirely absent from other mammals in current genomic assemblies

Probable role of the LIRs in primate speciation Among the orthologous genes, we found that human and chimpanzee genes contain more LIRs than rhesus monkey and mouse orthologs Our data suggest that most of the LIRs shared by human and chimpanzee orthologs were developed and maintained by the com-mon ancestor of humans and chimpanzees However, the difference in LIR frequency among the primates could be narrowed to some extent Our search for LIRs in rhesus monkey orthologs probably missed a proportion of the LIRs, although the similarity between arms was lessened to 75% In the case that the similarity between arms was lower than 75%, some

Trang 10

of the LIRs shared by all primates could not be

visual-ized Indeed, we observed higher mismatch rates in the

stems formed by monkey LIRs, and the corresponding

chimpanzee and human LIRs have undergone

compen-sating mutations that help to improve the stability of

the stem-loops for human and chimpanzee LIRs

rela-tive to those for rhesus monkey (results not shown)

The compensating mutations comprise one line of

evi-dence for the functional role and adaptive evolution of

the primate LIRs

The biological profiling of the orthologs with

human-specific LIRs implies their association with the

development of the central nervous system Moreover,

GO terms in pathways such as cell communication and

transmembrane signal transduction are enriched in

these orthologs The number of genes in eukaryotic

genomes is not so different as previously considered

[44,45] and the morphological and physiological

differ-ences among eukaryotes are considered to be the result

of the different regulation levels of the existing genes

The intronic LIRs in the present study are probably

novel, essential regulatory motifs that enable a

com-plex expression profile and the fine regulation of

human genes, as suggested previously [46] The

appear-ance of the LIRs probably provides humans with an

evolutionary advantage and contributes to the

specia-tion of primates

Experimental procedures

Identification of LIRs

The human genome (Build 35) was downloaded from the

NCBI (http://www.ncbi.nlm.nih.gov/) and the locations of

all human genes and their exons (for protein-coding genes)

were obtained from the Ensembl database (http://www

ensembl.org) From the gene list, we obtained the locations

of the boundaries of the genic and nongenic regions Exons

belonging to the same genes were sorted again according to

their genomic locations, and the introns were defined as the

intervals between the exons From the list, the boundaries

of exons and introns were determined

We first searched for inverted repeats across the human

genome using bespoke software [33] The settings for this

step were: arm length > 30 bp; arm identity > 85%; and

spacer < 2 kb In addition, inverted repeats with a GC

content of the arms of < 20% were filtered out This

aimed to exclude an abundance of inverted repeats formed

by (TA)nsimple repeats as shown in our primary study A

(TA)n by itself is an inverted repeat, and can form variant

secondary structures rather than an exclusive and stable

stem-loop Therefore, (TA)n repeats were not the required

typical inverted repeats As a result, we removed them from

the dataset in the present study Several types of redundan-cies were removed, as described previously [33] To define the recombinogenic LIRs, we screened the collection with new criterion The ratio of arm length to spacer length must be larger than mismatch, where the mismatch is equal

to 100% minus identity Therefore, the LIRs in our dataset were recombinogenic LIRs [33]

The LIRs within genes were identified and the ratio of their relative distance to exon–intron boundaries was calcu-lated Here, a ratio approaching to 0 indicates the relative distance to the closest exon-intron boundary and that close

to 0.5 means that the LIR is positioned close to the central

of an intron, no matter in what direction Pseudogenes were not used in this survey For those LIRs in intergenic regions, the same ratios were also measured The difference was that the ratios in that case represent the relative dis-tance to the closest neighboring genes We also attempted

to identify cases of partial overlapping between LIRs and genes or exons

Classification of LIRs

We selected the LIRs that were basically constructed by dinucleotide repeats In the case where TG + TA +

CA > 80% of the arm of an LIR, it was considered as

TG⁄ CA-rich; in the case where TC + TA + GA > 80%,

it was considered as TC⁄ GA-rich The remaining LIRs were classified on the basis of similarity We first used con-sensus motifs of common human repetitive elements (from the RepBase: http://www.girinst.org/) as templates An LIR was considered to be formed by a known repetitive element

if the identity of the homologous part (> 20 bp) was higher than 75% For the results obtained, LIRs formed by inverted Alu repeats were further confirmed by repeatmas-ker (http://repeatmasker.org) Second, LIRs homologous

to each other were searched Similarly, the criteria were: homologous part > 20 bp and identity of homologous part

> 75% Put simply, the algorithm for searching the homologous part aimed to find an identical seed of 5 bp and then extend the seed at both ends until continuous two mismatches occur at both sides

LIRs in mammalian orthologous genes

We obtained orthologous genes for human–chimpanzee, human–rhesus monkey and human–mouse species pairs from the BIOMART database (http://www.ensembl.org/ biomart/), which employs the Ensembl 42 Homology Data-base By searching the same human gene IDs in the three ortholog tables, we created a new ortholog table containing

12 723 groups of orthologous genes from the four species

In the BIOMART database, some orthologous genes are of the types ‘one-to-many’ and ‘many-to-many’ that denote a multiple orthologous relationship between the genes In the

Ngày đăng: 07/03/2014, 00:20

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

🧩 Sản phẩm bạn có thể quan tâm